Showing preview only (642K chars total). Download the full file or copy to clipboard to get everything.
Repository: edsu/twarc
Branch: main
Commit: 12104e080f48
Files: 89
Total size: 595.5 KB
Directory structure:
gitextract_5vocmduc/
├── .gitignore
├── .readthedocs.yaml
├── LICENSE
├── MANIFEST.in
├── README.md
├── RELEASING.md
├── docs/
│ ├── README.md
│ ├── api/
│ │ ├── client.md
│ │ ├── client2.md
│ │ ├── expansions.md
│ │ └── library.md
│ ├── plugins.md
│ ├── resources.md
│ ├── tutorial.md
│ ├── twarc1_en_us.md
│ ├── twarc1_es_mx.md
│ ├── twarc1_ja_jp.md
│ ├── twarc1_pt_br.md
│ ├── twarc1_sv_se.md
│ ├── twarc1_sw_ke.md
│ ├── twarc1_zw_zh.md
│ ├── twarc2_en_us.md
│ ├── twitter-developer-access.md
│ └── windows10.md
├── mkdocs.yml
├── pyproject.toml
├── requirements-mkdocs.txt
├── setup.cfg
├── src/
│ └── twarc/
│ ├── __init__.py
│ ├── __main__.py
│ ├── client.py
│ ├── client2.py
│ ├── command.py
│ ├── command2.py
│ ├── config.py
│ ├── decorators.py
│ ├── decorators2.py
│ ├── expansions.py
│ ├── handshake.py
│ ├── json2csv.py
│ └── version.py
├── test_twarc.py
├── test_twarc2.py
└── utils/
├── auth_timing.py
├── deduplicate.py
├── deleted.py
├── deleted_users.py
├── deletes.py
├── embeds.py
├── emojis.py
├── extractor.py
├── filter_date.py
├── filter_users.py
├── flakey.py
├── foaf.py
├── gender.py
├── geo.py
├── geofilter.py
├── geojson.py
├── json2csv.py
├── media2warc.py
├── media_urls.py
├── network.py
├── noretweets.py
├── oembeds.py
├── remove_limit.py
├── retweets.py
├── search.py
├── sensitive.py
├── sort_by_id.py
├── source.py
├── tags.py
├── times.py
├── twarc-archive.py
├── tweet.py
├── tweet_compliance.py
├── tweet_text.py
├── tweet_urls.py
├── tweetometer.py
├── tweets.py
├── unshrtn.py
├── urls.py
├── users.py
├── validate.py
├── wall.py
├── wayback.py
├── webarchives.py
├── wordcloud.py
└── youtubedl.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
*.pyc
*.log
.cache
.venv
.eggs
Pipfile*
build
dist
twarc.egg-info
.pytest_cache
.vscode
.env
site
uv.lock
================================================
FILE: .readthedocs.yaml
================================================
version: 2
mkdocs:
configuration: mkdocs.yml
python:
version: 3.8
install:
- requirements: requirements-mkdocs.txt
================================================
FILE: LICENSE
================================================
The MIT License (MIT)
Copyright (c) Documenting the Now Project
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: MANIFEST.in
================================================
include requirements.txt
include docs/README.md
================================================
FILE: README.md
================================================
# twarc
**Note: twarc is no longer actively supported after changes to Twitter's API quotas made it unusable.**
---
[](https://zenodo.org/badge/latestdoi/7605723)
twarc is a command line tool and Python library for collecting and archiving Twitter JSON
data via the Twitter API. It has separate commands (twarc and twarc2) for working with the older
v1.1 API and the newer v2 API and Academic Access (respectively).
* Read the [documentation](https://twarc-project.readthedocs.io)
* Ask questions here in [GitHub](https://github.com/DocNow/twarc/discussions), in [Slack](https://bit.ly/docnow-slack) or [Matrix](https://matrix.to/#/#docnow:matrix.org?via=matrix.org&via=petrichor.me&via=converser.eu)
twarc has been developed with generous support from the [Mellon Foundation](https://mellon.org/).
## Contributing
New features are welcome and encouraged for twarc. However, to keep the core twarc library and command line tool sustainable we will look at new functionality with the following principles in mind:
1. Purpose: twarc is for *collection* and *archiving* of Twitter data via the Twitter API.
2. Sustainability: keeping the surface area of twarc and it's dependencies small enough to ensure high quality.
3. Utility: what is exposed by twarc should be applicable to different people, projects and domains, and not specific use cases.
4. API consistency: as much as sensible we aim to make twarc consistent with the Twitter API, and also aim to make twarc consistent with itself - so commands in core twarc should work similarly to each other, and twarc functionality should align towards the Twitter API.
For features and approaches that fall outside of this, twarc enables external packages to hook into the twarc2 command line tool via [click-plugins](https://github.com/click-contrib/click-plugins). This means that if you want to propose new functionality, you can create your own package without coordinating with core twarc.
### Documentation
The documentation is managed at ReadTheDocs. If you would like to improve the documentation you can edit the Markdown files in `docs` or add new ones. Then send a pull request and we can add it.
To view your documentation locally you should be able to:
pip install -r requirements-mkdocs.txt
pip install -e .
mkdocs serve
open http://127.0.0.1:8000/
If you prefer you can create a page on the [wiki](https://github.com/docnow/twarc/wiki/) to workshop the documentation, and then when/if you think it's ready to be merged with the documentation create an [issue](https://github.com/docnow/twarc/issues). Please feel free to create whatever documentation is useful in the wiki area.
### Code
If you are interested in adding functionality to twarc or fixing something that's broken here are the steps to setting up your development environment:
git clone https://github.com/docnow/twarc
cd twarc
Create a .env file that included Twitter App keys to use during testing:
BEARER_TOKEN=CHANGEME
CONSUMER_KEY=CHANGEME
CONSUMER_SECRET=CHANGEME
ACCESS_TOKEN=CHANGEME
ACCESS_TOKEN_SECRET=CHANGEME
Now run the tests:
uv run pytest
Add your code and some new tests, and send a pull request!
================================================
FILE: RELEASING.md
================================================
# Releasing
New versions of twarc can be released by creating a release and assigning a new tag in the GitHub repo. The release, including upload of the new version to PyPI, is performed by GitHub actions when a new tag is created, using the PyPI token stored in the secrets associated with the repository. Anybody who has the permission to create a tag can perform a release.
Steps in a release:
1. Update the version number in `twarc/version.py` - the format is MAJOR.MINOR.PATCH and should always be increasing and unique.
2. Make a new release from https://github.com/DocNow/twarc/releases (hit the 'draft new release' button on the top right).
3. Create a new tag, matching the version number in `twarc/version.py`, with a v prefix (ie. vMAJOR.MINOR.PATCH)
4. Write release notes.
5. Publish the release.
6. Make sure the GitHub action completes successfully.
7. Double check that the new version correctly installs from PyPI: `pip install --upgrade twarc` should install the new version created above.
================================================
FILE: docs/README.md
================================================
# twarc
twarc is a command line tool and Python library for collecting and archiving Twitter JSON
data via the Twitter API. It has separate commands (twarc and twarc2) for working with the older
v1.1 API and the newer v2 API and Academic Access (respectively). It also has an ecosystem of [plugins](plugins) for doing things with the collected data.
See the `twarc` documentation for running commands: [twarc2](twarc2_en_us.md) and [twarc1](twarc2_en_us.md) for using the v1.1 API. If you aren't sure about which one to use you'll want to start with twarc2 since the v1.1 is scheduled to be retired.
## Install
If you have python installed, you can install twarc from a terminal (such as the Windows Command Prompt available in the "start" menu, or the [OSX Terminal application](https://support.apple.com/en-au/guide/terminal/apd5265185d-f365-44cb-8b09-71a064a42125/mac)):
```
pip3 install twarc
```
Once installed, you should be able to use the twarc and twarc2 command line utilities, or use it as a Python library - check the examples [here](api/library.md) for that.
## Other Tools
Twarc is purpose build for working with the twitter API for archiving and studying digital trace data. It is not built as a general purpose API library for Twitter. While the primary use is academic, it works just as well with "Standard" v2 API and "Premium" v1.1 APIs.
For a list of general purpose Twitter Libraries in different languages see the [Twitter Documentation](https://developer.twitter.com/en/docs/twitter-api/tools-and-libraries). For Python, [TwitterAPI](https://github.com/geduldig/TwitterAPI) and [tweepy](https://github.com/tweepy/tweepy) are both up to date and maintained. They also support v2 APIs, and their data format with expansions may differ from twarc. There is also a reference implementation of the [v2 Academic Access Search](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all) and [v1.1 Premium Search](https://developer.twitter.com/en/docs/twitter-api/premium/search-api/overview) from Twitter [here](https://github.com/twitterdev/search-tweets-python/). The [v2 version](https://github.com/twitterdev/search-tweets-python/tree/v2) of this script is compatible with twarc.
For `R` there is [academictwitteR](https://cran.r-project.org/web/packages/academictwitteR/vignettes/academictwitteR-intro.html). Unlike twarc, it focuses solely on querying the Twitter Academic Research Product Track v2 API endpoint. Data gathered in twarc can be imported into `R` for analysis as a dataframe if you export the data into CSV using [twarc-csv](https://pypi.org/project/twarc-csv/).
## Getting Help
Check out the [tutorial](tutorial.md) to get started, or follow along with this [recorded stream](https://tube.nocturlab.fr/videos/watch/1d98d20e-a4fd-4594-aa94-9b1b1301cead) introducing twarc. You can also find additional resources linked from [resources](resources.md). If you run into trouble, feel free to make a post on the [Twarc Repository](https://github.com/DocNow/twarc/issues) or on the [Twitter Developer Forums](https://twittercommunity.com/c/academic-research/62).
================================================
FILE: docs/api/client.md
================================================
# twarc.Client
::: twarc.client
handler: python
================================================
FILE: docs/api/client2.md
================================================
# twarc.Client2
::: twarc.client2
handler: python
================================================
FILE: docs/api/expansions.md
================================================
# twarc.expansions
[Expansions](https://developer.twitter.com/en/docs/twitter-api/expansions) are how the new v2 Twitter API includes optional metadata about Tweets. In contrast to v1.1, where each Tweet JSON object is self-contained, in v2 metadata about a whole "page" of requests is included in the response. This means that to get a self-contained Tweet JSON, additional processing is needed to look up each piece of extra metadata. Different tools and libraries may implement this in different ways. In twarc, the goal was to retain the original JSON format and only append extra fields, so that any code that expects original JSON will still work.
::: twarc.expansions
handler: python
================================================
FILE: docs/api/library.md
================================================
# Examples of using twarc2 as a library
Please see [client2](client2.md) docs for the full list of available functions. Here are some minimal working snippets of code that use twarc2 as a library.
## Search
The client implements the API as closely as possible - so if the API docs expect a parameter in a certain way, so does the twarc2 library.
```python
import datetime
from twarc.client2 import Twarc2
from twarc.expansions import ensure_flattened
# Your bearer token here
t = Twarc2(bearer_token="A...z")
# Start and end times must be in UTC
start_time = datetime.datetime(2021, 3, 21, 0, 0, 0, 0, datetime.timezone.utc)
end_time = datetime.datetime(2021, 3, 22, 0, 0, 0, 0, datetime.timezone.utc)
# search_results is a generator, max_results is max tweets per page, 100 max for full archive search with all expansions.
search_results = t.search_all(query="dogs lang:en -is:retweet", start_time=start_time, end_time=end_time, max_results=100)
# Get all results page by page:
for page in search_results:
# Do something with the whole page of results:
# print(page)
# or alternatively, "flatten" results returning 1 tweet at a time, with expansions inline:
for tweet in ensure_flattened(page):
# Do something with the tweet
print(tweet)
# Stop iteration prematurely, to only get 1 page of results.
break
```
## Working with Generators
Twarc will try to retrieve all available results and handle retries and rate limits for you. This can potentially retrieve more tweets than your monthly limit will allow. The command line interface has a `--limit` option, but the library returns generator functions and it is upto you to stop iterating when you have retrieved enough results.
For example, to only get 2 "pages" of followers max per user:
```python
from twarc.client2 import Twarc2
# Your bearer token here
t = Twarc2(bearer_token="A...z")
user_ids = [12, 2244994945, 4503599627370241] # @jack, @twitterdev, @overflow64
# Iterate over our target users
for user_id in user_ids:
# Iterate over pages of followers
for i, follower_page in enumerate(t.followers(user_id)):
# Do something with the follower_page here
print(f"Fetched a page of {len(follower_page['data'])} followers for {user_id}")
if i == 1: # Only retrieve the first two pages (enumerate starts from 0)
break
```
## twarc CSV
`twarc-csv` is an extra plugin you can install:
```
pip install twarc-csv
```
This can also be used as a library, for example:
If you have a bunch of data, and want a DataFrame:
```
from twarc_csv import DataFrameConverter
# Default options for Dataframe converter
converter = DataFrameConverter()
# this can be a list or generator of individual tweets or pages or results.
json_objects = [...]
df = converter.process(json_objects)
```
This doesn't save any files, and converts everything in memory.
If you have a large file, you should use `CSVConverter` as before
```
from twarc_csv import CSVConverter
with open("input.json", "r") as infile:
with open("output.csv", "w") as outfile:
converter = CSVConverter(infile=infile, outfile=outfile)
converter.process()
```
or with additional options:
```
from twarc_csv import CSVConverter, DataFrameConverter
converter = DataFrameConverter(
input_data_type="tweets",
json_encode_all=False,
json_encode_text=False,
json_encode_lists=True,
inline_referenced_tweets=True,
merge_retweets=True,
allow_duplicates=False,
)
with open("results.jsonl", "r") as infile:
with open("results.csv", "w") as outfile:
converter = CSVConverter(infile=infile, outfile=outfile, converter=converter)
converter.process()
```
`DataFrameConverter` parameters correspond to the command line options: https://github.com/DocNow/twarc-csv#extra-command-line-options
The full list of valid `output_columns` are: https://github.com/DocNow/twarc-csv/blob/main/dataframe_converter.py#L13-L85 when using `input_data_type="tweets"` and https://github.com/DocNow/twarc-csv/blob/main/dataframe_converter.py#L90-L115 when using `input_data_type="users"`. Note that it won't extract users from tweets, these have to be already extracted from the JSON. `twarc-csv` can also process compliance output and counts output.
## Search and write results to CSV example
Here is a complete working example that searches for all recent tweets in the last few hours, writes a `results.jsonl` with the original responses, and then converts this to CSV:
```python
import json
from datetime import datetime, timezone, timedelta
from twarc.client2 import Twarc2
from twarc_csv import CSVConverter
# Your bearer token here
t = Twarc2(bearer_token="A...z")
# Start and end times must be in UTC
start_time = datetime.now(timezone.utc) + timedelta(hours=-3)
# end_time cannot be immediately now, has to be at least 30 seconds ago.
end_time = datetime.now(timezone.utc) + timedelta(minutes=-1)
query = "dogs lang:en -is:retweet has:media"
print(f"Searching for \"{query}\" tweets from {start_time} to {end_time}...")
# search_results is a generator, max_results is max tweets per page, not total, 100 is max when using all expansions.
search_results = t.search_recent(query=query, start_time=start_time, end_time=end_time, max_results=100)
# Get all results page by page:
for page in search_results:
# Do something with the page of results:
with open("dogs_results.jsonl", "w+") as f:
f.write(json.dumps(page) + "\n")
print("Wrote a page of results...")
print("Converting to CSV...")
# This assumes `results.jsonl` is finished writing.
with open("dogs_results.jsonl", "r") as infile:
with open("dogs_output.csv", "w") as outfile:
converter = CSVConverter(infile, outfile)
converter.process()
print("Finished.")
```
================================================
FILE: docs/plugins.md
================================================
# Plugins
twarc v1 collected a set of utilities for working with tweet json in the
[utils] directory of the git repository. This was a handy way to develop and
share snippets of code. But some utilities had different dependencies which
weren't managed in a uniform way. Some of the utilities had slightly different
interfaces. They needed to be downloaded from GitHub manually and weren't
easily accessible at the command line if you remembered where you put them.
With *twarc2* these utilities are now installable as plugins, which are made
available as subcommands using the same twarc2 command line. Plugins are
published separately from twarc on [PyPI] and are installed with [pip]. Here is
a list of some known plugins (if you write one please [let us know] so we can
add it to this list):
* [twarc-ids](https://pypi.org/project/twarc-ids/): a simple example of printing the ids for tweets to use as a reference for creating plugins
* [twarc-csv](https://pypi.org/project/twarc-csv/): export tweets to CSV, which is probably the first thing a researcher will want to do
* [twarc-videos](https://pypi.org/project/twarc-videos): extract videos from tweets
* [twarc-network](https://pypi.org/project/twarc-network): visualize tweets and users as a network graph
* [twarc-timeline-archive](https://pypi.org/project/twarc-timeline-archive): routinely download tweet timelines for a list of users
* [twarc-hashtags](https://pypi.org/project/twarc-hashtags): create a report of hashtags that are used in collected tweet data
* Write your own, and [let us know] so we can add it here!
## Writing a Plugin
The [twarc-ids] plugin provides an example of how to write plugins. This
reference plugin simply reads collected tweet JSON data and writes out the tweet
identifiers. First you install the plugin:
pip install twarc-ids
and then you use it:
twarc2 ids tweets.json > ids.txt
Internally twarc's command line is implemented using the [click] library. The
[click-plugins] module is what manages twarc2 plugins. Basically you import
`click` and implement your plugin as you would any other click utility, for
example:
```python
import json
import click
@click.command()
@click.argument('infile', type=click.File('r'), default='-')
@click.argument('outfile', type=click.File('w'), default='-')
def ids(infile, outfile):
"""
Extract tweet ids from tweet JSON.
"""
for line in infile:
tweet = json.loads(line)
click.echo(t['data']['id'], file=outfile)
```
Note that the plugin takes input file *infile* and writes to an output file
*outfile* which default to stdin and stdout respectively. This allows plugin
utilities to be used as part of pipelines. You can add options using the
standard facilities that click provides if your plugin needs them.
If your plugin needs to talk to the Twitter API then just add the
`@click.pass_obj` decorator which will ensure that the first parameter in
your function will be a Twarc2 client that is configured to use the
client's keys.
```python
@click.command()
@click.argument('infile', type=click.File('r'), default='-')
@click.argument('outfile', type=click.File('w'), default='-')
@click.pass_obj
def ids(twarc_client, infile, outfile):
# do something with the twarc client here
```
Finally you just need to create a `setup.py` file for your project that
looks something like this:
```python
import setuptools
setuptools.setup(
name='twarc-ids',
version='0.0.1',
url='https://github.com/docnow/twarc-ids',
author='Ed Summers',
author_email='ehs@pobox.com',
py_modules=['twarc_ids'],
description='A twarc plugin to read Twitter data and output the tweet ids',
install_requires=['twarc'],
setup_requires=['pytest-runner'],
tests_require=['pytest'],
entry_points='''
[twarc.plugins]
ids=twarc_ids:ids
'''
)
```
The key part here is the `entry_points` section which is what allows twarc2 to
discover twarc.plugins dynamically at runtime, and also defines how the
subcommand maps to the plugin's function.
It's good practice to include a test or two for your plugin to ensure it works
over time. Check out the example [here] for how to test command line utilities
easily with click.
To publish your plugin on PyPi:
```
pip install twine
python setup.py sdist
twine upload dist/*
# enter pypi login details
```
[twarc-ids]: https://github.com/docnow/twarc-ids/
[PyPI]: https://python.org/pypi/
[pip]: https://pip.pypa.io/en/stable/
[click]: https://click.palletsprojects.com/
[click-plugins]: https://github.com/click-contrib/click-plugins
[here]: https://github.com/DocNow/twarc-ids/blob/main/test_twarc_ids.py
[let us know]: https://github.com/docnow/twarc/issues/
[utils]: https://github.com/DocNow/twarc/tree/main/utils
================================================
FILE: docs/resources.md
================================================
# Twarc Tutorials and Other Resources
Documentation here is largely auto generated from the code, which may not always be the most user friendly. Others have written great tutorials and other resources relating to using twarc, or working with the data generated by twarc. If you'd like to suggest additional resources that are relevant, please feel to open a pull request or open an issue.
## An Introductory Video from the Australian Digital Observatory
A [six minute video](https://www.youtube.com/watch?v=4DXEeM2AA9Y) by the [Australian Digital Observatory](https://www.digitalobservatory.net.au/) that shows some of the functionality of `twarc2` search, as well as how to use [Twitter's Query Builder](https://developer.twitter.com/apitools/query?query=) in conjunction with twarc.
<iframe width="560" height="315" src="https://www.youtube.com/embed/4DXEeM2AA9Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
## Carpentries Lesson
<https://carpentries-incubator.github.io/twitter-with-twarc/index.html>
Includes a step by step guide to collecting Twitter data using `twarc2`. It includes information on Twitter's JSON format, and how to manage collected data.
## UVA Library's Scholars' Lab Twarc Tutorial
<https://scholarslab.github.io/learn-twarc/>
A beginner guide that also goes through command line and Python setup. Uses `twarc` for v1.1 API examples, not `twarc2`.
## Guide from TwitterDev
<https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research>
Twitter have released a 101 guide on using the Academic Access endpoints. It uses `twarc2` as a library as opposed to command line, and gives code examples in R too.
## Twitter Data Collection & Analysis
<https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/12-Twitter-Data.html>
Lesson from Introduction to Cultural Analytics & Python
## Getting Data from Twitter: A twarc tutorial
<https://github.com/alblaine/twarc-tutorial>
Uses `twarc` for `v1.1` endpoints and has step by step examples for using some of the `/utils` scripts.
## UCSB Library Twarc Tutorials
<https://ucsb-collaboratory.github.io/twitter/>
Uses both `twarc` and `twarc2`
## Introduction to full archive searching using twarc v2
<https://github.com/jeffcsauer/twarc-v2-tutorials/blob/master/twarc_fas.md>
An example of using `twarc2` search, but be sure to install twarc using `pip install twarc` not the link to the v2 branch zip.
================================================
FILE: docs/tutorial.md
================================================
# Twarc Tutorial
Twarc is a command line tool for collecting Twitter data via Twitter's web Application Programming Interface (API). This tutorial is aimed at researchers who are new to collecting social media data, and who might be unfamiliar with command line interfaces.
By the end of this tutorial, you will have:
1. Familiarised yourself with interacting with a command line application via a terminal
2. Setup Twarc so you can collect data from the Twitter API (version 2)
3. Constructed two Twitter search queries to address a specific research question
4. Collected data for those two queries
5. Processed the collected data into formats suitable for other analysis
6. Performed a simple quantitative comparison of the two collections using Python
7. Prepared a dataset of tweet identifiers that can be shared with other researchers
## Motivating example
This tutorial is built around collecting data from Twitter to address the following research question:
***Which monotreme is currently the coolest - the echidna or the platypus?***
We'll answer this question with a simple quantitative approach to analysing the collected data: counting the volume of likes that tweets mentioning each species of animal accrue. For this tutorial, the species that gets the most likes on tweets is going to be considered the "coolest". This is a very simplistic quantitative approach, just to get you started on collecting and analysing Twitter data. To seriously study the relative coolness of monotremes, there are a wide variety of more appropriate (but also more involved) methods.
## Introduction to twarc and the Twitter API
### What is an API?
An **Application Programming Interface** (API) is a common method for software applications and services to allow other systems or people to programmatically interact with them. For example, Twitter has an API which allows external systems to make requests to Twitter for information or actions. Twitter (and many other web apps and services) uses an HTTP REST API, meaning that to interact with Twitter through the API you can send an HTTP request to a specific URL provided by Twitter. Twitter affords many different URLs (also known as **endpoints**) which have been designed for different purposes (more about that later). Assuming that your HTTP request is valid, Twitter will respond with a bundle of information in [JSON format](https://en.wikipedia.org/wiki/JSON) for you.
Twarc acts as a tool or an intermediary for you to interact with the Twitter API, so that you don't have to manage the details of how exactly to make requests to the Twitter API and handle Twitter's responses. Twarc commands correspond roughly with Twitter API endpoints. For example, when you use Twarc to fetch the timeline of a specific Twitter account (we'll use @Twitter in this example), this is the sequence of events:
1. You run `twarc2 timeline Twitter tweets.jsonl`
2. twarc2 makes a request on your behalf to the [Twitter v2 user lookup API endpoint](https://developer.twitter.com/en/docs/twitter-api/users/lookup/introduction) in order to find the user ID for the @Twitter account, and receives a response from the Twitter API server with that user ID
3. twarc2 makes a request on your behalf to the [Twitter v2 timeline API endpoint](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/introduction), using the user ID determined in step 2, and receives a response (or several responses) from the Twitter API server with @Twitter's tweets
4. twarc2 consolidates the timeline responses from step 3 and outputs them according to your initial command, in this case as `tweets.jsonl`
There are a great many resources on the internet to learn more about APIs more generally and how to use them in a variety of contexts. Here are a few introductory articles:
- [How to Geek: What is an API, and how do developers use them?](https://www.howtogeek.com/343877/what-is-an-api/)
- [IBM: What is an API?](https://www.ibm.com/cloud/learn/api)
More detailed information on APIs and working with them:
- [Zapier: An introduction to APIs](https://zapier.com/learn/apis/)
- [RealPython: Python and REST APIs: Interacting with web services](https://realpython.com/api-integration-in-python/)
### What can you do with the Twitter API?
The Twitter API is very popular in academic communities for good reason: it is one of the most accessible and research-friendly of the popular social media platforms at present. The Twitter API is well-established and offers a broad range of possibilities for data collection.
Here are some examples of things you can do with the Twitter API:
- Find historical tweets containing words or phrases during a time window of interest
- Collect live tweets as they are posted matching specific search criteria
- Collect tweets using specific hashtags or mentioning particular users
- Collect tweets made by a particular user account
- Collect engagement metrics including likes and retweets for specific tweets of interest
- Map Twitter account followers and followees within or around a group of users
- Trace conversations and interactions around users or tweets of interest
You may notice as you read about the Twitter API that there are two versions of the Twitter API - version 1.1 and version 2. At the time of writing, Twitter is providing both versions of the API, but at some unknown point in the future version 1.1 may be discontinued. Twarc can handle either API version: the `twarc` command uses version 1.1 of the Twitter API, the `twarc2` command uses version 2. Take care when reading documentation and tutorials as to which Twitter API version is being referenced. **This tutorial uses version 2 of the Twitter API**.
Twitter API endpoints can be structured either around tweets or around user accounts. For example, the search endpoint provides lists of tweets - user information is included, but the data is focused on the tweets.
The available endpoints and their details are evolving as Twitter develops and releases its API version 2, so for the most up to date information refer to [the Twitter API documentation](https://developer.twitter.com/en/docs/twitter-api). Some of the most used endpoints for research purposes are:
- [search](https://developer.twitter.com/en/docs/twitter-api/tweets/search/introduction): This is the endpoint used to search tweets, whether recent or historical.
- [lookup](https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/introduction): The lookup endpoints are useful when you have IDs of tweets of interest and want to fetch further data about those tweets - known in the Twarc community as **hydrating** the tweets.
- [follows](https://developer.twitter.com/en/docs/twitter-api/users/follows/introduction): The follows endpoint allows collecting information about who follows who on Twitter.
With the Twitter API, you can get data related to all types of objects that make up the Twitter experience, including [tweets](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet) and [users](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user). The Twitter documentation provides full details, and these two pages are very useful to bookmark!
The Twitter documentation also provides some useful tools for constructing searches and queries:
- [Twitter's v2 API Query Builder](https://developer.twitter.com/apitools/query?query=)
- [Building high quality filters for getting Twitter data](https://developer.twitter.com/en/docs/tutorials/building-high-quality-filters)
The rest of this tutorial is going to focus on using the Twitter search API endpoint to retrieve tweets containing content relevant to the research question. We've chosen to focus on this because:
1. With the rich functionality available in the search API the data collection for many projects can be condensed down to a few carefully chosen searches.
2. With [academic research access](https://developer.twitter.com/en/products/twitter-api/academic-research) it's possible to search the entire Twitter archive, making search uniquely powerful among the endpoints Twitter supports.
### Introduction to twarc
Twarc is at its core an application for interacting with the Twitter API, reading results from the different functionality the API offers, and safely writing the collected data to your machine for further analysis. Twarc handles the mechanical details of interacting with the Twitter API like including information to authenticate yourself, making HTTP requests to the API, formatting data in the right way, and retrying when things on the internet fail. Your job is to work out:
1. Which endpoint you want to call on from the Twitter API.
2. Which data you want to retrieve from that endpoint.
Twarc is a command line based application - to use twarc you type a command specifying a particular action, and the results of that command are shown as text on screen. If you haven't used a command line interface before, don't worry! Although there is a bit of a learning curve at the beginning, you will quickly get the hang of it - and because everything is a typed command, it is very easy to record and share _exactly_ how you collected data with other people.
## Considerations when using social media data for research
Before we dive into the details, it's worth mentioning some broader issues you will need to keep in mind when working with social media data. This is by no means an exhaustive list of issues and is intended as a starting point for further enquiry.
### Ethical use of "public" communication
Even though most tweets on Twitter are public, in that they're accessible to anyone on the web, most users of Twitter don't have any expectation that researchers will be reading their tweets for the purpose of research. Researchers need to be mindful of this when working with data from Twitter, and user expectations should be considered as part of the study design. The Association of Internet Researchers has established [Ethical Guidelines for Internet Research](https://aoir.org/ethics/) which are a good starting point for the higher level considerations.
Work has also been done specifically looking at [Twitter users' expectations](https://journals.sagepub.com/doi/10.1177/2056305118763366), with a number of key concerns outlined. For this tutorial we're going to be taking a high level quantitative evaluation of very recent Twitter data, which distances ourselves from the specific tweets and users creating them and aligns with these broader ethical considerations.
Finally, because tweets (and the internet more generally) are searchable, we need to keep in mind that quoting a tweet in whole or part might allow easy reidentification of any specific user or tweet. For this reason care needs to be taken when reporting material from tweets, and common practices in qualitative research may not align with Twitter users' interests or expectations.
### Copyright
This may vary according to where you are in the world but tweets, including the text of the tweet and attached photos and videos are likely to be protected by copyright. As well as the Twitter Developer Agreement considerations in the next section, this may limit what you can do with tweets and media downloaded from Twitter.
### Twitter's terms of service
When you signed up for a Twitter developer account you agreed to follow Twitter's [Developer Agreement and Policy](https://developer.twitter.com/en/developer-terms/agreement-and-policy). This agreement constrains how you can use and share Twitter data. While the primary purpose of this agreement is to protect Twitter the company, this policy also incorporates some elements aimed at protecting users of Twitter.
Some particular things to note from the Developer Agreement are:
- Limits on how geolocation data can be used
- How to share Twitter data
- Dealing with deleted tweets
Note that researchers using deleted tweets were also key concerns for [Twitter users](https://journals.sagepub.com/doi/10.1177/2056305118763366). This tutorial won't cover geolocation data at all, but will cover approaches to sharing Twitter data and removing deleted material from collections.
## Setup
Twarc is a command line application, written in the Python programming language. To get Twarc running on our machines, we're going to need to install Python, then install Twarc itself, and we will also need to setup a Twitter developer account.
### Twitter developer access
[Start here](https://developer.twitter.com/en/apply-for-access) to apply for a Twitter developer account and follow the steps in [our developer access guide](twitter-developer-access.md). For this tutorial, you can skip step 2, as we won't require academic access.
Once you have the **Bearer Token**, you are ready for the next step. This token is like a password, so you shouldn't share it with other people. You will also need to be able to enter this token once to configure Twarc, so it would be best to copy and paste it to a text file on your local machine until we've finished configuration.
### Install Python
#### Windows
Install the latest version [for Windows](https://www.python.org/downloads/windows/). During the installation, make sure the *Add Python to PATH* option is selected/ticked.

#### Mac
Install the latest version [for Mac](https://www.python.org/downloads/macos/). No additional setup should be necessary for Python.
### Install Twarc and other utilities
For this tutorial we're going to install three Python packages, `twarc`, an extension called `twarc-csv`, and `pandas`, a Python library for data analysis. We will use a command line interface to install these packages. On Windows we will use the `cmd` console, which can be found by searching for `cmd` from the start menu - you should see a prompt like the below screenshot. On Mac you can open the `Terminal` app.

Once you have a terminal open we can run the following command to install the necessary packages:
```shell
pip install twarc twarc-csv pandas
```
You should see output similar to the following:

### Our first command: making sure everything is working
Let's open a terminal and get started - just like when installing twarc, you will want to use the `cmd` application on windows and the `Terminal` application on Mac.
The first command we want to run is to check if everything in twarc is installed and working correctly. We'll use twarc's builtin `help` for this. Running the following command should show you a brief overview of the functionality that the twarc2 command provides and some of the options available:
```shell
twarc2 --help
```

Twarc is structured like many other command line applications: there is a single main command, `twarc2`, to launch the application, and then you provide a subcommand, or additional arguments, or flags to provide additional context about what that command should actually do. In this case we're only launching the `twarc2` command, and providing a single _flag_ `--help` (the double-dash syntax is usually used for this). Most terminal applications will have a `--help` or `-h` flag that will provide some useful information about the application you're running. This often includes example usage, options, and a short description.
Note also that often when reading commands out loud, the space in between words is not mentioned explicitly: the command above (`twarc2 --help`) might be read as "twarc-two dash dash help".
Though we won't cover the command line outside of using Twarc in this tutorial, your operating system's command line functionality is extensive and can help you automate a lot of otherwise tedious tasks. If you're interested in learning more the [Software Carpentry lesson on the shell](https://swcarpentry.github.io/shell-novice/) is a good starting point.
### Configuring twarc with our bearer token
The next thing we want to do is tell twarc about our bearer token so we can authenticate ourselves with the Twitter API. This can be done using twarc's `configure` command. In this case we're going to use the `twarc2` main command, and provide it with the subcommand `configure` to tell twarc we want to start the configuration process.
```
twarc2 configure
```
On running this command twarc will prompt us to paste our bearer token, as shown in the screenshot below. Note that for many command line terminals on Windows, using the usual `Ctrl+V` keyboard shortcut will not work by default. If this happens, try right-clicking,then click `paste` to achieve the same thing. After entering our token, we will be prompted to enter additional information - this is not necessary for this tutorial, so we will skip this step by typing the letter `n` and hitting `enter`.

## Introduction to Twitter search and counts
To tackle the research question we're interested in we're going to use the search endpoint to retrieve two sets of tweets: those using the word echidna, and those using the word platypus.
There are two key commands that the Twitter API provides for search: a `search` endpoint to retrieve tweets matching a particular query, and a `counts` endpoint to tell you how many tweets match that query over time. It's always a good idea to start with the `counts` endpoint first, because:
- it lets you establish early on how many tweets you will need to deal with: too many or too few matching tweets will help you determine whether your search strategy is reasonable
- it can take a long time to retrieve large numbers of tweets and its better to know in advance how much data you will need to deal with
- the count and trend over time is useful in and of itself
- if you accidentally search for the wrong thing you can consume your monthly quota of tweets without collecting anything useful
Let's get started with the `counts` API - in twarc this is accessible by the command `counts`. As before `twarc2` is our entry command, `counts` is the subcommand we're interested in, and the `echidna` is what we're interested in searching for on Twitter (the query).
```shell
twarc2 counts echidna
```
You should see something like the below screenshot - and yes, this output isn't very readable! By default twarc shows us the response in the JSON format directly from the Twitter API, so it's not great for using directly on the command line.

Let's improve this by updating our command to:
```shell
twarc2 counts echidna --text --granularity day
```
And we should see output like below (your results will be different, because you're searching on a different day to when these screenshots were captured). Note that the `--text` and `--granularity` are optional flags provided to the `twarc2 counts` command, we can see other options by running `twarc2 counts --help`. In this case `--text` returns a simplified text output for easier reading, and `--granularity day` is passed to the Twitter API to specify that we're interested only in daily counts of tweets, not the default hourly count.
```shell
2022-11-03T02:49:02.000Z - 2022-11-04T00:00:00.000Z: 974
2022-11-04T00:00:00.000Z - 2022-11-05T00:00:00.000Z: 802
2022-11-05T00:00:00.000Z - 2022-11-06T00:00:00.000Z: 527
2022-11-06T00:00:00.000Z - 2022-11-07T00:00:00.000Z: 554
2022-11-07T00:00:00.000Z - 2022-11-08T00:00:00.000Z: 883
2022-11-08T00:00:00.000Z - 2022-11-09T00:00:00.000Z: 723
2022-11-09T00:00:00.000Z - 2022-11-10T00:00:00.000Z: 1,567
2022-11-10T00:00:00.000Z - 2022-11-10T02:49:02.000Z: 219
```
Note that this is only the count for the last seven days, which is the level of search functionality available for all developers via the standard track of the Twitter API. If you have access to the [Twitter Academic track](https://developer.twitter.com/en/use-cases/do-research/academic-research), you can switch to searching the full Twitter archive from the `counts` and `search` commands by adding the `--archive` flag.
Twitter search is powerful and provides many rich options. However, it also functions a little differently to most other search engines, because Twitter search does not focus on _ranking_ tweets by relevance (like a web search engine does). Instead, Twitter search via the API focuses on retrieving all matching tweets in chronological order. In other words, Twitter search uses the [Boolean model of searching](https://nlp.stanford.edu/IR-book/html/htmledition/boolean-retrieval-1.html), and returns the documents that match exactly what you provide and nothing else.
Let's work through this example a little further, first we want to expand to capture more variants of the word echidna - note that Twitter search via the API matches on the whole word, so `echidna` and `echidnas` are different. You can also see that we've added some double quotes around our query - without these quotes the individual pieces of our query might be interpreted as additional arguments to our search command:
```shell
twarc2 counts "echidna echidna's echidnas" --granularity day --text
```
```console
2022-11-03T03:40:44.000Z - 2022-11-04T00:00:00.000Z: 0
2022-11-04T00:00:00.000Z - 2022-11-05T00:00:00.000Z: 0
2022-11-05T00:00:00.000Z - 2022-11-06T00:00:00.000Z: 0
2022-11-06T00:00:00.000Z - 2022-11-07T00:00:00.000Z: 0
2022-11-07T00:00:00.000Z - 2022-11-08T00:00:00.000Z: 0
2022-11-08T00:00:00.000Z - 2022-11-09T00:00:00.000Z: 0
2022-11-09T00:00:00.000Z - 2022-11-10T00:00:00.000Z: 0
2022-11-10T00:00:00.000Z - 2022-11-10T03:40:44.000Z: 0
```
Suddenly we're retrieving very few results! By default, if you don't specify an operator, the Twitter API assumes you mean AND, or that all of the words should be present - we will need to explicitly say that we want any of these words using the OR operator:
```shell
twarc2 counts "echidna OR echidna's OR echidnas" --granularity day --text
```
```console
2022-11-03T03:42:10.000Z - 2022-11-04T00:00:00.000Z: 964
2022-11-04T00:00:00.000Z - 2022-11-05T00:00:00.000Z: 846
2022-11-05T00:00:00.000Z - 2022-11-06T00:00:00.000Z: 552
2022-11-06T00:00:00.000Z - 2022-11-07T00:00:00.000Z: 573
2022-11-07T00:00:00.000Z - 2022-11-08T00:00:00.000Z: 962
2022-11-08T00:00:00.000Z - 2022-11-09T00:00:00.000Z: 758
2022-11-09T00:00:00.000Z - 2022-11-10T00:00:00.000Z: 1,591
2022-11-10T00:00:00.000Z - 2022-11-10T03:42:10.000Z: 288
```
We can also apply operators based on other content or properties of tweets (see more [search operators](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#list) in the Twitter API documentation). Because we're deciding to focus on the number of likes on tweets as our measure of coolness, we want to exclude retweets. If we don't exclude retweets, our like measure might be heavily influenced by one highly retweeted tweet.
We can do this using the `-` (minus) operator, which allows us to exclude tweets matching a criteria, in conjunction with the `is:retweet` operator, which filters on whether the tweet is a retweet or not. If we applied just the `is:retweet` operator we'd only see the retweets, the opposite of what we want.
```shell
twarc2 counts "echidna OR echidna's OR echidnas -is:retweet" --granularity day --text
```
```text
2022-11-03T03:43:02.000Z - 2022-11-04T00:00:00.000Z: 957
2022-11-04T00:00:00.000Z - 2022-11-05T00:00:00.000Z: 826
2022-11-05T00:00:00.000Z - 2022-11-06T00:00:00.000Z: 546
2022-11-06T00:00:00.000Z - 2022-11-07T00:00:00.000Z: 570
2022-11-07T00:00:00.000Z - 2022-11-08T00:00:00.000Z: 931
2022-11-08T00:00:00.000Z - 2022-11-09T00:00:00.000Z: 750
2022-11-09T00:00:00.000Z - 2022-11-10T00:00:00.000Z: 1,587
2022-11-10T00:00:00.000Z - 2022-11-10T03:43:02.000Z: 288
```
There's one tiny gotcha from the Twitter API here, which is important to know about. AND operators are applied before OR operators, even if the AND is not specified by the user. The query we wrote above actually means something like below. We're only removing the retweets containing the word "echidnas", not all retweets:
```
echidna OR echidna's OR (echidnas AND -is:retweet)
```
We can make our intent explicit by adding parentheses to group terms. This is a good idea in general to make your meaning clear, even if you know all of the operator rules.
```shell
twarc2 counts "(echidna OR echidna's OR echidnas) -is:retweet" --granularity day --text
```
Now for the purposes of this tutorial we're going to stop exploring any further, but we could continue to refine and improve this query to match our research question. Twitter lets you build very long queries (up to 512 characters on the standard track and 1024 for the academic track) so you have plenty of scope to express yourself. As mentioned earlier, [Twitter's Query Builder](https://developer.twitter.com/apitools/query?query=) is an excellent tool for helping you to build your query.
If we apply the same kind of process to the platypus case, we might end up with something like the following. In this case it was necessary to use the [Twitter search web interface](https://twitter.com/explore) to find some of the variations in the word platypus:
```shell
twarc2 counts "(platypus OR platpus's OR platypi OR platypusses OR platypuses) -is:retweet" --granularity day --text
```
Having decided on the actual queries to run and examined the counts, now it's time to actually collect the tweets! We can take the queries we ran earlier, replace the `counts` command with the `search` and remove the `counts` specific arguments to get:
```shell
twarc2 search "(echidna OR echidna's OR echidnas) -is:retweet" echidna.json
twarc2 search "(platypus OR platpus's OR platypi OR platypusses OR platypuses) -is:retweet" platypus.json
```
Running these two commands will save the tweets matching each of those searches to two files on our disk, which we will use for the next sessions.

TIP: if you're not sure where the files above have been saved, you can run the command `cd` on Windows, or `pwd` on Mac to have your shell print out the folder in the filesystem where twarc has been working.
## Understanding and transforming twitter JSON data
Now that we've collected some data, it's time to take a look at it. Let's start by viewing the collected data in its plainest form: as a text file. Although we named the file with an extension of `.json`, this is just a convention: the actual file content is a plain text in the [JSON](https://en.wikipedia.org/wiki/JSON) format. Let's open this file with our inbuilt text editor (Notepad on Windows, TextEdit on Mac).

You'll notice immediately that there is a *lot* of data in that file: tweets are rich objects, and we mentioned that twarc by default captures as much information as Twitter makes available. Further, the Twitter API provides data in a format that makes it convenient for machines to work with, but not so much for humans.
## Making a CSV file from our collected tweets
We don't recommend trying to manually parse this raw data unless you have specific needs that aren't covered by existing tools. So we're going to use the `twarc-csv` package that we installed earlier to do the heavy lifting of transforming the collected JSON into a more friendly comma-separated value ([CSV](https://en.wikipedia.org/wiki/Comma-separated_values)) file. CSV is a simple plaintext format, but unlike JSON format is easy to import or open with a spreadsheet.
The `twarc-csv` package lets us use a `csv` command to transform the files from twarc:
```shell
twarc2 csv echidna.json echidna.csv
twarc2 csv platypus.json platypus.csv
```
If we look at these files in our text editor again, we'll see a nice structure of one line per tweet, with all of the many columns for that tweet.

Since we're going to do more analysis with the Pandas library to answer our question, we will want to create the CSV with only the columns of interest. This will reduce the time and amount of computer memory/RAM you need to load your dataset. For example, the following commands produce CSV files with a small number of fields:
```shell
twarc2 csv --output-columns id,created_at,author_id,text,referenced_tweets.retweeted.id,public_metrics.like_count echidna.json echidna_minimal.csv
twarc2 csv --output-columns id,created_at,author_id,text,referenced_tweets.retweeted.id,public_metrics.like_count platypus.json platypus_minimal.csv
```
### The problem with Excel
It's tempting to try to open these CSV files directly in Excel, but if you do you're probably going to notice one or more of the following problems, as illustrated below:
1. The ID columns are likely to be broken.
2. Emoji and languages that don't use latin characters may not appear correctly.
3. Tweets may be broken up on newlines.
4. Excel can only support 1,048,576 rows - it's very easy to collect tweet datasets bigger than this.

If you save a file from Excel with any of those problems that file is no longer useful for most purposes (this is a common and longstanding problem with using spreadsheet software, that affects many fields. For example in genomics: https://www.nature.com/articles/d41586-021-02211-4). While it is possible to make Excel do the right thing with your data, it takes more work, and a single mistake can lead to loss of important data. Therefore our recommendation is, if possible, to avoid the use of spreadsheets for analysing Twitter data.
### Working with Pandas
If you are going to be using the scientific Python library [Pandas](https://pandas.pydata.org/) for any processing or analysis, you may wish to use Pandas methods. Pandas can be used to load and manipulate data like we have in our CSV file. Note that for this section we're going to run a very simple computation, the references will have links to more extensive resources for learning more.
```python
# process_monotremes.py
import pandas
echidna = pandas.read_csv("echidna_minimal.csv")
platypus = pandas.read_csv("platypus_minimal.csv")
echidna_likes = echidna["public_metrics.like_count"].sum()
platypus_likes = platypus["public_metrics.like_count"].sum()
print(f"Total likes on echidna tweets: {echidna_likes}. Total likes on platypus tweets: {platypus_likes}.")
```
Run this script through Python to see which of the monotremes is the coolest:
```shell
python process_monotremes.py
```
### Answering the research question: which monotreme is the coolest?
At the time of creating this tutorial, the above script run with the just collected data leads to the following result:
```shell
Total likes on echidna tweets: 1787652. Total likes on platypus tweets: 3462715.
```
On that basis, we can conclude that at the time of running this search the platypus is nearly twice as cool as the echnida based on Twitter likes.
Of course this is a simplistic approach to answering this specific research question - we could have made many other choices. Even using a simple quantitative approach looking at metrics: we could have chosen to look at other engagement counts like the number of retweets, or looked at the number of followers of the accounts tweeting about each animal (because a "cooler" account will have more followers). Much of the challenge in using Twitter for research is both about asking the right research question and also the choosing the right approach to the data to address that research question.
## Prepare a dataset for sharing/using a shared dataset
Having performed this analysis and come to a conclusion, it is good practice to share the underlying data so other people can reproduce these results (with some caveats). Noting that we want to preserve Twitter users' agency over the availability of their content, and Twitter's Developer Agreement, we can do this by creating a dataset of tweet IDs. Instead of sharing the content of the tweets, we can share the unique ID for that tweet, which allows others to `hydrate` the tweets by retrieving them again from the Twitter API.
This can be done as follows using twarc's `dehydrate` command:
```shell
twarc2 dehydrate --id-type tweets platypus.json platypus_ids.txt
twarc2 dehydrate --id-type tweets echidna.json echidna_ids.txt
```
These commands will produce the two text files, with each line in these files containing the unique ID of the tweet.
To `hydrate`, or retrieve the tweets again, we can use the corresponding commands:
```shell
twarc2 hydrate platypus_ids.txt platypus_hydrated.json
twarc2 hydrate echidna_ids.txt echidna_hydrated.json
```
Note that the hydrated files will include fewer tweets: tweets that have been deleted, or tweets by accounts that have been deleted, suspended, or protected, will not be included in the file. Note also that hydrating a dataset also means that engagement metrics like retweets and likes will be up to date for tweets that are still available.
## Suggested resources
You can find some additional links and resources in the [resources section](https://twarc-project.readthedocs.io/en/latest/resources/) of the twarc documentation.
================================================
FILE: docs/twarc1_en_us.md
================================================
twarc1
=====
***For information about working with the Twitter V2 API please see the [twarc2](https://twarc-project.readthedocs.io/en/latest/twarc2/) page.***
---
twarc is a command line tool and Python library for archiving Twitter JSON data.
Each tweet is represented as a JSON object that is
[exactly](https://dev.twitter.com/overview/api/tweets) what was returned from
the Twitter API. Tweets are stored as [line-oriented JSON](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON). twarc will handle
Twitter API's [rate limits](https://dev.twitter.com/rest/public/rate-limiting)
for you. In addition to letting you collect tweets twarc can also help you
collect users, trends and hydrate tweet ids.
twarc was developed as part of the [Documenting the Now](http://www.docnow.io)
project which was funded by the [Mellon Foundation](https://mellon.org/).
## Install
Before using twarc you will need to register an application at
[apps.twitter.com](http://apps.twitter.com). Once you've created your
application, note down the consumer key, consumer secret and then click to
generate an access token and access token secret. With these four variables
in hand you are ready to start using twarc.
1. install [Python 3](http://python.org/download)
2. [pip](https://pip.pypa.io/en/stable/installing/) install twarc:
```
pip install --upgrade twarc
```
### Homebrew (macOS only)
For macOS users, you can also install `twarc` via [Homebrew](https://brew.sh/):
```bash
$ brew install twarc
```
### Windows
If you installed with pip and see a "failed to create process" when running twarc try reinstalling like this:
python -m pip install --upgrade --force-reinstall twarc
## Quickstart:
First you're going to need to tell twarc about your application API keys and
grant access to one or more Twitter accounts:
twarc configure
Then try out a search:
twarc search blacklivesmatter > search.jsonl
Or maybe you'd like to collect tweets as they happen?
twarc filter blacklivesmatter > stream.jsonl
See below for the details about these commands and more.
## Usage
### Configure
Once you've got your application keys you can tell twarc what they are with the
`configure` command.
twarc configure
This will store your credentials in a file called `.twarc` in your home
directory so you don't have to keep entering them in. If you would rather supply
them directly you can set them in the environment (`CONSUMER_KEY`,
`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) or using command line
options (`--consumer_key`, `--consumer_secret`, `--access_token`,
`--access_token_secret`).
### Search
This uses Twitter's [search/tweets](https://dev.twitter.com/rest/reference/get/search/tweets) to download *pre-existing* tweets matching a given query.
twarc search blacklivesmatter > tweets.jsonl
It's important to note that `search` will return tweets that are found within a
7 day window that Twitter's search API imposes. If this seems like a small
window, it is, but you may be interested in collecting tweets as they happen
using the `filter` and `sample` commands below.
The best way to get familiar with Twitter's search syntax is to experiment with
[Twitter's Advanced Search](https://twitter.com/search-advanced) and copy and
pasting the resulting query from the search box. For example here is a more
complicated query that searches for tweets containing either the
\#blacklivesmatter or #blm hashtags that were sent to deray.
twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl
You also should definitely check out Igor Brigadir's *excellent* reference guide
to the Twitter Search syntax:
[Advanced Search on Twitter](https://github.com/igorbrigadir/twitter-advanced-search/blob/master/README.md).
There are lots of hidden gems in there that the advanced search form doesn't
make readily apparent.
Twitter attempts to code the language of a tweet, and you can limit your search
to a particular language if you want using an [ISO 639-1] code:
twarc search '#blacklivesmatter' --lang fr > tweets.jsonl
You can also search for tweets with a given location, for example tweets
mentioning *blacklivesmatter* that are 1 mile from the center of Ferguson,
Missouri:
twarc search blacklivesmatter --geocode 38.7442,-90.3054,1mi > tweets.jsonl
If a search query isn't supplied when using `--geocode` you will get all tweets
relevant for that location and radius:
twarc search --geocode 38.7442,-90.3054,1mi > tweets.jsonl
### Filter
The `filter` command will use Twitter's [statuses/filter](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/api-reference/post-statuses-filter) API to collect tweets as they happen.
twarc filter blacklivesmatter,blm > tweets.jsonl
Please note that the syntax for the Twitter's track queries is significantly
different than what queries in their search API. Consult the
[track documentation](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/guides/basic-stream-parameters#track) on how best to express the filter option you are using.
Use the `follow` command line argument if you would like to collect tweets from
a given user id as they happen. This includes retweets. For example this will
collect tweets and retweets from CNN:
twarc filter --follow 759251 > tweets.jsonl
You can also collect tweets using a bounding box. Note: the leading dash needs
to be escaped in the bounding box or else it will be interpreted as a command
line argument!
twarc filter --locations "\-74,40,-73,41" > tweets.jsonl
You can use the `lang` command line argument to pass in a [ISO 639-1] language
code to limit to, and since the filter stream allow you to filter by one more
languages it is repeatable. So this would collect tweets that mention paris or
madrid that were made in French or Spanish:
twarc filter paris,madrid --lang fr --lang es
If you combine filter and follow options they are OR'ed together. For example
this will collect tweets that use the blacklivesmatter or blm hashtags and also
tweets from user CNN:
twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl
But combining locations and languages will result effectively in an AND. For
example this will collect tweets from the greater New York area that are in
Spanish or French:
twarc filter --locations "\-74,40,-73,41" --lang es --lang fr
### Sample
Use the `sample` command to listen to Twitter's [statuses/sample](https://dev.twitter.com/streaming/reference/get/statuses/sample) API for a "random" sample of recent public statuses.
twarc sample > tweets.jsonl
### Dehydrate
The `dehydrate` command generates an id list from a file of tweets:
twarc dehydrate tweets.jsonl > tweet-ids.txt
### Hydrate
twarc's `hydrate` command will read a file of tweet identifiers and write out the tweet JSON for them using Twitter's [status/lookup](https://dev.twitter.com/rest/reference/get/statuses/lookup) API.
twarc hydrate ids.txt > tweets.jsonl
Twitter API's [Terms of Service](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter) discourage people from making large amounts of raw Twitter data available on the Web. The data can be used for research and archived for local use, but not shared with the world. Twitter does allow files of tweet identifiers to be shared, which can be useful when you would like to make a dataset of tweets available. You can then use Twitter's API to *hydrate* the data, or to retrieve the full JSON for each identifier. This is particularly important for [verification](https://en.wikipedia.org/wiki/Reproducibility) of social media research.
### Users
The `users` command will return User metadata for the given screen names.
twarc users deray,Nettaaaaaaaa > users.jsonl
You can also give it user ids:
twarc users 1232134,1413213 > users.jsonl
If you want you can also use a file of user ids, which can be useful if you are
using the `followers` and `friends` commands below:
twarc users ids.txt > users.jsonl
### Followers
The `followers` command will use Twitter's [follower id API](https://dev.twitter.com/rest/reference/get/followers/ids) to collect the follower user ids for exactly one user screen name per request as specified as an argument:
twarc followers deray > follower_ids.txt
The result will include exactly one user id per line. The response order is
reverse chronological, or most recent followers first.
### Friends
Like the `followers` command, the `friends` command will use Twitter's [friend id API](https://dev.twitter.com/rest/reference/get/friends/ids) to collect the friend user ids for exactly one user screen name per request as specified as an argument:
twarc friends deray > friend_ids.txt
### Trends
The `trends` command lets you retrieve information from Twitter's API about trending hashtags. You need to supply a [Where On Earth](https://web.archive.org/web/20180102203025/https://developer.yahoo.com/geo/geoplanet/) identifier (`woeid`) to indicate what trends you are interested in. For example here's how you can get the current trends for St Louis:
twarc trends 2486982
Using a `woeid` of 1 will return trends for the entire planet:
twarc trends 1
If you aren't sure what to use as a `woeid` just omit it and you will get a list
of all the places for which Twitter tracks trends:
twarc trends
If you have a geo-location you can use it instead of the `woedid`.
twarc trends 39.9062,-79.4679
Behind the scenes twarc will lookup the location using Twitter's [trends/closest](https://dev.twitter.com/rest/reference/get/trends/closest) API to find the nearest `woeid`.
### Timeline
The `timeline` command will use Twitter's [user timeline API](https://dev.twitter.com/rest/reference/get/statuses/user_timeline) to collect the most recent tweets posted by the user indicated by screen_name.
twarc timeline deray > tweets.jsonl
You can also look up users using a user id:
twarc timeline 12345 > tweets.jsonl
### Retweets
You can get retweets for a given tweet id like so:
twarc retweets 824077910927691778 > retweets.jsonl
If you have tweet_ids that you would like to fetch the retweets for, you can:
twarc retweets ids.txt > retweets.jsonl
### Replies
Unfortunately Twitter's API does not currently support getting replies to a
tweet. So twarc approximates it by using the search API. Since the search API
does not support getting tweets older than a week, twarc can only get the
replies to a tweet that have been sent in the last week.
If you want to get the replies to a given tweet you can:
twarc replies 824077910927691778 > replies.jsonl
Using the `--recursive` option will also fetch replies to the replies as well as
quotes. This can take a long time to complete for a large thread because of
rate limiting by the search API.
twarc replies 824077910927691778 --recursive
### Lists
To get the users that are on a list you can use the list URL with the
`listmembers` command:
twarc listmembers https://twitter.com/edsu/lists/bots
## Premium Search API
Twitter introduced a Premium Search API that lets you pay Twitter money for tweets.
Once you have set up an environment in your
[dashboard](https://developer.twitter.com/en/dashboard) you can use their 30day
and fullarchive endpoints to search for tweets outside the 7 day window provided
by the Standard Search API. To use the premium API from the command line you
will need to indicate which endpoint you are using, and the environment.
To avoid using up your entire budget you will likely want to limit the time
range using `--to_date` and `--from_date`. Additionally you can limit the
maximum number of tweets returned using `--limit`.
So for example, if I wanted to get all the blacklivesmatter tweets from a two
weeks ago (assuming today is June 1, 2020) using my environment named
*docnowdev* but not retrieving more than 1000 tweets, I could:
twarc search blacklivesmatter \
--30day docnowdev \
--from_date 2020-05-01 \
--to_date 2020-05-14 \
--limit 1000 \
> tweets.jsonl
Similarly, to find tweets from 2014 using the full archive you can:
twarc search blacklivesmatter \
--fullarchive docnowdev \
--from_date 2014-08-04 \
--to_date 2014-08-05 \
--limit 1000 \
> tweets.jsonl
If your environment is sandboxed you will need to use `--sandbox` so that twarc
knows not to request more than 100 tweets at a time (the default for
non-sandboxed environments is 500)
twarc search blacklivesmatter \
--fullarchive docnowdev \
--from_date 2014-08-04 \
--to_date 2014-08-05 \
--limit 1000 \
--sandbox \
> tweets.jsonl
## Gnip Enterprise API
twarc supports integration with the Gnip Twitter Full-Archive Enterprise API.
To do so, you must pass in the `--gnip_auth` argument. Additionally, set the
`GNIP_USERNAME`, `GNIP_PASSWORD`, and `GNIP_ACCOUNT` environment variables.
You can then run the following:
twarc search blacklivesmatter \
--gnip_auth \
--gnip_fullarchive prod \
--from_date 2014-08-04 \
--to_date 2015-08-05 \
--limit 1000 \
> tweets.jsonl
## Use as a Library
If you want you can use twarc programmatically as a library to collect
tweets. You first need to create a `twarc` instance (using your Twitter
credentials), and then use it to iterate through search results, filter
results or lookup results.
```python
from twarc import Twarc
t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)
for tweet in t.search("ferguson"):
print(tweet["text"])
```
You can do the same for a filter stream of new tweets that match a track
keyword
```python
for tweet in t.filter(track="ferguson"):
print(tweet["text"])
```
or location:
```python
for tweet in t.filter(locations="-74,40,-73,41"):
print(tweet["text"])
```
or user ids:
```python
for tweet in t.filter(follow='12345,678910'):
print(tweet["text"])
```
Similarly you can hydrate tweet identifiers by passing in a list of ids
or a generator:
```python
for tweet in t.hydrate(open('ids.txt')):
print(tweet["text"])
```
## User vs App Auth
twarc will manage rate limiting by Twitter. However, you should know that
their rate limiting varies based on the way that you authenticate. The two
options are User Auth and App Auth. twarc defaults to using User Auth but you
can tell it to use App Auth.
Switching to App Auth can be handy in some situations like when you are
searching tweets, since User Auth can only issue 180 requests every 15 minutes
(1.6 million tweets per day), but App Auth can issue 450 (4.3 million tweets per
day).
But be careful: the `statuses/lookup` endpoint used by the hydrate subcommand
has a rate limit of 900 requests per 15 minutes for User Auth, and 300 request
per 15 minutes for App Auth.
If you know what you are doing and want to force App Auth, you can use the
`--app_auth` command line option:
twarc --app_auth search ferguson > tweets.jsonl
Similarly, if you are using twarc as a library you can:
```python
from twarc import Twarc
t = Twarc(app_auth=True)
for tweet in t.search('ferguson'):
print(tweet['id_str'])
```
## Utilities
In the utils directory there are some simple command line utilities for
working with the line-oriented JSON, like printing out the archived tweets as
text or html, extracting the usernames, referenced URLs, etc. If you create a
script that you find handy please send a pull request.
When you've got some tweets you can create a rudimentary wall of them:
utils/wall.py tweets.jsonl > tweets.html
You can create a word cloud of tweets you collected about nasa:
utils/wordcloud.py tweets.jsonl > wordcloud.html
If you've collected some tweets using `replies` you can create a static D3
visualization of them with:
utils/network.py tweets.jsonl tweets.html
Optionally you can consolidate tweets by user, allowing you to see central accounts:
utils/network.py --users tweets.jsonl tweets.html
Additionally, you can create a network of hashtags, allowing you to view their colocation:
utils/network.py --hashtags tweets.jsonl tweets.html
And if you want to use the network graph in a program like [Gephi](https://gephi.org/),
you can generate a GEXF file with the following:
utils/network.py --users tweets.jsonl tweets.gexf
utils/network.py --hashtags tweets.jsonl tweets.gexf
Additionally if you want to convert the network into a dynamic network with timeline enabled (i.e. nodes will appear and disappear according to their attributes), you can open up your GEXF file in Gephi and follow [these instructions](https://seinecle.github.io/gephi-tutorials/generated-html/converting-a-network-with-dates-into-dynamic.html). Note that in tweets.gexf there is a column for "start_date" (which is the day the post was created) but none for "end_date" and that in the dynamic timeline, the nodes will appear on the screen at their start date and stay on screen forever after. For the "Time Interval creation options" pop-up in Gephi, the "Start time column" should be "start_date", the "End time column" should be empty, the "Parse dates" should be selected, and the Date format should be the last option, "dd/MM/yyyy HH:mm:ss".
gender.py is a filter which allows you to filter tweets based on a guess about
the gender of the author. So for example you can filter out all the tweets that
look like they were from women, and create a word cloud for them:
utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py >
tweets-female.html
You can output [GeoJSON](http://geojson.org/) from tweets where geo coordinates are available:
utils/geojson.py tweets.jsonl > tweets.geojson
Optionally you can export GeoJSON with centroids replacing bounding boxes:
utils/geojson.py tweets.jsonl --centroid > tweets.geojson
And if you do export GeoJSON with centroids, you can add some random fuzzing:
utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson
To filter tweets by presence or absence of geo coordinates (or Place, see
[API documentation](https://dev.twitter.com/overview/api/places)):
utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl
cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl
To filter tweets by a GeoJSON fence (requires [Shapely](https://github.com/Toblerity/Shapely)):
utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl
cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl
If you suspect you have duplicate in your tweets you can dedupe them:
utils/deduplicate.py tweets.jsonl > deduped.jsonl
You can sort by ID, which is analogous to sorting by time:
utils/sort_by_id.py tweets.jsonl > sorted.jsonl
You can filter out all tweets before a certain date (for example, if a hashtag was used for another event before the one you're interested in):
utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl
You can get an HTML list of the clients used:
utils/source.py tweets.jsonl > sources.html
If you want to remove the retweets:
utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl
Or unshorten urls (requires [unshrtn](https://github.com/docnow/unshrtn)):
cat tweets.jsonl | utils/unshrtn.py > unshortened.jsonl
Once you unshorten your URLs you can get a ranked list of most-tweeted URLs:
cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt
## twarc-report
Some further utility scripts to generate csv or json output suitable for
use with [D3.js](http://d3js.org/) visualizations are found in the
[twarc-report](https://github.com/pbinkley/twarc-report) project. The
util `directed.py`, formerly part of twarc, has moved to twarc-report as
`d3graph.py`.
Each script can also generate an html demo of a D3 visualization, e.g.
[timelines](https://wallandbinkley.com/twarc/bill10/) or a
[directed graph of retweets](https://wallandbinkley.com/twarc/bill10/directed-retweets.html).
[Chinese]: https://github.com/DocNow/twarc/blob/main/README_zw_zh.md
[Japanese]: https://github.com/DocNow/twarc/blob/main/README_ja_jp.md
[Portuguese]: https://github.com/DocNow/twarc/blob/main/README_pt_br.md
[Spanish]: https://github.com/DocNow/twarc/blob/main/README_es_mx.md
[Swedish]: https://github.com/DocNow/twarc/blob/main/README_sv_se.md
[Swahili]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md
[ISO 639-1]: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
================================================
FILE: docs/twarc1_es_mx.md
================================================
# twarc1
twarc es una recurso de línea de commando y catálogo de Python para archivar JSON dato de Twitter. Cada tweet se representa como
un artículo de JSON que es [exactamente](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object) lo que fue capturado del API de Twitter. Los Tweets se archivan como [JSON de línea orientado](https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON). twarc se encarga del [límite de tarifa](https://developer.twitter.com/en/docs/basics/rate-limiting) del API de Twitter. twarc también puede facilitar la colección de usuarios, tendencias y detallar las identificaciones de los tweets.
twarc fue desarrollado como parte del proyecto [Documenting the Now](http://www.docnow.io/) el cual fue financiado por el [Mellon Foundation](https://mellon.org/).
## La Instalación
Antes de usar twarc es necesario registrarse por [apps.twitter.com](https://apps.twitter.com/). Después de establecer la solicitud, se anota el clabe del consumidor, el secreto del consumidor, y entoces clickear para generar un access token y el secretro del access token. Con estos quatros requisitos, está listo para usar twarc.
1. Instala [Python](https://www.python.org/downloads/) (2 ó 3)
2. Instala twarc atraves de pip (si estas acezando de categoría: pip install --upgrade twarc)
## Quickstart:
Para empezar, se nececita dirigir a twarc sobre los claves de API:
`twarc configure`
Prueba una búsqueda:
`twarc search blacklivesmatter > search.josnl`
¿O quizás, preferirá coleccionar tweets en tiempo real?
`twarc filter blacklivesmatter > stream.josnl`
Vea abajo por detalles sobre estos commandos y más.
## Uso
### Configure
Una vez que tenga sus claves de aplicación, puede dirigir a twarc lo que son con el commando `configure`.
`twarc configure`
Esto archiva sus credenciales en un archivo que se llama `.twarc` en su directorio personal
para que no tenga que volver a ingresar los datos. Si prefiere ingresar los datos directamente, se
puede establecer en el ambiente `(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET)`
o usando las opciones de línea commando `(--consumer_key, --consumer_secret, --access_token, --access_token_secret)`.
### Search
Esto se usa para [las búsquedas](https://developer.twitter.com/en/docs/api-reference-index) de Twitter para descargar *preexistentes* tweets que corresponde a una consulta en particular.
`twarc search blacklivesmatter > tweets.jsonl`
Es importante a notar que este `search` dara resultados los tweets que se encuentran dentro de una ventana de siete dias como se imponga la búsqueda del Twitter API. Si parece una ventana mínima, lo es, pero puede ser que el interés es en coleccionar tweets en tiempo real usando `filter` y `sample` commandos detallados abajo.
La mejor manera de familiares con la búsqueda de syntax de Twitter es experimentado con el [Búsqueda Avanzada de Twitter](https://twitter.com/search-advanced) y copiar y pegar la consulta de la caja de búsqueda. Por ejemplo, abajo hay una consulta más complicada que busca los tweets que contienen #blacklivesmatter OR #blm hastags que se enviaron a deray.
`twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl`
Twitter puede codificar el lenguaje de un tweet, y puede limitar su búsqueda a un lenguaje particular:
`twarc search '#blacklivesmatter' --lang fr > tweets.jsonl`
También, puede buscar tweets dentro de un lugar geográfico, por ejemplo, los tweets que menciona blacklivesmatter que están a una milla del centro de Ferguson, Missouri:
`twarc search blacklivesmatter --geocode 38.7442,-90.3054,1mi > tweets.jsonl`
Si una bsqueda no está identificado cuando se usa "--geocode" se regresa a los tweets en esa ubicación y radio:
`twarc search --geocode 38.7442,-90.3054,1mi > tweets.jsonl`
### Filter
El commando "filter" se usa Twitter's ["status/filter"](https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data) API para coleccionar tweets en tiempo real.
`twarc filter blacklivesmatter,blm > tweets.jsonl`
Favor de notar que el sintaxis para los track queries de Twitter es differente de las búsquedas en el search API. Favor de consultar la documentación.
Use el commando `follow` para coleccionar tweets de una identificación de usuario en particular en tiempo real. Incluye retweets. Por ejemplo, esto colecciona tweets y retweets de CNN:
`twarc filter --follow 759251 > tweets.jsonl`
También se puede coleccionar tweets usando un "bounding box". Nota: ¡el primer guion necesita estar escapado en el "bounding box" si no, estará interpretado como un argumento de línea de commando!
`twarc filter --locations "\-74,40,-73,41" > tweets.jsonl`
Si combina las opciones serán "OR'ed" juntos. Por ejemplo, esto colecciona los tweets que usan los hashtags de blacklivesmatter o blm y tambien tweets del usario CNN:
`twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl`
### Sample
Usa el commando `sample` para probar a los [statuses/API de muestra](https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data) para una muestra "azar" de tweets recientes.
`twarc sample > tweets.jsonl`
### Dehydrate
El commando `dehydrate` genera una lista de id's de un archivo de tweets:
`twarc dehydrate tweets.jsonl > tweet-ids.txt`
### Hydrate
El mando `hydrate` busca a través de un archivo de identificadores y regresa el JSON del tweet usando el ["status/lookup API"](https://developer.twitter.com/en/docs/api-reference-index).
`twarc hydrate ids.txt > tweets.jsonl`
Los [términos de servicio](https://developer.twitter.com/en/developer-terms/policy#6._Be_a_Good_Partner_to_Twitter) del API de Twitter desalientan los usuarios a hacer público por el internet los datos de Twitter. Los datos se pueden usar para el estudio y archivado para uso local, pero no para compartir público. Aún, Twitter permite archivos de identificadores de Twitter ser compartidos. Puede usar el API de Twitter para hidratar los datos, o recuperar el completo JSON dato. Esto es importante para la [verificación](https://en.wikipedia.org/wiki/Reproducibility) del estudio de los redes sociales.
### Users
El commando `user` regresa metadata de usuario para los nobres de pantalla.
`twarc users deray,Nettaaaaaaaa > users.jsonl`
También puede acceder ids de usuario:
`twarc users 1232134,1413213 > users.jsonl`
Si quiere, también se puede usar un archivo de user ids:
`twarc users ids.txt > users.jsonl`
### Followers
El commando `followers` usa el [follower id API](https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-followers-ids) para coleccionar los user ids para un nombre de pantalla por búsqueda:
`twarc followers deray > follower_ids.txt`
El resultado incluye un user id por cada línea. El orden es en reversa cronológica, o los followers más recientes.
### Friends
El commando `friends` usa el [friend id API](https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-friends-ids) de Twitter para coleccionar los friend user ids para un nombre de pantalla por búsqueda:
`twarc friends deray > friend_ids.txt`
### Trends
El commando `trends` regresa información del Twitter API sobre los hashtags populares. Necesita ingresar un [Where on Earth idenfier (`woeid`)](https://en.wikipedia.org/wiki/WOEID) para indicar cual temas quieres buscar. Por ejemplo:
`twarc trends 2486982`
Usando un woeid de 1 regresara temas para el planeta:
`twarc trends 1`
También se puede omitir el `woeid` y los datos que regresan serán una lista de los lugares por donde Twitter localiza las temas:
`twarc trends`
Si tiene un geo-location, puede usarlo.
`twarc trends 39.9062,-79.4679`
twarc buscara el lugar usando el [trends/closest](https://developer.twitter.com/en/docs/api-reference-index) API para encontrar el `woeid` más cerca.
### Timeline
El commando `timeline` usa el [user timeline API](https://developer.twitter.com/en/docs/api-reference-index) para coleccionar los tweets más recientes del usuario indicado por el nombre de pantalla.
`twarc timeline deray > tweets.jsonl`
También se puede buscar usuarios usando un user id:
`twarc timeline 12345 > tweets.jsonl`
### Retweets
Se puede buscar retweets de un tweet específico:
`twarc retweets 824077910927691778 > retweets.jsonl`
### Replies
Desafortunadamente, el API de Twitter no soporte buscando respuestas a un tweet. Entonces, twarc usa el search API. EL search API no regresa tweets mayores de siete días.
Si quieres buscar las respuestas de un tweet:
`twarc replies 824077910927691778 > replies.jsonl`
El commando `--recursive` regresa respuestos a los respuestos. Esto puede tomar mucho tiempo para un thread muy grande porque el rate liming por el search API.
`twarc replies 824077910927691778 --recursive`
### Lists
Para conseguir los usuarios en una lista, se puede usar el list URL con el commando `listmembers`.
`twarc listmembers https://twitter.com/edsu/lists/bots`
## Use as a Library
twarc se puede usar programáticamente como una biblioteca para coleccionar tweets. Necesitas usar un `twarc` instance (usando tus credenciales de Twitter), y luego lo usas para buscar por resultados de búsqueda.
`from twarc import Twarc
t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)
for tweet in t.search("ferguson"):
print(tweet["text"])`
Puedes usar lo mismo para el filtro de stream de nuevos de tweets que sean iguales al track keyword.
`for tweet in t.filter(track="ferguson"):
print(tweet["text"])`
o lugar:
`for tweet in t.filter(locations="-74,40,-73,41"):
print(tweet["text"])`
o user ids:
`for tweet in t.filter(follow='12345,678910'):
print(tweet["text"])`
También los identificados de tweets se pueden hydratar:
`for tweet in t.hydrate(open('ids.txt')):
print(tweet["text"])`
## Utilities
En el directorio de utilidades hay algunos commando simple de line utilities para trabajar conel line-oriented JSON, Como imprimiendo out the archived tweets as texto o html, extracting the usernames, referenced URLs, etc. Si creas un script que tú puedas encontrar fácilmente por favor envía un pull request.
Cuando tengas algunos tweets puedes crear una pared rudimentaria de ellos:
`% utils/wall.py tweets.jsonl > tweets.html`
Puedes crear un word cloud de tweets que has coleccionado sobre nasa:
`% utils/wordcloud.py tweets.jsonl > wordcloud.html`
Si has coleccionado algunos tweets usando `replies` puedes crear a static D3 visualization de ellos con:
`% utils/network.py tweets.jsonl tweets.html`
Tienes la opción de consolidar tweets por user, permitiéndote ver las cuentas centrales:
`% utils/network.py --users tweets.jsonl tweets.html`
Y si quieres usar la graficas del network en un programa como [Gephi](https://gephi.org/), puedes generar un GEXF file con lo siguiente:
`% utils/network.py --users tweets.jsonl tweets.gexf`
gender.py es un filtro que te permite filtrar tweets basados en un guess sobre el género del autor. Por ejemplo, puedes filtrar todos los tweets que parecen ser de mujeres, y crear un word cloud para ellos:
`% utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py > tweets-female.html`
Se puede usar [GeoJSON](http://geojson.org/) de tweets que tienen geo coordiates:
`% utils/geojson.py tweets.jsonl > tweets.geojson`
Tienes la opcion de exportar GeoJSON con centroids replacing bounding boxes:
`% utils/geojson.py tweets.jsonl --centroid > tweets.geojson`
Y si exportas GeoJSON with centroids, puedes añadir algunos random fuzzing:
`% utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson`
Para filtrar tweets por presencia o ausencia de coordenadas geo (o por lugar Place, verifica [API documentacion](https://developer.twitter.com/en/docs/basics/getting-started)):
`% utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl
% cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl`
Para filtrar con GeoJSON fence (se necesita [Shapely](https://github.com/Toblerity/Shapely)):
`% utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl
% cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl`
Si sospechas que tienes un duplicado en tus tweets se puede usar "dedupe":
`% utils/deduplicate.py tweets.jsonl > deduped.jsonl`
Para ordernar por ID:
`% utils/sort_by_id.py tweets.jsonl > sorted.jsonl`
Puedes filtrar todos los tweets antes de una fecha exacta (Por ejemplo, si un hashtag fue usado para otro evento antes del que te interesaba):
`% utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl`
Puedes conseguir un listado de HTML de clientes usados:
`% utils/source.py tweets.jsonl > sources.html`
Si deseas remover los retweets:
`% utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl`
O unshorten urls (se necesita [unshrtn](https://github.com/DocNow/unshrtn)):
`% cat tweets.jsonl | utils/unshorten.py > unshortened.jsonl`
Una vez hayas unshorten tus URLs puedes obtener un listado de los most-tweeted URLs:
`% cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt`
## twarc-report
Más commandos de "utility" para generar csv or json output con uso con [D3.js](https://d3js.org/) visualizaciónes son encontrados en el [twarc-report](https://github.com/pbinkley/twarc-report) project. El util `directed.py` ahora es `d3graph.py`.
Cada script también puede generar un html demo de D3 visualization, e.g. [timelines](https://www.wallandbinkley.com/twarc/bill10/) o una [gráfica dirigida de retweets](https://www.wallandbinkley.com/twarc/bill10/directed-retweets.html).
---
Crédito de tradução: [Tina Figueroa]
[japonés]: https://github.com/DocNow/twarc/blob/main/README_ja_jp.md
[Portugués]: https://github.com/DocNow/twarc/blob/main/README_pt_br.md
[Inglés]: https://github.com/DocNow/twarc/blob/main/README.md
[Sueco]: https://github.com/DocNow/twarc/blob/main/README_sv_se.md
[Swahili]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md
[Tina Figueroa]: https://github.com/@tinafigueroa
================================================
FILE: docs/twarc1_ja_jp.md
================================================
twarc1
=====
twarcは、TwitterのJSONデータをアーカイブするためのコマンドラインツールおよびPythonライブラリーのプログラムです。
- 各ツイートは、Twitter APIから返された内容を[正確に](https://dev.twitter.com/overview/api/tweets)表すJSONオブジェクトとして表示されます。
- ツイートは[line-oriented JSON](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON)として保存されます。
- twarcがTwitterのAPI[レート制限](https://dev.twitter.com/rest/public/rate-limiting)を処理してくれます。
- twarcはツイートを収集できるだけでなく、ユーザー、トレンド、ツイートIDの詳細な情報の収集(hydrate; ハイドレート)にも役立ちます。
twarcは[Mellon Foundation](https://mellon.org/)によって援助された[Documenting the Now](http://www.docnow.io)プロジェクトの一環として開発されました.
## Install | インストール
twarcを使う前に[Twitter Developers](http://apps.twitter.com)にあなたのアプリケーションを登録する必要があります.
登録したら, コンシューマーキーとその秘密鍵を控えておきます.
そして「Create my access token」をクリックして、アクセストークンと秘密鍵を生成して控えておいてください.
これら4つの鍵が手元に揃えば, twarcを使い始める準備は完了です.
1. [Python](http://python.org/download)をインストールする (Version2か3)
2. [pip](https://pip.pypa.io/en/stable/installing/) install twarcする
### Homebrew (macOSだけ)
`twarc`は以下によってインストールできます.
```bash
$ brew install twarc
```
## Quickstart | クイックスタート
まず初めに, アプリケーションのAPIキーをtwarcに教え, 1つ以上のTwitterアカウントへのアクセスを許可する必要があります.
twarc configure
検索を試してみましょう.
twarc search blacklivesmatter > search.jsonl
または, 呟かれたツイートを収集したいですか?
twarc filter blacklivesmatter > stream.jsonl
コマンドなどの詳細については, 以下を参照してください.
## Usage | 用法
### Configure | 設定
`configure`コマンドで, 取得したアプリケーションキーをtwarcに教えることができます.
break
twarc configure
これにより, ホームディレクトリの`.twarc`というファイルに資格情報が保存されるため, 常に入力し続ける必要はありません.
直接指定したい場合は, 環境変数(`CONSUMER_KEY`, `CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`)か, コマンドラインオプション(`--consumer_key`, `--consumer_secret`, `--access_token`, `--access_token_secret`)を使用してください.
### Search | 検索
検索には, 与えられたクエリに適合する*既存の*ツイートをダウンロードするために, Twitterの[search/tweets](https://dev.twitter.com/rest/reference/get/search/tweets) APIを使います.
twarc search blacklivesmatter > tweets.jsonl
ここで重要なのは, `search`コマンドがTwitter検索APIの課す7日間以内の期限中から見つかったツイートを返すということです.
もし期限が「短すぎる」と思うのなら(まあそれはそうですが), 以下の`filter`コマンドや`sample`コマンドを使って収集してみると面白いかもしれません.
Twitterの検索構文についてよく知るためのベストプラクティスは, [Twitter's Advanced Search](https://twitter.com/search-advanced)で試してみて, 検索窓からクエリ文の結果をコピペすることです.
例えば以下の例は, `@deray`に送信された, ハッシュタグ`#blacklivesmatter`か`#blm`かの一方を含むツイートを検索する複雑なクエリです.
twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl
また, [Igor Brigadir](https://github.com/igorbrigadir)の*素晴らしい*Twitter検索構文のリファレンスを絶対にチェックしておくべきです.([Advanced Search on Twitter](https://github.com/igorbrigadir/twitter-advanced-search/blob/master/README.md))
高度な検索フォームには, すぐにはみつからない隠れた宝石がたくさんあります.
Twitterはツイートの言語をコーディングしようとします. [ISO 639-1]コードを使用すれば, 特定の言語に検索を制限できます.
twarc search '#blacklivesmatter' --lang fr > tweets.jsonl
特定の場所でのツイートを検索することもできます.
例えば, ミズーリ州ファーガソンの中心から1マイルの`blacklivesmatter`に言及するツイートなどを検索できます.
twarc search blacklivesmatter --geocode 38.7442,-90.3054,1mi > tweets.jsonl
`--geocode`の使用時に検索クエリが提供されない場合, その場所と半径に関連する全てのツイートを返します.
twarc search --geocode 38.7442,-90.3054,1mi > tweets.jsonl
### Filter | フィルター
`filter`コマンドは, 呟かれたツイートを収集するために, Twitterの[statuses/filter](https://dev.twitter.com/streaming/reference/post/statuses/filter) APIを使います.
twarc filter blacklivesmatter,blm > tweets.jsonl
ここで注意すべきなのは, Twitterのトラッククエリの構文は, 検索APIのクエリとは少し異なるということです.
そのため, 使用しているフィルターオプションの最も良い表現方法については, ドキュメントを参照してください.
特定のユーザーIDから呟かれたツイートを収集したい場合は, `follow`引数を使いましょう.
これにはリツイートも含まれます. 例えば, これは`@CNN`のツイート及びリツイートを収集します.
twarc filter --follow 759251 > tweets.jsonl
境界ボックス座標の数値(バウンディングボックス)を用いてツイートを収集することもできます.
注意: 先頭のダッシュ(`-`)はバウンディングボックス内ではエスケープする必要があります. エスケープしないと, コマンドライン引数として解釈されてしまいます!
twarc filter --locations "\-74,40,-73,41" > tweets.jsonl
`lang`コマンドライン引数を使用して, 検索を制限する[ISO 639-1]の言語コードを渡すことができます.
フィルターストリームでは, 1つ以上の言語でフィルタリングできるため, 繰り返し可能です.
以下は, フランス語またはスペイン語で呟かれた, パリまたはマドリードに言及しているツイートを収集します.
twarc filter paris,madrid --lang fr --lang es
フィルタを組み合わせてオプションの後ろに続けた場合には, それらは共にORで結がれます.
例えば, これはハッシュタグ`#blacklivesmatter`または`#blm`を使用するツイート, 及びユーザー`@CNN`からのツイートを収集します.
twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl
ただし, 場所と言語を組み合わせると, 結果的にANDになります.
例えば, これは, スペイン語またはフランス語で呟かれた, ニューヨークあたりからのツイートを収集します.
twarc filter --locations "\-74,40,-73,41" --lang es --lang fr
### Sample | 抽出
`sample`コマンドは, Twitterの[statuses/sample](https://dev.twitter.com/streaming/reference/get/statuses/sample) APIに直近のパブリックステータスの「無作為な」抽出を尋ねるのに使えます.
twarc sample > tweets.jsonl
### Dehydrate | デハイドレート
`dehydrate`コマンドはツイートのJSONLファイルからツイートIDのリストを生成します.
twarc dehydrate tweets.jsonl > tweet-ids.txt
### Hydrate | ハイドレート
twarcの`hydrate`コマンドは, ツイートの識別子のファイルを読み込んで, Twitterの[status/lookup](https://dev.twitter.com/rest/reference/get/statuses/lookup) APIを用いてそれらのツイートのJSONを書き出します.
twarc hydrate ids.txt > tweets.jsonl
Twitter APIの[利用規約](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter)では, 人々が大量のTwitterの生データをWeb上で利用可能にすることを制限しています.
- データは調査に使用したり, ローカルで使用するためにアーカイブしたりできますが, 世界と共有することはできません.
- Twitterはツイートの識別子ファイルを共有することは許可しておらず, それはツイートのデータセットを利用可能にしたい場合に役立ちます.
- それから, Twitter APIでデータを*ハイドレート*(注:水和)したり, またそれぞれの識別子のフルJSONデータを取得することは許可されています.
- `hydrate`は特に, ソーシャルメディア研究を[検証](https://ja.wikipedia.org/wiki/再現性)する時に重要となります.
### Users | ユーザー
`users`コマンドは, 与えられたスクリーンネームを持つユーザーのメタデータを返します.
twarc users deray,Nettaaaaaaaa > users.jsonl
またユーザーidも与えることができます.
twarc users 1232134,1413213 > users.jsonl
また, 望むなら以下のようにユーザーidのファイルを使用可能で, `followers`や`friends`といったコマンドを使っているときに有効です.
twarc users ids.txt > users.jsonl
### Followers | フォロワー
`followers`コマンドは, Twitterの[follower id API](https://dev.twitter.com/rest/reference/get/followers/ids)を用い, 引数として指定されたリクエストごとに1つだけのスクリーン名を持つユーザーのフォロワーのユーザーIDを収集します.
twarc followers deray > follower_ids.txt
結果には, 行ごとに1つのユーザーIDが含まれ, その応答順序は逆時系列順, すなわち最新のフォロワーが初めに来ます.
### Friends | 友達
`followers`コマンドと同じく, `friends`コマンドはTwitterの[friend id API](https://dev.twitter.com/rest/reference/get/friends/ids)を用いて, 引数として指定されたリクエストごとに1つだけのスクリーン名を持つユーザーのフレンド(フォロー)ユーザーIDを収集します.
twarc friends deray > friend_ids.txt
### Trends | トレンド
時に, 興味のあるトレンドの地域を示す[Where On Earth](https://web.archive.org/web/20180102203025/https://developer.yahoo.com/geo/geoplanet/)識別子(`WOE ID`)をオプションに与える必要があります.
例としてセントルイスの現在のトレンドを取得するやり方を示します.
twarc trends 2486982
`WOE ID`に`1`を用いることで, 全世界のトレンドが取得されます.
twarc trends 1
`WOE ID`として何を使用すればよいかわからない場合は, 以下のように`WOE ID`を省略することで, Twitterがトレンドを追跡している全ての場所のリストを取得できます.
twarc trends
Geolocationがあれば, `WOE ID`の代わりにジオロケーションを使用できます.
twarc trends 39.9062,-79.4679
バックグラウンドでtwarcは, Twitterの[trends/closest](https://dev.twitter.com/rest/reference/get/trends/closest) APIを使用して, 場所を検索し, 最も近い`WOE ID`を見つけます.
### Timeline | タイムライン
`timeline`コマンドは, Twitterの[user timeline API](https://dev.twitter.com/rest/reference/get/statuses/user_timeline)を用いて, スクリーンネームで示されるユーザーが投稿した最新のツイートを収集します.
twarc timeline deray > tweets.jsonl
また, ユーザーIDからユーザーを調べることもできます.
twarc timeline 12345 > tweets.jsonl
### Retweets | リツイート
指定されたツイートIDのリツイートを以下のように取得できます.
twarc retweets 824077910927691778 > retweets.jsonl
### Replies | 返信
残念ながら, TwitterのAPIは現在, ツイートへの返信の取得をサポートしていません.
代わりに, twarcは検索APIを使用してその機能の近似を行います.
Twitterの検索APIは, 1週間以上前のツイートの取得をサポートしていません.
そのため, twarcは先週までに送信されたツイートに対する返信のみを取得できます.
特定のツイートへの返信を取得したい場合は以下のようにします.
twarc replies 824077910927691778 > replies.jsonl
`--recursive`オプションを使用すると, 返信に対する返信や引用も取得されます.
検索APIによるレート制限のために, 長いスレッドの場合は完了するのに長時間かかる場合があります.
twarc replies 824077910927691778 --recursive
### Lists | リスト
リストにあるユーザを取得するには、`listmembers`コマンドで list URLを使用します。
twarc listmembers https://twitter.com/edsu/lists/bots
## Premium Search API
Twitterでは、ツイートにTwitterのお金を支払うことができるプレミアム検索APIが導入されました。
[ダッシュボード](https://developer.twitter.com/en/dashboard)で環境設定をした後、
「Standard Search API」が提供する7日間のウィンドウ外で、30日間とフルアーカイブ
でのエンドポイントを使ってツイートを検索することができます。コマンドラインから
Premium APIを使用するには、使用しているエンドポイントと環境を指定する必要があります。
予算全体を使い果たすことを避けるために、`--to_date`と`--from_date`を使用して
時間範囲を制限することをおすすめします。また、`--limit`を使用して返される
ツイートの最大数を制限することができます。
例えば、(今日が2020年6月1日だと仮定し)2週間前の全てのblacklivesmatterツイートを、
*docnowdev*という名前の環境を使って取得したいが、1000件以上のツイートを取得しない
場合は、次のような操作ができる。
twarc search blacklivesmatter \
--30day docnowdev \
--from_date 2020-05-01 \
--to_date 2020-05-14 \
--limit 1000 \
> tweets.jsonl
同様に、フルアーカイブを使用して2014年のツイートを検索するには、次の方法があります。
twarc search blacklivesmatter \
--fullarchive docnowdev \
--from_date 2014-08-04 \
--to_date 2014-08-05 \
--limit 1000 \
> tweets.jsonl
環境がサンドボックス化されている場合、twarcが一度に100件以上のツイートを要求しないように、
`--sandbox`を使用する必要があります。(サンドボックス化されていない環境のデフォルトは 500)
twarc search blacklivesmatter \
--fullarchive docnowdev \
--from_date 2014-08-04 \
--to_date 2014-08-05 \
--limit 1000 \
--sandbox \
> tweets.jsonl
## Use as a Library | ライブラリとして使用
必要で応じてtwarcをプログラム的にライブラリとして使ってツイートを収集することができます。
最初に(Twitterの資格情報を使用して)twarcインスタンスを作成し、検索結果、フィルタ結果、
または検索結果の反復を処理するために使用できます。
```python
from twarc import Twarc
t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)
for tweet in t.search("ferguson"):
print(tweet["text"])
```
trackキーワードに一致する新しいツイートのフィルタストリームに対しても同じことができます。
```python
for tweet in t.filter(track="ferguson"):
print(tweet["text"])
```
また`location`なら,
```python
for tweet in t.filter(locations="-74,40,-73,41"):
print(tweet["text"])
```
`user id`なら,
```python
for tweet in t.filter(follow='12345,678910'):
print(tweet["text"])
```
同様に, IDのリストまたはジェネレーターを渡すことで, ツイートIDをハイドレートできます.
```python
for tweet in t.hydrate(open('ids.txt')):
print(tweet["text"])
```
## User vs App Auth
twarcはTwitterによるレート制限を管理しますが、 それらのレート制限は、認証方法によって
異なります。ユーザー認証とアプリ認証の2つのオプションがありますが、twarcは
デフォルトでユーザー認証を使用するので、アプリ認証を使用するように指示することもできます。
アプリ認証への切り替えは、ツイートを検索するときなんかに便利です。ユーザー認証は
15分ごとに180件(1日あたり160万件)しかリクエストできないのに対し、アプリ認証は450件
(1日あたり430万件)のリクエストができるからです。
ただし注意すべきことは、ハイドレートサブコマンドで使用される`statuses / lookup`
エンドポイントには、ユーザー認証の場合は15分あたり900件までリクエスト、アプリ
認証の場合は15分あたり300件までのリクエストのレート制限があるということです。
自分が何をしているかを知っていて、アプリ認証を強制したい場合は、次のように`--app_auth`
コマンドラインオプションが使用できます。
twarc --app_auth search ferguson > tweets.jsonl
同様に、twarcをライブラリとして使用している場合は、次のことができます。
```python
from twarc import Twarc
t = Twarc(app_auth=True)
for tweet in t.search('ferguson'):
print(tweet['id_str'])
```
## Utilities | ユーティリティ
`utils`ディレクトリには, line-oriented JSONを操作するための簡単なコマンドラインユーティリティがいくつかあります.
例えばアーカイブされたツイートをテキストまたはHTMLとして出力したり, ユーザー名や参照URLなどを抽出したりするものです.
便利なスクリプトを自作したら, 是非プルリクエストをください.
いくつかツイートが手元にある時, それらを用いて初歩的なWallを作成できます.
utils/wall.py tweets.jsonl > tweets.html
`NASA`について収集したツイートのワードクラウドを作成できます.
utils/wordcloud.py tweets.jsonl > wordcloud.html
`replies`コマンドを用いていくつかのツイートを収集した場合, それらの静的な`D3.js`を用いたビジュアライゼーションを作成できます.
utils/network.py tweets.jsonl tweets.html
必要に応じてユーザーごとにツイートを統合し, その中心のアカウントを表示できます.
utils/network.py --users tweets.jsonl tweets.html
[Gephi](https://gephi.org/)などのプログラムでネットワークグラフを使用する場合は, 次のようにGEXFファイルを生成できます.
utils/network.py --users tweets.jsonl tweets.gexf
`gender.py`は, 著者の性別に関する推測に基づいてツイートをフィルタリングできるフィルターです.
例えば, 女性からのもののように見えるすべてのツイートを除外し, それらの単語クラウドを作成できます.
utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py > tweets-female.html
地理座標が利用可能なツイートから[GeoJSON](http://geojson.org/)を出力できます.
utils/geojson.py tweets.jsonl > tweets.geojson
必要に応じて, バウンディングボックスを置き換える重心を用いたGeoJSONをできます.
utils/geojson.py tweets.jsonl --centroid > tweets.geojson
また, 重心を用いたGeoJSONをエクスポートする場合に, ランダムファジングを追加することもできます.
utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson
地理座標の有無でツイートをフィルタリングするには, (場所については以下を参照:[API documentation](https://dev.twitter.com/overview/api/places))
utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl
cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl
GeoJSONのフェンスでツイートをフィルタリングするには, (要:[Shapely](https://github.com/Toblerity/Shapely))
utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl
cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl
ツイートに重複があると思われる場合は, 重複の排除が可能です.
utils/deduplicate.py tweets.jsonl > deduped.jsonl
ID順ソートできます.これは, 時間順ソートに似ています.
utils/sort_by_id.py tweets.jsonl > sorted.jsonl
特定の日付以前のすべてのツイートを除外できます.
例えば, 以下は関心のあるイベントの前, 別のイベントにハッシュタグが使用されていた場合です.
utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl
使用されているクライアントのHTMLリストを取得できます.
utils/source.py tweets.jsonl > sources.html
リツイートを削除する場合は,
utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl
またはURLの短縮を解除したい場合は, (要:[unshrtn](https://github.com/docnow/unshrtn))
cat tweets.jsonl | utils/unshrtn.py > unshortened.jsonl
URLを短縮すると, 最もよくツイートされたURLのランキングリストを取得できます.
cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt
## twarc-report
[twarc-report](https://github.com/pbinkley/twarc-report)プロジェクトでは, [D3.js](http://d3js.org/)でのビジュアライゼーションでの使用に適したCSVまたはJSONを生成・出力するユーティリティスクリプトを用意しています.
以前はtwarcの一部であった`directed.py`は`d3graph.py`としてtwarc-reportプロジェクトに移管しました.
またそれぞれのスクリプトは, ビジュアライゼーションのHTMLでのデモを生成できます.
具体例として,
- [タイムライン](https://www.wallandbinkley.com/twarc/bill10/)
- [リツイートの有向グラフ](https://www.wallandbinkley.com/twarc/bill10/directed-retweets.html)
があります.
---
翻訳クレジット: [Haruna]
[英語]: https://github.com/DocNow/twarc/blob/main/README.md
[ポルトガル語]: https://github.com/DocNow/twarc/blob/main/README_pt_br.md
[スペイン語]: https://github.com/DocNow/twarc/blob/main/README_es_mx.md
[スウェーデン語]: https://github.com/DocNow/twarc/blob/main/README_sv_se.md
[スワヒリ語]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md
[ISO 639-1]: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
[Haruna]: https://github.com/eggplants
================================================
FILE: docs/twarc1_pt_br.md
================================================
twarc1
=====
twarc é uma ferramenta de linha de comando e usa a biblioteca Python para arquivamento de dados do Twitter com JSON.
Cada tweet será representado como um objeto JSON
[exatamente](https://dev.twitter.com/overview/api/tweets) o que foi devolvido pela
API do Twitter. Os Tweets serão armazenados como [JSON, um por linha](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON). twarc controla totalmente a API [limites de uso](https://dev.twitter.com/rest/public/rate-limiting)
para você. Além de permitir que você colete Tweets, twarc também pode ajudá-lo
Coletar usuários, tendências e hidratar tweet ids.
twarc Foi desenvolvido como parte [Documenting the Now](http://www.docnow.io)
Projecto financiado pelo [Mellon Foundation](https://mellon.org/).
## Instalação
Antes de usar twarc você precisa registrar um aplicativo em
[apps.twitter.com](http://apps.twitter.com). Depois de criar o aplicativo, anote a [consumer_key], [consumer_secret] e clique em Gerar um [access_token] e um [access_token_secret]. Com estas quatro variáveis na mão você está pronto para começar a usar twarc.
OBS: Se tiver alguma dúvida de como criar o aplicativo, consulte [como criar um app](http://blog.difluir.com/2013/06/como-criar-uma-app-no-twitter/)
1. instalação [Python](http://python.org/download) (2 ou 3)
2. pip install twarc
### Homebrew (macOS apenas)
Para usuários do macOS, você pode instalar o `twarc` via:
```bash
$ brew install twarc
```
## Início Rápido:
Primeiro você vai precisar configurar o twarc mostrando a ele suas chaves de API:
twarc configure
Em seguida, experimente uma pesquisa rápida:
twarc search blacklivesmatter > search.jsonl
Ou talvez você gostaria de coletar tweets como eles acontecem?
twarc filter blacklivesmatter > stream.jsonl
Veja abaixo os detalhes sobre esses comandos e muito mais.
## Uso
### Configurar
Uma vez que você tem suas chaves de aplicativo, você pode dizer ao twarc quais são com o comando
`configure`.
twarc configure
Isso irá armazenar as credenciais em um arquivo chamado `.twarc` em seu
diretório home. Este arquivo será usado como padrão em outras chamadas.
Se preferir, você pode fornecer diretamente as chaves (`CONSUMER_KEY`,
`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) ou usando a linha de comando
com as opções (`--consumer_key`, `--consumer_secret`, `--access_token`,
`--access_token_secret`).
### Pesquisar
Os usuários do Twitter [Pesquisar/tweets](https://dev.twitter.com/rest/reference/get/search/tweets) para baixar *pre-existing* tweets, correspondendo a uma determinada consulta que desejar.
twarc search blacklivesmatter > tweets.jsonl
É importante notar que `search` Irá retornar tweets encontrados dentro de uma
Janela de 7 dias imposta pela API de pesquisa do Twitter. Se isso parece uma pequena
Janela,e é, mas você pode estar interessado em coletar tweets como eles acontecem
Usando o `filter` e `sample` comandos abaixo.
A melhor maneira de se familiarizar com a sintaxe de pesquisa do Twitter é experimentando
[Pesquisa Avançada do Twitter](https://twitter.com/search-advanced) E copiar e
colar a consulta resultante da caixa de pesquisa. Por exemplo, aqui está uma
consulta complicada que procura por tweets que contenham
\#blacklivesmatter ou #blm hashtags que foram enviados para deray.
twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl
Você definitivamente também deve consultar o *excelente* guia de referência de
Igor Brigadir à sintaxe de busca do Twitter:
[Busca Avançada no Twitter](https://github.com/igorbrigadir/twitter-advanced-search/blob/master/README.md)
Lá existem várias pérolas escondidas que não estão muito evidentes no
formulário de pesquisa avançada.
O Twitter tenta codificar a linguagem de um tweet e você pode limitar sua
pesquisa para um idioma específico se quiser usando um código [ISO 639-1]:
twarc search '#forabolsonaro' --lang pt > tweets.jsonl
Você também pode pesquisar tweets com um determinado local, por exemplo tweets
Mencionando *foratemer* das pessoas situadas a 1 milha na região de Brasília:
twarc search foratemer --geocode -16.050561,-47.814708,1mi > tweets.jsonl
Se uma consulta de pesquisa não for fornecida`--geocode` Você receberá todos os tweets
Relevantes para esse local e raio:
twarc search --geocode -16.050561,-47.814708,1mi > tweets.jsonl
### Filter
O comando `filter` Vai usar o Twitter [statuses/filter](https://dev.twitter.com/streaming/reference/post/statuses/filter) API to collect tweets as they happen.
twarc filter foratemer,blm > tweets.jsonl
Observe que a sintaxe para consultas de queries do Twitter é ligeiramente
diferente do que as consultas em sua API de pesquisa. Por favor, consulte a
documentação sobre a melhor forma de expressar a opção de filtro que você deseja.
Use o comando de linha `follow` com argumento se você quer coletar tweets de
um determinado ID de usuário. Isso inclui retweets. Por exemplo, isso vai
coletar tweets e os retweets da CNN:
twarc filter --follow 759251 > tweets.jsonl
Você também pode coletar tweets usando uma caixa delimitadora.
Nota: o traço principal precisa ser escapado na caixa delimitadora ou então
ele será interpretado como um comando de linha como argumento!
Exemplo: escapando com a barra invertida após aspas "\
twarc filter --locations "\-74,40,-73,41" > tweets.jsonl
Se você combinar opções eles serão um OU outro juntos.
Por exemplo, isso irá coletar Tweets que usam o hashtags foratemer
OU blm e também tweets do usuário CNN:
twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl
Mas combinar locais e idiomas resultará efetivamente em um E. Para
exemplo, isso irá coletar tweets da grande área de Nova York que estão em
Espanhol ou francês:
twarc filter --locations "\-74,40,-73,41" --lang es --lang fr
### Sample
Use o comando `sample` para ouvir/Status do Twitter [statuses/sample](https://dev.twitter.com/streaming/reference/get/statuses/sample) API para uma amostra "aleatória/ramdom" de tweets públicos recentes. O status será do usuário ativo na API twarc.
twarc sample > tweets.jsonl
### Dehydrate
O comando `dehydrate` gera uma lista de id de um arquivo de tweets:
twarc dehydrate tweets.jsonl > tweet-ids.txt
### Hydrate
O comando do twarc `hydrate` Lê um arquivo de IDs de tweets e escreve o tweet em JSON para eles usando Twitter [status/lookup](https://dev.twitter.com/rest/reference/get/statuses/lookup) API.
twarc hydrate ids.txt > tweets.jsonl
O [Termos do Serviço](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter) do Twitter API's desencoraja pessoas na busca de grandes quantidades de dados brutos do Twitter e disponíbilizar na Web. Os dados podem ser usados para pesquisa e arquivados para uso local, mas não devem ser compartilhados com o mundo. O Twitter permite que arquivos de identificadores de tweet sejam compartilhados, o que pode ser útil quando você quer fazer um conjunto de dados de tweets disponíveis. Você pode usar a API do Twitter para *hydrate* dados ou para recuperar o JSON completo para cada identificador/usuário ID. Isto é particularmente importante para [verificação](https://en.wikipedia.org/wiki/Reproducibility) da rede social mundial.
### Usuários
O comando `users` retorna metadados do usuário fornecidos na tela,exemplo:
twarc users deray,Nettaaaaaaaa > users.jsonl
Você também pode usar os ids do usuário:
twarc users 1232134,1413213 > users.jsonl
Se você quiser, você também pode usar um arquivo com ids de usuário, o que
pode ser útil se você estiver usando o `followers` e o `friends` conforme
comando abaixo:
twarc users ids.txt > users.jsonl
### Seguidores (Quem me segue)
O comando `followers` Vai usar o Twitter [API seguidores ID](https://dev.twitter.com/rest/reference/get/followers/ids) Para coletar os ids dos usuários que estão seguindo exatamente o nome informado na tela. Veja como é feita a solicitação usando o nome do user como argumento:
twarc followers deray > follower_ids.txt
O resultado incluirá exatamente um ID de usuário por linha.
A ordem de resposta é Invertida cronológicamente, o mais recente seguidores em primeiro lugar.
### Amigos (Quem eu sigo)
Igual o comando `followers`, o comando` friends` usará o Twitter [API amigos ID](https://dev.twitter.com/rest/reference/get/friends/ids) Para coletar os IDs de usuário amigo/friends com o nome que foi informado na tela no momento da solicitação,conforme especificado abaixo no argumento:
twarc friends deray > friend_ids.txt
### Trends / tendências
O comando `trends` permite recuperar informações da API do Twitter sobre hashtags tendências. Você precisa fornecer um [Onde na Terra](http://developer.yahoo.com/geo/geoplanet/) identificador (`woeid`) para indicar quais as tendências que você está interessado. Por exemplo, aqui é como você pode obter as tendências atuais para St Louis:
twarc trends 2486982
Usando um `woeid` de 1 irá retornar tendências para todo o planeta, ou trends mundiais:
twarc trends 1
Se você não tem certeza do que usar como um "woeid", não se preocupe, apenas omita seu valor e você receberá uma lista
de todos os lugares para os quais o Twitter acompanha as tendências:
twarc trends
Se você já tem uma [geo-location/geo localização], você pode usar diretamente no seu `woedid`.
twarc trends 39.9062,-79.4679
Por trás das cenas, o twarc buscará o local usando o Twitter [trends/closest](https://dev.twitter.com/rest/reference/get/trends/closest) API para encontrar a `woeid`.
### Timeline
O comando timeline usará do Twitter [API user timeline](https://dev.twitter.com/rest/reference/get/statuses/user_timeline) Para coletar os tweets mais recentes postados pelo usuário indicado por um screen_name.
twarc timeline deray > tweets.jsonl
Você também pode procurar usuários usando um id de usuário:
twarc timeline 12345 > tweets.jsonl
### Retuítes
Você pode obter retuítes para um determinado id de tweet como este:
twarc retweets 824077910927691778 > retweets.jsonl
Se você tiver tweet_ids para os quais gostaria de buscar os retuítes, você pode:
twarc retweets ids.txt > retweets.jsonl
### Repostas
Infelizmente, a API do Twitter não suporta atualmente a obtenção de respostas
para um tweet. Portanto, o twarc o aproxima usando a API de pesquisa. Como
a API de pesquisa não suporta a obtenção de tweets com mais de uma semana,
o twarc só pode obter todas as respostas a um tweet que foram enviadas na
última semana.
Se você deseja obter respostas para um determinado tweet, você pode:
twarc replies 824077910927691778 > replies.jsonl
Usar a opção `--recursive` também irá buscar respostas para as respostas, bem
como citações. Isso pode levar muito tempo para ser concluído em um thread
grande por causa de limitação de taxa pela API de pesquisa.
twarc replies 824077910927691778 --recursive
### Listas
Para obter os usuários que estão em uma lista, você pode usar o URL da lista com o
comando `listmembers`:
twarc listmembers https://twitter.com/edsu/lists/bots
## Premium Search API
O Twitter introduziu uma API de pesquisa premium que permite que você pague dinheiro
ao Twitter por tweets. Depois de configurar um ambiente em seu
[painel] (https://developer.twitter.com/en/dashboard) você pode usar seus 30 dias
e endpoints fullarchive para pesquisar tweets fora da janela de 7 dias fornecida
pela API de pesquisa padrão. Para usar a API premium na linha de comando, você
precisará indicar qual terminal você está usando e o ambiente.
Para evitar usar todo o seu orçamento, você provavelmente desejará limitar o
intervalo de tempo usando `--to_date` e` --from_date`. Além disso, você pode
limitar o número máximo de tweets retornados usando `--limit`.
Por exemplo, se eu quisesse obter todos os tweets blacklivesmatter de um
semanas atrás (supondo que hoje seja 1 de Junho de 2020) usando meu ambiente
chamado *docnowdev*, mas não recuperando mais de 1000 tweets, eu poderia:
twarc search blacklivesmatter \
--30day docnowdev \
--from_date 2020-05-01 \
--to_date 2020-05-14 \
--limit 1000 \
> tweets.jsonl
Da mesma forma, para encontrar tweets de 2014 usando o arquivo completo, você
pode:
twarc search blacklivesmatter \
--fullarchive docnowdev \
--from_date 2014-08-04 \
--to_date 2014-08-05 \
--limit 1000 \
> tweets.jsonl
Se o seu ambiente for sandbox, você precisará usar `--sandbox` para que o
twarc saiba que não deve solicitar mais de 100 tweets por vez (o padrão para
ambientes sem sandbox é 500)
twarc search blacklivesmatter \
--fullarchive docnowdev \
--from_date 2014-08-04 \
--to_date 2014-08-05 \
--limit 1000 \
--sandbox \
> tweets.jsonl
## Usar twarc como uma biblioteca
Se você quiser pode usar `twarc` programaticamente como uma biblioteca
para coletar Tweets. Primeiro você precisa criar uma instância do `twarc`
(usando as suas Credenciais do Twitter) e, em seguida, usá-lo para iterar
através de resultados de pesquisa ou filtrar resultados de pesquisa.
```python
from twarc import Twarc
t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)
for tweet in t.search("ferguson"):
print(tweet["text"])
```
Você pode fazer o mesmo para um fluxo de filtro de novos tweets que
correspondem a uma determinada faixa usando palavra-chave.
```python
for tweet in t.filter(track="ferguson"):
print(tweet["text"])
```
ou localização:
```python
for tweet in t.filter(locations="-74,40,-73,41"):
print(tweet["text"])
```
ou IDS do usuário:
```python
for tweet in t.filter(follow='12345,678910'):
print(tweet["text"])
```
Da mesma forma você pode hidratar os identificadores de tweet passando
em uma lista de ids ou um gerador:
```python
for tweet in t.hydrate(open('ids.txt')):
print(tweet["text"])
```
## User x App Auth
twarc gerenciará a limitação de taxas pelo Twitter. No entanto, você deve
saber que a limitação de taxa varia de acordo com a maneira como você
autentica. As duas opções são User Auth e App Auth. O padrão do twarc é usar a
autenticação do usuário, mas você pode dizer a ele para usar o App Auth.
Mudar para App Auth pode ser útil em algumas situações, como quando você está
pesquisando tweets, já que o User Auth só pode emitir 180 solicitações a cada
15 minutos (1,6 milhões de tweets por dia), mas o App Auth pode emitir 450 (4,
3 milhões de tweets por dia).
Mas tenha cuidado: o endpoint `statuses / lookup` usado pelo subcomando
hydrate tem um limite de taxa de 900 solicitações por 15 minutos para
autenticação do usuário e 300 solicitações por 15 minutos para App Auth.
Se você sabe o que está fazendo e deseja forçar o App Auth, pode usar o opção
de linha de comando `--app_auth`:
twarc --app_auth search ferguson > tweets.jsonl
Da mesma forma, se você estiver usando twarc como uma biblioteca, você pode:
```python
from twarc import Twarc
t = Twarc(app_auth=True)
for tweet in t.search('ferguson'):
print(tweet['id_str'])
```
## Utilitários
No diretório utils existem alguns utilitários via linha de comando simples para
Trabalhar com o JSON gravando linha por por linha, tais como.
- Imprimir os tweets arquivados como Texto ou html.
- Extraindo os nomes de usuários.
- URLs referenciadas.
- Etc.
Se você criar um Script e achar útil, por favor envie um pedido de pull no github do projeto.
Quando você tem alguns tweets você pode criar um paralelo rudimentar deles:
utils/wall.py tweets.jsonl > tweets.html
Você pode criar uma nuvem de palavras de tweets coletados sobre a nasa:
utils/wordcloud.py tweets.jsonl > wordcloud.html
Se você coletou alguns tweets usando `respostas`, você pode criar uma
visualização estática D3 deles com:
utils/network.py tweets.jsonl tweets.html
Opcionalmente, você pode consolidar tweets por usuário, permitindo que você
veja contas centrais:
utils/network.py --users tweets.jsonl tweets.html
Além disso, você pode criar uma rede de hashtags, permitindo que você
visualize sua alocação:
utils/network.py --hashtags tweets.jsonl tweets.html
E se você quiser usar o gráfico de rede em um programa como
[Gephi] (https://gephi.org/), você pode gerar um arquivo GEXF com o seguinte:
utils/network.py --users tweets.jsonl tweets.gexf
utils/network.py --hashtags tweets.jsonl tweets.gexf
gender.py É um filtro que permite filtrar tweets com base em um palpite sobre
o gênero do autor. Assim, por exemplo, você pode filtrar todos os tweets que
em tese foram feitos por mulheres, e criar uma nuvem de palavras para eles:
utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py > tweets-female.html
Você pode com [GeoJSON](http://geojson.org/) ver os tweets de determinadas coordenadas geográficas:
utils/geojson.py tweets.jsonl > tweets.geojson
Opcionalmente você pode exportar GeoJSON com centróides substituindo as caixas delimitadoras:
utils/geojson.py tweets.jsonl --centroid > tweets.geojson
E se você exportar GeoJSON com centróides, você pode adicionar alguns fuzzing aleatórios:
utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson
Para filtrar tweets pela presença ou ausência de coordenadas geográficas (Ou Local, veja [Documentação da API locais](https://dev.twitter.com/overview/api/places)):
utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl
cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl
Para filtrar tweets por uma área com GeoJSON (Requer [Shapely](https://github.com/Toblerity/Shapely)):
utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl
cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl
Se você suspeitar ter duplicado seus tweets, você pode remove-los:
utils/deduplicate.py tweets.jsonl > deduped.jsonl
Você pode classificar por ID, o que é análogo à classificação por tempo:
utils/sort_by_id.py tweets.jsonl > sorted.jsonl
Você pode filtrar todos os tweets antes de uma determinada data (por exemplo, se uma hashtag foi usada para outro evento antes do que você está interessado):
utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl
Você pode obter uma lista HTML dos usuários usados:
utils/source.py tweets.jsonl > sources.html
Se você quiser remover os retweets:
utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl
Ou unshorten urls (Requer [unshrtn](https://github.com/edsu/unshrtn)):
cat tweets.jsonl | utils/unshorten.py > unshortened.jsonl
Depois de desfazer masca de seus URLs, você pode obter uma lista classificada dos URLs mais tweeted:
cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt
## twarc-report
Alguns scripts de utilitários adicionais para gerar saída csv ou json adequada foi
feito com [D3.js](http://d3js.org/) Visualizações são encontradas
[twarc-report](https://github.com/pbinkley/twarc-report) projeto. O
Util direct.py, anteriormente parte do twarc, mudou-se para twarc-report como
d3graph.py.
Cada script também pode gerar uma demo html de uma visualização D3, e.g.
[timelines](https://wallandbinkley.com/twarc/bill10/) or a
[directed graph of retweets](https://wallandbinkley.com/twarc/bill10/directed-retweets.html).
---
Tradução créditos: [Wilson Jr]
[Espanhol]: https://github.com/DocNow/twarc/blob/main/README_es_mx.md
[Inglês]: https://github.com/DocNow/twarc/blob/main/README.md
[Japonês]: https://github.com/DocNow/twarc/blob/main/README_ja_jp.md
[Sueco]: https://github.com/DocNow/twarc/blob/main/README_sv_se.md
[Suaíli]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md
[ISO 639-1]: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
[Wilson Jr]: https://github.com/py3in
================================================
FILE: docs/twarc1_sv_se.md
================================================
twarc1
=====
twarc är ett kommandoradsverktyg twarc och ett Pythonbibliotek för arkivering av Twitter JSON data.
Varje tweet är representerat som ett JSON-objekt som är [exakt](https://dev.twitter.com/overview/api/tweets) vad som returneras från Twitters API
Tweets lagras som [line-oriented JSON](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON). twarc hanterar
Twitter API:ets [rate limits](https://dev.twitter.com/rest/public/rate-limiting)
åt dig. Förutom att kunna samla in tweets kan även twarc hjälpa dig att samla in användare, trender och omvandla tweet-id:n till tweets.
twarc har utvecklats som en del av [Documenting the Now](http://www.docnow.io)
projektet som finiansierades av [Mellon Foundation](https://mellon.org/).
## Installera
Innan du använder twarc behöver du registrera en applikation hos
[apps.twitter.com](http://apps.twitter.com). När du har skapat din applikation, skriv ner consumer key, consumer secret och klicka för att generera en access token och en access token secret.
Med dessa fyra variabler är du redo att börja använda twarc.
1. Installera [Python](http://python.org/download) (2 eller 3)
2. pip install twarc (om du uppgraderar: pip install --upgrade twarc)
## Snabbstart:
Först måste du tala om för twarc vad dina API-nycklar är och tillåta åtkomst till ett
eller flera twitterkonton:
twarc configure
Prova att köra:
twarc search blacklivesmatter > search.jsonl
Eller om du vill samla in tweets i samma ögonblick de skapas:
twarc filter blacklivesmatter > stream.jsonl
Se nedan för detaljer om dessa och fler kommandon.
## Användning
### Konfigurera
När du har dina applikationsnycklar så kan du tala om för twarc vilka de är med
`configure` kommandot.
twarc configure
Detta kommer att lagra dina nycklar i en fil som heter `.twarc` placerad i din hemkatalog så du slipper att skriva in dem varje gång.
Om du hellre vill tilldela dom direkt så kan du göra det i environment (`CONSUMER_KEY`,
`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) eller genom att använda kommandoradsparameter
options (`--consumer_key`, `--consumer_secret`, `--access_token`,
`--access_token_secret`).
### Sök
Detta använder Twitters [search/tweets](https://dev.twitter.com/rest/reference/get/search/tweets) för att ladda ner *redan befintliga* tweets som matchar en given söksträng.
twarc search blacklivesmatter > tweets.jsonl
Det är viktigt att notera att `search` retunerar tweets som hittas inom det 7-dagarsfönster som
Twitters sök-API erbjuder. Känns det som ett smalt fönster? Det är det. Men du kanske är intresserad av att samla in tweets i samma ögonblick som de skapas
genom att använda `filter` och `sample` kommandona nedan.
Det bästa sättet att bekanta sig med Twitters söksyntax är att experimentera med
[Twitters Advancerade Sök](https://twitter.com/search-advanced) och kopiera och klistra in söksträngen från sökboxen.
Här är till exempel en mer avancerad söksträng som matchar tweets innehållande antingen \#blacklivesmatter eller #blm hashtaggar som skickats till deray
twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl
Twitter försöker att koda en tweets språk, och du kan begränsa sökningen till ett specifikt språk om du vill:
twarc search '#blacklivesmatter' --lang fr > tweets.jsonl
Du kan också söka efter tweets inom en given plats, till exempel tweets som nämner *blacklivesmatter* som är 1 mile från centrala Ferguson, Missouri:
twarc search blacklivesmatter --geocode 38.7442,-90.3054,1mi > tweets.jsonl
Om inte en söksträng ges när du använder `--geocode` kommer du få alla tweets som är relevanta för den platsen och radien.
twarc search --geocode 38.7442,-90.3054,1mi > tweets.jsonl
### Filter
`filter` Kommandot använder Twitters [statuses/filter](https://dev.twitter.com/streaming/reference/post/statuses/filter) API för att samla in tweets i samma ögonblick som de skapas.
twarc filter blacklivesmatter,blm > tweets.jsonl
Notera att syntaxen för Twitters track söksträngar är något annorlunda än de som används i sök-API:et
Var god läs dokumentationen för att se hur du bäst kan formulera sökningar.
Använd `follow` kommandot om du vill samla in tweets från ett specifikt användar-id i samma ögonblick som de skapas. Detta inkluderar retweets.
Till exempel så samlar detta in tweets och retweets från CNN:
twarc filter --follow 759251 > tweets.jsonl
Du kan också samla in tweets genom att använda koordinater. Notera: det inledande bindestrecket behöver ignoreras, annars kommer det tolkas som en kommandoradsparameter!
twarc filter --locations "\-74,40,-73,41" > tweets.jsonl
Om du kombinerar parametrar så kommer de tolkas som OR
Till exempel så kommer detta samla in tweets som använder blacklivesmatter eller blm hashtaggen och som också postats av användaren CNN:
twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl
### Sample
Använd `sample` kommandot för att "lyssna" på Twitters [statuses/sample](https://dev.twitter.com/streaming/reference/get/statuses/sample) API för ett "slumpmässigt" prov av nyligen skapade publika tweets.
twarc sample > tweets.jsonl
### Dehydrering
`dehydrate` kommandot genererar en lista med identifierare från en fil med tweets:
twarc dehydrate tweets.jsonl > tweet-ids.txt
### Hydrering
twarc's `hydrate` kommando läser en fil med tweetidentifierare och skriver ut som tweet JSON genom Twitters [status/lookup](https://dev.twitter.com/rest/reference/get/statuses/lookup) API.
twarc hydrate ids.txt > tweets.jsonl
Twitter APIs [Terms of Service](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter) uppmuntrar inte folk att tillgängliggöra stora mängder av rå Twitterdata på webben.
Datan kan användas för forskning och arkiveras lokalt, men kan inte delas med världen. Twitter tillåter emellertid att identifierare delas, vilket kan vara bra när du vill tillgängliggöra ett dataset.
Du kan då använda Twitters API för att *hydrera* datan, eller för att hämta den fulla JSON-objektet för varje identifierare.
Detta är särskilt viktigt för [verifiering](https://en.wikipedia.org/wiki/Reproducibility) av forskning på social media.
### Användare
`users` kommandot retunerar metadata för angivna screen names.
twarc users deray,Nettaaaaaaaa > users.jsonl
Du kan också använda användar-id:
twarc users 1232134,1413213 > users.jsonl
Om du vill kan du också använda en fil med användar-id, vilket kan vara användbart om du använder
`followers` och `friends` kommandona nedan:
twarc users ids.txt > users.jsonl
### Följare
`followers` kommandot använder Twitters [follower id API](https://dev.twitter.com/rest/reference/get/followers/ids) för att samla in följarens användar-id för exakt ett screen name per request specificerat som ett argument:
twarc followers deray > follower_ids.txt
Resultatet inkluderar exakt ett användar-id per linje ordnat i omvänd kronologisk ordning, alltså de senaste följarna först.
### Vänner
Precis som `followers` kommandot, använder `friends` kommandot Twitters [friend id API](https://dev.twitter.com/rest/reference/get/friends/ids) för att samla in vänners användar-id för exakt ett screen name per request, specificerat som ett argument:
twarc friends deray > friend_ids.txt
### Trender
`trends` kommandot låter dig hämta information från Twitters API om trendande hashtags. Du måste bifoga en [Where On Earth](http://developer.yahoo.com/geo/geoplanet/) identifierare (`woeid`)
för att precisera vilka trender du är intresserad av. Till exempel kan du hämta de senaste trenderna för St. Louis på det hör viset:
twarc trends 2486982
Använder du ett `woeid` på 1 så kommer du få trender för hela världen:
twarc trends 1
Om du inte är säker på vad du ska använda för `woeid` så kan du helt enkelt utesluta det för att få en lista över alla platser Twitter har trender för:
twarc trends
Om du har en geo-position så kan du använda den istället för `woeid`.
twarc trends 39.9062,-79.4679
Bakom kulisserna så hjälper twarc dig genom Twitters [trends/closest](https://dev.twitter.com/rest/reference/get/trends/closest) API att hitta närmaste `woeid`.
### Tidslinje
`timeline` kommandot använder Twitters [user timeline API](https://dev.twitter.com/rest/reference/get/statuses/user_timeline) för att samla in de senaste tweetsen skapade av en användare baserat på screen_name.
twarc timeline deray > tweets.jsonl
Du kan också använda användar-id:
twarc timeline 12345 > tweets.jsonl
### Retweets
Du kan samla in retweets för ett givet tweetid genom:
twarc retweets 824077910927691778 > retweets.jsonl
### Svar
Tyvärr så stödjer inte Twitters API att hämta svar till en tweet.
twarc använder istället sök-API:et för detta. Då sök-API:et inte kan användas för att samla in tweets äldre än en vecka kan twarc endast hämta alla svar till en tweet som har postats den senaste veckan.
Om du vill hämta svaren till en tweet så kan du använda följande:
twarc replies 824077910927691778 > replies.jsonl
Genom att använda `--recursive` parametern så hämtas även svar till svar så väl som citerade tweets. Detta kan ta mycket lång tid att köra på stora trådar på grund av
rate limiting på sök-API:et.
twarc replies 824077910927691778 --recursive
### Listor
För att hämta användare som är med på en lista kan du använda list-URL:en med
`listmembers` kommandot:
twarc listmembers https://twitter.com/edsu/lists/bots
## Använd som ett bibliotek
Du kan också använda twarc programatiskt som ett bibliotek för att samla in tweets.
Du behöver först skapa en instans av `twarc` (genom att använda dina nycklar)
, och sedan använda det för att iterera genom sökresultat, filter och resultat.
```python
from twarc import Twarc
t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)
for tweet in t.search("ferguson"):
print(tweet["text"])
```
Du kan göra samma sak för en ström som matchar ett nyckelord
```python
for tweet in t.filter(track="ferguson"):
print(tweet["text"])
```
eller en position:
```python
for tweet in t.filter(locations="-74,40,-73,41"):
print(tweet["text"])
```
eller användar-id:
```python
for tweet in t.filter(follow='12345,678910'):
print(tweet["text"])
```
På samma sätt kan du hydrera tweetid:n genom att bearbeta en lista med idn
eller en generator:
```python
for tweet in t.hydrate(open('ids.txt')):
print(tweet["text"])
```
## Verktyg
I utils-mappen finns ett antal enkla kommandoradsverktyg för att bearbeta linjeorienterad JSON, så som att skriva ut arkiverade tweets som text eller html, extrahera användarnamn, refererade url:er, m.m.
Om du skapar ett skript som du tycker är bra så får du gärna skicka en pull request.
När du samlat in lite tweets kan du skapa en rudimentär vägg av dem:
% utils/wall.py tweets.jsonl > tweets.html
Du kan skapa ett ordmoln baserat på tweets du samlat in:
% utils/wordcloud.py tweets.jsonl > wordcloud.html
Om du har samlat in tweets genom att använda `replies` kan du skapa en statisk D3
visualisering av dem med:
% utils/network.py tweets.jsonl tweets.html
Du kan även slå samman tweets per användare, vilket gör att du kan se centrala konton.
% utils/network.py --users tweets.jsonl tweets.html
Och om du vill använda nätverksgrafen i ett program som [Gephi](https://gephi.org/), så kan du generera en GEXF-fil med följande:
% utils/network.py --users tweets.jsonl tweets.gexf
gender.py är ett filter som låter dig filtrera tweets baserat på en gissining författarens kön. Till exempel kan du filtrera ut alla tweets som
ser ut som de var skrivna av kvinnor och skapa ett ordmoln:
% utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py > tweets-female.html
Du kan få ut [GeoJSON](http://geojson.org/) från tweets där geo-koordinater finns tillgängliga:
% utils/geojson.py tweets.jsonl > tweets.geojson
Alternativt kan du exportera GeoJSON med centroider som ersättning för bounding boxes:
% utils/geojson.py tweets.jsonl --centroid > tweets.geojson
Och om du exporterar GeoJSON med centroider, så kan du lägga till lite slumpmässig fuzz:
% utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson
För att filtrera tweets baserat på tillgänglighet av geo-koordinater (eller plats, se [API documentation](https://dev.twitter.com/overview/api/places)):
% utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl
% cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl
För att filtrera tweets genom ett GeoJSON-staket (Kräver [Shapely](https://github.com/Toblerity/Shapely)):
% utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl
% cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl
Om du misstänker att du har duplikat i dina tweetinsamlingar kan du ta bort duplikaten:
% utils/deduplicate.py tweets.jsonl > deduped.jsonl
Du kan sortera efter ID, vilket är samma sak som att sortera efter tid.
% utils/sort_by_id.py tweets.jsonl > sorted.jsonl
Du kan filtrera bort alla tweets före ett specifikt datum (till exempel, om en hashtag användes för en annan händelse före det du är intresserad av):
% utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl
Du kan få en lista i HTML över vilka klienter som använts:
% utils/source.py tweets.jsonl > sources.html
Om du vill ta bort retweets:
% utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl
Eller lösa förkortade url:er (kräver [unshrtn](https://github.com/edsu/unshrtn)):
% cat tweets.jsonl | utils/unshorten.py > unshortened.jsonl
När du har löst de förkortade url:erna kan du få en ranklista över de mest tweetade url:erna:
% cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt
## twarc-report
Ytterligare verktyg för att generera CSV-filer eller json lämpad för att använda med
[D3.js](http://d3js.org/) visualiseringar kan du hitta i
[twarc-report](https://github.com/pbinkley/twarc-report) projektet. Verktyget
`directed.py`, tidigare en del av twarc, har flyttat till twarc-report som
`d3graph.py`.
Varje skript kan också generera en html-demo av en D3 visualisering, t.ex.
[timelines](https://wallandbinkley.com/twarc/bill10/) eller en
[riktad graf av retweets](https://wallandbinkley.com/twarc/bill10/directed-retweets.html).
Översättning: [Andreas Segerberg]
[Engelska]: https://github.com/DocNow/twarc/blob/main/README.md
[Japanska]: https://github.com/DocNow/twarc/blob/main/README_ja_jp.md
[Portugisiska]: https://github.com/DocNow/twarc/blob/main/README_pt_br.md
[Spanska]: https://github.com/DocNow/twarc/blob/main/README_es_mx.md
[Swahili]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md
[Andreas Segerberg]: https://github.com/Segerberg
================================================
FILE: docs/twarc1_sw_ke.md
================================================
twarc1
=====
twarc ni chombo ya command-line na Python Library ya kuhifadhi Twitter JSON
data. Kila Tweet ita akilishwa kama kitu ya JSON ita onyeshwa
[hivi](https://dev.twitter.com/overview/api/tweets) kutoka kwa Twitter API.
Tweets zita wekwa kama [line-oriented
JSON](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON). twarc
ita kusaidia ku chunga [rate
limits](https://dev.twitter.com/rest/public/rate-limiting) ya API ya Twitter.
twarc pia ita sanya tweets, watumiaji wa Twitter, uwenendo za Twitter na ita
hydrate tweet ids.
twarc imeundwa kama sehemu ya [Documenting the Now](http://www.docnow.io) ambayo
ilifadhiliwa na [Mellon Foundation](https://mellon.org/).
## Weka
Kabla kutumia twarc utahitaji kujiandikisha kwa
[apps.twitter.com](http://apps.twitter.com). Mara baada ya kuunda programu yako
andika `consumer key` and `consumer secret` yako alafu bonyeza kuzalisha `access
token` na `access token secret`. Uta hitaji hizi vigezo nne ku tumia twarc
1. weka [Python](http://python.org/download) (2 or 3)
2. pip install twarc (ama kuboresha: pip install --upgrade twarc)
## Haraka Haraka
Utahitaji kuambia twarc vifunguo ya API ya Twitter
twarc configure
alafu jaribu kuchungua na:
twarc search blacklivesmatter > search.jsonl
Ama wataka kusanya ma tweets kama zinatoka
twarc filter blacklivesmatter > stream.jsonl
Endelea kusoma ku pata maelezo kuhusu utumizi wa twarc
## Matumizi
### Sanidi
Mara tu una vifunguo vya Twitter unaweza kuambia twarc ukitumia command ya
`configure`.
twarc configure
twarc ita andika sifa zako kwenye file itayo itwa `.twarc` kwa saraka ya home.
Kama hutaki ama huwezi kuandika file hiyo unaweza kutumia command inayo tumia
mazingira yako. (`CONSUMER_KEY`,
`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) ama chagua command line
(`--consumer_key`, `--consumer_secret`, `--access_token`,
`--access_token_secret`).
### Uchunguzi
Hutumia [uchunguzi wa
tweets](https://dev.twitter.com/rest/reference/get/search/tweets) kupakua tweets
zilizoandikwa zinazo swala
twarc search blacklivesmatter > tweets.jsonl
Ni muhimu kukumbuka swali yako ita pakua tweets za mda wa siku 7 inayo tiwa na
API ya Twitter. Kama swali yako inataka mda wa siku nane au zaidi waeza kutumia
`filter` ama `sample` commands kama hizi.
Njia bora ya kujifunza na uchunguzi wa Twitter Search API ni ku jaribu
[Twitter's Advanced Search](https://twitter.com/search-advanced) alafu kuitumia
kwa twarc. Kwa mfano hapa tuna tafuta ma tweets zinazo \#blacklivesmatter ama
#blm hashtags zilizo tumwa kwa deray.
twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl
Twitter hujaribu kuweka lugha ya tweet na unaweza kupunguza kikoma yako kwa
lugha ukitaka
twarc search '#blacklivesmatter' --lang fr > tweets.jsonl
Unaweza pia kutafuta tweets za mahali fulani kwa mfano tweets zinazo taja
*blacklivesmatter* zilizo maili 1 kutoka katikati ya Ferguson, Missouri:
twarc search blacklivesmatter --geocode 38.7442,-90.3054,1mi > tweets.jsonl
Ikiwa swali yako haina maneno lakini umetumia `--geocode` utapata tweets zote za
eneo hio.
twarc search --geocode 38.7442,-90.3054,1mi > tweets.jsonl
### Chuja
Utumizi wa `filter` command husanya tweets zikiandikwa no hutumia
[statuses/filter](https://dev.twitter.com/streaming/reference/post/statuses/filter)
API.
twarc filter blacklivesmatter,blm > tweets.jsonl
Tafadhali kumbuka kuwa syntax ya Twitter ni tofauti na Twitter ya uchunguzi.
Tafadhali wasiliana na nyaraka jinsi ya kueleza chujia unayo tumia
Tumia command ya `follow` kama wataka kusanya tweets kutoka kwa mtumiaji kama
zinatokea. Hi inajumuisha retweets. Kwa mfano hii itasanya tweets na retweets za
CNN:
twarc filter --follow 759251 > tweets.jsonl
Waeza kusanya tweets kwa kutumia sanduku linalozingatia. Kumbuka: dash
inayoongoza inahitaji kutoroka katika sanduku linalozingatia ama ita fasiriwa
kama command line argument!
twarc filter --locations "\-74,40,-73,41" > tweets.jsonl
Ikiwa unachanganya chaguzi yako au OR'ed pamoja. Kwa mfano hii ita sanya tweets
zinasotumia blacklivesmatter ama blm na pia tweets kutoka mtumiaji CNN:
twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl
### Sampuli
Tumia `sample` command kusikiliza kwa sampuli ya Twitter
[statuses/sample](https://dev.twitter.com/streaming/reference/get/statuses/sample)
statuses hivi karibuni
twarc sample > tweets.jsonl
### Punguza maji
twarc ina `dehydrate` command ita tengeneza orodha ya id kutoka faili ya tweets:
twarc dehydrate tweets.jsonl > tweet-ids.txt
### Hydrate
twarc pia ina `hydrate` command ita soma faili inayo id na ita andika faili mpya
ya tweet JSON kwa kutumiya Twitter [status/lookup](https://dev.twitter.com/rest/reference/get/statuses/lookup) API.
twarc hydrate ids.txt > tweets.jsonl
API ya Twitter [Masharti ya
Huduma](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter)
huwazuia watu kutengeza kiasi kubwa ya Twitter data ipatikane kwenye Web. Hiyo
data yaeza kutumiwa kwa uchunguzi bora isi shirikiana na ulimwengu. Twitter
huruhusu mafaili ya tweet identifiers kugawanywa no hiyo inaweza kuwa na
manufaa. Waeza kutumia API ya Twitter ku *hydrate* hiyo data ama kupata kamili
ya JSON. Hi ni muhimu kwa
[uthibitishaji](https://en.wikipedia.org/wiki/Reproducibility) ya social media
research.
### Watumiaji
Utumizi was `users` command hurudisha metadata ya majina ya skrini iliyopewa
twarc users deray,Nettaaaaaaaa > users.jsonl
Waeza pia kuipatia ids za watumiaji
twarc users 1232134,1413213 > users.jsonl
Waeza kutumia faili iliyo na ids za watumiaji kwa mfano wataka `followers` na
`friends` commands
twarc users ids.txt > users.jsonl
### Wafuasi
Utumizi wa `followers` hutegemeya [follower id
API](https://dev.twitter.com/rest/reference/get/followers/ids) ku kusanya ids za
mfuasi moja kwa kila ombi. Kwa mfano:
twarc followers deray > follower_ids.txt
ita rudisha mfuasi moja kwa kila laini. Faili yako ita andikwa na wafuasi wa
hivi karibuni kwanza.
### Mwelekeo
Utumizi wa `trends` hutegemeya API ya Twitter ya mwelekeo wa hashtags. Unahitaji
kuipatia [Where On Earth](http://developer.yahoo.com/geo/geoplanet/) identifier
(`woeid`) kuiambia mwenendo unayopenda. Kwa mfano kama wataka maelekeo ya St.
Louis:
twarc trends 2486982
Ukitumia `woeid` ya 1 itarudisha mwenendo wa dunia yote.
twarc trends 1
Ikiwa hujui nini cha kutumia ya `woeid` iache na utapata maeneo yote ambayo
Twitter hufuata:
twarc trends
Kama una geo-location waeza kuitimia badala ya `woeid`
twarc trends 39.9062,-79.4679
Twitter ita tumia API ya [trends/closest](https://dev.twitter.com/rest/reference/get/trends/closest) ili kupata `woeid` iliyo karibu nawe
### Muda wa wakati
Utumiaji wa `timeline` command hutegemeya kwa API ya [user timeline
API](https://dev.twitter.com/rest/reference/get/statuses/user_timeline)
kukusanya Tweets za mtumiaji alionyeshwa na `screen_name`:
twarc timeline deray > tweets.jsonl
Unaweza pia kuangalia juu ya watumiaji kwa kutumia id ya mtumiaji
twarc timeline 12345 > tweets.jsonl
### Retweets
Unaweza kupata retweets kwa kuipeya id ya tweet hivi:
twarc retweets 824077910927691778 > retweets.jsonl
### Majibu
Twitter haina API ambayo inaweza kupata majibu za tweet. twarc hujaribu kwa
kutumia search API. Lakino search API haiwezi kupata majibu zaidi ya siku saba.
Ikiwa unataka kupata majibu ya tweets fanya hivi:
twarc replies 824077910927691778 > replies.jsonl
Utumizi wa `--recursive` utapata majibu ya majibu na quotes. Hii inaweza
kuchukua muda mrefu kukamilisha kama una majibu mengi kwa sababu ya kiwango cha
kupunguzwa search API.
twarc replies 824077910927691778 --recursive
### Orodha
Ili kupata watumiaji walio kwenye orodha unaweza kutumia URL ya orodha na
command ya `listmembers`
twarc listmembers https://twitter.com/edsu/lists/bots
## Tumia kama Maktaba
Ikiwa unataka kutumia twarc programatically kama maktaba kukusanya tweets.
Kwanza utahitaji kuunda `twarc` instance yako. (utatumia sifa zako za Twitter),
alafu utaitumia kutafuta matokeo ya utafutaji, futa matokeo au matokeo ya
kufuatilia.
```python
from twarc import Twarc
t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)
for tweet in t.search("ferguson"):
print(tweet["text"])
```
Unaweza kufanya hivyo kwa mkondo wa machujio ya tweets ambazo zinafanana na
kufuatilio neno muhimu:
```python
for tweet in t.filter(track="ferguson"):
print(tweet["text"])
```
au mahali
```python
for tweet in t.filter(locations="-74,40,-73,41"):
print(tweet["text"])
```
au ids za watumiaji
```python
for tweet in t.filter(follow='12345,678910'):
print(tweet["text"])
```
Vivyo hivyo unaweza ku hydrate tweet identifiers kwa kupitisha orodha ya ids au
jenereta:
```python
for tweet in t.hydrate(open('ids.txt')):
print(tweet["text"])
```
## Vya Kutumia
Katika saraka `utils` kuna commands zinazo weza kukusaidia kufanya kazi na
line-oriented JSON kama kuchapisha ma tweets kwa text au html, kuchimba majina
za watumiaji, URLS. If tengeneza script yako tafadhali tushirikiana na PR.
Unapopata tweets unaweza kuunda ukuta mzuri wako:
% utils/wall.py tweets.jsonl > tweets.html
Unaweza kuunda wingu ya maneno ya tweets ulizo sanya ambayo in neno nasa
% utils/wordcloud.py tweets.jsonl > wordcloud.html
Ikiwa umekusanya tweets kwa kutumia `majibu` unaweza kuunda taswira ya D3 na:
% utils/network.py tweets.jsonl tweets.html
Unaweza kuimarisha tweets za mtumiaji, kukuruhusu kuona akaunti kuu:
% utils/network.py --users tweets.jsonl tweets.html
Na kama unataka kutumia grafu ya mtandao katika mpango kama
[Gephi](https://gephi.org/), unaweza kuuna faili ya GEXF na
% utils/network.py --users tweets.jsonl tweets.gexf
`gender.py` ni chujio kinachokuwezesha kufuta tweets kulingana na nadhani kuhusu
jinsia ya mwandishi. Kwa mfano unaweza kufuta tweets zote ambazo
kuangalia kama walikuwa kutoka kwa wanawake, na kuunda wingu neno na:
% utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py > tweets-female.html
Unaweza kutoa [GeoJSON](http://geojson.org/) ya tweets kama geo coordinates
ziko:
% utils/geojson.py tweets.jsonl > tweets.geojson
Unaweza pia kuto GeoJSON na centriods, kubadilisha nafasi ya masanduku:
% utils/geojson.py tweets.jsonl --centroid > tweets.geojson
Na ukitoa GeoJSON na centroids, unaweza kuongeza random fuzzing:
% utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson
Ili kufuta tweets kwa kuwepo au kutokuwepo kwa kuratibu za geo (au Mahali, angalia nyaraka za [API](https://dev.twitter.com/overview/api/places)):
% utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl
% cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl
Ili kufuta tweets na uzio wa GeoJSON (inahitaji [Shapely](https://github.com/Toblerity/Shapely)):
% utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl
% cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl
Ikiwa unadhani una duplicate kwenye tweets zako unaweza kuwapunguza:
% utils/deduplicate.py tweets.jsonl > deduped.jsonl
Unaweza kuchagua na ID, ambayo ni sawa na kutatua kwa wakati:
% utils/sort_by_id.py tweets.jsonl > sorted.jsonl
Unaweza kufuta tweets zote kabla ya tarehe fulani (kwa mfano, kama hashtag ilitumiwa kwa tukio lingine kabla ya moja unayopenda):
% utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl
Unaweza kupata orodha ya HTML ya wateja kutumika:
% utils/source.py tweets.jsonl > sources.html
Ikiwa unataka kuondoa retweets:
% utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl
Au unshorten urls (requires [unshrtn](https://github.com/docnow/unshrtn)):
% cat tweets.jsonl | utils/unshorten.py > unshortened.jsonl
Mara baada ya kufuta URL zako unaweza kupata orodha ya vya URL inayo tweets nyingi zaidi:
% cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt
## twarc-report
Baadhi ya scripts zaidi ya huduma ili kuzalisha csv au json pato yanafaa kwa
kutumia na [D3.js](http://d3js.org/) visualizations hupatikana katika
[twarc-report](https://github.com/pbinkley/twarc-report). `directed.py` ilikuwa
sehemu ya twarc imehama kwa twarc-report kama `d3graph.py`.
Kila script pia inaweza kuzalisha demo html ya taswira ya D3, kwa mfano. [timelines](https://wallandbinkley.com/twarc/bill10/) or a
[directed graph of retweets](https://wallandbinkley.com/twarc/bill10/directed-retweets.html).
[Kihispania]: https://github.com/DocNow/twarc/blob/main/README_es_mx.md
[Kiingereza]: https://github.com/DocNow/twarc/blob/main/README_sv_se.md
[Kijapani]: https://github.com/DocNow/twarc/blob/main/README_ja_jp.md
[Kireno]: https://github.com/DocNow/twarc/blob/main/README_pt_br.md
[Kisweden]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md
================================================
FILE: docs/twarc1_zw_zh.md
================================================
twarc1
=====
twarc 是一个用来处理并存档推特 JSON 数据的命令行工具和 Python 包。
[正如](https://dev.twitter.com/overview/api/tweets)推特 API 返回的一样,twarc 处理的每一条推文都用一个 JSON 对象来表示。twarc 会自动处理推特 API 的[流量限制](https://dev.twitter.com/rest/public/rate-limiting)。除了可以让你收集推文之外,twarc 还可以帮助你收集用户信息、当下流行的标签和根据 id 获得推文的详细信息。
twarc 是作为 [Mellon Foundation](https://mellon.org/) 资助下的 [Documenting the Now](http://www.docnow.io) 项目的一部分开发的。
## 安装
在使用 twarc 之前,你需要在 [apps.twitter.com](http://apps.twitter.com) 注册一个应用。一旦你注册了你的应用,记下你的 `consumer key` 和 `consumer secret` 并点击生成一组 `access token` 和 `access token secret`. 这四个数据在手你就可以开始使用 twarc 了。
1. 安装 [Python](http://python.org/download) (2 或者 3)
2. [pip](https://pip.pypa.io/en/stable/installing/) install twarc
### 使用Homebrew (仅限macOS 系统)
macOS系统用户, 你可以通过Homebrew安装 `twarc` :
```shell
$ brew install twarc
```
## 快速开始:
首先你需要告诉 twarc 你的应用 keys 并授权它访问一个或者多个推特账号:
```shell
twarc configure
```
然后尝试搜索
```shell
twarc search blacklivesmatter > search.jsonl
```
或者你想试试实时搜索?
```shell
twarc filter blacklivesmatter > stream.jsonl
```
请阅读下文了解更多这些命令的意义和更多内容。
## 使用
### 配置
在获得应用 keys 之后你可以通过 `configure` 命令来告诉 twarc 它们的值。
```shell
twarc configure
```
这样做会在你的 `~` 目录下创建一个名为 `.twarc` 的文件来储存你的这些凭证,这样你就不必每次使用 twarc 的时候输入它们。如果你倾向于每次使用 twarc 的时候输入 keys,你可以使用环境变量 (`CONSUMER_KEY`,
`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) 或者使用命令行工具选项 (`--consumer_key`, `--consumer_secret`, `--access_token`,
`--access_token_secret`).
### 搜索
搜索功能使用推特的[搜索推文](https://dev.twitter.com/rest/reference/get/search/tweets) API endpoint 来下载*已经存在*的符合搜索字符串的推文。
```shell
twarc search blacklivesmatter > tweets.jsonl
```
尤其需要注意的是 `search` 返回的是过去七天内的推文:这是推特搜索 API 的限制。如果你觉得这太短了——我们也觉得——你或许会更愿意尝试使用下文提到的 `filter` 和 `sample` 命令。
最好的快速上手推特搜索语法的方法是实验[推特高级搜索](https://twitter.com/search-advanced)这个页面上的样例。你可以复制粘贴搜索框里的查询语句。比如这里有一个比较复杂的查询语句,它搜索包含有 `#blacklivesmatter` 和 `#blm` 关键字并发给 [deray](https://twitter.com/deray) 的推文。
```shell
twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl
```
你还应当看一看 Igor Brigadir 关于推特高级搜索语法`精彩绝伦`的指南: [推特高级搜索 (英文)](https://github.com/igorbrigadir/twitter-advanced-search/blob/master/README.md). 这份指南里包含了很多阅读推特搜索文档后依然不显然的玄妙之处。
推特尝试显式地定义推文的语言。你可以尝试使用 [ISO 639-1] 规范限制你获得的推文的语言。
```shell
twarc search '#blacklivesmatter' --lang fr > tweets.jsonl
```
你还可以通过位置来搜索。比如你可以搜索包含 `#blacklivesmatter` 且位置定位在密苏里弗格森半径1英里之内的推文。
```shell
twarc search blacklivesmatter --geocode 38.7442,-90.3054,1mi > tweets.jsonl
```
如果一个包含 `--geocode` 的搜索没有包含要查询的字符串,那么你将得到所有与该位置和其半径相关的推文。
```shell
twarc search --geocode 38.7442,-90.3054,1mi > tweets.jsonl
```
### 过滤
`filter` 命令使用推特的 [状态/过滤](https://dev.twitter.com/streaming/reference/post/statuses/filter) API 来搜集实时推文。
```shell
twarc filter blacklivesmatter,blm > tweets.jsonl
```
请注意推特的 `track` 查询语句的语法和搜索 API 里的语法略有不同。请使用官方文档来了解如何最好地表达你的过滤命令选项。
使用 `follow` 命令行参数和用户的 id 来实时收集某个具体用户的推文。注意这个命令的结果包含转推。举个例子,下面的命令搜索 `CNN` 的推文和转推。
```shell
twarc filter --follow 759251 > tweets.jsonl
```
你还可以限制一个地理上的矩形边界来收集推文。注意经纬度数据中的短横线必须用`\`转义,否则它将被理解成一个命令行参数!
```shell
twarc filter --locations "\-74,40,-73,41" > tweets.jsonl
```
你可以使用 `lang` 命令行参数来传入 [ISO 639-1] 语言代码来限制语言。你还可以多次使用这个参数指定多种语言。下面的例子实时收集提到了巴黎和马德里的法语推文和西班牙语推文:
```shell
twarc filter paris,madrid --lang fr --lang es
```
`filter` 和 `follow` 命令是**或**关系。下面的例子将收集包含 `blacklivesmatter` 或者 `blm` 关键字的推文,或者是来自 CNN 的推文。
```shell
twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl
```
但是将位置和语言限制合并将得到**和**的关系,下面的例子收集来自纽约且被标记为法语或者西班牙语的推文。
```shell
twarc filter --locations "\-74,40,-73,41" --lang es --lang fr
```
### 采样
使用 `sample` 命令来监听推特的 [状态/采样](https://dev.twitter.com/streaming/reference/get/statuses/sample) API 来“随机“采样最近的、公开的推文。
```shell
twarc sample > tweets.jsonl
```
### `脱水`
所谓的脱水 `dehydrate` 命令读取一个推文的 jsonl 文件,生成一个包含推文 id 的列表。
```shell
twarc dehydrate tweets.jsonl > tweet-ids.txt
```
### `补水`
twarc 所谓的补水命令 `hydrate` 是 `dehydrate` 的反过程,它读取一个包含推文 id 的文件,使用推特的 [状态/检索](https://dev.twitter.com/rest/reference/get/statuses/lookup) API 重建包含完整推文 json 的 jsonl 文件。
```shell
twarc hydrate ids.txt > tweets.jsonl
```
推特 API 的[服务条款](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter) 反对用户将大量原始推文数据公布在网络上。数据可以被用来研究使用和保存在本地,但是不可以和世界分享。不过,推特确实允许用户大量地将推文 id 公开分享,而这些 id 可以用来重建推文 JSON 数据——通过 `hydrate` 命令和推特的 API. 这一点对于社交媒体研究中的[复现](https://en.wikipedia.org/wiki/Reproducibility)尤为重要。
### 用户
用户 `users` 命令可以返回(多个)用户的元数据。用户的名称由推特上的屏幕名称唯一确认。(译者注:屏幕名称即你 @ 某用户时所显示的字符串)。
```shell
twarc users deray,Nettaaaaaaaa > users.jsonl
```
你也可以使用用户的 id.
```shell
twarc users 1232134,1413213 > users.jsonl
```
你也可以使用一个包含用户 id 的文件作为输入,这在你同时使用 `followers` 和 `friends` 命令时尤其有用。举例如下:
```shell
twarc users ids.txt > users.jsonl
```
### 粉丝
粉丝 `followers` 命令使用推特的 [粉丝 id](https://dev.twitter.com/rest/reference/get/followers/ids) API 来收集推特用户粉丝的 id 信息。该命令的输入只能是一个用户的屏幕名称。举例如下:
```shell
twarc followers deray > follower_ids.txt
```
输出的结果每一行是一个粉丝用户 id. 最新的粉丝将出现在最前面,依时间顺序倒序排列。
### 朋友
和粉丝 `followers` 命令类似,朋友 `friends` 命令将使用推特的 [朋友 id](https://dev.twitter.com/rest/reference/get/friends/ids) API 收集推特用户朋友的 id 信息。该命令的输入只能是一个用户的屏幕名称。举例如下:
```shell
twarc friends deray > friend_ids.txt
```
### 当下流行
当下流行 `trends` 命令可以用来搜索当下流行的标签。你需要一个 [地球上哪里](https://web.archive.org/web/20180102203025/https://developer.yahoo.com/geo/geoplanet/) 的 id (woeid) 来指明你对哪个地理位置的当下流行标签感兴趣。下面这个例子中的 `2486982` 代表圣路易斯:
```shell
twarc trends 2486982
```
令 `woeid` 为 1 即为搜索全球范围内当下流行的标签:
```shell
twarc trends 1
```
如果你不确定 `woeid`, 可以留空,这样推特会返回一个列表,包括全球各地的当下流行标签。
```shell
twarc trends
```
如果你已经知道确切的地理信息,可以用它来替代 `woeid`.
```shell
twarc trends 39.9062,-79.4679
```
这里的原理是 twarc 将使用推特的[趋势/最近位置](https://dev.twitter.com/rest/reference/get/trends/closest) API 找到距离指定地点最近的 `woeid`.
### 时间线
时间线 `timeline` 命令将通过推特的[时间线](https://dev.twitter.com/rest/reference/get/statuses/user_timeline) API 收集某个用户最近的推文。用户名称由其屏幕名称指定。
```shell
twarc timeline deray > tweets.jsonl
```
你也可以使用用户 id.
```shell
twarc timeline 12345 > tweets.jsonl
```
### 转推
你可以使用下面这个例子的格式来获得 id 为 `824077910927691778` 这条推文的转推。
```shell
twarc retweets 824077910927691778 > retweets.jsonl
```
输入也可以是一个包含推文 id 的文本。
```shell
twarc retweets ids.txt > retweets.jsonl
```
### 回复
推特的 API 不支持获得回复,但是 twarc 可以通过搜索 API 来近似模拟这一功能。因为搜索 API 的搜索时间区间只有过去一周所以 twarc 只能得到某条推文过去一周的回复。
下面这个例子使用推文 id 作为输入。
```shell
twarc replies 824077910927691778 > replies.jsonl
```
使用 `--recursive` 选项可以获得回复的回复以及引用。注意这可能会花费很长时间因为推特的搜索 API 有流量限制。
```shell
twarc replies 824077910927691778 --recursive
```
### 列表
你可以将推特用户列表的 URL 传入 `listmembers` 命令得到列表中的用户:
```shell
twarc listmembers https://twitter.com/edsu/lists/bots
```
## 付费搜索 API
推特引入了付费搜索 API. 它可以让你通过付款的方式实现更高级的搜索功能。你需要在[仪表板](https://developer.twitter.com/en/dashboard) 配置一个环境。在此之后,你可以搜索不限于最近7天内的推文的过去30天内的备份甚至完整推文备份。如果需要在命令行实现这一功能,你需要告诉 twarc 你在使用哪一个 endpoint 和环境。
为了控制预算,你可能需要限制搜索的时间段:使用 `--to_date` 和 `--frome_date`. 再次之外,你还可以使用 `--limit` 参数来限制返回的推文数目上限。
举例来看,假设今天是2020年6月1日,如果你想搜索不超过1000条从2020年5月1日到2020年5月14日所有提到 `blacklivesmatter` 的推文。如果我们的环境名为 `docnowdev`, 那么这个命令如下,注意我们使用了 `--30day` 这个 endpoint:
```shell
twarc search blacklivesmatter \
--30day docnowdev \
--from_date 2020-05-01 \
--to_date 2020-05-14 \
--limit 1000 \
> tweets.jsonl
```
类似的,如果你要搜索超过30天期限的全部推文备份,你需要使用 fullarchive, 举例如下:
```shell
twarc search blacklivesmatter \
--fullarchive docnowdev \
--from_date 2014-08-04 \
--to_date 2014-08-05 \
--limit 1000 \
> tweets.jsonl
```
如果你的环境在沙盒之中,你需要使用 `--sandbox` 参数来告诉 twarc 不要获得超过100条推文。默认的非沙盒环境的上限是500条。
```shell
twarc search blacklivesmatter \
--fullarchive docnowdev \
--from_date 2014-08-04 \
--to_date 2014-08-05 \
--limit 1000 \
--sandbox \
> tweets.jsonl
```
## Gnip 企业级 API
twarc 支持和 Gnip 推特全备份企业级 API 的完全整合。你需要使用 `--gnip_auth` 参数并设置好 `GNIP_USERNAME`、 `GNIP_PASSWORD`、 `GNIP_ACCOUNT` 三个环境变量。举例如下:
```shell
twarc search blacklivesmatter \
--gnip_auth \
--gnip_fullarchive prod \
--from_date 2014-08-04 \
--to_date 2015-08-05 \
--limit 1000 \
> tweets.jsonl
```
## 作为一个 Python 包的 twarc
如果你想在你自己的代码里使用 twarc 的话,你需要首先创建一个 `twarc` 实例,传入你的推特应用凭证然后用它进行搜索、过滤和检索。 举例如下:
```python
from twarc import Twarc
t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)
for tweet in t.search("ferguson"):
print(tweet["text"])
```
你还可以用同样的语法过滤满足关键字匹配的实时信息流。举例如下:
```python
for tweet in t.filter(track="ferguson"):
print(tweet["text"])
```
或者地点:
```python
for tweet in t.filter(locations="-74,40,-73,41"):
print(tweet["text"])
```
或者用户 id:
```python
for tweet in t.filter(follow='12345,678910'):
print(tweet["text"])
```
类似的,你还可以传入一个包含推特 id 的文件,“补水”以获得完整信息。举例如下:
```python
for tweet in t.hydrate(open('ids.txt')):
print(tweet["text"])
```
## 基于用户的验证和基于应用的验证
twarc 自动处理推特的流量限制。但是你应该了解流量限制会因为验证方式的不同而不同。推特有两种验证方式分别是基于用户的验证和基于应用的验证。 twarc 默认使用基于用户的验证方式但是你可以告诉 twarc 使用基于应用的验证。
举个例子,转为基于应用的验证可以显著提高搜索功能的效率。基于用户的验证每分钟可以发出180个请求(每天160万条结果),而基于应用的验证每分钟可以发出450个请求(每天430万个结果)。
需要注意的是,用 “补水”功能访问 `状态/检索 (status/lookup)` 这个 API endpoint 在基于用户的验证下有每15分钟900个请求的限制,而在基于应用的验证下是每15分钟300个。
如果你确认你要使用基于应用的验证,你可以使用 `--app_auth` 这个命令行选项。举例如下:
```shell
twarc --app_auth search ferguson > tweets.jsonl
```
类似的功能也可以在你的 Python 代码中实现。
```python
from twarc import Twarc
t = Twarc(app_auth=True)
for tweet in t.search('ferguson'):
print(tweet['id_str'])
```
## 实用工具
在 `utils` 文件夹下你可以找到几个脚本。这些脚本可以作用于 jsonl 文件上实现一些非常实用的功能:比如将 JSON 格式的推文输出为文本或者 HTML 格式, 提取用户名或者推文中引用的 URL 等等。如果你创作了一个好用的脚本,欢迎提出 PR.
下面的命令可以创作一个简单的推文墙。
```shell
utils/wall.py tweets.jsonl > tweets.html
```
下面的命令可以创作一个简单的词云。
```shell
utils/wordcloud.py tweets.jsonl > wordcloud.html
```
如果你用 `replies` 命令收集了一些推文,你可以用下面的命令创作一个静态的 D3 可视化。
```shell
utils/network.py tweets.jsonl tweets.html
```
你可以增加可选参数根据用户组织推文,这样你可看到这个网络中的核心账号。
```shell
utils/network.py --users tweets.jsonl tweets.html
```
额外的,你可以创作一个标签的网络,从而看到它们彼此之间的(共存)关系。
```shell
utils/network.py --hashtags tweets.jsonl tweets.html
```
如果你想使用网络作图软件 [Gephi](https://gephi.org/),你可以用下面的命令生成一个 `GEXF` 格式的文件。
```shell
utils/network.py --users tweets.jsonl tweets.gexf
utils/network.py --hashtags tweets.jsonl tweets.gexf
```
额外的,如果你想将网络转换成一个随时间线动态变化(节点会出现和消失)的动态网络,你可以在 Gephi 中打开生成的 `GEXF` 文件,跟随这个[教程](https://seinecle.github.io/gephi-tutorials/generated-html/converting-a-network-with-dates-into-dynamic.html)实现。注意在 `tweets.gexf` 文件里,仅有 `start_date` 一栏但是却没有 `end_date` 一栏,这会导致节点出现在屏幕上后便不再消失。对于 Gephi 中的 `Time interval creation options` 跳出窗口,`Start time column` 应该是 `start_date`, 而 `End time column` 则是空白的。`Parse dates` 应该勾选,同时选择最后一个日期格式选项:`dd/MM/yyyy HH:mm:ss`, 如下图所示。
`gender.py` 是一个可以猜测推文作者性别的脚本。比如下面的例子展示了如何保留看上去像是女性发出的推文并生成一个词云。
```shell
utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py >
tweets-female.html
```
你可以用含有地理定位信息的推文生成 [GeoJSON](http://geojson.org/) 格式的文件。
```shell
utils/geojson.py tweets.jsonl > tweets.geojson
```
你还可以用地理边界的[形心](https://en.wikipedia.org/wiki/Centroid)来取代地理位置矩形的边界。
```shell
utils/geojson.py tweets.jsonl --centroid > tweets.geojson
```
在此基础上你还可以加一些随机模糊。
```shell
utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson
```
欲了解更多关于利用地理坐标(或地点)的存在与否过滤推文的内容,请参考[文档](https://dev.twitter.com/overview/api/places)。下面是两个例子。
```shell
utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl
cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl
```
欲通过 GeoJson 的边界过滤推文,请参考下面的例子。注意你需要安装 [Shapely](https://github.com/Toblerity/Shapely).
```shell
utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl
cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl
```
如果你怀疑你有重复的推文,可以用下面的命令去重。
```shell
utils/deduplicate.py tweets.jsonl > deduped.jsonl
```
你可以用下面的命令像根据时间线排序一样根据推文 id 排序。
```shell
utils/sort_by_id.py tweets.jsonl > sorted.jsonl
```
You can filter out all tweets before a certain date (for example, if a hashtag was used for another event before the one you're interested in):
你可以过滤调某一具体日期前的推文,举个例子,有可能这一日期前某个标签的含义并不是你感兴趣的意思。
```shell
utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl
```
你还能够以列表的形式得到客户端信息。
```shell
utils/source.py tweets.jsonl > sources.html
```
下面的命令去除了转推。
```shell
utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl
```
或者复原原始的 URL 的长度(需要安装[unshrtn](https://github.com/docnow/unshrtn))。
```shell
cat tweets.jsonl | utils/unshrtn.py > unshortened.jsonl
```
一旦你获得了原始的 URL, 你可以根据推文中提到的次数对这些 URL 排序。
```shell
cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt
```
## twarc-report 项目
还有一些可以生成 csv 或者 json 输出以供 [D3.js](http://d3js.org/) 可视化使用的脚本可以在 [twarc-report](https://github.com/pbinkley/twarc-report) 项目中找到。原本属于 twarc 一部分的 `directed.py` 脚本也已经被转移到了 twarc-report 项目并被重命名为 `d3graph.py`.
下面的这两个链接包含了两个生成 HTML 格式的 D3 可视化文件的例子。
1. [timelines](https://wallandbinkley.com/twarc/bill10/)
2. [directed graph of retweets](https://wallandbinkley.com/twarc/bill10/directed-retweets.html)
[英语]: https://github.com/DocNow/twarc/blob/main/README.md
[日语]: https://github.com/DocNow/twarc/blob/main/README_ja_jp.md
[葡萄牙语]: https://github.com/DocNow/twarc/blob/main/README_pt_br.md
[西班牙语]: https://github.com/DocNow/twarc/blob/main/README_es_mx.md
[瑞典语]: https://github.com/DocNow/twarc/blob/main/README_sv_se.md
[斯瓦希里语]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md
[ISO 639-1]: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
================================================
FILE: docs/twarc2_en_us.md
================================================
# twarc2
twarc2 is a command line tool and Python library for archiving Twitter JSON
data. Each tweet is represented as a JSON object that was returned from the
Twitter API. Since Twitter's introduction of their [v2
API](https://developer.twitter.com/en/docs/twitter-api/api-reference-index#v2)
the JSON representation of a tweet is conditional on the types of fields and
expansions that are requested. twarc2 does the work of requesting the highest
fidelity representation of a tweet by requesting all the available data for
tweets.
Tweets are streamed or stored as [line-oriented
JSON](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON). twarc2
will handle Twitter API's [rate
limits](https://dev.twitter.com/rest/public/rate-limiting) for you. In addition
to letting you collect tweets twarc can also help you collect users and hydrate
tweet ids. It also has a collection of [plugins](plugins) you can use to do
things with the collected JSON data (such as converting it to CSV).
twarc2 was developed as part of the [Documenting the Now](http://www.docnow.io)
project which was funded by the [Mellon Foundation](https://mellon.org/).
## Install
Before using twarc you will need to create an application and attach it to an
project on your [Twitter Developer Portal](https://developer.twitter.com/en/portal/projects-and-apps). A ["Project"](https://developer.twitter.com/en/docs/projects/overview) is like a container for an "Application" with a specific purpose.
If you have Academic Access you should see an "Academic Research" Project,
if not, you should see only "Standard" Project. Academic Access is a separate endpoint, see [here](twitter-developer-access.md) for notes on this.
Once you've created your application, note down the Bearer token, and or the consumer key, consumer secret,
which may also be called API Key and API Secret and then optionally click to
generate an access token and access token secret. With these four variables
in hand you are ready to start using twarc.
1. install [Python 3](http://python.org/download)
2. [pip](https://pip.pypa.io/en/stable/installing/) install twarc from a terminal (such as the Windows Command Prompt available in the "start" menu, or the [OSX Terminal application](https://support.apple.com/en-au/guide/terminal/apd5265185d-f365-44cb-8b09-71a064a42125/mac)):
```
pip install --upgrade twarc
```
### Homebrew (macOS only)
For macOS users, you can also install `twarc` via [Homebrew](https://brew.sh/):
```bash
brew install twarc
```
### Windows
If you installed with pip and see a "failed to create process" when running twarc try reinstalling like this:
python -m pip install --upgrade --force-reinstall twarc
## Quickstart:
First you're going to need to tell twarc about your application API keys and
grant access to one or more Twitter accounts:
twarc2 configure
Then try out a search:
twarc2 search "blacklivesmatter" results.jsonl
Or maybe you'd like to collect tweets as they happen?
twarc2 filter "blacklivesmatter" results.jsonl
See below for the details about these commands and more.
## Configure
Once you've got your Twitter developer access set up you can tell twarc what they are with the `configure` command.
twarc2 configure
This will store your credentials in your home directory so you don't have to
keep entering them in. You can most of twarc's functionality by simply
configuring the *bearer token*, but if you want it to be complete you can enter
in the *API key* and *API secret*.
You can also the keys in the system environment (`CONSUMER_KEY`,
`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) or using command line
options (`--consumer-key`, `--consumer-secret`, `--access-token`,
`--access-token-secret`).
## Search
This uses Twitter's [tweets/search/recent](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent) and [tweets/search/all](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all) endpoints to download *pre-existing* tweets matching a given query. This command will search for any tweets mentioning *blacklivesmatter* from the 7 days.
twarc2 search "blacklivesmatter" results.jsonl
If you have access to the [Academic Research Product Track](https://developer.twitter.com/en/products/twitter-api/academic-research) you can search the full archive of tweets by using the `--archive` option.
twarc2 search --archive "blacklivesmatter" results.jsonl
The queries can be a lot more expressive than matching a single term. For
example this query will search for tweets containing either `blacklivesmatter`
or `blm` that were sent to the user \@deray.
twarc2 search "(blacklivesmatter OR blm) to:deray" results.jsonl
The best way to get familiar with Twitter's search syntax is to consult Twitter's [Building queries for Search Tweets](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query) documentation.
You also should definitely check out Igor Brigadir's *excellent* reference guide
to the Twitter Search syntax:
[Advanced Search on Twitter](https://github.com/igorbrigadir/twitter-advanced-search/blob/master/README.md).
There are lots of hidden gems in there that the advanced search form doesn't
make readily apparent.
### Limit
Because there is a 500,000 tweet limit (5, or sometimes 10 million for Academic Research Track)
you may want to limit the number of tweets you retrieve by using `--limit`:
twarc2 search --limit 5000 "blacklivesmatter" results.jsonl
### Time
You can also limit to a particular time range using `--start-time` and/or
`--end-time`, which can be especially useful in conjunction with `--archive`
when you are searching for historical tweets.
twarc2 search --start-time 2014-07-17 --end-time 2014-07-24 '"eric garner"' tweets.jsonl
If you leave off --start-time or --end-time it will be open on that side. So
for example to get all "eric garner" tweets before 2014-07-24 you would just
leave off the `--start-time`:
twarc2 search --end-time 2014-07-24 '"eric garner"' tweets.jsonl
### Sort Order
By default, Twitter returns the results ordered by their published date with the newest tweets being first.
To alter this behavior, it is possible to specify the `--sort-order` parameter.
Currently, it supports `recency` (the default) or `relevancy`.
In the latter case, tweets are ordered based on what Twitter determines to be the best results for your query.
## Searches
Searches works like the [search](#search) command, but instead of taking a single query, it reads from a file containing many queries. You can use the same limit and time options just like a single search command, but it will be applied to every query.
The input file for this command needs to be a plain text file, with one line for each query you want to run, for example you might have a file called `animals.txt` with the following lines:
cat
dog
mouse OR mice
Note that each line will be passed through directly to the Twitter API - if you have quoted strings, they will be treated as a phrase search by the Twitter API, which might not be what you intended.
If you run the following `searches` command, `animals.json` will contain at least 100 tweets for each query in the input file:
twarc2 searches --limit 100 animals.txt animals.json
You can use the `--archive` and `--start-time` flags just like a regular search command too, in this case to search the full archive of all tweets for the first day of 2020:
twarc2 searches --archive --start-time 2020-01-01 --end-time 2020-01-02 animals.txt animals.json
You can also use the `--counts-only` flag to check volumes first. This produces a csv file in the same format as the [counts](#counts) command with the `--csv` flag, with the addition of a column containing the query for that row.
twarc2 searches --counts-only animals.txt animals_counts.csv
One more thing - if you have a lot searches you want to run, you might want to consider using the `--combine-queries` flag. This combines consecutive queries into the file into a single longer query, meaning you issue fewer API calls and potentially collect fewer duplicate tweets that match more than one query. Using this on the `animals.txt` file as input will combine the three queries into the single longer query `(cat) OR (dog) OR (mouse OR mice)`, and only issue one logical query.
twarc2 searches --combine-queries animals.txt animals_combined.json
## Stream
The `stream` command will use Twitter's API
[tweets/search/stream](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/api-reference/get-tweets-search-stream)
endpoint to collect tweets as they happen. In order to use it you first need to
create one or more [rules]. For example:
twarc2 stream-rules add blacklivesmatter
You can list your active stream rules:
twarc2 stream-rules list
And you can collect the data from the stream, which will bring down any tweets that match your rules:
twarc2 stream stream.jsonl
When you want to stop you use `ctrl-c`. This only stops the stream but doesn't delete your stream rule. To remove a rule you can:
twarc2 stream-rules delete blacklivesmatter
## Sample
Use the `sample` command to listen to Twitter's [tweets/sample/stream](https://developer.twitter.com/en/docs/twitter-api/tweets/sampled-stream/api-reference/get-tweets-sample-stream) API for a "random" sample of recent public statuses. The sampling is based on the millisecond part of the tweet timestamp.
twarc2 sample sample.jsonl
## Users
If you have a file of user ids you can fetch the user metadata for them with
the `users` command:
twarc users users.txt users.jsonl
If the file contains usernames instead of user ids you can use the `--usernames` option:
twarc2 users --usernames users.txt users.jsonl
## Followers
You can fetch the followers of an account using the `followers` command:
twarc2 followers deray users.jsonl
## Following
To get the users that a user is following you can use `following`:
twarc2 following deray users.jsonl
The result will include exactly one user id per line. The response order is
reverse chronological, or most recent followers first.
## Timeline
The `timeline` command will use Twitter's [user timeline API](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/api-reference/get-users-id-tweets) to collect the most recent tweets posted by the user indicated by screen_name.
twarc2 timeline deray tweets.jsonl
## Conversation
You can retrieve a conversation thread using the tweet ID at the head of the
conversation:
twarc2 conversation 266031293945503744 > conversation.jsonl
## Likes
Twarc supports the two approaches that the Twitter API exposes for collecting likes via the `liked-tweets` and `liking-users` commands.
The `liked-tweets` command returns the tweets that have been liked by a specific account. The account is specified by the user ID of that account, in the following example is the account of Twitter's founder:
twarc2 liked-tweets 12 jacks-likes.jsonl
In this case the output file contains all of the likes of publicly accessible tweets. Note that the order of likes is not guaranteed by the API, but is probably reverse chronological, or most recent likes by that account first. The underlying tweet objects contain no information about when the tweet was liked.
The `liking-users` command returns the user profiles of the accounts that have liked a specific tweet (specified by the ID of the tweet):
twarc2 liking-users 1460417326130421765 liking-users.jsonl
In this example the output file contains all of the user profiles of the publicly accessible accounts that have liked that specific tweet. Note that the order of profiles is not guaranteed by the API, but is probably reverse chronological, or the profile of the most recent like for that account first. The underlying profile objects contain no information about when the tweet was liked.
Note that likes of tweets that are not publicly accessible, or likes by accounts that are protected will not be retrieved by either of these methods. Therefore, the metrics available on a tweet object (under the `public_metrics.like_count` field) will likely be higher than the number of likes you can retrieve via the Twitter API using these endpoints.
## Retweets
You can retrieve the user profiles of publicly accessible accounts that have retweeted a specific tweet, using the `retweeted_by` command and the ID of the tweet as an identifier. For example:
twarc2 retweeted-by 1460417326130421765 retweeting-users.jsonl
Unfortunately this only returns the user profiles (presumably in reverse chronological order) of the retweeters of that tweet - this means that important information, like when the tweet was retweeted is not present in the returned object.
## Dehydrate
The `dehydrate` command generates an id list from a file of tweets:
twarc2 dehydrate tweets.jsonl tweet-ids.txt
## Hydrate
twarc's `hydrate` command will read a file of tweet identifiers and write out the tweet JSON for them using Twitter's [tweets](https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/api-reference/get-tweets)
API endpoint:
twarc2 hydrate ids.txt tweets.jsonl
The input file, `ids.txt` is expected to be a file that contains a tweet identifier on each line, without quotes or a header:
```
919505987303886849
919505982882844672
919505982602039297
```
Twitter API's [Terms of Service](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter) discourage people from making large amounts of raw Twitter data available on the Web. The data can be used for research and archived for local use, but not shared with the world. Twitter does allow files of tweet identifiers to be shared, which can be useful when you would like to make a dataset of tweets available. You can then use Twitter's API to *hydrate* the data, or to retrieve the full JSON for each identifier. This is particularly important for [verification](https://en.wikipedia.org/wiki/Reproducibility) of social media research.
## Places
The search and stream APIs allow you to search by places. But in order to use
them you need to know the identifier for a specific place. twarc's
`places` command will let you search by the place name, geo coordinates, or ip
address. For example:
twarc2 places Ferguson
Which will output something like:
```shell
$ twarc2 places Ferguson
Ferguson, MO, United States [id=0a62ce0f6aa37536]
Ruisseau-Ferguson, Québec, Canada [id=25283a1f59449e8f]
Ferguson, Victoria, Australia [id=2538e66b7e5c082c]
Ferguson Road Initiative, Dallas, United States [id=368aad647311292a]
Ferguson, Western Australia, Australia [id=45f20c78d803ad84]
Ferguson, PA, United States [id=00c92e14361c9674]
Ferguson, KY, United States [id=0190ea5612aaae32]
```
You can then use one of the ids in a search:
twarc2 search "place:0a62ce0f6aa37536" tweets.jsonl
You can also search by geo-coordinates (lat,lon) and IP address. If you would prefer to see the full JSON response with the bounding boxes use the `--json` option.
## Command Line Usage
Below is what you see when you run `twarc2 --help`.
::: mkdocs-click:
:module: twarc.command2
:command: twarc2
:depth: 1
================================================
FILE: docs/twitter-developer-access.md
================================================
# Twitter Developer Access
If you have established that you would like to use Twitter Data in your study, you will need access to the API. There are several steps required to get access to the API. This is a guide on how best to engage with this process. Allow plenty of time for this.
Twitter has made the process of accessing their API more strict. There are a number of restricted use cases that may require you implement additional safeguards.
Before applying, the Terms of Service for Developers and the [Restricted Use Cases](https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases) are very short and relevant to read.
## Step 0: Have a Twitter account in good standing
Create and or edit your Twitter profile to fit your person or organization, preferably in English. Make sure it's public and you do the basic things like verifying your email and phone number (do not use a VoIP service), setting a non default profile picture and header, a description, links to your research group or website, a good description that identifies you as you, and preferably some friends and followers who are already on twitter in your research community. Use a good stable email provider (gmail) or your institution email as long as it is reliable and you can see any emails that may end up in spam, just in case.
## Step 1: Applying for a Developer Account
Fill out the forms for a new Individual developer Account here: <https://developer.twitter.com/en/apply-for-access>. Team accounts are not supported with Academic Access, so do not apply for a Team account. Pay attention to the specifics of each question: especially about sharing data outside of your organization, and with other government entities. Wait for a reply. This may take a couple of weeks.
## Step 2: Apply for the special Academic Access v2 Endpoint
Even if you specify your use case as "Academic" use case in your developer application form, you will not automatically get access to the [new Search endpoint](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all) with higher limits for academic use. You must fill in an additional form: <https://developer.twitter.com/en/portal/petition/academic/is-it-right-for-you>
Twitter generally prefers to grant access to faculty and postgrad researchers, not undergrad or masters students or contractors or collaborators. It may be better for the principal investigator or professor to log in from an institution account or their own one, provided it is in good standing and has an obviously identifiable online academic presense.
This application may also take a couple of days or weeks.
## Step 3: Create a Project and App
A Project with Academic Access should be created for you, or if you did not get Academic Access, you can create a new Standard Project. On your Dashboard <https://developer.twitter.com/en/portal/dashboard> you should see "Academic Research" or "Standard" and "Standalone Apps".
Before accessing the v2 API, you will need to create an App or use an existing one and add it to the Academic Access Project first. You can only have 1 App assigned to 1 Project.
When Creating an app, take note of the keys you are given:
API Key:
```
hCe77nsrgew3gsdhSDGFSgsdf
```
API Secret:
```
1jWERGWBrtRTWBTwGFDHGFH66SDFGSDFGSSDFGSDFGSSDFGa11
```
Bearer Token:
```
AAAAAAAAAAAAAAAAAAAAAAAsdfgsAAAAvSDFGSDRgssdfSDFGSDF44gsd4E%3Dkk33345336dfsgsdgsdgsdASGASDGadsGAFAKJGYIUYUIDGGKK
```
These are fake but have the same format as real ones. Note the `%` sign in the Bearer Token - this can often cause errors when copy pasting or providing this token in a command line. Other common causes of errors are including a trailing space, or extra `"` or `'` quotes or not quoting the string in code or command line. This depends on implementation.
These are important to save and [store as you would a password](https://developer.twitter.com/en/docs/authentication/guides/authentication-best-practices).
Continue to "App Settings" and fill in the description field of the app. You don't need to change any other settings here. Generally you will only need Read Only Access and will not need "3-legged OAuth" or callback URLs unlesws you plan on using the [Account Activity API](https://developer.twitter.com/en/docs/twitter-api/enterprise/account-activity-api/overview) if you want to make an interactive Bot for example.
A project must *contain* an app. The difference between a [Project](https://developer.twitter.com/en/docs/projects/overview) and [App](https://developer.twitter.com/en/docs/apps/overview) is sometimes confusing.
*Standalone Apps* are for `v1.1` endpoints, Standard and Academic Access *Projects* are for `v2` endpoints.
## Step 4: Collaborating with Others
Now that you have your keys and tokens, you can start using the API. You may be working with other people on implementations, so you may have to share your keys with someone at some point. Do not share your Twitter user and password details for the Developer Dashboard. This is not a good idea. Currently Twitter's "Teams" functionality is also incompatible with Academic Access. The best way is to provide your colaborator with the keys in a plain text configuration file that you securely share. Or as Environment variables. When someone has your keys, they have full access to the API on your behalf.
Be careful not to commit your keys into a public repository or make them visible to the public - do not include them in a client side js script for example. Most apps will ask for API Key and Secret, but "Consumer Key" is "API Key" and "Consumer Secret" is "API Secret".
For Academic Access, there is only one endpoint that takes Bearer (App Only) authentication, so in most cases, the Bearer Token is all you need to share.
## Step 5: Next Steps
Install `twarc`, and run `twarc2 configure` to set it up.
To make arbitrary API calls for testing, [twurl](https://github.com/twitter/twurl) is a good tool, when combined with [jq](https://stedolan.github.io/jq/).
To get help, a good place is the [Developer Forums](https://twittercommunity.com/), or the [DocNow Slack](https://docs.google.com/forms/d/1Wk0JdF2Cty2VHMqpf_QlJXVKQdUtfeeFhaYRben3qaM/viewform), or [Stackoverflow](https://stackoverflow.com/) for implementation details, or the repository [Issues](https://github.com/DocNow/twarc) if it's an issue with twarc or one of the addons.
To share and publish a Twitter Dataset, extract the Tweet IDs and or User IDs, and format these as 1 ID per line in a plain text file (optionally, you can compress this file). This will make your dataset easier to process for others. See the [DocNow Catalog](https://catalog.docnow.io/) and tools like [Zenodo](https://zenodo.org/) and [Figshare](https://figshare.com/).
================================================
FILE: docs/windows10.md
================================================
# twarc2 on Windows 10
This guide assumes you already have a Twitter Developer Account, a registered App with your keys and a Bearer Token, and Python installed on Windows.
## Prerequisites and Installation
You must have Python installed and working on Windows.
Python will be located in different places on your computer if you installed Python from either the [official website](https://www.python.org/downloads/windows/), or from the [Microsoft App store](https://www.microsoft.com/en-us/p/python-38/9mssztt1n39l), or via [Anaconda](https://www.anaconda.com/products/individual#windows).
Check that you can run these successfully:
Open the command line `cmd.exe` or `PowerShell` or `Windows Terminal Preview` and run:
`python --version`
and
`pip --version`
If both give you some version output without errors everything is ready to go. Otherwise, install and configure `python` and `pip`.
`twarc2` CLI works best through [Windows Terminal Preview](https://www.microsoft.com/en-us/p/windows-terminal-preview/9n8g5rfz9xk3?activetab=pivot:overviewtab)
## Setting up twarc2
Install `twarc2` with
`pip install --upgrade twarc`
If you get a warning like
```
WARNING: The scripts twarc.exe and twarc2.exe are installed in 'C:\Users\t495\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
```
You will need to add that folder to the PATH.
This will be different for your machine, so make sure to copy the full folder location from the command prompt, without the `'` quotes with `CTRL+C`.
Make sure that folder is set in PATH System Variables:
In Settings, find "edit the system environment variables"
After clicking on "Environment Variables"
Edit the "Path" variable in User Variables and add a new entry, in my case it was `C:\Users\t495\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\Scripts` but for you it will be different. Copy this from the warning it gives you, because it varies.
You should now be able to run `twarc2` from the command line:
`twarc2`
If you can see the instructions, everything is ready to go.
In powershell or command prompt, run:
`twarc2 configure`
Paste in your Bearer token, taking care not to accidentally copy an extra new line or space. It's not recommended to type these in manually, the API Secret entry will also not display what's being typed, but it still accepts input. If something went wrong, you can repeat the command and start over. The keys will be saved in a file that youcan use Notepad to view, saved in `C:\Users\youraccount\AppData\Roaming\twarc\config` or sometimes a different location, twarc will output the location of this file after the command runs.
When this is completed, twarc2 is ready to use.
## Escaping `"` Characters in Windows
The query you specify to search can contain `"` quotes for phrases, spaces and other special characters like `:` and `()`. When entered directly into the prompt these can be interpreted as part of the command, not part of the command line argument value. Windows has an odd way of escaping characters in the command line.
To use a `"` in a query, change it to `""` in Windows. The more common escape `\"` does not work.
For example, if you want to search for tweets that contain the phrase `"live laugh love"` or `"home sweet home"` in english, from the US, the query would be:
```
lang:en ("live laugh love" OR "home sweet home") place_country:US
```
Changing the `"` to `""` The twarc2 command (`--limit` is optional) for this would be:
```
twarc2 search --limit 500 "lang:en (""live laugh love"" OR ""home sweet home"") place_country:US" output.json
```
This Stackoverflow answer has the long version that explains why this works: https://stackoverflow.com/a/15262019
## Output Format Errors:
If you see this kind of error, for example when using `twarc2 flatten`:
> ⚡ Expecting value: line 1 column 1 (char 0)
It means the file was incorrectly saved. There is an edge case in Windows when writing output, do not use `>` to redirect `stdout`. This alters how files are written, and adds a BOM (Byte Order Mark) that makes the files unreadable to twarc for later, eg: when using `twarc2 flatten`. To fix the file, edit it in a Hex editor to remove the first 2 bytes.
For example, this will give you a bad file with a BOM:
`twarc2 search --limit 100 "dogs" > dogs.json`
While this will give you a correctly written UTF8 file:
`twarc2 search --limit 100 "dogs" dogs.json`
Do not redirect stdout to a file in Windows, instead - specify the output file as a command line argument.
================================================
FILE: mkdocs.yml
================================================
site_name: twarc
site_url: https://readthedocs.org/projects/twarc-project/
site_description: Collect Twitter JSON data from the command line.
repo_url: https://github.com/docnow/twarc
repo_name: twarc
edit_uri: edit/main/docs/
theme:
name: "material"
logo: images/docnow.png
palette:
scheme: preference
nav:
- Home: README.md
- twarc2:
- twarc2 (en): twarc2_en_us.md
- twarc1:
- twarc1 (en): twarc1_en_us.md
- twarc1 (es): twarc1_es_mx.md
- twarc1 (ja): twarc1_ja_jp.md
- twarc1 (pt): twarc1_pt_br.md
- twarc1 (sv): twarc1_sv_se.md
- twarc1 (sw): twarc1_sw_ke.md
- twarc1 (zw): twarc1_zw_zh.md
- Plugins: plugins.md
- Tutorial: tutorial.md
- Resources: resources.md
- Twitter Developer Access: twitter-developer-access.md
- Windows 10: windows10.md
- Library API:
- api/client.md
- api/client2.md
- api/library.md
- api/expansions.md
plugins:
- search
- mkdocstrings
markdown_extensions:
- mkdocs-click
- pymdownx.highlight
- pymdownx.superfences
================================================
FILE: pyproject.toml
================================================
[project]
name = "twarc"
version = "2.14.1"
description = "Archive tweets from the command line"
license = "MIT"
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
"click>=7,<9",
"click-config-file>=0.6",
"click-plugins>=1",
"humanize>=3.9",
"python-dateutil>=2.8",
"requests_oauthlib>=1.3",
"tqdm>=4.62",
"twarc-csv>=0.7.2",
]
[dependency-groups]
dev = [
"black>=25.9.0",
"pytest>=8.4.2",
"pytest-black>=0.6.0",
"python-dotenv>=1.2.1",
"pytz>=2025.2",
"toml>=0.10.2",
]
[project.scripts]
twarc = "twarc.command:main"
twarc2 = "twarc.command2:twarc2"
[tool.pytest.ini_options]
addopts = "--verbose --black"
[tool.uv.workspace]
members = [
"tmp/twarc",
]
[build-system]
requires = ["uv_build>=0.8.3,<0.9.0"]
build-backend = "uv_build"
================================================
FILE: requirements-mkdocs.txt
================================================
click>=7,<9
click-config-file>=0.6
click-plugins>=1
humanize>=3.9
python-dateutil>=2.8
requests_oauthlib>=1.3
tqdm>=4.62
mkdocs>=1.2
mkdocs-click>=0.4
mkdocs-material>=7.2
mkdocstrings[python]>=0.15
================================================
FILE: setup.cfg
================================================
[tool:pytest]
addopts=--verbose --black
[aliases]
test=pytest
================================================
FILE: src/twarc/__init__.py
================================================
from .client import Twarc
from .client2 import Twarc2
from .version import version
from .expansions import ensure_flattened
================================================
FILE: src/twarc/__main__.py
================================================
from twarc.command2 import twarc2
if __name__ == "__main__":
twarc2(prog_name="python -m twarc2")
================================================
FILE: src/twarc/client.py
================================================
# -*- coding: utf-8 -*-
import os
import re
import sys
import json
import types
import logging
import datetime
import requests
import ssl
from requests.exceptions import ConnectionError
from requests.packages.urllib3.exceptions import ProtocolError
from .decorators import *
from twarc.version import version, user_agen
gitextract_5vocmduc/
├── .gitignore
├── .readthedocs.yaml
├── LICENSE
├── MANIFEST.in
├── README.md
├── RELEASING.md
├── docs/
│ ├── README.md
│ ├── api/
│ │ ├── client.md
│ │ ├── client2.md
│ │ ├── expansions.md
│ │ └── library.md
│ ├── plugins.md
│ ├── resources.md
│ ├── tutorial.md
│ ├── twarc1_en_us.md
│ ├── twarc1_es_mx.md
│ ├── twarc1_ja_jp.md
│ ├── twarc1_pt_br.md
│ ├── twarc1_sv_se.md
│ ├── twarc1_sw_ke.md
│ ├── twarc1_zw_zh.md
│ ├── twarc2_en_us.md
│ ├── twitter-developer-access.md
│ └── windows10.md
├── mkdocs.yml
├── pyproject.toml
├── requirements-mkdocs.txt
├── setup.cfg
├── src/
│ └── twarc/
│ ├── __init__.py
│ ├── __main__.py
│ ├── client.py
│ ├── client2.py
│ ├── command.py
│ ├── command2.py
│ ├── config.py
│ ├── decorators.py
│ ├── decorators2.py
│ ├── expansions.py
│ ├── handshake.py
│ ├── json2csv.py
│ └── version.py
├── test_twarc.py
├── test_twarc2.py
└── utils/
├── auth_timing.py
├── deduplicate.py
├── deleted.py
├── deleted_users.py
├── deletes.py
├── embeds.py
├── emojis.py
├── extractor.py
├── filter_date.py
├── filter_users.py
├── flakey.py
├── foaf.py
├── gender.py
├── geo.py
├── geofilter.py
├── geojson.py
├── json2csv.py
├── media2warc.py
├── media_urls.py
├── network.py
├── noretweets.py
├── oembeds.py
├── remove_limit.py
├── retweets.py
├── search.py
├── sensitive.py
├── sort_by_id.py
├── source.py
├── tags.py
├── times.py
├── twarc-archive.py
├── tweet.py
├── tweet_compliance.py
├── tweet_text.py
├── tweet_urls.py
├── tweetometer.py
├── tweets.py
├── unshrtn.py
├── urls.py
├── users.py
├── validate.py
├── wall.py
├── wayback.py
├── webarchives.py
├── wordcloud.py
└── youtubedl.py
SYMBOL INDEX (369 symbols across 35 files)
FILE: src/twarc/client.py
class Twarc (line 38) | class Twarc(object):
method __init__ (line 46) | def __init__(
method search (line 101) | def search(
method premium_search (line 174) | def premium_search(
method timeline (line 258) | def timeline(
method user_lookup (line 338) | def user_lookup(self, ids, id_type="user_id"):
method follower_ids (line 384) | def follower_ids(self, user, max_pages=None):
method friend_ids (line 417) | def friend_ids(self, user, max_pages=None):
method filter (line 452) | def filter(
method sample (line 538) | def sample(self, event=None, record_keepalive=False):
method dehydrate (line 596) | def dehydrate(self, iterator):
method hydrate (line 607) | def hydrate(self, iterator, trim_user=False):
method tweet (line 654) | def tweet(self, tweet_id):
method retweets (line 660) | def retweets(self, tweet_ids):
method trends_available (line 682) | def trends_available(self):
method trends_place (line 693) | def trends_place(self, woeid, exclude=None):
method trends_closest (line 711) | def trends_closest(self, lat, lon):
method replies (line 723) | def replies(self, tweet, recursive=False, prune=()):
method list_members (line 783) | def list_members(
method oembed (line 816) | def oembed(self, tweet_url, **params):
method get (line 837) | def get(self, *args, **kwargs):
method post (line 882) | def post(self, *args, **kwargs):
method connect (line 909) | def connect(self):
method get_keys (line 961) | def get_keys(self):
method validate_keys (line 996) | def validate_keys(self):
method load_config (line 1038) | def load_config(self):
method save_config (line 1072) | def save_config(self, profile):
method configure (line 1096) | def configure(self):
method default_config (line 1203) | def default_config(self):
method is_standard_v1 (line 1206) | def is_standard_v1(self, url):
FILE: src/twarc/client2.py
class Twarc2 (line 33) | class Twarc2:
method __init__ (line 38) | def __init__(
method _prepare_params (line 110) | def _prepare_params(self, **kwargs):
method _search (line 189) | def _search(
method _lists (line 327) | def _lists(
method list_followers (line 356) | def list_followers(
method list_members (line 392) | def list_members(
method list_memberships (line 430) | def list_memberships(
method owned_lists (line 468) | def owned_lists(
method followed_lists (line 506) | def followed_lists(
method pinned_lists (line 544) | def pinned_lists(
method list_lookup (line 582) | def list_lookup(self, list_id, expansions=None, list_fields=None, user...
method list_tweets (line 615) | def list_tweets(
method search_recent (line 651) | def search_recent(
method search_all (line 712) | def search_all(
method counts_recent (line 782) | def counts_recent(
method counts_all (line 834) | def counts_all(
method tweet_lookup (line 888) | def tweet_lookup(
method user_lookup (line 947) | def user_lookup(
method sample (line 1013) | def sample(
method add_stream_rules (line 1057) | def add_stream_rules(self, rules):
method get_stream_rules (line 1073) | def get_stream_rules(self):
method delete_stream_rule_ids (line 1086) | def delete_stream_rule_ids(self, rule_ids):
method stream (line 1102) | def stream(
method _stream (line 1145) | def _stream(self, url, params, event, record_keepalive, tries=30):
method _timeline (line 1200) | def _timeline(
method timeline (line 1272) | def timeline(
method mentions (line 1328) | def mentions(
method following (line 1385) | def following(
method followers (line 1418) | def followers(
method liking_users (line 1451) | def liking_users(
method liked_tweets (line 1484) | def liked_tweets(
method retweeted_by (line 1522) | def retweeted_by(
method quotes (line 1555) | def quotes(
method get (line 1586) | def get(self, *args, **kwargs):
method get_paginated (line 1603) | def get_paginated(self, *args, **kwargs):
method post (line 1668) | def post(self, url, json_data):
method connect (line 1683) | def connect(self):
method compliance_job_list (line 1728) | def compliance_job_list(self, job_type, status):
method compliance_job_get (line 1755) | def compliance_job_get(self, job_id):
method compliance_job_create (line 1780) | def compliance_job_create(self, job_type, job_name, resumable=False):
method geo (line 1810) | def geo(
method _id_exists (line 1865) | def _id_exists(self, user):
method _ensure_user_id (line 1875) | def _ensure_user_id(self, user):
method _ensure_user (line 1893) | def _ensure_user(self, user):
method _check_for_disconnect (line 1911) | def _check_for_disconnect(self, data):
function _ts (line 1924) | def _ts(dt):
function _utcnow (line 1942) | def _utcnow():
function _append_metadata (line 1952) | def _append_metadata(result, url):
FILE: src/twarc/command.py
function main (line 56) | def main():
function get_argparser (line 345) | def get_argparser():
function numbered_filepath (line 513) | def numbered_filepath(filepath, num):
FILE: src/twarc/command2.py
function twarc2 (line 106) | def twarc2(
function configure (line 174) | def configure(ctx):
function get_version (line 217) | def get_version():
function _search (line 224) | def _search(
class MutuallyExclusiveOption (line 299) | class MutuallyExclusiveOption(Option):
method __init__ (line 305) | def __init__(self, *args, **kwargs):
method parse_name (line 318) | def parse_name(self, name):
method handle_parse_result (line 321) | def handle_parse_result(self, ctx, opts, args):
function command_line_input_output_file_arguments (line 331) | def command_line_input_output_file_arguments(f):
function command_line_progressbar_option (line 340) | def command_line_progressbar_option(f):
function command_line_search_options (line 353) | def command_line_search_options(f):
function command_line_timelines_options (line 374) | def command_line_timelines_options(f):
function _validate_max_results (line 399) | def _validate_max_results(context, parameter, value):
function command_line_search_archive_options (line 442) | def command_line_search_archive_options(f):
function _validate_expansions (line 466) | def _validate_expansions(context, parameter, value):
function command_line_expansions_options (line 482) | def command_line_expansions_options(f):
function command_line_expansions_shortcuts (line 532) | def command_line_expansions_shortcuts(f):
function _process_expansions_shortcuts (line 575) | def _process_expansions_shortcuts(kwargs):
function command_line_verbose_options (line 599) | def command_line_verbose_options(f):
function search (line 633) | def search(
function counts (line 684) | def counts(
function tweet (line 781) | def tweet(T, tweet_id, outfile, pretty, **kwargs):
function followers (line 814) | def followers(T, user, outfile, limit, max_results, hide_progress):
function following (line 858) | def following(T, user, outfile, limit, max_results, hide_progress):
function liking_users (line 902) | def liking_users(T, tweet_id, outfile, limit, max_results, hide_progress):
function retweeted_by (line 952) | def retweeted_by(T, tweet_id, outfile, limit, max_results, hide_progress):
function quotes (line 1004) | def quotes(T, tweet_id, outfile, limit, max_results, hide_progress, **kw...
function liked_tweets (line 1058) | def liked_tweets(T, user_id, outfile, limit, max_results, hide_progress):
function sample (line 1086) | def sample(T, outfile, limit, **kwargs):
function hydrate (line 1119) | def hydrate(T, infile, outfile, hide_progress, **kwargs):
function dehydrate (line 1144) | def dehydrate(infile, outfile, id_type, hide_progress):
function users (line 1203) | def users(T, infile, outfile, usernames, hide_progress, **kwargs):
function user (line 1236) | def user(T, name_or_id, user, outfile, **kwargs):
function mentions (line 1271) | def mentions(T, user_id, outfile, hide_progress, **kwargs):
function timeline (line 1306) | def timeline(
function timelines (line 1420) | def timelines(
function _timeline_tweets (line 1526) | def _timeline_tweets(
function searches (line 1603) | def searches(
function conversation (line 1809) | def conversation(
function conversations (line 1849) | def conversations(
function flatten (line 1924) | def flatten(infile, outfile, hide_progress):
function places (line 1966) | def places(T, value, outfile, search_type, granularity, max_results, json):
function stream (line 2010) | def stream(T, outfile, limit, **kwargs):
function lists (line 2037) | def lists(T):
function lists_lookup (line 2058) | def lists_lookup(T, list_id, outfile, pretty, **kwargs):
function lists_bulk_lookup (line 2086) | def lists_bulk_lookup(T, infile, outfile, hide_progress, **kwargs):
function lists_all (line 2129) | def lists_all(T, user, outfile, limit, hide_progress, **kwargs):
function lists_owned (line 2179) | def lists_owned(T, user, outfile, limit, hide_progress, **kwargs):
function lists_followed (line 2215) | def lists_followed(T, user, outfile, limit, hide_progress, **kwargs):
function lists_memberships (line 2251) | def lists_memberships(T, user, outfile, limit, hide_progress, **kwargs):
function lists_followers (line 2288) | def lists_followers(T, list_id, outfile, limit, hide_progress, **kwargs):
function lists_members (line 2326) | def lists_members(T, list_id, outfile, limit, hide_progress, **kwargs):
function lists_tweets (line 2364) | def lists_tweets(T, list_id, outfile, limit, hide_progress, **kwargs):
function stream_rules (line 2387) | def stream_rules(T):
function list_stream_rules (line 2398) | def list_stream_rules(T, display_ids):
function _print_stream_rules (line 2405) | def _print_stream_rules(T, display_ids=False):
function add_stream_rule (line 2436) | def add_stream_rule(T, value, tag):
function delete_stream_rule (line 2457) | def delete_stream_rule(T, value):
function delete_all (line 2487) | def delete_all(T):
function compliance_job (line 2502) | def compliance_job(T):
function compliance_job_list (line 2527) | def compliance_job_list(T, job_type, status, verbose, json_output):
function compliance_job_get (line 2566) | def compliance_job_get(T, job, verbose, json_output):
function compliance_job_create (line 2607) | def compliance_job_create(T, job_type, infile, outfile, job_name, wait, ...
function compliance_job_download (line 2679) | def compliance_job_download(T, job, outfile, wait, hide_progress):
function _get_job (line 2723) | def _get_job(T, job):
function _wait_for_job (line 2741) | def _wait_for_job(T, job, hide_progress=False):
function _download_job (line 2853) | def _download_job(job, outfile=None, hide_progress=False):
function _print_compliance_job (line 2888) | def _print_compliance_job(job, verbose=False):
function _rule_str (line 2947) | def _rule_str(rule):
function _error_str (line 2954) | def _error_str(errors):
function _write (line 2979) | def _write(results, outfile, pretty=False):
function _write_with_progress (line 2984) | def _write_with_progress(
FILE: src/twarc/config.py
class ConfigProvider (line 10) | class ConfigProvider:
method __init__ (line 11) | def __init__(self):
method __call__ (line 14) | def __call__(self, file_path, cmd_name):
FILE: src/twarc/decorators.py
function rate_limit (line 11) | def rate_limit(f):
function catch_conn_reset (line 93) | def catch_conn_reset(f):
function catch_timeout (line 121) | def catch_timeout(f):
function catch_gzip_errors (line 137) | def catch_gzip_errors(f):
function interruptible_sleep (line 154) | def interruptible_sleep(t, event=None):
function filter_protected (line 169) | def filter_protected(f):
FILE: src/twarc/decorators2.py
function rate_limit (line 16) | def rate_limit(f, tries=30):
function catch_request_exceptions (line 105) | def catch_request_exceptions(f, tries=30):
function interruptible_sleep (line 151) | def interruptible_sleep(t, event=None):
class cli_api_error (line 166) | class cli_api_error:
method __init__ (line 171) | def __init__(self, f):
method __call__ (line 176) | def __call__(self, *args, **kwargs):
function requires_app_auth (line 200) | def requires_app_auth(f):
class InvalidAuthType (line 218) | class InvalidAuthType(Exception):
class FileLineProgressBar (line 224) | class FileLineProgressBar(tqdm):
method __init__ (line 230) | def __init__(self, infile, outfile, **kwargs):
method update_with_result (line 260) | def update_with_result(
class FileSizeProgressBar (line 294) | class FileSizeProgressBar(tqdm):
method __init__ (line 301) | def __init__(self, infile, outfile, **kwargs):
method update_with_result (line 318) | def update_with_result(
class TimestampProgressBar (line 352) | class TimestampProgressBar(tqdm):
method __init__ (line 358) | def __init__(self, since_id, until_id, start_time, end_time, **kwargs):
method update_with_dates (line 398) | def update_with_dates(self, start_span, end_span):
method update_with_result (line 417) | def update_with_result(self, result):
method format_dict (line 431) | def format_dict(self):
method close (line 444) | def close(self):
function _date2millis (line 451) | def _date2millis(dt):
function _millis2date (line 455) | def _millis2date(ms):
function _snowflake2millis (line 461) | def _snowflake2millis(snowflake_id):
function _millis2snowflake (line 465) | def _millis2snowflake(ms):
FILE: src/twarc/expansions.py
function extract_includes (line 113) | def extract_includes(response, expansion, _id="id"):
function flatten (line 123) | def flatten(response):
function ensure_flattened (line 232) | def ensure_flattened(data):
FILE: src/twarc/handshake.py
function handshake (line 11) | def handshake():
FILE: src/twarc/json2csv.py
function get_headings (line 17) | def get_headings():
function get_row (line 59) | def get_row(t, excel=False):
function clean_str (line 103) | def clean_str(string):
function text (line 109) | def text(t):
function coordinates (line 122) | def coordinates(t):
function hashtags (line 128) | def hashtags(t):
function media (line 139) | def media(t):
function urls (line 148) | def urls(t):
function place (line 152) | def place(t):
function retweet_id (line 157) | def retweet_id(t):
function retweet_screen_name (line 164) | def retweet_screen_name(t):
function retweet_user_id (line 171) | def retweet_user_id(t):
function favorite_count (line 178) | def favorite_count(t):
function tweet_url (line 185) | def tweet_url(t):
function user_urls (line 189) | def user_urls(t):
function tweet_type (line 201) | def tweet_type(t):
FILE: test_twarc.py
function test_search (line 49) | def test_search():
function test_search_max_pages (line 59) | def test_search_max_pages():
function test_since_id (line 66) | def test_since_id():
function test_max_id (line 76) | def test_max_id():
function test_max_and_since_ids (line 90) | def test_max_and_since_ids():
function test_paging (line 107) | def test_paging():
function test_geocode (line 117) | def test_geocode():
function test_track (line 135) | def test_track():
function test_keepalive (line 146) | def test_keepalive():
function test_follow (line 156) | def test_follow():
function test_locations (line 194) | def test_locations():
function test_languages (line 215) | def test_languages():
function test_timeline_by_user_id (line 233) | def test_timeline_by_user_id():
function test_timeline_max_pages (line 250) | def test_timeline_max_pages():
function test_timeline_by_screen_name (line 261) | def test_timeline_by_screen_name():
function test_home_timeline (line 269) | def test_home_timeline():
function test_timeline_arg_handling (line 277) | def test_timeline_arg_handling():
function test_timeline_with_since_id (line 287) | def test_timeline_with_since_id():
function test_trends_available (line 300) | def test_trends_available():
function test_trends_place (line 307) | def test_trends_place():
function test_trends_closest (line 313) | def test_trends_closest():
function test_trends_place_exclude (line 319) | def test_trends_place_exclude():
function test_follower_ids (line 326) | def test_follower_ids():
function test_follower_ids_with_user_id (line 335) | def test_follower_ids_with_user_id():
function test_follower_ids_max_pages (line 344) | def test_follower_ids_max_pages():
function test_friend_ids (line 351) | def test_friend_ids():
function test_friend_ids_with_user_id (line 360) | def test_friend_ids_with_user_id():
function test_friend_ids_max_pages (line 369) | def test_friend_ids_max_pages():
function test_user_lookup_by_user_id (line 376) | def test_user_lookup_by_user_id():
function test_user_lookup_by_screen_name (line 399) | def test_user_lookup_by_screen_name():
function test_tweet (line 421) | def test_tweet():
function test_dehydrate (line 426) | def test_dehydrate():
function test_hydrate (line 437) | def test_hydrate():
function test_connection_error_get (line 635) | def test_connection_error_get(oauth1session_class):
function test_connection_error_post (line 655) | def test_connection_error_post(oauth1session_class):
function test_http_error_sample (line 674) | def test_http_error_sample():
function test_http_error_filter (line 688) | def test_http_error_filter():
function test_retweets (line 701) | def test_retweets():
function test_missing_retweets (line 706) | def test_missing_retweets():
function test_oembed (line 711) | def test_oembed():
function test_oembed_params (line 720) | def test_oembed_params():
function test_replies (line 729) | def test_replies():
function test_lists_members (line 765) | def test_lists_members():
function test_lists_members_owner_id (line 773) | def test_lists_members_owner_id():
function test_lists_list_id (line 781) | def test_lists_list_id():
function test_extended_compat (line 787) | def test_extended_compat():
function test_csv_retweet (line 797) | def test_csv_retweet():
function test_csv_retweet_hashtag (line 805) | def test_csv_retweet_hashtag():
function test_truncated_text (line 824) | def test_truncated_text():
function test_invalid_credentials (line 832) | def test_invalid_credentials():
function test_app_auth (line 842) | def test_app_auth():
function test_premium_30day_search (line 854) | def test_premium_30day_search():
function test_premium_fullarchive_search (line 872) | def test_premium_fullarchive_search():
function test_gnip_fullarchive_search (line 899) | def test_gnip_fullarchive_search():
FILE: test_twarc2.py
function test_version (line 34) | def test_version():
function test_auth_types_interaction (line 43) | def test_auth_types_interaction():
function test_sample (line 84) | def test_sample():
function test_search_recent (line 103) | def test_search_recent(sort_order):
function test_counts_recent (line 118) | def test_counts_recent():
function test_counts_empty_page (line 133) | def test_counts_empty_page():
function test_search_times (line 148) | def test_search_times():
function test_user_ids_lookup (line 169) | def test_user_ids_lookup():
function test_usernames_lookup (line 188) | def test_usernames_lookup():
function test_tweet_lookup (line 197) | def test_tweet_lookup():
function test_stream (line 224) | def test_stream():
function test_timeline (line 277) | def test_timeline():
function test_timeline_username (line 298) | def test_timeline_username():
function test_missing_timeline (line 319) | def test_missing_timeline():
function test_follows (line 324) | def test_follows():
function test_follows_username (line 346) | def test_follows_username():
function test_flattened (line 368) | def test_flattened():
function test_ensure_flattened (line 454) | def test_ensure_flattened():
function test_ensure_flattened_errors (line 507) | def test_ensure_flattened_errors():
function test_ensure_user_id (line 515) | def test_ensure_user_id():
function test_liking_users (line 535) | def test_liking_users():
function test_retweeted_by (line 550) | def test_retweeted_by():
function test_liked_tweets (line 565) | def test_liked_tweets():
function test_list_lookup (line 580) | def test_list_lookup():
function test_list_members (line 586) | def test_list_members():
function test_list_followers (line 593) | def test_list_followers():
function test_list_memberships (line 600) | def test_list_memberships():
function test_followed_lists (line 607) | def test_followed_lists():
function test_owned_lists (line 614) | def test_owned_lists():
function test_list_tweets (line 621) | def test_list_tweets():
function test_user_lookup_non_existent (line 628) | def test_user_lookup_non_existent():
function test_twarc_metadata (line 634) | def test_twarc_metadata():
function test_docs_requirements (line 660) | def test_docs_requirements():
function test_geo (line 671) | def test_geo():
function pick_id (line 675) | def pick_id(id, objects):
FILE: utils/auth_timing.py
function count_tweets (line 30) | def count_tweets(app_auth):
FILE: utils/deduplicate.py
function main (line 18) | def main(files, extract_retweets=False):
FILE: utils/deleted.py
function missing (line 15) | def missing(tweets):
FILE: utils/deletes.py
function main (line 39) | def main(files, enhance_tweet=False, print_results=True, profile=None):
function examine (line 63) | def examine(tweet):
function get_user_status (line 102) | def get_user_status(tweet):
function get_tweet_status (line 138) | def get_tweet_status(tweet):
function tweet_url (line 173) | def tweet_url(tweet):
function has_error_code (line 180) | def has_error_code(resp, code):
FILE: utils/extractor.py
class attriObject (line 15) | class attriObject:
method __init__ (line 18) | def __init__(self, string):
method getElement (line 22) | def getElement(self, json_object):
function tweets_files (line 45) | def tweets_files(string, path):
function parse (line 55) | def parse(args):
function extract (line 102) | def extract(json_object, args, csv_writer):
FILE: utils/filter_date.py
function filter_input (line 21) | def filter_input(mindate, maxdate, files):
function main (line 35) | def main():
FILE: utils/filter_users.py
function read_user_list_file (line 24) | def read_user_list_file(user_list_filepath):
function _is_header (line 44) | def _is_header(count, split_line):
function main (line 54) | def main(files, user_ids, screen_names, positive_match=True):
FILE: utils/flakey.py
function id2time (line 18) | def id2time(tweet_id):
FILE: utils/foaf.py
function friendships (line 31) | def friendships(user_id, level=2):
function user_ids (line 63) | def user_ids():
function user_in_db (line 76) | def user_in_db(user_id):
function add_friendship (line 84) | def add_friendship(user_id, friend_id):
function add_user (line 93) | def add_user(u):
FILE: utils/geofilter.py
function process (line 12) | def process(line, has_coordinates=None, has_place=None, fence=None):
function main (line 40) | def main():
FILE: utils/geojson.py
function text (line 32) | def text(t):
FILE: utils/json2csv.py
function main (line 26) | def main():
function numbered_filepath (line 95) | def numbered_filepath(filepath, num):
function get_headings (line 100) | def get_headings(extra_headings=None):
function get_row (line 107) | def get_row(t, extra_fields=None, excel=False):
function extra_field (line 115) | def extra_field(t, field_str):
FILE: utils/media2warc.py
class GetResource (line 42) | class GetResource(threading.Thread):
method __init__ (line 43) | def __init__(self, q):
method run (line 50) | def run(self):
class WriteWarc (line 70) | class WriteWarc(threading.Thread):
method __init__ (line 71) | def __init__(self, out_queue, warcfile):
method run (line 78) | def run(self):
class Dedup (line 108) | class Dedup:
method __init__ (line 114) | def __init__(self):
method start (line 117) | def start(self):
method save (line 128) | def save(self, digest_key, url):
method lookup (line 136) | def lookup(self, digest_key, url=None):
function parse_extended_entities (line 148) | def parse_extended_entities(extended_entities_dict):
function parse_binlinks_from_tweet (line 182) | def parse_binlinks_from_tweet(tweetdict):
function main (line 201) | def main():
FILE: utils/network.py
function add (line 91) | def add(from_user, from_id, to_user, to_id, type, created_at=None):
function to_json (line 115) | def to_json(g):
FILE: utils/oembeds.py
function main (line 47) | def main():
class OEmbeds (line 67) | class OEmbeds:
method __init__ (line 68) | def __init__(self, path="oembeds.db"):
method put (line 79) | def put(self, url, metadata):
method get (line 84) | def get(self, url):
FILE: utils/retweets.py
function main (line 15) | def main():
FILE: utils/twarc-archive.py
function main (line 46) | def main():
function get_last_archive (line 183) | def get_last_archive(archive_dir):
function get_next_archive (line 195) | def get_next_archive(archive_dir):
FILE: utils/tweet_compliance.py
function process_tweets (line 44) | def process_tweets(tweets):
FILE: utils/unshrtn.py
function unshrtn_obj (line 35) | def unshrtn_obj(obj):
function rewrite_line (line 69) | def rewrite_line(line):
function main (line 79) | def main():
FILE: utils/wall.py
function download_file (line 21) | def download_file(url):
function text (line 34) | def text(t):
FILE: utils/wayback.py
function main (line 19) | def main(files, save, force_save, sleep):
function lookup (line 50) | def lookup(url):
function savepagenow (line 60) | def savepagenow(url):
function timestamp (line 67) | def timestamp(s):
FILE: utils/wordcloud.py
function main (line 10) | def main():
function text (line 291) | def text(t):
FILE: utils/youtubedl.py
function main (line 67) | def main():
function download (line 170) | def download(url, q, ydl_opts, log):
Condensed preview — 89 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (661K chars).
[
{
"path": ".gitignore",
"chars": 106,
"preview": "*.pyc\n*.log\n.cache\n.venv\n.eggs\nPipfile*\nbuild\ndist\ntwarc.egg-info\n.pytest_cache\n.vscode\n.env\nsite\nuv.lock\n"
},
{
"path": ".readthedocs.yaml",
"chars": 125,
"preview": "version: 2\n\nmkdocs:\n configuration: mkdocs.yml\n\npython:\n version: 3.8\n install:\n - requirements: requirements-mkdocs"
},
{
"path": "LICENSE",
"chars": 1090,
"preview": "The MIT License (MIT)\n\nCopyright (c) Documenting the Now Project\n\nPermission is hereby granted, free of charge, to any p"
},
{
"path": "MANIFEST.in",
"chars": 48,
"preview": "include requirements.txt\ninclude docs/README.md\n"
},
{
"path": "README.md",
"chars": 3255,
"preview": "# twarc\n\n**Note: twarc is no longer actively supported after changes to Twitter's API quotas made it unusable.**\n\n---\n\n["
},
{
"path": "RELEASING.md",
"chars": 1011,
"preview": "# Releasing\n\nNew versions of twarc can be released by creating a release and assigning a new tag in the GitHub repo. The"
},
{
"path": "docs/README.md",
"chars": 3149,
"preview": "# twarc\n\ntwarc is a command line tool and Python library for collecting and archiving Twitter JSON\ndata via the Twitter "
},
{
"path": "docs/api/client.md",
"chars": 51,
"preview": "# twarc.Client\n\n::: twarc.client\n handler: python\n"
},
{
"path": "docs/api/client2.md",
"chars": 53,
"preview": "# twarc.Client2\n\n::: twarc.client2\n handler: python\n"
},
{
"path": "docs/api/expansions.md",
"chars": 695,
"preview": "# twarc.expansions\n\n[Expansions](https://developer.twitter.com/en/docs/twitter-api/expansions) are how the new v2 Twitte"
},
{
"path": "docs/api/library.md",
"chars": 5845,
"preview": "# Examples of using twarc2 as a library\n\nPlease see [client2](client2.md) docs for the full list of available functions."
},
{
"path": "docs/plugins.md",
"chars": 4784,
"preview": "# Plugins\n\ntwarc v1 collected a set of utilities for working with tweet json in the\n[utils] directory of the git reposit"
},
{
"path": "docs/resources.md",
"chars": 2581,
"preview": "# Twarc Tutorials and Other Resources\n\nDocumentation here is largely auto generated from the code, which may not always "
},
{
"path": "docs/tutorial.md",
"chars": 33440,
"preview": "# Twarc Tutorial\n\nTwarc is a command line tool for collecting Twitter data via Twitter's web Application Programming Int"
},
{
"path": "docs/twarc1_en_us.md",
"chars": 20706,
"preview": "twarc1\n=====\n\n***For information about working with the Twitter V2 API please see the [twarc2](https://twarc-project.rea"
},
{
"path": "docs/twarc1_es_mx.md",
"chars": 14480,
"preview": "# twarc1\r\n\r\ntwarc es una recurso de línea de commando y catálogo de Python para archivar JSON dato de Twitter. Cada twee"
},
{
"path": "docs/twarc1_ja_jp.md",
"chars": 13784,
"preview": "twarc1\n=====\n\ntwarcは、TwitterのJSONデータをアーカイブするためのコマンドラインツールおよびPythonライブラリーのプログラムです。\n\n- 各ツイートは、Twitter APIから返された内容を[正確に](ht"
},
{
"path": "docs/twarc1_pt_br.md",
"chars": 19899,
"preview": "twarc1\n=====\n\ntwarc é uma ferramenta de linha de comando e usa a biblioteca Python para arquivamento de dados do Twitter"
},
{
"path": "docs/twarc1_sv_se.md",
"chars": 14932,
"preview": "twarc1\n=====\n\ntwarc är ett kommandoradsverktyg twarc och ett Pythonbibliotek för arkivering av Twitter JSON data.\nVarje "
},
{
"path": "docs/twarc1_sw_ke.md",
"chars": 13034,
"preview": "twarc1\n\n=====\n\ntwarc ni chombo ya command-line na Python Library ya kuhifadhi Twitter JSON\ndata. Kila Tweet ita akilishw"
},
{
"path": "docs/twarc1_zw_zh.md",
"chars": 13399,
"preview": "twarc1\n=====\n\ntwarc 是一个用来处理并存档推特 JSON 数据的命令行工具和 Python 包。\n\n[正如](https://dev.twitter.com/overview/api/tweets)推特 API 返回的一样"
},
{
"path": "docs/twarc2_en_us.md",
"chars": 15446,
"preview": "\n# twarc2\n\ntwarc2 is a command line tool and Python library for archiving Twitter JSON\ndata. Each tweet is represented a"
},
{
"path": "docs/twitter-developer-access.md",
"chars": 6803,
"preview": "# Twitter Developer Access\n\nIf you have established that you would like to use Twitter Data in your study, you will need"
},
{
"path": "docs/windows10.md",
"chars": 4781,
"preview": "# twarc2 on Windows 10\n\nThis guide assumes you already have a Twitter Developer Account, a registered App with your keys"
},
{
"path": "mkdocs.yml",
"chars": 1032,
"preview": "site_name: twarc\nsite_url: https://readthedocs.org/projects/twarc-project/\nsite_description: Collect Twitter JSON data f"
},
{
"path": "pyproject.toml",
"chars": 813,
"preview": "[project]\nname = \"twarc\"\nversion = \"2.14.1\"\ndescription = \"Archive tweets from the command line\"\nlicense = \"MIT\"\nreadme "
},
{
"path": "requirements-mkdocs.txt",
"chars": 199,
"preview": "click>=7,<9\nclick-config-file>=0.6\nclick-plugins>=1\nhumanize>=3.9\npython-dateutil>=2.8\nrequests_oauthlib>=1.3\ntqdm>=4.62"
},
{
"path": "setup.cfg",
"chars": 63,
"preview": "[tool:pytest]\naddopts=--verbose --black\n\n[aliases]\ntest=pytest\n"
},
{
"path": "src/twarc/__init__.py",
"chars": 124,
"preview": "from .client import Twarc\nfrom .client2 import Twarc2\nfrom .version import version\nfrom .expansions import ensure_flatte"
},
{
"path": "src/twarc/__main__.py",
"chars": 103,
"preview": "from twarc.command2 import twarc2\n\nif __name__ == \"__main__\":\n twarc2(prog_name=\"python -m twarc2\")\n"
},
{
"path": "src/twarc/client.py",
"chars": 44131,
"preview": "# -*- coding: utf-8 -*-\n\nimport os\nimport re\nimport sys\nimport json\nimport types\nimport logging\nimport datetime\nimport r"
},
{
"path": "src/twarc/client2.py",
"chars": 70538,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\"\nSupport for the Twitter v2 API.\n\"\"\"\n\nimport re\nimport json\nimport time\nimport logging\nimpor"
},
{
"path": "src/twarc/command.py",
"chars": 15327,
"preview": "from __future__ import print_function\n\nimport os\nimport re\nimport sys\nimport json\nimport signal\nimport codecs\nimport log"
},
{
"path": "src/twarc/command2.py",
"chars": 91454,
"preview": "\"\"\"\nThe command line interfact to the Twitter v2 API.\n\"\"\"\n\nimport os\nimport re\nimport json\nimport time\nimport twarc\nimpo"
},
{
"path": "src/twarc/config.py",
"chars": 413,
"preview": "import logging\nimport configobj\n\n# Adapted from click_config_file.configobj_provider so that we can store the\n# file pat"
},
{
"path": "src/twarc/decorators.py",
"chars": 6286,
"preview": "import time\nimport logging\n\nfrom requests import HTTPError\nfrom requests.packages.urllib3.exceptions import ReadTimeoutE"
},
{
"path": "src/twarc/decorators2.py",
"chars": 17745,
"preview": "import os\nimport time\nimport click\nimport logging\nimport requests\n\nimport datetime\nimport humanize\nfrom tqdm.auto import"
},
{
"path": "src/twarc/expansions.py",
"chars": 9495,
"preview": "\"\"\"\nThis module contains a list of the known Twitter V2+ API expansions and fields\nfor each expansion, and a function fl"
},
{
"path": "src/twarc/handshake.py",
"chars": 4046,
"preview": "\"\"\"\nA function for asking the user for their Twitter API keys.\n\"\"\"\n\nimport requests\n\nfrom requests_oauthlib import OAuth"
},
{
"path": "src/twarc/json2csv.py",
"chars": 5643,
"preview": "#!/usr/bin/env python\n\nimport sys\n\nfrom dateutil.parser import parse as date_parse\nfrom six import string_types\n\nif sys."
},
{
"path": "src/twarc/version.py",
"chars": 175,
"preview": "import platform\n\nversion = \"2.14.1\"\n\nuser_agent = f\"twarc/{version} ({platform.system()} {platform.machine()}) {platform"
},
{
"path": "test_twarc.py",
"chars": 25079,
"preview": "import os\nimport re\nimport json\nimport time\nimport dotenv\nimport pytest\nimport logging\nimport datetime\n\ndotenv.load_dote"
},
{
"path": "test_twarc2.py",
"chars": 20031,
"preview": "import os\nimport pytz\nimport twarc\nimport dotenv\nimport pytest\nimport logging\nimport pathlib\nimport datetime\nimport thre"
},
{
"path": "utils/auth_timing.py",
"chars": 1327,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nTwitter's rate limits allow App Auth contexts to search at 450 requests\nevery 15 minutes, an"
},
{
"path": "utils/deduplicate.py",
"chars": 1186,
"preview": "#!/usr/bin/env python\n\"\"\"\nGiven a JSON file, remove any tweets with duplicate IDs.\n\nOptionally, this will extract retwee"
},
{
"path": "utils/deleted.py",
"chars": 793,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nThis is a little utility that reads in tweets, rehydrates them, and only \noutputs the tweets "
},
{
"path": "utils/deleted_users.py",
"chars": 748,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nThis utility Will read in user ids, or tweet JSON data, and look up each\nuser_id. If the use"
},
{
"path": "utils/deletes.py",
"chars": 6683,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nThis program assumes that you are feeding it tweet JSON data for tweets\nthat have been delet"
},
{
"path": "utils/embeds.py",
"chars": 275,
"preview": "#!/usr/bin/env python\n\nfrom __future__ import print_function\nimport json\nimport fileinput\n\nfor line in fileinput.input()"
},
{
"path": "utils/emojis.py",
"chars": 728,
"preview": "#!/usr/bin/env python3\n\nimport re\nimport json\nimport fileinput\nimport collections\nimport optparse\n\nimport emoji\n\nopt_par"
},
{
"path": "utils/extractor.py",
"chars": 5783,
"preview": "#!/usr/bin/env python3\nfrom datetime import datetime\nimport json\nimport os\nimport re\nimport argparse\nimport csv\nimport c"
},
{
"path": "utils/filter_date.py",
"chars": 1328,
"preview": "#!/usr/bin/env python\n\"\"\"\nGiven a minimum and/or maximum date, filter out all tweets after this date.\n\nFor example, if a"
},
{
"path": "utils/filter_users.py",
"chars": 2943,
"preview": "#!/usr/bin/env python\n\"\"\"\nFilters tweets posted by a list of users.\n\nThe list is supplied in a file. The file can contai"
},
{
"path": "utils/flakey.py",
"chars": 781,
"preview": "#!/usr/bin/env python3\n\n#\n# This program will read tweet ids (Snowflake IDs) from a file or a pipe and\n# write the tweet"
},
{
"path": "utils/foaf.py",
"chars": 5221,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nThis is a utility for getting the friend-of-a-friend network for a \ngiven twitter user. It w"
},
{
"path": "utils/gender.py",
"chars": 1092,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nfilters tweets based on a guess about the users gender\n\"\"\"\nfrom __future__ import print_funct"
},
{
"path": "utils/geo.py",
"chars": 378,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nFilter tweets/retweets that have geocoding.\n\"\"\"\nfrom __future__ import print_function\n\nimport"
},
{
"path": "utils/geofilter.py",
"chars": 1777,
"preview": "#!/usr/bin/env python\n\nfrom __future__ import print_function\n\nimport argparse\nimport json\nimport sys\n\nfrom shapely.geome"
},
{
"path": "utils/geojson.py",
"chars": 3130,
"preview": "#!/usr/bin/env python\n\n\"\"\"\ngeojson.py reads in tweets and writes out a corresponding geojson file for the\ntweets. Each f"
},
{
"path": "utils/json2csv.py",
"chars": 3380,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nA sample JSON to CSV program. Multivalued JSON properties are space delimited \nCSV columns. I"
},
{
"path": "utils/media2warc.py",
"chars": 8523,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nThis utility extracts media urls from tweet jsonl.gz and save them as warc records.\n\nWarcio "
},
{
"path": "utils/media_urls.py",
"chars": 959,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nPrint out the URLs of images uploaded to Twitter in a tweet json stream.\nUseful for piping to"
},
{
"path": "utils/network.py",
"chars": 9663,
"preview": "#!/usr/bin/env python\n\n# NOTE:\n#\n# This script has been ported to the twarc-network plugin for working\n# with data colle"
},
{
"path": "utils/noretweets.py",
"chars": 347,
"preview": "#!/usr/bin/env python\n\"\"\"\nGiven a JSON file, remove any retweets.\n\nExample usage:\nutils/noretweets.py tweets.jsonl > twe"
},
{
"path": "utils/oembeds.py",
"chars": 2770,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\noembeds.py will read a stream of tweet JSON and augment .entities.urls with oembed\nmetadata "
},
{
"path": "utils/remove_limit.py",
"chars": 615,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nUtility to remove limit warnings from Filter API output.\n\nIf --warnings was used, you will ha"
},
{
"path": "utils/retweets.py",
"chars": 831,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nPrints out the tweet ids and counts of most retweeted.\n\"\"\"\nfrom __future__ import print_funct"
},
{
"path": "utils/search.py",
"chars": 1152,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nFilter tweet JSON based on a regular expression to apply to the text of the \ntweet.\n\n sear"
},
{
"path": "utils/sensitive.py",
"chars": 554,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nFilter out tweets or retweets that Twitter thinks are sensitive (mostly porn).\n\"\"\"\nfrom __fut"
},
{
"path": "utils/sort_by_id.py",
"chars": 528,
"preview": "#!/usr/bin/env python\n\"\"\"\nSort tweets by ID.\n\nTwitter IDs are generated in chronologically ascending order,\nso this is t"
},
{
"path": "utils/source.py",
"chars": 1412,
"preview": "#!/usr/bin/env python\n\"\"\"\nUtil to count which clients are most used.\n\nExample usage:\nutils/source.py tweets.jsonl > sour"
},
{
"path": "utils/tags.py",
"chars": 378,
"preview": "#!/usr/bin/env python\nfrom __future__ import print_function\n\nimport json\nimport fileinput\nimport collections\n\ncounts = c"
},
{
"path": "utils/times.py",
"chars": 794,
"preview": "#!/usr/bin/env python\nfrom __future__ import print_function\n\nimport sys\nimport json\nimport optparse\nimport fileinput\nimp"
},
{
"path": "utils/twarc-archive.py",
"chars": 6039,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nThis little utility uses twarc to write Twitter search results to a directory\nof your choosin"
},
{
"path": "utils/tweet.py",
"chars": 1119,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nFetch a single tweet as JSON using its id.\n\"\"\"\nfrom __future__ import print_function\n\nimport "
},
{
"path": "utils/tweet_compliance.py",
"chars": 2397,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nSupports tweet compliance. See https://developer.twitter.com/en/docs/tweets/compliance/overvi"
},
{
"path": "utils/tweet_text.py",
"chars": 408,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nGiven a JSON file, return just the text of the tweet.\nExample usage:\nutils/tweet_text.py twee"
},
{
"path": "utils/tweet_urls.py",
"chars": 716,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nUsed in conjunction with retweet.py.\n\nPrints out the retweet count, and url of the retweeted "
},
{
"path": "utils/tweetometer.py",
"chars": 1364,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nReads tweet or Twitter user JSON and outputs a CSV of when the user account was\ncreated, how"
},
{
"path": "utils/tweets.py",
"chars": 516,
"preview": "#!/usr/bin/env python\nfrom __future__ import print_function\n\nimport json\nimport fileinput\nimport dateutil.parser\n\nfor li"
},
{
"path": "utils/unshrtn.py",
"chars": 3206,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nUnfortunately the \"expanded_url\" as supplied by Twitter aren't fully\nexpanded one hop past t"
},
{
"path": "utils/urls.py",
"chars": 461,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nPrint out the URLs in a tweet json stream.\n\"\"\"\nfrom __future__ import print_function\n\nimport"
},
{
"path": "utils/users.py",
"chars": 230,
"preview": "#!/usr/bin/env python\nfrom __future__ import print_function\n\nimport json\nimport fileinput\n\nfor line in fileinput.input()"
},
{
"path": "utils/validate.py",
"chars": 285,
"preview": "#!/usr/bin/env python\n\nimport sys\nimport json\nimport fileinput\n\nline_number = 0\n\nfor line in fileinput.input():\n line"
},
{
"path": "utils/wall.py",
"chars": 4868,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nFeed wall.py your JSON and get a wall of tweets as HTML. If you want to get the\nwall in chron"
},
{
"path": "utils/wayback.py",
"chars": 2692,
"preview": "#!/usr/bin/env python\n\n#\n# Reads a stream of tweets and checks to see if the tweet is archived at\n# Internet Archive and"
},
{
"path": "utils/webarchives.py",
"chars": 671,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nA program to filter tweets that contain links to a web archive. At the moment it\nsupports ar"
},
{
"path": "utils/wordcloud.py",
"chars": 6670,
"preview": "#!/usr/bin/env python\n\nfrom __future__ import print_function\nimport re\nimport sys\nimport json\nimport fileinput\n\n\ndef mai"
},
{
"path": "utils/youtubedl.py",
"chars": 5911,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nusage: youtubedl.py [-h] [--max-downloads MAX_DOWNLOADS]\n [--max-filesize"
}
]
About this extraction
This page contains the full source code of the edsu/twarc GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 89 files (595.5 KB), approximately 149.2k tokens, and a symbol index with 369 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.