Full Code of j-e-d/NYTdiff for AI

master cd031b3838f1 cached

6 files

28.8 KB

7.1k tokens

27 symbols

1 requests

Download .txt

Repository: j-e-d/NYTdiff
Branch: master
Commit: cd031b3838f1
Files: 6
Total size: 28.8 KB

Directory structure:
gitextract_5u6rg6p8/

├── README.md
├── css/
│   └── styles.css
├── nytdiff.py
├── output/
│   └── .gitignore
├── requirements.txt
└── run_diff.sh

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# NYTdiff

Code for the Twitter bot [@nyt_diff](https://twitter.com/nyt_diff).

Bluesky version: [@nytdiff.bsky.social](https://bsky.app/profile/nytdiff.bsky.social)

Mastodon version: [@nyt_diff@botsin.space](https://botsin.space/@nyt_diff)

The [phantomjs](http://phantomjs.org/) binary needs to be installed and the path updated in the run_diff.sh file.

[Twitter keys](https://dev.twitter.com/) and the [NYT API](http://developers.nytimes.com/) key for the "Top Stories V2" service are needed, values of these keys need to be entered in the run_diff.sh file.

Font: [Merriweather](https://github.com/SorkinType/Merriweather). Background pattern: [Paper Fibers](http://subtlepatterns.com/paper-fibers/).

If you wish to build your version check [this fork](https://github.com/xuv/NYTdiff) that reads RSS feeds and [this project](https://github.com/docnow/diffengine) that is in part based on nyt_diff and also checks RSS feeds.

Bots with similar functionality that I am aware of (please let me know of others).


|site|bot|
|----|---|
|[ABC Digital]|[EditandoAbc]|
|[Ámbito]|[af_diff]|
|[AP]|~~[ap_diff]~~|
|[BBC]|~~[bbc_diff]~~|
|[Breitbart News]|[breitbart_diff]|
|[Calgary Herald]|~~[yyc_herald_diff]~~|
|[Canadaland]|~~[canadaland_diff]~~|
|[The Canadian Broadcasting Corporation]|~~[cbc_diff]~~|
|[Clarín]|[cl_diff]|
|[CNN]|~~[cnn_diff]~~|
|[Cronista]|[cc_diff]|
|[Daily Mail]|~~[dailymail_diff]~~|
|[Fox News]|~~[fox_diff]~~|
|[Israeli News Websites]|[ILNewsDiff]|
|[The Globe and Mail]|~~[globemail_diff]~~|
|[The Guardian]|[guardian_diff]|
|[Infobae]|[ib_diff]|
|[Le Monde]|[lemonde_diff]|
|[La Nación]|[ln_diff]|
|[La Presse]|~~[lapresse_diff]~~|
|[La Repubblica]|~~[repubblica_diff]~~|
|[Le Soir]|[lesoir_diff]|
|[Página/12]|[p12_diff]|
|[Reuters]|~~[reuters_diff]~~|
|[RT]|~~[rt_edits]~~|
|[Stuff]|[stuff_diff]|
|[Telegraph]|~~[telegraph_diff]~~|
|[Todo Noticias]|~~[tn_diff]~~|
|[The Toronto Star]|~~[torstar_diff]~~|
|[Valeurs Actuelles]|[valeurs_diff]|
|[Wall Street Journal]|[wsj_diff]|
|[The Washington Post]|[wapo_diff]|
|[White House Blog]|[whitehouse_diff]

Some bot accounts are currently (November 2020) suspended by Twitter and have been ~~strikethrough~~.

[EditandoAbc]: https://twitter.com/EditandoAbc
[ABC Digital]: http://www.abc.com.py/

[af_diff]: https://twitter.com/af_diff
[Ámbito]: http://www.ambito.com/

[cc_diff]: https://twitter.com/cc_diff
[Cronista]: https://www.cronista.com/

[ib_diff]: https://twitter.com/ib_diff
[Infobae]: http://www.infobae.com/

[p12_diff]: https://twitter.com/p12_diff
[Página/12]: https://www.pagina12.com.ar/

[tn_diff]: https://twitter.com/tn_diff
[Todo Noticias]: http://tn.com.ar

[wapo_diff]: https://twitter.com/wapo_diff
[The Washington Post]: https://www.washingtonpost.com

[breitbart_diff]: https://twitter.com/breitbart_diff
[Breitbart News]: https://www.breitbart.com

[guardian_diff]: https://twitter.com/guardian_diff
[The Guardian]: https://www.theguardian.com/

[torstar_diff]: https://twitter.com/torstar_diff
[The Toronto Star]: https://www.thestar.com/

[globemail_diff]: https://twitter.com/globemail_diff
[The Globe and Mail]: http://www.theglobeandmail.com/

[canadaland_diff]: https://twitter.com/canadaland_diff
[Canadaland]: http://www.canadalandshow.com/

[repubblica_diff]: https://twitter.com/repubblica_diff
[La Repubblica]: http://www.repubblica.it/

[yyc_herald_diff]: https://twitter.com/yyc_herald_diff
[Calgary Herald]: http://calgaryherald.com/

[cbc_diff]: https://twitter.com/cbc_diff
[The Canadian Broadcasting Corporation]: http://www.cbc.ca/

[lapresse_diff]: https://twitter.com/lapresse_diff
[La Presse]: http://www.lapresse.ca/

[bbc_diff]: https://twitter.com/bbc_diff
[BBC]: http://www.bbc.co.uk/

[rt_edits]: https://twitter.com/rt_edits
[RT]: http://rt.com

[fox_diff]: https://twitter.com/fox_diff
[Fox News]: http://www.foxnews.com/

[dailymail_diff]: https://twitter.com/dailymail_diff
[Daily Mail]: http://www.dailymail.co.uk/

[telegraph_diff]: https://twitter.com/telegraph_diff
[Telegraph]: http://www.telegraph.co.uk/

[cnn_diff]: https://twitter.com/cnn_diff
[CNN]: http://www.cnn.com/

[reuters_diff]: https://twitter.com/reuters_diff
[Reuters]: http://www.reuters.com/

[ap_diff]: https://twitter.com/ap_diff
[AP]: https://www.ap.org/

[whitehouse_diff]: https://twitter.com/whitehouse_diff
[White House Blog]: https://www.whitehouse.gov/blog

[wsj_diff]: https://twitter.com/wsj_diff
[Wall Street Journal]: http://www.wsj.com/

[Clarín]: http://clarin.com
[cl_diff]: https://twitter.com/cl_diff

[La Nación]: http://www.lanacion.com.ar
[ln_diff]: https://twitter.com/ln_diff

[Le Monde]: http://lemonde.fr 
[lemonde_diff]: https://twitter.com/lemonde_diff

[Le Soir]: http://lesoir.be
[lesoir_diff]: https://twitter.com/lesoir_diff

[Valeurs Actuelles]: https://www.valeursactuelles.com/
[valeurs_diff]: https://twitter.com/valeurs_diff

[Stuff]: https://www.stuff.co.nz/
[stuff_diff]: https://twitter.com/stuff_diff

[Israeli News Websites]: https://github.com/alonmln/ILNewsDiff
[ILNewsDiff]: https://twitter.com/ILNewsDiff


================================================
FILE: css/styles.css
================================================
@font-face { 
    font-family: Merriweather; 
    font-style: normal;
    font-weight: normal;
    src: url('../fonts/Merriweather-Regular.ttf') format("truetype"); 
} 

body { 
    background: lightgray url('../img/paper_fibers.png') repeat; 
    font-family: Merriweather;
    font-size: 16px;
}

p {
    margin-left: 2em;
    margin-right: 2em;
    margin-top: 1em;
    margin-bottom: 1em;
    padding-left: 25px;
    padding-right: 25px;
    padding-top: 10px;
    padding-bottom: 10px;
    width: 350px;
    font-weight: normal;
}

del {
   background-color: lightpink;
   color: black;
   text-decoration: line-through;
   font-weight: lighter;
}

ins {
  background-color: aquamarine;
  color: black;
  text-decoration: none;
  font-weight: bold;
}


================================================
FILE: nytdiff.py
================================================
#!/usr/bin/python3

import collections
import hashlib
import json
import logging
import os
import shutil
import sys
import time
from datetime import datetime
from tempfile import TemporaryDirectory

import bleach
import dataset
import requests
import tweepy
from atproto import Client, models
from PIL import Image
from pytz import timezone
from simplediff import html_diff
from selenium import webdriver
from selenium.webdriver.common.by import By

TIMEZONE = "America/Buenos_Aires"
LOCAL_TZ = timezone(TIMEZONE)
MAX_RETRIES = 10
RETRY_DELAY = 3

if "TESTING" in os.environ:
    if os.environ["TESTING"] == "False":
        TESTING = False
    else:
        TESTING = True
else:
    TESTING = True

if "LOG_FOLDER" in os.environ:
    LOG_FOLDER = os.environ["LOG_FOLDER"]
else:
    LOG_FOLDER = ""

PHANTOMJS_PATH = os.environ["PHANTOMJS_PATH"]


class BaseParser(object):
    def __init__(self, api, client, bsky_api=None):
        self.urls = list()
        self.payload = None
        self.articles = dict()
        self.current_ids = set()
        self.filename = str()
        self.db = dataset.connect("sqlite:///titles.db")
        self.api = api
        self.client = client
        self.bsky_api = bsky_api

    def remove_old(self, column="id"):
        db_ids = set()
        for nota_db in self.articles_table.find(status="home"):
            db_ids.add(nota_db[column])
        for to_remove in db_ids - self.current_ids:
            if column == "id":
                data = dict(id=to_remove, status="removed")
            else:
                data = dict(article_id=to_remove, status="removed")
            self.articles_table.update(data, [column])
            logging.info("Removed %s", to_remove)

    def get_prev_tweet(self, article_id, column):
        if column == "id":
            search = self.articles_table.find_one(id=article_id)
        else:
            search = self.articles_table.find_one(article_id=article_id)
        if search is None:
            return None
        else:
            if "tweet_id" in search:
                return search["tweet_id"]
            else:
                return None

    def get_bsky_parent(self, article_id, column):
        # Returns a tuple (parent, root) of bluesky "strong refs" for
        # the previously posted article in this thread
        # If no parent is found, returns (None, None)
        if column == "id":
            search = self.articles_table.find_one(id=article_id)
        else:
            search = self.articles_table.find_one(article_id=article_id)
        if search and search.get("post_uri"):
            post_uri = search["post_uri"]
            post_cid = search["post_cid"]
            root_uri = search["root_uri"]
            root_cid = search["root_cid"]
            return (
                models.ComAtprotoRepoStrongRef.Main(uri=post_uri, cid=post_cid),
                models.ComAtprotoRepoStrongRef.Main(uri=root_uri, cid=root_cid),
            )
        else:
            return (None, None)

    def update_tweet_db(self, article_id, tweet_id, column):
        if column == "id":
            article = {"id": article_id, "tweet_id": tweet_id}
        else:
            article = {"article_id": article_id, "tweet_id": tweet_id}
        self.articles_table.update(article, [column])
        logging.debug("Updated tweet ID in db")

    def update_bsky_db(self, article_id, post_ref, root_ref, column):
        article = {
            column: article_id,
            "post_uri": post_ref.uri,
            "post_cid": post_ref.cid,
            "root_uri": root_ref.uri,
            "root_cid": root_ref.cid,
        }
        self.articles_table.update(article, [column])
        logging.debug("Updated bsky refs in db")

    def media_upload(self, filename):
        if TESTING:
            return 1
        try:
            response = self.api.media_upload(filename)
        except:
            print(sys.exc_info()[0])
            logging.exception("Media upload")
            return False
        return response.media_id_string

    def tweet_with_media(self, text, images, reply_to=None):
        if TESTING:
            print(text, images, reply_to)
            return True
        try:
            if reply_to is not None:
                tweet = self.client.create_tweet(
                    text=text, media_ids=images, in_reply_to_tweet_id=reply_to
                )
            else:
                tweet = self.client.create_tweet(text=text, media_ids=images)
        except:
            logging.exception("Tweet with media failed")
            print(sys.exc_info()[0])
            return False
        return tweet

    def tweet_text(self, text):
        if TESTING:
            print(text)
            return True
        try:
            tweet = self.client.create_tweet(text=text)
        except:
            logging.exception("Tweet text failed")
            print(sys.exc_info()[0])
            return False
        return tweet

    def media_metadata(self, image, alt_text):
        if TESTING:
            print(image, alt_text)
            return True
        try:
            self.api.create_media_metadata(image, alt_text)
        except:
            logging.exception("Tweet text failed")
            print(sys.exc_info()[0])
            return False
        return True

    def tweet(
        self, text, article_id, url, column="id", alt_text=None, archive_url=None
    ):
        if not self.client:
            return
        images = list()
        image = self.media_upload("./output/" + self.filename + ".png")
        logging.info("Media ready with ids: %s", image)
        images.append(image)
        if alt_text is not None:
            logging.info("Alt text to add: %s", alt_text)
            self.media_metadata(image, alt_text)
        logging.info("Text to tweet: %s", text)
        logging.info("Article id: %s", article_id)
        reply_to = self.get_prev_tweet(article_id, column)
        if reply_to is None:
            if archive_url is not None:
                url = url + " " + archive_url
            logging.info("Tweeting url/s: %s", url)
            tweet = self.tweet_text(url)
            if tweet is False:
                return
            reply_to = tweet.data["id"]
        logging.info("Replying to: %s", reply_to)
        tweet = self.tweet_with_media(text, images, reply_to)
        if tweet is False:
            return
        logging.info("Id to store: %s", tweet.data["id"])
        self.update_tweet_db(article_id, tweet.data["id"], column)
        return

    def bsky_website_card(self, article_data):
        # Generate a website preview card for the specified url
        # Returns a models.AppBskyEmbedExternal object suitable
        # for passing as the `embed' argument to atproto.send_post
        post_title = article_data["title"]
        post_description = article_data["abstract"]
        post_uri = article_data["url"]
        extra_args = {}
        if article_data.get('thumbnail'):
            r = requests.get(url=article_data['thumbnail'])
            if r.ok:
                thumb = self.bsky_api.upload_blob(r.content)
                extra_args["thumb"] = thumb.blob

        return models.AppBskyEmbedExternal.Main(
            external=models.AppBskyEmbedExternal.External(
                title=post_title,
                description=post_description,
                uri=post_uri,
                **extra_args,
            )
        )

    def bsky_post(self, text, article_data, column="id", alt_text=""):
        if not self.bsky_api:
            return
        article_id = article_data["article_id"]
        url = article_data["url"]

        # Collect image data for the thumbnail
        img_path = "./output/" + self.filename + ".png"
        with open(img_path, "rb") as f:
            img_data = f.read()
            img_blob = self.bsky_api.upload_blob(img_data).blob
        logging.info("Media ready with ids: %s", img_path)
        logging.info("Text to post: %s", text)
        logging.info("Article id: %s", article_id)
        (parent_ref, root_ref) = self.get_bsky_parent(article_id, column)
        logging.info("Parent ref: %s", parent_ref)
        logging.info("Root ref: %s", root_ref)
        if parent_ref is None:
            # No parent, let's start a new thread
            logging.info("Posting url: %s", url)
            post = self.bsky_api.send_post(
                "", embed=self.bsky_website_card(article_data)
            )
            root_ref = models.create_strong_ref(post)
            parent_ref = root_ref

        logging.info("Replying to: %s", parent_ref)

        # Prepare an image upload with aspect ratio hints
        with Image.open(img_path) as img:
            aspect_ratio = models.AppBskyEmbedDefs.AspectRatio(
                height=img.height,
                width=img.width
            )
        image_embed = models.AppBskyEmbedImages.Image(
            alt=alt_text,
            image=img_blob,
            aspect_ratio=aspect_ratio
        )
        post = self.bsky_api.send_post(
            text=text,
            embed=models.AppBskyEmbedImages.Main(images=[image_embed]),
            reply_to=models.AppBskyFeedPost.ReplyRef(parent=parent_ref, root=root_ref),
        )
        child_ref = models.create_strong_ref(post)
        logging.info("Id to store: %s", child_ref)
        self.update_bsky_db(article_id, child_ref, root_ref, column)

    def get_page(self, url, header=None, payload=None):
        for x in range(MAX_RETRIES):
            try:
                r = requests.get(url=url, headers=header, params=payload)
            except BaseException as e:
                if x == MAX_RETRIES - 1:
                    print("Max retries reached")
                    logging.warning("Max retries for: %s", url)
                    return None
                if "104" not in str(e):
                    print("Problem with url {}".format(url))
                    print("Exception: {}".format(str(e)))
                    logging.exception("Problem getting page")
                    return None
                time.sleep(RETRY_DELAY)
            else:
                break
        return r

    def strip_html(self, html_str):
        """
        a wrapper for bleach.clean() that strips ALL tags from the input
        """
        tags = []
        attr = {}
        strip = True
        return bleach.clean(html_str, tags=tags, attributes=attr, strip=strip)

    def show_diff(self, old, new):
        if old is None or new is None or len(old) == 0 or len(new) == 0:
            logging.info("Old or New empty")
            return False
        new_hash = hashlib.sha224(new.encode("utf8")).hexdigest()
        html_diff_result = html_diff(old, new)
        logging.info(html_diff_result)
        if "</ins>" not in html_diff_result and "</del>" not in html_diff_result:
            logging.info("No diff to show")
            return False
        html = """
        <!doctype html>
        <html lang="en">
          <head>
            <meta charset="utf-8">
            <link rel="stylesheet" href="css/styles.css">
          </head>
          <body>
          <p>
          {}
          </p>
          </body>
        </html>
        """.format(
            html_diff_result
        )
        with TemporaryDirectory(delete=False) as tmpdir:
            tmpfile = os.path.join(tmpdir, "tmp.html")
            with open(tmpfile, "w") as f:
                f.write(html)
            for d in ["css", "fonts", "img"]:
                shutil.copytree(d, os.path.join(tmpdir, d))
            opts = webdriver.chrome.options.Options()
            opts.add_argument("--headless")
            driver = webdriver.Chrome(options=opts)
            driver.get("file://{}".format(tmpfile))
            logging.info("tmpfile is %s", tmpfile)

        e = driver.find_element(By.XPATH, "//p")
        timestamp = str(int(time.time()))
        self.filename = timestamp + new_hash
        e.screenshot("./output/" + self.filename + ".png")
        return True

    def __str__(self):
        return "\n".join(self.urls)


class NYTParser(BaseParser):
    def __init__(self, nyt_api_key, api, client, bsky_api=None):
        BaseParser.__init__(self, api, client, bsky_api=bsky_api)
        self.urls = ["https://api.nytimes.com/svc/topstories/v2/home.json"]
        self.payload = {"api-key": nyt_api_key}
        self.articles_table = self.db["nyt_ids"]
        self.versions_table = self.db["nyt_versions"]

    def get_thumbnail(self, article):
        # Return the URL for the first thumbnail image in the article.
        # Choose the largest sub-600-pixel image available (Bluesky thumbnails
        # are resized to 560 pixels)
        thumb_url = None
        thumb_width = 0
        if article.get('multimedia'):
            for m in article['multimedia']:
                if m['type'] != 'image':
                    continue
                if m['width'] > 600:
                    continue
                if m['width'] > thumb_width:
                    thumb_width = m['width']
                    thumb_url = m['url']
        return thumb_url

    def json_to_dict(self, article):
        article_dict = dict()
        if not article.get("short_url") and not article.get("uri"):
            return None
        article_dict["short_url"] = article["short_url"].split("/")[-1]
        article_dict["article_id"] = article["uri"]
        if "html>" in article_dict["short_url"]:
            logging.warning("Problem extracting short_url of: %s", article)
            return None
        article_dict["url"] = article["url"]
        article_dict["title"] = article["title"]
        article_dict["abstract"] = self.strip_html(article["abstract"])
        article_dict["byline"] = article["byline"]
        article_dict["kicker"] = article["kicker"]
        article_dict["thumbnail"] = self.get_thumbnail(article)
        od = collections.OrderedDict(sorted(article_dict.items()))
        article_dict["hash"] = hashlib.sha224(
            repr(od.items()).encode("utf-8")
        ).hexdigest()
        article_dict["date_time"] = datetime.now(LOCAL_TZ)
        return article_dict

    def generate_alt_text(self, old, new):
        return "Before: {}\nAfter: {}".format(old, new)

    def store_data(self, data):
        if self.articles_table.find_one(article_id=data["article_id"]) is None:  # New
            article = {
                "article_id": data["article_id"],
                "add_dt": data["date_time"],
                "status": "home",
                "thumbnail": data["thumbnail"],
                "tweet_id": None,
                "post_uri": None,
                "post_cid": None,
                "root_uri": None,
                "root_cid": None,
            }
            self.articles_table.insert(article)
            logging.info("New article tracked: %s", data["url"])
            data["version"] = 1
            self.versions_table.insert(data)
        else:
            # re insert
            if (
                self.articles_table.find_one(
                    article_id=data["article_id"], status="removed"
                )
                is not None
            ):
                article = {
                    "article_id": data["article_id"],
                    "add_dt": data["date_time"],
                }

            count = self.versions_table.count(
                self.versions_table.table.columns.article_id == data["article_id"],
                hash=data["hash"],
            )
            if count == 1:  # Existing
                pass
            else:  # Changed
                result = self.db.query(
                    'SELECT * \
                                       FROM nyt_versions\
                                       WHERE article_id = "%s" \
                                       ORDER BY version DESC \
                                       LIMIT 1'
                    % (data["article_id"])
                )
                for row in result:
                    data["version"] = row["version"] + 1
                    self.versions_table.insert(data)
                    url = data["url"]
                    if (
                        row["url"].split("nytimes.com/")[1]
                        != data["url"].split("nytimes.com/")[1]
                    ):
                        if self.show_diff(
                            row["url"].split("nytimes.com/")[1],
                            data["url"].split("nytimes.com/")[1],
                        ):
                            tweet_text = "Change in URL"
                            old_text = row["url"].split("nytimes.com/")[1]
                            new_text = data["url"].split("nytimes.com/")[1]
                            alt_text = self.generate_alt_text(old_text, new_text)
                            self.bsky_post(
                                tweet_text, data, "article_id", alt_text)
                            self.tweet(
                                tweet_text,
                                data["article_id"],
                                url,
                                "article_id",
                                alt_text,
                            )
                    if row["title"] != data["title"]:
                        if self.show_diff(row["title"], data["title"]):
                            tweet_text = "Change in Headline"
                            alt_text = self.generate_alt_text(
                                row["title"], data["title"]
                            )
                            self.bsky_post(
                                tweet_text, data, "article_id", alt_text)
                            self.tweet(
                                tweet_text,
                                data["article_id"],
                                url,
                                "article_id",
                                alt_text,
                            )
                    if row["abstract"] != data["abstract"]:
                        if self.show_diff(row["abstract"], data["abstract"]):
                            tweet_text = "Change in Abstract"
                            alt_text = self.generate_alt_text(
                                row["abstract"], data["abstract"]
                            )
                            self.bsky_post(
                                tweet_text, data, "article_id", alt_text)
                            self.tweet(
                                tweet_text,
                                data["article_id"],
                                url,
                                "article_id",
                                alt_text,
                            )
                    if row["kicker"] != data["kicker"]:
                        if self.show_diff(row["kicker"], data["kicker"]):
                            tweet_text = "Change in Kicker"
                            alt_text = self.generate_alt_text(
                                row["kicker"], data["kicker"]
                            )
                            self.bsky_post(
                                tweet_text, data, "article_id", alt_text)
                            self.tweet(
                                tweet_text,
                                data["article_id"],
                                url,
                                "article_id",
                                alt_text,
                            )
        return data["article_id"]

    def loop_data(self, data):
        if "results" not in data:
            return False
        for article in data["results"]:
            try:
                article_dict = self.json_to_dict(article)
                if article_dict is not None and "/zh-hans/" not in article_dict["url"]:
                    article_id = self.store_data(article_dict)
                    self.current_ids.add(article_id)
            except BaseException as e:
                logging.exception("Problem looping NYT: %s", article)
                print("Exception: {}".format(str(e)))
                print("***************")
                print(article)
                print("***************")
                return False
        return True

    def parse_pages(self):
        r = self.get_page(self.urls[0], None, self.payload)
        if r is None or len(r.text) == 0:
            logging.warning("Empty response NYT")
            return
        if r.status_code != 200:
            logging.warning(f"Non 200 response: {r.status_code}, text: {r.text}")
        try:
            data = json.loads(r.text, strict=False)
        except BaseException as e:
            logging.exception("Problem parsing page: %s", r.text)
            print(e)
            print(len(r.text))
            print(type(r.text))
            print(r.text)
            print("----")
            return
        loop = self.loop_data(data)
        if loop:
            self.remove_old("article_id")


def main():
    # logging
    logging.basicConfig(
        filename=LOG_FOLDER + "titlediff.log",
        format="%(asctime)s %(name)13s %(levelname)8s: " + "%(message)s",
        level=logging.INFO,
    )
    logging.getLogger("requests").setLevel(logging.WARNING)
    logging.info("Starting script")

    nyt_api = None
    nyt_client = None
    if os.environ.get("NYT_TWITTER_CONSUMER_KEY"):
        consumer_key = os.environ["NYT_TWITTER_CONSUMER_KEY"]
        consumer_secret = os.environ["NYT_TWITTER_CONSUMER_SECRET"]
        access_token = os.environ["NYT_TWITTER_ACCESS_TOKEN"]
        access_token_secret = os.environ["NYT_TWITTER_ACCESS_TOKEN_SECRET"]
        bearer_token = os.environ["NYT_BEARER_TOKEN"]
        auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
        auth.secure = True
        auth.set_access_token(access_token, access_token_secret)
        nyt_api = tweepy.API(auth)
        nyt_client = tweepy.Client(
            bearer_token=bearer_token,
            consumer_key=consumer_key,
            consumer_secret=consumer_secret,
            access_token=access_token,
            access_token_secret=access_token_secret,
        )
        logging.debug("NYT Twitter API configured")

    bsky_api = None
    if "BLUESKY_LOGIN" in os.environ:
        bsky_login = os.environ["BLUESKY_LOGIN"]
        bsky_passwd = os.environ["BLUESKY_PASSWD"]
        bsky_api = Client(base_url="https://bsky.social")
        try:
            bsky_api.login(bsky_login, bsky_passwd)
        except:
            logging.exception("Bluesky login failed")
            return

    try:
        logging.debug("Starting NYT")
        nyt_api_key = os.environ["NYT_API_KEY"]
        nyt = NYTParser(
            nyt_api_key=nyt_api_key, api=nyt_api, client=nyt_client, bsky_api=bsky_api
        )
        nyt.parse_pages()
        logging.debug("Finished NYT")
    except:
        logging.exception("NYT")

    logging.info("Finished script")


if __name__ == "__main__":
    main()


================================================
FILE: output/.gitignore
================================================


================================================
FILE: requirements.txt
================================================
alembic>=1.13.1,<2.0.0
atproto>=0.0.50,<1.0.0
bleach>=6.1.0,<7.0.0
dataset>=1.6.2,<2.0.0
html5lib>=1.1,<2.0
Mako>=1.3.2,<2.0.0
MarkupSafe>=2.1.5,<3.0.0
normality>=2.5.0,<3.0.0
oauthlib>=3.2.2,<4.0.0
pillow>=11.0
python-editor>=1.0.4,<2.0.0
pytz>=2024.1,<2025.0
PyYAML>=6.0.1,<7.0.0
requests>=2.31.0,<3.0.0
requests-oauthlib>=1.3.1,<2.0.0
selenium>=4.18.1,<5.0.0
simplediff>=1.1,<2.0
six>=1.16.0,<2.0
SQLAlchemy>=1.4.52,<2.0.0
tweepy>=4.14.0,<5.0.0


================================================
FILE: run_diff.sh
================================================
#!/bin/bash
export TESTING=True

export NYT_TWITTER_CONSUMER_KEY=""
export NYT_TWITTER_CONSUMER_SECRET=""
export NYT_TWITTER_ACCESS_TOKEN=""
export NYT_TWITTER_ACCESS_TOKEN_SECRET=""

export NYT_API_KEY=""

export PHANTOMJS_PATH="./"

python nytdiff.py

Download .txt

gitextract_5u6rg6p8/

├── README.md
├── css/
│   └── styles.css
├── nytdiff.py
├── output/
│   └── .gitignore
├── requirements.txt
└── run_diff.sh

Download .txt

SYMBOL INDEX (27 symbols across 1 files)

FILE: nytdiff.py
  class BaseParser (line 46) | class BaseParser(object):
    method __init__ (line 47) | def __init__(self, api, client, bsky_api=None):
    method remove_old (line 58) | def remove_old(self, column="id"):
    method get_prev_tweet (line 70) | def get_prev_tweet(self, article_id, column):
    method get_bsky_parent (line 83) | def get_bsky_parent(self, article_id, column):
    method update_tweet_db (line 103) | def update_tweet_db(self, article_id, tweet_id, column):
    method update_bsky_db (line 111) | def update_bsky_db(self, article_id, post_ref, root_ref, column):
    method media_upload (line 122) | def media_upload(self, filename):
    method tweet_with_media (line 133) | def tweet_with_media(self, text, images, reply_to=None):
    method tweet_text (line 150) | def tweet_text(self, text):
    method media_metadata (line 162) | def media_metadata(self, image, alt_text):
    method tweet (line 174) | def tweet(
    method bsky_website_card (line 205) | def bsky_website_card(self, article_data):
    method bsky_post (line 228) | def bsky_post(self, text, article_data, column="id", alt_text=""):
    method get_page (line 276) | def get_page(self, url, header=None, payload=None):
    method strip_html (line 295) | def strip_html(self, html_str):
    method show_diff (line 304) | def show_diff(self, old, new):
    method __str__ (line 348) | def __str__(self):
  class NYTParser (line 352) | class NYTParser(BaseParser):
    method __init__ (line 353) | def __init__(self, nyt_api_key, api, client, bsky_api=None):
    method get_thumbnail (line 360) | def get_thumbnail(self, article):
    method json_to_dict (line 377) | def json_to_dict(self, article):
    method generate_alt_text (line 399) | def generate_alt_text(self, old, new):
    method store_data (line 402) | def store_data(self, data):
    method loop_data (line 519) | def loop_data(self, data):
    method parse_pages (line 537) | def parse_pages(self):
  function main (line 559) | def main():

Download .json

Condensed preview — 6 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (31K chars).

[
  {
    "path": "README.md",
    "chars": 5054,
    "preview": "# NYTdiff\n\nCode for the Twitter bot [@nyt_diff](https://twitter.com/nyt_diff).\n\nBluesky version: [@nytdiff.bsky.social]("
  },
  {
    "path": "css/styles.css",
    "chars": 756,
    "preview": "@font-face { \n    font-family: Merriweather; \n    font-style: normal;\n    font-weight: normal;\n    src: url('../fonts/Me"
  },
  {
    "path": "nytdiff.py",
    "chars": 23006,
    "preview": "#!/usr/bin/python3\n\nimport collections\nimport hashlib\nimport json\nimport logging\nimport os\nimport shutil\nimport sys\nimpo"
  },
  {
    "path": "output/.gitignore",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "requirements.txt",
    "chars": 448,
    "preview": "alembic>=1.13.1,<2.0.0\natproto>=0.0.50,<1.0.0\nbleach>=6.1.0,<7.0.0\ndataset>=1.6.2,<2.0.0\nhtml5lib>=1.1,<2.0\nMako>=1.3.2,"
  },
  {
    "path": "run_diff.sh",
    "chars": 253,
    "preview": "#!/bin/bash\nexport TESTING=True\n\nexport NYT_TWITTER_CONSUMER_KEY=\"\"\nexport NYT_TWITTER_CONSUMER_SECRET=\"\"\nexport NYT_TWI"
  }
]

About this extraction

This page contains the full source code of the j-e-d/NYTdiff GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 6 files (28.8 KB), approximately 7.1k tokens, and a symbol index with 27 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo