[
  {
    "path": "README.md",
    "content": "# NYTdiff\n\nCode for the Twitter bot [@nyt_diff](https://twitter.com/nyt_diff).\n\nBluesky version: [@nytdiff.bsky.social](https://bsky.app/profile/nytdiff.bsky.social)\n\nMastodon version: [@nyt_diff@botsin.space](https://botsin.space/@nyt_diff)\n\nThe [phantomjs](http://phantomjs.org/) binary needs to be installed and the path updated in the run_diff.sh file.\n\n[Twitter keys](https://dev.twitter.com/) and the [NYT API](http://developers.nytimes.com/) key for the \"Top Stories V2\" service are needed, values of these keys need to be entered in the run_diff.sh file.\n\nFont: [Merriweather](https://github.com/SorkinType/Merriweather). Background pattern: [Paper Fibers](http://subtlepatterns.com/paper-fibers/).\n\nIf you wish to build your version check [this fork](https://github.com/xuv/NYTdiff) that reads RSS feeds and [this project](https://github.com/docnow/diffengine) that is in part based on nyt_diff and also checks RSS feeds.\n\nBots with similar functionality that I am aware of (please let me know of others).\n\n\n|site|bot|\n|----|---|\n|[ABC Digital]|[EditandoAbc]|\n|[Ámbito]|[af_diff]|\n|[AP]|~~[ap_diff]~~|\n|[BBC]|~~[bbc_diff]~~|\n|[Breitbart News]|[breitbart_diff]|\n|[Calgary Herald]|~~[yyc_herald_diff]~~|\n|[Canadaland]|~~[canadaland_diff]~~|\n|[The Canadian Broadcasting Corporation]|~~[cbc_diff]~~|\n|[Clarín]|[cl_diff]|\n|[CNN]|~~[cnn_diff]~~|\n|[Cronista]|[cc_diff]|\n|[Daily Mail]|~~[dailymail_diff]~~|\n|[Fox News]|~~[fox_diff]~~|\n|[Israeli News Websites]|[ILNewsDiff]|\n|[The Globe and Mail]|~~[globemail_diff]~~|\n|[The Guardian]|[guardian_diff]|\n|[Infobae]|[ib_diff]|\n|[Le Monde]|[lemonde_diff]|\n|[La Nación]|[ln_diff]|\n|[La Presse]|~~[lapresse_diff]~~|\n|[La Repubblica]|~~[repubblica_diff]~~|\n|[Le Soir]|[lesoir_diff]|\n|[Página/12]|[p12_diff]|\n|[Reuters]|~~[reuters_diff]~~|\n|[RT]|~~[rt_edits]~~|\n|[Stuff]|[stuff_diff]|\n|[Telegraph]|~~[telegraph_diff]~~|\n|[Todo Noticias]|~~[tn_diff]~~|\n|[The Toronto Star]|~~[torstar_diff]~~|\n|[Valeurs Actuelles]|[valeurs_diff]|\n|[Wall Street Journal]|[wsj_diff]|\n|[The Washington Post]|[wapo_diff]|\n|[White House Blog]|[whitehouse_diff]\n\nSome bot accounts are currently (November 2020) suspended by Twitter and have been ~~strikethrough~~.\n\n[EditandoAbc]: https://twitter.com/EditandoAbc\n[ABC Digital]: http://www.abc.com.py/\n\n[af_diff]: https://twitter.com/af_diff\n[Ámbito]: http://www.ambito.com/\n\n[cc_diff]: https://twitter.com/cc_diff\n[Cronista]: https://www.cronista.com/\n\n[ib_diff]: https://twitter.com/ib_diff\n[Infobae]: http://www.infobae.com/\n\n[p12_diff]: https://twitter.com/p12_diff\n[Página/12]: https://www.pagina12.com.ar/\n\n[tn_diff]: https://twitter.com/tn_diff\n[Todo Noticias]: http://tn.com.ar\n\n[wapo_diff]: https://twitter.com/wapo_diff\n[The Washington Post]: https://www.washingtonpost.com\n\n[breitbart_diff]: https://twitter.com/breitbart_diff\n[Breitbart News]: https://www.breitbart.com\n\n[guardian_diff]: https://twitter.com/guardian_diff\n[The Guardian]: https://www.theguardian.com/\n\n[torstar_diff]: https://twitter.com/torstar_diff\n[The Toronto Star]: https://www.thestar.com/\n\n[globemail_diff]: https://twitter.com/globemail_diff\n[The Globe and Mail]: http://www.theglobeandmail.com/\n\n[canadaland_diff]: https://twitter.com/canadaland_diff\n[Canadaland]: http://www.canadalandshow.com/\n\n[repubblica_diff]: https://twitter.com/repubblica_diff\n[La Repubblica]: http://www.repubblica.it/\n\n[yyc_herald_diff]: https://twitter.com/yyc_herald_diff\n[Calgary Herald]: http://calgaryherald.com/\n\n[cbc_diff]: https://twitter.com/cbc_diff\n[The Canadian Broadcasting Corporation]: http://www.cbc.ca/\n\n[lapresse_diff]: https://twitter.com/lapresse_diff\n[La Presse]: http://www.lapresse.ca/\n\n[bbc_diff]: https://twitter.com/bbc_diff\n[BBC]: http://www.bbc.co.uk/\n\n[rt_edits]: https://twitter.com/rt_edits\n[RT]: http://rt.com\n\n[fox_diff]: https://twitter.com/fox_diff\n[Fox News]: http://www.foxnews.com/\n\n[dailymail_diff]: https://twitter.com/dailymail_diff\n[Daily Mail]: http://www.dailymail.co.uk/\n\n[telegraph_diff]: https://twitter.com/telegraph_diff\n[Telegraph]: http://www.telegraph.co.uk/\n\n[cnn_diff]: https://twitter.com/cnn_diff\n[CNN]: http://www.cnn.com/\n\n[reuters_diff]: https://twitter.com/reuters_diff\n[Reuters]: http://www.reuters.com/\n\n[ap_diff]: https://twitter.com/ap_diff\n[AP]: https://www.ap.org/\n\n[whitehouse_diff]: https://twitter.com/whitehouse_diff\n[White House Blog]: https://www.whitehouse.gov/blog\n\n[wsj_diff]: https://twitter.com/wsj_diff\n[Wall Street Journal]: http://www.wsj.com/\n\n[Clarín]: http://clarin.com\n[cl_diff]: https://twitter.com/cl_diff\n\n[La Nación]: http://www.lanacion.com.ar\n[ln_diff]: https://twitter.com/ln_diff\n\n[Le Monde]: http://lemonde.fr \n[lemonde_diff]: https://twitter.com/lemonde_diff\n\n[Le Soir]: http://lesoir.be\n[lesoir_diff]: https://twitter.com/lesoir_diff\n\n[Valeurs Actuelles]: https://www.valeursactuelles.com/\n[valeurs_diff]: https://twitter.com/valeurs_diff\n\n[Stuff]: https://www.stuff.co.nz/\n[stuff_diff]: https://twitter.com/stuff_diff\n\n[Israeli News Websites]: https://github.com/alonmln/ILNewsDiff\n[ILNewsDiff]: https://twitter.com/ILNewsDiff\n"
  },
  {
    "path": "css/styles.css",
    "content": "@font-face { \n    font-family: Merriweather; \n    font-style: normal;\n    font-weight: normal;\n    src: url('../fonts/Merriweather-Regular.ttf') format(\"truetype\"); \n} \n\nbody { \n    background: lightgray url('../img/paper_fibers.png') repeat; \n    font-family: Merriweather;\n    font-size: 16px;\n}\n\np {\n    margin-left: 2em;\n    margin-right: 2em;\n    margin-top: 1em;\n    margin-bottom: 1em;\n    padding-left: 25px;\n    padding-right: 25px;\n    padding-top: 10px;\n    padding-bottom: 10px;\n    width: 350px;\n    font-weight: normal;\n}\n\ndel {\n   background-color: lightpink;\n   color: black;\n   text-decoration: line-through;\n   font-weight: lighter;\n}\n\nins {\n  background-color: aquamarine;\n  color: black;\n  text-decoration: none;\n  font-weight: bold;\n}\n"
  },
  {
    "path": "nytdiff.py",
    "content": "#!/usr/bin/python3\n\nimport collections\nimport hashlib\nimport json\nimport logging\nimport os\nimport shutil\nimport sys\nimport time\nfrom datetime import datetime\nfrom tempfile import TemporaryDirectory\n\nimport bleach\nimport dataset\nimport requests\nimport tweepy\nfrom atproto import Client, models\nfrom PIL import Image\nfrom pytz import timezone\nfrom simplediff import html_diff\nfrom selenium import webdriver\nfrom selenium.webdriver.common.by import By\n\nTIMEZONE = \"America/Buenos_Aires\"\nLOCAL_TZ = timezone(TIMEZONE)\nMAX_RETRIES = 10\nRETRY_DELAY = 3\n\nif \"TESTING\" in os.environ:\n    if os.environ[\"TESTING\"] == \"False\":\n        TESTING = False\n    else:\n        TESTING = True\nelse:\n    TESTING = True\n\nif \"LOG_FOLDER\" in os.environ:\n    LOG_FOLDER = os.environ[\"LOG_FOLDER\"]\nelse:\n    LOG_FOLDER = \"\"\n\nPHANTOMJS_PATH = os.environ[\"PHANTOMJS_PATH\"]\n\n\nclass BaseParser(object):\n    def __init__(self, api, client, bsky_api=None):\n        self.urls = list()\n        self.payload = None\n        self.articles = dict()\n        self.current_ids = set()\n        self.filename = str()\n        self.db = dataset.connect(\"sqlite:///titles.db\")\n        self.api = api\n        self.client = client\n        self.bsky_api = bsky_api\n\n    def remove_old(self, column=\"id\"):\n        db_ids = set()\n        for nota_db in self.articles_table.find(status=\"home\"):\n            db_ids.add(nota_db[column])\n        for to_remove in db_ids - self.current_ids:\n            if column == \"id\":\n                data = dict(id=to_remove, status=\"removed\")\n            else:\n                data = dict(article_id=to_remove, status=\"removed\")\n            self.articles_table.update(data, [column])\n            logging.info(\"Removed %s\", to_remove)\n\n    def get_prev_tweet(self, article_id, column):\n        if column == \"id\":\n            search = self.articles_table.find_one(id=article_id)\n        else:\n            search = self.articles_table.find_one(article_id=article_id)\n        if search is None:\n            return None\n        else:\n            if \"tweet_id\" in search:\n                return search[\"tweet_id\"]\n            else:\n                return None\n\n    def get_bsky_parent(self, article_id, column):\n        # Returns a tuple (parent, root) of bluesky \"strong refs\" for\n        # the previously posted article in this thread\n        # If no parent is found, returns (None, None)\n        if column == \"id\":\n            search = self.articles_table.find_one(id=article_id)\n        else:\n            search = self.articles_table.find_one(article_id=article_id)\n        if search and search.get(\"post_uri\"):\n            post_uri = search[\"post_uri\"]\n            post_cid = search[\"post_cid\"]\n            root_uri = search[\"root_uri\"]\n            root_cid = search[\"root_cid\"]\n            return (\n                models.ComAtprotoRepoStrongRef.Main(uri=post_uri, cid=post_cid),\n                models.ComAtprotoRepoStrongRef.Main(uri=root_uri, cid=root_cid),\n            )\n        else:\n            return (None, None)\n\n    def update_tweet_db(self, article_id, tweet_id, column):\n        if column == \"id\":\n            article = {\"id\": article_id, \"tweet_id\": tweet_id}\n        else:\n            article = {\"article_id\": article_id, \"tweet_id\": tweet_id}\n        self.articles_table.update(article, [column])\n        logging.debug(\"Updated tweet ID in db\")\n\n    def update_bsky_db(self, article_id, post_ref, root_ref, column):\n        article = {\n            column: article_id,\n            \"post_uri\": post_ref.uri,\n            \"post_cid\": post_ref.cid,\n            \"root_uri\": root_ref.uri,\n            \"root_cid\": root_ref.cid,\n        }\n        self.articles_table.update(article, [column])\n        logging.debug(\"Updated bsky refs in db\")\n\n    def media_upload(self, filename):\n        if TESTING:\n            return 1\n        try:\n            response = self.api.media_upload(filename)\n        except:\n            print(sys.exc_info()[0])\n            logging.exception(\"Media upload\")\n            return False\n        return response.media_id_string\n\n    def tweet_with_media(self, text, images, reply_to=None):\n        if TESTING:\n            print(text, images, reply_to)\n            return True\n        try:\n            if reply_to is not None:\n                tweet = self.client.create_tweet(\n                    text=text, media_ids=images, in_reply_to_tweet_id=reply_to\n                )\n            else:\n                tweet = self.client.create_tweet(text=text, media_ids=images)\n        except:\n            logging.exception(\"Tweet with media failed\")\n            print(sys.exc_info()[0])\n            return False\n        return tweet\n\n    def tweet_text(self, text):\n        if TESTING:\n            print(text)\n            return True\n        try:\n            tweet = self.client.create_tweet(text=text)\n        except:\n            logging.exception(\"Tweet text failed\")\n            print(sys.exc_info()[0])\n            return False\n        return tweet\n\n    def media_metadata(self, image, alt_text):\n        if TESTING:\n            print(image, alt_text)\n            return True\n        try:\n            self.api.create_media_metadata(image, alt_text)\n        except:\n            logging.exception(\"Tweet text failed\")\n            print(sys.exc_info()[0])\n            return False\n        return True\n\n    def tweet(\n        self, text, article_id, url, column=\"id\", alt_text=None, archive_url=None\n    ):\n        if not self.client:\n            return\n        images = list()\n        image = self.media_upload(\"./output/\" + self.filename + \".png\")\n        logging.info(\"Media ready with ids: %s\", image)\n        images.append(image)\n        if alt_text is not None:\n            logging.info(\"Alt text to add: %s\", alt_text)\n            self.media_metadata(image, alt_text)\n        logging.info(\"Text to tweet: %s\", text)\n        logging.info(\"Article id: %s\", article_id)\n        reply_to = self.get_prev_tweet(article_id, column)\n        if reply_to is None:\n            if archive_url is not None:\n                url = url + \" \" + archive_url\n            logging.info(\"Tweeting url/s: %s\", url)\n            tweet = self.tweet_text(url)\n            if tweet is False:\n                return\n            reply_to = tweet.data[\"id\"]\n        logging.info(\"Replying to: %s\", reply_to)\n        tweet = self.tweet_with_media(text, images, reply_to)\n        if tweet is False:\n            return\n        logging.info(\"Id to store: %s\", tweet.data[\"id\"])\n        self.update_tweet_db(article_id, tweet.data[\"id\"], column)\n        return\n\n    def bsky_website_card(self, article_data):\n        # Generate a website preview card for the specified url\n        # Returns a models.AppBskyEmbedExternal object suitable\n        # for passing as the `embed' argument to atproto.send_post\n        post_title = article_data[\"title\"]\n        post_description = article_data[\"abstract\"]\n        post_uri = article_data[\"url\"]\n        extra_args = {}\n        if article_data.get('thumbnail'):\n            r = requests.get(url=article_data['thumbnail'])\n            if r.ok:\n                thumb = self.bsky_api.upload_blob(r.content)\n                extra_args[\"thumb\"] = thumb.blob\n\n        return models.AppBskyEmbedExternal.Main(\n            external=models.AppBskyEmbedExternal.External(\n                title=post_title,\n                description=post_description,\n                uri=post_uri,\n                **extra_args,\n            )\n        )\n\n    def bsky_post(self, text, article_data, column=\"id\", alt_text=\"\"):\n        if not self.bsky_api:\n            return\n        article_id = article_data[\"article_id\"]\n        url = article_data[\"url\"]\n\n        # Collect image data for the thumbnail\n        img_path = \"./output/\" + self.filename + \".png\"\n        with open(img_path, \"rb\") as f:\n            img_data = f.read()\n            img_blob = self.bsky_api.upload_blob(img_data).blob\n        logging.info(\"Media ready with ids: %s\", img_path)\n        logging.info(\"Text to post: %s\", text)\n        logging.info(\"Article id: %s\", article_id)\n        (parent_ref, root_ref) = self.get_bsky_parent(article_id, column)\n        logging.info(\"Parent ref: %s\", parent_ref)\n        logging.info(\"Root ref: %s\", root_ref)\n        if parent_ref is None:\n            # No parent, let's start a new thread\n            logging.info(\"Posting url: %s\", url)\n            post = self.bsky_api.send_post(\n                \"\", embed=self.bsky_website_card(article_data)\n            )\n            root_ref = models.create_strong_ref(post)\n            parent_ref = root_ref\n\n        logging.info(\"Replying to: %s\", parent_ref)\n\n        # Prepare an image upload with aspect ratio hints\n        with Image.open(img_path) as img:\n            aspect_ratio = models.AppBskyEmbedDefs.AspectRatio(\n                height=img.height,\n                width=img.width\n            )\n        image_embed = models.AppBskyEmbedImages.Image(\n            alt=alt_text,\n            image=img_blob,\n            aspect_ratio=aspect_ratio\n        )\n        post = self.bsky_api.send_post(\n            text=text,\n            embed=models.AppBskyEmbedImages.Main(images=[image_embed]),\n            reply_to=models.AppBskyFeedPost.ReplyRef(parent=parent_ref, root=root_ref),\n        )\n        child_ref = models.create_strong_ref(post)\n        logging.info(\"Id to store: %s\", child_ref)\n        self.update_bsky_db(article_id, child_ref, root_ref, column)\n\n    def get_page(self, url, header=None, payload=None):\n        for x in range(MAX_RETRIES):\n            try:\n                r = requests.get(url=url, headers=header, params=payload)\n            except BaseException as e:\n                if x == MAX_RETRIES - 1:\n                    print(\"Max retries reached\")\n                    logging.warning(\"Max retries for: %s\", url)\n                    return None\n                if \"104\" not in str(e):\n                    print(\"Problem with url {}\".format(url))\n                    print(\"Exception: {}\".format(str(e)))\n                    logging.exception(\"Problem getting page\")\n                    return None\n                time.sleep(RETRY_DELAY)\n            else:\n                break\n        return r\n\n    def strip_html(self, html_str):\n        \"\"\"\n        a wrapper for bleach.clean() that strips ALL tags from the input\n        \"\"\"\n        tags = []\n        attr = {}\n        strip = True\n        return bleach.clean(html_str, tags=tags, attributes=attr, strip=strip)\n\n    def show_diff(self, old, new):\n        if old is None or new is None or len(old) == 0 or len(new) == 0:\n            logging.info(\"Old or New empty\")\n            return False\n        new_hash = hashlib.sha224(new.encode(\"utf8\")).hexdigest()\n        html_diff_result = html_diff(old, new)\n        logging.info(html_diff_result)\n        if \"</ins>\" not in html_diff_result and \"</del>\" not in html_diff_result:\n            logging.info(\"No diff to show\")\n            return False\n        html = \"\"\"\n        <!doctype html>\n        <html lang=\"en\">\n          <head>\n            <meta charset=\"utf-8\">\n            <link rel=\"stylesheet\" href=\"css/styles.css\">\n          </head>\n          <body>\n          <p>\n          {}\n          </p>\n          </body>\n        </html>\n        \"\"\".format(\n            html_diff_result\n        )\n        with TemporaryDirectory(delete=False) as tmpdir:\n            tmpfile = os.path.join(tmpdir, \"tmp.html\")\n            with open(tmpfile, \"w\") as f:\n                f.write(html)\n            for d in [\"css\", \"fonts\", \"img\"]:\n                shutil.copytree(d, os.path.join(tmpdir, d))\n            opts = webdriver.chrome.options.Options()\n            opts.add_argument(\"--headless\")\n            driver = webdriver.Chrome(options=opts)\n            driver.get(\"file://{}\".format(tmpfile))\n            logging.info(\"tmpfile is %s\", tmpfile)\n\n        e = driver.find_element(By.XPATH, \"//p\")\n        timestamp = str(int(time.time()))\n        self.filename = timestamp + new_hash\n        e.screenshot(\"./output/\" + self.filename + \".png\")\n        return True\n\n    def __str__(self):\n        return \"\\n\".join(self.urls)\n\n\nclass NYTParser(BaseParser):\n    def __init__(self, nyt_api_key, api, client, bsky_api=None):\n        BaseParser.__init__(self, api, client, bsky_api=bsky_api)\n        self.urls = [\"https://api.nytimes.com/svc/topstories/v2/home.json\"]\n        self.payload = {\"api-key\": nyt_api_key}\n        self.articles_table = self.db[\"nyt_ids\"]\n        self.versions_table = self.db[\"nyt_versions\"]\n\n    def get_thumbnail(self, article):\n        # Return the URL for the first thumbnail image in the article.\n        # Choose the largest sub-600-pixel image available (Bluesky thumbnails\n        # are resized to 560 pixels)\n        thumb_url = None\n        thumb_width = 0\n        if article.get('multimedia'):\n            for m in article['multimedia']:\n                if m['type'] != 'image':\n                    continue\n                if m['width'] > 600:\n                    continue\n                if m['width'] > thumb_width:\n                    thumb_width = m['width']\n                    thumb_url = m['url']\n        return thumb_url\n\n    def json_to_dict(self, article):\n        article_dict = dict()\n        if not article.get(\"short_url\") and not article.get(\"uri\"):\n            return None\n        article_dict[\"short_url\"] = article[\"short_url\"].split(\"/\")[-1]\n        article_dict[\"article_id\"] = article[\"uri\"]\n        if \"html>\" in article_dict[\"short_url\"]:\n            logging.warning(\"Problem extracting short_url of: %s\", article)\n            return None\n        article_dict[\"url\"] = article[\"url\"]\n        article_dict[\"title\"] = article[\"title\"]\n        article_dict[\"abstract\"] = self.strip_html(article[\"abstract\"])\n        article_dict[\"byline\"] = article[\"byline\"]\n        article_dict[\"kicker\"] = article[\"kicker\"]\n        article_dict[\"thumbnail\"] = self.get_thumbnail(article)\n        od = collections.OrderedDict(sorted(article_dict.items()))\n        article_dict[\"hash\"] = hashlib.sha224(\n            repr(od.items()).encode(\"utf-8\")\n        ).hexdigest()\n        article_dict[\"date_time\"] = datetime.now(LOCAL_TZ)\n        return article_dict\n\n    def generate_alt_text(self, old, new):\n        return \"Before: {}\\nAfter: {}\".format(old, new)\n\n    def store_data(self, data):\n        if self.articles_table.find_one(article_id=data[\"article_id\"]) is None:  # New\n            article = {\n                \"article_id\": data[\"article_id\"],\n                \"add_dt\": data[\"date_time\"],\n                \"status\": \"home\",\n                \"thumbnail\": data[\"thumbnail\"],\n                \"tweet_id\": None,\n                \"post_uri\": None,\n                \"post_cid\": None,\n                \"root_uri\": None,\n                \"root_cid\": None,\n            }\n            self.articles_table.insert(article)\n            logging.info(\"New article tracked: %s\", data[\"url\"])\n            data[\"version\"] = 1\n            self.versions_table.insert(data)\n        else:\n            # re insert\n            if (\n                self.articles_table.find_one(\n                    article_id=data[\"article_id\"], status=\"removed\"\n                )\n                is not None\n            ):\n                article = {\n                    \"article_id\": data[\"article_id\"],\n                    \"add_dt\": data[\"date_time\"],\n                }\n\n            count = self.versions_table.count(\n                self.versions_table.table.columns.article_id == data[\"article_id\"],\n                hash=data[\"hash\"],\n            )\n            if count == 1:  # Existing\n                pass\n            else:  # Changed\n                result = self.db.query(\n                    'SELECT * \\\n                                       FROM nyt_versions\\\n                                       WHERE article_id = \"%s\" \\\n                                       ORDER BY version DESC \\\n                                       LIMIT 1'\n                    % (data[\"article_id\"])\n                )\n                for row in result:\n                    data[\"version\"] = row[\"version\"] + 1\n                    self.versions_table.insert(data)\n                    url = data[\"url\"]\n                    if (\n                        row[\"url\"].split(\"nytimes.com/\")[1]\n                        != data[\"url\"].split(\"nytimes.com/\")[1]\n                    ):\n                        if self.show_diff(\n                            row[\"url\"].split(\"nytimes.com/\")[1],\n                            data[\"url\"].split(\"nytimes.com/\")[1],\n                        ):\n                            tweet_text = \"Change in URL\"\n                            old_text = row[\"url\"].split(\"nytimes.com/\")[1]\n                            new_text = data[\"url\"].split(\"nytimes.com/\")[1]\n                            alt_text = self.generate_alt_text(old_text, new_text)\n                            self.bsky_post(\n                                tweet_text, data, \"article_id\", alt_text)\n                            self.tweet(\n                                tweet_text,\n                                data[\"article_id\"],\n                                url,\n                                \"article_id\",\n                                alt_text,\n                            )\n                    if row[\"title\"] != data[\"title\"]:\n                        if self.show_diff(row[\"title\"], data[\"title\"]):\n                            tweet_text = \"Change in Headline\"\n                            alt_text = self.generate_alt_text(\n                                row[\"title\"], data[\"title\"]\n                            )\n                            self.bsky_post(\n                                tweet_text, data, \"article_id\", alt_text)\n                            self.tweet(\n                                tweet_text,\n                                data[\"article_id\"],\n                                url,\n                                \"article_id\",\n                                alt_text,\n                            )\n                    if row[\"abstract\"] != data[\"abstract\"]:\n                        if self.show_diff(row[\"abstract\"], data[\"abstract\"]):\n                            tweet_text = \"Change in Abstract\"\n                            alt_text = self.generate_alt_text(\n                                row[\"abstract\"], data[\"abstract\"]\n                            )\n                            self.bsky_post(\n                                tweet_text, data, \"article_id\", alt_text)\n                            self.tweet(\n                                tweet_text,\n                                data[\"article_id\"],\n                                url,\n                                \"article_id\",\n                                alt_text,\n                            )\n                    if row[\"kicker\"] != data[\"kicker\"]:\n                        if self.show_diff(row[\"kicker\"], data[\"kicker\"]):\n                            tweet_text = \"Change in Kicker\"\n                            alt_text = self.generate_alt_text(\n                                row[\"kicker\"], data[\"kicker\"]\n                            )\n                            self.bsky_post(\n                                tweet_text, data, \"article_id\", alt_text)\n                            self.tweet(\n                                tweet_text,\n                                data[\"article_id\"],\n                                url,\n                                \"article_id\",\n                                alt_text,\n                            )\n        return data[\"article_id\"]\n\n    def loop_data(self, data):\n        if \"results\" not in data:\n            return False\n        for article in data[\"results\"]:\n            try:\n                article_dict = self.json_to_dict(article)\n                if article_dict is not None and \"/zh-hans/\" not in article_dict[\"url\"]:\n                    article_id = self.store_data(article_dict)\n                    self.current_ids.add(article_id)\n            except BaseException as e:\n                logging.exception(\"Problem looping NYT: %s\", article)\n                print(\"Exception: {}\".format(str(e)))\n                print(\"***************\")\n                print(article)\n                print(\"***************\")\n                return False\n        return True\n\n    def parse_pages(self):\n        r = self.get_page(self.urls[0], None, self.payload)\n        if r is None or len(r.text) == 0:\n            logging.warning(\"Empty response NYT\")\n            return\n        if r.status_code != 200:\n            logging.warning(f\"Non 200 response: {r.status_code}, text: {r.text}\")\n        try:\n            data = json.loads(r.text, strict=False)\n        except BaseException as e:\n            logging.exception(\"Problem parsing page: %s\", r.text)\n            print(e)\n            print(len(r.text))\n            print(type(r.text))\n            print(r.text)\n            print(\"----\")\n            return\n        loop = self.loop_data(data)\n        if loop:\n            self.remove_old(\"article_id\")\n\n\ndef main():\n    # logging\n    logging.basicConfig(\n        filename=LOG_FOLDER + \"titlediff.log\",\n        format=\"%(asctime)s %(name)13s %(levelname)8s: \" + \"%(message)s\",\n        level=logging.INFO,\n    )\n    logging.getLogger(\"requests\").setLevel(logging.WARNING)\n    logging.info(\"Starting script\")\n\n    nyt_api = None\n    nyt_client = None\n    if os.environ.get(\"NYT_TWITTER_CONSUMER_KEY\"):\n        consumer_key = os.environ[\"NYT_TWITTER_CONSUMER_KEY\"]\n        consumer_secret = os.environ[\"NYT_TWITTER_CONSUMER_SECRET\"]\n        access_token = os.environ[\"NYT_TWITTER_ACCESS_TOKEN\"]\n        access_token_secret = os.environ[\"NYT_TWITTER_ACCESS_TOKEN_SECRET\"]\n        bearer_token = os.environ[\"NYT_BEARER_TOKEN\"]\n        auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n        auth.secure = True\n        auth.set_access_token(access_token, access_token_secret)\n        nyt_api = tweepy.API(auth)\n        nyt_client = tweepy.Client(\n            bearer_token=bearer_token,\n            consumer_key=consumer_key,\n            consumer_secret=consumer_secret,\n            access_token=access_token,\n            access_token_secret=access_token_secret,\n        )\n        logging.debug(\"NYT Twitter API configured\")\n\n    bsky_api = None\n    if \"BLUESKY_LOGIN\" in os.environ:\n        bsky_login = os.environ[\"BLUESKY_LOGIN\"]\n        bsky_passwd = os.environ[\"BLUESKY_PASSWD\"]\n        bsky_api = Client(base_url=\"https://bsky.social\")\n        try:\n            bsky_api.login(bsky_login, bsky_passwd)\n        except:\n            logging.exception(\"Bluesky login failed\")\n            return\n\n    try:\n        logging.debug(\"Starting NYT\")\n        nyt_api_key = os.environ[\"NYT_API_KEY\"]\n        nyt = NYTParser(\n            nyt_api_key=nyt_api_key, api=nyt_api, client=nyt_client, bsky_api=bsky_api\n        )\n        nyt.parse_pages()\n        logging.debug(\"Finished NYT\")\n    except:\n        logging.exception(\"NYT\")\n\n    logging.info(\"Finished script\")\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "output/.gitignore",
    "content": ""
  },
  {
    "path": "requirements.txt",
    "content": "alembic>=1.13.1,<2.0.0\natproto>=0.0.50,<1.0.0\nbleach>=6.1.0,<7.0.0\ndataset>=1.6.2,<2.0.0\nhtml5lib>=1.1,<2.0\nMako>=1.3.2,<2.0.0\nMarkupSafe>=2.1.5,<3.0.0\nnormality>=2.5.0,<3.0.0\noauthlib>=3.2.2,<4.0.0\npillow>=11.0\npython-editor>=1.0.4,<2.0.0\npytz>=2024.1,<2025.0\nPyYAML>=6.0.1,<7.0.0\nrequests>=2.31.0,<3.0.0\nrequests-oauthlib>=1.3.1,<2.0.0\nselenium>=4.18.1,<5.0.0\nsimplediff>=1.1,<2.0\nsix>=1.16.0,<2.0\nSQLAlchemy>=1.4.52,<2.0.0\ntweepy>=4.14.0,<5.0.0\n"
  },
  {
    "path": "run_diff.sh",
    "content": "#!/bin/bash\nexport TESTING=True\n\nexport NYT_TWITTER_CONSUMER_KEY=\"\"\nexport NYT_TWITTER_CONSUMER_SECRET=\"\"\nexport NYT_TWITTER_ACCESS_TOKEN=\"\"\nexport NYT_TWITTER_ACCESS_TOKEN_SECRET=\"\"\n\nexport NYT_API_KEY=\"\"\n\nexport PHANTOMJS_PATH=\"./\"\n\npython nytdiff.py\n"
  }
]