[
  {
    "path": "README.md",
    "content": "# ArXiv Paper Summarizer\n\nThis repository provides a Python script to fetch and summarize research papers from arXiv using the free Gemini API. Additionally, it demonstrates how to automate the extraction and summarization of arXiv articles daily based on specific keywords (see the section titled \"Automatic Daily Extraction and Summarization\" below). The tool is designed to help researchers, students, and enthusiasts quickly extract key insights from arXiv papers without manually reading through lengthy documents.\n\n## Features\n- **Single URL Summarization**: Summarize a single arXiv paper by providing its URL.\n- **Batch URL Summarization**: Summarize multiple arXiv papers by listing their URLs in a text file.\n- **Batch Keywords Summarization**: Fetch and summarize all papers from arXiv based on keywords and date ranges.\n- **Easy Setup**: Simple installation and configuration process using Conda and pip.\n- **Gemini API Integration**: Leverages the free Gemini API for high-quality summarization.\n\n## Prerequisites\n- Python 3.11\n- Conda (for environment management)\n- A Gemini API key (free to obtain)\n\n## Installation\n\n### 1. Clone the Repository\n```bash\ngit clone https://github.com/Shaier/arxiv_summarizer.git\ncd arxiv_summarizer\n```\n\n### 2. Set Up the Conda Environment\nCreate and activate a Conda environment with Python 3.11:\n```bash\nconda create -n arxiv_summarizer python=3.11\nconda activate arxiv_summarizer\n```\n\n### 3. Install Dependencies\nInstall the required Python packages using pip:\n```bash\npip install -r requirements.txt\n```\n\n### 4. Configure the Gemini API Key\nObtain your Gemini API key from [Google's Gemini API page](https://ai.google.dev/gemini-api/docs/api-key). Once you have the key, open the `url_summarize.py` file and replace `YOUR_GEMINI_API_KEY` on line 5 with your actual API key.\n\n## Usage\n\n### Summarize a Single Paper (Based on a Single URL)\nTo summarize a single arXiv paper, run the script and provide the arXiv URL (ensure it is the abstract page, not the PDF link):\n```bash\npython url_summarize.py\n```\nWhen prompted:\n1. Enter `1` to summarize a single paper.\n2. Provide the arXiv URL (e.g., `https://arxiv.org/abs/2410.08003`).\n\n### Summarize Multiple Papers (Based on Multiple URLs)\nTo summarize multiple papers:\n1. Add the arXiv URLs to the `links.txt` file, with one URL per line.\n2. Run the script:\n```bash\npython url_summarize.py\n```\n3. When prompted, enter `2` to process all URLs listed in `links.txt`. Summaries are saved in `result.txt`.\n\n## Example\nHere’s an example of how to use the script:\n```bash\npython url_summarize.py\n> Enter 1 for single paper or 2 for multiple papers: 1\n> Enter the arXiv URL: https://arxiv.org/abs/2410.08003\n```\n\n### Summarize Multiple Papers (Based on Keywords)\n \n`keywords_summarizer.py` enables fetching and summarizing papers based on specified keywords and date ranges. This is useful for tracking new research trends, generating related work sections, or conducting systematic reviews across multiple keywords at once.\n \n### Usage  \n  \n1. **Run the script** and provide your search criteria:  \n```bash  \npython keywords_summarizer.py  \n```  \n2. **Specify keywords and a date range** when prompted. Example input:  \n```bash  \nEnter keywords: \"transformer, sparsity, MoE\"  \nEnter start date (YYYY-MM-DD): 2017-01-01  \nEnter end date (YYYY-MM-DD): 2024-03-01  \n```  \n3. The script fetches relevant papers from arXiv and generates summaries. The results are saved in `result.txt`.  \n\n\n## Automatic Daily Extraction and Summarization  \n  \n You can automate the extraction and summarization of arXiv articles based on specific keywords using Google Apps Script.  \n This setup will run daily and add newly found article titles (with links and summaries) to a Google Doc.  \n  \n ### Steps to Set Up  \n  \n 1. **Open Google Apps Script**  \n    - Log in to your Google account and go to [Google Apps Script](https://script.google.com/home/my).  \n    - Click on **\"New project\"** in the top left.  \n  \n 2. **Create a Google Doc**  \n    - Open [Google Docs](https://docs.google.com).  \n    - Click **Blank document** to create a new document.  \n    - Copy the **document ID** from the URL.  \n      - The ID is the long string in the document's URL, e.g., `123HEM4h5aQwygDk_A-xNaJ8CUoyMZTFsChyMk`.  \n  \n 3. **Copy and Modify the Script**  \n    - Open the `daily_arxiv.txt` file in this repository.  \n    - Copy and paste its content into the Google Apps Script editor.  \n    - Locate the `var docId` in the script (around line 3) and replace it with the **Google Doc ID** from Step 2.  \n    - Add your **Gemini API Key** around **line 81** (look for `var apiKey =`).\n    - Locate `var keywords = [...]` around **line 4** and update it with your preferred keywords.  \n  \n 4. **Test the Script**  \n    - Click the **Run** button at the top to execute the script (you might need to provide permissions).  \n    - If everything works correctly, your Google Doc should now contain a list of arXiv article titles with links.  \n  \n 5. **Schedule Daily Execution**  \n    - Click on the **clock icon** on the left (Triggers).  \n    - Click **\"Add trigger\"** in the bottom right.  \n    - Configure the trigger settings:  \n      - **Function**: Select the main function from the dropdown.  \n      - **Event Source**: Choose **Time-driven**.  \n      - **Type**: Select **Day timer**.  \n      - **Time Range**: Pick a time slot (e.g., midnight to 1 AM).  \n      - **Notifications**: Enable email notifications if you want updates.  \n    - Click **Save**.  \n  \n Now, your script will automatically fetch and summarize new arXiv articles daily based on your chosen keywords!  \n\n\n\n## Contributing\nContributions are welcome! If you have suggestions, improvements, or bug fixes, please open an issue or submit a pull request.\n\n## Support\nIf you encounter any issues or have questions, feel free to open an issue.\n"
  },
  {
    "path": "daily_arxiv.txt",
    "content": "// Function to fetch and process papers\nfunction fetchAndWritePapers() {\n  var docId = \"GOOGLE-DOC-ID\"; // Replace with your Google Doc ID\n  var keywords = ['language models', 'llm'];\n  var arxivRssUrl = 'http://arxiv.org/rss/cs.AI'; \n\n  try {\n    var doc = DocumentApp.openById(docId);\n    var body = doc.getBody();\n    var response = UrlFetchApp.fetch(arxivRssUrl);\n    var responseText = response.getContentText();\n    var xml = XmlService.parse(responseText);\n    var rootElement = xml.getRootElement();\n    var channel = rootElement.getChild(\"channel\");\n    var items = channel.getChildren(\"item\");\n\n    var paperCount = 1; // Correctly track numbering for papers\n\n    for (var i = 0; i < items.length; i++) {\n      var title = items[i].getChildText(\"title\").toLowerCase();\n      var link = items[i].getChildText(\"link\");\n      var description = items[i].getChildText(\"description\"); // Abstract\n\n      for (var j = 0; j < keywords.length; j++) {\n        if (title.includes(keywords[j].toLowerCase())) {\n          // Append paper title as a paragraph with correct numbering\n          var listItem = body.appendParagraph(paperCount + \") \" + title);\n          listItem.setLinkUrl(link);\n          listItem.setBold(true);\n\n          // Ensure spacing consistency for summary\n          if (description) {\n            var summary = summarizeWithGemini(description);\n            var summaryParagraph = body.appendParagraph(summary);\n            summaryParagraph.setIndentStart(30); // Indentation for readability\n            summaryParagraph.setSpacingBefore(5); // Ensures space before summary\n            Utilities.sleep(5000); // Pause for 5 second because of the free Gemini quota\n          } else {\n            var summaryParagraph = body.appendParagraph(\"Abstract not available.\");\n            summaryParagraph.setIndentStart(30);\n            summaryParagraph.setSpacingBefore(5);\n          }\n\n          paperCount++; // Increment paper count correctly\n          break;\n        }\n      }\n    }\n  } catch (e) {\n    Logger.log(\"Error: \" + e.toString());\n  }\n}\n\n// Function to fetch the abstract from an arXiv URL\nfunction fetchAbstract(arxivUrl) {\n  try {\n    Logger.log(\"Fetching abstract from: \" + arxivUrl);\n    var response = UrlFetchApp.fetch(arxivUrl);\n    var html = response.getContentText();\n\n    // Use regex to extract the abstract (simplified example)\n    var abstractRegex = /<blockquote class=\"abstract mathjax\">([\\s\\S]*?)<\\/blockquote>/;\n    var match = html.match(abstractRegex);\n    if (match && match[1]) {\n      Logger.log(\"Abstract fetched successfully.\");\n      return match[1].replace(/Abstract:/, \"Abstract: \").trim();\n    } else {\n      Logger.log(\"Abstract not found.\");\n      return \"Error: Abstract not found.\";\n    }\n  } catch (e) {\n    Logger.log(\"Error fetching abstract: \" + e.toString());\n    return \"Error: Unable to fetch the abstract.\";\n  }\n}\n\n// Function to summarize the abstract using the Gemini API\nfunction summarizeWithGemini(abstract) {\n  try {\n    Logger.log(\"Summarizing abstract using Gemini API...\");\n    var apiKey = \"GEMINI-API-KEY\"; // Replace with your Gemini API key\n    var apiUrl = \"https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=\" + apiKey;\n\n    var payload = {\n      \"contents\": [{\n        \"parts\": [{\n          \"text\": \"Summarize the following abstract in 1-2 simple sentences. Focus on what the authors did, why, and the results: \\n\\n\" + abstract\n        }]\n      }]\n    };\n\n    var options = {\n      \"method\": \"post\",\n      \"contentType\": \"application/json\",\n      \"payload\": JSON.stringify(payload)\n    };\n\n    var response = UrlFetchApp.fetch(apiUrl, options);\n    var result = JSON.parse(response.getContentText());\n    Logger.log(\"Summary generated successfully.\");\n    return result.candidates[0].content.parts[0].text;\n  } catch (e) {\n    Logger.log(\"Error summarizing abstract: \" + e.toString());\n    return \"Error: Unable to summarize the abstract.\";\n  }\n}\n"
  },
  {
    "path": "keywords_summarizer.py",
    "content": "import requests\nfrom bs4 import BeautifulSoup\nimport xml.etree.ElementTree as ET\nimport time\nfrom datetime import datetime, timedelta\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\n\n# Set up your Gemini API key\nGEMINI_API_KEY = \"\"\n\ndef fetch_abstract(arxiv_url):\n    # Fetch the arXiv page content using requests\n    response = requests.get(arxiv_url)\n    if response.status_code != 200:\n        return f\"Error: Unable to fetch {arxiv_url}, status code: {response.status_code}\"\n\n    # Parse the HTML content of the arXiv page\n    soup = BeautifulSoup(response.text, 'html.parser')\n\n    # Find the abstract section\n    abstract_tag = soup.find('blockquote', class_='abstract mathjax')\n    if abstract_tag:\n        # Get the content of the abstract and ensure a space after the \"Abstract:\" label\n        abstract_text = abstract_tag.text.strip()\n        # Ensure \"Abstract:\" has a space\n        if abstract_text.startswith(\"Abstract:\"):\n            abstract_text = abstract_text.replace(\"Abstract:\", \"Abstract: \")\n        return abstract_text\n    else:\n        return \"Error: Abstract not found.\"\n\ndef summarize_with_gemini(abstract_text):\n    # Set up the API endpoint for Gemini\n    url = \"https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=\" + GEMINI_API_KEY\n    \n    headers = {\n        \"Content-Type\": \"application/json\",\n    }\n\n    # Create the payload to send the abstract to Gemini with a more focused prompt\n    data = {\n        \"contents\": [{\n            \"parts\": [{\n                \"text\": f\"Summarize the following abstract in 1-2 simple sentences. Focus on what the authors did, why, and the results: \\n\\n{abstract_text}\"\n            }]\n        }]\n    }\n\n    # Make the POST request to the Gemini API\n    response = requests.post(url, headers=headers, json=data)\n    \n    if response.status_code == 200:\n        # Extract the summary from the response\n        result = response.json()\n        try:\n            # Access the correct keys in the response structure\n            summary = result['candidates'][0]['content']['parts'][0]['text']\n            return summary\n        except KeyError as e:\n            return f\"KeyError: {e}, check the response structure.\"\n    else:\n        return f\"Error: Unable to get response, status code: {response.status_code}\"\n\n\ndef fetch_papers_for_date_range(keyword, start_date, end_date, max_results):\n    papers = []\n    query = f'all:\"{keyword}\"'\n    query_url = f\"http://export.arxiv.org/api/query?search_query=({query})+AND+submittedDate:[{start_date}+TO+{end_date}]&start=0&max_results={max_results}\"\n    \n    # Fetch papers for this keyword and date range\n    response = requests.get(query_url)\n    if response.status_code != 200:\n        print(f\"Error: Unable to fetch papers for keyword '{keyword}' from {start_date} to {end_date}, status code: {response.status_code}\")\n        return papers\n\n    # Parse the XML response to get the papers\n    root = ET.fromstring(response.content)\n    for entry in root.findall('{http://www.w3.org/2005/Atom}entry'):\n        title = entry.find('{http://www.w3.org/2005/Atom}title').text.strip()\n        summary = entry.find('{http://www.w3.org/2005/Atom}summary').text.strip()\n        link = entry.find('{http://www.w3.org/2005/Atom}link[@title=\"pdf\"]').attrib['href']\n        papers.append({'title': title, 'summary': summary, 'link': link, 'keyword': keyword})\n    \n    return papers\n\ndef fetch_papers(keywords, start_date, end_date, max_results_per_keyword):\n    papers = []\n    keyword_totals = {keyword: 0 for keyword in keywords}  # Dictionary to store total papers found for each keyword\n    \n    # Convert start_date and end_date to datetime objects\n    start_date = datetime.strptime(start_date, \"%Y-%m-%d\")\n    end_date = datetime.strptime(end_date, \"%Y-%m-%d\")\n    \n    # Split the date range into monthly intervals\n    current_start_date = start_date\n    date_ranges = []\n    while current_start_date < end_date:\n        current_end_date = current_start_date + timedelta(days=30)  # Approximate 1 month\n        if current_end_date > end_date:\n            current_end_date = end_date\n        date_ranges.append((current_start_date.strftime(\"%Y-%m-%d\"), current_end_date.strftime(\"%Y-%m-%d\")))\n        current_start_date = current_end_date + timedelta(days=1)\n    \n    # Use ThreadPoolExecutor to parallelize queries\n    with ThreadPoolExecutor(max_workers=5) as executor:\n        futures = []\n        for keyword in keywords:\n            for start, end in date_ranges:\n                futures.append(executor.submit(fetch_papers_for_date_range, keyword, start, end, max_results_per_keyword))\n        \n        for future in as_completed(futures):\n            papers.extend(future.result())\n    \n    # Count papers per keyword\n    for paper in papers:\n        keyword_totals[paper['keyword']] += 1\n    \n    # Print the total number of documents found for each keyword\n    print(\"\\nTotal documents found per keyword:\")\n    for keyword, total in keyword_totals.items():\n        print(f\"{keyword}: {total} documents\")\n    \n    return papers\n\n# Open the result file to store the summaries\nwith open(\"result.txt\", \"w\") as result_file:\n    # Prompt for user input\n    keywords = input(\"Enter keywords separated by commas: \").strip().split(',')\n    start_date = input(\"Enter start date (YYYY-MM-DD): \").strip()\n    end_date = input(\"Enter end date (YYYY-MM-DD): \").strip()\n    max_results_per_keyword = int(input(\"Enter the number of results per keyword: \").strip())\n\n    # Fetch papers based on keywords, date range, and max results per keyword\n    papers = fetch_papers(keywords, start_date, end_date, max_results_per_keyword)\n    \n    for paper in papers:\n        print(f\"Fetching abstract for: {paper['title']}\")\n        \n        # Fetch the abstract\n        abstract = paper['summary']\n        if not abstract.startswith(\"Error\"):\n            # Summarize the abstract using Gemini\n            summary = summarize_with_gemini(abstract)\n            # summary = 'xyz'\n            result_file.write(f\"Keyword: {paper['keyword']}\\nTitle: {paper['title']}\\nLink: {paper['link']}\\nSummary: {summary}\\n\\n\")\n            print(f\"Summary for {paper['title']}:\\n{summary}\\n\")\n        else:\n            result_file.write(f\"Keyword: {paper['keyword']}\\nTitle: {paper['title']}\\nLink: {paper['link']}\\nSummary: Error fetching abstract\\n\\n\")\n            print(f\"Error fetching abstract for {paper['title']}\\n\")\n        # Add a 2-seconds delay between each summary request\n        time.sleep(2)\n"
  },
  {
    "path": "links.txt",
    "content": "https://arxiv.org/abs/2410.08003\nhttps://arxiv.org/abs/2410.04241\nhttps://arxiv.org/abs/2406.16779\nhttps://arxiv.org/abs/2402.00123\nhttps://arxiv.org/abs/2401.18001\nhttps://arxiv.org/abs/2310.10571\nhttps://arxiv.org/abs/2310.10583"
  },
  {
    "path": "requirements.txt",
    "content": "requests\nbeautifulsoup4\n"
  },
  {
    "path": "result.txt",
    "content": "arXiv URL: https://arxiv.org/abs/2410.08003\nSummary: The authors developed COMET, a new sparse neural network architecture inspired by biological neural systems, to address limitations of existing sparse networks in learning multiple tasks efficiently.  COMET uses random projections instead of trainable gating functions, leading to faster learning and better generalization across various tasks.\n\n\narXiv URL: https://arxiv.org/abs/2410.04241\nSummary: The authors introduced a new question answering task that incorporates source citations for questions with multiple valid answers, creating five new datasets and evaluation metrics to address this challenge; their results from several baseline models highlight the difficulty and importance of this task for building more trustworthy QA systems.\n\n\narXiv URL: https://arxiv.org/abs/2406.16779\nSummary: The authors investigated how input order and emphasis affect large language models' reading comprehension performance, testing nine models on three datasets.  They found that presenting the context before the question significantly improved accuracy (up to 36%), especially for questions requiring external knowledge, with simple context emphasis techniques proving most effective.\n\n\narXiv URL: https://arxiv.org/abs/2402.00123\nSummary: The authors compared language model evaluation using expert-designed templates versus naturally occurring text, finding that the two methods yielded different model rankings and scores, with template-free methods showing lower accuracy but a more expected relationship between perplexity and accuracy.  The differences were particularly notable between general and domain-specific models.\n\n\narXiv URL: https://arxiv.org/abs/2401.18001\nSummary: The authors evaluated 15 question-answering systems across five datasets using a comprehensive set of criteria to understand their robustness, consistency, and handling of conflicting information.  Their results revealed unexpected relationships between these factors, highlighting the significant negative impact of combining conflicting knowledge and noisy data on system performance.\n\n\narXiv URL: https://arxiv.org/abs/2310.10571\nSummary: The authors investigated whether adding irrelevant demographic information to biomedical questions affected the answers given by two types of question answering systems (knowledge graph-grounded and text-based).  They found that irrelevant demographic details caused significant changes in the answers provided by both systems, highlighting fairness concerns in biomedical AI.\n\n\narXiv URL: https://arxiv.org/abs/2310.10583\nSummary: The authors argue that current language models are unreliable, especially for low-resource languages, and propose building models that cite their sources to improve trustworthiness.  They discuss the benefits and challenges of this approach, aiming to stimulate discussion on improving language model development.\n\n\n"
  },
  {
    "path": "url_summarize.py",
    "content": "import requests\nfrom bs4 import BeautifulSoup\n\n# Set up your Gemini API key\nGEMINI_API_KEY = \"\"\n\ndef fetch_abstract(arxiv_url):\n    # Fetch the arXiv page content using requests\n    response = requests.get(arxiv_url)\n    if response.status_code != 200:\n        return f\"Error: Unable to fetch {arxiv_url}, status code: {response.status_code}\"\n\n    # Parse the HTML content of the arXiv page\n    soup = BeautifulSoup(response.text, 'html.parser')\n\n    # Find the abstract section\n    abstract_tag = soup.find('blockquote', class_='abstract mathjax')\n    if abstract_tag:\n        # Get the content of the abstract and ensure a space after the \"Abstract:\" label\n        abstract_text = abstract_tag.text.strip()\n        # Ensure \"Abstract:\" has a space\n        if abstract_text.startswith(\"Abstract:\"):\n            abstract_text = abstract_text.replace(\"Abstract:\", \"Abstract: \")\n        return abstract_text\n    else:\n        return \"Error: Abstract not found.\"\n\ndef summarize_with_gemini(abstract_text):\n    # Set up the API endpoint for Gemini\n    url = \"https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=\" + GEMINI_API_KEY\n    \n    headers = {\n        \"Content-Type\": \"application/json\",\n    }\n\n    # Create the payload to send the abstract to Gemini with a more focused prompt\n    data = {\n        \"contents\": [{\n            \"parts\": [{\n                \"text\": f\"Summarize the following abstract in 1-2 simple sentences. Focus on what the authors did, why, and the results: \\n\\n{abstract_text}\"\n            }]\n        }]\n    }\n\n    # Make the POST request to the Gemini API\n    response = requests.post(url, headers=headers, json=data)\n    \n    if response.status_code == 200:\n        # Extract the summary from the response\n        result = response.json()\n        try:\n            # Access the correct keys in the response structure\n            summary = result['candidates'][0]['content']['parts'][0]['text']\n            return summary\n        except KeyError as e:\n            return f\"KeyError: {e}, check the response structure.\"\n    else:\n        return f\"Error: Unable to get response, status code: {response.status_code}\"\n\n# Open the result file to store the summaries\nwith open(\"result.txt\", \"w\") as result_file:\n    # Prompt for user input\n    print(\"Select an option:\")\n    print(\"1. Enter a single arXiv link\")\n    print(\"2. Provide a file with arXiv links (using 'links.txt')\")\n\n    option = input(\"Enter 1 or 2: \")\n\n    if option == '1':\n        # Single paper input\n        arxiv_url = input(\"Enter the arXiv URL: \").strip()\n        print(f\"Fetching abstract for: {arxiv_url}\")\n        \n        # Fetch the abstract\n        abstract = fetch_abstract(arxiv_url)\n        if not abstract.startswith(\"Error\"):\n            # Summarize the abstract using Gemini\n            summary = summarize_with_gemini(abstract)\n            result_file.write(f\"arXiv URL: {arxiv_url}\\nSummary: {summary}\\n\\n\")\n            print(f\"Summary for {arxiv_url}:\\n{summary}\\n\")\n        else:\n            print(f\"Error fetching abstract for {arxiv_url}\\n\")\n\n    elif option == '2':\n        # Multiple papers from file (assuming links.txt)\n        file_path = 'links.txt'\n        try:\n            with open(file_path, 'r') as file:\n                links = file.readlines()\n\n            for link in links:\n                arxiv_url = link.strip()\n                print(f\"Fetching abstract for: {arxiv_url}\")\n                \n                # Fetch the abstract\n                abstract = fetch_abstract(arxiv_url)\n                if not abstract.startswith(\"Error\"):\n                    # Summarize the abstract using Gemini\n                    summary = summarize_with_gemini(abstract)\n                    result_file.write(f\"arXiv URL: {arxiv_url}\\nSummary: {summary}\\n\\n\")\n                    print(f\"Summary for {arxiv_url}:\\n{summary}\\n\")\n                else:\n                    result_file.write(f\"arXiv URL: {arxiv_url}\\nSummary: Error fetching abstract\\n\\n\")\n                    print(f\"Error fetching abstract for {arxiv_url}\\n\")\n\n        except FileNotFoundError:\n            print(\"Error: The file 'links.txt' with arXiv links was not found.\")\n            result_file.write(\"Error: The file 'links.txt' with arXiv links was not found.\\n\")\n\n    else:\n        print(\"Invalid option. Please run the script again and choose option 1 or 2.\")"
  }
]