Repository: Shaier/arxiv_summarizer
Branch: main
Commit: 549d68c2039f
Files: 7
Total size: 23.4 KB

Directory structure:
gitextract_mgf991u5/

├── README.md
├── daily_arxiv.txt
├── keywords_summarizer.py
├── links.txt
├── requirements.txt
├── result.txt
└── url_summarize.py

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# ArXiv Paper Summarizer

This repository provides a Python script to fetch and summarize research papers from arXiv using the free Gemini API. Additionally, it demonstrates how to automate the extraction and summarization of arXiv articles daily based on specific keywords (see the section titled "Automatic Daily Extraction and Summarization" below). The tool is designed to help researchers, students, and enthusiasts quickly extract key insights from arXiv papers without manually reading through lengthy documents.

## Features
- **Single URL Summarization**: Summarize a single arXiv paper by providing its URL.
- **Batch URL Summarization**: Summarize multiple arXiv papers by listing their URLs in a text file.
- **Batch Keywords Summarization**: Fetch and summarize all papers from arXiv based on keywords and date ranges.
- **Easy Setup**: Simple installation and configuration process using Conda and pip.
- **Gemini API Integration**: Leverages the free Gemini API for high-quality summarization.

## Prerequisites
- Python 3.11
- Conda (for environment management)
- A Gemini API key (free to obtain)

## Installation

### 1. Clone the Repository
```bash
git clone https://github.com/Shaier/arxiv_summarizer.git
cd arxiv_summarizer
```

### 2. Set Up the Conda Environment
Create and activate a Conda environment with Python 3.11:
```bash
conda create -n arxiv_summarizer python=3.11
conda activate arxiv_summarizer
```

### 3. Install Dependencies
Install the required Python packages using pip:
```bash
pip install -r requirements.txt
```

### 4. Configure the Gemini API Key
Obtain your Gemini API key from [Google's Gemini API page](https://ai.google.dev/gemini-api/docs/api-key). Once you have the key, open the `url_summarize.py` file and replace `YOUR_GEMINI_API_KEY` on line 5 with your actual API key.

## Usage

### Summarize a Single Paper (Based on a Single URL)
To summarize a single arXiv paper, run the script and provide the arXiv URL (ensure it is the abstract page, not the PDF link):
```bash
python url_summarize.py
```
When prompted:
1. Enter `1` to summarize a single paper.
2. Provide the arXiv URL (e.g., `https://arxiv.org/abs/2410.08003`).

### Summarize Multiple Papers (Based on Multiple URLs)
To summarize multiple papers:
1. Add the arXiv URLs to the `links.txt` file, with one URL per line.
2. Run the script:
```bash
python url_summarize.py
```
3. When prompted, enter `2` to process all URLs listed in `links.txt`. Summaries are saved in `result.txt`.

## Example
Here’s an example of how to use the script:
```bash
python url_summarize.py
> Enter 1 for single paper or 2 for multiple papers: 1
> Enter the arXiv URL: https://arxiv.org/abs/2410.08003
```

### Summarize Multiple Papers (Based on Keywords)
 
`keywords_summarizer.py` enables fetching and summarizing papers based on specified keywords and date ranges. This is useful for tracking new research trends, generating related work sections, or conducting systematic reviews across multiple keywords at once.
 
### Usage  
  
1. **Run the script** and provide your search criteria:  
```bash  
python keywords_summarizer.py  
```  
2. **Specify keywords and a date range** when prompted. Example input:  
```bash  
Enter keywords: "transformer, sparsity, MoE"  
Enter start date (YYYY-MM-DD): 2017-01-01  
Enter end date (YYYY-MM-DD): 2024-03-01  
```  
3. The script fetches relevant papers from arXiv and generates summaries. The results are saved in `result.txt`.  


## Automatic Daily Extraction and Summarization  
  
 You can automate the extraction and summarization of arXiv articles based on specific keywords using Google Apps Script.  
 This setup will run daily and add newly found article titles (with links and summaries) to a Google Doc.  
  
 ### Steps to Set Up  
  
 1. **Open Google Apps Script**  
    - Log in to your Google account and go to [Google Apps Script](https://script.google.com/home/my).  
    - Click on **"New project"** in the top left.  
  
 2. **Create a Google Doc**  
    - Open [Google Docs](https://docs.google.com).  
    - Click **Blank document** to create a new document.  
    - Copy the **document ID** from the URL.  
      - The ID is the long string in the document's URL, e.g., `123HEM4h5aQwygDk_A-xNaJ8CUoyMZTFsChyMk`.  
  
 3. **Copy and Modify the Script**  
    - Open the `daily_arxiv.txt` file in this repository.  
    - Copy and paste its content into the Google Apps Script editor.  
    - Locate the `var docId` in the script (around line 3) and replace it with the **Google Doc ID** from Step 2.  
    - Add your **Gemini API Key** around **line 81** (look for `var apiKey =`).
    - Locate `var keywords = [...]` around **line 4** and update it with your preferred keywords.  
  
 4. **Test the Script**  
    - Click the **Run** button at the top to execute the script (you might need to provide permissions).  
    - If everything works correctly, your Google Doc should now contain a list of arXiv article titles with links.  
  
 5. **Schedule Daily Execution**  
    - Click on the **clock icon** on the left (Triggers).  
    - Click **"Add trigger"** in the bottom right.  
    - Configure the trigger settings:  
      - **Function**: Select the main function from the dropdown.  
      - **Event Source**: Choose **Time-driven**.  
      - **Type**: Select **Day timer**.  
      - **Time Range**: Pick a time slot (e.g., midnight to 1 AM).  
      - **Notifications**: Enable email notifications if you want updates.  
    - Click **Save**.  
  
 Now, your script will automatically fetch and summarize new arXiv articles daily based on your chosen keywords!  


## Contributing
Contributions are welcome! If you have suggestions, improvements, or bug fixes, please open an issue or submit a pull request.

## Support
If you encounter any issues or have questions, feel free to open an issue.


================================================
FILE: daily_arxiv.txt
================================================
// Function to fetch and process papers
function fetchAndWritePapers() {
  var docId = "GOOGLE-DOC-ID"; // Replace with your Google Doc ID
  var keywords = ['language models', 'llm'];
  var arxivRssUrl = 'http://arxiv.org/rss/cs.AI'; 

  try {
    var doc = DocumentApp.openById(docId);
    var body = doc.getBody();
    var response = UrlFetchApp.fetch(arxivRssUrl);
    var responseText = response.getContentText();
    var xml = XmlService.parse(responseText);
    var rootElement = xml.getRootElement();
    var channel = rootElement.getChild("channel");
    var items = channel.getChildren("item");

    var paperCount = 1; // Correctly track numbering for papers

    for (var i = 0; i < items.length; i++) {
      var title = items[i].getChildText("title").toLowerCase();
      var link = items[i].getChildText("link");
      var description = items[i].getChildText("description"); // Abstract

      for (var j = 0; j < keywords.length; j++) {
        if (title.includes(keywords[j].toLowerCase())) {
          // Append paper title as a paragraph with correct numbering
          var listItem = body.appendParagraph(paperCount + ") " + title);
          listItem.setLinkUrl(link);
          listItem.setBold(true);

          // Ensure spacing consistency for summary
          if (description) {
            var summary = summarizeWithGemini(description);
            var summaryParagraph = body.appendParagraph(summary);
            summaryParagraph.setIndentStart(30); // Indentation for readability
            summaryParagraph.setSpacingBefore(5); // Ensures space before summary
            Utilities.sleep(5000); // Pause for 5 second because of the free Gemini quota
          } else {
            var summaryParagraph = body.appendParagraph("Abstract not available.");
            summaryParagraph.setIndentStart(30);
            summaryParagraph.setSpacingBefore(5);
          }

          paperCount++; // Increment paper count correctly
          break;
        }
      }
    }
  } catch (e) {
    Logger.log("Error: " + e.toString());
  }
}

// Function to fetch the abstract from an arXiv URL
function fetchAbstract(arxivUrl) {
  try {
    Logger.log("Fetching abstract from: " + arxivUrl);
    var response = UrlFetchApp.fetch(arxivUrl);
    var html = response.getContentText();

    // Use regex to extract the abstract (simplified example)
    var abstractRegex = /<blockquote class="abstract mathjax">([\s\S]*?)<\/blockquote>/;
    var match = html.match(abstractRegex);
    if (match && match[1]) {
      Logger.log("Abstract fetched successfully.");
      return match[1].replace(/Abstract:/, "Abstract: ").trim();
    } else {
      Logger.log("Abstract not found.");
      return "Error: Abstract not found.";
    }
  } catch (e) {
    Logger.log("Error fetching abstract: " + e.toString());
    return "Error: Unable to fetch the abstract.";
  }
}

// Function to summarize the abstract using the Gemini API
function summarizeWithGemini(abstract) {
  try {
    Logger.log("Summarizing abstract using Gemini API...");
    var apiKey = "GEMINI-API-KEY"; // Replace with your Gemini API key
    var apiUrl = "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=" + apiKey;

    var payload = {
      "contents": [{
        "parts": [{
          "text": "Summarize the following abstract in 1-2 simple sentences. Focus on what the authors did, why, and the results: \n\n" + abstract
        }]
      }]
    };

    var options = {
      "method": "post",
      "contentType": "application/json",
      "payload": JSON.stringify(payload)
    };

    var response = UrlFetchApp.fetch(apiUrl, options);
    var result = JSON.parse(response.getContentText());
    Logger.log("Summary generated successfully.");
    return result.candidates[0].content.parts[0].text;
  } catch (e) {
    Logger.log("Error summarizing abstract: " + e.toString());
    return "Error: Unable to summarize the abstract.";
  }
}


================================================
FILE: keywords_summarizer.py
================================================
import requests
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import time
from datetime import datetime, timedelta
from concurrent.futures import ThreadPoolExecutor, as_completed

# Set up your Gemini API key
GEMINI_API_KEY = ""

def fetch_abstract(arxiv_url):
    # Fetch the arXiv page content using requests
    response = requests.get(arxiv_url)
    if response.status_code != 200:
        return f"Error: Unable to fetch {arxiv_url}, status code: {response.status_code}"

    # Parse the HTML content of the arXiv page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the abstract section
    abstract_tag = soup.find('blockquote', class_='abstract mathjax')
    if abstract_tag:
        # Get the content of the abstract and ensure a space after the "Abstract:" label
        abstract_text = abstract_tag.text.strip()
        # Ensure "Abstract:" has a space
        if abstract_text.startswith("Abstract:"):
            abstract_text = abstract_text.replace("Abstract:", "Abstract: ")
        return abstract_text
    else:
        return "Error: Abstract not found."

def summarize_with_gemini(abstract_text):
    # Set up the API endpoint for Gemini
    url = "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=" + GEMINI_API_KEY
    
    headers = {
        "Content-Type": "application/json",
    }

    # Create the payload to send the abstract to Gemini with a more focused prompt
    data = {
        "contents": [{
            "parts": [{
                "text": f"Summarize the following abstract in 1-2 simple sentences. Focus on what the authors did, why, and the results: \n\n{abstract_text}"
            }]
        }]
    }

    # Make the POST request to the Gemini API
    response = requests.post(url, headers=headers, json=data)
    
    if response.status_code == 200:
        # Extract the summary from the response
        result = response.json()
        try:
            # Access the correct keys in the response structure
            summary = result['candidates'][0]['content']['parts'][0]['text']
            return summary
        except KeyError as e:
            return f"KeyError: {e}, check the response structure."
    else:
        return f"Error: Unable to get response, status code: {response.status_code}"


def fetch_papers_for_date_range(keyword, start_date, end_date, max_results):
    papers = []
    query = f'all:"{keyword}"'
    query_url = f"http://export.arxiv.org/api/query?search_query=({query})+AND+submittedDate:[{start_date}+TO+{end_date}]&start=0&max_results={max_results}"
    
    # Fetch papers for this keyword and date range
    response = requests.get(query_url)
    if response.status_code != 200:
        print(f"Error: Unable to fetch papers for keyword '{keyword}' from {start_date} to {end_date}, status code: {response.status_code}")
        return papers

    # Parse the XML response to get the papers
    root = ET.fromstring(response.content)
    for entry in root.findall('{http://www.w3.org/2005/Atom}entry'):
        title = entry.find('{http://www.w3.org/2005/Atom}title').text.strip()
        summary = entry.find('{http://www.w3.org/2005/Atom}summary').text.strip()
        link = entry.find('{http://www.w3.org/2005/Atom}link[@title="pdf"]').attrib['href']
        papers.append({'title': title, 'summary': summary, 'link': link, 'keyword': keyword})
    
    return papers

def fetch_papers(keywords, start_date, end_date, max_results_per_keyword):
    papers = []
    keyword_totals = {keyword: 0 for keyword in keywords}  # Dictionary to store total papers found for each keyword
    
    # Convert start_date and end_date to datetime objects
    start_date = datetime.strptime(start_date, "%Y-%m-%d")
    end_date = datetime.strptime(end_date, "%Y-%m-%d")
    
    # Split the date range into monthly intervals
    current_start_date = start_date
    date_ranges = []
    while current_start_date < end_date:
        current_end_date = current_start_date + timedelta(days=30)  # Approximate 1 month
        if current_end_date > end_date:
            current_end_date = end_date
        date_ranges.append((current_start_date.strftime("%Y-%m-%d"), current_end_date.strftime("%Y-%m-%d")))
        current_start_date = current_end_date + timedelta(days=1)
    
    # Use ThreadPoolExecutor to parallelize queries
    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = []
        for keyword in keywords:
            for start, end in date_ranges:
                futures.append(executor.submit(fetch_papers_for_date_range, keyword, start, end, max_results_per_keyword))
        
        for future in as_completed(futures):
            papers.extend(future.result())
    
    # Count papers per keyword
    for paper in papers:
        keyword_totals[paper['keyword']] += 1
    
    # Print the total number of documents found for each keyword
    print("\nTotal documents found per keyword:")
    for keyword, total in keyword_totals.items():
        print(f"{keyword}: {total} documents")
    
    return papers

# Open the result file to store the summaries
with open("result.txt", "w") as result_file:
    # Prompt for user input
    keywords = input("Enter keywords separated by commas: ").strip().split(',')
    start_date = input("Enter start date (YYYY-MM-DD): ").strip()
    end_date = input("Enter end date (YYYY-MM-DD): ").strip()
    max_results_per_keyword = int(input("Enter the number of results per keyword: ").strip())

    # Fetch papers based on keywords, date range, and max results per keyword
    papers = fetch_papers(keywords, start_date, end_date, max_results_per_keyword)
    
    for paper in papers:
        print(f"Fetching abstract for: {paper['title']}")
        
        # Fetch the abstract
        abstract = paper['summary']
        if not abstract.startswith("Error"):
            # Summarize the abstract using Gemini
            summary = summarize_with_gemini(abstract)
            # summary = 'xyz'
            result_file.write(f"Keyword: {paper['keyword']}\nTitle: {paper['title']}\nLink: {paper['link']}\nSummary: {summary}\n\n")
            print(f"Summary for {paper['title']}:\n{summary}\n")
        else:
            result_file.write(f"Keyword: {paper['keyword']}\nTitle: {paper['title']}\nLink: {paper['link']}\nSummary: Error fetching abstract\n\n")
            print(f"Error fetching abstract for {paper['title']}\n")
        # Add a 2-seconds delay between each summary request
        time.sleep(2)


================================================
FILE: links.txt
================================================
https://arxiv.org/abs/2410.08003
https://arxiv.org/abs/2410.04241
https://arxiv.org/abs/2406.16779
https://arxiv.org/abs/2402.00123
https://arxiv.org/abs/2401.18001
https://arxiv.org/abs/2310.10571
https://arxiv.org/abs/2310.10583

================================================
FILE: requirements.txt
================================================
requests
beautifulsoup4


================================================
FILE: result.txt
================================================
arXiv URL: https://arxiv.org/abs/2410.08003
Summary: The authors developed COMET, a new sparse neural network architecture inspired by biological neural systems, to address limitations of existing sparse networks in learning multiple tasks efficiently.  COMET uses random projections instead of trainable gating functions, leading to faster learning and better generalization across various tasks.


arXiv URL: https://arxiv.org/abs/2410.04241
Summary: The authors introduced a new question answering task that incorporates source citations for questions with multiple valid answers, creating five new datasets and evaluation metrics to address this challenge; their results from several baseline models highlight the difficulty and importance of this task for building more trustworthy QA systems.


arXiv URL: https://arxiv.org/abs/2406.16779
Summary: The authors investigated how input order and emphasis affect large language models' reading comprehension performance, testing nine models on three datasets.  They found that presenting the context before the question significantly improved accuracy (up to 36%), especially for questions requiring external knowledge, with simple context emphasis techniques proving most effective.


arXiv URL: https://arxiv.org/abs/2402.00123
Summary: The authors compared language model evaluation using expert-designed templates versus naturally occurring text, finding that the two methods yielded different model rankings and scores, with template-free methods showing lower accuracy but a more expected relationship between perplexity and accuracy.  The differences were particularly notable between general and domain-specific models.


arXiv URL: https://arxiv.org/abs/2401.18001
Summary: The authors evaluated 15 question-answering systems across five datasets using a comprehensive set of criteria to understand their robustness, consistency, and handling of conflicting information.  Their results revealed unexpected relationships between these factors, highlighting the significant negative impact of combining conflicting knowledge and noisy data on system performance.


arXiv URL: https://arxiv.org/abs/2310.10571
Summary: The authors investigated whether adding irrelevant demographic information to biomedical questions affected the answers given by two types of question answering systems (knowledge graph-grounded and text-based).  They found that irrelevant demographic details caused significant changes in the answers provided by both systems, highlighting fairness concerns in biomedical AI.


arXiv URL: https://arxiv.org/abs/2310.10583
Summary: The authors argue that current language models are unreliable, especially for low-resource languages, and propose building models that cite their sources to improve trustworthiness.  They discuss the benefits and challenges of this approach, aiming to stimulate discussion on improving language model development.


================================================
FILE: url_summarize.py
================================================
import requests
from bs4 import BeautifulSoup

# Set up your Gemini API key
GEMINI_API_KEY = ""

def fetch_abstract(arxiv_url):
    # Fetch the arXiv page content using requests
    response = requests.get(arxiv_url)
    if response.status_code != 200:
        return f"Error: Unable to fetch {arxiv_url}, status code: {response.status_code}"

    # Parse the HTML content of the arXiv page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the abstract section
    abstract_tag = soup.find('blockquote', class_='abstract mathjax')
    if abstract_tag:
        # Get the content of the abstract and ensure a space after the "Abstract:" label
        abstract_text = abstract_tag.text.strip()
        # Ensure "Abstract:" has a space
        if abstract_text.startswith("Abstract:"):
            abstract_text = abstract_text.replace("Abstract:", "Abstract: ")
        return abstract_text
    else:
        return "Error: Abstract not found."

def summarize_with_gemini(abstract_text):
    # Set up the API endpoint for Gemini
    url = "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=" + GEMINI_API_KEY
    
    headers = {
        "Content-Type": "application/json",
    }

    # Create the payload to send the abstract to Gemini with a more focused prompt
    data = {
        "contents": [{
            "parts": [{
                "text": f"Summarize the following abstract in 1-2 simple sentences. Focus on what the authors did, why, and the results: \n\n{abstract_text}"
            }]
        }]
    }

    # Make the POST request to the Gemini API
    response = requests.post(url, headers=headers, json=data)
    
    if response.status_code == 200:
        # Extract the summary from the response
        result = response.json()
        try:
            # Access the correct keys in the response structure
            summary = result['candidates'][0]['content']['parts'][0]['text']
            return summary
        except KeyError as e:
            return f"KeyError: {e}, check the response structure."
    else:
        return f"Error: Unable to get response, status code: {response.status_code}"

# Open the result file to store the summaries
with open("result.txt", "w") as result_file:
    # Prompt for user input
    print("Select an option:")
    print("1. Enter a single arXiv link")
    print("2. Provide a file with arXiv links (using 'links.txt')")

    option = input("Enter 1 or 2: ")

    if option == '1':
        # Single paper input
        arxiv_url = input("Enter the arXiv URL: ").strip()
        print(f"Fetching abstract for: {arxiv_url}")
        
        # Fetch the abstract
        abstract = fetch_abstract(arxiv_url)
        if not abstract.startswith("Error"):
            # Summarize the abstract using Gemini
            summary = summarize_with_gemini(abstract)
            result_file.write(f"arXiv URL: {arxiv_url}\nSummary: {summary}\n\n")
            print(f"Summary for {arxiv_url}:\n{summary}\n")
        else:
            print(f"Error fetching abstract for {arxiv_url}\n")

    elif option == '2':
        # Multiple papers from file (assuming links.txt)
        file_path = 'links.txt'
        try:
            with open(file_path, 'r') as file:
                links = file.readlines()

            for link in links:
                arxiv_url = link.strip()
                print(f"Fetching abstract for: {arxiv_url}")
                
                # Fetch the abstract
                abstract = fetch_abstract(arxiv_url)
                if not abstract.startswith("Error"):
                    # Summarize the abstract using Gemini
                    summary = summarize_with_gemini(abstract)
                    result_file.write(f"arXiv URL: {arxiv_url}\nSummary: {summary}\n\n")
                    print(f"Summary for {arxiv_url}:\n{summary}\n")
                else:
                    result_file.write(f"arXiv URL: {arxiv_url}\nSummary: Error fetching abstract\n\n")
                    print(f"Error fetching abstract for {arxiv_url}\n")

        except FileNotFoundError:
            print("Error: The file 'links.txt' with arXiv links was not found.")
            result_file.write("Error: The file 'links.txt' with arXiv links was not found.\n")

    else:
        print("Invalid option. Please run the script again and choose option 1 or 2.")