Repository: Shaier/arxiv_summarizer Branch: main Commit: 549d68c2039f Files: 7 Total size: 23.4 KB Directory structure: gitextract_mgf991u5/ ├── README.md ├── daily_arxiv.txt ├── keywords_summarizer.py ├── links.txt ├── requirements.txt ├── result.txt └── url_summarize.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: README.md ================================================ # ArXiv Paper Summarizer This repository provides a Python script to fetch and summarize research papers from arXiv using the free Gemini API. Additionally, it demonstrates how to automate the extraction and summarization of arXiv articles daily based on specific keywords (see the section titled "Automatic Daily Extraction and Summarization" below). The tool is designed to help researchers, students, and enthusiasts quickly extract key insights from arXiv papers without manually reading through lengthy documents. ## Features - **Single URL Summarization**: Summarize a single arXiv paper by providing its URL. - **Batch URL Summarization**: Summarize multiple arXiv papers by listing their URLs in a text file. - **Batch Keywords Summarization**: Fetch and summarize all papers from arXiv based on keywords and date ranges. - **Easy Setup**: Simple installation and configuration process using Conda and pip. - **Gemini API Integration**: Leverages the free Gemini API for high-quality summarization. ## Prerequisites - Python 3.11 - Conda (for environment management) - A Gemini API key (free to obtain) ## Installation ### 1. Clone the Repository ```bash git clone https://github.com/Shaier/arxiv_summarizer.git cd arxiv_summarizer ``` ### 2. Set Up the Conda Environment Create and activate a Conda environment with Python 3.11: ```bash conda create -n arxiv_summarizer python=3.11 conda activate arxiv_summarizer ``` ### 3. Install Dependencies Install the required Python packages using pip: ```bash pip install -r requirements.txt ``` ### 4. Configure the Gemini API Key Obtain your Gemini API key from [Google's Gemini API page](https://ai.google.dev/gemini-api/docs/api-key). Once you have the key, open the `url_summarize.py` file and replace `YOUR_GEMINI_API_KEY` on line 5 with your actual API key. ## Usage ### Summarize a Single Paper (Based on a Single URL) To summarize a single arXiv paper, run the script and provide the arXiv URL (ensure it is the abstract page, not the PDF link): ```bash python url_summarize.py ``` When prompted: 1. Enter `1` to summarize a single paper. 2. Provide the arXiv URL (e.g., `https://arxiv.org/abs/2410.08003`). ### Summarize Multiple Papers (Based on Multiple URLs) To summarize multiple papers: 1. Add the arXiv URLs to the `links.txt` file, with one URL per line. 2. Run the script: ```bash python url_summarize.py ``` 3. When prompted, enter `2` to process all URLs listed in `links.txt`. Summaries are saved in `result.txt`. ## Example Here’s an example of how to use the script: ```bash python url_summarize.py > Enter 1 for single paper or 2 for multiple papers: 1 > Enter the arXiv URL: https://arxiv.org/abs/2410.08003 ``` ### Summarize Multiple Papers (Based on Keywords) `keywords_summarizer.py` enables fetching and summarizing papers based on specified keywords and date ranges. This is useful for tracking new research trends, generating related work sections, or conducting systematic reviews across multiple keywords at once. ### Usage 1. **Run the script** and provide your search criteria: ```bash python keywords_summarizer.py ``` 2. **Specify keywords and a date range** when prompted. Example input: ```bash Enter keywords: "transformer, sparsity, MoE" Enter start date (YYYY-MM-DD): 2017-01-01 Enter end date (YYYY-MM-DD): 2024-03-01 ``` 3. The script fetches relevant papers from arXiv and generates summaries. The results are saved in `result.txt`. ## Automatic Daily Extraction and Summarization You can automate the extraction and summarization of arXiv articles based on specific keywords using Google Apps Script. This setup will run daily and add newly found article titles (with links and summaries) to a Google Doc. ### Steps to Set Up 1. **Open Google Apps Script** - Log in to your Google account and go to [Google Apps Script](https://script.google.com/home/my). - Click on **"New project"** in the top left. 2. **Create a Google Doc** - Open [Google Docs](https://docs.google.com). - Click **Blank document** to create a new document. - Copy the **document ID** from the URL. - The ID is the long string in the document's URL, e.g., `123HEM4h5aQwygDk_A-xNaJ8CUoyMZTFsChyMk`. 3. **Copy and Modify the Script** - Open the `daily_arxiv.txt` file in this repository. - Copy and paste its content into the Google Apps Script editor. - Locate the `var docId` in the script (around line 3) and replace it with the **Google Doc ID** from Step 2. - Add your **Gemini API Key** around **line 81** (look for `var apiKey =`). - Locate `var keywords = [...]` around **line 4** and update it with your preferred keywords. 4. **Test the Script** - Click the **Run** button at the top to execute the script (you might need to provide permissions). - If everything works correctly, your Google Doc should now contain a list of arXiv article titles with links. 5. **Schedule Daily Execution** - Click on the **clock icon** on the left (Triggers). - Click **"Add trigger"** in the bottom right. - Configure the trigger settings: - **Function**: Select the main function from the dropdown. - **Event Source**: Choose **Time-driven**. - **Type**: Select **Day timer**. - **Time Range**: Pick a time slot (e.g., midnight to 1 AM). - **Notifications**: Enable email notifications if you want updates. - Click **Save**. Now, your script will automatically fetch and summarize new arXiv articles daily based on your chosen keywords! ## Contributing Contributions are welcome! If you have suggestions, improvements, or bug fixes, please open an issue or submit a pull request. ## Support If you encounter any issues or have questions, feel free to open an issue. ================================================ FILE: daily_arxiv.txt ================================================ // Function to fetch and process papers function fetchAndWritePapers() { var docId = "GOOGLE-DOC-ID"; // Replace with your Google Doc ID var keywords = ['language models', 'llm']; var arxivRssUrl = 'http://arxiv.org/rss/cs.AI'; try { var doc = DocumentApp.openById(docId); var body = doc.getBody(); var response = UrlFetchApp.fetch(arxivRssUrl); var responseText = response.getContentText(); var xml = XmlService.parse(responseText); var rootElement = xml.getRootElement(); var channel = rootElement.getChild("channel"); var items = channel.getChildren("item"); var paperCount = 1; // Correctly track numbering for papers for (var i = 0; i < items.length; i++) { var title = items[i].getChildText("title").toLowerCase(); var link = items[i].getChildText("link"); var description = items[i].getChildText("description"); // Abstract for (var j = 0; j < keywords.length; j++) { if (title.includes(keywords[j].toLowerCase())) { // Append paper title as a paragraph with correct numbering var listItem = body.appendParagraph(paperCount + ") " + title); listItem.setLinkUrl(link); listItem.setBold(true); // Ensure spacing consistency for summary if (description) { var summary = summarizeWithGemini(description); var summaryParagraph = body.appendParagraph(summary); summaryParagraph.setIndentStart(30); // Indentation for readability summaryParagraph.setSpacingBefore(5); // Ensures space before summary Utilities.sleep(5000); // Pause for 5 second because of the free Gemini quota } else { var summaryParagraph = body.appendParagraph("Abstract not available."); summaryParagraph.setIndentStart(30); summaryParagraph.setSpacingBefore(5); } paperCount++; // Increment paper count correctly break; } } } } catch (e) { Logger.log("Error: " + e.toString()); } } // Function to fetch the abstract from an arXiv URL function fetchAbstract(arxivUrl) { try { Logger.log("Fetching abstract from: " + arxivUrl); var response = UrlFetchApp.fetch(arxivUrl); var html = response.getContentText(); // Use regex to extract the abstract (simplified example) var abstractRegex = /
([\s\S]*?)<\/blockquote>/; var match = html.match(abstractRegex); if (match && match[1]) { Logger.log("Abstract fetched successfully."); return match[1].replace(/Abstract:/, "Abstract: ").trim(); } else { Logger.log("Abstract not found."); return "Error: Abstract not found."; } } catch (e) { Logger.log("Error fetching abstract: " + e.toString()); return "Error: Unable to fetch the abstract."; } } // Function to summarize the abstract using the Gemini API function summarizeWithGemini(abstract) { try { Logger.log("Summarizing abstract using Gemini API..."); var apiKey = "GEMINI-API-KEY"; // Replace with your Gemini API key var apiUrl = "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=" + apiKey; var payload = { "contents": [{ "parts": [{ "text": "Summarize the following abstract in 1-2 simple sentences. Focus on what the authors did, why, and the results: \n\n" + abstract }] }] }; var options = { "method": "post", "contentType": "application/json", "payload": JSON.stringify(payload) }; var response = UrlFetchApp.fetch(apiUrl, options); var result = JSON.parse(response.getContentText()); Logger.log("Summary generated successfully."); return result.candidates[0].content.parts[0].text; } catch (e) { Logger.log("Error summarizing abstract: " + e.toString()); return "Error: Unable to summarize the abstract."; } } ================================================ FILE: keywords_summarizer.py ================================================ import requests from bs4 import BeautifulSoup import xml.etree.ElementTree as ET import time from datetime import datetime, timedelta from concurrent.futures import ThreadPoolExecutor, as_completed # Set up your Gemini API key GEMINI_API_KEY = "" def fetch_abstract(arxiv_url): # Fetch the arXiv page content using requests response = requests.get(arxiv_url) if response.status_code != 200: return f"Error: Unable to fetch {arxiv_url}, status code: {response.status_code}" # Parse the HTML content of the arXiv page soup = BeautifulSoup(response.text, 'html.parser') # Find the abstract section abstract_tag = soup.find('blockquote', class_='abstract mathjax') if abstract_tag: # Get the content of the abstract and ensure a space after the "Abstract:" label abstract_text = abstract_tag.text.strip() # Ensure "Abstract:" has a space if abstract_text.startswith("Abstract:"): abstract_text = abstract_text.replace("Abstract:", "Abstract: ") return abstract_text else: return "Error: Abstract not found." def summarize_with_gemini(abstract_text): # Set up the API endpoint for Gemini url = "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=" + GEMINI_API_KEY headers = { "Content-Type": "application/json", } # Create the payload to send the abstract to Gemini with a more focused prompt data = { "contents": [{ "parts": [{ "text": f"Summarize the following abstract in 1-2 simple sentences. Focus on what the authors did, why, and the results: \n\n{abstract_text}" }] }] } # Make the POST request to the Gemini API response = requests.post(url, headers=headers, json=data) if response.status_code == 200: # Extract the summary from the response result = response.json() try: # Access the correct keys in the response structure summary = result['candidates'][0]['content']['parts'][0]['text'] return summary except KeyError as e: return f"KeyError: {e}, check the response structure." else: return f"Error: Unable to get response, status code: {response.status_code}" def fetch_papers_for_date_range(keyword, start_date, end_date, max_results): papers = [] query = f'all:"{keyword}"' query_url = f"http://export.arxiv.org/api/query?search_query=({query})+AND+submittedDate:[{start_date}+TO+{end_date}]&start=0&max_results={max_results}" # Fetch papers for this keyword and date range response = requests.get(query_url) if response.status_code != 200: print(f"Error: Unable to fetch papers for keyword '{keyword}' from {start_date} to {end_date}, status code: {response.status_code}") return papers # Parse the XML response to get the papers root = ET.fromstring(response.content) for entry in root.findall('{http://www.w3.org/2005/Atom}entry'): title = entry.find('{http://www.w3.org/2005/Atom}title').text.strip() summary = entry.find('{http://www.w3.org/2005/Atom}summary').text.strip() link = entry.find('{http://www.w3.org/2005/Atom}link[@title="pdf"]').attrib['href'] papers.append({'title': title, 'summary': summary, 'link': link, 'keyword': keyword}) return papers def fetch_papers(keywords, start_date, end_date, max_results_per_keyword): papers = [] keyword_totals = {keyword: 0 for keyword in keywords} # Dictionary to store total papers found for each keyword # Convert start_date and end_date to datetime objects start_date = datetime.strptime(start_date, "%Y-%m-%d") end_date = datetime.strptime(end_date, "%Y-%m-%d") # Split the date range into monthly intervals current_start_date = start_date date_ranges = [] while current_start_date < end_date: current_end_date = current_start_date + timedelta(days=30) # Approximate 1 month if current_end_date > end_date: current_end_date = end_date date_ranges.append((current_start_date.strftime("%Y-%m-%d"), current_end_date.strftime("%Y-%m-%d"))) current_start_date = current_end_date + timedelta(days=1) # Use ThreadPoolExecutor to parallelize queries with ThreadPoolExecutor(max_workers=5) as executor: futures = [] for keyword in keywords: for start, end in date_ranges: futures.append(executor.submit(fetch_papers_for_date_range, keyword, start, end, max_results_per_keyword)) for future in as_completed(futures): papers.extend(future.result()) # Count papers per keyword for paper in papers: keyword_totals[paper['keyword']] += 1 # Print the total number of documents found for each keyword print("\nTotal documents found per keyword:") for keyword, total in keyword_totals.items(): print(f"{keyword}: {total} documents") return papers # Open the result file to store the summaries with open("result.txt", "w") as result_file: # Prompt for user input keywords = input("Enter keywords separated by commas: ").strip().split(',') start_date = input("Enter start date (YYYY-MM-DD): ").strip() end_date = input("Enter end date (YYYY-MM-DD): ").strip() max_results_per_keyword = int(input("Enter the number of results per keyword: ").strip()) # Fetch papers based on keywords, date range, and max results per keyword papers = fetch_papers(keywords, start_date, end_date, max_results_per_keyword) for paper in papers: print(f"Fetching abstract for: {paper['title']}") # Fetch the abstract abstract = paper['summary'] if not abstract.startswith("Error"): # Summarize the abstract using Gemini summary = summarize_with_gemini(abstract) # summary = 'xyz' result_file.write(f"Keyword: {paper['keyword']}\nTitle: {paper['title']}\nLink: {paper['link']}\nSummary: {summary}\n\n") print(f"Summary for {paper['title']}:\n{summary}\n") else: result_file.write(f"Keyword: {paper['keyword']}\nTitle: {paper['title']}\nLink: {paper['link']}\nSummary: Error fetching abstract\n\n") print(f"Error fetching abstract for {paper['title']}\n") # Add a 2-seconds delay between each summary request time.sleep(2) ================================================ FILE: links.txt ================================================ https://arxiv.org/abs/2410.08003 https://arxiv.org/abs/2410.04241 https://arxiv.org/abs/2406.16779 https://arxiv.org/abs/2402.00123 https://arxiv.org/abs/2401.18001 https://arxiv.org/abs/2310.10571 https://arxiv.org/abs/2310.10583 ================================================ FILE: requirements.txt ================================================ requests beautifulsoup4 ================================================ FILE: result.txt ================================================ arXiv URL: https://arxiv.org/abs/2410.08003 Summary: The authors developed COMET, a new sparse neural network architecture inspired by biological neural systems, to address limitations of existing sparse networks in learning multiple tasks efficiently. COMET uses random projections instead of trainable gating functions, leading to faster learning and better generalization across various tasks. arXiv URL: https://arxiv.org/abs/2410.04241 Summary: The authors introduced a new question answering task that incorporates source citations for questions with multiple valid answers, creating five new datasets and evaluation metrics to address this challenge; their results from several baseline models highlight the difficulty and importance of this task for building more trustworthy QA systems. arXiv URL: https://arxiv.org/abs/2406.16779 Summary: The authors investigated how input order and emphasis affect large language models' reading comprehension performance, testing nine models on three datasets. They found that presenting the context before the question significantly improved accuracy (up to 36%), especially for questions requiring external knowledge, with simple context emphasis techniques proving most effective. arXiv URL: https://arxiv.org/abs/2402.00123 Summary: The authors compared language model evaluation using expert-designed templates versus naturally occurring text, finding that the two methods yielded different model rankings and scores, with template-free methods showing lower accuracy but a more expected relationship between perplexity and accuracy. The differences were particularly notable between general and domain-specific models. arXiv URL: https://arxiv.org/abs/2401.18001 Summary: The authors evaluated 15 question-answering systems across five datasets using a comprehensive set of criteria to understand their robustness, consistency, and handling of conflicting information. Their results revealed unexpected relationships between these factors, highlighting the significant negative impact of combining conflicting knowledge and noisy data on system performance. arXiv URL: https://arxiv.org/abs/2310.10571 Summary: The authors investigated whether adding irrelevant demographic information to biomedical questions affected the answers given by two types of question answering systems (knowledge graph-grounded and text-based). They found that irrelevant demographic details caused significant changes in the answers provided by both systems, highlighting fairness concerns in biomedical AI. arXiv URL: https://arxiv.org/abs/2310.10583 Summary: The authors argue that current language models are unreliable, especially for low-resource languages, and propose building models that cite their sources to improve trustworthiness. They discuss the benefits and challenges of this approach, aiming to stimulate discussion on improving language model development. ================================================ FILE: url_summarize.py ================================================ import requests from bs4 import BeautifulSoup # Set up your Gemini API key GEMINI_API_KEY = "" def fetch_abstract(arxiv_url): # Fetch the arXiv page content using requests response = requests.get(arxiv_url) if response.status_code != 200: return f"Error: Unable to fetch {arxiv_url}, status code: {response.status_code}" # Parse the HTML content of the arXiv page soup = BeautifulSoup(response.text, 'html.parser') # Find the abstract section abstract_tag = soup.find('blockquote', class_='abstract mathjax') if abstract_tag: # Get the content of the abstract and ensure a space after the "Abstract:" label abstract_text = abstract_tag.text.strip() # Ensure "Abstract:" has a space if abstract_text.startswith("Abstract:"): abstract_text = abstract_text.replace("Abstract:", "Abstract: ") return abstract_text else: return "Error: Abstract not found." def summarize_with_gemini(abstract_text): # Set up the API endpoint for Gemini url = "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=" + GEMINI_API_KEY headers = { "Content-Type": "application/json", } # Create the payload to send the abstract to Gemini with a more focused prompt data = { "contents": [{ "parts": [{ "text": f"Summarize the following abstract in 1-2 simple sentences. Focus on what the authors did, why, and the results: \n\n{abstract_text}" }] }] } # Make the POST request to the Gemini API response = requests.post(url, headers=headers, json=data) if response.status_code == 200: # Extract the summary from the response result = response.json() try: # Access the correct keys in the response structure summary = result['candidates'][0]['content']['parts'][0]['text'] return summary except KeyError as e: return f"KeyError: {e}, check the response structure." else: return f"Error: Unable to get response, status code: {response.status_code}" # Open the result file to store the summaries with open("result.txt", "w") as result_file: # Prompt for user input print("Select an option:") print("1. Enter a single arXiv link") print("2. Provide a file with arXiv links (using 'links.txt')") option = input("Enter 1 or 2: ") if option == '1': # Single paper input arxiv_url = input("Enter the arXiv URL: ").strip() print(f"Fetching abstract for: {arxiv_url}") # Fetch the abstract abstract = fetch_abstract(arxiv_url) if not abstract.startswith("Error"): # Summarize the abstract using Gemini summary = summarize_with_gemini(abstract) result_file.write(f"arXiv URL: {arxiv_url}\nSummary: {summary}\n\n") print(f"Summary for {arxiv_url}:\n{summary}\n") else: print(f"Error fetching abstract for {arxiv_url}\n") elif option == '2': # Multiple papers from file (assuming links.txt) file_path = 'links.txt' try: with open(file_path, 'r') as file: links = file.readlines() for link in links: arxiv_url = link.strip() print(f"Fetching abstract for: {arxiv_url}") # Fetch the abstract abstract = fetch_abstract(arxiv_url) if not abstract.startswith("Error"): # Summarize the abstract using Gemini summary = summarize_with_gemini(abstract) result_file.write(f"arXiv URL: {arxiv_url}\nSummary: {summary}\n\n") print(f"Summary for {arxiv_url}:\n{summary}\n") else: result_file.write(f"arXiv URL: {arxiv_url}\nSummary: Error fetching abstract\n\n") print(f"Error fetching abstract for {arxiv_url}\n") except FileNotFoundError: print("Error: The file 'links.txt' with arXiv links was not found.") result_file.write("Error: The file 'links.txt' with arXiv links was not found.\n") else: print("Invalid option. Please run the script again and choose option 1 or 2.")