Repository: JorisdeJong123/7-Days-of-LangChain
Branch: main
Commit: ad587b189ebd
Files: 22
Total size: 43.3 KB
Directory structure:
gitextract_iggotek1/
├── .gitattributes
├── .gitignore
├── README.md
├── day_1/
│ ├── requirements.txt
│ └── yt_to_strategy.py
├── day_2/
│ ├── requirements.txt
│ ├── summary_example.txt
│ └── voice_to_meeting_notes.py
├── day_3/
│ ├── mindmap.py
│ ├── mindmap_example.md
│ └── requirements.txt
├── day_4/
│ ├── newsletter_example.txt
│ ├── requirements.txt
│ └── scientific_newsletter.py
├── day_5/
│ ├── podcast.py
│ └── requirements.txt
├── day_6/
│ ├── compare_files.py
│ └── requirements.txt
├── day_7/
│ ├── learning_path.py
│ ├── requirements.txt
│ └── youtube_ids.py
└── requirements.txt
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitattributes
================================================
# Auto detect text files and perform LF normalization
* text=auto
================================================
FILE: .gitignore
================================================
.env
env/
================================================
FILE: README.md
================================================
# 7 Days of LangChain
Code repo for my "7 Days of LangChain" series. The repo is still quite messy, but I'll be fixing this the coming week.
# How to start?
Clone the repo or copy the code snippets you'd like to use.
Run
<code>pip install -r requirements.txt</code>
# Want to know more about the code?
Go and follow me on [Twitter](https://twitter.com/JorisTechTalk) for more details on the code!
================================================
FILE: day_1/requirements.txt
================================================
langchain
openai
youtube-transcript-api
pytube
tiktoken
bs4
================================================
FILE: day_1/yt_to_strategy.py
================================================
"""
This script shows how to create a strategy for a four-hour workday based on a YouTube video.
We're using an easy LangChain implementation to show how to use the different components of LangChain.
This is part of my '7 Days of LangChain' series.
Check out the explanation about the code on my Twitter (@JorisTechTalk)
"""
from langchain import LLMChain
from langchain.document_loaders import YoutubeLoader
from langchain.text_splitter import TokenTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
# Set your OpenAI API Key.
openai_api_key = 'YOUR_API_KEY_HERE'
# Load a youtube video and get the transcript
url = "https://www.youtube.com/watch?v=aV4jKPFOjvk"
loader = YoutubeLoader.from_youtube_url(url, add_video_info=True)
data = loader.load()
# Split the transcript into shorter chunks.
# First create the text splitter. The chunk_size is the maximum number of tokens in each chunk.
# With the new gpt-3.5-turbo-16k model, you actually don't need it in this example, but it's good to know how to do it.
text_splitter = TokenTextSplitter(chunk_size = 5000, chunk_overlap = 100)
# Then split the transcript into chunks.
# The .split_documents() method returns the page_content attribute of the Document object.
docs = text_splitter.split_documents(data)
# Create the prompts. The prompt is the instruction to the model. Prompting is key to getting good results.
# Play around with the prompt to get different results.
# We create two prompts. Since we will be using the refine summarize chain, we need a prompt for the initial 'summarization' of a chunk, and a prompt for the refinement of the summary of subsequent chunks.
# The first prompt is for the initial summarization of a chunk. You can add any info about yourself or the topic you want.
# You could specifically focus on a skill you have to get more relevant results.
strategy_template = """
You are an expert in creating strategies for getting a four-hour workday. You are a productivity coach and you have helped many people achieve a four-hour workday.
You're goal is to create a detailed strategy for getting a four-hour workday.
The strategy should be based on the following text:
------------
{text}
------------
Given the text, create a detailed strategy. The strategy is aimed to get a working plan on how to achieve a four-hour workday.
The strategy should be as detailed as possible.
STRATEGY:
"""
PROMPT_STRATEGY = PromptTemplate(template=strategy_template, input_variables=["text"])
# The second prompt is for the refinement of the summary, based on subsequent chunks.
strategy_refine_template = (
"""
You are an expert in creating strategies for getting a four-hour workday.
You're goal is to create a detailed strategy for getting a four-hour workday.
We have provided an existing strategy up to a certain point: {existing_answer}
We have the opportunity to refine the strategy
(only if needed) with some more context below.
------------
{text}
------------
Given the new context, refine the strategy.
The strategy is aimed to get a working plan on how to achieve a four-hour workday.
If the context isn't useful, return the original strategy.
"""
)
PROMPT_STRATEGY_REFINE = PromptTemplate(
input_variables=["existing_answer", "text"],
template=strategy_refine_template,
)
# Initialize the large language model. You can use the gpt-3.5-turbo-16k model or any model you prefer.
# Play around with the temperature parameter to get different results. Higher temperature means more randomness. Lower temperature means more deterministic.
llm = ChatOpenAI(openai_api_key=openai_api_key, model_name='gpt-3.5-turbo-16k', temperature=0.5)
# Initiliaze the chain.
# The verbose parameter prints the 'thought process' of the model. It's useful for debugging.
strategy_chain = load_summarize_chain(llm=llm, chain_type='refine', verbose=True, question_prompt=PROMPT_STRATEGY, refine_prompt=PROMPT_STRATEGY_REFINE)
strategy = strategy_chain.run(docs)
# Now write the strategy to a file.
with open('strategy.txt', 'w') as f:
f.write(strategy)
# Now use this strategy to create a plan.
# The plan is a list of steps to take to achieve the goal.
# The plan is based on the strategy.
# Create the prompt for the plan.
plan_template = """
You are an expert in creating plans for getting a four-hour workday. You are a productivity coach and you have helped many people achieve a four-hour workday.
You're goal is to create a detailed plan for getting a four-hour workday.
The plan should be based on the following strategy:
------------
{strategy}
------------
Given the strategy, create a detailed plan. The plan is aimed to get a working plan on how to achieve a four-hour workday.
Think step by step.
The plan should be as detailed as possible.
PLAN:
"""
PROMPT_PLAN = PromptTemplate(template=plan_template, input_variables=["strategy"])
# Initialize the chain.
plan_chain = LLMChain(llm=llm, prompt=PROMPT_PLAN, verbose=True)
plan = plan_chain(strategy)
# Now write the plan to a file.
with open('plan.txt', 'w') as f:
f.write(plan['text'])
# Print the total cost of the API calls.
print(cb)
================================================
FILE: day_2/requirements.txt
================================================
openai
langchain
tiktoken
================================================
FILE: day_2/summary_example.txt
================================================
The meeting took place on February 18, 2021, and was focused on the engineering key review at GitLab. Eric Johnson, the meeting organizer, proposed breaking up the meeting into four department key reviews: engineering, development quality, security, and UX infrastructure and support. The reasons for this proposal were increased visibility, the ability to go deeper into each department's work, increased objectivity for managers, more time for Eric to focus on new markets, and a shift into a question-asking role rather than generating content and answering questions. To avoid adding three new meetings to stakeholders' calendars, Eric suggested a two-month rotation, with development quality going in month one and security and UX going in month two. The group expressed support for this proposal, with some members suggesting that the larger development department may need more frequent meetings. However, they agreed to try the proposed rotation and remain flexible.
The discussion then moved to the R&D overall MR rate and R&D wider MR rate, which are top-level key performance indicators (KPIs) for engineering. The wider MR rate includes both community contributions and community merge requests (MRs), while the overall MR rate includes all MRs. Eric raised concerns about the duplication between the two rates and suggested simplifying the metrics. Lily confirmed that the wider MR rate only captures community contributions, and Eric proposed tracking the percentage of total MRs that come from the community as a KPI instead. The group agreed with this proposal and decided to make the transition.
Christopher mentioned a lag issue with metrics updates in the month of February, particularly in development and MR development. The data team was working on resolving this issue. The discussion then shifted to the Postgres replication issue and who should be responsible for addressing it. Eric clarified that the data engineering team should be the DRI (directly responsible individual) for the issue, with infrastructure owning the data source. They discussed the need for a dedicated host, tuning improvements, and addressing overall demand on the database layer. Steve expressed his willingness to partner with Eric on this issue and suggested focusing on getting the biggest impact for the resources allocated.
Mech provided an update on defect tracking and meeting service level objectives (SLOs). They mentioned a first iteration performance indicator (PI) that shows the percentage of defects meeting SLOs, with S1 defects at 80% and S2 defects at 60%. They also mentioned working on measuring the average age of open bugs to get a holistic view of the backlog. Craig raised a concern about a spike in mean time to close for S2 defects and asked for insights. Mech noted that they hadn't seen a dip in age or overall count and suggested digging deeper into the issue. Christy suggested that the change in severity levels across the board may be a factor to consider.
MAIN TAKEAWAYS:
- The proposal to break up the meeting into four department key reviews was supported, with a two-month rotation plan.
- The R&D wider MR rate will be transitioned to tracking the percentage of total MRs that come from the community.
- There is a lag issue with metrics updates in February, particularly in development and MR development.
- The Postgres replication issue will be addressed by the data engineering team, with infrastructure support.
- Defect tracking and meeting SLOs are ongoing, with a focus on improving the average age of open bugs.
ACTION ITEMS:
- Lily and Max will work on transitioning the R&D wider MR rate to tracking the percentage of total MRs from the community.
- Steve will provide an update on the Postgres replication issue in the infrastructure key review.
- Mech will investigate the spike in mean time to close for S2 defects and provide further insights.
- Christy will explore if the change in severity levels impacted the metrics.
DECISIONS:
- The meeting will be broken up into four department key reviews with a two-month rotation plan.
- The R&D wider MR rate will be transitioned to tracking the percentage of total MRs from the community.
OPEN QUESTIONS:
- None mentioned.
NEXT STEPS:
- Lily and Max will work on transitioning the R&D wider MR rate.
- Steve will provide an update on the Postgres replication issue in the infrastructure key review.
- Mech will investigate the spike in mean time to close for S2 defects.
- Christy will explore the impact of the change in severity levels on the metrics.
================================================
FILE: day_2/voice_to_meeting_notes.py
================================================
"""
This script shows how to create a meeting notes based on your recordings.
We're using an easy LangChain implementation to show how to use the different components of LangChain.
Also includes an integration with OpenAI Whisper.
This is part of my '7 Days of LangChain' series.
Check out the explanation about the code on my Twitter (@JorisTechTalk)
"""
import openai
from langchain.docstore.document import Document
from langchain.text_splitter import TokenTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
import os
# Set your API key
openai_api_key = 'YOUR_API_KEY_HERE'
# os.environ["OPENAI_API_KEY"] = 'YOUR_API_KEY_HERE'
# Set your media file path
media_file_path = "meeting_chunk0.mp3"
# Open the media file
media_file = open(media_file_path, "rb")
# Set your model ID
model_id = "whisper-1"
# Call the API
response = openai.Audio.transcribe(
api_key=openai_api_key,
model=model_id,
file=media_file
)
# Assign the transcript to a variable
transcript = response["text"]
# Split the text
text_splitter = TokenTextSplitter(model_name="gpt-3.5-turbo-16k", chunk_size=10000, chunk_overlap=300)
texts = text_splitter.split_text(transcript)
# Create documents for further processing
docs = [Document(page_content=t) for t in texts]
# Create the prompts
prompt_template_summary = """
You are a management assistant with a specialization in note taking. You are taking notes for a meeting.
Write a detailed summary of the following transcript of a meeting:
{text}
Make sure you don't lose any important information. Be as detailed as possible in your summary.
Also end with a list of:
- Main takeaways
- Action items
- Decisions
- Open questions
- Next steps
If there are any follow-up meetings, make sure to include them in the summary and mentioned it specifically.
DETAILED SUMMARY IN ENGLISH:"""
PROMPT_SUMMARY = PromptTemplate(template=prompt_template_summary, input_variables=["text"])
refine_template_summary = (
'''
You are a management assistant with a specialization in note taking. You are taking notes for a meeting.
Your job is to provide detailed summary of the following transcript of a meeting:
We have provided an existing summary up to a certain point: {existing_answer}.
We have the opportunity to refine the existing summary (only if needed) with some more context below.
----------------
{text}
----------------
Given the new context, refine the original summary in English.
If the context isn't useful, return the original summary. Make sure you are detailed in your summary.
Make sure you don't lose any important information. Be as detailed as possible.
Also end with a list of:
- Main takeaways
- Action items
- Decisions
- Open questions
- Next steps
If there are any follow-up meetings, make sure to include them in the summary and mentioned it specifically.
'''
)
refine_prompt_summary = PromptTemplate(
input_variables=["existing_answer", "text"],
template=refine_template_summary,
)
# Initialize LLM
llm = ChatOpenAI(openai_api_key=openai_api_key,temperature=0.2, model_name="gpt-3.5-turbo-16k")
# Create a summary
sum_chain = load_summarize_chain(llm, chain_type="refine", verbose=True, question_prompt=PROMPT_SUMMARY, refine_prompt=refine_prompt_summary)
summary = sum_chain.run(docs)
# Write the response to a file
with open("summary.txt", "w") as f:
f.write(summary)
================================================
FILE: day_3/mindmap.py
================================================
"""
This script shows how to create a mindmap based on your study material.
We're using an easy LangChain implementation to show how to use the different components of LangChain.
Once you have your markdown mindmap, import it to Xmind to create a mindmap.
This is part of my '7 Days of LangChain' series.
Check out the explanation about the code on my Twitter (@JorisTechTalk)
"""
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import TokenTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate
from PyPDF2 import PdfReader
from langchain.docstore.document import Document
from langchain.callbacks import get_openai_callback
# Set your OpenAI API Key.
openai_api_key = 'YOUR_API_KEY_HERE'
# Set file path
file_path = 'eight.pdf'
# Load Data from PDF for Question Generation
loader_mindmap = PdfReader(file_path)
# Store all the text in a variable
text = ""
for page in loader_mindmap.pages:
text += page.extract_text()
# Split Data For Mindmap Generation
text_splitter = TokenTextSplitter(model_name="gpt-3.5-turbo-16k", chunk_size=10000, chunk_overlap=1000)
texts_for_mindmap = text_splitter.split_text(text)
docs_for_mindmap = [Document(page_content=t) for t in texts_for_mindmap]
# Template for the question generation for every document
prompt_template_mindmap = """
You are an experienced assistant in helping people understand topics through the help of mind maps.
You are an expert in the field of the requested topic.
Make a mindmap based on the context below. Try to make connections between the different topics and be concise.:
------------
{text}
------------
Think step by step.
Always answer in markdown text. Adhere to the following structure:
## Main Topic 1
### Subtopic 1
- Subtopic 1
-Subtopic 1
-Subtopic 2
-Subtopic 3
### Subtopic 2
- Subtopic 1
-Subtopic 1
-Subtopic 2
-Subtopic 3
## Main Topic 2
### Subtopic 1
- Subtopic 1
-Subtopic 1
-Subtopic 2
-Subtopic 3
Make sure you only put out the Markdown text, do not put out anything else. Also make sure you have the correct indentation.
MINDMAP IN MARKDOWN:
"""
PROMPT_MINDMAP = PromptTemplate(template=prompt_template_mindmap, input_variables=["text"])
# Template for refining the mindmap
refine_template_mindmap = ("""
You are an experienced assistant in helping people understand topics through the help of mind maps.
You are an expert in the field of the requested topic.
We have received some mindmap in markdown to a certain extent: {existing_answer}.
We have the option to refine the existing mindmap or add new parts. Try to make connections between the different topics and be concise.
(only if necessary) with some more context below
"------------\n"
"{text}\n"
"------------\n"
Always answer in markdown text. Try to make connections between the different topics and be concise. Adhere to the following structure:
## Main Topic 1
### Subtopic 1
- Subtopic 1
-Subtopic 1
-Subtopic 2
-Subtopic 3
### Subtopic 2
- Subtopic 1
-Subtopic 1
-Subtopic 2
-Subtopic 3
## Main Topic 2
### Subtopic 1
- Subtopic 1
-Subtopic 1
-Subtopic 2
-Subtopic 3
Make sure you only put out the Markdown text, do not put out anything else. Also make sure you have the correct indentation.
MINDMAP IN MARKDOWN:
"""
)
REFINE_PROMPT_MINDMAP = PromptTemplate(
input_variables=["existing_answer", "text"],
template=refine_template_mindmap,
)
# Tracking cost
with get_openai_callback() as cb:
# Initialize the LLM
llm_markdown = ChatOpenAI(openai_api_key=openai_api_key, temperature=0.3, model="gpt-3.5-turbo-16k")
# Initialize the summarization chain
summarize_chain = load_summarize_chain(llm=llm_markdown, chain_type="refine", verbose=True, question_prompt=PROMPT_MINDMAP, refine_prompt=REFINE_PROMPT_MINDMAP)
# Generate mindmap
mindmap = summarize_chain(docs_for_mindmap)
# Save mindmap to .md file
with open("mindmap.md", "w") as f:
f.write(mindmap['output_text'])
# Print cost
print(cb)
================================================
FILE: day_3/mindmap_example.md
================================================
## Eight Things to Know about Large Language Models
### Main Topic 1: Predictability and Capabilities
- Subtopic 1: LLMs get more capable with increasing investment
- Increasing investment leads to improved performance
- Scaling laws allow for precise prediction of capabilities
- Subtopic 2: Unpredictable emergence of important behaviors
- Specific behaviors can emerge unexpectedly with increasing investment
- Models can fail at a task consistently, but a larger model may succeed
### Main Topic 2: Understanding and Interpretation
- Subtopic 1: LLMs learn and use representations of the outside world
- Internal representations of color, objects, and geography
- Ability to reason at an abstract level
- Subtopic 2: Lack of reliable techniques for steering LLM behavior
- Limited control over LLM behavior
- Challenges in interpreting and guiding LLMs
### Main Topic 3: Performance and Values
- Subtopic 1: LLM performance surpasses human performance
- LLMs trained on more data can outperform humans
- Additional training methods improve performance
- Subtopic 2: LLMs do not necessarily express the values of their creators or web text
- Values expressed by LLMs can be controlled and influenced
- Third-party input and oversight can shape LLM values
### Main Topic 4: Interaction and Misleading Behavior
- Subtopic 1: Brief interactions with LLMs can be misleading
- Models can be sensitive to instructions and prompt wording
- Contingent failures do not necessarily indicate lack of capability
================================================
FILE: day_3/requirements.txt
================================================
openai
langchain
tiktoken
pypdf2
================================================
FILE: day_4/newsletter_example.txt
================================================
Title: Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Summary: This article explores different approaches for estimating the uncertainty of large language models (LLMs) without relying on model fine-tuning or proprietary information. The study introduces verbalize-based, consistency-based, and hybrid methods for benchmarking and evaluates their performance across various datasets and LLMs. The analysis reveals insights such as LLMs often exhibiting overconfidence when verbalizing their confidence and consistency-based methods outperforming verbalized confidences in most cases. The article concludes that hybrid methods show the most promising performance, but there is still room for improvement in confidence elicitation.
Link: http://arxiv.org/abs/2306.13063v1
Title: Towards Explainable Evaluation Metrics for Machine Translation
Summary: This concept paper discusses the need for explainable evaluation metrics for machine translation, as current metrics based on large language models lack transparency. The article identifies key properties and goals of explainable machine translation metrics and provides a synthesis of recent techniques and approaches. It also explores explainable metrics based on generative models like ChatGPT and GPT4. The article envisions next-generation approaches, including natural language explanations, to improve the transparency and acceptance of high-quality metrics for machine translation.
Link: http://arxiv.org/abs/2306.13041v1
Title: Tracking public attitudes toward ChatGPT on Twitter using sentiment analysis and topic modeling
Summary: This article investigates public attitudes towards ChatGPT, a chatbot powered by a large language model, using sentiment analysis and topic modeling techniques applied to Twitter data. The analysis reveals that the overall sentiment towards ChatGPT is largely neutral to positive across different occupation groups. The most popular topics mentioned in tweets related to ChatGPT include Artificial Intelligence, Search Engines, Education, Writing, and Question Answering.
Link: http://arxiv.org/abs/2306.12951v1
Title: Generative Multimodal Entity Linking
Summary: This article introduces GEMEL, a simple yet effective method for multimodal entity linking (MEL) that leverages large language models (LLMs) for generating target entity names. Unlike previous complex MEL methods, GEMEL only fine-tunes a linear layer while keeping the vision and language model frozen. The approach utilizes in-context learning capability of LLMs and achieves state-of-the-art results on two well-established MEL datasets with minimal fine-tuning. The article highlights the potential of using LLMs in the MEL task for efficient and general solutions.
Link: http://arxiv.org/abs/2306.12725v1
================================================
FILE: day_4/requirements.txt
================================================
langchain
openai
tabulate
tiktoken
google-api-python-client
google-auth-oauthlib
google-auth-httplib2
beautifulsoup4
================================================
FILE: day_4/scientific_newsletter.py
================================================
"""
This script shows how to create a newsletter based on the latest Arxiv articles.
We're using an easy LangChain implementation to show how to use the different components of LangChain.
This is part of my '7 Days of LangChain' series.
Check out the explanation about the code on my Twitter (@JorisTechTalk)
"""
from langchain.document_loaders import ArxivLoader
from langchain.agents.agent_toolkits import GmailToolkit
from langchain import OpenAI
import os
from langchain.agents import initialize_agent, AgentType
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain import LLMChain
from langchain.callbacks import get_openai_callback
import arxiv
# Topic of the newsletter you want to write about
query = "LLM"
# Set up the ArxivLoader
search = arxiv.Search(
query = query,
max_results = 4,
sort_by = arxiv.SortCriterion.SubmittedDate
)
# Initialize the docs variable
docs = ""
# Add all relevant information to the docs variable
for result in search.results():
docs += "Title: " + result.title + "\n"
docs += "Abstract: " + result.summary + "\n"
docs += "Download URL: " + result.pdf_url + "\n"
print(result.links)
for link in result.links:
docs += "Links: " + link.href + "\n"
# Track cost
with get_openai_callback() as cb:
# Template for the newsletter
prompt_newsletter_template = """
You are a newsletter writer. You write newsletters about scientific articles. You introduce the article and show a small summary to tell the user what the article is about.
You're main goal is to write a newsletter which contains summaries to interest the user in the articles.
--------------------
{text}
--------------------
Start with the title of the article. Then, write a small summary of the article.
Below each summary, include the link to the article containing /abs/ in the URL.
Summaries:
"""
PROMPT_NEWSLETTER = PromptTemplate(template=prompt_newsletter_template, input_variables=["text"])
# Set the OpenAI API key
os.environ['OPENAI_API_KEY'] = 'YOUR_API_KEY_HERE'
# Initialize the language model
llm = ChatOpenAI(temperature=0.6, model_name="gpt-3.5-turbo-16k", verbose=True)
# Initialize the LLMChain
newsletter_chain = LLMChain(llm=llm, prompt=PROMPT_NEWSLETTER, verbose=True)
# Run the LLMChain
newsletter = newsletter_chain.run(docs)
# Write newsletter to a text file
with open("newsletter.txt", "w") as f:
f.write(newsletter)
# Set toolkit
toolkit = GmailToolkit()
# Initialize the Gmail agent
agent = initialize_agent(
tools=toolkit.get_tools(),
llm=llm,
agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
# Run the agent
instructions = f"""
Write a draft directed to jorisdejong456@gmail.com, NEVER SEND THE EMAIL.
The subject should be 'Scientific Newsletter about {query}'.
The content should be the following: {newsletter}.
"""
agent.run(instructions)
print(cb)
================================================
FILE: day_5/podcast.py
================================================
# PODCAST Q&A BOT
from langchain.text_splitter import TokenTextSplitter
from langchain.docstore.document import Document
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import YoutubeLoader
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.callbacks import get_openai_callback
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
with get_openai_callback() as cb:
# Load a youtube video and get the transcript
loader = YoutubeLoader.from_youtube_url('https://www.youtube.com/watch?v=-hxeDjAxvJ8', add_video_info=True)
data = loader.load()
# Initialize text splitter for summary (Large chunks for better context and less API calls)
text_splitter_summary = TokenTextSplitter(chunk_size = 10000, chunk_overlap = 250)
# Split text into docs for summary
docs_summary = text_splitter_summary.split_documents(data)
# Initialize text splitter for QA (Smaller chunks for better QA)
text_splitter_qa = TokenTextSplitter(chunk_size = 1000, chunk_overlap = 200)
# Split text into docs for QA
docs_qa = text_splitter_qa.split_documents(data)
# Prompts for summary
# The first prompt is for the initial summarization of a chunk. You can add any info about yourself or the topic you want.
# You could specifically focus on a skill you have to get more relevant results.
summary_template = """
You are an expert in summarizing YouTube videos.
You're goal is to create a summary of a podcast.
Below you find the transcript of a podcast:
------------
{text}
------------
The transript of the podcast will also be used as the basis for a question and answer bot.
Provide some examples questions and answers that could be asked about the podcast. Make these questions very specific.
Total output will be a summary of the video and a list of example questions the user could ask of the video.
SUMMARY AND QUESTIONS:
"""
PROMPT_SUMMARY = PromptTemplate(template=summary_template, input_variables=["text"])
# The second prompt is for the refinement of the summary, based on subsequent chunks.
summary_refine_template = (
"""
You are an expert in summarizing YouTube videos.
You're goal is to create a summary of a podcast.
We have provided an existing summary up to a certain point: {existing_answer}
We have the opportunity to refine the summary
(only if needed) with some more context below.
Below you find the transcript of a podcast:
------------
{text}
------------
Given the new context, refine the summary and example questions.
The transript of the podcast will also be used as the basis for a question and answer bot.
Provide some examples questions and answers that could be asked about the podcast. Make these questions very specific.
If the context isn't useful, return the original summary and questions.
Total output will be a summary of the video and a list of example questions the user could ask of the video.
SUMMARY AND QUESTIONS:
"""
)
PROMPT_SUMMARY_REFINE = PromptTemplate(
input_variables=["existing_answer", "text"],
template=summary_refine_template,
)
# Set OPENAI API key
openai_api_key = 'YOUR_API_KEY'
# Initialize LLM
llm_summary = ChatOpenAI(openai_api_key=openai_api_key, model_name='gpt-3.5-turbo-16k', temperature=0.3)
# Initialize summarization chain
summarize_chain = load_summarize_chain(llm=llm_summary, chain_type="refine", verbose=True, question_prompt=PROMPT_SUMMARY, refine_prompt=PROMPT_SUMMARY_REFINE)
summary = summarize_chain.run(docs_summary)
# Write summary to file
with open("summary.txt", "w") as f:
f.write(summary)
# Create the LLM model for the question answering
llm_question_answer = ChatOpenAI(openai_api_key=openai_api_key,temperature=0.2, model="gpt-3.5-turbo-16k")
# Create the vector database and RetrievalQA Chain
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
db = FAISS.from_documents(docs_qa, embeddings)
qa = RetrievalQA.from_chain_type(llm=llm_question_answer, chain_type="stuff", retriever=db.as_retriever())
question = ""
# Run the QA chain continuously
while question != "exit":
# Get the user question
question = input("Ask a question or enter exit to close the app: ")
# Run the QA chain
answer = qa.run(question)
print(answer)
print("---------------------------------")
print("\n")
print(cb)
================================================
FILE: day_5/requirements.txt
================================================
langchain
openai
tiktoken
youtube-transcript-api
pytube
faiss-cpu
================================================
FILE: day_6/compare_files.py
================================================
from pydantic import BaseModel, Field
from langchain.chat_models import ChatOpenAI
from langchain.agents import Tool
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.agents import initialize_agent
from langchain.agents import AgentType
import os
# Create a custom input schema
class DocumentInput(BaseModel):
question: str = Field()
# Set OpenAI API key as environment variable
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# List of files you want to compare
files = [
{
"name": "Volkswagen-earnings-Q1-2023",
"path": "files/Volkswagen-Q1_2023.pdf"
},
{
"name": "tesla-earning-Q1-2023",
"path": "files/TSLA-Q1-2023-Update.pdf"
},
]
# Initialize a list of tools
tools = []
# Initialize the LLM
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")
# Loop over the files
for file in files:
# Load the documents
loader = PyPDFLoader(file["path"])
pages = loader.load_and_split()
# Split the documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(pages)
print(f"Loaded {len(docs)} documents from {file['name']}")
# Vectorize the documents and create a retriever
embeddings = OpenAIEmbeddings()
retriever = FAISS.from_documents(docs, embeddings).as_retriever()
# Wrap retrievers in a Tool
tools.append(
Tool(
args_schema=DocumentInput,
name=file["name"],
description=f"useful when you want to answer questions about {file['name']}",
func=RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
)
)
# Initialize LLM for the agent
llm = ChatOpenAI(
temperature=0,
model="gpt-3.5-turbo-0613",
)
# Initialize the agent
agent = initialize_agent(
agent=AgentType.OPENAI_FUNCTIONS,
tools=tools,
llm=llm,
verbose=True,
)
# Initialize the question variable
question = ""
# Run a loop to ask questions
while True and question != "exit":
question = input("Ask a question or write exit to quit: ")
if question == "exit":
break
answer = agent({"input": question})
print(answer["output"])
print("------")
================================================
FILE: day_6/requirements.txt
================================================
langchain
openai
pypdf
tiktoken
faiss-cpu
pycryptodome
================================================
FILE: day_7/learning_path.py
================================================
from langchain.document_loaders import YoutubeLoader
from langchain.text_splitter import TokenTextSplitter
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.callbacks import get_openai_callback
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain import LLMChain
import os
# Set openai api key as environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
# Set OpenAI API key as environment variable
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# List of Youtube Urls
# If you want to load all the videos from a channel, use the code in youtube_ids.py
youtube_urls = [
"https://www.youtube.com/watch?v=pEkxRQFNAs4", # Extract Topics From Video/Audio With LLMs (Topic Modeling w/ LangChain)
"https://www.youtube.com/watch?v=GvQC5BHBkoM", # 8 Things New SaaS Developers Need To Know on DAY 1
"https://www.youtube.com/watch?v=QoBCtcWO02g", # The One Person Business Model 2.0 (Turn Yourself Into A Business)
"https://www.youtube.com/watch?v=mMv6OSuitWw", # Python 101: Learn the 5 Must-Know Concepts
"https://www.youtube.com/watch?v=2zW5emKWof8", # 10 Years of Coding: What I Wish I Knew Before I Started
]
# Create a template for extracting topics from a text.
extract_topics_template = """
You are an expert in extracting skills being thaught from a transcript of a video.
You're goal is to extract the skills thaught from the transcript below.
The skills will be used to give the user an idea of what will be learned in the video.
Transcript:
------------
{text}
------------
The description of the skills should be descriptive, but short and concise. Mention what overarching skill would be learned.
Example:
Implementing continuous delivery for faster shipping - Software development
Evaluating and selecting a suitable tech stack for SaaS development - Software development
Recognizing the importance of marketing and customer communication in building a successful SaaS business - Business and marketing
Don't add numbers. Just each skill on a new line.
SKILLS - OVERARCHING SKILL:
"""
PROMPT_EXTRACT_TOPICS = PromptTemplate(template=extract_topics_template, input_variables=["text"])
# The second prompt is for the refinement of the summary and topics, based on subsequent chunks.
extract_topics_refine_template = (
"""
You are an expert in extracting skills from a transcript of a video.
You're goal is to extract the skills thaught from the transcript below.
The skills will be used to give the user an idea of what will be learned in the video.
We have provided a list of skills up to a certain point: {existing_answer}
We have the opportunity to refine the skills
(only if needed) with some more context below.
------------
{text}
------------
Given the new context, refine the skills discussed.
If the context isn't useful, return the list of skills.
The description of the skills should be descriptive, but short and concise. Mention what overarching skill would be learned.
Example:
Implementing continuous delivery for faster shipping - Software development
Evaluating and selecting a suitable tech stack for SaaS development - Software development
Recognizing the importance of marketing and customer communication in building a successful SaaS business - Business and marketing
Don't add numbers. Just each skill on a new line.
SKILLS - OVERARCHING SKILL:
"""
)
PROMPT_EXTRACT_TOPICS_REFINE = PromptTemplate(
input_variables=["existing_answer", "text"],
template=extract_topics_refine_template,
)
# Prompt for genarting a list of subskills needed to master a skill
subskills_template = """
You are an assistant specialized in desiging learning paths for people trying to acquire a particular skill-set.
Your goal is to make a list of sub skills a person needs to become proficient in a particular skill.
The skill set you need to design a learning path for is: {skill_set}
The user will say which skill set they want to learn, and you'll provide a short and consice list of specific skills this person needs to learn.
This list will be used to find YouTube videos related to those skills. Don't mention youtube videos though! Name only 5 skills maximum.
"""
PROMPT_SUBSKILLS = PromptTemplate(template=subskills_template, input_variables=["skill_set"])
# Prompt for finding a video based on a subskill set
find_video_template = """
You are an assistant specialized in desiging learning paths for people trying to acquire a particular skill-set.
Your goal is to find a list of videos that teaches a particular skill.
It should be based on the following context:
{context}
Look for videos that teach the following skills: {skill_set}
RETURN A LIST OF VIDEOS WITH YOUTUBE URL AND TITLE:
"""
PROMPT_FIND_VIDEO = PromptTemplate(template=find_video_template, input_variables=["context","skill_set"])
# Initialize the large language model. You can use the gpt-3.5-turbo-16k model or any model you prefer.
# Play around with the temperature parameter to get different results. Higher temperature means more randomness. Lower temperature means more deterministic.
llm = ChatOpenAI(model_name='gpt-3.5-turbo-16k', temperature=0)
# Initialize empty document list
documents = []
with get_openai_callback() as cb:
# Loop over the youtube urls
for url in youtube_urls:
# Load a youtube video and get the transcript
youtube_url = url
loader = YoutubeLoader.from_youtube_url(youtube_url=youtube_url, add_video_info=True)
data = loader.load()
metadata = data[0].metadata
title = metadata['title']
author = metadata['author']
# Split the transcript into shorter chunks.
# First create the text splitter. The chunk_size is the maximum number of tokens in each chunk.
text_splitter = TokenTextSplitter(chunk_size = 2000, chunk_overlap = 100)
# Then split the transcript into chunks.
# The .split_documents() method returns the page_content attribute of the Document object.
docs = text_splitter.split_documents(data)
# Initialize the summarization chain
extract_topics_chain = load_summarize_chain(llm=llm, chain_type="refine", verbose=True, question_prompt = PROMPT_EXTRACT_TOPICS, refine_prompt = PROMPT_EXTRACT_TOPICS_REFINE)
extracted_topics = extract_topics_chain(docs)
video_overview = ""
# Add the YouTube Channel name, video title, URL and extracted topics to the video overview
video_overview += f"YouTube Channel: {author}\n"
video_overview += f"YouTube Video: {title}\n"
video_overview += f"YouTube URL: {youtube_url}\n"
video_overview += "Skills: \n"
video_overview += extracted_topics['output_text']
# Create a document object with the video overview
docs = Document(page_content=video_overview, metadata={"title": title, "author": author, "url": youtube_url})
# Add the document to the documents list
documents.append(docs)
# Initialize the embeddings
embeddings = OpenAIEmbeddings()
# Create a vector store with the documents
vector_store = FAISS.from_documents(documents, embeddings)
# Save the vector store
vector_store.save_local("vector/", "vector_store")
# Load the vector store
vector_store = FAISS.load_local("vector/", embeddings, "vector_store")
# Initialize the subskills chain
subskills_chain = LLMChain(llm=llm, prompt=PROMPT_SUBSKILLS)
# Loop for questions
while True:
# Ask the user what skill they want to learn
skill_set = input("What skill set do you want to learn? ")
# Use skillset to find subskills
subskills = subskills_chain.predict(skill_set=skill_set)
# Print subskills
print(f"Subskills: \n {subskills}\n")
# Initialize the retrieval chain
qa = RetrievalQA.from_chain_type(llm = llm, retriever = vector_store.as_retriever(), chain_type="stuff", verbose=True)
# Set query to ask
query = f"Which Youtube videos teach {subskills}?"
# Use query to find videos
videos = qa.run(query)
# Print videos
print(f"Videos: \n {videos}\n")
================================================
FILE: day_7/requirements.txt
================================================
langchain
openai
tiktoken
unstructured
youtube-transcript-api
faiss-cpu
pytube
google-api-python-client
================================================
FILE: day_7/youtube_ids.py
================================================
# If you want to load all YouTube videos from a specific channel in one go, use these functions.
import googleapiclient.discovery
from tqdm import tqdm
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import TextFormatter
api_key = "Your Google Dev API Key" #@param {type:"string"}
channel_id = "" #@param {type:"string"} # Get your channel ID here https://commentpicker.com/youtube-channel-id.php
def get_channel_videos(channel_id, api_key):
youtube = googleapiclient.discovery.build(
"youtube", "v3", developerKey=api_key)
video_ids = []
page_token = None
while True:
request = youtube.search().list(
part="snippet",
channelId=channel_id,
maxResults=10, # Fetch 50 videos at a time
pageToken=page_token # Add pagination
)
response = request.execute()
video_ids += [item['id']['videoId'] for item in response['items'] if item['id']['kind'] == 'youtube#video']
# Check if there are more videos to fetch
if 'nextPageToken' in response:
page_token = response['nextPageToken']
else:
break
return video_ids
def get_transcript(video_id):
# Get transcript list
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
transcripts_manual = transcript_list._manually_created_transcripts
# Get transcript. If no manually created transcript is available, use the automatically generated one.
if transcripts_manual:
transcript = YouTubeTranscriptApi.get_transcript(video_id)
else:
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['nl', 'en'])
# Format transcript as text
formatter = TextFormatter()
text_transcript = formatter.format_transcript(transcript)
text_transcript = text_transcript.replace('\n', ' ')
return text_transcript
video_ids = get_channel_videos(channel_id, api_key)
transcript = get_transcript(video_ids[0])
================================================
FILE: requirements.txt
================================================
langchain
openai
youtube-transcript-api
pytube
tiktoken
bs4
pypdf2
google-api-python-client
google-auth-oauthlib
google-auth-httplib2
beautifulsoup4
tabulate
gitextract_iggotek1/ ├── .gitattributes ├── .gitignore ├── README.md ├── day_1/ │ ├── requirements.txt │ └── yt_to_strategy.py ├── day_2/ │ ├── requirements.txt │ ├── summary_example.txt │ └── voice_to_meeting_notes.py ├── day_3/ │ ├── mindmap.py │ ├── mindmap_example.md │ └── requirements.txt ├── day_4/ │ ├── newsletter_example.txt │ ├── requirements.txt │ └── scientific_newsletter.py ├── day_5/ │ ├── podcast.py │ └── requirements.txt ├── day_6/ │ ├── compare_files.py │ └── requirements.txt ├── day_7/ │ ├── learning_path.py │ ├── requirements.txt │ └── youtube_ids.py └── requirements.txt
SYMBOL INDEX (3 symbols across 2 files) FILE: day_6/compare_files.py class DocumentInput (line 14) | class DocumentInput(BaseModel): FILE: day_7/youtube_ids.py function get_channel_videos (line 11) | def get_channel_videos(channel_id, api_key): function get_transcript (line 37) | def get_transcript(video_id):
Condensed preview — 22 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (47K chars).
[
{
"path": ".gitattributes",
"chars": 66,
"preview": "# Auto detect text files and perform LF normalization\n* text=auto\n"
},
{
"path": ".gitignore",
"chars": 9,
"preview": ".env\nenv/"
},
{
"path": "README.md",
"chars": 402,
"preview": "# 7 Days of LangChain\n Code repo for my \"7 Days of LangChain\" series. The repo is still quite messy, but I'll be fixing "
},
{
"path": "day_1/requirements.txt",
"chars": 59,
"preview": "langchain\nopenai\nyoutube-transcript-api\npytube\ntiktoken\nbs4"
},
{
"path": "day_1/yt_to_strategy.py",
"chars": 5748,
"preview": "\"\"\"\nThis script shows how to create a strategy for a four-hour workday based on a YouTube video.\nWe're using an easy Lan"
},
{
"path": "day_2/requirements.txt",
"chars": 25,
"preview": "openai\nlangchain\ntiktoken"
},
{
"path": "day_2/summary_example.txt",
"chars": 4559,
"preview": "The meeting took place on February 18, 2021, and was focused on the engineering key review at GitLab. Eric Johnson, the "
},
{
"path": "day_2/voice_to_meeting_notes.py",
"chars": 3469,
"preview": "\"\"\"\nThis script shows how to create a meeting notes based on your recordings.\nWe're using an easy LangChain implementati"
},
{
"path": "day_3/mindmap.py",
"chars": 4134,
"preview": "\"\"\"\nThis script shows how to create a mindmap based on your study material.\nWe're using an easy LangChain implementation"
},
{
"path": "day_3/mindmap_example.md",
"chars": 1593,
"preview": "## Eight Things to Know about Large Language Models\n\n### Main Topic 1: Predictability and Capabilities\n\n- Subtopic 1: LL"
},
{
"path": "day_3/requirements.txt",
"chars": 32,
"preview": "openai\nlangchain\ntiktoken\npypdf2"
},
{
"path": "day_4/newsletter_example.txt",
"chars": 2814,
"preview": "Title: Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs\nSummary: This artic"
},
{
"path": "day_4/requirements.txt",
"chars": 116,
"preview": "langchain\nopenai\ntabulate\ntiktoken\ngoogle-api-python-client\ngoogle-auth-oauthlib\ngoogle-auth-httplib2\nbeautifulsoup4"
},
{
"path": "day_4/scientific_newsletter.py",
"chars": 3097,
"preview": "\"\"\"\nThis script shows how to create a newsletter based on the latest Arxiv articles.\nWe're using an easy LangChain imple"
},
{
"path": "day_5/podcast.py",
"chars": 4812,
"preview": "# PODCAST Q&A BOT\n\nfrom langchain.text_splitter import TokenTextSplitter\nfrom langchain.docstore.document import Documen"
},
{
"path": "day_5/requirements.txt",
"chars": 66,
"preview": "langchain\nopenai\ntiktoken\nyoutube-transcript-api\npytube\nfaiss-cpu\n"
},
{
"path": "day_6/compare_files.py",
"chars": 2430,
"preview": "from pydantic import BaseModel, Field\nfrom langchain.chat_models import ChatOpenAI\nfrom langchain.agents import Tool\nfro"
},
{
"path": "day_6/requirements.txt",
"chars": 54,
"preview": "langchain\nopenai\npypdf\ntiktoken\nfaiss-cpu\npycryptodome"
},
{
"path": "day_7/learning_path.py",
"chars": 8556,
"preview": "from langchain.document_loaders import YoutubeLoader\nfrom langchain.text_splitter import TokenTextSplitter\nfrom langchai"
},
{
"path": "day_7/requirements.txt",
"chars": 103,
"preview": "langchain\nopenai\ntiktoken\nunstructured\nyoutube-transcript-api\nfaiss-cpu\npytube\ngoogle-api-python-client"
},
{
"path": "day_7/youtube_ids.py",
"chars": 2035,
"preview": "# If you want to load all YouTube videos from a specific channel in one go, use these functions.\n\nimport googleapiclient"
},
{
"path": "requirements.txt",
"chars": 158,
"preview": "langchain\nopenai\nyoutube-transcript-api\npytube\ntiktoken\nbs4\npypdf2\ngoogle-api-python-client\ngoogle-auth-oauthlib\ngoogle-"
}
]
About this extraction
This page contains the full source code of the JorisdeJong123/7-Days-of-LangChain GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 22 files (43.3 KB), approximately 10.6k tokens, and a symbol index with 3 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.