Showing preview only (942K chars total). Download the full file or copy to clipboard to get everything.
Repository: alexeygrigorev/llm-rag-workshop
Branch: main
Commit: 95602ff21b4a
Files: 11
Total size: 917.2 KB
Directory structure:
gitextract_ihuhb7f5/
├── .gitignore
├── Pipfile
├── README.md
├── app.py
├── docker-compose.yaml
├── notebooks/
│ ├── documents.json
│ ├── elastic-search.ipynb
│ ├── google_flan_t5.ipynb
│ ├── long-workshop.ipynb
│ └── parse-faq.ipynb
└── rag.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
.ipynb_checkpoints
.envrc
__pycache__
================================================
FILE: Pipfile
================================================
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"
[packages]
scikit-learn = "*"
pandas = "*"
requests = "*"
python-docx = "*"
notebook = "==7.1.2"
transformers = "*"
accelerate = "*"
bitsandbytes = "*"
bitsandbytes-cuda116 = "*"
openai = "*"
tqdm = "*"
elasticsearch = "*"
[dev-packages]
[requires]
python_version = "3.9"
================================================
FILE: README.md
================================================
# LLM RAG Workshop
Chat with your own data - LLM+RAG workshop
The content here is based on [LLM Zoomcamp](https://github.com/DataTalksClub/llm-zoomcamp) - a free course about the engineering aspects of LLMs. We will run the course again in Spring-Summer 2025. Sign up if you're interested in attending it.
If you want to run a similar workshop in your company, contact
me at alexey@datatalks.club.
For this workshop, you need:
- Docker
- Python 3 (we use 3.12)
- [GitHub account](https://github.com/) + VS Code (optional - if you want to use codespaces, already contains Docker and Python)
- [OpenAI account](https://openai.com/) (optional)
- [Groq account](https://groq.com/) (optional)
- [HuggingFace account](https://huggingface.co/) (optional - if you want to access some open-source LLMs in the extended version)
# Plan
* LLM and RAG (theory)
* Preparing the environment (codespaces)
* Installing pipenv and direnv
* Running ElasticSearch
* Indexing and retrieving documents with ElasticSearch
* Generating the answers with OpenAI
Extended workshop:
* Creating a web interface with Streamlit
* Running LLMs locally
* Replacing OpenAI with Ollama
* Running Ollama and ElasticSearch in Docker-Compose
* Using Open-Source LLMs from HuggingFace Hub
Or
* Evaluating retrieval and RAG
# LLM and RAG
I generated that with ChatGPT:
## Large Language Models (LLMs)
- **Purpose:** Generate and understand text in a human-like manner.
- **Structure:** Built using deep learning techniques, especially Transformer architectures.
- **Size:** Characterized by having a vast number of parameters (billions to trillions), enabling nuanced understanding and generation.
- **Training:** Pre-trained on large datasets of text to learn a broad understanding of language, then fine-tuned for specific tasks.
- **Applications:** Used in chatbots, translation services, content creation, and more.
## Retrieval-Augmented Generation (RAG)
- **Purpose:** Enhance language model responses with information retrieved from external sources.
- **How It Works:** Combines a language model with a retrieval system, typically a document database or search engine.
- **Process:**
- Queries an external knowledge source based on input.
- Integrates retrieved information into the generation process to provide contextually rich and accurate responses.
- **Advantages:** Improves the factual accuracy and relevance of generated text.
- **Use Cases:** Fact-checking, knowledge-intensive tasks like medical diagnosis assistance, and detailed content creation where accuracy is crucial.
Use ChatGPT to show the difference between generating and RAG.
What we will do:
* Index Zoomcamp FAQ documents
* DE Zoomcamp: https://docs.google.com/document/d/19bnYs80DwuUimHM65UV3sylsCn2j1vziPOwzBwQrebw/edit
* ML Zoomcamp: https://docs.google.com/document/d/1LpPanc33QJJ6BSsyxVg-pWNMplal84TdZtq10naIhD8/edit
* MLOps Zoomcamp: https://docs.google.com/document/d/12TlBfhIiKtyBv8RnsoJR6F72bkPDGEvPOItJIxaEzE0/edit
* Create a Q&A system for answering questions about these documents
# Preparing the Environment
We will use codespaces - but it will work in any environment with Docker and Python 3
In codespaces:
* Create a repository, e.g. "llm-zoomcamp-rag-workshop"
* Start a codespace there
## Libraries
Install the packages:
```bash
pip install tqdm jupyter openai elasticsearch
```
if you use groq:
```bash
pip install groq
```
## LLM services
If you use OpenAI, we need the key:
* Sign up at https://platform.openai.com/ if you don't have an account
* Go to https://platform.openai.com/api-keys
* Create a new key, copy it
Let's put the key to an env variable:
```bash
export OPENAI_API_KEY="TOKEN"
```
For groq:
* Sign up at https://console.groq.com/
* Go to https://console.groq.com/keys
* Create a new key, copy it
* Use the `GROQ_API_KEY` env variable
## Managing secrets
You can also use [GitHub Codespaces secrets](https://docs.github.com/en/codespaces/managing-your-codespaces/managing-your-account-specific-secrets-for-github-codespaces#adding-a-secret) for better secret management.
If you don't use codespaces, you can do it with direnv:
```bash
sudo apt update
sudo apt install direnv
direnv hook bash >> ~/.bashrc
```
Create / edit `.envrc` in your project directory:
```bash
export OPENAI_API_KEY='sk-proj-key'
```
or
```bash
export GROQ_API_KEY='your-key'
```
Make sure `.envrc` is in your `.gitignore` - never commit it!
```bash
echo ".envrc" >> .gitignore
```
Allow direnv to run:
```bash
direnv allow
```
## Jupyter
Start a new terminal, and there run jupyter:
```bash
jupyter notebook
```
## Elasticsearch
In another terminal, run elasticsearch with docker:
```bash
docker run -it \
--rm \
--name elasticsearch \
-m 2G \
-p 9200:9200 \
-p 9300:9300 \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=false" \
docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```
Verify that ES is running
```bash
curl http://localhost:9200
```
You should get something like this:
```json
{
"name" : "63d0133fc451",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "AKW1gxdRTuSH8eLuxbqH6A",
"version" : {
"number" : "8.4.3",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "42f05b9372a9a4a470db3b52817899b99a76ee73",
"build_date" : "2022-10-04T07:17:24.662462378Z",
"build_snapshot" : false,
"lucene_version" : "9.3.0",
"minimum_wire_compatibility_version" : "7.17.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "You Know, for Search"
}
```
# Retrieval
RAG consists of multiple components, and the first is R - "retrieval". For retrieval, we need a search system. In our example, we will use elasticsearch for searching.
## Searching in the documents
Create a nootebook "elastic-rag" or something like that. We will use it for our experiments
First, we need to download the docs:
```bash
wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json
```
Let's load the documents
```python
import json
with open('./documents.json', 'rt') as f_in:
documents_file = json.load(f_in)
documents = []
for course in documents_file:
course_name = course['course']
for doc in course['documents']:
doc['course'] = course_name
documents.append(doc)
```
Now we'll index these documents with elastic search
First initiate the connection and check that it's working:
```python
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
es.info()
```
You should see the same response as earlier with `curl`.
Before we can index the documents, we need to create an index (an index in elasticsearch is like a table in a "usual" databases):
```python
index_settings = {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"text": {"type": "text"},
"section": {"type": "text"},
"question": {"type": "text"},
"course": {"type": "keyword"}
}
}
}
index_name = "course-questions"
response = es.indices.create(index=index_name, body=index_settings)
response
```
Now we're ready to index all the documents:
```python
from tqdm.auto import tqdm
for doc in tqdm(documents):
es.index(index=index_name, document=doc)
```
## Retrieving the docs
```python
user_question = "How do I join the course after it has started?"
search_query = {
"size": 5,
"query": {
"bool": {
"must": {
"multi_match": {
"query": user_question,
"fields": ["question^3", "text", "section"],
"type": "best_fields"
}
},
"filter": {
"term": {
"course": "data-engineering-zoomcamp"
}
}
}
}
}
```
This query:
* Retrieves top 5 matching documents.
* Searches in the "question", "text", "section" fields, prioritizing "question" using `multi_match` query with type `best_fields` (see [here](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/elastic-search.md) for more information)
* Matches user query "How do I join the course after it has started?".
* Shows results only for the "data-engineering-zoomcamp" course.
Let's see the output:
```python
response = es.search(index=index_name, body=search_query)
for hit in response['hits']['hits']:
doc = hit['_source']
print(f"Section: {doc['section']}")
print(f"Question: {doc['question']}")
print(f"Answer: {doc['text'][:60]}...\n")
```
## Cleaning it
We can make it cleaner by putting it into a function:
```python
def retrieve_documents(query, index_name="course-questions", max_results=5):
es = Elasticsearch("http://localhost:9200")
search_query = {
"size": max_results,
"query": {
"bool": {
"must": {
"multi_match": {
"query": query,
"fields": ["question^3", "text", "section"],
"type": "best_fields"
}
},
"filter": {
"term": {
"course": "data-engineering-zoomcamp"
}
}
}
}
}
response = es.search(index=index_name, body=search_query)
documents = [hit['_source'] for hit in response['hits']['hits']]
return documents
```
And print the answers:
```python
user_question = "How do I join the course after it has started?"
response = retrieve_documents(user_question)
for doc in response:
print(f"Section: {doc['section']}")
print(f"Question: {doc['question']}")
print(f"Answer: {doc['text'][:60]}...\n")
```
# Generation - Answering questions
Now let's do the "G" part - generation based on the "R" output
## OpenAI
Today we will use OpenAI (it's the easiest to get started with). In the course, we will learn how to use open-source models
Make sure we have the SDK installed and the key is set.
This is how we communicate with ChatGPT3.5:
```python
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "The course already started. Can I still join?"}]
)
print(response.choices[0].message.content)
```
## Groq
With groq it's almost the same:
```python
from groq import Groq
client = Groq()
response = client.chat.completions.create(
model="llama3-8b-8192",
messages=[{"role": "user", "content": "The course already started. Can I still join?"}]
)
print(response.choices[0].message.content)
```
## Building a Prompt
Now let's build a prompt. First, we put all the
documents together in one string:
```python
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()
context_docs = retrieve_documents(user_question)
context_result = ""
for doc in context_docs:
doc_str = context_template.format(**doc)
context_result += ("\n\n" + doc_str)
context = context_result.strip()
print(context)
```
Now build the actual prompt:
```python
prompt = f"""
You're a course teaching assistant. Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Only use the facts from the CONTEXT. If the CONTEXT doesn't contan the answer, return "NONE"
QUESTION: {user_question}
CONTEXT:
{context}
""".strip()
```
Now we can put it to OpenAI API:
```python
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
answer
```
(Replace with `llama3-8b-8192` if using groq)
Note: there are system and user prompts, we can also experiment with them to make the design of the prompt cleaner.
## Cleaning
Now let's put everything together in one function:
```python
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()
prompt_template = """
You're a course teaching assistant.
Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Don't use other information outside of the provided CONTEXT.
QUESTION: {user_question}
CONTEXT:
{context}
""".strip()
def build_context(documents):
context_result = ""
for doc in documents:
doc_str = context_template.format(**doc)
context_result += ("\n\n" + doc_str)
return context_result.strip()
def build_prompt(user_question, documents):
context = build_context(documents)
prompt = prompt_template.format(
user_question=user_question,
context=context
)
return prompt
def ask_openai(prompt, model="gpt-4o"):
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
return answer
# use ask_groq and model="llama3-8b-8192" if using groq
def qa_bot(user_question):
context_docs = retrieve_documents(user_question)
prompt = build_prompt(user_question, context_docs)
answer = ask_openai(prompt)
return answer
```
Now we can ask it different questions
```python
qa_bot("I'm getting invalid reference format: repository name must be lowercase")
```
```python
qa_bot("I can't connect to postgres port 5432, my password doesn't work")
```
```python
qa_bot("how can I run kafka?")
```
# What's next
* Use Open-Souce
* Build an interface, e.g. streamlit
* Deploy it
# Extended version
For an extended version of this workshop, we will
* Build a UI with streamlit
* Experiment with open-source LLMs and replace OpenAI
# Streamlit UI
We can build simple UI apps with streamlit. Let's install it
```bash
pipenv install streamlit
```
If you want to learn more about streamlit, you can
use [this material](https://github.com/DataTalksClub/project-of-the-week/blob/main/2022-08-14-frontend.md).
We need a simple form with
* Input box for the prompt
* Button
* Text field to display the response (in markdown)
```python
import streamlit as st
def qa_bot(prompt):
import time
time.sleep(2)
return f"Response for the prompt: {prompt}"
def main():
st.title("DTC Q&A System")
with st.form(key='rag_form'):
prompt = st.text_input("Enter your prompt")
response_placeholder = st.empty()
submit_button = st.form_submit_button(label='Submit')
if submit_button:
response_placeholder.markdown("Loading...")
response = qa_bot(prompt)
response_placeholder.markdown(response)
if __name__ == "__main__":
main()
```
Let's run it
```bash
streamlit run app.py
```
Now we can replace the function `qa_bot`. Let's create
a file `rag.py` with the content from the notebook.
You can see the content of the file [here](rag.py).
Also, we add a special dropdown menu to select the course:
```python
courses = [
"data-engineering-zoomcamp",
"machine-learning-zoomcamp",
"mlops-zoomcamp"
]
zoomcamp_option = st.selectbox("Select a zoomcamp", courses)
```
# Open-Source LLMs
There are many open-source LLMs. We will use two platforms:
* Ollama for running on CPU
* HuggingFace for running on GPU
## Ollama
The easiest way to run an LLM without a GPU is using [Ollama](https://github.com/ollama/ollama)
Note that the 2 core codespaces instance is not enough.
For this part it's better to create a separate instance
with 4 cores.
You can also run it locally. I have 8 cores on my laptop,
so it's faster than doing it on codespaces.
Installing for Linux:
```bash
curl -fsSL https://ollama.com/install.sh | sh
```
Installing for other OS - check the [Ollama website](https://www.ollama.com/download). I successfully tested it on Windows too.
Let's run it:
```bash
# in one terminal
ollama start
# in another terminal
ollama run phi3
```
Prompt example:
```
Question: I just discovered the couse. can i still enrol
Context:
Course - Can I still join the course after the start date? Yes, even if you don't register, you're still eligible to submit the homeworks. Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.
Environment - Is Python 3.9 still the recommended version to use in 2024? Yes, for simplicity (of troubleshooting against the recorded videos) and stability. [source] But Python 3.10 and 3.11 should work fine.
How can we contribute to the course? Star the repo! Share it with friends if you find it useful ❣️ Create a PR if you see you can improve the text or the structure of the repository.
Answer:
```
Ollama's API is compatible with OpenAI's python client, so
we can use it by changing only a few lines of code:
```python
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1/',
api_key='ollama',
)
response = client.chat.completions.create(
model='phi3',
messages=[{"role": "user", "content": prompt}]
)
response.choices[0].message.content
```
That's it! Now let's put everything in Docker
## Ollama + Elastic in Docker
We already know how to run Elasticsearch in Docker:
```bash
docker run -it \
--rm \
--name elasticsearch \
-p 9200:9200 \
-p 9300:9300 \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=false" \
docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```
[This is how we run Ollama in Docker](https://hub.docker.com/r/ollama/ollama):
```bash
docker run -it \
--rm \
--name ollama \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama
```
When we run it, we need to log in to the container to download
the phi3 model:
```bash
docker exec -it ollama bash
ollama pull phi3
```
After pulling the model, we can query it with OpenAI's python package. Because we do volume mapping, the model files will
stay in the container across multiple runs.
Let's now combine them into one docker-compose file.
Create a [`docker-compose.yaml`](docker-compose.yaml) file with both Ollama and Elasticsearch.
And now run it:
```bash
docker-compose up
```
## HuggingFace Hub
Ollama can run locally on a CPU. But there are many models
that require a GPU.
For running them, we will use Colab or other notebook platform with a GPU (for example, SaturnCloud). Let's stop our codespace
for now.
In Colab, you need to enable GPU:
* Create a notebook: https://colab.research.google.com/#create=true
* Runtime -> Change runtime type -> T4 GPU
* `!nvidia-smi` to verify you have a GPU
Now we need to install the dependencies:
```
!pip install -U transformers accelerate bitsandbytes
```
Also, it's tricky to run Elasticsearch on Colab, so we will replace
it with [minsearch](https://github.com/alexeygrigorev/minsearch) - a simple in-memory search library:
```
!pip install minsearch
```
Let's get the data and create an index:
```python
import requests
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()
documents = []
for course in documents_raw:
course_name = course['course']
for doc in course['documents']:
doc['course'] = course_name
documents.append(doc)
import minsearch
index = minsearch.Index(
text_fields=["question", "text", "section"],
keyword_fields=["course"]
)
index.fit(documents)
```
Searching with minsearch:
```python
query = "I just discovered the course, can I still join?"
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3}
index.search(query, filter_dict, boost_dict, num_results=5)
```
Let's replace our search function:
```python
def retrieve_documents(query, max_results=5):
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3}
return index.search(query, filter_dict, boost_dict, num_results=5)
```
We will use Google's FLAN T5 model: [`google/flan-t5-xl`](https://huggingface.co/google/flan-t5-xl).
Downloading and loading it:
```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
model_name = "google/flan-t5-xl"
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False)
tokenizer.model_max_length = 4096
```
Using it:
```python
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
```
Let's put it to a function:
```python
def llm(prompt):
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, )
result = tokenizer.decode(outputs[0])
return result
```
Everything together:
```python
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()
prompt_template = """
You're a course teaching assistant.
Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Don't use other information outside of the provided CONTEXT.
QUESTION: {user_question}
CONTEXT:
{context}
""".strip()
def build_context(documents):
context_result = ""
for doc in documents:
doc_str = context_template.format(**doc)
context_result += ("\n\n" + doc_str)
return context_result.strip()
def build_prompt(user_question, documents):
context = build_context(documents)
prompt = prompt_template.format(
user_question=user_question,
context=context
)
return prompt
def qa_bot(user_question):
context_docs = retrieve_documents(user_question)
prompt = build_prompt(user_question, context_docs)
answer = llm(prompt)
return answer
```
Making the answers longer:
```python
def llm(prompt, generate_params=None):
if generate_params is None:
generate_params = {}
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(
input_ids,
max_length=generate_params.get("max_length", 100),
num_beams=generate_params.get("num_beams", 5),
do_sample=generate_params.get("do_sample", False),
temperature=generate_params.get("temperature", 1.0),
top_k=generate_params.get("top_k", 50),
top_p=generate_params.get("top_p", 0.95),
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
return result
```
Explanation of the parameters:
* `max_length`: Set this to a higher value if you want longer responses. For example, `max_length=300`.
* `num_beams`: Increasing this can lead to more thorough exploration of possible sequences. Typical values are between 5 and 10.
* `do_sample`: Set this to `True` to use sampling methods. This can produce more diverse responses.
* `temperature`: Lowering this value makes the model more confident and deterministic, while higher values increase diversity. Typical values range from 0.7 to 1.5.
* `top_k` and `top_p`: These parameters control nucleus sampling. `top_k` limits the sampling pool to the top `k` tokens, while `top_p` uses cumulative probability to cut off the sampling pool. Adjust these based on the desired level of randomness.
Final notebook:
* [notebooks/google_flan_t5.ipynb](notebooks/google_flan_t5.ipynb)
* [On Colab](https://colab.research.google.com/drive/1ldGq6PJw5_vIEWFcxZsNlxX6IJR-rWPk?usp=sharing)
Other models:
* `microsoft/Phi-3-mini-128k-instruct`
* `mistralai/Mistral-7B-v0.1`
* And many more
# Alternative - evaluation
## Retrieval evaluation
Using hitrate to evaluate
First, we generate ground truth data ([notebook](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/ground-truth-data.ipynb)) (also [add ID to documents](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/documents-with-ids.json))
```python
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record.
The record:
section: {section}
question: {question}
answer: {text}
Provide the output in parsable JSON without using code blocks:
["question1", "question2", ..., "question5"]
```
```
wget https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/ground-truth-data.csv?raw=1
mv ground-truth-data.csv?raw=1 ground-truth-data.csv
```
Then [evaluate it](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/evaluate-text.ipynb):
```python
relevance_total = []
for q in tqdm(ground_truth):
doc_id = q['document']
results = elastic_search(query=q['question'], course=q['course'])
relevance = [d['id'] == doc_id for d in results]
relevance_total.append(relevance)
```
Metrics:
```python
def hit_rate(relevance_total):
cnt = 0
for line in relevance_total:
if True in line:
cnt = cnt + 1
return cnt / len(relevance_total)
def mrr(relevance_total):
total_score = 0.0
for line in relevance_total:
for rank in range(len(line)):
if line[rank] == True:
total_score = total_score + 1 / (rank + 1)
return total_score / len(relevance_total)
```
## Cosine
Q->A->Q cosine similarity (see [here](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/offline-rag-evaluation.ipynb))
```python
from sentence_transformers import SentenceTransformer
model_name = 'multi-qa-MiniLM-L6-cos-v1'
model = SentenceTransformer(model_name)
answer_orig = 'Yes, sessions are recorded if you miss one. Everything is recorded, allowing you to catch up on any missed content. Additionally, you can ask questions in advance for office hours and have them addressed during the live stream. You can also ask questions in Slack.'
answer_llm = 'Everything is recorded, so you won’t miss anything. You will be able to ask your questions for office hours in advance and we will cover them during the live stream. Also, you can always ask questions in Slack.'
v_llm = model.encode(answer_llm)
v_orig = model.encode(answer_orig)
v_llm.dot(v_orig)
```
## LLM as a Judge
[See here for code](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/offline-rag-evaluation.ipynb)
```python
prompt1_template = """
You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer compared to the original answer provided.
Based on the relevance and similarity of the generated answer to the original answer, you will classify
it as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".
Here is the data for evaluation:
Original Answer: {answer_orig}
Generated Question: {question}
Generated Answer: {answer_llm}
Please analyze the content and context of the generated answer in relation to the original
answer and provide your evaluation in parsable JSON without using code blocks:
{{
"Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
"Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()
prompt2_template = """
You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".
Here is the data for evaluation:
Question: {question}
Generated Answer: {answer_llm}
Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks:
{{
"Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
"Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()
```
# Conclusions
That was fun - thanks!
================================================
FILE: app.py
================================================
import streamlit as st
from rag import qa_bot
def main():
st.title("DTC Q&A System")
courses = [
"data-engineering-zoomcamp",
"machine-learning-zoomcamp",
"mlops-zoomcamp"
]
with st.form(key='rag_form'):
zoomcamp_option = st.selectbox("Select a zoomcamp", courses)
prompt = st.text_input("Enter your prompt")
response_placeholder = st.empty()
submit_button = st.form_submit_button(label='Submit')
if submit_button:
response_placeholder.markdown("Loading...")
response = qa_bot(prompt, course=zoomcamp_option)
response_placeholder.markdown(response)
if __name__ == "__main__":
main()
================================================
FILE: docker-compose.yaml
================================================
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama:/root/.ollama
restart: always
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.4.3
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports:
- "9200:9200"
- "9300:9300"
volumes:
- es_data:/usr/share/elasticsearch/data
deploy:
resources:
limits:
memory: 2G
restart: always
volumes:
ollama:
es_data:
================================================
FILE: notebooks/documents.json
================================================
[
{
"course": "data-engineering-zoomcamp",
"documents": [
{
"text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
"section": "General course-related questions",
"question": "Course - When will the course start?"
},
{
"text": "GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites",
"section": "General course-related questions",
"question": "Course - What are the prerequisites for this course?"
},
{
"text": "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
"section": "General course-related questions",
"question": "Course - Can I still join the course after the start date?"
},
{
"text": "You don't need it. You're accepted. You can also just start learning and submitting homework without registering. It is not checked against any registered list. Registration is just to gauge interest before the start date.",
"section": "General course-related questions",
"question": "Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?"
},
{
"text": "You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.",
"section": "General course-related questions",
"question": "Course - What can I do before the course starts?"
},
{
"text": "There are 3 Zoom Camps in a year, as of 2024. However, they are for separate courses:\nData-Engineering (Jan - Apr)\nMLOps (May - Aug)\nMachine Learning (Sep - Jan)\nThere's only one Data-Engineering Zoomcamp \u201clive\u201d cohort per year, for the certification. Same as for the other Zoomcamps.\nThey follow pretty much the same schedule for each cohort per zoomcamp. For Data-Engineering it is (generally) from Jan-Apr of the year. If you\u2019re not interested in the Certificate, you can take any zoom camps at any time, at your own pace, out of sync with any \u201clive\u201d cohort.",
"section": "General course-related questions",
"question": "Course - how many Zoomcamps in a year?"
},
{
"text": "Yes. For the 2024 edition we are using Mage AI instead of Prefect and re-recorded the terraform videos, For 2023, we used Prefect instead of Airflow..",
"section": "General course-related questions",
"question": "Course - Is the current cohort going to be different from the previous cohort?"
},
{
"text": "Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.",
"section": "General course-related questions",
"question": "Course - Can I follow the course after it finishes?"
},
{
"text": "Yes, the slack channel remains open and you can ask questions there. But always sDocker containers exit code w search the channel first and second, check the FAQ (this document), most likely all your questions are already answered here.\nYou can also tag the bot @ZoomcampQABot to help you conduct the search, but don\u2019t rely on its answers 100%, it is pretty good though.",
"section": "General course-related questions",
"question": "Course - Can I get support if I take the course in the self-paced mode?"
},
{
"text": "All the main videos are stored in the Main \u201cDATA ENGINEERING\u201d playlist (no year specified). The Github repository has also been updated to show each video with a thumbnail, that would bring you directly to the same playlist below.\nBelow is the MAIN PLAYLIST\u2019. And then you refer to the year specific playlist for additional videos for that year like for office hours videos etc. Also find this playlist pinned to the slack channel.\nh\nttps://youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&si=NspQhtZhZQs1B9F-",
"section": "General course-related questions",
"question": "Course - Which playlist on YouTube should I refer to?"
},
{
"text": "It depends on your background and previous experience with modules. It is expected to require about 5 - 15 hours per week. [source1] [source2]\nYou can also calculate it yourself using this data and then update this answer.",
"section": "General course-related questions",
"question": "Course - \u200b\u200bHow many hours per week am I expected to spend on this course?"
},
{
"text": "No, you can only get a certificate if you finish the course with a \u201clive\u201d cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the course is running.",
"section": "General course-related questions",
"question": "Certificate - Can I follow the course in a self-paced mode and get a certificate?"
},
{
"text": "The zoom link is only published to instructors/presenters/TAs.\nStudents participate via Youtube Live and submit questions to Slido (link would be pinned in the chat when Alexey goes Live). The video URL should be posted in the announcements channel on Telegram & Slack before it begins. Also, you will see it live on the DataTalksClub YouTube Channel.\nDon\u2019t post your questions in chat as it would be off-screen before the instructors/moderators have a chance to answer it if the room is very active.",
"section": "General course-related questions",
"question": "Office Hours - What is the video/zoom link to the stream for the \u201cOffice Hour\u201d or workshop sessions?"
},
{
"text": "Yes! Every \u201cOffice Hours\u201d will be recorded and available a few minutes after the live session is over; so you can view (or rewatch) whenever you want.",
"section": "General course-related questions",
"question": "Office Hours - I can\u2019t attend the \u201cOffice hours\u201d / workshop, will it be recorded?"
},
{
"text": "You can find the latest and up-to-date deadlines here: https://docs.google.com/spreadsheets/d/e/2PACX-1vQACMLuutV5rvXg5qICuJGL-yZqIV0FBD84CxPdC5eZHf8TfzB-CJT_3Mo7U7oGVTXmSihPgQxuuoku/pubhtml\nAlso, take note of Announcements from @Au-Tomator for any extensions or other news. Or, the form may also show the updated deadline, if Instructor(s) has updated it.",
"section": "General course-related questions",
"question": "Homework - What are homework and project deadlines?"
},
{
"text": "No, late submissions are not allowed. But if the form is still not closed and it\u2019s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y\nOlder news:[source1] [source2]",
"section": "General course-related questions",
"question": "Homework - Are late submissions of homework allowed?"
},
{
"text": "Answer: In short, it\u2019s your repository on github, gitlab, bitbucket, etc\nIn long, your repository or any other location you have your code where a reasonable person would look at it and think yes, you went through the week and exercises.",
"section": "General course-related questions",
"question": "Homework - What is the homework URL in the homework link?"
},
{
"text": "After you submit your homework it will be graded based on the amount of questions in a particular homework. You can see how many points you have right on the page of the homework up top. Additionally in the leaderboard you will find the sum of all points you\u2019ve earned - points for Homeworks, FAQs and Learning in Public. If homework is clear, others work as follows: if you submit something to FAQ, you get one point, for each learning in a public link you get one point.\n(https://datatalks-club.slack.com/archives/C01FABYF2RG/p1706846846359379?thread_ts=1706825019.546229&cid=C01FABYF2RG)",
"section": "General course-related questions",
"question": "Homework and Leaderboard - what is the system for points in the course management platform?"
},
{
"text": "When you set up your account you are automatically assigned a random name such as \u201cLucid Elbakyan\u201d for example. If you want to see what your Display name is.\nGo to the Homework submission link \u2192 https://courses.datatalks.club/de-zoomcamp-2024/homework/hw2 - Log in > Click on \u2018Data Engineering Zoom Camp 2024\u2019 > click on \u2018Edit Course Profile\u2019 - your display name is here, you can also change it should you wish:",
"section": "General course-related questions",
"question": "Leaderboard - I am not on the leaderboard / how do I know which one I am on the leaderboard?"
},
{
"text": "Yes, for simplicity (of troubleshooting against the recorded videos) and stability. [source]\nBut Python 3.10 and 3.11 should work fine.",
"section": "General course-related questions",
"question": "Environment - Is Python 3.9 still the recommended version to use in 2024?"
},
{
"text": "You can set it up on your laptop or PC if you prefer to work locally from your laptop or PC.\nYou might face some challenges, especially for Windows users. If you face cnd2\nIf you prefer to work on the local machine, you may start with the week 1 Introduction to Docker and follow through.\nHowever, if you prefer to set up a virtual machine, you may start with these first:\nUsing GitHub Codespaces\nSetting up the environment on a cloudV Mcodespace\nI decided to work on a virtual machine because I have different laptops & PCs for my home & office, so I can work on this boot camp virtually anywhere.",
"section": "General course-related questions",
"question": "Environment - Should I use my local machine, GCP, or GitHub Codespaces for my environment?"
},
{
"text": "GitHub Codespaces offers you computing Linux resources with many pre-installed tools (Docker, Docker Compose, Python).\nYou can also open any GitHub repository in a GitHub Codespace.",
"section": "General course-related questions",
"question": "Environment - Is GitHub codespaces an alternative to using cli/git bash to ingest the data and create a docker file?"
},
{
"text": "It's up to you which platform and environment you use for the course.\nGithub codespaces or GCP VM are just possible options, but you can do the entire course from your laptop.",
"section": "General course-related questions",
"question": "Environment - Do we really have to use GitHub codespaces? I already have PostgreSQL & Docker installed."
},
{
"text": "Choose the approach that aligns the most with your idea for the end project\nOne of those should suffice. However, BigQuery, which is part of GCP, will be used, so learning that is probably a better option. Or you can set up a local environment for most of this course.",
"section": "General course-related questions",
"question": "Environment - Do I need both GitHub Codespaces and GCP?"
},
{
"text": "1. To open Run command window, you can either:\n(1-1) Use the shortcut keys: 'Windows + R', or\n(1-2) Right Click \"Start\", and click \"Run\" to open.\n2. Registry Values Located in Registry Editor, to open it: Type 'regedit' in the Run command window, and then press Enter.' 3. Now you can change the registry values \"Autorun\" in \"HKEY_CURRENT_USER\\Software\\Microsoft\\Command Processor\" from \"if exists\" to a blank.\nAlternatively, You can simplify the solution by deleting the fingerprint saved within the known_hosts file. In Windows, this file is placed at C:\\Users\\<your_user_name>\\.ssh\\known_host",
"section": "General course-related questions",
"question": "This happens when attempting to connect to a GCP VM using VSCode on a Windows machine. Changing registry value in registry editor"
},
{
"text": "For uniformity at least, but you\u2019re not restricted to GCP, you can use other cloud platforms like AWS if you\u2019re comfortable with other cloud platforms, since you get every service that\u2019s been provided by GCP in Azure and AWS or others..\nBecause everyone has a google account, GCP has a free trial period and gives $300 in credits to new users. Also, we are working with BigQuery, which is a part of GCP.\nNote that to sign up for a free GCP account, you must have a valid credit card.",
"section": "General course-related questions",
"question": "Environment - Why are we using GCP and not other cloud providers?"
},
{
"text": "No, if you use GCP and take advantage of their free trial.",
"section": "General course-related questions",
"question": "Should I pay for cloud services?"
},
{
"text": "You can do most of the course without a cloud. Almost everything we use (excluding BigQuery) can be run locally. We won\u2019t be able to provide guidelines for some things, but most of the materials are runnable without GCP.\nFor everything in the course, there\u2019s a local alternative. You could even do the whole course locally.",
"section": "General course-related questions",
"question": "Environment - The GCP and other cloud providers are unavailable in some countries. Is it possible to provide a guide to installing a home lab?"
},
{
"text": "Yes, you can. Just remember to adapt all the information on the videos to AWS. Besides, the final capstone will be evaluated based on the task: Create a data pipeline! Develop a visualisation!\nThe problem would be when you need help. You\u2019d need to rely on fellow coursemates who also use AWS (or have experience using it before), which might be in smaller numbers than those learning the course with GCP.\nAlso see Is it possible to use x tool instead of the one tool you use?",
"section": "General course-related questions",
"question": "Environment - I want to use AWS. May I do that?"
},
{
"text": "We will probably have some calls during the Capstone period to clear some questions but it will be announced in advance if that happens.",
"section": "General course-related questions",
"question": "Besides the \u201cOffice Hour\u201d which are the live zoom calls?"
},
{
"text": "We will use the same data, as the project will essentially remain the same as last year\u2019s. The data is available here",
"section": "General course-related questions",
"question": "Are we still using the NYC Trip data for January 2021? Or are we using the 2022 data?"
},
{
"text": "No, but we moved the 2022 stuff here",
"section": "General course-related questions",
"question": "Is the 2022 repo deleted?"
},
{
"text": "Yes, you can use any tool you want for your project.",
"section": "General course-related questions",
"question": "Can I use Airflow instead for my final project?"
},
{
"text": "Yes, this applies if you want to use Airflow or Prefect instead of Mage, AWS or Snowflake instead of GCP products or Tableau instead of Metabase or Google data studio.\nThe course covers 2 alternative data stacks, one using GCP and one using local installation of everything. You can use one of them or use your tool of choice.\nShould you consider it instead of the one tool you use? That we can\u2019t support you if you choose to use a different stack, also you would need to explain the different choices of tool for the peer review of your capstone project.",
"section": "General course-related questions",
"question": "Is it possible to use tool \u201cX\u201d instead of the one tool you use in the course?"
},
{
"text": "Star the repo! Share it with friends if you find it useful \u2763\ufe0f\nCreate a PR if you see you can improve the text or the structure of the repository.",
"section": "General course-related questions",
"question": "How can we contribute to the course?"
},
{
"text": "Yes! Linux is ideal but technically it should not matter. Students last year used all 3 OSes successfully",
"section": "General course-related questions",
"question": "Environment - Is the course [Windows/mac/Linux/...] friendly?"
},
{
"text": "Have no idea how past cohorts got past this as I haven't read old slack messages, and no FAQ entries that I can find.\nLater modules (module-05 & RisingWave workshop) use shell scripts in *.sh files and most Windows users not using WSL would hit a wall and cannot continue, even in git bash or MINGW64. This is why WSL environment setup is recommended from the start.",
"section": "General course-related questions",
"question": "Environment - Roadblock for Windows users in modules with *.sh (shell scripts)."
},
{
"text": "Yes to both! check out this document: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/awesome-data-engineering.md",
"section": "General course-related questions",
"question": "Any books or additional resources you recommend?"
},
{
"text": "You will have two attempts for a project. If the first project deadline is over and you\u2019re late or you submit the project and fail the first attempt, you have another chance to submit the project with the second attempt.",
"section": "General course-related questions",
"question": "Project - What is Project Attemp #1 and Project Attempt #2 exactly?"
},
{
"text": "The first step is to try to solve the issue on your own. Get used to solving problems and reading documentation. This will be a real life skill you need when employed. [ctrl+f] is your friend, use it! It is a universal shortcut and works in all apps/browsers.\nWhat does the error say? There will often be a description of the error or instructions on what is needed or even how to fix it. I have even seen a link to the solution. Does it reference a specific line of your code?\nRestart app or server/pc.\nGoogle it, use ChatGPT, Bing AI etc.\nIt is going to be rare that you are the first to have the problem, someone out there has posted the fly issue and likely the solution.\nSearch using: <technology> <problem statement>. Example: pgcli error column c.relhasoids does not exist.\nThere are often different solutions for the same problem due to variation in environments.\nCheck the tech\u2019s documentation. Use its search if available or use the browsers search function.\nTry uninstall (this may remove the bad actor) and reinstall of application or reimplementation of action. Remember to restart the server/pc for reinstalls.\nSometimes reinstalling fails to resolve the issue but works if you uninstall first.\nPost your question to Stackoverflow. Read the Stackoverflow guide on posting good questions.\nhttps://stackoverflow.com/help/how-to-ask\nThis will be your real life. Ask an expert in the future (in addition to coworkers).\nAsk in Slack\nBefore asking a question,\nCheck Pins (where the shortcut to the repo and this FAQ is located)\nUse the slack app\u2019s search function\nUse the bot @ZoomcampQABot to do the search for you\ncheck the FAQ (this document), use search [ctrl+f]\nWhen asking a question, include as much information as possible:\nWhat are you coding on? What OS?\nWhat command did you run, which video did you follow? Etc etc\nWhat error did you get? Does it have a line number to the \u201coffending\u201d code and have you check it for typos?\nWhat have you tried that did not work? This answer is crucial as without it, helpers would ask you to do the suggestions in the error log first. Or just read this FAQ document.\nDO NOT use screenshots, especially don\u2019t take pictures from a phone.\nDO NOT tag instructors, it may discourage others from helping you. Copy and paste errors; if it\u2019s long, just post it in a reply to your thread.\nUse ``` for formatting your code.\nUse the same thread for the conversation (that means reply to your own thread).\nDO NOT create multiple posts to discuss the issue.\nlearYou may create a new post if the issue reemerges down the road. Describe what has changed in the environment.\nProvide additional information in the same thread of the steps you have taken for resolution.\nTake a break and come back later. You will be amazed at how often you figure out the solution after letting your brain rest. Get some fresh air, workout, play a video game, watch a tv show, whatever allows your brain to not think about it for a little while or even until the next day.\nRemember technology issues in real life sometimes take days or even weeks to resolve.\nIf somebody helped you with your problem and it's not in the FAQ, please add it there. It will help other students.",
"section": "General course-related questions",
"question": "How to troubleshoot issues"
},
{
"text": "When the troubleshooting guide above does not help resolve it and you need another pair of eyeballs to spot mistakes. When asking a question, include as much information as possible:\nWhat are you coding on? What OS?\nWhat command did you run, which video did you follow? Etc etc\nWhat error did you get? Does it have a line number to the \u201coffending\u201d code and have you check it for typos?\nWhat have you tried that did not work? This answer is crucial as without it, helpers would ask you to do the suggestions in the error log first. Or just read this FAQ document.",
"section": "General course-related questions",
"question": "How to ask questions"
},
{
"text": "After you create a GitHub account, you should clone the course repo to your local machine using the process outlined in this video: Git for Everybody: How to Clone a Repository from GitHub\nHaving this local repository on your computer will make it easy for you to access the instructors\u2019 code and make pull requests (if you want to add your own notes or make changes to the course content).\nYou will probably also create your own repositories that host your notes, versions of your file, to do this. Here is a great tutorial that shows you how to do this: https://www.atlassian.com/git/tutorials/setting-up-a-repository\nRemember to ignore large database, .csv, and .gz files, and other files that should not be saved to a repository. Use .gitignore for this: https://www.atlassian.com/git/tutorials/saving-changes/gitignore NEVER store passwords or keys in a git repo (even if that repo is set to private).\nThis is also a great resource: https://dangitgit.com/",
"section": "General course-related questions",
"question": "How do I use Git / GitHub for this course?"
},
{
"text": "Error: Makefile:2: *** missing separator. Stop.\nSolution: Tabs in document should be converted to Tab instead of spaces. Follow this stack.",
"section": "General course-related questions",
"question": "VS Code: Tab using spaces"
},
{
"text": "If you\u2019re running Linux on Windows Subsystem for Linux (WSL) 2, you can open HTML files from the guest (Linux) with whatever Internet Browser you have installed on the host (Windows). Just install wslu and open the page with wslview <file>, for example:\nwslview index.html\nYou can customise which browser to use by setting the BROWSER environment variable first. For example:\nexport BROWSER='/mnt/c/Program Files/Firefox/firefox.exe'",
"section": "General course-related questions",
"question": "Opening an HTML file with a Windows browser from Linux running on WSL"
},
{
"text": "This tutorial shows you how to set up the Chrome Remote Desktop service on a Debian Linux virtual machine (VM) instance on Compute Engine. Chrome Remote Desktop allows you to remotely access applications with a graphical user interface.\nTaxi Data - Yellow Taxi Trip Records downloading error, Error no or XML error webpage\nWhen you try to download the 2021 data from TLC website, you get this error:\nIf you click on the link, and ERROR 403: Forbidden on the terminal.\nWe have a backup, so use it instead: https://github.com/DataTalksClub/nyc-tlc-data\nSo the link should be https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz\nNote: Make sure to unzip the \u201cgz\u201d file (no, the \u201cunzip\u201d command won\u2019t work for this.)\n\u201cgzip -d file.gz\u201dg",
"section": "Module 1: Docker and Terraform",
"question": "Set up Chrome Remote Desktop for Linux on Compute Engine"
},
{
"text": "In this video, we store the data file as \u201coutput.csv\u201d. The data file won\u2019t store correctly if the file extension is csv.gz instead of csv. One alternative is to replace csv_name = \u201coutput.cs -v\u201d with the file name given at the end of the URL. Notice that the URL for the yellow taxi data is: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz where the highlighted part is the name of the file. We can parse this file name from the URL and use it as csv_name. That is, we can replace csv_name = \u201coutput.csv\u201d with\ncsv_name = url.split(\u201c/\u201d)[-1] . Then when we use csv_name to using pd.read_csv, there won\u2019t be an issue even though the file name really has the extension csv.gz instead of csv since the pandas read_csv function can read csv.gz files directly.",
"section": "Module 1: Docker and Terraform",
"question": "Taxi Data - How to handle taxi data files, now that the files are available as *.csv.gz?"
},
{
"text": "Yellow Trips: https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf\nGreen Trips: https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf",
"section": "Module 1: Docker and Terraform",
"question": "Taxi Data - Data Dictionary for NY Taxi data?"
},
{
"text": "You can unzip this downloaded parquet file, in the command line. The result is a csv file which can be imported with pandas using the pd.read_csv() shown in the videos.\n\u2018\u2019\u2019gunzip green_tripdata_2019-09.csv.gz\u2019\u2019\u2019\nSOLUTION TO USING PARQUET FILES DIRECTLY IN PYTHON SCRIPT ingest_data.py\nIn the def main(params) add this line\nparquet_name= 'output.parquet'\nThen edit the code which downloads the files\nos.system(f\"wget {url} -O {parquet_name}\")\nConvert the download .parquet file to csv and rename as csv_name to keep it relevant to the rest of the code\ndf = pd.read_parquet(parquet_name)\ndf.to_csv(csv_name, index=False)",
"section": "Module 1: Docker and Terraform",
"question": "Taxi Data - Unzip Parquet file"
},
{
"text": "\u201cwget is not recognized as an internal or external command\u201d, you need to install it.\nOn Ubuntu, run:\n$ sudo apt-get install wget\nOn MacOS, the easiest way to install wget is to use Brew:\n$ brew install wget\nOn Windows, the easiest way to install wget is to use Chocolatey:\n$ choco install wget\nOr you can download a binary (https://gnuwin32.sourceforge.net/packages/wget.htm) and put it to any location in your PATH (e.g. C:/tools/)\nAlso, you can following this step to install Wget on MS Windows\n* Download the latest wget binary for windows from [eternallybored] (https://eternallybored.org/misc/wget/) (they are available as a zip with documentation, or just an exe)\n* If you downloaded the zip, extract all (if windows built in zip utility gives an error, use [7-zip] (https://7-zip.org/)).\n* Rename the file `wget64.exe` to `wget.exe` if necessary.\n* Move wget.exe to your `Git\\mingw64\\bin\\`.\nAlternatively, you can use a Python wget library, but instead of simply using \u201cwget\u201d you\u2019ll need to use\npython -m wget\nYou need to install it with pip first:\npip install wget\nAlternatively, you can just paste the file URL into your web browser and download the file normally that way. You\u2019ll want to move the resulting file into your working directory.\nAlso recommended a look at the python library requests for the loading gz file https://pypi.org/project/requests",
"section": "Module 1: Docker and Terraform",
"question": "lwget is not recognized as an internal or external command"
},
{
"text": "Firstly, make sure that you add \u201c!\u201d before wget if you\u2019re running your command in a Jupyter Notebook or CLI. Then, you can check one of this 2 things (from CLI):\nUsing the Python library wget you installed with pip, try python -m wget <url>\nWrite the usual command and add --no-check-certificate at the end. So it should be:\n!wget <website_url> --no-check-certificate",
"section": "Module 1: Docker and Terraform",
"question": "wget - ERROR: cannot verify <website> certificate (MacOS)"
},
{
"text": "For those who wish to use the backslash as an escape character in Git Bash for Windows (as Alexey normally does), type in the terminal: bash.escapeChar=\\ (no need to include in .bashrc)",
"section": "Module 1: Docker and Terraform",
"question": "Git Bash - Backslash as an escape character in Git Bash for Windows"
},
{
"text": "Instruction on how to store secrets that will be avialable in GitHub Codespaces.\nManaging your account-specific secrets for GitHub Codespaces - GitHub Docs",
"section": "Module 1: Docker and Terraform",
"question": "GitHub Codespaces - How to store secrets"
},
{
"text": "Make sure you're able to start the Docker daemon, and check the issue immediately down below:\nAnd don\u2019t forget to update the wsl in powershell the command is wsl \u2013update",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Cannot connect to Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
},
{
"text": "As the official Docker for Windows documentation says, the Docker engine can either use the\nHyper-V or WSL2 as its backend. However, a few constraints might apply\nWindows 10 Pro / 11 Pro Users: \nIn order to use Hyper-V as its back-end, you MUST have it enabled first, which you can do by following the tutorial: Enable Hyper-V Option on Windows 10 / 11\nWindows 10 Home / 11 Home Users: \nOn the other hand, Users of the 'Home' version do NOT have the option Hyper-V option enabled, which means, you can only get Docker up and running using the WSL2 credentials(Windows Subsystem for Linux). Url\nYou can find the detailed instructions to do so here: rt ghttps://pureinfotech.com/install-wsl-windows-11/\nIn case, you run into another issue while trying to install WSL2 (WslRegisterDistribution failed with error: 0x800701bc), Make sure you update the WSL2 Linux Kernel, following the guidelines here: \n\nhttps://github.com/microsoft/WSL/issues/5393",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Error during connect: In the default daemon configuration on Windows, the docker client must be run with elevated privileges to connect.: Post: \"http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.24/containers/create\" : open //./pipe/docker_engine: The system cannot find the file specified"
},
{
"text": "Whenever a `docker pull is performed (either manually or by `docker-compose up`), it attempts to fetch the given image name (pgadmin4, for the example above) from a repository (dbpage).\nIF the repository is public, the fetch and download happens without any issue whatsoever.\nFor instance:\ndocker pull postgres:13\ndocker pull dpage/pgadmin4\nBE ADVISED:\n\nThe Docker Images we'll be using throughout the Data Engineering Zoomcamp are all public (except when or if explicitly said otherwise by the instructors or co-instructors).\n\nMeaning: you are NOT required to perform a docker login to fetch them. \n\nSo if you get the message above saying \"docker login': denied: requested access to the resource is denied. That is most likely due to a typo in your image name:\n\nFor instance:\n$ docker pull dbpage/pgadmin4\nWill throw that exception telling you \"repository does not exist or may require 'docker login'\nError response from daemon: pull access denied for dbpage/pgadmin4, repository does not exist or \nmay require 'docker login': denied: requested access to the resource is denied\nBut that actually happened because the actual image is dpage/pgadmin4 and NOT dbpage/pgadmin4\nHow to fix it:\n$ docker pull dpage/pgadmin4\nEXTRA NOTES:\nIn the real world, occasionally, when you're working for a company or closed organisation, the Docker image you're trying to fetch might be under a private repo that your DockerHub Username was granted access to.\nFor which cases, you must first execute:\n$ docker login\nFill in the details of your username and password.\nAnd only then perform the `docker pull` against that private repository\nWhy am I encountering a \"permission denied\" error when creating a PostgreSQL Docker container for the New York Taxi Database with a mounted volume on macOS M1?\nIssue Description:\nWhen attempting to run a Docker command similar to the one below:\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \\\n-p 5432:5432 \\mount\npostgres:13\nYou encounter the error message:\ndocker: Error response from daemon: error while creating mount source path '/path/to/ny_taxi_postgres_data': chown /path/to/ny_taxi_postgres_data: permission denied.\nSolution:\n1- Stop Rancher Desktop:\nIf you are using Rancher Desktop and face this issue, stop Rancher Desktop to resolve compatibility problems.\n2- Install Docker Desktop:\nInstall Docker Desktop, ensuring that it is properly configured and has the required permissions.\n2-Retry Docker Command:\nRun the Docker command again after switching to Docker Desktop. This step resolves compatibility issues on some systems.\nNote: The issue occurred because Rancher Desktop was in use. Switching to Docker Desktop resolves compatibility problems and allows for the successful creation of PostgreSQL containers with mounted volumes for the New York Taxi Database on macOS M1.",
"section": "Module 1: Docker and Terraform",
"question": "Docker - docker pull dbpage"
},
{
"text": "When I runned command to create postgre in docker container it created folder on my local machine to mount it to volume inside container. It has write and read protection and owned by user 999, so I could not delete it by simply drag to trash. My obsidian could not started due to access error, so I had to change placement of this folder and delete old folder by this command:\nsudo rm -r -f docker_test/\n- where `rm` - remove, `-r` - recursively, `-f` - force, `docker_test/` - folder.",
"section": "Module 1: Docker and Terraform",
"question": "Docker - can\u2019t delete local folder that mounted to docker volume"
},
{
"text": "First off, make sure you're running the latest version of Docker for Windows, which you can download from here. Sometimes using the menu to \"Upgrade\" doesn't work (which is another clear indicator for you to uninstall, and reinstall with the latest version)\nIf docker is stuck on starting, first try to switch containers by right clicking the docker symbol from the running programs and switch the containers from windows to linux or vice versa\n[Windows 10 / 11 Pro Edition] The Pro Edition of Windows can run Docker either by using Hyper-V or WSL2 as its backend (Docker Engine)\nIn order to use Hyper-V as its back-end, you MUST have it enabled first, which you can do by following the tutorial: Enable Hyper-V Option on Windows 10 / 11\nIf you opt-in for WSL2, you can follow the same steps as detailed in the tutorial here",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Docker won't start or is stuck in settings (Windows 10 / 11)"
},
{
"text": "It is recommended by the Docker do\n[Windows 10 / 11 Home Edition] If you're running a Home Edition, you can still make it work with WSL2 (Windows Subsystem for Linux) by following the tutorial here\nIf even after making sure your WSL2 (or Hyper-V) is set up accordingly, Docker remains stuck, you can try the option to Reset to Factory Defaults or do a fresh install.",
"section": "Module 1: Docker and Terraform",
"question": "Should I run docker commands from the windows file system or a file system of a Linux distribution in WSL?"
},
{
"text": "More info in the Docker Docs on Best Practises",
"section": "Module 1: Docker and Terraform",
"question": "Docker - cs to store all code in your default Linux distro to get the best out of file system performance (since Docker runs on WSL2 backend by default for Windows 10 Home / Windows 11 Home users)."
},
{
"text": "You may have this error:\n$ docker run -it ubuntu bash\nthe input device is not a TTY. If you are using mintty, try prefixing the command with 'winpty'\nerror:\nSolution:\nUse winpty before docker command (source)\n$ winpty docker run -it ubuntu bash\nYou also can make an alias:\necho \"alias docker='winpty docker'\" >> ~/.bashrc\nOR\necho \"alias docker='winpty docker'\" >> ~/.bash_profile",
"section": "Module 1: Docker and Terraform",
"question": "Docker - The input device is not a TTY (Docker run for Windows)"
},
{
"text": "You may have this error:\nRetrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.u\nrllib3.connection.HTTPSConnection object at 0x7efe331cf790>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')':\n/simple/pandas/\nPossible solution might be:\n$ winpty docker run -it --dns=8.8.8.8 --entrypoint=bash python:3.9",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Cannot pip install on Docker container (Windows)"
},
{
"text": "Even after properly running the docker script the folder is empty in the vs code then try this (For Windows)\nwinpty docker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v \"C:\\Users\\abhin\\dataengg\\DE_Project_git_connected\\DE_OLD\\week1_set_up\\docker_sql/ny_taxi_postgres_data:/var/lib/postgresql/data\" \\\n-p 5432:5432 \\\npostgres:13\nHere quoting the absolute path in the -v parameter is solving the issue and all the files are visible in the Vs-code ny_taxi folder as shown in the video",
"section": "Module 1: Docker and Terraform",
"question": "Docker - ny_taxi_postgres_data is empty"
},
{
"text": "Check this article for details - Setting up docker in macOS\nFrom researching it seems this method might be out of date, it seems that since docker changed their licensing model, the above is a bit hit and miss. What worked for me was to just go to the docker website and download their dmg. Haven\u2019t had an issue with that method.",
"section": "Module 1: Docker and Terraform",
"question": "dasDocker - Setting up Docker on Mac"
},
{
"text": "$ docker run -it\\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"admin\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v \"/mnt/path/to/ny_taxi_postgres_data\":\"/var/lib/postgresql/data\" \\\n-p 5432:5432 \\\npostgres:13\nCCW\nThe files belonging to this database system will be owned by user \"postgres\".\nThis use The database cluster will be initialized with locale \"en_US.utf8\".\nThe default databerrorase encoding has accordingly been set to \"UTF8\".\nxt search configuration will be set to \"english\".\nData page checksums are disabled.\nfixing permissions on existing directory /var/lib/postgresql/data ... initdb: f\nerror: could not change permissions of directory \"/var/lib/postgresql/data\": Operation not permitted volume\nOne way to solve this issue is to create a local docker volume and map it to postgres data directory /var/lib/postgresql/data\nThe input dtc_postgres_volume_local must match in both commands below\n$ docker volume create --name dtc_postgres_volume_local -d local\n$ docker run -it\\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v dtc_postgres_volume_local:/var/lib/postgresql/data \\\n-p 5432:5432\\\npostgres:13\nTo verify the above command works in (WSL2 Ubuntu 22.04, verified 2024-Jan), go to the Docker Desktop app and look under Volumes - dtc_postgres_volume_local would be listed there. The folder ny_taxi_postgres_data would however be empty, since we used an alternative config.\nAn alternate error could be:\ninitdb: error: directory \"/var/lib/postgresql/data\" exists but is not empty\nIf you want to create a new database system, either remove or empthe directory \"/var/lib/postgresql/data\" or run initdb\nwitls",
"section": "Module 1: Docker and Terraform",
"question": "1Docker - Could not change permissions of directory \"/var/lib/postgresql/data\": Operation not permitted"
},
{
"text": "Mapping volumes on Windows could be tricky. The way it was done in the course video doesn\u2019t work for everyone.\nFirst, if yo\nmove your data to some folder without spaces. E.g. if your code is in \u201cC:/Users/Alexey Grigorev/git/\u2026\u201d, move it to \u201cC:/git/\u2026\u201d\nTry replacing the \u201c-v\u201d part with one of the following options:\n-v /c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n-v //c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n-v /c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n-v //c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n--volume //driveletter/path/ny_taxi_postgres_data/:/var/lib/postgresql/data\nwinpty docker run -it\n-e POSTGRES_USER=\"root\"\n-e POSTGRES_PASSWORD=\"root\"\n-e POSTGRES_DB=\"ny_taxi\"\n-v /c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n-p 5432:5432\npostgres:1\nTry adding winpty before the whole command\n3\nwin\nTry adding quotes:\n-v \"/c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\n-v \"//c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\n-v \u201c/c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\n-v \"//c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\n-v \"c:\\some\\path\\ny_taxi_postgres_data\":/var/lib/postgresql/data\nNote: (Window) if it automatically creates a folder called \u201cny_taxi_postgres_data;C\u201d suggests you have problems with volume mapping, try deleting both folders and replacing \u201c-v\u201d part with other options. For me \u201c//c/\u201d works instead of \u201c/c/\u201d. And it will work by automatically creating a correct folder called \u201cny_taxi_postgres_data\u201d.\nA possible solution to this error would be to use /\u201d$(pwd)\u201d/ny_taxi_postgres_data:/var/lib/postgresql/data (with quotes\u2019 position varying as in the above list).\nYes for windows use the command it works perfectly fine\n-v /\u201d$(pwd)\u201d/ny_taxi_postgres_data:/var/lib/postgresql/data\nImportant: note how the quotes are placed.\nIf none of these options work, you can use a volume name instead of the path:\n-v ny_taxi_postgres_data:/var/lib/postgresql/data\nFor Mac: You can wrap $(pwd) with quotes like the highlighted.\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v \"$(pwd)\"/ny_taxi_postgres_data:/var/lib/postgresql/data \\\n-p 5432:5432 \\\nPostgres:13\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v \"$(pwd)\"/ny_taxi_postgres_data:/var/lib/postgresql/data \\\n-p 5432:5432 \\\npostgres:13\nSource:https://stackoverflow.com/questions/48522615/docker-error-invalid-reference-format-repository-name-must-be-lowercase",
"section": "Module 1: Docker and Terraform",
"question": "Docker - invalid reference format: repository name must be lowercase (Mounting volumes with Docker on Windows)"
},
{
"text": "Change the mounting path. Replace it with one of following:\n-v /e/zoomcamp/...:/var/lib/postgresql/data\n-v /c:/.../ny_taxi_postgres_data:/var/lib/postgresql/data\\ (leading slash in front of c:)",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Error response from daemon: invalid mode: \\Program Files\\Git\\var\\lib\\postgresql\\data."
},
{
"text": "When you run this command second time\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v <your path>:/var/lib/postgresql/data \\\n-p 5432:5432 \\\npostgres:13\nThe error message above could happen. That means you should not mount on the second run. This command helped me:\nWhen you run this command second time\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-p 5432:5432 \\\npostgres:13",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Error response from daemon: error while creating buildmount source path '/run/desktop/mnt/host/c/<your path>': mkdir /run/desktop/mnt/host/c: file exists"
},
{
"text": "This error appeared when running the command: docker build -t taxi_ingest:v001 .\nWhen feeding the database with the data the user id of the directory ny_taxi_postgres_data was changed to 999, so my user couldn\u2019t access it when running the above command. Even though this is not the problem here it helped to raise the error due to the permission issue.\nSince at this point we only need the files Dockerfile and ingest_data.py, to fix this error one can run the docker build command on a different directory (having only these two files).\nA more complete explanation can be found here: https://stackoverflow.com/questions/41286028/docker-build-error-checking-context-cant-stat-c-users-username-appdata\nYou can fix the problem by changing the permission of the directory on ubuntu with following command:\nsudo chown -R $USER dir_path\nOn windows follow the link: https://thegeekpage.com/take-ownership-of-a-file-folder-through-command-prompt-in-windows-10/ \n\n\t\t\t\t\t\t\t\t\t\t\tAdded by\n\t\t\t\t\t\t\t\t\t\t\tKenan Arslanbay",
"section": "Module 1: Docker and Terraform",
"question": "Docker - build error: error checking context: 'can't stat '/home/user/repos/data-engineering/week_1_basics_n_setup/2_docker_sql/ny_taxi_postgres_data''."
},
{
"text": "You might have installed docker via snap. Run \u201csudo snap status docker\u201d to verify.\nIf you have \u201cerror: unknown command \"status\", see 'snap help'.\u201d as a response than deinstall docker and install via the official website\nBind for 0.0.0.0:5432 failed: port is a",
"section": "Module 1: Docker and Terraform",
"question": "Docker - ERRO[0000] error waiting for container: context canceled"
},
{
"text": "Found the issue in the PopOS linux. It happened because our user didn\u2019t have authorization rights to the host folder ( which also caused folder seems empty, but it didn\u2019t!).\n\u2705Solution:\nJust add permission for everyone to the corresponding folder\nsudo chmod -R 777 <path_to_folder>\nExample:\nsudo chmod -R 777 ny_taxi_postgres_data/",
"section": "Module 1: Docker and Terraform",
"question": "Docker - build error checking context: can\u2019t stat \u2018/home/fhrzn/Projects/\u2026./ny_taxi_postgres_data\u2019"
},
{
"text": "This happens on Ubuntu/Linux systems when trying to run the command to build the Docker container again.\n$ docker build -t taxi_ingest:v001 .\nA folder is created to host the Docker files. When the build command is executed again to rebuild the pipeline or create a new one the error is raised as there are no permissions on this new folder. Grant permissions by running this comtionmand;\n$ sudo chmod -R 755 ny_taxi_postgres_data\nOr use 777 if you still see problems. 755 grants write access to only the owner.",
"section": "Module 1: Docker and Terraform",
"question": "Docker - failed to solve with frontend dockerfile.v0: failed to read dockerfile: error from sender: open ny_taxi_postgres_data: permission denied."
},
{
"text": "Get the network name via: $ docker network ls.",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Docker network name"
},
{
"text": "Sometimes, when you try to restart a docker image configured with a network name, the above message appears. In this case, use the following command with the appropriate container name:\n>>> If the container is running state, use docker stop <container_name>\n>>> then, docker rm pg-database\nOr use docker start instead of docker run in order to restart the docker image without removing it.",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Error response from daemon: Conflict. The container name \"pg-database\" is already in use by container \u201cxxx\u201d. You have to remove (or rename) that container to be able to reuse that name."
},
{
"text": "Typical error: sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name \"pgdatabase\" to address: Name or service not known\nWhen running docker-compose up -d see which network is created and use this for the ingestions script instead of pg-network and see the name of the database to use instead of pgdatabase\nE.g.:\npg-network becomes 2docker_default\nPgdatabase becomes 2docker-pgdatabase-1",
"section": "Module 1: Docker and Terraform",
"question": "Docker - ingestion when using docker-compose could not translate host name"
},
{
"text": "terraformRun this command before starting your VM:\nOn Intel CPU:\nmodprobe -r kvm_intel\nmodprobe kvm_intel nested=1\nOn AMD CPU:\nmodprobe -r kvm_amd\nmodprobe kvm_amd nested=1",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Cannot install docker on MacOS/Windows 11 VM running on top of Linux (due to Nested virtualization)."
},
{
"text": "It\u2019s very easy to manage your docker container, images, network and compose projects from VS Code.\nJust install the official extension and launch it from the left side icon.\nIt will work even if your Docker runs on WSL2, as VS Code can easily connect with your Linux.\nDocker - How to stop a container?\nUse the following command:\n$ docker stop <container_id>",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Connecting from VS Code"
},
{
"text": "When you see this in logs, your container with postgres is not accepting any requests, so if you attempt to connect, you'll get this error:\nconnection failed: server closed the connection unexpectedly\nThis probably means the server terminated abnormally before or while processing the request.\nIn this case, you need to delete the directory with data (the one you map to the container with the -v flag) and restart the container.",
"section": "Module 1: Docker and Terraform",
"question": "Docker - PostgreSQL Database directory appears to contain a database. Database system is shut down"
},
{
"text": "On few versions of Ubuntu, snap command can be used to install Docker.\nsudo snap install docker",
"section": "Module 1: Docker and Terraform",
"question": "Docker not installable on Ubuntu"
},
{
"text": "error: could not change permissions of directory \"/var/lib/postgresql/data\": Operation not permitted volume\nif you have used the prev answer (just before this) and have created a local docker volume, then you need to tell the compose file about the named volume:\nvolumes:\ndtc_postgres_volume_local: # Define the named volume here\n# services mentioned in the compose file auto become part of the same network!\nservices:\nyour remaining code here . . .\nnow use docker volume inspect dtc_postgres_volume_local to see the location by checking the value of Mountpoint\nIn my case, after i ran docker compose up the mounting dir created was named \u2018docker_sql_dtc_postgres_volume_local\u2019 whereas it should have used the already existing \u2018dtc_postgres_volume_local\u2019\nAll i did to fix this is that I renamed the existing \u2018dtc_postgres_volume_local\u2019 to \u2018docker_sql_dtc_postgres_volume_local\u2019 and removed the newly created one (just be careful when doing this)\nrun docker compose up again and check if the table is there or not!",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - mounting error"
},
{
"text": "Couldn\u2019t translate host name to address\nMake sure postgres database is running.\n\n\u200b\u200bUse the command to start containers in detached mode: docker-compose up -d\n(data-engineering-zoomcamp) hw % docker compose up -d\n[+] Running 2/2\n\u283f Container pg-admin Started 0.6s\n\u283f Container pg-database Started\nTo view the containers use: docker ps.\n(data-engineering-zoomcamp) hw % docker ps\nCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\nfaf05090972e postgres:13 \"docker-entrypoint.s\u2026\" 39 seconds ago Up 37 seconds 0.0.0.0:5432->5432/tcp pg-database\n6344dcecd58f dpage/pgadmin4 \"/entrypoint.sh\" 39 seconds ago Up 37 seconds 443/tcp, 0.0.0.0:8080->80/tcp pg-admin\nhw\nTo view logs for a container: docker logs <containerid>\n(data-engineering-zoomcamp) hw % docker logs faf05090972e\nPostgreSQL Database directory appears to contain a database; Skipping initialization\n2022-01-25 05:58:45.948 UTC [1] LOG: starting PostgreSQL 13.5 (Debian 13.5-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit\n2022-01-25 05:58:45.948 UTC [1] LOG: listening on IPv4 address \"0.0.0.0\", port 5432\n2022-01-25 05:58:45.948 UTC [1] LOG: listening on IPv6 address \"::\", port 5432\n2022-01-25 05:58:45.954 UTC [1] LOG: listening on Unix socket \"/var/run/postgresql/.s.PGSQL.5432\"\n2022-01-25 05:58:45.984 UTC [28] LOG: database system was interrupted; last known up at 2022-01-24 17:48:35 UTC\n2022-01-25 05:58:48.581 UTC [28] LOG: database system was not properly shut down; automatic recovery in\nprogress\n2022-01-25 05:58:48.602 UTC [28] LOG: redo starts at 0/872A5910\n2022-01-25 05:59:33.726 UTC [28] LOG: invalid record length at 0/98A3C160: wanted 24, got 0\n2022-01-25 05:59:33.726 UTC [28\n] LOG: redo done at 0/98A3C128\n2022-01-25 05:59:48.051 UTC [1] LOG: database system is ready to accept connections\nIf docker ps doesn\u2019t show pgdatabase running, run: docker ps -a\nThis should show all containers, either running or stopped.\nGet the container id for pgdatabase-1, and run",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Error translating host name to address"
},
{
"text": "After executing `docker-compose up` - if you lose database data and are unable to successfully execute your Ingestion script (to re-populate your database) but receive the following error:\nsqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name /data_pgadmin:/var/lib/pgadmin\"pg-database\" to address: Name or service not known\nDocker compose is creating its own default network since it is no longer specified in a docker execution command or file. Docker Compose will emit to logs the new network name. See the logs after executing `docker compose up` to find the network name and change the network name argument in your Ingestion script.\nIf problems persist with pgcli, we can use HeidiSQL,usql\nKrishna Anand",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Data retention (could not translate host name \"pg-database\" to address: Name or service not known)"
},
{
"text": "It returns --> Error response from daemon: network 66ae65944d643fdebbc89bd0329f1409dec2c9e12248052f5f4c4be7d1bdc6a3 not found\nTry:\ndocker ps -a to see all the stopped & running containers\nd to nuke all the containers\nTry: docker-compose up -d again ports\nOn localhost:8080 server \u2192 Unable to connect to server: could not translate host name 'pg-database' to address: Name does not resolve\nTry: new host name, best without \u201c - \u201d e.g. pgdatabase\nAnd on docker-compose.yml, should specify docker network & specify the same network in both containers\nservices:\npgdatabase:\nimage: postgres:13\nenvironment:\n- POSTGRES_USER=root\n- POSTGRES_PASSWORD=root\n- POSTGRES_DB=ny_taxi\nvolumes:\n- \"./ny_taxi_postgres_data:/var/lib/postgresql/data:rw\"\nports:\n- \"5431:5432\"\nnetworks:\n- pg-network\npgadmin:\nimage: dpage/pgadmin4\nenvironment:\n- PGADMIN_DEFAULT_EMAIL=admin@admin.com\n- PGADMIN_DEFAULT_PASSWORD=root\nports:\n- \"8080:80\"\nnetworks:\n- pg-network\nnetworks:\npg-network:\nname: pg-network",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Hostname does not resolve"
},
{
"text": "So one common issue is when you run docker-compose on GCP, postgres won\u2019t persist it\u2019s data to mentioned path for example:\nservices:\n\u2026\n\u2026\npgadmin:\n\u2026\n\u2026\nVolumes:\n\u201c./pgadmin\u201d:/var/lib/pgadmin:wr\u201d\nMight not work so in this use you can use Docker Volume to make it persist, by simply changing\nservices:\n\u2026\n\u2026.\npgadmin:\n\u2026\n\u2026\nVolumes:\npgadmin:/var/lib/pgadmin\nvolumes:\nPgadmin:",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Persist PGAdmin docker contents on GCP"
},
{
"text": "The docker will keep on crashing continuously\nNot working after restart\ndocker engine stopped\nAnd failed to fetch extensions pop ups will on screen non-stop\nSolution :\nTry checking if latest version of docker is installed / Try updating the docker\nIf Problem still persist then final solution is to reinstall docker\n(Just have to fetch images again else no issues)",
"section": "Module 1: Docker and Terraform",
"question": "Docker engine stopped_failed to fetch extensions"
},
{
"text": "As per the lessons,\nPersisting pgAdmin configuration (i.e. server name) is done by adding a \u201cvolumes\u201d section:\nservices:\npgdatabase:\n[...]\npgadmin:\nimage: dpage/pgadmin4\nenvironment:\n- PGADMIN_DEFAULT_EMAIL=admin@admin.com\n- PGADMIN_DEFAULT_PASSWORD=root\nvolumes:\n- \"./pgAdmin_data:/var/lib/pgadmin/sessions:rw\"\nports:\n- \"8080:80\"\nIn the example above, \u201dpgAdmin_data\u201d is a folder on the host machine, and \u201c/var/lib/pgadmin/sessions\u201d is the session settings folder in the pgAdmin container.\nBefore running docker-compose up on the YAML file, we also need to give the pgAdmin container access to write to the \u201cpgAdmin_data\u201d folder. The container runs with a username called \u201c5050\u201d and user group \u201c5050\u201d. The bash command to give access over the mounted volume is:\nsudo chown -R 5050:5050 pgAdmin_data",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Persist PGAdmin configuration"
},
{
"text": "This happens if you did not create the docker group and added your user. Follow these steps from the link:\nguides/docker-without-sudo.md at main \u00b7 sindresorhus/guides \u00b7 GitHub\nAnd then press ctrl+D to log-out and log-in again. pgAdmin: Maintain state so that it remembers your previous connection\nIf you are tired of having to setup your database connection each time that you fire up the containers, all you have to do is create a volume for pgAdmin:\nIn your docker-compose.yaml file, enter the following into your pgAdmin declaration:\nvolumes:\n- type: volume\nsource: pgadmin_data\ntarget: /var/lib/pgadmin\nAlso add the following to the end of the file:ls\nvolumes:\nPgadmin_data:",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - dial unix /var/run/docker.sock: connect: permission denied"
},
{
"text": "This is happen to me after following 1.4.1 video where we are installing docker compose in our Google Cloud VM. In my case, the docker-compose file downloaded from github named docker-compose-linux-x86_64 while it is more convenient to use docker-compose command instead. So just change the docker-compose-linux-x86_64 into docker-compose.",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - docker-compose still not available after changing .bashrc"
},
{
"text": "Installing pass via \u2018sudo apt install pass\u2019 helped to solve the issue. More about this can be found here: https://github.com/moby/buildkit/issues/1078",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Error getting credentials after running docker-compose up -d"
},
{
"text": "For everyone who's having problem with Docker compose, getting the data in postgres and similar issues, please take care of the following:\ncreate a new volume on docker (either using the command line or docker desktop app)\nmake the following changes to your docker-compose.yml file (see attachment)\nset low_memory=false when importing the csv file (df = pd.read_csv('yellow_tripdata_2021-01.csv', nrows=1000, low_memory=False))\nuse the below function (in the upload-data.ipynb) for better tracking of your ingestion process (see attachment)\nOrder of execution:\n(1) open terminal in 2_docker_sql folder and run docker compose up\n(2) ensure no other containers are running except the one you just executed (pgadmin and pgdatabase)\n(3) open jupyter notebook and begin the data ingestion\n(4) open pgadmin and set up a server (make sure you use the same configurations as your docker-compose.yml file like the same name (pgdatabase), port, databasename (ny_taxi) etc.",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Errors pertaining to docker-compose.yml and pgadmin setup"
},
{
"text": "Locate config.json file for docker (check your home directory; Users/username/.docker).\nModify credsStore to credStore\nSave and re-run",
"section": "Module 1: Docker and Terraform",
"question": "Docker Compose up -d error getting credentials - err: exec: \"docker-credential-desktop\": executable file not found in %PATH%, out: ``"
},
{
"text": "To figure out which docker-compose you need to download from https://github.com/docker/compose/releases you can check your system with these commands:\nuname -s -> return Linux most likely\nuname -m -> return \"flavor\"\nOr try this command -\nsudo curl -L \"https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)\" -o /usr/local/bin/docker-compose",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Which docker-compose binary to use for WSL?"
},
{
"text": "If you wrote the docker-compose.yaml file exactly like the video, you might run into an error like this:dev\nservice \"pgdatabase\" refers to undefined volume dtc_postgres_volume_local: invalid compose project\nIn order to make it work, you need to include the volume in your docker-compose file. Just add the following:\nvolumes:\ndtc_postgres_volume_local:\n(Make sure volumes are at the same level as services.)",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Error undefined volume in Windows/WSL"
},
{
"text": "Error: initdb: error: could not change permissions of directory\nIssue: WSL and Windows do not manage permissions in the same way causing conflict if using the Windows file system rather than the WSL file system.\nSolution: Use Docker volumes.\nWhy: Volume is used for storage of persistent data and not for use of transferring files. A local volume is unnecessary.\nBenefit: This resolves permission issues and allows for better management of volumes.\nNOTE: the \u2018user:\u2019 is not necessary if using docker volumes, but is if using local drive.\n</> docker-compose.yaml\nservices:\npostgres:\nimage: postgres:15-alpine\ncontainer_name: postgres\nuser: \"0:0\"\nenvironment:\n- POSTGRES_USER=postgres\n- POSTGRES_PASSWORD=postgres\n- POSTGRES_DB=ny_taxi\nvolumes:\n- \"pg-data:/var/lib/postgresql/data\"\nports:\n- \"5432:5432\"\nnetworks:\n- pg-network\npgadmin:\nimage: dpage/pgadmin4\ncontainer_name: pgadmin\nuser: \"${UID}:${GID}\"\nenvironment:\n- PGADMIN_DEFAULT_EMAIL=email@some-site.com\n- PGADMIN_DEFAULT_PASSWORD=pgadmin\nvolumes:\n- \"pg-admin:/var/lib/pgadmin\"\nports:\n- \"8080:80\"\nnetworks:\n- pg-network\nnetworks:\npg-network:\nname: pg-network\nvolumes:\npg-data:\nname: ingest_pgdata\npg-admin:\nname: ingest_pgadmin",
"section": "Module 1: Docker and Terraform",
"question": "WSL Docker directory permissions error"
},
{
"text": "Cause : If Running on git bash or vm in windows pgadmin doesnt work easily LIbraries like psycopg2 and libpq ar required still the error persists.\nSolution- I use psql instead of pgadmin totally same\nPip install psycopg2\ndock",
"section": "Module 1: Docker and Terraform",
"question": "Docker - If pgadmin is not working for Querying in Postgres Use PSQL"
},
{
"text": "Cause:\nIt happens because the apps are not updated. To be specific, search for any pending updates for Windows Terminal, WSL and Windows Security updates.\nSolution\nfor updating Windows terminal which worked for me:\nGo to Microsoft Store.\nGo to the library of apps installed in your system.\nSearch for Windows terminal.\nUpdate the app and restart your system to see the changes.\nFor updating the Windows security updates:\nGo to Windows updates and check if there are any pending updates from Windows, especially security updates.\nDo restart your system once the updates are downloaded and installed successfully.",
"section": "Module 1: Docker and Terraform",
"question": "WSL - Insufficient system resources exist to complete the requested service."
},
{
"text": "Up restardoting the same issue appears. Happens out of the blue on windows.\nSolution 1: Fixing DNS Issue (credit: reddit) this worked for me personally\nreg add \"HKLM\\System\\CurrentControlSet\\Services\\Dnscache\" /v \"Start\" /t REG_DWORD /d \"4\" /f\nRestart your computer and then enable it with the following\nreg add \"HKLM\\System\\CurrentControlSet\\Services\\Dnscache\" /v \"Start\" /t REG_DWORD /d \"2\" /f\nRestart your OS again. It should work.\nSolution 2: right click on running Docker icon (next to clock) and chose \"Switch to Linux containers\"\nbash: conda: command not found\nDatabase is uninitialized and superuser password is not specified.\nDatabase is uninitialized and superuser password is not specified.",
"section": "Module 1: Docker and Terraform",
"question": "WSL - WSL integration with distro Ubuntu unexpectedly stopped with exit code 1."
},
{
"text": "Issue when trying to run the GPC VM through SSH through WSL2, probably because WSL2 isn\u2019t looking for .ssh keys in the correct folder. My case I was trying to run this command in the terminal and getting an error\nPC:/mnt/c/Users/User/.ssh$ ssh -i gpc [username]@[my external IP]\nYou can try to use sudo before the command\nSudo .ssh$ ssh -i gpc [username]@[my external IP]\nYou can also try to cd to your folder and change the permissions for the private key SSH file.\nchmod 600 gpc\nIf that doesn\u2019t work, create a .ssh folder in the home diretory of WSL2 and copy the content of windows .ssh folder to that new folder.\ncd ~\nmkdir .ssh\ncp -r /mnt/c/Users/YourUsername/.ssh/* ~/.ssh/\nYou might need to adjust the permissions of the files and folders in the .ssh directory.",
"section": "Module 1: Docker and Terraform",
"question": "WSL - Permissions too open at Windows"
},
{
"text": "Such as the issue above, WSL2 may not be referencing the correct .ssh/config path from Windows. You can create a config file at the home directory of WSL2.\ncd ~\nmkdir .ssh\nCreate a config file in this new .ssh/ folder referencing this folder:\nHostName [GPC VM external IP]\nUser [username]\nIdentityFile ~/.ssh/[private key]",
"section": "Module 1: Docker and Terraform",
"question": "WSL - Could not resolve host name"
},
{
"text": "Change TO Socket\npgcli -h 127.0.0.1 -p 5432 -u root -d ny_taxi\npgcli -h 127.0.0.1 -p 5432 -u root -d ny_taxi",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - connection failed: :1), port 5432 failed: could not receive data from server: Connection refused could not send SSL negotiation packet: Connection refused"
},
{
"text": "probably some installation error, check out sy",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI --help error"
},
{
"text": "In this section of the course, the 5432 port of pgsql is mapped to your computer\u2019s 5432 port. Which means you can access the postgres database via pgcli directly from your computer.\nSo No, you don\u2019t need to run it inside another container. Your local system will do.",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - INKhould we run pgcli inside another docker container?"
},
{
"text": "FATAL: password authentication failed for user \"root\"\nobservations: Below in bold do not forget the folder that was created ny_taxi_postgres_data\nThis happens if you have a local Postgres installation in your computer. To mitigate this, use a different port, like 5431, when creating the docker container, as in: -p 5431: 5432\nThen, we need to use this port when connecting to pgcli, as shown below:\npgcli -h localhost -p 5431 -u root -d ny_taxi\nThis will connect you to your postgres docker container, which is mapped to your host\u2019s 5431 port (though you might choose any port of your liking as long as it is not occupied).\nFor a more visual and detailed explanation, feel free to check the video 1.4.2 - Port Mapping and Networks in Docker\nIf you want to debug: the following can help (on a MacOS)\nTo find out if something is blocking your port (on a MacOS):\nYou can use the lsof command to find out which application is using a specific port on your local machine. `lsof -i :5432`wi\nOr list the running postgres services on your local machine with launchctl\nTo unload the running service on your local machine (on a MacOS):\nunload the launch agent for the PostgreSQL service, which will stop the service and free up the port \n`launchctl unload -w ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist`\nthis one to start it again\n`launchctl load -w ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist`\nChanging port from 5432:5432 to 5431:5432 helped me to avoid this error.",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - FATAL: password authentication failed for user \"root\" (You already have Postgres)"
},
{
"text": "I get this error\npgcli -h localhost -p 5432 -U root -d ny_taxi\nTraceback (most recent call last):\nFile \"/opt/anaconda3/bin/pgcli\", line 8, in <module>\nsys.exit(cli())\nFile \"/opt/anaconda3/lib/python3.9/site-packages/click/core.py\", line 1128, in __call__\nreturn self.main(*args, **kwargs)\nFile \"/opt/anaconda3/lib/python3.9/sitYe-packages/click/core.py\", line\n1053, in main\nrv = self.invoke(ctx)\nFile \"/opt/anaconda3/lib/python3.9/site-packages/click/core.py\", line 1395, in invoke\nreturn ctx.invoke(self.callback, **ctx.params)\nFile \"/opt/anaconda3/lib/python3.9/site-packages/click/core.py\", line 754, in invoke\nreturn __callback(*args, **kwargs)\nFile \"/opt/anaconda3/lib/python3.9/site-packages/pgcli/main.py\", line 880, in cli\nos.makedirs(config_dir)\nFile \"/opt/anaconda3/lib/python3.9/os.py\", line 225, in makedirspython\nmkdir(name, mode)PermissionError: [Errno 13] Permission denied: '/Users/vray/.config/pgcli'\nMake sure you install pgcli without sudo.\nThe recommended approach is to use conda/anaconda to make sure your system python is not affected.\nIf conda install gets stuck at \"Solving environment\" try these alternatives: https://stackoverflow.com/questions/63734508/stuck-at-solving-environment-on-anaconda",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - PermissionError: [Errno 13] Permission denied: '/some/path/.config/pgcli'"
},
{
"text": "ImportError: no pq wrapper available.\nAttempts made:\n- couldn't import \\dt\nopg 'c' implementation: No module named 'psycopg_c'\n- couldn't import psycopg 'binary' implementation: No module named 'psycopg_binary'\n- couldn't import psycopg 'python' implementation: libpq library not found\nSolution:\nFirst, make sure your Python is set to 3.9, at least.\nAnd the reason for that is we have had cases of 'psycopg2-binary' failing to install because of an old version of Python (3.7.3). \n\n0. You can check your current python version with: \n$ python -V(the V must be capital)\n1. Based on the previous output, if you've got a 3.9, skip to Step #2\n Otherwispye better off with a new environment with 3.9\n$ conda create \u2013name de-zoomcamp python=3.9\n$ conda activate de-zoomcamp\n2. Next, you should be able to install the lib for postgres like this:\n```\n$ e\n$ pip install psycopg2_binary\n```\n3. Finally, make sure you're also installing pgcli, but use conda for that:\n```\n$ pgcli -h localhost -U root -d ny_taxisudo\n```\nThere, you should be good to go now!\nAnother solution:\nRun this\npip install \"psycopg[binary,pool]\"",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - no pq wrapper available."
},
{
"text": "If your Bash prompt is stuck on the password command for postgres\nUse winpty:\nwinpty pgcli -h localhost -p 5432 -u root -d ny_taxi\nAlternatively, try using Windows terminal or terminal in VS code.\nEditPGCLI -connection failed: FATAL: password authentication failed for user \"root\"\nThe error above was faced continually despite inputting the correct password\nSolution\nOption 1: Stop the PostgreSQL service on Windows\nOption 2 (using WSL): Completely uninstall Protgres 12 from Windows and install postgresql-client on WSL (sudo apt install postgresql-client-common postgresql-client libpq-dev)\nOption 3: Change the port of the docker container\nNEW SOLUTION: 27/01/2024\nPGCLI -connection failed: FATAL: password authentication failed for user \"root\"\nIf you\u2019ve got the error above, it\u2019s probably because you were just like me, closed the connection to the Postgres:13 image in the previous step of the tutorial, which is\n\ndocker run -it \\\n-e POSTGRES_USER=root \\\n-e POSTGRES_PASSWORD=root \\\n-e POSTGRES_DB=ny_taxi \\\n-v d:/git/data-engineering-zoomcamp/week_1/docker_sql/ny_taxi_postgres_data:/var/lib/postgresql/data \\\n-p 5432:5432 \\\npostgres:13\nSo keep the database connected and you will be able to implement all the next steps of the tutorial.",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - stuck on password prompt"
},
{
"text": "Problem: If you have already installed pgcli but bash doesn't recognize pgcli\nOn Git bash: bash: pgcli: command not found\nOn Windows Terminal: pgcli: The term 'pgcli' is not recognized\u2026\nSolution: Try adding a Python path C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\Scripts to Windows PATH\nFor details:\nGet the location: pip list -v\nCopy C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\site-packages\n3. Replace site-packages with Scripts: C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\Scripts\nIt can also be that you have Python installed elsewhere.\nFor me it was under c:\\python310\\lib\\site-packages\nSo I had to add c:\\python310\\lib\\Scripts to PATH, as shown below.\nPut the above path in \"Path\" (or \"PATH\") in System Variables\nReference: https://stackoverflow.com/a/68233660",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - pgcli: command not found"
},
{
"text": "In case running pgcli locally causes issues or you do not want to install it locally you can use it running in a Docker container instead.\nBelow the usage with values used in the videos of the course for:\nnetwork name (docker network)\npostgres related variables for pgcli\nHostname\nUsername\nPort\nDatabase name\n$ docker run -it --rm --network pg-network ai2ys/dockerized-pgcli:4.0.1\n175dd47cda07:/# pgcli -h pg-database -U root -p 5432 -d ny_taxi\nPassword for root:\nServer: PostgreSQL 16.1 (Debian 16.1-1.pgdg120+1)\nVersion: 4.0.1\nHome: http://pgcli.com\nroot@pg-database:ny_taxi> \\dt\n+--------+------------------+-------+-------+\n| Schema | Name | Type | Owner |\n|--------+------------------+-------+-------|\n| public | yellow_taxi_data | table | root |\n+--------+------------------+-------+-------+\nSELECT 1\nTime: 0.009s\nroot@pg-database:ny_taxi>",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - running in a Docker container"
},
{
"text": "PULocationID will not be recognized but \u201cPULocationID\u201d will be. This is because unquoted \"Localidentifiers are case insensitive. See docs.",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - case sensitive use \u201cQuotations\u201d around columns with capital letters"
},
{
"text": "When using the command `\\d <database name>` you get the error column `c.relhasoids does not exist`.\nResolution:\nUninstall pgcli\nReinstall pgclidatabase \"ny_taxi\" does not exist\nRestart pc",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - error column c.relhasoids does not exist"
},
{
"text": "This happens while uploading data via the connection in jupyter notebook\nengine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')\nThe port 5432 was taken by another postgres. We are not connecting to the port in docker, but to the port on our machine. Substitute 5431 or whatever port you mapped to for port 5432.\nAlso if this error is still persistent , kindly check if you have a service in windows running postgres , Stopping that service will resolve the issue",
"section": "Module 1: Docker and Terraform",
"question": "Postgres - OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL: password authentication failed for user \"root\""
},
{
"text": "Can happen when connecting via pgcli\npgcli -h localhost -p 5432 -U root -d ny_taxi\nOr while uploading data via the connection in jupyter notebook\nengine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')\nThis can happen when Postgres is already installed on your computer. Changing the port can resolve that (e.g. from 5432 to 5431).\nTo check whether there even is a root user with the ability to login:\nTry: docker exec -it <your_container_name> /bin/bash\nAnd then run\n???\nAlso, you could change port from 5432:5432 to 5431:5432\nOther solution that worked:\nChanging `POSTGRES_USER=juroot` to `PGUSER=postgres`\nBased on this: postgres with docker compose gives FATAL: role \"root\" does not exist error - Stack Overflow\nAlso `docker compose down`, removing folder that had postgres volume, running `docker compose up` again.",
"section": "Module 1: Docker and Terraform",
"question": "Postgres - OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL: role \"root\" does not exist"
},
{
"text": "~\\anaconda3\\lib\\site-packages\\psycopg2\\__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs)\n120\n121 dsn = _ext.make_dsn(dsn, **kwargs)\n--> 122 conn = _connect(dsn, connection_factory=connection_factory, **kwasync)\n123 if cursor_factory is not None:\n124 conn.cursor_factory = cursor_factory\nOperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL: database \"ny_taxi\" does not exist\nMake sure postgres is running. You can check that by running `docker ps`\n\u2705Solution: If you have postgres software installed on your computer before now, build your instance on a different port like 8080 instead of 5432",
"section": "Module 1: Docker and Terraform",
"question": "Postgres - OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL: dodatabase \"ny_taxi\" does not exist"
},
{
"text": "Issue:\ne\u2026\nSolution:\npip install psycopg2-binary\nIf you already have it, you might need to update it:\npip install psycopg2-binary --upgrade\nOther methods, if the above fails:\nif you are getting the \u201c ModuleNotFoundError: No module named 'psycopg2' \u201c error even after the above installation, then try updating conda using the command conda update -n base -c defaults conda. Or if you are using pip, then try updating it before installing the psycopg packages i.e\nFirst uninstall the psycopg package\nThen update conda or pip\nThen install psycopg again using pip.\nif you are still facing error with r pcycopg2 and showing pg_config not found then you will have to install postgresql. in MAC it is brew install postgresql",
"section": "Module 1: Docker and Terraform",
"question": "Postgres - ModuleNotFoundError: No module named 'psycopg2'"
},
{
"text": "In the join queries, if we mention the column name directly or enclosed in single quotes it\u2019ll throw an error says \u201ccolumn does not exist\u201d.\n\u2705Solution: But if we enclose the column names in double quotes then it will work",
"section": "Module 1: Docker and Terraform",
"question": "Postgres - \"Column does not exist\" but it actually does (Pyscopg2 error in MacBook Pro M2)"
},
{
"text": "pgAdmin has a new version. Create server dialog may not appear. Try using register-> server instead.",
"section": "Module 1: Docker and Terraform",
"question": "pgAdmin - Create server dialog does not appear"
},
{
"text": "Using GitHub Codespaces in the browser resulted in a blank screen after the login to pgAdmin (running in a Docker container). The terminal of the pgAdmin container was showing the following error message:\nCSRFError: 400 Bad Request: The referrer does not match the host.\nSolution #1:\nAs recommended in the following issue https://github.com/pgadmin-org/pgadmin4/issues/5432 setting the following environment variable solved it.\nPGADMIN_CONFIG_WTF_CSRF_ENABLED=\"False\"\nModified \u201cdocker run\u201d command\ndocker run --rm -it \\\n-e PGADMIN_DEFAULT_EMAIL=\"admin@admin.com\" \\\n-e PGADMIN_DEFAULT_PASSWORD=\"root\" \\\n-e PGADMIN_CONFIG_WTF_CSRF_ENABLED=\"False\" \\\n-p \"8080:80\" \\\n--name pgadmin \\\n--network=pg-network \\\ndpage/pgadmin4:8.2\nSolution #2:\nUsing the local installed VSCode to display GitHub Codespaces.\nWhen using GitHub Codespaces in the locally installed VSCode (opening a Codespace or creating/starting one) this issue did not occur.",
"section": "Module 1: Docker and Terraform",
"question": "pgAdmin - Blank/white screen after login (browser)"
},
{
"text": "I am using a Mac Pro device and connect to the GCP Compute Engine via Remote SSH - VSCode. But when I trying to run the PgAdmin container via docker run or docker compose command, I am failed to access the pgAdmin address via my browser. I have switched to another browser, but still can not access the pgAdmin address. So I modified a little bit the configuration from the previous DE Zoomcamp repository like below and can access the pgAdmin address:\nSolution #1:\nModified \u201cdocker run\u201d command\ndocker run --rm -it \\\n-e PGADMIN_DEFAULT_EMAIL=\"admin@admin.com\" \\\n-e PGADMIN_DEFAULT_PASSWORD=\"pgadmin\" \\\n-e PGADMIN_CONFIG_WTF_CSRF_ENABLED=\"False\" \\\n-e PGADMIN_LISTEN_ADDRESS=0.0.0.0 \\\n-e PGADMIN_LISTEN_PORT=5050 \\\n-p 5050:5050 \\\n--network=de-zoomcamp-network \\\n--name pgadmin-container \\\n--link postgres-container \\\n-t dpage/pgadmin4\nSolution #2:\nModified docker-compose.yaml configuration (via \u201cdocker compose up\u201d command)\npgadmin:\nimage: dpage/pgadmin4\ncontainer_name: pgadmin-conntainer\nenvironment:\n- PGADMIN_DEFAULT_EMAIL=admin@admin.com\n- PGADMIN_DEFAULT_PASSWORD=pgadmin\n- PGADMIN_CONFIG_WTF_CSRF_ENABLED=False\n- PGADMIN_LISTEN_ADDRESS=0.0.0.0\n- PGADMIN_LISTEN_PORT=5050\nvolumes:\n- \"./pgadmin_data:/var/lib/pgadmin/data\"\nports:\n- \"5050:5050\"\nnetworks:\n- de-zoomcamp-network\ndepends_on:\n- postgres-conntainer\nPython - ModuleNotFoundError: No module named 'pysqlite2'\nImportError: DLL load failed while importing _sqlite3: The specified module could not be found. ModuleNotFoundError: No module named 'pysqlite2'\nThe issue seems to arise from the missing of sqlite3.dll in path \".\\Anaconda\\Dlls\\\".\n\u2705I solved it by simply copying that .dll file from \\Anaconda3\\Library\\bin and put it under the path mentioned above. (if you are using anaconda)",
"section": "Module 1: Docker and Terraform",
"question": "pgAdmin - Can not access/open the PgAdmin address via browser"
},
{
"text": "If you follow the video 1.2.2 - Ingesting NY Taxi Data to Postgres and you execute all the same\nsteps as Alexey does, you will ingest all the data (~1.3 million rows) into the table yellow_taxi_data as expected.\nHowever, if you try to run the whole script in the Jupyter notebook for a second time from top to bottom, you will be missing the first chunk of 100000 records. This is because there is a call to the iterator before the while loop that puts the data in the table. The while loop therefore starts by ingesting the second chunk, not the first.\n\u2705Solution: remove the cell \u201cdf=next(df_iter)\u201d that appears higher up in the notebook than the while loop. The first time w(df_iter) is called should be within the while loop.\n\ud83d\udcd4Note: As this notebook is just used as a way to test the code, it was not intended to be run top to bottom, and the logic is tidied up in a later step when it is instead inserted into a .py file for the pipeline",
"section": "Module 1: Docker and Terraform",
"question": "Python - Ingestion with Jupyter notebook - missing 100000 records"
},
{
"text": "{t_end - t_start} seconds\")\nimport pandas as pd\ndf = pd.read_csv('path/to/file.csv.gz', /app/ingest_data.py:1: DeprecationWarning:)\nIf you prefer to keep the uncompressed csv (easier preview in vscode and similar), gzip files can be unzipped using gunzip (but not unzip). On a Ubuntu local or virtual machine, you may need to apt-get install gunzip first.",
"section": "Module 1: Docker and Terraform",
"question": "Python - Iteration csv without error"
},
{
"text": "Pandas can interpret \u201cstring\u201d column values as \u201cdatetime\u201d directly when reading the CSV file using \u201cpd.read_csv\u201d using the parameter \u201cparse_dates\u201d, which for example can contain a list of column names or column indices. Then the conversion afterwards is not required anymore.\npandas.read_csv \u2014 pandas 2.1.4 documentation (pydata.org)\nExample from week 1\nimport pandas as pd\ndf = pd.read_csv(\n'yellow_tripdata_2021-01.csv',\nnrows=100,\nparse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])\ndf.info()\nwhich will output\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 100 entries, 0 to 99\nData columns (total 18 columns):\n# Column Non-Null Count Dtype\n--- ------ -------------- -----\n0 VendorID 100 non-null int64\n1 tpep_pickup_datetime 100 non-null datetime64[ns]\n2 tpep_dropoff_datetime 100 non-null datetime64[ns]\n3 passenger_count 100 non-null int64\n4 trip_distance 100 non-null float64\n5 RatecodeID 100 non-null int64\n6 store_and_fwd_flag 100 non-null object\n7 PULocationID 100 non-null int64\n8 DOLocationID 100 non-null int64\n9 payment_type 100 non-null int64\n10 fare_amount 100 non-null float64\n11 extra 100 non-null float64\n12 mta_tax 100 non-null float64\n13 tip_amount 100 non-null float64\n14 tolls_amount 100 non-null float64\n15 improvement_surcharge 100 non-null float64\n16 total_amount 100 non-null float64\n17 congestion_surcharge 100 non-null float64\ndtypes: datetime64[ns](2), float64(9), int64(6), object(1)\nmemory usage: 14.2+ KB",
"section": "Module 1: Docker and Terraform",
"question": "iPython - Pandas parsing dates with \u2018read_csv\u2019"
},
{
"text": "os.system(f\"curl -LO {url} -o {csv_name}\")",
"section": "Module 1: Docker and Terraform",
"question": "Python - Python cant ingest data from the github link provided using curl"
},
{
"text": "When a CSV file is compressed using Gzip, it is saved with a \".csv.gz\" file extension. This file type is also known as a Gzip compressed CSV file. When you want to read a Gzip compressed CSV file using Pandas, you can use the read_csv() function, which is specifically designed to read CSV files. The read_csv() function accepts several parameters, including a file path or a file-like object. To read a Gzip compressed CSV file, you can pass the file path of the \".csv.gz\" file as an argument to the read_csv() function.\nHere is an example of how to read a Gzip compressed CSV file using Pandas:\ndf = pd.read_csv('file.csv.gz'\n, compression='gzip'\n, low_memory=False\n)",
"section": "Module 1: Docker and Terraform",
"question": "Python - Pandas can read *.csv.gzip"
},
{
"text": "Contrary to panda\u2019s read_csv method there\u2019s no such easy way to iterate through and set chunksize for parquet files. We can use PyArrow (Apache Arrow Python bindings) to resolve that.\nimport pyarrow.parquet as pq\noutput_name = \u201chttps://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet\u201d\nparquet_file = pq.ParquetFile(output_name)\nparquet_size = parquet_file.metadata.num_rows\nengine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')\ntable_name=\u201dyellow_taxi_schema\u201d\n# Clear table if exists\npq.read_table(output_name).to_pandas().head(n=0).to_sql(name=table_name, con=engine, if_exists='replace')\n# default (and max) batch size\nindex = 65536\nfor i in parquet_file.iter_batches(use_threads=True):\nt_start = time()\nprint(f'Ingesting {index} out of {parquet_size} rows ({index / parquet_size:.0%})')\ni.to_pandas().to_sql(name=table_name, con=engine, if_exists='append')\nindex += 65536\nt_end = time()\nprint(f'\\t- it took %.1f seconds' % (t_end - t_start))",
"section": "Module 1: Docker and Terraform",
"question": "Python - How to iterate through and ingest parquet file"
},
{
"text": "Error raised during the jupyter notebook\u2019s cell execution:\nfrom sqlalchemy import create_engine.\nSolution: Version of Python module \u201ctyping_extensions\u201d >= 4.6.0. Can be updated by Conda or pip.",
"section": "Module 1: Docker and Terraform",
"question": "Python - SQLAlchemy - ImportError: cannot import name 'TypeAliasType' from 'typing_extensions'."
},
{
"text": "create_engine('postgresql://root:root@localhost:5432/ny_taxi') I get the error \"TypeError: 'module' object is not callable\"\nSolution:\nconn_string = \"postgresql+psycopg://root:root@localhost:5432/ny_taxi\"\nengine = create_engine(conn_string)",
"section": "Module 1: Docker and Terraform",
"question": "Python - SQLALchemy - TypeError 'module' object is not callable"
},
{
"text": "Error raised during the jupyter notebook\u2019s cell execution:\nengine = create_engine('postgresql://root:root@localhost:5432/ny_taxi').\nSolution: Need to install Python module \u201cpsycopg2\u201d. Can be installed by Conda or pip.",
"section": "Module 1: Docker and Terraform",
"question": "Python - SQLAlchemy - ModuleNotFoundError: No module named 'psycopg2'."
},
{
"text": "Unable to add Google Cloud SDK PATH to Windows\nWindows error: The installer is unable to automatically update your system PATH. Please add C:\\tools\\google-cloud-sdk\\bin\nif you are constantly getting this feedback. Might be that you needed to add Gitbash to your Windows path:\nOne way of doing that is to use conda: \u2018If you are not already using it\nDownload the Anaconda Navigator\nMake sure to check the box (add conda to the path when installing navigator: although not recommended do it anyway)\nYou might also need to install git bash if you are not already using it(or you might need to uninstall it to reinstall it properly)\nMake sure to check the following boxes while you install Gitbash\nAdd a GitBash to Windows Terminal\nUse Git and optional Unix tools from the command prompt\nNow open up git bash and type conda init bash This should modify your bash profile\nAdditionally, you might want to use Gitbash as your default terminal.\nOpen your Windows terminal and go to settings, on the default profile change Windows power shell to git bash",
"section": "Module 1: Docker and Terraform",
"question": "GCP - Unable to add Google Cloud SDK PATH to Windows"
},
{
"text": "It asked me to create a project. This should be done from the cloud console. So maybe we don\u2019t need this FAQ.\nWARNING: Project creation failed: HttpError accessing <https://cloudresourcemanager.googleapis.com/v1/projects?alt=json>: response: <{'vtpep_pickup_datetimeary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'content-encoding': 'gzip', 'date': 'Mon, 24 Jan 2022 19:29:12 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'server-timing': 'gfet4t7; dur=189', 'alt-svc': 'h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000,h3-Q050=\":443\"; ma=2592000,h3-Q046=\":443\"; ma=2592000,h3-Q043=\":443\"; ma=2592000,quic=\":443\"; ma=2592000; v=\"46,43\"', 'transfer-encoding': 'chunked', 'status': 409}>, content <{\n\"error\": {\n\"code\": 409,\n\"message\": \"Requested entity alreadytpep_pickup_datetime exists\",\n\"status\": \"ALREADY_EXISTS\"\n}\n}\nFrom Stackoverflow: https://stackoverflow.com/questions/52561383/gcloud-cli-cannot-create-project-the-project-id-you-specified-is-already-in-us?rq=1\nProject IDs are unique across all projects. That means if any user ever had a project with that ID, you cannot use it. testproject is pretty common, so it's not surprising it's already taken.",
"section": "Module 1: Docker and Terraform",
"question": "GCP - Project creation failed: HttpError accessing \u2026 Requested entity alreadytpep_pickup_datetime exists"
},
{
"text": "If you receive the error: \u201cError 403: The project to be billed is associated with an absent billing account., accountDisabled\u201d It is most likely because you did not enter YOUR project ID. The snip below is from video 1.3.2\nThe value you enter here will be unique to each student. You can find this value on your GCP Dashboard when you login.\nAshish Agrawal\nAnother possibility is that you have not linked your billing account to your current project",
"section": "Module 1: Docker and Terraform",
"question": "GCP - The project to be billed is associated with an absent billing account"
},
{
"text": "GCP Account Suspension Inquiry\nIf Google refuses your credit/debit card, try another - I\u2019ve got an issue with Kaspi (Kazakhstan) but it worked with TBC (Georgia).\nUnfortunately, there\u2019s small hope that support will help.\nIt seems that Pyypl web-card should work too.",
"section": "Module 1: Docker and Terraform",
"question": "GCP - OR-CBAT-15 ERROR Google cloud free trial account"
},
{
"text": "The ny-rides.json is your private file in Google Cloud Platform (GCP). \n\nAnd here\u2019s the way to find it:\nGCP -> Select project with your instance -> IAM & Admin -> Service Accounts Keys tab -> add key, JSON as key type, then click create\nNote: Once you go into Service Accounts Keys tab, click the email, then you can see the \u201cKEYS\u201d tab where you can add key as a JSON as its key type",
"section": "Module 1: Docker and Terraform",
"question": "GCP - Where can I find the \u201cny-rides.json\u201d file?"
},
{
"text": "In this lecture, Alexey deleted his instance in Google Cloud. Do I have to do it?\nNope. Do not delete your instance in Google Cloud platform. Otherwise, you have to do this twice for the week 1 readings.",
"section": "Module 1: Docker and Terraform",
"question": "GCP - Do I need to delete my instance in Google Cloud?"
},
{
"text": "System Resource Usage:\ntop or htop: Shows real-time information about system resource usage, including CPU, memory, and processes.\nfree -h: Displays information about system memory usage and availability.\ndf -h: Shows disk space usage of file systems.\ndu -h <directory>: Displays disk usage of a specific directory.\nRunning Processes:\nps aux: Lists all running processes along with detailed information.\nNetwork:\nifconfig or ip addr show: Shows network interface configuration.\nnetstat -tuln: Displays active network connections and listening ports.\nHardware Information:\nlscpu: Displays CPU information.\nlsblk: Lists block devices (disks and partitions).\nlshw: Lists hardware configuration.\nUser and Permissions:\nwho: Shows who is logged on and their activities.\nw: Displays information about currently logged-in users and their processes.\nPackage Management:\napt list --installed: Lists installed packages (for Ubuntu and Debian-based systems)",
"section": "Module 1: Docker and Terraform",
"question": "Commands to inspect the health of your VM:"
},
{
"text": "if you\u2019ve got the error\n\u2502 Error: Error updating Dataset \"projects/<your-project-id>/datasets/demo_dataset\": googleapi: Error 403: Billing has not been enabled for this project. Enable billing at https://console.cloud.google.com/billing. The default table expiration time must be less than 60 days, billingNotEnabled\nbut you\u2019ve set your billing account indeed, then try to disable billing for the project and enable it again. It worked for ME!",
"section": "Module 1: Docker and Terraform",
"question": "Billing account has not been enabled for this project. But you\u2019ve done it indeed!"
},
{
"text": "for windows if you having trouble install SDK try follow these steps on the link, if you getting this error:\nThese credentials will be used by any library that requests Application Default Credentials (ADC).\nWARNING:\nCannot find a quota project to add to ADC. You might receive a \"quota exceeded\" or \"API not enabled\" error. Run $ gcloud auth application-default set-quota-project to add a quota project.\nFor me:\nI reinstalled the sdk using unzip file \u201cinstall.bat\u201d,\nafter successfully checking gcloud version,\nrun gcloud init to set up project before\nyou run gcloud auth application-default login\nhttps://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp/windows.md\nGCP VM - I cannot get my Virtual Machine to start because GCP has no resources.\nClick on your VM\nCreate an image of your VM\nOn the page of the image, tell GCP to create a new VM instance via the image\nOn the settings page, change the location",
"section": "Module 1: Docker and Terraform",
"question": "GCP - Windows Google Cloud SDK install issue:gcp"
},
{
"text": "The reason this video about the GCP VM exists is that many students had problems configuring their env. You can use your own env if it works for you.\nAnd the advantage of using your own environment is that if you are working in a Github repo where you can commit, you will be able to commit the changes that you do. In the VM the repo is cloned via HTTPS so it is not possible to directly commit, even if you are the owner of the repo.",
"section": "Module 1: Docker and Terraform",
"question": "GCP VM - Is it necessary to use a GCP VM? When is it useful?"
},
{
"text": "I am trying to create a directory but it won't let me do it\nUser1@DESKTOP-PD6UM8A MINGW64 /\n$ mkdir .ssh\nmkdir: cannot create directory \u2018.ssh\u2019: Permission denied\nYou should do it in your home directory. Should be your home (~)\nLocal. But it seems you're trying to do it in the root folder (/). Should be your home (~)\nLink to Video 1.4.1",
"section": "Module 1: Docker and Terraform",
"question": "GCP VM - mkdir: cannot create directory \u2018.ssh\u2019: Permission denied"
},
{
"text": "Failed to save '<file>': Unable to write file 'vscode-remote://ssh-remote+de-zoomcamp/home/<user>/data_engineering_course/week_2/airflow/dags/<file>' (NoPermissions (FileSystemError): Error: EACCES: permission denied, open '/home/<user>/data_engineering_course/week_2/airflow/dags/<file>')\nYou need to change the owner of the files you are trying to edit via VS Code. You can run the following command to change the ownership.\nssh\nsudo chown -R <user> <path to your directory>",
"section": "Module 1: Docker and Terraform",
"question": "GCP VM - Error while saving the file in VM via VS Code"
},
{
"text": "Question: I connected to my VM perfectly fine last week (ssh) but when I tried again this week, the connection request keeps timing out.\n\u2705Answer: Start your VM. Once the VM is running, copy its External IP and paste that into your config file within the ~/.ssh folder.\ncd ~/.ssh\ncode config \u2190 this opens the config file in VSCode",
"section": "Module 1: Docker and Terraform",
"question": ". GCP VM - VM connection request timeout"
},
{
"text": "(reference: https://serverfault.com/questions/953290/google-compute-engine-ssh-connect-to-host-ip-port-22-operation-timed-out)Go to edit your VM.\nGo to section Automation\nAdd Startup script\n```\n#!/bin/bash\nsudo ufw allow ssh\n```\nStop and Start VM.",
"section": "Module 1: Docker and Terraform",
"question": "GCP VM - connect to host port 22 no route to host"
},
{
"text": "You can easily forward the ports of pgAdmin, postgres and Jupyter Notebook using the built-in tools in Ubuntu and without any additional client:\nFirst, in the VM machine, launch docker-compose up -d and jupyter notebook in the correct folder.\nFrom the local machine, execute: ssh -i ~/.ssh/gcp -L 5432:localhost:5432 username@external_ip_of_vm\nExecute the same command but with ports 8080 and 8888.\nNow you can access pgAdmin on local machine in browser typing localhost:8080\nFor Jupyter Notebook, type localhost:8888 in the browser of your local machine. If you have problems with the credentials, it is possible that you have to copy the link with the access token provided in the logs of the terminal of the VM machine when you launched the jupyter notebook command.\nTo forward both pgAdmin and postgres use, ssh -i ~/.ssh/gcp -L 5432:localhost:5432 -L 8080:localhost:8080 modito@35.197.218.128",
"section": "Module 1: Docker and Terraform",
"question": "GCP VM - Port forwarding from GCP without using VS Code"
},
{
"text": "If you are using MS VS Code and running gcloud in WSL2, when you first try to login to gcp via the gcloud cli gcloud auth application-default login, you will see a message like this, and nothing will happen\nAnd there might be a prompt to ask if you want to open it via browser, if you click on it, it will open up a page with error message\nSolution : you should instead hover on the long link, and ctrl + click the long link\n\nClick configure Trusted Domains here\n\nPopup will appear, pick first or second entry\nNext time you gcloud auth, the login page should popup via default browser without issues",
"section": "Module 1: Docker and Terraform",
"question": "GCP gcloud + MS VS Code - gcloud auth hangs"
},
{
"text": "It is an internet connectivity error, terraform is somehow not able to access the online registry. Check your VPN/Firewall settings (or just clear cookies or restart your network). Try terraform init again after this, it should work.",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error: Failed to query available provider packages \u2502 Could not retrieve the list of available versions for provider hashicorp/google: could not query \u2502 provider registry for registry.terrafogorm.io/hashicorp/google: the request failed after 2 attempts, \u2502 please try again later"
},
{
"text": "The issue was with the network. Google is not accessible in my country, I am using a VPN. And The terminal program does not automatically follow the system proxy and requires separate proxy configuration settings.I opened a Enhanced Mode in Clash, which is a VPN app, and 'terraform apply' works! So if you encounter the same issue, you can ask help for your vpn provider.",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error:Post \"https://storage.googleapis.com/storage/v1/b?alt=json&prettyPrint=false&project=coherent-ascent-379901\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp 172.217.163.42:443: i/o timeout"
},
{
"text": "https://techcommunity.microsoft.com/t5/azure-developer-community-blog/configuring-terraform-on-windows-10-linux-sub-system/ba-p/393845",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Install for WSL"
},
{
"text": "https://github.com/hashicorp/terraform/issues/14513",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error acquiring the state lock"
},
{
"text": "When running\nterraform apply\non wsl2 I've got this error:\n\u2502 Error: Post \"https://storage.googleapis.com/storage/v1/b?alt=json&prettyPrint=false&project=<your-project-id>\": oauth2: cannot fetch token: 400 Bad Request\n\u2502 Response: {\"error\":\"invalid_grant\",\"error_description\":\"Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.\"}\nIT happens because there may be time desync on your machine which affects computing JWT\nTo fix this, run the command\nsudo hwclock -s\nwhich fixes your system time.\nReference",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error 400 Bad Request. Invalid JWT Token on WSL."
},
{
"text": "\u2502 Error: googleapi: Error 403: Access denied., forbidden\nYour $GOOGLE_APPLICATION_CREDENTIALS might not be pointing to the correct file \nrun = export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/YOUR_JSON.json\nAnd then = gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error 403 : Access denied"
},
{
"text": "One service account is enough for all the services/resources you'll use in this course. After you get the file with your credentials and set your environment variable, you should be good to go.",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Do I need to make another service account for terraform before I get the keys (.json file)?"
},
{
"text": "Here: https://releases.hashicorp.com/terraform/1.1.3/terraform_1.1.3_linux_amd64.zip",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Where can I find the Terraform 1.1.3 Linux (AMD 64)?"
},
{
"text": "You get this error because I run the command terraform init outside the working directory, and this is wrong.You need first to navigate to the working directory that contains terraform configuration files, and and then run the command.",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Terraform initialized in an empty directory! The directory has no Terraform configuration files. You may begin working with Terraform immediately by creating Terraform configuration files.g"
},
{
"text": "The error:\nError: googleapi: Error 403: Access denied., forbidden\n\u2502\nand\n\u2502 Error: Error creating Dataset: googleapi: Error 403: Request had insufficient authentication scopes.\nFor this solution make sure to run:\necho $GOOGLE_APPLICATION_CREDENTIALS\necho $?\nSolution:\nYou have to set again the GOOGLE_APPLICATION_CREDENTIALS as Alexey did in the environment set-up video in week1:\nexport GOOGLE_APPLICATION_CREDENTIALS=\"<path/to/your/service-account-authkeys>.json",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error creating Dataset: googleapi: Error 403: Request had insufficient authentication scopes"
},
{
"text": "The error:\nError: googleapi: Error 403: terraform-trans-campus@trans-campus-410115.iam.gserviceaccount.com does not have storage.buckets.create access to the Google Cloud project. Permission 'storage.buckets.create' denied on resource (or it may not exist)., forbidden\nThe solution:\nYou have to declare the project name as your Project ID, and not your Project name, available on GCP console Dashboard.",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error creating Bucket: googleapi: Error 403: Permission denied to access \u2018storage.buckets.create\u2019"
},
{
"text": "provider \"google\" {\nproject = var.projectId\ncredentials = file(\"${var.gcpkey}\")\n#region = var.region\nzone = var.zone\n}",
"section": "Module 1: Docker and Terraform",
"question": "To ensure the sensitivity of the credentials file, I had to spend lot of time to input that as a file."
},
{
"text": "For the HW1 I encountered this issue. The solution is\nSELECT * FROM zones AS z WHERE z.\"Zone\" = 'Astoria Zone';\nI think columns which start with uppercase need to go between \u201cColumn\u201d. I ran into a lot of issues like this and \u201c \u201d made it work out.\nAddition to the above point, for me, there is no \u2018Astoria Zone\u2019, only \u2018Astoria\u2019 is existing in the dataset.\nSELECT * FROM zones AS z WHERE z.\"Zone\" = 'Astoria\u2019;",
"section": "Module 1: Docker and Terraform",
"question": "SQL - SELECT * FROM zones_taxi WHERE Zone='Astoria Zone'; Error Column Zone doesn't exist"
},
{
"text": "It is inconvenient to use quotation marks all the time, so it is better to put the data to the database all in lowercase, so in Pandas after\ndf = pd.read_csv(\u2018taxi+_zone_lookup.csv\u2019)\nAdd the row:\ndf.columns = df.columns.str.lower()",
"section": "Module 1: Docker and Terraform",
"question": "SQL - SELECT Zone FROM taxi_zones Error Column Zone doesn't exist"
},
{
"text": "Solution (for mac users): os.system(f\"curl {url} --output {csv_name}\")",
"section": "Module 1: Docker and Terraform",
"question": "CURL - curl: (6) Could not resolve host: output.csv"
},
{
"text": "To resolve this, ensure that your config file is in C/User/Username/.ssh/config",
"section": "Module 1: Docker and Terraform",
"question": "SSH Error: ssh: Could not resolve hostname linux: Name or service not known"
},
{
"text": "If you use Anaconda (recommended for the course), it comes with pip, so the issues is probably that the anaconda\u2019s Python is not on the PATH.\nAdding it to the PATH is different for each operation system.\nFor Linux and MacOS:\nOpen a terminal.\nFind the path to your Anaconda installation. This is typically `~/anaconda3` or `~/opt/anaconda3`.\nAdd Anaconda to your PATH with the command: `export PATH=\"/path/to/anaconda3/bin:$PATH\"`.\nTo make this change permanent, add the command to your `.bashrc` (Linux) or `.bash_profile` (MacOS) file.\nOn Windows, python and pip are in different locations (python is in the anaconda root, and pip is in Scripts). With GitBash:\nLocate your Anaconda installation. The default path is usually `C:\\Users\\[YourUsername]\\Anaconda3`.\nDetermine the correct path format for Git Bash. Paths in Git Bash follow the Unix-style, so convert the Windows path to a Unix-style path. For example, `C:\\Users\\[YourUsername]\\Anaconda3` becomes `/c/Users/[YourUsername]/Anaconda3`.\nAdd Anaconda to your PATH with the command: `export PATH=\"/c/Users/[YourUsername]/Anaconda3/:/c/Users/[YourUsername]/Anaconda3/Scripts/$PATH\"`.\nTo make this change permanent, add the command to your `.bashrc` file in your home directory.\nRefresh your environment with the command: `source ~/.bashrc`.\nFor Windows (without Git Bash):\nRight-click on 'This PC' or 'My Computer' and select 'Properties'.\nClick on 'Advanced system settings'.\nIn the System Properties window, click on 'Environment Variables'.\nIn the Environment Variables window, select the 'Path' variable in the 'System variables' section and click 'Edit'.\nIn the Edit Environment Variable window, click 'New' and add the path to your Anaconda installation (typically `C:\\Users\\[YourUsername]\\Anaconda3` and C:\\Users\\[YourUsername]\\Anaconda3\\Scripts`).\nClick 'OK' in all windows to apply the changes.\nAfter adding Anaconda to the PATH, you should be able to use `pip` from the command line. Remember to restart your terminal (or command prompt in Windows) to apply these changes.",
"section": "Module 1: Docker and Terraform",
"question": "'pip' is not recognized as an internal or external command, operable program or batch file."
},
{
"text": "Resolution: You need to stop the services which is using the port.\nRun the following:\n```\nsudo kill -9 `sudo lsof -t -i:<port>`\n```\n<port> being 8080 in this case. This will free up the port for use.\n~ Abhijit Chakraborty\nError: error response from daemon: cannot stop container: 1afaf8f7d52277318b71eef8f7a7f238c777045e769dd832426219d6c4b8dfb4: permission denied\nResolution: In my case, I had to stop docker and restart the service to get it running properly\nUse the following command:\n```\nsudo systemctl restart docker.socket docker.service\n```\n~ Abhijit Chakraborty\nError: cannot import module psycopg2\nResolution: Run the following command in linux:\n```\nsudo apt-get install libpq-dev\npip install psycopg2\n```\n~ Abhijit Chakraborty\nError: docker build Error checking context: 'can't stat '<path-to-file>'\nResolution: This happens due to insufficient permission for docker to access a certain file within the directory which hosts the Dockerfile.\n1. You can create a .dockerignore file and add the directory/file which you want Dockerfile to ignore while build.\n2. If the above does not work, then put the dockerfile and corresponding script, `\t1.py` in our case to a subfolder. and run `docker build ...`\nfrom inside the new folder.\n~ Abhijit Chakraborty",
"section": "Module 1: Docker and Terraform",
"question": "Error: error starting userland proxy: listen tcp4 0.0.0.0:8080: bind: address already in use"
},
{
"text": "To get a pip-friendly requirements.txt file file from Anaconda use\nconda install pip then `pip list \u2013format=freeze > requirements.txt`.\n`conda list -d > requirements.txt` will not work and `pip freeze > requirements.txt` may give odd pathing.",
"section": "Module 2: Workflow Orchestration",
"question": "Anaconda to PIP"
},
{
"text": "Prefect: https://docs.google.com/document/d/1K_LJ9RhAORQk3z4Qf_tfGQCDbu8wUWzru62IUscgiGU/edit?usp=sharing\nAirflow: https://docs.google.com/document/d/1-BwPAsyDH_mAsn8HH5z_eNYVyBMAtawJRjHHsjEKHyY/edit?usp=sharing",
"section": "Module 2: Workflow Orchestration",
"question": "Where are the FAQ questions from the previous cohorts for the orchestration module?"
},
{
"text": "Issue : Docker containers exit instantly with code 132, upon docker compose up\nMage documentation has it listing the cause as \"older architecture\" .\nThis might be a hardware issue, so unless you have another computer, you can't solve it without purchasing a new one, so the next best solution is a VM.\nThis is from a student running on a VirtualBox VM, Ubuntu 22.04.3 LTS, Docker version 25.0.2. So not having the context on how the vbox was spin up with (CPU, RAM, network, etc), it\u2019s really inconclusive at this time.",
"section": "Module 2: Workflow Orchestration",
"question": "Docker - 2.2.2 Configure Mage"
},
{
"text": "This issue was occurring with Windows WSL 2\nFor me this was because WSL 2 was not dedicating enough cpu cores to Docker.The load seems to take up at least one cpu core so I recommend dedicating at least two.\nOpen Bash and run the following code:\n$ cd ~\n$ ls -la\nLook for the .wsl config file:\n-rw-r--r-- 1 ~1049089 31 Jan 25 12:54 .wslconfig\nUsing a text editing tool of your choice edit or create your .wslconfig file:\n$ nano .wslconfig\nPaste the following into the new file/ edit the existing file in this format and save:\n*** Note - for memory\u2013 this is the RAM on your machine you can dedicate to Docker, your situation may be different than mine ***\n[wsl2]\nprocessors=<Number of Processors - at least 2!> example: 4\nmemory=<memory> example:4GB\nExample:\nOnce you do that run:\n$ wsl --shutdown\nThis shuts down WSL\nThen Restart Docker Desktop - You should now be able to load the .csv.gz file without the error into a pandas dataframe",
"section": "Module 2: Workflow Orchestration",
"question": "WSL - 2.2.3 Mage - Unexpected Kernel Restarts; Kernel Running out of memory:"
},
{
"text": "The issue and solution on the link:\nhttps://datatalks-club.slack.com/archives/C01FABYF2RG/p1706817366764269?thread_ts=1706815324.993529&cid=C01FABYF2RG",
"section": "Module 2: Workflow Orchestration",
"question": "2.2.3 Configuring Postgres"
},
{
"text": "Check that the POSTGRES_PORT variable in the io_config.yml file is set to port 5432, which is the default postgres port. The POSTGRES_PORT variable is the mage container port, not the host port. Hence, there\u2019s no need to set the POSTGRES_PORT to 5431 just because you already have a conflicting postgres installation in your host machine.",
"section": "Module 2: Workflow Orchestration",
"question": "MAGE - 2.2.3 OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5431 failed: Connection refused"
},
{
"text": "You forgot to select \u2018dev\u2019 profile in the dropdown menu next to where you select \u2018PostgreSQL\u2019 in the connection drop down.",
"section": "Module 2: Workflow Orchestration",
"question": "MAGE - 2.2.4 executing SELECT 1; results in KeyError"
},
{
"text": "If you are getting this error. Update your mage io_config.yaml file, and specify a timeout value set to 600 like this.\nMake sure to save your changes.\nMAGE - 2.2.4 Testing BigQuery connection using SQL 404 error:\nNotFound: 404 Not found: Dataset ny-rides-diegogutierrez:None was not found in location northamerica-northeast1\nIf you get this error even with all roles/permissions given to the service account check if you have ticked the box where it says \u201cUse raw SQL\u201d, just like the image below.",
"section": "Module 2: Workflow Orchestration",
"question": "MAGE -2.2.4 ConnectionError: ('Connection aborted.', TimeoutError('The write operation timed out'))"
},
{
"text": "Solution: https://stackoverflow.com/questions/48056381/google-client-invalid-jwt-token-must-be-a-short-lived-token",
"section": "Module 2: Workflow Orchestration",
"question": "Problem: RefreshError: ('invalid_grant: Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.', {'error': 'invalid_grant', 'error_description': 'Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.'})"
},
{
"text": "Origin of Solution (Mage Slack-Channel): https://mageai.slack.com/archives/C03HTTWFEKE/p1706543947795599\nProblem: This error can often be seen after solving the error mentioned in 2.2.4. The error can be found in Mage version 0.9.61 and is a side-effect of the update of the code for data-loader blocks.\nNote: Mage 0.9.62 has been released, as of Feb 5 2024. Please recheck. Solution below may be obsolete\nSolution: Using a \u201cfixed\u201d version of the docker container\nPull updated docker image from docker-hub\nmageai/mageaidocker pull:alpha\nUpdate docker-compose.yaml\nversion: '3'\nservices:\nmagic:\nimage: mageai/mageai:alpha <--- instead of \u201clatest\u201d-tag\ndocker-compose up\nThe original Error is still present, but the SQL-query will return the desired result:\n--------------------------------------------------------------------------------------",
"section": "Module 2: Workflow Orchestration",
"question": "Mage - 2.2.4 IndexError: list index out of range"
},
{
"text": "Add\nif not path.parent.is_dir():\npath.parent.mkdir(parents=True)\npath = Path(path).as_posix()\nsee:\nhttps://datatalks-club.slack.com/archives/C01FABYF2RG/p1675774214591809?thread_ts=1675768839.028879&cid=C01FABYF2RG",
"section": "Module 2: Workflow Orchestration",
"question": "2.2.6 OSError: Cannot save file into a non-existent directory: '..\\\\..\\\\data\\\\yellow'\\n\")"
},
{
"text": "The video DE Zoomcamp 2.2.7 is missing the actual deployment of Mage using Terraform to GCP. The steps for the deployment were not covered in the video.\nI successfully deployed it and wanted to share some key points:\nIn variables.tf, set the project_id default value to your GCP project ID.\nEnable the Cloud Filestore API:\nVisit the Google Cloud Console.to\nNavigate to \"APIs & Services\" > \"Library.\"\nSearch for \"Cloud Filestore API.\"\nClick on the API and enable it.\nTo perform the deployment:\nterraform init\nterraform apply\nPlease note that during the terraform apply step, Terraform will prompt you to enter the PostgreSQL password. After that, it will ask for confirmation to proceed with the deployment. Review the changes, type 'yes' when prompted, and press Enter.",
"section": "Module 2: Workflow Orchestration",
"question": "GCP - 2.2.7d Deploying Mage to GCP"
},
{
"text": "If you want to rune multiple docker containers from different directories. Then make sure to change the port mappings in the docker-compose.yml file.\nports:\n- 8088:6789\nThe 8088 port in above case is hostport, where mage will run on your local machine. You can customize this as long as the port is available. If you are running on VM, make sure to forward the port too. You need to keep the container port to 6789 as this is the port where mage is running.\nGCP - 2.2.7d Deploying Mage to Google Cloud\nWhile terraforming all the resources inside a VM created in GCS the following error is shown.\nError log:\nmodule.lb-http.google_compute_backend_service.default[\"default\"]: Creating...\n\u2577\n\u2502 Error: Error creating GlobalAddress: googleapi: Error 403: Request had insufficient authentication scopes.\n\u2502 Details:\n\u2502 [\n\u2502 {\n\u2502 \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\n\u2502 \"domain\": \"googleapis.com\",\n\u2502 \"metadatas\": {\n\u2502 \"method\": \"compute.beta.GlobalAddressesService.Insert\",\n\u2502 \"service\": \"compute.googleapis.com\"\n\u2502 },\n\u2502 \"reason\": \"ACCESS_TOKEN_SCOPE_INSUFFICIENT\"\n\u2502 }\n\u2502 ]\n\u2502\n\u2502 More details:\n\u2502 Reason: insufficientPermissions, Message: Insufficient Permission\nThis error might happen when you are using a VM inside GCS. To use the Google APIs from a GCP virtual machine you need to add the cloud platform scope (\"https://www.googleapis.com/auth/cloud-platform\") to your VM when it is created.\nSince ours is already created you can just stop it and change the permissions. You can do it in the console, just go to \"EDIT\", g99o all the way down until you find \"Cloud API access scopes\". There you can \"Allow full access to all Cloud APIs\". I did this and all went smoothly generating all the resources needed. Hope it helps if you encounter this same error.\nResources: https://stackoverflow.com/questions/35928534/403-request-had-insufficient-authentication-scopes-during-gcloud-container-clu",
"section": "Module 2: Workflow Orchestration",
"question": "Ruuning Multiple Mage instances in Docker from different directories"
},
{
"text": "If you are on the free trial account on GCP you will face this issue when trying to deploy the infrastructures with terraform. This service is not available for this kind of account.\nThe solution I found was to delete the load_balancer.tf file and to comment or delete the rows that differentiate it on the main.tf file. After this just do terraform destroy to delete any infrastructure created on the fail attempts and re-run the terraform apply.\nCode on main.tf to comment/delete:\nLine 166, 167, 168",
"section": "Module 2: Workflow Orchestration",
"question": "GCP - 2.2.7d Load Balancer Problem (Security Policies quota)"
},
{
"text": "If you get the following error\nYou have to edit variables.tf on the gcp folder, set your project-id and region and zones properly. Then, run terraform apply again.\nYou can find correct regions/zones here: https://cloud.google.com/compute/docs/regions-zones\nDeploying MAGE to GCP with Terraform via the VM (2.2.7)\nFYI - It can take up to 20 minutes to deploy the MAGE Terraform files if you are using a GCP Virtual Machine. It is normal, so don\u2019t interrupt the process or think it\u2019s taking too long. If you have, make sure you run a terraform destroy before trying again as you will have likely partially created resources which will cause errors next time you run `terraform apply`.\n`terraform destroy` may not completely delete partial resources - go to Google Cloud Console and use the search bar at the top to search for the \u2018app.name\u2019 you declared in your variables.tf file; this will list all resources with that name - make sure you delete them all before running `terraform apply` again.\nWhy are my GCP free credits going so fast? MAGE .tf files - Terraform Destroy not destroying all Resources\nI checked my GCP billing last night & the MAGE Terraform IaC didn't destroy a GCP Resource called Filestore as \u2018mage-data-prep- it has been costing \u00a35.01 of my free credits each day I now have \u00a3151 left - Alexey has assured me that This amount WILL BE SUFFICIENT funds to finish the course. Note to anyone who had issues deploying the MAGE terraform code: check your billing account to see what you're being charged for (main menu - billing) (even if it's your free credits) and run a search for 'mage-data-prep' in the top bar just to be sure that your resources have been destroyed - if any come up delete them.",
"section": "Module 2: Workflow Orchestration",
"question": "GCP - 2.2.7d Part 2 - Getting error when you run terraform apply"
},
{
"text": "```\n\u2502 Error: Error creating Connector: googleapi: Error 403: Permission 'vpcaccess.connectors.create' denied on resource '//vpcaccess.googleapis.com/projects/<ommit>/locations/us-west1' (or it may not exist).\n\u2502 Details:\n\u2502 [\n\u2502 {\n\u2502 \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\n\u2502 \"domain\": \"vpcaccess.googleapis.com\",\n\u2502 \"metadata\": {\n\u2502 \"permission\": \"vpcaccess.connectors.create\",\n\u2502 \"resource\": \"projects/<ommit>/locations/us-west1\"\n\u2502 },\n\u2502 \"reason\": \"IAM_PERMISSION_DENIED\"\n\u2502 }\n\u2502 ]\n\u2502\n\u2502 with google_vpc_access_connector.connector,\n\u2502 on fs.tf line 19, in resource \"google_vpc_access_connector\" \"connector\":\n\u2502 19: resource \"google_vpc_access_connector\" \"connector\" {\n\u2502\n```\nSolution: Add Serverless VPC Access Admin to Service Account.\nLine 148",
"section": "Module 2: Workflow Orchestration",
"question": "Question: Permission 'vpcaccess.connectors.create'"
},
{
"text": "Git won\u2019t push an empty folder to GitHub, so if you put a file in that folder and then push, then you should be good to go.\nOr - in your code- make the folder if it doesn\u2019t exist using Pathlib as shown here: https://stackoverflow.com/a/273227/4590385.\nFor some reason, when using github storage, the relative path for writing locally no longer works. Try using two separate paths, one full path for the local write, and the original relative path for GCS bucket upload.",
"section": "Module 2: Workflow Orchestration",
"question": "File Path: Cannot save file into a non-existent directory: 'data/green'"
},
{
"text": "The green dataset contains lpep_pickup_datetime while the yellow contains tpep_pickup_datetime. Modify the script(s) depending on the dataset as required.",
"section": "Module 2: Workflow Orchestration",
"question": "No column name lpep_pickup_datetime / tpep_pickup_datetime"
},
{
"text": "pd.read_csv\ndf_iter = pd.read_csv(dataset_url, iterator=True, chunksize=100000)\nThe data needs to be appended to the parquet file using the fastparquet engine\ndf.to_parquet(path, compression=\"gzip\", engine='fastparquet', append=True)",
"section": "Module 2: Workflow Orchestration",
"question": "Process to download the VSC using Pandas is killed right away"
},
{
"text": "denied: requested access to the resource is denied\nThis can happen when you\nHaven't logged in properly to Docker Desktop (use docker login -u \"myusername\")\nHave used the wrong username when pushing to docker images. Use the same one as your username and as the one you build on\ndocker image build -t <myusername>/<imagename>:<tag>\ndocker image push <myusername>/<imagename>:<tag>",
"section": "Module 2: Workflow Orchestration",
"question": "Push to docker image failure"
},
{
"text": "16:21:35.607 | INFO | Flow run 'singing-malkoha' - Executing 'write_bq-b366772c-0' immediately...\nKilled\nSolution: You probably are running out of memory on your VM and need to add more. For example, if you have 8 gigs of RAM on your VM, you may want to expand that to 16 gigs.",
"section": "Module 2: Workflow Orchestration",
"question": "Flow script fails with \u201ckilled\u201d message:"
},
{
"text": "After playing around with prefect for a while this can happen.\nSsh to your VM and run sudo du -h --block-size=G | sort -n -r | head -n 30 to see which directory needs the most space.\nMost likely it will be \u2026/.prefect/storage, where your cached flows are stored. You can delete older flows from there. You also have to delete the corresponding flow in the UI, otherwise it will throw you an error, when you try to run your next flow.\nSSL Certificate Verify: (I got it when trying to run flows on MAC): urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED]\npip install certifi\n/Applications/Python\\ {ver}/Install\\ Certificates.command\nor\nrunning the \u201cInstall Certificate.command\u201d inside of the python{ver} folder",
"section": "Module 2: Workflow Orchestration",
"question": "GCP VM: Disk Space is full"
},
{
"text": "It means your container consumed all available RAM allocated to it. It can happen in particular when working on Question#3 in the homework as the dataset is relatively large and containers eat a lot of memory in general.\nI would recommend restarting your computer and only starting the necessary processes to run the container. If that doesn\u2019t work, allocate more resources to docker. If also that doesn\u2019t work because your workstation is a potato, you can use an online compute environment service like GitPod, which is free under under 50 hours / month of use.",
"section": "Module 2: Workflow Orchestration",
"question": "Docker: container crashed with status code 137."
},
{
"text": "In Q3 there was a task to run the etl script from web to GCS. The problem was, it wasn\u2019t really an ETL straight from web to GCS, but it was actually a web to local storage to local memory to GCS over network ETL. Yellow data is about 100 MB each per month compressed and ~700 MB after uncompressed on memory\nThis leads to a problem where i either got a network type error because my not so good 3rd world internet or i got my WSL2 crashed/hanged because out of memory error and/or 100% resource usage hang.\nSolution:\nif you have a lot of time at hand, try compressing it to parquet and writing it to GCS with the timeout argument set to a really high number (the default os 60 seconds)\nthe yellow taxi data for feb 2019 is about 100MB as parquet file\ngcp_cloud_storage_bucket_block.upload_from_path(\nfrom_path=f\"{path}\",\nto_path=path,\ntimeout=600\n)",
"section": "Module 2: Workflow Orchestration",
"question": "Timeout due to slow upload internet"
},
{
"text": "This error occurs when you try to re-run the export block, of the transformed green_taxi data to PostgreSQL.\nWhat you\u2019ll need to do is to drop the table using SQL in Mage (screenshot below).\nYou should be able to re-run the block successfully after dropping the table.",
"section": "Module 2: Workflow Orchestration",
"question": "UndefinedColumn: column \"ratecode_id\", \"rate_code_id\" \u201cvendor_id\u201d, \u201cpu_location_id\u201d, \u201cdo_location_id\u201d of relation \"green_taxi\" does not exist - Export transformed green_taxi data to PostgreSQL"
},
{
"text": "SettingWithCopyWarning:\nA value is trying to be set on a copy of a slice from a DataFrame.\nUse the data.loc[] = value syntax instead of df[] = value to ensure that the new column is being assigned to the original dataframe instead of a copy of a dataframe or a series.",
"section": "Module 2: Workflow Orchestration",
"question": "Homework - Q3 SettingWithCopyWarning Error:"
},
{
"text": "CSV Files are very big in nyc data, so we instead of using Pandas/Python kernel , we can try Pyspark Kernel\nDocumentation of Mage for using pyspark kernel: https://docs.mage.ai/integrations/spark-pyspark\n?",
"section": "Module 2: Workflow Orchestration",
"question": "Since I was using slow laptop, and we have so big csv files, I used pyspark kernel in mage instead of python, How to do it?"
},
{
"text": "So we will first delete the connection between blocks then we can remove the connection.",
"section": "Module 2: Workflow Orchestration",
"question": "I got an error when I was deleting BLOCK IN A PIPELINE"
},
{
"text": "While Editing the Pipeline Name It throws permission denied error.\n(Work around)In that case proceed with the work and save later on revisit it will let you edit.",
"section": "Module 2: Workflow Orchestration",
"question": "Mage UI won\u2019t let you edit the Pipeline name?"
},
{
"text": "Solution n\u00b01 if you want to download everything :\n```\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nfrom pyarrow.fs import GcsFileSystem\n\u2026\n@data_loader\ndef load_data(*args, **kwargs):\n bucket_name = YOUR_BUCKET_NAME_HERE'\n blob_prefix = 'PATH / TO / WHERE / THE / PARTITIONS / ARE'\n root_path = f\"{bucket_name}/{blob_prefix}\"\npa_table = pq.read_table(\n source=root_path,\n filesystem=GcsFileSystem(), \n )\n\n return pa_table.to_pandas()\nSolution n\u00b02 if you want to download only some dates :\n@data_loader\ndef load_data(*args, **kwargs):\ngcs = pa.fs.GcsFileSystem()\nbucket_name = 'YOUR_BUCKET_NAME_HERE'\nblob_prefix = ''PATH / TO / WHERE / THE / PARTITIONS / ARE''\nroot_path = f\"{bucket_name}/{blob_prefix}\"\npa_dataset = pq.ParquetDataset(\npath_or_paths=root_path,\nfilesystem=gcs,\nfilters=[('lpep_pickup_date', '>=', '2020-10-01'), ('lpep_pickup_date', '<=', '2020-10-31')]\n)\nreturn pa_dataset.read().to_pandas()\n# More information about the pq.Parquet.Dataset : Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories. Documentation here :\nhttps://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset\nERROR: UndefinedColumn: column \"vendor_id\" of relation \"green_taxi\" does not exist\nTwo possible solutions both of them work in the same way.\nOpen up a Data Loader connect using SQL - RUN the command \n`DROP TABLE mage.green_taxi`\nElse, Open up a Data Extractor of SQL - increase the rows to above the number of rows in the dataframe (you can find that in the bottom of the transformer block) change the Write Policy to `Replace` and run the SELECT statement",
"section": "Module 2: Workflow Orchestration",
"question": "How do I make Mage load the partitioned files that we created on 2.2.4, to load them into BigQuery ?"
},
{
"text": "All mage files are in your /home/src/folder where you saved your credentials.json so you should be able to access them locally. You will see a folder for \u2018Pipelines\u2019, 'data loaders', 'data transformers' & 'data exporters' - inside these will be the .py or .sql files for the blocks you created in your pipeline.\nRight click & \u2018download\u2019 the pipeline itself to your local machine (which gives you metadata, pycache and other files)\nAs above, download each .py/.sql file that corresponds to each block you created for the pipeline. You'll find these under 'data loaders', 'data transformers' 'data exporters'\nMove the downloaded files to your GitHub repo folder & commit your changes.",
"section": "Module 2: Workflow Orchestration",
"question": "Git - What Files Should I Submit for Homework 2 & How do I get them out of MAGE:"
},
{
"text": "Assuming you downloaded the Mage repo in the week 2 folder of the Data Engineering Zoomcamp, you might want to include your mage copy, demo pipelines and homework within your personal copy of the Data Engineering Zoomcamp repo. This will not work by default, because GitHub sees them as two separate repositories, and one does not track the other. To add the Mage files to your main DE Zoomcamp repo, you will need to:\nMove the contents of the .gitignore file in your main .gitignore.\nUse the terminal to cd into the Mage folder and:\nrun \u201cgit remote remove origin\u201d to de-couple the Mage repo,\nrun \u201crm -rf .git\u201d to delete local git files,\nrun \u201cgit add .\u201d to add the current folder as changes to stage, commit and push.",
"section": "Module 2: Workflow Orchestration",
"question": "Git - How do I include the files in the Mage repo (including exercise files and homework) in a personal copy of the Data Engineering Zoomcamp repo?"
},
{
"text": "When try to add three assertions:\nvendor_id is one of the existing values in the column (currently)\npassenger_count is greater than 0\ntrip_distance is greater than 0\nto test_output, I got ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Below is my code:\ndata_filter = (data['passenger_count'] > 0) and (data['trip_distance'] > 0)\nAfter looking for solutions at Stackoverflow, I found great discussion about it. So I changed my code into:\ndata_filter = (data['passenger_count'] > 0) & (data['trip_distance'] > 0)",
"section": "Module 2: Workflow Orchestration",
"question": "Got ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()"
},
{
"text": "This happened when I just booted up my PC, continuing from the progress I was doing from yesterday.\nAfter cd-ing into your directory, and running docker compose up , the web interface for the Mage shows, but the files that I had yesterday was gone.\nIf your files are gone, go ahead and close the web interface, and properly shutting down the mage docker compose by doing Ctrl + C once. Try running it again. This worked for me more than once (yes the issue persisted with my PC twice)\nAlso, you should check if you\u2019re in the correct repository before doing docker compose up . This was discussed in the Slack #course-data-engineering channel",
"section": "Module 2: Workflow Orchestration",
"question": "Mage AI Files are Gone/disappearing"
},
{
"text": "The above errors due to \u201c at the trailing side and it need to be modified with \u2018 quotes at both ends\nKrishna Anand",
"section": "Module 2: Workflow Orchestration",
"question": "Mage - Errors in io.config.yaml file"
},
{
"text": "Problem: The following error occurs when attempting to export data from Mage to a GCS bucket using pyarrow suggesting Mage doesn\u2019t have the necessary permissions to access the specified GCP credentials .json file.\nArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error GetBucketMetadata: Could not create a OAuth2 access token to authenticate the request. The request was not sent, as such an access token is required to complete the request successfully. Learn more about Google Cloud authentication at https://cloud.google.com/docs/authentication. The underlying error message was: Cannot open credentials file /home/src/...\nSolution: Inside the Mage app:\nCreate a credentials folder (e.g. gcp-creds) within the magic-zoomcamp folder\nIn the credentials folder create a .json key file (e.g. mage-gcp-creds.json)\nCopy/paste GCP service account credentials into the .json key file and save\nUpdate code to point to this file. E.g.\nenviron['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/src/magic-zoomcamp/gcp-creds/mage-gcp-creds.json'",
"section": "Module 2: Workflow Orchestration",
"question": "Mage - ArrowException Cannot open credentials file"
},
{
"text": "Oserror: google::cloud::status(unavailable: retry policy exhausted getbucketmetadata: could not create a OAuth2 access token to authenticate the request. the request was not sent, as such an access token is required to complete the request successfully. learn more about google cloud authentication at https://cloud.google.com/docs/authentication. the underlying error message was: performwork() - curl error [6]=couldn't resolve host name)",
"section": "Module 2: Workflow Orchestration",
"question": "Mage - OSError"
},
{
"text": "Problem: The following error occurs when attempting to export data from Mage to a GCS bucket. Assigned service account doesn\u2019t have the necessary permissions access Google Cloud Storage Bucket\nPermissionError: [Errno 13] google::cloud::Status(PERMISSION_DENIED: Permanent error GetBucketMetadata:... .iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist). error_info={reason=forbidden, domain=global, metadata={http_status_code=403}}). Detail: [errno 13] Permission denied\nSolution: Add Cloud Storage Admin role to the service account:\nGo to project in Google Cloud Console>IAM & Admin>IAM\nClick Edit principal (pencil symbol) to the right of the service account you are using\nClick + ADD ANOTHER ROLE\nSelect Cloud Storage>Storage Admin\nClick Save",
"section": "Module 2: Workflow Orchestration",
"question": "Mage - PermissionError service account does not have storage.buckets.get access to the Google Cloud Storage bucket"
},
{
"text": "1. Make sure your pyspark script is ready to be send to Dataproc cluster\n2. Create a Dataproc Cluster in GCP Console\n3. Make sure to edit the service account and add new role - Dataproc Editor\n4. Copy the python script ./notebooks/pyspark_script.py and place it under GCS bucket path\n5. Make sure gcloud cli is installed either in Mage manually or via your Dockerfile and docker-compose files. This is needed to let Mage access google Dataproc and the script it needs to execute. Refer - Installing the latest gcloud CLI\n6. Use the Bigquery/Dataproc script mentioned here - https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/code/cloud.md . Use Mage to trigger the query",
"section": "Module 3: Data Warehousing",
"question": "Trigger Dataproc from Mage"
},
{
"text": "A:\n1 solution) Add -Y flag, so that apt-get automatically agrees to install additional packages\n2) Use python ZipFile package, which is included in all modern python distributions",
"section": "Module 3: Data Warehousing",
"question": "Docker-compose takes infinitely long to install zip unzip packages for linux, which are required to unpack datasets"
},
{
"text": "Make sure to use Nullable dataTypes, such as Int64 when appliable.",
"section": "Module 3: Data Warehousing",
"question": "GCS Bucket - error when writing data from web to GCS:"
},
{
"text": "Ultimately, when trying to ingest data into a BigQuery table, all files within a given directory must have the same schema.\nWhen dealing for example with the FHV Datasets from 2019, however (see image below), one can see that the files for '2019-05', and 2019-06, have the columns \"PUlocationID\" and \"DOlocationID\" as Integers, while for the period of '2019-01' through '2019-04', the same column is defined as FLOAT.\nSo while importing these files as parquet to BigQuery, the first one will be used to define the schema of the table, while all files following that will be used to append data on the existing table. Which means, they must all follow the very same schema of the file that created the table.\nSo, in order to prevent errors like that, make sure to enforce the data types for the columns on the DataFrame before you serialize/upload them to BigQuery. Like this:\npd.read_csv(\"path_or_url\").astype({\n\t\"col1_name\": \"datatype\",\t\n\t\"col2_name\": \"datatype\",\t\n\t...\t\t\t\t\t\n\t\"colN_name\": \"datatype\" \t\n})",
"section": "Module 3: Data Warehousing",
"question": "GCS Bucket - Failed to create table: Error while reading data, error message: Parquet column 'XYZ' has type INT which does not match the target cpp_type DOUBLE. File: gs://path/to/some/blob.parquet"
},
{
"text": "If you receive the error gzip.BadGzipFile: Not a gzipped file (b'\\n\\n'), this is because you have specified the wrong URL to the FHV dataset. Make sure to use https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/{dataset_file}.csv.gz\nEmphasising the \u2018/releases/download\u2019 part of the URL.",
"section": "Module 3: Data Warehousing",
"question": "GCS Bucket - Fix Error when importing FHV data to GCS"
},
{
"text": "Krishna Anand",
"section": "Module 3: Data Warehousing",
"question": "GCS Bucket - Load Data From URL list in to GCP Bucket"
},
{
"text": "Check the Schema\nYou might have a wrong formatting\nTry to upload the CSV.GZ files without formatting or going through pandas via wget\nSee this Slack conversation for helpful tips",
"section": "Module 3: Data Warehousing",
"question": "GCS Bucket - I query my dataset and get a Bad character (ASCII 0) error?"
},
{
"text": "Run the following command to check if \u201cBigQuery Command Line Tool\u201d is installed or not: gcloud components list\nYou can also use bq.cmd instead of bq to make it work.",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - \u201cbq: command not found\u201d"
},
{
"text": "Use big queries carefully,\nI created by bigquery dataset on an account where my free trial was exhausted, and got a bill of $80.\nUse big query in free credits and destroy all the datasets after creation.\nCheck your Billing daily! Especially if you\u2019ve spinned up a VM.",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Caution in using bigquery:no"
},
{
"text": "Be careful when you create your resources on GCP, all of them have to share the same Region in order to allow load data from GCS Bucket to BigQuery. If you forgot it when you created them, you can create a new dataset on BigQuery using the same Region which you used on your GCS Bucket.\nThis means that your GCS Bucket and the BigQuery dataset are placed in different regions. You have to create a new dataset inside BigQuery in the same region with your GCS bucket and store the data in the newly created dataset.",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Cannot read and write in different locations: source: EU, destination: US - Loading data from GCS into BigQuery (different Region):"
},
{
"text": "Make sure to create the BigQuery dataset in the very same location that you've created the GCS Bucket. For instance, if your GCS Bucket was created in `us-central1`, then BigQuery dataset must be created in the same region (us-central1, in this example)",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Cannot read and write in different locations: source: <REGION_HERE>, destination: <ANOTHER_REGION_HERE>"
},
{
"text": "By the way, this isn\u2019t a problem/solution, but a useful hint:\nPlease, remember to save your progress in BigQuery SQL Editor.\nI was almost finishing the homework, when my Chrome Tab froze and I had to reload it. Then I lost my entire SQL script.\nSave your script from time to time. Just click on the button at the top bar. Your saved file will be available on the left panel.\nAlternatively, you can copy paste your queries into an .sql file in your preferred editor (Notepad++, VS Code, etc.). Using the .sql extension will provide convenient color formatting.",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Remember to save your queries"
},
{
"text": "Ans : While real-time analytics might not be explicitly mentioned, BigQuery has real-time data streaming capabilities, allowing for potential integration in future project iterations.",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Can I use BigQuery for real-time analytics in this project?"
},
{
"text": "could not parse 'pickup_datetime' as timestamp for field pickup_datetime (position 2)\nThis error is caused by invalid data in the timestamp column. A way to identify the problem is to define the schema from the external table using string datatype. This enables the queries to work at which point we can filter out the invalid rows from the import to the materialised table and insert the fields with the timestamp data type.",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Unable to load data from external tables into a materialized table in BigQuery due to an invalid timestamp error that are added while appending data to the file in Google Cloud Storage"
},
{
"text": "Background:\n`pd.read_parquet`\n`pd.to_datetime`\n`pq.write_to_dataset`\nReference:\nhttps://stackoverflow.com/questions/48314880/are-parquet-file-created-with-pyarrow-vs-pyspark-compatible\nhttps://stackoverflow.com/questions/57798479/editing-parquet-files-with-python-causes-errors-to-datetime-format\nhttps://www.reddit.com/r/bigquery/comments/16aoq0u/parquet_timestamp_to_bq_coming_across_as_int/?share_id=YXqCs5Jl6hQcw-kg6-VgF&utm_content=1&utm_medium=ios_app&utm_name=ioscss&utm_source=share&utm_term=1\nSolution:\nAdd `use_deprecated_int96_timestamps=True` to `pq.write_to_dataset` function, like below\npq.write_to_dataset(\ntable,\nroot_path=root_path,\nfilesystem=gcs,\nuse_deprecated_int96_timestamps=True\n# Write timestamps to INT96 Parquet format\n)",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Error Message in BigQuery: annotated as a valid Timestamp, please annotate it as TimestampType(MICROS) or TimestampType(MILLIS)"
},
{
"text": "Solution:\nIf you\u2019re using Mage, in the last Data Exporter that writes to Google Cloud Storage use PyArrow to generate the Parquet file with the correct logical type for the datetime columns, otherwise they won't be converted to timestamp when loaded by BigQuery later on.\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nimport os\nif 'data_exporter' not in globals():\nfrom mage_ai.data_preparation.decorators import data_exporter\n# Replace with the location of your service account key JSON file.\nos.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/src/personal-gcp.json'\nbucket_name = \"<YOUR_BUCKET_NAME>\"\nobject_key = 'nyc_taxi_data_2022.parquet'\nwhere = f'{bucket_name}/{object_key}'\n@data_exporter\ndef export_data(data, *args, **kwargs):\ntable = pa.Table.from_pandas(data, preserve_index=False)\ngcs = pa.fs.GcsFileSystem()\npq.write_table(\ntable,\nwhere,\n# Convert integer columns in Epoch milliseconds\n# to Timestamp columns in microseconds ('us') so\n# they can be loaded into BigQuery with the right\n# data type\ncoerce_timestamps='us',\nfilesystem=gcs\n)\nSolution 2:\nIf you\u2019re using Mage, in the last Data Exporter that writes to Google Cloud Storage, provide PyArrow with explicit schema to generate the Parquet file with the correct logical type for the datetime columns, otherwise they won't be converted to timestamp when loaded by BigQuery later on.\nschema = pa.schema([\n('vendor_id', pa.int64()),\n('lpep_pickup_datetime', pa.timestamp('ns')),\n('lpep_dropoff_datetime', pa.timestamp('ns')),\n('store_and_fwd_flag', pa.string()),\n('ratecode_id', pa.int64()),\n('pu_location_id', pa.int64()),\n('do_location_id', pa.int64()),\n('passenger_count', pa.int64()),\n('trip_distance', pa.float64()),\n('fare_amount', pa.float64()),\n('extra', pa.float64()),\n('mta_tax', pa.float64()),\n('tip_amount', pa.float64()),\n('tolls_amount', pa.float64()),\n('improvement_surcharge', pa.float64()),\n('total_amount', pa.float64()),\n('payment_type', pa.int64()),\n('trip_type', pa.int64()),\n('congestion_surcharge', pa.float64()),\n('lpep_pickup_month', pa.int64())\n])\ntable = pa.Table.from_pandas(data, schema=schema)",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Datetime columns in Parquet files created from Pandas show up as integer columns in BigQuery"
},
{
"text": "Reference:\nhttps://cloud.google.com/bigquery/docs/external-data-cloud-storage\nSolution:\nfrom google.cloud import bigquery\n# Set table_id to the ID of the table to create\ntable_id = f\"{project_id}.{dataset_name}.{table_name}\"\n# Construct a BigQuery client object\nclient = bigquery.Client()\n# Set the external source format of your table\nexternal_source_format = \"PARQUET\"\n# Set the source_uris to point to your data in Google Cloud\nsource_uris = [ f'gs://{bucket_name}/{object_key}/*']\n# Create ExternalConfig object with external source format\nexternal_config = bigquery.ExternalConfig(external_source_format)\n# Set source_uris that point to your data in Google Cloud\nexternal_config.source_uris = source_uris\nexternal_config.autodetect = True\ntable = bigquery.Table(table_id)\n# Set the external data configuration of the table\ntable.external_data_configuration = external_config\ntable = client.create_table(table) # Make an API request.\nprint(f'Created table with external source: {table_id}')\nprint(f'Format: {table.external_data_configuration.source_format}')",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Create External Table using Python"
},
{
"text": "Reference:\nhttps://stackoverflow.com/questions/60941726/can-bigquery-api-overwrite-existing-table-view-with-create-table-tables-inser\nSolution:\nCombine with \u201cCreate External Table using Python\u201d, use it before \u201cclient.create_table\u201d function.\ndef tableExists(tableID, client):\n\"\"\"\nCheck if a table already exists using the tableID.\nreturn : (Boolean)\n\"\"\"\ntry:\ntable = client.get_table(tableID)\nreturn True\nexcept Exception as e: # NotFound:\nreturn False",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Check BigQuery Table Exist And Delete"
},
{
"text": "To avoid this error you can upload data from Google Cloud Storage to BigQuery through BigQuery Cloud Shell using the command:\n$ bq load --autodetect --allow_quoted_newlines --source_format=CSV dataset_name.table_name \"gs://dtc-data-lake-bucketname/fhv/fhv_tripdata_2019-*.csv.gz\"",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Error: Missing close double quote (\") character"
},
{
"text": "Solution: This problem arises if your gcs and bigquery storage is in different regions.\nOne potential way to solve it:\nGo to your google cloud bucket and check the region in field named \u201cLocation\u201d\nNow in bigquery, click on three dot icon near your project name and select create dataset.\nIn region filed choose the same regions as you saw in your google cloud bucket",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Cannot read and write in different locations: source: asia-south2, destination: US"
},
{
"text": "There are multiple benefits of using Cloud Functions to automate tasks in Google Cloud.\nUse below Cloud Function python script to load files directly to BigQuery. Use your project id, dataset id & table id as defined by you.\nimport tempfile\nimport requests\nimport logging\nfrom google.cloud import bigquery\ndef hello_world(request):\n# table_id = <project_id.dataset_id.table_id>\ntable_id = 'de-zoomcap-project.dezoomcamp.fhv-2019'\n# Create a new BigQuery client\nclient = bigquery.Client()\nfor month in range(4, 13):\n# Define the schema for the data in the CSV.gz files\nurl = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-{:02d}.csv.gz'.format(month)\n# Download the CSV.gz file from Github\nresponse = requests.get(url)\n# Create new table if loading first month data else append\nwrite_disposition_string = \"WRITE_APPEND\" if month > 1 else \"WRITE_TRUNCATE\"\n# Defining LoadJobConfig with schema of table to prevent it from changing with every table\njob_config = bigquery.LoadJobConfig(\nschema=[\nbigquery.SchemaField(\"dispatching_base_num\", \"STRING\"),\nbigquery.SchemaField(\"pickup_datetime\", \"TIMESTAMP\"),\nbigquery.SchemaField(\"dropOff_datetime\", \"TIMESTAMP\"),\nbigquery.SchemaField(\"PUlocationID\", \"STRING\"),\nbigquery.SchemaField(\"DOlocationID\", \"STRING\"),\nbigquery.SchemaField(\"SR_Flag\", \"STRING\"),\nbigquery.SchemaField(\"Affiliated_base_number\", \"STRING\"),\n],\nskip_leading_rows=1,\nwrite_disposition=write_disposition_string,\nautodetect=True,\nsource_format=\"CSV\",\n)\n# Load the data into BigQuery\n# Create a temporary file to prevent the exception- AttributeError: 'bytes' object has no attribute 'tell'\"\nwith tempfile.NamedTemporaryFile() as f:\nf.write(response.content)\nf.seek(0)\njob = client.load_table_from_file(\nf,\ntable_id,\nlocation=\"US\",\njob_config=job_config,\n)\njob.result()\nlogging.info(\"Data for month %d successfully loaded into table %s.\", month, table_id)\nreturn 'Data loaded into table {}.'.format(table_id)",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - Tip: Using Cloud Function to read csv.gz files from github directly to BigQuery in Google Cloud:"
},
{
"text": "You need to uncheck cache preferences in query settings",
"section": "Module 3: Data Warehousing",
"question": "GCP BQ - When querying two different tables external and materialized you get the same result when count(distinct(*))"
},
{
"text": "Problem: When you inject data into GCS using Pandas, there is a chance that some dataset has missing values on DOlocationID and PUlocationID. Pandas by default will cast these columns as float data type, causing inconsistent data type between parquet in GCS and schema defined in big query. You will see something like this:\nSolution:\nFix the data type issue in data pipeline\nBefore injecting data into GCS, use astype and Int64 (which is different from int64 and accept both missing value and integer exist in the column) to cast the columns.\nSomething like:\ndf[\"PUlocationID\"] = df.PUlocationID.astype(\"Int64\")\ndf[\"DOlocationID\"] = df.DOlocationID.astype(\"Int64\")\nNOTE: It is best to define the data type of all the columns in the Transformation section of the ETL pipeline before loading to BigQuery",
"section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.",
"question": "GCP BQ - How to handle type error from big query and parquet data?"
},
{
"text": "Problem occurs when misplacing content after fro``m clause in BigQuery SQLs.\nCheck to remove any extra apaces or any other symbols, keep in lowercases, digits and dashes only",
"section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.",
"question": "GCP BQ - Invalid project ID . Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project"
},
{
"text": "No. Based on the documentation for Bigquery, it does not support more than 1 column to be partitioned.\n[source]",
"section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.",
"question": "GCP BQ - Does BigQuery support multiple columns partition?"
},
{
"text": "Error Message:\nPARTITION BY expression must be DATE(<timestamp_column>), DATE(<datetime_column>), DATETIME_TRUNC(<datetime_column>, DAY/HOUR/MONTH/YEAR), a DATE column, TIMESTAMP_TRUNC(<timestamp_column>, DAY/HOUR/MONTH/YEAR), DATE_TRUNC(<date_column>, MONTH/YEAR), or RANGE_BUCKET(<int64_column>, GENERATE_ARRAY(<int64_value>, <int64_value>[, <int64_value>]))\nSolution:\nConvert the column to datetime first.\ndf[\"pickup_datetime\"] = pd.to_datetime(df[\"pickup_datetime\"])\ndf[\"dropOff_datetime\"] = pd.to_datetime(df[\"dropOff_datetime\"])",
"section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.",
"question": "GCP BQ - DATE() Error in BigQuery"
},
{
"text": "Native tables are tables where the data is stored in BigQuery. External tables store the data outside BigQuery, with BigQuery storing metadata about that external table.\nResources:\nhttps://cloud.google.com/bigquery/docs/external-tables\nhttps://cloud.google.com/bigquery/docs/tables-intro",
"section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.",
"question": "GCP BQ - Native tables vs External tables in BigQuery?"
},
{
"text": "Issue: Tried running command to export ML model from BQ to GCS from Week 3\nbq --project_id taxi-rides-ny extract -m nytaxi.tip_model gs://taxi_ml_model/tip_model\nIt is failing on following error:\nBigQuery error in extract operation: Error processing job Not found: Dataset was not found in location US\nI verified the BQ data set and gcs bucket are in the same region- us-west1. Not sure how it gets location US. I couldn\u2019t find the solution yet.\nSolution: Please enter correct project_id and gcs_bucket folder address. My gcs_bucket folder address is\ngs://dtc_data_lake_optimum-airfoil-376815/tip_model",
"section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.",
"question": "GCP BQ ML - Unable to run command (shown in video) to export ML model from BQ to GCS"
},
{
"text": "To sol
gitextract_ihuhb7f5/ ├── .gitignore ├── Pipfile ├── README.md ├── app.py ├── docker-compose.yaml ├── notebooks/ │ ├── documents.json │ ├── elastic-search.ipynb │ ├── google_flan_t5.ipynb │ ├── long-workshop.ipynb │ └── parse-faq.ipynb └── rag.py
SYMBOL INDEX (6 symbols across 2 files) FILE: app.py function main (line 6) | def main(): FILE: rag.py function retrieve_documents (line 31) | def retrieve_documents( function build_context (line 62) | def build_context(documents): function build_prompt (line 72) | def build_prompt(user_question, documents): function ask_openai (line 81) | def ask_openai(prompt, model="gpt-4o"): function qa_bot (line 90) | def qa_bot(user_question, course):
Condensed preview — 11 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (992K chars).
[
{
"path": ".gitignore",
"chars": 37,
"preview": ".ipynb_checkpoints\n.envrc\n__pycache__"
},
{
"path": "Pipfile",
"chars": 350,
"preview": "[[source]]\nurl = \"https://pypi.org/simple\"\nverify_ssl = true\nname = \"pypi\"\n\n[packages]\nscikit-learn = \"*\"\npandas = \"*\"\nr"
},
{
"path": "README.md",
"chars": 28319,
"preview": "# LLM RAG Workshop\n\nChat with your own data - LLM+RAG workshop\n\nThe content here is based on [LLM Zoomcamp](https://gith"
},
{
"path": "app.py",
"chars": 693,
"preview": "import streamlit as st\n\nfrom rag import qa_bot\n\n\ndef main():\n st.title(\"DTC Q&A System\")\n\n courses = [\n \"da"
},
{
"path": "docker-compose.yaml",
"chars": 553,
"preview": "version: '3.8'\n\nservices:\n ollama:\n image: ollama/ollama\n ports:\n - \"11434:11434\"\n volumes:\n - ollam"
},
{
"path": "notebooks/documents.json",
"chars": 658332,
"preview": "[\n {\n \"course\": \"data-engineering-zoomcamp\",\n \"documents\": [\n {\n \"text\": \"The purpose of this documen"
},
{
"path": "notebooks/elastic-search.ipynb",
"chars": 28759,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 2,\n \"id\": \"94360c9f-71d9-4f64-b4b1-cdcdbdff8e18\",\n \""
},
{
"path": "notebooks/google_flan_t5.ipynb",
"chars": 174224,
"preview": "{\n \"nbformat\": 4,\n \"nbformat_minor\": 0,\n \"metadata\": {\n \"colab\": {\n \"provenance\": [],\n \"gpuType\": \"T4\"\n "
},
{
"path": "notebooks/long-workshop.ipynb",
"chars": 39898,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"id\": \"cb084056-a0ed-4169-84f3-b655d1bf66d8\",\n "
},
{
"path": "notebooks/parse-faq.ipynb",
"chars": 5860,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 8,\n \"id\": \"4cd1eaa8-3424-41ad-9cf2-3e8548712865\",\n \""
},
{
"path": "rag.py",
"chars": 2232,
"preview": "from openai import OpenAI\nfrom elasticsearch import Elasticsearch\n\n\nclient = OpenAI()\n\nes = Elasticsearch(\"http://localh"
}
]
About this extraction
This page contains the full source code of the alexeygrigorev/llm-rag-workshop GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 11 files (917.2 KB), approximately 239.2k tokens, and a symbol index with 6 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.