Repository: datastacktv/data-engineer-roadmap
Branch: master
Commit: 53e47f5780a5
Files: 5
Total size: 9.4 KB
Directory structure:
gitextract_pyo98h0e/
├── .github/
│ └── FUNDING.yml
├── CHANGELOG.md
├── README.md
└── text/
├── extras.md
└── roadmap.md
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/FUNDING.yml
================================================
# These are supported funding model platforms
# github: [datastacktv, alexandraabbas]
custom: https://paypal.me/alexandraabbas
================================================
FILE: CHANGELOG.md
================================================
## Roadmap 2021
### Update 2021-01-15
* Added text version for visually impaired users (issue #10)
* Math & statistics basics have been added to CS fundamentals (issue #22)
* Dimensional modelling has been added to Database fundamentals
* Added section for Object storage (issue #7)
* Azure CosmosDB has been added to Document databases
* Apache Impala has been moved from Batch processing to Data Warehouses
* Azure Synapse Analytics (issue #18) and ClickHouse (issue #24) have been added to Data Warehouses
* Lambda & Kappa architectures have been added to Cluster computing fundamentals (issue #31)
* Azure Data Lake has been added to Managed Hadoop
* Apache NiFi has been added to Hybrid data processing
* Cloud specific messaging services have been added to Messaging (issue #8)
* Luigi has been added to Workflow scheduling
* AWS CDK has replaced AWS CloudFormation in Infrastructure provisioning (issue #4, issue #6)
* Power BI has been added to data visualisation tools (issue #29)
* MLflow has been added to Machine Learning Ops (issue #30)
## Roadmap 2020
[Modern Data Engineer Roadmap 2020](https://github.com/datastacktv/data-engineer-roadmap/tree/8b1ccdce4524961bfd37495de20117c47766b1eb)
================================================
FILE: README.md
================================================

> Roadmap to becoming a data engineer in 2021
[](https://twitter.com/datastacktv)
[](http://youtube.com/c/datastacktv)
[](https://datastack.tv/)
[](https://datastackjobs.com/)
This roadmap aims to give a **complete picture of the modern data engineering landscape** and serve as a **study guide** for aspiring data engineers.
***
Note to beginners
> Beginners shouldn’t feel overwhelmed by the vast number of tools and frameworks listed here. A typical data engineer would master a subset of these tools throughout several years depending on his/her company and career choices.
***
🔥 We just launched [**Data Stack Jobs**](https://datastackjobs.com/) — a clean and simple job site for Data Stack Engineers!
> [Text version for visually impaired users](text/roadmap.md)

## Nice to have 😎
> [Text version for visually impaired users](text/extras.md)

## Contributions are welcome 💜
Please raise an issue to discuss your suggestions or open a Pull Request to request improvements.
## Reviewers 🔎
Huge thank you to [@whydidithavetobebugs](https://github.com/whydidithavetobebugs), [@sawidis](https://github.com/sawidis), [@marclamberti](https://github.com/marclamberti) and [@mpyeager](https://github.com/mpyeager) for reviewing this roadmap.
## About us 👋🏼
[datastack.tv](https://datastack.tv/) is the learning platform for the modern data stack. We create concise screencast video tutorials for data engineers. [**Browse our courses here!**](https://datastack.tv/courses.html)
## License 🗞
> Copyright © 2021 Alexandra Abbas —
================================================
FILE: text/extras.md
================================================
> Text version for visually impaired users
*Note: Data engineers often work closely with Data scientists, Data analysts and Machine Learning engineers. It’s good to have a basic understanding of the tools they use.*
* Visualise data
* Tableau [general recommendation]
* Looker [personal recommendation]
* Grafana [general recommendation]
* Jupyter Notebook [general recommendation]
* Microsoft Power BI
* Machine Learning fundamentals
* Terminology [general recommendation]
* Supervised vs unsupervised learning
* Classification vs regression
* Evaluation metrics
* scikit-learn [general recommendation]
* Tensorflow [personal recommendation]
* Keras [personal recommendation]
* PyTorch [general recommendation]
* Machine Learning Ops
* Tensorflow Extended (TFX) [general recommendation]
* Kubeflow [personal recommendation]
* MLflow
* Amazon SageMaker
* Google Cloud AI Platform
*Note: Keep learning...*
================================================
FILE: text/roadmap.md
================================================
> Text version for visually impaired users
# Data Engineer in 2021
* CS fundamentals
* Basic terminal usage [general recommendation]
* Data structures & algorithms [general recommendation]
* APIs [general recommendation]
* REST [general recommendation]
* Structured vs unstructured data [general recommendation]
* Serialisation
* Linux [general recommendation]
* CLI
* Vim
* Shell scripting
* Cronjobs
* How does the computer work? [general recommendation]
* How does the Internet work? [general recommendation]
* Git — Version control [general recommendation]
* Math & statistics basics [general recommendation]
*Note: Git is used for tracking changes in source code and coordinating work among programmers. In your day to day work you will use Git server as a service like GitHub, GitLab or Bitbucket.*
* Learn a programming language
* Python [personal recommendation]
* Java [general recommendation]
* Scala
* Go
*Note: Learn how to write clean, extensibile code. Spend some time understanding programming paradigms (functional vs. OOP) and best practices (design patterns, YAGNI, stateful vs stateless applications). Get familiar with an IDE or code editor like VSCode.*
* Testing
* Unit testing [general recommendation]
* Integration testing [general recommendation]
* Functional testing [general recommendation]
* Database fundamentals
* SQL [general recommendation]
* Normalisation [general recommendation]
* ACID transactions [general recommendation]
* CAP theorem [general recommendation]
* OLTP vs OLAP [general recommendation]
* Horizontal vs vertical scaling [general recommendation]
* Dimensional modeling [general recommendation]
* Relational databases
* MySQL [general recommendation]
* PostgreSQL [general recommendation]
* MariaDB
* Amazon Aurora
* Non-relational databases
* Document databases
* MongoDB [general recommendation]
* Elasticsearch [general recommendation]
* Apache CouchDB
* Azure CormosDB
* Wide column databases
* Apache Cassandra [general recommendation]
* Apache HBase [general recommendation]
* Google Cloud Bigtable [personal recommendation]
* Graph databases
* Neo4j
* Amazon Neptune
* Key-value stores
* Redis [personal recommendation]
* Memcached
* Amazon DynamoDB [general recommendation]
*Note: Understand the difference between Document, Wide column, Graph and Key-value NoSQL databases. We recommend mastering one database from each category.*
* Data warehouses
* Snowflake [general recommendation]
* Presto
* Apache Hive
* Apache Impala
* Amazon Redshift [general recommendation]
* Google BigQuery [personal recommendation]
* Azure Synapse
* ClickHouse
* Object storage
* AWS S3 [general recommendation]
* Azure Blob Storage
* Google Cloud Storage
* Apache Ozone
* Cluster computing fundamentals
* Apache Hadoop [general recommendation]
* HDFS [general recommendation]
* MapReduce [general recommendation]
* Lambda & Kappa architectures
* Managed Hadoop [general recommendation]
* Amazon EMR
* Google Dataproc
* Azure Data Lake
*Note: Most modern data processing frameworks are based on Apache Hadoop and MapReduce to some extent. Understanding these concepts can help you learn modern data processing frameworks much quicker.*
* Data processing
* Batch
* Apache Pig [general recommendation]
* Apache Arrow
* data build tool [personal recommendation]
* Hybrid
* Apache Spark [general recommendation]
* Apache Beam [personal recommendation]
* Apache Flink [general recommendation]
* Apache NiFi
* Streaming
* Apache Kafka [personal recommendation]
* Apache Storm [general recommendation]
* Apache Samza
* Amazon Kinesis
*Note: Hybrid frameworks are able to process both batch and streaming data. Batch data processing is often done by analytical data warehouse applications. See Data warehouses section for more.*
* Messaging
* RabbitMQ [general recommendation]
* Apache ActiveMQ
* Amazon SNS & SQS
* Google PubSub
* Azure Service Bus
* Workflow scheduling
* Apache Airflow [personal recommendation]
* Google Composer
* Apache Oozie
* Luigi
*Note: Cloud Composer is a managed Apache Airflow service on Google Cloud Platform.*
* Monitoring and observability for data pipelines
* Prometheus [general recommendation]
* Datadog [general recommendation]
* Sentry [general recommendation]
* Monte Carlo
* Datafold
* Soda Data
* StatsD
* Networking
* Protocols [general recommendation]
* HTTP / HTTPS
* TCP
* SSH
* IP
* DNS
* Firewalls [general recommendation]
* VPN [general recommendation]
* VPC [general recommendation]
* Infrastructure as Code
* Containers
* Docker [personal recommendation]
* LXC
* Container orchestration
* Kubernetes [general recommendation]
* Docker Swarm
* Apache Mesos
* Google Kubernetes Engine (GKE) [general recommendation]
* Infrastructure provisioning
* Terraform [personal recommendation]
* Pulumi
* AWS CDK [general recommendation]
* CI/CD
* GitHub Actions [general recommendation]
* Jenkins [general recommendation]
* Identity and access management
* Active Directory [general recommendation]
* Azure Active Directory
* Data security & privacy
* Legal compliance [general recommendation]
* Encryption [general recommendation]
* Key management [general recommendation]
* Data governance & integrity