Repository: panovvv/bigdata-docker-compose
Branch: master
Commit: ca515ac08b21
Files: 9
Total size: 32.1 KB
Directory structure:
gitextract_p_pqx3h3/
├── .gitignore
├── LICENSE
├── README.md
├── data/
│ ├── batches/
│ │ └── sample_batch.py
│ ├── grades.csv
│ └── ssn-address.tsv
├── docker-compose.yml
├── init.sql
└── zeppelin_notebooks/
└── test_2FVBJBJ1V.zpln
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
# OS garbage
.DS_Store
desktop.ini
# IDE garbage
.idea/
# Livy batch files, copied over from elsewhere, except one sample batch
data/batches/*
!data/batches/sample_batch.py
# Spark job results
data/output/
================================================
FILE: LICENSE
================================================
The MIT License
Copyright (c) 2020 Vadim Panov
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
================================================
FILE: README.md
================================================
# Big data playground: Cluster with Hadoop, Hive, Spark, Zeppelin and Livy via Docker-compose.
I wanted to have the ability to play around with various big data
applications as effortlessly as possible,
namely those found in Amazon EMR.
Ideally, that would be something that can be brought up and torn down
in one command. This is how this repository came to be!
## Constituent images:
[Base image](https://github.com/panovvv/hadoop-hive-spark-docker):
[](https://cloud.docker.com/repository/docker/panovvv/hadoop-hive-spark/builds)
[](https://hub.docker.com/r/panovvv/hadoop-hive-spark)
[](https://hub.docker.com/r/panovvv/hadoop-hive-spark)
[Zeppelin image](https://github.com/panovvv/zeppelin-bigdata-docker): [](https://cloud.docker.com/repository/docker/panovvv/zeppelin-bigdata/builds)
[](https://hub.docker.com/r/panovvv/zeppelin-bigdata)
[](https://hub.docker.com/r/panovvv/zeppelin-bigdata)
[Livy image](https://github.com/panovvv/livy-docker): [](https://cloud.docker.com/repository/docker/panovvv/livy/builds)
[](https://hub.docker.com/r/panovvv/livy)
[](https://hub.docker.com/r/panovvv/livy)
## Usage
Clone:
```bash
git clone https://github.com/panovvv/bigdata-docker-compose.git
```
* On non-Linux platforms, you should dedicate more RAM to Docker than it does by default
(2Gb on my machine with 16Gb RAM). Otherwise applications (ResourceManager in my case)
will quit sporadically and you'll see messages like this one in logs:
<pre>
current-datetime INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1234ms
No GCs detected
</pre>
Increasing memory to 8G solved all those mysterious problems for me.
* You should have more than 90% of free disk space, otherwise
YARN will deem all nodes unhealthy.
Bring everything up:
```bash
cd bigdata-docker-compose
docker-compose up -d
```
* **data/** directory is mounted into every container, you can use this as
a storage both for files you want to process using Hive/Spark/whatever
and results of those computations.
* **livy_batches/** directory is where you have some sample code for
Livy batch processing mode. It's mounted to the node where Livy
is running. You can store your code there as well, or make use of the
universal **data/**.
* **zeppelin_notebooks/** contains, quite predictably, notebook files
for Zeppelin. Thanks to that, all your notebooks persist across runs.
Hive JDBC port is exposed to host:
* URI: `jdbc:hive2://localhost:10000`
* Driver: `org.apache.hive.jdbc.HiveDriver` (org.apache.hive:hive-jdbc:3.1.2)
* User and password: unused.
To shut the whole thing down, run this from the same folder:
```bash
docker-compose down
```
## Checking if everything plays well together
You can quickly check everything by opening the
[bundled Zeppelin notebook](http://localhost:8890)
and running all paragraphs.
Alternatively, to get a sense of
how it all works under the hood, follow the instructions below:
### Hadoop and YARN:
Check [YARN (Hadoop ResourceManager) Web UI
(localhost:8088)](http://localhost:8088/).
You should see 2 active nodes there.
There's also an
[alternative YARN Web UI 2 (http://localhost:8088/ui2)](http://localhost:8088/ui2).
Then, [Hadoop Name Node UI (localhost:9870)](http://localhost:9870),
Hadoop Data Node UIs at
[http://localhost:9864](http://localhost:9864) and [http://localhost:9865](http://localhost:9865):
all of those URLs should result in a page.
Open up a shell in the master node.
```bash
docker-compose exec master bash
jps
```
`jps` command outputs a list of running Java processes,
which on Hadoop Namenode/Spark Master node should include those:
<pre>
123 Jps
456 ResourceManager
789 NameNode
234 SecondaryNameNode
567 HistoryServer
890 Master
</pre>
... but not necessarily in this order and those IDs,
also some extras like `RunJar` and `JobHistoryServer` might be there too.
Then let's see if YARN can see all resources we have (2 worker nodes):
```bash
yarn node -list
```
<pre>
current-datetime INFO client.RMProxy: Connecting to ResourceManager at master/172.28.1.1:8032
Total Nodes:2
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
worker1:45019 RUNNING worker1:8042 0
worker2:41001 RUNNING worker2:8042 0
</pre>
HDFS (Hadoop distributed file system) condition:
```bash
hdfs dfsadmin -report
```
<pre>
Live datanodes (2):
Name: 172.28.1.2:9866 (worker1)
...
Name: 172.28.1.3:9866 (worker2)
</pre>
Now we'll upload a file into HDFS and see that it's visible from all
nodes:
```bash
hadoop fs -put /data/grades.csv /
hadoop fs -ls /
```
<pre>
Found N items
...
-rw-r--r-- 2 root supergroup ... /grades.csv
...
</pre>
Ctrl+D out of master now. Repeat for remaining nodes
(there's 3 total: master, worker1 and worker2):
```bash
docker-compose exec worker1 bash
hadoop fs -ls /
```
<pre>
Found 1 items
-rw-r--r-- 2 root supergroup ... /grades.csv
</pre>
While we're on nodes other than Hadoop Namenode/Spark Master node,
jps command output should include DataNode and Worker now instead of
NameNode and Master:
```bash
jps
```
<pre>
123 Jps
456 NodeManager
789 DataNode
234 Worker
</pre>
### Hive
Prerequisite: there's a file `grades.csv` stored in HDFS ( `hadoop fs -put /data/grades.csv /` )
```bash
docker-compose exec master bash
hive
```
```sql
CREATE TABLE grades(
`Last name` STRING,
`First name` STRING,
`SSN` STRING,
`Test1` DOUBLE,
`Test2` INT,
`Test3` DOUBLE,
`Test4` DOUBLE,
`Final` DOUBLE,
`Grade` STRING)
COMMENT 'https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
LOAD DATA INPATH '/grades.csv' INTO TABLE grades;
SELECT * FROM grades;
-- OK
-- Alfalfa Aloysius 123-45-6789 40.0 90 100.0 83.0 49.0 D-
-- Alfred University 123-12-1234 41.0 97 96.0 97.0 48.0 D+
-- Gerty Gramma 567-89-0123 41.0 80 60.0 40.0 44.0 C
-- Android Electric 087-65-4321 42.0 23 36.0 45.0 47.0 B-
-- Bumpkin Fred 456-78-9012 43.0 78 88.0 77.0 45.0 A-
-- Rubble Betty 234-56-7890 44.0 90 80.0 90.0 46.0 C-
-- Noshow Cecil 345-67-8901 45.0 11 -1.0 4.0 43.0 F
-- Buff Bif 632-79-9939 46.0 20 30.0 40.0 50.0 B+
-- Airpump Andrew 223-45-6789 49.0 1 90.0 100.0 83.0 A
-- Backus Jim 143-12-1234 48.0 1 97.0 96.0 97.0 A+
-- Carnivore Art 565-89-0123 44.0 1 80.0 60.0 40.0 D+
-- Dandy Jim 087-75-4321 47.0 1 23.0 36.0 45.0 C+
-- Elephant Ima 456-71-9012 45.0 1 78.0 88.0 77.0 B-
-- Franklin Benny 234-56-2890 50.0 1 90.0 80.0 90.0 B-
-- George Boy 345-67-3901 40.0 1 11.0 -1.0 4.0 B
-- Heffalump Harvey 632-79-9439 30.0 1 20.0 30.0 40.0 C
-- Time taken: 3.324 seconds, Fetched: 16 row(s)
```
Ctrl+D back to bash. Check if the file's been loaded to Hive warehouse
directory:
```bash
hadoop fs -ls /usr/hive/warehouse/grades
```
<pre>
Found 1 items
-rw-r--r-- 2 root supergroup ... /usr/hive/warehouse/grades/grades.csv
</pre>
The table we just created should be accessible from all nodes, let's
verify that now:
```bash
docker-compose exec worker2 bash
hive
```
```sql
SELECT * FROM grades;
```
You should be able to see the same table.
### Spark
Open up [Spark Master Web UI (localhost:8080)](http://localhost:8080/):
<pre>
Workers (2)
Worker Id Address State Cores Memory
worker-timestamp-172.28.1.3-8882 172.28.1.3:8882 ALIVE 2 (0 Used) 1024.0 MB (0.0 B Used)
worker-timestamp-172.28.1.2-8881 172.28.1.2:8881 ALIVE 2 (0 Used) 1024.0 MB (0.0 B Used)
</pre>
, also worker UIs at [localhost:8081](http://localhost:8081/)
and [localhost:8082](http://localhost:8082/). All those pages should be
accessible.
Then there's also Spark History server running at
[localhost:18080](http://localhost:18080/) - every time you run Spark jobs, you
will see them here.
History Server includes REST API at
[localhost:18080/api/v1/applications](http://localhost:18080/api/v1/applications).
This is a mirror of everything on the main page, only in JSON format.
Let's run some sample jobs now:
```bash
docker-compose exec master bash
run-example SparkPi 10
#, or you can do the same via spark-submit:
spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--driver-memory 2g \
--executor-memory 1g \
--executor-cores 1 \
$SPARK_HOME/examples/jars/spark-examples*.jar \
10
```
<pre>
INFO spark.SparkContext: Running Spark version 2.4.4
INFO spark.SparkContext: Submitted application: Spark Pi
..
INFO client.RMProxy: Connecting to ResourceManager at master/172.28.1.1:8032
INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
...
INFO yarn.Client: Application report for application_1567375394688_0001 (state: ACCEPTED)
...
INFO yarn.Client: Application report for application_1567375394688_0001 (state: RUNNING)
...
INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.102882 s
Pi is roughly 3.138915138915139
...
INFO util.ShutdownHookManager: Deleting directory /tmp/spark-81ea2c22-d96e-4d7c-a8d7-9240d8eb22ce
</pre>
Spark has 3 interactive shells: spark-shell to code in Scala,
pyspark for Python and sparkR for R. Let's try them all out:
```bash
hadoop fs -put /data/grades.csv /
spark-shell
```
```scala
spark.range(1000 * 1000 * 1000).count()
val df = spark.read.format("csv").option("header", "true").load("/grades.csv")
df.show()
df.createOrReplaceTempView("df")
spark.sql("SHOW TABLES").show()
spark.sql("SELECT * FROM df WHERE Final > 50").show()
//TODO SELECT TABLE from hive - not working for now.
spark.sql("SELECT * FROM grades").show()
```
<pre>
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = yarn, app id = application_N).
Spark session available as 'spark'.
res0: Long = 1000000000
df: org.apache.spark.sql.DataFrame = [Last name: string, First name: string ... 7 more fields]
+---------+----------+-----------+-----+-----+-----+-----+-----+-----+
|Last name|First name| SSN|Test1|Test2|Test3|Test4|Final|Grade|
+---------+----------+-----------+-----+-----+-----+-----+-----+-----+
| Alfalfa| Aloysius|123-45-6789| 40| 90| 100| 83| 49| D-|
...
|Heffalump| Harvey|632-79-9439| 30| 1| 20| 30| 40| C|
+---------+----------+-----------+-----+-----+-----+-----+-----+-----+
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | df| true|
+--------+---------+-----------+
+---------+----------+-----------+-----+-----+-----+-----+-----+-----+
|Last name|First name| SSN|Test1|Test2|Test3|Test4|Final|Grade|
+---------+----------+-----------+-----+-----+-----+-----+-----+-----+
| Airpump| Andrew|223-45-6789| 49| 1| 90| 100| 83| A|
| Backus| Jim|143-12-1234| 48| 1| 97| 96| 97| A+|
| Elephant| Ima|456-71-9012| 45| 1| 78| 88| 77| B-|
| Franklin| Benny|234-56-2890| 50| 1| 90| 80| 90| B-|
+---------+----------+-----------+-----+-----+-----+-----+-----+-----+
</pre>
Ctrl+D out of Scala shell now.
```bash
pyspark
```
```python
spark.range(1000 * 1000 * 1000).count()
df = spark.read.format('csv').option('header', 'true').load('/grades.csv')
df.show()
df.createOrReplaceTempView('df')
spark.sql('SHOW TABLES').show()
spark.sql('SELECT * FROM df WHERE Final > 50').show()
# TODO SELECT TABLE from hive - not working for now.
spark.sql('SELECT * FROM grades').show()
```
<pre>
1000000000
$same_tables_as_above
</pre>
Ctrl+D out of PySpark.
```bash
sparkR
```
```R
df <- as.DataFrame(list("One", "Two", "Three", "Four"), "This is as example")
head(df)
df <- read.df("/grades.csv", "csv", header="true")
head(df)
```
<pre>
This is as example
1 One
2 Two
3 Three
4 Four
$same_tables_as_above
</pre>
* Amazon S3
From Hadoop:
```bash
hadoop fs -Dfs.s3a.impl="org.apache.hadoop.fs.s3a.S3AFileSystem" -Dfs.s3a.access.key="classified" -Dfs.s3a.secret.key="classified" -ls "s3a://bucket"
```
Then from PySpark:
```python
sc._jsc.hadoopConfiguration().set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'classified')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'classified')
df = spark.read.format('csv').option('header', 'true').option('sep', '\t').load('s3a://bucket/tabseparated_withheader.tsv')
df.show(5)
```
None of the commands above stores your credentials anywhere
(i.e. as soon as you'd shut down the cluster your creds are safe). More
persistent ways of storing the credentials are out of scope of this
readme.
### Zeppelin
Zeppelin interface should be available at [http://localhost:8890](http://localhost:8890).
You'll find a notebook called "test" in there, containing commands
to test integration with bash, Spark and Livy.
### Livy
Livy is at [http://localhost:8998](http://localhost:8998) (and yes,
there's a web UI as well as REST API on that port - just click the link).
* Livy Sessions.
Try to poll the REST API:
```bash
curl --request GET \
--url http://localhost:8998/sessions | python3 -mjson.tool
```
The response, assuming you didn't create any sessions before, should look like this:
```json
{
"from": 0,
"total": 0,
"sessions": []
}
```
1 ) Create a session:
```bash
curl --request POST \
--url http://localhost:8998/sessions \
--header 'content-type: application/json' \
--data '{
"kind": "pyspark"
}' | python3 -mjson.tool
```
Response:
```json
{
"id": 0,
"name": null,
"appId": null,
"owner": null,
"proxyUser": null,
"state": "starting",
"kind": "pyspark",
"appInfo": {
"driverLogUrl": null,
"sparkUiUrl": null
},
"log": [
"stdout: ",
"\nstderr: ",
"\nYARN Diagnostics: "
]
}
```
2 ) Wait for session to start (state will transition from "starting"
to "idle"):
```bash
curl --request GET \
--url http://localhost:8998/sessions/0 | python3 -mjson.tool
```
Response:
```json
{
"id": 0,
"name": null,
"appId": "application_1584274334558_0001",
"owner": null,
"proxyUser": null,
"state": "starting",
"kind": "pyspark",
"appInfo": {
"driverLogUrl": "http://worker2:8042/node/containerlogs/container_1584274334558_0003_01_000001/root",
"sparkUiUrl": "http://master:8088/proxy/application_1584274334558_0003/"
},
"log": [
"timestamp bla"
]
}
```
3 ) Post some statements:
```bash
curl --request POST \
--url http://localhost:8998/sessions/0/statements \
--header 'content-type: application/json' \
--data '{
"code": "import sys;print(sys.version)"
}' | python3 -mjson.tool
curl --request POST \
--url http://localhost:8998/sessions/0/statements \
--header 'content-type: application/json' \
--data '{
"code": "spark.range(1000 * 1000 * 1000).count()"
}' | python3 -mjson.tool
```
Response:
```json
{
"id": 0,
"code": "import sys;print(sys.version)",
"state": "waiting",
"output": null,
"progress": 0.0,
"started": 0,
"completed": 0
}
```
```json
{
"id": 1,
"code": "spark.range(1000 * 1000 * 1000).count()",
"state": "waiting",
"output": null,
"progress": 0.0,
"started": 0,
"completed": 0
}
```
4) Get the result:
```bash
curl --request GET \
--url http://localhost:8998/sessions/0/statements | python3 -mjson.tool
```
Response:
```json
{
"total_statements": 2,
"statements": [
{
"id": 0,
"code": "import sys;print(sys.version)",
"state": "available",
"output": {
"status": "ok",
"execution_count": 0,
"data": {
"text/plain": "3.7.3 (default, Apr 3 2019, 19:16:38) \n[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]"
}
},
"progress": 1.0
},
{
"id": 1,
"code": "spark.range(1000 * 1000 * 1000).count()",
"state": "available",
"output": {
"status": "ok",
"execution_count": 1,
"data": {
"text/plain": "1000000000"
}
},
"progress": 1.0
}
]
}
```
5) Delete the session:
```bash
curl --request DELETE \
--url http://localhost:8998/sessions/0 | python3 -mjson.tool
```
Response:
```json
{
"msg": "deleted"
}
```
* Livy Batches.
To get all active batches:
```bash
curl --request GET \
--url http://localhost:8998/batches | python3 -mjson.tool
```
Strange enough, this elicits the same response as if we were querying
the sessions endpoint, but ok...
1 ) Send the batch:
```bash
curl --request POST \
--url http://localhost:8998/batches \
--header 'content-type: application/json' \
--data '{
"file": "local:/data/batches/sample_batch.py",
"pyFiles": [
"local:/data/batches/sample_batch.py"
],
"args": [
"123"
]
}' | python3 -mjson.tool
```
Response:
```json
{
"id": 0,
"name": null,
"owner": null,
"proxyUser": null,
"state": "starting",
"appId": null,
"appInfo": {
"driverLogUrl": null,
"sparkUiUrl": null
},
"log": [
"stdout: ",
"\nstderr: ",
"\nYARN Diagnostics: "
]
}
```
2 ) Query the status:
```bash
curl --request GET \
--url http://localhost:8998/batches/0 | python3 -mjson.tool
```
Response:
```json
{
"id": 0,
"name": null,
"owner": null,
"proxyUser": null,
"state": "running",
"appId": "application_1584274334558_0005",
"appInfo": {
"driverLogUrl": "http://worker2:8042/node/containerlogs/container_1584274334558_0005_01_000001/root",
"sparkUiUrl": "http://master:8088/proxy/application_1584274334558_0005/"
},
"log": [
"timestamp bla",
"\nstderr: ",
"\nYARN Diagnostics: "
]
}
```
3 ) To see all log lines, query the `/log` endpoint.
You can skip 'to' and 'from' params, or manipulate them to get all log lines.
Livy (as of 0.7.0) supports no more than 100 log lines per response.
```bash
curl --request GET \
--url 'http://localhost:8998/batches/0/log?from=100&to=200' | python3 -mjson.tool
```
Response:
```json
{
"id": 0,
"from": 100,
"total": 203,
"log": [
"...",
"Welcome to",
" ____ __",
" / __/__ ___ _____/ /__",
" _\\ \\/ _ \\/ _ `/ __/ '_/",
" /__ / .__/\\_,_/_/ /_/\\_\\ version 2.4.5",
" /_/",
"",
"Using Python version 3.7.5 (default, Oct 17 2019 12:25:15)",
"SparkSession available as 'spark'.",
"3.7.5 (default, Oct 17 2019, 12:25:15) ",
"[GCC 8.3.0]",
"Arguments: ",
"['/data/batches/sample_batch.py', '123']",
"Custom number passed in args: 123",
"Will raise 123 to the power of 3...",
"...",
"123 ^ 3 = 1860867",
"...",
"2020-03-15 13:06:09,503 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-138164b7-c5dc-4dc5-be6b-7a49c6bcdff0/pyspark-4d73b7c7-e27c-462f-9e5a-96011790d059"
]
}
```
4 ) Delete the batch:
```bash
curl --request DELETE \
--url http://localhost:8998/batches/0 | python3 -mjson.tool
```
Response:
```json
{
"msg": "deleted"
}
```
## Credits
Sample data file:
* __grades.csv__ is borrowed from
[John Burkardt's page](https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html)
under Florida State University domain. Thanks for
sharing those!
* __ssn-address.tsv__ is derived from __grades.csv__ by removing some fields
and adding randomly-generated addresses.
================================================
FILE: data/batches/sample_batch.py
================================================
import sys
from pyspark.shell import spark
print(sys.version)
print("Arguments: \n" + str(sys.argv))
try:
num = int(sys.argv[1])
print("Custom number passed in args: " + str(num))
except (ValueError, IndexError):
num = 1000
print("Can't process as number: " + sys.argv[1])
# Checking if f-string are available (python>=3.6)
print(f"Will raise {num} to the power of 3...")
cube = spark.range(num * num * num).count()
print(f"{num} ^ 3 = {cube}")
================================================
FILE: data/grades.csv
================================================
Last name,First name,SSN,Test1,Test2,Test3,Test4,Final,Grade
Alfalfa,Aloysius,123-45-6789,40,90,100,83,49,D-
Alfred,University,123-12-1234,41,97,96,97,48,D+
Gerty,Gramma,567-89-0123,41,80,60,40,44,C
Android,Electric,087-65-4321,42,23,36,45,47,B-
Bumpkin,Fred,456-78-9012,43,78,88,77,45,A-
Rubble,Betty,234-56-7890,44,90,80,90,46,C-
Noshow,Cecil,345-67-8901,45,11,-1,4,43,F
Buff,Bif,632-79-9939,46,20,30,40,50,B+
Airpump,Andrew,223-45-6789,49,1,90,100,83,A
Backus,Jim,143-12-1234,48,1,97,96,97,A+
Carnivore,Art,565-89-0123,44,1,80,60,40,D+
Dandy,Jim,087-75-4321,47,1,23,36,45,C+
Elephant,Ima,456-71-9012,45,1,78,88,77,B-
Franklin,Benny,234-56-2890,50,1,90,80,90,B-
George,Boy,345-67-3901,40,1,11,-1,4,B
Heffalump,Harvey,632-79-9439,30,1,20,30,40,C
================================================
FILE: data/ssn-address.tsv
================================================
Alfalfa Aloysius 123-45-6789 7098 East Road Hopkins, MN 55343
Backus Jim 143-12-1234 603 Wagon Drive Miamisburg, OH 45342
Dandy Jim 087-75-4321 4 Ann St. Hackensack, NJ 07601
George Boy 345-67-3901 13 Foxrun Ave. Annandale, VA 22003
Alfred University 123-12-1234 98 Wellington Ave. Lowell, MA 01851
Elephant Ima 456-71-9012
Heffalump Harvey 632-79-9439 5 Beech Street Canyon Country, CA 91387
Gerty Gramma 567-89-0123
Rubble Betty 234-56-7890 9715 Penn St. Royal Oak, MI 48067
================================================
FILE: docker-compose.yml
================================================
version: "3.7"
services:
hivemetastore:
image: postgres:11.5
hostname: hivemetastore
environment:
POSTGRES_PASSWORD: new_password
expose:
- 5432
volumes:
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
networks:
spark_net:
ipv4_address: 172.28.1.4
extra_hosts:
- "master:172.28.1.1"
- "worker1:172.28.1.2"
- "worker2:172.28.1.3"
- "zeppelin:172.28.1.5"
- "livy:172.28.1.6"
master:
image: panovvv/hadoop-hive-spark:2.5.2
# build: '../hadoop-hive-spark-docker'
hostname: master
depends_on:
- hivemetastore
environment:
HADOOP_NODE: namenode
HIVE_CONFIGURE: yes, please
SPARK_PUBLIC_DNS: localhost
SPARK_LOCAL_IP: 172.28.1.1
SPARK_MASTER_HOST: 172.28.1.1
SPARK_LOCAL_HOSTNAME: master
expose:
- 1-65535
ports:
# Spark Master Web UI
- 8080:8080
# Spark job Web UI: increments for each successive job
- 4040:4040
- 4041:4041
- 4042:4042
- 4043:4043
# Spark History server
- 18080:18080
# YARN UI
- 8088:8088
# Hadoop namenode UI
- 9870:9870
# Hadoop secondary namenode UI
- 9868:9868
# Hive JDBC
- 10000:10000
volumes:
- ./data:/data
networks:
spark_net:
ipv4_address: 172.28.1.1
extra_hosts:
- "worker1:172.28.1.2"
- "worker2:172.28.1.3"
- "hivemetastore:172.28.1.4"
- "zeppelin:172.28.1.5"
- "livy:172.28.1.6"
worker1:
image: panovvv/hadoop-hive-spark:2.5.2
# build: '../hadoop-hive-spark-docker'
hostname: worker1
depends_on:
- hivemetastore
environment:
SPARK_MASTER_ADDRESS: spark://master:7077
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
SPARK_LOCAL_HOSTNAME: worker1
SPARK_LOCAL_IP: 172.28.1.2
SPARK_MASTER_HOST: 172.28.1.1
HADOOP_NODE: datanode
expose:
- 1-65535
ports:
# Hadoop datanode UI
- 9864:9864
#Spark worker UI
- 8081:8081
volumes:
- ./data:/data
networks:
spark_net:
ipv4_address: 172.28.1.2
extra_hosts:
- "master:172.28.1.1"
- "worker2:172.28.1.3"
- "hivemetastore:172.28.1.4"
- "zeppelin:172.28.1.5"
- "livy:172.28.1.6"
worker2:
image: panovvv/hadoop-hive-spark:2.5.2
# build: '../hadoop-hive-spark-docker'
hostname: worker2
depends_on:
- hivemetastore
environment:
SPARK_MASTER_ADDRESS: spark://master:7077
SPARK_WORKER_PORT: 8882
SPARK_WORKER_WEBUI_PORT: 8082
SPARK_PUBLIC_DNS: localhost
SPARK_LOCAL_HOSTNAME: worker2
SPARK_LOCAL_IP: 172.28.1.3
SPARK_MASTER_HOST: 172.28.1.1
HADOOP_NODE: datanode
HADOOP_DATANODE_UI_PORT: 9865
expose:
- 1-65535
ports:
# Hadoop datanode UI
- 9865:9865
# Spark worker UI
- 8082:8082
volumes:
- ./data:/data
networks:
spark_net:
ipv4_address: 172.28.1.3
extra_hosts:
- "master:172.28.1.1"
- "worker1:172.28.1.2"
- "hivemetastore:172.28.1.4"
- "zeppelin:172.28.1.5"
- "livy:172.28.1.6"
livy:
image: panovvv/livy:2.5.2
# build: '../livy-docker'
hostname: livy
depends_on:
- master
- worker1
- worker2
volumes:
- ./livy_batches:/livy_batches
- ./data:/data
environment:
- SPARK_MASTER=yarn
# Intentionally not specified - if it's set here, then we can't override it
# via REST API ("conf"={} map)
# Can be client or cluster
# - SPARK_DEPLOY_MODE=client
- LOCAL_DIR_WHITELIST=/data/batches/
- ENABLE_HIVE_CONTEXT=false
# Defaults are fine for variables below. Uncomment to change them.
# - LIVY_HOST=0.0.0.0
# - LIVY_PORT=8998
expose:
- 1-65535
ports:
- 8998:8998
networks:
spark_net:
ipv4_address: 172.28.1.6
extra_hosts:
- "master:172.28.1.1"
- "worker1:172.28.1.2"
- "worker2:172.28.1.3"
- "hivemetastore:172.28.1.4"
- "zeppelin:172.28.1.5"
zeppelin:
image: panovvv/zeppelin-bigdata:2.5.2
# build: '../zeppelin-bigdata-docker'
hostname: zeppelin
depends_on:
- master
- worker1
- worker2
- livy
volumes:
- ./zeppelin_notebooks:/zeppelin_notebooks
- ./data:/data
environment:
ZEPPELIN_PORT: 8890
ZEPPELIN_NOTEBOOK_DIR: '/zeppelin_notebooks'
expose:
- 8890
ports:
- 8890:8890
networks:
spark_net:
ipv4_address: 172.28.1.5
extra_hosts:
- "master:172.28.1.1"
- "worker1:172.28.1.2"
- "worker2:172.28.1.3"
- "hivemetastore:172.28.1.4"
- "livy:172.28.1.6"
networks:
spark_net:
ipam:
driver: default
config:
- subnet: 172.28.0.0/16
================================================
FILE: init.sql
================================================
CREATE DATABASE "hivemetastoredb";
================================================
FILE: zeppelin_notebooks/test_2FVBJBJ1V.zpln
================================================
{
"paragraphs": [
{
"text": "%sh\n\n#Only load a file to HDFS if it\u0027s not already there - because of this you can run all paragraphs as many times as you like.\nhadoop fs -test -e /grades.csv\n\nif ! hadoop fs -test -e /grades.csv\nthen\n echo \"*******************************************\"\n echo \"grades.csv is not in HDFS yet! Uploading...\"\n echo \"*******************************************\"d\n hadoop fs -put /data/grades.csv /\nfi"
},
{
"text": "%sh\n\nhadoop fs -ls /"
},
{
"text": "%jdbc\n\n-- Does not support more than one statement per paragraph, it seems. Same goes for semicolon at the end of statements - errors out if you include it.\nDROP TABLE IF EXISTS grades"
},
{
"text": "%jdbc\n\nCREATE TABLE grades(\n `Last name` STRING,\n `First name` STRING,\n `SSN` STRING,\n `Test1` DOUBLE,\n `Test2` INT,\n `Test3` DOUBLE,\n `Test4` DOUBLE,\n `Final` DOUBLE,\n `Grade` STRING)\nCOMMENT \u0027https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html\u0027\nROW FORMAT DELIMITED\nFIELDS TERMINATED BY \u0027,\u0027\nSTORED AS TEXTFILE\ntblproperties(\"skip.header.line.count\"\u003d\"1\")"
},
{
"text": "%jdbc\n\nLOAD DATA INPATH \u0027/grades.csv\u0027 INTO TABLE grades"
},
{
"text": "%jdbc\n\nSELECT * FROM grades"
},
{
"text": "%sh\n\n# Take a look at the warehouse directory, specifically where our Hive table is stored.\n hadoop fs -ls /usr/hive/warehouse/grades"
},
{
"text": "%sh\n\n# Put the file back into HDFS - it was moved to warehouse directory when we loaded it with Hive.\nhadoop fs -put /data/grades.csv /\nhadoop fs -ls /"
},
{
"text": "%spark\n\n// Basic Spark functions\nspark.range(1000 * 1000 * 1000).count()"
},
{
"text": "%spark\n\n// Dataframes\nval df \u003d Seq(\n (\"One\", 1),\n (\"Two\", 2),\n (\"Three\", 3),\n (\"Four\", 4)\n).toDF(\"This is\", \"an example\")\ndf.show()"
},
{
"text": "%spark\n\n// Read CSV file from HDFS into Dataframe\nval df \u003d spark.read.format(\"csv\").option(\"header\", \"true\").load(\"/grades.csv\")\ndf.show()"
},
{
"text": "%spark\n\n// Spark SQL and temporary views\ndf.createOrReplaceTempView(\"df\")\nspark.sql(\"SHOW TABLES\").show()"
},
{
"text": "%spark\n\nspark.sql(\"SELECT * FROM df WHERE Final > 50\").show()"
},
{
"text": "%spark.pyspark\n\n# Check Python version - 2 not allowed.\nimport sys\nprint(sys.version)"
},
{
"text": "%spark.pyspark\n\n# Basic Spark functions\nspark.range(1000 * 1000 * 1000).count()"
},
{
"text": "%spark.pyspark\n\n# Dataframes\ndf \u003d sqlContext.createDataFrame([(\"One\", 1), (\"Two\", 2), (\"Three\", 3), (\"Four\", 4)], (\"This is\", \"an example\"))\ndf.show()"
},
{
"text": "%spark.pyspark\n\n# Read CSV file from HDFS into Dataframe\ndf \u003d spark.read.format(\"csv\").option(\"header\", \"true\").load(\"/grades.csv\")\ndf.show()"
},
{
"text": "%spark.r\n\n# Dataframes\ndf \u003c- as.DataFrame(list(\"One\", \"Two\", \"Three\", \"Four\"), \"This is as example\")\nhead(df)"
},
{
"text": "%spark.r\n\n# Read CSV file from HDFS into Dataframe\ndf \u003c- read.df(\"/grades.csv\", \"csv\", header\u003d\"true\")\nhead(df)"
},
{
"text": "%livy\n\n// Scala Spark over Livy\nspark.range(1000 * 1000 * 1000).count()"
},
{
"text": "%livy.pyspark\n\n# PySpark over Livy\nimport sys\nprint(sys.version)\nspark.range(1000 * 1000 * 1000).count()"
},
{
"text": "%livy.sparkr\n\n# SparkR over Livy\ndf \u003c- as.DataFrame(list(\"One\", \"Two\", \"Three\", \"Four\"), \"This is as example\")\nhead(df)"
},
{
"text": "%livy.sql\nSELECT 1, CONCAT(\u0027This is\u0027, \u0027 a test\u0027)"
}
],
"name": "test",
"id": "2FVBJBJ1V",
"defaultInterpreterGroup": "spark",
"version": "0.9.0",
"noteParams": {},
"noteForms": {},
"angularObjects": {},
"config": {
"isZeppelinNotebookCronEnable": false
},
"info": {}
}
gitextract_p_pqx3h3/
├── .gitignore
├── LICENSE
├── README.md
├── data/
│ ├── batches/
│ │ └── sample_batch.py
│ ├── grades.csv
│ └── ssn-address.tsv
├── docker-compose.yml
├── init.sql
└── zeppelin_notebooks/
└── test_2FVBJBJ1V.zpln
Condensed preview — 9 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (35K chars).
[
{
"path": ".gitignore",
"chars": 209,
"preview": "# OS garbage\n.DS_Store\ndesktop.ini\n\n# IDE garbage\n.idea/\n\n# Livy batch files, copied over from elsewhere, except one sam"
},
{
"path": "LICENSE",
"chars": 1071,
"preview": "The MIT License\n\nCopyright (c) 2020 Vadim Panov\n\nPermission is hereby granted, free of charge, to any person obtaining a"
},
{
"path": "README.md",
"chars": 20410,
"preview": "# Big data playground: Cluster with Hadoop, Hive, Spark, Zeppelin and Livy via Docker-compose.\n\nI wanted to have the abi"
},
{
"path": "data/batches/sample_batch.py",
"chars": 466,
"preview": "import sys\n\nfrom pyspark.shell import spark\n\nprint(sys.version)\nprint(\"Arguments: \\n\" + str(sys.argv))\n\ntry:\n num = i"
},
{
"path": "data/grades.csv",
"chars": 747,
"preview": "Last name,First name,SSN,Test1,Test2,Test3,Test4,Final,Grade\nAlfalfa,Aloysius,123-45-6789,40,90,100,83,49,D-\nAlfred,Univ"
},
{
"path": "data/ssn-address.tsv",
"chars": 480,
"preview": "Alfalfa\tAloysius\t123-45-6789\t7098 East Road\tHopkins, MN 55343\nBackus Jim\t143-12-1234\t603 Wagon Drive\tMiamisburg, OH 453"
},
{
"path": "docker-compose.yml",
"chars": 5085,
"preview": "version: \"3.7\"\nservices:\n hivemetastore:\n image: postgres:11.5\n hostname: hivemetastore\n environment:\n PO"
},
{
"path": "init.sql",
"chars": 34,
"preview": "CREATE DATABASE \"hivemetastoredb\";"
},
{
"path": "zeppelin_notebooks/test_2FVBJBJ1V.zpln",
"chars": 4407,
"preview": "{\n \"paragraphs\": [\n {\n \"text\": \"%sh\\n\\n#Only load a file to HDFS if it\\u0027s not already there - becau"
}
]
About this extraction
This page contains the full source code of the panovvv/bigdata-docker-compose GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 9 files (32.1 KB), approximately 10.8k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.