Repository: panovvv/bigdata-docker-compose
Branch: master
Commit: ca515ac08b21
Files: 9
Total size: 32.1 KB

Directory structure:
gitextract_p_pqx3h3/

├── .gitignore
├── LICENSE
├── README.md
├── data/
│   ├── batches/
│   │   └── sample_batch.py
│   ├── grades.csv
│   └── ssn-address.tsv
├── docker-compose.yml
├── init.sql
└── zeppelin_notebooks/
    └── test_2FVBJBJ1V.zpln

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# OS garbage
.DS_Store
desktop.ini

# IDE garbage
.idea/

# Livy batch files, copied over from elsewhere, except one sample batch
data/batches/*
!data/batches/sample_batch.py

# Spark job results
data/output/


================================================
FILE: LICENSE
================================================
The MIT License

Copyright (c) 2020 Vadim Panov

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

================================================
FILE: README.md
================================================
# Big data playground: Cluster with Hadoop, Hive, Spark, Zeppelin and Livy via Docker-compose.

I wanted to have the ability to play around with various big data
applications as effortlessly as possible,
namely those found in Amazon EMR.
Ideally, that would be something that can be brought up and torn down
in one command. This is how this repository came to be!

## Constituent images:

[Base image](https://github.com/panovvv/hadoop-hive-spark-docker):
[![Docker Build Status: Base image](https://img.shields.io/docker/cloud/build/panovvv/hadoop-hive-spark.svg)](https://cloud.docker.com/repository/docker/panovvv/hadoop-hive-spark/builds)
[![Docker Pulls: Base image](https://img.shields.io/docker/pulls/panovvv/hadoop-hive-spark.svg)](https://hub.docker.com/r/panovvv/hadoop-hive-spark)
[![Docker Stars: Base image](https://img.shields.io/docker/stars/panovvv/hadoop-hive-spark.svg)](https://hub.docker.com/r/panovvv/hadoop-hive-spark)

[Zeppelin image](https://github.com/panovvv/zeppelin-bigdata-docker): [![Docker Build Status: Zeppelin](https://img.shields.io/docker/cloud/build/panovvv/zeppelin-bigdata.svg)](https://cloud.docker.com/repository/docker/panovvv/zeppelin-bigdata/builds)
[![Docker Pulls: Zeppelin](https://img.shields.io/docker/pulls/panovvv/zeppelin-bigdata.svg)](https://hub.docker.com/r/panovvv/zeppelin-bigdata)
[![Docker Stars: Zeppelin](https://img.shields.io/docker/stars/panovvv/zeppelin-bigdata.svg)](https://hub.docker.com/r/panovvv/zeppelin-bigdata)

[Livy image](https://github.com/panovvv/livy-docker): [![Docker Build Status: Livy](https://img.shields.io/docker/cloud/build/panovvv/livy.svg)](https://cloud.docker.com/repository/docker/panovvv/livy/builds)
[![Docker Pulls: Livy](https://img.shields.io/docker/pulls/panovvv/livy.svg)](https://hub.docker.com/r/panovvv/livy)
[![Docker Stars: Livy](https://img.shields.io/docker/stars/panovvv/livy.svg)](https://hub.docker.com/r/panovvv/livy)

## Usage

Clone:
```bash
git clone https://github.com/panovvv/bigdata-docker-compose.git
```
* On non-Linux platforms, you should dedicate more RAM to Docker than it does by default
  (2Gb on my machine with 16Gb RAM). Otherwise applications (ResourceManager in my case)
  will quit sporadically and you'll see messages like this one in logs:
  <pre>
  current-datetime INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1234ms
  No GCs detected
  </pre>
  Increasing memory to 8G solved all those mysterious problems for me.

* You should have more than 90% of free disk space, otherwise
  YARN will deem all nodes unhealthy.

Bring everything up:
```bash
cd bigdata-docker-compose
docker-compose up -d
```

* **data/** directory is mounted into every container, you can use this as
a storage both for files you want to process using Hive/Spark/whatever
and results of those computations.
* **livy_batches/** directory is where you have some sample code for
Livy batch processing mode. It's mounted to the node where Livy
is running. You can store your code there as well, or make use of the
universal **data/**.
* **zeppelin_notebooks/** contains, quite predictably, notebook files
for Zeppelin. Thanks to that, all your notebooks persist across runs.

Hive JDBC port is exposed to host:
* URI: `jdbc:hive2://localhost:10000`
* Driver: `org.apache.hive.jdbc.HiveDriver` (org.apache.hive:hive-jdbc:3.1.2)
* User and password: unused.

To shut the whole thing down, run this from the same folder:
```bash
docker-compose down
```

## Checking if everything plays well together
You can quickly check everything by opening the
[bundled Zeppelin notebook](http://localhost:8890)
and running all paragraphs.

Alternatively, to get a sense of
how it all works under the hood, follow the instructions below:

### Hadoop and YARN:

Check [YARN (Hadoop ResourceManager) Web UI
(localhost:8088)](http://localhost:8088/).
You should see 2 active nodes there.
There's also an
[alternative YARN Web UI 2 (http://localhost:8088/ui2)](http://localhost:8088/ui2).

Then, [Hadoop Name Node UI (localhost:9870)](http://localhost:9870),
Hadoop Data Node UIs at
[http://localhost:9864](http://localhost:9864) and [http://localhost:9865](http://localhost:9865):
all of those URLs should result in a page.

Open up a shell in the master node.
```bash
docker-compose exec master bash
jps
```
`jps` command outputs a list of running Java processes,
which on Hadoop Namenode/Spark Master node should include those:
<pre>
123 Jps
456 ResourceManager
789 NameNode
234 SecondaryNameNode
567 HistoryServer
890 Master
</pre>

... but not necessarily in this order and those IDs,
also some extras like `RunJar` and `JobHistoryServer` might be there too.

Then let's see if YARN can see all resources we have (2 worker nodes):
```bash
yarn node -list
```
<pre>
current-datetime INFO client.RMProxy: Connecting to ResourceManager at master/172.28.1.1:8032
Total Nodes:2
         Node-Id	     Node-State	Node-Http-Address	Number-of-Running-Containers
   worker1:45019	        RUNNING	     worker1:8042	                           0
   worker2:41001	        RUNNING	     worker2:8042	                           0
</pre>

HDFS (Hadoop distributed file system) condition:
```bash
hdfs dfsadmin -report
```
<pre>
Live datanodes (2):
Name: 172.28.1.2:9866 (worker1)
...
Name: 172.28.1.3:9866 (worker2)
</pre>

Now we'll upload a file into HDFS and see that it's visible from all
nodes:
```bash
hadoop fs -put /data/grades.csv /
hadoop fs -ls /
```
<pre>
Found N items
...
-rw-r--r--   2 root supergroup  ... /grades.csv
...
</pre>

Ctrl+D out of master now. Repeat for remaining nodes
(there's 3 total: master, worker1 and worker2):

```bash
docker-compose exec worker1 bash
hadoop fs -ls /
```
<pre>
Found 1 items
-rw-r--r--   2 root supergroup  ... /grades.csv
</pre>

While we're on nodes other than Hadoop Namenode/Spark Master node,
jps command output should include DataNode and Worker now instead of
NameNode and Master:
```bash
jps
```
<pre>
123 Jps
456 NodeManager
789 DataNode
234 Worker
</pre>

### Hive

Prerequisite: there's a file `grades.csv` stored in HDFS ( `hadoop fs -put /data/grades.csv /` )
```bash
docker-compose exec master bash
hive
```
```sql
CREATE TABLE grades(
    `Last name` STRING,
    `First name` STRING,
    `SSN` STRING,
    `Test1` DOUBLE,
    `Test2` INT,
    `Test3` DOUBLE,
    `Test4` DOUBLE,
    `Final` DOUBLE,
    `Grade` STRING)
COMMENT 'https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");

LOAD DATA INPATH '/grades.csv' INTO TABLE grades;

SELECT * FROM grades;
-- OK
-- Alfalfa	Aloysius	123-45-6789	40.0	90	100.0	83.0	49.0	D-
-- Alfred	University	123-12-1234	41.0	97	96.0	97.0	48.0	D+
-- Gerty	Gramma	567-89-0123	41.0	80	60.0	40.0	44.0	C
-- Android	Electric	087-65-4321	42.0	23	36.0	45.0	47.0	B-
-- Bumpkin	Fred	456-78-9012	43.0	78	88.0	77.0	45.0	A-
-- Rubble	Betty	234-56-7890	44.0	90	80.0	90.0	46.0	C-
-- Noshow	Cecil	345-67-8901	45.0	11	-1.0	4.0	43.0	F
-- Buff	Bif	632-79-9939	46.0	20	30.0	40.0	50.0	B+
-- Airpump	Andrew	223-45-6789	49.0	1	90.0	100.0	83.0	A
-- Backus	Jim	143-12-1234	48.0	1	97.0	96.0	97.0	A+
-- Carnivore	Art	565-89-0123	44.0	1	80.0	60.0	40.0	D+
-- Dandy	Jim	087-75-4321	47.0	1	23.0	36.0	45.0	C+
-- Elephant	Ima	456-71-9012	45.0	1	78.0	88.0	77.0	B-
-- Franklin	Benny	234-56-2890	50.0	1	90.0	80.0	90.0	B-
-- George	Boy	345-67-3901	40.0	1	11.0	-1.0	4.0	B
-- Heffalump	Harvey	632-79-9439	30.0	1	20.0	30.0	40.0	C
-- Time taken: 3.324 seconds, Fetched: 16 row(s)
```

Ctrl+D back to bash. Check if the file's been loaded to Hive warehouse
directory:

```bash
hadoop fs -ls /usr/hive/warehouse/grades
```
<pre>
Found 1 items
-rw-r--r--   2 root supergroup  ... /usr/hive/warehouse/grades/grades.csv
</pre>

The table we just created should be accessible from all nodes, let's
verify that now:
```bash
docker-compose exec worker2 bash
hive
```
```sql
SELECT * FROM grades;
```
You should be able to see the same table.
### Spark

Open up [Spark Master Web UI (localhost:8080)](http://localhost:8080/):
<pre>
Workers (2)
Worker Id	Address	State	Cores	Memory
worker-timestamp-172.28.1.3-8882	172.28.1.3:8882	ALIVE	2 (0 Used)	1024.0 MB (0.0 B Used)
worker-timestamp-172.28.1.2-8881	172.28.1.2:8881	ALIVE	2 (0 Used)	1024.0 MB (0.0 B Used)
</pre>
, also worker UIs at  [localhost:8081](http://localhost:8081/)
and  [localhost:8082](http://localhost:8082/). All those pages should be
accessible.

Then there's also Spark History server running at
[localhost:18080](http://localhost:18080/) - every time you run Spark jobs, you
will see them here.

History Server includes REST API at
[localhost:18080/api/v1/applications](http://localhost:18080/api/v1/applications).
This is a mirror of everything on the main page, only in JSON format.

Let's run some sample jobs now:
```bash
docker-compose exec master bash
run-example SparkPi 10
#, or you can do the same via spark-submit:
spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode client \
    --driver-memory 2g \
    --executor-memory 1g \
    --executor-cores 1 \
    $SPARK_HOME/examples/jars/spark-examples*.jar \
    10
```
<pre>
INFO spark.SparkContext: Running Spark version 2.4.4
INFO spark.SparkContext: Submitted application: Spark Pi
..
INFO client.RMProxy: Connecting to ResourceManager at master/172.28.1.1:8032
INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
...
INFO yarn.Client: Application report for application_1567375394688_0001 (state: ACCEPTED)
...
INFO yarn.Client: Application report for application_1567375394688_0001 (state: RUNNING)
...
INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.102882 s
Pi is roughly 3.138915138915139
...
INFO util.ShutdownHookManager: Deleting directory /tmp/spark-81ea2c22-d96e-4d7c-a8d7-9240d8eb22ce
</pre>

Spark has 3 interactive shells: spark-shell to code in Scala,
pyspark for Python and sparkR for R. Let's try them all out:
```bash
hadoop fs -put /data/grades.csv /
spark-shell
```
```scala
spark.range(1000 * 1000 * 1000).count()

val df = spark.read.format("csv").option("header", "true").load("/grades.csv")
df.show()

df.createOrReplaceTempView("df")
spark.sql("SHOW TABLES").show()
spark.sql("SELECT * FROM df WHERE Final > 50").show()

//TODO SELECT TABLE from hive - not working for now.
spark.sql("SELECT * FROM grades").show()
```
<pre>
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = yarn, app id = application_N).
Spark session available as 'spark'.

res0: Long = 1000000000

df: org.apache.spark.sql.DataFrame = [Last name: string, First name: string ... 7 more fields]

+---------+----------+-----------+-----+-----+-----+-----+-----+-----+
|Last name|First name|        SSN|Test1|Test2|Test3|Test4|Final|Grade|
+---------+----------+-----------+-----+-----+-----+-----+-----+-----+
|  Alfalfa|  Aloysius|123-45-6789|   40|   90|  100|   83|   49|   D-|
...
|Heffalump|    Harvey|632-79-9439|   30|    1|   20|   30|   40|    C|
+---------+----------+-----------+-----+-----+-----+-----+-----+-----+

+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
|        |       df|       true|
+--------+---------+-----------+

+---------+----------+-----------+-----+-----+-----+-----+-----+-----+
|Last name|First name|        SSN|Test1|Test2|Test3|Test4|Final|Grade|
+---------+----------+-----------+-----+-----+-----+-----+-----+-----+
|  Airpump|    Andrew|223-45-6789|   49|    1|   90|  100|   83|    A|
|   Backus|       Jim|143-12-1234|   48|    1|   97|   96|   97|   A+|
| Elephant|       Ima|456-71-9012|   45|    1|   78|   88|   77|   B-|
| Franklin|     Benny|234-56-2890|   50|    1|   90|   80|   90|   B-|
+---------+----------+-----------+-----+-----+-----+-----+-----+-----+
</pre>
Ctrl+D out of Scala shell now.

```bash
pyspark
```
```python
spark.range(1000 * 1000 * 1000).count()

df = spark.read.format('csv').option('header', 'true').load('/grades.csv')
df.show()

df.createOrReplaceTempView('df')
spark.sql('SHOW TABLES').show()
spark.sql('SELECT * FROM df WHERE Final > 50').show()

# TODO SELECT TABLE from hive - not working for now.
spark.sql('SELECT * FROM grades').show()
```
<pre>
1000000000

$same_tables_as_above
</pre>
Ctrl+D out of PySpark.

```bash
sparkR
```
```R
df <- as.DataFrame(list("One", "Two", "Three", "Four"), "This is as example")
head(df)

df <- read.df("/grades.csv", "csv", header="true")
head(df)
```
<pre>
  This is as example
1                One
2                Two
3              Three
4               Four

$same_tables_as_above
</pre>

* Amazon S3

From Hadoop:
```bash
hadoop fs -Dfs.s3a.impl="org.apache.hadoop.fs.s3a.S3AFileSystem" -Dfs.s3a.access.key="classified" -Dfs.s3a.secret.key="classified" -ls "s3a://bucket"
```

Then from PySpark:

```python
sc._jsc.hadoopConfiguration().set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'classified')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'classified')

df = spark.read.format('csv').option('header', 'true').option('sep', '\t').load('s3a://bucket/tabseparated_withheader.tsv')
df.show(5)
```

None of the commands above stores your credentials anywhere
(i.e. as soon as you'd shut down the cluster your creds are safe). More
persistent ways of storing the credentials are out of scope of this
readme.

### Zeppelin

Zeppelin interface should be available at [http://localhost:8890](http://localhost:8890).

You'll find a notebook called "test" in there, containing commands
to test integration with bash, Spark and Livy.

### Livy

Livy is at [http://localhost:8998](http://localhost:8998) (and yes,
there's a web UI as well as REST API on that port - just click the link).

* Livy Sessions.

Try to poll the REST API:
```bash
curl --request GET \
  --url http://localhost:8998/sessions | python3 -mjson.tool
```
The response, assuming you didn't create any sessions before, should look like this:
```json
{
  "from": 0,
  "total": 0,
  "sessions": []
}
```

1 ) Create a session:
```bash
curl --request POST \
  --url http://localhost:8998/sessions \
  --header 'content-type: application/json' \
  --data '{
	"kind": "pyspark"
}' | python3 -mjson.tool
```
Response:
```json
{
    "id": 0,
    "name": null,
    "appId": null,
    "owner": null,
    "proxyUser": null,
    "state": "starting",
    "kind": "pyspark",
    "appInfo": {
        "driverLogUrl": null,
        "sparkUiUrl": null
    },
    "log": [
        "stdout: ",
        "\nstderr: ",
        "\nYARN Diagnostics: "
    ]
}
```

2 ) Wait for session to start (state will transition from "starting"
to "idle"):
```bash
curl --request GET \
  --url http://localhost:8998/sessions/0 | python3 -mjson.tool
```
Response:
```json
{
    "id": 0,
    "name": null,
    "appId": "application_1584274334558_0001",
    "owner": null,
    "proxyUser": null,
    "state": "starting",
    "kind": "pyspark",
    "appInfo": {
        "driverLogUrl": "http://worker2:8042/node/containerlogs/container_1584274334558_0003_01_000001/root",
        "sparkUiUrl": "http://master:8088/proxy/application_1584274334558_0003/"
    },
    "log": [
        "timestamp bla"
    ]
}
```

3 ) Post some statements:
```bash
curl --request POST \
  --url http://localhost:8998/sessions/0/statements \
  --header 'content-type: application/json' \
  --data '{
	"code": "import sys;print(sys.version)"
}' | python3 -mjson.tool
curl --request POST \
  --url http://localhost:8998/sessions/0/statements \
  --header 'content-type: application/json' \
  --data '{
	"code": "spark.range(1000 * 1000 * 1000).count()"
}' | python3 -mjson.tool
```
Response:
```json
{
    "id": 0,
    "code": "import sys;print(sys.version)",
    "state": "waiting",
    "output": null,
    "progress": 0.0,
    "started": 0,
    "completed": 0
}
```
```json
{
    "id": 1,
    "code": "spark.range(1000 * 1000 * 1000).count()",
    "state": "waiting",
    "output": null,
    "progress": 0.0,
    "started": 0,
    "completed": 0
}
```

4) Get the result:
```bash
curl --request GET \
  --url http://localhost:8998/sessions/0/statements | python3 -mjson.tool
```
Response:
```json
{
  "total_statements": 2,
  "statements": [
    {
      "id": 0,
      "code": "import sys;print(sys.version)",
      "state": "available",
      "output": {
        "status": "ok",
        "execution_count": 0,
        "data": {
          "text/plain": "3.7.3 (default, Apr  3 2019, 19:16:38) \n[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]"
        }
      },
      "progress": 1.0
    },
    {
      "id": 1,
      "code": "spark.range(1000 * 1000 * 1000).count()",
      "state": "available",
      "output": {
        "status": "ok",
        "execution_count": 1,
        "data": {
          "text/plain": "1000000000"
        }
      },
      "progress": 1.0
    }
  ]
}
```

5) Delete the session:
```bash
curl --request DELETE \
  --url http://localhost:8998/sessions/0 | python3 -mjson.tool
```
Response:
```json
{
  "msg": "deleted"
}
```
* Livy Batches.

To get all active batches:
```bash
curl --request GET \
  --url http://localhost:8998/batches | python3 -mjson.tool
```
Strange enough, this elicits the same response as if we were querying
the sessions endpoint, but ok...

1 ) Send the batch:
```bash
curl --request POST \
  --url http://localhost:8998/batches \
  --header 'content-type: application/json' \
  --data '{
	"file": "local:/data/batches/sample_batch.py",
	"pyFiles": [
		"local:/data/batches/sample_batch.py"
	],
	"args": [
		"123"
	]
}' | python3 -mjson.tool
```
Response:
```json
{
    "id": 0,
    "name": null,
    "owner": null,
    "proxyUser": null,
    "state": "starting",
    "appId": null,
    "appInfo": {
        "driverLogUrl": null,
        "sparkUiUrl": null
    },
    "log": [
        "stdout: ",
        "\nstderr: ",
        "\nYARN Diagnostics: "
    ]
}
```

2 ) Query the status:
```bash
curl --request GET \
  --url http://localhost:8998/batches/0 | python3 -mjson.tool
```
Response:
```json
{
    "id": 0,
    "name": null,
    "owner": null,
    "proxyUser": null,
    "state": "running",
    "appId": "application_1584274334558_0005",
    "appInfo": {
        "driverLogUrl": "http://worker2:8042/node/containerlogs/container_1584274334558_0005_01_000001/root",
        "sparkUiUrl": "http://master:8088/proxy/application_1584274334558_0005/"
    },
    "log": [
        "timestamp bla",
        "\nstderr: ",
        "\nYARN Diagnostics: "
    ]
}
```

3 ) To see all log lines, query the `/log` endpoint.
You can skip 'to' and 'from' params, or manipulate them to get all log lines.
Livy (as of 0.7.0) supports no more than 100 log lines per response.
```bash
curl --request GET \
  --url 'http://localhost:8998/batches/0/log?from=100&to=200' | python3 -mjson.tool
```
Response:
```json
{
    "id": 0,
    "from": 100,
    "total": 203,
    "log": [
        "...",
        "Welcome to",
        "      ____              __",
        "     / __/__  ___ _____/ /__",
        "    _\\ \\/ _ \\/ _ `/ __/  '_/",
        "   /__ / .__/\\_,_/_/ /_/\\_\\   version 2.4.5",
        "      /_/",
        "",
        "Using Python version 3.7.5 (default, Oct 17 2019 12:25:15)",
        "SparkSession available as 'spark'.",
        "3.7.5 (default, Oct 17 2019, 12:25:15) ",
        "[GCC 8.3.0]",
        "Arguments: ",
        "['/data/batches/sample_batch.py', '123']",
        "Custom number passed in args: 123",
        "Will raise 123 to the power of 3...",
        "...",
        "123 ^ 3 = 1860867",
        "...",
        "2020-03-15 13:06:09,503 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-138164b7-c5dc-4dc5-be6b-7a49c6bcdff0/pyspark-4d73b7c7-e27c-462f-9e5a-96011790d059"
    ]
}
```

4 ) Delete the batch:
```bash
curl --request DELETE \
  --url http://localhost:8998/batches/0 | python3 -mjson.tool
```
Response:
```json
{
  "msg": "deleted"
}
```

## Credits
Sample data file:
* __grades.csv__ is borrowed from 
[John Burkardt's page](https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html)
under Florida State University domain. Thanks for
sharing those!

* __ssn-address.tsv__ is derived from  __grades.csv__ by removing some fields
  and adding randomly-generated addresses.
  

================================================
FILE: data/batches/sample_batch.py
================================================
import sys

from pyspark.shell import spark

print(sys.version)
print("Arguments: \n" + str(sys.argv))

try:
    num = int(sys.argv[1])
    print("Custom number passed in args: " + str(num))
except (ValueError, IndexError):
    num = 1000
    print("Can't process as number: " + sys.argv[1])

# Checking if f-string are available (python>=3.6)
print(f"Will raise {num} to the power of 3...")

cube = spark.range(num * num * num).count()
print(f"{num} ^ 3 = {cube}")


================================================
FILE: data/grades.csv
================================================
Last name,First name,SSN,Test1,Test2,Test3,Test4,Final,Grade
Alfalfa,Aloysius,123-45-6789,40,90,100,83,49,D-
Alfred,University,123-12-1234,41,97,96,97,48,D+
Gerty,Gramma,567-89-0123,41,80,60,40,44,C
Android,Electric,087-65-4321,42,23,36,45,47,B-
Bumpkin,Fred,456-78-9012,43,78,88,77,45,A-
Rubble,Betty,234-56-7890,44,90,80,90,46,C-
Noshow,Cecil,345-67-8901,45,11,-1,4,43,F
Buff,Bif,632-79-9939,46,20,30,40,50,B+
Airpump,Andrew,223-45-6789,49,1,90,100,83,A
Backus,Jim,143-12-1234,48,1,97,96,97,A+
Carnivore,Art,565-89-0123,44,1,80,60,40,D+
Dandy,Jim,087-75-4321,47,1,23,36,45,C+
Elephant,Ima,456-71-9012,45,1,78,88,77,B-
Franklin,Benny,234-56-2890,50,1,90,80,90,B-
George,Boy,345-67-3901,40,1,11,-1,4,B
Heffalump,Harvey,632-79-9439,30,1,20,30,40,C


================================================
FILE: data/ssn-address.tsv
================================================
Alfalfa	Aloysius	123-45-6789	7098 East Road	Hopkins, MN 55343
Backus  Jim	143-12-1234	603 Wagon Drive	Miamisburg, OH 45342
Dandy	Jim	087-75-4321	4 Ann St.	Hackensack, NJ 07601
George	Boy	345-67-3901	13 Foxrun Ave.	Annandale, VA 22003
Alfred	University	123-12-1234	98 Wellington Ave.	Lowell, MA 01851
Elephant	Ima	456-71-9012	
Heffalump	Harvey	632-79-9439	5 Beech Street	Canyon Country, CA 91387
Gerty	Gramma	567-89-0123	
Rubble	Betty	234-56-7890	9715 Penn St.	Royal Oak, MI 48067


================================================
FILE: docker-compose.yml
================================================
version: "3.7"
services:
  hivemetastore:
    image: postgres:11.5
    hostname: hivemetastore
    environment:
      POSTGRES_PASSWORD: new_password
    expose:
      - 5432
    volumes:
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      spark_net:
        ipv4_address: 172.28.1.4
    extra_hosts:
      - "master:172.28.1.1"
      - "worker1:172.28.1.2"
      - "worker2:172.28.1.3"
      - "zeppelin:172.28.1.5"
      - "livy:172.28.1.6"

  master:
    image: panovvv/hadoop-hive-spark:2.5.2
#    build: '../hadoop-hive-spark-docker'
    hostname: master
    depends_on:
      - hivemetastore
    environment:
      HADOOP_NODE: namenode
      HIVE_CONFIGURE: yes, please
      SPARK_PUBLIC_DNS: localhost
      SPARK_LOCAL_IP: 172.28.1.1
      SPARK_MASTER_HOST: 172.28.1.1
      SPARK_LOCAL_HOSTNAME: master
    expose:
      - 1-65535
    ports:
      # Spark Master Web UI
      - 8080:8080
      # Spark job Web UI: increments for each successive job
      - 4040:4040
      - 4041:4041
      - 4042:4042
      - 4043:4043
      # Spark History server
      - 18080:18080
      # YARN UI
      - 8088:8088
      # Hadoop namenode UI
      - 9870:9870
      # Hadoop secondary namenode UI
      - 9868:9868
      # Hive JDBC
      - 10000:10000
    volumes:
      - ./data:/data
    networks:
      spark_net:
        ipv4_address: 172.28.1.1
    extra_hosts:
      - "worker1:172.28.1.2"
      - "worker2:172.28.1.3"
      - "hivemetastore:172.28.1.4"
      - "zeppelin:172.28.1.5"
      - "livy:172.28.1.6"

  worker1:
    image: panovvv/hadoop-hive-spark:2.5.2
#    build: '../hadoop-hive-spark-docker'
    hostname: worker1
    depends_on:
      - hivemetastore
    environment:
      SPARK_MASTER_ADDRESS: spark://master:7077
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8081
      SPARK_PUBLIC_DNS: localhost
      SPARK_LOCAL_HOSTNAME: worker1
      SPARK_LOCAL_IP: 172.28.1.2
      SPARK_MASTER_HOST: 172.28.1.1
      HADOOP_NODE: datanode
    expose:
      - 1-65535
    ports:
      # Hadoop datanode UI
      - 9864:9864
      #Spark worker UI
      - 8081:8081
    volumes:
      - ./data:/data
    networks:
      spark_net:
        ipv4_address: 172.28.1.2
    extra_hosts:
      - "master:172.28.1.1"
      - "worker2:172.28.1.3"
      - "hivemetastore:172.28.1.4"
      - "zeppelin:172.28.1.5"
      - "livy:172.28.1.6"

  worker2:
    image: panovvv/hadoop-hive-spark:2.5.2
#    build: '../hadoop-hive-spark-docker'
    hostname: worker2
    depends_on:
      - hivemetastore
    environment:
      SPARK_MASTER_ADDRESS: spark://master:7077
      SPARK_WORKER_PORT: 8882
      SPARK_WORKER_WEBUI_PORT: 8082
      SPARK_PUBLIC_DNS: localhost
      SPARK_LOCAL_HOSTNAME: worker2
      SPARK_LOCAL_IP: 172.28.1.3
      SPARK_MASTER_HOST: 172.28.1.1
      HADOOP_NODE: datanode
      HADOOP_DATANODE_UI_PORT: 9865
    expose:
      - 1-65535
    ports:
      # Hadoop datanode UI
      - 9865:9865
      # Spark worker UI
      - 8082:8082
    volumes:
      - ./data:/data
    networks:
      spark_net:
        ipv4_address: 172.28.1.3
    extra_hosts:
      - "master:172.28.1.1"
      - "worker1:172.28.1.2"
      - "hivemetastore:172.28.1.4"
      - "zeppelin:172.28.1.5"
      - "livy:172.28.1.6"

  livy:
    image: panovvv/livy:2.5.2
#    build: '../livy-docker'
    hostname: livy
    depends_on:
      - master
      - worker1
      - worker2
    volumes:
      - ./livy_batches:/livy_batches
      - ./data:/data
    environment:
      - SPARK_MASTER=yarn
      # Intentionally not specified - if it's set here, then we can't override it
      # via REST API ("conf"={} map)
      # Can be client or cluster
#      - SPARK_DEPLOY_MODE=client

      - LOCAL_DIR_WHITELIST=/data/batches/
      - ENABLE_HIVE_CONTEXT=false
      # Defaults are fine for variables below. Uncomment to change them.
#      - LIVY_HOST=0.0.0.0
#      - LIVY_PORT=8998
    expose:
      - 1-65535
    ports:
      - 8998:8998
    networks:
      spark_net:
        ipv4_address: 172.28.1.6
    extra_hosts:
      - "master:172.28.1.1"
      - "worker1:172.28.1.2"
      - "worker2:172.28.1.3"
      - "hivemetastore:172.28.1.4"
      - "zeppelin:172.28.1.5"

  zeppelin:
    image: panovvv/zeppelin-bigdata:2.5.2
#    build: '../zeppelin-bigdata-docker'
    hostname: zeppelin
    depends_on:
      - master
      - worker1
      - worker2
      - livy
    volumes:
      - ./zeppelin_notebooks:/zeppelin_notebooks
      - ./data:/data
    environment:
      ZEPPELIN_PORT: 8890
      ZEPPELIN_NOTEBOOK_DIR: '/zeppelin_notebooks'
    expose:
      - 8890
    ports:
      - 8890:8890
    networks:
      spark_net:
        ipv4_address: 172.28.1.5
    extra_hosts:
      - "master:172.28.1.1"
      - "worker1:172.28.1.2"
      - "worker2:172.28.1.3"
      - "hivemetastore:172.28.1.4"
      - "livy:172.28.1.6"

networks:
  spark_net:
    ipam:
      driver: default
      config:
        - subnet: 172.28.0.0/16

================================================
FILE: init.sql
================================================
CREATE DATABASE "hivemetastoredb";

================================================
FILE: zeppelin_notebooks/test_2FVBJBJ1V.zpln
================================================
{
  "paragraphs": [
        {
          "text": "%sh\n\n#Only load a file to HDFS if it\u0027s not already there - because of this you can run all paragraphs as many times as you like.\nhadoop fs -test -e /grades.csv\n\nif ! hadoop fs -test -e /grades.csv\nthen\n    echo \"*******************************************\"\n    echo \"grades.csv is not in HDFS yet! Uploading...\"\n    echo \"*******************************************\"d\n    hadoop fs -put /data/grades.csv /\nfi"
        },
        {
          "text": "%sh\n\nhadoop fs -ls /"
        },
        {
          "text": "%jdbc\n\n-- Does not support more than one statement per paragraph, it seems. Same goes for semicolon at the end of statements - errors out if you include it.\nDROP TABLE IF EXISTS grades"
        },
        {
          "text": "%jdbc\n\nCREATE TABLE grades(\n    `Last name` STRING,\n    `First name` STRING,\n    `SSN` STRING,\n    `Test1` DOUBLE,\n    `Test2` INT,\n    `Test3` DOUBLE,\n    `Test4` DOUBLE,\n    `Final` DOUBLE,\n    `Grade` STRING)\nCOMMENT \u0027https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html\u0027\nROW FORMAT DELIMITED\nFIELDS TERMINATED BY \u0027,\u0027\nSTORED AS TEXTFILE\ntblproperties(\"skip.header.line.count\"\u003d\"1\")"
        },
        {
          "text": "%jdbc\n\nLOAD DATA INPATH \u0027/grades.csv\u0027 INTO TABLE grades"
        },
        {
          "text": "%jdbc\n\nSELECT * FROM grades"
        },
        {
          "text": "%sh\n\n# Take a look at the warehouse directory, specifically where our Hive table is stored.\n hadoop fs -ls /usr/hive/warehouse/grades"
        },
        {
          "text": "%sh\n\n# Put the file back into HDFS - it was moved to warehouse directory when we loaded it with Hive.\nhadoop fs -put /data/grades.csv /\nhadoop fs -ls /"
        },
        {
          "text": "%spark\n\n// Basic Spark functions\nspark.range(1000 * 1000 * 1000).count()"
        },
        {
          "text": "%spark\n\n// Dataframes\nval df \u003d Seq(\n  (\"One\", 1),\n  (\"Two\", 2),\n  (\"Three\", 3),\n  (\"Four\", 4)\n).toDF(\"This is\", \"an example\")\ndf.show()"
        },
        {
          "text": "%spark\n\n// Read CSV file from HDFS into Dataframe\nval df \u003d spark.read.format(\"csv\").option(\"header\", \"true\").load(\"/grades.csv\")\ndf.show()"
        },
        {
          "text": "%spark\n\n// Spark SQL and temporary views\ndf.createOrReplaceTempView(\"df\")\nspark.sql(\"SHOW TABLES\").show()"
        },
        {
          "text": "%spark\n\nspark.sql(\"SELECT * FROM df WHERE Final > 50\").show()"
        },
        {
          "text": "%spark.pyspark\n\n# Check Python version - 2 not allowed.\nimport sys\nprint(sys.version)"
        },
        {
          "text": "%spark.pyspark\n\n#  Basic Spark functions\nspark.range(1000 * 1000 * 1000).count()"
        },
        {
          "text": "%spark.pyspark\n\n# Dataframes\ndf \u003d sqlContext.createDataFrame([(\"One\", 1), (\"Two\", 2), (\"Three\", 3), (\"Four\", 4)], (\"This is\", \"an example\"))\ndf.show()"
        },
        {
          "text": "%spark.pyspark\n\n# Read CSV file from HDFS into Dataframe\ndf \u003d spark.read.format(\"csv\").option(\"header\", \"true\").load(\"/grades.csv\")\ndf.show()"
        },
        {
          "text": "%spark.r\n\n# Dataframes\ndf \u003c- as.DataFrame(list(\"One\", \"Two\", \"Three\", \"Four\"), \"This is as example\")\nhead(df)"
        },
        {
          "text": "%spark.r\n\n# Read CSV file from HDFS into Dataframe\ndf \u003c- read.df(\"/grades.csv\", \"csv\", header\u003d\"true\")\nhead(df)"
        },
        {
          "text": "%livy\n\n// Scala Spark over Livy\nspark.range(1000 * 1000 * 1000).count()"
        },
        {
          "text": "%livy.pyspark\n\n#  PySpark over Livy\nimport sys\nprint(sys.version)\nspark.range(1000 * 1000 * 1000).count()"
        },
        {
          "text": "%livy.sparkr\n\n# SparkR over Livy\ndf \u003c- as.DataFrame(list(\"One\", \"Two\", \"Three\", \"Four\"), \"This is as example\")\nhead(df)"
        },
        {
          "text": "%livy.sql\nSELECT 1, CONCAT(\u0027This is\u0027, \u0027 a test\u0027)"
        }
      ],
  "name": "test",
  "id": "2FVBJBJ1V",
  "defaultInterpreterGroup": "spark",
  "version": "0.9.0",
  "noteParams": {},
  "noteForms": {},
  "angularObjects": {},
  "config": {
    "isZeppelinNotebookCronEnable": false
  },
  "info": {}
}