Repository: panovvv/bigdata-docker-compose Branch: master Commit: ca515ac08b21 Files: 9 Total size: 32.1 KB Directory structure: gitextract_p_pqx3h3/ ├── .gitignore ├── LICENSE ├── README.md ├── data/ │ ├── batches/ │ │ └── sample_batch.py │ ├── grades.csv │ └── ssn-address.tsv ├── docker-compose.yml ├── init.sql └── zeppelin_notebooks/ └── test_2FVBJBJ1V.zpln ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ # OS garbage .DS_Store desktop.ini # IDE garbage .idea/ # Livy batch files, copied over from elsewhere, except one sample batch data/batches/* !data/batches/sample_batch.py # Spark job results data/output/ ================================================ FILE: LICENSE ================================================ The MIT License Copyright (c) 2020 Vadim Panov Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # Big data playground: Cluster with Hadoop, Hive, Spark, Zeppelin and Livy via Docker-compose. I wanted to have the ability to play around with various big data applications as effortlessly as possible, namely those found in Amazon EMR. Ideally, that would be something that can be brought up and torn down in one command. This is how this repository came to be! ## Constituent images: [Base image](https://github.com/panovvv/hadoop-hive-spark-docker): [](https://cloud.docker.com/repository/docker/panovvv/hadoop-hive-spark/builds) [](https://hub.docker.com/r/panovvv/hadoop-hive-spark) [](https://hub.docker.com/r/panovvv/hadoop-hive-spark) [Zeppelin image](https://github.com/panovvv/zeppelin-bigdata-docker): [](https://cloud.docker.com/repository/docker/panovvv/zeppelin-bigdata/builds) [](https://hub.docker.com/r/panovvv/zeppelin-bigdata) [](https://hub.docker.com/r/panovvv/zeppelin-bigdata) [Livy image](https://github.com/panovvv/livy-docker): [](https://cloud.docker.com/repository/docker/panovvv/livy/builds) [](https://hub.docker.com/r/panovvv/livy) [](https://hub.docker.com/r/panovvv/livy) ## Usage Clone: ```bash git clone https://github.com/panovvv/bigdata-docker-compose.git ``` * On non-Linux platforms, you should dedicate more RAM to Docker than it does by default (2Gb on my machine with 16Gb RAM). Otherwise applications (ResourceManager in my case) will quit sporadically and you'll see messages like this one in logs:
current-datetime INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1234ms No GCs detectedIncreasing memory to 8G solved all those mysterious problems for me. * You should have more than 90% of free disk space, otherwise YARN will deem all nodes unhealthy. Bring everything up: ```bash cd bigdata-docker-compose docker-compose up -d ``` * **data/** directory is mounted into every container, you can use this as a storage both for files you want to process using Hive/Spark/whatever and results of those computations. * **livy_batches/** directory is where you have some sample code for Livy batch processing mode. It's mounted to the node where Livy is running. You can store your code there as well, or make use of the universal **data/**. * **zeppelin_notebooks/** contains, quite predictably, notebook files for Zeppelin. Thanks to that, all your notebooks persist across runs. Hive JDBC port is exposed to host: * URI: `jdbc:hive2://localhost:10000` * Driver: `org.apache.hive.jdbc.HiveDriver` (org.apache.hive:hive-jdbc:3.1.2) * User and password: unused. To shut the whole thing down, run this from the same folder: ```bash docker-compose down ``` ## Checking if everything plays well together You can quickly check everything by opening the [bundled Zeppelin notebook](http://localhost:8890) and running all paragraphs. Alternatively, to get a sense of how it all works under the hood, follow the instructions below: ### Hadoop and YARN: Check [YARN (Hadoop ResourceManager) Web UI (localhost:8088)](http://localhost:8088/). You should see 2 active nodes there. There's also an [alternative YARN Web UI 2 (http://localhost:8088/ui2)](http://localhost:8088/ui2). Then, [Hadoop Name Node UI (localhost:9870)](http://localhost:9870), Hadoop Data Node UIs at [http://localhost:9864](http://localhost:9864) and [http://localhost:9865](http://localhost:9865): all of those URLs should result in a page. Open up a shell in the master node. ```bash docker-compose exec master bash jps ``` `jps` command outputs a list of running Java processes, which on Hadoop Namenode/Spark Master node should include those:
123 Jps 456 ResourceManager 789 NameNode 234 SecondaryNameNode 567 HistoryServer 890 Master... but not necessarily in this order and those IDs, also some extras like `RunJar` and `JobHistoryServer` might be there too. Then let's see if YARN can see all resources we have (2 worker nodes): ```bash yarn node -list ```
current-datetime INFO client.RMProxy: Connecting to ResourceManager at master/172.28.1.1:8032
Total Nodes:2
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
worker1:45019 RUNNING worker1:8042 0
worker2:41001 RUNNING worker2:8042 0
HDFS (Hadoop distributed file system) condition:
```bash
hdfs dfsadmin -report
```
Live datanodes (2): Name: 172.28.1.2:9866 (worker1) ... Name: 172.28.1.3:9866 (worker2)Now we'll upload a file into HDFS and see that it's visible from all nodes: ```bash hadoop fs -put /data/grades.csv / hadoop fs -ls / ```
Found N items ... -rw-r--r-- 2 root supergroup ... /grades.csv ...Ctrl+D out of master now. Repeat for remaining nodes (there's 3 total: master, worker1 and worker2): ```bash docker-compose exec worker1 bash hadoop fs -ls / ```
Found 1 items -rw-r--r-- 2 root supergroup ... /grades.csvWhile we're on nodes other than Hadoop Namenode/Spark Master node, jps command output should include DataNode and Worker now instead of NameNode and Master: ```bash jps ```
123 Jps 456 NodeManager 789 DataNode 234 Worker### Hive Prerequisite: there's a file `grades.csv` stored in HDFS ( `hadoop fs -put /data/grades.csv /` ) ```bash docker-compose exec master bash hive ``` ```sql CREATE TABLE grades( `Last name` STRING, `First name` STRING, `SSN` STRING, `Test1` DOUBLE, `Test2` INT, `Test3` DOUBLE, `Test4` DOUBLE, `Final` DOUBLE, `Grade` STRING) COMMENT 'https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE tblproperties("skip.header.line.count"="1"); LOAD DATA INPATH '/grades.csv' INTO TABLE grades; SELECT * FROM grades; -- OK -- Alfalfa Aloysius 123-45-6789 40.0 90 100.0 83.0 49.0 D- -- Alfred University 123-12-1234 41.0 97 96.0 97.0 48.0 D+ -- Gerty Gramma 567-89-0123 41.0 80 60.0 40.0 44.0 C -- Android Electric 087-65-4321 42.0 23 36.0 45.0 47.0 B- -- Bumpkin Fred 456-78-9012 43.0 78 88.0 77.0 45.0 A- -- Rubble Betty 234-56-7890 44.0 90 80.0 90.0 46.0 C- -- Noshow Cecil 345-67-8901 45.0 11 -1.0 4.0 43.0 F -- Buff Bif 632-79-9939 46.0 20 30.0 40.0 50.0 B+ -- Airpump Andrew 223-45-6789 49.0 1 90.0 100.0 83.0 A -- Backus Jim 143-12-1234 48.0 1 97.0 96.0 97.0 A+ -- Carnivore Art 565-89-0123 44.0 1 80.0 60.0 40.0 D+ -- Dandy Jim 087-75-4321 47.0 1 23.0 36.0 45.0 C+ -- Elephant Ima 456-71-9012 45.0 1 78.0 88.0 77.0 B- -- Franklin Benny 234-56-2890 50.0 1 90.0 80.0 90.0 B- -- George Boy 345-67-3901 40.0 1 11.0 -1.0 4.0 B -- Heffalump Harvey 632-79-9439 30.0 1 20.0 30.0 40.0 C -- Time taken: 3.324 seconds, Fetched: 16 row(s) ``` Ctrl+D back to bash. Check if the file's been loaded to Hive warehouse directory: ```bash hadoop fs -ls /usr/hive/warehouse/grades ```
Found 1 items -rw-r--r-- 2 root supergroup ... /usr/hive/warehouse/grades/grades.csvThe table we just created should be accessible from all nodes, let's verify that now: ```bash docker-compose exec worker2 bash hive ``` ```sql SELECT * FROM grades; ``` You should be able to see the same table. ### Spark Open up [Spark Master Web UI (localhost:8080)](http://localhost:8080/):
Workers (2) Worker Id Address State Cores Memory worker-timestamp-172.28.1.3-8882 172.28.1.3:8882 ALIVE 2 (0 Used) 1024.0 MB (0.0 B Used) worker-timestamp-172.28.1.2-8881 172.28.1.2:8881 ALIVE 2 (0 Used) 1024.0 MB (0.0 B Used), also worker UIs at [localhost:8081](http://localhost:8081/) and [localhost:8082](http://localhost:8082/). All those pages should be accessible. Then there's also Spark History server running at [localhost:18080](http://localhost:18080/) - every time you run Spark jobs, you will see them here. History Server includes REST API at [localhost:18080/api/v1/applications](http://localhost:18080/api/v1/applications). This is a mirror of everything on the main page, only in JSON format. Let's run some sample jobs now: ```bash docker-compose exec master bash run-example SparkPi 10 #, or you can do the same via spark-submit: spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode client \ --driver-memory 2g \ --executor-memory 1g \ --executor-cores 1 \ $SPARK_HOME/examples/jars/spark-examples*.jar \ 10 ```
INFO spark.SparkContext: Running Spark version 2.4.4 INFO spark.SparkContext: Submitted application: Spark Pi .. INFO client.RMProxy: Connecting to ResourceManager at master/172.28.1.1:8032 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers ... INFO yarn.Client: Application report for application_1567375394688_0001 (state: ACCEPTED) ... INFO yarn.Client: Application report for application_1567375394688_0001 (state: RUNNING) ... INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.102882 s Pi is roughly 3.138915138915139 ... INFO util.ShutdownHookManager: Deleting directory /tmp/spark-81ea2c22-d96e-4d7c-a8d7-9240d8eb22ceSpark has 3 interactive shells: spark-shell to code in Scala, pyspark for Python and sparkR for R. Let's try them all out: ```bash hadoop fs -put /data/grades.csv / spark-shell ``` ```scala spark.range(1000 * 1000 * 1000).count() val df = spark.read.format("csv").option("header", "true").load("/grades.csv") df.show() df.createOrReplaceTempView("df") spark.sql("SHOW TABLES").show() spark.sql("SELECT * FROM df WHERE Final > 50").show() //TODO SELECT TABLE from hive - not working for now. spark.sql("SELECT * FROM grades").show() ```
Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = yarn, app id = application_N). Spark session available as 'spark'. res0: Long = 1000000000 df: org.apache.spark.sql.DataFrame = [Last name: string, First name: string ... 7 more fields] +---------+----------+-----------+-----+-----+-----+-----+-----+-----+ |Last name|First name| SSN|Test1|Test2|Test3|Test4|Final|Grade| +---------+----------+-----------+-----+-----+-----+-----+-----+-----+ | Alfalfa| Aloysius|123-45-6789| 40| 90| 100| 83| 49| D-| ... |Heffalump| Harvey|632-79-9439| 30| 1| 20| 30| 40| C| +---------+----------+-----------+-----+-----+-----+-----+-----+-----+ +--------+---------+-----------+ |database|tableName|isTemporary| +--------+---------+-----------+ | | df| true| +--------+---------+-----------+ +---------+----------+-----------+-----+-----+-----+-----+-----+-----+ |Last name|First name| SSN|Test1|Test2|Test3|Test4|Final|Grade| +---------+----------+-----------+-----+-----+-----+-----+-----+-----+ | Airpump| Andrew|223-45-6789| 49| 1| 90| 100| 83| A| | Backus| Jim|143-12-1234| 48| 1| 97| 96| 97| A+| | Elephant| Ima|456-71-9012| 45| 1| 78| 88| 77| B-| | Franklin| Benny|234-56-2890| 50| 1| 90| 80| 90| B-| +---------+----------+-----------+-----+-----+-----+-----+-----+-----+Ctrl+D out of Scala shell now. ```bash pyspark ``` ```python spark.range(1000 * 1000 * 1000).count() df = spark.read.format('csv').option('header', 'true').load('/grades.csv') df.show() df.createOrReplaceTempView('df') spark.sql('SHOW TABLES').show() spark.sql('SELECT * FROM df WHERE Final > 50').show() # TODO SELECT TABLE from hive - not working for now. spark.sql('SELECT * FROM grades').show() ```
1000000000 $same_tables_as_aboveCtrl+D out of PySpark. ```bash sparkR ``` ```R df <- as.DataFrame(list("One", "Two", "Three", "Four"), "This is as example") head(df) df <- read.df("/grades.csv", "csv", header="true") head(df) ```
This is as example 1 One 2 Two 3 Three 4 Four $same_tables_as_above* Amazon S3 From Hadoop: ```bash hadoop fs -Dfs.s3a.impl="org.apache.hadoop.fs.s3a.S3AFileSystem" -Dfs.s3a.access.key="classified" -Dfs.s3a.secret.key="classified" -ls "s3a://bucket" ``` Then from PySpark: ```python sc._jsc.hadoopConfiguration().set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'classified') sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'classified') df = spark.read.format('csv').option('header', 'true').option('sep', '\t').load('s3a://bucket/tabseparated_withheader.tsv') df.show(5) ``` None of the commands above stores your credentials anywhere (i.e. as soon as you'd shut down the cluster your creds are safe). More persistent ways of storing the credentials are out of scope of this readme. ### Zeppelin Zeppelin interface should be available at [http://localhost:8890](http://localhost:8890). You'll find a notebook called "test" in there, containing commands to test integration with bash, Spark and Livy. ### Livy Livy is at [http://localhost:8998](http://localhost:8998) (and yes, there's a web UI as well as REST API on that port - just click the link). * Livy Sessions. Try to poll the REST API: ```bash curl --request GET \ --url http://localhost:8998/sessions | python3 -mjson.tool ``` The response, assuming you didn't create any sessions before, should look like this: ```json { "from": 0, "total": 0, "sessions": [] } ``` 1 ) Create a session: ```bash curl --request POST \ --url http://localhost:8998/sessions \ --header 'content-type: application/json' \ --data '{ "kind": "pyspark" }' | python3 -mjson.tool ``` Response: ```json { "id": 0, "name": null, "appId": null, "owner": null, "proxyUser": null, "state": "starting", "kind": "pyspark", "appInfo": { "driverLogUrl": null, "sparkUiUrl": null }, "log": [ "stdout: ", "\nstderr: ", "\nYARN Diagnostics: " ] } ``` 2 ) Wait for session to start (state will transition from "starting" to "idle"): ```bash curl --request GET \ --url http://localhost:8998/sessions/0 | python3 -mjson.tool ``` Response: ```json { "id": 0, "name": null, "appId": "application_1584274334558_0001", "owner": null, "proxyUser": null, "state": "starting", "kind": "pyspark", "appInfo": { "driverLogUrl": "http://worker2:8042/node/containerlogs/container_1584274334558_0003_01_000001/root", "sparkUiUrl": "http://master:8088/proxy/application_1584274334558_0003/" }, "log": [ "timestamp bla" ] } ``` 3 ) Post some statements: ```bash curl --request POST \ --url http://localhost:8998/sessions/0/statements \ --header 'content-type: application/json' \ --data '{ "code": "import sys;print(sys.version)" }' | python3 -mjson.tool curl --request POST \ --url http://localhost:8998/sessions/0/statements \ --header 'content-type: application/json' \ --data '{ "code": "spark.range(1000 * 1000 * 1000).count()" }' | python3 -mjson.tool ``` Response: ```json { "id": 0, "code": "import sys;print(sys.version)", "state": "waiting", "output": null, "progress": 0.0, "started": 0, "completed": 0 } ``` ```json { "id": 1, "code": "spark.range(1000 * 1000 * 1000).count()", "state": "waiting", "output": null, "progress": 0.0, "started": 0, "completed": 0 } ``` 4) Get the result: ```bash curl --request GET \ --url http://localhost:8998/sessions/0/statements | python3 -mjson.tool ``` Response: ```json { "total_statements": 2, "statements": [ { "id": 0, "code": "import sys;print(sys.version)", "state": "available", "output": { "status": "ok", "execution_count": 0, "data": { "text/plain": "3.7.3 (default, Apr 3 2019, 19:16:38) \n[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]" } }, "progress": 1.0 }, { "id": 1, "code": "spark.range(1000 * 1000 * 1000).count()", "state": "available", "output": { "status": "ok", "execution_count": 1, "data": { "text/plain": "1000000000" } }, "progress": 1.0 } ] } ``` 5) Delete the session: ```bash curl --request DELETE \ --url http://localhost:8998/sessions/0 | python3 -mjson.tool ``` Response: ```json { "msg": "deleted" } ``` * Livy Batches. To get all active batches: ```bash curl --request GET \ --url http://localhost:8998/batches | python3 -mjson.tool ``` Strange enough, this elicits the same response as if we were querying the sessions endpoint, but ok... 1 ) Send the batch: ```bash curl --request POST \ --url http://localhost:8998/batches \ --header 'content-type: application/json' \ --data '{ "file": "local:/data/batches/sample_batch.py", "pyFiles": [ "local:/data/batches/sample_batch.py" ], "args": [ "123" ] }' | python3 -mjson.tool ``` Response: ```json { "id": 0, "name": null, "owner": null, "proxyUser": null, "state": "starting", "appId": null, "appInfo": { "driverLogUrl": null, "sparkUiUrl": null }, "log": [ "stdout: ", "\nstderr: ", "\nYARN Diagnostics: " ] } ``` 2 ) Query the status: ```bash curl --request GET \ --url http://localhost:8998/batches/0 | python3 -mjson.tool ``` Response: ```json { "id": 0, "name": null, "owner": null, "proxyUser": null, "state": "running", "appId": "application_1584274334558_0005", "appInfo": { "driverLogUrl": "http://worker2:8042/node/containerlogs/container_1584274334558_0005_01_000001/root", "sparkUiUrl": "http://master:8088/proxy/application_1584274334558_0005/" }, "log": [ "timestamp bla", "\nstderr: ", "\nYARN Diagnostics: " ] } ``` 3 ) To see all log lines, query the `/log` endpoint. You can skip 'to' and 'from' params, or manipulate them to get all log lines. Livy (as of 0.7.0) supports no more than 100 log lines per response. ```bash curl --request GET \ --url 'http://localhost:8998/batches/0/log?from=100&to=200' | python3 -mjson.tool ``` Response: ```json { "id": 0, "from": 100, "total": 203, "log": [ "...", "Welcome to", " ____ __", " / __/__ ___ _____/ /__", " _\\ \\/ _ \\/ _ `/ __/ '_/", " /__ / .__/\\_,_/_/ /_/\\_\\ version 2.4.5", " /_/", "", "Using Python version 3.7.5 (default, Oct 17 2019 12:25:15)", "SparkSession available as 'spark'.", "3.7.5 (default, Oct 17 2019, 12:25:15) ", "[GCC 8.3.0]", "Arguments: ", "['/data/batches/sample_batch.py', '123']", "Custom number passed in args: 123", "Will raise 123 to the power of 3...", "...", "123 ^ 3 = 1860867", "...", "2020-03-15 13:06:09,503 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-138164b7-c5dc-4dc5-be6b-7a49c6bcdff0/pyspark-4d73b7c7-e27c-462f-9e5a-96011790d059" ] } ``` 4 ) Delete the batch: ```bash curl --request DELETE \ --url http://localhost:8998/batches/0 | python3 -mjson.tool ``` Response: ```json { "msg": "deleted" } ``` ## Credits Sample data file: * __grades.csv__ is borrowed from [John Burkardt's page](https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html) under Florida State University domain. Thanks for sharing those! * __ssn-address.tsv__ is derived from __grades.csv__ by removing some fields and adding randomly-generated addresses. ================================================ FILE: data/batches/sample_batch.py ================================================ import sys from pyspark.shell import spark print(sys.version) print("Arguments: \n" + str(sys.argv)) try: num = int(sys.argv[1]) print("Custom number passed in args: " + str(num)) except (ValueError, IndexError): num = 1000 print("Can't process as number: " + sys.argv[1]) # Checking if f-string are available (python>=3.6) print(f"Will raise {num} to the power of 3...") cube = spark.range(num * num * num).count() print(f"{num} ^ 3 = {cube}") ================================================ FILE: data/grades.csv ================================================ Last name,First name,SSN,Test1,Test2,Test3,Test4,Final,Grade Alfalfa,Aloysius,123-45-6789,40,90,100,83,49,D- Alfred,University,123-12-1234,41,97,96,97,48,D+ Gerty,Gramma,567-89-0123,41,80,60,40,44,C Android,Electric,087-65-4321,42,23,36,45,47,B- Bumpkin,Fred,456-78-9012,43,78,88,77,45,A- Rubble,Betty,234-56-7890,44,90,80,90,46,C- Noshow,Cecil,345-67-8901,45,11,-1,4,43,F Buff,Bif,632-79-9939,46,20,30,40,50,B+ Airpump,Andrew,223-45-6789,49,1,90,100,83,A Backus,Jim,143-12-1234,48,1,97,96,97,A+ Carnivore,Art,565-89-0123,44,1,80,60,40,D+ Dandy,Jim,087-75-4321,47,1,23,36,45,C+ Elephant,Ima,456-71-9012,45,1,78,88,77,B- Franklin,Benny,234-56-2890,50,1,90,80,90,B- George,Boy,345-67-3901,40,1,11,-1,4,B Heffalump,Harvey,632-79-9439,30,1,20,30,40,C ================================================ FILE: data/ssn-address.tsv ================================================ Alfalfa Aloysius 123-45-6789 7098 East Road Hopkins, MN 55343 Backus Jim 143-12-1234 603 Wagon Drive Miamisburg, OH 45342 Dandy Jim 087-75-4321 4 Ann St. Hackensack, NJ 07601 George Boy 345-67-3901 13 Foxrun Ave. Annandale, VA 22003 Alfred University 123-12-1234 98 Wellington Ave. Lowell, MA 01851 Elephant Ima 456-71-9012 Heffalump Harvey 632-79-9439 5 Beech Street Canyon Country, CA 91387 Gerty Gramma 567-89-0123 Rubble Betty 234-56-7890 9715 Penn St. Royal Oak, MI 48067 ================================================ FILE: docker-compose.yml ================================================ version: "3.7" services: hivemetastore: image: postgres:11.5 hostname: hivemetastore environment: POSTGRES_PASSWORD: new_password expose: - 5432 volumes: - ./init.sql:/docker-entrypoint-initdb.d/init.sql healthcheck: test: ["CMD-SHELL", "pg_isready -U postgres"] interval: 10s timeout: 5s retries: 5 networks: spark_net: ipv4_address: 172.28.1.4 extra_hosts: - "master:172.28.1.1" - "worker1:172.28.1.2" - "worker2:172.28.1.3" - "zeppelin:172.28.1.5" - "livy:172.28.1.6" master: image: panovvv/hadoop-hive-spark:2.5.2 # build: '../hadoop-hive-spark-docker' hostname: master depends_on: - hivemetastore environment: HADOOP_NODE: namenode HIVE_CONFIGURE: yes, please SPARK_PUBLIC_DNS: localhost SPARK_LOCAL_IP: 172.28.1.1 SPARK_MASTER_HOST: 172.28.1.1 SPARK_LOCAL_HOSTNAME: master expose: - 1-65535 ports: # Spark Master Web UI - 8080:8080 # Spark job Web UI: increments for each successive job - 4040:4040 - 4041:4041 - 4042:4042 - 4043:4043 # Spark History server - 18080:18080 # YARN UI - 8088:8088 # Hadoop namenode UI - 9870:9870 # Hadoop secondary namenode UI - 9868:9868 # Hive JDBC - 10000:10000 volumes: - ./data:/data networks: spark_net: ipv4_address: 172.28.1.1 extra_hosts: - "worker1:172.28.1.2" - "worker2:172.28.1.3" - "hivemetastore:172.28.1.4" - "zeppelin:172.28.1.5" - "livy:172.28.1.6" worker1: image: panovvv/hadoop-hive-spark:2.5.2 # build: '../hadoop-hive-spark-docker' hostname: worker1 depends_on: - hivemetastore environment: SPARK_MASTER_ADDRESS: spark://master:7077 SPARK_WORKER_PORT: 8881 SPARK_WORKER_WEBUI_PORT: 8081 SPARK_PUBLIC_DNS: localhost SPARK_LOCAL_HOSTNAME: worker1 SPARK_LOCAL_IP: 172.28.1.2 SPARK_MASTER_HOST: 172.28.1.1 HADOOP_NODE: datanode expose: - 1-65535 ports: # Hadoop datanode UI - 9864:9864 #Spark worker UI - 8081:8081 volumes: - ./data:/data networks: spark_net: ipv4_address: 172.28.1.2 extra_hosts: - "master:172.28.1.1" - "worker2:172.28.1.3" - "hivemetastore:172.28.1.4" - "zeppelin:172.28.1.5" - "livy:172.28.1.6" worker2: image: panovvv/hadoop-hive-spark:2.5.2 # build: '../hadoop-hive-spark-docker' hostname: worker2 depends_on: - hivemetastore environment: SPARK_MASTER_ADDRESS: spark://master:7077 SPARK_WORKER_PORT: 8882 SPARK_WORKER_WEBUI_PORT: 8082 SPARK_PUBLIC_DNS: localhost SPARK_LOCAL_HOSTNAME: worker2 SPARK_LOCAL_IP: 172.28.1.3 SPARK_MASTER_HOST: 172.28.1.1 HADOOP_NODE: datanode HADOOP_DATANODE_UI_PORT: 9865 expose: - 1-65535 ports: # Hadoop datanode UI - 9865:9865 # Spark worker UI - 8082:8082 volumes: - ./data:/data networks: spark_net: ipv4_address: 172.28.1.3 extra_hosts: - "master:172.28.1.1" - "worker1:172.28.1.2" - "hivemetastore:172.28.1.4" - "zeppelin:172.28.1.5" - "livy:172.28.1.6" livy: image: panovvv/livy:2.5.2 # build: '../livy-docker' hostname: livy depends_on: - master - worker1 - worker2 volumes: - ./livy_batches:/livy_batches - ./data:/data environment: - SPARK_MASTER=yarn # Intentionally not specified - if it's set here, then we can't override it # via REST API ("conf"={} map) # Can be client or cluster # - SPARK_DEPLOY_MODE=client - LOCAL_DIR_WHITELIST=/data/batches/ - ENABLE_HIVE_CONTEXT=false # Defaults are fine for variables below. Uncomment to change them. # - LIVY_HOST=0.0.0.0 # - LIVY_PORT=8998 expose: - 1-65535 ports: - 8998:8998 networks: spark_net: ipv4_address: 172.28.1.6 extra_hosts: - "master:172.28.1.1" - "worker1:172.28.1.2" - "worker2:172.28.1.3" - "hivemetastore:172.28.1.4" - "zeppelin:172.28.1.5" zeppelin: image: panovvv/zeppelin-bigdata:2.5.2 # build: '../zeppelin-bigdata-docker' hostname: zeppelin depends_on: - master - worker1 - worker2 - livy volumes: - ./zeppelin_notebooks:/zeppelin_notebooks - ./data:/data environment: ZEPPELIN_PORT: 8890 ZEPPELIN_NOTEBOOK_DIR: '/zeppelin_notebooks' expose: - 8890 ports: - 8890:8890 networks: spark_net: ipv4_address: 172.28.1.5 extra_hosts: - "master:172.28.1.1" - "worker1:172.28.1.2" - "worker2:172.28.1.3" - "hivemetastore:172.28.1.4" - "livy:172.28.1.6" networks: spark_net: ipam: driver: default config: - subnet: 172.28.0.0/16 ================================================ FILE: init.sql ================================================ CREATE DATABASE "hivemetastoredb"; ================================================ FILE: zeppelin_notebooks/test_2FVBJBJ1V.zpln ================================================ { "paragraphs": [ { "text": "%sh\n\n#Only load a file to HDFS if it\u0027s not already there - because of this you can run all paragraphs as many times as you like.\nhadoop fs -test -e /grades.csv\n\nif ! hadoop fs -test -e /grades.csv\nthen\n echo \"*******************************************\"\n echo \"grades.csv is not in HDFS yet! Uploading...\"\n echo \"*******************************************\"d\n hadoop fs -put /data/grades.csv /\nfi" }, { "text": "%sh\n\nhadoop fs -ls /" }, { "text": "%jdbc\n\n-- Does not support more than one statement per paragraph, it seems. Same goes for semicolon at the end of statements - errors out if you include it.\nDROP TABLE IF EXISTS grades" }, { "text": "%jdbc\n\nCREATE TABLE grades(\n `Last name` STRING,\n `First name` STRING,\n `SSN` STRING,\n `Test1` DOUBLE,\n `Test2` INT,\n `Test3` DOUBLE,\n `Test4` DOUBLE,\n `Final` DOUBLE,\n `Grade` STRING)\nCOMMENT \u0027https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html\u0027\nROW FORMAT DELIMITED\nFIELDS TERMINATED BY \u0027,\u0027\nSTORED AS TEXTFILE\ntblproperties(\"skip.header.line.count\"\u003d\"1\")" }, { "text": "%jdbc\n\nLOAD DATA INPATH \u0027/grades.csv\u0027 INTO TABLE grades" }, { "text": "%jdbc\n\nSELECT * FROM grades" }, { "text": "%sh\n\n# Take a look at the warehouse directory, specifically where our Hive table is stored.\n hadoop fs -ls /usr/hive/warehouse/grades" }, { "text": "%sh\n\n# Put the file back into HDFS - it was moved to warehouse directory when we loaded it with Hive.\nhadoop fs -put /data/grades.csv /\nhadoop fs -ls /" }, { "text": "%spark\n\n// Basic Spark functions\nspark.range(1000 * 1000 * 1000).count()" }, { "text": "%spark\n\n// Dataframes\nval df \u003d Seq(\n (\"One\", 1),\n (\"Two\", 2),\n (\"Three\", 3),\n (\"Four\", 4)\n).toDF(\"This is\", \"an example\")\ndf.show()" }, { "text": "%spark\n\n// Read CSV file from HDFS into Dataframe\nval df \u003d spark.read.format(\"csv\").option(\"header\", \"true\").load(\"/grades.csv\")\ndf.show()" }, { "text": "%spark\n\n// Spark SQL and temporary views\ndf.createOrReplaceTempView(\"df\")\nspark.sql(\"SHOW TABLES\").show()" }, { "text": "%spark\n\nspark.sql(\"SELECT * FROM df WHERE Final > 50\").show()" }, { "text": "%spark.pyspark\n\n# Check Python version - 2 not allowed.\nimport sys\nprint(sys.version)" }, { "text": "%spark.pyspark\n\n# Basic Spark functions\nspark.range(1000 * 1000 * 1000).count()" }, { "text": "%spark.pyspark\n\n# Dataframes\ndf \u003d sqlContext.createDataFrame([(\"One\", 1), (\"Two\", 2), (\"Three\", 3), (\"Four\", 4)], (\"This is\", \"an example\"))\ndf.show()" }, { "text": "%spark.pyspark\n\n# Read CSV file from HDFS into Dataframe\ndf \u003d spark.read.format(\"csv\").option(\"header\", \"true\").load(\"/grades.csv\")\ndf.show()" }, { "text": "%spark.r\n\n# Dataframes\ndf \u003c- as.DataFrame(list(\"One\", \"Two\", \"Three\", \"Four\"), \"This is as example\")\nhead(df)" }, { "text": "%spark.r\n\n# Read CSV file from HDFS into Dataframe\ndf \u003c- read.df(\"/grades.csv\", \"csv\", header\u003d\"true\")\nhead(df)" }, { "text": "%livy\n\n// Scala Spark over Livy\nspark.range(1000 * 1000 * 1000).count()" }, { "text": "%livy.pyspark\n\n# PySpark over Livy\nimport sys\nprint(sys.version)\nspark.range(1000 * 1000 * 1000).count()" }, { "text": "%livy.sparkr\n\n# SparkR over Livy\ndf \u003c- as.DataFrame(list(\"One\", \"Two\", \"Three\", \"Four\"), \"This is as example\")\nhead(df)" }, { "text": "%livy.sql\nSELECT 1, CONCAT(\u0027This is\u0027, \u0027 a test\u0027)" } ], "name": "test", "id": "2FVBJBJ1V", "defaultInterpreterGroup": "spark", "version": "0.9.0", "noteParams": {}, "noteForms": {}, "angularObjects": {}, "config": { "isZeppelinNotebookCronEnable": false }, "info": {} }