[
  {
    "path": ".gitignore",
    "content": "# OS garbage\n.DS_Store\ndesktop.ini\n\n# IDE garbage\n.idea/\n\n# Livy batch files, copied over from elsewhere, except one sample batch\ndata/batches/*\n!data/batches/sample_batch.py\n\n# Spark job results\ndata/output/\n"
  },
  {
    "path": "LICENSE",
    "content": "The MIT License\n\nCopyright (c) 2020 Vadim Panov\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\nTHE SOFTWARE."
  },
  {
    "path": "README.md",
    "content": "# Big data playground: Cluster with Hadoop, Hive, Spark, Zeppelin and Livy via Docker-compose.\n\nI wanted to have the ability to play around with various big data\napplications as effortlessly as possible,\nnamely those found in Amazon EMR.\nIdeally, that would be something that can be brought up and torn down\nin one command. This is how this repository came to be!\n\n## Constituent images:\n\n[Base image](https://github.com/panovvv/hadoop-hive-spark-docker):\n[![Docker Build Status: Base image](https://img.shields.io/docker/cloud/build/panovvv/hadoop-hive-spark.svg)](https://cloud.docker.com/repository/docker/panovvv/hadoop-hive-spark/builds)\n[![Docker Pulls: Base image](https://img.shields.io/docker/pulls/panovvv/hadoop-hive-spark.svg)](https://hub.docker.com/r/panovvv/hadoop-hive-spark)\n[![Docker Stars: Base image](https://img.shields.io/docker/stars/panovvv/hadoop-hive-spark.svg)](https://hub.docker.com/r/panovvv/hadoop-hive-spark)\n\n[Zeppelin image](https://github.com/panovvv/zeppelin-bigdata-docker): [![Docker Build Status: Zeppelin](https://img.shields.io/docker/cloud/build/panovvv/zeppelin-bigdata.svg)](https://cloud.docker.com/repository/docker/panovvv/zeppelin-bigdata/builds)\n[![Docker Pulls: Zeppelin](https://img.shields.io/docker/pulls/panovvv/zeppelin-bigdata.svg)](https://hub.docker.com/r/panovvv/zeppelin-bigdata)\n[![Docker Stars: Zeppelin](https://img.shields.io/docker/stars/panovvv/zeppelin-bigdata.svg)](https://hub.docker.com/r/panovvv/zeppelin-bigdata)\n\n[Livy image](https://github.com/panovvv/livy-docker): [![Docker Build Status: Livy](https://img.shields.io/docker/cloud/build/panovvv/livy.svg)](https://cloud.docker.com/repository/docker/panovvv/livy/builds)\n[![Docker Pulls: Livy](https://img.shields.io/docker/pulls/panovvv/livy.svg)](https://hub.docker.com/r/panovvv/livy)\n[![Docker Stars: Livy](https://img.shields.io/docker/stars/panovvv/livy.svg)](https://hub.docker.com/r/panovvv/livy)\n\n## Usage\n\nClone:\n```bash\ngit clone https://github.com/panovvv/bigdata-docker-compose.git\n```\n* On non-Linux platforms, you should dedicate more RAM to Docker than it does by default\n  (2Gb on my machine with 16Gb RAM). Otherwise applications (ResourceManager in my case)\n  will quit sporadically and you'll see messages like this one in logs:\n  <pre>\n  current-datetime INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1234ms\n  No GCs detected\n  </pre>\n  Increasing memory to 8G solved all those mysterious problems for me.\n\n* You should have more than 90% of free disk space, otherwise\n  YARN will deem all nodes unhealthy.\n\nBring everything up:\n```bash\ncd bigdata-docker-compose\ndocker-compose up -d\n```\n\n* **data/** directory is mounted into every container, you can use this as\na storage both for files you want to process using Hive/Spark/whatever\nand results of those computations.\n* **livy_batches/** directory is where you have some sample code for\nLivy batch processing mode. It's mounted to the node where Livy\nis running. You can store your code there as well, or make use of the\nuniversal **data/**.\n* **zeppelin_notebooks/** contains, quite predictably, notebook files\nfor Zeppelin. Thanks to that, all your notebooks persist across runs.\n\nHive JDBC port is exposed to host:\n* URI: `jdbc:hive2://localhost:10000`\n* Driver: `org.apache.hive.jdbc.HiveDriver` (org.apache.hive:hive-jdbc:3.1.2)\n* User and password: unused.\n\nTo shut the whole thing down, run this from the same folder:\n```bash\ndocker-compose down\n```\n\n## Checking if everything plays well together\nYou can quickly check everything by opening the\n[bundled Zeppelin notebook](http://localhost:8890)\nand running all paragraphs.\n\nAlternatively, to get a sense of\nhow it all works under the hood, follow the instructions below:\n\n### Hadoop and YARN:\n\nCheck [YARN (Hadoop ResourceManager) Web UI\n(localhost:8088)](http://localhost:8088/).\nYou should see 2 active nodes there.\nThere's also an\n[alternative YARN Web UI 2 (http://localhost:8088/ui2)](http://localhost:8088/ui2).\n\nThen, [Hadoop Name Node UI (localhost:9870)](http://localhost:9870),\nHadoop Data Node UIs at\n[http://localhost:9864](http://localhost:9864) and [http://localhost:9865](http://localhost:9865):\nall of those URLs should result in a page.\n\nOpen up a shell in the master node.\n```bash\ndocker-compose exec master bash\njps\n```\n`jps` command outputs a list of running Java processes,\nwhich on Hadoop Namenode/Spark Master node should include those:\n<pre>\n123 Jps\n456 ResourceManager\n789 NameNode\n234 SecondaryNameNode\n567 HistoryServer\n890 Master\n</pre>\n\n... but not necessarily in this order and those IDs,\nalso some extras like `RunJar` and `JobHistoryServer` might be there too.\n\nThen let's see if YARN can see all resources we have (2 worker nodes):\n```bash\nyarn node -list\n```\n<pre>\ncurrent-datetime INFO client.RMProxy: Connecting to ResourceManager at master/172.28.1.1:8032\nTotal Nodes:2\n         Node-Id\t     Node-State\tNode-Http-Address\tNumber-of-Running-Containers\n   worker1:45019\t        RUNNING\t     worker1:8042\t                           0\n   worker2:41001\t        RUNNING\t     worker2:8042\t                           0\n</pre>\n\nHDFS (Hadoop distributed file system) condition:\n```bash\nhdfs dfsadmin -report\n```\n<pre>\nLive datanodes (2):\nName: 172.28.1.2:9866 (worker1)\n...\nName: 172.28.1.3:9866 (worker2)\n</pre>\n\nNow we'll upload a file into HDFS and see that it's visible from all\nnodes:\n```bash\nhadoop fs -put /data/grades.csv /\nhadoop fs -ls /\n```\n<pre>\nFound N items\n...\n-rw-r--r--   2 root supergroup  ... /grades.csv\n...\n</pre>\n\nCtrl+D out of master now. Repeat for remaining nodes\n(there's 3 total: master, worker1 and worker2):\n\n```bash\ndocker-compose exec worker1 bash\nhadoop fs -ls /\n```\n<pre>\nFound 1 items\n-rw-r--r--   2 root supergroup  ... /grades.csv\n</pre>\n\nWhile we're on nodes other than Hadoop Namenode/Spark Master node,\njps command output should include DataNode and Worker now instead of\nNameNode and Master:\n```bash\njps\n```\n<pre>\n123 Jps\n456 NodeManager\n789 DataNode\n234 Worker\n</pre>\n\n### Hive\n\nPrerequisite: there's a file `grades.csv` stored in HDFS ( `hadoop fs -put /data/grades.csv /` )\n```bash\ndocker-compose exec master bash\nhive\n```\n```sql\nCREATE TABLE grades(\n    `Last name` STRING,\n    `First name` STRING,\n    `SSN` STRING,\n    `Test1` DOUBLE,\n    `Test2` INT,\n    `Test3` DOUBLE,\n    `Test4` DOUBLE,\n    `Final` DOUBLE,\n    `Grade` STRING)\nCOMMENT 'https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html'\nROW FORMAT DELIMITED\nFIELDS TERMINATED BY ','\nSTORED AS TEXTFILE\ntblproperties(\"skip.header.line.count\"=\"1\");\n\nLOAD DATA INPATH '/grades.csv' INTO TABLE grades;\n\nSELECT * FROM grades;\n-- OK\n-- Alfalfa\tAloysius\t123-45-6789\t40.0\t90\t100.0\t83.0\t49.0\tD-\n-- Alfred\tUniversity\t123-12-1234\t41.0\t97\t96.0\t97.0\t48.0\tD+\n-- Gerty\tGramma\t567-89-0123\t41.0\t80\t60.0\t40.0\t44.0\tC\n-- Android\tElectric\t087-65-4321\t42.0\t23\t36.0\t45.0\t47.0\tB-\n-- Bumpkin\tFred\t456-78-9012\t43.0\t78\t88.0\t77.0\t45.0\tA-\n-- Rubble\tBetty\t234-56-7890\t44.0\t90\t80.0\t90.0\t46.0\tC-\n-- Noshow\tCecil\t345-67-8901\t45.0\t11\t-1.0\t4.0\t43.0\tF\n-- Buff\tBif\t632-79-9939\t46.0\t20\t30.0\t40.0\t50.0\tB+\n-- Airpump\tAndrew\t223-45-6789\t49.0\t1\t90.0\t100.0\t83.0\tA\n-- Backus\tJim\t143-12-1234\t48.0\t1\t97.0\t96.0\t97.0\tA+\n-- Carnivore\tArt\t565-89-0123\t44.0\t1\t80.0\t60.0\t40.0\tD+\n-- Dandy\tJim\t087-75-4321\t47.0\t1\t23.0\t36.0\t45.0\tC+\n-- Elephant\tIma\t456-71-9012\t45.0\t1\t78.0\t88.0\t77.0\tB-\n-- Franklin\tBenny\t234-56-2890\t50.0\t1\t90.0\t80.0\t90.0\tB-\n-- George\tBoy\t345-67-3901\t40.0\t1\t11.0\t-1.0\t4.0\tB\n-- Heffalump\tHarvey\t632-79-9439\t30.0\t1\t20.0\t30.0\t40.0\tC\n-- Time taken: 3.324 seconds, Fetched: 16 row(s)\n```\n\nCtrl+D back to bash. Check if the file's been loaded to Hive warehouse\ndirectory:\n\n```bash\nhadoop fs -ls /usr/hive/warehouse/grades\n```\n<pre>\nFound 1 items\n-rw-r--r--   2 root supergroup  ... /usr/hive/warehouse/grades/grades.csv\n</pre>\n\nThe table we just created should be accessible from all nodes, let's\nverify that now:\n```bash\ndocker-compose exec worker2 bash\nhive\n```\n```sql\nSELECT * FROM grades;\n```\nYou should be able to see the same table.\n### Spark\n\nOpen up [Spark Master Web UI (localhost:8080)](http://localhost:8080/):\n<pre>\nWorkers (2)\nWorker Id\tAddress\tState\tCores\tMemory\nworker-timestamp-172.28.1.3-8882\t172.28.1.3:8882\tALIVE\t2 (0 Used)\t1024.0 MB (0.0 B Used)\nworker-timestamp-172.28.1.2-8881\t172.28.1.2:8881\tALIVE\t2 (0 Used)\t1024.0 MB (0.0 B Used)\n</pre>\n, also worker UIs at  [localhost:8081](http://localhost:8081/)\nand  [localhost:8082](http://localhost:8082/). All those pages should be\naccessible.\n\nThen there's also Spark History server running at\n[localhost:18080](http://localhost:18080/) - every time you run Spark jobs, you\nwill see them here.\n\nHistory Server includes REST API at\n[localhost:18080/api/v1/applications](http://localhost:18080/api/v1/applications).\nThis is a mirror of everything on the main page, only in JSON format.\n\nLet's run some sample jobs now:\n```bash\ndocker-compose exec master bash\nrun-example SparkPi 10\n#, or you can do the same via spark-submit:\nspark-submit --class org.apache.spark.examples.SparkPi \\\n    --master yarn \\\n    --deploy-mode client \\\n    --driver-memory 2g \\\n    --executor-memory 1g \\\n    --executor-cores 1 \\\n    $SPARK_HOME/examples/jars/spark-examples*.jar \\\n    10\n```\n<pre>\nINFO spark.SparkContext: Running Spark version 2.4.4\nINFO spark.SparkContext: Submitted application: Spark Pi\n..\nINFO client.RMProxy: Connecting to ResourceManager at master/172.28.1.1:8032\nINFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers\n...\nINFO yarn.Client: Application report for application_1567375394688_0001 (state: ACCEPTED)\n...\nINFO yarn.Client: Application report for application_1567375394688_0001 (state: RUNNING)\n...\nINFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.102882 s\nPi is roughly 3.138915138915139\n...\nINFO util.ShutdownHookManager: Deleting directory /tmp/spark-81ea2c22-d96e-4d7c-a8d7-9240d8eb22ce\n</pre>\n\nSpark has 3 interactive shells: spark-shell to code in Scala,\npyspark for Python and sparkR for R. Let's try them all out:\n```bash\nhadoop fs -put /data/grades.csv /\nspark-shell\n```\n```scala\nspark.range(1000 * 1000 * 1000).count()\n\nval df = spark.read.format(\"csv\").option(\"header\", \"true\").load(\"/grades.csv\")\ndf.show()\n\ndf.createOrReplaceTempView(\"df\")\nspark.sql(\"SHOW TABLES\").show()\nspark.sql(\"SELECT * FROM df WHERE Final > 50\").show()\n\n//TODO SELECT TABLE from hive - not working for now.\nspark.sql(\"SELECT * FROM grades\").show()\n```\n<pre>\nSpark context Web UI available at http://localhost:4040\nSpark context available as 'sc' (master = yarn, app id = application_N).\nSpark session available as 'spark'.\n\nres0: Long = 1000000000\n\ndf: org.apache.spark.sql.DataFrame = [Last name: string, First name: string ... 7 more fields]\n\n+---------+----------+-----------+-----+-----+-----+-----+-----+-----+\n|Last name|First name|        SSN|Test1|Test2|Test3|Test4|Final|Grade|\n+---------+----------+-----------+-----+-----+-----+-----+-----+-----+\n|  Alfalfa|  Aloysius|123-45-6789|   40|   90|  100|   83|   49|   D-|\n...\n|Heffalump|    Harvey|632-79-9439|   30|    1|   20|   30|   40|    C|\n+---------+----------+-----------+-----+-----+-----+-----+-----+-----+\n\n+--------+---------+-----------+\n|database|tableName|isTemporary|\n+--------+---------+-----------+\n|        |       df|       true|\n+--------+---------+-----------+\n\n+---------+----------+-----------+-----+-----+-----+-----+-----+-----+\n|Last name|First name|        SSN|Test1|Test2|Test3|Test4|Final|Grade|\n+---------+----------+-----------+-----+-----+-----+-----+-----+-----+\n|  Airpump|    Andrew|223-45-6789|   49|    1|   90|  100|   83|    A|\n|   Backus|       Jim|143-12-1234|   48|    1|   97|   96|   97|   A+|\n| Elephant|       Ima|456-71-9012|   45|    1|   78|   88|   77|   B-|\n| Franklin|     Benny|234-56-2890|   50|    1|   90|   80|   90|   B-|\n+---------+----------+-----------+-----+-----+-----+-----+-----+-----+\n</pre>\nCtrl+D out of Scala shell now.\n\n```bash\npyspark\n```\n```python\nspark.range(1000 * 1000 * 1000).count()\n\ndf = spark.read.format('csv').option('header', 'true').load('/grades.csv')\ndf.show()\n\ndf.createOrReplaceTempView('df')\nspark.sql('SHOW TABLES').show()\nspark.sql('SELECT * FROM df WHERE Final > 50').show()\n\n# TODO SELECT TABLE from hive - not working for now.\nspark.sql('SELECT * FROM grades').show()\n```\n<pre>\n1000000000\n\n$same_tables_as_above\n</pre>\nCtrl+D out of PySpark.\n\n```bash\nsparkR\n```\n```R\ndf <- as.DataFrame(list(\"One\", \"Two\", \"Three\", \"Four\"), \"This is as example\")\nhead(df)\n\ndf <- read.df(\"/grades.csv\", \"csv\", header=\"true\")\nhead(df)\n```\n<pre>\n  This is as example\n1                One\n2                Two\n3              Three\n4               Four\n\n$same_tables_as_above\n</pre>\n\n* Amazon S3\n\nFrom Hadoop:\n```bash\nhadoop fs -Dfs.s3a.impl=\"org.apache.hadoop.fs.s3a.S3AFileSystem\" -Dfs.s3a.access.key=\"classified\" -Dfs.s3a.secret.key=\"classified\" -ls \"s3a://bucket\"\n```\n\nThen from PySpark:\n\n```python\nsc._jsc.hadoopConfiguration().set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')\nsc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'classified')\nsc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'classified')\n\ndf = spark.read.format('csv').option('header', 'true').option('sep', '\\t').load('s3a://bucket/tabseparated_withheader.tsv')\ndf.show(5)\n```\n\nNone of the commands above stores your credentials anywhere\n(i.e. as soon as you'd shut down the cluster your creds are safe). More\npersistent ways of storing the credentials are out of scope of this\nreadme.\n\n### Zeppelin\n\nZeppelin interface should be available at [http://localhost:8890](http://localhost:8890).\n\nYou'll find a notebook called \"test\" in there, containing commands\nto test integration with bash, Spark and Livy.\n\n### Livy\n\nLivy is at [http://localhost:8998](http://localhost:8998) (and yes,\nthere's a web UI as well as REST API on that port - just click the link).\n\n* Livy Sessions.\n\nTry to poll the REST API:\n```bash\ncurl --request GET \\\n  --url http://localhost:8998/sessions | python3 -mjson.tool\n```\nThe response, assuming you didn't create any sessions before, should look like this:\n```json\n{\n  \"from\": 0,\n  \"total\": 0,\n  \"sessions\": []\n}\n```\n\n1 ) Create a session:\n```bash\ncurl --request POST \\\n  --url http://localhost:8998/sessions \\\n  --header 'content-type: application/json' \\\n  --data '{\n\t\"kind\": \"pyspark\"\n}' | python3 -mjson.tool\n```\nResponse:\n```json\n{\n    \"id\": 0,\n    \"name\": null,\n    \"appId\": null,\n    \"owner\": null,\n    \"proxyUser\": null,\n    \"state\": \"starting\",\n    \"kind\": \"pyspark\",\n    \"appInfo\": {\n        \"driverLogUrl\": null,\n        \"sparkUiUrl\": null\n    },\n    \"log\": [\n        \"stdout: \",\n        \"\\nstderr: \",\n        \"\\nYARN Diagnostics: \"\n    ]\n}\n```\n\n2 ) Wait for session to start (state will transition from \"starting\"\nto \"idle\"):\n```bash\ncurl --request GET \\\n  --url http://localhost:8998/sessions/0 | python3 -mjson.tool\n```\nResponse:\n```json\n{\n    \"id\": 0,\n    \"name\": null,\n    \"appId\": \"application_1584274334558_0001\",\n    \"owner\": null,\n    \"proxyUser\": null,\n    \"state\": \"starting\",\n    \"kind\": \"pyspark\",\n    \"appInfo\": {\n        \"driverLogUrl\": \"http://worker2:8042/node/containerlogs/container_1584274334558_0003_01_000001/root\",\n        \"sparkUiUrl\": \"http://master:8088/proxy/application_1584274334558_0003/\"\n    },\n    \"log\": [\n        \"timestamp bla\"\n    ]\n}\n```\n\n3 ) Post some statements:\n```bash\ncurl --request POST \\\n  --url http://localhost:8998/sessions/0/statements \\\n  --header 'content-type: application/json' \\\n  --data '{\n\t\"code\": \"import sys;print(sys.version)\"\n}' | python3 -mjson.tool\ncurl --request POST \\\n  --url http://localhost:8998/sessions/0/statements \\\n  --header 'content-type: application/json' \\\n  --data '{\n\t\"code\": \"spark.range(1000 * 1000 * 1000).count()\"\n}' | python3 -mjson.tool\n```\nResponse:\n```json\n{\n    \"id\": 0,\n    \"code\": \"import sys;print(sys.version)\",\n    \"state\": \"waiting\",\n    \"output\": null,\n    \"progress\": 0.0,\n    \"started\": 0,\n    \"completed\": 0\n}\n```\n```json\n{\n    \"id\": 1,\n    \"code\": \"spark.range(1000 * 1000 * 1000).count()\",\n    \"state\": \"waiting\",\n    \"output\": null,\n    \"progress\": 0.0,\n    \"started\": 0,\n    \"completed\": 0\n}\n```\n\n4) Get the result:\n```bash\ncurl --request GET \\\n  --url http://localhost:8998/sessions/0/statements | python3 -mjson.tool\n```\nResponse:\n```json\n{\n  \"total_statements\": 2,\n  \"statements\": [\n    {\n      \"id\": 0,\n      \"code\": \"import sys;print(sys.version)\",\n      \"state\": \"available\",\n      \"output\": {\n        \"status\": \"ok\",\n        \"execution_count\": 0,\n        \"data\": {\n          \"text/plain\": \"3.7.3 (default, Apr  3 2019, 19:16:38) \\n[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]\"\n        }\n      },\n      \"progress\": 1.0\n    },\n    {\n      \"id\": 1,\n      \"code\": \"spark.range(1000 * 1000 * 1000).count()\",\n      \"state\": \"available\",\n      \"output\": {\n        \"status\": \"ok\",\n        \"execution_count\": 1,\n        \"data\": {\n          \"text/plain\": \"1000000000\"\n        }\n      },\n      \"progress\": 1.0\n    }\n  ]\n}\n```\n\n5) Delete the session:\n```bash\ncurl --request DELETE \\\n  --url http://localhost:8998/sessions/0 | python3 -mjson.tool\n```\nResponse:\n```json\n{\n  \"msg\": \"deleted\"\n}\n```\n* Livy Batches.\n\nTo get all active batches:\n```bash\ncurl --request GET \\\n  --url http://localhost:8998/batches | python3 -mjson.tool\n```\nStrange enough, this elicits the same response as if we were querying\nthe sessions endpoint, but ok...\n\n1 ) Send the batch:\n```bash\ncurl --request POST \\\n  --url http://localhost:8998/batches \\\n  --header 'content-type: application/json' \\\n  --data '{\n\t\"file\": \"local:/data/batches/sample_batch.py\",\n\t\"pyFiles\": [\n\t\t\"local:/data/batches/sample_batch.py\"\n\t],\n\t\"args\": [\n\t\t\"123\"\n\t]\n}' | python3 -mjson.tool\n```\nResponse:\n```json\n{\n    \"id\": 0,\n    \"name\": null,\n    \"owner\": null,\n    \"proxyUser\": null,\n    \"state\": \"starting\",\n    \"appId\": null,\n    \"appInfo\": {\n        \"driverLogUrl\": null,\n        \"sparkUiUrl\": null\n    },\n    \"log\": [\n        \"stdout: \",\n        \"\\nstderr: \",\n        \"\\nYARN Diagnostics: \"\n    ]\n}\n```\n\n2 ) Query the status:\n```bash\ncurl --request GET \\\n  --url http://localhost:8998/batches/0 | python3 -mjson.tool\n```\nResponse:\n```json\n{\n    \"id\": 0,\n    \"name\": null,\n    \"owner\": null,\n    \"proxyUser\": null,\n    \"state\": \"running\",\n    \"appId\": \"application_1584274334558_0005\",\n    \"appInfo\": {\n        \"driverLogUrl\": \"http://worker2:8042/node/containerlogs/container_1584274334558_0005_01_000001/root\",\n        \"sparkUiUrl\": \"http://master:8088/proxy/application_1584274334558_0005/\"\n    },\n    \"log\": [\n        \"timestamp bla\",\n        \"\\nstderr: \",\n        \"\\nYARN Diagnostics: \"\n    ]\n}\n```\n\n3 ) To see all log lines, query the `/log` endpoint.\nYou can skip 'to' and 'from' params, or manipulate them to get all log lines.\nLivy (as of 0.7.0) supports no more than 100 log lines per response.\n```bash\ncurl --request GET \\\n  --url 'http://localhost:8998/batches/0/log?from=100&to=200' | python3 -mjson.tool\n```\nResponse:\n```json\n{\n    \"id\": 0,\n    \"from\": 100,\n    \"total\": 203,\n    \"log\": [\n        \"...\",\n        \"Welcome to\",\n        \"      ____              __\",\n        \"     / __/__  ___ _____/ /__\",\n        \"    _\\\\ \\\\/ _ \\\\/ _ `/ __/  '_/\",\n        \"   /__ / .__/\\\\_,_/_/ /_/\\\\_\\\\   version 2.4.5\",\n        \"      /_/\",\n        \"\",\n        \"Using Python version 3.7.5 (default, Oct 17 2019 12:25:15)\",\n        \"SparkSession available as 'spark'.\",\n        \"3.7.5 (default, Oct 17 2019, 12:25:15) \",\n        \"[GCC 8.3.0]\",\n        \"Arguments: \",\n        \"['/data/batches/sample_batch.py', '123']\",\n        \"Custom number passed in args: 123\",\n        \"Will raise 123 to the power of 3...\",\n        \"...\",\n        \"123 ^ 3 = 1860867\",\n        \"...\",\n        \"2020-03-15 13:06:09,503 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-138164b7-c5dc-4dc5-be6b-7a49c6bcdff0/pyspark-4d73b7c7-e27c-462f-9e5a-96011790d059\"\n    ]\n}\n```\n\n4 ) Delete the batch:\n```bash\ncurl --request DELETE \\\n  --url http://localhost:8998/batches/0 | python3 -mjson.tool\n```\nResponse:\n```json\n{\n  \"msg\": \"deleted\"\n}\n```\n\n## Credits\nSample data file:\n* __grades.csv__ is borrowed from \n[John Burkardt's page](https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html)\nunder Florida State University domain. Thanks for\nsharing those!\n\n* __ssn-address.tsv__ is derived from  __grades.csv__ by removing some fields\n  and adding randomly-generated addresses.\n  \n"
  },
  {
    "path": "data/batches/sample_batch.py",
    "content": "import sys\n\nfrom pyspark.shell import spark\n\nprint(sys.version)\nprint(\"Arguments: \\n\" + str(sys.argv))\n\ntry:\n    num = int(sys.argv[1])\n    print(\"Custom number passed in args: \" + str(num))\nexcept (ValueError, IndexError):\n    num = 1000\n    print(\"Can't process as number: \" + sys.argv[1])\n\n# Checking if f-string are available (python>=3.6)\nprint(f\"Will raise {num} to the power of 3...\")\n\ncube = spark.range(num * num * num).count()\nprint(f\"{num} ^ 3 = {cube}\")\n"
  },
  {
    "path": "data/grades.csv",
    "content": "Last name,First name,SSN,Test1,Test2,Test3,Test4,Final,Grade\nAlfalfa,Aloysius,123-45-6789,40,90,100,83,49,D-\nAlfred,University,123-12-1234,41,97,96,97,48,D+\nGerty,Gramma,567-89-0123,41,80,60,40,44,C\nAndroid,Electric,087-65-4321,42,23,36,45,47,B-\nBumpkin,Fred,456-78-9012,43,78,88,77,45,A-\nRubble,Betty,234-56-7890,44,90,80,90,46,C-\nNoshow,Cecil,345-67-8901,45,11,-1,4,43,F\nBuff,Bif,632-79-9939,46,20,30,40,50,B+\nAirpump,Andrew,223-45-6789,49,1,90,100,83,A\nBackus,Jim,143-12-1234,48,1,97,96,97,A+\nCarnivore,Art,565-89-0123,44,1,80,60,40,D+\nDandy,Jim,087-75-4321,47,1,23,36,45,C+\nElephant,Ima,456-71-9012,45,1,78,88,77,B-\nFranklin,Benny,234-56-2890,50,1,90,80,90,B-\nGeorge,Boy,345-67-3901,40,1,11,-1,4,B\nHeffalump,Harvey,632-79-9439,30,1,20,30,40,C\n"
  },
  {
    "path": "data/ssn-address.tsv",
    "content": "Alfalfa\tAloysius\t123-45-6789\t7098 East Road\tHopkins, MN 55343\nBackus  Jim\t143-12-1234\t603 Wagon Drive\tMiamisburg, OH 45342\nDandy\tJim\t087-75-4321\t4 Ann St.\tHackensack, NJ 07601\nGeorge\tBoy\t345-67-3901\t13 Foxrun Ave.\tAnnandale, VA 22003\nAlfred\tUniversity\t123-12-1234\t98 Wellington Ave.\tLowell, MA 01851\nElephant\tIma\t456-71-9012\t\nHeffalump\tHarvey\t632-79-9439\t5 Beech Street\tCanyon Country, CA 91387\nGerty\tGramma\t567-89-0123\t\nRubble\tBetty\t234-56-7890\t9715 Penn St.\tRoyal Oak, MI 48067\n"
  },
  {
    "path": "docker-compose.yml",
    "content": "version: \"3.7\"\nservices:\n  hivemetastore:\n    image: postgres:11.5\n    hostname: hivemetastore\n    environment:\n      POSTGRES_PASSWORD: new_password\n    expose:\n      - 5432\n    volumes:\n      - ./init.sql:/docker-entrypoint-initdb.d/init.sql\n    healthcheck:\n      test: [\"CMD-SHELL\", \"pg_isready -U postgres\"]\n      interval: 10s\n      timeout: 5s\n      retries: 5\n    networks:\n      spark_net:\n        ipv4_address: 172.28.1.4\n    extra_hosts:\n      - \"master:172.28.1.1\"\n      - \"worker1:172.28.1.2\"\n      - \"worker2:172.28.1.3\"\n      - \"zeppelin:172.28.1.5\"\n      - \"livy:172.28.1.6\"\n\n  master:\n    image: panovvv/hadoop-hive-spark:2.5.2\n#    build: '../hadoop-hive-spark-docker'\n    hostname: master\n    depends_on:\n      - hivemetastore\n    environment:\n      HADOOP_NODE: namenode\n      HIVE_CONFIGURE: yes, please\n      SPARK_PUBLIC_DNS: localhost\n      SPARK_LOCAL_IP: 172.28.1.1\n      SPARK_MASTER_HOST: 172.28.1.1\n      SPARK_LOCAL_HOSTNAME: master\n    expose:\n      - 1-65535\n    ports:\n      # Spark Master Web UI\n      - 8080:8080\n      # Spark job Web UI: increments for each successive job\n      - 4040:4040\n      - 4041:4041\n      - 4042:4042\n      - 4043:4043\n      # Spark History server\n      - 18080:18080\n      # YARN UI\n      - 8088:8088\n      # Hadoop namenode UI\n      - 9870:9870\n      # Hadoop secondary namenode UI\n      - 9868:9868\n      # Hive JDBC\n      - 10000:10000\n    volumes:\n      - ./data:/data\n    networks:\n      spark_net:\n        ipv4_address: 172.28.1.1\n    extra_hosts:\n      - \"worker1:172.28.1.2\"\n      - \"worker2:172.28.1.3\"\n      - \"hivemetastore:172.28.1.4\"\n      - \"zeppelin:172.28.1.5\"\n      - \"livy:172.28.1.6\"\n\n  worker1:\n    image: panovvv/hadoop-hive-spark:2.5.2\n#    build: '../hadoop-hive-spark-docker'\n    hostname: worker1\n    depends_on:\n      - hivemetastore\n    environment:\n      SPARK_MASTER_ADDRESS: spark://master:7077\n      SPARK_WORKER_PORT: 8881\n      SPARK_WORKER_WEBUI_PORT: 8081\n      SPARK_PUBLIC_DNS: localhost\n      SPARK_LOCAL_HOSTNAME: worker1\n      SPARK_LOCAL_IP: 172.28.1.2\n      SPARK_MASTER_HOST: 172.28.1.1\n      HADOOP_NODE: datanode\n    expose:\n      - 1-65535\n    ports:\n      # Hadoop datanode UI\n      - 9864:9864\n      #Spark worker UI\n      - 8081:8081\n    volumes:\n      - ./data:/data\n    networks:\n      spark_net:\n        ipv4_address: 172.28.1.2\n    extra_hosts:\n      - \"master:172.28.1.1\"\n      - \"worker2:172.28.1.3\"\n      - \"hivemetastore:172.28.1.4\"\n      - \"zeppelin:172.28.1.5\"\n      - \"livy:172.28.1.6\"\n\n  worker2:\n    image: panovvv/hadoop-hive-spark:2.5.2\n#    build: '../hadoop-hive-spark-docker'\n    hostname: worker2\n    depends_on:\n      - hivemetastore\n    environment:\n      SPARK_MASTER_ADDRESS: spark://master:7077\n      SPARK_WORKER_PORT: 8882\n      SPARK_WORKER_WEBUI_PORT: 8082\n      SPARK_PUBLIC_DNS: localhost\n      SPARK_LOCAL_HOSTNAME: worker2\n      SPARK_LOCAL_IP: 172.28.1.3\n      SPARK_MASTER_HOST: 172.28.1.1\n      HADOOP_NODE: datanode\n      HADOOP_DATANODE_UI_PORT: 9865\n    expose:\n      - 1-65535\n    ports:\n      # Hadoop datanode UI\n      - 9865:9865\n      # Spark worker UI\n      - 8082:8082\n    volumes:\n      - ./data:/data\n    networks:\n      spark_net:\n        ipv4_address: 172.28.1.3\n    extra_hosts:\n      - \"master:172.28.1.1\"\n      - \"worker1:172.28.1.2\"\n      - \"hivemetastore:172.28.1.4\"\n      - \"zeppelin:172.28.1.5\"\n      - \"livy:172.28.1.6\"\n\n  livy:\n    image: panovvv/livy:2.5.2\n#    build: '../livy-docker'\n    hostname: livy\n    depends_on:\n      - master\n      - worker1\n      - worker2\n    volumes:\n      - ./livy_batches:/livy_batches\n      - ./data:/data\n    environment:\n      - SPARK_MASTER=yarn\n      # Intentionally not specified - if it's set here, then we can't override it\n      # via REST API (\"conf\"={} map)\n      # Can be client or cluster\n#      - SPARK_DEPLOY_MODE=client\n\n      - LOCAL_DIR_WHITELIST=/data/batches/\n      - ENABLE_HIVE_CONTEXT=false\n      # Defaults are fine for variables below. Uncomment to change them.\n#      - LIVY_HOST=0.0.0.0\n#      - LIVY_PORT=8998\n    expose:\n      - 1-65535\n    ports:\n      - 8998:8998\n    networks:\n      spark_net:\n        ipv4_address: 172.28.1.6\n    extra_hosts:\n      - \"master:172.28.1.1\"\n      - \"worker1:172.28.1.2\"\n      - \"worker2:172.28.1.3\"\n      - \"hivemetastore:172.28.1.4\"\n      - \"zeppelin:172.28.1.5\"\n\n  zeppelin:\n    image: panovvv/zeppelin-bigdata:2.5.2\n#    build: '../zeppelin-bigdata-docker'\n    hostname: zeppelin\n    depends_on:\n      - master\n      - worker1\n      - worker2\n      - livy\n    volumes:\n      - ./zeppelin_notebooks:/zeppelin_notebooks\n      - ./data:/data\n    environment:\n      ZEPPELIN_PORT: 8890\n      ZEPPELIN_NOTEBOOK_DIR: '/zeppelin_notebooks'\n    expose:\n      - 8890\n    ports:\n      - 8890:8890\n    networks:\n      spark_net:\n        ipv4_address: 172.28.1.5\n    extra_hosts:\n      - \"master:172.28.1.1\"\n      - \"worker1:172.28.1.2\"\n      - \"worker2:172.28.1.3\"\n      - \"hivemetastore:172.28.1.4\"\n      - \"livy:172.28.1.6\"\n\nnetworks:\n  spark_net:\n    ipam:\n      driver: default\n      config:\n        - subnet: 172.28.0.0/16"
  },
  {
    "path": "init.sql",
    "content": "CREATE DATABASE \"hivemetastoredb\";"
  },
  {
    "path": "zeppelin_notebooks/test_2FVBJBJ1V.zpln",
    "content": "{\n  \"paragraphs\": [\n        {\n          \"text\": \"%sh\\n\\n#Only load a file to HDFS if it\\u0027s not already there - because of this you can run all paragraphs as many times as you like.\\nhadoop fs -test -e /grades.csv\\n\\nif ! hadoop fs -test -e /grades.csv\\nthen\\n    echo \\\"*******************************************\\\"\\n    echo \\\"grades.csv is not in HDFS yet! Uploading...\\\"\\n    echo \\\"*******************************************\\\"d\\n    hadoop fs -put /data/grades.csv /\\nfi\"\n        },\n        {\n          \"text\": \"%sh\\n\\nhadoop fs -ls /\"\n        },\n        {\n          \"text\": \"%jdbc\\n\\n-- Does not support more than one statement per paragraph, it seems. Same goes for semicolon at the end of statements - errors out if you include it.\\nDROP TABLE IF EXISTS grades\"\n        },\n        {\n          \"text\": \"%jdbc\\n\\nCREATE TABLE grades(\\n    `Last name` STRING,\\n    `First name` STRING,\\n    `SSN` STRING,\\n    `Test1` DOUBLE,\\n    `Test2` INT,\\n    `Test3` DOUBLE,\\n    `Test4` DOUBLE,\\n    `Final` DOUBLE,\\n    `Grade` STRING)\\nCOMMENT \\u0027https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html\\u0027\\nROW FORMAT DELIMITED\\nFIELDS TERMINATED BY \\u0027,\\u0027\\nSTORED AS TEXTFILE\\ntblproperties(\\\"skip.header.line.count\\\"\\u003d\\\"1\\\")\"\n        },\n        {\n          \"text\": \"%jdbc\\n\\nLOAD DATA INPATH \\u0027/grades.csv\\u0027 INTO TABLE grades\"\n        },\n        {\n          \"text\": \"%jdbc\\n\\nSELECT * FROM grades\"\n        },\n        {\n          \"text\": \"%sh\\n\\n# Take a look at the warehouse directory, specifically where our Hive table is stored.\\n hadoop fs -ls /usr/hive/warehouse/grades\"\n        },\n        {\n          \"text\": \"%sh\\n\\n# Put the file back into HDFS - it was moved to warehouse directory when we loaded it with Hive.\\nhadoop fs -put /data/grades.csv /\\nhadoop fs -ls /\"\n        },\n        {\n          \"text\": \"%spark\\n\\n// Basic Spark functions\\nspark.range(1000 * 1000 * 1000).count()\"\n        },\n        {\n          \"text\": \"%spark\\n\\n// Dataframes\\nval df \\u003d Seq(\\n  (\\\"One\\\", 1),\\n  (\\\"Two\\\", 2),\\n  (\\\"Three\\\", 3),\\n  (\\\"Four\\\", 4)\\n).toDF(\\\"This is\\\", \\\"an example\\\")\\ndf.show()\"\n        },\n        {\n          \"text\": \"%spark\\n\\n// Read CSV file from HDFS into Dataframe\\nval df \\u003d spark.read.format(\\\"csv\\\").option(\\\"header\\\", \\\"true\\\").load(\\\"/grades.csv\\\")\\ndf.show()\"\n        },\n        {\n          \"text\": \"%spark\\n\\n// Spark SQL and temporary views\\ndf.createOrReplaceTempView(\\\"df\\\")\\nspark.sql(\\\"SHOW TABLES\\\").show()\"\n        },\n        {\n          \"text\": \"%spark\\n\\nspark.sql(\\\"SELECT * FROM df WHERE Final > 50\\\").show()\"\n        },\n        {\n          \"text\": \"%spark.pyspark\\n\\n# Check Python version - 2 not allowed.\\nimport sys\\nprint(sys.version)\"\n        },\n        {\n          \"text\": \"%spark.pyspark\\n\\n#  Basic Spark functions\\nspark.range(1000 * 1000 * 1000).count()\"\n        },\n        {\n          \"text\": \"%spark.pyspark\\n\\n# Dataframes\\ndf \\u003d sqlContext.createDataFrame([(\\\"One\\\", 1), (\\\"Two\\\", 2), (\\\"Three\\\", 3), (\\\"Four\\\", 4)], (\\\"This is\\\", \\\"an example\\\"))\\ndf.show()\"\n        },\n        {\n          \"text\": \"%spark.pyspark\\n\\n# Read CSV file from HDFS into Dataframe\\ndf \\u003d spark.read.format(\\\"csv\\\").option(\\\"header\\\", \\\"true\\\").load(\\\"/grades.csv\\\")\\ndf.show()\"\n        },\n        {\n          \"text\": \"%spark.r\\n\\n# Dataframes\\ndf \\u003c- as.DataFrame(list(\\\"One\\\", \\\"Two\\\", \\\"Three\\\", \\\"Four\\\"), \\\"This is as example\\\")\\nhead(df)\"\n        },\n        {\n          \"text\": \"%spark.r\\n\\n# Read CSV file from HDFS into Dataframe\\ndf \\u003c- read.df(\\\"/grades.csv\\\", \\\"csv\\\", header\\u003d\\\"true\\\")\\nhead(df)\"\n        },\n        {\n          \"text\": \"%livy\\n\\n// Scala Spark over Livy\\nspark.range(1000 * 1000 * 1000).count()\"\n        },\n        {\n          \"text\": \"%livy.pyspark\\n\\n#  PySpark over Livy\\nimport sys\\nprint(sys.version)\\nspark.range(1000 * 1000 * 1000).count()\"\n        },\n        {\n          \"text\": \"%livy.sparkr\\n\\n# SparkR over Livy\\ndf \\u003c- as.DataFrame(list(\\\"One\\\", \\\"Two\\\", \\\"Three\\\", \\\"Four\\\"), \\\"This is as example\\\")\\nhead(df)\"\n        },\n        {\n          \"text\": \"%livy.sql\\nSELECT 1, CONCAT(\\u0027This is\\u0027, \\u0027 a test\\u0027)\"\n        }\n      ],\n  \"name\": \"test\",\n  \"id\": \"2FVBJBJ1V\",\n  \"defaultInterpreterGroup\": \"spark\",\n  \"version\": \"0.9.0\",\n  \"noteParams\": {},\n  \"noteForms\": {},\n  \"angularObjects\": {},\n  \"config\": {\n    \"isZeppelinNotebookCronEnable\": false\n  },\n  \"info\": {}\n}\n"
  }
]