Repository: ropenscilabs/r-docker-tutorial Branch: gh-pages Commit: c4428c6af0fe Files: 25 Total size: 6.2 MB Directory structure: gitextract_zmrekp_n/ ├── .gitattributes ├── .gitignore ├── 01-what-and-why.Rmd ├── 01-what-and-why.html ├── 02-Launching-Docker.Rmd ├── 02-Launching-Docker.html ├── 03-install-packages.Rmd ├── 03-install-packages.html ├── 04-Dockerhub.Rmd ├── 04-Dockerhub.html ├── 05-dockerfiles.Rmd ├── 05-dockerfiles.html ├── 06-Sharing-all-your-analysis.Rmd ├── 06-Sharing-all-your-analysis.html ├── Makefile ├── README.md ├── data/ │ └── gapminder-FiveYearData.csv ├── index.html ├── instructors.md ├── javascripts/ │ └── scale.fix.js ├── params.json ├── r-docker-tutorial.Rmd ├── stylesheets/ │ ├── github-light.css │ └── styles.css └── supplemental.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitattributes ================================================ *.html linguist-vendored stylesheets/*.css linguist-vendored javascripts/*.js linguist-vendored *.Rmd linguist-language=R ================================================ FILE: .gitignore ================================================ .Rhistory *_cache ================================================ FILE: 01-what-and-why.Rmd ================================================ --- title: "What is Docker and Why should I use it?" output: dcTemplate::dc_lesson_template: fig_width: 6 fig_height: 6 highlight: pygments --- ```{r knitr_init, echo = FALSE, cache = FALSE} library(knitr) ## Global options options(max.print = "75") opts_chunk$set(cache = TRUE, prompt = FALSE, tidy = TRUE, comment = "> #", message = FALSE, warning = FALSE) opts_knit$set(width = 75) ``` ## Lesson Objectives - Understanding the basic idea of Docker - Seeing the point of why Docker is useful ## Why would I want to use Docker? Imagine you are working on an analysis in R and you send your code to a friend. Your friend runs exactly this code on exactly the same data set but gets a slightly different result. This can have various reasons such as a different operating system, a different version of an R package, et cetera. Docker is trying to solve problems like that. **A Docker container can be seen as a computer inside your computer**. The cool thing about this virtual computer is that you can send it to your friends; And when they start this computer and run your code they will get exactly the same results as you did.  In short, you should use Docker because - it allows you to **wrangle dependencies** starting from the operating system up to details such as R and Latex package versions - it makes sure that your analyses are **reproducible**. There are a couple of other points what Docker helps with: - Portability: Since a Docker container can easily be sent to another machine, you can set up everything on your own computer and then run the analyses on e.g. a more powerful machine. - Sharability: You can send the Docker container to anyone (who knows how to work with Docker). ## Basic vocabulary The words *image* and *container* will come up a lot in the following. An instance of an image is called container. An image is the setup of the virtual computer. If you run this image, you will have an instance of it, which we call container. You can have many running containers of the same image. Next: Go to [Lesson 02 Launching Docker](02-Launching-Docker.html) or back to the [main page](http://jsta.github.io/r-docker-tutorial/). ================================================ FILE: 01-what-and-why.html ================================================
Imagine you are working on an analysis in R and you send your code to a friend. Your friend runs exactly this code on exactly the same data set but gets a slightly different result. This can have various reasons such as a different operating system, a different version of an R package, et cetera. Docker is trying to solve problems like that.
A Docker container can be seen as a computer inside your computer. The cool thing about this virtual computer is that you can send it to your friends; And when they start this computer and run your code they will get exactly the same results as you did.
Computerception
In short, you should use Docker because
There are a couple of other points what Docker helps with:
The words image and container will come up a lot in the following. An instance of an image is called container. An image is the setup of the virtual computer. If you run this image, you will have an instance of it, which we call container. You can have many running containers of the same image.
Next: Go to Lesson 02 Launching Docker or back to the main page.
## Warning: package 'knitr' was built under R version 4.1.2
First things first: install Docker. The install guide links to a bunch of introductory material after installation is complete; it’s not necessary to complete those tutorials for this lesson, but they are an excellent introduction to basic Docker usage.
The first thing we need to do to launch Docker is to open a Unix Shell. If you’re on Mac or Windows, in the last step you installed something called the Docker Quickstart Terminal; open that up now - it should look like a plain shell prompt (~$), but really it’s pointing at a linux virtual machine that Docker likes to run in, and this is where you should do everything for the rest of this tutorial unless otherwise noted. If you’re on a linux machine, then you can use a plain old terminal prompt.
On a Mac you can also go to your terminal of choice and configure it for Docker usage. Especially if you get the error Cannot connect to the Docker daemon. Is the docker daemon running on this host? at some point in the tutorial, running the following command might fix your problem:
eval "$(docker-machine env default)"
Next, we will ask Docker to run an image that already exists, we will use the verse Docker image from Rocker which will allow us to run RStudio inside the container and has many useful R packages already installed. You need to set a password for RStudio using the -e environment flag. You will be asked to enter this when the container launches.
docker run --rm -p 8787:8787 -e PASSWORD=YOURNEWPASSWORD rocker/verse
Optional: *p and --rm are flags that allow you to customize how you run the container. p tells Docker that you will be using a port to see RStudio in your web browser (at a location which we specify afterwards as port 8787:8787). Finally, –rm ensures that when we quit the container, the container is deleted. If we did not do this, everytime we run a container, a version of it will be saved to our local computer. This can lead to the eventual wastage of a lot of disk space until we manually remove these containers. Later we will show you how to save your container if you want to do so.
If you try to run a Docker container which you have not installed locally then Docker will automatically search for the container on Docker Hub (an online repository for docker images) and download it if it exists.*
The command above will lead RStudio-Server to launch invisibly. To connect to it, open a browser and enter http://, followed by your ip address, followed by :8787. If you are running a Mac or Windows machine, you will find the ip address on the first line of text that appeared in your terminal when you launched the Docker Quickstart Terminal. For example, you should see:
## .
## ## ## ==
## ## ## ## ## ===
/"""""""""""""""""\___/ ===
~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ / ===- ~~~
\______ o __/
\ \ __/
\____\_______/
docker is configured to use the default machine with IP 192.168.99.100
For help getting started, check out the docs at https://docs.docker.com
Thus, you would enter http://192.168.99.100:8787 in your browser as the url.
If you are running a Mac or Linux machine, you can use localhost as the ip address. For example: http://localhost:8787
This should lead to you being greeted by the RStudio welcome screen. Log in using:
username: rstudio password: YOURNEWPASSWORD
Now you should be able to work with RStudio in your browser in much the same way as you would on your desktop. The password you should use is the one you created when you used docker run above.
The image below shows RStudio server running within a docker image. You should see something similar on your machine.
install-pkgs
Try to now look at your files of your virtual computer (docker container). Go to file -> open file. You will see that there are actually no files. The reason for this is that this image came with no files. Next, open a new R Script, e.g. by going to file -> New file -> R Script. Enter the following code in the script, run it and save it.
# make x the numbers from 1 to 5, and y the numbers from 6-10
x <- 1:5
y <- 6:10
# plot x against y
plot(x, y)
If you look at your files again now, you will see the script file.
Now, given that we used the --rm flag when we launched the Docker container, anything we create on the machine will be gone. Let’s verify this. First, close the browser tab where you have RStudio open, and then go to your terminal window from where you launched the Docker container and type Contol+C. This shuts down the Docker container.
Now relaunch your a Docker container using the RStudio image as you did previously, e.g., docker run --rm -p 8787:8787 rocker/verse in the terminal and typing http://192.168.99.100:8787 in your browser window and see if the rscript and plot you saved is still there.
That leads us to the question of, how can we save our work if the container is deleted when we exit the container? One solution is to link a volume (for example your local hard drive) to the container so that you can access the data there as well as being able to save things there.
This time when we launch our container we will use the -v flag along with the path to our project’s root directory. Your launch command should look something like this, although the path will differ depending on where you saved the data to on your computer. On the left hand side of the : is the path on your own computer, which you can get by doing pwd at the bash prompt from your project’s root directory. On the right hand side is the path on the container; this should almost always start with /home/rstudio/.
docker run --rm -p 8787:8787 -v /Users/tiffanytimbers/Documents/DC/r-docker-tutorial:/home/rstudio/r-docker-tutorial rocker/verse
Again, you will have to enter something like http://192.168.99.100:8787 in your browser as the url to get RStudio to run.
This time when you launch RStudio in a Docker container and you try to open a file you should be able to see some files and directories. Now set the working directory to the directory called r-docker-tutorial and load the gapminder-FiveYearData.csv into R via read.table.
# load gapminder data from a csv on your computer
gap5yr <- read.csv(file = 'data/gapminder-FiveYearData.csv')
Now lets plot GDP per capita against life expectancy.
# load ggplot library
library(ggplot2)
# plot GDP against life expectancy
qplot(gap5yr$lifeExp, gap5yr$gdpPercap)
# save the plot
ggsave(filename = 'data/GDP_LifeExp.pdf')
Let’s also save the script as plot_GDP_LifeExp.R in the r-docker-tutorial directory. Now close the RStudio browser and exit your Docker container via the terminal using Control+C. Then look inside the r-docker-tutorial and r-docker-tutorial/data directories on your laptop to see if you can see the two files you created.
In this lesson we learned how to launch a Docker container that allows us to run RStudio in a browser. We learned that using the --rm flag when we run Docker makes the container ephemeral; meaning that it is automatically deleted after we close the container. We do this as to not build up a large collection of containers on our machine and waste space. We also learned that we can link a volume of our laptop to the Docker container if we want to be able to access and save data, scripts and any other files.
The container we used already had R, RStudio and several useful R packages installed. In later lessons we will learn how to modify this container to install new packages, and where we can find other Docker containers that might be useful for our work.
Open R or RStudio on your local machine and check which packages you have installed by typing installed.packages(). Compare with your neighbour if they match. Do the same in RStudio in your browser.
Next: Go to Lesson 03 Install packages or back to the main page.
You can install R packages with RStudio in the browser, like you would on a desktop-RStudio-session, by using install.packages. Let’s launch a verse Docker container to run RStudio as we did previously, and try to install the gapminder package, and load it and peek at the data.
# install package
install.packages('gapminder')
# load library
library(gapminder)
# peek at data
head(gapminder)
Great! Now we have the Gapminder package installed so we can work with the whole dataset. But wait, what is going to happen when we exit the container? It will be deleted and since we didn’t save this version of the Docker image, when we open another instance of the container we will have to install the Gapminder package again if we want to use it.
To avoid this, lets save the image by running Docker commit and then the next time we run a Docker container we can run an instance of this image which includes the Gapminder package. To do this we need to open another terminal window before we close our Docker container.
To save this specific version of the image we need to find this containers specific hash. We can see this by typing the following command in the new terminal window, and it will list all running Docker containers:
docker ps
The output should look something like what is shown below, and the specific hash for this container is the alphanumeric text in the first column.
4a6a528b35da rocker/verse "/init" 2 minutes ago Up 2 minutes 0.0.0.0:8787->8787/tcp silly_meninsky
Now to save this version of the image, in the new terminal window type:
docker commit -m "verse + gapminder" 4a6a528b35da verse_gapminder
To save this Docker image we have to provide a commit message to describe the change that we have made to the image. We do this by passing the -m flag followed by the message in quotes. We also need to provide the specific hash for this version of the container (here 4a6a528b35da). Finally, we also provide a new name for the new image. We called this new image verse_gapminder.
We can see that we now have two Docker images saved on our laptops by typing:
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
verse_gapminder latest bb38976d03cf 57 seconds ago 1.955 GB
rocker/verse latest 0168d115f220 3 days ago 1.954 GB
You can test that this worked by running a Docker container from each image. You will find that the Gapminder package is only installed on the verse_gapminder image and not on the rocker/verse image.
Many R packages have dependencies external to R, for example GSL, GDAL, JAGS and so on. To install these on a running rocker container you need to go to the docker command line (in a new terminal window) and type the following:
docker ps # find the ID of the running container you want to add a package to
docker exec -it <container-id> bash # a docker command to start a bash shell in your container
apt-get install libgsl0-dev # install the package, in this case GSLIf you get an error message when running apt-get install libgsl0-dev try running apt-get update first.
To save these changes go to yet another terminal window and save as above using docker commit, e.g.
docker commit -m "verse + gapminder + GSL" <container id> verse_gapminder_gsl
Now you can go to the terminal window in which you typed the docker exec command and close the docker container by typing exit.
You or someone else will probably want to check at some point later, what the docker image contains. Write a README file which documents the details of the verse_gapminder_gsl image. For the R packages you can use the output of installed.packages().
Next: Go to Lesson 04 Dockerhub or back to the main page.
Docker Hub is the place where open Docker images are stored. When we ran our first image by typing
docker run --rm -p 8787:8787 rocker/verse
the software first checked if this image is available on your computer and since it wasn’t it downloaded the image from Docker Hub. So getting an image from Docker Hub works sort of automatically. If you just want to pull the image but not run it, you can also do
docker pull rocker/verse
Imagine you made your own Docker image and would like to share it with the world you can sign up for an account on https://hub.docker.com/. After verifying your email you are ready to go and upload your first docker image.
docker login --username=yourhubusername --email=youremail@company.com
just with your own user name and email that you used for the account. Enter your password when prompted. If everything worked you will get a message similar to
WARNING: login credentials saved in /home/username/.docker/config.json
Login Succeeded
docker images
and what you will see will be similar to
REPOSITORY TAG IMAGE ID CREATED SIZE
verse_gapminder_gsl latest 023ab91c6291 3 minutes ago 1.975 GB
verse_gapminder latest bb38976d03cf 13 minutes ago 1.955 GB
rocker/verse latest 0168d115f220 3 days ago 1.954 GB
and tag your image
docker tag bb38976d03cf yourhubusername/verse_gapminder:firsttry
The number must match the image ID and :firsttry is the tag. In general, a good choice for a tag is something that will help you understand what this container should be used in conjunction with, or what it represents. If this container contains the analysis for a paper, consider using that paper’s DOI or journal-issued serial number; if it’s meant for use with a particular version of a code or data version control repo, that’s a good choice too - whatever will help you understand what this particular image is intended for.
docker push yourhubusername/verse_gapminder
Your image is now available for everyone to use.
Pushing to Docker Hub is great, but it does have some disadvantages:
Solutions to these problems can be to save the Docker container locally as a a tar archive, and then you can easily load that to an image when needed.
To save a Docker image after you have pulled, committed or built it you use the docker save command. For example, lets save a local copy of the verse_gapminder docker image we made:
docker save verse_gapminder > verse_gapminder.tar
If we want to load that Docker container from the archived tar file in the future, we can use the docker load command:
docker load --input verse_gapminder.tar
Next: Go to Lesson 05 Dockerfiles or back to the main page.
Earlier, we got started with a base image that let us run RStudio from within Docker, and learned to modify the contents of that image using docker commit. This is an excellent technique for capturing what we’ve done so we can reproduce it later, but what if we want to be able to easily change the collection of things in our image, and have a clear record of just what went into it? This is useful when maintaining running environments that may change and evolve over a project, and is facilitated by Dockerfiles.
Dockerfiles are a set of instructions on how to add things to a base image. They build custom images up in a series of layers. In a new file called Dockerfile, put the following:
FROM rocker/verse:latest
This tells Docker to start with the rocker/verse base image - that’s what we’ve been using so far. The FROM command must always be the first thing in your Dockerfile; this is the bottom crust of the pie we are baking.
Next, let’s add another layer on top of our base, in order to have gapminder pre-installed and ready to go:
RUN R -e "install.packages('gapminder', repos = 'http://cran.us.r-project.org')"
RUN commands in your Dockerfile execute shell commands to build up your image, like putting the filling in our pie. In this example, we install gapminder from the command line using install.packages, which does the same thing as if we had done install.packages('gapminder') from within RStudio. Save your Dockerfile, and return to your docker terminal; we can now build our image by doing:
docker build -t my-r-image .
-t my-r-image gives our image a name (note image names are always all lower case), and the . says all the resources we need to build this image are in our current directory. List your images via:
docker images
and you should see my-r-image in the list. Launch your new image similarly to how we launched the base image:
docker run --rm -p 8787:8787 my-r-image
Then in the RStudio terminal, try gapminder again:
library('gapminder')
gapminder
And there it is - gapminder is pre-installed and ready to go in your new docker image.
Our pie is almost complete! All we need to finish it is the topping. In addition to R packages like gapminder, we may also want some some static files inside our Docker image - such as data. We can do this using the ADD command in your Dockerfile:
ADD data/gapminder-FiveYearData.csv /home/rstudio/
Rebuild your Docker image:
docker build -t my-r-image .
And launch it again:
docker run --rm -p 8787:8787 my-r-image
Go back to RStudio in the browser, and there gapminder-FiveYearData.csv will be, present in the files visible to RStudio. In this way, we can capture files as part of our Docker image, so they’re always available along with the rest of our image in the exact same state.
While building and rebuilding your Docker image in this tutorial, you may have noticed lines like this:
Step 2 : RUN R -e "install.packages('gapminder', repos = 'http://cran.us.r-project.org')"
---> Using cache
---> fa9be67b52d1
Noting that a cached version of the commands was being used. When you rebuild an image, Docker checks the previous version(s) of that image to see if the same commands were executed previously; each of those steps is preserved as a separate layer, and Docker is smart enough to re-use those layers if they are unchanged and in the same order as previously. Therefore, once you’ve got part of your setup process figured out (particularly if it’s a slow part), leave it near the top of your Dockerfile and don’t put anything above or between those lines, particularly things that change frequently; this can substantially speed up your build process.
In this lesson, we learned how to compose a Dockerfile so that we can re-create our images at will. We learned three main commands:
FROM is always at the top of a Dockerfile, and specifies the image we want to start from.RUN runs shell commands on top of our base image, and is used for doing things like downloads and installations.ADD adds files from our computer to our new Docker image.The image is built by running docker build -t my-r-image . in the same directory as our Dockerfile and any files we want to include with an ADD command.
rocker/verse, with gapminder and gsl installed. Also add a readme to your image describing what it contains.Find the Dockerfile of the rocker/verse image through Docker Hub.
rocker/verse image. What do they tell us?Go to Lesson 06 Share all your analysis or back to the
main page.
Now that we have learned how to work with a dockerfile, we can send all our analysis to a collaborator. We will share an image that contains all the dependencies that we need to run our analysis, the data and the analysis.
We will build this image via a dockerfile. Let’s start with the basic verse rocker image we used before. This time we want to have a specific R version (3.3.2), which the developers make possible by tagging the images with the version. See all available tags for the rocker/verse image here. The version tag is very useful when you want your analysis to be reproducible.
FROM rocker/verse:3.3.2
As part of our analysis, we will use the gapminder data. We will need to install this package into our docker image. Let’s modify our dockerfile to install this package.
RUN R -e "install.packages('gapminder', repos = 'http://cran.us.r-project.org')"
Now we just need to write our analysis and add it to our dockerfile.
For this analysis, we will create a plot of life expectancy vs. gdp per capita.
On a new R script let’s write the following analysis.
library(ggplot2)
library(gapminder)## Warning: package 'gapminder' was built under R version 4.1.2
life_expentancy_plot <- ggplot(data = gapminder) +
geom_point(aes(x = lifeExp, y = gdpPercap, colour = continent))We will save this r script as analysis.R and add it to our dockerfile.
ADD analysis.R /home/rstudio/
Now we can build the image and check that we have everything we want to share with our collaborator.
docker build -t my-analysis .
Our analysis will appear on the list of images.
docker images
Launch your new image and check you have everything you want to include:
docker run -dp 8787:8787 my-analysis
Great! our analysis script is there and gapminder is installed.
Now we can push our analysis to dockerhub.
On dockerhub click on Create Repository. Choose a name (e.g. gapminder_my_analysis) and a description for your repository and click Create.
Log into the Docker Hub from the command line
docker login --username=yourhubusername
just with your own user name that you used for the account. Enter your password when prompted.
Check the image ID using
docker images
and what you will see will be similar to
REPOSITORY TAG IMAGE ID CREATED SIZE
my-analysis latest dc63d4790eaa 2 minutes ago 3.164 GB
and tag your image
docker tag dc63d4790eaa yourhubusername/gapminder_my_analysis:firsttry
Push your image to the repository you created
docker push yourhubusername/gapminder_my_analysis
Your image is now available for everyone to use.
Now your collaborator can download your image.
Your collaborator should write on their command line:
docker pull yourhubusername/gapminder_my_analysis:firsttry
They now have the image of your analysis.
Click to go back to the main page.
This is an introduction to Docker designed for participants with knowledge about R and RStudio. The introduction is intended to be helping people who need Docker for a project. We first explain what Docker is and why it is useful. Then we go into the the details on how to use it for a reproducible transportable project.
Try to install Docker before you come to the workshop, by following the instructions for mac, linux, or windows. If you get stuck, don't worry! We'll help you get set up at the beginning of the workshop.
This tutorial is work in progress. If you have any suggestions how we could make it better please open a new issue.