[
  {
    "path": ".gitignore",
    "content": "*.o\n/contain\n/inject\n/pseudo\n/tags\n"
  },
  {
    "path": "COPYING",
    "content": "Copyright (C) 2013 Chris Webb <chris@arachsys.com>\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to\ndeal in the Software without restriction, including without limitation the\nrights to use, copy, modify, merge, publish, distribute, sublicense, and/or\nsell copies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING\nFROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS\nIN THE SOFTWARE.\n"
  },
  {
    "path": "Makefile",
    "content": "BINDIR := $(PREFIX)/bin\nCFLAGS := -Os -Wall -Wfatal-errors\n\nBINARIES := inject\nSUIDROOT := contain pseudo\n\n%:: %.c Makefile\n\t$(CC) $(CFLAGS) -o $@ $(filter %.c,$^)\n\nall: $(BINARIES) $(SUIDROOT)\n\ncontain: contain.[ch] console.c map.c mount.c util.c\n\ninject: contain.h inject.c map.c util.c\n\npseudo: contain.h pseudo.c map.c util.c\n\nclean:\n\trm -f $(BINARIES) $(SUIDROOT)\n\ninstall: $(BINARIES) $(SUIDROOT)\n\tmkdir -p $(DESTDIR)$(BINDIR)\n\tinstall -s $(BINARIES) $(DESTDIR)$(BINDIR)\n\tinstall -o root -g root -m 4755 -s $(SUIDROOT) $(DESTDIR)$(BINDIR)\n\n.PHONY: all clean install\n"
  },
  {
    "path": "README",
    "content": "Containers\n==========\n\nThis package is a simple implementation of containers for Linux, making\nsecure containers as easy to create and use as a traditional chroot. It\ncomprises three utilities, contain, inject and pseudo, which use the kernel\nsupport for user namespaces merged in Linux 3.8.\n\n\nDemonstration\n-------------\n\nWith the utilities already installed, the demo begins in an unprivileged\nuser's shell:\n\n  $ echo $$ $UID\n  21260 1000\n\nTo create a simple test container, copy /bin and /lib* from the host into a\ntemporary directory with the default UID/GID mappings applied:\n\n  $ cd $(mktemp -d)\n  $ tar -c -f - -C / bin lib lib32 lib64 | pseudo tar -x -f -\n\nIt is very straightforward to launch a container with this newly-created\nroot filesystem:\n\n  $ contain . /bin/bash\n  #\n\nThe new shell has PID 1 within the container, and cannot see other processes\non the host:\n\n  # echo $$ $UID\n  1 0\n  # ps ax\n    PID TTY      STAT   TIME COMMAND\n      1 console  Ss     0:00 /bin/bash\n      2 console  R+     0:00 ps ax\n\nThe container root user is able to manipulate ownerships and permissions\nwithin its filesystem:\n\n  # ls -l /dev/console\n  crw--w---- 1 0 5 136, 9 Jul  1 14:00 /dev/console\n  # chown 12:34 /dev/console\n  # chmod a+rw /dev/console\n  # ls -l /dev/console\n  crw-rw-rw- 1 12 34 136, 9 Jul  1 14:00 /dev/console\n\nand can also make other privileged changes such as setting the hostname:\n\n  # echo -n \"hostname $(hostname) -> \" && hostname brian && hostname\n  hostname alice -> brian\n\nor configuring the network stack:\n\n  # ip link show\n  1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT\n      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00\n  # ping -w 1 1.2.3.4 &>/dev/null && echo up || echo down\n  down\n  # ip addr add 1.2.3.4/32 dev lo && ip link set lo up\n  # ping -w 1 1.2.3.4 &>/dev/null && echo up || echo down\n  up\n  # ip link add type veth && ip link show\n  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT\n      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00\n  2: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000\n      link/ether 3a:0c:96:36:2d:ff brd ff:ff:ff:ff:ff:ff\n  3: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000\n      link/ether a2:86:1a:92:58:cb brd ff:ff:ff:ff:ff:ff\n\nIn all cases, these changes affect the container but not the host as a\nwhole. Processes in the container live in different resource namespaces\nisolated from the host, and the container root user is unable to do anything\nthat would require elevated capabilities or root privilege on the host\nitself.\n\n\ncontain\n-------\n\nThe contain utility is invoked as\n\n  contain [OPTIONS] DIR [CMD [ARG]...]\n\nwith options\n\n  -c        disable console emulation in the container\n  -g MAP    set the container-to-host GID map\n  -i CMD    run a helper child inside the new namespaces\n  -n        share the host network unprivileged in the container\n  -o CMD    run a helper child outside the new namespaces\n  -u MAP    set the container-to-host UID map\n\nand creates a new container with DIR recursively bound as its root\nfilesystem, running CMD as PID 1 within that container. If unspecified, CMD\ndefaults to /bin/sh to start a shell, so to fully boot a distribution,\nspecify CMD as /bin/init or /sbin/init.\n\nThe container init process is isolated in new user, cgroup, mount, IPC, UTS,\ntime and PID namespaces. A synthetic /dev with device nodes bound from the\nhost /dev is automatically mounted within the new mount namespace, together\nwith standard /dev/pts, /proc and /sys filesystems.\n\nBecause it runs in its own user namespace, users and groups seen inside a\ncontainer are not the same as the underlying credentials visible for the\nsame processes and files on the host. Sensible default container-to-host UID\nand GID mappings are provided and described below, but the -u and -g options\ncan be used to override the defaults.\n\nThe container console is a host pseudo-terminal bound at /dev/console in the\nnew /dev filesystem: stdin and stdout are copied to/from this, and it serves\nas stdin, stdout and stderr for the container init process. This console\nemulation can be disabled using the -c option: if -c is used, init is run\ndirectly with the stdin, stdout and stderr of the contain command.\n\nContainers are usually isolated in their own network namespace, with a\ndistinct set of network interfaces from the host. By specifying the -n\noption, it is possible to safely share the host network stack instead. If\nyou do this, user networking within the container will work normally, but\nthe container has no privileges with respect to its network namespace so it\nisn't possible to (re)configure interfaces or routes, and setuid utilities\nlike ping which use a raw socket will fail.\n\nTwo different kinds of helper program can be used to help set up a\ncontainer. A program specified with -i is run inside the new namespaces with\nthe new root filesystem as its working directory, just before pivoting into\nit. Typically this type of helper is used to bind-mount additional parts of\nthe host filesystem inside the container.\n\nA helper specified with -o is run outside the namespaces but as a direct\nchild of the supervisor process which is running within them. This type of\nhelper can be used to move host network interfaces (such as a macvtap\ninterface or one half of a veth pair) into the container's network\nnamespace.\n\nThe environment of the container init process includes \"container=contain\"\nso that distributions can identify when they are running under contain.\n\n\ninject\n------\n\nThe inject utility is invoked as\n\n  inject PID [CMD [ARG]...]\n\nwhere PID is the process ID of a running container supervisor, and runs a\ncommand or shell inside the existing container. The environment, stdin,\nstdout and stderr of inject are all inherited by the command to be run.\n\nThe container supervisor PID (i.e. that of contain itself) should be given\nto inject, not the PID of the descendant init process. The inject utility\nwill only work if the process specified has a child with \"container=contain\"\nin its environment, which it assumes to be the container init.\n\nLinux allows an unprivileged user to join the user namespace of any process\nhe can dump or ptrace, so inject need not be installed setuid even if\ncontain and pseudo are setuid root. It will refuse to run if it detects\nsetuid/setgid operation.\n\n\npseudo\n------\n\nThe pseudo utility is invoked as\n\n  pseudo [OPTIONS] [CMD [ARG]...]\n\nwith options\n\n  -g MAP    set the user namespace GID map\n  -u MAP    set the user namespace UID map\n\nand runs a command or shell as root in a new user namespace, by analogy with\nsudo which runs a command as root in the host user namespace.\n\nUnlike contain, pseudo does not unshare other namespaces or attempt to\nisolate the new process from the rest of the host. It has identical default\nUID/GID mappings, -u and -g options, and support for /etc/subuid and\n/etc/subgid when installed setuid root, but no other contain options are\nsupported.\n\nOne use for pseudo is as a more capable replacement for fakeroot, useful for\ntesting, when building software packages or for constructing system images.\nUnlike the traditional fakeroot approach based on LD_PRELOAD, static\nbinaries and chroot jails are both handled correctly.\n\nIt is also invaluable for running host software to access the same\nfilesystem as a container, replicating the user and group file ownerships\nthat the container would see. For example, in the demo above, the system\nimage is untarred under pseudo so that files are written into the filesystem\nwith UIDs and GIDs mapped for the container rather than unmapped as on the\nhost.\n\n\nUser and group mappings\n-----------------------\n\nBy default, when run as root, contain and pseudo will map container UID/GID\n0 onto the highest available host UID/GID (4294967294 unless nested), and\nall other UIDs/GIDs are mapped onto themselves apart from the top container\nUID and GID which must be left unmapped.\n\nThe default mappings avoid host UID and GID 0 as the host root user is still\ngranted a variety of privileges even after dropping all capabilities in the\nhost user namespace. For example, /proc and /sys files typically have (host)\nroot:root ownership, and allowing the container access unfiltered access to\nthings like /proc/sys is dangerous.\n\nRun as an unprivileged user, container UID/GID 0 is mapped onto the\nunprivileged user's UID/GID, then container UIDs/GIDs 1, 2, etc. are\nsuccessively mapped onto any ranges delegated to that user in /etc/subuid\nand /etc/subgid.\n\nThe -u and -g options can be used to specify custom mappings, in the format\nSTART:LOWER:COUNT[,START:LOWER:COUNT]... where START is the first UID/GID in\na container range, LOWER is the first UID/GID in the corresponding range in\nthe host, and COUNT is the length of these ranges.\n\nFor example, -u 0:1000:1,1:4000:2000 will map container UID 0 onto host UID\n1000 and container UIDs 1...2000 onto host UIDs 4000...5999.\n\nIt is not possible to map more than one container ID onto a given host ID,\nnor to list the same container ID twice in a map specification. When invoked\nby an unprivileged user, all host ranges are checked against /etc/subuid and\n/etc/subgid.\n\nUnmapped users and groups are mapped by the kernel onto the overflow UID and\nGID set in /proc/sys/kernel/overflowuid and /proc/sys/kernel/overflowgid. By\ndefault the kernel sets both these values to 65534.\n\n\nUnprivileged operation, /etc/subuid and /etc/subgid\n---------------------------------------------------\n\nWhen a non-root user runs contain or pseudo unprivileged, these tools can\nonly map container UID/GIDs onto the host UID/GID of that user. The\nresulting container is not very useful as it has just a single user and\ngroup available. (Typically only root is mapped in the container.)\n\nHowever, contain and pseudo can also be installed setuid root, and in this\ncase, unprivileged users can also map onto ranges of UIDs/GIDs that have\nbeen delegated for their use in /etc/subuid and /etc/subgid.\n\nThe format of these files is similar to /etc/passwd, /etc/group and\n/etc/shadow. Each line specifies an additional range of UIDs/GIDs allocated\nto a particular user, and there can be zero, one, or multiple lines for any\ngiven user. There are three colon-delimited fields: the user's login name,\nthe first UID/GID in the range, and the number of UIDs/GIDs in the range.\nFor example, an /etc/subuid containing the lines\n\n  chris:100000:10000\n  chris:120000:10000\n\nallocates UID ranges 100000-109999 and 120000-129999 to my user 'chris' in\naddition to my normal login UID.\n\nThe kernel user namespace author Eric Biederman <ebiederm@xmission.com> has\nproposed patches against the standard GNU/Linux Shadow package which add\nsupport for creating and updating these files in this format; they are\nlikely to become a standard way to delegate sub-users and sub-groups.\n\nLinux 3.19 and later do not allow unprivileged processes to write a GID map\nunless the setgroups() call has been permanently disabled by writing \"deny\"\nto /proc/PID/setgroups. This is a fix for CVE-2014-8989 which applied to\nstrangely-configured systems where group membership implies more restricted\npermissions rather than supplementary permissions.\n\nAs a result, when run non-setuid by an unprivileged user, contain and pseudo\nmust disable setgroups() in the container. Conversely, when installed setuid\nroot, they will use their privilege to bypass this kernel restriction,\nresulting in fully-functional containers which still support setgroups().\nHowever, this also means that they can be used to bypass restrictions\nimplemented by group membership.\n\n\nBuilding and installing\n-----------------------\n\nUnpack the source tar.gz file and change to the unpacked directory.\n\nRun 'make', then 'make install' as root to install both binaries setuid root\nin /bin. Alternatively, you can set DESTDIR and/or BINDIR to install in a\ndifferent location, or strip and copy the compiled binaries into the correct\nplace manually.\n\nNote that setuid contain and pseudo effectively enable unprivileged users to\nto drop supplementary group memberships using setgroups(). Consequently,\nthey should NOT be installed setuid root on systems where group membership\nimplies more restricted permissions rather than supplementary permissions.\n\nThese utilities were developed on GNU/Linux and are not portable to other\nplatforms as they rely on Linux-specific facilities such as namespaces.\nPlease report any problems or bugs to Chris Webb <chris@arachsys.com>.\n\n\nCopying\n-------\n\nThis software was written by Chris Webb <chris@arachsys.com> and is\ndistributed as Free Software under the terms of the MIT license in COPYING.\n"
  },
  {
    "path": "TIPS",
    "content": "Shutting down or killing a container\n------------------------------------\n\nFrom the host, the inject utility can be used to run an appropriate command\nwithin the container to start a graceful shut down. For example\n\n  inject PID /bin/halt\n\nTo immediately kill a container and all its processes, it is sufficient to\nsend the init process a SIGKILL from the host using\n\n  pkill -KILL -P PID\n\nwhere PID is the process ID of a running container supervisor. It is very\nimportant not to SIGKILL the container supervisor itself or the container\nwill be orphaned, continuing to run unsupervised as a child of the host\ninit.\n\n\nUsing cgroups to limit memory and processes available to a container\n--------------------------------------------------------------------\n\nIf cgroup support, the memory controller and the pids controller are\ncompiled into the kernel, a mounted cgroup2 filesystem can be used to apply\nmemory and process-count limits to a container as it is started. For\nexample, the shell script\n\n  #!/bin/sh -e\n  echo +memory +pids >/sys/fs/cgroup/cgroup.subtree_control\n  mkdir /sys/fs/cgroup/mycontainer\n  echo $$ >/sys/fs/cgroup/mycontainer/tasks\n  echo 2G >/sys/fs/cgroup/mycontainer/memory.high\n  echo 3G >/sys/fs/cgroup/mycontainer/memory.max\n  echo 2G >/sys/fs/cgroup/mycontainer/memory.swap.max\n  echo 256 >sys/fs/cgroup/mycontainer/pids.max\n  exec contain [...]\n\napplies a best-efforts limit of 2GB memory with a hard limit of 3GB. Swap\nusage is restricted to at most 2G, and no more than 256 process can be\nforked within the container.\n\nIn addition, if contain is built and run on Linux 4.6 or later, a cgroup\nnamespace will be used to virtualise the container's view of the cgroup\nhierarchy in /sys/fs/cgroup and /proc/*/cgroup. /sys/fs/cgroup/mycontainer\nwill appear as the root of the hierarchy at /sys/fs/cgroup within the\ncontainer.\n\nSee linux/kernel/Documentation/cgroup-v2.txt for detailed info on the\navailable controllers and configuration parameters.\n\n\nTroubleshooting\n---------------\n\nThe contain/psuedo error message 'Failed to unshare user namespace: Invalid\nargument' typically means that your kernel is not compiled with support for\nuser namespaces, i.e. CONFIG_USER_NS is not set. The contain tool will also\ndie with a similar message referring to one of the other required namespaces\nif support for that is not available in the kernel.\n\nTo run these tools you need to be running Linux 3.8 or later with\n\n  CONFIG_CGROUPS=y\n  CONFIG_UTS_NS=y\n  CONFIG_TIME_NS=y\n  CONFIG_IPC_NS=y\n  CONFIG_USER_NS=y\n  CONFIG_PID_NS=y\n  CONFIG_NET_NS=y\n\nset in the kernel build config. Note that before Linux 3.12, CONFIG_XFS_FS\nconflicted with CONFIG_USER_NS, so these tools could not be used where XFS\nsupport was compiled either into the kernel or as a module.\n\nThe contain tool will fail to mount /dev/pts unless\n\n  CONFIG_DEVPTS_MULTIPLE_INSTANCES=y\n\nis set in the kernel build config. Both container and host /dev/pts must be\nmounted with -o newinstance, with /dev/ptmx symlinked to pts/ptmx.\n\nLinux 3.12 introduced tighter restrictions on mounting proc and sysfs, which\nbroke older versions of contain. To comply with these new rules, contain\nnow ensures that procfs and sysfs are mounted in the new mount namespace\nbefore pivoting into the container and detaching the host root.\n\nA bug in Linux 3.12 will prevent contain from mounting /proc in a container\nif binfmt_misc is mounted on /proc/sys/fs/binfmt_misc in the host\nfilesystem. This was fixed in Linux 3.13.\n\nLinux 3.19 introduced restrictions on writing a user namespace GID map as an\nunprivileged user unless setgroups() has been permanently disabled, which\nbroke older versions of contain. Run non-setuid and unprivileged, contain\nand pseudo must now disable setgroups() to create containers, but if they\nare installed setuid, they will bypass this kernel restriction and leave\nsetgroups() enabled in the resulting containers.\n"
  },
  {
    "path": "console.c",
    "content": "#define _GNU_SOURCE\n#include <err.h>\n#include <errno.h>\n#include <fcntl.h>\n#include <limits.h>\n#include <poll.h>\n#include <signal.h>\n#include <stdlib.h>\n#include <termios.h>\n#include <unistd.h>\n#include <sys/ioctl.h>\n#include <sys/signalfd.h>\n#include <sys/syscall.h>\n#include <sys/types.h>\n#include <sys/wait.h>\n#include \"contain.h\"\n\nstatic struct termios saved;\n\nint getconsole(void) {\n  int master, null;\n\n  if ((null = open(\"/dev/null\", O_RDWR)) < 0)\n    errx(EXIT_FAILURE, \"Failed to open /dev/null\");\n\n  if (fcntl(STDIN_FILENO, F_GETFD) < 0)\n    dup2(null, STDIN_FILENO);\n  if (fcntl(STDOUT_FILENO, F_GETFD) < 0)\n    dup2(null, STDOUT_FILENO);\n  if (fcntl(STDERR_FILENO, F_GETFD) < 0)\n    dup2(null, STDERR_FILENO);\n\n  if (null != STDIN_FILENO)\n    if (null != STDOUT_FILENO)\n      if (null != STDERR_FILENO)\n        close(null);\n\n  if ((master = posix_openpt(O_RDWR | O_NOCTTY)) < 0)\n    errx(EXIT_FAILURE, \"Failed to allocate a console pseudo-terminal\");\n  grantpt(master);\n  unlockpt(master);\n  return master;\n}\n\nstatic void rawmode(void) {\n  struct termios termios;\n\n  if (!isatty(STDIN_FILENO))\n    return;\n  if (tcgetattr(STDIN_FILENO, &termios) < 0)\n    err(EXIT_FAILURE, \"tcgetattr\");\n  cfmakeraw(&termios);\n  tcsetattr(STDIN_FILENO, TCSANOW, &termios);\n}\n\nstatic void restoremode(void) {\n  if (isatty(STDIN_FILENO))\n    tcsetattr(STDIN_FILENO, TCSANOW, &saved);\n}\n\nstatic void savemode(void) {\n  if (isatty(STDIN_FILENO) && tcgetattr(STDIN_FILENO, &saved) < 0)\n    err(EXIT_FAILURE, \"tcgetattr\");\n}\n\nvoid setconsole(char *name) {\n  int console;\n  struct termios termios;\n\n  setsid();\n\n  if ((console = open(name, O_RDWR)) < 0)\n    errx(EXIT_FAILURE, \"Failed to open console in container\");\n  ioctl(console, TIOCSCTTY, NULL);\n\n  if (tcgetattr(console, &termios) < 0)\n    err(EXIT_FAILURE, \"tcgetattr\");\n  termios.c_iflag |= IGNBRK | IUTF8;\n  tcsetattr(console, TCSANOW, &termios);\n\n  dup2(console, STDIN_FILENO);\n  dup2(console, STDOUT_FILENO);\n  dup2(console, STDERR_FILENO);\n  if (console != STDIN_FILENO)\n    if (console != STDOUT_FILENO)\n      if (console != STDERR_FILENO)\n        close(console);\n}\n\nint supervise(pid_t child, int console) {\n  char buffer[PIPE_BUF];\n  int signals, slave, status;\n  sigset_t mask;\n  ssize_t count, length, offset;\n  struct pollfd fds[3];\n\n  if (console < 0) {\n    if (waitpid(child, &status, 0) < 0)\n      err(EXIT_FAILURE, \"waitpid\");\n    return WIFEXITED(status) ? WEXITSTATUS(status) : EXIT_FAILURE;\n  }\n\n  sigemptyset(&mask);\n  sigaddset(&mask, SIGCHLD);\n  sigprocmask(SIG_BLOCK, &mask, NULL);\n  if ((signals = signalfd(-1, &mask, 0)) < 0)\n    err(EXIT_FAILURE, \"signalfd\");\n\n  if (waitpid(child, &status, WNOHANG) > 0)\n    if (WIFEXITED(status) || WIFSIGNALED(status))\n      raise(SIGCHLD);\n\n  savemode();\n  atexit(restoremode);\n  rawmode();\n\n  slave = open(ptsname(console), O_RDWR);\n\n  fds[0].fd = console;\n  fds[0].events = POLLIN;\n  fds[1].fd = STDIN_FILENO;\n  fds[1].events = POLLIN;\n  fds[2].fd = signals;\n  fds[2].events = POLLIN;\n\n  while (1) {\n    if (poll(fds, 3, -1) < 0)\n        if (errno != EAGAIN && errno != EINTR)\n          err(EXIT_FAILURE, \"poll\");\n\n    if (fds[0].revents & POLLIN) {\n      if ((length = read(console, buffer, sizeof(buffer))) < 0)\n        if (errno != EAGAIN && errno != EINTR)\n          err(EXIT_FAILURE, \"read\");\n      for (offset = 0; length > 0; offset += count, length -= count)\n        while ((count = write(STDOUT_FILENO, buffer + offset, length)) < 0)\n          if (errno != EAGAIN && errno != EINTR)\n            err(EXIT_FAILURE, \"write\");\n    }\n\n    if (fds[1].revents & (POLLHUP | POLLIN)) {\n      if ((length = read(STDIN_FILENO, buffer, sizeof(buffer))) == 0)\n        fds[1].events = 0;\n      else if (length < 0 && errno != EAGAIN && errno != EINTR)\n        err(EXIT_FAILURE, \"read\");\n      for (offset = 0; length > 0; offset += count, length -= count)\n        while ((count = write(console, buffer + offset, length)) < 0)\n          if (errno != EAGAIN && errno != EINTR)\n            err(EXIT_FAILURE, \"write\");\n    }\n\n    if (fds[2].revents & POLLIN) {\n      if (read(signals, buffer, sizeof(buffer)) < 0)\n        if (errno != EAGAIN && errno != EINTR)\n          err(EXIT_FAILURE, \"read\");\n      if (waitpid(child, &status, WNOHANG) > 0)\n        if (WIFEXITED(status) || WIFSIGNALED(status))\n          break;\n    }\n  }\n\n  close(signals);\n  close(slave);\n\n  while ((length = read(console, buffer, sizeof(buffer)))) {\n    if (length < 0 && errno != EAGAIN && errno != EINTR)\n      break;\n    for (offset = 0; length > 0; offset += count, length -= count)\n      while ((count = write(STDOUT_FILENO, buffer + offset, length)) < 0)\n        if (errno != EAGAIN && errno != EINTR)\n          err(EXIT_FAILURE, \"write\");\n  }\n\n  return WIFEXITED(status) ? WEXITSTATUS(status) : EXIT_FAILURE;\n}\n"
  },
  {
    "path": "contain.c",
    "content": "#define _GNU_SOURCE\n#include <err.h>\n#include <errno.h>\n#include <fcntl.h>\n#include <grp.h>\n#include <sched.h>\n#include <signal.h>\n#include <stdio.h>\n#include <stdlib.h>\n#include <string.h>\n#include <sysexits.h>\n#include <unistd.h>\n#include <linux/sched.h>\n#include <sys/prctl.h>\n#include <sys/syscall.h>\n#include <sys/types.h>\n#include \"contain.h\"\n\nstatic void usage(const char *progname) {\n  fprintf(stderr, \"\\\nUsage: %s [OPTIONS] DIR [CMD [ARG]...]\\n\\\nOptions:\\n\\\n  -c        disable console emulation in the container\\n\\\n  -g MAP    set the container-to-host GID map\\n\\\n  -i CMD    run a helper child inside the new namespaces\\n\\\n  -n        share the host network unprivileged in the container\\n\\\n  -o CMD    run a helper child outside the new namespaces\\n\\\n  -u MAP    set the container-to-host UID map\\n\\\nGID and UID maps are specified as START:LOWER:COUNT[,START:LOWER:COUNT]...\\n\\\n\", progname);\n  exit(EX_USAGE);\n}\n\nint main(int argc, char **argv) {\n  char *gidmap = NULL, *inside = NULL, *outside = NULL, *uidmap = NULL;\n  int hostnet = 0, master, option, stdio = 0;\n  pid_t child, parent;\n\n  while ((option = getopt(argc, argv, \"+:cg:i:no:u:\")) > 0)\n    switch (option) {\n      case 'c':\n        stdio++;\n        break;\n      case 'g':\n        gidmap = optarg;\n        break;\n      case 'i':\n        inside = optarg;\n        break;\n      case 'n':\n        hostnet++;\n        break;\n      case 'o':\n        outside = optarg;\n        break;\n      case 'u':\n        uidmap = optarg;\n        break;\n      default:\n        usage(argv[0]);\n    }\n\n  if (argc <= optind)\n    usage(argv[0]);\n\n  parent = getpid();\n  switch (child = fork()) {\n    case -1:\n      err(EXIT_FAILURE, \"fork\");\n    case 0:\n      raise(SIGSTOP);\n      if (geteuid() != 0)\n        denysetgroups(parent);\n      writemap(parent, GID, gidmap);\n      writemap(parent, UID, uidmap);\n\n      if (outside) {\n        if (setgid(getgid()) < 0 || setuid(getuid()) < 0)\n          errx(EXIT_FAILURE, \"Failed to drop privileges\");\n        prctl(PR_SET_DUMPABLE, 1);\n        execlp(SHELL, SHELL, \"-c\", outside, NULL);\n        err(EXIT_FAILURE, \"exec %s\", outside);\n      }\n\n      exit(EXIT_SUCCESS);\n  }\n\n  if (setgid(getgid()) < 0 || setuid(getuid()) < 0)\n    errx(EXIT_FAILURE, \"Failed to drop privileges\");\n  prctl(PR_SET_DUMPABLE, 1);\n\n  if (unshare(CLONE_NEWUSER) < 0)\n    errx(EXIT_FAILURE, \"Failed to unshare user namespace\");\n\n#ifdef CLONE_NEWCGROUP\n  if (unshare(CLONE_NEWCGROUP) < 0)\n    errx(EXIT_FAILURE, \"Failed to unshare cgroup namespace\");\n#endif\n\n  if (unshare(CLONE_NEWIPC) < 0)\n    errx(EXIT_FAILURE, \"Failed to unshare IPC namespace\");\n\n  if (!hostnet && unshare(CLONE_NEWNET) < 0)\n    errx(EXIT_FAILURE, \"Failed to unshare network namespace\");\n\n  if (unshare(CLONE_NEWNS) < 0)\n    errx(EXIT_FAILURE, \"Failed to unshare mount namespace\");\n\n#ifdef CLONE_NEWTIME\n  if (unshare(CLONE_NEWTIME) < 0)\n    errx(EXIT_FAILURE, \"Failed to unshare time namespace\");\n#endif\n\n  if (unshare(CLONE_NEWUTS) < 0)\n    errx(EXIT_FAILURE, \"Failed to unshare UTS namespace\");\n\n  waitforstop(child);\n  kill(child, SIGCONT);\n  waitforexit(child);\n\n  setgid(0);\n  setgroups(0, NULL);\n  setuid(0);\n\n  master = stdio ? -1 : getconsole();\n  createroot(argv[optind], master, inside);\n\n  if (unshare(CLONE_NEWPID) < 0)\n    errx(EXIT_FAILURE, \"Failed to unshare PID namespace\");\n\n  switch (child = fork()) {\n    case -1:\n      err(EXIT_FAILURE, \"fork\");\n    case 0:\n      mountproc();\n      if (!hostnet)\n        mountsys();\n      enterroot();\n\n      if (master >= 0) {\n        close(master);\n        setconsole(\"/dev/console\");\n      }\n\n      clearenv();\n      putenv(\"container=contain\");\n\n      if (argv[optind + 1])\n        execv(argv[optind + 1], argv + optind + 1);\n      else\n        execl(SHELL, SHELL, NULL);\n      err(EXIT_FAILURE, \"exec\");\n  }\n\n  return supervise(child, master);\n}\n"
  },
  {
    "path": "contain.h",
    "content": "#ifndef CONTAIN_H\n#define CONTAIN_H\n\n#define GID 0\n#define UID 1\n#define INVALID ((unsigned) -1)\n#define SHELL \"/bin/sh\"\n\n#define getid(type) ((unsigned) ((type) == GID ? getgid() : getuid()))\n#define idfile(type) ((type) == GID ? \"gid_map\" : \"uid_map\")\n#define idname(type) ((type) == GID ? \"GID\" : \"UID\")\n#define subpath(type) ((type) == GID ? \"/etc/subgid\" : \"/etc/subuid\")\n\nchar *append(char **destination, const char *format, ...);\nvoid createroot(char *src, int console, char *helper);\nvoid denysetgroups(pid_t pid);\nvoid enterroot(void);\nint getconsole(void);\nvoid mountproc(void);\nvoid mountsys(void);\nvoid seal(char **argv, char **envp);\nvoid setconsole(char *name);\nchar *string(const char *format, ...);\nint supervise(pid_t child, int console);\nchar *tmpdir(void);\nvoid waitforstop(pid_t child);\nvoid waitforexit(pid_t child);\nvoid writemap(pid_t pid, int type, char *map);\n\n#endif\n"
  },
  {
    "path": "inject.c",
    "content": "#define _GNU_SOURCE\n#include <dirent.h>\n#include <err.h>\n#include <errno.h>\n#include <fcntl.h>\n#include <grp.h>\n#include <sched.h>\n#include <signal.h>\n#include <stdio.h>\n#include <stdlib.h>\n#include <string.h>\n#include <unistd.h>\n#include <sys/stat.h>\n#include <sys/syscall.h>\n#include <sys/types.h>\n#include \"contain.h\"\n\nstatic int getparent(pid_t child) {\n  char *end, *line = NULL, *path, *start;\n  pid_t parent = -1;\n  size_t size;\n  FILE *file;\n\n  path = string(\"/proc/%u/stat\", child);\n  file = fopen(path, \"r\");\n  free(path);\n\n  if (file && getline(&line, &size, file) >= 0)\n    /* \"PID (NAME) S PPID ...\", so PPID begins 4 chars after the last ')' */\n    if ((start = strrchr(line, ')')) && strlen(start) >= 4) {\n      parent = strtol(start + 4, &end, 10);\n      if (end == start || *end != ' ')\n        parent = -1;\n    }\n\n  if (file)\n    fclose(file);\n  if (line)\n    free(line);\n\n  return parent;\n}\n\nstatic void join(pid_t pid, char *type) {\n  char *path;\n  int fd;\n\n  path = string(\"/proc/%u/ns/%s\", pid, type);\n\n  if ((fd = open(path, O_RDONLY)) >= 0) {\n    if (syscall(__NR_setns, fd, 0) < 0 && strcmp(type, \"user\") == 0)\n      errx(EXIT_FAILURE, \"Failed to join user namespace\");\n    close(fd);\n  } else if (errno != ENOENT) {\n    errx(EXIT_FAILURE, \"PID %u does not belong to you\", pid);\n  } else if (strcmp(type, \"user\") == 0) {\n    errx(EXIT_FAILURE, \"PID %u not found or user namespace unavailable\", pid);\n  }\n\n  free(path);\n}\n\nstatic void usage(const char *progname) {\n  fprintf(stderr, \"Usage: %s PID [CMD [ARG]...]\\n\", progname);\n  exit(64);\n}\n\nint main(int argc, char **argv, char **envp) {\n  char *end, *item = NULL, *path;\n  pid_t child = -1, parent, pid;\n  size_t size;\n  struct dirent *entry;\n  DIR *dir;\n  FILE *file;\n\n  seal(argv, envp);\n  if (argc < 2)\n    usage(argv[0]);\n\n  parent = strtol(argv[1], &end, 10);\n  if (end == argv[1] || *end)\n    usage(argv[0]);\n\n  if (geteuid() != getuid())\n    errx(EXIT_FAILURE, \"setuid installation is unsafe\");\n  else if (getegid() != getgid())\n    errx(EXIT_FAILURE, \"setgid installation is unsafe\");\n\n  join(parent, \"user\");\n  setgid(0);\n  setgroups(0, NULL);\n  setuid(0);\n\n  if (!(dir = opendir(\"/proc\")))\n    errx(EXIT_FAILURE, \"Failed to list processes\");\n  while (child < 0 && (entry = readdir(dir))) {\n    pid = strtol(entry->d_name, &end, 10);\n    if (end == entry->d_name || *end)\n      continue;\n    if (getparent(pid) == parent) {\n      path = string(\"/proc/%u/environ\", pid);\n      if ((file = fopen(path, \"r\"))) {\n        while (getdelim(&item, &size, '\\0', file) >= 0)\n          if (strcmp(item, \"container=contain\") == 0)\n            child = pid;\n        fclose(file);\n      }\n      free(path);\n    }\n  }\n  closedir(dir);\n  if (item)\n    free(item);\n\n  if (child < 0)\n    errx(EXIT_FAILURE, \"PID %u is not a container supervisor\", parent);\n\n  join(child, \"cgroup\");\n  join(child, \"ipc\");\n  join(child, \"net\");\n  join(child, \"pid\");\n  join(child, \"time\");\n  join(child, \"uts\");\n  join(child, \"mnt\");\n\n  if (chdir(\"/\") < 0)\n    errx(EXIT_FAILURE, \"Failed to enter container root directory\");\n\n  switch (child = fork()) {\n    case -1:\n      err(EXIT_FAILURE, \"fork\");\n    case 0:\n      if (argv[2])\n        execvp(argv[2], argv + 2);\n      else if (getenv(\"SHELL\"))\n        execl(getenv(\"SHELL\"), getenv(\"SHELL\"), NULL);\n      else\n        execl(SHELL, SHELL, NULL);\n      err(EXIT_FAILURE, \"exec\");\n  }\n\n  waitforexit(child);\n  return EXIT_SUCCESS;\n}\n"
  },
  {
    "path": "map.c",
    "content": "#define _GNU_SOURCE\n#include <err.h>\n#include <errno.h>\n#include <grp.h>\n#include <fcntl.h>\n#include <pwd.h>\n#include <sched.h>\n#include <stdio.h>\n#include <stdlib.h>\n#include <string.h>\n#include <unistd.h>\n#include \"contain.h\"\n\nvoid denysetgroups(pid_t pid) {\n  char *path, *text = \"deny\";\n  int fd;\n\n  path = string(\"/proc/%d/setgroups\", pid);\n  if ((fd = open(path, O_WRONLY)) < 0)\n    errx(EXIT_FAILURE, \"Failed to disable setgroups() in container\");\n  else if (write(fd, text, strlen(text)) != (ssize_t) strlen(text))\n    errx(EXIT_FAILURE, \"Failed to disable setgroups() in container\");\n  close(fd);\n  free(path);\n}\n\nstatic char *getmap(pid_t pid, int type) {\n  char *line = NULL, *result = NULL, *path;\n  size_t size;\n  unsigned count, first, lower;\n  FILE *file;\n\n  if (pid == -1)\n    path = string(\"/proc/self/%s\", idfile(type));\n  else\n    path = string(\"/proc/%d/%s\", pid, idfile(type));\n  if (!(file = fopen(path, \"r\")))\n    errx(EXIT_FAILURE, \"Cannot read %s\", path);\n\n  while (getline(&line, &size, file) >= 0) {\n    if (sscanf(line, \" %u %u %u\", &first, &lower, &count) != 3)\n      errx(EXIT_FAILURE, \"Invalid map data in %s\", path);\n    append(&result, \"%s%u:%u:%u\", result ? \",\" : \"\", first, lower, count);\n  }\n\n  if (!result)\n    errx(EXIT_FAILURE, \"Invalid map data in %s\", path);\n\n  fclose(file);\n  free(line);\n  free(path);\n  return result;\n}\n\nstatic char *mapitem(char *map, unsigned *first, unsigned *lower,\n    unsigned *count) {\n  ssize_t skip;\n\n  while (map && *map && strchr(\",;\", *map))\n    map++;\n  if (map == NULL || *map == '\\0')\n    return NULL;\n  if (sscanf(map, \"%u:%u:%u%zn\", first, lower, count, &skip) < 3)\n    errx(EXIT_FAILURE, \"Invalid ID map '%s'\", map);\n  return map + skip;\n}\n\nstatic char *rangeitem(char *range, unsigned *start, unsigned *length) {\n  ssize_t skip;\n\n  while (range && *range && strchr(\",;\", *range))\n    range++;\n  if (range == NULL || *range == '\\0')\n    return NULL;\n  if (sscanf(range, \"%u:%u%zn\", start, length, &skip) < 2)\n    errx(EXIT_FAILURE, \"Invalid ID range '%s'\", range);\n  return range + skip;\n}\n\nstatic char *readranges(int type) {\n  char *line = NULL, *entry, *range, *user;\n  size_t end, size;\n  struct passwd *passwd;\n  uid_t uid;\n  unsigned int length, start;\n  FILE *file;\n\n  range = string(\"%u:1\", getid(type));\n  if (!(file = fopen(subpath(type), \"r\")))\n    return range;\n\n  uid = getuid();\n  user = getenv(\"USER\");\n  user = user ? user : getenv(\"LOGNAME\");\n  user = user ? user : getlogin();\n  if (!user || !(passwd = getpwnam(user)) || passwd->pw_uid != uid) {\n    if (!(passwd = getpwuid(uid)))\n      errx(EXIT_FAILURE, \"Failed to validate your username\");\n    user = passwd->pw_name;\n  }\n  endpwent();\n\n  while (getline(&line, &size, file) >= 0) {\n    if (strtol(line, &entry, 10) != uid || entry == line) {\n      if (strncmp(line, user, strlen(user)))\n        continue;\n      entry = line + strlen(user);\n    }\n    if (sscanf(entry, \":%u:%u%zn\", &start, &length, &end) < 2)\n      continue;\n    if (strchr(\":\\n\", entry[end + 1]))\n      append(&range, \",%u:%u\", start, length);\n  }\n\n  free(line);\n  fclose(file);\n  return range;\n}\n\nstatic char *rootdefault(int type) {\n  char *cursor, *map, *result;\n  unsigned count, first, last = INVALID, lower;\n\n  cursor = map = getmap(-1, type);\n  while ((cursor = mapitem(cursor, &first, &lower, &count)))\n    if (last == INVALID || last < first + count - 1)\n      last = first + count - 1;\n  result = string(\"0:%u:1\", last);\n\n  cursor = map;\n  while ((cursor = mapitem(cursor, &first, &lower, &count))) {\n    if (first == 0) {\n      if (count == 1 && first >= last)\n        errx(EXIT_FAILURE, \"No unprivileged %s available\\n\", idname(type));\n      first++, lower++, count--;\n    }\n\n    if (last <= first + count - 1 && count > 0)\n      count--;\n\n    if (count > 0)\n      append(&result, \"%s%u:%u:%u\", result ? \",\" : \"\", first, first, count);\n  }\n\n  free(map);\n  return result;\n}\n\nstatic char *userdefault(int type) {\n  char *cursor, *map, *range, *result = NULL;\n  unsigned count, first, index = 0, length, lower, start;\n\n  if (geteuid() != 0)\n    return string(\"0:%u:1\", getid(type));\n\n  map = getmap(-1, type);\n  range = readranges(type);\n\n  while ((range = rangeitem(range, &start, &length))) {\n    cursor = map;\n    while ((cursor = mapitem(cursor, &first, &lower, &count))) {\n      if (start + length <= first || first + count <= start)\n        continue;\n      if (first + count < start + length)\n        length = start - first + count;\n      if (start < first) {\n        index += first - start;\n        length -= first - start;\n        start = first;\n      }\n      append(&result, \"%s%u:%u:%u\", result ? \",\" : \"\", index, start, length);\n      index += length;\n    }\n  }\n\n  free(map);\n  free(range);\n  return result;\n}\n\nstatic void validate(char *range, unsigned first, unsigned count) {\n  unsigned length, start;\n\n  while ((range = rangeitem(range, &start, &length)))\n    if (first < start + length && start < first + count) {\n      if (first < start)\n        validate(range, first, start - first);\n      if (first + count > start + length)\n        validate(range, start + length, first + count - start - length);\n      return;\n    }\n  errx(EXIT_FAILURE, \"Cannot map onto IDs that are not delegated to you\");\n}\n\nstatic void verifymap(char *map, char *range) {\n  unsigned count, first, lower;\n\n  while ((map = mapitem(map, &first, &lower, &count)))\n    validate(range, lower, count);\n}\n\nvoid writemap(pid_t pid, int type, char *map) {\n  char *path, *range, *text = NULL;\n  int fd;\n  unsigned count, first, lower;\n\n  if (!map) {\n    map = (getuid() == 0 ? rootdefault : userdefault)(type);\n  } else if (getuid() != 0) {\n    range = readranges(type);\n    verifymap(map, range);\n    free(range);\n  }\n\n  while ((map = mapitem(map, &first, &lower, &count)))\n    append(&text, \"%u %u %u\\n\", first, lower, count);\n\n  path = string(\"/proc/%d/%s\", pid, idfile(type));\n  if ((fd = open(path, O_WRONLY)) < 0)\n    errx(EXIT_FAILURE, \"Failed to set container %s map\", idname(type));\n  else if (write(fd, text, strlen(text)) != (ssize_t) strlen(text))\n    errx(EXIT_FAILURE, \"Failed to set container %s map\", idname(type));\n\n  close(fd);\n  free(path);\n  free(text);\n}\n"
  },
  {
    "path": "mount.c",
    "content": "#define _GNU_SOURCE\n#include <err.h>\n#include <errno.h>\n#include <fcntl.h>\n#include <stdlib.h>\n#include <unistd.h>\n#include <sys/mount.h>\n#include <sys/stat.h>\n#include <sys/syscall.h>\n#include <sys/types.h>\n#include \"contain.h\"\n\nstatic char *root;\n\nstatic void bindnode(char *src, char *dst) {\n  int fd;\n\n  if ((fd = open(dst, O_WRONLY | O_CREAT, 0600)) >= 0)\n    close(fd);\n  if (mount(src, dst, NULL, MS_BIND, NULL) < 0)\n    errx(EXIT_FAILURE, \"Failed to bind %s into new /dev filesystem\", src);\n}\n\nstatic void cleanup(void) {\n  if (root) {\n    umount2(root, MNT_DETACH);\n    rmdir(root);\n  }\n}\n\nvoid createroot(char *src, int console, char *helper) {\n  mode_t mask;\n  pid_t child;\n\n  root = tmpdir();\n  atexit(cleanup);\n\n  if (mount(src, root, NULL, MS_BIND | MS_REC, NULL) < 0)\n    errx(EXIT_FAILURE, \"Failed to bind new root filesystem\");\n  else if (chdir(root) < 0)\n    errx(EXIT_FAILURE, \"Failed to enter new root filesystem\");\n\n  mask = umask(0);\n  mkdir(\"dev\" , 0755);\n  if (mount(\"tmpfs\", \"dev\", \"tmpfs\", 0, \"mode=0755\") < 0)\n    errx(EXIT_FAILURE, \"Failed to mount /dev tmpfs in new root filesystem\");\n\n  mkdir(\"dev/pts\", 0755);\n  if (mount(\"devpts\", \"dev/pts\", \"devpts\", 0, \"newinstance,ptmxmode=666\") < 0)\n    errx(EXIT_FAILURE, \"Failed to mount /dev/pts in new root filesystem\");\n\n  mkdir(\"dev/tmp\", 0755);\n  umask(mask);\n\n  if (console >= 0)\n    bindnode(ptsname(console), \"dev/console\");\n  bindnode(\"/dev/full\", \"dev/full\");\n  bindnode(\"/dev/null\", \"dev/null\");\n  bindnode(\"/dev/random\", \"dev/random\");\n  bindnode(\"/dev/tty\", \"dev/tty\");\n  bindnode(\"/dev/urandom\", \"dev/urandom\");\n  bindnode(\"/dev/zero\", \"dev/zero\");\n  symlink(\"pts/ptmx\", \"dev/ptmx\");\n\n  if (helper)\n    switch (child = fork()) {\n      case -1:\n        err(EXIT_FAILURE, \"fork\");\n      case 0:\n        execlp(SHELL, SHELL, \"-c\", helper, NULL);\n        err(EXIT_FAILURE, \"exec %s\", helper);\n      default:\n        waitforexit(child);\n    }\n}\n\nvoid enterroot(void) {\n  if (syscall(__NR_pivot_root, \".\", \"dev/tmp\") < 0)\n    errx(EXIT_FAILURE, \"Failed to pivot into new root filesystem\");\n\n  if (chdir(\"/dev/tmp\") >= 0) {\n    while (*root == '/')\n      root++;\n    rmdir(root);\n  }\n\n  root = NULL;\n\n  if (chdir(\"/\") < 0 || umount2(\"/dev/tmp\", MNT_DETACH) < 0)\n    errx(EXIT_FAILURE, \"Failed to detach old root filesystem\");\n  else\n    rmdir(\"/dev/tmp\");\n}\n\nvoid mountproc(void) {\n  mode_t mask;\n\n  mask = umask(0);\n  mkdir(\"proc\" , 0755);\n  umask(mask);\n\n  if (mount(\"proc\", \"proc\", \"proc\", 0, NULL) < 0)\n    errx(EXIT_FAILURE, \"Failed to mount /proc in new root filesystem\");\n}\n\nvoid mountsys(void) {\n  mode_t mask;\n\n  mask = umask(0);\n  mkdir(\"sys\" , 0755);\n  umask(mask);\n\n  if (mount(\"sysfs\", \"sys\", \"sysfs\", 0, NULL) < 0)\n    errx(EXIT_FAILURE, \"Failed to mount /sys in new root filesystem\");\n  mount(\"cgroup2\", \"sys/fs/cgroup\", \"cgroup2\", 0, NULL);\n}\n"
  },
  {
    "path": "pseudo.c",
    "content": "#define _GNU_SOURCE\n#include <err.h>\n#include <errno.h>\n#include <grp.h>\n#include <sched.h>\n#include <signal.h>\n#include <stdio.h>\n#include <stdlib.h>\n#include <sysexits.h>\n#include <unistd.h>\n#include <sys/prctl.h>\n#include \"contain.h\"\n\nstatic void usage(const char *progname) {\n  fprintf(stderr, \"\\\nUsage: %s [OPTIONS] [CMD [ARG]...]\\n\\\nOptions:\\n\\\n  -g MAP    set the user namespace GID map\\n\\\n  -u MAP    set the user namespace UID map\\n\\\nGID and UID maps are specified as START:LOWER:COUNT[,START:LOWER:COUNT]...\\n\\\n\", progname);\n  exit(EX_USAGE);\n}\n\nint main(int argc, char **argv) {\n  char *gidmap = NULL, *uidmap = NULL;\n  int option;\n  pid_t child, parent;\n\n  while ((option = getopt(argc, argv, \"+:g:u:\")) > 0)\n    switch (option) {\n      case 'g':\n        gidmap = optarg;\n        break;\n      case 'u':\n        uidmap = optarg;\n        break;\n      default:\n        usage(argv[0]);\n    }\n\n  parent = getpid();\n  switch (child = fork()) {\n    case -1:\n      err(EXIT_FAILURE, \"fork\");\n    case 0:\n      raise(SIGSTOP);\n      if (geteuid() != 0)\n        denysetgroups(parent);\n      writemap(parent, GID, gidmap);\n      writemap(parent, UID, uidmap);\n      exit(0);\n  }\n\n  if (setgid(getgid()) < 0 || setuid(getuid()) < 0)\n    errx(EXIT_FAILURE, \"Failed to drop privileges\");\n  prctl(PR_SET_DUMPABLE, 1);\n\n  if (unshare(CLONE_NEWUSER) < 0)\n    errx(EXIT_FAILURE, \"Failed to unshare user namespace\");\n\n  waitforstop(child);\n  kill(child, SIGCONT);\n  waitforexit(child);\n\n  setgid(0);\n  setgroups(0, NULL);\n  setuid(0);\n\n  if (argv[optind])\n    execvp(argv[optind], argv + optind);\n  else if (getenv(\"SHELL\"))\n    execl(getenv(\"SHELL\"), getenv(\"SHELL\"), NULL);\n  else\n    execl(SHELL, SHELL, NULL);\n\n  err(EXIT_FAILURE, \"exec\");\n  return EXIT_FAILURE;\n}\n"
  },
  {
    "path": "util.c",
    "content": "#define _GNU_SOURCE\n#include <err.h>\n#include <errno.h>\n#include <fcntl.h>\n#include <stdarg.h>\n#include <stdio.h>\n#include <stdlib.h>\n#include <string.h>\n#include <unistd.h>\n#include <sys/mman.h>\n#include <sys/sendfile.h>\n#include <sys/types.h>\n#include <sys/wait.h>\n#include \"contain.h\"\n\nchar *append(char **destination, const char *format, ...) {\n  char *extra, *result;\n  va_list args;\n\n  va_start(args, format);\n  if (vasprintf(&extra, format, args) < 0)\n    err(EXIT_FAILURE, \"asprintf\");\n  va_end(args);\n\n  if (*destination == NULL) {\n    *destination = extra;\n    return extra;\n  }\n\n  if (asprintf(&result, \"%s%s\", *destination, extra) < 0)\n      err(EXIT_FAILURE, \"asprintf\");\n  free(*destination);\n  free(extra);\n  *destination = result;\n  return result;\n}\n\nvoid seal(char **argv, char **envp) {\n  const int seals = F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW | F_SEAL_WRITE;\n  int dst, src;\n  ssize_t length;\n\n  if ((src = open(\"/proc/self/exe\", O_RDONLY)) < 0)\n    err(EXIT_FAILURE, \"open /proc/self/exe\");\n  if (fcntl(src, F_GET_SEALS) == seals) {\n    close(src);\n    return;\n  }\n\n  dst = memfd_create(\"/proc/self/exe\", MFD_CLOEXEC | MFD_ALLOW_SEALING);\n  if (dst < 0)\n    err(EXIT_FAILURE, \"memfd_create\");\n\n  while (length = sendfile(dst, src, NULL, BUFSIZ), length != 0)\n    if (length < 0 && errno != EAGAIN && errno != EINTR)\n      err(EXIT_FAILURE, \"sendfile\");\n  close(src);\n\n  if (fcntl(dst, F_ADD_SEALS, seals) < 0)\n    err(EXIT_FAILURE, \"fcntl F_ADD_SEALS\");\n  fexecve(dst, argv, envp);\n  err(EXIT_FAILURE, \"fexecve\");\n}\n\nchar *string(const char *format, ...) {\n  char *result;\n  va_list args;\n\n  va_start(args, format);\n  if (vasprintf(&result, format, args) < 0)\n    err(EXIT_FAILURE, \"asprintf\");\n  va_end(args);\n  return result;\n}\n\nchar *tmpdir(void) {\n  char *dir;\n\n  if (!(dir = strdup(\"/tmp/XXXXXX\")))\n    err(EXIT_FAILURE, \"strdup\");\n  else if (!mkdtemp(dir))\n    errx(EXIT_FAILURE, \"Failed to create temporary directory\");\n  return dir;\n}\n\nvoid waitforexit(pid_t child) {\n  int status;\n\n  if (waitpid(child, &status, 0) < 0)\n    err(EXIT_FAILURE, \"waitpid\");\n  else if (WEXITSTATUS(status) != EXIT_SUCCESS)\n    exit(WEXITSTATUS(status));\n}\n\nvoid waitforstop(pid_t child) {\n  int status;\n\n  if (waitpid(child, &status, WUNTRACED) < 0)\n    err(EXIT_FAILURE, \"waitpid\");\n  if (!WIFSTOPPED(status))\n    exit(WEXITSTATUS(status));\n}\n"
  }
]