Repository: Vonng/ddia
Branch: main
Commit: 573bb53a0557
Files: 139
Total size: 5.4 MB
Directory structure:
gitextract_zlswtsh8/
├── .github/
│ └── workflows/
│ └── pages.yaml
├── .gitignore
├── .nojekyll
├── LICENSE
├── Makefile
├── README.md
├── assets/
│ └── css/
│ ├── custom.css
│ └── example.css
├── bin/
│ ├── Pipfile
│ ├── doc
│ ├── epub
│ ├── preprocess-epub.py
│ ├── toc.py
│ ├── translate.py
│ └── zh-tw.py
├── content/
│ ├── en/
│ │ ├── _index.md
│ │ ├── ch1.md
│ │ ├── ch10.md
│ │ ├── ch11.md
│ │ ├── ch12.md
│ │ ├── ch13.md
│ │ ├── ch14.md
│ │ ├── ch2.md
│ │ ├── ch3.md
│ │ ├── ch4.md
│ │ ├── ch5.md
│ │ ├── ch6.md
│ │ ├── ch7.md
│ │ ├── ch8.md
│ │ ├── ch9.md
│ │ ├── colophon.md
│ │ ├── glossary.md
│ │ ├── indexes.md
│ │ ├── part-i.md
│ │ ├── part-ii.md
│ │ ├── part-iii.md
│ │ ├── preface.md
│ │ └── toc.md
│ ├── tw/
│ │ ├── _index.md
│ │ ├── ch1.md
│ │ ├── ch10.md
│ │ ├── ch11.md
│ │ ├── ch12.md
│ │ ├── ch13.md
│ │ ├── ch14.md
│ │ ├── ch2.md
│ │ ├── ch3.md
│ │ ├── ch4.md
│ │ ├── ch5.md
│ │ ├── ch6.md
│ │ ├── ch7.md
│ │ ├── ch8.md
│ │ ├── ch9.md
│ │ ├── colophon.md
│ │ ├── contrib.md
│ │ ├── glossary.md
│ │ ├── indexes.md
│ │ ├── part-i.md
│ │ ├── part-ii.md
│ │ ├── part-iii.md
│ │ ├── preface.md
│ │ └── toc.md
│ ├── v1/
│ │ ├── _index.md
│ │ ├── ch1.md
│ │ ├── ch10.md
│ │ ├── ch11.md
│ │ ├── ch12.md
│ │ ├── ch2.md
│ │ ├── ch3.md
│ │ ├── ch4.md
│ │ ├── ch5.md
│ │ ├── ch6.md
│ │ ├── ch7.md
│ │ ├── ch8.md
│ │ ├── ch9.md
│ │ ├── colophon.md
│ │ ├── contrib.md
│ │ ├── glossary.md
│ │ ├── part-i.md
│ │ ├── part-ii.md
│ │ ├── part-iii.md
│ │ ├── preface.md
│ │ └── toc.md
│ ├── v1_tw/
│ │ ├── _index.md
│ │ ├── ch1.md
│ │ ├── ch10.md
│ │ ├── ch11.md
│ │ ├── ch12.md
│ │ ├── ch2.md
│ │ ├── ch3.md
│ │ ├── ch4.md
│ │ ├── ch5.md
│ │ ├── ch6.md
│ │ ├── ch7.md
│ │ ├── ch8.md
│ │ ├── ch9.md
│ │ ├── colophon.md
│ │ ├── contrib.md
│ │ ├── glossary.md
│ │ ├── part-i.md
│ │ ├── part-ii.md
│ │ ├── part-iii.md
│ │ ├── preface.md
│ │ └── toc.md
│ └── zh/
│ ├── _index.md
│ ├── ch1.md
│ ├── ch10.md
│ ├── ch11.md
│ ├── ch12.md
│ ├── ch13.md
│ ├── ch14.md
│ ├── ch2.md
│ ├── ch3.md
│ ├── ch4.md
│ ├── ch5.md
│ ├── ch6.md
│ ├── ch7.md
│ ├── ch8.md
│ ├── ch9.md
│ ├── colophon.md
│ ├── contrib.md
│ ├── glossary.md
│ ├── indexes.md
│ ├── part-i.md
│ ├── part-ii.md
│ ├── part-iii.md
│ ├── preface.md
│ └── toc.md
├── giscus.json
├── go.mod
├── go.sum
├── hugo.yaml
├── i18n/
│ ├── en.yaml
│ ├── tw.yaml
│ ├── v2.yaml
│ └── zh.yaml
├── js/
│ └── epub.css
├── layouts/
│ └── shortcodes/
│ └── figure.html
└── metadata.yaml
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/workflows/pages.yaml
================================================
# Sample workflow for building and deploying a Hugo site to GitHub Pages
name: Deploy Hugo site to Pages
on:
# Runs on pushes targeting the default branch
push:
branches: ["main"]
# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
contents: read
pages: write
id-token: write
# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
concurrency:
group: "pages"
cancel-in-progress: false
# Default to bash
defaults:
run:
shell: bash
jobs:
# Build job
build:
runs-on: ubuntu-latest
env:
HUGO_VERSION: 0.155.3
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0 # fetch all history for .GitInfo and .Lastmod
submodules: recursive
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.26'
- name: Setup Pages
id: pages
uses: actions/configure-pages@v4
- name: Setup Hugo
run: |
wget -O ${{ runner.temp }}/hugo.deb https://github.com/gohugoio/hugo/releases/download/v${HUGO_VERSION}/hugo_extended_${HUGO_VERSION}_linux-amd64.deb \
&& sudo dpkg -i ${{ runner.temp }}/hugo.deb
- name: Build with Hugo
env:
# For maximum backward compatibility with Hugo modules
HUGO_ENVIRONMENT: production
HUGO_ENV: production
run: |
hugo \
--gc --minify \
--baseURL "${{ steps.pages.outputs.base_url }}/"
- name: Upload artifact
uses: actions/upload-pages-artifact@v3
with:
path: ./public
# Deployment job
deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
needs: build
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4
================================================
FILE: .gitignore
================================================
.idea/
.code/
__pycache__/
.DS_Store
tmp/
output/
public/
.hugo_build.lock
.claude
CLAUDE.md
content/cn/
zh.md
en.md
================================================
FILE: .nojekyll
================================================
================================================
FILE: LICENSE
================================================
Attribution 4.0 International
=======================================================================
Creative Commons Corporation ("Creative Commons") is not a law firm and
does not provide legal services or legal advice. Distribution of
Creative Commons public licenses does not create a lawyer-client or
other relationship. Creative Commons makes its licenses and related
information available on an "as-is" basis. Creative Commons gives no
warranties regarding its licenses, any material licensed under their
terms and conditions, or any related information. Creative Commons
disclaims all liability for damages resulting from their use to the
fullest extent possible.
Using Creative Commons Public Licenses
Creative Commons public licenses provide a standard set of terms and
conditions that creators and other rights holders may use to share
original works of authorship and other material subject to copyright
and certain other rights specified in the public license below. The
following considerations are for informational purposes only, are not
exhaustive, and do not form part of our licenses.
Considerations for licensors: Our public licenses are
intended for use by those authorized to give the public
permission to use material in ways otherwise restricted by
copyright and certain other rights. Our licenses are
irrevocable. Licensors should read and understand the terms
and conditions of the license they choose before applying it.
Licensors should also secure all rights necessary before
applying our licenses so that the public can reuse the
material as expected. Licensors should clearly mark any
material not subject to the license. This includes other CC-
licensed material, or material used under an exception or
limitation to copyright. More considerations for licensors:
wiki.creativecommons.org/Considerations_for_licensors
Considerations for the public: By using one of our public
licenses, a licensor grants the public permission to use the
licensed material under specified terms and conditions. If
the licensor's permission is not necessary for any reason--for
example, because of any applicable exception or limitation to
copyright--then that use is not regulated by the license. Our
licenses grant only permissions under copyright and certain
other rights that a licensor has authority to grant. Use of
the licensed material may still be restricted for other
reasons, including because others have copyright or other
rights in the material. A licensor may make special requests,
such as asking that all changes be marked or described.
Although not required by our licenses, you are encouraged to
respect those requests where reasonable. More considerations
for the public:
wiki.creativecommons.org/Considerations_for_licensees
=======================================================================
Creative Commons Attribution 4.0 International Public License
By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution 4.0 International Public License ("Public License"). To the
extent this Public License may be interpreted as a contract, You are
granted the Licensed Rights in consideration of Your acceptance of
these terms and conditions, and the Licensor grants You such rights in
consideration of benefits the Licensor receives from making the
Licensed Material available under these terms and conditions.
Section 1 -- Definitions.
a. Adapted Material means material subject to Copyright and Similar
Rights that is derived from or based upon the Licensed Material
and in which the Licensed Material is translated, altered,
arranged, transformed, or otherwise modified in a manner requiring
permission under the Copyright and Similar Rights held by the
Licensor. For purposes of this Public License, where the Licensed
Material is a musical work, performance, or sound recording,
Adapted Material is always produced where the Licensed Material is
synched in timed relation with a moving image.
b. Adapter's License means the license You apply to Your Copyright
and Similar Rights in Your contributions to Adapted Material in
accordance with the terms and conditions of this Public License.
c. Copyright and Similar Rights means copyright and/or similar rights
closely related to copyright including, without limitation,
performance, broadcast, sound recording, and Sui Generis Database
Rights, without regard to how the rights are labeled or
categorized. For purposes of this Public License, the rights
specified in Section 2(b)(1)-(2) are not Copyright and Similar
Rights.
d. Effective Technological Measures means those measures that, in the
absence of proper authority, may not be circumvented under laws
fulfilling obligations under Article 11 of the WIPO Copyright
Treaty adopted on December 20, 1996, and/or similar international
agreements.
e. Exceptions and Limitations means fair use, fair dealing, and/or
any other exception or limitation to Copyright and Similar Rights
that applies to Your use of the Licensed Material.
f. Licensed Material means the artistic or literary work, database,
or other material to which the Licensor applied this Public
License.
g. Licensed Rights means the rights granted to You subject to the
terms and conditions of this Public License, which are limited to
all Copyright and Similar Rights that apply to Your use of the
Licensed Material and that the Licensor has authority to license.
h. Licensor means the individual(s) or entity(ies) granting rights
under this Public License.
i. Share means to provide material to the public by any means or
process that requires permission under the Licensed Rights, such
as reproduction, public display, public performance, distribution,
dissemination, communication, or importation, and to make material
available to the public including in ways that members of the
public may access the material from a place and at a time
individually chosen by them.
j. Sui Generis Database Rights means rights other than copyright
resulting from Directive 96/9/EC of the European Parliament and of
the Council of 11 March 1996 on the legal protection of databases,
as amended and/or succeeded, as well as other essentially
equivalent rights anywhere in the world.
k. You means the individual or entity exercising the Licensed Rights
under this Public License. Your has a corresponding meaning.
Section 2 -- Scope.
a. License grant.
1. Subject to the terms and conditions of this Public License,
the Licensor hereby grants You a worldwide, royalty-free,
non-sublicensable, non-exclusive, irrevocable license to
exercise the Licensed Rights in the Licensed Material to:
a. reproduce and Share the Licensed Material, in whole or
in part; and
b. produce, reproduce, and Share Adapted Material.
2. Exceptions and Limitations. For the avoidance of doubt, where
Exceptions and Limitations apply to Your use, this Public
License does not apply, and You do not need to comply with
its terms and conditions.
3. Term. The term of this Public License is specified in Section
6(a).
4. Media and formats; technical modifications allowed. The
Licensor authorizes You to exercise the Licensed Rights in
all media and formats whether now known or hereafter created,
and to make technical modifications necessary to do so. The
Licensor waives and/or agrees not to assert any right or
authority to forbid You from making technical modifications
necessary to exercise the Licensed Rights, including
technical modifications necessary to circumvent Effective
Technological Measures. For purposes of this Public License,
simply making modifications authorized by this Section 2(a)
(4) never produces Adapted Material.
5. Downstream recipients.
a. Offer from the Licensor -- Licensed Material. Every
recipient of the Licensed Material automatically
receives an offer from the Licensor to exercise the
Licensed Rights under the terms and conditions of this
Public License.
b. No downstream restrictions. You may not offer or impose
any additional or different terms or conditions on, or
apply any Effective Technological Measures to, the
Licensed Material if doing so restricts exercise of the
Licensed Rights by any recipient of the Licensed
Material.
6. No endorsement. Nothing in this Public License constitutes or
may be construed as permission to assert or imply that You
are, or that Your use of the Licensed Material is, connected
with, or sponsored, endorsed, or granted official status by,
the Licensor or others designated to receive attribution as
provided in Section 3(a)(1)(A)(i).
b. Other rights.
1. Moral rights, such as the right of integrity, are not
licensed under this Public License, nor are publicity,
privacy, and/or other similar personality rights; however, to
the extent possible, the Licensor waives and/or agrees not to
assert any such rights held by the Licensor to the limited
extent necessary to allow You to exercise the Licensed
Rights, but not otherwise.
2. Patent and trademark rights are not licensed under this
Public License.
3. To the extent possible, the Licensor waives any right to
collect royalties from You for the exercise of the Licensed
Rights, whether directly or through a collecting society
under any voluntary or waivable statutory or compulsory
licensing scheme. In all other cases the Licensor expressly
reserves any right to collect such royalties.
Section 3 -- License Conditions.
Your exercise of the Licensed Rights is expressly made subject to the
following conditions.
a. Attribution.
1. If You Share the Licensed Material (including in modified
form), You must:
a. retain the following if it is supplied by the Licensor
with the Licensed Material:
i. identification of the creator(s) of the Licensed
Material and any others designated to receive
attribution, in any reasonable manner requested by
the Licensor (including by pseudonym if
designated);
ii. a copyright notice;
iii. a notice that refers to this Public License;
iv. a notice that refers to the disclaimer of
warranties;
v. a URI or hyperlink to the Licensed Material to the
extent reasonably practicable;
b. indicate if You modified the Licensed Material and
retain an indication of any previous modifications; and
c. indicate the Licensed Material is licensed under this
Public License, and include the text of, or the URI or
hyperlink to, this Public License.
2. You may satisfy the conditions in Section 3(a)(1) in any
reasonable manner based on the medium, means, and context in
which You Share the Licensed Material. For example, it may be
reasonable to satisfy the conditions by providing a URI or
hyperlink to a resource that includes the required
information.
3. If requested by the Licensor, You must remove any of the
information required by Section 3(a)(1)(A) to the extent
reasonably practicable.
4. If You Share Adapted Material You produce, the Adapter's
License You apply must not prevent recipients of the Adapted
Material from complying with this Public License.
Section 4 -- Sui Generis Database Rights.
Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:
a. for the avoidance of doubt, Section 2(a)(1) grants You the right
to extract, reuse, reproduce, and Share all or a substantial
portion of the contents of the database;
b. if You include all or a substantial portion of the database
contents in a database in which You have Sui Generis Database
Rights, then the database in which You have Sui Generis Database
Rights (but not its individual contents) is Adapted Material; and
c. You must comply with the conditions in Section 3(a) if You Share
all or a substantial portion of the contents of the database.
For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.
Section 5 -- Disclaimer of Warranties and Limitation of Liability.
a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
c. The disclaimer of warranties and limitation of liability provided
above shall be interpreted in a manner that, to the extent
possible, most closely approximates an absolute disclaimer and
waiver of all liability.
Section 6 -- Term and Termination.
a. This Public License applies for the term of the Copyright and
Similar Rights licensed here. However, if You fail to comply with
this Public License, then Your rights under this Public License
terminate automatically.
b. Where Your right to use the Licensed Material has terminated under
Section 6(a), it reinstates:
1. automatically as of the date the violation is cured, provided
it is cured within 30 days of Your discovery of the
violation; or
2. upon express reinstatement by the Licensor.
For the avoidance of doubt, this Section 6(b) does not affect any
right the Licensor may have to seek remedies for Your violations
of this Public License.
c. For the avoidance of doubt, the Licensor may also offer the
Licensed Material under separate terms or conditions or stop
distributing the Licensed Material at any time; however, doing so
will not terminate this Public License.
d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
License.
Section 7 -- Other Terms and Conditions.
a. The Licensor shall not be bound by any additional or different
terms or conditions communicated by You unless expressly agreed.
b. Any arrangements, understandings, or agreements regarding the
Licensed Material not stated herein are separate from and
independent of the terms and conditions of this Public License.
Section 8 -- Interpretation.
a. For the avoidance of doubt, this Public License does not, and
shall not be interpreted to, reduce, limit, restrict, or impose
conditions on any use of the Licensed Material that could lawfully
be made without permission under this Public License.
b. To the extent possible, if any provision of this Public License is
deemed unenforceable, it shall be automatically reformed to the
minimum extent necessary to make it enforceable. If the provision
cannot be reformed, it shall be severed from this Public License
without affecting the enforceability of the remaining terms and
conditions.
c. No term or condition of this Public License will be waived and no
failure to comply consented to unless expressly agreed to by the
Licensor.
d. Nothing in this Public License constitutes or may be interpreted
as a limitation upon, or waiver of, any privileges and immunities
that apply to the Licensor or You, including from the legal
processes of any jurisdiction or authority.
=======================================================================
Creative Commons is not a party to its public
licenses. Notwithstanding, Creative Commons may elect to apply one of
its public licenses to material it publishes and in those instances
will be considered the “Licensor.” The text of the Creative Commons
public licenses is dedicated to the public domain under the CC0 Public
Domain Dedication. Except for the limited purpose of indicating that
material is shared under a Creative Commons public license or as
otherwise permitted by the Creative Commons policies published at
creativecommons.org/policies, Creative Commons does not authorize the
use of the trademark "Creative Commons" or any other trademark or logo
of Creative Commons without its prior written consent including,
without limitation, in connection with any unauthorized modifications
to any of its public licenses or any other arrangements,
understandings, or agreements concerning use of licensed material. For
the avoidance of doubt, this paragraph does not form part of the
public licenses.
Creative Commons may be contacted at creativecommons.org.
================================================
FILE: Makefile
================================================
default: dev
d:dev
dev:
hugo serve
b:build
build:
hugo build
.PHONY: default d dev b build
# generate zh-tw version
translate:
bin/zh-tw.py
epub:
bin/epub
.PHONY: default doc translate
================================================
FILE: README.md
================================================
# 设计数据密集型应用(第二版) - 中文翻译版
[](https://ddia.vonng.com)
[](https://ddia.vonng.com/v1)
[](https://star-history.com/#Vonng/ddia&Date)
**作者**: [Martin Kleppmann](https://martin.kleppmann.com),[《Designing Data-Intensive Applications 2nd Edition》](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html): 英国剑桥大学分布式系统研究员,演讲者,博主和开源贡献者,软件工程师和企业家,曾在 LinkedIn 和 Rapportive 负责数据基础架构。
**译者**:[冯若航](https://vonng.com) / [Vonng](https://github.com/Vonng) (rh@vonng.com) [Pigsty](https://pgsty.com) 创始人,[活跃](https://committers.top/china)[开源贡献者](https://gitstar-ranking.com/Vonng),PostgreSQL Hacker。开源 RDS PG 发行版 [Pigsty](https://pigsty.cc/zh/) 与公众号《[老冯云数](https://mp.weixin.qq.com/s/p4Ys10ZdEDAuqNAiRmcnIQ)》作者,[数据库老司机](https://pigsty.cc/zh/blog/db),[云计算泥石流](https://pigsty.cc/zh/blog/cloud),曾于阿里,苹果,探探担任架构师与DBA。
**校订**: [@yingang](https://github.com/yingang) | [**繁體中文**](content/tw/_index.md) by [@afunTW](https://github.com/afunTW) | [完整贡献者列表](#贡献)
**阅读**:访问 [https://ddia.vonng.com](https://ddia.vonng.com) 阅读本书在线版本,或使用 [hugo](https://gohugo.io/documentation/) / [hextra](https://imfing.github.io/hextra/zh-cn/) 主题自行构建。
> [!NOTE]
> [**DDIA 第二版**](https://ddia.vonng.com) 正在翻译中(翻译至至第十章),欢迎阅览并提出您的宝贵意见。
---------
## 译序
> 不懂数据库的全栈工程师不是好架构师
>
> —— 冯若航 / Vonng
现今,尤其是在互联网领域,大多数应用都属于数据密集型应用。本书从底层数据结构到顶层架构设计,将数据系统设计中的精髓娓娓道来。其中的宝贵经验无论是对架构师、DBA、还是后端工程师、甚至产品经理都会有帮助。
这是一本理论结合实践的书,书中很多问题,译者在实际场景中都曾遇到过,读来让人击节扼腕。如果能早点读到这本书,该少走多少弯路啊!
这也是一本深入浅出的书,讲述概念的来龙去脉而不是卖弄定义,介绍事物发展演化历程而不是事实堆砌,将复杂的概念讲述的浅显易懂,但又直击本质不失深度。每章最后的引用质量非常好,是深入学习各个主题的绝佳索引。
本书为数据系统的设计、实现、与评价提供了很好的概念框架。读完并理解本书内容后,读者可以轻松看破大多数的技术忽悠,与技术砖家撕起来虎虎生风🤣。
这是 2017 年译者读过最好的一本技术类书籍,这么好的书没有中文翻译,实在是遗憾。某不才,愿为先进技术文化的传播贡献一份力量。既可以深入学习有趣的技术主题,又可以锻炼中英文语言文字功底,何乐而不为?
---------
## 前言
> 在我们的社会中,技术是一种强大的力量。数据、软件、通信可以用于坏的方面:不公平的阶级固化,损害公民权利,保护既得利益集团。但也可以用于好的方面:让底层人民发出自己的声音,让每个人都拥有机会,避免灾难。本书献给所有将技术用于善途的人们。
---------
> 计算是一种流行文化,流行文化鄙视历史。流行文化关乎个体身份和参与感,但与合作无关。流行文化活在当下,也与过去和未来无关。我认为大部分(为了钱)编写代码的人就是这样的,他们不知道自己的文化来自哪里。
>
> —— 阿兰・凯接受 Dobb 博士的杂志采访时(2012 年)
---------
## 目录
* [序言](https://ddia.vonng.com/preface)
* [第一部分:数据系统基础](https://ddia.vonng.com//part-i)
- [1. 数据系统架构中的权衡](https://ddia.vonng.com/ch1)
- [2. 定义非功能性需求](https://ddia.vonng.com/ch2)
- [3. 数据模型与查询语言](https://ddia.vonng.com/ch3)
- [4. 存储与检索](https://ddia.vonng.com/ch4)
- [5. 编码与演化](https://ddia.vonng.com/ch5)
* [第二部分:分布式数据](https://ddia.vonng.com/part-ii)
- [6. 复制](https://ddia.vonng.com/ch6)
- [7. 分片](https://ddia.vonng.com/ch7)
- [8. 事务](https://ddia.vonng.com/ch8)
- [9. 分布式系统的麻烦](https://ddia.vonng.com/ch9)
- [10.一致性与共识](https://ddia.vonng.com/ch10)
* [第三部分:派生数据](https://ddia.vonng.com/part-iii)
- [11. 批处理](https://ddia.vonng.com/ch11)
- [12. 流处理](https://ddia.vonng.com/ch12)
- [13. 流处理系统哲学](https://ddia.vonng.com/ch13)
- [14. 做正确的事](https://ddia.vonng.com/ch14)
* [术语表](https://ddia.vonng.com/glossary)
* [后记](https://ddia.vonng.com/colophon)

---------
## 法律声明
从原作者处得知,已经有简体中文的翻译计划,将于 2018 年末完成。[购买地址](https://search.jd.com/Search?keyword=设计数据密集型应用)
译者纯粹出于 **学习目的** 与 **个人兴趣** 翻译本书,不追求任何经济利益。
译者保留对此版本译文的署名权,其他权利以原作者和出版社的主张为准。
本译文只供学习研究参考之用,不得公开发行或用于商业用途,有能力阅读英文书籍者请购买正版支持,本书英文原版在 [O'REILLY](https://learning.oreilly.com/api/v1/continue/9781098119058/) 平台上提供在线免费试预览。
---------
## 贡献
0. 全文校订 by [@yingang](https://github.com/Vonng/ddia/commits?author=yingang)
1. [序言初翻修正](https://github.com/Vonng/ddia/commit/afb5edab55c62ed23474149f229677e3b42dfc2c) by [@seagullbird](https://github.com/Vonng/ddia/commits?author=seagullbird)
2. [第一章语法标点校正](https://github.com/Vonng/ddia/commit/973b12cd8f8fcdf4852f1eb1649ddd9d187e3644) by [@nevertiree](https://github.com/Vonng/ddia/commits?author=nevertiree)
3. [第六章部分校正](https://github.com/Vonng/ddia/commit/d4eb0852c0ec1e93c8aacc496c80b915bb1e6d48) 与[第十章的初翻](https://github.com/Vonng/ddia/commit/9de8dbd1bfe6fbb03b3bf6c1a1aa2291aed2490e) by [@MuAlex](https://github.com/Vonng/ddia/commits?author=MuAlex)
4. 第一部分前言,ch2 校正 by [@jiajiadebug](https://github.com/Vonng/ddia/commits?author=jiajiadebug)
5. 词汇表、后记关于野猪的部分 by [@Chowss](https://github.com/Vonng/ddia/commits?author=Chowss)
6. 繁體中文版本与转换脚本 by [@afunTW](https://github.com/afunTW)
7. 多处翻译修正 by [@songzhibin97](https://github.com/Vonng/ddia/commits?author=songzhibin97) [@MamaShip](https://github.com/Vonng/ddia/commits?author=MamaShip) [@FangYuan33](https://github.com/Vonng/ddia/commits?author=FangYuan33)
8. 感谢所有作出贡献,提出意见的朋友们:
Pull Requests & Issues
| ISSUE & Pull Requests | USER | Title |
|-------------------------------------------------|------------------------------------------------------------|----------------------------------------------------------------|
| [386](https://github.com/Vonng/ddia/pull/386) | [@uncle-lv](https://github.com/uncle-lv) | ch2: 优化一处翻译 |
| [384](https://github.com/Vonng/ddia/pull/384) | [@PanggNOTlovebean](https://github.com/PanggNOTlovebean) | docs: 优化中文文档的措辞和表达 |
| [383](https://github.com/Vonng/ddia/pull/383) | [@PanggNOTlovebean](https://github.com/PanggNOTlovebean) | docs: 修正 ch4 中的术语和表达错误 |
| [382](https://github.com/Vonng/ddia/pull/382) | [@uncle-lv](https://github.com/uncle-lv) | ch1: 优化一处翻译 |
| [381](https://github.com/Vonng/ddia/pull/381) | [@Max-Tortoise](https://github.com/Max-Tortoise) | ch4: 修正一处术语不完整问题 |
| [377](https://github.com/Vonng/ddia/pull/377) | [@huang06](https://github.com/huang06) | 优化翻译术语 |
| [375](https://github.com/Vonng/ddia/issues/375) | [@z-soulx](https://github.com/z-soulx) | 对于是否100%全中文翻译的必要性讨论?个人-没必要100%,特别是“名词”,有原单词更加适合it人员 |
| [371](https://github.com/Vonng/ddia/pull/371) | [@lewiszlw](https://github.com/lewiszlw) | CPU core -> CPU 核心 |
| [369](https://github.com/Vonng/ddia/pull/369) | [@bbwang-gl](https://github.com/bbwang-gl) | ch7: 可串行化快照隔离检测一个事务何时修改另一个事务的读取 |
| [368](https://github.com/Vonng/ddia/pull/368) | [@yhao3](https://github.com/yhao3) | 更新 zh-tw.py 与 zh-tw 内容 |
| [367](https://github.com/Vonng/ddia/pull/367) | [@yhao3](https://github.com/yhao3) | 修正拼写、格式和标点问题 |
| [366](https://github.com/Vonng/ddia/pull/366) | [@yangshangde](https://github.com/yangshangde) | ch8: 将“电源失败”改为“电源失效” |
| [365](https://github.com/Vonng/ddia/pull/365) | [@xyohn](https://github.com/xyohn) | ch1: 优化“存储与计算分离”相关翻译 |
| [364](https://github.com/Vonng/ddia/issues/364) | [@xyohn](https://github.com/xyohn) | ch1: 优化“存储与计算分离”相关翻译 |
| [363](https://github.com/Vonng/ddia/pull/363) | [@xyohn](https://github.com/xyohn) | #362: 优化一处翻译 |
| [362](https://github.com/Vonng/ddia/issues/362) | [@xyohn](https://github.com/xyohn) | ch1: 优化一处翻译 |
| [359](https://github.com/Vonng/ddia/pull/359) | [@c25423](https://github.com/c25423) | ch10: 修正一处拼写错误 |
| [358](https://github.com/Vonng/ddia/pull/358) | [@lewiszlw](https://github.com/lewiszlw) | ch4: 修正一处拼写错误 |
| [356](https://github.com/Vonng/ddia/pull/356) | [@lewiszlw](https://github.com/lewiszlw) | ch2: 修正一处标点错误 |
| [355](https://github.com/Vonng/ddia/pull/355) | [@DuroyGeorge](https://github.com/DuroyGeorge) | ch12: 修正一处格式错误 |
| [354](https://github.com/Vonng/ddia/pull/354) | [@justlorain](https://github.com/justlorain) | ch7: 修正一处参考链接 |
| [353](https://github.com/Vonng/ddia/pull/353) | [@fantasyczl](https://github.com/fantasyczl) | ch3&9: 修正两处引用错误 |
| [352](https://github.com/Vonng/ddia/pull/352) | [@fantasyczl](https://github.com/fantasyczl) | 支持输出为 EPUB 格式 |
| [349](https://github.com/Vonng/ddia/pull/349) | [@xiyihan0](https://github.com/xiyihan0) | ch1: 修正一处格式错误 |
| [348](https://github.com/Vonng/ddia/pull/348) | [@omegaatt36](https://github.com/omegaatt36) | ch3: 修正一处图像链接 |
| [346](https://github.com/Vonng/ddia/issues/346) | [@Vermouth1995](https://github.com/Vermouth1995) | ch1: 优化一处翻译 |
| [343](https://github.com/Vonng/ddia/pull/343) | [@kehao-chen](https://github.com/kehao-chen) | ch10: 优化一处翻译 |
| [341](https://github.com/Vonng/ddia/pull/341) | [@YKIsTheBest](https://github.com/YKIsTheBest) | ch3: 优化两处翻译 |
| [340](https://github.com/Vonng/ddia/pull/340) | [@YKIsTheBest](https://github.com/YKIsTheBest) | ch2: 优化多处翻译 |
| [338](https://github.com/Vonng/ddia/pull/338) | [@YKIsTheBest](https://github.com/YKIsTheBest) | ch1: 优化一处翻译 |
| [335](https://github.com/Vonng/ddia/pull/335) | [@kimi0230](https://github.com/kimi0230) | 修正一处繁体中文错误 |
| [334](https://github.com/Vonng/ddia/pull/334) | [@soulrrrrr](https://github.com/soulrrrrr) | ch2: 修正一处繁体中文错误 |
| [332](https://github.com/Vonng/ddia/pull/332) | [@justlorain](https://github.com/justlorain) | ch5: 修正一处翻译错误 |
| [331](https://github.com/Vonng/ddia/pull/331) | [@Lyianu](https://github.com/Lyianu) | ch9: 更正几处拼写错误 |
| [330](https://github.com/Vonng/ddia/pull/330) | [@Lyianu](https://github.com/Lyianu) | ch7: 优化一处翻译 |
| [329](https://github.com/Vonng/ddia/issues/329) | [@Lyianu](https://github.com/Lyianu) | ch6: 指出一处翻译错误 |
| [328](https://github.com/Vonng/ddia/pull/328) | [@justlorain](https://github.com/justlorain) | ch4: 更正一处翻译遗漏 |
| [326](https://github.com/Vonng/ddia/pull/326) | [@liangGTY](https://github.com/liangGTY) | ch1: 优化一处翻译 |
| [323](https://github.com/Vonng/ddia/pull/323) | [@marvin263](https://github.com/marvin263) | ch5: 优化一处翻译 |
| [322](https://github.com/Vonng/ddia/pull/322) | [@marvin263](https://github.com/marvin263) | ch8: 优化一处翻译 |
| [304](https://github.com/Vonng/ddia/pull/304) | [@spike014](https://github.com/spike014) | ch11: 优化一处翻译 |
| [298](https://github.com/Vonng/ddia/pull/298) | [@Makonike](https://github.com/Makonike) | ch11&12: 修正两处错误 |
| [284](https://github.com/Vonng/ddia/pull/284) | [@WAangzE](https://github.com/WAangzE) | ch4: 更正一处列表错误 |
| [283](https://github.com/Vonng/ddia/pull/283) | [@WAangzE](https://github.com/WAangzE) | ch3: 更正一处错别字 |
| [282](https://github.com/Vonng/ddia/pull/282) | [@WAangzE](https://github.com/WAangzE) | ch2: 更正一处公式问题 |
| [281](https://github.com/Vonng/ddia/pull/281) | [@lyuxi99](https://github.com/lyuxi99) | 更正多处内部链接错误 |
| [280](https://github.com/Vonng/ddia/pull/280) | [@lyuxi99](https://github.com/lyuxi99) | ch9: 更正内部链接错误 |
| [279](https://github.com/Vonng/ddia/issues/279) | [@codexvn](https://github.com/codexvn) | ch9: 指出公式在 GitHub Pages 显示的问题 |
| [278](https://github.com/Vonng/ddia/pull/278) | [@LJlkdskdjflsa](https://github.com/LJlkdskdjflsa) | 发现了繁体中文版本中的错误翻译 |
| [275](https://github.com/Vonng/ddia/pull/275) | [@117503445](https://github.com/117503445) | 更正 LICENSE 链接 |
| [274](https://github.com/Vonng/ddia/pull/274) | [@uncle-lv](https://github.com/uncle-lv) | ch7: 修正错别字 |
| [273](https://github.com/Vonng/ddia/pull/273) | [@Sdot-Python](https://github.com/Sdot-Python) | ch7: 统一了 write skew 的翻译 |
| [271](https://github.com/Vonng/ddia/pull/271) | [@Makonike](https://github.com/Makonike) | ch6: 统一了 rebalancing 的翻译 |
| [270](https://github.com/Vonng/ddia/pull/270) | [@Ynjxsjmh](https://github.com/Ynjxsjmh) | ch7: 修正不一致的翻译 |
| [263](https://github.com/Vonng/ddia/pull/263) | [@zydmayday](https://github.com/zydmayday) | ch5: 修正译文中的重复单词 |
| [260](https://github.com/Vonng/ddia/pull/260) | [@haifeiWu](https://github.com/haifeiWu) | ch4: 修正部分不准确的翻译 |
| [258](https://github.com/Vonng/ddia/pull/258) | [@bestgrc](https://github.com/bestgrc) | ch3: 修正一处翻译错误 |
| [257](https://github.com/Vonng/ddia/pull/257) | [@UnderSam](https://github.com/UnderSam) | ch8: 修正一处拼写错误 |
| [256](https://github.com/Vonng/ddia/pull/256) | [@AlphaWang](https://github.com/AlphaWang) | ch7: 修正“可串行化”相关内容的多处翻译不当 |
| [255](https://github.com/Vonng/ddia/pull/255) | [@AlphaWang](https://github.com/AlphaWang) | ch7: 修正“可重复读”相关内容的多处翻译不当 |
| [253](https://github.com/Vonng/ddia/pull/253) | [@AlphaWang](https://github.com/AlphaWang) | ch7: 修正“读已提交”相关内容的多处翻译不当 |
| [246](https://github.com/Vonng/ddia/pull/246) | [@derekwu0101](https://github.com/derekwu0101) | ch3: 修正繁体中文的转译错误 |
| [245](https://github.com/Vonng/ddia/pull/245) | [@skyran1278](https://github.com/skyran1278) | ch12: 修正繁体中文的转译错误 |
| [244](https://github.com/Vonng/ddia/pull/244) | [@Axlgrep](https://github.com/Axlgrep) | ch9: 修正不通顺的翻译 |
| [242](https://github.com/Vonng/ddia/pull/242) | [@lynkeib](https://github.com/lynkeib) | ch9: 修正不通顺的翻译 |
| [241](https://github.com/Vonng/ddia/pull/241) | [@lynkeib](https://github.com/lynkeib) | ch8: 修正不正确的公式格式 |
| [240](https://github.com/Vonng/ddia/pull/240) | [@8da2k](https://github.com/8da2k) | ch9: 修正不通顺的翻译 |
| [239](https://github.com/Vonng/ddia/pull/239) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch7: 修正不一致的翻译 |
| [237](https://github.com/Vonng/ddia/pull/237) | [@zhangnew](https://github.com/zhangnew) | ch3: 修正错误的图片链接 |
| [229](https://github.com/Vonng/ddia/pull/229) | [@lis186](https://github.com/lis186) | 指出繁体中文的转译错误:复杂 |
| [226](https://github.com/Vonng/ddia/pull/226) | [@chroming](https://github.com/chroming) | ch1: 修正导航栏中的章节名称 |
| [220](https://github.com/Vonng/ddia/pull/220) | [@skyran1278](https://github.com/skyran1278) | ch9: 修正线性一致的繁体中文翻译 |
| [194](https://github.com/Vonng/ddia/pull/194) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch4: 修正错误的翻译 |
| [193](https://github.com/Vonng/ddia/pull/193) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch4: 优化译文 |
| [192](https://github.com/Vonng/ddia/pull/192) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch4: 修正不一致和不通顺的翻译 |
| [190](https://github.com/Vonng/ddia/pull/190) | [@Pcrab](https://github.com/Pcrab) | ch1: 修正不准确的翻译 |
| [187](https://github.com/Vonng/ddia/pull/187) | [@narojay](https://github.com/narojay) | ch9: 修正生硬的翻译 |
| [186](https://github.com/Vonng/ddia/pull/186) | [@narojay](https://github.com/narojay) | ch8: 修正错别字 |
| [185](https://github.com/Vonng/ddia/issues/185) | [@8da2k](https://github.com/8da2k) | 指出小标题跳转的问题 |
| [184](https://github.com/Vonng/ddia/pull/184) | [@DavidZhiXing](https://github.com/DavidZhiXing) | ch10: 修正失效的网址 |
| [183](https://github.com/Vonng/ddia/pull/183) | [@OneSizeFitsQuorum](https://github.com/OneSizeFitsQuorum) | ch8: 修正错别字 |
| [182](https://github.com/Vonng/ddia/issues/182) | [@lroolle](https://github.com/lroolle) | 建议docsify的主题风格 |
| [181](https://github.com/Vonng/ddia/pull/181) | [@YunfengGao](https://github.com/YunfengGao) | ch2: 修正翻译错误 |
| [180](https://github.com/Vonng/ddia/pull/180) | [@skyran1278](https://github.com/skyran1278) | ch3: 指出繁体中文的转译错误 |
| [177](https://github.com/Vonng/ddia/pull/177) | [@exzhawk](https://github.com/exzhawk) | 支持 Github Pages 里的公式显示 |
| [176](https://github.com/Vonng/ddia/pull/176) | [@haifeiWu](https://github.com/haifeiWu) | ch2: 语义网相关翻译更正 |
| [175](https://github.com/Vonng/ddia/pull/175) | [@cwr31](https://github.com/cwr31) | ch7: 不变式相关翻译更正 |
| [174](https://github.com/Vonng/ddia/pull/174) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | README & preface: 更正不正确的中文用词和标点符号 |
| [173](https://github.com/Vonng/ddia/pull/173) | [@ZvanYang](https://github.com/ZvanYang) | ch12: 修正不完整的翻译 |
| [171](https://github.com/Vonng/ddia/pull/171) | [@ZvanYang](https://github.com/ZvanYang) | ch12: 修正重复的译文 |
| [169](https://github.com/Vonng/ddia/pull/169) | [@ZvanYang](https://github.com/ZvanYang) | ch12: 更正不太通顺的翻译 |
| [166](https://github.com/Vonng/ddia/pull/166) | [@bp4m4h94](https://github.com/bp4m4h94) | ch1: 发现错误的文献索引 |
| [164](https://github.com/Vonng/ddia/pull/164) | [@DragonDriver](https://github.com/DragonDriver) | preface: 更正错误的标点符号 |
| [163](https://github.com/Vonng/ddia/pull/163) | [@llmmddCoder](https://github.com/llmmddCoder) | ch1: 更正错误字 |
| [160](https://github.com/Vonng/ddia/pull/160) | [@Zhayhp](https://github.com/Zhayhp) | ch2: 建议将 network model 翻译为网状模型 |
| [159](https://github.com/Vonng/ddia/pull/159) | [@1ess](https://github.com/1ess) | ch4: 更正错误字 |
| [157](https://github.com/Vonng/ddia/pull/157) | [@ZvanYang](https://github.com/ZvanYang) | ch7: 更正不太通顺的翻译 |
| [155](https://github.com/Vonng/ddia/pull/155) | [@ZvanYang](https://github.com/ZvanYang) | ch7: 更正不太通顺的翻译 |
| [153](https://github.com/Vonng/ddia/pull/153) | [@DavidZhiXing](https://github.com/DavidZhiXing) | ch9: 修正缩略图的错别字 |
| [152](https://github.com/Vonng/ddia/pull/152) | [@ZvanYang](https://github.com/ZvanYang) | ch7: 除重->去重 |
| [151](https://github.com/Vonng/ddia/pull/151) | [@ZvanYang](https://github.com/ZvanYang) | ch5: 修订sibling相关的翻译 |
| [147](https://github.com/Vonng/ddia/pull/147) | [@ZvanYang](https://github.com/ZvanYang) | ch5: 更正一处不准确的翻译 |
| [145](https://github.com/Vonng/ddia/pull/145) | [@Hookey](https://github.com/Hookey) | 识别了当前简繁转译过程中处理不当的地方,暂通过转换脚本规避 |
| [144](https://github.com/Vonng/ddia/issues/144) | [@secret4233](https://github.com/secret4233) | ch7: 不翻译`next-key locking` |
| [143](https://github.com/Vonng/ddia/issues/143) | [@imcheney](https://github.com/imcheney) | ch3: 更新残留的机翻段落 |
| [142](https://github.com/Vonng/ddia/issues/142) | [@XIJINIAN](https://github.com/XIJINIAN) | 建议去除段首的制表符 |
| [141](https://github.com/Vonng/ddia/issues/141) | [@Flyraty](https://github.com/Flyraty) | ch5: 发现一处错误格式的章节引用 |
| [140](https://github.com/Vonng/ddia/pull/140) | [@Bowser1704](https://github.com/Bowser1704) | ch5: 修正章节Summary中多处不通顺的翻译 |
| [139](https://github.com/Vonng/ddia/pull/139) | [@Bowser1704](https://github.com/Bowser1704) | ch2&ch3: 修正多处不通顺的或错误的翻译 |
| [137](https://github.com/Vonng/ddia/pull/137) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch5&ch6: 优化多处不通顺的或错误的翻译 |
| [134](https://github.com/Vonng/ddia/pull/134) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch4: 优化多处不通顺的或错误的翻译 |
| [133](https://github.com/Vonng/ddia/pull/133) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch3: 优化多处错误的或不通顺的翻译 |
| [132](https://github.com/Vonng/ddia/pull/132) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch3: 优化一处容易产生歧义的翻译 |
| [131](https://github.com/Vonng/ddia/pull/131) | [@rwwg4](https://github.com/rwwg4) | ch6: 修正两处错误的翻译 |
| [129](https://github.com/Vonng/ddia/pull/129) | [@anaer](https://github.com/anaer) | ch4: 修正两处强调文本和四处代码变量名称 |
| [128](https://github.com/Vonng/ddia/pull/128) | [@meilin96](https://github.com/meilin96) | ch5: 修正一处错误的引用 |
| [126](https://github.com/Vonng/ddia/pull/126) | [@cwr31](https://github.com/cwr31) | ch10: 修正一处错误的翻译(功能 -> 函数) |
| [125](https://github.com/Vonng/ddia/pull/125) | [@dch1228](https://github.com/dch1228) | ch2: 优化 how best 的翻译(如何以最佳方式) |
| [123](https://github.com/Vonng/ddia/pull/123) | [@yingang](https://github.com/yingang) | translation updates (chapter 9, TOC in readme, glossary, etc.) |
| [121](https://github.com/Vonng/ddia/pull/121) | [@yingang](https://github.com/yingang) | translation updates (chapter 5 to chapter 8) |
| [120](https://github.com/Vonng/ddia/pull/120) | [@jiong-han](https://github.com/jiong-han) | Typo fix: 呲之以鼻 -> 嗤之以鼻 |
| [119](https://github.com/Vonng/ddia/pull/119) | [@cclauss](https://github.com/cclauss) | Streamline file operations in convert() |
| [118](https://github.com/Vonng/ddia/pull/118) | [@yingang](https://github.com/yingang) | translation updates (chapter 2 to chapter 4) |
| [117](https://github.com/Vonng/ddia/pull/117) | [@feeeei](https://github.com/feeeei) | 统一每章的标题格式 |
| [115](https://github.com/Vonng/ddia/pull/115) | [@NageNalock](https://github.com/NageNalock) | 第七章病句修改: 重复词语 |
| [114](https://github.com/Vonng/ddia/pull/114) | [@Sunt-ing](https://github.com/Sunt-ing) | Update README.md: correct the book name |
| [113](https://github.com/Vonng/ddia/pull/113) | [@lpxxn](https://github.com/lpxxn) | 修改语句 |
| [112](https://github.com/Vonng/ddia/pull/112) | [@ibyte2011](https://github.com/ibyte2011) | Update ch9.md |
| [110](https://github.com/Vonng/ddia/pull/110) | [@lpxxn](https://github.com/lpxxn) | 读已写入数据 |
| [107](https://github.com/Vonng/ddia/pull/107) | [@abbychau](https://github.com/abbychau) | 單調鐘和好死还是赖活着 |
| [106](https://github.com/Vonng/ddia/pull/106) | [@enochii](https://github.com/enochii) | typo in ch2: fix braces typo |
| [105](https://github.com/Vonng/ddia/pull/105) | [@LiminCode](https://github.com/LiminCode) | Chronicle translation error |
| [104](https://github.com/Vonng/ddia/pull/104) | [@Sunt-ing](https://github.com/Sunt-ing) | several advice for better translation |
| [103](https://github.com/Vonng/ddia/pull/103) | [@Sunt-ing](https://github.com/Sunt-ing) | typo in ch4: should be 完成 rather than 完全 |
| [102](https://github.com/Vonng/ddia/pull/102) | [@Sunt-ing](https://github.com/Sunt-ing) | ch4: better-translation: 扼杀 → 破坏 |
| [101](https://github.com/Vonng/ddia/pull/101) | [@Sunt-ing](https://github.com/Sunt-ing) | typo in Ch4: should be "改变" rathr than "盖面" |
| [100](https://github.com/Vonng/ddia/pull/100) | [@LiminCode](https://github.com/LiminCode) | fix missing translation |
| [99 ](https://github.com/Vonng/ddia/pull/99) | [@mrdrivingduck](https://github.com/mrdrivingduck) | ch6: fix the word rebalancing |
| [98 ](https://github.com/Vonng/ddia/pull/98) | [@jacklightChen](https://github.com/jacklightChen) | fix ch7.md: fix wrong references |
| [97 ](https://github.com/Vonng/ddia/pull/97) | [@jenac](https://github.com/jenac) | 96 |
| [96 ](https://github.com/Vonng/ddia/pull/96) | [@PragmaTwice](https://github.com/PragmaTwice) | ch2: fix typo about 'may or may not be' |
| [95 ](https://github.com/Vonng/ddia/pull/95) | [@EvanMu96](https://github.com/EvanMu96) | fix translation of "the battle cry" in ch5 |
| [94 ](https://github.com/Vonng/ddia/pull/94) | [@kemingy](https://github.com/kemingy) | ch6: fix markdown and punctuations |
| [93 ](https://github.com/Vonng/ddia/pull/93) | [@kemingy](https://github.com/kemingy) | ch5: fix markdown and some typos |
| [92 ](https://github.com/Vonng/ddia/pull/92) | [@Gilbert1024](https://github.com/Gilbert1024) | Merge pull request #1 from Vonng/master |
| [88 ](https://github.com/Vonng/ddia/pull/88) | [@kemingy](https://github.com/kemingy) | fix typo for ch1, ch2, ch3, ch4 |
| [87 ](https://github.com/Vonng/ddia/pull/87) | [@wynn5a](https://github.com/wynn5a) | Update ch3.md |
| [86 ](https://github.com/Vonng/ddia/pull/86) | [@northmorn](https://github.com/northmorn) | Update ch1.md |
| [85 ](https://github.com/Vonng/ddia/pull/85) | [@sunbuhui](https://github.com/sunbuhui) | fix ch2.md: fix ch2 ambiguous translation |
| [84 ](https://github.com/Vonng/ddia/pull/84) | [@ganler](https://github.com/ganler) | Fix translation: use up |
| [83 ](https://github.com/Vonng/ddia/pull/83) | [@afunTW](https://github.com/afunTW) | Using OpenCC to convert from zh-cn to zh-tw |
| [82 ](https://github.com/Vonng/ddia/pull/82) | [@kangni](https://github.com/kangni) | fix gitbook url |
| [78 ](https://github.com/Vonng/ddia/pull/78) | [@hanyu2](https://github.com/hanyu2) | Fix unappropriated translation |
| [77 ](https://github.com/Vonng/ddia/pull/77) | [@Ozarklake](https://github.com/Ozarklake) | fix typo |
| [75 ](https://github.com/Vonng/ddia/pull/75) | [@2997ms](https://github.com/2997ms) | Fix typo |
| [74 ](https://github.com/Vonng/ddia/pull/74) | [@2997ms](https://github.com/2997ms) | Update ch9.md |
| [70 ](https://github.com/Vonng/ddia/pull/70) | [@2997ms](https://github.com/2997ms) | Update ch7.md |
| [67 ](https://github.com/Vonng/ddia/pull/67) | [@jiajiadebug](https://github.com/jiajiadebug) | fix issues in ch2 - ch9 and glossary |
| [66 ](https://github.com/Vonng/ddia/pull/66) | [@blindpirate](https://github.com/blindpirate) | Fix typo |
| [63 ](https://github.com/Vonng/ddia/pull/63) | [@haifeiWu](https://github.com/haifeiWu) | Update ch10.md |
| [62 ](https://github.com/Vonng/ddia/pull/62) | [@ych](https://github.com/ych) | fix ch1.md typesetting problem |
| [61 ](https://github.com/Vonng/ddia/pull/61) | [@xianlaioy](https://github.com/xianlaioy) | docs:钟-->种,去掉ou |
| [60 ](https://github.com/Vonng/ddia/pull/60) | [@Zombo1296](https://github.com/Zombo1296) | 否则 -> 或者 |
| [59 ](https://github.com/Vonng/ddia/pull/59) | [@AlexanderMisel](https://github.com/AlexanderMisel) | 呼叫->调用,显着->显著 |
| [58 ](https://github.com/Vonng/ddia/pull/58) | [@ibyte2011](https://github.com/ibyte2011) | Update ch8.md |
| [55 ](https://github.com/Vonng/ddia/pull/55) | [@saintube](https://github.com/saintube) | ch8: 修改链接错误 |
| [54 ](https://github.com/Vonng/ddia/pull/54) | [@Panmax](https://github.com/Panmax) | Update ch2.md |
| [53 ](https://github.com/Vonng/ddia/pull/53) | [@ibyte2011](https://github.com/ibyte2011) | Update ch9.md |
| [52 ](https://github.com/Vonng/ddia/pull/52) | [@hecenjie](https://github.com/hecenjie) | Update ch1.md |
| [51 ](https://github.com/Vonng/ddia/pull/51) | [@latavin243](https://github.com/latavin243) | fix 修正ch3 ch4几处翻译 |
| [50 ](https://github.com/Vonng/ddia/pull/50) | [@AlexZFX](https://github.com/AlexZFX) | 几个疏漏和格式错误 |
| [49 ](https://github.com/Vonng/ddia/pull/49) | [@haifeiWu](https://github.com/haifeiWu) | Update ch1.md |
| [48 ](https://github.com/Vonng/ddia/pull/48) | [@scaugrated](https://github.com/scaugrated) | fix typo |
| [47 ](https://github.com/Vonng/ddia/pull/47) | [@lzwill](https://github.com/lzwill) | Fixed typos in ch2 |
| [45 ](https://github.com/Vonng/ddia/pull/45) | [@zenuo](https://github.com/zenuo) | 删除一个多余的右括号 |
| [44 ](https://github.com/Vonng/ddia/pull/44) | [@akxxsb](https://github.com/akxxsb) | 修正第七章底部链接错误 |
| [43 ](https://github.com/Vonng/ddia/pull/43) | [@baijinping](https://github.com/baijinping) | "更假简单"->"更加简单" |
| [42 ](https://github.com/Vonng/ddia/pull/42) | [@tisonkun](https://github.com/tisonkun) | 修复 ch1 中的无序列表格式 |
| [38 ](https://github.com/Vonng/ddia/pull/38) | [@renjie-c](https://github.com/renjie-c) | 纠正多处的翻译小错误 |
| [37 ](https://github.com/Vonng/ddia/pull/37) | [@tankilo](https://github.com/tankilo) | fix translation mistakes in ch4.md |
| [36 ](https://github.com/Vonng/ddia/pull/36) | [@wwek](https://github.com/wwek) | 1.修复多个链接错误 2.名词优化修订 3.错误修订 |
| [35 ](https://github.com/Vonng/ddia/pull/35) | [@wwek](https://github.com/wwek) | fix ch7.md to ch8.md link error |
| [34 ](https://github.com/Vonng/ddia/pull/34) | [@wwek](https://github.com/wwek) | Merge pull request #1 from Vonng/master |
| [33 ](https://github.com/Vonng/ddia/pull/33) | [@wwek](https://github.com/wwek) | fix part-ii.md link error |
| [32 ](https://github.com/Vonng/ddia/pull/32) | [@JCYoky](https://github.com/JCYoky) | Update ch2.md |
| [31 ](https://github.com/Vonng/ddia/pull/31) | [@elsonLee](https://github.com/elsonLee) | Update ch7.md |
| [26 ](https://github.com/Vonng/ddia/pull/26) | [@yjhmelody](https://github.com/yjhmelody) | 修复一些明显错误 |
| [25 ](https://github.com/Vonng/ddia/pull/25) | [@lqbilbo](https://github.com/lqbilbo) | 修复链接错误 |
| [24 ](https://github.com/Vonng/ddia/pull/24) | [@artiship](https://github.com/artiship) | 修改词语顺序 |
| [23 ](https://github.com/Vonng/ddia/pull/23) | [@artiship](https://github.com/artiship) | 修正错别字 |
| [22 ](https://github.com/Vonng/ddia/pull/22) | [@artiship](https://github.com/artiship) | 纠正翻译错误 |
| [21 ](https://github.com/Vonng/ddia/pull/21) | [@zhtisi](https://github.com/zhtisi) | 修正目录和本章标题不符的情况 |
| [20 ](https://github.com/Vonng/ddia/pull/20) | [@rentiansheng](https://github.com/rentiansheng) | Update ch7.md |
| [19 ](https://github.com/Vonng/ddia/pull/19) | [@LHRchina](https://github.com/LHRchina) | 修复语句小bug |
| [16 ](https://github.com/Vonng/ddia/pull/16) | [@MuAlex](https://github.com/MuAlex) | Master |
| [15 ](https://github.com/Vonng/ddia/pull/15) | [@cg-zhou](https://github.com/cg-zhou) | Update translation progress |
| [14 ](https://github.com/Vonng/ddia/pull/14) | [@cg-zhou](https://github.com/cg-zhou) | Translate glossary |
| [13 ](https://github.com/Vonng/ddia/pull/13) | [@cg-zhou](https://github.com/cg-zhou) | 详细修改了后记中和印度野猪相关的描述 |
| [12 ](https://github.com/Vonng/ddia/pull/12) | [@ibyte2011](https://github.com/ibyte2011) | 修改了部分翻译 |
| [11 ](https://github.com/Vonng/ddia/pull/11) | [@jiajiadebug](https://github.com/jiajiadebug) | ch2 100% |
| [10 ](https://github.com/Vonng/ddia/pull/10) | [@jiajiadebug](https://github.com/jiajiadebug) | ch2 20% |
| [9 ](https://github.com/Vonng/ddia/pull/9) | [@jiajiadebug](https://github.com/jiajiadebug) | Preface, ch1, part-i translation minor fixes |
| [7 ](https://github.com/Vonng/ddia/pull/7) | [@MuAlex](https://github.com/MuAlex) | Ch6 translation pull request |
| [6 ](https://github.com/Vonng/ddia/pull/6) | [@MuAlex](https://github.com/MuAlex) | Ch6 change version1 |
| [5 ](https://github.com/Vonng/ddia/pull/5) | [@nevertiree](https://github.com/nevertiree) | Chapter 01语法微调 |
| [2 ](https://github.com/Vonng/ddia/pull/2) | [@seagullbird](https://github.com/seagullbird) | 序言初翻 |
---------
## 许可证
[](https://github.com/Vonng/ddia/blob/master/LICENSE)
本项目采用 [CC-BY 4.0](https://github.com/Vonng/ddia/blob/master/LICENSE) 许可证,您可以在这里找到完整说明:
- [署名 4.0 协议国际版 CC BY 4.0 Deed](https://creativecommons.org/licenses/by/4.0/deed.zh-hans)
- [Attribution 4.0 International CC BY 4.0](https://creativecommons.org/licenses/by/4.0/deed.en)
================================================
FILE: assets/css/custom.css
================================================
/* 调整左侧导航栏的宽度 */
/* 增加宽度以确保文本不换行 */
/* 左侧导航栏宽度调整 - 增加到 28rem 以避免换行 */
.hextra-sidebar {
width: 28rem !important;
min-width: 28rem !important;
}
/* 确保导航内容正确显示 */
.hextra-sidebar nav {
width: 100%;
}
/* 防止导航项文字换行 */
.hextra-sidebar li {
white-space: nowrap;
}
.sidebar-container {
width: 20rem !important;
white-space: nowrap;
}
/* 确保导航链接不换行 */
.hextra-sidebar a {
white-space: nowrap;
overflow: hidden;
text-overflow: ellipsis;
display: block;
}
/* 调整右侧页内目录(On This Page)的宽度 */
/* Hextra 默认宽度约为 16rem (256px),增加 1.5 倍变为 24rem (384px) */
/* 右侧目录的宽度 */
.hextra-toc {
width: 24rem !important;
}
/* 确保目录内容正确显示 */
.hextra-toc nav {
width: 100%;
}
/* 调整目录项的文字换行以适应更宽的空间 */
.hextra-toc li {
word-wrap: break-word;
}
================================================
FILE: assets/css/example.css
================================================
.md-example { margin: 1.25rem 0; padding: 1rem; border: 1px solid var(--border); border-radius: .75rem; }
.md-example__caption { display: flex; gap: .5rem; align-items: baseline; margin-bottom: .5rem; }
.md-example__label { font-weight: 600; opacity: .85; }
.md-example__title { font-weight: 600; }
.md-example__anchor { margin-left: auto; text-decoration: none; opacity: .6; }
.md-example__anchor:hover { opacity: 1; }
.md-example__note { margin-top: .5rem; font-size: .95em; opacity: .85; }
================================================
FILE: bin/Pipfile
================================================
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"
[packages]
opencc = "*"
click = "*"
[dev-packages]
[requires]
python_version = "3.6"
================================================
FILE: bin/doc
================================================
#!/bin/bash
#==============================================================#
# File : doc
# Ctime : 2021-08-10
# Mtime : 2021-08-12
# Desc : Serve local doc with docsify, python3, python
# Path : bin/doc
# Deps : docsify or python3 or python2
# Copyright (C) 2018-2021 Ruohang Feng
#==============================================================#
PROG_DIR="$(cd $(dirname $0) && pwd)"
DOCS_DIR="$(cd $(dirname ${PROG_DIR}) && pwd)"
# node.js (docsify) > python3 (http.server) > python2 (SimpleHTTPServer)
if command -v docsify; then
echo "serve with docsify (click url to view in browser)"
cd ${DOCS_DIR} && docsify serve
elif command -v python3; then
echo "serve http://localhost:3001 (python3 http.server)"
cd ${DOCS_DIR} && python3 -m http.server 3001
elif command -v python2; then
echo "serve http://localhost:3001 (python2 SimpleHTTPServer)"
cd ${DOCS_DIR} && python2 -m SimpleHTTPServer 3001
else
echo "no available server"
fi
================================================
FILE: bin/epub
================================================
#!/usr/bin/env bash
set -e
# Set the directory containing Markdown files
SCRIPT_DIR=$(dirname "$0")
INPUT_DIR=$(cd "$(dirname "$SCRIPT_DIR")" && pwd)
OUTPUT_DIR="$INPUT_DIR/output"
TEMP_DIR="$OUTPUT_DIR/temp"
# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"
mkdir -p "$TEMP_DIR"
# Preprocess Markdown files to convert Hugo shortcodes
echo "Preprocessing Markdown files..."
python3 "${SCRIPT_DIR}/preprocess-epub.py" "${INPUT_DIR}/content/zh" "$TEMP_DIR"
convert_to_epub() {
# convert all EPUB files into a single EPUB book
OUTPUT_BOOK="$OUTPUT_DIR/ddia.epub"
rm -f "$OUTPUT_BOOK"
echo "Converting all EPUB files into $OUTPUT_BOOK..."
local meta_file=${INPUT_DIR}/metadata.yaml
local css_file=${INPUT_DIR}/js/epub.css
pandoc -o "$OUTPUT_BOOK" --metadata-file="$meta_file" \
--toc-depth=2 \
--top-level-division=chapter \
--file-scope=true \
--css="$css_file" \
--webtex \
--wrap=preserve \
"${TEMP_DIR}"/_index.md \
"${TEMP_DIR}"/preface.md \
"${TEMP_DIR}"/part-i.md \
"${TEMP_DIR}"/ch1.md \
"${TEMP_DIR}"/ch2.md \
"${TEMP_DIR}"/ch3.md \
"${TEMP_DIR}"/ch4.md \
"${TEMP_DIR}"/part-ii.md \
"${TEMP_DIR}"/ch5.md \
"${TEMP_DIR}"/ch6.md \
"${TEMP_DIR}"/ch7.md \
"${TEMP_DIR}"/ch8.md \
"${TEMP_DIR}"/ch9.md \
"${TEMP_DIR}"/part-iii.md \
"${TEMP_DIR}"/ch10.md \
"${TEMP_DIR}"/ch11.md \
"${TEMP_DIR}"/ch12.md \
"${TEMP_DIR}"/ch13.md \
"${TEMP_DIR}"/ch14.md \
"${TEMP_DIR}"/colophon.md \
"${TEMP_DIR}"/glossary.md
echo "Converted EPUB book created at $OUTPUT_BOOK."
}
convert_to_epub
# Clean up temporary files
rm -rf "$TEMP_DIR"
================================================
FILE: bin/preprocess-epub.py
================================================
#!/usr/bin/env python3
"""
预处理 Markdown 文件,将 Hugo shortcode 转换为 Pandoc 可识别的格式
处理两种 shortcode:
1. {{< figure src="/fig/xxx.png" caption="xxx" >}} → 
2. {{< figure ... >}} (无 src) → 移除(通常用于代码示例)
"""
import os
import re
import sys
from pathlib import Path
FIGURE_SHORTCODE_RE = re.compile(r"\{\{<\s*figure\b(.*?)>\}\}", re.DOTALL)
ATTR_RE = re.compile(r'([\w-]+)="([^"]*)"')
ABS_IMAGE_RE = re.compile(r'!\[([^\]]*)\]\(/(?!static/)([^)]+)\)')
def _escape_alt_text(text):
"""Escape `]` in alt text to avoid breaking Markdown image syntax."""
return text.replace("]", r"\]")
def convert_markdown(text):
"""
转换 Hugo figure shortcode 和绝对路径图片引用。
Args:
text: Markdown 文本内容
Returns:
转换后的文本
"""
def replace_figure_shortcode(match):
attrs_text = match.group(1)
attrs = dict(ATTR_RE.findall(attrs_text))
src = attrs.get("src")
# 没有 src 的 figure 一般是代码示例占位,直接移除
if not src:
return ""
# 绝对路径资源转为相对 static 路径,便于 Pandoc 打包
if src.startswith('/'):
src = 'static' + src
# 优先 caption,fallback 到 title,至少保证图片可渲染
alt = _escape_alt_text(attrs.get("caption") or attrs.get("title") or "")
return f''
text = FIGURE_SHORTCODE_RE.sub(replace_figure_shortcode, text)
# 把 Markdown 里的绝对路径图片  转为 static/map/ch01.png
text = ABS_IMAGE_RE.sub(r'', text)
return text
def process_file(input_path, output_path):
"""
处理单个 Markdown 文件
Args:
input_path: 输入文件路径
output_path: 输出文件路径
"""
with open(input_path, 'r', encoding='utf-8') as f:
content = f.read()
# 转换内容
converted_content = convert_markdown(content)
# 写入输出文件
os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(output_path, 'w', encoding='utf-8') as f:
f.write(converted_content)
print(f"Processed: {input_path} -> {output_path}")
def main():
"""主函数"""
if len(sys.argv) < 2:
print("Usage: preprocess.py [output_file]")
print(" or: preprocess.py ")
sys.exit(1)
input_path = sys.argv[1]
if os.path.isfile(input_path):
# 处理单个文件
output_path = sys.argv[2] if len(sys.argv) > 2 else input_path
process_file(input_path, output_path)
elif os.path.isdir(input_path):
# 处理目录
output_dir = sys.argv[2]
input_dir = Path(input_path)
# 获取所有 .md 文件
md_files = sorted(input_dir.glob('*.md'))
for md_file in md_files:
output_file = os.path.join(output_dir, md_file.name)
process_file(str(md_file), output_file)
print(f"\nTotal processed: {len(md_files)} files")
else:
print(f"Error: {input_path} is not a valid file or directory")
sys.exit(1)
if __name__ == '__main__':
main()
================================================
FILE: bin/toc.py
================================================
#!/usr/bin/env python3
"""
TOC Generator for DDIA book
Usage: python toc.py [output_file]
Example: python toc.py zh 2
python toc.py en 3 en-toc.md
"""
import os
import sys
import re
from pathlib import Path
def extract_front_matter_title(content):
"""Extract title from Hugo front matter"""
lines = content.split('\n')
in_front_matter = False
for line in lines:
if line.strip() == '---':
if not in_front_matter:
in_front_matter = True
else:
break
elif in_front_matter and line.startswith('title:'):
# Extract title, removing quotes
title = line.split(':', 1)[1].strip()
if title.startswith('"') and title.endswith('"'):
title = title[1:-1]
elif title.startswith("'") and title.endswith("'"):
title = title[1:-1]
return title
return None
def extract_headings(content, max_depth):
"""Extract headings up to specified depth from markdown content
max_depth=1 -> extract H2 only
max_depth=2 -> extract H2-H3
max_depth=3 -> extract H2-H4
max_depth=4 -> extract H2-H5
"""
headings = []
lines = content.split('\n')
# Skip front matter
skip_until = 0
if lines[0].strip() == '---':
for i, line in enumerate(lines[1:], 1):
if line.strip() == '---':
skip_until = i + 1
break
for line in lines[skip_until:]:
# Match markdown headings with optional ID
# Format: ## Heading Text {#heading-id}
match = re.match(r'^(#{2,5})\s+(.*?)(?:\s*\{#([\w-]+)\})?$', line)
if match:
level = len(match.group(1))
# max_depth=1 -> extract level 2 only (H2)
# max_depth=2 -> extract level 2-3 (H2-H3)
# max_depth=3 -> extract level 2-4 (H2-H4)
# max_depth=4 -> extract level 2-5 (H2-H5)
max_level = max_depth + 1
if level <= max_level:
heading_text = match.group(2).strip()
heading_id = match.group(3)
headings.append({
'level': level, # Keep original level: 2 for H2, 3 for H3, etc.
'text': heading_text,
'id': heading_id
})
return headings
def generate_toc_entry(file_name, title, lang, depth, content_dir):
"""Generate TOC entry for a file"""
entries = []
# Determine URL path
base_name = file_name.replace('.md', '')
if lang == 'zh':
url = f"/{base_name}"
else:
url = f"/{lang}/{base_name}"
# Add main entry (level 1)
entries.append({
'level': 1,
'text': f"[{title}]({url})",
'raw_text': title
})
# Special case: glossary.md only shows main title (no sub-headings)
if file_name == 'glossary.md':
effective_depth = 0 # Don't extract any sub-headings
else:
effective_depth = depth - 1 # Adjust depth: user depth 1 = no extraction, 2 = extract H2, etc.
# If effective_depth >= 1, extract headings from file
if effective_depth >= 1:
file_path = content_dir / file_name
if file_path.exists():
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
headings = extract_headings(content, effective_depth)
for heading in headings:
# Create link with anchor
if heading['id']:
anchor_url = f"{url}#{heading['id']}"
else:
# Generate anchor from heading text (simplified)
anchor = heading['text'].lower()
anchor = re.sub(r'[^\w\s-]', '', anchor)
anchor = re.sub(r'\s+', '-', anchor)
anchor_url = f"{url}#{anchor}"
# Adjust level: H2 becomes level 2, H3 becomes level 3, etc.
# This ensures proper indentation under the main entry
entries.append({
'level': heading['level'],
'text': f"[{heading['text']}]({anchor_url})",
'raw_text': heading['text']
})
return entries
def format_toc_entries(entries):
"""Format TOC entries with proper indentation"""
formatted = []
for entry in entries:
level = entry['level']
text = entry['text']
if level == 0:
# Blank line separator
formatted.append('')
elif level == 1:
# Main entry (chapter/section level)
formatted.append(f"## {text}")
elif level == 2:
# H2 heading
formatted.append(f"- {text}")
elif level == 3:
# H3 heading
formatted.append(f" - {text}")
elif level == 4:
# H4 heading
formatted.append(f" - {text}")
elif level == 5:
# H5 heading
formatted.append(f" - {text}")
return '\n'.join(formatted)
def check_file_status(file_path, lang):
"""Check if a file exists and add status marker if needed"""
if not file_path.exists():
return " (未发布)" if lang == 'zh' else " (未發布)" if lang == 'tw' else " (WIP)"
# Check if file has minimal content (you can adjust this logic)
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
# Simple heuristic: if file has less than 500 characters of actual content, consider it WIP
# Remove front matter for content check
lines = content.split('\n')
if lines[0].strip() == '---':
for i, line in enumerate(lines[1:], 1):
if line.strip() == '---':
content = '\n'.join(lines[i+1:])
break
content_length = len(content.strip())
if content_length < 500:
return " (未发布)" if lang == 'zh' else " (未發布)" if lang == 'tw' else " (WIP)"
return ""
def main():
# Parse arguments
if len(sys.argv) < 3:
print("Usage: python toc.py [output_file]")
print("Example: python toc.py zh 2")
sys.exit(1)
lang = sys.argv[1]
if lang not in ['zh', 'en', 'tw']:
print(f"Error: Language must be one of: zh, en, tw")
sys.exit(1)
try:
depth = int(sys.argv[2])
if depth not in [1, 2, 3, 4]:
raise ValueError
except ValueError:
print(f"Error: Depth must be 1, 2, 3, or 4")
sys.exit(1)
# Determine output file
if len(sys.argv) > 3:
output_file = sys.argv[3]
else:
output_file = f"{lang}.md"
# Get content directory
script_dir = Path(__file__).parent
project_root = script_dir.parent
content_dir = project_root / 'content' / lang
if not content_dir.exists():
print(f"Error: Content directory {content_dir} does not exist")
sys.exit(1)
# Define file order
file_order = [
'preface.md',
'ch1.md', 'ch2.md', 'ch3.md', 'ch4.md', 'ch5.md', 'ch6.md',
'ch7.md', 'ch8.md', 'ch9.md', 'ch10.md', 'ch11.md', 'ch12.md', 'ch13.md',
'glossary.md',
'colophon.md'
]
# Generate TOC
all_entries = []
for file_name in file_order:
file_path = content_dir / file_name
if file_path.exists():
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
title = extract_front_matter_title(content)
if title:
entries = generate_toc_entry(file_name, title, lang, depth, content_dir)
# Add status marker to main entry if needed
status = check_file_status(file_path, lang)
if status and entries:
# Update the first entry (main title) with status
entries[0]['text'] = entries[0]['text'].replace(')', f'){status}')
all_entries.extend(entries)
if entries: # Add blank line between chapters
all_entries.append({'level': 0, 'text': ''})
# Format and write output
formatted_toc = format_toc_entries(all_entries)
# Clean up extra blank lines
formatted_toc = re.sub(r'\n{3,}', '\n\n', formatted_toc)
# Write to file
output_path = Path(output_file)
with open(output_path, 'w', encoding='utf-8') as f:
f.write(formatted_toc)
print(f"TOC generated successfully: {output_path}")
print(f"Language: {lang}, Depth: {depth}")
if __name__ == "__main__":
main()
================================================
FILE: bin/translate.py
================================================
"""Convert zh-cn to zh-tw
Refer to https://github.com/BYVoid/OpenCC
"""
import click
import opencc
from pathlib import Path
from pprint import pprint
@click.group()
def cli():
pass
def convert(infile: str, outfile: str, cfg: str):
"""read >> convert >> write file
Args:
infile (str): input file
outfile (str): output file
cfg (str): config
"""
converter = opencc.OpenCC(cfg)
with open(infile, "r") as inf, open(outfile, "w+") as outf:
outf.write("\n".join(converter.convert(line) for line in inf))
print(f"Convert to {outfile}")
@cli.command()
@click.option("-i", "--input", "infile", required=True)
@click.option("-o", "--output", "outfile", required=True)
@click.option("-c", "--config", "cfg", required=True, default="s2twp.json")
def file(infile: str, outfile: str, cfg: str):
"""read >> convert >> write file
Args:
infile (str): input file
outfile (str): output file
cfg (str): config
"""
convert(infile, outfile, cfg)
@cli.command()
@click.option("-i", "--input", "infolder", required=True)
@click.option("-o", "--output", "outfolder", required=True)
@click.option("-c", "--config", "cfg", required=True, default="s2twp.json")
def repo(infolder, outfolder, cfg):
if not Path(outfolder).exists():
Path(outfolder).mkdir(parents=True)
print(f"Create {outfolder}")
infiles = Path(infolder).resolve().glob("*.md")
pair = [
{"infile": str(infile), "outfile": str(Path(outfolder).resolve() / infile.name)}
for idx, infile in enumerate(infiles)
]
for p in pair:
convert(p["infile"], p["outfile"], cfg)
if __name__ == "__main__":
cli()
================================================
FILE: bin/zh-tw.py
================================================
#!/usr/bin/env python3
import os, sys, opencc
import re
def process_urls(text, src_folder, dst_folder):
"""处理 Markdown 中的相对 URL"""
# 定义需要处理的页面路径(不带.md后缀)
page_paths = [
'/ch1', '/ch2', '/ch3', '/ch4', '/ch5', '/ch6',
'/ch7', '/ch8', '/ch9', '/ch10', '/ch11', '/ch12', '/ch13',
'/part-i', '/part-ii', '/part-iii',
'/preface', '/glossary', '/colophon'
]
# 对每个页面路径进行替换
for page_path in page_paths:
# 匹配 Markdown 链接格式 [text](page_path) 或 [text](page_path#anchor)
pattern = rf'\[([^\]]*)\]\(([^)]*)({re.escape(page_path)})(#[^)]*)?\)'
# 替换为添加 /tw 前缀的版本
def replace_func(match):
text_part = match.group(1)
folder_part = match.group(2) or ''
page_part = match.group(3)
anchor_part = match.group(4) or ''
if not folder_part:
return f'[{text_part}](/tw{page_part}{anchor_part})' # 默认中文版本,没有 /zh 前缀,直接在前面添加 /tw 前缀
elif folder_part[1:] == src_folder:
return f'[{text_part}](/{dst_folder}{page_part}{anchor_part})' # 其它中文版本,有类似 /v1 的前缀,根据输入参数进行替换
else:
text = f'[{text_part}]({folder_part}{page_part}{anchor_part})'
print(f'unknown folder part in: {text}, keep it unchanged')
return text
text = re.sub(pattern, replace_func, text)
return text
def convert_file(src_filepath, dst_filepath, src_folder, dst_folder, cfg='s2twp.json'):
print("convert %s to %s" % (src_filepath, dst_filepath))
converter = opencc.OpenCC(cfg)
with open(src_filepath, "r", encoding='utf-8') as src, open(dst_filepath, "w+", encoding='utf-8') as dst:
dst.write("\n".join(
process_urls(
converter.convert(line.rstrip())
.replace('一箇', '一個')
.replace('髮送', '傳送')
.replace('髮布', '釋出')
.replace('髮生', '發生')
.replace('髮出', '發出')
.replace('嚐試', '嘗試')
.replace('線上性一致', '在線性一致') # 优先按"在线"解析了?
.replace('復雜', '複雜')
.replace('討論瞭', '討論了')
.replace('瞭解釋', '了解釋')
.replace('瞭如', '了如') # 引入了如, 實現了如, 了如何, 了如果, 了如此
.replace('了如指掌', '瞭如指掌') # 针对上一行的例外情况
.replace('明瞭', '明了') # 闡明了, 聲明了, 指明了
.replace('倒黴', '倒楣')
.replace('區域性性', '區域性')
.replace('下麵條件', '下面條件') # 优先按"面条"解析了?
.replace('當日志', '當日誌') # 优先按"当日"解析了?
.replace('真即時間', '真實時間') # 优先按"实时"解析了?
.replace('面向物件', '物件導向')
.replace('非規範化', '反正規化')
.replace('規範化', '正規化'),
src_folder, dst_folder
)
for line in src))
def convert(zh_folder, tw_folder):
home = os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(sys.argv[0])), '..'))
zh_dirpath = os.path.join(home, 'content', zh_folder)
tw_dirpath = os.path.join(home, 'content', tw_folder)
for file in os.listdir(zh_dirpath):
if file.endswith('.md'):
zh_filepath = os.path.join(zh_dirpath, file)
tw_filepath = os.path.join(tw_dirpath, file)
convert_file(zh_filepath, tw_filepath, zh_folder, tw_folder)
if __name__ == '__main__':
print(sys.argv)
convert('zh', 'tw')
convert('v1', 'v1_tw')
================================================
FILE: content/en/_index.md
================================================
---
title: "Designing Data-Intensive Applications 2nd Edition"
linkTitle: DDIA
cascade:
type: docs
breadcrumbs: false
---
—— **The Big Ideas Behind Reliable, Scalable, and Maintainable Systems**
[Martin Kleppmann](https://martin.kleppmann.com)
> The en-us version only includes **intro**, **summary**, **references** of all chapters to protect the intellectual property of author and publisher.

--------
*Technology is a powerful force in our society. Data, software, and communication can*
*be used for bad: to entrench unfair power structures, to undermine human rights, and to protect vested interests. But they can also be used for good: to make underrepresented people’s voices heard, to create opportunities for everyone, and to avert disasters. This book is dedicated to everyone working toward the good.*
---------
*Computing is pop culture. [...] Pop culture holds a disdain for history. Pop culture is all about identity and feeling like you’re participating. It has nothing to do with cooperation, the past or the future—it’s living in the present. I think the same is true of most people who write code for money. They have no idea where [their culture came from].*
— [Alan Kay](http://www.drdobbs.com/architecture-and-design/interview-with-alan-kay/240003442), in interview with *Dr Dobb’s Journal* (2012)
---------
## Table of Contents
### [Preface](/en/preface)
### [Part I: Foundations of Data Systems](/en/part-i)
- [1. Tradeoffs in Data Systems Architecture](/en/ch1)
- [2. Defining NonFunctional Requirements](/en/ch2)
- [3. Data Models and Query Languages](/en/ch3)
- [4. Storage and Retrieval](/en/ch4)
- [5. Encoding and Evolution](/en/ch5)
### [Part II: Distributed Data](/en/part-ii)
- [6. Replication](/en/ch6)
- [7. Partitioning](/en/ch7)
- [8. Transactions](/en/ch8)
- [9. The Trouble with Distributed Systems](/en/ch9)
- [10. Consistency and Consensus](/en/ch10)
### [Part III: Derived Data](/en/part-iii)
- [11. Batch Processing](/en/ch11)
- [12. Stream Processing](/en/ch12)
- [13. A Philosophy of Streaming Systems](/en/ch13)
- [14. Doing the Right Thing](/en/ch14)
### [Glossary](/en/glossary)
### [Colophon](/en/colophon)
================================================
FILE: content/en/ch1.md
================================================
---
title: "1. Trade-offs in Data Systems Architecture"
weight: 101
breadcrumbs: false
---
> *There are no solutions, there are only trade-offs. […] But you try to get the best
> trade-off you can get, and that’s all you can hope for.*
>
> [Thomas Sowell](https://www.youtube.com/watch?v=2YUtKr8-_Fg), Interview with Fred Barnes (2005)
> [!TIP] A NOTE FOR EARLY RELEASE READERS
> With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles.
>
> This will be the 1st chapter of the final book. The GitHub repo for this book is https://github.com/ept/ddia2-feedback.
> If you’d like to be actively involved in reviewing and commenting on this draft, please reach out on GitHub.
Data is central to much application development today. With web and mobile apps, software as a
service (SaaS), and cloud services, it has become normal to store data from many different users in
a shared server-based data infrastructure. Data from user activity, business transactions, devices
and sensors needs to be stored and made available for analysis. As users interact with an
application, they both read the data that is stored, and also generate more data.
Small amounts of data, which can be stored and processed on a single machine, are often fairly easy
to deal with. However, as the data volume or the rate of queries grows, it needs to be distributed
across multiple machines, which introduces many challenges. As the needs of the application become
more complex, it is no longer sufficient to store everything in one system, but it might be
necessary to combine multiple storage or processing systems that provide different capabilities.
We call an application *data-intensive* if data management is one of the primary challenges in
developing the application [^1].
While in *compute-intensive* systems the challenge is parallelizing some very large computation, in
data-intensive applications we usually worry more about things like storing and processing large
data volumes, managing changes to data, ensuring consistency in the face of failures and
concurrency, and making sure services are highly available.
Such applications are typically built from standard building blocks that provide commonly needed
functionality. For example, many applications need to:
* Store data so that they, or another application, can find it again later (*databases*)
* Remember the result of an expensive operation, to speed up reads (*caches*)
* Allow users to search data by keyword or filter it in various ways (*search indexes*)
* Handle events and data changes as soon as they occur (*stream processing*)
* Periodically crunch a large amount of accumulated data (*batch processing*)
In building an application we typically take several software systems or services, such as databases
or APIs, and glue them together with some application code. If you are doing exactly what the data
systems were designed for, then this process can be quite easy.
However, as your application becomes more ambitious, challenges arise. There are many database
systems with different characteristics, suitable for different purposes—how do you choose which one
to use? There are various approaches to caching, several ways of building search indexes, and so
on—how do you reason about their trade-offs? You need to figure out which tools and which approaches
are the most appropriate for the task at hand, and it can be difficult to combine tools when you
need to do something that a single tool cannot do alone.
This book is a guide to help you make decisions about which technologies to use and how to combine
them. As you will see, there is no one approach that is fundamentally better than others; everything
has pros and cons. With this book, you will learn to ask the right questions to evaluate and compare
data systems, so that you can figure out which approach will best serve the needs of your particular
application.
We will start our journey by looking at some of the ways that data is typically used in
organizations today. Many of the ideas here have their origin in *enterprise software* (i.e., the
software needs and engineering practices of large organizations, such as big corporations and
governments), since historically, only large organizations had the large data volumes that required
sophisticated technical solutions. If your data volume is small enough, you can simply keep it in a
spreadsheet! However, more recently it has also become common for smaller companies and startups to
manage large data volumes and build data-intensive systems.
One of the key challenges with data systems is that different people need to do very different
things with data. If you are working at a company, you and your team will have one set of
priorities, while another team may have entirely different goals, even though you might be working
with the same dataset! Moreover, those goals might not be explicitly articulated, which can lead to
misunderstandings and disagreement about the right approach.
To help you understand what choices you can make, this chapter compares several contrasting
concepts, and explores their trade-offs:
* the difference between operational and analytical systems ([“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics));
* pros and cons of cloud services and self-hosted systems ([“Cloud versus Self-Hosting”](/en/ch1#sec_introduction_cloud));
* when to move from single-node systems to distributed systems ([“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed)); and
* balancing the needs of the business and the rights of the user ([“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance)).
Moreover, this chapter will provide you with terminology that we will need for the rest of the book.
> [!TIP] TERMINOLOGY: FRONTENDS AND BACKENDS
Much of what we will discuss in this book relates to *backend development*. To explain that term:
for web applications, the client-side code (which runs in a web browser) is called the *frontend*,
and the server-side code that handles user requests is known as the *backend*. Mobile apps are
similar to frontends in that they provide user interfaces, which often communicate over the Internet
with a server-side backend. Frontends sometimes manage data locally on the user’s device [^2],
but the greatest data infrastructure challenges often lie in the backend: a frontend only needs to
handle one user’s data, whereas the backend manages data on behalf of *all* of the users.
A backend service is often reachable via HTTP (sometimes WebSocket); it usually consists of some
application code that reads and writes data in one or more databases, and sometimes interfaces with
additional data systems such as caches or message queues (which we might collectively call *data
infrastructure*). The application code is often *stateless* (i.e., when it finishes handling one
HTTP request, it forgets everything about that request), and any information that needs to persist
from one request to another needs to be stored either on the client, or in the server-side data
infrastructure.
## Analytical versus Operational Systems {#sec_introduction_analytics}
If you are working on data systems in an enterprise, you are likely to encounter several different
types of people who work with data. The first type are *backend engineers* who build services that
handle requests for reading and updating data; these services often serve external users, either
directly or indirectly via other services (see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)). Sometimes
services are for internal use by other parts of the organization.
In addition to the teams managing backend services, two other groups of people typically require
access to an organization’s data: *business analysts*, who generate reports about the activities of
the organization in order to help the management make better decisions (*business intelligence* or
*BI*), and *data scientists*, who look for novel insights in data or who create user-facing product
features that are enabled by data analysis and machine learning/AI (for example, “people who bought
X also bought Y” recommendations on an e-commerce website, predictive analytics such as risk scoring
or spam filtering, and ranking of search results).
Although business analysts and data scientists tend to use different tools and operate in different
ways, they have some things in common: both perform *analytics*, which means they look at the data
that the users and backend services have generated, but they generally do not modify this data
(except perhaps for fixing mistakes). They might create derived datasets in which the original data
has been processed in some way. This has led to a split between two types of systems—a distinction
that we will use throughout this book:
* *Operational systems* consist of the backend services and data infrastructure where data is
created, for example by serving external users. Here, the application code both reads and modifies
the data in its databases, based on the actions performed by the users.
* *Analytical systems* serve the needs of business analysts and data scientists. They contain a
read-only copy of the data from the operational systems, and they are optimized for the types of
data processing that are needed for analytics.
As we shall see in the next section, operational and analytical systems are often kept separate, for
good reasons. As these systems have matured, two new specialized roles have emerged: *data
engineers* and *analytics engineers*. Data engineers are the people who know how to integrate the
operational and the analytical systems, and who take responsibility for the organization’s data
infrastructure more widely [^3].
Analytics engineers model and transform data to make it more useful for the business analysts and
data scientists in an organization [^4].
Many engineers specialize on either the operational or the analytical side. However, this book
covers both operational and analytical data systems, since both play an important role in the
lifecycle of data within an organization. We will explore in-depth the data infrastructure that is
used to deliver services both to internal and external users, so that you can work better with your
colleagues on the other side of this divide.
### Characterizing Transaction Processing and Analytics {#sec_introduction_oltp}
In the early days of business data processing, a write to the database typically corresponded to a
*commercial transaction* taking place: making a sale, placing an order with a supplier, paying an
employee’s salary, etc. As databases expanded into areas that didn’t involve money changing hands,
the term *transaction* nevertheless stuck, referring to a group of reads and writes that form a
logical unit.
> [!NOTE]
> [Chapter 8](/en/ch8#ch_transactions) explores in detail what we mean with a transaction. This chapter uses the term
> loosely to refer to low-latency reads and writes.
Even though databases started being used for many different kinds of data—posts on social media,
moves in a game, contacts in an address book, and many others—the basic access pattern
remained similar to processing business transactions. An operational system typically looks up a
small number of records by some key (this is called a *point query*). Records are inserted, updated,
or deleted based on the user’s input. Because these applications are interactive, this access
pattern became known as *online transaction processing* (OLTP).
However, databases also started being increasingly used for analytics, which has very different
access patterns compared to OLTP. Usually an analytic query scans over a huge number of records, and
calculates aggregate statistics (such as count, sum, or average) rather than returning the
individual records to the user. For example, a business analyst at a supermarket chain may want to
answer analytic queries such as:
* What was the total revenue of each of our stores in January?
* How many more bananas than usual did we sell during our latest promotion?
* Which brand of baby food is most often purchased together with brand X diapers?
The reports that result from these types of queries are important for business intelligence, helping
the management decide what to do next. In order to differentiate this pattern of using databases
from transaction processing, it has been called *online analytic processing* (OLAP) [^5].
The difference between OLTP and analytics is not always clear-cut, but some typical characteristics are listed in [Table 1-1](/en/ch1#tab_oltp_vs_olap).
{{< figure id="tab_oltp_vs_olap" title="Table 1-1. Comparing characteristics of operational and analytic systems" class="w-full my-4" >}}
| Property | Operational systems (OLTP) | Analytical systems (OLAP) |
|---------------------|-------------------------------------------------|-------------------------------------------|
| Main read pattern | Point queries (fetch individual records by key) | Aggregate over large number of records |
| Main write pattern | Create, update, and delete individual records | Bulk import (ETL) or event stream |
| Human user example | End user of web/mobile application | Internal analyst, for decision support |
| Machine use example | Checking if an action is authorized | Detecting fraud/abuse patterns |
| Type of queries | Fixed set of queries, predefined by application | Analyst can make arbitrary queries |
| Data represents | Latest state of data (current point in time) | History of events that happened over time |
| Dataset size | Gigabytes to terabytes | Terabytes to petabytes |
> [!NOTE]
> The meaning of *online* in *OLAP* is unclear; it probably refers to the fact that queries are not
> just for predefined reports, but that analysts use the OLAP system interactively for explorative
> queries.
With operational systems, users are generally not allowed to construct custom SQL queries and run
them on the database, since that would potentially allow them to read or modify data that they do
not have permission to access. Moreover, they might write queries that are expensive to execute, and
hence affect the database performance for other users. For these reasons, OLTP systems mostly run a
fixed set of queries that are baked into the application code, and use one-off custom queries only
occasionally for maintenance or troubleshooting. On the other hand, analytic databases usually give
their users the freedom to write arbitrary SQL queries by hand, or to generate queries automatically
using a data visualization or dashboard tool such as Tableau, Looker, or Microsoft Power BI.
There is also a type of systems that is designed for analytical workloads (queries that aggregate
over many records) but that are embedded into user-facing products. This category is known as
*product analytics* or *real-time analytics*, and systems designed for this type of use include
Pinot, Druid, and ClickHouse [^6].
### Data Warehousing {#sec_introduction_dwh}
At first, the same databases were used for both transaction processing and analytic queries. SQL
turned out to be quite flexible in this regard: it works well for both types of queries.
Nevertheless, in the late 1980s and early 1990s, there was a trend for companies to stop using their
OLTP systems for analytics purposes, and to run the analytics on a separate database system instead.
This separate database was called a *data warehouse*.
A large enterprise may have dozens, even hundreds, of online transaction processing systems:
systems powering the customer-facing website, controlling point of sale (checkout) systems in
physical stores, tracking inventory in warehouses, planning routes for vehicles, managing suppliers,
administering employees, and performing many other tasks. Each of these systems is complex and needs
a team of people to maintain it, so these systems end up operating mostly independently from each
other.
It is usually undesirable for business analysts and data scientists to directly query these OLTP
systems, for several reasons:
* the data of interest may be spread across multiple operational systems, making it difficult to
combine those datasets in a single query (a problem known as *data silos*);
* the kinds of schemas and data layouts that are good for OLTP are less well suited for analytics
(see [“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics));
* analytic queries can be quite expensive, and running them on an OLTP database would impact the
performance for other users; and
* the OLTP systems might reside in a separate network that users are not allowed direct access to
for security or compliance reasons.
A *data warehouse*, by contrast, is a separate database that analysts can query to their hearts’
content, without affecting OLTP operations [^7].
As we shall see in [Chapter 4](/en/ch4#ch_storage), data warehouses often store data in a way that is very different
from OLTP databases, in order to optimize for the types of queries that are common in analytics.
The data warehouse contains a read-only copy of the data in all the various OLTP systems in the
company. Data is extracted from OLTP databases (using either a periodic data dump or a continuous
stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into
the data warehouse. This process of getting data into the data warehouse is known as
*Extract–Transform–Load* (ETL) and is illustrated in [Figure 1-1](/en/ch1#fig_dwh_etl). Sometimes the order of the
*transform* and *load* steps is swapped (i.e., the transformation is done in the data warehouse,
after loading), resulting in *ELT*.
{{< figure src="/fig/ddia_0101.png" id="fig_dwh_etl" caption="Figure 1-1. Simplified outline of ETL into a data warehouse." class="w-full my-4" >}}
In some cases the data sources of the ETL processes are external SaaS products such as customer
relationship management (CRM), email marketing, or credit card processing systems. In those cases,
you do not have direct access to the original database, since it is accessible only via the software
vendor’s API. Bringing the data from these external systems into your own data warehouse can enable
analyses that are not possible via the SaaS API. ETL for SaaS APIs is often implemented by
specialist data connector services such as Fivetran, Singer, or AirByte.
Some database systems offer *hybrid transactional/analytic processing* (HTAP), which aims to enable
OLTP and analytics in a single system without requiring ETL from one system into another [^8] [^9].
However, many HTAP systems internally consist of an OLTP system coupled with a separate analytical
system, hidden behind a common interface—so the distinction between the two remains important for
understanding how these systems work.
Moreover, even though HTAP exists, it is common to have a separation between transactional and
analytic systems due to their different goals and requirements. In particular, it is considered good
practice for each operational system to have its own database (see
[“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), leading to hundreds of separate operational databases; on the
other hand, an enterprise usually has a single data warehouse, so that business analysts can combine
data from several operational systems in a single query.
HTAP therefore does not replace data warehouses. Rather, it is useful in scenarios where the same
application needs to both perform analytics queries that scan a large number of rows, and also
read and update individual records with low latency. Fraud detection can involve such workloads, for
example [^10].
The separation between operational and analytical systems is part of a wider trend: as workloads
have become more demanding, systems have become more specialized and optimized for particular
workloads. General-purpose systems can handle small data volumes comfortably, but the greater the
scale, the more specialized systems tend to become [^11].
#### From data warehouse to data lake {#from-data-warehouse-to-data-lake}
A data warehouse often uses a *relational* data model that is queried through SQL (see
[Chapter 3](/en/ch3#ch_datamodels)), perhaps using specialized business intelligence software. This model works well
for the types of queries that business analysts need to make, but it is less well suited to the
needs of data scientists, who might need to perform tasks such as:
* Transform data into a form that is suitable for training a machine learning model; often this
requires turning the rows and columns of a database table into a vector or matrix of numerical
values called *features*. The process of performing this transformation in a way that maximizes
the performance of the trained model is called *feature engineering*, and it often requires custom
code that is difficult to express using SQL.
* Take textual data (e.g., reviews of a product) and use natural language processing techniques to
try to extract structured information from it (e.g., the sentiment of the author, or which topics
they mention). Similarly, they might need to extract structured information from photos using
computer vision techniques.
Although there have been efforts to add machine learning operators to a SQL data model [^12]
and to build efficient machine learning systems on top of a relational foundation [^13],
many data scientists prefer not to work in a relational database such as a data warehouse. Instead,
many prefer to use Python data analysis libraries such as pandas and scikit-learn, statistical
analysis languages such as R, and distributed analytics frameworks such as Spark [^14].
We discuss these further in [“Dataframes, Matrices, and Arrays”](/en/ch3#sec_datamodels_dataframes).
Consequently, organizations face a need to make data available in a form that is suitable for use by
data scientists. The answer is a *data lake*: a centralized data repository that holds a copy of any
data that might be useful for analysis, obtained from operational systems via ETL processes. The
difference from a data warehouse is that a data lake simply contains files, without imposing any
particular file format or data model. Files in a data lake might be collections of database records,
encoded using a file format such as Avro or Parquet (see [Chapter 5](/en/ch5#ch_encoding)), but they can equally well
contain text, images, videos, sensor readings, sparse matrices, feature vectors, genome sequences,
or any other kind of data [^15].
Besides being more flexible, this is also often cheaper than relational data storage, since the data
lake can use commoditized file storage such as object stores (see [“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native)).
ETL processes have been generalized to *data pipelines*, and in some cases the data lake has become
an intermediate stop on the path from the operational systems to the data warehouse. The data lake
contains data in a “raw” form produced by the operational systems, without the transformation into a
relational data warehouse schema. This approach has the advantage that each consumer of the data can
transform the raw data into a form that best suits their needs. It has been dubbed the *sushi
principle*: “raw data is better” [^16].
Besides loading data from a data lake into a separate data warehouse, it is also possible to run
typical data warehousing workloads (SQL queries and business analytics) directly on the files in the
data lake, alongside data science/machine learning workloads. This architecture is known as a *data
lakehouse*, and it requires a query execution engine and a metadata (e.g., schema management) layer
that extend the data lake’s file storage [^17].
Apache Hive, Spark SQL, Presto, and Trino are examples of this approach.
#### Beyond the data lake {#beyond-the-data-lake}
As analytics practices have matured, organizations have been increasingly paying attention to the
management and operations of analytics systems and data pipelines, as captured for example in the
DataOps manifesto [^18].
Part of this are issues of governance, privacy, and compliance with regulation such as GDPR and
CCPA, which we discuss in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance) and [“Legislation and Self-Regulation”](/en/ch14#sec_future_legislation).
Moreover, analytical data is increasingly made available not only as files and relational tables,
but also as streams of events (see [Chapter 12](/en/ch12#ch_stream)). With file-based data analysis you can re-run the
analysis periodically (e.g., daily) in order to respond to changes in the data, but stream processing
allows analytics systems to respond to events much faster, on the order of seconds. Depending on the
application and how time-sensitive it is, a stream processing approach can be valuable, for example
to identify and block potentially fraudulent or abusive activity.
In some cases the outputs of analytics systems are made available to operational systems (a process
sometimes known as *reverse ETL* [^19]). For example, a machine-learning model that was trained on data in an analytics system may be deployed to
production, so that it can generate recommendations for end-users, such as “people who bought X also
bought Y”. Such deployed outputs of analytics systems are also known as *data products* [^20].
Machine learning models can be deployed to operational systems using specialized tools such as
TFX, Kubeflow, or MLflow.
### Systems of Record and Derived Data {#sec_introduction_derived}
Related to the distinction between operational and analytical systems, this book also distinguishes
between *systems of record* and *derived data systems*. These terms are useful because they can help
you clarify the flow of data through a system:
Systems of record
: A system of record, also known as *source of truth*, holds the authoritative or *canonical*
version of some data. When new data comes in, e.g., as user input, it is first written here. Each
fact is represented exactly once (the representation is typically *normalized*; see
[“Normalization, Denormalization, and Joins”](/en/ch3#sec_datamodels_normalization)). If there is any discrepancy between another system and the
system of record, then the value in the system of record is (by definition) the correct one.
Derived data systems
: Data in a derived system is the result of taking some existing data from another system and
transforming or processing it in some way. If you lose derived data, you can recreate it from the
original source. A classic example is a cache: data can be served from the cache if present, but
if the cache doesn’t contain what you need, you can fall back to the underlying database.
Denormalized values, indexes, materialized views, transformed data representations, and models
trained on a dataset also fall into this category.
Technically speaking, derived data is *redundant*, in the sense that it duplicates existing
information. However, it is often essential for getting good performance on read queries. You can
derive several different datasets from a single source, enabling you to look at the data from
different “points of view.”
Analytical systems are usually derived data systems, because they are consumers of data created
elsewhere. Operational services may contain a mixture of systems of record and derived data systems.
The systems of record are the primary databases to which data is first written, whereas the derived
data systems are the indexes and caches that speed up common read operations, especially for queries
that the system of record cannot answer efficiently.
Most databases, storage engines, and query languages are not inherently a system of record or a
derived system. A database is just a tool: how you use it is up to you. The distinction between
system of record and derived data system depends not on the tool, but on how you use it in your
application. By being clear about which data is derived from which other data, you can bring clarity
to an otherwise confusing system architecture.
When the data in one system is derived from the data in another, you need a process for updating the
derived data when the original in the system of record changes. Unfortunately, many databases are
designed based on the assumption that your application only ever needs to use that one database, and
they do not make it easy to integrate multiple systems in order to propagate such updates. In
[“Data Integration”](/en/ch13#sec_future_integration) we will discuss approaches to *data integration*, which allow us to compose multiple
data systems to achieve things that one system alone cannot do.
That brings us to the end of our comparison of analytics and transaction processing. In the next
section, we will examine another trade-off that you might have already seen debated multiple times.
## Cloud versus Self-Hosting {#sec_introduction_cloud}
With anything that an organization needs to do, one of the first questions is: should it be done
in-house, or should it be outsourced? Should you build or should you buy?
Ultimately, this is a question about business priorities. The received management wisdom is that
things that are a core competency or a competitive advantage of your organization should be done
in-house, whereas things that are non-core, routine, or commonplace should be left to a vendor [^21].
To give an extreme example, most companies do not generate their own electricity (unless they are an
energy company, and leaving aside emergency backup power), since it is cheaper to buy electricity from the grid.
With software, two important decisions to be made are who builds the software and who deploys it.
There is a spectrum of possibilities that outsource each decision to various degrees, as illustrated
in [Figure 1-2](/en/ch1#fig_cloud_spectrum). At one extreme is bespoke software that you write and run in-house; at
the other extreme are widely-used cloud services or Software as a Service (SaaS) products that are
implemented and operated by an external vendor, and which you only access through a web interface or API.
{{< figure src="/fig/ddia_0102.png" id="fig_cloud_spectrum" caption="Figure 1-2. A spectrum of types of software and its operations." class="w-full my-4" >}}
The middle ground is off-the-shelf software (open source or commercial) that you *self-host*, i.e.,
deploy yourself—for example, if you download MySQL and install it on a server you control. This
could be on your own hardware (often called *on-premises*, even if the server is actually in a
rented datacenter rack and not literally on your own premises), or on a virtual machine in the cloud
(*Infrastructure as a Service* or IaaS). There are still more points along this spectrum, e.g.,
taking open source software and running a modified version of it.
Separately from this spectrum there is also the question of *how* you deploy services, either in the
cloud or on-premises—for example, whether you use an orchestration framework such as Kubernetes.
However, choice of deployment tooling is out of scope of this book, since other factors have a
greater influence on the architecture of data systems.
### Pros and Cons of Cloud Services {#sec_introduction_cloud_tradeoffs}
Using a cloud service, rather than running comparable software yourself, essentially outsources the
operation of that software to the cloud provider. There are good arguments for and against cloud
services. Cloud providers claim that using their services saves you time and money, and allows you
to move faster compared to setting up your own infrastructure.
Whether a cloud service is actually cheaper and easier than self-hosting depends very much on your
skills and the workload on your systems. If you already have experience setting up and operating the
systems you need, and if your load is quite predictable (i.e., the number of machines you need does
not fluctuate wildly), then it’s often cheaper to buy your own machines and run the software on them
yourself [^22] [^23].
On the other hand, if you need a system that you don’t already know how to deploy and operate, then
adopting a cloud service is often easier and quicker than learning to manage the system yourself. If
you have to hire and train staff specifically to maintain and operate the system, that can get very
expensive. You still need an operations team when you’re using the cloud (see
[“Operations in the Cloud Era”](/en/ch1#sec_introduction_operations)), but outsourcing the basic system administration can free up your
team to focus on higher-level concerns.
When you outsource the operation of a system to a company that specializes in running that service,
that can potentially result in a better service, since the provider gains operational expertise from
providing the service to many customers. On the other hand, if you run the service yourself, you can
configure and tune it to perform well on your particular workload; it is unlikely that a cloud
service would be willing to make such customizations on your behalf.
Cloud services are particularly valuable if the load on your systems varies a lot over time. If you
provision your machines to be able to handle peak load, but those computing resources are idle most
of the time, the system becomes less cost-effective. In this situation, cloud services have the
advantage that they can make it easier to scale your computing resources up or down in response to
changes in demand.
For example, analytics systems often have extremely variable load: running a large analytical query
quickly requires a lot of computing resources in parallel, but once the query completes, those
resources sit idle until the user makes the next query. Predefined queries (e.g., for daily reports)
can be enqueued and scheduled to smooth out the load, but for interactive queries, the faster you
want them to complete, the more variable the workload becomes. If your dataset is so large that
querying it quickly requires significant computing resources, using the cloud can save money, since
you can return unused resources to the provider rather than leaving them idle. For smaller datasets,
this difference is less significant.
The biggest downside of a cloud service is that you have no control over it:
* If it is lacking a feature you need, all you can do is to politely ask the vendor whether they
will add it; you generally cannot implement it yourself.
* If the service goes down, all you can do is to wait for it to recover.
* If you are using the service in a way that triggers a bug or causes performance problems, it will
be difficult for you to diagnose the issue. With software that you run yourself, you can get
performance metrics and debugging information from the operating system to help you understand its
behavior, and you can look at the server logs, but with a service hosted by a vendor you usually
do not have access to these internals.
* Moreover, if the service shuts down or becomes unacceptably expensive, or if the vendor decides to
change their product in a way you don’t like, you are at their mercy—continuing to run an old
version of the software is usually not an option, so you will be forced to migrate to an
alternative service [^24].
This risk is mitigated if there are alternative services that expose a compatible API, but for
many cloud services there are no standard APIs, which raises the cost of switching, making vendor
lock-in a problem.
* The cloud provider needs to be trusted to keep the data secure, which can complicate the process
of complying with privacy and security regulations.
Despite all these risks, it has become more and more popular for organizations to build new
applications on top of cloud services, or adopting a hybrid approach in which cloud services are
used for some aspects of a system. However, cloud services will not subsume all in-house data
systems: many older systems predate the cloud, and for any services that have specialist
requirements that existing cloud services cannot meet, in-house systems remain necessary. For
example, very latency-sensitive applications such as high-frequency trading require full control of
the hardware.
### Cloud-Native System Architecture {#sec_introduction_cloud_native}
Besides having a different economic model (subscribing to a service instead of buying hardware and
licensing software to run on it), the rise of the cloud has also had a profound effect on how data
systems are implemented on a technical level. The term *cloud-native* is used to describe an
architecture that is designed to take advantage of cloud services.
In principle, almost any software that you can self-host could also be provided as a cloud service,
and indeed such managed services are now available for many popular data systems. However, systems
that have been designed from the ground up to be cloud-native have been shown to have several
advantages: better performance on the same hardware, faster recovery from failures, being able to
quickly scale computing resources to match the load, and supporting larger datasets [^25] [^26] [^27].
[Table 1-2](/en/ch1#tab_cloud_native_dbs) lists some examples of both types of systems.
{{< figure id="tab_cloud_native_dbs" title="Table 1-2. Examples of self-hosted and cloud-native database systems" class="w-full my-4" >}}
| Category | Self-hosted systems | Cloud-native systems |
|------------------|-----------------------------|-----------------------------------------------------------------------|
| Operational/OLTP | MySQL, PostgreSQL, MongoDB | AWS Aurora [^25], Azure SQL DB Hyperscale [^26], Google Cloud Spanner |
| Analytical/OLAP | Teradata, ClickHouse, Spark | Snowflake [^27], Google BigQuery, Azure Synapse Analytics |
#### Layering of cloud services {#layering-of-cloud-services}
Many self-hosted data systems have very simple system requirements: they run on a conventional
operating system such as Linux or Windows, they store their data as files on the filesystem, and
they communicate via standard network protocols such as TCP/IP. A few systems depend on special
hardware such as GPUs (for machine learning) or RDMA network interfaces, but on the whole,
self-hosted software tends to use very generic computing resources: CPU, RAM, a filesystem, and an IP network.
In a cloud, this type of software can be run on an Infrastructure-as-a-Service environment, using
one or more virtual machines (or *instances*) with a certain allocation of CPUs, memory, disk, and
network bandwidth. Compared to physical machines, cloud instances can be provisioned faster and they
come in a greater variety of sizes, but otherwise they are similar to a traditional computer: you
can run any software you like on it, but you are responsible for administering it yourself.
In contrast, the key idea of cloud-native services is to use not only the computing resources
managed by your operating system, but also to build upon lower-level cloud services to create
higher-level services. For example:
* *Object storage* services such as Amazon S3, Azure Blob Storage, and Cloudflare R2 store large
files. They provide more limited APIs than a typical filesystem (basic file reads and writes), but
they have the advantage that they hide the underlying physical machines: the service automatically
distributes the data across many machines, so that you don’t have to worry about running out of
disk space on any one machine. Even if some machines or their disks fail entirely, no data is
lost.
* Many other services are in turn built upon object storage and other cloud services: for example,
Snowflake is a cloud-based analytic database (data warehouse) that relies on S3 for data storage [^27],
and some other services in turn build upon Snowflake.
As always with abstractions in computing, there is no one right answer to what you should use. As a
general rule, higher-level abstractions tend to be more oriented towards particular use cases. If
your needs match the situations for which a higher-level system is designed, using the existing
higher-level system will probably provide what you need with much less hassle than building it
yourself from lower-level systems. On the other hand, if there is no high-level system that meets
your needs, then building it yourself from lower-level components is the only option.
#### Separation of storage and compute {#sec_introduction_storage_compute}
In traditional computing, disk storage is regarded as durable (we assume that once something is
written to disk, it will not be lost). To tolerate the failure of an individual hard disk, RAID
(Redundant Array of Independent Disks) is often used to maintain copies of the data on several
disks attached to the same machine. RAID can be performed either in hardware or in software by the
operating system, and it is transparent to the applications accessing the filesystem.
In the cloud, compute instances (virtual machines) may also have local disks attached, but
cloud-native systems typically treat these disks more like an ephemeral cache, and less like
long-term storage. This is because the local disk becomes inaccessible if the associated instance
fails, or if the instance is replaced with a bigger or a smaller one (on a different physical machine) in order to adapt to changes in load.
As an alternative to local disks, cloud services also offer virtual disk storage that can be
detached from one instance and attached to a different one (Amazon EBS, Azure managed disks, and
persistent disks in Google Cloud). Such a virtual disk is not actually a physical disk, but rather a
cloud service provided by a separate set of machines, which emulates the behavior of a disk (a
*block device*, where each block is typically 4 KiB in size). This technology makes it
possible to run traditional disk-based software in the cloud, but the block device emulation
introduces overheads that can be avoided in systems that are designed from the ground up for the cloud [^25]. It also makes the application
very sensitive to network glitches, since every I/O on the virtual block device is actually a network call [^28].
To address this problem, cloud-native services generally avoid using virtual disks, and instead
build on dedicated storage services that are optimized for particular workloads. Object storage
services such as S3 are designed for long-term storage of fairly large files, ranging from hundreds
of kilobytes to several gigabytes in size. The individual rows or values stored in a database are
typically much smaller than this; cloud databases therefore typically manage smaller values in a
separate service, and store larger data blocks (containing many individual values) in an object
store [^26] [^29]. We will see ways of doing this in [Chapter 4](/en/ch4#ch_storage).
In a traditional systems architecture, the same computer is responsible for both storage (disk) and
computation (CPU and RAM), but in cloud-native systems, these two responsibilities have become
somewhat separated or *disaggregated* [^9] [^27] [^30] [^31]:
for example, S3 only stores files, and if you want to analyze that data, you will have to run the
analysis code somewhere outside of S3. This implies transferring the data over the network, which we
will discuss further in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed).
Moreover, cloud-native systems are often *multitenant*, which means that rather than having a
separate machine for each customer, data and computation from several different customers are
handled on the same shared hardware by the same service [^32].
Multitenancy can enable better hardware utilization, easier scalability, and easier management by
the cloud provider, but it also requires careful engineering to ensure that one customer’s activity
does not affect the performance or security of the system for other customers [^33].
### Operations in the Cloud Era {#sec_introduction_operations}
Traditionally, the people managing an organization’s server-side data infrastructure were known as
*database administrators* (DBAs) or *system administrators* (sysadmins). More recently, many
organizations have tried to integrate the roles of software development and operations into teams
with a shared responsibility for both backend services and data infrastructure; the *DevOps*
philosophy has guided this trend. *Site Reliability Engineers* (SREs) are Google’s implementation of
this idea [^34].
The role of operations is to ensure services are reliably delivered to users (including configuring
infrastructure and deploying applications), and to ensure a stable production environment (including
monitoring and diagnosing any problems that may affect reliability). For self-hosted systems,
operations traditionally involves a significant amount of work at the level of individual machines,
such as capacity planning (e.g., monitoring available disk space and adding more disks before you
run out of space), provisioning new machines, moving services from one machine to another, and
installing operating system patches.
Many cloud services present an API that hides the individual machines that actually implement the
service. For example, cloud storage replaces fixed-size disks with *metered billing*, where you can
store data without planning your capacity needs in advance, and you are then charged based on the
space actually used. Moreover, many cloud services remain highly available, even when individual
machines have failed (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability)).
This shift in emphasis from individual machines to services has been accompanied by a change in the
role of operations. The high-level goal of providing a reliable service remains the same, but the
processes and tools have evolved. The DevOps/SRE philosophy places greater emphasis on:
* automation—preferring repeatable processes over manual one-off jobs,
* preferring ephemeral virtual machines and services over long running servers,
* enabling frequent application updates,
* learning from incidents, and
* preserving the organization’s knowledge about the system, even as individual people come and go [^35].
With the rise of cloud services, there has been a bifurcation of roles: operations teams at
infrastructure companies specialize in the details of providing a reliable service to a large number
of customers, while the customers of the service spend as little time and effort as possible on infrastructure [^36].
Customers of cloud services still require operations, but they focus on different aspects, such as
choosing the most appropriate service for a given task, integrating different services with each
other, and migrating from one service to another. Even though metered billing removes the need for
capacity planning in the traditional sense, it’s still important to know what resources you are
using for which purpose, so that you don’t waste money on cloud resources that are not needed:
capacity planning becomes financial planning, and performance optimization becomes cost optimization [^37].
Moreover, cloud services do have resource limits or *quotas* (such as the maximum number of
processes you can run concurrently), which you need to know about and plan for before you run into them [^38].
Adopting a cloud service can be easier and quicker than running your own infrastructure, although
even here there is a cost in learning how to use it, and perhaps working around its limitations.
Integration between different services becomes a particular challenge as a growing number of vendors
offers an ever broader range of cloud services targeting different use cases [^39] [^40].
ETL (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) is only part of the story; operational cloud services also need
to be integrated with each other. At present, there is a lack of standards that would facilitate
this sort of integration, so it often involves significant manual effort.
Other operational aspects that cannot fully be outsourced to cloud services include maintaining the
security of an application and the libraries it uses, managing the interactions between your own
services, monitoring the load on your services, and tracking down the cause of problems such as
performance degradations or outages. While the cloud is changing the role of operations, the need
for operations is as great as ever.
## Distributed versus Single-Node Systems {#sec_introduction_distributed}
A system that involves several machines communicating via a network is called a *distributed
system*. Each of the processes participating in a distributed system is called a *node*. There are
various reasons why you might want a system to be distributed:
Inherently distributed systems
: If an application involves two or more interacting users, each using their own device, then the
system is unavoidably distributed: the communication between the devices will have to go via a
network.
Requests between cloud services
: If data is stored in one service but processed in another, it must be transferred over the network
from one service to the other.
Fault tolerance/high availability
: If your application needs to continue working even if one machine (or several machines, or
the network, or an entire datacenter) goes down, you can use multiple machines to give you
redundancy. When one fails, another one can take over. See [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability) and
[Chapter 6](/en/ch6#ch_replication) on replication.
Scalability
: If your data volume or computing requirements grow bigger than a single machine can handle,
you can potentially spread the load across multiple machines. See
[“Scalability”](/en/ch2#sec_introduction_scalability).
Latency
: If you have users around the world, you might want to have servers in various regions
worldwide so that each user can be served from a server that is geographically close to
them. That avoids the users having to wait for network packets to travel halfway around the
world to answer their requests. See [“Describing Performance”](/en/ch2#sec_introduction_percentiles).
Elasticity
: If your application is busy at some times and idle at other times, a cloud deployment can scale up
or down to meet the demand, so that you pay only for resources you are actively using. This is more
difficult on a single machine, which needs to be provisioned to handle the maximum load, even at
times when it is barely used.
Using specialized hardware
: Different parts of the system can take advantage of different types of hardware to match their
workload. For example, an object store may use machines with many disks but few CPUs, whereas a
data analysis system may use machines with lots of CPU and memory but no disks, and a machine
learning system may use machines with GPUs (which are much more efficient than CPUs for training
deep neural networks and other machine learning tasks).
Legal compliance
: Some countries have data residency laws that require data about people in their jurisdiction to be
stored and processed geographically within that country [^41].
The scope of these rules varies—for example, in some cases it applies only to medical or financial
data, while other cases are broader. A service with users in several such jurisdictions will
therefore have to distribute their data across servers in several locations.
Sustainability
: If you have flexibility on where and when to run your jobs, you might be able to run them in a
time and place where plenty of renewable electricity is available, and avoid running them when the
power grid is under strain. This can reduce your carbon emissions and allow you to take advantage
of cheap power when it is available [^42] [^43].
These reasons apply both to services that you write yourself (application code) and services
consisting of off-the-shelf software (such as databases).
### Problems with Distributed Systems {#sec_introduction_dist_sys_problems}
Distributed systems also have downsides. Every request and API call that goes via the network needs
to deal with the possibility of failure: the network may be interrupted, or the service may be
overloaded or crashed, and therefore any request may time out without receiving a response. In this
case, we don’t know whether the service received the request, and simply retrying it might not be
safe. We will discuss these problems in detail in [Chapter 9](/en/ch9#ch_distributed).
Although datacenter networks are fast, making a call to another service is still vastly slower than
calling a function in the same process [^44].
When operating on large volumes of data, rather than transferring the data from storage to a
separate machine that processes it, it can be faster to bring the computation to the machine that
already has the data [^45].
More nodes are not always faster: in some cases, a simple single-threaded program on one computer
can perform significantly better than a cluster with over 100 CPU cores [^46].
Troubleshooting a distributed system is often difficult: if the system is slow to respond, how do
you figure out where the problem lies? Techniques for diagnosing problems in distributed systems are
developed under the heading of *observability* [^47] [^48],
which involves collecting data about the execution of a system, and allowing it to be queried in
ways that allows both high-level metrics and individual events to be analyzed. *Tracing* tools such
as OpenTelemetry, Zipkin, and Jaeger allow you to track which client called which server for which
operation, and how long each call took [^49].
Databases provide various mechanisms for ensuring data consistency, as we shall see in
[Chapter 6](/en/ch6#ch_replication) and [Chapter 8](/en/ch8#ch_transactions). However, when each service has its own database,
maintaining consistency of data across those different services becomes the application’s problem.
Distributed transactions, which we explore in [Chapter 8](/en/ch8#ch_transactions), are a possible technique for
ensuring consistency, but they are rarely used in a microservices context because they run counter
to the goal of making services independent from each other, and many databases don’t support them [^50].
For all these reasons, if you can do something on a single machine, this is often much simpler and
cheaper compared to setting up a distributed system [^23] [^46] [^51].
CPUs, memory, and disks have grown larger, faster, and more reliable. When combined with single-node
databases such as DuckDB, SQLite, and KùzuDB, many workloads can now run on a single node. We will
explore more on this topic in [Chapter 4](/en/ch4#ch_storage).
### Microservices and Serverless {#sec_introduction_microservices}
The most common way of distributing a system across multiple machines is to divide them into clients
and servers, and let the clients make requests to the servers. Most commonly HTTP is used for this
communication, as we will discuss in [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc). The same process may be both a
server (handling incoming requests) and a client (making outbound requests to other services).
This way of building applications has traditionally been called a *service-oriented architecture*
(SOA); more recently the idea has been refined into a *microservices* architecture [^52] [^53].
In this architecture, a service has one well-defined purpose (for example, in the case of S3, this
would be file storage); each service exposes an API that can be called by clients via the network,
and each service has one team that is responsible for its maintenance. A complex application can
thus be decomposed into multiple interacting services, each managed by a separate team.
There are several advantages to breaking down a complex piece of software into multiple services:
each service can be updated independently, reducing coordination effort among teams; each service
can be assigned the hardware resources it needs; and by hiding the implementation details behind an
API, the service owners are free to change the implementation without affecting clients. In terms of
data storage, it is common for each service to have its own databases, and not to share databases
between services: sharing a database would effectively make the entire database structure a part of
the service’s API, and then that structure would be difficult to change. Shared databases could also
cause one service’s queries to negatively impact the performance of other services.
On the other hand, having many services can itself breed complexity: each service requires
infrastructure for deploying new releases, adjusting the allocated hardware resources to match the
load, collecting logs, monitoring service health, and alerting an on-call engineer in the case of a
problem. *Orchestration* frameworks such as Kubernetes have become a popular way of deploying
services, since they provide a foundation for this infrastructure. Testing a service during
development can be complicated, since you also need to run all the other services that it depends on.
Microservice APIs can be challenging to evolve. Clients that call an API expect the API to have
certain fields. Developers might wish to add or remove fields to an API as business needs change,
but doing so can cause clients to fail. Worse still, such failures are often not discovered until
late in the development cycle when the updated service API is deployed to a staging or production
environment. API description standards such as OpenAPI and gRPC help manage the relationship between
client and server APIs; we discuss these further in [Chapter 5](/en/ch5#ch_encoding).
Microservices are primarily a technical solution to a people problem: allowing different teams to
make progress independently without having to coordinate with each other. This is valuable in a large
company, but in a small company where there are not many teams, using microservices is likely to be
unnecessary overhead, and it is preferable to implement the application in the simplest way possible [^52].
*Serverless*, or *function-as-a-service* (FaaS), is another approach to deploying services, in which
the management of the infrastructure is outsourced to a cloud vendor [^33].
When using virtual machines, you have to explicitly choose when to start up or shut down an
instance; in contrast, with the serverless model, the cloud provider automatically allocates and
frees hardware resources as needed, based on the incoming requests to your service [^54].
Serverless deployment shifts more of the operational burden to cloud providers and enables flexible billing
by usage rather than machine instances. To offer such benefits, many serverless infrastructure providers
impose a time limit on function execution, limit runtime environments, and might suffer from slow
start times when a function is first invoked. The term “serverless” can also be misleading: each
serverless function execution still runs on a server, but subsequent executions might run on a
different one. Moreover, infrastructure such as BigQuery and various Kafka offerings have adopted
“serverless” terminology to signal that their services auto-scale and that they bill by usage rather than machine instances.
Just like cloud storage replaced capacity planning (deciding in advance how many disks to buy) with
a metered billing model, the serverless approach is bringing metered billing to code execution: you
only pay for the time that your application code is actually running, rather than having to
provision resources in advance.
### Cloud Computing versus Supercomputing {#id17}
Cloud computing is not the only way of building large-scale computing systems; an alternative is
*high-performance computing* (HPC), also known as *supercomputing*. Although there are overlaps, HPC
often has different priorities and uses different techniques compared to cloud computing and
enterprise datacenter systems. Some of those differences are:
* Supercomputers are typically used for computationally intensive scientific computing tasks, such
as weather forecasting, climate modeling, molecular dynamics (simulating the movement of atoms and
molecules), complex optimization problems, and solving partial differential equations. On the
other hand, cloud computing tends to be used for online services, business data systems, and
similar systems that need to serve user requests with high availability.
* A supercomputer typically runs large batch jobs that checkpoint the state of their computation to
disk from time to time. If a node fails, a common solution is to simply stop the entire cluster
workload, repair the faulty node, and then restart the computation from the last checkpoint [^55] [^56].
With cloud services, it is usually not desirable to stop the entire cluster, since the services
need to continually serve users with minimal interruptions.
* Supercomputer nodes typically communicate through shared memory and remote direct memory access
(RDMA), which support high bandwidth and low latency, but assume a high level of trust among the users of the system [^57].
In cloud computing, the network and the machines are often shared by mutually untrusting
organizations, requiring stronger security mechanisms such as resource isolation (e.g., virtual
machines), encryption and authentication.
* Cloud datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to
provide high bisection bandwidth—a commonly used measure of a network’s overall performance [^55] [^58].
Supercomputers often use specialized network topologies, such as multi-dimensional meshes and toruses [^59],
which yield better performance for HPC workloads with known communication patterns.
* Cloud computing allows nodes to be distributed across multiple geographic regions, whereas
supercomputers generally assume that all of their nodes are close together.
Large-scale analytics systems sometimes share some characteristics with supercomputing, which is why
it can be worth knowing about these techniques if you are working in this area. However, this book
is mostly concerned with services that need to be continually available, as discussed in [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability).
## Data Systems, Law, and Society {#sec_introduction_compliance}
So far you’ve seen in this chapter that the architecture of data systems is influenced not only by
technical goals and requirements, but also by the human needs of the organizations that they
support. Increasingly, data systems engineers are realizing that serving the needs of their own
business is not enough: we also have a responsibility towards society at large.
One particular concern are systems that store data about people and their behavior. Since 2018 the
*General Data Protection Regulation* (GDPR) has given residents of many European countries greater
control and legal rights over their personal data, and similar privacy regulation has been adopted
in various other countries and states around the world, including for example the California
Consumer Privacy Act (CCPA). Regulations around AI, such as the *EU AI Act*, place further
restrictions on how personal data can be used.
Moreover, even in areas that are not directly subject to regulation, there is increasing recognition
of the effects that computer systems have on people and society. Social media has changed how
individuals consume news, which influences their political opinions and hence may affect the outcome
of elections. Automated systems increasingly make decisions that have profound consequences for
individuals, such as deciding who should be given a loan or insurance coverage, who should be
invited to a job interview, or who should be suspected of a crime [^60].
Everyone who works on such systems shares a responsibility for considering the ethical impact and
ensuring that they comply with relevant law. It is not necessary for everybody to become an expert
in law and ethics, but a basic awareness of legal and ethical principles is just as important as,
say, some foundational knowledge in distributed systems.
Legal considerations are influencing the very foundations of how data systems are being designed [^61].
For example, the GDPR grants individuals the right to have their data erased on request (sometimes
known as the *right to be forgotten*). However, as we shall see in this book, many data systems rely
on immutable constructs such as append-only logs as part of their design; how can we ensure deletion
of some data in the middle of a file that is supposed to be immutable? How do we handle deletion of
data that has been incorporated into derived datasets (see [“Systems of Record and Derived Data”](/en/ch1#sec_introduction_derived)), such as
training data for machine learning models? Answering these questions creates new engineering
challenges.
At present we don’t have clear guidelines on which particular technologies or system architectures
should be considered “GDPR-compliant” or not. The regulation deliberately does not mandate
particular technologies, because these may quickly change as technology progresses. Instead, the
legal texts set out high-level principles that are subject to interpretation. This means that there
are no simple answers to the question of how to comply with privacy regulation, but we will look at
some of the technologies in this book through this lens.
In general, we store data because we think that its value is greater than the costs of storing it.
However, it is worth remembering that the costs of storage are not just the bill you pay for Amazon
S3 or another service: the cost-benefit calculation should also take into account the risks of
liability and reputational damage if the data were to be leaked or compromised by adversaries, and
the risk of legal costs and fines if the storage and processing of the data is found not to be
compliant with the law [^51].
Governments or police forces might also compel companies to hand over data. When there is a risk
that the data may reveal criminalized behaviors (for example, homosexuality in several Middle
Eastern and African countries, or seeking an abortion in several US states), storing that data
creates real safety risks for users. Travel to an abortion clinic, for example, could easily be
revealed by location data, perhaps even by a log of the user’s IP addresses over time (which
indicate approximate location).
Once all the risks are taken into account, it might be reasonable to decide that some data is simply
not worth storing, and that it should therefore be deleted. This principle of *data minimization*
(sometimes known by the German term *Datensparsamkeit*) runs counter to the “big data” philosophy of
storing lots of data speculatively in case it turns out to be useful in the future [^62].
But it fits with the GDPR, which mandates that personal data may only be collected for a specified,
explicit purpose, that this data may not later be used for any other purpose, and that the data must
not be kept for longer than necessary for the purposes for which it was collected [^63].
Businesses have also taken notice of privacy and safety concerns. Credit card companies require
payment processing businesses to adhere to strict payment card industry (PCI) standards. Processors
undergo frequent evaluations from independent auditors to verify continued compliance. Software
vendors have also seen increased scrutiny. Many buyers now require their vendors to comply with
Service Organization Control (SOC) Type 2 standards. As with PCI compliance, vendors undergo third
party audits to verify adherence.
Generally, it is important to balance the needs of your business against the needs of the people
whose data you are collecting and processing. There is much more to this topic; in [Chapter 14](/en/ch14#ch_right_thing) we
will go deeper into the topics of ethics and legal compliance, including the problems of bias and
discrimination.
## Summary {#summary}
The theme of this chapter has been to understand trade-offs: that is, to recognize that for many
questions there is not one right answer, but several different approaches that each have various
pros and cons. We explored some of the most important choices that affect the architecture of data
systems, and introduced terminology that will be needed throughout the rest of this book.
We started by making a distinction between operational (transaction-processing, OLTP) and analytical
(OLAP) systems, and saw their different characteristics: not only managing different types of data
with different access patterns, but also serving different audiences. We encountered the concept of
a data warehouse and data lake, which receive data feeds from operational systems via ETL. In
[Chapter 4](/en/ch4#ch_storage) we will see that operational and analytical systems often use very different internal
data layouts because of the different types of queries they need to serve.
We then compared cloud services, a comparatively recent development, to the traditional paradigm of
self-hosted software that has previously dominated data systems architecture. Which of these
approaches is more cost-effective depends a lot on your particular situation, but it’s undeniable
that cloud-native approaches are bringing big changes to the way data systems are architected, for
example in the way they separate storage and compute.
Cloud systems are intrinsically distributed, and we briefly examined some of the trade-offs of
distributed systems compared to using a single machine. There are situations in which you can’t
avoid going distributed, but it’s advisable not to rush into making a system distributed if it’s
possible to keep it on a single machine. In [Chapter 9](/en/ch9#ch_distributed) we will cover the challenges with
distributed systems in more detail.
Finally, we saw that data systems architecture is determined not only by the needs of the business
deploying the system, but also by privacy regulation that protects the rights of the people whose
data is being processed—an aspect that many engineers are prone to ignoring. How we translate legal
requirements into technical implementations is not yet well understood, but it’s important to keep
this question in mind as we move through the rest of this book.
### References
[^1]: Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. [The Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26)
[^2]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
[^3]: Joe Reis and Matt Housley. [*Fundamentals of Data Engineering*](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/). O’Reilly Media, 2022. ISBN: 9781098108304
[^4]: Rui Pedro Machado and Helder Russa. [*Analytics Engineering with SQL and dbt*](https://www.oreilly.com/library/view/analytics-engineering-with/9781098142377/). O’Reilly Media, 2023. ISBN: 9781098142384
[^5]: Edgar F. Codd, S. B. Codd, and C. T. Salley. [Providing OLAP to User-Analysts: An IT Mandate](https://www.estgv.ipv.pt/PaginasPessoais/jloureiro/ESI_AID2007_2008/fichas/codd.pdf). E. F. Codd Associates, 1993. Archived at [perma.cc/RKX8-2GEE](https://perma.cc/RKX8-2GEE)
[^6]: Chinmay Soman and Neha Pawar. [Comparing Three Real-Time OLAP Databases: Apache Pinot, Apache Druid, and ClickHouse](https://startree.ai/blog/a-tale-of-three-real-time-olap-databases). *startree.ai*, April 2023. Archived at [perma.cc/8BZP-VWPA](https://perma.cc/8BZP-VWPA)
[^7]: Surajit Chaudhuri and Umeshwar Dayal. [An Overview of Data Warehousing and OLAP Technology](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigrecord.pdf). *ACM SIGMOD Record*, volume 26, issue 1, pages 65–74, March 1997. [doi:10.1145/248603.248616](https://doi.org/10.1145/248603.248616)
[^8]: Fatma Özcan, Yuanyuan Tian, and Pinar Tözün. [Hybrid Transactional/Analytical Processing: A Survey](https://humming80.github.io/papers/sigmod-htaptut.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017. [doi:10.1145/3035918.3054784](https://doi.org/10.1145/3035918.3054784)
[^9]: Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. [Cloud-Native Transactions and Analytics in SingleStore](https://dl.acm.org/doi/abs/10.1145/3514221.3526055). At *International Conference on Management of Data* (SIGMOD), June 2022. [doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055)
[^10]: Chao Zhang, Guoliang Li, Jintao Zhang, Xinning Zhang, and Jianhua Feng. [HTAP Databases: A Survey](https://arxiv.org/pdf/2404.15670). *IEEE Transactions on Knowledge and Data Engineering*, April 2024. [doi:10.1109/TKDE.2024.3389693](https://doi.org/10.1109/TKDE.2024.3389693)
[^11]: Michael Stonebraker and Uğur Çetintemel. [‘One Size Fits All’: An Idea Whose Time Has Come and Gone](https://pages.cs.wisc.edu/~shivaram/cs744-readings/fits_all.pdf). At *21st International Conference on Data Engineering* (ICDE), April 2005. [doi:10.1109/ICDE.2005.1](https://doi.org/10.1109/ICDE.2005.1)
[^12]: Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. [MAD Skills: New Analysis Practices for Big Data](https://www.vldb.org/pvldb/vol2/vldb09-219.pdf). *Proceedings of the VLDB Endowment*, volume 2, issue 2, pages 1481–1492, August 2009. [doi:10.14778/1687553.1687576](https://doi.org/10.14778/1687553.1687576)
[^13]: Dan Olteanu. [The Relational Data Borg is Learning](https://www.vldb.org/pvldb/vol13/p3502-olteanu.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, August 2020. [doi:10.14778/3415478.3415572](https://doi.org/10.14778/3415478.3415572)
[^14]: Matt Bornstein, Martin Casado, and Jennifer Li. [Emerging Architectures for Modern Data Infrastructure: 2020](https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/). *future.a16z.com*, October 2020. Archived at [perma.cc/LF8W-KDCC](https://perma.cc/LF8W-KDCC)
[^15]: Martin Fowler. [DataLake](https://www.martinfowler.com/bliki/DataLake.html). *martinfowler.com*, February 2015. Archived at [perma.cc/4WKN-CZUK](https://perma.cc/4WKN-CZUK)
[^16]: Bobby Johnson and Joseph Adler. [The Sushi Principle: Raw Data Is Better](https://learning.oreilly.com/videos/strata-hadoop/9781491924143/9781491924143-video210840/). At *Strata+Hadoop World*, February 2015.
[^17]: Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. [Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference on Innovative Data Systems Research* (CIDR), January 2021.
[^18]: DataKitchen, Inc. [The DataOps Manifesto](https://dataopsmanifesto.org/en/). *dataopsmanifesto.org*, 2017. Archived at [perma.cc/3F5N-FUQ4](https://perma.cc/3F5N-FUQ4)
[^19]: Tejas Manohar. [What is Reverse ETL: A Definition & Why It’s Taking Off](https://hightouch.io/blog/reverse-etl/). *hightouch.io*, November 2021. Archived at [perma.cc/A7TN-GLYJ](https://perma.cc/A7TN-GLYJ)
[^20]: Simon O’Regan. [Designing Data Products](https://towardsdatascience.com/designing-data-products-b6b93edf3d23). *towardsdatascience.com*, August 2018. Archived at [perma.cc/HU67-3RV8](https://perma.cc/HU67-3RV8)
[^21]: Camille Fournier. [Why is it so hard to decide to buy?](https://skamille.medium.com/why-is-it-so-hard-to-decide-to-buy-d86fee98e88e) *skamille.medium.com*, July 2021. Archived at [perma.cc/6VSG-HQ5X](https://perma.cc/6VSG-HQ5X)
[^22]: David Heinemeier Hansson. [Why we’re leaving the cloud](https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0). *world.hey.com*, October 2022. Archived at [perma.cc/82E6-UJ65](https://perma.cc/82E6-UJ65)
[^23]: Nima Badizadegan. [Use One Big Server](https://specbranch.com/posts/one-big-server/). *specbranch.com*, August 2022. Archived at [perma.cc/M8NB-95UK](https://perma.cc/M8NB-95UK)
[^24]: Steve Yegge. [Dear Google Cloud: Your Deprecation Policy is Killing You](https://steve-yegge.medium.com/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc). *steve-yegge.medium.com*, August 2020. Archived at [perma.cc/KQP9-SPGU](https://perma.cc/KQP9-SPGU)
[^25]: Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. [Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases](https://media.amazonwebservices.com/blog/2017/aurora-design-considerations-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 1041–1052, May 2017. [doi:10.1145/3035918.3056101](https://doi.org/10.1145/3035918.3056101)
[^26]: Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. [Socrates: The New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 1743–1756, June 2019. [doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047)
[^27]: Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong, Ashish Motivala, and Thierry Cruanes. [Building An Elastic Query Engine on Disaggregated Storage](https://www.usenix.org/system/files/nsdi20-paper-vuppalapati.pdf). At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020.
[^28]: Nick Van Wiggeren. [The Real Failure Rate of EBS](https://planetscale.com/blog/the-real-fail-rate-of-ebs). *planetscale.com*, March 2025. Archived at [perma.cc/43CR-SAH5](https://perma.cc/43CR-SAH5)
[^29]: Colin Breck. [Predicting the Future of Distributed Systems](https://blog.colinbreck.com/predicting-the-future-of-distributed-systems/). *blog.colinbreck.com*, August 2024. Archived at [perma.cc/K5FC-4XX2](https://perma.cc/K5FC-4XX2)
[^30]: Gwen Shapira. [Compute-Storage Separation Explained](https://www.thenile.dev/blog/storage-compute). *thenile.dev*, January 2023. Archived at [perma.cc/QCV3-XJNZ](https://perma.cc/QCV3-XJNZ)
[^31]: Ravi Murthy and Gurmeet Goindi. [AlloyDB for PostgreSQL under the hood: Intelligent, database-aware storage](https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage). *cloud.google.com*, May 2022. Archived at [archive.org](https://web.archive.org/web/20220514021120/https%3A//cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage)
[^32]: Jack Vanlightly. [The Architecture of Serverless Data Systems](https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems). *jack-vanlightly.com*, November 2023. Archived at [perma.cc/UDV4-TNJ5](https://perma.cc/UDV4-TNJ5)
[^33]: Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, David A. Patterson. [Cloud Programming Simplified: A Berkeley View on Serverless Computing](https://arxiv.org/abs/1902.03383). *arxiv.org*, February 2019.
[^34]: Betsy Beyer, Jennifer Petoff, Chris Jones, and Niall Richard Murphy. [*Site Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). O’Reilly Media, 2016. ISBN: 9781491929124
[^35]: Thomas Limoncelli. [The Time I Stole $10,000 from Bell Labs](https://queue.acm.org/detail.cfm?id=3434773). *ACM Queue*, volume 18, issue 5, November 2020. [doi:10.1145/3434571.3434773](https://doi.org/10.1145/3434571.3434773)
[^36]: Charity Majors. [The Future of Ops Jobs](https://acloudguru.com/blog/engineering/the-future-of-ops-jobs). *acloudguru.com*, August 2020. Archived at [perma.cc/GRU2-CZG3](https://perma.cc/GRU2-CZG3)
[^37]: Boris Cherkasky. [(Over)Pay As You Go for Your Datastore](https://medium.com/riskified-technology/over-pay-as-you-go-for-your-datastore-11a29ae49a8b). *medium.com*, September 2021. Archived at [perma.cc/Q8TV-2AM2](https://perma.cc/Q8TV-2AM2)
[^38]: Shlomi Kushchi. [Serverless Doesn’t Mean DevOpsLess or NoOps](https://thenewstack.io/serverless-doesnt-mean-devopsless-or-noops/). *thenewstack.io*, February 2023. Archived at [perma.cc/3NJR-AYYU](https://perma.cc/3NJR-AYYU)
[^39]: Erik Bernhardsson. [Storm in the stratosphere: how the cloud will be reshuffled](https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html). *erikbern.com*, November 2021. Archived at [perma.cc/SYB2-99P3](https://perma.cc/SYB2-99P3)
[^40]: Benn Stancil. [The data OS](https://benn.substack.com/p/the-data-os). *benn.substack.com*, September 2021. Archived at [perma.cc/WQ43-FHS6](https://perma.cc/WQ43-FHS6)
[^41]: Maria Korolov. [Data residency laws pushing companies toward residency as a service](https://www.csoonline.com/article/3647761/data-residency-laws-pushing-companies-toward-residency-as-a-service.html). *csoonline.com*, January 2022. Archived at [perma.cc/CHE4-XZZ2](https://perma.cc/CHE4-XZZ2)
[^42]: Severin Borenstein. [Can Data Centers Flex Their Power Demand?](https://energyathaas.wordpress.com/2025/04/14/can-data-centers-flex-their-power-demand/) *energyathaas.wordpress.com*, April 2025. Archived at
[^43]: Bilge Acun, Benjamin Lee, Fiodar Kazhamiaka, Aditya Sundarrajan, Kiwan Maeng, Manoj Chakkaravarthy, David Brooks, and Carole-Jean Wu. [Carbon Dependencies in Datacenter Design and Management](https://hotcarbon.org/assets/2022/pdf/hotcarbon22-acun.pdf). *ACM SIGENERGY Energy Informatics Review*, volume 3, issue 3, pages 21–26. [doi:10.1145/3630614.3630619](https://doi.org/10.1145/3630614.3630619)
[^44]: Kousik Nath. [These are the numbers every computer engineer should know](https://www.freecodecamp.org/news/must-know-numbers-for-every-computer-engineer/). *freecodecamp.org*, September 2019. Archived at [perma.cc/RW73-36RL](https://perma.cc/RW73-36RL)
[^45]: Joseph M. Hellerstein, Jose Faleiro, Joseph E. Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. [Serverless Computing: One Step Forward, Two Steps Back](https://arxiv.org/abs/1812.03651). At *Conference on Innovative Data Systems Research* (CIDR), January 2019.
[^46]: Frank McSherry, Michael Isard, and Derek G. Murray. [Scalability! But at What COST?](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf) At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
[^47]: Cindy Sridharan. *[Distributed Systems Observability: A Guide to Building Robust Systems](https://unlimited.humio.com/rs/756-LMY-106/images/Distributed-Systems-Observability-eBook.pdf)*. Report, O’Reilly Media, May 2018. Archived at [perma.cc/M6JL-XKCM](https://perma.cc/M6JL-XKCM)
[^48]: Charity Majors. [Observability — A 3-Year Retrospective](https://thenewstack.io/observability-a-3-year-retrospective/). *thenewstack.io*, August 2019. Archived at [perma.cc/CG62-TJWL](https://perma.cc/CG62-TJWL)
[^49]: Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. [Dapper, a Large-Scale Distributed Systems Tracing Infrastructure](https://research.google/pubs/pub36356/). Google Technical Report dapper-2010-1, April 2010. Archived at [perma.cc/K7KU-2TMH](https://perma.cc/K7KU-2TMH)
[^50]: Rodrigo Laigner, Yongluan Zhou, Marcos Antonio Vaz Salles, Yijian Liu, and Marcos Kalinowski. [Data management in microservices: State of the practice, challenges, and research directions](https://www.vldb.org/pvldb/vol14/p3348-laigner.pdf). *Proceedings of the VLDB Endowment*, volume 14, issue 13, pages 3348–3361, September 2021. [doi:10.14778/3484224.3484232](https://doi.org/10.14778/3484224.3484232)
[^51]: Jordan Tigani. [Big Data is Dead](https://motherduck.com/blog/big-data-is-dead/). *motherduck.com*, February 2023. Archived at [perma.cc/HT4Q-K77U](https://perma.cc/HT4Q-K77U)
[^52]: Sam Newman. [*Building Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). O’Reilly Media, 2021. ISBN: 9781492034025
[^53]: Chris Richardson. [Microservices: Decomposing Applications for Deployability and Scalability](https://www.infoq.com/articles/microservices-intro/). *infoq.com*, May 2014. Archived at [perma.cc/CKN4-YEQ2](https://perma.cc/CKN4-YEQ2)
[^54]: Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, Ricardo Bianchini. [Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider](https://www.usenix.org/system/files/atc20-shahrad.pdf). At *USENIX Annual Technical Conference* (ATC), July 2020.
[^55]: Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. [The Datacenter as a Computer: Designing Warehouse-Scale Machines](https://www.morganclaypool.com/doi/10.2200/S00874ED3V01Y201809CAC046), third edition. Morgan & Claypool Synthesis Lectures on Computer Architecture, October 2018. [doi:10.2200/S00874ED3V01Y201809CAC046](https://doi.org/10.2200/S00874ED3V01Y201809CAC046)
[^56]: David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. [Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing](https://arcb.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/sc12.pdf),” at *International Conference for High Performance Computing, Networking, Storage and Analysis* (SC), November 2012. [doi:10.1109/SC.2012.49](https://doi.org/10.1109/SC.2012.49)
[^57]: Anna Kornfeld Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang. [Securing RDMA for High-Performance Datacenter Storage Systems](https://www.usenix.org/conference/hotcloud20/presentation/kornfeld-simpson). At *12th USENIX Workshop on Hot Topics in Cloud Computing* (HotCloud), July 2020.
[^58]: Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. [Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf). At *Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. [doi:10.1145/2785956.2787508](https://doi.org/10.1145/2785956.2787508)
[^59]: Glenn K. Lockwood. [Hadoop’s Uncomfortable Fit in HPC](https://blog.glennklockwood.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html). *glennklockwood.blogspot.co.uk*, May 2014. Archived at [perma.cc/S8XX-Y67B](https://perma.cc/S8XX-Y67B)
[^60]: Cathy O’Neil: *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. ISBN: 9780553418811
[^61]: Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. [Understanding and Benchmarking the Impact of GDPR on Database Systems](https://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 7, pages 1064–1077, March 2020. [doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354)
[^62]: Martin Fowler. [Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html). *martinfowler.com*, December 2013. Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6)
[^63]: [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016.
================================================
FILE: content/en/ch10.md
================================================
---
title: "10. Consistency and Consensus"
weight: 210
breadcrumbs: false
---

> *An ancient adage warns, “Never go to sea with two chronometers; take one or three.”*
>
> Frederick P. Brooks Jr., *The Mythical Man-Month: Essays on Software Engineering* (1995)
Lots of things can go wrong in distributed systems, as discussed in [Chapter 9](/en/ch9#ch_distributed). If we want a
service to continue working correctly despite those things going wrong, we need to find ways of
tolerating faults.
One of the best tools we have for fault tolerance is *replication*. However, as we saw in
[Chapter 6](/en/ch6#ch_replication), having multiple copies of the data on multiple replicas opens up the risk of
inconsistencies. Reads might be handled by a replica that is not up-to-date, yielding stale results.
If multiple replicas can accept writes, we have to deal with conflicts between values that were
concurrently written on different replicas. At a high level, there are two competing philosophies
for dealing with such issues:
Eventual consistency
: In this philosophy, the fact that a system is replicated is made visible to the application, and
you as application developer are expected to deal with the inconsistencies and conflicts that may
arise. This approach is often used in systems with multi-leader (see
[“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)) and leaderless replication (see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)).
Strong consistency
: This philosophy says that applications should not have to worry about internal details of
replication, and that the system should behave as if it was single-node. The advantage of this
approach is that it’s simpler for you, the application developer. The disadvantage is that
stronger consistency has a performance cost, and some kinds of fault that an eventually consistent
system can tolerate cause outages in strongly consistent systems.
As always, which approach is better depends on your application. If you have an app where users can
make changes to data while offline, then eventual consistency is inevitable, as discussed in
[“Sync Engines and Local-First Software”](/en/ch6#sec_replication_offline_clients). However, eventual consistency can also be difficult for
applications to deal with. If your replicas are located in datacenters with fast, reliable
communication, then strong consistency is often appropriate because its cost is acceptable.
In this chapter we will dive deeper into the strongly consistent approach, looking at three areas:
1. One challenge is that “strong consistency” is quite vague, so we will develop a more precise
definition of what we want to achieve: *linearizability*.
2. We will look at the problem of generating IDs and timestamps. This may sound unrelated to
consistency but is actually closely connected.
3. We will explore how distributed systems can achieve linearizability while still remaining
fault-tolerant; the answer is *consensus* algorithms.
Along the way, we will see that there are some fundamental limits on what is possible and what is
not in a distributed system.
The topics of this chapter are notorious for being hard to implement correctly; it’s very easy to
build systems that behave fine when there are no faults, but which completely fall apart when faced
with an unlucky combination of faults that the designer of the system hadn’t considered. A lot of
theory has been developed to help us think through those edge cases, which enables us to build
systems that can robustly tolerate faults.
This chapter will only scratch the surface: we will stick with informal intuitions, and avoid the
algorithmic nitty-gritty, formal models, and proofs. If you want to do serious work on consensus
systems and similar infrastructure, you will need to go much deeper into the theory if you want any
chance of your systems being robust. As usual, the literature references in this chapter provide
some initial pointers.
## Linearizability {#sec_consistency_linearizability}
If you want a replicated database to be as simple as possible to use, you should make it behave as
if it wasn’t replicated at all. Then users don’t have to worry about replication lag, conflicts, and
other inconsistencies. That would give us the advantage of fault tolerance, but without the
complexity arising from having to think about multiple replicas.
This is the idea behind *linearizability* [^1] (also known as *atomic consistency* [^2], *strong consistency*, *immediate consistency*, or *external consistency* [^3]).
The exact definition of linearizability is quite subtle, and we will explore it in the rest of this
section. But the basic idea is to make a system appear as if there were only one copy of the data,
and all operations on it are atomic. With this guarantee, even though there may be multiple replicas
in reality, the application does not need to worry about them.
In a linearizable system, as soon as one client successfully completes a write, all clients reading
from the database must be able to see the value just written. Maintaining the illusion of a single
copy of the data means guaranteeing that the value read is the most recent, up-to-date value, and
doesn’t come from a stale cache or replica. In other words, linearizability is a *recency
guarantee*. To clarify this idea, let’s look at an example of a system that is not linearizable.
{{< figure src="/fig/ddia_1001.png" id="fig_consistency_linearizability_0" caption="Figure 10-1. If this database were linearizable, then either Alice's read would return 1 instead of 0, or Bob's read would return 0 instead of 1." class="w-full my-4" >}}
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4].
Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a
game their favorite team is playing. Just after the final score is announced, Aaliyah refreshes the
page, sees the winner announced, and excitedly tells Bryce about it. Bryce incredulously hits
*reload* on his own phone, but his request goes to a database replica that is lagging, and so his
phone shows that the game is still ongoing.
If Aaliyah and Bryce had hit reload at the same time, it would have been less surprising if they had
gotten two different query results, because they wouldn’t know at exactly what time their respective
requests were processed by the server. However, Bryce knows that he hit the reload button (initiated
his query) *after* he heard Aaliyah exclaim the final score, and therefore he expects his query
result to be at least as recent as Aaliyah’s. The fact that his query returned a stale result is a
violation of linearizability.
### What Makes a System Linearizable? {#sec_consistency_lin_definition}
In order to understand linearizability better, let’s look at some more examples.
[Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows three clients concurrently reading and writing the same
object *x* in a linearizable database. In distributed systems theory, *x* is called a *register*—in
practice, it could be one key in a key-value store, one row in a relational database, or one
document in a document database, for example.
{{< figure src="/fig/ddia_1002.png" id="fig_consistency_linearizability_1" caption="Figure 10-2. Alice observes that x = 0 and y = 1, while Bob observes that x = 1 and y = 0. It's as if Alice's and Bob's computers disagree on the order in which the writes happened." class="w-full my-4" >}}
For simplicity, [Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows only the requests from the clients’
point of view, not the internals of the database. Each bar is a request made by a client, where the
start of a bar is the time when the request was sent, and the end of a bar is when the response was
received by the client. Due to variable network delays, a client doesn’t know exactly when the
database processed its request—it only knows that it must have happened sometime between the
client sending the request and receiving the response.
In this example, the register has two types of operations:
* *read*(*x*) ⇒ *v* means the client requested to read the value of register
*x*, and the database returned the value *v*.
* *write*(*x*, *v*) ⇒ *r* means the client requested to set the
register *x* to value *v*, and the database returned response *r* (which could be *ok* or *error*).
In [Figure 10-2](/en/ch10#fig_consistency_linearizability_1), the value of *x* is initially 0, and client C performs a
write request to set it to 1. While this is happening, clients A and B are repeatedly polling the
database to read the latest value. What are the possible responses that A and B might get for their
read requests?
* The first read operation by client A completes before the write begins, so it must definitely
return the old value 0.
* The last read by client A begins after the write has completed, so it must definitely return the
new value 1 if the database is linearizable, because the read must have been processed after the
write.
* Any read operations that overlap in time with the write operation might return either 0 or 1,
because we don’t know whether or not the write has taken effect at the time when the read
operation is processed. These operations are *concurrent* with the write.
However, that is not yet sufficient to fully describe linearizability: if reads that are concurrent
with a write can return either the old or the new value, then readers could see a value flip back
and forth between the old and the new value several times while a write is going on. That is not
what we expect of a system that emulates a “single copy of the data.”
To make the system linearizable, we need to add another constraint, illustrated in
[Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
{{< figure src="/fig/ddia_1003.png" id="fig_consistency_linearizability_2" caption="Figure 10-3. If Alice and Bob had perfect clocks, linearizability would require that x = 1 is returned, since the read of x begins after the write x = 1 completes." class="w-full my-4" >}}
In a linearizable system we imagine that there must be some point in time (between the start and end
of the write operation) at which the value of *x* atomically flips from 0 to 1. Thus, if one
client’s read returns the new value 1, all subsequent reads must also return the new value, even if
the write operation has not yet completed.
This timing dependency is illustrated with an arrow in [Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
Client A is the first to read the new value, 1. Just after A’s read returns, B begins a new read.
Since B’s read occurs strictly after A’s read, it must also return 1, even though the write by C is
still ongoing. (It’s the same situation as with Aaliyah and Bryce in
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0): after Aaliyah has read the new value, Bryce also expects to
read the new value.)
We can further refine this timing diagram to visualize each operation taking effect atomically at
some point in time [^5],
like in the more complex example shown in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3). In this example we
add a third type of operation besides *read* and *write*:
* *cas*(*x*, *v*old, *v*new) ⇒ *r* means the client
requested an atomic *compare-and-set* operation (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)). If the
current value of the register *x* equals *v*old, it should be atomically set to *v*new. If
the value of *x* is different from *v*old, then the operation should leave the register
unchanged and return an error. *r* is the database’s response (*ok* or *error*).
Each operation in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3) is marked with a vertical line (inside the
bar for each operation) at the time when we think the operation was executed. Those markers are
joined up in a sequential order, and the result must be a valid sequence of reads and writes for a
register (every read must return the value set by the most recent write).
The requirement of linearizability is that the lines joining up the operation markers always move
forward in time (from left to right), never backward. This requirement ensures the recency guarantee we
discussed earlier: once a new value has been written or read, all subsequent reads see the value
that was written, until it is overwritten again.
{{< figure src="/fig/ddia_1004.png" id="fig_consistency_linearizability_3" caption="Figure 10-4. The read of x is concurrent with the write x = 1. Since we don't know the exact timing of the operations, the read is allowed to return either 0 or 1." class="w-full my-4" >}}
There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3):
* First client B sent a request to read *x*, then client D sent a request to set *x* to 0, and then
client A sent a request to set *x* to 1. Nevertheless, the value returned to B’s read is 1 (the
value written by A). This is okay: it means that the database first processed D’s write, then A’s
write, and finally B’s read. Although this is not the order in which the requests were sent, it’s
an acceptable order, because the three requests are concurrent. Perhaps B’s read request was
slightly delayed in the network, so it only reached the database after the two writes.
* Client B’s read returned 1 before client A received its response from the database, saying that
the write of the value 1 was successful. This is also okay: it just means the *ok* response from
the database to client A was slightly delayed in the network.
* This model doesn’t assume any transaction isolation: another client may change a value at any
time. For example, C first reads 1 and then reads 2, because the value was changed by B between
the two reads. An atomic compare-and-set (*cas*) operation can be used to check the value hasn’t
been concurrently changed by another client: B and C’s *cas* requests succeed, but D’s *cas*
request fails (by the time the database processes it, the value of *x* is no longer 0).
* The final read by client B (in a shaded bar) is not linearizable. The operation is concurrent with
C’s *cas* write, which updates *x* from 2 to 4. In the absence of other requests, it would be okay for
B’s read to return 2. However, client A has already read the new value 4 before B’s read started,
so B is not allowed to read an older value than A. Again, it’s the same situation as with Aaliyah
and Bryce in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0).
That is the intuition behind linearizability; the formal definition [^1] describes it more precisely. It is
possible (though computationally expensive) to test whether a system’s behavior is linearizable by
recording the timings of all requests and responses, and checking whether they can be arranged into
a valid sequential order [^6] [^7].
Just as there are various weak isolation levels for transactions besides serializability (see
[“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels)), there are also various weaker consistency models for
replicated systems besides linearizability [^8].
In fact, the *read-after-write*, *monotonic reads*, and *consistent prefix reads* properties we saw
in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag) are examples of such weaker consistency models. Linearizability
guarantees all these weaker properties, and more. In this chapter we will focus on linearizability,
which is the strongest consistency model in common use.
--------
> [!TIP] LINEARIZABILITY VERSUS SERIALIZABILITY
Linearizability is easily confused with serializability (see [“Serializability”](/en/ch8#sec_transactions_serializability)),
as both words seem to mean something like “can be arranged in a sequential order.” However, they are
quite different guarantees, and it is important to distinguish between them:
Serializability
: Serializability is an isolation property of transactions, where every transaction may read and
write *multiple objects* (rows, documents, records). It guarantees that transactions behave the
same as if they had executed in *some* serial order: that is, as if you first performed all of one
transaction’s operations, then all of another transaction’s operations, and so on, without
interleaving them. It is okay for that serial order to be different from the order in which the
transactions were actually run [^9].
Linearizability
: Linearizability is a guarantee on reads and writes of a register (an *individual object*). It
doesn’t group operations together into transactions, so it does not prevent problems such as write
skew that involve multiple objects (see [“Write Skew and Phantoms”](/en/ch8#sec_transactions_write_skew)). However, linearizability
is a *recency* guarantee: it requires that if one operation finishes before another one starts,
then the later operation must observe a state that is at least as new as the earlier operation.
Serializability does not have that requirement: for example, stale reads are allowed by
serializability [^10].
(*Sequential consistency* is something else again [^8], but we won’t discuss it here.)
A database may provide both serializability and linearizability, and this combination is known as
*strict serializability* or *strong one-copy serializability* (*strong-1SR*) [^11] [^12].
Single-node databases are typically linearizable. With distributed databases using optimistic
methods like serializable snapshot isolation (see [“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)) the situation is more
complicated: for example, CockroachDB provides serializability, and some recency guarantees on
reads, but not strict serializability [^13]
because this would require expensive coordination between transactions [^14].
It is also possible to combine a weaker isolation level with linearizability, or a weaker
consistency model with serializability; in fact, consistency model and isolation level can be chosen
largely independently from each other [^15] [^16].
--------
### Relying on Linearizability {#sec_consistency_linearizability_usage}
In what circumstances is linearizability useful? Viewing the final score of a sporting match is
perhaps a frivolous example: a result that is outdated by a few seconds is unlikely to cause any
real harm in this situation. However, there a few areas in which linearizability is an important
requirement for making a system work correctly.
#### Locking and leader election {#locking-and-leader-election}
A system that uses single-leader replication needs to ensure that there is indeed only one leader,
not several (split brain). One way of electing a leader is to use a lease: every node that starts up
tries to acquire the lease, and the one that succeeds becomes the leader [^17].
No matter how this mechanism is implemented, it must be linearizable: it should not be possible for
two different nodes to acquire the lease at the same time.
Coordination services like Apache ZooKeeper [^18]
and etcd are often used to implement distributed leases and leader election. They use consensus
algorithms to implement linearizable operations in a fault-tolerant way (we discuss such algorithms
later in this chapter). There are still many subtle details to implementing leases and leader
election correctly (see for example the fencing issue in [“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing)), and
libraries like Apache Curator help by providing higher-level recipes on top of ZooKeeper. However, a
linearizable storage service is the basic foundation for these coordination tasks.
--------
> [!NOTE]
> Strictly speaking, ZooKeeper provides linearizable writes, but reads may be stale, since there is no
> guarantee that they are served from the current leader [^18]. etcd since version 3 provides linearizable reads by default.
--------
Distributed locking is also used at a much more granular level in some distributed databases, such as
Oracle Real Application Clusters (RAC) [^19].
RAC uses a lock per disk page, with multiple nodes sharing access
to the same disk storage system. Since these linearizable locks are on the critical path of
transaction execution, RAC deployments usually have a dedicated cluster interconnect network for
communication between database nodes.
#### Constraints and uniqueness guarantees {#sec_consistency_uniqueness}
Uniqueness constraints are common in databases: for example, a username or email address must
uniquely identify one user, and in a file storage service there cannot be two files with the same
path and filename. If you want to enforce this constraint as the data is written (such that if two people
try to concurrently create a user or a file with the same name, one of them will be returned an
error), you need linearizability.
This situation is actually similar to a lock: when a user registers for your service, you can think
of them acquiring a “lock” on their chosen username. The operation is also very similar to an atomic
compare-and-set, setting the username to the ID of the user who claimed it, provided that the
username is not already taken.
Similar issues arise if you want to ensure that a bank account balance never goes negative, or that
you don’t sell more items than you have in stock in the warehouse, or that two people don’t
concurrently book the same seat on a flight or in a theater. These constraints all require there to
be a single up-to-date value (the account balance, the stock level, the seat occupancy) that all
nodes agree on.
In real applications, it is sometimes acceptable to treat such constraints loosely (for example, if
a flight is overbooked, you can move customers to a different flight and offer them compensation for
the inconvenience). In such cases, linearizability may not be needed, and we will discuss such
loosely interpreted constraints in [“Timeliness and Integrity”](/en/ch13#sec_future_integrity).
However, a hard uniqueness constraint, such as the one you typically find in relational databases,
requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints,
can be implemented without linearizability [^20].
#### Cross-channel timing dependencies {#cross-channel-timing-dependencies}
Notice a detail in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0): if Aaliyah hadn’t exclaimed the score,
Bryce wouldn’t have known that the result of his query was stale. He would have just refreshed the
page again a few seconds later, and eventually seen the final score. The linearizability violation
was only noticed because there was an additional communication channel in the system (Aaliyah’s
voice to Bryce’s ears).
Similar situations can arise in computer systems. For example, say you have a website where users
can upload a video, and a background process transcodes the video to a lower quality that can be
streamed on slow internet connections. The architecture and dataflow of this system is illustrated
in [Figure 10-5](/en/ch10#fig_consistency_transcoder).
The video transcoder needs to be explicitly instructed to perform a transcoding job, and this
instruction is sent from the web server to the transcoder via a message queue (see [“Messaging Systems”](/en/ch12#sec_stream_messaging)).
The web server doesn’t place the entire video on the queue, since most message brokers are designed
for small messages, and a video may be many megabytes in size. Instead, the video is first written
to a file storage service, and once the write is complete, the instruction to the transcoder is
placed on the queue.
{{< figure src="/fig/ddia_1005.png" id="fig_consistency_transcoder" caption="Figure 10-5. A system that is not linearizable: Alice and Bob see the uploaded image at different times, and thus Bob's request is based on stale data." class="w-full my-4" >}}
If the file storage service is linearizable, then this system should work fine. If it is not
linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in
[Figure 10-5](/en/ch10#fig_consistency_transcoder)) might be faster than the internal replication inside the storage
service. In this case, when the transcoder fetches the original video (step 5), it might see an old
version of the file, or nothing at all. If it processes an old version of the video, the original
and transcoded videos in the file storage become permanently inconsistent with each other.
This problem arises because there are two different communication channels between the web server
and the transcoder: the file storage and the message queue. Without the recency guarantee of
linearizability, race conditions between these two channels are possible. This situation is
analogous to [Figure 10-1](/en/ch10#fig_consistency_linearizability_0), where there was also a race condition between
two communication channels: the database replication and the real-life audio channel between
Aaliyah’s mouth and Bryce’s ears.
A similar race condition occurs if you have a mobile app that can receive push notifications, and
the app fetches some data from a server when it receives a push notification. If the data fetch
might go to a lagging replica, it could happen that the push notification goes through quickly, but
the subsequent fetch doesn’t see the data that the push notification was about.
Linearizability is not the only way of avoiding this race condition, but it’s the simplest to
understand. If you control the additional communication channel (like in the case of the message
queue, but not in the case of Aaliyah and Bryce), you can use alternative approaches similar to what
we discussed in [“Reading Your Own Writes”](/en/ch6#sec_replication_ryw), at the cost of additional complexity.
### Implementing Linearizable Systems {#sec_consistency_implementing_linearizable}
Now that we’ve looked at a few examples in which linearizability is useful, let’s think about how we
might implement a system that offers linearizable semantics.
Since linearizability essentially means “behave as though there is only a single copy of the data,
and all operations on it are atomic,” the simplest answer would be to really only use a single copy
of the data. However, that approach would not be able to tolerate faults: if the node holding that
one copy failed, the data would be lost, or at least inaccessible until the node was brought up again.
Let’s revisit the replication methods from [Chapter 6](/en/ch6#ch_replication), and compare whether they can be made linearizable:
Single-leader replication (potentially linearizable)
: In a system with single-leader replication, the leader has the primary copy of the data that is
used for writes, and the followers maintain backup copies of the data on other nodes. As long as
you perform all reads and writes on the leader, they are likely to be linearizable. However, this
assumes that you know for sure who the leader is. As discussed in
[“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing), it is quite possible for a node to think that it is the leader,
when in fact it is not—and if the delusional leader continues to serve requests, it is likely to
violate linearizability [^21].
With asynchronous replication, failover may even lose committed writes, which violates both
durability and linearizability.
Sharding a single-leader database, with a separate leader per shard, does not affect
linearizability, since it is only a single-object guarantee. Cross-shard transactions are a
different matter (see [“Distributed Transactions”](/en/ch8#sec_transactions_distributed)).
Consensus algorithms (likely linearizable)
: Some consensus algorithms are essentially single-leader replication with automatic leader election
and failover. They are carefully designed to prevent split brain, allowing them to implement
linearizable storage safely. ZooKeeper uses the Zab consensus algorithm [^22] and etcd uses Raft [^23], for example.
However, just because a system uses consensus does not guarantee that all operations on it are
linearizable: if it allows reads on a node without checking that it is still the leader, the
results of the read may be stale if a new leader has just been elected.
Multi-leader replication (not linearizable)
: Systems with multi-leader replication are generally not linearizable, because they concurrently
process writes on multiple nodes and asynchronously replicate them to other nodes. For this
reason, they can produce conflicting writes that require resolution (see
[“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts)).
Leaderless replication (probably not linearizable)
: For systems with leaderless replication (Dynamo-style; see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)), people
sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes
(*w* + *r* > *n*). Depending on the exact algorithm, and depending on how you define
strong consistency, this is not quite true.
“Last write wins” conflict resolution methods based on time-of-day clocks (e.g., in Cassandra and
ScyllaDB) are almost certainly nonlinearizable, because clock timestamps cannot be guaranteed to be
consistent with actual event ordering due to clock skew (see [“Relying on Synchronized Clocks”](/en/ch9#sec_distributed_clocks_relying)).
Even with quorums, nonlinearizable behavior is possible, as demonstrated in the next section.
#### Linearizability and quorums {#sec_consistency_quorum_linearizable}
Intuitively, it seems as though quorum reads and writes should be linearizable in a
Dynamo-style model. However, when we have variable network delays, it is possible to have race
conditions, as demonstrated in [Figure 10-6](/en/ch10#fig_consistency_leaderless).
{{< figure src="/fig/ddia_1006.png" id="fig_consistency_leaderless" caption="Figure 10-6. Quorums are not sufficient to ensure linearizability if network delays are variable." class="w-full my-4" >}}
In [Figure 10-6](/en/ch10#fig_consistency_leaderless), the initial value of *x* is 0, and a writer client is updating
*x* to 1 by sending the write to all three replicas (*n* = 3, *w* = 3).
Concurrently, client A reads from a quorum of two nodes (*r* = 2) and sees the new value 1
on one of the nodes. Also concurrently with the write, client B reads from a different quorum of two
nodes, and gets back the old value 0 from both.
The quorum condition is met (*w* + *r* > *n*), but this execution is nevertheless not
linearizable: B’s request begins after A’s request completes, but B returns the old value while A
returns the new value. (It’s once again the Aaliyah and Bryce situation from
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0).)
It is possible to make Dynamo-style quorums linearizable at the cost of reduced
performance: a reader must perform read repair (see [“Catching up on missed writes”](/en/ch6#sec_replication_read_repair)) synchronously,
before returning results to the application [^24].
Moreover, before writing, a writer must read the latest state of a quorum of nodes to fetch the
latest timestamp of any prior write, and ensure that the new write has a greater timestamp [^25] [^26].
However, Riak does not perform synchronous read repair due to the performance penalty.
Cassandra does wait for read repair to complete on quorum reads [^27],
but it loses linearizability due to its use of time-of-day clocks for timestamps.
Moreover, only linearizable read and write operations can be implemented in this way; a
linearizable compare-and-set operation cannot, because it requires a consensus algorithm [^28].
In summary, it is safest to assume that a leaderless system with Dynamo-style replication does not
provide linearizability, even with quorum reads and writes.
### The Cost of Linearizability {#sec_linearizability_cost}
As some replication methods can provide linearizability and others cannot, it is interesting to
explore the pros and cons of linearizability in more depth.
We already discussed some use cases for different replication methods in [Chapter 6](/en/ch6#ch_replication); for
example, we saw that multi-leader replication is often a good choice for multi-region
replication (see [“Geographically Distributed Operation”](/en/ch6#sec_replication_multi_dc)). An example of such a deployment is illustrated in
[Figure 10-7](/en/ch10#fig_consistency_cap_availability).
{{< figure src="/fig/ddia_1007.png" id="fig_consistency_cap_availability" caption="Figure 10-7. If clients cannot contact enough replicas due to a network partition, they cannot process writes." class="w-full my-4" >}}
Consider what happens if there is a network interruption between the two regions. Let’s assume
that the network within each region is working, and clients can reach their local region, but the
regions cannot connect to each other. This is known as a *network partition*.
With a multi-leader database, each region can continue operating normally: since writes from one
region are asynchronously replicated to the other, the writes are simply queued up and exchanged
when network connectivity is restored.
On the other hand, if single-leader replication is used, then the leader must be in one of the
regions. Any writes and any linearizable reads must be sent to the leader—thus, for any
clients connected to a follower region, those read and write requests must be sent synchronously
over the network to the leader region.
If the network between regions is interrupted in a single-leader setup, clients connected to
follower regions cannot contact the leader, so they cannot make any writes to the database, nor
any linearizable reads. They can still make reads from the follower, but they might be stale
(nonlinearizable). If the application requires linearizable reads and writes, the network
interruption causes the application to become unavailable in the regions that cannot contact the leader.
If clients can connect directly to the leader region, this is not a problem, since the
application continues to work normally there. But clients that can only reach a follower region
will experience an outage until the network link is repaired.
#### The CAP theorem {#the-cap-theorem}
This issue is not just a consequence of single-leader and multi-leader replication: any linearizable
database has this problem, no matter how it is implemented. The issue also isn’t specific to
multi-region deployments, but can occur on any unreliable network, even within one region.
The trade-off is as follows:
* If your application *requires* linearizability, and some replicas are disconnected from the other
replicas due to a network problem, then some replicas cannot process requests while they are
disconnected: they must either wait until the network problem is fixed, or return an error (either
way, they become *unavailable*). This choice is sometimes known as *CP* (consistent under network partitions).
* If your application *does not require* linearizability, then it can be written in a way that each
replica can process requests independently, even if it is disconnected from other replicas (e.g.,
multi-leader). In this case, the application can remain *available* in the face of a network
problem, but its behavior is not linearizable. This choice is known as *AP* (available under network partitions).
Thus, applications that don’t require linearizability can be more tolerant of network problems. This
insight is popularly known as the *CAP theorem* [^29] [^30] [^31] [^32],
named by Eric Brewer in 2000, although the trade-off had been known to designers of
distributed databases since the 1970s [^33] [^34] [^35].
CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of
starting a discussion about trade-offs in databases. At the time, many distributed databases
focused on providing linearizable semantics on a cluster of machines with shared storage [^19], and CAP encouraged database engineers
to explore a wider design space of distributed shared-nothing systems, which were more suitable for
implementing large-scale web services [^36].
CAP deserves credit for this culture shift—it helped trigger the NoSQL movement, a burst of new
database technologies around the mid-2000s.
> [!TIP] THE UNHELPFUL CAP THEOREM
CAP is sometimes presented as *Consistency, Availability, Partition tolerance: pick 2 out of 3*.
Unfortunately, putting it this way is misleading [^32] because network partitions are a kind of
fault, so they aren’t something about which you have a choice: they will happen whether you like it or not.
At times when the network is working correctly, a system can provide both consistency
(linearizability) and total availability. When a network fault occurs, you have to choose between
either linearizability or total availability. Thus, a better way of phrasing CAP would be
*either Consistent or Available when Partitioned* [^37].
A more reliable network needs to make this choice less often, but at some point the choice is inevitable.
The CP/AP classification scheme has several further flaws [^4]. *Consistency* is formalized as
linearizability (the theorem doesn’t say anything about weaker consistency models), and the
formalization of *availability* [^30] does not
match the usual meaning of the term [^38]. Many highly available (fault-tolerant) systems actually do not meet CAP’s
idiosyncratic definition of availability. Moreover, some system designers choose (with good reason)
to provide neither linearizability nor the form of availability that the CAP theorem assumes, so
those systems are neither CP nor AP [^39] [^40].
All in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us
understand systems better, so CAP is best avoided.
The CAP theorem as formally defined [^30] is of
very narrow scope: it only considers one consistency model (namely linearizability) and one kind of
fault (network partitions, which according to data from Google are the cause of less than 8% of incidents [^41]).
It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP
has been historically influential, it has little practical value for designing systems [^4] [^38].
There have been efforts to generalize CAP. For example, the *PACELC principle* observes that system
designers might also choose to weaken consistency at times when the network is working fine in order
to reduce latency [^39] [^40] [^42].
Thus, during a network partition (P), we need to choose between availability (A) and consistency (C);
else (E), when there is no partition, we may choose between low latency (L) and consistency (C).
However, this definition inherits several problems with CAP, such as the counterintuitive definitions of consistency and availability.
There are many more interesting impossibility results in distributed systems [^43], and CAP has now been
superseded by more precise results [^44] [^45], so it is of mostly historical interest today.
#### Linearizability and network delays {#linearizability-and-network-delays}
Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable
in practice. For example, even RAM on a modern multi-core CPU is not linearizable [^46]:
if a thread running on one CPU core writes to a memory address, and a thread on another CPU core
reads the same address shortly afterward, it is not guaranteed to read the value written by the
first thread (unless a *memory barrier* or *fence* [^47] is used).
The reason for this behavior is that every CPU core has its own memory cache and store buffer.
Memory access first goes to the cache by default, and any changes are asynchronously written out to
main memory. Since accessing data in the cache is much faster than going to main memory [^48], this feature is essential for
good performance on modern CPUs. However, there are now several copies of the data (one in main
memory, and perhaps several more in various caches), and these copies are asynchronously updated, so
linearizability is lost.
Why make this trade-off? It makes no sense to use the CAP theorem to justify the multi-core memory
consistency model: within one computer we usually assume reliable communication, and we don’t expect
one CPU core to be able to continue operating normally if it is disconnected from the rest of the
computer. The reason for dropping linearizability is *performance*, not fault tolerance [^39].
The same is true of many distributed databases that choose not to provide linearizable guarantees:
they do so primarily to increase performance, not so much for fault tolerance [^42].
Linearizability is slow—and this is true all the time, not only during a network fault.
Can’t we maybe find a more efficient implementation of linearizable storage? It seems the answer is
no: Attiya and Welch [^49] prove that if you want linearizability, the response time of read and write requests is at least
proportional to the uncertainty of delays in the network. In a network with highly variable delays,
like most computer networks (see [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing)), the response time of linearizable
reads and writes is inevitably going to be high. A faster algorithm for linearizability does not
exist, but weaker consistency models can be much faster, so this trade-off is important for
latency-sensitive systems. In [“Timeliness and Integrity”](/en/ch13#sec_future_integrity) we will discuss some approaches for avoiding
linearizability without sacrificing correctness.
## ID Generators and Logical Clocks {#sec_consistency_logical}
In many applications you need to assign some sort of unique ID to database records when they are
created, which gives you a primary key by which you can refer to those records. In single-node
databases it is common to use an auto-incrementing integer, which has the advantage that it can be
stored in only 64 bits (or even 32 bits if you are sure that you will never have more than 4 billion
records, but that is risky).
Another advantage of such auto-incrementing IDs is that the order of the IDs tells you the order in
which the records were created. For example, [Figure 10-8](/en/ch10#fig_consistency_id_generator) shows a chat
application that assigns auto-incrementing IDs to chat messages as they are posted. You can then
display the messages in order of increasing ID, and the resulting chat threads will make sense:
Aaliyah posts a question that is assigned ID 1, and Bryce’s answer to the question is assigned a
greater ID, namely 3.
{{< figure src="/fig/ddia_1008.png" id="fig_consistency_id_generator" caption="Figure 10-8. Two different nodes may generate conflicting IDs." class="w-full my-4" >}}
This single-node ID generator is another example of a linearizable system. Each request to fetch the
ID is an operation that atomically increments a counter and returns the old counter value (a
*fetch-and-add* operation); linearizability ensures that if the posting of Aaliyah’s message
completes before Bryce’s posting begins, then Bryce’s ID must be greater than Aaliyah’s. The
messages by Aaliyah and Caleb in [Figure 10-8](/en/ch10#fig_consistency_id_generator) are concurrent, so linearizability
doesn’t specify how their IDs must be ordered, as long as they are unique.
An in-memory single-node ID generator is easy to implement: you can use the atomic increment
instruction provided by your CPU, which allows multiple threads to safely increment the same
counter. It’s a bit more effort to make the counter persistent, so that the node can crash and
restart without resetting the counter value, which would result in duplicate IDs. But the real
problems are:
* A single-node ID generator is not fault-tolerant because that node is a single point of failure.
* It’s slow if you want to create a record in another region, as you potentially have to make a
round-trip to the other side of the planet just to get an ID.
* That single node could become a bottleneck if you have high write throughput.
There are various alternative options for ID generators that you can consider:
Sharded ID assignment
: You could have multiple nodes that assign IDs—for example, one that generates only even numbers,
and one that generates only odd numbers. In general, you can reserve some bits in the ID to
contain a shard number. Those IDs are still compact, but you lose the ordering property: for
example, if you have chat messages with IDs 16 and 17, you don’t know whether message 16 was
actually sent first, because the IDs were assigned by different nodes, and one node might have
been ahead of the other.
Preallocated blocks of IDs
: Instead of requesting individual IDs from the single-node ID generator, it could hand out blocks
of IDs. For example, node A might claim the block of IDs from 1 to 1,000, and node B might claim
the block from 1,001 to 2,000. Then each node can independently hand out IDs from its block, and
request a new block from the single-node ID generator when its supply of sequence numbers begins
to run low. However, this scheme doesn’t ensure correct ordering either: it could happen that one
message is given an ID in the range from 1,001 to 2,000, and a later message is given an ID in the
range from 1 to 1,000 if the ID was assigned by a different node.
Random UUIDs
: You can use *universally unique identifiers* (UUIDs), also known as *globally unique identifiers*
(GUIDs). These have the big advantage that they can be generated locally on any node without
requiring communication, but they require more space (128 bits). There are several different
versions of UUIDs; the simplest is version 4, which is essentially a random number that is so long
that is very unlikely that two nodes would ever pick the same one. Unfortunately, the order of
such IDs is also random, so comparing two IDs tells you nothing about which one is newer.
Wall-clock timestamp made unique
: If your nodes’ time-of-day clock is kept approximately correct using NTP, you can generate IDs by
putting a timestamp from that clock in the most significant bits, and filling the remaining bits
with extra information that ensures the ID is unique even if the timestamp is not—for example, a
shard number and a per-shard incrementing sequence number, or a long random value. This approach
is used in Version 7 UUIDs [^50], Twitter’s Snowflake [^51], ULIDs [^52], Hazelcast’s Flake ID generator,
MongoDB ObjectIDs, and many similar schemes [^50]. You can implement these ID generators in application code or within a database [^53].
All these schemes generate IDs that are unique (at least with high enough probability that
collisions are vanishingly rare), but they have much weaker ordering guarantees for IDs than the
single-node auto-incrementing scheme.
As discussed in [“Timestamps for ordering events”](/en/ch9#sec_distributed_lww), wall-clock timestamps can provide at best an approximate
ordering: if an earlier write gets a timestamp from a slightly fast clock, and a later write’s
timestamp is from a slightly slow clock, the timestamp order may be inconsistent with the order in
which the events actually happened. With clock jumps due to using a non-monotonic clock, even the
timestamps generated by a single node might be ordered incorrectly. ID generators based on
wall-clock time are therefore unlikely to be linearizable.
You can reduce such ordering inconsistencies by relying on high-precision clock synchronization,
using atomic clocks or GPS receivers. But it would also be nice to be able to generate IDs that are
unique and correctly ordered without relying on special hardware. That’s what *logical clocks* are
about.
### Logical Clocks {#sec_consistency_timestamps}
In [“Unreliable Clocks”](/en/ch9#sec_distributed_clocks) we discussed time-of-day clocks and monotonic clocks. Both of these
are *physical clocks*: they measure the passing of seconds (or milliseconds, microseconds, etc.).
In distributed systems it is common to also use another kind of clock, called a *logical clock*.
While a physical clock is a hardware device that counts the seconds that have elapsed, a logical
clock is an algorithm that counts the events that have occurred. A timestamp from a logical clock
therefore doesn’t tell you what time it is, but you *can* compare two timestamps from a logical
clock to tell which one is earlier and which one is later.
The requirements for a logical clock are typically:
* that its timestamps are compact (a few bytes in size) and unique;
* that you can compare any two timestamps (i.e. they are *totally ordered*); and
* that the order of timestamps is *consistent with causality*: if operation A happened before B,
then A’s timestamp is less than B’s timestamp. (We discussed causality previously in
[“The “happens-before” relation and concurrency”](/en/ch6#sec_replication_happens_before).)
A single-node ID generator meets these requirements, but the distributed ID generators we just
discussed do not meet the causal ordering requirement.
#### Lamport timestamps {#lamport-timestamps}
Fortunately, there is a simple method for generating logical timestamps that *is* consistent with
causality, and which you can use as a distributed ID generator. It is called a *Lamport clock*,
proposed in 1978 by Leslie Lamport [^54],
in what is now one of the most-cited papers in the field of distributed systems.
[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) shows how a Lamport clock would work in the chat example of
[Figure 10-8](/en/ch10#fig_consistency_id_generator). Each node has a unique identifier, which in
[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) is the name “Aaliyah”, “Bryce”, or “Caleb”, but which in practice
could be a random UUID or something similar. Moreover, each node keeps a counter of the number of
operations it has processed. A Lamport timestamp is then simply a pair of (*counter*, *node ID*).
Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp,
each timestamp is made unique.
{{< figure src="/fig/ddia_1009.png" id="fig_consistency_lamport_ts" caption="Figure 10-9. Lamport timestamps provide a total ordering consistent with causality." class="w-full my-4" >}}
Every time a node generates a timestamp, it increments its counter value and uses the new value.
Moreover, every time a node sees a timestamp from another node, if the counter value in that
timestamp is greater than its local counter value, it increases its local counter to match the value in the timestamp.
In [Figure 10-9](/en/ch10#fig_consistency_lamport_ts), Aaliyah had not yet seen Caleb’s message when posting her own,
and vice versa. Assuming both users start with an initial counter value of 0, both therefore
increment their local counter and attach the new counter value of 1 to their message. When Bryce
receives those messages, he increases his local counter value to 1. Finally, Bryce sends a reply to
Aaliyah’s message, for which he increments his local counter and attaches the new value of 2 to the
message.
To compare two Lamport timestamps, we first compare their counter value: for example,
(2, “Bryce”) is greater than (1, “Aaliyah”) and also greater than (1, “Caleb”). If
two timestamps have the same counter, we compare their node IDs instead, using the usual
lexicographic string comparison. Thus, the timestamp order in this example is
(1, “Aaliyah”) < (1, “Caleb”) < (2, “Bryce”).
#### Hybrid logical clocks {#hybrid-logical-clocks}
Lamport timestamps are good at capturing the order in which things happened, but they have some
limitations:
* Since they have no direct relation to physical time, you can’t use them to find, say, all the
messages that were posted on a particular date—you would need to store the physical time
separately.
* If two nodes never communicate, one node’s counter increments will never be reflected in the other
one’s counter. As a result, it could happen that events generated around the same time on
different nodes have wildly different counter values.
A *hybrid logical clock* combines the advantages of physical time-of-day clocks with the ordering
guarantees of Lamport clocks [^55].
Like a physical clock, it counts seconds or microseconds. Like a Lamport clock, when one node sees a
timestamp from another node that is greater than its local clock value, it moves its own local value
forward to match the other node’s timestamp. As a result, if one node’s clock is running fast, the
other nodes will similarly move their clocks forward when they communicate.
Every time a timestamp from a hybrid logical clock is generated, it is also incremented, which
ensures that the clock monotonically moves forward, even if the underlying physical clock jumps
backwards, for example due to NTP adjustments. Thus, the hybrid logical clock might be slightly
ahead of the underlying physical clock. Details of the algorithm ensure that this discrepancy
remains as small as possible.
As a result, you can treat a timestamp from a hybrid logical clock almost like a timestamp from a
conventional time-of-day clock, with the added property that its ordering is consistent with the
happens-before relation. It doesn’t depend on any special hardware, and requires only roughly
synchronized clocks. Hybrid logical clocks are used by CockroachDB, for example.
#### Lamport/hybrid logical clocks vs. vector clocks {#lamporthybrid-logical-clocks-vs-vector-clocks}
In [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl) we discussed how snapshot isolation is often implemented:
essentially, by giving each transaction a transaction ID, and allowing each transaction to see
writes made by transactions with a lower ID, but to make writes by transactions with higher IDs
invisible. Lamport clocks and hybrid logical clocks are a good way of generating these transaction
IDs, because they ensure that the snapshot is consistent with causality [^56].
When multiple timestamps are generated concurrently, these algorithms order them arbitrarily. This
means that when you look at two timestamps, you generally can’t tell whether they were generated
concurrently or whether one happened before the other. (In the example of
[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) you actually can tell that Aaliyah and Caleb’s messages must have
been concurrent, because they have the same counter value, but when the counter values are different
you can’t tell whether they were concurrent.)
If you want to be able to determine when records were created concurrently, you need a different
algorithm, such as a *vector clock*. The downside is that the timestamps from a vector clock are
much bigger—potentially one integer for every node in the system. See [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)
for more details on detecting concurrency.
### Linearizable ID Generators {#sec_consistency_linearizable_id}
Although Lamport clocks and hybrid logical clocks provide useful ordering guarantees, that ordering
is still weaker than the linearizable single-node ID generator we talked about previously. Recall
that linearizability requires that if request A completed before request B began, then B must have
the higher ID, even if A and B never communicated with each other. On the other hand, Lamport clocks
can only ensure that a node generates timestamps that are greater than any other timestamp that node
has seen, but it can’t say anything about timestamps that it hasn’t seen.
[Figure 10-10](/en/ch10#fig_consistency_permissions) shows how a non-linearizable ID generator could cause problems.
Imagine a social media website where user A wants to share an embarrassing photo privately with
their friends. A’s account is initially public, but using their laptop, A first changes their
account settings to private. Then A uses their phone to upload the photo. Since A performed these
updates in sequence, they might reasonably expect the photo upload to be subject to the new,
restricted account permissions.
{{< figure src="/fig/ddia_1010.png" id="fig_consistency_permissions" caption="Figure 10-10. An example of a permission system using Lamport timestamps." class="w-full my-4" >}}
The account permission and the photo are stored in two separate databases (or separate shards of the
same database), and let’s assume they use a Lamport clock or hybrid logical clock to assign a
timestamp to every write. Since the photos database didn’t read from the accounts database, it’s
possible that the local counter in the photos database is slightly behind, and therefore the photo
upload is assigned a lower timestamp than the update of the account settings.
Next, let’s say that a viewer (who is not friends with A) is looking at A’s profile, and their read
uses an MVCC implementation of snapshot isolation. It could happen that the viewer’s read has a
timestamp that is greater than that of the photo upload, but less than that of the account settings
update. As a result, the system will determine that the account is still public at the time of the
read, and therefore show the viewer the embarrassing photo that they were not supposed to see.
You can imagine several possible ways of fixing this problem. Maybe the photos database should have
read the user’s account status before performing the write, but it’s easy to forget such a check.
If A’s actions had been performed on the same device, maybe the app on their device could have
tracked the latest timestamp of that user’s writes—but if the user uses a laptop and a phone, as in
the example, that’s not so easy.
The simplest solution in this case would be to use a linearizable ID generator, which would ensure
that the photo upload is assigned a greater ID than the account permissions change.
#### Implementing a linearizable ID generator {#implementing-a-linearizable-id-generator}
The simplest way of ensuring that ID assignment is linearizable is by actually using a single node
for this purpose. That node only needs to atomically increment a counter and return its value when
requested, persist the counter value (so that it doesn’t generate duplicate IDs if the node crashes
and restarts), and replicate it for fault tolerance (using single-leader replication). This approach
is used in practice: for example, TiDB/TiKV calls it a *timestamp oracle*, inspired by Google’s
Percolator [^57].
As an optimization, you can avoid performing a disk write and replication on every single request.
Instead, the ID generator can write a record describing a batch of IDs; once that record is
persisted and replicated, the node can start handing out those IDs to clients in sequence. Before it
runs out of IDs in that batch, it can persist and replicate the record for the next batch. That way,
some IDs will be skipped if the node crashes and restarts or if you fail over to a follower, but you
won’t issue any duplicate or out-of-order IDs.
You can’t easily shard the ID generator, since if you have multiple shards independently handing out
IDs, you can no longer guarantee that their order is linearizable. You also can’t easily distribute
the ID generator across multiple regions; thus, in a geographically distributed database, all
requests for IDs will have to go to a node in a single region. On the upside, the ID generator’s job
is very simple, so a single node can handle a large request throughput.
If you don’t want to use a single-node ID generator, an alternative is possible: you can do what
Google’s Spanner does, as discussed in [“Synchronized clocks for global snapshots”](/en/ch9#sec_distributed_spanner). It relies on a physical clock
that returns not just a single timestamp, but a range of timestamps indicating the uncertainty in
the clock reading. It then waits for the duration of that uncertainty interval to elapse before
returning.
Assuming that the uncertainty interval is correct (i.e., that the true current physical time always
lies within that interval), this process also ensures that if one request completes before another
begins, the later request will have a greater timestamp. This approach ensures this linearizable ID
assignment without any communication: even requests in different regions will be ordered correctly,
without waiting for cross-region requests. The downside is that you need hardware and software
support for clocks to be tightly synchronized and compute the necessary uncertainty interval.
#### Enforcing constraints using logical clocks {#enforcing-constraints-using-logical-clocks}
In [“Constraints and uniqueness guarantees”](/en/ch10#sec_consistency_uniqueness) we saw that a linearizable compare-and-set operation can be used
to implement locks, uniqueness constraints, and similar constructs in a distributed system. This
raises the question: is a logical clock or a linearizable ID generator also sufficient to implement
these things?
The answer is: not quite. When you have several nodes that are all trying to acquire the
same lock or register the same username, you could use a logical clock to assign timestamps to those
requests, and pick the one with the lowest timestamp as the winner. If the clock is linearizable,
you know that any future requests will always generate greater timestamps, and therefore you can be
sure that no future request will receive an even lower timestamp than the winner.
Unfortunately, part of the problem is still unsolved: how does a node know whether its own timestamp
is the lowest? To be sure, it needs to hear from *every* other node that might have generated a
timestamp [^54]. If one of the other nodes
has failed in the meantime, or cannot be reached due to a network problem, this system would grind
to a halt, because we can’t be sure whether that node might have the lowest timestamp. This is not
the kind of fault-tolerant system that we need.
To implement locks, leases, and similar constructs in a fault-tolerant way, we need something
stronger than logical clocks or ID generators: we need consensus.
## Consensus {#sec_consistency_consensus}
In this chapter we have seen several examples of things that are easy when you have only a single
node, but which get a lot harder if you want fault tolerance:
* A database can be linearizable if you have only a single leader, and you make all reads and writes
on that leader. But how do you fail over if that leader fails, while avoiding split brain? How do
you ensure that a node that believes itself to be the leader hasn’t actually been voted out in the meantime?
* A linearizable ID generator on a single node is just a counter with an atomic fetch-and-add
instruction, but what if it crashes?
* An atomic compare-and-set (CAS) operation is useful for many things, such as deciding who gets a
lock or lease when several processes are racing to acquire it, or ensuring the uniqueness of a
file or user with a given name. On a single node, CAS may be as simple as one CPU instruction, but
how do you make it fault-tolerant?
It turns out that all of these are instances of the same fundamental distributed systems problem:
*consensus*. Consensus is one of the most important and fundamental problems in distributed
computing; it is also infamously difficult to get right [^58] [^59],
and many systems have got it wrong in the past. Now that we have discussed replication
([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
linearizability (this chapter), we are finally ready to tackle the consensus problem.
The best-known consensus algorithms are Viewstamped Replication [^60] [^61], Paxos [^58] [^62] [^63] [^64],
Raft [^23] [^65] [^66], and Zab [^18] [^22] [^67]. There are quite a few similarities between these algorithms, but they are not the same [^68] [^69].
These algorithms work in a non-Byzantine system model: that is, network communication may be
arbitrarily delayed or dropped, and nodes may crash, restart, and become disconnected, but the
algorithms assume that nodes otherwise follow the protocol correctly and do not behave maliciously.
There are also consensus algorithms that can tolerate some Byzantine nodes, i.e., nodes that don’t
correctly follow the protocol (for example, by sending contradictory messages to other nodes). A
common assumption is that fewer than one-third of the nodes are Byzantine-faulty [^26] [^70].
Such *Byzantine fault tolerant* (BFT) consensus algorithms are used in blockchains [^71].
However, as explained in [“Byzantine Faults”](/en/ch9#sec_distributed_byzantine), BFT algorithms are beyond the scope of this
book.
--------
> [!TIP] THE IMPOSSIBILITY OF CONSENSUS
You may have heard about the FLP result [^72]—named after the
authors Fischer, Lynch, and Paterson—which proves that there is no algorithm that is always able to
reach consensus if there is a risk that a node may crash. In a distributed system, we must assume
that nodes may crash, so reliable consensus is impossible. Yet, here we are, discussing algorithms
for achieving consensus. What is going on here?
Firstly, FLP doesn’t say that we can never reach consensus—it only says that we can’t guarantee that
a consensus algorithm will *always* terminate. Moreover, the FLP result is proved assuming a
deterministic algorithm in the asynchronous system model (see [“System Model and Reality”](/en/ch9#sec_distributed_system_model)),
which means the algorithm cannot use any clocks or timeouts. If it can use timeouts to suspect that
another node may have crashed (even if the suspicion is sometimes wrong), then consensus becomes solvable [^73].
Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility result [^74].
Thus, although the FLP result about the impossibility of consensus is of great theoretical
importance, distributed systems can usually achieve consensus in practice.
--------
### The Many Faces of Consensus {#sec_consistency_faces}
Consensus can be expressed in several different ways:
* *Single-value consensus* is very similar to an atomic *compare-and-set* operation, and it can be
used to implement locks, leases, and uniqueness constraints.
* Constructing an *append-only log* also requires consensus; it is usually formalized as *total
order broadcast*. With a log you can build *state machine replication*, leader-based replication,
event sourcing, and other useful things.
* *Atomic commitment* of a multi-database or multi-shard transaction requires that all participants
agree on whether to commit or abort the transaction.
We will explore all of these shortly. In fact, these problems are all equivalent to each other: if
you have an algorithm that solves one of these problems, you can convert it into a solution for any
of the others. This is quite a profound and perhaps surprising insight! And that’s why we can lump
all of these things together under “consensus”, even though they look quite different on the surface.
#### Single-value consensus {#single-value-consensus}
The standard formulation of consensus involves getting multiple nodes to agree on a single value.
For example:
* When a database with single-leader replication first starts up, or when the existing leader fails,
several nodes may concurrently try to become the leader. Similarly, multiple nodes may race to
acquire a lock or lease. Consensus allows them to decide which one wins.
* If several people concurrently try to book the last seat on an airplane, or the same seat in a
theater, or try to register an account with the same username, then a consensus algorithm could
determine which one should succeed.
More generally, one or more nodes may *propose* values, and the consensus algorithm *decides* on one
of those values. In the examples above, each node could propose its own ID, and the algorithm
decides which node ID should become the new leader, the holder of the lease, or the buyer of the
airplane/theater seat. In this formalism, a consensus algorithm must satisfy the following
properties [^26]:
Uniform agreement
: No two nodes decide differently.
Integrity
: Once a node has decided one value, it cannot change its mind by deciding another value.
Validity
: If a node decides value *v*, then *v* was proposed by some node.
Termination
: Every node that does not crash eventually decides some value.
If you want to decide multiple values, you can run a separate instance of the consensus algorithm
for each. For example, you could have a separate consensus run for each bookable seat in the
theater, so that you get one decision (one buyer) for each seat.
The uniform agreement and integrity properties define the core idea of consensus: everyone decides
on the same outcome, and once you have decided, you cannot change your mind. The validity property
rules out trivial solutions: for example, you could have an algorithm that always decides `null`, no
matter what was proposed; this algorithm would satisfy the agreement and integrity properties, but
not the validity property.
If you don’t care about fault tolerance, then satisfying the first three properties is easy: you can
just hardcode one node to be the “dictator,” and let that node make all of the decisions. However,
if that one node fails, then the system can no longer make any decisions—just like single-leader
replication without failover. All the difficulty arises from the need for fault tolerance.
The termination property formalizes the idea of fault tolerance. It essentially says that a
consensus algorithm cannot simply sit around and do nothing forever—in other words, it must make
progress. Even if some nodes fail, the other nodes must still reach a decision. (Termination is a
liveness property, whereas the other three are safety properties—see
[“Safety and liveness”](/en/ch9#sec_distributed_safety_liveness).)
If a crashed node may recover, you could just wait for it to come back. However, consensus must
ensure that it makes a decision even if a crashed node suddenly disappears and never comes back.
(Instead of a software crash, imagine that there is an earthquake, and the datacenter containing
your node is destroyed by a landslide. You must assume that your node is buried under 30 feet of mud
and is never going to come back online.)
Of course, if *all* nodes crash and none of them are running, then it is not possible for any
algorithm to decide anything. There is a limit to the number of failures that an algorithm can
tolerate: in fact, it can be proved that any consensus algorithm requires at least a majority of
nodes to be functioning correctly in order to assure termination [^73]. That majority can safely form a quorum
(see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)).
Thus, the termination property is subject to the assumption that fewer than half of the nodes are
crashed or unreachable. However, most consensus algorithms ensure that the safety
properties—agreement, integrity, and validity—are always met, even if a majority of nodes fail or
there is a severe network problem [^75].
Thus, a large-scale outage can stop the system from being able to process requests, but it cannot
corrupt the consensus system by causing it to make inconsistent decisions.
#### Compare-and-set as consensus {#compare-and-set-as-consensus}
A compare-and-set (CAS) operation checks whether the current value of some object equals some
expected value; if yes, it atomically updates the object to some new value; if no, it leaves the
object unchanged and returns an error.
If you have a fault-tolerant, linearizable CAS operation, it is easy to solve the consensus problem:
initially set the object to a null value; each node that wants to propose a value invokes CAS with
the expected value being null, and the new value being the value it wants to propose (assuming it is
non-null). The decided value is then whatever value the object is set to.
Likewise, if you have a solution for consensus, you can implement CAS: whenever one or more nodes
want to perform CAS with the same expected value, you use the consensus protocol to propose the new
values in the CAS invocation, and then set the object to whatever value was decided by the
consensus. Any CAS invocations whose new value was not decided return an error. CAS invocations with
different expected values use separate runs of the consensus protocol.
This shows that CAS and consensus are equivalent to each other [^28] [^73].
Again, both are straightforward on a single node, but challenging to make fault-tolerant. As an
example of CAS in a distributed setting, we saw conditional write operations for object stores in
[“Databases backed by object storage”](/en/ch6#sec_replication_object_storage), which allow a write to happen only if an object with the same
name has not been created or modified by another client since the current client last read it.
However, a linearizable read-write register is not sufficient to solve consensus. The FLP result
tells us that consensus cannot be solved by a deterministic algorithm in the asynchronous crash-stop
model [^72], but we saw in [“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum
reads/writes in this model [^24] [^25] [^26]. From this it follows that a linearizable register cannot solve consensus.
#### Shared logs as consensus {#sec_consistency_shared_logs}
We have seen several examples of logs, such as replication logs, transaction logs, and write-ahead
logs. A log stores a sequence of *log entries*, and anyone who reads it sees the same entries in the
same order. Sometimes a log has a single writer that is allowed to append new entries, but a *shared
log* is one where multiple nodes can request entries to be appended. An example is single-leader
replication: any client can ask the leader to make a write, which the leader appends to the
replication log, and then all followers apply the writes in the same order as the leader.
More formally, a shared log supports two operations: you can request for a value to be added to the
log, and you can read the entries in the log. It must satisfy the following properties:
Eventual append
: If a node requests for some value to be added the log, and the node does not crash, then that node
must eventually read that value in a log entry.
Reliable delivery
: No log entries are lost: if one node reads some log entry, then eventually every node that does
not crash must also read that log entry.
Append-only
: Once a node has read some log entry, it is immutable, and new log entries can only be added after
it, but not before. A node may re-read the log, in which case it sees the same log entries in the
same order as it read them initially (even if the node crashes and restarts).
Agreement
: If two nodes both read some log entry *e*, then prior to *e* they must have read exactly the same
sequence of log entries in the same order.
Validity
: If a node reads a log entry containing some value, then some node previously requested for that
value to be added to the log.
--------
> [!NOTE]
> A shared log is formally known as a *total order broadcast*, *atomic broadcast*, or *total order multicast* protocol [^26] [^76] [^77]
> It’s the same thing described in different words: requesting a value to be added to the log is then called “broadcasting” it, and reading a log entry is called “delivering” it.
--------
If you have an implementation of a shared log, it is easy to solve the consensus problem: every node
that wants to propose a value requests for it to be added to the log, and whichever value is read
back in the first log entry is the value that is decided. Since all nodes read log entries in the
same order, they are guaranteed to agree on which value is delivered first [^28].
Conversely, if you have a solution for consensus, you can implement a shared log. The details are a
bit more complicated, but the basic idea is this [^73]:
1. You have a slot in the log for every future log entry, and you run a separate instance of the
consensus algorithm for every such slot to decide what value should go in that entry.
2. When a node wants to add a value to the log, it proposes that value for one of the slots that has
not yet been decided.
3. When the consensus algorithm decides for one of the slots, and all the previous slots have
already been decided, then the decided value is appended as a new log entry, and any consecutive
slots that have been decided also have their decided value appended to the log.
4. If a proposed value was not chosen for some slot, the node that wanted to add it retries by
proposing it for a later slot.
This shows that consensus is equivalent to total order broadcast and shared logs. Single-leader
replication without failover does not meet the liveness requirements, since it stops delivering
messages if the leader crashes. As usual, the challenge is in performing failover safely and
automatically.
#### Fetch-and-add as consensus {#fetch-and-add-as-consensus}
The linearizable ID generator we saw in [“Linearizable ID Generators”](/en/ch10#sec_consistency_linearizable_id) comes close to solving
consensus, but it falls slightly short. We can implement such an ID generator using a fetch-and-add
operation, which atomically increments a counter and returns the old counter value.
If you have a CAS operation, it’s easy to implement fetch-and-add: first read the counter value,
then perform a CAS where the expected value is the value you read, and the new value is that value
plus one. If the CAS fails, you retry the whole process until the CAS succeeds. This is less
efficient than a native fetch-and-add operation when there is contention, but it is functionally
equivalent. Since you can implement CAS using consensus, you can also implement fetch-and-add using
consensus.
Conversely, if you have a fault-tolerant fetch-and-add operation, can you solve the consensus
problem? Let’s say you initialize the counter to zero, and every node that wants to propose a value
invokes the fetch-and-add operation to increment the counter. Since the fetch-and-add operation is
atomic, one of the nodes will read the initial value of zero, and the others will all read a value
that has been incremented at least once.
Now let’s say that the node that reads zero is the winner, and its value is decided. That works for
the node that read zero, but the other nodes have a problem: they know that they are not the winner,
but they don’t know which of the other nodes has won. The winner could send a message to the other
nodes to let them know it has won, but what if the winner crashes before it has a chance to send
this message? In that case the other nodes are left hanging, unable to decide any value, and thus
the consensus does not terminate. And the other nodes can’t fall back to another node, because the
node that read zero may yet come back and rightly decide the value it proposed.
An exception is if we know for sure that no more than two nodes will propose a value. In that case,
the nodes can send each other the values they want to propose, and then each perform the
fetch-and-add operation. The node that reads zero decides its own value, and the node that reads one
decides the other node’s value. This solves the consensus problem among two nodes, which is why we
can say that fetch-and-add has a *consensus number* of two [^28].
In contrast, CAS and shared logs solve consensus for any number of nodes that may propose values, so
they have a consensus number of ∞ (infinity).
#### Atomic commitment as consensus {#atomic-commitment-as-consensus}
In [“Distributed Transactions”](/en/ch8#sec_transactions_distributed) we saw the *atomic commitment* problem, which is to ensure that
the databases or shards involved in a distributed transaction all either commit or abort a
transaction. We also saw the *two-phase commit* algorithm, which relies on a coordinator that is a
single point of failure.
What is the relationship between consensus and atomic commitment? At first glance, they seem very
similar—both require nodes to come to some form of agreement. However, there is one important
difference: with consensus it’s okay to decide any value that proposed, whereas with atomic
commitment the algorithm *must* abort if *any* of the participants voted to abort. More precisely,
atomic commitment requires the following properties [^78]:
Uniform agreement
: No two nodes decide on different outcomes.
Integrity
: Once a node has decided one outcome, it cannot change its mind by deciding another outcome.
Validity
: If a node decides to commit, then all nodes must have previously voted to commit. If any node
voted to abort, the nodes must abort.
Non-triviality
: If all nodes vote to commit, and no communication timeouts occur, then all nodes must decide to
commit.
Termination
: Every node that does not crash eventually decides to either commit or abort.
The validity property ensures that a transaction can only commit if all nodes agree; and the
non-triviality property ensures the algorithm can’t simply always abort (but it allows an abort if
any of the communication among the nodes times out). The other three properties are basically the
same as for consensus.
If you have a solution for consensus, there are multiple ways you could solve atomic commitment [^78] [^79].
One works like this: when you want to commit the transaction, every node sends its vote to commit or
abort to every other node. Nodes that receive a vote to commit from itself and every other node
propose “commit” using the consensus algorithm; nodes that receive a vote to abort, or which
experience a timeout, propose “abort” using the consensus algorithm. When a node finds out what the
consensus algorithm decided, it commits or aborts accordingly.
In this algorithm, “commit” will only be proposed if all nodes voted to commit. If any node voted to
abort, all proposals in the consensus algorithm will be “abort”. It could happen that some nodes
propose “abort” while others propose “commit” if all nodes voted to commit but some communication
timed out; in this case it doesn’t matter whether the nodes commit or abort, as long as they all do the same.
If you have a fault-tolerant atomic commitment protocol, you can also solve consensus. Every node
that wants to propose a value starts a transaction on a quorum of nodes, and at each node it
performs a single-node CAS to set a register to the proposed value if its value has not already been
set by another transaction. If the CAS succeeds, the node votes to commit, otherwise it votes to
abort. If the atomic commit protocol decides to commit a transaction, its value is decided for
consensus; if atomic commit aborts, the proposing node retries with a new transaction.
This shows that atomic commit and consensus are also equivalent to each other.
### Consensus in Practice {#sec_consistency_total_order}
We have seen that single-value consensus, CAS, shared logs, and atomic commitment are all equivalent
to each other: you can convert a solution to one of them into a solution to any of the others. That
is a valuable theoretical insight, but it doesn’t answer the question: which of these many
formulations of consensus is the most useful in practice?
The answer is that most consensus systems provide shared logs, also known as total order broadcast.
Raft, Viewstamped Replication, and Zab provide shared logs right out of the box. Paxos provides
single-value consensus, but in practice most systems using Paxos actually use the extension called
Multi-Paxos, which also provides a shared log.
#### Using shared logs {#sec_consistency_smr}
A shared log is a good fit for database replication: if every log entry represents a write to the
database, and every replica processes the same writes in the same order using deterministic logic,
then the replicas will all end up in a consistent state. This idea is known as *state machine replication* [^80],
and it is the principle behind event sourcing, which we saw in [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events). Shared
logs are also useful for stream processing, as we shall see in [Chapter 12](/en/ch12#ch_stream).
Similarly, a shared log can be used to implement serializable transactions: as discussed in
[“Actual Serial Execution”](/en/ch8#sec_transactions_serial), if every log entry represents a deterministic transaction to be
executed as a stored procedure, and if every node executes those transactions in the same order,
then the transactions will be serializable [^81] [^82].
---------
> [!NOTE]
> Sharded databases with a strong consistency model often maintain a separate log per shard, which
> improves scalability, but limits the consistency guarantees (e.g., consistent snapshots, foreign key
> references) they can offer across shards. Serializable transactions across shards are possible, but
> require additional coordination [^83].
--------
A shared log is also powerful because it can easily be adapted to other forms of consensus:
* We saw previously how to use it to implement single-value consensus and CAS: simply decide the
value that appears first in the log.
* If you want many instances of single-value consensus (e.g. one per seat in a theater that several
people are trying to book), include the seat number in the log entries, and decide the first log
entry that contains a given seat number.
* If you want an atomic fetch-and-add, put the number to add to the counter in a log entry, and the
current counter value is the sum of all of the log entries so far. A simple counter on log entries
can be used to generate fencing tokens (see [“Fencing off zombies and delayed requests”](/en/ch9#sec_distributed_fencing_tokens)); for example, in
ZooKeeper, this sequence number is called `zxid` [^18].
#### From single-leader replication to consensus {#from-single-leader-replication-to-consensus}
We saw previously that single-value consensus is easy if you have a single “dictator” node that
makes the decision, and likewise a shared log is easy if a single leader is the only node that is
allowed to append entries to it. The question is how to provide fault tolerance if that node fails.
Traditionally, databases with single-leader replication didn’t solve this problem: they left leader
failover as an action that a human administrator had to perform manually. Unfortunately, this means
a significant amount of downtime, since there is a limit to how fast humans can react, and it
doesn’t satisfy the termination property of consensus. For consensus, we require that the algorithm
can automatically choose a new leader. (Not all consensus algorithms have a leader, but the commonly
used algorithms do [^84].)
However, there is a problem. We previously discussed the problem of split brain, and said that all
nodes need to agree who the leader is—otherwise two different nodes could each believe themselves to
be the leader, and consequently make inconsistent decisions. Thus, it seems like we need consensus
in order to elect a leader, and we need a leader in order to solve consensus. How do we break out of
this conundrum?
In fact, consensus algorithms don’t require that there is only one leader at any one time. Instead,
they make a weaker guarantee: they define an *epoch number* (called the *ballot number* in Paxos,
*view number* in Viewstamped Replication, and *term number* in Raft) and guarantee that within each
epoch, the leader is unique.
When a node believes that the current leader is dead because it hasn’t heard from the leader for
some timeout, it may start a vote to elect a new leader. This election is given a new epoch number
that is greater than any previous epoch. If there is a conflict between two different leaders in two
different epochs (perhaps because the previous leader actually wasn’t dead after all), then the
leader with the higher epoch number prevails.
Before a leader is allowed to append the next entry to the shared log, it must first check that
there isn’t some other leader with a higher epoch number which might append a different entry. It
can do this by collecting votes from a quorum of nodes—typically, but not always, a majority of
nodes [^85]. A node votes yes only if it is not aware of any other leader with a higher epoch.
Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s
proposal for the next entry to append to the log. The quorums for those two votes must overlap: if
a vote on a proposal succeeds, at least one of the nodes that voted for it must have also
participated in the most recent successful leader election [^85]. Thus, if the vote on a proposal
passes without revealing any higher-numbered epoch, the current leader can conclude that no leader
with a higher epoch number has been elected, and therefore it can safely append the proposed entry
to the log [^26] [^86].
These two rounds of voting look superficially similar to two-phase commit, but they are very
different protocols. In consensus algorithms, any node can start an election and it requires only a
quorum of nodes to respond; in 2PC, only the coordinator can request votes, and it requires a “yes”
vote from *every* participant before it can commit.
#### Subtleties of consensus {#subtleties-of-consensus}
This basic structure is common to all of Raft, Multi-Paxos, Zab, and Viewstamped Replication: a vote
by a quorum of nodes elects a leader, and then another quorum vote is required for every entry that
the leader wants to append to the log [^68] [^69]. Every new log entry is synchronously replicated
to a quorum of nodes before it is confirmed to the client that requested the write. This ensures
that the log entry won’t be lost if the current leader fails.
However, the devil is in the details, and that’s also where these algorithms take different
approaches. For example, when the old leader fails and a new one is elected, the algorithm needs to
ensure that the new leader honors any log entries that had already been appended by the old leader
before it failed. Raft does this by only allowing a node to become the new leader if its log is at
least as up-to-date as a majority of its followers [^69].
In contrast, Paxos allows any node to become the new leader, but requires it to bring its log
up-to-date with other nodes before it can start appending new entries of its own.
--------
> [!TIP] CONSISTENCY VS. AVAILABILITY IN LEADER ELECTION
If you want the consensus algorithm to strictly guarantee the properties laid out in
[“Shared logs as consensus”](/en/ch10#sec_consistency_shared_logs), it’s essential that the new leader is up-to-date with any confirmed
log entries before it can process any writes or linearizable reads. If a node with stale data were
to become the new leader, it may write a new value to log entries that were already written by the
old leader, violating the shared log’s append-only property.
In some cases, you might choose to weaken the consensus properties in order to recover more quickly
from a leader failure. For example, Kafka offers the option of enabling *unclean leader election*,
which allows any replica to become leader, even if it is not up-to-date. Also, in databases with
asynchronous replication, you cannot guarantee that any follower is up-to-date when the leader
fails.
If you drop the requirement for the new leader to be up-to-date, you may improve performance and
availability, but you are on thin ice, since the theory of consensus no longer applies. While things
will work fine as long as there are no faults, the problems discussed in [Chapter 9](/en/ch9#ch_distributed) can
easily cause a lot of data loss or corruption.
--------
Another subtlety is in how the algorithms deal with log entries that had been proposed by the old
leader before it failed, but for which the vote on appending to the log had not yet completed. You
can find discussions of these details in the references for this chapter [^23] [^69] [^86].
For databases that use a consensus algorithm for replication, not only do writes need to be turned
into log entries and replicated to a quorum. If you want to guarantee linearizable reads, they also
have to go through a quorum vote similarly to a write, to confirm that the node that believes to be
the leader really still is up-to-date. Linearizable reads in etcd work like this, for example.
In their standard form, most consensus algorithms assume a fixed set of nodes—that is, nodes may go
down and come back up again, but the set of nodes that is allowed to vote is fixed when the cluster
is created. In practice, it’s often necessary to add new nodes or remove old nodes in a system
configuration. Consensus algorithms have been extended with *reconfiguration* features that make
this possible. This is especially useful when adding new regions to a system, or when migrating from
one location to another (by first adding the new nodes, and then removing the old nodes).
#### Pros and cons of consensus {#pros-and-cons-of-consensus}
Although they are complex and subtle, consensus algorithms are a huge breakthrough for distributed
systems. Consensus is essentially “single-leader replication done right”, with automatic failover on
leader failure, ensuring that no committed data is lost and no split-brain is possible, even in the
face of all the problems we discussed in [Chapter 9](/en/ch9#ch_distributed).
Since single-leader replication with automatic failover is essentially one of the definitions of
consensus, any system that provides automatic failover but does not use a proven consensus algorithm
is likely to be unsafe [^87].
Using a proven consensus algorithm is not a guarantee of correctness of the whole system—there are
still plenty of other places where bugs can lurk—but it’s a good start.
Nevertheless, consensus is not used everywhere, because the benefits come at a cost. Consensus
systems always require a strict majority to operate—three nodes to tolerate one failure, or five
nodes to tolerate two failures. Every operation needs to communicate with a quorum, so you can’t
increase throughput by adding more nodes (in fact, every node you add makes the algorithm slower).
If a network partition cuts off some nodes from the rest, only the majority portion of the network
can make progress, and the rest are blocked.
Consensus systems generally rely on timeouts to detect failed nodes. In environments with highly
variable network delays, especially systems distributed across multiple geographic regions, it can
be difficult to tune these timeouts: if they are too large it takes a long time to recover from a
failure; if they are too small there can be lots of unnecessary leader elections, resulting in
terrible performance as the system can end up spending more time choosing leaders than doing useful
work.
Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft
has been shown to have unpleasant edge cases [^88] [^89]:
if the entire network is working correctly except for one particular network link that is
consistently unreliable, Raft can get into situations where leadership continually bounces between
two nodes, or the current leader is continually forced to resign, so the system effectively never
makes progress. Designing algorithms that are more robust to unreliable networks is still an open
research problem.
For systems that want to be highly available, but don’t want to accept the cost of consensus, the
only real alternative is to use a weaker consistency model instead, such as those offered by
leaderless or multi-leader replication as discussed in [Chapter 6](/en/ch6#ch_replication). These approaches
generally don’t offer linearizability, but for applications that don’t need it that is fine.
### Coordination Services {#sec_consistency_coordination}
Consensus algorithms are useful in any distributed database that wants to offer linearizable
operations, and many modern distributed databases use consensus algorithms for replication. But one
family of systems is a particularly prominent user of consensus: *coordination services* such as
ZooKeeper, etcd, or Consul. Although these systems look superficially like any other key-value
store, they are not designed for general-purpose data storage like most databases.
Instead, they are designed to coordinate between nodes of another distributed system. For example,
Kubernetes relies on etcd, while Spark and Flink in high availability mode rely on ZooKeeper running
in the background. Coordination services are designed to hold small amounts of data that can fit
entirely in memory (although they still write to disk for durability), which is replicated across
multiple nodes using a fault-tolerant consensus algorithm.
Coordination services are modeled after Google’s Chubby lock service [^17] [^58].
They combine a consensus algorithm with several other features that turn out to be particularly
useful when building distributed systems:
Locks and leases
: We saw previously how consensus systems can implement an atomic, fault-tolerant compare-and-set
(CAS) operation. Coordination services rely on this approach to implement locks and leases: if
several nodes concurrently try to acquire the same lease, only one of them will succeed.
Support for fencing
: As discussed in [“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing), when a resource is protected by a lease, you
need *fencing* to prevent clients from interfering with each other in the case of a process pause
or large network delay. Consensus systems can generate fencing tokens by giving each log entry a
monotonically increasing ID (`zxid` and `cversion` in ZooKeeper, revision number in etcd).
Failure detection
: Clients maintain a long-lived session on the coordination service, and periodically exchange
heartbeats to check if the other node is still alive. Even if the connection is temporarily
interrupted, or a server fails, any leases held by the client remain active. However, if there is
no heartbeat for longer than the timeout of the lease, the coordination service assumes the client
is dead and releases the lease (ZooKeeper calls these *ephemeral nodes*).
Change notifications
: A client can request that the coordination service sends it a notification whenever certain keys
change. This allows a client to find out when another client joins the cluster (based on the value
it writes to the coordination service), or if another client fails (because its session times out
and its ephemeral nodes disappear), for example. These notifications save the client from having
to frequently poll the service to find out about changes.
Failure detection and change notifications do not require consensus, but they are useful for
distributed coordination alongside the atomic operations and fencing support that do require
consensus.
--------
> [!TIP] MANAGING CONFIGURATION WITH COORDINATION SERVICES
Applications and infrastructure often have configuration parameters such as timeouts, thread pool
sizes, and so on. Coordination services are sometimes used to store such configuration data,
represented as key-value pairs. Processes load the latest settings upon startup, and subscribe to
receive notifications of any changes. When a configuration changes, the process can begin using the
new setting immediately or restart itself to load the latest changes.
Configuration management doesn’t need the consensus aspect of a coordination service, but it’s
convenient to use a coordination service and rely on its notification feature if you are already
running the coordination service anyway. Alternatively, a process could periodically poll for
configuration updates from a file or URL, which avoids the need for a specialized service.
--------
#### Allocating work to nodes {#allocating-work-to-nodes}
A coordination service is useful if you have several instances of a process or service, and one
of them needs to be chosen as leader or primary. If the leader fails, one of the other nodes should
take over. This is necessary for single-leader databases, but it’s also appropriate for job
schedulers and similar stateful systems.
Another use case is when you have some sharded resource (database, message streams, file storage,
distributed actor system, etc.) and need to decide which shard to assign to which node. As new nodes
join the cluster, some of the shards need to be moved from existing nodes to the new nodes in order
to rebalance the load. As nodes are removed or fail, other nodes need to take over the failed nodes’
work.
These kinds of tasks can be achieved by judicious use of atomic operations, ephemeral nodes, and
notifications in a coordination service. If done correctly, this approach allows the application to
automatically recover from faults without human intervention. It’s not easy, despite the appearance
of libraries such as Apache Curator that have sprung up to provide higher-level tools on top of the
ZooKeeper client API—but it is still much better than attempting to implement the necessary
consensus algorithms from scratch, which would be very prone to bugs.
A dedicated coordination service also has the advantage that it can run on a fixed set of nodes
(usually three or five), regardless of how many nodes there are in the distributed system that
relies on it for coordination. For example, in a storage system with thousands of shards, it would
be terribly inefficient to run a consensus algorithm over thousands of nodes; it’s much better to
“outsource” the consensus to a small number of nodes running a coordination service.
Normally, the kind of data managed by a coordination service is quite slow-changing: it represents
information like “the node running on IP address 10.1.1.23 is the leader for shard 7,” and such
assignments usually change on a timescale of minutes or hours. Coordination services are not
intended for storing data that may change thousands of times per second. For that, it is better to
use a conventional database; alternatively, tools like Apache BookKeeper [^90] [^91]
can be used to replicate fast-changing internal state of a service.
#### Service discovery {#service-discovery}
ZooKeeper, etcd, and Consul are also often used for *service discovery*—that is, to find out which
IP address you need to connect to in order to reach a particular service (see
[“Load balancers, service discovery, and service meshes”](/en/ch5#sec_encoding_service_discovery)). In cloud environments, where it is common for
virtual machines to continually come and go, you often don’t know the IP addresses of your services
ahead of time. Instead, you can configure your services such that when they start up they register
their network endpoints in a service registry, where they can then be found by other services.
Using a coordination service for service discovery can be convenient, as its failure detection and
change notification features make it easy for clients to keep track of service instances as they
come and go. And if you are already using a coordination service for leases, locking, or leader
election, it makes sense to also use it for service discovery, since it already knows which node
should receive requests for your service.
However, using consensus for service discovery is often overkill: this use case often doesn’t
require linearizability, and it’s more important that service discovery is highly available and
fast, since without it everything would grind to a halt. It’s therefore often preferable to cache
service discovery information and accept that it might be slightly stale. For example, DNS-based
service discovery uses multiple layers of caching to achieve good performance and availability.
To support this use case, ZooKeeper supports *observers*, which are replicas that receive the log
and maintain a copy of the data stored in ZooKeeper, but which do not participate in the consensus
algorithm’s voting process. Reads from an observer are not linearizable as they might be stale, but
they remain available even if the network is interrupted, and they increase the read throughput that
the system can support by caching.
## Summary {#summary}
In this chapter we examined the topic of strong consistency in fault-tolerant systems: what it is,
and how to achieve it. We looked in depth at linearizability, a popular formalization of strong
consistency: it means that replicated data appears as though there were only a single copy, and all
operations act on it atomically. We saw that linearizability is useful when you need some data to be
up-to-date when you read it, or if you need to resolve a race condition (e.g. if multiple nodes are
concurrently trying to do the same thing, such as creating files with the same name).
Although linearizability is appealing because it is easy to understand—it makes a database behave
like a variable in a single-threaded program—it has the downside of being slow, especially in
environments with large network delays. Many replication algorithms don’t guarantee linearizability,
even though it superficially might seem like they might provide strong consistency.
Next, we applied the concept of linearizability in the context of ID generators. A single-node
auto-incrementing counter is linearizable, but not fault-tolerant. Many distributed ID generation
schemes don’t guarantee that the IDs are ordered consistently with the order in which the events
actually happened. Logical clocks such as Lamport clocks and hybrid logical clocks provide ordering
that is consistent with causality, but no linearizability.
This led us to the concept of consensus. We saw that achieving consensus means deciding something in
such a way that all nodes agree on what was decided, and such that they can’t change their mind. A
wide range of problems are actually reducible to consensus and are equivalent to each other (i.e.,
if you have a solution for one of them, you can transform it into a solution for all of the others).
Such equivalent problems include:
Linearizable compare-and-set operation
: The register needs to atomically *decide* whether to set its value, based on whether its current
value equals the parameter given in the operation.
Locks and leases
: When several clients are concurrently trying to grab a lock or lease, the lock *decides* which one
successfully acquired it.
Uniqueness constraints
: When several transactions concurrently try to create conflicting records with the same key, the
constraint must *decide* which one to allow and which should fail with a constraint violation.
Shared logs
: When several nodes concurrently want to append entries to a log, the log *decides* in which order
they are appended. Total order broadcast is also equivalent.
Atomic transaction commit
: The database nodes involved in a distributed transaction must all *decide* the same way whether to
commit or abort the transaction.
Linearizable fetch-and-add operation
: This operation can be used to implement an ID generator. Several nodes can concurrently invoke the
operation, and it *decides* the order in which they increment the counter. This case actually
solves consensus only between two nodes, while the others work for any number of nodes.
All of these are straightforward if you only have a single node, or if you are willing to assign the
decision-making capability to a single node. This is what happens in a single-leader database: all
the power to make decisions is vested in the leader, which is why such databases are able to provide
linearizable operations, uniqueness constraints, a replication log, and more.
However, if that single leader fails, or if a network interruption makes the leader unreachable,
such a system becomes unable to make any progress until a human performs a manual failover.
Widely-used consensus algorithms like Raft and Paxos are essentially single-leader replication with
built-in automatic leader election and failover if the current leader fails.
Consensus algorithms are carefully designed to ensure that no committed writes are lost during a
failover, and that the system cannot get into a split brain state in which multiple nodes are
accepting writes. This requires that every write, and every linearizable read, is confirmed by a
quorum (typically a majority) of nodes. This can be expensive, especially across geographic regions,
but it is unavoidable if you want the strong consistency and fault tolerance that consensus provides.
Coordination services like ZooKeeper and etcd are also built on top of consensus algorithms. They
provide locks, leases, failure detection, and change notification features that are useful for
managing the state of distributed applications. If you find yourself wanting to do one of those
things that is reducible to consensus, and you want it to be fault-tolerant, it is advisable to use
a coordination service. It won’t guarantee that you will get it right, but it will probably help.
Consensus algorithms are complicated and subtle, but they are supported by a rich body of theory
that has been developed since the 1980s. This theory makes it possible to build systems that can
tolerate all the faults that we discussed in [Chapter 9](/en/ch9#ch_distributed), and still ensure that your data is
not corrupted. This is an amazing achievement, and the references at the end of this chapter feature
some of the highlights of this work.
Nevertheless, consensus is not always the right tool: in some systems, the strong consistency
properties it provides are not needed, and it is better to have weaker consistency with higher
availability and better performance. In these cases, it is common to use leaderless or multi-leader
replication, which we previously discussed in [Chapter 6](/en/ch6#ch_replication). The logical clocks that we
discussed in this chapter are helpful in that context.
### References
[^1]: Maurice P. Herlihy and Jeannette M. Wing. [Linearizability: A Correctness Condition for Concurrent Objects](https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 12, issue 3, pages 463–492, July 1990. [doi:10.1145/78969.78972](https://doi.org/10.1145/78969.78972)
[^2]: Leslie Lamport. [On interprocess communication](https://www.microsoft.com/en-us/research/publication/interprocess-communication-part-basic-formalism-part-ii-algorithms/). *Distributed Computing*, volume 1, issue 2, pages 77–101, June 1986. [doi:10.1007/BF01786228](https://doi.org/10.1007/BF01786228)
[^3]: David K. Gifford. [Information Storage in a Decentralized Computer System](https://bitsavers.org/pdf/xerox/parc/techReports/CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System.pdf). Xerox Palo Alto Research Centers, CSL-81-8, June 1981. Archived at [perma.cc/2XXP-3JPB](https://perma.cc/2XXP-3JPB)
[^4]: Martin Kleppmann. [Please Stop Calling Databases CP or AP](https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html). *martin.kleppmann.com*, May 2015. Archived at [perma.cc/MJ5G-75GL](https://perma.cc/MJ5G-75GL)
[^5]: Kyle Kingsbury. [Call Me Maybe: MongoDB Stale Reads](https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads). *aphyr.com*, April 2015. Archived at [perma.cc/DXB4-J4JC](https://perma.cc/DXB4-J4JC)
[^6]: Kyle Kingsbury. [Computational Techniques in Knossos](https://aphyr.com/posts/314-computational-techniques-in-knossos). *aphyr.com*, May 2014. Archived at [perma.cc/2X5M-EHTU](https://perma.cc/2X5M-EHTU)
[^7]: Kyle Kingsbury and Peter Alvaro. [Elle: Inferring Isolation Anomalies from Experimental Observations](https://www.vldb.org/pvldb/vol14/p268-alvaro.pdf). *Proceedings of the VLDB Endowment*, volume 14, issue 3, pages 268–280, November 2020. [doi:10.14778/3430915.3430918](https://doi.org/10.14778/3430915.3430918)
[^8]: Paolo Viotti and Marko Vukolić. [Consistency in Non-Transactional Distributed Storage Systems](https://arxiv.org/abs/1512.00168). *ACM Computing Surveys* (CSUR), volume 49, issue 1, article no. 19, June 2016. [doi:10.1145/2926965](https://doi.org/10.1145/2926965)
[^9]: Peter Bailis. [Linearizability Versus Serializability](http://www.bailis.org/blog/linearizability-versus-serializability/). *bailis.org*, September 2014. Archived at [perma.cc/386B-KAC3](https://perma.cc/386B-KAC3)
[^10]: Daniel Abadi. [Correctness Anomalies Under Serializable Isolation](https://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html). *dbmsmusings.blogspot.com*, June 2019. Archived at [perma.cc/JGS7-BZFY](https://perma.cc/JGS7-BZFY)
[^11]: Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Highly Available Transactions: Virtues and Limitations](https://www.vldb.org/pvldb/vol7/p181-bailis.pdf). *Proceedings of the VLDB Endowment*, volume 7, issue 3, pages 181–192, November 2013. [doi:10.14778/2732232.2732237](https://doi.org/10.14778/2732232.2732237), extended version published as [arXiv:1302.0309](https://arxiv.org/abs/1302.0309)
[^12]: Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. [*Concurrency Control and Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at [*microsoft.com*](https://www.microsoft.com/en-us/research/people/philbe/book/).
[^13]: Andrei Matei. [CockroachDB’s consistency model](https://www.cockroachlabs.com/blog/consistency-model/). *cockroachlabs.com*, February 2021. Archived at [perma.cc/MR38-883B](https://perma.cc/MR38-883B)
[^14]: Murat Demirbas. [Strict-serializability, but at what cost, for what purpose?](https://muratbuffalo.blogspot.com/2022/08/strict-serializability-but-at-what-cost.html) *muratbuffalo.blogspot.com*, August 2022. Archived at [perma.cc/T8AY-N3U9](https://perma.cc/T8AY-N3U9)
[^15]: Ben Darnell. [How to talk about consistency and isolation in distributed DBs](https://www.cockroachlabs.com/blog/db-consistency-isolation-terminology/). *cockroachlabs.com*, February 2022. Archived at [perma.cc/53SV-JBGK](https://perma.cc/53SV-JBGK)
[^16]: Daniel Abadi. [An explanation of the difference between Isolation levels vs. Consistency levels](https://dbmsmusings.blogspot.com/2019/08/an-explanation-of-difference-between.html). *dbmsmusings.blogspot.com*, August 2019. Archived at [perma.cc/QSF2-CD4P](https://perma.cc/QSF2-CD4P)
[^17]: Mike Burrows. [The Chubby Lock Service for Loosely-Coupled Distributed Systems](https://research.google/pubs/pub27897/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006.
[^18]: Flavio P. Junqueira and Benjamin Reed. [*ZooKeeper: Distributed Process Coordination*](https://www.oreilly.com/library/view/zookeeper/9781449361297/). O’Reilly Media, 2013. ISBN: 978-1-449-36130-3
[^19]: Murali Vallath. [*Oracle 10g RAC Grid, Services & Clustering*](https://www.oreilly.com/library/view/oracle-10g-rac/9781555583217/). Elsevier Digital Press, 2006. ISBN: 978-1-555-58321-7
[^20]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). *Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185–196, November 2014. [doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509)
[^21]: Kyle Kingsbury. [Call Me Maybe: etcd and Consul](https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul). *aphyr.com*, June 2014. Archived at [perma.cc/XL7U-378K](https://perma.cc/XL7U-378K)
[^22]: Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. [Zab: High-Performance Broadcast for Primary-Backup Systems](https://marcoserafini.github.io/assets/pdf/zab.pdf). At *41st IEEE International Conference on Dependable Systems and Networks* (DSN), June 2011. [doi:10.1109/DSN.2011.5958223](https://doi.org/10.1109/DSN.2011.5958223)
[^23]: Diego Ongaro and John K. Ousterhout. [In Search of an Understandable Consensus Algorithm](https://www.usenix.org/system/files/conference/atc14/atc14-paper-ongaro.pdf). At *USENIX Annual Technical Conference* (ATC), June 2014.
[^24]: Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. [Sharing Memory Robustly in Message-Passing Systems](https://www.cs.huji.ac.il/course/2004/dist/p124-attiya.pdf). *Journal of the ACM*, volume 42, issue 1, pages 124–142, January 1995. [doi:10.1145/200836.200869](https://doi.org/10.1145/200836.200869)
[^25]: Nancy Lynch and Alex Shvartsman. [Robust Emulation of Shared Memory Using Dynamic Quorum-Acknowledged Broadcasts](https://groups.csail.mit.edu/tds/papers/Lynch/FTCS97.pdf). At *27th Annual International Symposium on Fault-Tolerant Computing* (FTCS), June 1997. [doi:10.1109/FTCS.1997.614100](https://doi.org/10.1109/FTCS.1997.614100)
[^26]: Christian Cachin, Rachid Guerraoui, and Luís Rodrigues. [*Introduction to Reliable and Secure Distributed Programming*](https://www.distributedprogramming.net/), 2nd edition. Springer, 2011. ISBN: 978-3-642-15259-7, [doi:10.1007/978-3-642-15260-3](https://doi.org/10.1007/978-3-642-15260-3)
[^27]: Niklas Ekström, Mikhail Panchenko, and Jonathan Ellis. [Possible Issue with Read Repair?](https://lists.apache.org/thread/wwsjnnc93mdlpw8nb0d5gn4q1bmpzbon) Email thread on *cassandra-dev* mailing list, October 2012.
[^28]: Maurice P. Herlihy. [Wait-Free Synchronization](https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 13, issue 1, pages 124–149, January 1991. [doi:10.1145/114005.102808](https://doi.org/10.1145/114005.102808)
[^29]: Armando Fox and Eric A. Brewer. [Harvest, Yield, and Scalable Tolerant Systems](https://radlab.cs.berkeley.edu/people/fox/static/pubs/pdf/c18.pdf). At *7th Workshop on Hot Topics in Operating Systems* (HotOS), March 1999. [doi:10.1109/HOTOS.1999.798396](https://doi.org/10.1109/HOTOS.1999.798396)
[^30]: Seth Gilbert and Nancy Lynch. [Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services](https://www.comp.nus.edu.sg/~gilbert/pubs/BrewersConjecture-SigAct.pdf). *ACM SIGACT News*, volume 33, issue 2, pages 51–59, June 2002. [doi:10.1145/564585.564601](https://doi.org/10.1145/564585.564601)
[^31]: Seth Gilbert and Nancy Lynch. [Perspectives on the CAP Theorem](https://groups.csail.mit.edu/tds/papers/Gilbert/Brewer2.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 30–36, February 2012. [doi:10.1109/MC.2011.389](https://doi.org/10.1109/MC.2011.389)
[^32]: Eric A. Brewer. [CAP Twelve Years Later: How the ‘Rules’ Have Changed](https://sites.cs.ucsb.edu/~rich/class/cs293-cloud/papers/brewer-cap.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 23–29, February 2012. [doi:10.1109/MC.2012.37](https://doi.org/10.1109/MC.2012.37)
[^33]: Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen. [Consistency in Partitioned Networks](https://www.cs.rice.edu/~alc/old/comp520/papers/DGS85.pdf). *ACM Computing Surveys*, volume 17, issue 3, pages 341–370, September 1985. [doi:10.1145/5505.5508](https://doi.org/10.1145/5505.5508)
[^34]: Paul R. Johnson and Robert H. Thomas. [RFC 677: The Maintenance of Duplicate Databases](https://tools.ietf.org/html/rfc677). Network Working Group, January 1975.
[^35]: Michael J. Fischer and Alan Michael. [Sacrificing Serializability to Attain High Availability of Data in an Unreliable Network](https://sites.cs.ucsb.edu/~agrawal/spring2011/ugrad/p70-fischer.pdf). At *1st ACM Symposium on Principles of Database Systems* (PODS), March 1982. [doi:10.1145/588111.588124](https://doi.org/10.1145/588111.588124)
[^36]: Eric A. Brewer. [NoSQL: Past, Present, Future](https://www.infoq.com/presentations/NoSQL-History/). At *QCon San Francisco*, November 2012.
[^37]: Adrian Cockcroft. [Migrating to Microservices](https://www.infoq.com/presentations/migration-cloud-native/). At *QCon London*, March 2014.
[^38]: Martin Kleppmann. [A Critique of the CAP Theorem](https://arxiv.org/abs/1509.05393). arXiv:1509.05393, September 2015.
[^39]: Daniel Abadi. [Problems with CAP, and Yahoo’s little known NoSQL system](https://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html). *dbmsmusings.blogspot.com*, April 2010. Archived at [perma.cc/4NTZ-CLM9](https://perma.cc/4NTZ-CLM9)
[^40]: Daniel Abadi. [Hazelcast and the Mythical PA/EC System](https://dbmsmusings.blogspot.com/2017/10/hazelcast-and-mythical-paec-system.html). *dbmsmusings.blogspot.com*, October 2017. Archived at [perma.cc/J5XM-U5C2](https://perma.cc/J5XM-U5C2)
[^41]: Eric Brewer. [Spanner, TrueTime & The CAP Theorem](https://research.google.com/pubs/archive/45855.pdf). *research.google.com*, February 2017. Archived at [perma.cc/59UW-RH7N](https://perma.cc/59UW-RH7N)
[^42]: Daniel J. Abadi. [Consistency Tradeoffs in Modern Distributed Database System Design](https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 37–42, February 2012. [doi:10.1109/MC.2012.33](https://doi.org/10.1109/MC.2012.33)
[^43]: Nancy A. Lynch. [A Hundred Impossibility Proofs for Distributed Computing](https://groups.csail.mit.edu/tds/papers/Lynch/podc89.pdf). At *8th ACM Symposium on Principles of Distributed Computing* (PODC), August 1989. [doi:10.1145/72981.72982](https://doi.org/10.1145/72981.72982)
[^44]: Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin. [Consistency, Availability, and Convergence](https://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2036.pdf). University of Texas at Austin, Department of Computer Science, Tech Report UTCS TR-11-22, May 2011. Archived at [perma.cc/SAV8-9JAJ](https://perma.cc/SAV8-9JAJ)
[^45]: Hagit Attiya, Faith Ellen, and Adam Morrison. [Limitations of Highly-Available Eventually-Consistent Data Stores](https://www.cs.tau.ac.il/~mad/publications/podc2015-replds.pdf). At *ACM Symposium on Principles of Distributed Computing* (PODC), July 2015. [doi:10.1145/2767386.2767419](https://doi.org/10.1145/2767386.2767419)
[^46]: Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. [x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors](https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf). *Communications of the ACM*, volume 53, issue 7, pages 89–97, July 2010. [doi:10.1145/1785414.1785443](https://doi.org/10.1145/1785414.1785443)
[^47]: Martin Thompson. [Memory Barriers/Fences](https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html). *mechanical-sympathy.blogspot.co.uk*, July 2011. Archived at [perma.cc/7NXM-GC5U](https://perma.cc/7NXM-GC5U)
[^48]: Ulrich Drepper. [What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf). *akkadia.org*, November 2007. Archived at [perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ)
[^49]: Hagit Attiya and Jennifer L. Welch. [Sequential Consistency Versus Linearizability](https://courses.csail.mit.edu/6.852/01/papers/p91-attiya.pdf). *ACM Transactions on Computer Systems* (TOCS), volume 12, issue 2, pages 91–122, May 1994. [doi:10.1145/176575.176576](https://doi.org/10.1145/176575.176576)
[^50]: Kyzer R. Davis, Brad G. Peabody, and Paul J. Leach. [Universally Unique IDentifiers (UUIDs)](https://www.rfc-editor.org/rfc/rfc9562). RFC 9562, IETF, May 2024.
[^51]: Ryan King. [Announcing Snowflake](https://blog.x.com/engineering/en_us/a/2010/announcing-snowflake). *blog.x.com*, June 2010. Archived at [archive.org](https://web.archive.org/web/20241128214604/https%3A//blog.x.com/engineering/en_us/a/2010/announcing-snowflake)
[^52]: Alizain Feerasta. [Universally Unique Lexicographically Sortable Identifier](https://github.com/ulid/spec). *github.com*, 2016. Archived at [perma.cc/NV2Y-ZP8U](https://perma.cc/NV2Y-ZP8U)
[^53]: Rob Conery. [A Better ID Generator for PostgreSQL](https://bigmachine.io/2014/05/29/a-better-id-generator-for-postgresql/). *bigmachine.io*, May 2014. Archived at [perma.cc/K7QV-3KFC](https://perma.cc/K7QV-3KFC)
[^54]: Leslie Lamport. [Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, volume 21, issue 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563)
[^55]: Sandeep S. Kulkarni, Murat Demirbas, Deepak Madeppa, Bharadwaj Avva, and Marcelo Leone. [Logical Physical Clocks](https://cse.buffalo.edu/~demirbas/publications/hlc.pdf). *18th International Conference on Principles of Distributed Systems* (OPODIS), December 2014. [doi:10.1007/978-3-319-14472-6\_2](https://doi.org/10.1007/978-3-319-14472-6_2)
[^56]: Manuel Bravo, Nuno Diegues, Jingna Zeng, Paolo Romano, and Luís Rodrigues. [On the use of Clocks to Enforce Consistency in the Cloud](http://sites.computer.org/debull/A15mar/p18.pdf). *IEEE Data Engineering Bulletin*, volume 38, issue 1, pages 18–31, March 2015. Archived at [perma.cc/68ZU-45SH](https://perma.cc/68ZU-45SH)
[^57]: Daniel Peng and Frank Dabek. [Large-Scale Incremental Processing Using Distributed Transactions and Notifications](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf). At *9th USENIX Conference on Operating Systems Design and Implementation* (OSDI), October 2010.
[^58]: Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone. [Paxos Made Live – An Engineering Perspective](https://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf). At *26th ACM Symposium on Principles of Distributed Computing* (PODC), June 2007. [doi:10.1145/1281100.1281103](https://doi.org/10.1145/1281100.1281103)
[^59]: Will Portnoy. [Lessons Learned from Implementing Paxos](https://blog.willportnoy.com/2012/06/lessons-learned-from-paxos.html). *blog.willportnoy.com*, June 2012. Archived at [perma.cc/QHD9-FDD2](https://perma.cc/QHD9-FDD2)
[^60]: Brian M. Oki and Barbara H. Liskov. [Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems](https://pmg.csail.mit.edu/papers/vr.pdf). At *7th ACM Symposium on Principles of Distributed Computing* (PODC), August 1988. [doi:10.1145/62546.62549](https://doi.org/10.1145/62546.62549)
[^61]: Barbara H. Liskov and James Cowling. [Viewstamped Replication Revisited](https://pmg.csail.mit.edu/papers/vr-revisited.pdf). Massachusetts Institute of Technology, Tech Report MIT-CSAIL-TR-2012-021, July 2012. Archived at [perma.cc/56SJ-WENQ](https://perma.cc/56SJ-WENQ)
[^62]: Leslie Lamport. [The Part-Time Parliament](https://www.microsoft.com/en-us/research/publication/part-time-parliament/). *ACM Transactions on Computer Systems*, volume 16, issue 2, pages 133–169, May 1998. [doi:10.1145/279227.279229](https://doi.org/10.1145/279227.279229)
[^63]: Leslie Lamport. [Paxos Made Simple](https://www.microsoft.com/en-us/research/publication/paxos-made-simple/). *ACM SIGACT News*, volume 32, issue 4, pages 51–58, December 2001. Archived at [perma.cc/82HP-MNKE](https://perma.cc/82HP-MNKE)
[^64]: Robbert van Renesse and Deniz Altinbuken. [Paxos Made Moderately Complex](https://people.cs.umass.edu/~arun/590CC/papers/paxos-moderately-complex.pdf). *ACM Computing Surveys* (CSUR), volume 47, issue 3, article no. 42, February 2015. [doi:10.1145/2673577](https://doi.org/10.1145/2673577)
[^65]: Diego Ongaro. [Consensus: Bridging Theory and Practice](https://github.com/ongardie/dissertation). PhD Thesis, Stanford University, August 2014. Archived at [perma.cc/5VTZ-2ADH](https://perma.cc/5VTZ-2ADH)
[^66]: Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy, and Jon Crowcroft. [Raft Refloated: Do We Have Consensus?](https://www.cl.cam.ac.uk/research/srg/netos/papers/2015-raftrefloated-osr.pdf) *ACM SIGOPS Operating Systems Review*, volume 49, issue 1, pages 12–21, January 2015. [doi:10.1145/2723872.2723876](https://doi.org/10.1145/2723872.2723876)
[^67]: André Medeiros. [ZooKeeper’s Atomic Broadcast Protocol: Theory and Practice](http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf). Aalto University School of Science, March 2012. Archived at [perma.cc/FVL4-JMVA](https://perma.cc/FVL4-JMVA)
[^68]: Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider. [Vive La Différence: Paxos vs. Viewstamped Replication vs. Zab](https://arxiv.org/abs/1309.5671). *IEEE Transactions on Dependable and Secure Computing*, volume 12, issue 4, pages 472–484, September 2014. [doi:10.1109/TDSC.2014.2355848](https://doi.org/10.1109/TDSC.2014.2355848)
[^69]: Heidi Howard and Richard Mortier. [Paxos vs Raft: Have we reached consensus on distributed consensus?](https://arxiv.org/abs/2004.05074). At *7th Workshop on Principles and Practice of Consistency for Distributed Data* (PaPoC), April 2020. [doi:10.1145/3380787.3393681](https://doi.org/10.1145/3380787.3393681)
[^70]: Miguel Castro and Barbara H. Liskov. [Practical Byzantine Fault Tolerance and Proactive Recovery](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/p398-castro-bft-tocs.pdf). *ACM Transactions on Computer Systems*, volume 20, issue 4, pages 396–461, November 2002. [doi:10.1145/571637.571640](https://doi.org/10.1145/571637.571640)
[^71]: Shehar Bano, Alberto Sonnino, Mustafa Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis. [SoK: Consensus in the Age of Blockchains](https://smeiklej.com/files/aft19a.pdf). At *1st ACM Conference on Advances in Financial Technologies* (AFT), October 2019. [doi:10.1145/3318041.3355458](https://doi.org/10.1145/3318041.3355458)
[^72]: Michael J. Fischer, Nancy Lynch, and Michael S. Paterson. [Impossibility of Distributed Consensus with One Faulty Process](https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf). *Journal of the ACM*, volume 32, issue 2, pages 374–382, April 1985. [doi:10.1145/3149.214121](https://doi.org/10.1145/3149.214121)
[^73]: Tushar Deepak Chandra and Sam Toueg. [Unreliable Failure Detectors for Reliable Distributed Systems](https://courses.csail.mit.edu/6.852/08/papers/CT96-JACM.pdf). *Journal of the ACM*, volume 43, issue 2, pages 225–267, March 1996. [doi:10.1145/226643.226647](https://doi.org/10.1145/226643.226647)
[^74]: Michael Ben-Or. [Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols](https://homepage.cs.uiowa.edu/~ghosh/BenOr.pdf). At *2nd ACM Symposium on Principles of Distributed Computing* (PODC), August 1983. [doi:10.1145/800221.806707](https://doi.org/10.1145/800221.806707)
[^75]: Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. [Consensus in the Presence of Partial Synchrony](https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf). *Journal of the ACM*, volume 35, issue 2, pages 288–323, April 1988. [doi:10.1145/42282.42283](https://doi.org/10.1145/42282.42283)
[^76]: Xavier Défago, André Schiper, and Péter Urbán. [Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey](https://dspace.jaist.ac.jp/dspace/bitstream/10119/4883/1/defago_et_al.pdf). *ACM Computing Surveys*, volume 36, issue 4, pages 372–421, December 2004. [doi:10.1145/1041680.1041682](https://doi.org/10.1145/1041680.1041682)
[^77]: Hagit Attiya and Jennifer Welch. *Distributed Computing: Fundamentals, Simulations and Advanced Topics*, 2nd edition. John Wiley & Sons, 2004. ISBN: 978-0-471-45324-6, [doi:10.1002/0471478210](https://doi.org/10.1002/0471478210)
[^78]: Rachid Guerraoui. [Revisiting the Relationship Between Non-Blocking Atomic Commitment and Consensus](https://citeseerx.ist.psu.edu/pdf/5d06489503b6f791aa56d2d7942359c2592e44b0). At *9th International Workshop on Distributed Algorithms* (WDAG), September 1995. [doi:10.1007/BFb0022140](https://doi.org/10.1007/BFb0022140)
[^79]: Jim N. Gray and Leslie Lamport. [Consensus on Transaction Commit](https://dsf.berkeley.edu/cs286/papers/paxoscommit-tods2006.pdf). *ACM Transactions on Database Systems* (TODS), volume 31, issue 1, pages 133–160, March 2006. [doi:10.1145/1132863.1132867](https://doi.org/10.1145/1132863.1132867)
[^80]: Fred B. Schneider. [Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial](https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf). *ACM Computing Surveys*, volume 22, issue 4, pages 299–319, December 1990. [doi:10.1145/98163.98167](https://doi.org/10.1145/98163.98167)
[^81]: Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. [Calvin: Fast Distributed Transactions for Partitioned Database Systems](https://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2012. [doi:10.1145/2213836.2213838](https://doi.org/10.1145/2213836.2213838)
[^82]: Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck. [Tango: Distributed Data Structures over a Shared Log](https://www.microsoft.com/en-us/research/publication/tango-distributed-data-structures-over-a-shared-log/). At *24th ACM Symposium on Operating Systems Principles* (SOSP), November 2013. [doi:10.1145/2517349.2522732](https://doi.org/10.1145/2517349.2522732)
[^83]: Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis. [CORFU: A Shared Log Design for Flash Clusters](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final30.pdf). At *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012.
[^84]: Vasilis Gavrielatos, Antonios Katsarakis, and Vijay Nagarajan. [Odyssey: the impact of modern hardware on strongly-consistent replication protocols](https://vasigavr1.github.io/files/Odyssey_Eurosys_2021.pdf). At *16th European Conference on Computer Systems* (EuroSys), April 2021. [doi:10.1145/3447786.3456240](https://doi.org/10.1145/3447786.3456240)
[^85]: Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. [Flexible Paxos: Quorum Intersection Revisited](https://drops.dagstuhl.de/opus/volltexte/2017/7094/pdf/LIPIcs-OPODIS-2016-25.pdf). At *20th International Conference on Principles of Distributed Systems* (OPODIS), December 2016. [doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25)
[^86]: Martin Kleppmann. [Distributed Systems lecture notes](https://www.cl.cam.ac.uk/teaching/2425/ConcDisSys/dist-sys-notes.pdf). *University of Cambridge*, October 2024. Archived at [perma.cc/SS3Q-FNS5](https://perma.cc/SS3Q-FNS5)
[^87]: Kyle Kingsbury. [Call Me Maybe: Elasticsearch 1.5.0](https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0). *aphyr.com*, April 2015. Archived at [perma.cc/37MZ-JT7H](https://perma.cc/37MZ-JT7H)
[^88]: Heidi Howard and Jon Crowcroft. [Coracle: Evaluating Consensus at the Internet Edge](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p85.pdf). At *Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. [doi:10.1145/2829988.2790010](https://doi.org/10.1145/2829988.2790010)
[^89]: Tom Lianza and Chris Snook. [A Byzantine failure in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/). *blog.cloudflare.com*, November 2020. Archived at [perma.cc/83EZ-ALCY](https://perma.cc/83EZ-ALCY)
[^90]: Ivan Kelly. [BookKeeper Tutorial](https://github.com/ivankelly/bookkeeper-tutorial). *github.com*, October 2014. Archived at [perma.cc/37Y6-VZWU](https://perma.cc/37Y6-VZWU)
[^91]: Jack Vanlightly. [Apache BookKeeper Insights Part 1 — External Consensus and Dynamic Membership](https://medium.com/splunk-maas/apache-bookkeeper-insights-part-1-external-consensus-and-dynamic-membership-c259f388da21). *medium.com*, November 2021. Archived at [perma.cc/3MDB-8GFB](https://perma.cc/3MDB-8GFB)
================================================
FILE: content/en/ch11.md
================================================
---
title: "11. Batch Processing"
weight: 311
breadcrumbs: false
---

> *A system cannot be successful if it is too strongly influenced by a single person. Once the
> initial design is complete and fairly robust, the real test begins as people with many different
> viewpoints undertake their own experiments.*
>
> Donald Knuth
> [!TIP] A NOTE FOR EARLY RELEASE READERS
> With Early Release ebooks, you get books in their earliest form---the author's raw and unedited
> content as they write---so you can take advantage of these technologies long before the official
> release of these titles.
>
> This will be the 11th chapter of the final book. The GitHub repo for this book is
> *[*https://github.com/ept/ddia2-feedback*](https://github.com/ept/ddia2-feedback)*.
>
> If you'd like to be actively involved in reviewing and commenting on this draft, please reach out on
> GitHub.
Much of this book so far has talked about *requests* and *queries*, and the corresponding
*responses* or *results*. This style of data processing is assumed in many modern data systems: you
ask for something, or you send an instruction, and the system tries to give you an answer as quickly
as possible.
A web browser requesting a page, a service calling a remote API, databases, caches, search indexes,
and many other systems work this way. We call these *online systems*. Response time is usually their
primary measure of performance, and they often require fault tolerance to ensure high availability.
However, sometimes you need to run a bigger computation or process larger amounts of data than you
can do in an interactive request. Maybe you need to train an AI model, or transform lots of data
from one form into another, or compute analytics over a very large dataset. We call these tasks
*batch processing* jobs, or sometimes *offline systems*.
A batch processing job takes some input data (which is read-only), and produces some output data
(which is generated from scratch every time the job runs). It typically does not mutate data in the
way a read/write transaction would. The output is therefore *derived* from the input (as discussed
in ["Systems of Record and Derived Data"](/en/ch1#sec_introduction_derived)): if you don't like the
output, you can just delete it, adjust the job logic, and run it again. By treating inputs as
immutable and avoiding side effects (such as writing to external databases), batch jobs not only
achieve good performance but also have other benefits:
- If you introduce a bug into the code and the output is wrong or corrupted, you can simply roll
back to a previous version of the code and rerun the job, and the output will be correct again.
Or, even simpler, you can keep the old output in a different directory and simply switch back to
it. Most object stores and open table formats (see ["Cloud Data
Warehouses"](/en/ch4#sec_cloud_data_warehouses)) support this feature, which is known as *time
travel*. Most databases with read-write transactions do not have this property: if you deploy
buggy code that writes bad data to the database, then rolling back the code will do nothing to fix
the data in the database. The idea of being able to recover from buggy code has been called *human
fault tolerance* [^1].
- As a consequence of this ease of rolling back, feature development can proceed more quickly than
in an environment where mistakes could mean irreversible damage. This principle of *minimizing
irreversibility* is beneficial for Agile software development [^2].
- The same set of files can be used as input for various different jobs, including monitoring jobs
that calculate metrics and evaluate whether a job's output has the expected characteristics (for
example, by comparing it to the output from the previous run and measuring discrepancies).
- Batch processing frameworks make efficient use of computing resources. Even though it's possible
to batch process data using online data systems such as OLTP databases and applications servers,
doing so can be much more expensive in terms of the resources required.
Batch data processing also presents challenges. With most frameworks, output can only be processed
by other jobs after the whole job finishes. Batch processing can also be inefficient: any change to
input data---even a single byte---means the batch job must reprocess the entire input dataset.
Despite these limitations, batch processing has proven useful in a wide range of use cases, which
we'll revisit in ["Batch Use Cases"](/en/ch11#sec_batch_output).
A batch job may take a long time to run: minutes, hours, or even days. Jobs may be scheduled to run
periodically (for example, once per day). The primary measure of performance is usually throughput:
how much data the job can process per unit time. Some batch systems handle faults by simply aborting
and restarting the whole job, while others have fault tolerance so that a job can complete
successfully despite some of its nodes crashing.
> [!NOTE]
> An alternative to batch processing is *stream processing*, in which the job doesn't finish running
> when it has processed the input, but instead continues watching the input and processes changes in
> the input shortly after they happen. We will turn to stream processing in
> [Chapter 12](/en/ch12#ch_stream).
The boundary between online and batch processing systems is not always clear: a long-running
database query looks quite like a batch process. But batch processing also has some particular
characteristics that make it a useful building block for building reliable, scalable, and
maintainable applications. For example, it often plays a role in *data integration*, i.e., composing
multiple data systems to achieve things that one system alone cannot do. ETL, as discussed in ["Data
Warehousing"](/en/ch1#sec_introduction_dwh), is an example of this.
Modern batch processing has been heavily influenced by MapReduce, a batch processing algorithm that
was published by Google in 2004 [^3], and subsequently implemented in various open source
data systems, including Hadoop, CouchDB, and MongoDB. MapReduce is a fairly low-level programming
model, and less sophisticated than the parallel query execution engines found, for example, in data
warehouses [^4], [^5]. When it was new, MapReduce was a step forward in terms of the
scale of processing that could be achieved on commodity hardware, but now it is largely obsolete,
and no longer used at Google [^6], [^7].
Batch processing today is more often done using frameworks such as Spark or Flink, or data warehouse
query engines. Like MapReduce, they rely heavily on sharding (see [Chapter 7](/en/ch7#ch_sharding))
and parallel execution, but they have far more sophisticated caching and execution strategies. As
these systems have matured, operational concerns have been largely solved, so focus has shifted
toward usability. New processing models such as dataflow APIs, query languages, and DataFrame APIs
are now widely supported. Job and workflow orchestration has also matured. Hadoop-centric workflow
schedulers such as Oozie and Azkaban have been replaced with more generalized solutions such as
Airflow, Dagster, and Prefect, which support a wide array of batch processing frameworks and cloud
data warehouses.
Cloud computing has grown ubiquitous. Batch storage layers are shifting from distributed filesystems
(DFSs) like HDFS, GlusterFS, and CephFS to object storage systems such as S3. Scalable cloud data
warehouses like BigQuery and Snowflake are blurring the line between data warehouses and batch
processing.
To build an intuition of what batch processing is about, we will start this chapter with an example
that uses standard Unix tools on a single machine. We will then investigate how we can extend data
processing to multiple machines in a distributed system. We will see that, much like an operating
system, distributed batch processing frameworks have a scheduler and a filesystem. We will then
explore various processing models that we use to write batch jobs. Finally, we discuss common batch
processing use cases.
## Batch Processing with Unix Tools {#sec_batch_unix}
Say you have a web server that appends a line to a log file every time it serves a request. For
example, using the nginx default access log format, one line of the log might look like this:
216.58.210.78 - - [27/Jun/2025:17:55:11 +0000] "GET /css/typography.css HTTP/1.1"
200 3377 "https://martin.kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36"
(That is actually one line; it's only broken onto multiple lines here for readability.) There's a
lot of information in that line. In order to interpret it, you need to look at the definition of the
log format, which is as follows:
$remote_addr - $remote_user [$time_local] "$request"
$status $body_bytes_sent "$http_referer" "$http_user_agent"
So, this one line of the log indicates that on June 27, 2025, at 17:55:11 UTC, the server received a
request for the file */css/typography.css* from the client IP address 216.58.210.78. The user was
not authenticated, so `$remote_user` is set to a hyphen (`-`). The response status was 200 (i.e.,
the request was successful), and the response was 3,377 bytes in size. The web browser was Chrome
137, and it loaded the file because it was referenced in the page at the URL
*[*https://martin.kleppmann.com/*](https://martin.kleppmann.com/)*.
Though log parsing might seem contrived, it's actually a critical part of many modern technology
companies, and is used for everything from ad pipelines to payment processing. Indeed, it was a
driving force behind the rapid adoption of MapReduce and the "big data" movement.
### Simple Log Analysis {#sec_batch_log_analysis}
Various tools can take these log files and produce pretty reports about your website traffic, but
for the sake of exercise, let's build our own, using basic Unix tools. For example, say you want to
find the five most popular pages on your website. You can do this in a Unix shell as follows:
``` bash
cat /var/log/nginx/access.log | #1
awk '{print $7}' | #2
sort | #3
uniq -c | #4
sort -r -n | #5
head -n 5 #6
```
1. Read the log file. (Strictly speaking, `cat` is unnecessary here, as the input file could be
given directly as an argument to `awk`. However, the linear pipeline is more apparent when
written like this.)
2. Split each line into fields by whitespace, and output only the seventh such field from each
line, which happens to be the requested URL. In our example line, this request URL is
*/css/typography.css*.
3. Alphabetically `sort` the list of requested URLs. If some URL has been requested *n* times, then
after sorting, the file contains the same URL repeated *n* times in a row.
4. The `uniq` command filters out repeated lines in its input by checking whether two adjacent
lines are the same. The `-c` option tells it to also output a counter: for every distinct URL,
it reports how many times that URL appeared in the input.
5. The second `sort` sorts by the number (`-n`) at the start of each line, which is the number of
times the URL was requested. It then returns the results in reverse (`-r`) order, i.e. with the
largest number first.
6. Finally, `head` outputs just the first five lines (`-n 5`) of input, and discards the rest.
The output of that series of commands looks something like this:
4189 /favicon.ico
3631 /2016/02/08/how-to-do-distributed-locking.html
2124 /2020/11/18/distributed-systems-and-elliptic-curves.html
1369 /
915 /css/typography.css
Although the preceding command line likely looks a bit obscure if you're unfamiliar with Unix tools,
it is incredibly powerful. It will process gigabytes of log files in a matter of seconds, and you
can easily modify the analysis to suit your needs. For example, if you want to omit CSS files from
the report, change the `awk` argument to `'$7 !~ /\.css$/ {print $7}'`. If you want to count top
client IP addresses instead of top pages, change the `awk` argument to `'{print $1}'`. And so on.
We don't have space in this book to explore Unix tools in detail, but they are very much worth
learning about. Surprisingly many data analyses can be done in a few minutes using some combination
of `awk`, `sed`, `grep`, `sort`, `uniq`, and `xargs`, and they perform surprisingly well
[^8].
### Chain of Commands Versus Custom Program {#sec_batch_custom_program}
Instead of the chain of Unix commands, you could write a simple program to do the same thing. For
example, in Python, it might look something like this:
``` python
from collections import defaultdict
counts = defaultdict(int) #1
with open('/var/log/nginx/access.log', 'r') as file:
for line in file:
url = line.split()[6] #2
counts[url] += 1 #3
top5 = sorted(((count, url) for url, count in counts.items()), reverse=True)[:5] #4
for count, url in top5: #5
print(f"{count} {url}")
```
1. `counts` is a hash table that keeps a counter for the number of times we've seen each URL. A
counter is zero by default.
2. From each line of the log, we take the URL to be the seventh whitespace-separated field (the
array index is 6 because Python's arrays are zero-indexed).
3. Increment the counter for the URL in the current line of the log.
4. Sort the hash table contents by counter value (descending), and take the top five entries.
5. Print out those top five entries.
This program is not as concise as the chain of Unix pipes, but it's fairly readable, and which of
the two you prefer is partly a matter of taste. However, besides the superficial syntactic
differences between the two, there is a big difference in the execution flow, which becomes apparent
if you run this analysis on a large file.
### Sorting Versus In-memory Aggregation {#id275}
The Python script keeps an in-memory hash table of URLs, where each URL is mapped to the number of
times it has been seen. The Unix pipeline example does not have such a hash table, but instead
relies on sorting a list of URLs in which multiple occurrences of the same URL are simply repeated.
Which approach is better? It depends how many different URLs you have. For most small to mid-sized
websites, you can probably fit all distinct URLs, and a counter for each URL, in (say) 1 GB of
memory. In this example, the *working set* of the job (the amount of memory to which the job needs
random access) depends only on the number of distinct URLs: if there are a million log entries for a
single URL, the space required in the hash table is still just one URL plus the size of the counter.
If this working set is small enough, an in-memory hash table works fine---even on a laptop.
On the other hand, if the job's working set is larger than the available memory, the sorting
approach has the advantage that it can make efficient use of disks. It's the same principle as we
discussed in ["Log-Structured Storage"](/en/ch4#sec_storage_log_structured): chunks of data can be
sorted in memory and written out to disk as segment files, and then multiple sorted segments can be
merged into a larger sorted file. Mergesort has sequential access patterns that perform well on
disks (see ["Sequential Versus Random Writes on SSDs"](/en/ch4#sidebar_sequential)).
The `sort` utility in GNU Coreutils (Linux) automatically handles larger-than-memory datasets by
spilling to disk, and automatically parallelizes sorting across multiple CPU cores [^9].
This means that the simple chain of Unix commands we saw earlier easily scales to large datasets,
without running out of memory. The bottleneck is likely to be the rate at which the input file can
be read from disk.
A limitation of Unix tools is that they run only on a single machine. Datasets that are too large to
fit in memory or local disk present a problem---and that's where distributed batch processing
frameworks come in.
## Batch Processing in Distributed Systems {#sec_batch_distributed}
The machine that runs our Unix tool example has a number of components that work together to process
the log data:
- Storage devices that are accessed through the operating system's filesystem interface.
- A scheduler that determines when processes get to run, and how to allocate CPU resources to them.
- A series of Unix programs whose `stdin` and `stdout` are connected together by pipes.
These same components exist in distributed data processing frameworks. In fact, you can think of a
distributed processing framework as a distributed operating system; they have filesystems, job
schedulers, and programs that send data to each other through the filesystem or other communication
channels.
### Distributed Filesystems {#sec_batch_dfs}
The filesystem provided by your operating system is composed of several layers:
- At the lowest level, block device drivers speak directly to the disk, and allow the layers above
to read and write raw blocks.
- Above the block layer sits a page cache that keeps recently accessed blocks in memory for faster
access.
- The block API is wrapped in a filesystem layer that breaks up large files into blocks, and tracks
file metadata such as inodes, directories, and files. ext4 and XFS are two common implementations
on Linux, for example.
- Finally, the operating system exposes different filesystems to applications through a common API
called the virtual file system (VFS). The VFS is what allows applications to read and write in a
standard way regardless of the underlying filesystem.
Distributed filesystems work in much the same way. Files are broken up into blocks, which are
distributed across many machines. DFS blocks are typically much larger than local blocks: HDFS
(Hadoop Distributed File System) defaults to 128MB, while JuiceFS and many object stores use 4MB
blocks---much larger than ext4's 4096 bytes. Larger blocks mean less metadata to keep track of,
which makes a big difference on petabyte-sized datasets. Larger blocks also lower the overhead of
seeking to a block relative to reading it.
Most physical storage devices can't write partial blocks, so operating systems require writes to use
an entire block even if the data doesn't take up the whole block. Since distributed filesystems have
larger blocks and are usually implemented on top of operating system filesystems, they don't have
this requirement. For example, a 900MB file stored with 128MB blocks would have 7 blocks that use
128MB and 1 block that uses 4MB.
DFS blocks are read by making network requests to a machine in the cluster that stores the block.
Each machine runs a daemon, exposing an API that allows remote processes to read and write blocks as
files on its local filesystem. HDFS refers to these daemons as DataNodes, while GlusterFS calls them
glusterfsd processes. We'll call them *data nodes* in this book.
Distributed filesystems also implement the distributed equivalent of a page cache. Since DFS blocks
are stored as files on data nodes, reads and writes go through each data node's operating system,
which includes an in-memory page cache. This keeps frequently read data blocks in-memory on the data
nodes. Some distributed filesystems also implement more caching tiers such as the client-side and
local-disk caching found in JuiceFS.
Filesystems such as ext4 and XFS keep track of storage metadata including free space, file block
locations, directory structures, permission settings, and more. Distributed filesystems also need a
way to track file locations spread across machines, permission settings, and so on. Hadoop has a
service called the NameNode, which maintains metadata for the cluster. DeepSeek's 3FS has a metadata
service that persists its data to a key-value store such as FoundationDB.
Above the filesystem sits the VFS. A close analogue in batch processing is a distributed
filesystem's protocol. Distributed filesystems must expose a protocol or interface so that batch
processing systems can read and write files. This protocol acts as a pluggable interface: any DFS
may be used so long as it implements the protocol. For example, Amazon S3's API has been widely
adopted by other storage systems such as MinIO, Cloudflare's R2, Tigris, Backblaze's B2, and many
others. Batch processing systems with S3 support can use any of these storage systems.
Some DFSs implement POSIX-compliant filesystems that appear to the operating system's VFS like any
other filesystem. Filesystem in Userspace (FUSE) or the Network File System (NFS) protocol are often
used to integrate into the VFS. NFS is perhaps the most well known distributed filesystem protocol.
The protocol was originally developed to allow multiple clients to read and write data on a single
server. More recently, filesystems such as AWS's Elastic File System (EFS) and Archil provide
NFS-compatible distributed filesystem implementations that are far more scalable. NFS clients still
connect to one end point, but underneath, these systems communicate with distributed metadata
services and data nodes to read and write data.
> [!TIP] DISTRIBUTED FILESYSTEMS AND NETWORK STORAGE
> Distributed filesystems are based on the *shared-nothing* principle (see ["Shared-Memory,
> Shared-Disk, and Shared-Nothing
> Architecture"](/en/ch2#sec_introduction_shared_nothing)), in contrast to the
> shared-disk approach of *Network Attached Storage* (NAS) and *Storage Area Network* (SAN)
> architectures. Shared-disk storage is implemented by a centralized storage appliance, often using
> custom hardware and special network infrastructure such as Fibre Channel. On the other hand, the
> shared-nothing approach requires no special hardware, only computers connected by a conventional
> datacenter network.
Many distributed filesystems are built on commodity hardware, which is less expensive but has higher
failure rates than enterprise-grade hardware. In order to tolerate machine and disk failures, file
blocks are replicated on multiple machines. This also allows schedulers to more evenly distribute
workloads since it can execute a task on any node that contains a replica of the task's input data.
Replication may mean simply several copies of the same data on multiple machines, as in
[Chapter 6](/en/ch6#ch_replication), or an *erasure coding* scheme such as Reed--Solomon codes,
which allows lost data to be recovered with lower storage overhead than full replication
[^10], [^11], [^12]. The techniques are similar to RAID, which provides
redundancy across several disks attached to the same machine; the difference is that in a
distributed filesystem, file access and replication are done over a conventional datacenter network
without special hardware.
### Object Stores {#id277}
Object storage services such as Amazon S3, Google Cloud Storage, Azure Blob Storage, and OpenStack
Swift have become a popular alternative to distributed filesystems for batch processing jobs. In
fact, the line between the two is somewhat blurry. As we saw in the previous section and ["Databases
Backed by Object Storage"](/en/ch6#sec_replication_object_storage), Filesystem in Userspace (FUSE)
drivers allow users to treat object stores such as S3 as a filesystem. Some DFS implementations such
as JuiceFS and Ceph offer both object storage and filesystem APIs. However, their APIs, performance,
and consistency guarantees are very different. Care must be taken when adopting such systems to make
sure they behave as expected, even if they seem to implement the requisite APIs.
Each object in an object store has a URL such as `s3://my-photo-bucket/2025/04/01/birthday.png`. The
host portion of the URL (`my-photo-bucket`) describes the bucket where objects are stored, and the
part that follows is the object's *key* (`/2025/04/01/birthday.png` in our example). A bucket has a
globally unique name, and each object's key must be unique within its bucket.
Object are read using a `get` call and written using a `put` call. Unlike files on a filesystem,
objects are immutable once written. To update an object, it must be fully rewritten using a `put`
call, similarly to a key-value store. Azure Blob Storage and S3 Express One Zone support appends,
but most other stores do not. There are no file handle APIs with functions like `fopen` and `fseek`.
Objects may look as if they are organised into directories, which is somewhat confusing, since
object stores do not have the concept of directories. The path structure is simply a convention, and
the slashes are a part of the object's key. This convention allows you to perform something similar
to a directory listing by requesting a list of objects with a particular prefix. However, listing
objects by prefix is different from a filesystem directory listing in two ways:
- A prefix `list` operation behaves like recursive `ls -R` call on a Unix system: it returns all
objects that start with the prefix---objects in subpaths are included.
- Empty directories are not possible: if you were to remove all objects underneath
`s3://my-photo-bucket/2025/04/01`, then `01` would no longer appear when we call `list` on
`s3://my-photo-bucket/2025/04`. It is a common practice to create a zero-byte object as a way to
represent an empty directory (e.g. creating an empty `s3://my-photo-bucket/2025/04/01` file to
keep it present when all child objects are deleted).
DFS implementations often support many common filesystem operations such as hard links, symbolic
links, file locking, and atomic renames. Such features are missing from object stores. Linking and
locks are typically not supported, while renames are non-atomic; they're accomplished by copying the
object to the new key, and then deleting the old object. If you want to rename a directory, you have
to individually rename every object within it, since the directory name is a part of the key.
The key-value stores we discussed in [Chapter 4](/en/ch4#ch_storage) are optimized for small values
(typically kilobytes) and frequent, low-latency reads/writes. In contrast, distributed filesystems
and object stores are generally optimized for large objects (megabytes to gigabytes) and less
frequent, larger reads. Recently, however, object stores have begun to add support for frequent and
smaller reads/writes. For example, S3 Express One Zone now offers single-millisecond latency and a
pricing model that is more similar to key-value stores.
Another difference between distributed filesystems and object stores is that DFSes such as HDFS
allow computing tasks to be run on the machine that stores a copy of a particular file. This allows
the task to read that file without having to send it over the network, which saves bandwidth if the
executable code of the task is smaller than the file it needs to read. On the other hand, object
stores usually keep storage and computation separate. Doing so might use more bandwidth, but modern
datacenter networks are very fast, so this is often acceptable. This architecture also allows
machine resources such as CPU and memory to be scaled independently of storage since the two are
decoupled.
### Distributed Job Orchestration {#id278}
Our operating system analogy also applies to job orchestration. When you execute a Unix batch job,
something needs to actually run the `awk`, `sort`, `uniq`, and `head` processes. Data needs to be
transferred from one process's output to another process's input, memory must be allocated for each
process, instructions from each process must be scheduled fairly and executed on the CPU, memory and
I/O boundaries must be enforced, and so on. On a single machine, an operating system's kernel is
responsible for such work. In a distributed environment, this is the role of a job orchestrator.
Batch processing frameworks send a request to an orchestrator's scheduler to run a job. Requests to
start a job contain metadata such as:
- the number of tasks to execute,
- the amount of memory, CPU, and disk needed for each task,
- a job identifier,
- access credentials,
- job paramaters such as input and output data,
- required hardware details such as GPUs or disk types, and
- where the job's executable code is located.
Orchestrators such as Kubernetes and Hadoop YARN (Yet Another Resource Negotiator) [^13]
combine this information with cluster metadata to execute the job using the following components:
Task executors
: An executor daemon such as YARN's *NodeManager* or Kubernetes's *kubelet* runs on each node in
the cluster. Executors are responsible for running job tasks, sending heartbeats to signal their
liveness, and tracking task status and resource allocation on the node. When a task-start
request is sent to an executor, it retrieves the job's executable code and runs a command to
start the task. The executor then monitors the process until it finishes or fails, at which
point it updates the task status metadata accordingly.
Many executors also work with the operating system to provide both security and performance
isolation. YARN and Kubernetes both use Linux *cgroups*, for example. This prevents tasks from
accessing data without permission, or from negatively affecting the performance of other tasks
on the node by using excessive resources.
Resource Manager
: An orchestrator's resource manager stores metadata about each node, including available hardware
(CPUs, GPUs, memory, disks, and so on), task statuses, network location, node status, and other
relevant information. Thus, the manager provides a global view of the cluster's current state.
The centralized nature of the resource manager can lead to both scalability and availability
bottlenecks. YARN uses ZooKeeper and Kubernetes uses etcd to store cluster state (see
["Coordination Services"](/en/ch10#sec_consistency_coordination)).
Scheduler
: Orchestrators usually have a centralized scheduler subsystem, which receives requests to start,
stop, or check on the status of a job. For example, a scheduler might receive a request to start
a job with 10 tasks using a specific Docker image on nodes that have a specific type of GPU. The
scheduler uses the information from the request and state of the resource manager to determine
which tasks to run on which nodes. The task executors are then informed of their assigned work
and begin execution.
Though each orchestrator uses different terminology, you will find these components in nearly all
orchestration systems.
> [!NOTE]
> Scheduling decisions sometimes require application-specific schedulers that can take into account
> particular requirements, such as auto-scaling read replicas when a certain query threshold is
> reached. The centralized scheduler and application-specific schedulers work together to determine
> how to best execute tasks. YARN refers to its sub-schedulers as *ApplicationMasters*, while
> Kubernetes calls them *operators*.
#### Resource Allocation {#id279}
Schedulers have a particularly challenging role in job orchestration: they must figure out how to
best allocate the cluster's limited resources amongst jobs with competing needs. Fundamentally, its
decisions must balance fairness and efficiency.
Imagine a small cluster with five nodes that has a total of 160 CPU cores available. The cluster's
scheduler receives two job requests, each wanting 100 cores to complete its work. What's the best
way to schedule the workload?
- The scheduler could decide to run 80 tasks for each job, starting the remaining 20 tasks for each
job as earlier tasks complete.
- The scheduler could run all of one job's tasks, and begin running the second job's tasks only when
100 cores are available, a strategy known as *gang scheduling*.
- One job request comes before the other. The scheduler has to decide whether to allocate all 100
cores to that job, or hold some back in anticipation for future jobs.
This is a very simple example, but we already see many difficult trade-offs. In the gang-scheduling
scenario, for example, if the scheduler reserves CPU cores until all 100 are available at the same
time, nodes will sit idle. The cluster's resource utilization will drop and a deadlock might occur
if other jobs also attempt to reserve CPU cores.
On the other hand, if the scheduler simply waits for 100 cores to become available, other jobs might
grab the cores in the meantime. The cluster might not have 100 cores available for a very long time,
which leads to *starvation*. The scheduler could decide to *preempt* some of the first job's tasks,
killing them to make room for the second job. Task preemption decreases cluster efficiency as well,
since the killed tasks will need to be restarted later and re-run.
Now imagine a scheduler that must make allocation decisions for hundreds or even millions of such
job requests. Finding an optimal solution seems intractable. In fact, the problem is *NP-hard*,
which means that it is prohibitively slow to calculate an optimal solution for all but the smallest
examples [^14], [^15].
In practice, schedulers therefore use heuristics to make non-optimal but reasonable decisions.
Several algorithms are commonly used, including first-in first-out (FIFO), dominant resource
fairness (DRF), priority queues, capacity or quota-based scheduling, and various bin-packing
algorithms. The details for such algorithms are beyond the scope of this book, but they're a
fascinating area of research.
#### Scheduling Workflows {#sec_batch_workflows}
The Unix tools example at the start of this chapter involved a chain of several commands, connected
by Unix pipes. The same pattern arises in distributed batch processes: often the output from one job
needs to become the input to one or more other jobs, and each job may have several inputs that are
produced by other jobs. This is called a *workflow* or *directed acyclic graph (DAG)* of jobs.
> [!NOTE]
> In ["Durable Execution and Workflows"](/en/ch5#sec_encoding_dataflow_workflows) we
> saw workflow engines that offer durable execution of a sequence of steps, typically performing RPCs.
> In the context of batch processing, "workflow" has a different meaning: it's a sequence of batch
> processes, each taking input data and producing output data, but normally not making RPCs to
> external services. Durable execution engines typically process less data per-request than their
> batch processing counterparts, though the line is somewhat fuzzy.
There are several reasons why a workflow of multiple jobs might be needed:
- If the output of one job needs to become the input to several other jobs, which are maintained by
different teams, it's best for the first job to first write its output to a location where all the
other jobs can read it. Those consuming jobs can then be scheduled to run every time that data has
been updated, or on some other schedule.
- You might want to transfer data from one processing tool to another. For example, a Spark job
might output its data to HDFS, then a Python script might trigger a Trino SQL query (see ["Cloud
Data Warehouses"](/en/ch4#sec_cloud_data_warehouses)) that does further processing on the HDFS
files and outputs to S3.
- Some data pipelines internally require multiple processing stages. For example, if one stage needs
to shard the data by one key, and the next stage needs to shard by a different key, the first
stage can output data sharded in the way that is required by the second stage.
In the Unix tools example, the pipe that connects the output of one command to the input of another
uses only a small in-memory buffer, and doesn't write the data into a file. If that buffer fills up,
the producing process needs to wait until the consuming process has read some data from the buffer
before it can output more---a form of backpressure. Spark, Flink, and other batch execution engines
support a similar model where the output of one task is directly passed to another task (over the
network if the tasks are running on different machines).
However, in a workflow it is more usual for one job to write its output to a distributed filesystem
or object store, and for the next job to read it from there. This decouples the jobs from each
other, allowing them to run at different times. If a job has several inputs, a workflow scheduler
typically waits until all of the jobs that produce its inputs have completed successfully before
running the job that consumes those inputs.
Schedulers found in orchestration frameworks such as YARN's ResourceManager or Spark's built-in
scheduler do not manage entire workflows; they do scheduling on a per-job basis. To handle these
dependencies between job executions, various workflow schedulers have been developed, including
Airflow, Dagster, and Prefect. Workflow schedulers have management features that are useful when
maintaining a large collection of batch jobs. Workflows consisting of 50 to 100 jobs are common in
many data pipelines, and in a large organization, many different teams may be running different jobs
or workflows that read each other's output across many different systems. Tool support is important
for managing such complex dataflows.
#### Handling Faults {#id281}
Batch jobs often run for long periods of time. Long-running jobs with many parallel tasks are likely
to experience at least one task failure along the way. As discussed in ["Hardware and Software
Faults"](/en/ch2#sec_introduction_hardware_faults) and ["Unreliable
Networks"](/en/ch9#sec_distributed_networks), there are many reasons why this could happen,
including hardware faults (especially on commodity hardware), or network interruptions.
Another reason why a task might not finish running is that the scheduler may intentionally preempt
(kill) it. Preemption is particularly useful if you have multiple priority levels: low-priority
tasks that are cheaper to run, and high-priority tasks that cost more. Low-priority tasks can run
whenever there is spare computing capacity, but they run the risk of being preempted at any moment
if a higher-priority task arrives. Such cheaper, low-priority virtual machines are called *spot
instances* on Amazon EC2, *spot virtual machines* on Azure, and *preemptible instances* on Google
Cloud [^16].
Since batch processing is often used for jobs that are not time-sensitive, it is well suited for
using low-priority tasks and spot instances to reduce the cost of running jobs. Essentially, those
jobs can use spare computing resources that would otherwise be idle, and thereby increase the
utilization of the cluster. However, this also means that those tasks are more likely to be killed
by the scheduler: preemptions occur more frequently than hardware faults [^17].
Since batch jobs regenerate their output from scratch every time they are run, task failures are
easier to handle than in online systems: the system can delete the partial output from the failed
execution and schedule it to run again on another machine. It would be wasteful to rerun the entire
job due to a single task failure, though. MapReduce and its successors therefore keep the execution
of parallel tasks independent from each other, so that they can retry work at the granularity of an
individual task [^3].
Fault tolerance is trickier when the output of one task becomes the input to another task as part of
a workflow. MapReduce solves this by always writing such intermediate data back to the distributed
filesystem, and waiting for the writing task to complete successfully before allowing other tasks to
read the data. This works, even in an environment where preemption is common, but it means a lot of
writes to the DFS, which can be inefficient.
Spark keeps intermediate data in memory or "spills" to local disk, and only writes the final result
to the DFS. It also keeps track of how the intermediate data was computed, allowing Spark to
recompute it in case it is lost [^18]. Flink uses a different approach based on
periodically checkpointing a snapshot of tasks [^19]. We will return to this topic in
["Dataflow Engines"](/en/ch11#sec_batch_dataflow).
## Batch Processing Models {#id431}
We have seen how batch jobs are scheduled in a distributed environment. Let us now turn our
attention to how batch processing frameworks actually process data. The two most common models are
MapReduce and dataflow engines. Although dataflow engines have largely replaced MapReduce in
practice, it is useful to understand how MapReduce works, since it influenced many modern batch
processing frameworks.
MapReduce and dataflow engines have evolved to support multiple programming models including
low-level programmatic APIs, relational query languages, and DataFrame APIs. A variety of options
enable application engineers, analytics engineers, business analysts, and even non-technical
employees to process company data for various use cases, which we'll discuss in ["Batch Use
Cases"](/en/ch11#sec_batch_output).
### MapReduce {#sec_batch_mapreduce}
The pattern of data processing in MapReduce is very similar to the web server log analysis example
in ["Simple Log Analysis"](/en/ch11#sec_batch_log_analysis):
1. Read a set of input files, and break it up into *records*. In the web server log example, each
record is one line in the log (that is, `\n` is the record separator). In Hadoop's MapReduce,
the input file is stored in a distributed filesystem like HDFS or an object store like S3.
Various file formats are used, such as Apache Parquet (a columnar format, see ["Column-Oriented
Storage"](/en/ch4#sec_storage_column)) or Apache Avro (a row-based format, see
["Avro"](/en/ch5#sec_encoding_avro)).
2. Call the mapper function to extract a key and value from each input record. In the Unix tool
example, the mapper function is `awk '{print $7}'`: it extracts the URL (`$7`) as the key, and
leaves the value empty.
3. Sort all of the key-value pairs by key. In the log example, this is done by the first `sort`
command.
4. Call the reducer function to iterate over the sorted key-value pairs. If there are multiple
occurrences of the same key, the sorting has made them adjacent in the list, so it is easy to
combine those values without having to keep a lot of state in memory. In the Unix tool example,
the reducer is implemented by the command `uniq -c`, which counts the number of adjacent records
with the same key.
Those four steps can be performed by one MapReduce job. Steps 2 (map) and 4 (reduce) are where you
write your custom data processing code. Step 1 (breaking files into records) is handled by the input
format parser. Step 3, the `sort` step, is implicit in MapReduce---you don't have to write it,
because the output from the mapper is always sorted before it is given to the reducer. This sorting
step is a foundational batch processing algorithm, which we'll revisit in ["Shuffling
Data"](/en/ch11#sec_shuffle).
To create a MapReduce job, you need to implement two callback functions, the mapper and reducer,
which behave as follows:
Mapper
: The mapper is called once for every input record, and its job is to extract the key and value
from the input record. For each input, it may generate any number of key-value pairs (including
none). It does not keep any state from one input record to the next, so each record is handled
independently.
Reducer
: The MapReduce framework takes the key-value pairs produced by the mappers, collects all the
values belonging to the same key, and calls the reducer with an iterator over that collection of
values. The reducer can produce output records (such as the number of occurrences of the same
URL).
In the web server log example, we had a second `sort` command in step 5, which ranked URLs by number
of requests. In MapReduce, if you need a second sorting stage, you can implement it by writing a
second MapReduce job and using the output of the first job as input to the second job. Viewed like
this, the role of the mapper is to prepare the data by putting it into a form that is suitable for
sorting, and the role of the reducer is to process the data that has been sorted.
> [!TIP] MAPREDUCE AND FUNCTIONAL PROGRAMMING
> Though MapReduce is used for batch processing, the programming model comes from functional
> programming. Lisp introduced *map* and *reduce* (or *fold*) as higher‑order functions on lists, and
> they have made their way into mainstream languages such as Python, Rust, and Java. Many common data
> processing operations, including those offered by SQL, can be implemented on top of MapReduce. Both
> functions, and functional programming in general, have important properties that MapReduce benefits
> from. Map and reduce are composable, which fits nicely with data processing (as we saw in our Unix
> example). Map is also *embarassingly parallel* (each input is processed independently), which
> simplifies MapReduce's parallel execution. For reduce, different keys can be processed in parallel.
Implementing a complex processing job using the raw MapReduce APIs is actually quite hard and
laborious---for instance, any join algorithms used by the job would need to be implemented from
scratch [^20]. MapReduce is also quite slow compared to more modern batch processors. One
reason is that its file-based I/O prevents job pipelining, i.e., processing output data in a
downstream job before the upstream job is complete.
### Dataflow Engines {#sec_batch_dataflow}
In order to fix some of MapReduce's problems, several new execution engines for distributed batch
computations were developed, the most well known of which are Spark [^18], [^21] and
Flink [^19]. There are various differences in the way they are designed, but they have one
thing in common: they handle an entire workflow as one job, rather than breaking it up into
independent subjobs.
Since they explicitly model the flow of data through several processing stages, these systems are
known as *dataflow engines*. Like MapReduce, they support a low-level API that repeatedly calls a
user-defined function to process one record at a time, but they also offer higher-level operators
such as *join* and *group by*. They parallelize work by sharding inputs, and they copy the output of
one task over the network to become the input to another task. Unlike in MapReduce, operators need
not take the strict roles of alternating map and reduce, but instead can be assembled in more
flexible ways.
These dataflow APIs generally use relational-style building blocks to express a computation: joining
datasets on the value of some field; grouping tuples by key; filtering by some condition; and
aggregating tuples by counting, summing, or other functions. Internally, these operations are
implemented using the shuffle algorithms that we discuss in the next section.
This style of processing engine is based on research systems like Dryad [^22] and Nephele
[^23], and it offers several advantages compared to the MapReduce model:
- Expensive work such as sorting need only be performed in places where it is actually required,
rather than always happening by default between every map and reduce stage.
- When there are several operators in a row that don't change the sharding of the dataset (such as
map or filter), they can be combined into a single task, reducing data copying overheads.
- Because all joins and data dependencies in a workflow are explicitly declared, the scheduler has
an overview of what data is required where, so it can make locality optimizations. For example, it
can try to place the task that consumes some data on the same machine as the task that produces
it, so that the data can be exchanged through a shared memory buffer rather than having to copy it
over the network.
- It is usually sufficient for intermediate state between operators to be kept in memory or written
to local disk, which requires less I/O than writing it to a distributed filesystem or object store
(where it must be replicated to several machines and written to disk on each replica). MapReduce
already uses this optimization for mapper output, but dataflow engines generalize the idea to all
intermediate state.
- Operators can start executing as soon as their input is ready; there is no need to wait for the
entire preceding stage to finish before the next one starts.
- Existing processes can be reused to run new operators, reducing startup overheads compared to
MapReduce (which launches a new JVM for each task).
You can use dataflow engines to implement the same computations as MapReduce workflows, and they
usually execute significantly faster due to the optimizations described here.
### Shuffling Data {#sec_shuffle}
We saw that both the Unix tools example at the beginning of the chapter and MapReduce are based on
sorting. Batch processors need to be able to sort datasets petabytes in size, which are too large to
fit on a single machine. They therefore require a distributed sorting algorithm where both the input
and the output is sharded. Such an algorithm is called a *shuffle*.
> [!NOTE] SHUFFLE IS NOT RANDOM
> The term *shuffle* is confusing. When you shuffle a deck of cards, you end up with a random order.
> In contrast, the shuffle we're talking about here produces a sorted order, with no randomness.
Shuffling is a foundational algorithm for batch processors, where it is used for joins and
aggregations. MapReduce, Spark, Flink, Daft, Dataflow, and BigQuery [^24] all implement
scalable and performant shuffle algorithms in order to handle large datasets. We'll use the shuffle
in Hadoop MapReduce [^25] for illustration purposes, but the concepts in this section
translate to other systems as well.
[Figure 11-1](/en/ch11#fig_batch_mapreduce) shows the dataflow in a MapReduce job. We assume that
the input to the job is sharded, and the shards are labelled *m 1*, *m 2*, and *m 3*. For example,
each shard may be a separate file on HDFS or a separate object in an object store, and all the
shards belonging to the same dataset are grouped into the same HDFS directory or have the same key
prefix in an object store bucket.
{{< figure src="/fig/ddia_1101.png" id="fig_batch_mapreduce" caption="Figure 11-1. A MapReduce job with three mappers and three reducers." class="w-full my-4" >}}
The framework starts a separate map task for each input shard. A task reads its assigned file,
passing one record at a time to the mapper callback. The reduce side of the computation is also
sharded. While the number of map tasks is determined by the number of input shards, the number of
reduce tasks is configured by the job author (it can be different from the number of map tasks).
The output of the mapper consists of key-value pairs, and the framework needs to ensure that if two
different mappers output the same key, those key-value pairs end up being processed by the same
reducer task. To achieve this, each mapper creates a separate output file on its local disk for
every reducer (for example, the file *m 1, r 2* in [Figure 11-1](/en/ch11#fig_batch_mapreduce) is
the file created by mapper 1 containing the data destined for reducer 2). When the mapper outputs a
key-value pair, a hash of the key typically determines which reducer file it is written to
(similarly to ["Sharding by Hash of Key"](/en/ch7#sec_sharding_hash)).
While a mapper is writing these files, it also sorts the key-value pairs within each file. This can
be done using the techniques we saw in ["Log-Structured
Storage"](/en/ch4#sec_storage_log_structured): batches of key-value pairs are first collected in a
sorted data structure in memory, then written out as sorted segment files, and smaller segment files
are progressively merged into larger ones.
After each mapper finishes, reducers connect to it and copy the appropriate file of sorted key-value
pairs to their local disk. Once the reduce task has its share of the output from all of the mappers,
it merges these files together, preserving the sort order, mergesort-style. Key-value pairs with the
same key are now consecutive, even if they came from different mappers. The reducer function is then
called once per-key, each time with an iterator that returns all the values for that key.
Any records output by the reducer function are sequentially written to a file, with one file per
reduce task. These files (*r 1*, *r 2*, *r 3* in [Figure 11-1](/en/ch11#fig_batch_mapreduce)) become
the shards of the job's output dataset, and they are written back to the distributed filesystem or
object store.
Though MapReduce executes the shuffle step between its map and reduce steps, modern dataflow engines
and cloud data warehouses are more sophisticated. Systems such as BigQuery have optimized their
shuffle algorithms to keep data in memory and to write data to external sorting services
[^24]. Such services speed up shuffling and replicate shuffled data to provide resilience.
#### JOIN and GROUP BY {#sec_batch_join}
Let's look at how sorted data simplifies distributed joins and aggregations. We'll continue with
MapReduce for illustration purposes, though these concepts apply to most batch processing systems.
A typical example of a join in a batch job is illustrated in
[Figure 11-2](/en/ch11#fig_batch_join_example). On the left is a log of events describing the things
that logged-in users did on a website (known as *activity events* or *clickstream data*), and on the
right is a database of users. You can think of this example as being part of a star schema (see
["Stars and Snowflakes: Schemas for Analytics"](/en/ch3#sec_datamodels_analytics)): the log of
events is the fact table, and the user database is one of the dimensions.
{{< figure src="/fig/ddia_1102.png" id="fig_batch_join_example" caption="Figure 11-2. A join between a log of user activity events and a database of user profiles." class="w-full my-4" >}}
If you want to perform an analysis of the activity events that takes into account information from
the user database (for example, find out whether certain pages are more popular with younger or
older users, using the date of birth field in the user profile), you need to compute a join between
these two tables. How would you compute that join, assuming both tables are so large that they have
to be sharded?
You can use the fact that in MapReduce, the shuffle brings together all the key-value pairs with the
same key to the same reducer, no matter which shard they were on originally. Here, the user ID can
serve as the key. You can therefore write a mapper that goes over the user activity events, and
emits page view URLs keyed by user ID, as illustrated in
[Figure 11-3](/en/ch11#fig_batch_join_reduce). Another mapper goes over the user database row by
row, extracting the user ID as the key and the user's date of birth as the value.
{{< figure src="/fig/ddia_1103.png" id="fig_batch_join_reduce" caption="Figure 11-3. A sort-merge join on user ID. If the input datasets are sharded into multiple files, each could be processed with multiple mappers in parallel." class="w-full my-4" >}}
The shuffle then ensures that a reducer function can access a particular user's date of birth and
all of that user's page view events at the same time. The MapReduce job can even arrange the records
to be sorted such that the reducer always sees the record from the user database first, followed by
the activity events in timestamp order---this technique is known as a *secondary sort*
[^25].
The reducer can then perform the actual join logic easily. The first value is expected to be the
date of birth, which the reducer stores in a local variable. It then iterates over the activity
events with the same user ID, outputting each viewed URL along with the viewer's date of birth.
Since the reducer processes all of the records for a particular user ID in one go, it only needs to
keep one user record in memory at any one time, and it never needs to make any requests over the
network. This algorithm is known as a *sort-merge join*, since mapper output is sorted by key, and
the reducers then merge together the sorted lists of records from both sides of the join.
The next MapReduce job in the workflow can then calculate the distribution of viewer ages for each
URL. To do so, the job would first shuffle the data using the URL as key. Once sorted, the reducers
would then iterate over all the page views (with viewer birth date) for a single URL, keep a counter
for the number of views by each age group, and increment the appropriate counter for each page view.
This way you can implement a *group by* operation and aggregation.
### Query languages {#sec_batch_query_lanauges}
Over the years, execution engines for distributed batch processing have matured. By now, the
infrastructure has become robust enough to store and process many petabytes of data on clusters of
over 10,000 machines. As the problem of physically operating batch processes at such scale has been
considered more or less solved, attention has turned to improving the programming model.
MapReduce, dataflow engines, and cloud data warehouses have all embraced SQL as the lingua franca
for batch processing. It's a natural fit: legacy data warehouses used SQL, data analytics and ETL
tools already support SQL, and all developers and analysts know it.
Besides the obvious advantage of requiring less code than handwritten MapReduce jobs, these query
language interfaces also allow interactive use, in which you write analytical queries and run them
from a terminal or GUI. This style of interactive querying is an efficient and natural way for
business analytics, product managers, sales and finance teams, and others to explore data in a batch
processing environment. Though not a classic form of batch processing, SQL support has made
exploratory queries suitable for distributed batch processing systems.
High-level query languages not only make the humans using the system more productive, but they also
improve the job execution efficiency at a machine level. As we saw in ["Cloud Data
Warehouses"](/en/ch4#sec_cloud_data_warehouses), query engines are responsible for converting SQL
queries into batch jobs to be executed in a cluster. This translation step from query to syntax tree
to physical operators allows the engine to optimize queries. Query engines such as Hive, Trino,
Spark, and Flink have cost-based query optimizers that can analyze the properties of join inputs and
automatically decide which algorithm would be most suitable for the task at hand. Optimizers might
even change the order of joins so that the amount of intermediate state is minimized [^19],
[^26], [^27], [^28].
While SQL is the most popular general-purpose batch processing query language, other languages
remain in use for niche use cases. Apache Pig was a language based on relational operators that
allowed data pipelines to be specified step by step, rather than as one big SQL query. DataFrames
(see next section) have similar characteristics, and Morel is a more modern language influenced by
Pig. Other users have adopted JSON query languages such as jq, JMESPath, or JsonPath.
In ["Graph-Like Data Models"](/en/ch3#sec_datamodels_graph) we discussed using graphs for modeling
data, and using graph query languages to traverse the edges and vertices in a graph. Many graph
processing frameworks also support batch computation through query languages such as Apache
TinkerPop's Gremlin. We will look at graph processing use cases in more detail in ["Batch Use
Cases"](/en/ch11#sec_batch_output).
> [!TIP] BATCH PROCESSING AND CLOUD DATA WAREHOUSES CONVERGE
> Historically, data warehouses ran on specialized hardware appliances, and provided SQL analytics
> queries over relational data. In contrast, batch processing frameworks like MapReduce set out to
> provide greater scalability and greater flexibility by supporting processing logic written in a
> general-purpose programming language, allowing it to read and write arbitrary data formats.
>
> Over time, the two have become much more similar. Modern batch processing frameworks now support SQL
> as a language for writing batch jobs, and they achieve good performance on relational queries by
> using columnar storage formats such as Parquet and optimized query execution engines (see ["Query
> Execution: Compilation and Vectorization"](/en/ch4#sec_storage_vectorized)).
> Meanwhile, data warehouses have grown more scalable by moving to the cloud (see ["Cloud Data
> Warehouses"](/en/ch4#sec_cloud_data_warehouses)), and implementing many of the
> same scheduling, fault tolerance, and shuffling techniques that distributed batch frameworks do.
> Many use distributed filesystems as well.
>
> Just as batch processing systems adopted SQL as a processing model, cloud warehouses have adopted
> alternative processing models such as DataFrames as well (discussed in the next section). For
> example, Google Cloud BigQuery offers a BigQuery DataFrames library and Snowflake's Snowpark
> integrates with Pandas. Batch processing workflow orchestrators such as Airflow, Prefect, and
> Dagster also integrate with cloud warehouses.
>
> Not all batch jobs are easily expressed in SQL, though. Iterative graph algorithms such as PageRank,
> complex machine learning, and many other tasks are difficult to express in SQL. AI data processing,
> which includes non-relational and multi-modal data such as images, video, and audio, can also be
> difficult to do in SQL.
>
> Moreover, cloud data warehouses struggle with certain workloads. Row-by-row computation is less
> efficient when using column-oriented storage formats. Alternative warehouse APIs or a batch
> processing system are preferable in such cases. Cloud data warehouses also tend to be more expensive
> than other batch processing systems. It can be more cost-efficient to run large jobs in batch
> processing systems such as Spark or Flink instead.
>
> Ultimately, the decision between processing data in batch systems or data warehouses comes down to
> factors such as cost, convenience, ease of implementation, availability, and so on. Most large
> enterprises have many data processing systems, which give them flexibility in this decision. Smaller
> companies often get by with just one.
### DataFrames {#id287}
As data scientists and statisticians began using distributed batch processing frameworks for machine
learning use cases, they found existing processing models cumbersome, as they were used to working
with the DataFrame data model found in R and Pandas (see ["DataFrames, Matrices, and
Arrays"](/en/ch3#sec_datamodels_dataframes)). A DataFrame is similar to a table in a relational
database: it is a collection of rows, and all the values in the same column have the same type.
Instead of writing one big SQL query, users call functions corresponding to relational operators to
perform filters, joins, sorting, group by, and other operations.
Originally, DataFrame manipulation typically occurred locally, in memory. Consequently, DataFrames
were limited to datasets that fit on a single machine. Data scientists wanted to interact with the
large datasets found in batch processing environments using the DataFrame APIs they were used to.
Distributed data processing frameworks such as Spark, Flink, and Daft have adopted DataFrame APIs to
meet this need. On the other hand, local DataFrames are usually indexed and ordered while
distributed DataFrames are generally not [^29]. This can lead to performance surprises
when migrating to batch frameworks.
DataFrame APIs appear similar to dataflow APIs, but implementations vary. While Pandas executes
operations immediately when the DataFrame methods are called, Apache Spark first translates all the
DataFrame API calls into a query plan and runs query optimization before executing the workflow on
top of its distributed dataflow engine. This allows it to improve performance.
Frameworks such as Daft even support both client and server-side computation. Smaller, in-memory
operations are executed on the client while larger datasets and computation are executed on the
server. Columnar storage formats such as Apache Arrow offer a unified data model that both client
and server-side execution engines can share.
## Batch Use Cases {#sec_batch_output}
Now that we've seen how batch processing works, let's see how it is applied to a range of different
applications. Batch jobs are excellent for processing large datasets in bulk, but they aren't good
for low latency use cases. Consequently, you'll find batch jobs wherever there's a lot of data and
data freshness isn't important. This might sound limiting, but it turns out that the a significant
amount of data processing fits this model:
- Accounting and inventory reconciliation, where companies verify that transactions line up with
their bank accounts and inventory, are often done in batch [^30].
- In manufacturing, demand forecasting is computed in periodic batch jobs [^31].
- Ecommerce, media, and social media companies train their recommendation models using batch jobs
[^32], [^33].
- Many financial systems are batch-based, as well. For example, the United States's banking network
runs almost entirely on batch jobs [^34].
In the following sections, we'll discuss some of the batch processing use cases you'll find in
nearly every industry.
### Extract--Transform--Load (ETL) {#sec_batch_etl_usage}
["Data Warehousing"](/en/ch1#sec_introduction_dwh) introduced the idea of ETL and ELT, where a data
processing pipeline extracts data from a production database, transforms it, and loads results into
a downstream system (we'll use "ETL" in this section to represent both ETL and ELT workloads). Batch
jobs are often used for such workloads, especially when the downstream system is a data warehouse.
The parallel nature of batch jobs makes them a great fit for data transformation. Much of data
transformation involves "embarrassingly parallel" workloads. Filtering data, projecting fields, and
many other common data warehouse transformations can all be done in parallel.
Batch processing environments also come with robust workflow schedulers, which make it easy to
schedule, orchestrate, and debug ETL data pipeline jobs. When a failure occurs, schedulers often
retry jobs to mitigate transient issues that might occur. A job that fails repeatedly will be marked
as failed, which helps developers easily see which job in their data pipeline stopped working.
Schedulers like Airflow even come with built-in source, sink, and query operators for MySQL,
PostgreSQL, Snowflake, Spark, Flink, and dozens of other popular systems. A tight integration
between schedulers and data processing systems simplifies data integration.
We've also seen that batch jobs are easy to troubleshoot and fix when things go awry. This feature
is invaluable when debugging data pipelines. Failed files can be easily inspected to see what went
wrong, and ETL batch jobs can be fixed and re-run. For example, an input file might no longer
contain a field that a transformation batch job intends to use. Data engineers will see that the
field is missing, and update the transformation logic or the job that produced the input.
Data pipelines used to be managed by a single data engineering team, as it was considered unfair to
ask other teams working on product features to write and manage complex batch data pipelines.
Recently, improvements in batch processing models and metadata management have made it much easier
for engineers across an organization to contribute to and manage their own data pipelines. *Data
mesh* [^35], [^36], *data contract* [^37], and *data fabric*
[^38] practices provide standards and tools to help teams safely publish their data for
consumption by anybody in the organization.
Data pipelines and analytic queries have begun to share not only processing models, but execution
engines as well. Many batch ETL jobs now run on the same systems as the analytic queries that read
their output. It is not uncommon to see data pipeline transformations and analytic queries both run
as SparkSQL, Trino, or DuckDB queries. Such an architecture further blurs the line between
application engineering, data engineering, analytics engineering, and business analysis.
### Analytics {#sec_batch_olap}
In ["Operational Versus Analytical Systems"](/en/ch1#sec_introduction_analytics), we saw that
analytic queries (OLAP) often scan over a large number of records, performing groupings and
aggregations. It is possible to run such workloads in a batch processing system, alongside other
batch processing workloads. Analysts write SQL queries that execute atop a query engine, which reads
and writes from a distributed file system or object store. Table metadata such as table to file
mappings, names, and types are managed with table formats such as Apache Iceberg and catalogs such
as Unity (see ["Cloud Data Warehouses"](/en/ch4#sec_cloud_data_warehouses)). This architecture is
known as a *data lakehouse* [^39].
As with ETL, improvements in SQL query interfaces mean many organizations now use batch frameworks
such as Spark for analytics. Such query patterns come in two styles:
- Pre-aggregation queries, where data is rolled up into OLAP cubes or data marts to speed up queries
(see ["Materialized Views and Data Cubes"](/en/ch4#sec_storage_materialized_views)).
Pre-aggregated data is queried in the warehouse or pushed to purpose-built realtime OLAP systems
such as Apache Druid or Apache Pinot. Pre-aggregation normally takes place at a scheduled
interval. The workflow schedulers discussed in ["Scheduling
Workflows"](/en/ch11#sec_batch_workflows) are used to manage these workloads.
- Ad hoc queries that users run to answer specific business questions, investigate user behavior,
debug operational issues, and much more. Response times are important for this use case. Analysts
run queries iteratively as they get responses and learn more about the data they're investigating.
Batch processing frameworks with fast query execution help reduce waiting times for analysts.
SQL support enables batch processing frameworks to integrate with spreadsheets and data
visualization tools such as Tableau, Power BI, Looker, and Apache Superset. For example, Tableau
offers SparkSQL and Presto connectors, while Apache Superset supports Trino, Hive, Spark SQL,
Presto, and many other systems that ultimately execute batch jobs to query data.
### Machine Learning {#id290}
Machine learning (ML) makes frequent use of batch processing. Data scientists, ML engineers, and AI
engineers use batch processing frameworks to investigate data patterns, transform data, and train
machine learning models. Common uses include:
- Feature engineering: Raw data is filtered and transformed into data that models can be trained on.
Predictive models often need numeric data, so engineers must transform other forms of data (such
as text or discrete values) into the required format.
- Model training: The training data is the input to the batch process, and the weights of the
trained model are the output.
- Batch inference: A trained model can then be used to make predictions in bulk if datasets are
large and realtime results are not required. This includes evaluating the model's predictions on a
test dataset.
Batch processing frameworks provide tools explicitly for these use cases. For example, Apache
Spark's MLlib and Apache Flink's FlinkML come with a wide variety of feature engineering tools,
statistical functions, and classifiers.
Machine learning applications such as recommendation engines and ranking systems also make heavy use
of graph processing (see ["Graph-Like Data Models"](/en/ch3#sec_datamodels_graph)). Many graph
algorithms are expressed by traversing one edge at a time, joining one vertex with an adjacent
vertex in order to propagate some information, and repeating until some condition is met---for
example, until there are no more edges to follow, or until some metric converges.
The *bulk synchronous parallel* (BSP) model of computation [^40] has become popular for
batch processing graphs. Among others, it is implemented by Apache Giraph [^20], Spark's
GraphX API, and Flink's Gelly API [^41]. It is also known as the *Pregel* model, as
Google's Pregel paper popularized this approach for processing graphs [^42].
Batch processing is also an integral part of large language model (LLM) data preparation and
training. Raw text input data such as websites typically reside in a DFS or object store. This data
must be pre-processed to make it suitable for training. Pre-processing steps that are well-suited
for batch processing frameworks include:
- Plain text must be extracted from HTML and malformed text must be fixed.
- Low quality, irrelevant, and duplicate documents must be detected and removed.
- Text must be tokenized (split into words) and converted into embeddings, which are numeric
representations each word.
Batch processing frameworks such as Kubeflow, Flyte, and Ray are purpose-built for such workloads.
OpenAI uses Ray as part of its ChatGPT training process, for example [^43]. These
frameworks have built-in integrations for LLM and AI libraries such as PyTorch, Tensorflow, XGBoost,
and many others. They also offer built-in support for feature engineering, model training, batch
inference, and fine tuning (adjusting a foundational model for specific use cases).
Finally, data scientists often experiment with data in interactive notebooks such as Jupyter or Hex.
Notebooks are made up of *cells*, which are small chunks of markdown, Python, or SQL. Cells are
executed sequentially to produce spreadsheets, graphs, or data. Many notebooks use batch processing
via DataFrame APIs or query such systems using SQL.
### Serving Derived Data {#sec_batch_serving_derived}
Batch jobs are often used to build pre-computed or derived datasets such as product recommendations,
user-facing reports, and features for machine learning models. These datasets are typically served
from a production database, key-value store, or search engine. Regardless of the system used, the
pre-computed data needs to make its way from the batch processor's distributed filesystem or object
store back into the database that's serving live traffic.
The most obvious choice might be to use the client library for your favorite database directly
within a batch job, and to write directly to the database server, one record at a time. This will
work (assuming your firewall rules allow direct access from your batch processing environment to
your production databases), but it is a bad idea for several reasons:
- Making a network request for every single record is orders of magnitude slower than the normal
throughput of a batch task. Even if the client library supports batching, performance is likely to
be poor.
- Batch processing frameworks often run many tasks in parallel. If all the tasks concurrently write
to the same output database, with a rate expected of a batch process, that database can easily be
overwhelmed, and its performance for queries is likely to suffer. This can in turn cause
operational problems in other parts of the system [^44].
- Normally, batch jobs provide a clean all-or-nothing guarantee for job output: if a job succeeds,
the result is the output of running every task exactly once, even if some tasks failed and had to
be retried along the way; if the entire job fails, no output is produced. However, writing to an
external system from inside a job produces externally visible side effects that cannot be hidden
in this way. Thus, you have to worry about the results from partially completed jobs being visible
to other systems. If a task fails and is restarted, it may duplicate output from the failed
execution.
A better solution is to have batch jobs push pre-computed datasets to streams such as Kafka topics,
which we discuss further in [Chapter 12](/en/ch12#ch_stream). Search engines like Elasticsearch,
realtime OLAP systems like Apache Pinot and Apache Druid, derived datastores like Venice
[^45], and cloud data warehouses like ClickHouse all have the built-in ability to ingest
data from Kafka into their systems. Pushing data through a streaming systems fixes a few of the
problems we discussed above:
- Streaming systems are optimized for sequential writes, which make them better suited for the bulk
write workload of a batch job.
- Streaming systems can also act as a buffer between the batch job and the production databases.
Downstream systems can throttle their read rate to ensure they can continue to comfortably serve
production traffic.
- The output of a single batch job can be consumed by multiple downstream systems.
- Streaming systems can serve as a security boundary between batch processing environments and
production networks: they can be deployed in a so-called DMZ (demilitarized zone) network that
sits between the batch processing network and production network.
Pushing data through streams doesn't inherently solve the all-or-nothing guarantee issue we
discussed above. To make this work, batch jobs must send a notification to downstream systems that
their job is complete and the data can now be served. Consumers of the stream need to be able to
keep data they receive invisible to queries, like an uncommitted transaction with *read committed*
isolation (see ["Read Committed"](/en/ch8#sec_transactions_read_committed)), until they are notified
that it is complete.
Another pattern that is more common when bootstrapping databases is to build a brand-new database
*inside* the batch job and bulk load those files directly into the database from a distributed
filesystem, object store, or local filesystem. Many data systems offer bulk import tools such as
TiDB's Lightning tool, or Apache Pinot's and Apache Druid's Hadoop import jobs. RocksDB also offers
an API to bulk import SSTs from batch jobs.
Building databases in batch and bulk importing the data is very fast, and makes it easier for
systems to atomically switch between dataset versions. On the other hand, it can be challenging to
incrementally update datasets from batch jobs that build brand-new databases. It's common to take a
hybrid approach in situations where both bootstrapping and incremental loads are needed. Venice, for
example, supports hybrid stores that allow for batch row-based updates and full dataset swaps.
## Summary {#id292}
In this chapter, we explored the design and implementation of batch processing systems. We began
with the classic Unix toolchain (awk, sort, uniq, etc.), to illustrate fundamental batch processing
primitives such as sorting and counting.
We then scaled up to distributed batch processing systems. We saw that batch-style I/O processes
immutable, bounded input datasets to produce output data, allowing reruns and debugging without side
effects. To process files, we saw that batch frameworks have three main components: an orchestration
layer that determines where and when jobs run, a storage layer to persist data, and a computation
layer that processes the actual data.
We looked at how distributed filesystems and object stores manage large files through block-based
replication, caching, and metadata services, and how modern batch frameworks interact with these
systems using pluggable APIs. We also discussed how orchestrators schedule tasks, allocate
resources, and handle faults in large clusters. We also compared job orchestrators that schedule
jobs with workflow orchestrators that manage the lifecycle of a collection of jobs that run in a
dependency graph.
We surveyed batch processing models, starting with MapReduce and its canonical map and reduce
functions. Next, we turned to dataflow engines like Spark and Flink, which offer simpler-to-use
dataflow APIs and better performance. To understand how batch jobs scale, we covered the shuffle
algorithm, a foundational operation that enables grouping, joining, and aggregation.
As batch systems matured, focus shifted to usability. You learned about high-level query languages
like SQL and DataFrame APIs, which make batch jobs more accessible and easier to optimize. Query
optimizers translate declarative queries into efficient execution plans.
We finished the chapter with common batch processing use cases:
- ETL pipelines, which extract, transform, and load data between different systems using scheduled
workflows;
- Analytics, where batch jobs support both pre-aggregated dashboards and ad hoc queries;
- Machine learning, where batch jobs prepare and process large training datasets;
- Populating production-facing systems from batch outputs, often via streams or bulk loading tools,
in order to serve the derived data to users.
In the next chapter, we will turn to stream processing, in which the input is *unbounded*---that is,
you still have a job, but its inputs are never-ending streams of data. In this case, a job is never
complete, because at any time there may still be more work coming in. We shall see that stream and
batch processing are similar in some respects, but the assumption of unbounded streams also changes
a lot about how we build systems.
##### Footnotes
### References {#references}
[^1]: Nathan Marz. [How to Beat the CAP Theorem](http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html). *nathanmarz.com*, October 2011. Archived at [perma.cc/4BS9-R9A4](https://perma.cc/4BS9-R9A4)
[^2]: Molly Bartlett Dishman and Martin Fowler. [Agile Architecture](https://www.youtube.com/watch?v=VjKYO6DP3fo&list=PL055Epbe6d5aFJdvWNtTeg_UEHZEHdInE). At *O'Reilly Software Architecture Conference*, March 2015.
[^3]: Jeffrey Dean and Sanjay Ghemawat. [MapReduce: Simplified Data Processing on Large Clusters](https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean.pdf). At *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004.
[^4]: Shivnath Babu and Herodotos Herodotou. [Massively Parallel Databases and MapReduce Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/db-mr-survey-final.pdf). *Foundations and Trends in Databases*, volume 5, issue 1, pages 1--104, November 2013. [doi:10.1561/1900000036](https://doi.org/10.1561/1900000036)
[^5]: David J. DeWitt and Michael Stonebraker. [MapReduce: A Major Step Backwards](https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html). Originally published at *databasecolumn.vertica.com*, January 2008. Archived at [perma.cc/U8PA-K48V](https://perma.cc/U8PA-K48V)
[^6]: Henry Robinson. [The Elephant Was a Trojan Horse: On the Death of Map-Reduce at Google](https://www.the-paper-trail.org/post/2014-06-25-the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/). *the-paper-trail.org*, June 2014. Archived at [perma.cc/9FEM-X787](https://perma.cc/9FEM-X787)
[^7]: Urs Hölzle. [R.I.P. MapReduce. After having served us well since 2003, today we removed the remaining internal codebase for good](https://twitter.com/uhoelzle/status/1177360023976067077). *twitter.com*, September 2019. Archived at [perma.cc/B34T-LLY7](https://perma.cc/B34T-LLY7)
[^8]: Adam Drake. [Command-Line Tools Can Be 235x Faster than Your Hadoop Cluster](https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html). *aadrake.com*, January 2014. Archived at [perma.cc/87SP-ZMCY](https://perma.cc/87SP-ZMCY)
[^9]: [`sort`: Sort text files](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html). GNU Coreutils 9.7 Documentation, Free Software Foundation, Inc., 2025.
[^10]: Michael Ovsiannikov, Silvius Rus, Damian Reeves, Paul Sutter, Sriram Rao, and Jim Kelly. [The Quantcast File System](https://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p808-ovsiannikov.pdf). *Proceedings of the VLDB Endowment*, volume 6, issue 11, pages 1092--1101, August 2013. [doi:10.14778/2536222.2536234](https://doi.org/10.14778/2536222.2536234)
[^11]: Andrew Wang, Zhe Zhang, Kai Zheng, Uma Maheswara G., and Vinayakumar B. [Introduction to HDFS Erasure Coding in Apache Hadoop](https://www.cloudera.com/blog/technical/introduction-to-hdfs-erasure-coding-in-apache-hadoop.html). *blog.cloudera.com*, September 2015. Archived at [archive.org](https://web.archive.org/web/20250731115546/https://www.cloudera.com/blog/technical/introduction-to-hdfs-erasure-coding-in-apache-hadoop.html)
[^12]: Andy Warfield. [Building and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. Archived at [perma.cc/7LPK-TP7V](https://perma.cc/7LPK-TP7V)
[^13]: Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. [Apache Hadoop YARN: Yet Another Resource Negotiator](https://opencourse.inf.ed.ac.uk/sites/default/files/2023-10/yarn-socc13.pdf). At *4th Annual Symposium on Cloud Computing* (SoCC), October 2013. [doi:10.1145/2523616.2523633](https://doi.org/10.1145/2523616.2523633)
[^14]: Richard M. Karp. [Reducibility Among Combinatorial Problems](https://www.cs.purdue.edu/homes/hosking/197/canon/karp.pdf). *Complexity of Computer Computations. The IBM Research Symposia Series*. Springer, 1972. [doi:10.1007/978-1-4684-2001-2_9](https://doi.org/10.1007/978-1-4684-2001-2_9)
[^15]: J. D. Ullman. [NP-Complete Scheduling Problems](https://www.cs.montana.edu/bhz/classes/fall-2018/csci460/paper4.pdf). *Journal of Computer and System Sciences*, volume 10, issue 3, June 1975. [doi:10.1016/S0022-0000(75)80008-0](https://doi.org/10.1016/S0022-0000(75)80008-0)
[^16]: Gilad David Maayan. [The complete guide to spot instances on AWS, Azure and GCP](https://www.datacenterdynamics.com/en/opinions/complete-guide-spot-instances-aws-azure-and-gcp/). *datacenterdynamics.com*, March 2021. Archived at [archive.org](https://web.archive.org/web/20250722114617/https://www.datacenterdynamics.com/en/opinions/complete-guide-spot-instances-aws-azure-and-gcp/)
[^17]: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. [Large-Scale Cluster Management at Google with Borg](https://dl.acm.org/doi/pdf/10.1145/2741948.2741964). At *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741964](https://doi.org/10.1145/2741948.2741964)
[^18]: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf). At *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012.
[^19]: Paris Carbone, Stephan Ewen, Seif Haridi, Asterios Katsifodimos, Volker Markl, and Kostas Tzoumas. [Apache Flink™: Stream and Batch Processing in a Single Engine](http://sites.computer.org/debull/A15dec/p28.pdf). *Bulletin of the IEEE Computer Society Technical Committee on Data Engineering*, volume 38, issue 4, December 2015. Archived at [perma.cc/G3N3-BKX5](https://perma.cc/G3N3-BKX5)
[^20]: Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira. *[Hadoop Application Architectures](https://learning.oreilly.com/library/view/hadoop-application-architectures/9781491910313/)*. O'Reilly Media, 2015. ISBN: 978-1-491-90004-8
[^21]: Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee. *[Learning Spark, 2nd Edition](https://learning.oreilly.com/library/view/learning-spark-2nd/9781492050032/)*. O'Reilly Media, 2020. ISBN: 978-1492050049
[^22]: Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. [Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks](https://www.microsoft.com/en-us/research/publication/dryad-distributed-data-parallel-programs-from-sequential-building-blocks/). At *2nd European Conference on Computer Systems* (EuroSys), March 2007. [doi:10.1145/1272996.1273005](https://doi.org/10.1145/1272996.1273005)
[^23]: Daniel Warneke and Odej Kao. [Nephele: Efficient Parallel Data Processing in the Cloud](https://stratosphere2.dima.tu-berlin.de/assets/papers/Nephele_09.pdf). At *2nd Workshop on Many-Task Computing on Grids and Supercomputers* (MTAGS), November 2009. [doi:10.1145/1646468.1646476](https://doi.org/10.1145/1646468.1646476)
[^24]: Hossein Ahmadi. [In-memory query execution in Google BigQuery](https://cloud.google.com/blog/products/bigquery/in-memory-query-execution-in-google-bigquery). *cloud.google.com*, August 2016. Archived at [perma.cc/DGG2-FL9W](https://perma.cc/DGG2-FL9W)
[^25]: Tom White. *[Hadoop: The Definitive Guide](https://learning.oreilly.com/library/view/hadoop-the-definitive/9781491901687/)*, 4th edition. O'Reilly Media, 2015. ISBN: 978-1-491-90163-2
[^26]: Fabian Hüske. [Peeking into Apache Flink's Engine Room](https://flink.apache.org/2015/03/13/peeking-into-apache-flinks-engine-room/). *flink.apache.org*, March 2015. Archived at [perma.cc/44BW-ALJX](https://perma.cc/44BW-ALJX)
[^27]: Mostafa Mokhtar. [Hive 0.14 Cost Based Optimizer (CBO) Technical Overview](https://web.archive.org/web/20170607112708/http://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/). *hortonworks.com*, March 2015. Archived on [archive.org](https://web.archive.org/web/20170607112708/http://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/)
[^28]: Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. [Spark SQL: Relational Data Processing in Spark](https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742797](https://doi.org/10.1145/2723372.2742797)
[^29]: Kaya Kupferschmidt. [Spark vs Pandas, part 2 -- Spark](https://towardsdatascience.com/spark-vs-pandas-part-2-spark-c57f8ea3a781/). *towardsdatascience.com*, October 2020. Archived at [perma.cc/5BRK-G4N5](https://perma.cc/5BRK-G4N5)
[^30]: Ammar Chalifah. [Tracking payments at scale](https://bolt.eu/en/blog/tracking-payments-at-scale). *bolt.eu.com*, June 2025. Archived at [perma.cc/Q4KX-8K3J](https://perma.cc/Q4KX-8K3J)
[^31]: Nafi Ahmet Turgut, Hamza Akyıldız, Hasan Burak Yel, Mehmet İkbal Özmen, Mutlu Polatcan, Pinar Baki, and Esra Kayabali. [Demand forecasting at Getir built with Amazon Forecast](https://aws.amazon.com/blogs/machine-learning/demand-forecasting-at-getir-built-with-amazon-forecast). *aws.amazon.com.com*, May 2023. Archived at [perma.cc/H3H6-GNL7](https://perma.cc/H3H6-GNL7)
[^32]: Jason (Siyu) Zhu. [Enhancing homepage feed relevance by harnessing the power of large corpus sparse ID embeddings](https://www.linkedin.com/blog/engineering/feed/enhancing-homepage-feed-relevance-by-harnessing-the-power-of-lar). *linkedin.com*, August 2023. Archived at [archive.org](https://web.archive.org/web/20250225094424/https://www.linkedin.com/blog/engineering/feed/enhancing-homepage-feed-relevance-by-harnessing-the-power-of-lar)
[^33]: Avery Ching, Sital Kedia, and Shuojie Wang. [Apache Spark \@Scale: A 60 TB+ production use case](https://engineering.fb.com/2016/08/31/core-infra/apache-spark-scale-a-60-tb-production-use-case/). *engineering.fb.com*, August 2016. Archived at [perma.cc/F7R5-YFAV](https://perma.cc/F7R5-YFAV)
[^34]: Edward Kim. [How ACH works: A developer perspective --- Part 1](https://engineering.gusto.com/how-ach-works-a-developer-perspective-part-1-339d3e7bea1). *engineering.gusto.com*, April 2014. Archived at [perma.cc/F67P-VBLK](https://perma.cc/F67P-VBLK)
[^35]: Zhamak Dehghani. [How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh](https://martinfowler.com/articles/data-monolith-to-mesh.html). *martinfowler.com*, May 2019. Archived at [perma.cc/LN2L-L4VC](https://perma.cc/LN2L-L4VC)
[^36]: Chris Riccomini. [What the Heck is a Data Mesh?!](https://cnr.sh/essays/what-the-heck-data-mesh) *cnr.sh*, June 2021. Archived at [perma.cc/NEJ2-BAX3](https://perma.cc/NEJ2-BAX3)
[^37]: Chad Sanderson, Mark Freeman, B. E. Schmidt. [*Data Contracts*](https://www.oreilly.com/library/view/data-contracts/9781098157623/). O'Reilly Media, 2025. ISBN: 9781098157623
[^38]: Daniel Abadi. [Data Fabric vs. Data Mesh: What's the Difference?](https://www.starburst.io/blog/data-fabric-vs-data-mesh-whats-the-difference/) *starburst.io*, November 2021. Archived at [perma.cc/RSK3-HXDK](https://perma.cc/RSK3-HXDK)
[^39]: Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. [Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference on Innovative Data Systems Research* (CIDR), January 2021.
[^40]: Leslie G. Valiant. [A Bridging Model for Parallel Computation](https://dl.acm.org/doi/pdf/10.1145/79173.79181). *Communications of the ACM*, volume 33, issue 8, pages 103--111, August 1990. [doi:10.1145/79173.79181](https://doi.org/10.1145/79173.79181)
[^41]: Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. [Spinning Fast Iterative Data Flows](https://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf). *Proceedings of the VLDB Endowment*, volume 5, issue 11, pages 1268-1279, July 2012. [doi:10.14778/2350229.2350245](https://doi.org/10.14778/2350229.2350245)
[^42]: Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. [Pregel: A System for Large-Scale Graph Processing](https://kowshik.github.io/JPregel/pregel_paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2010. [doi:10.1145/1807167.1807184](https://doi.org/10.1145/1807167.1807184)
[^43]: Richard MacManus. [OpenAI Chats about Scaling LLMs at Anyscale's Ray Summit](https://thenewstack.io/openai-chats-about-scaling-llms-at-anyscales-ray-summit/). *thenewstack.io*, September 2023. Archived at [perma.cc/YJD6-KUXU](https://perma.cc/YJD6-KUXU)
[^44]: Jay Kreps. [Why Local State is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing). *oreilly.com*, July 2014. Archived at [perma.cc/P8HU-R5LA](https://perma.cc/P8HU-R5LA)
[^45]: Félix GV. [Open Sourcing Venice -- LinkedIn's Derived Data Platform](https://www.linkedin.com/blog/engineering/open-source/open-sourcing-venice-linkedin-s-derived-data-platform). *linkedin.com*, September 2022. Archived at [archive.org](https://web.archive.org/web/20250226160927/https://www.linkedin.com/blog/engineering/open-source/open-sourcing-venice-linkedin-s-derived-data-platform)
================================================
FILE: content/en/ch12.md
================================================
---
title: "12. Stream Processing"
weight: 312
breadcrumbs: false
---

> *A complex system that works is invariably found to have evolved from a simple system that works.
> The inverse proposition also appears to be true: A complex system designed from scratch never
> works and cannot be made to work.*
>
> John Gall, *Systemantics* (1975)
> [!TIP] A NOTE FOR EARLY RELEASE READERS
> With Early Release ebooks, you get books in their earliest form---the author's raw and unedited
> content as they write---so you can take advantage of these technologies long before the official
> release of these titles.
>
> This will be the 12th chapter of the final book. The GitHub repo for this book is
> *[*https://github.com/ept/ddia2-feedback*](https://github.com/ept/ddia2-feedback)*.
>
> If you'd like to be actively involved in reviewing and commenting on this draft, please reach out on
> GitHub.
In [Chapter 11](/en/ch11#ch_batch) we discussed batch processing---techniques that read a set of
files as input and produce a new set of output files. The output is a form of *derived data*; that
is, a dataset that can be recreated by running the batch process again if necessary. We saw how this
simple but powerful idea can be used to create search indexes, recommendation systems, analytics,
and more.
However, one big assumption remained throughout [Chapter 11](/en/ch11#ch_batch): namely, that the
input is bounded---i.e., of a known and finite size---so the batch process knows when it has
finished reading its input. For example, the sorting operation that is central to MapReduce must
read its entire input before it can start producing output: it could happen that the very last input
record is the one with the lowest key, and thus needs to be the very first output record, so
starting the output early is not an option.
In reality, a lot of data is unbounded because it arrives gradually over time: your users produced
data yesterday and today, and they will continue to produce more data tomorrow. Unless you go out of
business, this process never ends, and so the dataset is never "complete" in any meaningful way
[^1]. Thus, batch processors must artificially divide the data into chunks of fixed
duration: for example, processing a day's worth of data at the end of every day, or processing an
hour's worth of data at the end of every hour.
The problem with daily batch processes is that changes in the input are only reflected in the output
a day later, which is too slow for many impatient users. To reduce the delay, we can run the
processing more frequently---say, processing a second's worth of data at the end of every
second---or even continuously, abandoning the fixed time slices entirely and simply processing every
event as it happens. That is the idea behind *stream processing*.
In general, a "stream" refers to data that is incrementally made available over time. The concept
appears in many places: in the `stdin` and `stdout` of Unix, programming languages (lazy lists)
[^2], filesystem APIs (such as Java's `FileInputStream`), TCP connections, delivering
audio and video over the internet, and so on.
In this chapter we will look at *event streams* as a data management mechanism: the unbounded,
incrementally processed counterpart to the batch data we saw in the last chapter. We will first
discuss how streams are represented, stored, and transmitted over a network. In ["Databases and
Streams"](/en/ch12#sec_stream_databases) we will investigate the relationship between streams and
databases. And finally, in ["Processing Streams"](/en/ch12#sec_stream_processing) we will explore
approaches and tools for processing those streams continually, and ways that they can be used to
build applications.
## Transmitting Event Streams {#sec_stream_transmit}
In the batch processing world, the inputs and outputs of a job are files (perhaps on a distributed
filesystem). What does the streaming equivalent look like?
When the input is a file (a sequence of bytes), the first processing step is usually to parse it
into a sequence of records. In a stream processing context, a record is more commonly known as an
*event*, but it is essentially the same thing: a small, self-contained, immutable object containing
the details of something that happened at some point in time. An event usually contains a timestamp
indicating when it happened according to a time-of-day clock (see ["Monotonic Versus Time-of-Day
Clocks"](/en/ch9#sec_distributed_monotonic_timeofday)).
For example, the thing that happened might be an action that a user took, such as viewing a page or
making a purchase. It might also originate from a machine, such as a periodic measurement from a
temperature sensor, or a CPU utilization metric. In the example of ["Batch Processing with Unix
Tools"](/en/ch11#sec_batch_unix), each line of the web server log is an event.
An event may be encoded as a text string, or JSON, or perhaps in some binary form, as discussed in
[Chapter 5](/en/ch5#ch_encoding). This encoding allows you to store an event, for example by
appending it to a file, inserting it into a relational table, or writing it to a document database.
It also allows you to send the event over the network to another node in order to process it.
In batch processing, a file is written once and then potentially read by multiple jobs. Analogously,
in streaming terminology, an event is generated once by a *producer* (also known as a *publisher* or
*sender*), and then potentially processed by multiple *consumers* (*subscribers* or *recipients*)
[^3]. In a filesystem, a filename identifies a set of related records; in a streaming
system, related events are usually grouped together into a *topic* or *stream*.
In principle, a file or database is sufficient to connect producers and consumers: a producer writes
every event that it generates to the datastore, and each consumer periodically polls the datastore
to check for events that have appeared since it last ran. This is essentially what a batch process
does when it processes a day's worth of data at the end of every day.
However, when moving toward continual processing with low delays, polling becomes expensive if the
datastore is not designed for this kind of usage. The more often you poll, the lower the percentage
of requests that return new events, and thus the higher the overheads become. Instead, it is better
for consumers to be notified when new events appear.
Databases have traditionally not supported this kind of notification mechanism very well: relational
databases commonly have *triggers*, which can react to a change (e.g., a row being inserted into a
table), but they are very limited in what they can do and have been somewhat of an afterthought in
database design [^4]. Instead, specialized tools have been developed for the purpose of
delivering event notifications.
### Messaging Systems {#sec_stream_messaging}
A common approach for notifying consumers about new events is to use a *messaging system*: a
producer sends a message containing the event, which is then pushed to consumers. We touched on
these systems previously in ["Event-Driven Architectures"](/en/ch5#sec_encoding_dataflow_msg), but
we will now go into more detail.
A direct communication channel like a Unix pipe or TCP connection between producer and consumer
would be a simple way of implementing a messaging system. However, most messaging systems expand on
this basic model. In particular, Unix pipes and TCP connect exactly one sender with one recipient,
whereas a messaging system allows multiple producer nodes to send messages to the same topic and
allows multiple consumer nodes to receive messages in a topic.
Within this *publish/subscribe* model, different systems take a wide range of approaches, and there
is no one right answer for all purposes. To differentiate the systems, it is particularly helpful to
ask the following two questions:
1. *What happens if the producers send messages faster than the consumers can process them?*
Broadly speaking, there are three options: the system can drop messages, buffer messages in a
queue, or apply *backpressure* (also known as *flow control*; i.e., blocking the producer from
sending more messages). For example, Unix pipes and TCP use backpressure: they have a small
fixed-size buffer, and if it fills up, the sender is blocked until the recipient takes data out
of the buffer (see ["Network congestion and queueing"](/en/ch9#sec_distributed_congestion)).
If messages are buffered in a queue, it is important to understand what happens as that queue
grows. Does the system crash if the queue no longer fits in memory, or does it write messages to
disk? In the latter case, how does the disk access affect the performance of the messaging
system [^5], and what happens when the disk fills up [^6]?
2. *What happens if nodes crash or temporarily go offline---are any messages lost?* As with
databases, durability may require some combination of writing to disk and/or replication (see
the sidebar ["Replication and Durability"](/en/ch8#sidebar_transactions_durability)), which has
a cost. If you can afford to sometimes lose messages, you can probably get higher throughput and
lower latency on the same hardware.
Whether message loss is acceptable depends very much on the application. For example, with sensor
readings and metrics that are transmitted periodically, an occasional missing data point is perhaps
not important, since an updated value will be sent a short time later anyway. However, beware that
if a large number of messages are dropped, it may not be immediately apparent that the metrics are
incorrect [^7]. If you are counting events, it is more important that they are delivered
reliably, since every lost message means incorrect counters.
A nice property of the batch processing systems we explored in [Chapter 11](/en/ch11#ch_batch) is
that they provide a strong reliability guarantee: failed tasks are automatically retried, and
partial output from failed tasks is automatically discarded. This means the output is the same as if
no failures had occurred, which helps simplify the programming model. Later in this chapter we will
examine how we can provide similar guarantees in a streaming context.
#### Direct messaging from producers to consumers {#id296}
A number of messaging systems use direct network communication between producers and consumers
without going via intermediary nodes:
- UDP multicast is widely used in the financial industry for streams such as stock market feeds,
where low latency is important [^8]. Although UDP itself is unreliable,
application-level protocols can recover lost packets (the producer must remember packets it has
sent so that it can retransmit them on demand).
- Brokerless messaging libraries such as ZeroMQ and nanomsg take a similar approach, implementing
publish/subscribe messaging over TCP or IP multicast.
- Some metrics collection agents, such as StatsD [^9] use unreliable UDP messaging to
collect metrics from all machines on the network and monitor them. (In the StatsD protocol,
counter metrics are only correct if all messages are received; using UDP makes the metrics at best
approximate [^10]. See also ["TCP Versus UDP"](/en/ch9#sidebar_distributed_tcp_udp).)
- If the consumer exposes a service on the network, producers can make a direct HTTP or RPC request
(see ["Dataflow Through Services: REST and RPC"](/en/ch5#sec_encoding_dataflow_rpc)) to push
messages to the consumer. This is the idea behind webhooks [^11], a pattern in which a
callback URL of one service is registered with another service, and it makes a request to that URL
whenever an event occurs.
Although these direct messaging systems work well in the situations for which they are designed,
they generally require the application code to be aware of the possibility of message loss. The
faults they can tolerate are quite limited: even if the protocols detect and retransmit packets that
are lost in the network, they generally assume that producers and consumers are constantly online.
If a consumer is offline, it may miss messages that were sent while it is unreachable. Some
protocols allow the producer to retry failed message deliveries, but this approach may break down if
the producer crashes, losing the buffer of messages that it was supposed to retry.
#### Message brokers {#id433}
A widely used alternative is to send messages via a *message broker* (also known as a *message
queue*), which is essentially a kind of database that is optimized for handling message streams
[^12]. It runs as a server, with producers and consumers connecting to it as clients.
Producers write messages to the broker, and consumers receive them by reading them from the broker.
By centralizing the data in the broker, these systems can more easily tolerate clients that come and
go (connect, disconnect, and crash), and the question of durability is moved to the broker instead.
Some message brokers only keep messages in memory, while others (depending on configuration) write
them to disk so that they are not lost in case of a broker crash. Faced with slow consumers, they
generally allow unbounded queueing (as opposed to dropping messages or backpressure), although this
choice may also depend on the configuration.
A consequence of queueing is also that consumers are generally *asynchronous*: when a producer sends
a message, it normally only waits for the broker to confirm that it has buffered the message and
does not wait for the message to be processed by consumers. The delivery to consumers will happen at
some undetermined future point in time---often within a fraction of a second, but sometimes
significantly later if there is a queue backlog.
#### Message brokers compared to databases {#id297}
Some message brokers can even participate in two-phase commit protocols using XA or JTA (see
["Distributed Transactions Across Different Systems"](/en/ch8#sec_transactions_xa)). This feature
makes them quite similar in nature to databases, although there are still important practical
differences between message brokers and databases:
- Databases usually keep data until it is explicitly deleted, whereas some message brokers
automatically delete a message when it has been successfully delivered to its consumers. Such
message brokers are not suitable for long-term data storage.
- Since they quickly delete messages, most message brokers assume that their working set is fairly
small---i.e., the queues are short. If the broker needs to buffer a lot of messages because the
consumers are slow (perhaps spilling messages to disk if they no longer fit in memory), each
individual message takes longer to process, and the overall throughput may degrade [^5].
- Databases often support secondary indexes and various ways of searching for data using a query
language, while message brokers often support some way of subscribing to a subset of topics
matching some pattern. Both are essentially ways for a client to select the portion of the data
that it wants to know about, but databases typically offer much more advanced query functionality.
- When querying a database, the result is typically based on a point-in-time snapshot of the data;
if another client subsequently writes something to the database that changes the query result, the
first client does not find out that its prior result is now outdated (unless it repeats the query,
or polls for changes). By contrast, message brokers do not support arbitrary queries and don't
allow message updates once they're sent, but they do notify clients when data changes (i.e., when
new messages become available).
This is the traditional view of message brokers, which is encapsulated in standards like JMS
[^13] and AMQP [^14] and implemented in software like RabbitMQ, ActiveMQ,
HornetQ, Qpid, TIBCO Enterprise Message Service, IBM MQ, Azure Service Bus, and Google Cloud Pub/Sub
[^15]. Although it is possible to use databases as queues, tuning them to get good
performance is not straightforward [^16].
#### Multiple consumers {#id298}
When multiple consumers read messages in the same topic, two main patterns of messaging are used, as
illustrated in [Figure 12-1](/en/ch12#fig_stream_multi_consumer):
Load balancing
: Each message is delivered to *one* of the consumers, so the consumers can share the work of
processing the messages in the topic. The broker may assign messages to consumers arbitrarily.
This pattern is useful when the messages are expensive to process, and so you want to be able to
add consumers to parallelize the processing. (In AMQP, you can implement load balancing by
having multiple clients consuming from the same queue, and in JMS it is called a *shared*
*subscription*.)
Fan-out
: Each message is delivered to *all* of the consumers. Fan-out allows several independent
consumers to each "tune in" to the same broadcast of messages, without affecting each
other---the streaming equivalent of having several different batch jobs that read the same input
file. (This feature is provided by topic subscriptions in JMS, and exchange bindings in AMQP.)
{{< figure src="/fig/ddia_1201.png" id="fig_stream_multi_consumer" caption="Figure 12-1. (a) Load balancing: sharing the work of consuming a topic among consumers; (b) fan-out: delivering each message to multiple consumers." class="w-full my-4" >}}
The two patterns can be combined, for example using Kafka's *consumer groups* feature. When a
consumer group subscribes to a topic, each message in the topic is sent to one of the consumers in
the group (load-balancing across the consumers in the group). If two separate consumer groups
subscribe to the same topic, each message is sent to one consumer in each group (providing fan-out
across consumer groups).
#### Acknowledgments and redelivery {#sec_stream_reordering}
Consumers may crash at any time, so it could happen that a broker delivers a message to a consumer
but the consumer never processes it, or only partially processes it before crashing. In order to
ensure that the message is not lost, message brokers use *acknowledgments*: a client must explicitly
tell the broker when it has finished processing a message so that the broker can remove it from the
queue.
If the connection to a client is closed or times out without the broker receiving an acknowledgment,
it assumes that the message was not processed, and therefore it delivers the message again to
another consumer. (Note that it could happen that the message actually *was* fully processed, but
the acknowledgment was lost in the network. Handling this case requires an atomic commit protocol,
as discussed in ["Exactly-once message processing"](/en/ch8#sec_transactions_exactly_once), unless
the operation was idempotent or exactly-once semantics are not required.)
When combined with load balancing, this redelivery behavior has an interesting effect on the
ordering of messages. In [Figure 12-2](/en/ch12#fig_stream_redelivery), the consumers generally
process messages in the order they were sent by producers. However, consumer 2 crashes while
processing message *m3*, at the same time as consumer 1 is processing message *m4*. The
unacknowledged message *m3* is subsequently redelivered to consumer 1, with the result that consumer
1 processes messages in the order *m4*, *m3*, *m5*. Thus, *m3* and *m4* are not delivered in the
same order as they were sent by producer 1.
{{< figure src="/fig/ddia_1202.png" id="fig_stream_redelivery" caption="Figure 12-2. Consumer 2 crashes while processing m3, so it is redelivered to consumer 1 at a later time." class="w-full my-4" >}}
Even if the message broker otherwise tries to preserve the order of messages (as required by both
the JMS and AMQP standards), the combination of load balancing with redelivery inevitably leads to
messages being reordered. To avoid this issue, you can use a separate queue per consumer (i.e., not
use the load balancing feature). Message reordering is not a problem if messages are completely
independent of each other, but it can be important if there are causal dependencies between
messages, as we shall see later in the chapter.
Redelivery can also result in wasted resources, resource starvation, or permanent blockages in a
stream. A common scenario is a producer that improperly serializes a message; for example, by
leaving out a required key in a JSON-encoded object. Any consumer that reads the message will expect
the key, and fail if it's missing. No acknowledgement is sent, so the broker will re-send the
message, which will cause another consumer to fail. This loop repeats itself indefinitely. If the
broker guarantees strong ordering, no further progress can be made. Brokers that allow message
reordering can continue to make progress, but will waste resources on messages that will never be
acknowledged.
Dead letter queues (DLQs) are used to handle this problem. Rather than keeping the message in the
current queue and retrying forever, the message is moved to a different queue to unblock consumers
[^17], [^18]. Monitoring is usually set up on dead letter queues---any message in
the queue is an error. Once a new message is detected, an operator can decide to permanently drop
it, manually modify and re-produce the message, or fix consumer code to handle the message
appropriately. DLQs are common in most queuing systems, but log-based messaging systems such as
Apache Pulsar and stream processing systems such as Kafka Streams now support them as well
[^19].
### Log-based Message Brokers {#sec_stream_log}
Sending a packet over a network or making a request to a network service is normally a transient
operation that leaves no permanent trace. Although it is possible to record it permanently (using
packet capture and logging), we normally don't think of it that way. AMQP/JMS-style message brokers
inherited this transient messaging mindset: even though they may write messages to disk, they
quickly delete the messages again after they have been delivered to consumers.
Databases and filesystems take the opposite approach: everything that is written to a database or
file is normally expected to be permanently recorded, at least until someone explicitly chooses to
delete it again.
This difference in mindset has a big impact on how derived data is created. A key feature of batch
processes, as discussed in [Chapter 11](/en/ch11#ch_batch), is that you can run them repeatedly,
experimenting with the processing steps, without risk of damaging the input (since the input is
read-only). This is not the case with AMQP/JMS-style messaging: receiving a message is destructive
if the acknowledgment causes it to be deleted from the broker, so you cannot run the same consumer
again and expect to get the same result.
If you add a new consumer to a messaging system, it typically only starts receiving messages sent
after the time it was registered; any prior messages are already gone and cannot be recovered.
Contrast this with files and databases, where you can add a new client at any time, and it can read
data written arbitrarily far in the past (as long as it has not been explicitly overwritten or
deleted by the application).
Why can we not have a hybrid, combining the durable storage approach of databases with the
low-latency notification facilities of messaging? This is the idea behind *log-based message
brokers*, which have become very popular in recent years.
#### Using logs for message storage {#id300}
A log is simply an append-only sequence of records on disk. We previously discussed logs in the
context of log-structured storage engines and write-ahead logs in [Chapter 4](/en/ch4#ch_storage),
in the context of replication in [Chapter 6](/en/ch6#ch_replication), and as a form of consensus in
[Chapter 10](/en/ch10#ch_consistency).
The same structure can be used to implement a message broker: a producer sends a message by
appending it to the end of the log, and a consumer receives messages by reading the log
sequentially. If a consumer reaches the end of the log, it waits for a notification that a new
message has been appended. The Unix tool `tail -f`, which watches a file for data being appended,
essentially works like this.
In order to scale to higher throughput than a single disk can offer, the log can be *sharded* (in
the sense of [Chapter 7](/en/ch7#ch_sharding)). Different shards can then be hosted on different
machines, making each shard a separate log that can be read and written independently from other
shards. A topic can then be defined as a group of shards that all carry messages of the same type.
This approach is illustrated in [Figure 12-3](/en/ch12#fig_stream_kafka_partitions).
Within each shard, which Kafka calls a *partition*, the broker assigns a monotonically increasing
sequence number, or *offset*, to every message (in
[Figure 12-3](/en/ch12#fig_stream_kafka_partitions), the numbers in boxes are message offsets). Such
a sequence number makes sense because a partition (shard) is append-only, so the messages within a
partition are totally ordered. There is no ordering guarantee across different partitions.
{{< figure src="/fig/ddia_1203.png" id="fig_stream_kafka_partitions" caption="Figure 12-3. Producers send messages by appending them to a topic-partition file, and consumers read these files sequentially." class="w-full my-4" >}}
Apache Kafka [^20] and Amazon Kinesis Streams are log-based message brokers that work like
this. Google Cloud Pub/Sub is architecturally similar but exposes a JMS-style API rather than a log
abstraction [^15]. Even though these message brokers write all messages to disk, they are
able to achieve throughput of millions of messages per second by sharding across multiple machines,
and fault tolerance by replicating messages [^21], [^22].
#### Logs compared to traditional messaging {#sec_stream_logs_vs_messaging}
The log-based approach trivially supports fan-out messaging, because several consumers can
independently read the log without affecting each other---reading a message does not delete it from
the log. To achieve load balancing across a group of consumers, instead of assigning individual
messages to consumer clients, the broker can assign entire shards to nodes in the consumer group.
Each client then consumes *all* the messages in the shards it has been assigned. Typically, when a
consumer has been assigned a log shard, it reads the messages in the shard sequentially, in a
straightforward single-threaded manner. This coarse-grained load balancing approach has some
downsides:
- The number of nodes sharing the work of consuming a topic can be at most the number of log shards
in that topic, because messages within the same shard are delivered to the same node. (It's
possible to create a load balancing scheme in which two consumers share the work of processing a
shard by having both read the full set of messages, but one of them only considers messages with
even-numbered offsets while the other deals with the odd-numbered offsets. Alternatively, you
could spread message processing over a thread pool, but that approach complicates consumer offset
management. In general, single-threaded processing of a shard is preferable, and parallelism can
be increased by using more shards.)
- If a single message is slow to process, it holds up the processing of subsequent messages in that
shard (a form of head-of-line blocking; see ["Describing
Performance"](/en/ch2#sec_introduction_percentiles)).
Thus, in situations where messages may be expensive to process and you want to parallelize
processing on a message-by-message basis, and where message ordering is not so important, the
JMS/AMQP style of message broker is preferable. On the other hand, in situations with high message
throughput, where each message is fast to process and where message ordering is important, the
log-based approach works very well [^23], [^24]. However, the distinction between
the two architectures is being blurred as log-based messaging systems such as Kafka now support
JMS/AMQP style consumer groups, which allow multiple consumers to receive messages from the same
partition [^25], [^26].
Since sharded logs typically preserve message ordering only within a single shard, all messages that
need to be ordered consistently need to be routed to the same shard. For example, an application may
require that the events relating to one particular user appear in a fixed order. This can be
achieved by choosing the shard for an event based on the user ID of that event (in other words,
making the user ID the *partition key*).
#### Consumer offsets {#sec_stream_log_offsets}
Consuming a shard sequentially makes it easy to tell which messages have been processed: all
messages with an offset less than a consumer's current offset have already been processed, and all
messages with a greater offset have not yet been seen. Thus, the broker does not need to track
acknowledgments for every single message---it only needs to periodically record the consumer
offsets. The reduced bookkeeping overhead and the opportunities for batching and pipelining in this
approach help increase the throughput of log-based systems. If a consumer fails, however, it will
resume from the last recorded offset rather than the more recent last offset it saw. This can causes
the consumer to see some messages twice.
This offset is in fact very similar to the *log sequence number* that is commonly found in
single-leader database replication, and which we discussed in ["Setting Up New
Followers"](/en/ch6#sec_replication_new_replica). In database replication, the log sequence number
allows a follower to reconnect to a leader after it has become disconnected, and resume replication
without skipping any writes. Exactly the same principle is used here: the message broker behaves
like a leader database, and the consumer like a follower.
If a consumer node fails, another node in the consumer group is assigned the failed consumer's
shards, and it starts consuming messages at the last recorded offset. If the consumer had processed
subsequent messages but not yet recorded their offset, those messages will be processed a second
time upon restart. We will discuss ways of dealing with this issue later in the chapter.
#### Disk space usage {#sec_stream_disk_usage}
If you only ever append to the log, you will eventually run out of disk space. To reclaim disk
space, the log is actually divided into segments, and from time to time old segments are deleted or
moved to archive storage. (We'll discuss a more sophisticated way of freeing disk space in ["Log
compaction"](/en/ch12#sec_stream_log_compaction).)
This means that if a slow consumer cannot keep up with the rate of messages, and it falls so far
behind that its consumer offset points to a deleted segment, it will miss some of the messages.
Effectively, the log implements a bounded-size buffer that discards old messages when it gets full,
also known as a *circular buffer* or *ring buffer*. However, since that buffer is on disk, it can be
quite large.
Let's do a back-of-the-envelope calculation. At the time of writing, a typical large hard drive has
a capacity of 20 TB and a sequential write throughput of 250 MB/s. If you are writing messages at
the fastest possible rate, it takes about 22 hours until the drive is full and you need to start
deleting the oldest messages. That means a disk-based log can always buffer at least 22 hours worth
of messages, even if you have many disks with many machines (having more disks increases both the
available space and the total write bandwidth). In practice, deployments rarely use the full write
bandwidth of the disk, so the log can typically keep a buffer of several days' or even weeks' worth
of messages.
Many log-based message brokers now store messages in object storage to increase their storage
capacity, similarly to databases as we saw in ["Databases Backed by Object
Storage"](/en/ch6#sec_replication_object_storage). Message brokers such as Apache Kafka and Redpanda
serve older messages from object storage as part of their tiered storage. Others, such as
WarpStream, Confluent Freight, and Bufstream store all of their data in the object store. In
addition to cost-efficiency, this architecture also makes data integration easier: messages in
object storage are stored as Iceberg tables, which enable batch and data warehouse job execution
directly on the data without having to copy it into another system.
#### When consumers cannot keep up with producers {#id459}
At the beginning of ["Messaging Systems"](/en/ch12#sec_stream_messaging) we discussed three choices
of what to do if a consumer cannot keep up with the rate at which producers are sending messages:
dropping messages, buffering, or applying backpressure. In this taxonomy, the log-based approach is
a form of buffering with a large but fixed-size buffer (limited by the available disk space).
If a consumer falls so far behind that the messages it requires are older than what is retained on
disk, it will not be able to read those messages---so the broker effectively drops old messages that
go back further than the size of the buffer can accommodate. You can monitor how far a consumer is
behind the head of the log, and raise an alert if it falls behind significantly. As the buffer is
large, there is enough time for a human operator to fix the slow consumer and allow it to catch up
before it starts missing messages.
Even if a consumer does fall too far behind and starts missing messages, only that consumer is
affected; it does not disrupt the service for other consumers. This fact is a big operational
advantage: you can experimentally consume a production log for development, testing, or debugging
purposes, without having to worry much about disrupting production services. When a consumer is shut
down or crashes, it stops consuming resources---the only thing that remains is its consumer offset.
This behavior also contrasts with traditional message brokers, where you need to be careful to
delete any queues whose consumers have been shut down---otherwise they continue unnecessarily
accumulating messages and taking away memory from consumers that are still active.
#### Replaying old messages {#sec_stream_replay}
We noted previously that with AMQP- and JMS-style message brokers, processing and acknowledging
messages is a destructive operation, since it causes the messages to be deleted on the broker. On
the other hand, in a log-based message broker, consuming messages is more like reading from a file:
it is a read-only operation that does not change the log.
The only side effect of processing, besides any output of the consumer, is that the consumer offset
moves forward. But the offset is under the consumer's control, so it can easily be manipulated if
necessary: for example, you can start a copy of a consumer with yesterday's offsets and write the
output to a different location, in order to reprocess the last day's worth of messages. You can
repeat this any number of times, varying the processing code.
This aspect makes log-based messaging more like the batch processes of the last chapter, where
derived data is clearly separated from input data through a repeatable transformation process. It
allows more experimentation and easier recovery from errors and bugs, making it a good tool for
integrating dataflows within an organization [^27].
## Databases and Streams {#sec_stream_databases}
We have drawn some comparisons between message brokers and databases. Even though they have
traditionally been considered separate categories of tools, we saw that log-based message brokers
have been successful in taking ideas from databases and applying them to messaging. We can also go
in reverse: take ideas from messaging and streams, and apply them to databases.
One approach is to use an *event stream as the system of record* for storing data (see ["Systems of
Record and Derived Data"](/en/ch1#sec_introduction_derived)). This is what happens in *event
sourcing*, which we discussed in ["Event Sourcing and CQRS"](/en/ch3#sec_datamodels_events): instead
of storing data in a data model that is mutated by updating and deleting, you can model every state
change as an immutable event, and write it to an append-only log. Any read-optimized materialized
views are derived from these events. Log-based message brokers (configured to never delete old
events) are well suited for event sourcing since they use append-only storage, and they can notify
consumers about new events with low latency.
But you don't have to go as far as adopting event sourcing; even with mutable data models, event
streams are useful for databases. In fact, every write to a database is an event that can be
captured, stored, and processed. The connection between databases and streams runs deeper than just
the physical storage of logs on disk---it is quite fundamental.
For example, a replication log (see ["Implementation of Replication
Logs"](/en/ch6#sec_replication_implementation)) is a stream of database write events, produced by
the leader as it processes transactions. The followers apply that stream of writes to their own copy
of the database and thus end up with an accurate copy of the same data. The events in the
replication log describe the data changes that occurred.
We also came across the *state machine replication* principle in ["Using shared
logs"](/en/ch10#sec_consistency_smr), which states: if every event represents a write to the
database, and every replica processes the same events in the same order, then the replicas will all
end up in the same final state. (Processing an event is assumed to be a deterministic operation.)
It's just another case of event streams!
In this section we will first look at a problem that arises in heterogeneous data systems, and then
explore how we can solve it by bringing ideas from event streams to databases.
### Keeping Systems in Sync {#sec_stream_sync}
As we have seen throughout this book, there is no single system that can satisfy all data storage,
querying, and processing needs. In practice, most nontrivial applications need to combine several
different technologies in order to satisfy their requirements: for example, using an OLTP database
to serve user requests, a cache to speed up common requests, a full-text index to handle search
queries, and a data warehouse for analytics. Each of these has its own copy of the data, stored in
its own representation that is optimized for its own purposes.
As the same or related data appears in several different places, they need to be kept in sync with
one another: if an item is updated in the database, it also needs to be updated in the cache, search
indexes, and data warehouse. With data warehouses this synchronization is usually performed by ETL
processes (see ["Data Warehousing"](/en/ch1#sec_introduction_dwh)), often by taking a full copy of a
database, transforming it, and bulk-loading it into the data warehouse---in other words, a batch
process. Similarly, we saw in ["Batch Use Cases"](/en/ch11#sec_batch_output) how search indexes,
recommendation systems, and other derived data systems might be created using batch processes.
If periodic full database dumps are too slow, an alternative that is sometimes used is *dual
writes*, in which the application code explicitly writes to each of the systems when data changes:
for example, first writing to the database, then updating the search index, then invalidating the
cache entries (or even performing those writes concurrently).
However, dual writes have some serious problems, one of which is a race condition illustrated in
[Figure 12-4](/en/ch12#fig_stream_write_order). In this example, two clients concurrently want to
update an item X: client 1 wants to set the value to A, and client 2 wants to set it to B. Both
clients first write the new value to the database, then write it to the search index. Due to unlucky
timing, the requests are interleaved: the database first sees the write from client 1 setting the
value to A, then the write from client 2 setting the value to B, so the final value in the database
is B. The search index first sees the write from client 2, then client 1, so the final value in the
search index is A. The two systems are now permanently inconsistent with each other, even though no
error occurred.
{{< figure src="/fig/ddia_1204.png" id="fig_stream_write_order" caption="Figure 12-4. In the database, X is first set to A and then to B, while at the search index the writes arrive in the opposite order." class="w-full my-4" >}}
Unless you have some additional concurrency detection mechanism, such as the version vectors we
discussed in ["Detecting Concurrent Writes"](/en/ch6#sec_replication_concurrent), you will not even
notice that concurrent writes occurred---one value will simply silently overwrite another value.
Another problem with dual writes is that one of the writes may fail while the other succeeds. This
is a fault-tolerance problem rather than a concurrency problem, but it also has the effect of the
two systems becoming inconsistent with each other. Ensuring that they either both succeed or both
fail is a case of the atomic commit problem, which is expensive to solve (see ["Two-Phase Commit
(2PC)"](/en/ch8#sec_transactions_2pc)).
If you only have one replicated database with a single leader, then that leader determines the order
of writes, so the state machine replication approach works among replicas of the database. However,
in [Figure 12-4](/en/ch12#fig_stream_write_order) there isn't a single leader: the database may have
a leader and the search index may have a leader, but neither follows the other, and so conflicts can
occur (see ["Multi-Leader Replication"](/en/ch6#sec_replication_multi_leader)).
The situation would be better if there really was only one leader---for example, the database---and
if we could make the search index a follower of the database. But is this possible in practice?
### Change Data Capture {#sec_stream_cdc}
The problem with most databases' replication logs is that they have long been considered to be an
internal implementation detail of the database, not a public API. Clients are supposed to query the
database through its data model and query language, not parse the replication logs and try to
extract data from them.
For decades, many databases simply did not have a documented way of getting the log of changes
written to them. For this reason it was difficult to take all the changes made in a database and
replicate them to a different storage technology such as a search index, cache, or data warehouse.
More recently, there has been growing interest in *change data capture* (CDC), which is the process
of observing all data changes written to a database and extracting them in a form in which they can
be replicated to other systems [^28]. CDC is especially interesting if changes are made
available as a stream, immediately as they are written.
For example, you can capture the changes in a database and continually apply the same changes to a
search index. If the log of changes is applied in the same order, you can expect the data in the
search index to match the data in the database. The search index and any other derived data systems
are just consumers of the change stream.
[Figure 12-5](/en/ch12#fig_stream_change_capture) shows how the concurrency problem of
[Figure 12-4](/en/ch12#fig_stream_write_order) is solved with CDC. Even though the two requests to
set X to A and B respectively arrive concurrently at the database, the database decides on some
order in which to execute them, and writes them to its replication log in that order. The search
index picks them up and applies them in the same order. If you need the data in another system, such
as a data warehouse, you can simply add it as another consumer of the CDC event stream.
{{< figure src="/fig/ddia_1205.png" id="fig_stream_change_capture" caption="Figure 12-5. Taking data in the order it was written to one database, and applying the changes to other systems in the same order." class="w-full my-4" >}}
#### Implementing change data capture {#id307}
We can call the log consumers *derived data systems*, as discussed in ["Systems of Record and
Derived Data"](/en/ch1#sec_introduction_derived): the data stored in the search index and the data
warehouse is just another view onto the data in the system of record. Change data capture is a
mechanism for ensuring that all changes made to the system of record are also reflected in the
derived data systems so that the derived systems have an accurate copy of the data.
Essentially, change data capture makes one database the leader (the one from which the changes are
captured), and turns the others into followers. A log-based message broker is well suited for
transporting the change events from the source database to the derived systems, since it preserves
the ordering of messages (avoiding the reordering issue of
[Figure 12-2](/en/ch12#fig_stream_redelivery)).
Logical replication logs can be used to implement change data capture (see ["Logical (row-based) log
replication"](/en/ch6#sec_replication_logical)), although it comes with challenges, such as handling
schema changes and properly modeling updates. The Debezium open source project addresses these
challenges. The project contains *source connectors* for MySQL, PostgreSQL, Oracle, SQL Server, Db2,
Cassandra, and many other databases. These connectors attach to database replication logs and
surface the changes in a standard event schema. Messages can then be transformed and written to
downstream databases. The Kafka Connect framework offers further CDC connectors for various
databases, as well. Maxwell does something similar for MySQL by parsing the binlog [^29],
GoldenGate provides similar facilities for Oracle, and pgcapture does the same for PostgreSQL.
Like message brokers, change data capture is usually asynchronous: the system of record database
does not wait for the change to be applied to consumers before committing it. This design has the
operational advantage that adding a slow consumer does not affect the system of record too much, but
it has the downside that all the issues of replication lag apply (see ["Problems with Replication
Lag"](/en/ch6#sec_replication_lag)).
#### Initial snapshot {#sec_stream_cdc_snapshot}
If you have the log of all changes that were ever made to a database, you can reconstruct the entire
state of the database by replaying the log. However, in many cases, keeping all changes forever
would require too much disk space, and replaying it would take too long, so the log needs to be
truncated.
Building a new full-text index, for example, requires a full copy of the entire database---it is not
sufficient to only apply a log of recent changes, since it would be missing items that were not
recently updated. Thus, if you don't have the entire log history, you need to start with a
consistent snapshot, as previously discussed in ["Setting Up New
Followers"](/en/ch6#sec_replication_new_replica).
The snapshot of the database must correspond to a known position or offset in the change log, so
that you know at which point to start applying changes after the snapshot has been processed. Some
CDC tools integrate this snapshot facility, while others leave it as a manual operation. Debezium
uses Netflix's DBLog watermarking algorithm to provide incremental snapshots [^30],
[^31].
#### Log compaction {#sec_stream_log_compaction}
If you can only keep a limited amount of log history, you need to go through the snapshot process
every time you want to add a new derived data system. However, *log compaction* provides a good
alternative.
We discussed log compaction previously in ["Log-Structured
Storage"](/en/ch4#sec_storage_log_structured), in the context of log-structured storage engines (see
[Figure 4-3](/en/ch4#fig_storage_sstable_merging) for an example). The principle is simple: the
storage engine periodically looks for log records with the same key, throws away any duplicates, and
keeps only the most recent update for each key. This might make log segments much smaller, so
segments may also be merged as part of the compaction process, as shown in
[Figure 12-6](/en/ch12#fig_stream_compaction). This process runs in the background.
{{< figure src="/fig/ddia_1206.png" id="fig_stream_compaction" caption="Figure 12-6. A log of key-value pairs, where the key is the ID of a cat video (mew, purr, scratch, or yawn), and the value is the number of times it has been played. Log compaction retains only the most value for each key." class="w-full my-4" >}}
In a log-structured storage engine, an update with a special null value (a *tombstone*) indicates
that a key was deleted, and causes it to be removed during log compaction. But as long as a key is
not overwritten or deleted, it stays in the log forever. The disk space required for such a
compacted log depends only on the current contents of the database, not the number of writes that
have ever occurred in the database. If the same key is frequently overwritten, previous values will
eventually be garbage-collected, and only the latest value will be retained.
The same idea works in the context of log-based message brokers and change data capture. If the CDC
system is set up such that every change has a primary key, and every update for a key replaces the
previous value for that key, then it's sufficient to keep just the most recent write for a
particular key.
Now, whenever you want to rebuild a derived data system such as a search index, you can start a new
consumer from offset 0 of the log-compacted topic, and sequentially scan over all messages in the
log. The log is guaranteed to contain the most recent value for every key in the database (and maybe
some older values)---in other words, you can use it to obtain a full copy of the database contents
without having to take another snapshot of the CDC source database.
This log compaction feature is supported by Apache Kafka. As we shall see later in this chapter, it
allows the message broker to be used for durable storage, not just for transient messaging.
#### API support for change streams {#sec_stream_change_api}
Most popular databases now expose change streams as a first-class interface, rather than the
retrofitted and reverse-engineered CDC efforts of the past. Relational databases such as MySQL and
PostgreSQL typically send changes through the same replication log they use for their own replicas.
Most cloud vendors offer CDC solutions for their products as well: for example, Datastream offers
streaming data access for Google Cloud's relational databases and data warehouses.
Even evenutally consistent, quorum-based databases such as Cassandra now support change data
capture. As we saw in ["Linearizability and quorums"](/en/ch10#sec_consistency_quorum_linearizable),
clients must persist writes to a majority of nodes before they're considered visible. CDC support
for quorum writes is challenging because there's no single source of truth to subscribe to. Whether
the data is visible or not depends on each reader's consistency preferences. Cassandra sidesteps
this issue by exposing raw log segments for each node rather than providing a single stream of
mutations. Systems that wish to consume the data must read the raw log segments for each node and
decide how best to merge them into a single stream (much like a quorum reader does) [^32].
Kafka Connect [^33] integrates change data capture tools for a wide range of database
systems with Kafka. Once the stream of change events is in Kafka, it can be used to update derived
data systems such as search indexes, and also feed into stream processing systems as discussed later
in this chapter.
#### Change data capture versus event sourcing {#sec_stream_event_sourcing}
Let's compare change data capture to event sourcing. Similarly to change data capture, event
sourcing involves storing all changes to the application state as a log of change events. The
biggest difference is that event sourcing applies the idea at a different level of abstraction:
- In change data capture, the application uses the database in a mutable way, updating and deleting
records at will. The log of changes is extracted from the database at a low level (e.g., by
parsing the replication log), which ensures that the order of writes extracted from the database
matches the order in which they were actually written, avoiding the race condition in
[Figure 12-4](/en/ch12#fig_stream_write_order).
- In event sourcing, the application logic is explicitly built on the basis of immutable events that
are written to an event log. In this case, the event store is append-only, and updates or deletes
of events are discouraged or prohibited. Events are designed to reflect things that happened at
the application level, rather than low-level state changes.
Which one is better depends on your situation. Adopting event sourcing is a big change for an
application that is not already doing it; it has a number of pros and cons, which we discussed in
["Event Sourcing and CQRS"](/en/ch3#sec_datamodels_events). In contrast, CDC can be added to an
existing database with minimal changes---the application writing to the database might not even know
that CDC is occurring.
> [!TIP] CHANGE DATA CAPTURE AND DATABASE SCHEMAS
> Though change data capture appears easier to adopt than event sourcing, it comes with its own set of
> challenges.
>
> In a microservices architecture, a database is typically only accessed from one service. Other
> services interact with it through that service's public API, but they don't normally access the
> database directly. This makes the database an internal implementation detail of the service,
> allowing the developers to change its database schema without affecting the public API.
>
> However, CDC systems typically use the upstream database's schema when replicating its data, which
> turns these schemas into public APIs that must be managed much like the public API of the service. A
> developer who removes a table column in their database table will break downstream consumers that
> depend on this field. Such challenges have always existed with data pipelines, but they typically
> only impacted data warehouse ETL. Since CDC is often implemented as a data stream, other production
> services might be consumers. Breaking such consumers can cause a customer-facing outage
> [^34]. Data contracts are often used to prevent these breakages.
>
> A common way to decouple internal from external schemas is to use the *outbox pattern*. Outboxes are
> tables with their own schemas, which are exposed to the CDC system rather than the internal domain
> model in the database [^35], [^36]. Developers can then modify their internal
> schemas as they see fit while leaving their outbox tables untouched. This might look like a dual
> write---it is. However, outboxes avoid the challenges we discussed in ["Keeping Systems in
> Sync"](/en/ch12#sec_stream_sync) by keeping both writes in the same system (the
> database). This design allows both writes to appear in a single transaction.
>
> Outboxes present a few tradeoffs, though. Developers must still maintain the transformation between
> their internal and outbox schemas, which can be challenging. An outbox also increases the amount of
> data that the database has to write to its underlying storage, which might trigger performance
> problems.
Like with change data capture, replaying the event log allows you to reconstruct the current state
of the system. However, log compaction needs to be handled differently:
- A CDC event for the update of a record typically contains the entire new version of the record, so
the current value for a primary key is entirely determined by the most recent event for that
primary key, and log compaction can discard previous events for the same key.
- On the other hand, with event sourcing, events are modeled at a higher level: an event typically
expresses the intent of a user action, not the mechanics of the state update that occurred as a
result of the action. In this case, later events typically do not override prior events, and so
you need the full history of events to reconstruct the final state. Log compaction is not possible
in the same way.
Applications that use event sourcing typically have some mechanism for storing snapshots of the
current state that is derived from the log of events, so they don't need to repeatedly reprocess the
full log. However, this is only a performance optimization to speed up reads and recovery from
crashes; the intention is that the system is able to store all raw events forever and reprocess the
full event log whenever required. We discuss this assumption in ["Limitations of
immutability"](/en/ch12#sec_stream_immutability_limitations).
### State, Streams, and Immutability {#sec_stream_immutability}
We saw in [Chapter 11](/en/ch11#ch_batch) that batch processing benefits from the immutability of
its input files, so you can run experimental processing jobs on existing input files without fear of
damaging them. This principle of immutability is also what makes event sourcing and change data
capture so powerful.
We normally think of databases as storing the current state of the application---this representation
is optimized for reads, and it is usually the most convenient for serving queries. The nature of
state is that it changes, so databases support updating and deleting data as well as inserting it.
How does this fit with immutability?
Whenever you have state that changes, that state is the result of the events that mutated it over
time. For example, your list of currently available seats is the result of the reservations you have
processed, the current account balance is the result of the credits and debits on the account, and
the response time graph for your web server is an aggregation of the individual response times of
all web requests that have occurred.
No matter how the state changes, there was always a sequence of events that caused those changes.
Even as things are done and undone, the fact remains true that those events occurred. The key idea
is that mutable state and an append-only log of immutable events do not contradict each other: they
are two sides of the same coin. The log of all changes, the *changelog*, represents the evolution of
state over time.
If you are mathematically inclined, you might say that the application state is what you get when
you integrate an event stream over time, and a change stream is what you get when you differentiate
the state by time, as shown in [Figure 12-7](/en/ch12#fig_stream_integral) [^37],
[^38]. The analogy has limitations (for example, the second derivative of state does not
seem to be meaningful), but it's a useful starting point for thinking about data.
{{< figure src="/fig/ddia_1207.png" id="fig_stream_integral" caption="Figure 12-7. The relationship between the current application state and an event stream." class="w-full my-4" >}}
If you store the changelog durably, that simply has the effect of making the state reproducible. If
you consider the log of events to be your system of record, and any mutable state as being derived
from it, it becomes easier to reason about the flow of data through a system. As Jim Gray and
Andreas Reuter put it in 1992 [^39]:
> \[T\]here is no fundamental need to keep a database at all; the log contains all the information
> there is. The only reason for storing the database (i.e., the current end-of-the-log) is
> performance of retrieval operations.
Log compaction is one way of bridging the distinction between log and database state: it retains
only the latest version of each record, and discards overwritten versions.
#### Advantages of immutable events {#sec_stream_immutability_pros}
Immutability in databases is an old idea. For example, accountants have been using immutability for
centuries in financial bookkeeping. When a transaction occurs, it is recorded in an append-only
*ledger*, which is essentially a log of events describing money, goods, or services that have
changed hands. The accounts, such as profit and loss or the balance sheet, are derived from the
transactions in the ledger by adding them up [^40].
If a mistake is made, accountants don't erase or change the incorrect transaction in the
ledger---instead, they add another transaction that compensates for the mistake, for example
refunding an incorrect charge. The incorrect transaction still remains in the ledger forever,
because it might be important for auditing reasons. If incorrect figures, derived from the incorrect
ledger, have already been published, then the figures for the next accounting period include a
correction. This process is entirely normal in accounting [^41].
Although such auditability is particularly important in financial systems, it is also beneficial for
many other systems that are not subject to such strict regulation. If you accidentally deploy buggy
code that writes bad data to a database, recovery is much harder if the code is able to
destructively overwrite data. With an append-only log of immutable events, it is much easier to
diagnose what happened and recover from the problem. Similarly, customer service can use an audit
log to diagnose customer requests and complaints.
Immutable events also capture more information than just the current state. For example, on a
shopping website, a customer may add an item to their cart and then remove it again. Although the
second event cancels out the first event from the point of view of order fulfillment, it may be
useful to know for analytics purposes that the customer was considering a particular item but then
decided against it. Perhaps they will choose to buy it in the future, or perhaps they found a
substitute. This information is recorded in an event log, but would be lost in a database that
deletes items when they are removed from the cart.
#### Deriving several views from the same event log {#sec_stream_deriving_views}
Moreover, by separating mutable state from the immutable event log, you can derive several different
read-oriented representations from the same log of events. This works just like having multiple
consumers of a stream ([Figure 12-5](/en/ch12#fig_stream_change_capture)): for example, the analytic
database Druid ingests directly from Kafka using this approach, and Kafka Connect sinks can export
data from Kafka to various different databases and indexes [^33].
Having an explicit translation step from an event log to a database makes it easier to evolve your
application over time: if you want to introduce a new feature that presents your existing data in
some new way, you can use the event log to build a separate read-optimized view for the new feature,
and run it alongside the existing systems without having to modify them. Running old and new systems
side by side is often easier than performing a complicated schema migration in an existing system.
Once readers have switched to the new system and the old system is no longer needed, you can simply
shut it down and reclaim its resources [^42], [^43].
This idea of writing data in one write-optimized form, and then translating it into different
read-optimized representations as needed, is the *command query responsibility segregation* (CQRS)
pattern that we already encountered in ["Event Sourcing and CQRS"](/en/ch3#sec_datamodels_events).
It doesn't necessarily require event sourcing: you can just as well build multiple materialized
views from a stream of CDC events [^44].
The traditional approach to database and schema design is based on the fallacy that data must be
written in the same form as it will be queried. Debates about normalization and denormalization (see
["Normalization, Denormalization, and Joins"](/en/ch3#sec_datamodels_normalization)) become largely
irrelevant if you can translate data from a write-optimized event log to read-optimized application
state: it is entirely reasonable to denormalize data in the read-optimized views, as the translation
process gives you a mechanism for keeping it consistent with the event log.
In ["Case Study: Social Network Home Timelines"](/en/ch2#sec_introduction_twitter) we discussed a
social network's home timelines, a cache of recent posts by the people a particular user is
following (like a mailbox). This is another example of read-optimized state: home timelines are
highly denormalized, since your posts are duplicated in all of the timelines of the people following
you. However, the fan-out service keeps this duplicated state in sync with new posts and new
following relationships, which keeps the duplication manageable.
#### Concurrency control {#sec_stream_concurrency}
The biggest downside of CQRS is that the consumers of the event log are usually asynchronous, so
there is a possibility that a user may make a write to the log, then read from a derived view and
find that their write has not yet been reflected in the view. We discussed this problem and
potential solutions previously in ["Reading Your Own Writes"](/en/ch6#sec_replication_ryw).
One solution would be to perform the updates of the read view synchronously with appending the event
to the log. This either requires a distributed transaction across the event log and the derived
view, or some way of waiting until an event has been reflected in the view. Both approaches are
usually impractical, so views are normally updated asynchronously.
On the other hand, deriving the current state from an event log also simplifies some aspects of
concurrency control. Much of the need for multi-object transactions (see ["Single-Object and
Multi-Object Operations"](/en/ch8#sec_transactions_multi_object)) stems from a single user action
requiring data to be changed in several different places. With event sourcing, you can design an
event such that it is a self-contained description of a user action. The user action then requires
only a single write in one place---namely appending the event to the log---which is easy to make
atomic.
If the event log and the application state are sharded in the same way (for example, processing an
event for a customer in shard 3 only requires updating shard 3 of the application state), then a
straightforward single-threaded log consumer needs no concurrency control for writes---by
construction, it only processes a single event at a time (see also ["Actual Serial
Execution"](/en/ch8#sec_transactions_serial)). The log removes the nondeterminism of concurrency by
defining a serial order of events in a shard [^27]. If an event touches multiple state
shards, a bit more work is required, which we will discuss in [Chapter 13](/en/ch13#ch_philosophy).
Many systems that don't use an event-sourced model nevertheless rely on immutability for concurrency
control: various databases internally use immutable data structures or multi-version data to support
point-in-time snapshots (see ["Indexes and snapshot
isolation"](/en/ch8#sec_transactions_snapshot_indexes)). Version control systems such as Git,
Mercurial, and Fossil also rely on immutable data to preserve version history of files.
#### Limitations of immutability {#sec_stream_immutability_limitations}
To what extent is it feasible to keep an immutable history of all changes forever? The answer
depends on the amount of churn in the dataset. Some workloads mostly add data and rarely update or
delete; they are easy to make immutable. Other workloads have a high rate of updates and deletes on
a comparatively small dataset; in these cases, the immutable history may grow prohibitively large,
fragmentation may become an issue, and the performance of compaction and garbage collection becomes
crucial for operational robustness [^45], [^46].
Besides the performance reasons, there may also be circumstances in which you need data to be
deleted for administrative or legal reasons, in spite of all immutability. For example, privacy
regulations such as the European General Data Protection Regulation (GDPR) require that a user's
personal information be deleted and erroneous information be removed on demand, or an accidental
leak of sensitive information may need to be contained.
In these circumstances, it's not sufficient to just append another event to the log to indicate that
the prior data should be considered deleted---you actually want to rewrite history and pretend that
the data was never written in the first place. For example, Datomic calls this feature *excision*
[^47], and the Fossil version control system has a similar concept called *shunning*
[^48].
Truly deleting data is surprisingly hard [^49], since copies can live in many places: for
example, storage engines, filesystems, and SSDs often write to a new location rather than
overwriting in place [^41], and backups are often deliberately immutable to prevent
accidental deletion or corruption.
One way of enabling deletion of immutable data is *crypto-shredding* [^50]: data that you
may want to delete in the future is stored encrypted, and when you want to get rid of it, you forget
the encryption key. The encrypted data is then still there, but nobody can use it. In some sense
this only moves the problem around: the actual data is now immutable, but your key storage is
mutable.
Moreover, you have to decide up front which data is going to be encrypted with the same key, and
when you are going to use different keys---an important decision, since you can later crypto-shred
either all or none of the data encrypted with a particular key, but not some of it. Storing a
separate key for every single data item would get too unwieldy, as the key storage would get as big
as the primary data storage. More sophisticated schemes such as puncturable encryption
[^51] make it possible to selectively revoke a key's decryption abilities, but they are
not widely used.
Overall, deletion is more a matter of "making it harder to retrieve the data" than actually "making
it impossible to retrieve the data." Nevertheless, you sometimes have to try, as we shall see in
["Legislation and Self-Regulation"](/en/ch14#sec_future_legislation).
## Processing Streams {#sec_stream_processing}
So far in this chapter we have talked about where streams come from (user activity events, sensors,
and writes to databases), and we have talked about how streams are transported (through direct
messaging, via message brokers, and in event logs).
What remains is to discuss what you can do with the stream once you have it---namely, you can
process it. Broadly, there are three options:
1. You can take the data in the events and write it to a database, cache, search index, or similar
storage system, from where it can then be queried by other clients. As shown in
[Figure 12-5](/en/ch12#fig_stream_change_capture), this is a good way of keeping a database in
sync with changes happening in other parts of the system---especially if the stream consumer is
the only client writing to the database. Writing to a storage system is the streaming equivalent
of what we discussed in ["Batch Use Cases"](/en/ch11#sec_batch_output).
2. You can push the events to users in some way, for example by sending email alerts or push
notifications, or by streaming the events to a real-time dashboard where they are visualized. In
this case, a human is the ultimate consumer of the stream.
3. You can process one or more input streams to produce one or more output streams. Streams may go
through a pipeline consisting of several such processing stages before they eventually end up at
an output (option 1 or 2).
In the rest of this chapter, we will discuss option 3: processing streams to produce other, derived
streams. A piece of code that processes streams like this is known as an *operator* or a *job*. It
is closely related to the Unix processes and MapReduce jobs we discussed in
[Chapter 11](/en/ch11#ch_batch), and the pattern of dataflow is similar: a stream processor consumes
input streams in a read-only fashion and writes its output to a different location in an append-only
fashion.
The patterns for sharding and parallelization in stream processors are also very similar to those in
MapReduce and the dataflow engines we saw in [Chapter 11](/en/ch11#ch_batch), so we won't repeat
those topics here. Basic mapping operations such as transforming and filtering records also work the
same.
The one crucial difference from batch jobs is that a stream never ends. This difference has many
implications: as discussed at the start of this chapter, sorting does not make sense with an
unbounded dataset, and so sort-merge joins (see ["JOIN and GROUP BY"](/en/ch11#sec_batch_join))
cannot be used. Fault-tolerance mechanisms must also change: with a batch job that has been running
for a few minutes, a failed task can simply be restarted from the beginning, but with a stream job
that has been running for several years, restarting from the beginning after a crash may not be a
viable option.
### Uses of Stream Processing {#sec_stream_uses}
Stream processing has long been used for monitoring purposes, where an organization wants to be
alerted if certain things happen. For example:
- Fraud detection systems need to determine if the usage patterns of a credit card have unexpectedly
changed, and block the card if it is likely to have been stolen.
- Trading systems need to examine price changes in a financial market and execute trades according
to specified rules.
- Manufacturing systems need to monitor the status of machines in a factory, and quickly identify
the problem if there is a malfunction.
- Military and intelligence systems need to track the activities of a potential aggressor, and raise
the alarm if there are signs of an attack.
These kinds of applications require quite sophisticated pattern matching and correlations. However,
other uses of stream processing have also emerged over time. In this section we will briefly compare
and contrast some of these applications.
#### Complex event processing {#id317}
*Complex event processing* (CEP) is an approach developed in the 1990s for analyzing event streams,
especially geared toward the kind of application that requires searching for certain event patterns
[^52]. Similarly to the way that a regular expression allows you to search for certain
patterns of characters in a string, CEP allows you to specify rules to search for certain patterns
of events in a stream.
CEP systems often use a high-level declarative query language like SQL, or a graphical user
interface, to describe the patterns of events that should be detected. These queries are submitted
to a processing engine that consumes the input streams and internally maintains a state machine that
performs the required matching. When a match is found, the engine emits a *complex event* (hence the
name) with the details of the event pattern that was detected [^53].
In these systems, the relationship between queries and data is reversed compared to normal
databases. Usually, a database stores data persistently and treats queries as transient: when a
query comes in, the database searches for data matching the query, and then forgets about the query
when it has finished. CEP engines reverse these roles: queries are stored long-term; as each event
arrives, the engine checks whether it has now seen an event pattern that matches any of its standing
queries [^54].
Implementations of CEP include Esper, Apama, and TIBCO StreamBase. Distributed stream processors
like Flink and Spark Streaming also have SQL support for declarative queries on streams.
#### Stream analytics {#id318}
Another area in which stream processing is used is for *analytics* on streams. The boundary between
CEP and stream analytics is blurry, but as a general rule, analytics tends to be less interested in
finding specific event sequences and is more oriented toward aggregations and statistical metrics
over a large number of events---for example:
- Measuring the rate of some type of event (how often it occurs per time interval)
- Calculating the rolling average of a value over some time period
- Comparing current statistics to previous time intervals (e.g., to detect trends or to alert on
metrics that are unusually high or low compared to the same time last week)
Such statistics are usually computed over fixed time intervals---for example, you might want to know
the average number of queries per second to a service over the last 5 minutes, and their 99th
percentile response time during that period. Averaging over a few minutes smoothes out irrelevant
fluctuations from one second to the next, while still giving you a timely picture of any changes in
traffic pattern. The time interval over which you aggregate is known as a *window*, and we will look
into windowing in more detail in ["Reasoning About Time"](/en/ch12#sec_stream_time).
Stream analytics systems sometimes use probabilistic algorithms, such as Bloom filters (which we
encountered in ["Bloom filters"](/en/ch4#sec_storage_bloom_filter)) for set membership, HyperLogLog
[^55] for cardinality estimation, and various percentile estimation algorithms (see
["Computing Percentiles"](/en/ch2#sidebar_percentiles)). Probabilistic algorithms produce
approximate results, but have the advantage of requiring significantly less memory in the stream
processor than exact algorithms. This use of approximation algorithms sometimes leads people to
believe that stream processing systems are always lossy and inexact, but that is wrong: there is
nothing inherently approximate about stream processing, and using probabilistic algorithms is merely
an optimization [^56].
Many open source distributed stream processing frameworks are designed with analytics in mind: for
example, Apache Storm, Spark Streaming, Flink, Samza, Apache Beam, and Kafka Streams
[^57]. Hosted services include Google Cloud Dataflow and Azure Stream Analytics.
#### Maintaining materialized views {#sec_stream_mat_view}
We saw that a stream of changes to a database can be used to keep derived data systems, such as
caches, search indexes, and data warehouses, up to date with a source database. These are examples
of maintaining materialized views: deriving an alternative view onto some dataset so that you can
query it efficiently, and updating that view whenever the underlying data changes [^37].
Similarly, in event sourcing, application state is maintained by applying a log of events; here the
application state is also a kind of materialized view. Unlike stream analytics scenarios, it is
usually not sufficient to consider only events within some time window: building the materialized
view potentially requires *all* events over an arbitrary time period, apart from any obsolete events
that may be discarded by log compaction. In effect, you need a window that stretches all the way
back to the beginning of time.
In principle, any stream processor could be used for materialized view maintenance, although the
need to maintain events forever runs counter to the assumptions of some analytics-oriented
frameworks that mostly operate on windows of a limited duration. Kafka Streams and Confluent's
ksqlDB support this kind of usage, building upon Kafka's support for log compaction [^58].
> [!TIP] INCREMENTAL VIEW MAINTENANCE
> Databases might seem well suited for materialized view maintenance; they are designed to keep full
> copies of a dataset, after all. Many also support materialized views. We saw in ["Materialized Views
> and Data Cubes"](/en/ch4#sec_storage_materialized_views) that analytical queries
> typical of a data warehouse can be materialized into OLAP cubes.
>
> Unfortunately, databases often refresh materialized view tables using batch jobs or on-demand
> requests such as PostgreSQL's `REFRESH MATERIALIZED VIEW`. Views are recalculated
> periodically rather than as updates to souce data occurs. This approach has two significant
> drawbacks that make it inappropriate for stream processing view maintenance:
>
> 1. Poor efficiency: All data is reprocessed every time the view is updated, though it's likely that
> most of the data remains unchanged.
>
> 2. Data freshness: changes in source data are not reflected in a materialized view until its query
> is re-run during its next scheduled update.
>
> It is possible to write database triggers that update materialized views efficiently in scenarios
> where the data is easily partitioned and the computation is naturally incremental. For example, if a
> materialized view maintains total sales revenue per-day, the row for the appropriate day can be
> updated every time a new sale occurs. Bespoke solutions work in a few cases, but many SQL queries
> can't be easily or efficiently converted to incremental computation.
>
>
>
> *Incremental view maintenance (IVM)* is a more general solution to the problems listed above. IVM
> techniques convert relational grammars such as SQL into operators capable of incremental
> computations. Rather than processing entire datasets, IVM algorithms recompute and update only data
> that has changed [^38], [^59], [^60]. View computation becomes far more
> efficient. Updates can then be run much more frequently, which dramatically increases data
> freshness.
>
> Databases such as Materialize [^61], RisingWave, ClickHouse, and Feldera all use IVM
> techniques to provide efficient incremental materialized views. These databases ingest streams of
> events to expose materialized views in realtime. Recent events are buffered in-memory and
> periodically used to update on-disk materialized views. Reads combine the recent events and the
> materialized data to provide a single realtime view. Since reads are often expressed in SQL and
> materialized views are often stored in OLAP-style formats, these systems also support large-scale
> data warehouse-style queries such as those disucssed in
> [Chapter 11](/en/ch11#ch_batch).
#### Search on streams {#id320}
Besides CEP, which allows searching for patterns consisting of multiple events, there is also
sometimes a need to search for individual events based on complex criteria, such as full-text search
queries.
For example, media monitoring services subscribe to feeds of news articles and broadcasts from media
outlets, and search for any news mentioning companies, products, or topics of interest. This is done
by formulating a search query in advance, and then continually matching the stream of news items
against this query. Similar features exist on some websites: for example, users of real estate
websites can ask to be notified when a new property matching their search criteria appears on the
market. The percolator feature of Elasticsearch [^62] is one option for implementing this
kind of stream search.
Conventional search engines first index the documents and then run queries over the index. By
contrast, searching a stream turns the processing on its head: the queries are stored, and the
documents run past the queries, like in CEP. In the simplest case, you can test every document
against every query, although this can get slow if you have a large number of queries. To optimize
the process, it is possible to index the queries as well as the documents, and thus narrow down the
set of queries that may match [^63].
#### Event-Driven Architectures and RPC {#sec_stream_actors_drpc}
In ["Event-Driven Architectures"](/en/ch5#sec_encoding_dataflow_msg) we discussed message-passing
systems as an alternative to RPC---i.e., as a mechanism for services to communicate, as used for
example in the actor model. Although these systems are also based on messages and events, we
normally don't think of them as stream processors:
- Actor frameworks are primarily a mechanism for managing concurrency and distributed execution of
communicating modules, whereas stream processing is primarily a data management technique.
- Communication between actors is often ephemeral and one-to-one, whereas event logs are durable and
multi-subscriber.
- Actors can communicate in arbitrary ways (including cyclic request/response patterns), but stream
processors are usually set up in acyclic pipelines where every stream is the output of one
particular job, and derived from a well-defined set of input streams.
That said, there is some crossover area between RPC-like systems and stream processing. For example,
Apache Storm has a feature called *distributed RPC*, which allows user queries to be farmed out to a
set of nodes that also process event streams; these queries are then interleaved with events from
the input streams, and results can be aggregated and sent back to the user. (See also ["Multi-shard
data processing"](/en/ch13#sec_future_unbundled_multi_shard).)
It is also possible to process streams using actor frameworks. However, many such frameworks do not
guarantee message delivery in the case of crashes, so the processing is not fault-tolerant unless
you implement additional retry logic.
### Reasoning About Time {#sec_stream_time}
Stream processors often need to deal with time, especially when used for analytics purposes, which
frequently use time windows such as "the average over the last five minutes." It might seem that the
meaning of "the last five minutes" should be unambiguous and clear, but unfortunately the notion is
surprisingly tricky.
In a batch process, the processing tasks rapidly crunch through a large collection of historical
events. If some kind of breakdown by time needs to happen, the batch process needs to look at the
timestamp embedded in each event. There is no point in looking at the system clock of the machine
running the batch process, because the time at which the process is run has nothing to do with the
time at which the events actually occurred.
A batch process may read a year's worth of historical events within a few minutes; in most cases,
the timeline of interest is the year of history, not the few minutes of processing. Moreover, using
the timestamps in the events allows the processing to be deterministic: running the same process
again on the same input yields the same result.
On the other hand, many stream processing frameworks use the local system clock on the processing
machine (the *processing time*) to determine windowing [^64]. This approach has the
advantage of being simple, and it is reasonable if the delay between event creation and event
processing is negligibly short. However, it breaks down if there is any significant processing
lag---i.e., if the processing may happen noticeably later than the time at which the event actually
occurred.
#### Event time versus processing time {#id322}
There are many reasons why processing may be delayed: queueing, network faults, a performance issue
leading to contention in the message broker or processor, a restart of the stream consumer, or
reprocessing of past events while recovering from a fault or after fixing a bug in the code.
Moreover, message delays can also lead to unpredictable ordering of messages. For example, say a
user first makes one web request (which is handled by web server A), and then a second request
(which is handled by server B). A and B emit events describing the requests they handled, but B's
event reaches the message broker before A's event does. Now stream processors will first see the B
event and then the A event, even though they actually occurred in the opposite order.
If it helps to have an analogy, consider the *Star Wars* movies: Episode IV was released in 1977,
Episode V in 1980, and Episode VI in 1983, followed by Episodes I, II, and III in 1999, 2002, and
2005, respectively, and Episodes VII, VIII, and IX in 2015, 2017, and 2019 [^65]. If you
watched the movies in the order they came out, the order in which you processed the movies is
inconsistent with the order of their narrative. (The episode number is like the event timestamp, and
the date when you watched the movie is the processing time.) As humans, we are able to cope with
such discontinuities, but stream processing algorithms need to be specifically written to
accommodate such timing and ordering issues.
Confusing event time and processing time leads to bad data. For example, say you have a stream
processor that measures the rate of requests (counting the number of requests per second). If you
redeploy the stream processor, it may be shut down for a minute and process the backlog of events
when it comes back up. If you measure the rate based on the processing time, it will look as if
there was a sudden anomalous spike of requests while processing the backlog, when in fact the real
rate of requests was steady ([Figure 12-8](/en/ch12#fig_stream_processing_time)).
{{< figure src="/fig/ddia_1208.png" id="fig_stream_processing_time" caption="Figure 12-8. Windowing by processing time introduces artifacts due to variations in processing rate." class="w-full my-4" >}}
#### Handling straggler events {#id323}
A tricky problem when defining windows in terms of event time is that you can never be sure when you
have received all of the events for a particular window, or whether there are some events still to
come.
For example, say you're grouping events into one-minute windows so that you can count the number of
requests per minute. You have counted some number of events with timestamps that fall in the 37th
minute of the hour, and time has moved on; now most of the incoming events fall within the 38th and
39th minutes of the hour. When do you declare that you have finished the window for the 37th minute,
and output its counter value?
You can time out and declare a window ready after you have not seen any new events for that window
in a while. However, it could still happen that some events were buffered on another machine
somewhere, delayed due to a network interruption. You need to be able to handle such *straggler*
events that arrive after the window has already been declared complete. Broadly, you have two
options [^1]:
1. Ignore the straggler events, as they are probably a small percentage of events in normal
circumstances. You can track the number of dropped events as a metric, and alert if you start
dropping a significant amount of data.
2. Publish a *correction*, an updated value for the window with stragglers included. You may also
need to retract the previous output.
In some cases it is possible to use a special message to indicate, "From now on there will be no
more messages with a timestamp earlier than *t*," which can be used by consumers to trigger windows
[^66]. However, if several producers on different machines are generating events, each
with their own minimum timestamp thresholds, the consumers need to keep track of each producer
individually. Adding and removing producers is trickier in this case.
#### Whose clock are you using, anyway? {#id438}
Assigning timestamps to events is even more difficult when events can be buffered at several points
in the system. For example, consider a mobile app that reports events for usage metrics to a server.
The app may be used while the device is offline, in which case it will buffer events locally on the
device and send them to a server when an internet connection is next available (which may be hours
or even days later). To any consumers of this stream, the events will appear as extremely delayed
stragglers.
In this context, the timestamp on the events should really be the time at which the user interaction
occurred, according to the mobile device's local clock. However, the clock on a user-controlled
device often cannot be trusted, as it may be accidentally or deliberately set to the wrong time (see
["Clock Synchronization and Accuracy"](/en/ch9#sec_distributed_clock_accuracy)). The time at which
the event was received by the server (according to the server's clock) is more likely to be
accurate, since the server is under your control, but less meaningful in terms of describing the
user interaction.
To adjust for incorrect device clocks, one approach is to log three timestamps [^67]:
- The time at which the event occurred, according to the device clock
- The time at which the event was sent to the server, according to the device clock
- The time at which the event was received by the server, according to the server clock
By subtracting the second timestamp from the third, you can estimate the offset between the device
clock and the server clock (assuming the network delay is negligible compared to the required
timestamp accuracy). You can then apply that offset to the event timestamp, and thus estimate the
true time at which the event actually occurred (assuming the device clock offset did not change
between the time the event occurred and the time it was sent to the server).
This problem is not unique to stream processing---batch processing suffers from exactly the same
issues of reasoning about time. It is just more noticeable in a streaming context, where we are more
aware of the passage of time.
#### Types of windows {#id324}
Once you know how the timestamp of an event should be determined, the next step is to decide how
windows over time periods should be defined. The window can then be used for aggregations, for
example to count events, or to calculate the average of values within the window. Several types of
windows are in common use [^64], [^68]:
Tumbling window
: A tumbling window has a fixed length, and every event belongs to exactly one window. For
example, if you have a 1-minute tumbling window, all the events with timestamps between 10:03:00
and 10:03:59 are grouped into one window, events between 10:04:00 and 10:04:59 into the next
window, and so on. You could implement a 1-minute tumbling window by taking each event timestamp
and rounding it down to the nearest minute to determine the window that it belongs to.
Hopping window
: A hopping window also has a fixed length, but allows windows to overlap in order to provide some
smoothing. For example, a 5-minute window with a hop size of 1 minute would contain the events
between 10:03:00 and 10:07:59, then the next window would cover events between 10:04:00 and
10:08:59, and so on. You can implement this hopping window by first calculating 1-minute
tumbling windows, and then aggregating over several adjacent windows.
Sliding window
: A sliding window contains all the events that occur within some interval of each other. For
example, a 5-minute sliding window would cover events at 10:03:39 and 10:08:12, because they are
less than 5 minutes apart (note that tumbling and hopping 5-minute windows would not have put
these two events in the same window, as they use fixed boundaries). A sliding window can be
implemented by keeping a buffer of events sorted by time and removing old events when they
expire from the window.
Session window
: Unlike the other window types, a session window has no fixed duration. Instead, it is defined by
grouping together all events for the same user that occur closely together in time, and the
window ends when the user has been inactive for some time (for example, if there have been no
events for 30 minutes). Sessionization is a common requirement for website analytics.
Window operations usually maintain temporary state. In some cases, the state is of a fixed size, no
matter how large the window or how many events occur: for example, a counting operation will only
have one counter regardless of the window size or event count. On the other hand, sliding windows or
stream joins, which we discuss in the next section, require that events be buffered until the window
finishes. Therefore, large window sizes or high-throughput streams can cause stream processors to
keep a lot of temporary state. You must then take care to ensure the machines running stream
processing tasks have enough capacity to maintain this state, whether in-memory or on-disk.
### Stream Joins {#sec_stream_joins}
In ["JOIN and GROUP BY"](/en/ch11#sec_batch_join) we discussed how batch jobs can join datasets by
key, and how such joins form an important part of data pipelines. Since stream processing
generalizes data pipelines to incremental processing of unbounded datasets, there is exactly the
same need for joins on streams.
However, the fact that new events can appear anytime on a stream makes joins on streams more
challenging than in batch jobs. To understand the situation better, let's distinguish three
different types of joins: *stream-stream* joins, *stream-table* joins, and *table-table* joins. In
the following sections we'll illustrate each by example.
#### Stream-stream join (window join) {#id440}
Say you have a search feature on your website, and you want to detect recent trends in searched-for
URLs. Every time someone types a search query, you log an event containing the query and the results
returned. Every time someone clicks one of the search results, you log another event recording the
click. In order to calculate the click-through rate for each URL in the search results, you need to
bring together the events for the search action and the click action, which are connected by having
the same session ID. Similar analyses are needed in advertising systems [^69].
The click may never come if the user abandons their search, and even if it comes, the time between
the search and the click may be highly variable: in many cases it might be a few seconds, but it
could be as long as days or weeks (if a user runs a search, forgets about that browser tab, and then
returns to the tab and clicks a result sometime later). Due to variable network delays, the click
event may even arrive before the search event. You can choose a suitable window for the join---for
example, you may choose to join a click with a search if they occur at most one hour apart.
Note that embedding the details of the search in the click event is not equivalent to joining the
events: doing so would only tell you about the cases where the user clicked a search result, not
about the searches where the user did not click any of the results. In order to measure search
quality, you need accurate click-through rates, for which you need both the search events and the
click events.
To implement this type of join, a stream processor needs to maintain *state*: for example, all the
events that occurred in the last hour, indexed by session ID. Whenever a search event or click event
occurs, it is added to the appropriate index, and the stream processor also checks the other index
to see if another event for the same session ID has already arrived. If there is a matching event,
you emit an event saying which search result was clicked. If the search event expires without you
seeing a matching click event, you emit an event saying which search results were not clicked.
#### Stream-table join (stream enrichment) {#sec_stream_table_joins}
In ["JOIN and GROUP BY"](/en/ch11#sec_batch_join) ([Figure 11-2](/en/ch11#fig_batch_join_example))
we saw an example of a batch job joining two datasets: a set of user activity events and a database
of user profiles. It is natural to think of the user activity events as a stream, and to perform the
same join on a continuous basis in a stream processor: the input is a stream of activity events
containing a user ID, and the output is a stream of activity events in which the user ID has been
augmented with profile information about the user. This process is sometimes known as *enriching*
the activity events with information from the database.
To perform this join, the stream process needs to look at one activity event at a time, look up the
event's user ID in the database, and add the profile information to the activity event. The database
lookup could be implemented by querying a remote database; however, as discussed in ["JOIN and GROUP
BY"](/en/ch11#sec_batch_join), such remote queries are likely to be slow and risk overloading the
database [^58].
Another approach is to load a copy of the database into the stream processor so that it can be
queried locally without a network round-trip. This technique is called a *hash join* since the local
copy of the database might be an in-memory hash table if it is small enough, or an index on the
local disk.
The difference from batch jobs is that a batch job uses a point-in-time snapshot of the database as
input, whereas a stream processor is long-running, and the contents of the database are likely to
change over time, so the stream processor's local copy of the database needs to be kept up to date.
This issue can be solved by change data capture: the stream processor can subscribe to a changelog
of the user profile database as well as the stream of activity events. When a profile is created or
modified, the stream processor updates its local copy. Thus, we obtain a join between two streams:
the activity events and the profile updates.
A stream-table join is actually very similar to a stream-stream join; the biggest difference is that
for the table changelog stream, the join uses a window that reaches back to the "beginning of time"
(a conceptually infinite window), with newer versions of records overwriting older ones. For the
stream input, the join might not maintain a window at all.
#### Table-table join (materialized view maintenance) {#id326}
Consider the social network timeline example that we discussed in ["Case Study: Social Network Home
Timelines"](/en/ch2#sec_introduction_twitter). We said that when a user wants to view their home
timeline, it is too expensive to iterate over all the people the user is following, find their
recent posts, and merge them.
Instead, we want a timeline cache: a kind of per-user "inbox" to which posts are written as they are
sent, so that reading the timeline is a single lookup. Materializing and maintaining this cache
requires the following event processing:
- When user *u* sends a new post, it is added to the timeline of every user who is following *u*.
- When a user deletes a post, or deletes their entire account, it is removed from all users'
timelines.
- When user *u*~1~ starts following user *u*~2~, recent posts by *u*~2~ are added to *u*~1~'s
timeline.
- When user *u*~1~ unfollows user *u*~2~, posts by *u*~2~ are removed from *u*~1~'s timeline.
To implement this cache maintenance in a stream processor, you need streams of events for posts
(sending and deleting) and for follow relationships (following and unfollowing). The stream process
needs to maintain a database containing the set of followers for each user so that it knows which
timelines need to be updated when a new post arrives.
Another way of looking at this stream process is that it maintains a materialized view for a query
that joins two tables (posts and follows), something like the following:
``` sql
SELECT follows.follower_id AS timeline_id,
array_agg(posts.* ORDER BY posts.timestamp DESC)
FROM posts
JOIN follows ON follows.followee_id = posts.sender_id
GROUP BY follows.follower_id
```
The join of the streams corresponds directly to the join of the tables in that query. The timelines
are effectively a cache of the result of this query, updated every time the underlying tables
change.
> [!NOTE]
> If you regard a stream as the derivative of a table, as in
> [Figure 12-7](/en/ch12#fig_stream_integral), and regard a join as a product of two
> tables *u·v*, something interesting happens: the stream of changes to the materialized join follows
> the product rule (*u·v*)′ = *u*′*v* + *uv*′. In words: any change of posts is joined with the
> current followers, and any change of followers is joined with the current posts [^37].
#### Time-dependence of joins {#sec_stream_join_time}
The three types of joins described here (stream-stream, stream-table, and table-table) have a lot in
common: they all require the stream processor to maintain some state (search and click events, user
profiles, or follower list) based on one join input, and query that state on messages from the other
join input.
The order of the events that maintain the state is important (it matters whether you first follow
and then unfollow, or the other way round). In a sharded event log like Kafka, the ordering of
events within a single shard (partition) is preserved, but there is typically no ordering guarantee
across different streams or shards.
This raises a question: if events on different streams happen around a similar time, in which order
are they processed? In the stream-table join example, if a user updates their profile, which
activity events are joined with the old profile (processed before the profile update), and which are
joined with the new profile (processed after the profile update)? Put another way: if state changes
over time, and you join with some state, what point in time do you use for the join?
Such time dependence can occur in many places. For example, if you sell things, you need to apply
the right tax rate to invoices, which depends on the country or state, the type of product, and the
date of sale (since tax rates change from time to time). When joining sales to a table of tax rates,
you probably want to join with the tax rate at the time of the sale, which may be different from the
current tax rate if you are reprocessing historical data.
If the ordering of events across streams is undetermined, the join becomes nondeterministic
[^70], which means you cannot rerun the same job on the same input and necessarily get the
same result: the events on the input streams may be interleaved in a different way when you run the
job again.
In data warehouses, this issue is known as a *slowly changing dimension* (SCD), and it is often
addressed by using a unique identifier for a particular version of the joined record: for example,
every time the tax rate changes, it is given a new identifier, and the invoice includes the
identifier for the tax rate at the time of sale [^71], [^72]. This change makes the
join deterministic, but has the consequence that log compaction is not possible, since all versions
of the records in the table need to be retained. Alternatively, you can denormalize the data and
include the applicable tax rate directly in every sale event.
### Fault Tolerance {#sec_stream_fault_tolerance}
In the final section of this chapter, let's consider how stream processors can tolerate faults. We
saw in [Chapter 11](/en/ch11#ch_batch) that batch processing frameworks can tolerate faults fairly
easily: if a task fails, it can simply be started again on another machine, and the output of the
failed task is discarded. This transparent retry is possible because input files are immutable, each
task writes its output to a separate file, and output is only made visible when a task completes
successfully.
In particular, the batch approach to fault tolerance ensures that the output of the batch job is the
same as if nothing had gone wrong, even if in fact some tasks did fail. It appears as though every
input record was processed exactly once---no records are skipped, and none are processed twice.
Although restarting tasks means that records may in fact be processed multiple times, the visible
effect in the output is as if they had only been processed once. This principle is known as
*exactly-once semantics*, although *effectively-once* would be a more descriptive term
[^73].
The same issue of fault tolerance arises in stream processing, but it is less straightforward to
handle: waiting until a task is finished before making its output visible is not an option, because
a stream is infinite and so you can never finish processing it.
#### Microbatching and checkpointing {#id329}
One solution is to break the stream into small blocks, and treat each block like a miniature batch
process. This approach is called *microbatching*, and it is used in Spark Streaming [^74].
The batch size is typically around one second, which is the result of a performance compromise:
smaller batches incur greater scheduling and coordination overhead, while larger batches mean a
longer delay before results of the stream processor become visible.
Microbatching also implicitly provides a tumbling window equal to the batch size (windowed by
processing time, not event timestamps); any jobs that require larger windows need to explicitly
carry over state from one microbatch to the next.
A variant approach, used in Apache Flink, is to periodically generate rolling checkpoints of state
and write them to durable storage [^75], [^76]. If a stream operator crashes, it can
restart from its most recent checkpoint and discard any output generated between the last checkpoint
and the crash. The checkpoints are triggered by barriers in the message stream, similar to the
boundaries between microbatches, but without forcing a particular window size.
Within the confines of the stream processing framework, the microbatching and checkpointing
approaches provide the same exactly-once semantics as batch processing. However, as soon as output
leaves the stream processor (for example, by writing to a database, sending messages to an external
message broker, or sending emails), the framework is no longer able to discard the output of a
failed microbatch. In this case, restarting a failed task causes the external side effect to happen
twice, and microbatching or checkpointing alone is not sufficient to prevent this problem.
#### Atomic commit revisited {#sec_stream_atomic_commit}
In order to give the appearance of exactly-once processing in the presence of faults, we need to
ensure that all outputs and side effects of processing an event take effect *if and only if* the
processing is successful. Those effects include any messages sent to downstream operators or
external messaging systems (including email or push notifications), any database writes, any changes
to operator state, and any acknowledgment of input messages (including moving the consumer offset
forward in a log-based message broker).
Those things either all need to happen atomically, or none of them must happen, but they should not
go out of sync with each other. If this approach sounds familiar, it is because we discussed it in
["Exactly-once message processing"](/en/ch8#sec_transactions_exactly_once) in the context of
distributed transactions and two-phase commit.
In [Chapter 10](/en/ch10#ch_consistency) we discussed the problems in the traditional
implementations of distributed transactions, such as XA. However, in more restricted environments it
is possible to implement such an atomic commit facility efficiently. This approach is used in Google
Cloud Dataflow [^66], [^75], VoltDB [^77], and Apache Kafka [^78],
[^79]. Unlike XA, these implementations do not attempt to provide transactions across
heterogeneous technologies, but instead keep the transactions internal by managing both state
changes and messaging within the stream processing framework. The overhead of the transaction
protocol can be amortized by processing several input messages within a single transaction.
#### Idempotence {#sec_stream_idempotence}
Our goal is to discard the partial output of any failed tasks so that they can be safely retried
without taking effect twice. Distributed transactions are one way of achieving that goal, but
another way is to rely on *idempotence*, as we saw in ["Durable Execution and
Workflows"](/en/ch5#sec_encoding_dataflow_workflows) [^80].
An idempotent operation is one that you can perform multiple times, and it has the same effect as if
you performed it only once. For example, deleting a key in a key-value store is idempotent (deleting
the value again has no further effect), whereas incrementing a counter is not idempotent (performing
the increment again means the value is incremented twice).
Even if an operation is not naturally idempotent, it can often be made idempotent with a bit of
extra metadata. For example, when consuming messages from Kafka, every message has a persistent,
monotonically increasing offset. When writing a value to an external database, you can include the
offset of the message that triggered the last write with the value. Thus, you can tell whether an
update has already been applied, and avoid performing the same update again.
The state handling in Storm's Trident is based on a similar idea. Relying on idempotence implies
several assumptions: restarting a failed task must replay the same messages in the same order (a
log-based message broker does this), the processing must be deterministic, and no other node may
concurrently update the same value [^81], [^82].
When failing over from one processing node to another, fencing may be required (see ["Distributed
Locks and Leases"](/en/ch9#sec_distributed_lock_fencing)) to prevent interference from a node that
is thought to be dead but is actually alive. Despite all those caveats, idempotent operations can be
an effective way of achieving exactly-once semantics with only a small overhead.
#### Rebuilding state after a failure {#sec_stream_state_fault_tolerance}
Any stream process that requires state---for example, any windowed aggregations (such as counters,
averages, and histograms) and any tables and indexes used for joins---must ensure that this state
can be recovered after a failure.
One option is to keep the state in a remote datastore and replicate it, although having to query a
remote database for each individual message can be slow. An alternative is to keep state local to
the stream processor, and replicate it periodically. Then, when the stream processor is recovering
from a failure, the new task can read the replicated state and resume processing without data loss.
For example, Flink periodically captures snapshots of operator state and writes them to durable
storage such as a distributed filesystem [^75], [^76], and Kafka Streams replicates
state changes by sending them to a dedicated Kafka topic with log compaction, similar to change data
capture [^83]. VoltDB replicates state by redundantly processing each input message on
several nodes (see ["Actual Serial Execution"](/en/ch8#sec_transactions_serial)).
In some cases, it may not even be necessary to replicate the state, because it can be rebuilt from
the input streams. For example, if the state consists of aggregations over a fairly short window, it
may be fast enough to simply replay the input events corresponding to that window. If the state is a
local replica of a database, maintained by change data capture, the database can also be rebuilt
from the log-compacted change stream.
However, all of these trade-offs depend on the performance characteristics of the underlying
infrastructure: in some systems, network delay may be lower than disk access latency, and network
bandwidth may be comparable to disk bandwidth. There is no universally ideal trade-off for all
situations, and the merits of local versus remote state may also shift as storage and networking
technologies evolve.
## Summary {#id332}
In this chapter we have discussed event streams, what purposes they serve, and how to process them.
In some ways, stream processing is very much like the batch processing we discussed in
[Chapter 11](/en/ch11#ch_batch), but done continuously on unbounded (never-ending) streams rather
than on a fixed-size input [^84]. From this perspective, message brokers and event logs
serve as the streaming equivalent of a filesystem.
We spent some time comparing two types of message brokers:
AMQP/JMS-style message broker
: The broker assigns individual messages to consumers, and consumers acknowledge individual
messages when they have been successfully processed. Messages are deleted from the broker once
they have been acknowledged. This approach is appropriate as an asynchronous form of RPC (see
also ["Event-Driven Architectures"](/en/ch5#sec_encoding_dataflow_msg)), for example in a task
queue, where the exact order of message processing is not important and where there is no need
to go back and read old messages again after they have been processed.
Log-based message broker
: The broker assigns all messages in a shard to the same consumer node, and always delivers
messages in the same order. Parallelism is achieved through sharding, and consumers track their
progress by checkpointing the offset of the last message they have processed. The broker retains
messages on disk, so it is possible to jump back and reread old messages if necessary.
The log-based approach has similarities to the replication logs found in databases (see
[Chapter 6](/en/ch6#ch_replication)) and log-structured storage engines (see
[Chapter 4](/en/ch4#ch_storage)). It is also a form of consensus, as we saw in
[Chapter 10](/en/ch10#ch_consistency). We saw that this approach is especially appropriate for
stream processing systems that consume input streams and generate derived state or derived output
streams.
In terms of where streams come from, we discussed several possibilities: user activity events,
sensors providing periodic readings, and data feeds (e.g., market data in finance) are naturally
represented as streams. We saw that it can also be useful to think of the writes to a database as a
stream: we can capture the changelog---i.e., the history of all changes made to a database---either
implicitly through change data capture or explicitly through event sourcing. Log compaction allows
the stream to retain a full copy of the contents of a database.
Representing databases as streams opens up powerful opportunities for integrating systems. You can
keep derived data systems such as search indexes, caches, and analytics systems continually up to
date by consuming the log of changes and applying them to the derived system. You can even build
fresh views onto existing data by starting from scratch and consuming the log of changes from the
beginning all the way to the present.
The facilities for maintaining state as streams and replaying messages are also the basis for the
techniques that enable stream joins and fault tolerance in various stream processing frameworks. We
discussed several purposes of stream processing, including searching for event patterns (complex
event processing), computing windowed aggregations (stream analytics), and keeping derived data
systems up to date (materialized views).
We then discussed the difficulties of reasoning about time in a stream processor, including the
distinction between processing time and event timestamps, and the problem of dealing with straggler
events that arrive after you thought your window was complete.
We distinguished three types of joins that may appear in stream processes:
Stream-stream joins
: Both input streams consist of activity events, and the join operator searches for related events
that occur within some window of time. For example, it may match two actions taken by the same
user within 30 minutes of each other. The two join inputs may in fact be the same stream (a
*self-join*) if you want to find related events within that one stream.
Stream-table joins
: One input stream consists of activity events, while the other is a database changelog. The
changelog keeps a local copy of the database up to date. For each activity event, the join
operator queries the database and outputs an enriched activity event.
Table-table joins
: Both input streams are database changelogs. In this case, every change on one side is joined
with the latest state of the other side. The result is a stream of changes to the materialized
view of the join between the two tables.
Finally, we discussed techniques for achieving fault tolerance and exactly-once semantics in a
stream processor. As with batch processing, we need to discard the partial output of any failed
tasks. However, since a stream process is long-running and produces output continuously, we can't
simply discard all output. Instead, a finer-grained recovery mechanism can be used, based on
microbatching, checkpointing, transactions, or idempotent writes.
##### Footnotes
### References {#references}
[^1]: Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. [The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing](https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf). *Proceedings of the VLDB Endowment*, volume 8, issue 12, pages 1792--1803, August 2015. [doi:10.14778/2824032.2824076](https://doi.org/10.14778/2824032.2824076)
[^2]: Harold Abelson, Gerald Jay Sussman, and Julie Sussman. [*Structure and Interpretation of Computer Programs*](https://web.mit.edu/6.001/6.037/sicp.pdf), 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, archived at [archive.org/details/sicp_20211010](https://archive.org/details/sicp_20211010)
[^3]: Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec. [The Many Faces of Publish/Subscribe](https://www.cs.ru.nl/~pieter/oss/manyfaces.pdf). *ACM Computing Surveys*, volume 35, issue 2, pages 114--131, June 2003. [doi:10.1145/857076.857078](https://doi.org/10.1145/857076.857078)
[^4]: Don Carney, Uğur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Greg Seidman, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. [Monitoring Streams -- A New Class of Data Management Applications](https://www.vldb.org/conf/2002/S07P02.pdf). At *28th International Conference on Very Large Data Bases* (VLDB), August 2002. [doi:10.1016/B978-155860869-6/50027-5](https://doi.org/10.1016/B978-155860869-6/50027-5)
[^5]: Matthew Sackman. [Pushing Back](https://wellquite.org/posts/lshift/pushing_back/). *wellquite.org*, May 2016. Archived at [perma.cc/3KCZ-RUFY](https://perma.cc/3KCZ-RUFY)
[^6]: Thomas Figg (tef). [how (not) to write a pipeline](https://web.archive.org/web/20250107135013/https://cohost.org/tef/post/1764930-how-not-to-write-a). *cohost.org*, June 2023. Archived at [perma.cc/A3V8-NYCM](https://perma.cc/A3V8-NYCM)
[^7]: Vicent Martí. [Brubeck, a statsd-Compatible Metrics Aggregator](https://github.blog/news-insights/the-library/brubeck/). *github.blog*, June 2015. Archived at [perma.cc/TP3Q-DJYM](https://perma.cc/TP3Q-DJYM)
[^8]: Seth Lowenberger. [MoldUDP64 Protocol Specification V 1.00](https://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf). *nasdaqtrader.com*, July 2009. Archived at
[^9]: Ian Malpass. [Measure Anything, Measure Everything](https://codeascraft.com/2011/02/15/measure-anything-measure-everything/). *codeascraft.com*, February 2011. Archived at [archive.org](https://web.archive.org/web/20250820034209/https://www.etsy.com/codeascraft/measure-anything-measure-everything/)
[^10]: Dieter Plaetinck. [25 Graphite, Grafana and statsd Gotchas](https://grafana.com/blog/2016/03/03/25-graphite-grafana-and-statsd-gotchas/). *grafana.com*, March 2016. Archived at [perma.cc/3NP3-67U7](https://perma.cc/3NP3-67U7)
[^11]: Jeff Lindsay. [Web Hooks to Revolutionize the Web](https://progrium.github.io/blog/2007/05/03/web-hooks-to-revolutionize-the-web/). *progrium.com*, May 2007. Archived at [perma.cc/BF9U-XNX4](https://perma.cc/BF9U-XNX4)
[^12]: Jim N. Gray. [Queues Are Databases](https://arxiv.org/pdf/cs/0701158.pdf). Microsoft Research Technical Report MSR-TR-95-56, December 1995. Archived at [arxiv.org](https://arxiv.org/pdf/cs/0701158)
[^13]: Mark Hapner, Rich Burridge, Rahul Sharma, Joseph Fialli, Kate Stout, and Nigel Deakin. [JSR-343 Java Message Service (JMS) 2.0 Specification](https://jcp.org/en/jsr/detail?id=343). *jms-spec.java.net*, March 2013. Archived at [perma.cc/E4YG-46TA](https://perma.cc/E4YG-46TA)
[^14]: Sanjay Aiyagari, Matthew Arrott, Mark Atwell, Jason Brome, Alan Conway, Robert Godfrey, Robert Greig, Pieter Hintjens, John O'Hara, Matthias Radestock, Alexis Richardson, Martin Ritchie, Shahrokh Sadjadi, Rafael Schloming, Steven Shaw, Martin Sustrik, Carl Trieloff, Kim van der Riet, and Steve Vinoski. [AMQP: Advanced Message Queuing Protocol Specification](https://www.rabbitmq.com/resources/specs/amqp0-9-1.pdf). Version 0-9-1, November 2008. Archived at [perma.cc/6YJJ-GM9X](https://perma.cc/6YJJ-GM9X)
[^15]: [Architectural overview of Pub/Sub](https://cloud.google.com/pubsub/architecture). *cloud.google.com*, 2025. Archived at [perma.cc/VWF5-ABP4](https://perma.cc/VWF5-ABP4)
[^16]: Aris Tzoumas. [Lessons from scaling PostgreSQL queues to 100k events per second](https://www.rudderstack.com/blog/scaling-postgres-queue/). *rudderstack.com*, July 2025. Archived at [perma.cc/QD8C-VA4Y](https://perma.cc/QD8C-VA4Y)
[^17]: Robin Moffatt. [Kafka Connect Deep Dive -- Error Handling and Dead Letter Queues](https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues/). *confluent.io*, March 2019. Archived at [perma.cc/KQ5A-AB28](https://perma.cc/KQ5A-AB28)
[^18]: Dunith Danushka. [Message reprocessing: How to implement the dead letter queue](https://redpanda.com/blog/reliable-message-processing-with-dead-letter-queue). *redpanda.com*. Archived at [perma.cc/R7UB-WEWF](https://perma.cc/R7UB-WEWF)
[^19]: Damien Gasparina, Loic Greffier, and Sebastien Viale. [KIP-1034: Dead letter queue in Kafka Streams](https://cwiki.apache.org/confluence/display/KAFKA/KIP-1034%3A+Dead+letter+queue+in+Kafka+Streams). *cwiki.apache.org*, April 2024. Archived at [perma.cc/3VXV-QXAN](https://perma.cc/3VXV-QXAN)
[^20]: Jay Kreps, Neha Narkhede, and Jun Rao. [Kafka: A Distributed Messaging System for Log Processing](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf). At *6th International Workshop on Networking Meets Databases* (NetDB), June 2011. Archived at [perma.cc/CSW7-TCQ5](https://perma.cc/CSW7-TCQ5)
[^21]: Jay Kreps. [Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)](https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines). *engineering.linkedin.com*, April 2014. Archived at [archive.org](https://web.archive.org/web/20140921000742/https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines)
[^22]: Kartik Paramasivam. [How We're Improving and Advancing Kafka at LinkedIn](https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedin). *engineering.linkedin.com*, September 2015. Archived at [perma.cc/3S3V-JCYJ](https://perma.cc/3S3V-JCYJ)
[^23]: Philippe Dobbelaere and Kyumars Sheykh Esmaili. [Kafka versus RabbitMQ: A comparative study of two industry reference publish/subscribe implementations](https://arxiv.org/abs/1709.00333). At *11th ACM International Conference on Distributed and Event-based Systems* (DEBS), June 2017. [doi:10.1145/3093742.3093908](https://doi.org/10.1145/3093742.3093908)
[^24]: Kate Holterhoff. [Why Message Queues Endure: A History](https://redmonk.com/kholterhoff/2024/12/12/why-message-queues-endure-a-history/). *redmonk.com*, December 2024. Archived at [perma.cc/6DX8-XK4W](https://perma.cc/6DX8-XK4W)
[^25]: Andrew Schofield. [KIP-932: Queues for Kafka](https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka). *cwiki.apache.org*, May 2023. Archived at [perma.cc/LBE4-BEMK](https://perma.cc/LBE4-BEMK)
[^26]: Jack Vanlightly. [The advantages of queues on logs](https://jack-vanlightly.com/blog/2023/10/2/the-advantages-of-queues-on-logs). *jack-vanlightly.com*, October 2023. Archived at [perma.cc/WJ7V-287K](https://perma.cc/WJ7V-287K)
[^27]: Jay Kreps. [The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying). *engineering.linkedin.com*, December 2013. Archived at [perma.cc/2JHR-FR64](https://perma.cc/2JHR-FR64)
[^28]: Andy Hattemer. [Change Data Capture is having a moment. Why?](https://materialize.com/blog/change-data-capture-is-having-a-moment-why/) *materialize.com*, September 2021. Archived at [perma.cc/AL37-P53C](https://perma.cc/AL37-P53C)
[^29]: Prem Santosh Udaya Shankar. [Streaming MySQL Tables in Real-Time to Kafka](https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html). *engineeringblog.yelp.com*, August 2016. Archived at [perma.cc/5ZR3-2GVV](https://perma.cc/5ZR3-2GVV)
[^30]: Andreas Andreakis, Ioannis Papapanagiotou. [DBLog: A Watermark Based Change-Data-Capture Framework](https://arxiv.org/pdf/2010.12597). October 2020. Archived at [arxiv.org](https://arxiv.org/pdf/2010.12597)
[^31]: Jiri Pechanec. [Percolator](https://debezium.io/blog/2021/10/07/incremental-snapshots/). *debezium.io*, October 2021. Archived at [perma.cc/EQ8E-W6KQ](https://perma.cc/EQ8E-W6KQ)
[^32]: Debezium maintainers. [Debezium Connector for Cassandra](https://debezium.io/documentation/reference/stable/connectors/cassandra.html). *debezium.io*. Archived at [perma.cc/WR6K-EKMD](https://perma.cc/WR6K-EKMD)
[^33]: Neha Narkhede. [Announcing Kafka Connect: Building Large-Scale Low-Latency Data Pipelines](https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/). *confluent.io*, February 2016. Archived at [perma.cc/8WXJ-L6GF](https://perma.cc/8WXJ-L6GF)
[^34]: Chris Riccomini. [Kafka change data capture breaks database encapsulation](https://cnr.sh/posts/2018-11-05-kafka-change-data-capture-breaks-database-encapsulation/). *cnr.sh*, November 2018. Archived at [perma.cc/P572-9MKF](https://perma.cc/P572-9MKF)
[^35]: Gunnar Morling. ["Change Data Capture Breaks Encapsulation". Does it, though?](https://www.decodable.co/blog/change-data-capture-breaks-encapsulation-does-it-though) *decodable.co*, November 2023. Archived at [perma.cc/YX2P-WNWR](https://perma.cc/YX2P-WNWR)
[^36]: Gunnar Morling. [Revisiting the Outbox Pattern](https://www.decodable.co/blog/revisiting-the-outbox-pattern). *decodable.co*, October 2024. Archived at [perma.cc/M5ZL-RPS9](https://perma.cc/M5ZL-RPS9)
[^37]: Ashish Gupta and Inderpal Singh Mumick. [Maintenance of Materialized Views: Problems, Techniques, and Applications](https://web.archive.org/web/20220407025818id_/http://sites.computer.org/debull/95JUN-CD.pdf#page=5). *IEEE Data Engineering Bulletin*, volume 18, issue 2, pages 3--18, June 1995. Archived at [archive.org](https://web.archive.org/web/20220407025818id_/http://sites.computer.org/debull/95JUN-CD.pdf#page=5)
[^38]: Mihai Budiu, Tej Chajed, Frank McSherry, Leonid Ryzhyk, Val Tannen. [DBSP: Incremental Computation on Streams and Its Applications to Databases](https://sigmodrecord.org/publications/sigmodRecord/2403/pdfs/20_dbsp-budiu.pdf). *SIGMOD Record*, volume 53, issue 1, pages 87--95, March 2024. [doi:10.1145/3665252.3665271](https://doi.org/10.1145/3665252.3665271)
[^39]: Jim Gray and Andreas Reuter. [*Transaction Processing: Concepts and Techniques*](https://learning.oreilly.com/library/view/transaction-processing/9780080519555/). Morgan Kaufmann, 1992. ISBN: 9781558601901
[^40]: Martin Kleppmann. [Accounting for Computer Scientists](https://martin.kleppmann.com/2011/03/07/accounting-for-computer-scientists.html). *martin.kleppmann.com*, March 2011. Archived at [perma.cc/9EGX-P38N](https://perma.cc/9EGX-P38N)
[^41]: Pat Helland. [Immutability Changes Everything](https://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf). At *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
[^42]: Martin Kleppmann. [*Making Sense of Stream Processing*](https://martin.kleppmann.com/papers/stream-processing.pdf). Report, O'Reilly Media, May 2016. Archived at [perma.cc/RAY4-JDVX](https://perma.cc/RAY4-JDVX)
[^43]: Kartik Paramasivam. [Stream Processing Hard Problems -- Part 1: Killing Lambda](https://engineering.linkedin.com/blog/2016/06/stream-processing-hard-problems-part-1-killing-lambda). *engineering.linkedin.com*, June 2016. Archived at [archive.org](https://web.archive.org/web/20240621211312/https://www.linkedin.com/blog/engineering/data-streaming-processing/stream-processing-hard-problems-part-1-killing-lambda)
[^44]: Stéphane Derosiaux. [CQRS: What? Why? How?](https://sderosiaux.medium.com/cqrs-what-why-how-945543482313) *sderosiaux.medium.com*, September 2019. Archived at [perma.cc/FZ3U-HVJ4](https://perma.cc/FZ3U-HVJ4)
[^45]: Baron Schwartz. [Immutability, MVCC, and Garbage Collection](https://web.archive.org/web/20220122020806/http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/). *xaprb.com*, December 2013. Archived at [archive.org](https://web.archive.org/web/20220122020806/http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/)
[^46]: Daniel Eloff, Slava Akhmechet, Jay Kreps, et al. [Re: Turning the Database Inside-out with Apache Samza](https://news.ycombinator.com/item?id=9145197). Hacker News discussion, *news.ycombinator.com*, March 2015. Archived at [perma.cc/ML9E-JC83](https://perma.cc/ML9E-JC83)
[^47]: [Datomic Documentation: Excision](https://docs.datomic.com/operation/excision.html). Cognitect, Inc., *docs.datomic.com*. Archived at [perma.cc/J5QQ-SH32](https://perma.cc/J5QQ-SH32)
[^48]: [Fossil Documentation: Deleting Content from Fossil](https://fossil-scm.org/home/doc/trunk/www/shunning.wiki). *fossil-scm.org*, 2025. Archived at [perma.cc/DS23-GTNG](https://perma.cc/DS23-GTNG)
[^49]: Jay Kreps. [The irony of distributed systems is that data loss is really easy but deleting data is surprisingly hard.](https://x.com/jaykreps/status/582580836425330688) *x.com*, March 2015. Archived at [perma.cc/7RRZ-V7B7](https://perma.cc/7RRZ-V7B7)
[^50]: Brent Robinson. [Crypto shredding: How it can solve modern data retention challenges](https://medium.com/@brentrobinson5/crypto-shredding-how-it-can-solve-modern-data-retention-challenges-da874b01745b). *medium.com*, January 2019. Archived at
[^51]: Matthew D. Green and Ian Miers. [Forward Secure Asynchronous Messaging from Puncturable Encryption](https://isi.jhu.edu/~mgreen/forward_sec.pdf). At *IEEE Symposium on Security and Privacy*, May 2015. [doi:10.1109/SP.2015.26](https://doi.org/10.1109/SP.2015.26)
[^52]: David C. Luckham. [What's the Difference Between ESP and CEP?](https://complexevents.com/2020/06/15/whats-the-difference-between-esp-and-cep-2/) *complexevents.com*, June 2019. Archived at [perma.cc/E7PZ-FDEF](https://perma.cc/E7PZ-FDEF)
[^53]: Arvind Arasu, Shivnath Babu, and Jennifer Widom. [The CQL Continuous Query Language: Semantic Foundations and Query Execution](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cql.pdf). *The VLDB Journal*, volume 15, issue 2, pages 121--142, June 2006. [doi:10.1007/s00778-004-0147-z](https://doi.org/10.1007/s00778-004-0147-z)
[^54]: Julian Hyde. [Data in Flight: How Streaming SQL Technology Can Help Solve the Web 2.0 Data Crunch](https://queue.acm.org/detail.cfm?id=1667562). *ACM Queue*, volume 7, issue 11, December 2009. [doi:10.1145/1661785.1667562](https://doi.org/10.1145/1661785.1667562)
[^55]: Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. [HyperLogLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). At *Conference on Analysis of Algorithms* (AofA), June 2007. [doi:10.46298/dmtcs.3545](https://doi.org/10.46298/dmtcs.3545)
[^56]: Jay Kreps. [Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture). *oreilly.com*, July 2014. Archived at [perma.cc/2WY5-HC8Y](https://perma.cc/2WY5-HC8Y)
[^57]: Ian Reppel. [An Overview of Apache Streaming Technologies](https://ianreppel.org/an-overview-of-apache-streaming-technologies/). *ianreppel.org*, March 2016. Archived at [perma.cc/BB3E-QJLW](https://perma.cc/BB3E-QJLW)
[^58]: Jay Kreps. [Why Local State is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing). *oreilly.com*, July 2014. Archived at [perma.cc/P8HU-R5LA](https://perma.cc/P8HU-R5LA)
[^59]: RisingWave Labs. [Deep Dive Into the RisingWave Stream Processing Engine - Part 2: Computational Model](https://risingwave.com/blog/deep-dive-into-the-risingwave-stream-processing-engine-part-2-computational-model/). *risingwave.com*, November 2023. Archived at [perma.cc/LM74-XDEL](https://perma.cc/LM74-XDEL)
[^60]: Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard. [Differential dataflow](https://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf). At *6th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2013.
[^61]: Andy Hattemer. [Incremental Computation in the Database](https://materialize.com/guides/incremental-computation/). *materialize.com*, March 2020. Archived at [perma.cc/AL94-YVRN](https://perma.cc/AL94-YVRN)
[^62]: Shay Banon. [Percolator](https://www.elastic.co/blog/percolator). *elastic.co*, February 2011. Archived at [perma.cc/LS5R-4FQX](https://perma.cc/LS5R-4FQX)
[^63]: Alan Woodward and Martin Kleppmann. [Real-Time Full-Text Search with Luwak and Samza](https://martin.kleppmann.com/2015/04/13/real-time-full-text-search-luwak-samza.html). *martin.kleppmann.com*, April 2015. Archived at [perma.cc/2U92-Q7R4](https://perma.cc/2U92-Q7R4)
[^64]: Tyler Akidau. [The World Beyond Batch: Streaming 102](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102). *oreilly.com*, January 2016. Archived at [perma.cc/4XF9-8M2K](https://perma.cc/4XF9-8M2K)
[^65]: Stephan Ewen. [Streaming Analytics with Apache Flink](https://www.slideshare.net/slideshow/advanced-streaming-analytics-with-apache-flink-and-apache-kafka-stephan-ewen/61920008). At *Kafka Summit*, April 2016. Archived at [perma.cc/QBQ4-F9MR](https://perma.cc/QBQ4-F9MR)
[^66]: Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. [MillWheel: Fault-Tolerant Stream Processing at Internet Scale](https://www.vldb.org/pvldb/vol6/p1033-akidau.pdf). *Proceedings of the VLDB Endowment*, volume 6, issue 11, pages 1033--1044, August 2013. [doi:10.14778/2536222.2536229](https://doi.org/10.14778/2536222.2536229)
[^67]: Alex Dean. [Improving Snowplow's Understanding of Time](https://snowplow.io/blog/improving-snowplows-understanding-of-time). *snowplow.io*, September 2015. Archived at [perma.cc/6CT9-Z3Q2](https://perma.cc/6CT9-Z3Q2)
[^68]: [Azure Stream Analytics: Windowing functions](https://learn.microsoft.com/en-gb/stream-analytics-query/windowing-azure-stream-analytics). Microsoft Azure Reference, *learn.microsoft.com*, July 2025. Archived at [archive.org](https://web.archive.org/web/20250901140013/https://learn.microsoft.com/en-gb/stream-analytics-query/windowing-azure-stream-analytics)
[^69]: Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, Ashish Gupta, Haifeng Jiang, Tianhao Qiu, Alexey Reznichenko, Deomid Ryabkov, Manpreet Singh, and Shivakumar Venkataraman. [Photon: Fault-Tolerant and Scalable Joining of Continuous Data Streams](https://research.google.com/pubs/archive/41529.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2013. [doi:10.1145/2463676.2465272](https://doi.org/10.1145/2463676.2465272)
[^70]: Ben Kirwin. [Doing the Impossible: Exactly-Once Messaging Patterns in Kafka](https://ben.kirw.in/2014/11/28/kafka-patterns/). *ben.kirw.in*, November 2014. Archived at [perma.cc/A5QL-QRX7](https://perma.cc/A5QL-QRX7)
[^71]: Pat Helland. [Data on the Outside Versus Data on the Inside](https://www.cidrdb.org/cidr2005/papers/P12.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005.
[^72]: Ralph Kimball and Margy Ross. [*The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*](https://learning.oreilly.com/library/view/the-data-warehouse/9781118530801/), 3rd edition. John Wiley & Sons, 2013. ISBN: 978-1-118-53080-1
[^73]: Viktor Klang. [I'm coining the phrase 'effectively-once' for message processing with at-least-once + idempotent operations](https://x.com/viktorklang/status/789036133434978304). *x.com*, October 2016. Archived at [perma.cc/7DT9-TDG2](https://perma.cc/7DT9-TDG2)
[^74]: Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. [Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters](https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final28.pdf). At *4th USENIX Conference in Hot Topics in Cloud Computing* (HotCloud), June 2012.
[^75]: Kostas Tzoumas, Stephan Ewen, and Robert Metzger. [High-Throughput, Low-Latency, and Exactly-Once Stream Processing with Apache Flink](https://web.archive.org/web/20250429165534/https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink). *ververica.com*, August 2015. Archived at [archive.org](https://web.archive.org/web/20250429165534/https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink)
[^76]: Paris Carbone, Gyula Fóra, Stephan Ewen, Seif Haridi, and Kostas Tzoumas. [Lightweight Asynchronous Snapshots for Distributed Dataflows](https://arxiv.org/abs/1506.08603). arXiv:1506.08603 \[cs.DC\], June 2015.
[^77]: Ryan Betts and John Hugg. [*Fast Data: Smart and at Scale*](https://www.voltactivedata.com/wp-content/uploads/2017/03/hv-ebook-fast-data-smart-and-at-scale.pdf). Report, O'Reilly Media, October 2015. Archived at [perma.cc/VQ6S-XQQY](https://perma.cc/VQ6S-XQQY)
[^78]: Neha Narkhede and Guozhang Wang. [Exactly-Once Semantics Are Possible: Here's How Kafka Does It](https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/). *confluent.io*, June 2019. Archived at [perma.cc/Q2AU-Q2ED](https://perma.cc/Q2AU-Q2ED)
[^79]: Jason Gustafson, Flavio Junqueira, Apurva Mehta, Sriram Subramanian, and Guozhang Wang. [KIP-98 -- Exactly Once Delivery and Transactional Messaging](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging). *cwiki.apache.org*, November 2016. Archived at [perma.cc/95PT-RCTG](https://perma.cc/95PT-RCTG)
[^80]: Pat Helland. [Idempotence Is Not a Medical Condition](https://dl.acm.org/doi/pdf/10.1145/2160718.2160734). *Communications of the ACM*, volume 55, issue 5, page 56, May 2012. [doi:10.1145/2160718.2160734](https://doi.org/10.1145/2160718.2160734)
[^81]: Jay Kreps. [Re: Trying to Achieve Deterministic Behavior on Recovery/Rewind](https://lists.apache.org/thread/n0sz6zld72nvjtnytv09pxc57mdcf9ft). Email to *samza-dev* mailing list, September 2014. Archived at [perma.cc/7DPD-GJNL](https://perma.cc/7DPD-GJNL)
[^82]: E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. [A Survey of Rollback-Recovery Protocols in Message-Passing Systems](https://www.cs.utexas.edu/~lorenzo/papers/SurveyFinal.pdf). *ACM Computing Surveys*, volume 34, issue 3, pages 375--408, September 2002. [doi:10.1145/568522.568525](https://doi.org/10.1145/568522.568525)
[^83]: Adam Warski. [Kafka Streams -- How Does It Fit the Stream Processing Landscape?](https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/) *softwaremill.com*, June 2016. Archived at [perma.cc/WQ5Q-H2J2](https://perma.cc/WQ5Q-H2J2)
[^84]: Stephan Ewen, Fabian Hueske, and Xiaowei Jiang. [Batch as a Special Case of Streaming and Alibaba's contribution of Blink](https://flink.apache.org/2019/02/13/batch-as-a-special-case-of-streaming-and-alibabas-contribution-of-blink/). *flink.apache.org*, February 2019. Archived at [perma.cc/A529-SKA9](https://perma.cc/A529-SKA9)
================================================
FILE: content/en/ch13.md
================================================
---
title: "13. A Philosophy of Streaming Systems"
weight: 313
breadcrumbs: false
---

> *If a thing be ordained to another as to its end, its last end cannot consist in the preservation
> of its being. Hence a captain does not intend as a last end, the preservation of the ship
> entrusted to him, since a ship is ordained to something else as its end, viz. to navigation.*
>
> *(Often quoted as: If the highest aim of a captain was the preserve his ship, he would keep it in
> port forever.)*
>
> St. Thomas Aquinas, *Summa Theologica* (1265--1274)
> [!TIP] A NOTE FOR EARLY RELEASE READERS
> With Early Release ebooks, you get books in their earliest form---the author's raw and unedited
> content as they write---so you can take advantage of these technologies long before the official
> release of these titles.
>
> This will be the 13th chapter of the final book. The GitHub repo for this book is
> *[*https://github.com/ept/ddia2-feedback*](https://github.com/ept/ddia2-feedback)*.
>
> If you'd like to be actively involved in reviewing and commenting on this draft, please reach out on GitHub.
In [Chapter 2](/en/ch2#ch_nonfunctional) we discussed the goal of creating applications and systems
that are *reliable*, *scalable*, and *maintainable*. These themes have run through all of the
chapters: for example, we discussed many fault-tolerance algorithms that help improve reliability,
sharding to improve scalability, and mechanisms for evolution and abstraction that improve
maintainability.
In this chapter we will bring all of these ideas together, and build on the streaming/event-driven
architecture ideas from [Chapter 12](/en/ch12#ch_stream) in particular to develop a philosophy of
application development that meets those goals. This chapter is more opinionated than previous
chapters, presenting a deep-dive into one particular philosophy rather than comparing multiple
approaches.
## Data Integration {#sec_future_integration}
A recurring theme in this book has been that for any given problem, there are several solutions, all
of which have different pros, cons, and trade-offs. For example, when discussing storage engines in
[Chapter 4](/en/ch4#ch_storage), we saw log-structured storage, B-trees, and column-oriented
storage. When discussing replication in [Chapter 6](/en/ch6#ch_replication), we saw single-leader,
multi-leader, and leaderless approaches.
If you have a problem such as "I want to store some data and look it up again later," there is no
one right solution, but many different approaches that are each appropriate in different
circumstances. A software implementation typically has to pick one particular approach. It's hard
enough to get one code path robust and performing well---trying to do everything in one piece of
software almost guarantees that the implementation will be poor.
Thus, the most appropriate choice of software tool also depends on the circumstances. Every piece of
software, even a so-called "general-purpose" database, is designed for a particular usage pattern.
Faced with this profusion of alternatives, the first challenge is then to figure out the mapping
between the software products and the circumstances in which they are a good fit. Vendors are
understandably reluctant to tell you about the kinds of workloads for which their software is poorly
suited, but hopefully the previous chapters have equipped you with some questions to ask in order to
read between the lines and better understand the trade-offs.
However, even if you perfectly understand the mapping between tools and circumstances for their use,
there is another challenge: in complex applications, data is often used in several different ways.
There is unlikely to be one piece of software that is suitable for *all* the different circumstances
in which the data is used, so you inevitably end up having to cobble together several different
pieces of software in order to provide your application's functionality.
### Combining Specialized Tools by Deriving Data {#id442}
For example, it is common to need to integrate an OLTP database with a full-text search index in
order to handle queries for arbitrary keywords. Although some databases (such as PostgreSQL) include
a full-text indexing feature, which can be sufficient for simple applications [^1], more
sophisticated search facilities require specialist information retrieval tools. Conversely, search
indexes are generally not very suitable as a durable system of record, and so many applications need
to combine two different tools in order to satisfy all of the requirements.
We touched on the issue of integrating data systems in ["Keeping Systems in
Sync"](/en/ch12#sec_stream_sync). As the number of different representations of the data increases,
the integration problem becomes harder. Besides the database and the search index, perhaps you need
to keep copies of the data in analytics systems (data warehouses, or batch and stream processing
systems); maintain caches or denormalized versions of objects that were derived from the original
data; pass the data through machine learning, classification, ranking, or recommendation systems; or
send notifications based on changes to the data.
#### Reasoning about dataflows {#id443}
When copies of the same data need to be maintained in several storage systems in order to satisfy
different access patterns, you need to be very clear about the inputs and outputs: where is data
written first, and which representations are derived from which sources? How do you get data into
all the right places, in the right formats?
For example, you might arrange for data to first be written to a system of record database,
capturing the changes made to that database (see ["Change Data Capture"](/en/ch12#sec_stream_cdc))
and then applying the changes to the search index in the same order. If change data capture (CDC) is
the only way of updating the index, you can be confident that the index is entirely derived from the
system of record, and therefore consistent with it (barring bugs in the software). Writing to the
database is the only way of supplying new input into this system.
Allowing the application to directly write to both the search index and the database introduces the
problem shown in [Figure 12-4](/en/ch12#fig_stream_write_order), in which two clients concurrently
send conflicting writes, and the two storage systems process them in a different order. In this
case, neither the database nor the search index is "in charge" of determining the order of writes,
and so they may make contradictory decisions and become permanently inconsistent with each other.
If it is possible for you to funnel all user input through a single system that decides on an
ordering for all writes, it becomes much easier to derive other representations of the data by
processing the writes in the same order. This is an application of the state machine replication
approach that we saw in ["Consensus in Practice"](/en/ch10#sec_consistency_total_order). Whether you
use change data capture or an event sourcing log is less important than simply the principle of
deciding on a total order.
Updating a derived data system based on an event log can often be made deterministic and idempotent
(see ["Idempotence"](/en/ch12#sec_stream_idempotence)), making it quite easy to recover from faults.
#### Derived data versus distributed transactions {#sec_future_derived_vs_transactions}
The classic approach for keeping different data systems consistent with each other involves
distributed transactions, as discussed in ["Two-Phase Commit (2PC)"](/en/ch8#sec_transactions_2pc).
How does the approach of using derived data systems fare in comparison to distributed transactions?
At an abstract level, they achieve a similar goal by different means. Distributed transactions
decide on an ordering of writes by using locks for mutual exclusion, while CDC and event sourcing
use a log for ordering. Distributed transactions use atomic commit to ensure that changes take
effect exactly once, while log-based systems are often based on deterministic retry and idempotence.
The biggest difference is that transaction systems usually guarantee that after a value is written,
you can immediately read the up-to-date value (see ["Reading Your Own
Writes"](/en/ch6#sec_replication_ryw)). On the other hand, derived data systems are often updated
asynchronously, and so they do not by default guarantee that reads are up-to-date.
Within limited environments that are willing to pay the cost of distributed transactions, they have
been used successfully. However, XA has poor fault tolerance and performance characteristics (see
["Distributed Transactions Across Different Systems"](/en/ch8#sec_transactions_xa)), which severely
limit its usefulness. It might be possible to create a better protocol for distributed transactions,
but getting such a protocol widely adopted and integrated with existing tools would be challenging,
and is unlikely to happen soon.
In the absence of widespread support for a good distributed transaction protocol, log-based derived
data is the most promising approach for integrating different data systems. However, guarantees such
as reading your own writes are useful, and it is not productive to tell everyone "eventual
consistency is inevitable---suck it up and learn to deal with it" (at least not without good
guidance on *how* to deal with it).
Later in this chapter we will discuss some approaches for implementing stronger guarantees on top of
asynchronously derived systems, and work toward a middle ground between distributed transactions and
asynchronous log-based systems.
#### The limits of total ordering {#id335}
With systems that are small enough, constructing a totally ordered event log is entirely feasible
(as demonstrated by the popularity of databases with single-leader replication, which construct
precisely such a log). However, as systems are scaled toward bigger and more complex workloads,
limitations begin to emerge:
- In most cases, constructing a totally ordered log requires all events to pass through a *single
leader node* that decides on the ordering. If the throughput of events is greater than a single
machine can handle, you need to shard the log across multiple machines. The order of events in two
different shards is then ambiguous.
- If the servers are spread across multiple *geographically distributed* regions, for example in
order to tolerate an entire datacenter going offline, you typically have a separate leader in each
datacenter, because network delays make synchronous cross-datacenter coordination inefficient.
This implies an undefined ordering of events that originate in two different datacenters.
- When applications are deployed as *microservices*, a common design choice is to deploy each
service and its durable state as an independent unit, with no durable state shared between
services. When two events originate in different services, there is no defined order for those
events.
- Some applications maintain client-side state that is updated immediately on user input (without
waiting for confirmation from a server), and even continue to work offline. With such
applications, clients and servers are very likely to see events in different orders.
In formal terms, deciding on a total order of events is known as *total order broadcast*, which is
equivalent to consensus (see ["The Many Faces of Consensus"](/en/ch10#sec_consistency_faces)). Most
consensus algorithms are designed for situations in which the throughput of a single node is
sufficient to process the entire stream of events, and these algorithms do not provide a mechanism
for multiple nodes to share the work of ordering the events.
#### Ordering events to capture causality {#sec_future_capture_causality}
In cases where there is no causal link between events, the lack of a total order is not a big
problem, since concurrent events can be ordered arbitrarily. Some other cases are easy to handle:
for example, when there are multiple updates of the same object, they can be totally ordered by
routing all updates for a particular object ID to the same log shard. However, causal dependencies
sometimes arise in more subtle ways.
For example, consider a social networking service, and two users who were in a relationship but have
just broken up. One of the users removes the other as a friend, and then sends a message to their
remaining friends complaining about their ex-partner. The user's intention is that their ex-partner
should not see the rude message, since the message was sent after the friend status was revoked.
However, in a system that stores friendship status in one place and messages in another place, that
ordering dependency between the *unfriend* event and the *message-send* event may be lost. If the
causal dependency is not captured, a service that sends notifications about new messages may process
the *message-send* event before the *unfriend* event, and thus incorrectly send a notification to
the ex-partner.
In this example, the notifications are effectively a join between the messages and the friend list,
making it related to the timing issues of joins that we discussed previously (see ["Time-dependence
of joins"](/en/ch12#sec_stream_join_time)). Unfortunately, there does not seem to be a simple answer
to this problem [^2], [^3]. Starting points include:
- Logical timestamps can provide total ordering without coordination (see ["ID Generators and
Logical Clocks"](/en/ch10#sec_consistency_logical)), so they may help in cases where total order
broadcast is not feasible. However, they still require recipients to handle events that are
delivered out of order, and they require additional metadata to be passed around.
- If you can log an event to record the state of the system that the user saw before making a
decision, and give that event a unique identifier, then any later events can reference that event
identifier in order to record the causal dependency [^4].
- Conflict resolution algorithms (see ["Automatic conflict
resolution"](/en/ch6#sec_replication_automatic_resolution)) help with processing events that are
delivered in an unexpected order. They are useful for maintaining state, but they do not help if
actions have external side effects (such as sending a notification to a user).
Perhaps, patterns for application development will emerge in the future that allow causal
dependencies to be captured efficiently, and derived state to be maintained correctly, without
forcing all events to go through the bottleneck of total order broadcast.
### Batch and Stream Processing {#sec_future_batch_streaming}
The goal of data integration is to make sure that data ends up in the right form in all the right
places. Doing so requires consuming inputs, transforming, joining, filtering, aggregating, training
models, evaluating, and eventually writing to the appropriate outputs. Batch and stream processors
are the tools for achieving this goal. The outputs of batch and stream processes are derived
datasets such as search indexes, materialized views, recommendations to show to users, aggregate
metrics, and so on.
As we saw in [Chapter 11](/en/ch11#ch_batch) and [Chapter 12](/en/ch12#ch_stream), batch and stream
processing have a lot of principles in common, and the main fundamental difference is that stream
processors operate on unbounded datasets whereas batch process inputs are of a known, finite size.
#### Maintaining derived state {#id446}
Batch processing has a quite strong functional flavor (even if the code is not written in a
functional programming language): it encourages deterministic, pure functions whose output depends
only on the input and which have no side effects other than the explicit outputs, treating inputs as
immutable and outputs as append-only. Stream processing is similar, but it extends operators to
allow managed, fault-tolerant state.
The principle of deterministic functions with well-defined inputs and outputs is not only good for
fault tolerance, but also simplifies reasoning about the dataflows in an organization
[^5]. No matter whether the derived data is a search index, a statistical model, or a
cache, it is helpful to think in terms of data pipelines that derive one thing from another, pushing
state changes in one system through functional application code and applying the effects to derived
systems.
In principle, derived data systems could be maintained synchronously, just like a relational
database updates secondary indexes synchronously within the same transaction as writes to the table
being indexed. However, asynchrony is what makes systems based on event logs robust: it allows a
fault in one part of the system to be contained locally, whereas distributed transactions abort if
any one participant fails, so they tend to amplify failures by spreading them to the rest of the
system.
We saw in ["Sharding and Secondary Indexes"](/en/ch7#sec_sharding_secondary_indexes) that secondary
indexes often cross shard boundaries. A sharded system with secondary indexes either needs to send
writes to multiple shards (if the index is term-partitioned) or send reads to all shards (if the
index is document-partitioned). Such cross-shard communication is also most reliable and scalable if
the index is maintained asynchronously [^6].
#### Reprocessing data for application evolution {#sec_future_reprocessing}
When maintaining derived data, batch and stream processing are both useful. Stream processing allows
changes in the input to be reflected in derived views with low delay, whereas batch processing
allows large amounts of accumulated historical data to be reprocessed in order to derive new views
onto an existing dataset.
In particular, reprocessing existing data provides a good mechanism for maintaining a system,
evolving it to support new features and changed requirements. Without reprocessing, schema evolution
is limited to simple changes like adding a new optional field to a record, or adding a new type of
record. On the other hand, with reprocessing it is possible to restructure a dataset into a
completely different model in order to better serve new requirements.
> [!TIP] SCHEMA MIGRATIONS ON RAILWAYS
> Large-scale "schema migrations" occur in noncomputer systems as well. For example, in the early days
> of railway building in 19th-century England there were various competing standards for the gauge
> (the distance between the two rails). Trains built for one gauge couldn't run on tracks of another
> gauge, which restricted the possible interconnections in the train network [^7].
>
> After a single standard gauge was finally decided upon in 1846, tracks with other gauges had to be
> converted---but how do you do this without shutting down the train line for months or years? The
> solution is to first convert the track to *dual gauge* or *mixed gauge* by adding a third rail. This
> conversion can be done gradually, and when it is done, trains of both gauges can run on the line,
> using two of the three rails. Eventually, once all trains have been converted to the standard gauge,
> the rail providing the nonstandard gauge can be removed.
>
> "Reprocessing" the existing tracks in this way, and allowing the old and new versions to exist side
> by side, makes it possible to change the gauge gradually over the course of years. Nevertheless, it
> is an expensive undertaking, which is why nonstandard gauges still exist today. For example, the
> BART system in the San Francisco Bay Area uses a different gauge from the majority of the US.
Derived views allow *gradual* evolution. If you want to restructure a dataset, you do not need to
perform the migration as a sudden switch. Instead, you can maintain the old schema and the new
schema side by side as two independently derived views onto the same underlying data. You can then
start shifting a small number of users to the new view in order to test its performance and find any
bugs, while most users continue to be routed to the old view. Gradually, you can increase the
proportion of users accessing the new view, and eventually you can drop the old view [^8],
[^9].
The beauty of such a gradual migration is that every stage of the process is easily reversible if
something goes wrong: you always have a working system to go back to. By reducing the risk of
irreversible damage, you can be more confident about going ahead, and thus move faster to improve
your system [^10].
#### Unifying batch and stream processing {#id338}
An early proposal for unifying batch and stream processing was the *lambda architecture*
[^11], which had a number of problems [^12] and has fallen out of use. More
recent systems allow batch computations (reprocessing historical data) and stream computations
(processing events as they arrive) to be implemented in the same system [^13], an approach
that is sometimes known as the *kappa architecture* [^12].
Unifying batch and stream processing in one system requires the following features:
- The ability to replay historical events through the same processing engine that handles the stream
of recent events. For example, log-based message brokers have the ability to replay messages, and
some stream processors can read input from a distributed filesystem or object storage.
- Exactly-once semantics for stream processors---that is, ensuring that the output is the same as if
no faults had occurred, even if faults did in fact occur. Like with batch processing, this
requires discarding the partial output of any failed tasks.
- Tools for windowing by event time, not by processing time, since processing time is meaningless
when reprocessing historical events. For example, Apache Beam provides an API for expressing such
computations, which can then be run using Apache Flink or Google Cloud Dataflow.
## Unbundling Databases {#sec_future_unbundling}
At a most abstract level, databases, batch/stream processors, and operating systems all perform the
same functions: they store some data, and they allow you to process and query that data
[^14], [^15]. A database stores data in records of some data model (rows in tables,
documents, vertices in a graph, etc.) while an operating system's filesystem stores data in
files---but at their core, both are "information management" systems [^16]. As we saw in
[Chapter 11](/en/ch11#ch_batch), batch processors are like a distributed version of Unix.
Of course, there are many practical differences. For example, many filesystems do not cope very well
with a directory containing 10 million small files, whereas a database containing 10 million small
records is completely normal and unremarkable. Nevertheless, the similarities and differences
between operating systems and databases are worth exploring.
Unix and relational databases have approached the information management problem with very different
philosophies. Unix viewed its purpose as presenting programmers with a logical but fairly low-level
hardware abstraction, whereas relational databases wanted to give application programmers a
high-level abstraction that would hide the complexities of data structures on disk, concurrency,
crash recovery, and so on. Unix developed pipes and files that are just sequences of bytes, whereas
databases developed SQL and transactions.
Which approach is better? Of course, it depends what you want. Unix is "simpler" in the sense that
it is a fairly thin wrapper around hardware resources; relational databases are "simpler" in the
sense that a short declarative query can draw on a lot of powerful infrastructure (query
optimization, indexes, join methods, concurrency control, replication, etc.) without the author of
the query needing to understand the implementation details.
The tension between these philosophies has lasted for decades (both Unix and the relational model
emerged in the early 1970s) and still isn't resolved. For example, the NoSQL movement could be
interpreted as wanting to apply a Unix-esque approach of low-level abstractions to the domain of
distributed OLTP data storage.
This section attempts to reconcile the two philosophies, in the hope that we can combine the best of
both worlds.
### Composing Data Storage Technologies {#id447}
Over the course of this book we have discussed various features provided by databases and how they
work, including:
- Secondary indexes, which allow you to efficiently search for records based on the value of a
field;
- Materialized views, which are a kind of precomputed cache of query results;
- Replication logs, which keep copies of the data on other nodes up to date; and
- Full-text search indexes, which allow keyword search in text and which are built into some
relational databases [^1].
In Chapters [11](/en/ch11#ch_batch) and [12](/en/ch12#ch_stream), similar themes emerged. We talked
about building full-text search indexes, about materialized view maintenance, and about replicating
changes from a database to derived data systems using change data capture.
It seems that there are parallels between the features that are built into databases and the derived
data systems that people are building with batch and stream processors.
#### Creating an index {#id340}
Think about what happens when you run `CREATE INDEX` to create a new index in a relational database.
The database has to scan over a consistent snapshot of a table, pick out all of the field values
being indexed, sort them, and write out the index. Then it must process the backlog of writes that
have been made since the consistent snapshot was taken (assuming the table was not locked while
creating the index, so writes could continue). Once that is done, the database must continue to keep
the index up to date whenever a transaction writes to the table.
This process is remarkably similar to setting up a new follower replica (see ["Setting Up New
Followers"](/en/ch6#sec_replication_new_replica)), and also very similar to bootstrapping change
data capture in a streaming system (see ["Initial snapshot"](/en/ch12#sec_stream_cdc_snapshot)).
Whenever you run `CREATE INDEX`, the database essentially reprocesses the existing dataset and
derives the index as a new view onto the existing data. The existing data may be a snapshot of the
state rather than a log of all changes that ever happened, but the two are closely related.
#### The meta-database of everything {#id341}
In this light, the dataflow across an entire organization starts looking like one huge database
[^5]. Whenever a batch, stream, or ETL process transports data from one place and form to
another place and form, it is acting like the database subsystem that keeps indexes or materialized
views up to date.
Viewed like this, batch and stream processors are like elaborate implementations of triggers, stored
procedures, and materialized view maintenance algorithms. The derived data systems they maintain are
like different index types. For example, a relational database may support B-tree indexes, hash
indexes, spatial indexes, and other types of indexes. In the emerging architecture of derived data
systems, instead of implementing those facilities as features of a single integrated database
product, they are provided by various different pieces of software, running on different machines,
administered by different teams.
Where will these developments take us in the future? If we start from the premise that there is no
single data model or storage format that is suitable for all access patterns, there are two avenues
by which different storage and processing tools can nevertheless be composed into a cohesive system:
Federated databases: unifying reads
: It is possible to provide a unified query interface to a wide variety of underlying storage
engines and processing methods---an approach known as a *federated database* or *polystore*
[^17], [^18]. For example, PostgreSQL's *foreign data wrapper* feature fits this
pattern, as do federated query engines such as Trino, Hoptimator, and Xorq. Applications that
need a specialized data model or query interface can still access the underlying storage engines
directly, while users who want to combine data from disparate places can do so easily through
the federated interface.
A federated query interface follows the relational tradition of a single integrated system with
a high-level query language and elegant semantics, but a complicated implementation.
Unbundled databases: unifying writes
: While federation addresses read-only querying across several different systems, it does not have
a good answer to synchronizing writes across those systems. We said that within a single
database, creating a consistent index is a built-in feature. When we compose several storage
systems, we similarly need to ensure that all data changes end up in all the right places, even
in the face of faults. Making it easier to reliably plug together storage systems (e.g., through
change data capture and event logs) is like *unbundling* a database's index-maintenance features
in a way that can synchronize writes across disparate technologies [^5], [^19].
The unbundled approach follows the Unix tradition of small tools that do one thing well
[^20], that communicate through a uniform low-level API (pipes), and that can be
composed using a higher-level language (the shell) [^14].
#### Making unbundling work {#sec_future_unbundling_favor}
Federation and unbundling are two sides of the same coin: composing a reliable, scalable, and
maintainable system out of diverse components. Federated read-only querying requires mapping one
data model into another, which takes some thought but is ultimately quite a manageable problem.
Keeping the writes to several storage systems in sync is the harder engineering problem, and so we
will focus on it here.
The traditional approach to synchronizing writes requires distributed transactions across
heterogeneous storage systems [^17], which are problematic, as discussed previously.
Transactions within a single storage or stream processing system are feasible, but when data crosses
the boundary between different technologies, an asynchronous event log with idempotent writes is a
much more robust and practicable approach.
For example, distributed transactions are used within some stream processors to achieve exactly-once
semantics, and this can work quite well. However, when a transaction would need to involve systems
written by different groups of people (e.g., when data is written from a stream processor to a
distributed key-value store or search index), the lack of a standardized transaction protocol makes
integration much harder. An ordered log of events with idempotent consumers is a much simpler
abstraction, and thus much more feasible to implement across heterogeneous systems [^5].
The big advantage of log-based integration is *loose coupling* between the various components, which
manifests itself in two ways:
1. At a system level, asynchronous event streams make the system as a whole more robust to outages
or performance degradation of individual components. If a consumer runs slow or fails, the event
log can buffer messages, allowing the producer and any other consumers to continue running
unaffected. The faulty consumer can catch up when it is fixed, so it doesn't miss any data, and
the fault is contained. By contrast, the synchronous interaction of distributed transactions
tends to escalate local faults into large-scale failures.
2. At a human level, unbundling data systems allows different software components and services to
be developed, improved, and maintained independently from each other by different teams.
Specialization allows each team to focus on doing one thing well, with well-defined interfaces
to other teams' systems. Event logs provide an interface that is powerful enough to capture
fairly strong consistency properties (due to durability and ordering of events), but also
general enough to be applicable to almost any kind of data.
#### Unbundled versus integrated systems {#id448}
If unbundling does indeed become the way of the future, it will not replace databases in their
current form---they will still be needed as much as ever. Databases are still required for
maintaining state in stream processors, and in order to serve queries for the output of batch and
stream processors. Specialized query engines will continue to be important for particular workloads:
for example, query engines in data warehouses are optimized for exploratory analytic queries and
handle this kind of workload very well.
The complexity of running several different pieces of infrastructure can be a problem: each piece of
software has a learning curve, configuration issues, and operational quirks, and so it is worth
deploying as few moving parts as possible. A single integrated software product may also be able to
achieve better and more predictable performance on the kinds of workloads for which it is designed,
compared to a system consisting of several tools that you have composed with application code
[^21]. Building for scale that you don't need is wasted effort and may lock you into an
inflexible design. In effect, it is a form of premature optimization.
The goal of unbundling is not to compete with individual databases on performance for particular
workloads; the goal is to allow you to combine several different databases in order to achieve good
performance for a much wider range of workloads than is possible with a single piece of software.
It's about breadth, not depth.
Thus, if there is a single technology that does everything you need, you're most likely best off
simply using that product rather than trying to reimplement it yourself from lower-level components.
The advantages of unbundling and composition only come into the picture when there is no single
piece of software that satisfies all your requirements.
The tools for composing data systems are getting better: Debezium can extract change streams from
many databases, Kafka's protocol is becoming a de-facto standard for event streams, and incremental
view maintenance engines (see ["Incremental View Maintenance"](/en/ch12#sec_stream_ivm)) make it
possible to precompute and update caches of complex queries.
### Designing Applications Around Dataflow {#sec_future_dataflow}
The general idea of updating derived data when its underlying data changes is nothing new. For
example, spreadsheets have powerful dataflow programming capabilities [^22]: you can put a
formula in one cell (for example, the sum of cells in another column), and whenever any input to the
formula changes, the result of the formula is automatically recalculated. This is exactly what we
want at a data system level: when a record in a database changes, we want any index for that record
to be automatically updated, and any cached views or aggregations that depend on the record to be
automatically refreshed. You should not have to worry about the technical details of how this
refresh happens, but be able to simply trust that it works correctly.
Thus, most data systems still have something to learn from the features that VisiCalc already had in
1979 [^23]. The difference from spreadsheets is that today's data systems need to be
fault-tolerant, scalable, and store data durably. They also need to be able to integrate disparate
technologies written by different groups of people over time, and reuse existing libraries and
services: it is unrealistic to expect all software to be developed using one particular language,
framework, or tool.
In this section we will expand on these ideas and explore some ways of building applications around
the ideas of unbundled databases and dataflow.
#### Application code as a derivation function {#sec_future_dataflow_derivation}
When one dataset is derived from another, it goes through some kind of transformation function. For
example:
- A secondary index is a kind of derived dataset with a straightforward transformation function: for
each row or document in the base table, it picks out the values in the columns or fields being
indexed, and sorts by those values (assuming a SSTable or B-tree index, which are sorted by key).
- A full-text search index is created by applying various natural language processing functions such
as language detection, word segmentation, stemming or lemmatization, spelling correction, and
synonym identification, followed by building a data structure for efficient lookups (such as an
inverted index).
- In a machine learning system, we can consider the model as being derived from the training data by
applying various feature extraction and statistical analysis functions. When the model is applied
to new input data, the output of the model is derived from the input and the model (and hence,
indirectly, from the training data).
- A cache often contains an aggregation of data in the form in which it is going to be displayed in
a user interface (UI). Populating the cache thus requires knowledge of what fields are referenced
in the UI; changes in the UI may require updating the definition of how the cache is populated and
rebuilding the cache.
The derivation function for a secondary index is so commonly required that it is built into many
databases as a core feature, and you can invoke it by merely saying `CREATE INDEX`. For full-text
indexing, basic linguistic features for common languages may be built into a database, but the more
sophisticated features often require domain-specific tuning. In machine learning, feature
engineering is notoriously application-specific, and often has to incorporate detailed knowledge
about the user interaction and deployment of an application [^24].
When the function that creates a derived dataset is not a standard cookie-cutter function like
creating a secondary index, custom code is required to handle the application-specific aspects. And
this custom code is where many databases struggle. Although relational databases commonly support
triggers, stored procedures, and user-defined functions, which can be used to execute application
code within the database, they have been somewhat of an afterthought in database design.
#### Separation of application code and state {#id344}
In theory, databases could be deployment environments for arbitrary application code, like an
operating system. However, in practice they have turned out to be poorly suited for this purpose.
They do not fit well with the requirements of modern application development, such as dependency and
package management, version control, rolling upgrades, evolvability, monitoring, metrics, calls to
network services, and integration with external systems.
On the other hand, deployment and cluster management tools such as Kubernetes, Docker, Mesos, YARN,
and others are designed specifically for the purpose of running application code. By focusing on
doing one thing well, they are able to do it much better than a database that provides execution of
user-defined functions as one of its many features.
Most web applications today are deployed as stateless services, in which any user request can be
routed to any application server, and the server forgets everything about the request once it has
sent the response. This style of deployment is convenient, as servers can be added or removed at
will, but the state has to go somewhere: typically, a database. The trend has been to keep stateless
application logic separate from state management (databases): not putting application logic in the
database and not putting persistent state in the application [^25]. As people in the
functional programming community like to joke, "We believe in the separation of Church and state"
[^26].
> [!NOTE]
> Explaining a joke usually ruins it, but here is an explanation anyway so that nobody feels left out.
> *Church* is a reference to the mathematician Alonzo Church, who created the lambda calculus, an
> early form of computation that is the basis for most functional programming languages. The lambda
> calculus has no mutable state (i.e., no variables that can be overwritten), so one could say that
> mutable state is separate from Church's work.
In this typical web application model, the database acts as a kind of mutable shared variable that
can be accessed synchronously over the network. The application can read and update the variable,
and the database takes care of making it durable, providing some concurrency control and fault
tolerance.
However, in most programming languages you cannot subscribe to changes in a mutable variable---you
can only read it periodically. Unlike in a spreadsheet, readers of the variable don't get notified
if the value of the variable changes. (You can implement such notifications in your own code---this
is known as the *observer pattern*---but most languages do not have this pattern as a built-in
feature.)
Databases have inherited this passive approach to mutable data: if you want to find out whether the
content of the database has changed, often your only option is to poll (i.e., to repeat your query
periodically). Subscribing to changes is only just beginning to emerge as a feature.
#### Dataflow: Interplay between state changes and application code {#id450}
Thinking about applications in terms of dataflow implies renegotiating the relationship between
application code and state management. Instead of treating a database as a passive variable that is
manipulated by the application, we think much more about the interplay and collaboration between
state, state changes, and code that processes them. Application code responds to state changes in
one place by triggering state changes in another place.
We have already seen this idea in change data capture, in the actor model, in triggers, and
incremental view maintenance. Unbundling the database means taking this idea and applying it to the
creation of derived datasets outside of the primary database: caches, full-text search indexes,
machine learning, or analytics systems. We can use stream processing and messaging systems for this
purpose.
Maintaining derived data requires the following properties, which log-based message brokers can
provide:
- When maintaining derived data, the order of state changes is often important (if several views are
derived from an event log, they need to process the events in the same order so that they remain
consistent with each other).
- Fault tolerance is essential: losing just a single message causes the derived dataset to go
permanently out of sync with its data source. Both message delivery and derived state updates must
be reliable.
Stable message ordering and fault-tolerant message processing are quite stringent demands, but they
are much less expensive and more operationally robust than distributed transactions. Modern stream
processors can provide these ordering and reliability guarantees at scale, and they allow
application code to be run as stream operators.
This application code can do the arbitrary processing that built-in derivation functions in
databases generally don't provide. Like Unix tools chained by pipes, stream operators can be
composed to build large systems around dataflow. Each operator takes streams of state changes as
input, and produces other streams of state changes as output.
#### Stream processors and services {#id345}
The currently dominant style of application development involves breaking down functionality into a
set of *services* that communicate via synchronous network requests such as REST APIs. The advantage
of such a service-oriented architecture over a single monolithic application is primarily
organizational scalability through loose coupling: different teams can work on different services,
which reduces coordination effort between teams (as long as the services can be deployed and updated
independently).
Composing stream operators into dataflow systems has a lot of similar characteristics to the
microservices approach [^27], [^28]. However, the underlying communication mechanism
is very different: one-directional, asynchronous message streams rather than synchronous
request/response interactions.
Besides the advantages listed in ["Event-Driven Architectures"](/en/ch5#sec_encoding_dataflow_msg),
such as better fault tolerance, dataflow systems can also achieve better performance than
traditional REST APIs or RPC. For example, say a customer is purchasing an item that is priced in
one currency but paid for in another currency. In order to perform the currency conversion, you need
to know the current exchange rate. This operation could be implemented in two ways [^27],
[^29]:
1. In the microservices approach, the code that processes the purchase would probably query an
exchange-rate service or database in order to obtain the current rate for a particular currency.
2. In the dataflow approach, the code that processes purchases would subscribe to a stream of
exchange rate updates ahead of time, and record the current rate in a local database whenever it
changes. When it comes to processing the purchase, it only needs to query the local database.
The second approach has replaced a synchronous network request to another service with a query to a
local database (which may be on the same machine, even in the same process). In the microservices
approach, you could avoid the synchronous network request by caching the exchange rate locally in
the service that processes the purchase. However, in order to keep that cache fresh, you would need
to periodically poll for updated exchange rates, or subscribe to a stream of changes---which is
exactly what happens in the dataflow approach.
Not only is the dataflow approach faster, but it is also more robust to the failure of another
service. The fastest and most reliable network request is no network request at all! Instead of RPC,
we now have a stream join between purchase events and exchange rate update events.
The join is time-dependent: if the purchase events are reprocessed at a later point in time, the
exchange rate will have changed. If you want to reconstruct the original output, you will need to
obtain the historical exchange rate at the original time of purchase. No matter whether you query a
service or subscribe to a stream of exchange rate updates, you will need to handle this time
dependence (see ["Time-dependence of joins"](/en/ch12#sec_stream_join_time)).
Subscribing to a stream of changes, rather than querying the current state when needed, brings us
closer to a spreadsheet-like model of computation: when some piece of data changes, any derived data
that depends on it can swiftly be updated. There are still many open questions, for example around
issues like time-dependent joins, but building applications around dataflow ideas is a very
promising direction to explore.
### Observing Derived State {#sec_future_observing}
At an abstract level, the dataflow systems discussed in the last section give you a process for
creating derived datasets (such as search indexes, materialized views, and predictive models) and
keeping them up to date. Let's call that process the *write path*: whenever some piece of
information is written to the system, it may go through multiple stages of batch and stream
processing, and eventually every derived dataset is updated to incorporate the data that was
written. [Figure 13-1](/en/ch13#fig_future_write_read_paths) shows an example of updating a search
index.
{{< figure src="/fig/ddia_1301.png" id="fig_future_write_read_paths" caption="Figure 13-1. In a search index, writes (document updates) meet reads (queries)." class="w-full my-4" >}}
But why do you create the derived dataset in the first place? Most likely because you want to query
it again at a later time. This is the *read path*: when serving a user request you read from the
derived dataset, perhaps perform some more processing on the results, and construct the response to
the user.
Taken together, the write path and the read path encompass the whole journey of the data, from the
point where it is collected to the point where it is consumed (probably by another human). The write
path is the portion of the journey that is precomputed---i.e., that is done eagerly as soon as the
data comes in, regardless of whether anyone has asked to see it. The read path is the portion of the
journey that only happens when someone asks for it. If you are familiar with functional programming
languages, you might notice that the write path is similar to eager evaluation, and the read path is
similar to lazy evaluation.
The derived dataset is the place where the write path and the read path meet, as illustrated in
[Figure 13-1](/en/ch13#fig_future_write_read_paths). It represents a trade-off between the amount of
work that needs to be done at write time and the amount that needs to be done at read time.
#### Materialized views and caching {#id451}
A full-text search index is a good example: the write path updates the index, and the read path
searches the index for keywords. Both reads and writes need to do some work. Writes need to update
the index entries for all terms that appear in the document. Reads need to search for each of the
words in the query, and apply Boolean logic to find documents that contain *all* of the words in the
query (an `AND` operator), or *any* synonym of each of the words (an `OR` operator).
If you didn't have an index, a search query would have to scan over all documents (like `grep`),
which would get very expensive if you had a large number of documents. No index means less work on
the write path (no index to update), but a lot more work on the read path.
On the other hand, you could imagine precomputing the search results for all possible queries. In
that case, you would have less work to do on the read path: no Boolean logic, just find the results
for your query and return them. However, the write path would be a lot more expensive: the set of
possible search queries that could be asked is infinite (or at least exponential in the number of
terms in the corpus), and thus precomputing all possible search results would not be possible.
Another option would be to precompute the search results for only a fixed set of the most common
queries, so that they can be served quickly without having to go to the index. The uncommon queries
can still be served from the index. This would generally be called a *cache* of common queries,
although we could also call it a materialized view, as it would need to be updated when new
documents appear that should be included in the results of one of the common queries.
From this example we can see that an index is not the only possible boundary between the write path
and the read path. Caching of common search results is possible, and `grep`-like scanning without
the index is also possible on a small number of documents. Viewed like this, the role of caches,
indexes, and materialized views is simple: they shift the boundary between the read path and the
write path. They allow us to do more work on the write path, by precomputing results, in order to
save effort on the read path.
Shifting the boundary between work done on the write path and the read path was in fact the topic of
the social networking example in ["Case Study: Social Network Home
Timelines"](/en/ch2#sec_introduction_twitter). In that example, we also saw how the boundary between
write path and read path might be drawn differently for celebrities compared to ordinary users.
After 500 pages we have come full circle!
#### Stateful, offline-capable clients {#id347}
The idea of a boundary between write and read paths is interesting because we can discuss shifting
that boundary and explore what that shift means in practical terms. Let's look at the idea in a
different context.
In the past, web browsers were stateless clients that can only do useful things when you have an
internet connection (just about the only thing you could do offline was to scroll up and down in a
page that you had previously loaded while online). However, single-page JavaScript web apps now have
a lot of stateful capabilities, including client-side user interface interaction and persistent
local storage in the web browser. Mobile apps can similarly store a lot of state on the device and
don't require a round-trip to the server for most user interactions.
In ["Sync Engines and Local-First Software"](/en/ch6#sec_replication_offline_clients) we saw how
persistent local state enables a class of applications in which users can work offline, without an
internet connection, and sync with remote servers in the background when a network connection is
available [^30]. Since mobile devices sometimes have slow and unreliable cellular internet
connections, it's a big advantage for users if their user interface does not have to wait for
synchronous network requests, and if apps mostly work offline.
When we move away from the assumption of stateless clients talking to a central database and toward
state that is maintained on end-user devices, a world of new opportunities opens up. In particular,
we can think of the on-device state as a *cache of state on the server*. The pixels on the screen
are a materialized view onto model objects in the client app; the model objects are a local replica
of state in a remote datacenter [^31].
#### Pushing state changes to clients {#id348}
In a typical web page, if you load the page in a web browser and the data subsequently changes on
the server, the browser does not find out about the change until you reload the page. The browser
only reads the data at one point in time, assuming that it is static---it does not subscribe to
updates from the server. Thus, the state in the browser is a stale cache that is not updated unless
you explicitly poll for changes. (HTTP-based feed subscription protocols like RSS are really just a
basic form of polling.)
More recent protocols have moved beyond the basic request/response pattern of HTTP: server-sent
events (the EventSource API) and WebSockets provide communication channels by which a web browser
can keep an open TCP connection to a server, and the server can actively push messages to the
browser as long as it remains connected. This provides an opportunity for the server to actively
inform the end-user client about any changes to the state it has stored locally, reducing the
staleness of the client-side state.
In terms of our model of write path and read path, actively pushing state changes all the way to
client devices means extending the write path all the way to the end user. When a client is first
initialized, it would still need to use a read path to get its initial state, but thereafter it
could rely on a stream of state changes sent by the server. The ideas we discussed around stream
processing and messaging are not restricted to running only in a datacenter: we can take the ideas
further, and extend them all the way to end-user devices [^32].
The devices will be offline some of the time, and unable to receive any notifications of state
changes from the server during that time. But we already solved that problem: in ["Consumer
offsets"](/en/ch12#sec_stream_log_offsets) we discussed how a consumer of a log-based message broker
can reconnect after failing or becoming disconnected, and ensure that it doesn't miss any messages
that arrived while it was disconnected. The same technique works for individual users, where each
device is a small subscriber to a small stream of events.
#### End-to-end event streams {#id349}
Tools for developing stateful clients and user interfaces, such as React and Elm [^33],
already have the ability to update the rendered user interface in response to changes in the
underlying state. It would be very natural to extend this programming model to also allow a server
to push state-change events into this client-side event pipeline.
Thus, state changes could flow through an end-to-end write path: from the interaction on one device
that triggers a state change, via event logs and through several derived data systems and stream
processors, all the way to the user interface of a person observing the state on another device.
These state changes could be propagated with fairly low delay---say, under one second end to end.
Some applications, such as instant messaging and online games, already have such a "real-time"
architecture (in the sense of interactions with low delay, not in the sense of response time
guarantees). But why don't we build all applications this way?
The challenge is that the assumption of stateless clients and request/response interactions is very
deeply ingrained in our databases, libraries, frameworks, and protocols. Many datastores support
read and write operations where a request returns one response, but much fewer provide an ability to
subscribe to changes---i.e., a request that returns a stream of responses over time.
In order to extend the write path all the way to the end user, we would need to fundamentally
rethink the way we build many of these systems: moving away from request/response interaction and
toward publish/subscribe dataflow [^31]. This would require effort, but it would have the
advantage of making user interfaces more responsive and providing better offline support.
#### Reads are events too {#sec_future_read_events}
We discussed that when a stream processor writes derived data to a store (database, cache, or
index), and that store is queried, the store acts as the boundary between the write path and the
read path. The store allows random-access read queries to the data that would otherwise require
scanning the whole event log.
In many cases, the data storage is separate from the streaming system. But recall that stream
processors also need to maintain state to perform aggregations and joins. This state is normally
hidden inside the stream processor, but some frameworks allow it to also be queried by outside
clients [^34], turning the stream processor itself into a kind of simple database.
Let's take that idea further. As discussed so far, the writes to the store go through an event log,
while reads are transient network requests that go directly to the nodes that store the data being
queried. This is a reasonable design, but not the only possible one. It is also possible to
represent read requests as streams of events, and send both the read events and the write events
through a stream processor; the processor responds to read events by emitting the result of the read
to an output stream [^35].
When both the writes and the reads are represented as events, and routed to the same stream operator
in order to be handled, we are in fact performing a stream-table join between the stream of read
queries and the database. The read event needs to be sent to the database shard holding the data,
just like batch and stream processors need to copartition inputs on the same key when joining.
This correspondence between serving requests and performing joins is quite fundamental
[^36]. A one-off read request passes through the join operator, which then immediately
forgets the request; a subscribe request is a persistent join with past and future events on the
other side of the join.
Recording a log of read events potentially also has benefits with regard to tracking causal
dependencies and data provenance across a system: it would allow you to reconstruct what the user
saw before they made a particular decision. For example, in an online shop, it is likely that the
predicted shipping date and the inventory status shown to a customer affect whether they choose to
buy an item [^4]. To analyze this connection, you need to record the result of the user's
query of the shipping and inventory status.
Writing read requests to durable storage thus enables better tracking of causal dependencies, but it
incurs additional storage and I/O cost. Optimizing such systems to reduce the overhead is still an
open research problem [^2]. But if you already log read requests for operational purposes,
as a side effect of request processing, it is not such a great change to make the log the source of
the requests instead.
#### Multi-shard data processing {#sec_future_unbundled_multi_shard}
For queries that only touch a single shard, the effort of sending queries through a stream and
collecting a stream of responses is perhaps overkill. However, this idea opens the possibility of
distributed execution of complex queries that need to combine data from several shards, taking
advantage of the infrastructure for message routing, sharding, and joining that is already provided
by stream processors.
Storm's distributed RPC feature supports this usage pattern. For example, it has been used to
compute the number of people who have seen a URL on a social network---i.e., the union of the
follower sets of everyone who has posted that URL [^37]. As the set of users is sharded,
this computation requires combining results from many shards.
Another example of this pattern occurs in fraud prevention: in order to assess the risk of whether a
particular purchase event is fraudulent, you can examine the reputation scores of the user's IP
address, email address, billing address, shipping address, and so on. Each of these reputation
databases is itself sharded, and so collecting the scores for a particular purchase event requires a
sequence of joins with differently sharded datasets [^38].
The internal query execution graphs of data warehouse query engines have similar characteristics. If
you need to perform this kind of multi-shard join, it is probably simpler to use a database that
provides this feature than to implement it using a stream processor. However, treating queries as
streams provides an option for implementing large-scale applications that run against the limits of
conventional off-the-shelf solutions.
## Aiming for Correctness {#sec_future_correctness}
With stateless services that only read data, it is not a big deal if something goes wrong: you can
fix the bug and restart the service, and everything returns to normal. Stateful systems such as
databases are not so simple: they are designed to remember things forever (more or less), so if
something goes wrong, the effects also potentially last forever---which means they require more
careful thought [^39].
We want to build applications that are reliable and *correct* (i.e., programs whose semantics are
well defined and understood, even in the face of various faults). For approximately four decades,
the transaction properties of atomicity, isolation, and durability have been the tools of choice for
building correct applications. However, those foundations are weaker than they seem: witness for
example the confusion of weak isolation levels (see ["Weak Isolation
Levels"](/en/ch8#sec_transactions_isolation_levels)).
In some areas, transactions have been abandoned entirely and replaced with models that offer better
performance and scalability, but much messier semantics. *Consistency* is often talked about, but
poorly defined. Some people assert that we should "embrace weak consistency" for the sake of better
availability, while lacking a clear idea of what that actually means in practice.
For a topic that is so important, our understanding and our engineering methods are surprisingly
flaky. For example, it is very difficult to determine whether it is safe to run a particular
application at a particular transaction isolation level or replication configuration [^40],
[^41]. Often simple solutions appear to work correctly when concurrency is low and there are
no faults, but turn out to have many subtle bugs in more demanding circumstances.
For example, Kyle Kingsbury's Jepsen experiments [^42] have highlighted the stark
discrepancies between some products' claimed safety guarantees and their actual behavior in the
presence of network problems and crashes. Even if infrastructure products like databases were free
from problems, application code would still need to correctly use the features they provide, which
is error-prone if the configuration is hard to understand (which is the case with weak isolation
levels, quorum configurations, and so on).
If your application can tolerate occasionally corrupting or losing data in unpredictable ways, life
is a lot simpler, and you might be able to get away with simply crossing your fingers and hoping for
the best. On the other hand, if you need stronger assurances of correctness, then serializability
and atomic commit are established approaches, but they come at a cost: they typically only work in a
single datacenter (ruling out geographically distributed architectures), and they limit the scale
and fault-tolerance properties you can achieve.
While the traditional transaction approach is not going away, it is not the last word in making
applications correct and resilient to faults. In this section we will explore some ways of thinking
about correctness in the context of dataflow architectures.
### The End-to-End Argument for Databases {#sec_future_end_to_end}
Just because an application uses a data system that provides comparatively strong safety properties,
such as serializable transactions, that does not mean the application is guaranteed to be free from
data loss or corruption. For example, if an application has a bug that causes it to write incorrect
data, or delete data from a database, serializable transactions aren't going to save you. This is an
argument in favor of immutable and append-only data, because it is easier to recover from such
mistakes if you remove the ability of faulty code to destroy good data.
Although immutability is useful, it is not a cure-all by itself. Let's look at a more subtle example
of data corruption that can occur.
#### Exactly-once execution of an operation {#id353}
In ["Fault Tolerance"](/en/ch12#sec_stream_fault_tolerance) we encountered *exactly-once* (or
*effectively-once*) semantics. If something goes wrong while processing a message, you can either
give up (drop the message---i.e., incur data loss) or try again. If you try again, there is the risk
that it actually succeeded the first time, but you just didn't find out about the success, and so
the message ends up being processed twice.
Processing twice is a form of data corruption: it is undesirable to charge a customer twice for the
same service (billing them too much) or increment a counter twice (overstating some metric). In this
context, *exactly-once* means arranging the computation such that the final effect is the same as if
no faults had occurred, even if the operation actually was retried due to some fault. We previously
discussed a few approaches for achieving this goal.
One of the most effective approaches is to make the operation *idempotent*; that is, to ensure that
it has the same effect, no matter whether it is executed once or multiple times. However, taking an
operation that is not naturally idempotent and making it idempotent requires some effort and care:
you may need to maintain some additional metadata (such as the set of operation IDs that have
updated a value), and ensure fencing when failing over from one node to another (see ["Distributed
Locks and Leases"](/en/ch9#sec_distributed_lock_fencing)).
#### Duplicate suppression {#id354}
The same pattern of needing to suppress duplicates occurs in many other places besides stream
processing. For example, TCP uses sequence numbers on packets to put them in the correct order at
the recipient, and to determine whether any packets were lost or duplicated on the network. Any lost
packets are retransmitted and any duplicates are removed by the TCP stack before it hands the data
to an application.
However, this duplicate suppression only works within the context of a single TCP connection.
Imagine the TCP connection is a client's connection to a database, and it is currently executing the
transaction in [Example 13-1](/en/ch13#fig_future_non_idempotent). In many databases, a transaction
is tied to a client connection (if the client sends several queries, the database knows that they
belong to the same transaction because they are sent on the same TCP connection). If the client
suffers a network interruption and connection timeout after sending the `COMMIT`, but before hearing
back from the database server, it does not know whether the transaction has been committed or
aborted ([Figure 9-1](/en/ch9#fig_distributed_network)).
##### Example 13-1. A nonidempotent transfer of money from one account to another
``` sql
BEGIN TRANSACTION;
UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;
COMMIT;
```
The client can reconnect to the database and retry the transaction, but now it is outside of the
scope of TCP duplicate suppression. Since the transaction in
[Example 13-1](/en/ch13#fig_future_non_idempotent) is not idempotent, it could happen that \$22 is
transferred instead of the desired \$11. Thus, even though
[Example 13-1](/en/ch13#fig_future_non_idempotent) is a standard example for transaction atomicity,
it is actually not correct, and real banks do not work like this [^3].
Two-phase commit (see ["Two-Phase Commit (2PC)"](/en/ch8#sec_transactions_2pc)) protocols break the
1:1 mapping between a TCP connection and a transaction, since they must allow a transaction
coordinator to reconnect to a database after a network fault, and tell it whether to commit or abort
an in-doubt transaction. Is this sufficient to ensure that the transaction will only be executed
once? Unfortunately not.
Even if we can suppress duplicate transactions between the database client and server, we still need
to worry about the network between the end-user device and the application server. For example, if
the end-user client is a web browser, it probably uses an HTTP POST request to submit an instruction
to the server. Perhaps the user is on a weak cellular data connection, and they succeed in sending
the POST, but the signal becomes too weak before they are able to receive the response from the
server.
In this case, the user will probably be shown an error message, and they may retry manually. Web
browsers warn, "Are you sure you want to submit this form again?"---and the user says yes, because
they wanted the operation to happen. (The Post/Redirect/Get pattern [^43] avoids this
warning message in normal operation, but it doesn't help if the POST request times out.) From the
web server's point of view the retry is a separate request, and from the database's point of view it
is a separate transaction. The usual deduplication mechanisms don't help.
#### Uniquely identifying requests {#id355}
To make the request idempotent through several hops of network communication, it is not sufficient
to rely just on a transaction mechanism provided by a database---you need to consider the
*end-to-end* flow of the request.
For example, you could generate a unique identifier for a request (such as a UUID) and include it as
a hidden form field in the client application, or calculate a hash of all the relevant form fields
to derive the request ID [^3]. If the web browser submits the POST request twice, the two
requests will have the same request ID. You can then pass that request ID all the way through to the
database and check that you only ever execute one request with a given ID, as shown in
[Example 13-2](/en/ch13#fig_future_request_id).
##### Example 13-2. Suppressing duplicate requests using a unique ID
``` sql
ALTER TABLE requests ADD UNIQUE (request_id);
BEGIN TRANSACTION;
INSERT INTO requests
(request_id, from_account, to_account, amount)
VALUES('0286FDB8-D7E1-423F-B40B-792B3608036C', 4321, 1234, 11.00);
UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;
COMMIT;
```
[Example 13-2](/en/ch13#fig_future_request_id) relies on a uniqueness constraint on the `request_id`
column. If a transaction attempts to insert an ID that already exists, the `INSERT` fails and the
transaction is aborted, preventing it from taking effect twice. Relational databases can generally
maintain a uniqueness constraint correctly, even at weak isolation levels (whereas an
application-level check-then-insert may fail under nonserializable isolation, as discussed in
["Write Skew and Phantoms"](/en/ch8#sec_transactions_write_skew)).
Besides suppressing duplicate requests, the `requests` table in
[Example 13-2](/en/ch13#fig_future_request_id) acts as a kind of event log, which can be useful for
event sourcing or change data capture. The updates to the account balances don't actually have to
happen in the same transaction as the insertion of the event, since they are redundant and could be
derived from the request event in a downstream consumer---as long as the event is processed exactly
once, which can again be enforced using the request ID.
#### The end-to-end argument {#sec_future_e2e_argument}
This scenario of suppressing duplicate transactions is just one example of a more general principle
called the *end-to-end argument*, which was articulated by Saltzer, Reed, and Clark in 1984
[^44]:
> The function in question can completely and correctly be implemented only with the knowledge and
> help of the application standing at the endpoints of the communication system. Therefore,
> providing that questioned function as a feature of the communication system itself is not
> possible. (Sometimes an incomplete version of the function provided by the communication system
> may be useful as a performance enhancement.)
In our example, the *function in question* was duplicate suppression. We saw that TCP suppresses
duplicate packets at the TCP connection level, and some stream processors provide so-called
exactly-once semantics at the message processing level, but that is not enough to prevent a user
from submitting a duplicate request if the first one times out. By themselves, TCP, database
transactions, and stream processors cannot entirely rule out these duplicates. Solving the problem
requires an end-to-end solution: a transaction identifier that is passed all the way from the
end-user client to the database.
The end-to-end argument also applies to checking the integrity of data: checksums built into
Ethernet, TCP, and TLS can detect corruption of packets in the network, but they cannot detect
corruption due to bugs in the software at the sending and receiving ends of the network connection,
or corruption on the disks where the data is stored. If you want to catch all possible sources of
data corruption, you also need end-to-end checksums.
A similar argument applies with encryption [^44]: the password on your home WiFi network
protects against people snooping your WiFi traffic, but not against attackers elsewhere on the
internet; TLS/SSL between your client and the server protects against network attackers, but not
against compromises of the server. Only end-to-end encryption and authentication can protect against
all of these things.
Although the low-level features (TCP duplicate suppression, Ethernet checksums, WiFi encryption)
cannot provide the desired end-to-end features by themselves, they are still useful, since they
reduce the probability of problems at the higher levels. For example, HTTP requests would often get
mangled if we didn't have TCP putting the packets back in the right order. We just need to remember
that the low-level reliability features are not by themselves sufficient to ensure end-to-end
correctness.
#### Applying end-to-end thinking in data systems {#id357}
This brings us back to the original thesis: just because an application uses a data system that
provides comparatively strong safety properties, such as serializable transactions, that does not
mean the application is guaranteed to be free from data loss or corruption. The application itself
needs to take end-to-end measures, such as duplicate suppression, as well.
That is a shame, because fault-tolerance mechanisms are hard to get right. Low-level reliability
mechanisms, such as those in TCP, work quite well, and so the remaining higher-level faults occur
fairly rarely. It would be really nice to wrap up the remaining high-level fault-tolerance machinery
in an abstraction so that application code needn't worry about it---but it seems that we have not
yet found the right abstraction.
Transactions have long been seen as a useful abstraction. As discussed in
[Chapter 8](/en/ch8#ch_transactions), they take a wide range of possible issues (concurrent writes,
constraint violations, crashes, network interruptions, disk failures) and collapse them down to two
possible outcomes: commit or abort. That is a huge simplification of the programming model, but it
is not enough.
Transactions are expensive, especially when they involve heterogeneous storage technologies (see
["Distributed Transactions Across Different Systems"](/en/ch8#sec_transactions_xa)). When we refuse
to use distributed transactions because they are too expensive, we end up having to reimplement
fault-tolerance mechanisms in application code. As numerous examples throughout this book have
shown, reasoning about concurrency and partial failure is difficult and counterintuitive, and so
most application-level mechanisms do not work correctly. The consequence is lost or corrupted data.
For these reasons, it is worth exploring fault-tolerance abstractions that make it easy to provide
application-specific end-to-end correctness properties, but also maintain good performance and good
operational characteristics in a large-scale distributed environment.
### Enforcing Constraints {#sec_future_constraints}
Let's think about correctness in the context of the ideas around unbundling databases. We saw that
end-to-end duplicate suppression can be achieved with a request ID that is passed all the way from
the client to the database that records the write. What about other kinds of constraints?
In particular, let's focus on uniqueness constraints---such as the one we relied on in
[Example 13-2](/en/ch13#fig_future_request_id). In ["Constraints and uniqueness
guarantees"](/en/ch10#sec_consistency_uniqueness) we saw several other examples of application
features that need to enforce uniqueness: a username or email address must uniquely identify a user,
a file storage service cannot have more than one file with the same name, and two people cannot book
the same seat on a flight or in a theater.
Other kinds of constraints are very similar: for example, ensuring that an account balance never
goes negative, that you don't sell more items than you have in stock in the warehouse, or that a
meeting room does not have overlapping bookings. Techniques that enforce uniqueness can often be
used for these kinds of constraints as well.
#### Uniqueness constraints require consensus {#id452}
In [Chapter 10](/en/ch10#ch_consistency) we saw that in a distributed setting, enforcing a
uniqueness constraint requires consensus: if there are several concurrent requests with the same
value, the system somehow needs to decide which one of the conflicting operations is accepted, and
reject the others as violations of the constraint.
The most common way of achieving this consensus is to make a single node the leader, and put it in
charge of making all the decisions. That works fine as long as you don't mind funneling all requests
through a single node (even if the client is on the other side of the world), and as long as that
node doesn't fail. Consensus algorithms like Raft tackle the problem of safely electing a new leader
if the current leader has failed (or is believed to have failed due to a network problem), and
preventing split brain.
Uniqueness checking can be scaled out by sharding based on the value that needs to be unique. For
example, if you need to ensure uniqueness by request ID, as in
[Example 13-2](/en/ch13#fig_future_request_id), you can ensure all requests with the same request ID
are routed to the same shard. If you need usernames to be unique, you can shard by hash of username.
However, asynchronous multi-leader replication is ruled out, because it could happen that different
leaders concurrently accept conflicting writes, and thus the values are no longer unique. If you
want to be able to immediately reject any writes that would violate the constraint, synchronous
coordination is unavoidable [^45].
#### Uniqueness in log-based messaging {#sec_future_uniqueness_log}
A shared log ensures that all consumers see messages in the same order---a guarantee that is
formally known as *total order broadcast* and is equivalent to consensus (see ["The Many Faces of
Consensus"](/en/ch10#sec_consistency_faces)). In the unbundled database approach with log-based
messaging, we can use a very similar approach to enforce uniqueness constraints.
A stream processor consumes all the messages in a log shard sequentially on a single thread. Thus,
if the log is sharded based on the value that needs to be unique, a stream processor can
unambiguously and deterministically decide which one of several conflicting operations came first in
the log. For example, in the case of several users trying to claim the same username
[^46]:
1. Every request for a username is encoded as a message, and appended to a shard determined by the
hash of the username.
2. A stream processor sequentially reads the requests in the log, using a local database to keep
track of which usernames are taken. For every request for a username that is available, it
records the name as taken and emits a success message to an output stream. For every request for
a username that is already taken, it emits a rejection message to an output stream.
3. The client that requested the username watches the output stream and waits for a success or
rejection message corresponding to its request.
This algorithm is the same as the construction for achieving consensus using a shared log, which we
saw in [Chapter 10](/en/ch10#ch_consistency). It scales easily to a large request throughput by
increasing the number of shards, as each shard can be processed independently.
The approach works not only for uniqueness constraints, but also for many other kinds of
constraints. Its fundamental principle is that any writes that may conflict are routed to the same
shard and processed sequentially. The definition of a conflict may depend on the application, but
the stream processor can use arbitrary logic to validate a request.
#### Multi-shard request processing {#id360}
Ensuring that an operation is executed atomically, while satisfying constraints, becomes more
interesting when several shards are involved. In [Example 13-2](/en/ch13#fig_future_request_id),
there are potentially three shards: the one containing the request ID, the one containing the payee
account, and the one containing the payer account. There is no reason why those three things should
be in the same shard, since they are all independent from each other.
In the traditional approach to databases, executing this transaction would require an atomic commit
across all three shards, which essentially forces it into a total order with respect to all other
transactions on any of those shards. Since there is now cross-shard coordination, different shards
can no longer be processed independently, so throughput is likely to suffer.
However, equivalent correctness can be achieved without cross-shard transactions using sharded logs
and stream processors. [Figure 13-2](/en/ch13#fig_future_multi_shard) shows an example of a payment
transaction that needs to check whether there is sufficient money in the source account, and if so,
atomically transfers some amount to a destination account while deducting fees. It works as follows
[^47]:
{{< figure src="/fig/ddia_1302.png" id="fig_future_multi_shard" caption="Figure 13-2. Checking whether a source account has enough money, and atomically transferring money to a destination account and a fees account, using event logs and stream processors." class="w-full my-4" >}}
1. The request to transfer money from the source account to the destination account is given a
unique request ID by the user's client, and appended to a log shard based on the source account
ID.
2. A stream processor reads the log of requests and maintains a database containing the state of
the source account and the IDs of requests it has already processed. The contents of this
database are entirely derived from the log. When the stream processor encounters a request with
an ID that it has not seen before, it checks in its local database whether the source account
has enough money to perform the transfer.
If yes, it updates its local database to reserve the payment amount on the source account, and
emits events to several other logs: an outgoing payment event to the log shard for the source
account (its own input log), an incoming payment event to the log shard for the destination
account, and an incoming payment event to the log shard for the fees account. The original
request ID is included in those emitted events.
3. Eventually the outgoing payment event is delivered back to the source account processor
(possibly after having received unrelated events in the meantime). The stream processor
recognises based on the request ID that this is a payment it previously reserved, and it now
executes the payment, again updating its local state of the source account. It ignores
duplicates based on request ID.
4. The log shards for the destination and fees accounts are consumed by independent stream
processing tasks. When they receive an incoming payment event, they update their local state to
reflect the payment, and they deduplicate events based on request ID.
[Figure 13-2](/en/ch13#fig_future_multi_shard) shows the three accounts as being in three separate
shards, but they could just as well be in the same shards---it doesn't matter. All we need is that
the events for any given account are processed strictly in log order with at-least-once semantics,
and that the stream processors are deterministic.
For example, consider what happens if the source account processor crashes while processing a
payment request. The output messages may or may not have been emitted before the crash occurred.
When it recovers from the crash, it will process the same request again (due to at-least-once
semantics), and it will make the same decision on whether to allow the payment (since it's
deterministic). It will therefore emit the same output messages with the same request ID to the
outgoing, incoming, and fees account shards. If they are duplicates, the downstream consumers will
ignore them based on the request ID.
Atomicity in this system comes not from any transactions, but from the fact that writing the initial
request event to the source account log is an atomic action. Once that one event in the log, all the
downstream events will eventually be written as well---possibly after stream processors have
recovered from crashes, and possibly with duplicates, but they will appear eventually.
With exactly-once semantics this example becomes easier to implement, since it ensures that the
stream processor's local state is consistent with the set of messages it has processed. Thus, if it
crashes and re-processes some messages, its local state is also reset to what it was before those
messages were processed.
If the user in [Figure 13-2](/en/ch13#fig_future_multi_shard) wants to find out whether their
transfer was approved or not, they can subscribe to the source account log shard and wait for the
outgoing payment event. In order to explicitly notify the user if the balance is insufficient, the
stream processor can emit a "declined payment" event to that log shard.
By breaking down the multi-shard transaction into several differently sharded stages and using the
end-to-end request ID, we have achieved the same correctness property (every request is applied
exactly once to both the payer and payee accounts), even in the presence of faults, and without
using an atomic commit protocol.
### Timeliness and Integrity {#sec_future_integrity}
A convenient property of many transactional systems is that as soon as one transaction commits, its
writes are immediately visible to other transactions. This property is formalized as *strict
serializability* (see ["Linearizability Versus
Serializability"](/en/ch10#sidebar_consistency_serializability)).
This is not the case when unbundling an operation across multiple stages of stream processors:
consumers of a log are asynchronous by design, so a sender does not wait until its message has been
processed by consumers. However, it is possible for a client to wait for a message to appear on an
output stream, like the user waiting for an outgoing payment or payment declined event in
[Figure 13-2](/en/ch13#fig_future_multi_shard), which depends on whether there was enough money in
the source account.
In this example, the correctness of the source account balance check does not depend on whether the
user making the request waits for the outcome. The waiting only has the purpose of synchronously
informing the user whether or not the payment succeeded, but this notification is decoupled from the
effects of processing the request.
More generally, the term *consistency* conflates two different requirements that are worth
considering separately:
Timeliness
: Timeliness means ensuring that users observe the system in an up-to-date state. We saw
previously that if a user reads from a stale copy of the data, they may observe it in an
inconsistent state (see ["Problems with Replication Lag"](/en/ch6#sec_replication_lag)).
However, that inconsistency is temporary, and will eventually be resolved simply by waiting and
trying again.
The CAP theorem uses consistency in the sense of linearizability, which is a strong way of
achieving timeliness. Weaker timeliness properties like *read-after-write* consistency can also
be useful.
Integrity
: Integrity means absence of corruption; i.e., no data loss, and no contradictory or false data.
In particular, if some derived dataset is maintained as a view onto some underlying data, the
derivation must be correct. For example, a database index must correctly reflect the contents of
the database---an index in which some records are missing is not very useful.
If integrity is violated, the inconsistency is permanent: waiting and trying again is not going
to fix database corruption in most cases. Instead, explicit checking and repair is needed. In
the context of ACID transactions, "consistency" is usually understood as some kind of
application-specific notion of integrity. Atomicity and durability are important tools for
preserving integrity.
In slogan form: violations of timeliness are "eventual consistency," whereas violations of integrity
are "perpetual inconsistency."
In most applications, integrity is much more important than timeliness. Violations of timeliness can
be annoying and confusing, but violations of integrity can be catastrophic.
For example, on your credit card statement, it is not surprising if a transaction that you made
within the last 24 hours does not yet appear---it is normal that these systems have a certain lag.
We know that banks reconcile and settle transactions asynchronously, and timeliness is not very
important here [^3]. However, it would be very bad if the statement balance was not equal
to the sum of the transactions plus the previous statement balance (an error in the sums), or if a
transaction was charged to you but not paid to the merchant (disappearing money). Such problems
would be violations of the integrity of the system.
#### Correctness of dataflow systems {#id453}
ACID transactions usually provide both timeliness (e.g., linearizability) and integrity (e.g.,
atomic commit) guarantees. Thus, if you approach application correctness from the point of view of
ACID transactions, the distinction between timeliness and integrity is fairly inconsequential.
On the other hand, an interesting property of the event-based dataflow systems that we have
discussed in this chapter is that they decouple timeliness and integrity. When processing event
streams asynchronously, there is no guarantee of timeliness, unless you explicitly build consumers
that wait for a message to arrive before returning. For example, a user could request a payment and
then read the state of their account before the stream processor has executed the request; the user
will not see the payment they just requested.
However, integrity is in fact central to streaming systems. *Exactly-once* or *effectively-once*
semantics is a mechanism for preserving integrity. If an event is lost, or if an event takes effect
twice, the integrity of a data system could be violated. Thus, fault-tolerant message delivery and
duplicate suppression (e.g., idempotent operations) are important for maintaining the integrity of a
data system in the face of faults.
As we saw in the last section, reliable stream processing systems can preserve integrity without
requiring distributed transactions and an atomic commit protocol, which means they can potentially
achieve comparable correctness with much better performance and operational robustness. We achieved
this integrity through a combination of mechanisms:
- Representing the content of the write operation as a single message, which can easily be written
atomically---an approach that fits very well with event sourcing
- Deriving all other state updates from that single message using deterministic derivation
functions, similarly to stored procedures
- Passing a client-generated request ID through all these levels of processing, enabling end-to-end
duplicate suppression and idempotence
- Making messages immutable and allowing derived data to be reprocessed from time to time, which
makes it easier to recover from bugs
#### Loosely interpreted constraints {#id362}
As discussed previously, enforcing a uniqueness constraint requires consensus, typically implemented
by funneling all events in a particular shard through a single node. This limitation is unavoidable
if we want the traditional form of uniqueness constraint, and stream processing cannot get around
it.
However, another thing to realize is that in many real applications there is actually a business
requirement to allow violations of what you might think of as hard constraints:
- If customers order more items than you have in your warehouse, you can order in more stock,
apologize to customers for the delay, and offer them a discount. This is actually the same as what
you'd have to do if, say, a forklift truck ran over some of the items in your warehouse, leaving
you with fewer items in stock than you thought you had [^3]. Thus, the apology workflow
already needs to be part of your business processes anyway in order to deal with forklift
incidents, and a hard constraint on the number of items in stock might be unnecessary.
- Similarly, many airlines overbook airplanes in the expectation that some passengers will miss
their flight, and many hotels overbook rooms, expecting that some guests will cancel. In these
cases, the constraint of "one person per seat" is deliberately violated for business reasons, and
compensation processes (refunds, upgrades, providing a complimentary room at a neighboring hotel)
are put in place to handle situations in which demand exceeds supply. Even if there was no
overbooking, apology and compensation processes would be needed in order to deal with flights
being cancelled due to bad weather or staff on strike---recovering from such issues is just a
normal part of business [^3].
- If someone withdraws more money than they have in their account, the bank can charge them an
overdraft fee and ask them to pay back what they owe. By limiting the total withdrawals per day,
the risk to the bank is bounded.
- In systems that integrate data between different organizations, inconsistencies will inevitably
arise, and correction mechanisms are necessary to handle them. As noted in ["Batch Use
Cases"](/en/ch11#sec_batch_output), settlement of payments between banks is an example of this.
In many business contexts, it is therefore acceptable to temporarily violate a constraint and fix it
up later by apologizing. This kind of change to correct a mistake is called a *compensating
transaction* [^48], [^49]. The cost of the apology (in terms of money or reputation)
varies, but it is often quite low: you can't unsend an email, but you can send a follow-up email
with a correction. If you accidentally charge a credit card twice, you can refund one of the
charges, and the cost to you is just the processing fees and perhaps a customer complaint. Once
money has been paid out of an ATM, you can't directly get it back, although in principle you can
send debt collectors to recover the money if the account was overdrawn and the customer won't pay it
back.
Whether the cost of the apology is acceptable is a business decision. If it is acceptable, the
traditional model of checking all constraints before even writing the data is unnecessarily
restrictive. It may well be reasonable to go ahead with a write optimistically, and to check the
constraint after the fact. You can still ensure that the validation occurs before doing things that
would be expensive to recover from, but that doesn't imply you must do the validation before you
even write the data.
These applications *do* require integrity: you would not want to lose a reservation, or have money
disappear due to mismatched credits and debits. But they *don't* require timeliness on the
enforcement of the constraint: if you have sold more items than you have in the warehouse, you can
patch up the problem after the fact by apologizing. Doing so is similar to the conflict resolution
approaches we discussed in ["Dealing with Conflicting
Writes"](/en/ch6#sec_replication_write_conflicts).
#### Coordination-avoiding data systems {#id454}
We have now made two interesting observations:
1. Dataflow systems can maintain integrity guarantees on derived data without atomic commit,
linearizability, or synchronous cross-shard coordination.
2. Although strict uniqueness constraints require timeliness and coordination, many applications
are actually fine with loose constraints that may be temporarily violated and fixed up later, as
long as integrity is preserved throughout.
Taken together, these observations mean that dataflow systems can provide the data management
services for many applications without requiring coordination, while still giving strong integrity
guarantees. Such *coordination-avoiding* data systems have a lot of appeal: they can achieve better
performance and fault tolerance than systems that need to perform synchronous coordination
[^45].
For example, such a system could operate distributed across multiple datacenters in a multi-leader
configuration, asynchronously replicating between regions. Any one datacenter can continue operating
independently from the others, because no synchronous cross-region coordination is required. Such a
system would have weak timeliness guarantees---it could not be linearizable without introducing
coordination---but it can still have strong integrity guarantees.
In this context, serializable transactions are still useful as part of maintaining derived state,
but they can be run at a small scope where they work well [^6]. Heterogeneous distributed
transactions such as XA transactions are not required. Synchronous coordination can still be
introduced in places where it is needed (for example, to enforce strict constraints before an
operation from which recovery is not possible), but there is no need for everything to pay the cost
of coordination if only a small part of an application needs it [^32].
Another way of looking at coordination and constraints: they reduce the number of apologies you have
to make due to inconsistencies, but potentially also reduce the performance and availability of your
system, and thus potentially increase the number of apologies you have to make due to outages. You
cannot reduce the number of apologies to zero, but you can aim to find the best trade-off for your
needs---the sweet spot where there are neither too many inconsistencies nor too many availability
problems.
### Trust, but Verify {#sec_future_verification}
All of our discussion of correctness, integrity, and fault-tolerance has been under the assumption
that certain things might go wrong, but other things won't. We call these assumptions our *system
model* (see ["System Model and Reality"](/en/ch9#sec_distributed_system_model)): for example, we
should assume that processes can crash, machines can suddenly lose power, and the network can
arbitrarily delay or drop messages. But we might also assume that data written to disk is not lost
after `fsync`, that data in memory is not corrupted, and that the multiplication instruction of our
CPU always returns the correct result.
These assumptions are quite reasonable, as they are true most of the time, and it would be difficult
to get anything done if we had to constantly worry about our computers making mistakes.
Traditionally, system models take a binary approach toward faults: we assume that some things can
happen, and other things can never happen. In reality, it is more a question of probabilities: some
things are more likely, other things less likely. The question is whether violations of our
assumptions happen often enough that we may encounter them in practice.
We have seen that data can become corrupted in memory (see ["Hardware and Software
Faults"](/en/ch2#sec_introduction_hardware_faults)), on disk (see ["Replication and
Durability"](/en/ch8#sidebar_transactions_durability)), and on the network (see ["Weak forms of
lying"](/en/ch9#sec_distributed_weak_lying)). Maybe this is something we should be paying more
attention to? If you are operating at large enough scale, even very unlikely things do happen.
#### Maintaining integrity in the face of software bugs {#id455}
Besides such hardware issues, there is always the risk of software bugs, which would not be caught
by lower-level network, memory, or filesystem checksums. Even widely used database software has
bugs: for example, past versions of MySQL have failed to correctly maintain uniqueness constraints
[^50] and PostgreSQL's serializable isolation level has exhibited write skew anomalies in
the past [^51], even though MySQL and PostgreSQL are robust and well-regarded databases
that have been battle-tested by many people for many years. In less mature software, the situation
is likely to be much worse.
Despite considerable efforts in careful design, testing, and review, bugs still creep in. Although
they are rare, and they eventually get found and fixed, there is still a period during which such
bugs can corrupt data.
When it comes to application code, we have to assume many more bugs, since most applications don't
receive anywhere near the amount of review and testing that database code does. Many applications
don't even correctly use the features that databases offer for preserving integrity, such as foreign
key or uniqueness constraints [^25].
Consistency in the sense of ACID is based on the idea that the database starts off in a consistent
state, and a transaction transforms it from one consistent state to another consistent state. Thus,
we expect the database to always be in a consistent state. However, this notion only makes sense if
you assume that the transaction is free from bugs. If the application uses the database incorrectly
in some way, for example using a weak isolation level unsafely, the integrity of the database cannot
be guaranteed.
#### Don't just blindly trust what they promise {#id364}
With both hardware and software not always living up to the ideal that we would like them to be, it
seems that data corruption is inevitable sooner or later. Thus, we should at least have a way of
finding out if data has been corrupted so that we can fix it and try to track down the source of the
error. Checking the integrity of data is known as *auditing*.
As discussed in ["Advantages of immutable events"](/en/ch12#sec_stream_immutability_pros), auditing
is not just for financial applications. However, auditability is very important in finance precisely
because everyone knows that mistakes happen, and we all recognize the need to be able to detect and
fix problems.
Mature systems similarly tend to consider the possibility of unlikely things going wrong, and manage
that risk. For example, large-scale storage systems such as HDFS and Amazon S3 do not fully trust
disks: they run background processes that continually read back files, compare them to other
replicas, and move files from one disk to another, in order to mitigate the risk of silent
corruption [^52], [^53].
If you want to be sure that your data is still there, you have to actually read it and check. Most
of the time it will still be there, but if it isn't, you really want to find out sooner rather than
later. By the same argument, it is important to try restoring from your backups from time to
time---otherwise you may only find out that your backup is broken when it is too late and you have
already lost data. Don't just blindly trust that it is all working.
Systems like HDFS and S3 still have to assume that disks work correctly most of the time---which is
a reasonable assumption, but not the same as assuming that they *always* work correctly. However,
not many systems currently have this kind of "trust, but verify" approach of continually auditing
themselves. Many assume that correctness guarantees are absolute and make no provision for the
possibility of rare data corruption. In the future we may see more *self-validating* or
*self-auditing* systems that continually check their own integrity, rather than relying on blind
trust [^54].
#### Designing for auditability {#id365}
If a transaction mutates several objects in a database, it is difficult to tell after the fact what
that transaction means. Even if you capture the transaction logs, the insertions, updates, and
deletions in various tables do not necessarily give a clear picture of *why* those mutations were
performed. The invocation of the application logic that decided on those mutations is transient and
cannot be reproduced.
By contrast, event-based systems can provide better auditability. In the event sourcing approach,
user input to the system is represented as a single immutable event, and any resulting state updates
are derived from that event. The derivation can be made deterministic and repeatable, so that
running the same log of events through the same version of the derivation code will result in the
same state updates.
Being explicit about dataflow makes the *provenance* of data much clearer, which makes integrity
checking much more feasible. For the event log, we can use hashes to check that the event storage
has not been corrupted. For any derived state, we can rerun the batch and stream processors that
derived it from the event log in order to check whether we get the same result, or even run a
redundant derivation in parallel.
A deterministic and well-defined dataflow also makes it easier to debug and trace the execution of a
system in order to determine why it did something [^4], [^55]. If something
unexpected occurred, it is valuable to have the diagnostic capability to reproduce the exact
circumstances that led to the unexpected event---a kind of time-travel debugging capability.
#### The end-to-end argument again {#id456}
If we cannot fully trust that every individual component of the system will be free from
corruption---that every piece of hardware is fault-free and that every piece of software is
bug-free---then we must at least periodically check the integrity of our data. If we don't check, we
won't find out about corruption until it is too late and it has caused some downstream damage, at
which point it will be much harder and more expensive to track down the problem.
Checking the integrity of data systems is best done in an end-to-end fashion: the more systems we
can include in an integrity check, the fewer opportunities there are for corruption to go unnoticed
at some stage of the process. If we can check that an entire derived data pipeline is correct end to
end, then any disks, networks, services, and algorithms along the path are implicitly included in
the check.
Having continuous end-to-end integrity checks gives you increased confidence about the correctness
of your systems, which in turn allows you to move faster [^56]. Like automated testing,
auditing increases the chances that bugs will be found quickly, and thus reduces the risk that a
change to the system or a new storage technology will cause damage. If you are not afraid of making
changes, you can much better evolve an application to meet changing requirements.
#### Tools for auditable data systems {#id366}
At present, not many data systems make auditability a top-level concern. Some applications implement
their own audit mechanisms, for example by logging all changes to a separate audit table, but
guaranteeing the integrity of the audit log and the database state is still difficult. A transaction
log can be made tamper-proof by periodically signing it with a hardware security module, but that
does not guarantee that the right transactions went into the log in the first place.
Blockchains such as Bitcoin or Ethereum are shared append-only logs with cryptographic consistency
checks; the transactions they store are events, and smart contracts are basically stream processors.
The consensus protocols they use ensure that all nodes agree on the same sequence of events. The
difference to the consensus protocols of [Chapter 10](/en/ch10#ch_consistency) is that blockchains
are Byzantine fault tolerant, i.e. they still work if some of the participating nodes have corrupted
data because the replicas continually check each other's integrity.
For most applications, blockchains have too high overhead to be useful. However, some of their
cryptographic tools can also be used in a lighterweight context. For example, *Merkle trees*
[^57], are trees of hashes that can be used to efficiently prove that a record appears in
some dataset (and a few other things). *Certificate transparency* uses cryptographically verified
append-only logs and Merkle trees to check the validity of TLS/SSL certificates [^58],
[^59]; it avoids needing a consensus protocol by having a single leader per log.
Integrity-checking and auditing algorithms, like those of certificate transparency and distributed
ledgers, might becoming more widely used in data systems in general in the future. Some work will be
needed to make them equally scalable as systems without cryptographic auditing, and to keep the
performance penalty as low as possible, but they are nevertheless interesting.
## Summary {#id367}
In this chapter we discussed new approaches to designing data systems based on ideas from stream
processing. We started with the observation that there is no one single tool that can efficiently
serve all possible use cases, and so applications necessarily need to compose several different
pieces of software to accomplish their goals. We discussed how to solve this *data integration*
problem by using batch processing and event streams to let data changes flow between different
systems.
In this approach, certain systems are designated as systems of record, and other data is derived
from them through transformations. In this way we can maintain indexes, materialized views, machine
learning models, statistical summaries, and more. By making these derivations and transformations
asynchronous and loosely coupled, a problem in one area is prevented from spreading to unrelated
parts of the system, increasing the robustness and fault-tolerance of the system as a whole.
Expressing dataflows as transformations from one dataset to another also helps evolve applications:
if you want to change one of the processing steps, for example to change the structure of an index
or cache, you can just rerun the new transformation code on the whole input dataset in order to
rederive the output. Similarly, if something goes wrong, you can fix the code and reprocess the data
in order to recover.
These processes are quite similar to what databases already do internally, so we recast the idea of
dataflow applications as *unbundling* the components of a database, and building an application by
composing these loosely coupled components.
Derived state can be updated by observing changes in the underlying data. Moreover, the derived
state itself can further be observed by downstream consumers. We can even take this dataflow all the
way through to the end-user device that is displaying the data, and thus build user interfaces that
dynamically update to reflect data changes and continue to work offline.
Next, we discussed how to ensure that all of this processing remains correct in the presence of
faults. We saw that strong integrity guarantees can be implemented scalably with asynchronous event
processing, by using end-to-end request identifiers to make operations idempotent and by checking
constraints asynchronously. Clients can either wait until the check has passed, or go ahead without
waiting but risk having to apologize about a constraint violation. This approach is much more
scalable and robust than the traditional approach of using distributed transactions, and fits with
how many business processes work in practice.
By structuring applications around dataflow and checking constraints asynchronously, we can avoid
most coordination and create systems that maintain integrity but still perform well, even in
geographically distributed scenarios and in the presence of faults. We then talked a little about
using audits to verify the integrity of data and detect corruption, and observed that the techniques
used by blockchains also have a similarity to event-based systems.
##### Footnotes
### References {#references}
[^1]: Rachid Belaid. [Postgres Full-Text Search is Good Enough!](https://rachbelaid.com/postgres-full-text-search-is-good-enough/) *rachbelaid.com*, July 2015. Archived at [perma.cc/ZVP9-YDCB](https://perma.cc/ZVP9-YDCB)
[^2]: Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, Wyatt Lloyd, and Kaushik Veeraraghavan. [Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf). At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
[^3]: Pat Helland and Dave Campbell. [Building on Quicksand](https://arxiv.org/pdf/0909.1788). At *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009.
[^4]: Jessica Kerr. [Provenance and Causality in Distributed Systems](https://jessitron.com/2016/09/25/provenance-and-causality-in-distributed-systems/). *jessitron.com*, September 2016. Archived at [perma.cc/DTD2-F8ZM](https://perma.cc/DTD2-F8ZM)
[^5]: Jay Kreps. [The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying). *engineering.linkedin.com*, December 2013. Archived at [perma.cc/2JHR-FR64](https://perma.cc/2JHR-FR64)
[^6]: Pat Helland. [Life Beyond Distributed Transactions: An Apostate's Opinion](https://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf). At *3rd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2007.
[^7]: Lionel A. Smith. [The Broad Gauge Story](https://lionels.orpheusweb.co.uk/RailSteam/GWRBroadG/BGHist.html). *Journal of the Monmouthshire Railway Society*, Summer 1985. Archived at [perma.cc/DDK9-JA6X](https://perma.cc/DDK9-JA6X)
[^8]: Jacqueline Xu. [Online Migrations at Scale](https://stripe.com/blog/online-migrations). *stripe.com*, February 2017. Archived at [perma.cc/ZQY2-EAU2](https://perma.cc/ZQY2-EAU2)
[^9]: Flavio Santos and Robert Stephenson. [Changing the Wheels on a Moving Bus --- Spotify's Event Delivery Migration](https://engineering.atspotify.com/2021/10/changing-the-wheels-on-a-moving-bus-spotify-event-delivery-migration). *engineering.atspotify.com*, October 2021. Archived at [perma.cc/5C4V-G8EV](https://perma.cc/5C4V-G8EV)
[^10]: Molly Bartlett Dishman and Martin Fowler. [Agile Architecture](https://www.youtube.com/watch?v=VjKYO6DP3fo&list=PL055Epbe6d5aFJdvWNtTeg_UEHZEHdInE). At *O'Reilly Software Architecture Conference*, March 2015.
[^11]: Nathan Marz and James Warren. [*Big Data: Principles and Best Practices of Scalable Real-Time Data Systems*](https://www.manning.com/books/big-data). Manning, 2015. ISBN: 978-1-617-29034-3
[^12]: Jay Kreps. [Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture). *oreilly.com*, July 2014. Archived at [perma.cc/PGH6-XUCH](https://perma.cc/PGH6-XUCH)
[^13]: Raul Castro Fernandez, Peter Pietzuch, Jay Kreps, Neha Narkhede, Jun Rao, Joel Koshy, Dong Lin, Chris Riccomini, and Guozhang Wang. [Liquid: Unifying Nearline and Offline Big Data Integration](https://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper25u.pdf). At *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
[^14]: Dennis M. Ritchie and Ken Thompson. [The UNIX Time-Sharing System](https://web.eecs.utk.edu/~qcao1/cs560/papers/paper-unix.pdf). *Communications of the ACM*, volume 17, issue 7, pages 365--375, July 1974. [doi:10.1145/361011.361061](https://doi.org/10.1145/361011.361061)
[^15]: Wes McKinney. [The Road to Composable Data Systems: Thoughts on the Last 15 Years and the Future](https://wesmckinney.com/blog/looking-back-15-years/). *wesmckinney.com*, September 2023. Archived at [perma.cc/J9SJ-886N](https://perma.cc/J9SJ-886N)
[^16]: Eric A. Brewer and Joseph M. Hellerstein. [CS262a: Advanced Topics in Computer Systems](https://people.eecs.berkeley.edu/~brewer/cs262/systemr.html). Lecture notes, University of California, Berkeley, *cs.berkeley.edu*, August 2011. Archived at [perma.cc/TE79-LGWU](https://perma.cc/TE79-LGWU)
[^17]: Michael Stonebraker. [The Case for Polystores](https://wp.sigmod.org/?p=1629). *wp.sigmod.org*, July 2015. Archived at [perma.cc/G7J2-KR45](https://perma.cc/G7J2-KR45)
[^18]: Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magda Balazinska, Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stan Zdonik. [The BigDAWG Polystore System](https://sigmod.org/publications/sigmodRecord/1506/pdfs/04_vision_Duggan.pdf). *ACM SIGMOD Record*, volume 44, issue 2, pages 11--16, June 2015. [doi:10.1145/2814710.2814713](https://doi.org/10.1145/2814710.2814713)
[^19]: David B. Lomet, Alan Fekete, Gerhard Weikum, and Mike Zwilling. [Unbundling Transaction Services in the Cloud](https://arxiv.org/pdf/0909.1768). At *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009.
[^20]: Martin Kleppmann and Jay Kreps. [Kafka, Samza and the Unix Philosophy of Distributed Data](https://martin.kleppmann.com/papers/kafka-debull15.pdf). *IEEE Data Engineering Bulletin*, volume 38, issue 4, pages 4--14, December 2015.
[^21]: John Hugg. [Winning Now and in the Future: Where Volt Active Data Shines](https://www.voltactivedata.com/blog/2016/03/winning-now-future-voltdb-shines/). *voltactivedata.com*, March 2016. Archived at [perma.cc/44MP-3MWM](https://perma.cc/44MP-3MWM)
[^22]: Felienne Hermans. [Spreadsheets Are Code](https://vimeo.com/145492419). At *Code Mesh*, November 2015.
[^23]: Dan Bricklin and Bob Frankston. [VisiCalc: Information from Its Creators](http://danbricklin.com/visicalc.htm). *danbricklin.com*. Archived at [archive.org](https://web.archive.org/web/20250905040530/http://danbricklin.com/visicalc.htm)
[^24]: D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. [Machine Learning: The High-Interest Credit Card of Technical Debt](https://research.google.com/pubs/archive/43146.pdf). At *NIPS Workshop on Software Engineering for Machine Learning* (SE4ML), December 2014. Archived at
[^25]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2737784](https://doi.org/10.1145/2723372.2737784)
[^26]: Guy Steele. [Re: Need for Macros (Was Re: Icon)](https://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01134.html). email to *ll1-discuss* mailing list, *people.csail.mit.edu*, December 2001. Archived at [perma.cc/K9X8-CJ65](https://perma.cc/K9X8-CJ65)
[^27]: Ben Stopford. [Microservices in a Streaming World](https://www.infoq.com/presentations/microservices-streaming). At *QCon London*, March 2016.
[^28]: Adam Bellemare. [*Building Event-Driven Microservices, 2nd Edition*](https://learning.oreilly.com/library/view/building-event-driven-microservices/9798341622180/). O'Reilly Media, 2025.
[^29]: Christian Posta. [Why Microservices Should Be Event Driven: Autonomy vs Authority](https://blog.christianposta.com/microservices/why-microservices-should-be-event-driven-autonomy-vs-authority/). *blog.christianposta.com*, May 2016. Archived at [perma.cc/E6N9-3X92](https://perma.cc/E6N9-3X92)
[^30]: Alex Feyerke. [Designing Offline-First Web Apps](https://alistapart.com/article/offline-first/). *alistapart.com*, December 2013. Archived at [perma.cc/WH7R-S2DS](https://perma.cc/WH7R-S2DS)
[^31]: Martin Kleppmann. [Turning the Database Inside-out with Apache Samza.](https://martin.kleppmann.com/2015/03/04/turning-the-database-inside-out.html) at *Strange Loop*, September 2014. Archived at [perma.cc/U6E8-A9MT](https://perma.cc/U6E8-A9MT)
[^32]: Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich. [Global Sequence Protocol: A Robust Abstraction for Replicated Shared State](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ECOOP.2015.568). At *29th European Conference on Object-Oriented Programming* (ECOOP), July 2015. [doi:10.4230/LIPIcs.ECOOP.2015.568](https://doi.org/10.4230/LIPIcs.ECOOP.2015.568)
[^33]: Evan Czaplicki and Stephen Chong. [Asynchronous Functional Reactive Programming for GUIs](https://people.seas.harvard.edu/~chong/pubs/pldi13-elm.pdf). At *34th ACM SIGPLAN Conference on Programming Language Design and Implementation* (PLDI), June 2013. [doi:10.1145/2491956.2462161](https://doi.org/10.1145/2491956.2462161)
[^34]: Eno Thereska, Damian Guy, Michael Noll, and Neha Narkhede. [Unifying Stream Processing and Interactive Queries in Apache Kafka](https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/). *confluent.io*, October 2016. Archived at [perma.cc/W8JG-EAZF](https://perma.cc/W8JG-EAZF)
[^35]: Frank McSherry. [Dataflow as Database](https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-17.md). *github.com*, July 2016. Archived at [perma.cc/384D-DUFH](https://perma.cc/384D-DUFH)
[^36]: Peter Alvaro. [I See What You Mean](https://www.youtube.com/watch?v=R2Aa4PivG0g). At *Strange Loop*, September 2015.
[^37]: Nathan Marz. [Trident: A High-Level Abstraction for Realtime Computation](https://blog.x.com/engineering/en_us/a/2012/trident-a-high-level-abstraction-for-realtime-computation). *blog.x.com*, August 2012. Archived at [archive.org](https://web.archive.org/web/20250515030808/https://blog.x.com/engineering/en_us/a/2012/trident-a-high-level-abstraction-for-realtime-computation)
[^38]: Edi Bice. [Low Latency Web Scale Fraud Prevention with Apache Samza, Kafka and Friends](https://www.slideshare.net/slideshow/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends/57068078). At *Merchant Risk Council MRC Vegas Conference*, March 2016. Archived at [perma.cc/T3H5-QN3R](https://perma.cc/T3H5-QN3R)
[^39]: Charity Majors. [The Accidental DBA](https://charity.wtf/2016/10/02/the-accidental-dba/). *charity.wtf*, October 2016. Archived at [perma.cc/6ANP-ARB6](https://perma.cc/6ANP-ARB6)
[^40]: Arthur J. Bernstein, Philip M. Lewis, and Shiyong Lu. [Semantic Conditions for Correctness at Different Isolation Levels](https://dsf.berkeley.edu/cs286/papers/isolation-icde2000.pdf). At *16th International Conference on Data Engineering* (ICDE), February 2000. [doi:10.1109/ICDE.2000.839387](https://doi.org/10.1109/ICDE.2000.839387)
[^41]: Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan. [Automating the Detection of Snapshot Isolation Anomalies](https://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf). At *33rd International Conference on Very Large Data Bases* (VLDB), September 2007.
[^42]: Kyle Kingsbury. [Jespen: Distributed Systems Safety Research](https://jepsen.io/). *jepsen.io*.
[^43]: Michael Jouravlev. [Redirect After Post](https://www.theserverside.com/news/1365146/Redirect-After-Post). *theserverside.com*, August 2004. Archived at [archive.org](https://web.archive.org/web/20250904205736/https://www.theserverside.com/news/1365146/Redirect-After-Post)
[^44]: Jerome H. Saltzer, David P. Reed, and David D. Clark. [End-to-End Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf). *ACM Transactions on Computer Systems*, volume 2, issue 4, pages 277--288, November 1984. [doi:10.1145/357401.357402](https://doi.org/10.1145/357401.357402)
[^45]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). *Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185--196, November 2014. [doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509)
[^46]: Alex Yarmula. [Strong Consistency in Manhattan](https://blog.x.com/engineering/en_us/a/2016/strong-consistency-in-manhattan). *blog.x.com*, March 2016. Archived at [archive.org](https://web.archive.org/web/20250713175819/https://blog.x.com/engineering/en_us/a/2016/strong-consistency-in-manhattan)
[^47]: Martin Kleppmann, Alastair R. Beresford, and Boerge Svingen. [Online Event Processing: Achieving consistency where distributed transactions have failed](https://martin.kleppmann.com/papers/olep-cacm.pdf). *Communications of the ACM*, volume 62, issue 5, pages 43-49, May 2019. [doi:10.1145/3312527](https://doi.org/10.1145/3312527)
[^48]: Jim Gray. [The Transaction Concept: Virtues and Limitations](https://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf). At *7th International Conference on Very Large Data Bases* (VLDB), September 1981. Archived at [perma.cc/8VPT-N5H6](https://perma.cc/8VPT-N5H6)
[^49]: Hector Garcia-Molina and Kenneth Salem. [Sagas](https://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 1987. [doi:10.1145/38713.38742](https://doi.org/10.1145/38713.38742)
[^50]: Annamalai Gurusami and Daniel Price. [Bug #73170: Duplicates in Unique Secondary Index Because of Fix of Bug#68021](https://bugs.mysql.com/bug.php?id=73170). *bugs.mysql.com*, July 2014. Archived at [perma.cc/P6BV-W7JJ](https://perma.cc/P6BV-W7JJ)
[^51]: Gary Fredericks. [Postgres Serializability Bug](https://github.com/gfredericks/pg-serializability-bug). *github.com*, September 2015. Archived at [perma.cc/N8UP-2822](https://perma.cc/N8UP-2822)
[^52]: Xiao Chen. [HDFS DataNode Scanners and Disk Checker Explained](https://www.cloudera.com/blog/technical/hdfs-datanode-scanners-and-disk-checker-explained.html). *blog.cloudera.com*, December 2016. Archived at [perma.cc/6S36-X98L](https://perma.cc/6S36-X98L)
[^53]: Daniel Persson. [How does Ceph scrubbing work?](https://www.youtube.com/watch?v=M9QGMoc3GU8) *youtube.com*, March 2022.
[^54]: Jay Kreps. [Getting Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW)
[^55]: Martin Fowler. [The LMAX Architecture](https://martinfowler.com/articles/lmax.html). *martinfowler.com*, July 2011. Archived at [perma.cc/5AV4-N6RJ](https://perma.cc/5AV4-N6RJ)
[^56]: Sam Stokes. [Move Fast with Confidence](https://five-eights.com/2016/07/11/move-fast-with-confidence/). *five-eights.com*, July 2016. Archived at [perma.cc/J8C6-DHXB](https://perma.cc/J8C6-DHXB)
[^57]: Ralph C. Merkle. [A Digital Signature Based on a Conventional Encryption Function](https://people.eecs.berkeley.edu/~raluca/cs261-f15/readings/merkle.pdf). At *CRYPTO '87*, August 1987. [doi:10.1007/3-540-48184-2_32](https://doi.org/10.1007/3-540-48184-2_32)
[^58]: Ben Laurie. [Certificate Transparency](https://queue.acm.org/detail.cfm?id=2668154). *ACM Queue*, volume 12, issue 8, pages 10-19, August 2014. [doi:10.1145/2668152.2668154](https://doi.org/10.1145/2668152.2668154)
[^59]: Mark D. Ryan. [Enhanced Certificate Transparency and End-to-End Encrypted Mail](https://www.ndss-symposium.org/wp-content/uploads/2017/09/12_2_1.pdf). At *Network and Distributed System Security Symposium* (NDSS), February 2014. [doi:10.14722/ndss.2014.23379](https://doi.org/10.14722/ndss.2014.23379)
================================================
FILE: content/en/ch14.md
================================================
---
title: "14. Doing the Right Thing"
weight: 314
breadcrumbs: false
---

> *Feeding AI systems on the world's beauty, ugliness, and cruelty, but expecting it to reflect only
> the beauty is a fantasy.*
>
> Vinay Uday Prabhu and Abeba Birhane, *Large Datasets: A Pyrrhic Win for Computer Vision?* (2020)
> [!TIP] A NOTE FOR EARLY RELEASE READERS
> With Early Release ebooks, you get books in their earliest form---the author's raw and unedited
> content as they write---so you can take advantage of these technologies long before the official
> release of these titles.
>
> This will be the 14th chapter of the final book. The GitHub repo for this book is
> *[*https://github.com/ept/ddia2-feedback*](https://github.com/ept/ddia2-feedback)*.
>
> If you'd like to be actively involved in reviewing and commenting on this draft, please reach out on GitHub.
In the final chapter of this book, let's take a step back. Throughout this book we have examined a
wide range of different architectures for data systems, evaluated their pros and cons, and explored
techniques for building reliable, scalable, and maintainable applications. However, we have left out
an important and fundamental part of the discussion, which we should now fill in.
Every system is built for a purpose; every action we take has both intended and unintended
consequences. The purpose may be as simple as making money, but the consequences for the world may
reach far beyond that original purpose. We, the engineers building these systems, have a
responsibility to carefully consider those consequences and to consciously decide what kind of world
we want to live in.
We talk about data as an abstract thing, but remember that many datasets are about people: their
behavior, their interests, their identity. We must treat such data with humanity and respect. Users
are humans too, and human dignity is paramount [^1].
Software development increasingly involves making important ethical choices. There are guidelines to
help software engineers navigate these issues, such as the ACM Code of Ethics and Professional
Conduct [^2], but they are rarely discussed, applied, and enforced in practice. As a
result, engineers and product managers sometimes take a very cavalier attitude to privacy and
potential negative consequences of their products [^3], [^4].
A technology is not good or bad in itself---what matters is how it is used and how it affects
people. This is true for a software system like a search engine in much the same way as it is for a
weapon like a gun. Is not sufficient for software engineers to focus exclusively on the technology
and ignore its consequences: the ethical responsibility is ours to bear also. Reasoning about ethics
is difficult, but it is too important to ignore.
However, what makes something "good" or "bad" is not well-defined, and most people in computing
don't even discuss that question [^5]. In contrast to much of computing, the concepts at
the heart of ethics are not fixed or determinate in their precise meaning, and they require
interpretation, which may be subjective [^6]. Ethics is not going through some checklist
to confirm you comply; it's a participatory and iterative process of reflection, in dialog with the
people involved, with accountability for the results [^7].
## Predictive Analytics {#id369}
For example, predictive analytics is a major part of why people are excited about big data and AI.
Using data analysis to predict the weather, or the spread of diseases, is one thing [^8];
it is another matter to predict whether a convict is likely to reoffend, whether an applicant for a
loan is likely to default, or whether an insurance customer is likely to make expensive claims
[^9]. The latter have a direct effect on individual people's lives.
Naturally, payment networks want to prevent fraudulent transactions, banks want to avoid bad loans,
airlines want to avoid hijackings, and companies want to avoid hiring ineffective or untrustworthy
people. From their point of view, the cost of a missed business opportunity is low, but the cost of
a bad loan or a problematic employee is much higher, so it is natural for organizations to want to
be cautious. If in doubt, they are better off saying no.
However, as algorithmic decision-making becomes more widespread, someone who has (accurately or
falsely) been labeled as risky by some algorithm may suffer a large number of those "no" decisions.
Systematically being excluded from jobs, air travel, insurance coverage, property rental, financial
services, and other key aspects of society is such a large constraint of the individual's freedom
that it has been called "algorithmic prison" [^10]. In countries that respect human
rights, the criminal justice system presumes innocence until proven guilty; on the other hand,
automated systems can systematically and arbitrarily exclude a person from participating in society
without any proof of guilt, and with little chance of appeal.
### Bias and Discrimination {#id370}
Decisions made by an algorithm are not necessarily any better or any worse than those made by a
human. Every person is likely to have biases, even if they actively try to counteract them, and
discriminatory practices can become culturally institutionalized. There is hope that basing
decisions on data, rather than subjective and instinctive assessments by people, could be more fair
and give a better chance to people who are often overlooked in the traditional system
[^11].
When we develop predictive analytics and AI systems, we are not merely automating a human's decision
by using software to specify the rules for when to say yes or no; we are even leaving the rules
themselves to be inferred from data. However, the patterns learned by these systems are opaque: even
if there is some correlation in the data, we may not know why. If there is a systematic bias in the
input to an algorithm, the system will most likely learn and amplify that bias in its output
[^12].
In many countries, anti-discrimination laws prohibit treating people differently depending on
protected traits such as ethnicity, age, gender, sexuality, disability, or beliefs. Other features
of a person's data may be analyzed, but what happens if they are correlated with protected traits?
For example, in racially segregated neighborhoods, a person's postal code or even their IP address
is a strong predictor of race. Put like this, it seems ridiculous to believe that an algorithm could
somehow take biased data as input and produce fair and impartial output from it [^13],
[^14]. Yet this belief often seems to be implied by proponents of data-driven decision
making, an attitude that has been satirized as "machine learning is like money laundering for bias"
[^15].
Predictive analytics systems merely extrapolate from the past; if the past is discriminatory, they
codify and amplify that discrimination [^16]. If we want the future to be better than the
past, moral imagination is required, and that's something only humans can provide [^17].
Data and models should be our tools, not our masters.
### Responsibility and Accountability {#id371}
Automated decision making opens the question of responsibility and accountability [^17].
If a human makes a mistake, they can be held accountable, and the person affected by the decision
can appeal. Algorithms make mistakes too, but who is accountable if they go wrong [^18]?
When a self-driving car causes an accident, who is responsible? If an automated credit scoring
algorithm systematically discriminates against people of a particular race or religion, is there any
recourse? If a decision by your machine learning system comes under judicial review, can you explain
to the judge how the algorithm made its decision? People should not be able to evade their
responsibility by blaming an algorithm.
Credit rating agencies are an old example of collecting data to make decisions about people. A bad
credit score makes life difficult, but at least a credit score is normally based on relevant facts
about a person's actual borrowing history, and any errors in the record can be corrected (although
the agencies normally do not make this easy). However, scoring algorithms based on machine learning
typically use a much wider range of inputs and are much more opaque, making it harder to understand
how a particular decision has come about and whether someone is being treated in an unfair or
discriminatory way [^19].
A credit score summarizes "How did you behave in the past?" whereas predictive analytics usually
work on the basis of "Who is similar to you, and how did people like you behave in the past?"
Drawing parallels to others' behavior implies stereotyping people, for example based on where they
live (a close proxy for race and socioeconomic class). What about people who get put in the wrong
bucket? Furthermore, if a decision is incorrect due to erroneous data, recourse is almost impossible
[^17].
Much data is statistical in nature, which means that even if the probability distribution on the
whole is correct, individual cases may well be wrong. For example, if the average life expectancy in
your country is 80 years, that doesn't mean you're expected to drop dead on your 80th birthday. From
the average and the probability distribution, you can't say much about the age to which one
particular person will live. Similarly, the output of a prediction system is probabilistic and may
well be wrong in individual cases.
A blind belief in the supremacy of data for making decisions is not only delusional, it is
positively dangerous. As data-driven decision making becomes more widespread, we will need to figure
out how to make algorithms accountable and transparent, how to avoid reinforcing existing biases,
and how to fix them when they inevitably make mistakes.
We will also need to figure out how to prevent data being used to harm people, and realize its
positive potential instead. For example, analytics can reveal financial and social characteristics
of people's lives. On the one hand, this power could be used to focus aid and support to help those
people who most need it. On the other hand, it is sometimes used by predatory business seeking to
identify vulnerable people and sell them risky products such as high-cost loans and worthless
college degrees [^17], [^20].
### Feedback Loops {#id372}
Even with predictive applications that have less immediately far-reaching effects on people, such as
recommendation systems, there are difficult issues that we must confront. When services become good
at predicting what content users want to see, they may end up showing people only opinions they
already agree with, leading to echo chambers in which stereotypes, misinformation, and polarization
can breed. We are already seeing the impact of social media echo chambers on election campaigns.
When predictive analytics affect people's lives, particularly pernicious problems arise due to
self-reinforcing feedback loops. For example, consider the case of employers using credit scores to
evaluate potential hires. You may be a good worker with a good credit score, but suddenly find
yourself in financial difficulties due to a misfortune outside of your control. As you miss payments
on your bills, your credit score suffers, and you will be less likely to find work. Joblessness
pushes you toward poverty, which further worsens your scores, making it even harder to find
employment [^17]. It's a downward spiral due to poisonous assumptions, hidden behind a
camouflage of mathematical rigor and data.
As another example of a feedback loop, economists found that when gas stations in Germany introduced
algorithmic prices, competition was reduced and prices for consumers went up because the algorithms
learned to collude [^21].
We can't always predict when such feedback loops happen. However, many consequences can be predicted
by thinking about the entire system (not just the computerized parts, but also the people
interacting with it)---an approach known as *systems thinking* [^22]. We can try to
understand how a data analysis system responds to different behaviors, structures, or
characteristics. Does the system reinforce and amplify existing differences between people (e.g.,
making the rich richer or the poor poorer), or does it try to combat injustice? And even with the
best intentions, we must beware of unintended consequences.
## Privacy and Tracking {#id373}
Besides the problems of predictive analytics---i.e., using data to make automated decisions about
people---there are ethical problems with data collection itself. What is the relationship between
the organizations collecting data and the people whose data is being collected?
When a system only stores data that a user has explicitly entered, because they want the system to
store and process it in a certain way, the system is performing a service for the user: the user is
the customer. But when a user's activity is tracked and logged as a side effect of other things they
are doing, the relationship is less clear. The service no longer just does what the user tells it to
do, but it takes on interests of its own, which may conflict with the user's interests.
Tracking behavioral data has become increasingly important for user-facing features of many online
services: tracking which search results are clicked helps improve the ranking of search results;
recommending "people who liked X also liked Y" helps users discover interesting and useful things;
A/B tests and user flow analysis can help indicate how a user interface might be improved. Those
features require some amount of tracking of user behavior, and users benefit from them.
However, depending on a company's business model, tracking often doesn't stop there. If the service
is funded through advertising, the advertisers are the actual customers, and the users' interests
take second place. Tracking data becomes more detailed, analyses become further-reaching, and data
is retained for a long time in order to build up detailed profiles of each person for marketing
purposes.
Now the relationship between the company and the user whose data is being collected starts looking
quite different. The user is given a free service and is coaxed into engaging with it as much as
possible. The tracking of the user serves not primarily that individual, but rather the needs of the
advertisers who are funding the service. This relationship can be appropriately described with a
word that has more sinister connotations: *surveillance*.
### Surveillance {#id374}
As a thought experiment, try replacing the word *data* with *surveillance*, and observe if common
phrases still sound so good [^23]. How about this: "In our surveillance-driven
organization we collect real-time surveillance streams and store them in our surveillance warehouse.
Our surveillance scientists use advanced analytics and surveillance processing in order to derive
new insights."
This thought experiment is unusually polemic for this book, *Designing Surveillance-Intensive
Applications*, but strong words are needed to emphasize this point. In our attempts to make software
"eat the world" [^24], we have built the greatest mass surveillance infrastructure the
world has ever seen. We are rapidly approaching a world in which every inhabited space contains at
least one internet-connected microphone, in the form of smartphones, smart TVs, voice-controlled
assistant devices, baby monitors, and even children's toys that use cloud-based speech recognition.
Many of these devices have a terrible security record [^25].
What is new compared to the past is that digitization has made it easy to collect large amounts of
data about people. Surveillance of our location and movements, our social relationships and
communications, our purchases and payments, and data about our health have become almost
unavoidable. A surveillance organisation may end up knowing more about a person than that person
knows about themselves---for example, identifying illnesses or economic problems before the person
themselves is aware of them.
Even the most totalitarian and repressive regimes of the past could only dream of putting a
microphone in every room and forcing every person to constantly carry a device capable of tracking
their location and movements. Yet the benefits that we get from digital technology are so great that
we now voluntarily accept this world of total surveillance. The difference is just that the data is
being collected by corporations to provide us with services, rather than government agencies seeking
control [^26].
Not all data collection necessarily qualifies as surveillance, but examining it as such can help us
understand our relationship with the data collector. Why are we seemingly happy to accept
surveillance by corporations? Perhaps you feel you have nothing to hide---in other words, you are
totally in line with existing power structures, you are not a marginalized minority, and you needn't
fear persecution [^27]. Not everyone is so fortunate. Or perhaps it's because the purpose
seems benign---it's not overt coercion and conformance, but merely better recommendations and more
personalized marketing. However, combined with the discussion of predictive analytics from the last
section, that distinction seems less clear.
We are already seeing behavioral data on car driving, tracked by cars without drivers' consent,
affecting their insurance premiums [^28], and health insurance coverage that depends on
people wearing a fitness tracking device. When surveillance is used to determine things that hold
sway over important aspects of life, such as insurance coverage or employment, it starts to appear
less benign. Moreover, data analysis can reveal surprisingly intrusive things: for example, the
movement sensor in a smartwatch or fitness tracker can be used to work out what you are typing (for
example, passwords) with fairly good accuracy [^29]. Sensor accuracy and algorithms for
analysis are only going to get better.
### Consent and Freedom of Choice {#id375}
We might assert that users voluntarily choose to use a service that tracks their activity, and they
have agreed to the terms of service and privacy policy, so they consent to data collection. We might
even claim that users are receiving a valuable service in return for the data they provide, and that
the tracking is necessary in order to provide the service. Undoubtedly, social networks, search
engines, and various other free online services are valuable to users---but there are problems with
this argument.
First, we should ask in what way the tracking is necessary. Some forms of tracking directly feed
into improving features for users: for example, tracking the click-through rate on search results
can help improve a search engine's result ranking and relevance, and tracking which products
customers tend to buy together can help an online shop suggest related products. However, when
tracking user interaction for content recommendations, or to build user profiles for advertising
purposes, it is less clear whether this is genuinely in the user's interest---or is it only
necessary because the ads pay for the service?
Second, users have little knowledge of what data they are feeding into our databases, or how it is
retained and processed---and most privacy policies do more to obscure than to illuminate. Without
understanding what happens to their data, users cannot give any meaningful consent. Often, data from
one user also says things about other people who are not users of the service and who have not
agreed to any terms. The derived datasets that we discussed in this part of the book---in which data
from the entire user base may have been combined with behavioral tracking and external data
sources---are precisely the kinds of data of which users cannot have any meaningful understanding.
Moreover, data is extracted from users through a one-way process, not a relationship with true
reciprocity, and not a fair value exchange. There is no dialog, no option for users to negotiate how
much data they provide and what service they receive in return: the relationship between the service
and the user is very asymmetric and one-sided. The terms are set by the service, not by the user
[^30], [^31].
In the European Union, the *General Data Protection Regulation* (GDPR) requires that consent must be
"freely given, specific, informed, and unambiguous", and that the user must be able to "refuse or
withdraw consent without detriment"---otherwise it is not considered "freely given". Any request for
consent must be written "in an intelligible and easily accessible form, using clear and plain
language". Moreover, "silence, pre-ticked boxes or inactivity \[do not\] constitute consent"
[^32]. There are other bases for lawful processing of personal data besides consent, such
as *legitimate interest*, which permits certain uses of data such as fraud prevention
[^33].
You might argue that a user who does not consent to surveillance can simply choose not to use a
service. But this choice is not free either: if a service is so popular that it is "regarded by most
people as essential for basic social participation" [^30], then it is not reasonable to
expect people to opt out of this service---using it is *de facto* mandatory. For example, in most
Western social communities, it has become the norm to carry a smartphone, to use social networks for
socializing, and to use Google for finding information. Especially when a service has network
effects, there is a social cost to people choosing *not* to use it.
Declining to use a service due to its user tracking policies is easier said than done. These
platforms are designed specifically to engage users. Many use game mechanics and tactics common in
gambling to keep users coming back [^34]. Even if a user gets past this, declining to
engage is only an option for the small number of people who are privileged enough to have the time
and knowledge to understand its privacy policy, and who can afford to potentially miss out on social
participation or professional opportunities that may have arisen if they had participated in the
service. For people in a less privileged position, there is no meaningful freedom of choice:
surveillance becomes inescapable.
### Privacy and Use of Data {#id457}
Sometimes people claim that "privacy is dead" on the grounds that some users are willing to post all
sorts of things about their lives to social media, sometimes mundane and sometimes deeply personal.
However, this claim is false and rests on a misunderstanding of the word *privacy*.
Having privacy does not mean keeping everything secret; it means having the freedom to choose which
things to reveal to whom, what to make public, and what to keep secret. The right to privacy is a
decision right: it enables each person to decide where they want to be on the spectrum between
secrecy and transparency in each situation [^30]. It is an important aspect of a person's
freedom and autonomy.
For example, someone who suffers from a rare medical condition might be very happy to provide their
private medical data to researchers if there is a chance that it might help the development of
treatments for their condition. However, the important thing is that this person has a choice over
who may access this data, and for what purpose. If there was a risk that information about their
medical condition would harm their access to medical insurance or employment or other important
things, this person would probably be much more cautious about sharing their data.
When data is extracted from people through surveillance infrastructure, privacy rights are not
necessarily eroded, but rather transferred to the data collector. Companies that acquire data
essentially say "trust us to do the right thing with your data," which means that the right to
decide what to reveal and what to keep secret is transferred from the individual to the company.
The companies in turn choose to keep much of the outcome of this surveillance secret, because to
reveal it would be perceived as creepy, and would harm their business model (which relies on knowing
more about people than other companies do). Intimate information about users is only revealed
indirectly, for example in the form of tools for targeting advertisements to specific groups of
people (such as those suffering from a particular illness).
Even if particular users cannot be personally reidentified from the bucket of people targeted by a
particular ad, they have lost their agency about the disclosure of some intimate information. It is
not the user who decides what is revealed to whom on the basis of their personal preferences---it is
the company that exercises the privacy right with the goal of maximizing its profit.
Many companies have a goal of not being *perceived* as creepy---avoiding the question of how
intrusive their data collection actually is, and instead focusing on managing user perceptions. And
even these perceptions are often managed poorly: for example, something may be factually correct,
but if it triggers painful memories, the user may not want to be reminded about it [^35].
With any kind of data we should expect the possibility that it is wrong, undesirable, or
inappropriate in some way, and we need to build mechanisms for handling those failures. Whether
something is "undesirable" or "inappropriate" is of course down to human judgment; algorithms are
oblivious to such notions unless we explicitly program them to respect human needs. As engineers of
these systems we must be humble, accepting and planning for such failings.
Privacy settings that allow a user of an online service to control which aspects of their data other
users can see are a starting point for handing back some control to users. However, regardless of
the setting, the service itself still has unfettered access to the data, and is free to use it in
any way permitted by the privacy policy. Even if the service promises not to sell the data to third
parties, it usually grants itself unrestricted rights to process and analyze the data internally,
often going much further than what is overtly visible to users.
This kind of large-scale transfer of privacy rights from individuals to corporations is historically
unprecedented [^30]. Surveillance has always existed, but it used to be expensive and
manual, not scalable and automated. Trust relationships have always existed, for example between a
patient and their doctor, or between a defendant and their attorney---but in these cases the use of
data has been strictly governed by ethical, legal, and regulatory constraints. Internet services
have made it much easier to amass huge amounts of sensitive information without meaningful consent,
and to use it at massive scale without users understanding what is happening to their private data.
### Data as Assets and Power {#id376}
Since behavioral data is a byproduct of users interacting with a service, it is sometimes called
"data exhaust"---suggesting that the data is worthless waste material. Viewed this way, behavioral
and predictive analytics can be seen as a form of recycling that extracts value from data that would
have otherwise been thrown away.
More correct would be to view it the other way round: from an economic point of view, if targeted
advertising is what pays for a service, then the user activity that generates behavioral data could
be regarded as a form of labor [^36]. One could go even further and argue that the
application with which the user interacts is merely a means to lure users into feeding more and more
personal information into the surveillance infrastructure [^30]. The delightful human
creativity and social relationships that often find expression in online services are cynically
exploited by the data extraction machine.
Personal data is a valuable asset, as evidenced by the existence of data brokers, a shady industry
operating in secrecy, purchasing, aggregating, analyzing, inferring, and reselling intrusive
personal data about people, mostly for marketing purposes [^20]. Startups are valued by
their user numbers, by "eyeballs"---i.e., by their surveillance capabilities.
Because the data is valuable, many people want it. Of course companies want it---that's why they
collect it in the first place. But governments want to obtain it too: by means of secret deals,
coercion, legal compulsion, or simply stealing it [^37]. When a company goes bankrupt, the
personal data it has collected is one of the assets that gets sold. Moreover, the data is difficult
to secure, so breaches happen disconcertingly often.
These observations have led critics to saying that data is not just an asset, but a "toxic asset"
[^37], or at least "hazardous material" [^38]. Maybe data is not the new gold,
nor the new oil, but rather the new uranium [^39]. Even if we think that we are capable of
preventing abuse of data, whenever we collect data, we need to balance the benefits with the risk of
it falling into the wrong hands: computer systems may be compromised by criminals or hostile foreign
intelligence services, data may be leaked by insiders, the company may fall into the hands of
unscrupulous management that does not share our values, or the country may be taken over by a regime
that has no qualms about compelling us to hand over the data.
When collecting data, we need to consider not just today's political environment, but all possible
future governments. There is no guarantee that every government elected in future will respect human
rights and civil liberties, so "it is poor civic hygiene to install technologies that could someday
facilitate a police state" [^40].
"Knowledge is power," as the old adage goes. And furthermore, "to scrutinize others while avoiding
scrutiny oneself is one of the most important forms of power" [^41]. This is why
totalitarian governments want surveillance: it gives them the power to control the population.
Although today's technology companies are not overtly seeking political power, the data and
knowledge they have accumulated nevertheless gives them a lot of power over our lives, much of which
is surreptitious, outside of public oversight [^42].
### Remembering the Industrial Revolution {#id377}
Data is the defining feature of the information age. The internet, data storage, processing, and
software-driven automation are having a major impact on the global economy and human society. As our
daily lives and social organization have been changed by information technology, and will probably
continue to radically change in the coming decades, comparisons to the Industrial Revolution come to
mind [^17], [^26].
The Industrial Revolution came about through major technological and agricultural advances, and it
brought sustained economic growth and significantly improved living standards in the long run. Yet
it also came with major problems: pollution of the air (due to smoke and chemical processes) and the
water (from industrial and human waste) was dreadful. Factory owners lived in splendor, while urban
workers often lived in very poor housing and worked long hours in harsh conditions. Child labor was
common, including dangerous and poorly paid work in mines.
It took a long time before safeguards were established, such as environmental protection
regulations, safety protocols for workplaces, outlawing child labor, and health inspections for
food. Undoubtedly the cost of doing business increased when factories were no longer allowed to dump
their waste into rivers, sell tainted foods, or exploit workers. But society as a whole benefited
hugely from these regulations, and few of us would want to return to a time before [^17].
Just as the Industrial Revolution had a dark side that needed to be managed, our transition to the
information age has major problems that we need to confront and solve [^43], [^44].
The collection and use of data is one of those problems. In the words of Bruce Schneier
[^26]:
> Data is the pollution problem of the information age, and protecting privacy is the environmental
> challenge. Almost all computers produce information. It stays around, festering. How we deal with
> it---how we contain it and how we dispose of it---is central to the health of our information
> economy. Just as we look back today at the early decades of the industrial age and wonder how our
> ancestors could have ignored pollution in their rush to build an industrial world, our
> grandchildren will look back at us during these early decades of the information age and judge us
> on how we addressed the challenge of data collection and misuse.
>
> We should try to make them proud.
### Legislation and Self-Regulation {#sec_future_legislation}
Data protection laws might be able to help preserve individuals' rights. For example, the European
GDPR states that personal data must be "collected for specified, explicit and legitimate purposes
and not further processed in a manner that is incompatible with those purposes", and furthermore
that data must be "adequate, relevant and limited to what is necessary in relation to the purposes
for which they are processed" [^32].
However, this principle of *data minimization* runs directly counter to the philosophy of Big Data,
which is to maximize data collection, to combine it with other datasets, to experiment and to
explore in order to generate new insights. Exploration means using data for unforeseen purposes,
which is the opposite of the "specified and explicit" purposes for which the data must have been
collected. While the GDPR has had some effect on the online advertising industry [^45],
the regulation has been weakly enforced [^46], and it does not seem to have led to much of
a change in culture and practices across the wider tech industry.
Companies that collect lots of data about people oppose regulation as being a burden and a hindrance
to innovation. To some extent that opposition is justified. For example, when sharing medical data,
there are clear risks to privacy, but there are also potential opportunities: how many deaths could
be prevented if data analysis was able to help us achieve better diagnostics or find better
treatments [^47]? Over-regulation may prevent such breakthroughs. It is difficult to
balance such potential opportunities with the risks [^41].
Fundamentally, we need a culture shift in the tech industry with regard to personal data. We should
stop regarding users as metrics to be optimized, and remember that they are humans who deserve
respect, dignity, and agency. We should self-regulate our data collection and processing practices
in order to establish and maintain the trust of the people who depend on our software
[^48]. And we should take it upon ourselves to educate end users about how their data is
used, rather than keeping them in the dark.
We should allow each individual to maintain their privacy---i.e., their control over own data---and
not steal that control from them through surveillance. Our individual right to control our data is
like the natural environment of a national park: if we don't explicitly protect and care for it, it
will be destroyed. It will be the tragedy of the commons, and we will all be worse off for it.
Ubiquitous surveillance is not inevitable---we are still able to stop it.
As a first step, we should not retain data forever, but purge it as soon as it is no longer needed,
and minimize what we collect in the first place [^48], [^49]. Data you don't have is
data that can't be leaked, stolen, or compelled by governments to be handed over. Overall, culture
and attitude changes will be necessary. As people working in technology, if we don't consider the
societal impact of our work, we're not doing our job [^50].
## Summary {#id594}
This brings us to the end of the book. We have covered a lot of ground:
- In [Chapter 1](/en/ch1#ch_tradeoffs) we contrasted analytical and operational systems, compared
the cloud to self-hosting, weighed up distributed and single-node systems, and discussed balancing
the needs of your business with the needs of your users.
- In [Chapter 2](/en/ch2#ch_nonfunctional) we saw how to define several nonfunctional requirements
such as performance, reliability, scalability, and maintainability.
- In [Chapter 3](/en/ch3#ch_datamodels) we explored a spectrum of data models, including the
relational, document, and graph models, event sourcing, and DataFrames. We also looked at examples
of various query languages, including SQL, Cypher, SPARQL, Datalog, and GraphQL.
- In [Chapter 4](/en/ch4#ch_storage) we discussed storage engines for OLTP (LSM-trees and B-trees),
for analytics (column-oriented storage), and indexes for information retrieval (full-text and
vector search).
- In [Chapter 5](/en/ch5#ch_encoding) we examined different ways of encoding data objects as bytes,
and how to support evolution as requirements change. We also compared several ways how data flows
between processes: via databases, service calls, workflow engines, or event-driven architectures.
- In [Chapter 6](/en/ch6#ch_replication) we studied the trade-offs between single-leader,
multi-leader, and leaderless replication. We also looked at consistency models such as
read-after-write consistency, and sync engines that allow clients to work offline.
- In [Chapter 7](/en/ch7#ch_sharding) we went into sharding, including strategies for rebalancing,
request routing, and secondary indexing.
- In [Chapter 8](/en/ch8#ch_transactions) we covered transactions: durability, how various isolation
levels (read committed, snapshot isolation, and serializable) can be achieved, and how atomicity
can be ensured in distributed transactions.
- In [Chapter 9](/en/ch9#ch_distributed) we surveyed fundamental problems that occur in distributed
systems (network faults and delays, clock errors, process pauses, crashes), and saw how they make
it difficult to correctly implement even something seemingly simple like a lock.
- In [Chapter 10](/en/ch10#ch_consistency) we went on a deep-dive into various forms of consensus
and the consistency model (linearizability) it enables.
- In [Chapter 11](/en/ch11#ch_batch) we dug into batch processing, building up from simple chains of
Unix tools to large-scale distributed batch processors using distributed filesystems or object
stores.
- In [Chapter 12](/en/ch12#ch_stream) we generalized batch processing to stream processing,
discussed the underlying message brokers, change data capture, fault tolerance, and processing
patterns such as streaming joins.
- In [Chapter 13](/en/ch13#ch_philosophy) we explored a philosophy of streaming systems that allows
disparate data systems to be integrated, systems to be evolved, and applications to be scaled more
easily.
Finally, in this last chapter, we took a step back and examined some ethical aspects of building
data-intensive applications. We saw that although data can be used to do good, it can also do
significant harm: making decisions that seriously affect people's lives and are difficult to appeal
against, leading to discrimination and exploitation, normalizing surveillance, and exposing intimate
information. We also run the risk of data breaches, and we may find that a well-intentioned use of
data has unintended consequences.
As software and data are having such a large impact on the world, we as engineers must remember that
we carry a responsibility to work toward the kind of world that we want to live in: a world that
treats people with humanity and respect. Let's work together towards that goal.
##### Footnotes
### References {#references}
[^1]: David Schmudde. [What If Data Is a Bad Idea?](https://schmud.de/posts/2024-08-18-data-is-a-bad-idea.html). *schmud.de*, August 2024. Archived at [perma.cc/ZXU5-XMCT](https://perma.cc/ZXU5-XMCT)
[^2]: [ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics). Association for Computing Machinery, *acm.org*, 2018. Archived at [perma.cc/SEA8-CMB8](https://perma.cc/SEA8-CMB8)
[^3]: Igor Perisic. [Making Hard Choices: The Quest for Ethics in Machine Learning](https://www.linkedin.com/blog/engineering/archive/making-hard-choices-the-quest-for-ethics-in-machine-learning). *linkedin.com*, November 2016. Archived at [perma.cc/DGF8-KNT7](https://perma.cc/DGF8-KNT7)
[^4]: John Naughton. [Algorithm Writers Need a Code of Conduct](https://www.theguardian.com/commentisfree/2015/dec/06/algorithm-writers-should-have-code-of-conduct). *theguardian.com*, December 2015. Archived at [perma.cc/TBG2-3NG6](https://perma.cc/TBG2-3NG6)
[^5]: Ben Green. ["Good" isn't good enough](https://www.benzevgreen.com/wp-content/uploads/2019/11/19-ai4sg.pdf). At *NeurIPS Joint Workshop on AI for Social Good*, December 2019. Archived at [perma.cc/H4LN-7VY3](https://perma.cc/H4LN-7VY3)
[^6]: Deborah G. Johnson and Mario Verdicchio. [Ethical AI is Not about AI](https://cacm.acm.org/opinion/ethical-ai-is-not-about-ai/). *Communications of the ACM*, volume 66, issue 2, pages 32--34, January 2023. [doi:10.1145/3576932](https://doi.org/10.1145/3576932)
[^7]: Marc Steen. [Ethics as a Participatory and Iterative Process](https://cacm.acm.org/opinion/ethics-as-a-participatory-and-iterative-process/). *Communications of the ACM*, volume 66, issue 5, pages 27--29, April 2023. [doi:10.1145/3550069](https://doi.org/10.1145/3550069)
[^8]: Logan Kugler. [What Happens When Big Data Blunders?](https://cacm.acm.org/news/what-happens-when-big-data-blunders/) *Communications of the ACM*, volume 59, issue 6, pages 15--16, June 2016. [doi:10.1145/2911975](https://doi.org/10.1145/2911975)
[^9]: Miri Zilka. [Algorithms and the criminal justice system: promises and challenges in deployment and research](https://www.cl.cam.ac.uk/research/security/seminars/archive/video/2023-03-07-t196231.html). At *University of Cambridge Security Seminar Series*, March 2023.
[^10]: Bill Davidow. [Welcome to Algorithmic Prison](https://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/). *theatlantic.com*, February 2014. Archived at [archive.org](https://web.archive.org/web/20171019201812/https://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/)
[^11]: Don Peck. [They're Watching You at Work](https://www.theatlantic.com/magazine/archive/2013/12/theyre-watching-you-at-work/354681/). *theatlantic.com*, December 2013. Archived at [perma.cc/YR9T-6M38](https://perma.cc/YR9T-6M38)
[^12]: Leigh Alexander. [Is an Algorithm Any Less Racist Than a Human?](https://www.theguardian.com/technology/2016/aug/03/algorithm-racist-human-employers-work) *theguardian.com*, August 2016. Archived at [perma.cc/XP93-DSVX](https://perma.cc/XP93-DSVX)
[^13]: Jesse Emspak. [How a Machine Learns Prejudice](https://www.scientificamerican.com/article/how-a-machine-learns-prejudice/). *scientificamerican.com*, December 2016. [perma.cc/R3L5-55E6](https://perma.cc/R3L5-55E6)
[^14]: Rohit Chopra, Kristen Clarke, Charlotte A. Burrows, and Lina M. Khan. [Joint Statement on Enforcement Efforts Against Discrimination and Bias in Automated Systems](https://www.ftc.gov/system/files/ftc_gov/pdf/EEOC-CRT-FTC-CFPB-AI-Joint-Statement%28final%29.pdf). *ftc.gov*, April 2023. Archived at [perma.cc/YY4Y-RCCA](https://perma.cc/YY4Y-RCCA)
[^15]: Maciej Cegłowski. [The Moral Economy of Tech](https://idlewords.com/talks/sase_panel.htm). *idlewords.com*, June 2016. Archived at [perma.cc/L8XV-BKTD](https://perma.cc/L8XV-BKTD)
[^16]: Greg Nichols. [Artificial Intelligence in healthcare is racist](https://www.zdnet.com/article/artificial-intelligence-in-healthcare-is-racist/). *zdnet.com*, November 2020. Archived at [perma.cc/3MKW-YKRS](https://perma.cc/3MKW-YKRS)
[^17]: Cathy O'Neil. *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. ISBN: 978-0-553-41881-1
[^18]: Julia Angwin. [Make Algorithms Accountable](https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html). *nytimes.com*, August 2016. Archived at [archive.org](https://web.archive.org/web/20230819055242/https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html)
[^19]: Bryce Goodman and Seth Flaxman. [European Union Regulations on Algorithmic Decision-Making and a 'Right to Explanation'](https://arxiv.org/abs/1606.08813). At *ICML Workshop on Human Interpretability in Machine Learning*, June 2016. Archived at [arxiv.org/abs/1606.08813](https://arxiv.org/abs/1606.08813)
[^20]: [A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes](https://www.commerce.senate.gov/services/files/0d2b3642-6221-4888-a631-08f2f255b577). Staff Report, *United States Senate Committee on Commerce, Science, and Transportation*, *commerce.senate.gov*, December 2013. Archived at [perma.cc/32NV-YWLQ](https://perma.cc/32NV-YWLQ)
[^21]: Stephanie Assad, Robert Clark, Daniel Ershov, and Lei Xu. [Algorithmic Pricing and Competition: Empirical Evidence from the German Retail Gasoline Market](https://economics.yale.edu/sites/default/files/clark_acex_jan_2021.pdf). *Journal of Political Economy*, volume 132, issue 3, pages 723-771, March 2024. [doi:10.1086/726906](https://doi.org/10.1086/726906)
[^22]: Donella H. Meadows and Diana Wright. *Thinking in Systems: A Primer*. Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7
[^23]: Daniel J. Bernstein. [Listening to a "big data"/"data science" talk. Mentally translating "data" to "surveillance": "\...everything starts with surveillance\..."](https://x.com/hashbreaker/status/598076230437568512) *x.com*, May 2015. Archived at [perma.cc/EY3D-WBBJ](https://perma.cc/EY3D-WBBJ)
[^24]: Marc Andreessen. [Why Software Is Eating the World](https://a16z.com/why-software-is-eating-the-world/). *a16z.com*, August 2011. Archived at [perma.cc/3DCC-W3G6](https://perma.cc/3DCC-W3G6)
[^25]: J. M. Porup. ['Internet of Things' Security Is Hilariously Broken and Getting Worse](https://arstechnica.com/information-technology/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/). *arstechnica.com*, January 2016. Archived at [archive.org](https://web.archive.org/web/20250823001716/https://arstechnica.com/information-technology/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/)
[^26]: Bruce Schneier. [*Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World*](https://www.schneier.com/books/data_and_goliath/). W. W. Norton, 2015. ISBN: 978-0-393-35217-7
[^27]: The Grugq. [Nothing to Hide](https://grugq.tumblr.com/post/142799983558/nothing-to-hide). *grugq.tumblr.com*, April 2016. Archived at [perma.cc/BL95-8W5M](https://perma.cc/BL95-8W5M)
[^28]: Federal Trade Commission. [FTC Takes Action Against General Motors for Sharing Drivers' Precise Location and Driving Behavior Data Without Consent](https://www.ftc.gov/news-events/news/press-releases/2025/01/ftc-takes-action-against-general-motors-sharing-drivers-precise-location-driving-behavior-data). *ftc.gov*, January 2025. Archived at [perma.cc/3XGV-3HRD](https://perma.cc/3XGV-3HRD)
[^29]: Tony Beltramelli. [Deep-Spying: Spying Using Smartwatch and Deep Learning](https://arxiv.org/abs/1512.05616). Masters Thesis, IT University of Copenhagen, December 2015. Archived at *arxiv.org/abs/1512.05616*
[^30]: Shoshana Zuboff. [Big Other: Surveillance Capitalism and the Prospects of an Information Civilization](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2594754). *Journal of Information Technology*, volume 30, issue 1, pages 75--89, April 2015. [doi:10.1057/jit.2015.5](https://doi.org/10.1057/jit.2015.5)
[^31]: Michiel Rhoen. [Beyond Consent: Improving Data Protection Through Consumer Protection Law](https://policyreview.info/articles/analysis/beyond-consent-improving-data-protection-through-consumer-protection-law). *Internet Policy Review*, volume 5, issue 1, March 2016. [doi:10.14763/2016.1.404](https://doi.org/10.14763/2016.1.404)
[^32]: [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016](https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng). *Official Journal of the European Union*, L 119/1, May 2016.
[^33]: UK Information Commissioner's Office. [What is the 'legitimate interests' basis?](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/lawful-basis/legitimate-interests/what-is-the-legitimate-interests-basis/) *ico.org.uk*. Archived at [perma.cc/W8XR-F7ML](https://perma.cc/W8XR-F7ML)
[^34]: Tristan Harris. [How a handful of tech companies control billions of minds every day](https://www.ted.com/talks/tristan_harris_how_a_handful_of_tech_companies_control_billions_of_minds_every_day). At *TED2017*, April 2017.
[^35]: Carina C. Zona. [Consequences of an Insightful Algorithm](https://www.youtube.com/watch?v=YRI40A4tyWU). At *GOTO Berlin*, November 2016.
[^36]: Imanol Arrieta Ibarra, Leonard Goff, Diego Jiménez Hernández, Jaron Lanier, and E. Glen Weyl. [Should We Treat Data as Labor? Moving Beyond 'Free'](https://www.aeaweb.org/conference/2018/preliminary/paper/2Y7N88na). *American Economic Association Papers Proceedings*, volume 1, issue 1, December 2017.
[^37]: Bruce Schneier. [Data Is a Toxic Asset, So Why Not Throw It Out?](https://www.schneier.com/essays/archives/2016/03/data_is_a_toxic_asse.html) *schneier.com*, March 2016. Archived at [perma.cc/4GZH-WR3D](https://perma.cc/4GZH-WR3D)
[^38]: Cory Scott. [Data is not toxic - which implies no benefit - but rather hazardous material, where we must balance need vs. want](https://x.com/cory_scott/status/706586399483437056). *x.com*, March 2016. Archived at [perma.cc/CLV7-JF2E](https://perma.cc/CLV7-JF2E)
[^39]: Mark Pesce. [Data is the new uranium -- incredibly powerful and amazingly dangerous](https://www.theregister.com/2024/11/20/data_is_the_new_uranium/). *theregister.com*, November 2024. Archived at [perma.cc/NV8B-GYGV](https://perma.cc/NV8B-GYGV)
[^40]: Bruce Schneier. [Mission Creep: When Everything Is Terrorism](https://www.schneier.com/essays/archives/2013/07/mission_creep_when_e.html). *schneier.com*, July 2013. Archived at [perma.cc/QB2C-5RCE](https://perma.cc/QB2C-5RCE)
[^41]: Lena Ulbricht and Maximilian von Grafenstein. [Big Data: Big Power Shifts?](https://policyreview.info/articles/analysis/big-data-big-power-shifts) *Internet Policy Review*, volume 5, issue 1, March 2016. [doi:10.14763/2016.1.406](https://doi.org/10.14763/2016.1.406)
[^42]: Ellen P. Goodman and Julia Powles. [Facebook and Google: Most Powerful and Secretive Empires We've Ever Known](https://www.theguardian.com/technology/2016/sep/28/google-facebook-powerful-secretive-empire-transparency). *theguardian.com*, September 2016. Archived at [perma.cc/8UJA-43G6](https://perma.cc/8UJA-43G6)
[^43]: Judy Estrin and Sam Gill. [The World Is Choking on Digital Pollution](https://washingtonmonthly.com/2019/01/13/the-world-is-choking-on-digital-pollution/). *washingtonmonthly.com*, January 2019. Archived at [perma.cc/3VHF-C6UC](https://perma.cc/3VHF-C6UC)
[^44]: A. Michael Froomkin. [Regulating Mass Surveillance as Privacy Pollution: Learning from Environmental Impact Statements](https://repository.law.miami.edu/cgi/viewcontent.cgi?article=1062&context=fac_articles). *University of Illinois Law Review*, volume 2015, issue 5, August 2015. Archived at [perma.cc/24ZL-VK2T](https://perma.cc/24ZL-VK2T)
[^45]: Pengyuan Wang, Li Jiang, and Jian Yang. [The Early Impact of GDPR Compliance on Display Advertising: The Case of an Ad Publisher](https://openreview.net/pdf?id=TUnLHNo19S). *Journal of Marketing Research*, volume 61, issue 1, April 2023. [doi:10.1177/00222437231171848](https://doi.org/10.1177/00222437231171848)
[^46]: Johnny Ryan. [Don't be fooled by Meta's fine for data breaches](https://www.economist.com/by-invitation/2023/05/24/dont-be-fooled-by-metas-fine-for-data-breaches-says-johnny-ryan). *The Economist*, May 2023. Archived at [perma.cc/VCR6-55HR](https://perma.cc/VCR6-55HR)
[^47]: Jessica Leber. [Your Data Footprint Is Affecting Your Life in Ways You Can't Even Imagine](https://www.fastcompany.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine). *fastcompany.com*, March 2016. Archived at [archive.org](https://web.archive.org/web/20161128133016/https://www.fastcoexist.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine)
[^48]: Maciej Cegłowski. [Haunted by Data](https://idlewords.com/talks/haunted_by_data.htm). *idlewords.com*, October 2015. Archived at [archive.org](https://web.archive.org/web/20161130143932/https://idlewords.com/talks/haunted_by_data.htm)
[^49]: Sam Thielman. [You Are Not What You Read: Librarians Purge User Data to Protect Privacy](https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy). *theguardian.com*, January 2016. Archived at [archive.org](https://web.archive.org/web/20250828224851/https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy)
[^50]: Jez Humble. [It's a cliché that people get into tech to "change the world". So then, you have to actually consider what the impact of your work is on the world. The idea that you can or should exclude societal and political discussions in tech is idiotic. It means you're not doing your job](https://x.com/jezhumble/status/1386758340894597122). *x.com*, April 2021. Archived at [perma.cc/3NYS-MHLC](https://perma.cc/3NYS-MHLC)
================================================
FILE: content/en/ch2.md
================================================
---
title: "2. Defining Nonfunctional Requirements"
weight: 102
breadcrumbs: false
---

> *The Internet was done so well that most people think of it as a natural resource like the Pacific
> Ocean, rather than something that was man-made. When was the last time a technology with a scale
> like that was so error-free?*
>
> [Alan Kay](https://www.drdobbs.com/architecture-and-design/interview-with-alan-kay/240003442),
> in interview with *Dr Dobb’s Journal* (2012)
If you are building an application, you will be driven by a list of requirements. At the top of your
list is most likely the functionality that the application must offer: what screens and what buttons
you need, and what each operation is supposed to do in order to fulfill the purpose of your
software. These are your *functional requirements*.
In addition, you probably also have some *nonfunctional requirements*: for example, the app should
be fast, reliable, secure, legally compliant, and easy to maintain. These requirements might not be
explicitly written down, because they may seem somewhat obvious, but they are just as important as
the app’s functionality: an app that is unbearably slow or unreliable might as well not exist.
Many nonfunctional requirements, such as security, fall outside the scope of this book. But there
are a few nonfunctional requirements that we will consider, and this chapter will help you
articulate them for your own systems:
* How to define and measure the *performance* of a system (see [“Describing Performance”](/en/ch2#sec_introduction_percentiles));
* What it means for a service to be *reliable*—namely, continuing to work correctly, even when
things go wrong (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability));
* Allowing a system to be *scalable* by having efficient ways of adding computing
capacity as the load on the system grows (see [“Scalability”](/en/ch2#sec_introduction_scalability)); and
* Making it easier to maintain a system in the long term (see [“Maintainability”](/en/ch2#sec_introduction_maintainability)).
The terminology introduced in this chapter will also be useful in the following chapters, when we go
into the details of how data-intensive systems are implemented. However, abstract definitions can be
quite dry; to make the ideas more concrete, we will start this chapter with a case study of how a
social networking service might work, which will provide practical examples of performance and
scalability.
## Case Study: Social Network Home Timelines {#sec_introduction_twitter}
Imagine you are given the task of implementing a social network in the style of X (formerly
Twitter), in which users can post messages and follow other users. This will be a huge
simplification of how such a service actually works [^1] [^2] [^3],
but it will help illustrate some of the issues that arise in large-scale systems.
Let’s assume that users make 500 million posts per day, or 5,700 posts per second on average.
Occasionally, the rate can spike as high as 150,000 posts/second [^4].
Let’s also assume that the average user follows 200 people and has 200 followers (although there is
a very wide range: most people have only a handful of followers, and a few celebrities such as
Barack Obama have over 100 million followers).
### Representing Users, Posts, and Follows {#id20}
Imagine we keep all of the data in a relational database as shown in [Figure 2-1](/en/ch2#fig_twitter_relational). We
have one table for users, one table for posts, and one table for follow relationships.
{{< figure src="/fig/ddia_0201.png" id="fig_twitter_relational" caption="Figure 2-1. Simple relational schema for a social network in which users can follow each other." class="w-full my-4" >}}
Let’s say the main read operation that our social network must support is the *home timeline*, which
displays recent posts by people you are following (for simplicity we will ignore ads, suggested
posts from people you are not following, and other extensions). We could write the following SQL
query to get the home timeline for a particular user:
```sql
SELECT posts.*, users.* FROM posts
JOIN follows ON posts.sender_id = follows.followee_id
JOIN users ON posts.sender_id = users.id
WHERE follows.follower_id = current_user
ORDER BY posts.timestamp DESC
LIMIT 1000
```
To execute this query, the database will use the `follows` table to find everybody who
`current_user` is following, look up recent posts by those users, and sort them by timestamp to get
the most recent 1,000 posts by any of the followed users.
Posts are supposed to be timely, so let’s assume that after somebody makes a post, we want their
followers to be able to see it within 5 seconds. One way of doing that would be for the user’s
client to repeat the query above every 5 seconds while the user is online (this is known as
*polling*). If we assume that 10 million users are online and logged in at the same time, that would
mean running the query 2 million times per second. Even if you increase the polling interval, this
is a lot.
Moreover, the query above is quite expensive: if you are following 200 people, it needs to fetch a
list of recent posts by each of those 200 people, and merge those lists. 2 million timeline queries
per second then means that the database needs to look up the recent posts from some sender 400
million times per second—a huge number. And that is the average case. Some users follow tens of
thousands of accounts; for them, this query is very expensive to execute, and difficult to make
fast.
### Materializing and Updating Timelines {#sec_introduction_materializing}
How can we do better? Firstly, instead of polling, it would be better if the server actively pushed
new posts to any followers who are currently online. Secondly, we should precompute the results of
the query above so that a user’s request for their home timeline can be served from a cache.
Imagine that for each user we store a data structure containing their home timeline, i.e., the
recent posts by people they are following. Every time a user makes a post, we look up all of their
followers, and insert that post into the home timeline of each follower—like delivering a message to
a mailbox. Now when a user logs in, we can simply give them this home timeline that we precomputed.
Moreover, to receive a notification about any new posts on their timeline, the user’s client simply
needs to subscribe to the stream of posts being added to their home timeline.
The downside of this approach is that we now need to do more work every time a user makes a post,
because the home timelines are derived data that needs to be updated. The process is illustrated in
[Figure 2-2](/en/ch2#fig_twitter_timelines). When one initial request results in several downstream requests being
carried out, we use the term *fan-out* to describe the factor by which the number of requests
increases.
{{< figure src="/fig/ddia_0202.png" id="fig_twitter_timelines" caption="Figure 2-2. Fan-out: delivering new posts to every follower of the user who made the post." class="w-full my-4" >}}
At a rate of 5,700 posts posted per second, if the average post reaches 200 followers (i.e., a
fan-out factor of 200), we will need to do just over 1 million home timeline writes per second. This
is a lot, but it’s still a significant saving compared to the 400 million per-sender post lookups
per second that we would otherwise have to do.
If the rate of posts spikes due to some special event, we don’t have to do the timeline
deliveries immediately—we can enqueue them and accept that it will temporarily take a bit longer for
posts to show up in followers’ timelines. Even during such load spikes, timelines remain fast to
load, since we simply serve them from a cache.
This process of precomputing and updating the results of a query is called *materialization*, and
the timeline cache is an example of a *materialized view* (a concept we will discuss further in
[“Maintaining materialized views”](/en/ch12#sec_stream_mat_view)). The materialized view speeds up reads, but in return we have to do more work on
write. The cost of writes for most users is modest, but a social network also has to consider some
extreme cases:
* If a user is following a very large number of accounts, and those accounts post a lot, that user
will have a high rate of writes to their materialized timeline. However, in this case it’s
unlikely that the user is actually reading all of the posts in their timeline, and therefore it’s
okay to simply drop some of their timeline writes and show the user only a sample of the posts
from the accounts they’re following
[^5].
* When a celebrity account with a very large number of followers makes a post, we have to do a large
amount of work to insert that post into the home timelines of each of their millions of followers.
In this case it’s not okay to drop some of those writes. One way of solving this problem is to
handle celebrity posts separately from everyone else’s posts: we can save ourselves the effort of
adding them to millions of timelines by storing the celebrity posts separately and merging them
with the materialized timeline when it is read. Despite such optimizations, handling celebrities
on a social network can require a lot of infrastructure
[^6].
## Describing Performance {#sec_introduction_percentiles}
Most discussions of software performance consider two main types of metric:
Response time
: The elapsed time from the moment when a user makes a request until they receive the requested
answer. The unit of measurement is seconds (or milliseconds, or microseconds).
Throughput
: The number of requests per second, or the data volume per second, that the system is processing.
For a given allocation of hardware resources, there is a *maximum throughput* that can be handled.
The unit of measurement is “somethings per second”.
In the social network case study, “posts per second” and “timeline writes per second” are throughput
metrics, whereas the “time it takes to load the home timeline” or the “time until a post is
delivered to followers” are response time metrics.
There is often a connection between throughput and response time; an example of such a relationship
for an online service is sketched in [Figure 2-3](/en/ch2#fig_throughput). The service has a low response time when
request throughput is low, but response time increases as load increases. This is because of
*queueing*: when a request arrives on a highly loaded system, it’s likely that the CPU is already in
the process of handling an earlier request, and therefore the incoming request needs to wait until
the earlier request has been completed. As throughput approaches the maximum that the hardware can
handle, queueing delays increase sharply.
{{< figure src="/fig/ddia_0203.png" id="fig_throughput" caption="Figure 2-3. As the throughput of a service approaches its capacity, the response time increases dramatically due to queueing." class="w-full my-4" >}}
--------
> [!TIP] WHEN AN OVERLOADED SYSTEM WON'T RECOVER
If a system is close to overload, with throughput pushed close to the limit, it can sometimes enter a
vicious cycle where it becomes less efficient and hence even more overloaded. For example, if there
is a long queue of requests waiting to be handled, response times may increase so much that clients
time out and resend their request. This causes the rate of requests to increase even further, making
the problem worse—a *retry storm*. Even when the load is reduced again, such a system may remain in
an overloaded state until it is rebooted or otherwise reset. This phenomenon is called a *metastable
failure*, and it can cause serious outages in production systems [^7] [^8].
To avoid retries overloading a service, you can increase and randomize the time between successive
retries on the client side (*exponential backoff* [^9] [^10]), and temporarily stop sending requests to a service that has returned errors or timed out recently
(using a *circuit breaker* [^11] [^12] or *token bucket* algorithm [^13]).
The server can also detect when it is approaching overload and start proactively rejecting requests
(*load shedding* [^14]), and send back responses asking clients to slow down (*backpressure* [^1] [^15]).
The choice of queueing and load-balancing algorithms can also make a difference [^16].
--------
In terms of performance metrics, the response time is usually what users care about the most,
whereas the throughput determines the required computing resources (e.g., how many servers you need),
and hence the cost of serving a particular workload. If throughput is likely to increase beyond what
the current hardware can handle, the capacity needs to be expanded; a system is said to be
*scalable* if its maximum throughput can be significantly increased by adding computing resources.
In this section we will focus primarily on response times, and we will return to throughput and
scalability in [“Scalability”](/en/ch2#sec_introduction_scalability).
### Latency and Response Time {#id23}
“Latency” and “response time” are sometimes used interchangeably, but in this book we will use the
terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)):
* The *response time* is what the client sees; it includes all delays incurred anywhere in the
system.
* The *service time* is the duration for which the service is actively processing the user request.
* *Queueing delays* can occur at several points in the flow: for example, after a request is
received, it might need to wait until a CPU is available before it can be processed; a response
packet might need to be buffered before it is sent over the network if other tasks on the same
machine are sending a lot of data via the outbound network interface.
* *Latency* is a catch-all term for time during which a request is not being actively processed,
i.e., during which it is *latent*. In particular, *network latency* or *network delay* refers to
the time that request and response spend traveling through the network.
{{< figure src="/fig/ddia_0204.png" id="fig_response_time" caption="Figure 2-4. Response time, service time, network latency, and queueing delay." class="w-full my-4" >}}
In [Figure 2-4](/en/ch2#fig_response_time), time flows from left to right, each communicating node is shown as a
horizontal line, and a request or response message is shown as a thick diagonal arrow from one node
to another. You will encounter this style of diagram frequently over the course of this book.
The response time can vary significantly from one request to the next, even if you keep making the
same request over and over again. Many factors can add random delays: for example, a context switch
to a background process, the loss of a network packet and TCP retransmission, a garbage collection
pause, a page fault forcing a read from disk, mechanical vibrations in the server rack [^17],
or many other causes. We will discuss this topic in more detail in [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing).
Queueing delays often account for a large part of the variability in response times. As a server
can only process a small number of things in parallel (limited, for example, by its number of CPU
cores), it only takes a small number of slow requests to hold up the processing of subsequent
requests—an effect known as *head-of-line blocking*. Even if those subsequent requests have fast
service times, the client will see a slow overall response time due to the time waiting for the
prior request to complete. The queueing delay is not part of the service time, and for this reason
it is important to measure response times on the client side.
### Average, Median, and Percentiles {#id24}
Because the response time varies from one request to the next, we need to think of it not as a
single number, but as a *distribution* of values that you can measure. In [Figure 2-5](/en/ch2#fig_lognormal), each
gray bar represents a request to a service, and its height shows how long that request took. Most
requests are reasonably fast, but there are occasional *outliers* that take much longer.
Variation in network delay is also known as *jitter*.
{{< figure src="/fig/ddia_0205.png" id="fig_lognormal" caption="Figure 2-5. Illustrating mean and percentiles: response times for a sample of 100 requests to a service." class="w-full my-4" >}}
It’s common to report the *average* response time of a service (technically, the *arithmetic mean*:
that is, sum all the response times, and divide by the number of requests). The mean response time
is useful for estimating throughput limits [^18].
However, the mean is not a very good metric if you want to know your “typical” response time,
because it doesn’t tell you how many users actually experienced that delay.
Usually it is better to use *percentiles*. If you take your list of response times and sort it from
fastest to slowest, then the *median* is the halfway point: for example, if your median response
time is 200 ms, that means half your requests return in less than 200 ms, and half your
requests take longer than that. This makes the median a good metric if you want to know how long
users typically have to wait. The median is also known as the *50th percentile*, and sometimes
abbreviated as *p50*.
In order to figure out how bad your outliers are, you can look at higher percentiles: the *95th*,
*99th*, and *99.9th* percentiles are common (abbreviated *p95*, *p99*, and *p999*). They are the
response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular
threshold. For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of
100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more. This is
illustrated in [Figure 2-5](/en/ch2#fig_lognormal).
High percentiles of response times, also known as *tail latencies*, are important because they
directly affect users’ experience of the service. For example, Amazon describes response time
requirements for internal services in terms of the 99.9th percentile, even though it only affects 1
in 1,000 requests. This is because the customers with the slowest requests are often those who have
the most data on their accounts because they have made many purchases—that is, they’re the most
valuable customers [^19].
It’s important to keep those customers happy by ensuring the website is fast for them.
On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed
too expensive and to not yield enough benefit for Amazon’s purposes. Reducing response times at very
high percentiles is difficult because they are easily affected by random events outside of your
control, and the benefits are diminishing.
--------
> [!TIP] THE USER IMPACT OF RESPONSE TIMES
It seems intuitively obvious that a fast service is better for users than a slow service [^20].
However, it is surprisingly difficult to get hold of reliable data to quantify the effect that
latency has on user behavior.
Some often-cited statistics are unreliable. In 2006 Google reported that a slowdown in search
results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue [^21].
However, another Google study from 2009 reported that a 400 ms increase in latency resulted in
only 0.6% fewer searches per day [^22],
and in the same year Bing found that a two-second increase in load time reduced ad revenue by 4.3% [^23].
Newer data from these companies appears not to be publicly available.
A more recent Akamai study [^24]
claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites
by up to 7%; however, on closer inspection, the same study reveals that very *fast* page load times
are also correlated with lower conversion rates! This seemingly paradoxical result is explained by
the fact that the pages that load fastest are often those that have no useful content (e.g., 404
error pages). However, since the study makes no effort to separate the effects of page content from
the effects of load time, its results are probably not meaningful.
A study by Yahoo [^25] compares click-through rates on fast-loading versus slow-loading search results, controlling for
quality of search results. It finds 20–30% more clicks on fast searches when the difference between
fast and slow responses is 1.25 seconds or more.
--------
### Use of Response Time Metrics {#sec_introduction_slo_sla}
High percentiles are especially important in backend services that are called multiple times as
part of serving a single end-user request. Even if you make the calls in parallel, the end-user
request still needs to wait for the slowest of the parallel calls to complete. It takes just one
slow call to make the entire end-user request slow, as illustrated in [Figure 2-6](/en/ch2#fig_tail_amplification).
Even if only a small percentage of backend calls are slow, the chance of getting a slow call
increases if an end-user request requires multiple backend calls, and so a higher proportion of
end-user requests end up being slow (an effect known as *tail latency amplification* [^26]).
{{< figure src="/fig/ddia_0206.png" id="fig_tail_amplification" caption="Figure 2-6. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request." class="w-full my-4" >}}
Percentiles are often used in *service level objectives* (SLOs) and *service level agreements*
(SLAs) as ways of defining the expected performance and availability of a service [^27].
For example, an SLO may set a target for a service to have a median response time of less than
200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests
result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not
met (for example, customers may be entitled to a refund). That is the basic idea, at least; in
practice, defining good availability metrics for SLOs and SLAs is not straightforward [^28] [^29].
--------
> [!TIP] COMPUTING PERCENTILES
If you want to add response time percentiles to the monitoring dashboards for your services, you
need to efficiently calculate them on an ongoing basis. For example, you may want to keep a rolling
window of response times of requests in the last 10 minutes. Every minute, you calculate the median
and various percentiles over the values in that window and plot those metrics on a graph.
The simplest implementation is to keep a list of response times for all requests within the time
window and to sort that list every minute. If that is too inefficient for you, there are algorithms
that can calculate a good approximation of percentiles at minimal CPU and memory cost.
Open source percentile estimation libraries include HdrHistogram,
t-digest [^30] [^31],
OpenHistogram [^32], and DDSketch [^33].
Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from
several machines, is mathematically meaningless—the right way of aggregating response time data
is to add the histograms [^34].
--------
## Reliability and Fault Tolerance {#sec_introduction_reliability}
Everybody has an intuitive idea of what it means for something to be reliable or unreliable. For
software, typical expectations include:
* The application performs the function that the user expected.
* It can tolerate the user making mistakes or using the software in unexpected ways.
* Its performance is good enough for the required use case, under the expected load and data volume.
* The system prevents any unauthorized access and abuse.
If all those things together mean “working correctly,” then we can understand *reliability* as
meaning, roughly, “continuing to work correctly, even when things go wrong.” To be more precise
about things going wrong, we will distinguish between *faults* and *failures* [^35] [^36] [^37]:
Fault
: A fault is when a particular *part* of a system stops working correctly: for example, if a
single hard drive malfunctions, or a single machine crashes, or an external service (that the
system depends on) has an outage.
Failure
: A failure is when the system *as a whole* stops providing the required service to the user; in
other words, when it does not meet the service level objective (SLO).
The distinction between fault and failure can be confusing because they are the same thing, just at
different levels. For example, if a hard drive stops working, we say that the hard drive has failed:
if the system consists only of that one hard drive, it has stopped providing the required service.
However, if the system you’re talking about contains many hard drives, then the failure of a single
hard drive is only a fault from the point of view of the bigger system, and the bigger system might
be able to tolerate that fault by having a copy of the data on another hard drive.
### Fault Tolerance {#id27}
We call a system *fault-tolerant* if it continues providing the required service to the user in
spite of certain faults occurring. If a system cannot tolerate a certain part becoming faulty, we
call that part a *single point of failure* (SPOF), because a fault in that part escalates to cause
the failure of the whole system.
For example, in the social network case study, a fault that might happen is that during the fan-out
process, a machine involved in updating the materialized timelines crashes or become unavailable.
To make this process fault-tolerant, we would need to ensure that another machine can take over this
task without missing any posts that should have been delivered, and without duplicating any posts.
(This idea is known as *exactly-once semantics*, and we will examine it in detail in [“The End-to-End Argument for Databases”](/en/ch13#sec_future_end_to_end).)
Fault tolerance is always limited to a certain number of certain types of faults. For example, a
system might be able to tolerate a maximum of two hard drives failing at the same time, or a maximum
of one out of three nodes crashing. It would not make sense to tolerate any number of faults: if all
nodes crash, there is nothing that can be done. If the entire planet Earth (and all servers on it)
were swallowed by a black hole, tolerance of that fault would require web hosting in space—good luck
getting that budget item approved.
Counter-intuitively, in such fault-tolerant systems, it can make sense to *increase* the rate of
faults by triggering them deliberately—for example, by randomly killing individual processes
without warning. This is called *fault injection*. Many critical bugs are actually due to poor error
handling [^38]; by deliberately inducing faults, you ensure
that the fault-tolerance machinery is continually exercised and tested, which can increase your
confidence that faults will be handled correctly when they occur naturally. *Chaos engineering* is
a discipline that aims to improve confidence in fault-tolerance mechanisms through experiments such
as deliberately injecting faults [^39].
Although we generally prefer tolerating faults over preventing faults, there are cases where
prevention is better than cure (e.g., because no cure exists). This is the case with security
matters, for example: if an attacker has compromised a system and gained access to sensitive data,
that event cannot be undone. However, this book mostly deals with the kinds of faults that can be
cured, as described in the following sections.
### Hardware and Software Faults {#sec_introduction_hardware_faults}
When we think of causes of system failure, hardware faults quickly come to mind:
* Approximately 2–5% of magnetic hard drives fail per year [^40] [^41]; in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
Recent data suggests that disks are getting more reliable, but failure rates remain significant [^42].
* Approximately 0.5–1% of solid state drives (SSDs) fail per year [^43]. Small numbers of bit errors are corrected automatically [^44], but uncorrectable errors occur approximately once per year per drive, even in drives that are
fairly new (i.e., that have experienced little wear); this error rate is higher than that of
magnetic hard drives [^45], [^46].
* Other hardware components such as power supplies, RAID controllers, and memory modules also fail, although less frequently than hard drives [^47] [^48].
* Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result,
likely due to manufacturing defects [^49] [^50] [^51]. In some cases, an erroneous computation leads to a crash, but in other cases it leads to a program simply returning the wrong result.
* Data in RAM can also be corrupted, either due to random events such as cosmic rays, or due to
permanent physical defects. Even when memory with error-correcting codes (ECC) is used, more than
1% of machines encounter an uncorrectable error in a given year, which typically leads to a crash
of the machine and the affected memory module needing to be replaced [^52].
Moreover, certain pathological memory access patterns can flip bits with high probability [^53].
* An entire datacenter might become unavailable (for example, due to power outage or network
misconfiguration) or even be permanently destroyed (for example by fire, flood, or earthquake [^54]).
A solar storm, which induces large electrical currents in long-distance wires when the sun ejects
a large mass of charged particles, could damage power grids and undersea network cables [^55].
Although such large-scale failures are rare, their impact can be catastrophic if a service cannot tolerate the loss of a datacenter [^56].
These events are rare enough that you often don’t need to worry about them when working on a small
system, as long as you can easily replace hardware that becomes faulty. However, in a large-scale
system, hardware faults happen often enough that they become part of the normal system operation.
#### Tolerating hardware faults through redundancy {#tolerating-hardware-faults-through-redundancy}
Our first response to unreliable hardware is usually to add redundancy to the individual hardware
components in order to reduce the failure rate of the system. Disks may be set up in a RAID
configuration (spreading data across multiple disks in the same machine so that a failed disk does
not cause data loss), servers may have dual power supplies and hot-swappable CPUs, and datacenters
may have batteries and diesel generators for backup power. Such redundancy can often keep a machine
running uninterrupted for years.
Redundancy is most effective when component faults are independent, that is, the occurrence of one
fault does not change how likely it is that another fault will occur. However, experience has shown
that there are often significant correlations between component failures [^41] [^57] [^58];
unavailability of an entire server rack or an entire datacenter still happens more often than we
would like.
Hardware redundancy increases the uptime of a single machine; however, as discussed in
[“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), there are advantages to using a distributed system, such as being
able to tolerate a complete outage of one datacenter.
For this reason, cloud systems tend to focus less on the reliability of individual machines, and
instead aim to make services highly available by tolerating faulty nodes at the software level.
Cloud providers use *availability zones* to identify which resources are physically co-located;
resources in the same place are more likely to fail at the same time than geographically separated
resources.
The fault-tolerance techniques we discuss in this book are designed to tolerate the loss of entire
machines, racks, or availability zones. They generally work by allowing a machine in one datacenter
to take over when a machine in another datacenter fails or becomes unreachable. We will discuss such
techniques for fault tolerance in [Chapter 6](/en/ch6#ch_replication), [Chapter 10](/en/ch10#ch_consistency), and at various other
points in this book.
Systems that can tolerate the loss of entire machines also have operational advantages: a
single-server system requires planned downtime if you need to reboot the machine (to apply operating
system security patches, for example), whereas a multi-node fault-tolerant system can be patched by
restarting one node at a time, without affecting the service for users. This is called a *rolling
upgrade*, and we will discuss it further in [Chapter 5](/en/ch5#ch_encoding).
#### Software faults {#software-faults}
Although hardware failures can be weakly correlated, they are still mostly independent: for
example, if one disk fails, it’s likely that other disks in the same machine will be fine for
another while. On the other hand, software faults are often very highly correlated, because it is
common for many nodes to run the same software and thus have the same bugs [^59] [^60].
Such faults are harder to anticipate, and they tend to cause many more system failures than
uncorrelated hardware faults [^47]. For example:
* A software bug that causes every node to fail at the same time in particular circumstances. For
example, on June 30, 2012, a leap second caused many Java applications to hang simultaneously due
to a bug in the Linux kernel, bringing down many Internet services [^61].
Due to a firmware bug, all SSDs of certain models suddenly fail after precisely 32,768 hours of
operation (less than 4 years), rendering the data on them unrecoverable [^62].
* A runaway process that uses up some shared, limited resource, such as CPU time, memory, disk
space, network bandwidth, or threads [^63]. For example, a process that consumes too much memory while processing a large request may be
killed by the operating system. A bug in a client library could cause a much higher request volume than anticipated [^64].
* A service that the system depends on slows down, becomes unresponsive, or starts returning corrupted responses.
* An interaction between different systems results in emergent behavior that does not occur when each system was tested in isolation [^65].
* Cascading failures, where a problem in one component causes another component to become overloaded
and slow down, which in turn brings down another component [^66] [^67].
The bugs that cause these kinds of software faults often lie dormant for a long time until they are
triggered by an unusual set of circumstances. In those circumstances, it is revealed that the
software is making some kind of assumption about its environment—and while that assumption is
usually true, it eventually stops being true for some reason [^68] [^69].
There is no quick solution to the problem of systematic faults in software. Lots of small things can
help: carefully thinking about assumptions and interactions in the system; thorough testing; process
isolation; allowing processes to crash and restart; avoiding feedback loops such as retry storms
(see [“When an overloaded system won’t recover”](/en/ch2#sidebar_metastable)); measuring, monitoring, and analyzing system behavior in production.
### Humans and Reliability {#id31}
Humans design and build software systems, and the operators who keep the systems running are also
human. Unlike machines, humans don’t just follow rules; their strength is being creative and
adaptive in getting their job done. However, this characteristic also leads to unpredictability, and
sometimes mistakes that can lead to failures, despite best intentions. For example, one study of
large internet services found that configuration changes by operators were the leading cause of
outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages [^70].
It is tempting to label such problems as “human error” and to wish that they could be solved by
better controlling human behavior through tighter procedures and compliance with rules. However,
blaming people for mistakes is counterproductive. What we call “human error” is not really the cause
of an incident, but rather a symptom of a problem with the sociotechnical system in which people are
trying their best to do their jobs [^71].
Often complex systems have emergent behavior, in which unexpected interactions between components
may also lead to failures [^72].
Various technical measures can help minimize the impact of human mistakes, including thorough
testing (both hand-written tests and *property testing* on lots of random inputs) [^38], rollback mechanisms for quickly
reverting configuration changes, gradual roll-outs of new code, detailed and clear monitoring,
observability tools for diagnosing production issues (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)),
and well-designed interfaces that encourage “the right thing” and discourage “the wrong thing”.
However, these things require an investment of time and money, and in the pragmatic reality of
everyday business, organizations often prioritize revenue-generating activities over measures that
increase their resilience against mistakes. If there is a choice between more features and more
testing, many organizations understandably choose features. Given this choice, when a preventable
mistake inevitably occurs, it does not make sense to blame the person who made the mistake—the
problem is the organization’s priorities.
Increasingly, organizations are adopting a culture of *blameless postmortems*: after an incident,
the people involved are encouraged to share full details about what happened, without fear of
punishment, since this allows others in the organization to learn how to prevent similar problems in the future [^73].
This process may uncover a need to change business priorities, a need to invest in areas that have
been neglected, a need to change the incentives for the people involved, or some other systemic
issue that needs to be brought to the management’s attention.
As a general principle, when investigating an incident, you should be suspicious of simplistic
answers. “Bob should have been more careful when deploying that change” is not productive, but
neither is “We must rewrite the backend in Haskell.” Instead, management should take the opportunity
to learn the details of how the sociotechnical system works from the point of view of the people who
work with it every day, and take steps to improve it based on this feedback [^71].
--------
> [!TIP] HOW IMPORTANT IS RELIABILITY?
Reliability is not just for nuclear power stations and air traffic control—more mundane applications
are also expected to work reliably. Bugs in business applications cause lost productivity (and legal
risks if figures are reported incorrectly), and outages of e-commerce sites can have huge costs in
terms of lost revenue and damage to reputation.
In many applications, a temporary outage of a few minutes or even a few hours is tolerable [^74],
but permanent data loss or corruption would be catastrophic. Consider a parent who stores all their
pictures and videos of their children in your photo application [^75]. How would they
feel if that database was suddenly corrupted? Would they know how to restore it from a backup?
As another example of how unreliable software can harm people, consider the Post Office Horizon
scandal. Between 1999 and 2019, hundreds of people managing Post Office branches in Britain were
convicted of theft or fraud because the accounting software showed a shortfall in their accounts.
Eventually it became clear that many of these shortfalls were due to bugs in the software, and many
convictions have since been overturned [^76].
What led to this, probably the largest miscarriage of justice in British history, is the fact that
English law assumes that computers operate correctly (and hence, evidence produced by computers is
reliable) unless there is evidence to the contrary [^77].
Software engineers may laugh at the idea that software could ever be bug-free, but this is little
solace to the people who were wrongfully imprisoned, declared bankrupt, or even committed suicide as
a result of a wrongful conviction due to an unreliable computer system.
There are situations in which we may choose to sacrifice reliability in order to reduce development
cost (e.g., when developing a prototype product for an unproven market)—but we should be very
conscious of when we are cutting corners and keep in mind the potential consequences.
--------
## Scalability {#sec_introduction_scalability}
Even if a system is working reliably today, that doesn’t mean it will necessarily work reliably in
the future. One common reason for degradation is increased load: perhaps the system has grown from
10,000 concurrent users to 100,000 concurrent users, or from 1 million to 10 million. Perhaps it is
processing much larger volumes of data than it did before.
*Scalability* is the term we use to describe a system’s ability to cope with increased load.
Sometimes, when discussing scalability, people make comments along the lines of, “You’re not Google
or Amazon. Stop worrying about scale and just use a relational database.” Whether this maxim applies
to you depends on the type of application you are building.
If you are building a new product that currently only has a small number of users, perhaps at a
startup, the overriding engineering goal is usually to keep the system as simple and flexible as
possible, so that you can easily modify and adapt the features of your product as you learn more
about customers’ needs [^78].
In such an environment, it is counterproductive to worry about hypothetical scale that might be
needed in the future: in the best case, investments in scalability are wasted effort and premature
optimization; in the worst case, they lock you into an inflexible design and make it harder to
evolve your application.
The reason is that scalability is not a one-dimensional label: it is meaningless to say “X is
scalable” or “Y doesn’t scale.” Rather, discussing scalability means considering questions like:
* “If the system grows in a particular way, what are our options for coping with the growth?”
* “How can we add computing resources to handle the additional load?”
* “Based on current growth projections, when will we hit the limits of our current architecture?”
If you succeed in making your application popular, and therefore handling a growing amount of load,
you will learn where your performance bottlenecks lie, and therefore you will know along which
dimensions you need to scale. At that point it’s time to start worrying about techniques for
scalability.
### Describing Load {#id33}
First, we need to succinctly describe the current load on the system; only then can we discuss
growth questions (what happens if our load doubles?). Often this will be a measure of throughput:
for example, the number of requests per second to a service, how many gigabytes of new data arrive
per day, or the number of shopping cart checkouts per hour. Sometimes you care about the peak of
some variable quantity, such as the number of simultaneously online users in
[“Case Study: Social Network Home Timelines”](/en/ch2#sec_introduction_twitter).
Often there are other statistical characteristics of the load that also affect the access patterns
and hence the scalability requirements. For example, you may need to know the ratio of reads to
writes in a database, the hit rate on a cache, or the number of data items per user (for example,
the number of followers in the social network case study). Perhaps the average case is what matters
for you, or perhaps your bottleneck is dominated by a small number of extreme cases. It all depends
on the details of your particular application.
Once you have described the load on your system, you can investigate what happens when the load
increases. You can look at it in two ways:
* When you increase the load in a certain way and keep the system resources (CPUs, memory, network
bandwidth, etc.) unchanged, how is the performance of your system affected?
* When you increase the load in a certain way, how much do you need to increase the resources if you
want to keep performance unchanged?
Usually our goal is to keep the performance of the system within the requirements of the SLA
(see [“Use of Response Time Metrics”](/en/ch2#sec_introduction_slo_sla)) while also minimizing the cost of running the system. The greater
the required computing resources, the higher the cost. It might be that some types of hardware are
more cost-effective than others, and these factors may change over time as new types of hardware
become available.
If you can double the resources in order to handle twice the load, while keeping performance the
same, we say that you have *linear scalability*, and this is considered a good thing. Occasionally
it is possible to handle twice the load with less than double the resources, due to economies of
scale or a better distribution of peak load
[^79] [^80].
Much more likely is that the cost grows faster than linearly, and there may be many reasons for the
inefficiency. For example, if you have a lot of data, then processing a single write request may
involve more work than if you have a small amount of data, even if the size of the request is the
same.
### Shared-Memory, Shared-Disk, and Shared-Nothing Architecture {#sec_introduction_shared_nothing}
The simplest way of increasing the hardware resources of a service is to move it to a more powerful
machine. Individual CPU cores are no longer getting significantly faster, but you can buy a machine
(or rent a cloud instance) with more CPU cores, more RAM, and more disk space. This approach is
called *vertical scaling* or *scaling up*.
You can get parallelism on a single machine by using multiple processes or threads. All the threads
belonging to the same process can access the same RAM, and hence this approach is also called a
*shared-memory architecture*. The problem with a shared-memory approach is that the cost grows
faster than linearly: a high-end machine with twice the hardware resources typically costs
significantly more than twice as much. And due to bottlenecks, a machine twice the size can often
handle less than twice the load.
Another approach is the *shared-disk architecture*, which uses several machines with independent
CPUs and RAM, but which stores data on an array of disks that is shared between the machines, which
are connected via a fast network: *Network-Attached Storage* (NAS) or *Storage Area Network* (SAN).
This architecture has traditionally been used for on-premises data warehousing workloads, but
contention and the overhead of locking limit the scalability of the shared-disk approach [^81].
By contrast, the *shared-nothing architecture* [^82]
(also called *horizontal scaling* or *scaling out*) has gained a lot of popularity. In this
approach, we use a distributed system with multiple nodes, each of which has its own CPUs, RAM, and
disks. Any coordination between nodes is done at the software level, via a conventional network.
The advantages of shared-nothing are that it has the potential to scale linearly, it can use
whatever hardware offers the best price/performance ratio (especially in the cloud), it can more
easily adjust its hardware resources as load increases or decreases, and it can achieve greater
fault tolerance by distributing the system across multiple data centers and regions. The downsides
are that it requires explicit sharding (see [Chapter 7](/en/ch7#ch_sharding)), and it incurs all the complexity of
distributed systems ([Chapter 9](/en/ch9#ch_distributed)).
Some cloud-native database systems use separate services for storage and transaction execution (see
[“Separation of storage and compute”](/en/ch1#sec_introduction_storage_compute)), with multiple compute nodes sharing access to the same
storage service. This model has some similarity to a shared-disk architecture, but it avoids the
scalability problems of older systems: instead of providing a filesystem (NAS) or block device (SAN)
abstraction, the storage service offers a specialized API that is designed for the specific needs of
the database [^83].
### Principles for Scalability {#id35}
The architecture of systems that operate at large scale is usually highly specific to the
application—there is no such thing as a generic, one-size-fits-all scalable architecture
(informally known as *magic scaling sauce*). For example, a system that is designed to handle
100,000 requests per second, each 1 kB in size, looks very different from a system that is
designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same
data throughput (100 MB/sec).
Moreover, an architecture that is appropriate for one level of load is unlikely to cope with 10
times that load. If you are working on a fast-growing service, it is therefore likely that you will
need to rethink your architecture on every order of magnitude load increase. As the needs of the
application are likely to evolve, it is usually not worth planning future scaling needs more than
one order of magnitude in advance.
A good general principle for scalability is to break a system down into smaller components that can
operate largely independently from each other. This is the underlying principle behind microservices
(see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing
([Chapter 12](/en/ch12#ch_stream)), and shared-nothing architectures. However, the challenge is in knowing where to
draw the line between things that should be together, and things that should be apart. Design
guidelines for microservices can be found in other books [^84],
and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding).
Another good principle is not to make things more complicated than necessary. If a single-machine
database will do the job, it’s probably preferable to a complicated distributed setup. Auto-scaling
systems (which automatically add or remove resources in response to demand) are cool, but if your
load is fairly predictable, a manually scaled system may have fewer operational surprises (see
[“Operations: Automatic or Manual Rebalancing”](/en/ch7#sec_sharding_operations)). A system with five services is simpler than one with fifty. Good
architectures usually involve a pragmatic mixture of approaches.
## Maintainability {#sec_introduction_maintainability}
Software does not wear out or suffer material fatigue, so it does not break in the same ways as
mechanical objects do. But the requirements for an application frequently change, the environment
that the software runs in changes (such as its dependencies and the underlying platform), and it has
bugs that need fixing.
It is widely recognized that the majority of the cost of software is not in its initial development,
but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures,
adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding
new features [^85] [^86].
However, maintenance is also difficult. If a system has been successfully running for a long time,
it may well use outdated technologies that not many engineers understand today (such as mainframes
and COBOL code); institutional knowledge of how and why a system was designed in a certain way may
have been lost as people have left the organization; it might be necessary to fix other people’s
mistakes. Moreover, the computer system is often intertwined with the human organization that it
supports, which means that maintenance of such *legacy* systems is as much a people problem as a
technical one [^87].
Every system we create today will one day become a legacy system if it is valuable enough to survive
for a long time. In order to minimize the pain for future generations who need to maintain our
software, we should design it with maintenance concerns in mind. Although we cannot always predict
which decisions might create maintenance headaches in the future, in this book we will pay attention
to several principles that are widely applicable:
Operability
: Make it easy for the organization to keep the system running smoothly.
Simplicity
: Make it easy for new engineers to understand the system, by implementing it using well-understood,
consistent patterns and structures, and avoiding unnecessary complexity.
Evolvability
: Make it easy for engineers to make changes to the system in the future, adapting it and extending
it for unanticipated use cases as requirements change.
### Operability: Making Life Easy for Operations {#id37}
We previously discussed the role of operations in [“Operations in the Cloud Era”](/en/ch1#sec_introduction_operations), and we saw that
human processes are at least as important for reliable operations as software tools. In fact, it has
been suggested that “good operations can often work around the limitations of bad (or incomplete)
software, but good software cannot run reliably with bad operations” [^60].
In large-scale systems consisting of many thousands of machines, manual maintenance would be
unreasonably expensive, and automation is essential. However, automation can be a two-edged sword:
there will always be edge cases (such as rare failure scenarios) that require manual intervention
from the operations team. Since the cases that cannot be handled automatically are the most complex
issues, greater automation requires a *more* skilled operations team that can resolve those issues [^88].
Moreover, if an automated system goes wrong, it is often harder to troubleshoot than a system that
relies on an operator to perform some actions manually. For that reason, it is not the case that
more automation is always better for operability. However, some amount of automation is important,
and the sweet spot will depend on the specifics of your particular application and organization.
Good operability means making routine tasks easy, allowing the operations team to focus their efforts
on high-value activities. Data systems can do various things to make routine tasks easy, including [^89]:
* Allowing monitoring tools to check the system’s key metrics, and supporting observability tools
(see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)) to give insights into the system’s runtime behavior.
A variety of commercial and open source tools can help here [^90].
* Avoiding dependency on individual machines (allowing machines to be taken down for maintenance
while the system as a whole continues running uninterrupted)
* Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)
* Providing good default behavior, but also giving administrators the freedom to override defaults when needed
* Self-healing where appropriate, but also giving administrators manual control over the system state when needed
* Exhibiting predictable behavior, minimizing surprises
### Simplicity: Managing Complexity {#id38}
Small software projects can have delightfully simple and expressive code, but as projects get
larger, they often become very complex and difficult to understand. This complexity slows down
everyone who needs to work on the system, further increasing the cost of maintenance. A software
project mired in complexity is sometimes described as a *big ball of mud* [^91].
When complexity makes maintenance hard, budgets and schedules are often overrun. In complex
software, there is also a greater risk of introducing bugs when making a change: when the system is
harder for developers to understand and reason about, hidden assumptions, unintended consequences,
and unexpected interactions are more easily overlooked [^69].
Conversely, reducing complexity greatly improves the maintainability of software, and thus
simplicity should be a key goal for the systems we build.
Simple systems are easier to understand, and therefore we should try to solve a given problem in the
simplest way possible. Unfortunately, this is easier said than done. Whether something is simple or
not is often a subjective matter of taste, as there is no objective standard of simplicity [^92].
For example, one system may hide a complex implementation behind a simple interface, whereas another
may have a simple implementation that exposes more internal detail to its users—which one is simpler?
One attempt at reasoning about complexity has been to break it down into two categories, *essential* and *accidental* complexity [^93].
The idea is that essential complexity is inherent in the problem domain of the application, while
accidental complexity arises only because of limitations of our tooling. Unfortunately, this
distinction is also flawed, because boundaries between the essential and the accidental shift as our tooling evolves [^94].
One of the best tools we have for managing complexity is *abstraction*. A good abstraction can hide
a great deal of implementation detail behind a clean, simple-to-understand façade. A good
abstraction can also be used for a wide range of different applications. Not only is this reuse more
efficient than reimplementing a similar thing multiple times, but it also leads to higher-quality
software, as quality improvements in the abstracted component benefit all applications that use it.
For example, high-level programming languages are abstractions that hide machine code, CPU registers,
and syscalls. SQL is an abstraction that hides complex on-disk and in-memory data structures,
concurrent requests from other clients, and inconsistencies after crashes. Of course, when
programming in a high-level language, we are still using machine code; we are just not using it
*directly*, because the programming language abstraction saves us from having to think about it.
Abstractions for application code, which aim to reduce its complexity,
can be created using methodologies such as *design patterns* [^95] and *domain-driven design* (DDD) [^96].
This book is not about such application-specific abstractions, but rather about general-purpose
abstractions on top of which you can build your applications, such as database transactions,
indexes, and event logs. If you want to use techniques such as DDD, you can implement them on top of
the foundations described in this book.
### Evolvability: Making Change Easy {#sec_introduction_evolvability}
It’s extremely unlikely that your system’s requirements will remain unchanged forever. They are much more
likely to be in constant flux: you learn new facts, previously unanticipated use cases emerge,
business priorities change, users request new features, new platforms replace old platforms, legal
or regulatory requirements change, growth of the system forces architectural changes, etc.
In terms of organizational processes, *Agile* working patterns provide a framework for adapting to
change. The Agile community has also developed technical tools and processes that are helpful when
developing software in a frequently changing environment, such as test-driven development (TDD) and
refactoring. In this book, we search for ways of increasing agility at the level of a system
consisting of several different applications or services with different characteristics.
The ease with which you can modify a data system, and adapt it to changing requirements, is closely
linked to its simplicity and its abstractions: loosely-coupled, simple systems are usually easier to
modify than tightly-coupled, complex ones. Since this is such an important idea, we will use a
different word to refer to agility on a data system level: *evolvability* [^97].
One major factor that makes change difficult in large systems is when some action is irreversible,
and therefore that action needs to be taken very carefully [^98].
For example, say you are migrating from one database to another: if you cannot switch back to the
old system in case of problems with the new one, the stakes are much higher than if you can easily go
back. Minimizing irreversibility improves flexibility.
## Summary {#summary}
In this chapter we examined several examples of nonfunctional requirements: performance,
reliability, scalability, and maintainability. Through these topics we have also encountered
principles and terminology that we will need throughout the rest of the book. We started with a case
study of how one might implement home timelines in a social network, which illustrated some of the
challenges that arise at scale.
We discussed how to measure performance (e.g., using response time percentiles), the load on a
system (e.g., using throughput metrics), and how they are used in SLAs. Scalability is a closely
related concept: that is, ensuring performance stays the same when the load grows. We saw some
general principles for scalability, such as breaking a task down into smaller parts that can operate
independently, and we will dive into deep technical detail on scalability techniques in the
following chapters.
To achieve reliability, you can use fault tolerance techniques, which allow a system to continue
providing its service even if some component (e.g., a disk, a machine, or another service) is
faulty. We saw examples of hardware faults that can occur, and distinguished them from software
faults, which can be harder to deal with because they are often strongly correlated. Another aspect
of achieving reliability is to build resilience against humans making mistakes, and we saw blameless
postmortems as a technique for learning from incidents.
Finally, we examined several facets of maintainability, including supporting the work of operations
teams, managing complexity, and making it easy to evolve an application’s functionality over time.
There are no easy answers on how to achieve these things, but one thing that can help is to build
applications using well-understood building blocks that provide useful abstractions. The rest of
this book will cover a selection of building blocks that have proved to be valuable in practice.
### References
[^1]: Mike Cvet. [How We Learned to Stop Worrying and Love Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016.
[^2]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK)
[^3]: Twitter. [Twitter’s Recommendation Algorithm](https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm). *blog.twitter.com*, March 2023. Archived at [perma.cc/L5GT-229T](https://perma.cc/L5GT-229T)
[^4]: Raffi Krikorian. [New Tweets per second record, and how!](https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how) *blog.twitter.com*, August 2013. Archived at [perma.cc/6JZN-XJYN](https://perma.cc/6JZN-XJYN)
[^5]: Jaz Volpert. [When Imperfect Systems are Good, Actually: Bluesky’s Lossy Timelines](https://jazco.dev/2025/02/19/imperfection/). *jazco.dev*, February 2025. Archived at [perma.cc/2PVE-L2MX](https://perma.cc/2PVE-L2MX)
[^6]: Samuel Axon. [3% of Twitter’s Servers Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010. Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX)
[^7]: Nathan Bronson, Abutalib Aghayev, Aleksey Charapko, and Timothy Zhu. [Metastable Failures in Distributed Systems](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf). At *Workshop on Hot Topics in Operating Systems* (HotOS), May 2021. [doi:10.1145/3458336.3465286](https://doi.org/10.1145/3458336.3465286)
[^8]: Marc Brooker. [Metastability and Distributed Systems](https://brooker.co.za/blog/2021/05/24/metastable.html). *brooker.co.za*, May 2021. Archived at [perma.cc/7FGJ-7XRK](https://perma.cc/7FGJ-7XRK)
[^9]: Marc Brooker. [Exponential Backoff And Jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/). *aws.amazon.com*, March 2015. Archived at [perma.cc/R6MS-AZKH](https://perma.cc/R6MS-AZKH)
[^10]: Marc Brooker. [What is Backoff For?](https://brooker.co.za/blog/2022/08/11/backoff.html) *brooker.co.za*, August 2022. Archived at [perma.cc/PW9N-55Q5](https://perma.cc/PW9N-55Q5)
[^11]: Michael T. Nygard. [*Release It!*](https://learning.oreilly.com/library/view/release-it-2nd/9781680504552/), 2nd Edition. Pragmatic Bookshelf, January 2018. ISBN: 9781680502398
[^12]: Frank Chen. [Slowing Down to Speed Up – Circuit Breakers for Slack’s CI/CD](https://slack.engineering/circuit-breakers/). *slack.engineering*, August 2022. Archived at [perma.cc/5FGS-ZPH3](https://perma.cc/5FGS-ZPH3)
[^13]: Marc Brooker. [Fixing retries with token buckets and circuit breakers](https://brooker.co.za/blog/2022/02/28/retries.html). *brooker.co.za*, February 2022. Archived at [perma.cc/MD6N-GW26](https://perma.cc/MD6N-GW26)
[^14]: David Yanacek. [Using load shedding to avoid overload](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/). Amazon Builders’ Library, *aws.amazon.com*. Archived at [perma.cc/9SAW-68MP](https://perma.cc/9SAW-68MP)
[^15]: Matthew Sackman. [Pushing Back](https://wellquite.org/posts/lshift/pushing_back/). *wellquite.org*, May 2016. Archived at [perma.cc/3KCZ-RUFY](https://perma.cc/3KCZ-RUFY)
[^16]: Dmitry Kopytkov and Patrick Lee. [Meet Bandaid, the Dropbox service proxy](https://dropbox.tech/infrastructure/meet-bandaid-the-dropbox-service-proxy). *dropbox.tech*, March 2018. Archived at [perma.cc/KUU6-YG4S](https://perma.cc/KUU6-YG4S)
[^17]: Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. [Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems](https://www.usenix.org/system/files/conference/fast18/fast18-gunawi.pdf). At *16th USENIX Conference on File and Storage Technologies*, February 2018.
[^18]: Marc Brooker. [Is the Mean Really Useless?](https://brooker.co.za/blog/2017/12/28/mean.html) *brooker.co.za*, December 2017. Archived at [perma.cc/U5AE-CVEM](https://perma.cc/U5AE-CVEM)
[^19]: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. [Dynamo: Amazon’s Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007. [doi:10.1145/1294261.1294281](https://doi.org/10.1145/1294261.1294281)
[^20]: Kathryn Whitenton. [The Need for Speed, 23 Years Later](https://www.nngroup.com/articles/the-need-for-speed/). *nngroup.com*, May 2020. Archived at [perma.cc/C4ER-LZYA](https://perma.cc/C4ER-LZYA)
[^21]: Greg Linden. [Marissa Mayer at Web 2.0](https://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html). *glinden.blogspot.com*, November 2005. Archived at [perma.cc/V7EA-3VXB](https://perma.cc/V7EA-3VXB)
[^22]: Jake Brutlag. [Speed Matters for Google Web Search](https://services.google.com/fh/files/blogs/google_delayexp.pdf). *services.google.com*, June 2009. Archived at [perma.cc/BK7R-X7M2](https://perma.cc/BK7R-X7M2)
[^23]: Eric Schurman and Jake Brutlag. [Performance Related Changes and their User Impact](https://www.youtube.com/watch?v=bQSE51-gr2s). Talk at *Velocity 2009*.
[^24]: Akamai Technologies, Inc. [The State of Online Retail Performance](https://web.archive.org/web/20210729180749/https%3A//www.akamai.com/us/en/multimedia/documents/report/akamai-state-of-online-retail-performance-spring-2017.pdf). *akamai.com*, April 2017. Archived at [perma.cc/UEK2-HYCS](https://perma.cc/UEK2-HYCS)
[^25]: Xiao Bai, Ioannis Arapakis, B. Barla Cambazoglu, and Ana Freire. [Understanding and Leveraging the Impact of Response Latency on User Behaviour in Web Search](https://iarapakis.github.io/papers/TOIS17.pdf). *ACM Transactions on Information Systems*, volume 36, issue 2, article 21, April 2018. [doi:10.1145/3106372](https://doi.org/10.1145/3106372)
[^26]: Jeffrey Dean and Luiz André Barroso. [The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/). *Communications of the ACM*, volume 56, issue 2, pages 74–80, February 2013. [doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794)
[^27]: Alex Hidalgo. [*Implementing Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets*](https://www.oreilly.com/library/view/implementing-service-level/9781492076803/). O’Reilly Media, September 2020. ISBN: 1492076813
[^28]: Jeffrey C. Mogul and John Wilkes. [Nines are Not Enough: Meaningful Metrics for Clouds](https://research.google/pubs/pub48033/). At *17th Workshop on Hot Topics in Operating Systems* (HotOS), May 2019. [doi:10.1145/3317550.3321432](https://doi.org/10.1145/3317550.3321432)
[^29]: Tamás Hauer, Philipp Hoffmann, John Lunney, Dan Ardelean, and Amer Diwan. [Meaningful Availability](https://www.usenix.org/conference/nsdi20/presentation/hauer). At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020.
[^30]: Ted Dunning. [The t-digest: Efficient estimates of distributions](https://www.sciencedirect.com/science/article/pii/S2665963820300403). *Software Impacts*, volume 7, article 100049, February 2021. [doi:10.1016/j.simpa.2020.100049](https://doi.org/10.1016/j.simpa.2020.100049)
[^31]: David Kohn. [How percentile approximation works (and why it’s more useful than averages)](https://www.timescale.com/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/). *timescale.com*, September 2021. Archived at [perma.cc/3PDP-NR8B](https://perma.cc/3PDP-NR8B)
[^32]: Heinrich Hartmann and Theo Schlossnagle. [Circllhist — A Log-Linear Histogram Data Structure for IT Infrastructure Monitoring](https://arxiv.org/pdf/2001.06561.pdf). *arxiv.org*, January 2020.
[^33]: Charles Masson, Jee E. Rim, and Homin K. Lee. [DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees](https://www.vldb.org/pvldb/vol12/p2195-masson.pdf). *Proceedings of the VLDB Endowment*, volume 12, issue 12, pages 2195–2205, August 2019. [doi:10.14778/3352063.3352135](https://doi.org/10.14778/3352063.3352135)
[^34]: Baron Schwartz. [Why Percentiles Don’t Work the Way You Think](https://orangematter.solarwinds.com/2016/11/18/why-percentiles-dont-work-the-way-you-think/). *solarwinds.com*, November 2016. Archived at [perma.cc/469T-6UGB](https://perma.cc/469T-6UGB)
[^35]: Walter L. Heimerdinger and Charles B. Weinstock. [A Conceptual Framework for System Fault Tolerance](https://resources.sei.cmu.edu/asset_files/TechnicalReport/1992_005_001_16112.pdf). Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992. Archived at [perma.cc/GD2V-DMJW](https://perma.cc/GD2V-DMJW)
[^36]: Felix C. Gärtner. [Fundamentals of fault-tolerant distributed computing in asynchronous environments](https://dl.acm.org/doi/pdf/10.1145/311531.311532). *ACM Computing Surveys*, volume 31, issue 1, pages 1–26, March 1999. [doi:10.1145/311531.311532](https://doi.org/10.1145/311531.311532)
[^37]: Algirdas Avižienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. [Basic Concepts and Taxonomy of Dependable and Secure Computing](https://hdl.handle.net/1903/6459). *IEEE Transactions on Dependable and Secure Computing*, volume 1, issue 1, January 2004. [doi:10.1109/TDSC.2004.2](https://doi.org/10.1109/TDSC.2004.2)
[^38]: Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. [Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf). At *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014.
[^39]: Casey Rosenthal and Nora Jones. [*Chaos Engineering*](https://learning.oreilly.com/library/view/chaos-engineering/9781492043850/). O’Reilly Media, April 2020. ISBN: 9781492043867
[^40]: Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andre Barroso. [Failure Trends in a Large Disk Drive Population](https://www.usenix.org/legacy/events/fast07/tech/full_papers/pinheiro/pinheiro_old.pdf). At *5th USENIX Conference on File and Storage Technologies* (FAST), February 2007.
[^41]: Bianca Schroeder and Garth A. Gibson. [Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?](https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder.pdf) At *5th USENIX Conference on File and Storage Technologies* (FAST), February 2007.
[^42]: Andy Klein. [Backblaze Drive Stats for Q2 2021](https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2021/). *backblaze.com*, August 2021. Archived at [perma.cc/2943-UD5E](https://perma.cc/2943-UD5E)
[^43]: Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and Kushagra Vaid. [SSD Failures in Datacenters: What? When? and Why?](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/a7-narayanan.pdf) At *9th ACM International on Systems and Storage Conference* (SYSTOR), June 2016. [doi:10.1145/2928275.2928278](https://doi.org/10.1145/2928275.2928278)
[^44]: Alibaba Cloud Storage Team. [Storage System Design Analysis: Factors Affecting NVMe SSD Performance (1)](https://www.alibabacloud.com/blog/594375). *alibabacloud.com*, January 2019. Archived at [archive.org](https://web.archive.org/web/20230522005034/https%3A//www.alibabacloud.com/blog/594375)
[^45]: Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. [Flash Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf). At *14th USENIX Conference on File and Storage Technologies* (FAST), February 2016.
[^46]: Jacob Alter, Ji Xue, Alma Dimnaku, and Evgenia Smirni. [SSD failures in the field: symptoms, causes, and prediction models](https://dl.acm.org/doi/pdf/10.1145/3295500.3356172). At *International Conference for High Performance Computing, Networking, Storage and Analysis* (SC), November 2019. [doi:10.1145/3295500.3356172](https://doi.org/10.1145/3295500.3356172)
[^47]: Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. [Availability in Globally Distributed Storage Systems](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Ford.pdf). At *9th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2010.
[^48]: Kashi Venkatesh Vishwanath and Nachiappan Nagappan. [Characterizing Cloud Computing Hardware Reliability](https://www.microsoft.com/en-us/research/wp-content/uploads/2010/06/socc088-vishwanath.pdf). At *1st ACM Symposium on Cloud Computing* (SoCC), June 2010. [doi:10.1145/1807128.1807161](https://doi.org/10.1145/1807128.1807161)
[^49]: Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat. [Cores that don’t count](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf). At *Workshop on Hot Topics in Operating Systems* (HotOS), June 2021. [doi:10.1145/3458336.3465297](https://doi.org/10.1145/3458336.3465297)
[^50]: Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, and Sriram Sankar. [Silent Data Corruptions at Scale](https://arxiv.org/abs/2102.11245). *arXiv:2102.11245*, February 2021.
[^51]: Diogo Behrens, Marco Serafini, Sergei Arnautov, Flavio P. Junqueira, and Christof Fetzer. [Scalable Error Isolation for Distributed Systems](https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/behrens). At *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015.
[^52]: Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. [DRAM Errors in the Wild: A Large-Scale Field Study](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf). At *11th International Joint Conference on Measurement and Modeling of Computer Systems* (SIGMETRICS), June 2009. [doi:10.1145/1555349.1555372](https://doi.org/10.1145/1555349.1555372)
[^53]: Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. [Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf). At *41st Annual International Symposium on Computer Architecture* (ISCA), June 2014. [doi:10.5555/2665671.2665726](https://doi.org/10.5555/2665671.2665726)
[^54]: Tim Bray. [Worst Case](https://www.tbray.org/ongoing/When/202x/2021/10/08/The-WOrst-Case). *tbray.org*, October 2021. Archived at [perma.cc/4QQM-RTHN](https://perma.cc/4QQM-RTHN)
[^55]: Sangeetha Abdu Jyothi. [Solar Superstorms: Planning for an Internet Apocalypse](https://ics.uci.edu/~sabdujyo/papers/sigcomm21-cme.pdf). At *ACM SIGCOMM Conferene*, August 2021. [doi:10.1145/3452296.3472916](https://doi.org/10.1145/3452296.3472916)
[^56]: Adrian Cockcroft. [Failure Modes and Continuous Resilience](https://adrianco.medium.com/failure-modes-and-continuous-resilience-6553078caad5). *adrianco.medium.com*, November 2019. Archived at [perma.cc/7SYS-BVJP](https://perma.cc/7SYS-BVJP)
[^57]: Shujie Han, Patrick P. C. Lee, Fan Xu, Yi Liu, Cheng He, and Jiongzhou Liu. [An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers](https://www.usenix.org/conference/fast21/presentation/han). At *19th USENIX Conference on File and Storage Technologies* (FAST), February 2021.
[^58]: Edmund B. Nightingale, John R. Douceur, and Vince Orgovan. [Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs](https://eurosys2011.cs.uni-salzburg.at/pdf/eurosys2011-nightingale.pdf). At *6th European Conference on Computer Systems* (EuroSys), April 2011. [doi:10.1145/1966445.1966477](https://doi.org/10.1145/1966445.1966477)
[^59]: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. [What Bugs Live in the Cloud?](https://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf) At *5th ACM Symposium on Cloud Computing* (SoCC), November 2014. [doi:10.1145/2670979.2670986](https://doi.org/10.1145/2670979.2670986)
[^60]: Jay Kreps. [Getting Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW)
[^61]: Nelson Minar. [Leap Second Crashes Half the Internet](https://www.somebits.com/weblog/tech/bad/leap-second-2012.html). *somebits.com*, July 2012. Archived at [perma.cc/2WB8-D6EU](https://perma.cc/2WB8-D6EU)
[^62]: Hewlett Packard Enterprise. [Support Alerts – Customer Bulletin a00092491en\_us](https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us). *support.hpe.com*, November 2019. Archived at [perma.cc/S5F6-7ZAC](https://perma.cc/S5F6-7ZAC)
[^63]: Lorin Hochstein. [awesome limits](https://github.com/lorin/awesome-limits). *github.com*, November 2020. Archived at [perma.cc/3R5M-E5Q4](https://perma.cc/3R5M-E5Q4)
[^64]: Caitie McCaffrey. [Clients Are Jerks: AKA How Halo 4 DoSed the Services at Launch & How We Survived](https://www.caitiem.com/2015/06/23/clients-are-jerks-aka-how-halo-4-dosed-the-services-at-launch-how-we-survived/). *caitiem.com*, June 2015. Archived at [perma.cc/MXX4-W373](https://perma.cc/MXX4-W373)
[^65]: Lilia Tang, Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, and Tianyin Xu. [Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems](https://tianyin.github.io/pub/csi-failures.pdf). At *18th European Conference on Computer Systems* (EuroSys), May 2023. [doi:10.1145/3552326.3587448](https://doi.org/10.1145/3552326.3587448)
[^66]: Mike Ulrich. [Addressing Cascading Failures](https://sre.google/sre-book/addressing-cascading-failures/). In Betsy Beyer, Jennifer Petoff, Chris Jones, and Niall Richard Murphy (ed). [*Site Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). O’Reilly Media, 2016. ISBN: 9781491929124
[^67]: Harri Faßbender. [Cascading failures in large-scale distributed systems](https://blog.mi.hdm-stuttgart.de/index.php/2022/03/03/cascading-failures-in-large-scale-distributed-systems/). *blog.mi.hdm-stuttgart.de*, March 2022. Archived at [perma.cc/K7VY-YJRX](https://perma.cc/K7VY-YJRX)
[^68]: Richard I. Cook. [How Complex Systems Fail](https://www.adaptivecapacitylabs.com/HowComplexSystemsFail.pdf). Cognitive Technologies Laboratory, April 2000. Archived at [perma.cc/RDS6-2YVA](https://perma.cc/RDS6-2YVA)
[^69]: David D. Woods. [STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity](https://snafucatchers.github.io/). *snafucatchers.github.io*, March 2017. Archived at [archive.org](https://web.archive.org/web/20230306130131/https%3A//snafucatchers.github.io/)
[^70]: David Oppenheimer, Archana Ganapathi, and David A. Patterson. [Why Do Internet Services Fail, and What Can Be Done About It?](https://static.usenix.org/events/usits03/tech/full_papers/oppenheimer/oppenheimer.pdf) At *4th USENIX Symposium on Internet Technologies and Systems* (USITS), March 2003.
[^71]: Sidney Dekker. [*The Field Guide to Understanding ‘Human Error’, 3rd Edition*](https://learning.oreilly.com/library/view/the-field-guide/9781317031833/). CRC Press, November 2017. ISBN: 9781472439055
[^72]: Sidney Dekker. [*Drift into Failure: From Hunting Broken Components to Understanding Complex Systems*](https://www.taylorfrancis.com/books/mono/10.1201/9781315257396/drift-failure-sidney-dekker). CRC Press, 2011. ISBN: 9781315257396
[^73]: John Allspaw. [Blameless PostMortems and a Just Culture](https://www.etsy.com/codeascraft/blameless-postmortems/). *etsy.com*, May 2012. Archived at [perma.cc/YMJ7-NTAP](https://perma.cc/YMJ7-NTAP)
[^74]: Itzy Sabo. [Uptime Guarantees — A Pragmatic Perspective](https://world.hey.com/itzy/uptime-guarantees-a-pragmatic-perspective-736d7ea4). *world.hey.com*, March 2023. Archived at [perma.cc/F7TU-78JB](https://perma.cc/F7TU-78JB)
[^75]: Michael Jurewitz. [The Human Impact of Bugs](http://jury.me/blog/2013/3/14/the-human-impact-of-bugs). *jury.me*, March 2013. Archived at [perma.cc/5KQ4-VDYL](https://perma.cc/5KQ4-VDYL)
[^76]: Mark Halper. [How Software Bugs led to ‘One of the Greatest Miscarriages of Justice’ in British History](https://cacm.acm.org/news/how-software-bugs-led-to-one-of-the-greatest-miscarriages-of-justice-in-british-history/). *Communications of the ACM*, January 2025. [doi:10.1145/3703779](https://doi.org/10.1145/3703779)
[^77]: Nicholas Bohm, James Christie, Peter Bernard Ladkin, Bev Littlewood, Paul Marshall, Stephen Mason, Martin Newby, Steven J. Murdoch, Harold Thimbleby, and Martyn Thomas. [The legal rule that computers are presumed to be operating correctly – unforeseen and unjust consequences](https://www.benthamsgaze.org/wp-content/uploads/2022/06/briefing-presumption-that-computers-are-reliable.pdf). Briefing note, *benthamsgaze.org*, June 2022. Archived at [perma.cc/WQ6X-TMW4](https://perma.cc/WQ6X-TMW4)
[^78]: Dan McKinley. [Choose Boring Technology](https://mcfunley.com/choose-boring-technology). *mcfunley.com*, March 2015. Archived at [perma.cc/7QW7-J4YP](https://perma.cc/7QW7-J4YP)
[^79]: Andy Warfield. [Building and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. Archived at [perma.cc/7LPK-TP7V](https://perma.cc/7LPK-TP7V)
[^80]: Marc Brooker. [Surprising Scalability of Multitenancy](https://brooker.co.za/blog/2023/03/23/economics.html). *brooker.co.za*, March 2023. Archived at [perma.cc/ZZD9-VV8T](https://perma.cc/ZZD9-VV8T)
[^81]: Ben Stopford. [Shared Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/). *benstopford.com*, November 2009. Archived at [perma.cc/7BXH-EDUR](https://perma.cc/7BXH-EDUR)
[^82]: Michael Stonebraker. [The Case for Shared Nothing](https://dsf.berkeley.edu/papers/hpts85-nothing.pdf). *IEEE Database Engineering Bulletin*, volume 9, issue 1, pages 4–9, March 1986.
[^83]: Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. [Socrates: The New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 1743–1756, June 2019. [doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047)
[^84]: Sam Newman. [*Building Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). O’Reilly Media, 2021. ISBN: 9781492034025
[^85]: Nathan Ensmenger. [When Good Software Goes Bad: The Surprising Durability of an Ephemeral Technology](https://themaintainers.wpengine.com/wp-content/uploads/2021/04/ensmenger-maintainers-v2.pdf). At *The Maintainers Conference*, April 2016. Archived at [perma.cc/ZXT4-HGZB](https://perma.cc/ZXT4-HGZB)
[^86]: Robert L. Glass. [*Facts and Fallacies of Software Engineering*](https://learning.oreilly.com/library/view/facts-and-fallacies/0321117425/). Addison-Wesley Professional, October 2002. ISBN: 9780321117427
[^87]: Marianne Bellotti. [*Kill It with Fire*](https://learning.oreilly.com/library/view/kill-it-with/9781098128883/). No Starch Press, April 2021. ISBN: 9781718501188
[^88]: Lisanne Bainbridge. [Ironies of automation](https://www.adaptivecapacitylabs.com/IroniesOfAutomation-Bainbridge83.pdf). *Automatica*, volume 19, issue 6, pages 775–779, November 1983. [doi:10.1016/0005-1098(83)90046-8](https://doi.org/10.1016/0005-1098%2883%2990046-8)
[^89]: James Hamilton. [On Designing and Deploying Internet-Scale Services](https://www.usenix.org/legacy/events/lisa07/tech/full_papers/hamilton/hamilton.pdf). At *21st Large Installation System Administration Conference* (LISA), November 2007.
[^90]: Dotan Horovits. [Open Source for Better Observability](https://horovits.medium.com/open-source-for-better-observability-8c65b5630561). *horovits.medium.com*, October 2021. Archived at [perma.cc/R2HD-U2ZT](https://perma.cc/R2HD-U2ZT)
[^91]: Brian Foote and Joseph Yoder. [Big Ball of Mud](http://www.laputan.org/pub/foote/mud.pdf). At *4th Conference on Pattern Languages of Programs* (PLoP), September 1997. Archived at [perma.cc/4GUP-2PBV](https://perma.cc/4GUP-2PBV)
[^92]: Marc Brooker. [What is a simple system?](https://brooker.co.za/blog/2022/05/03/simplicity.html) *brooker.co.za*, May 2022. Archived at [perma.cc/U72T-BFVE](https://perma.cc/U72T-BFVE)
[^93]: Frederick P. Brooks. [No Silver Bullet – Essence and Accident in Software Engineering](https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf). In [*The Mythical Man-Month*](https://www.oreilly.com/library/view/mythical-man-month-the/0201835959/), Anniversary edition, Addison-Wesley, 1995. ISBN: 9780201835953
[^94]: Dan Luu. [Against essential and accidental complexity](https://danluu.com/essential-complexity/). *danluu.com*, December 2020. Archived at [perma.cc/H5ES-69KC](https://perma.cc/H5ES-69KC)
[^95]: Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. [*Design Patterns: Elements of Reusable Object-Oriented Software*](https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/). Addison-Wesley Professional, October 1994. ISBN: 9780201633610
[^96]: Eric Evans. [*Domain-Driven Design: Tackling Complexity in the Heart of Software*](https://learning.oreilly.com/library/view/domain-driven-design-tackling/0321125215/). Addison-Wesley Professional, August 2003. ISBN: 9780321125217
[^97]: Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson. [Analyzing Software Evolvability](https://www.es.mdh.se/pdf_publications/1251.pdf). at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. [doi:10.1109/COMPSAC.2008.50](https://doi.org/10.1109/COMPSAC.2008.50)
[^98]: Enrico Zaninotto. [From X programming to the X organisation](https://martinfowler.com/articles/zaninotto.pdf). At *XP Conference*, May 2002. Archived at [perma.cc/R9AR-QCKZ](https://perma.cc/R9AR-QCKZ)
================================================
FILE: content/en/ch3.md
================================================
---
title: "3. Data Models and Query Languages"
weight: 103
breadcrumbs: false
---

> *The limits of my language mean the limits of my world.*
>
> Ludwig Wittgenstein, *Tractatus Logico-Philosophicus* (1922)
Data models are perhaps the most important part of developing software, because they have such a
profound effect: not only on how the software is written, but also on how we *think about the problem*
that we are solving.
Most applications are built by layering one data model on top of another. For each layer, the key
question is: how is it *represented* in terms of the next-lower layer? For example:
1. As an application developer, you look at the real world (in which there are people,
organizations, goods, actions, money flows, sensors, etc.) and model it in terms of objects or
data structures, and APIs that manipulate those data structures. Those structures are often
specific to your application.
2. When you want to store those data structures, you express them in terms of a general-purpose
data model, such as JSON or XML documents, tables in a relational database, or vertices and
edges in a graph. Those data models are the topic of this chapter.
3. The engineers who built your database software decided on a way of representing that
document/relational/graph data in terms of bytes in memory, on disk, or on a network. The
representation may allow the data to be queried, searched, manipulated, and processed in various
ways. We will discuss these storage engine designs in [Chapter 4](/en/ch4#ch_storage).
4. On yet lower levels, hardware engineers have figured out how to represent bytes in terms of
electrical currents, pulses of light, magnetic fields, and more.
In a complex application there may be more intermediary levels, such as APIs built upon APIs, but
the basic idea is still the same: each layer hides the complexity of the layers below it by
providing a clean data model. These abstractions allow different groups of people—for example,
the engineers at the database vendor and the application developers using their database—to work together effectively.
Several different data models are widely used in practice, often for different purposes. Some types
of data and some queries are easy to express in one model, and awkward in another. In this chapter
we will explore those trade-offs by comparing the relational model, the document model, graph-based
data models, event sourcing, and dataframes. We will also briefly look at query languages that allow
you to work with these models. This comparison will help you decide when to use which model.
--------
> [!TIP] TERMINOLOGY: DECLARATIVE QUERY LANGUAGES
Many of the query languages in this chapter (such as SQL, Cypher, SPARQL, or Datalog) are
*declarative*, which means that you specify the pattern of the data you want—what conditions the
results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and
aggregated)—but not *how* to achieve that goal. The database system’s query optimizer can decide
which indexes and which join algorithms to use, and in which order to execute various parts of the query.
In contrast, with most programming languages you would have to write an *algorithm*—i.e., telling
the computer which operations to perform in which order. A declarative query language is attractive
because it is typically more concise and easier to write than an explicit algorithm. But more
importantly, it also hides implementation details of the query engine, which makes it possible for
the database system to introduce performance improvements without requiring any changes to queries. [^1].
For example, a database might be able to execute a declarative query in parallel across multiple CPU
cores and machines, without you having to worry about how to implement that parallelism [^2].
In a hand-coded algorithm it would be a lot of work to implement such parallel execution yourself.
--------
## Relational Model versus Document Model {#sec_datamodels_history}
The best-known data model today is probably that of SQL, based on the relational model proposed by Edgar Codd in 1970 [^3]:
data is organized into *relations* (called *tables* in SQL), where each relation is an unordered collection of *tuples* (*rows* in SQL).
The relational model was originally a theoretical proposal, and many people at the time doubted whether it
could be implemented efficiently. However, by the mid-1980s, relational database management systems
(RDBMS) and SQL had become the tools of choice for most people who needed to store and query data
with some kind of regular structure. Many data management use cases are still dominated by
relational data decades later—for example, business analytics (see [“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics)).
Over the years, there have been many competing approaches to data storage and querying. In the 1970s
and early 1980s, the *network model* and the *hierarchical model* were the main alternatives, but
the relational model came to dominate them. Object databases came and went again in the late 1980s
and early 1990s. XML databases appeared in the early 2000s, but have only seen niche adoption. Each
competitor to the relational model generated a lot of hype in its time, but it never lasted [^4].
Instead, SQL has grown to incorporate other data types besides its relational core—for example,
adding support for XML, JSON, and graph data [^5].
In the 2010s, *NoSQL* was the latest buzzword that tried to overthrow the dominance of relational
databases. NoSQL refers not to a single technology, but a loose set of ideas around new data models,
schema flexibility, scalability, and a move towards open source licensing models. Some databases
branded themselves as *NewSQL*, as they aim to provide the scalability of NoSQL systems along with
the data model and transactional guarantees of traditional relational databases. The NoSQL and
NewSQL ideas have been very influential in the design of data systems, but as the principles have
become widely adopted, use of those terms has faded.
One lasting effect of the NoSQL movement is the popularity of the *document model*, which usually
represents data as JSON. This model was originally popularized by specialized document databases
such as MongoDB and Couchbase, although most relational databases have now also added JSON support.
Compared to relational tables, which are often seen as having a rigid and inflexible schema, JSON
documents are thought to be more flexible.
The pros and cons of document and relational data have been debated extensively; let’s examine some
of the key points of that debate.
### The Object-Relational Mismatch {#sec_datamodels_document}
Much application development today is done in object-oriented programming languages, which leads to
a common criticism of the SQL data model: if data is stored in relational tables, an awkward
translation layer is required between the objects in the application code and the database model of
tables, rows, and columns. The disconnect between the models is sometimes called an *impedance mismatch*.
--------
> [!NOTE]
> The term *impedance mismatch* is borrowed from electronics. Every electric circuit has a certain
> impedance (resistance to alternating current) on its inputs and outputs. When you connect one
> circuit’s output to another one’s input, the power transfer across the connection is maximized if
> the output and input impedances of the two circuits match. An impedance mismatch can lead to signal
> reflections and other troubles.
--------
#### Object-relational mapping (ORM) {#object-relational-mapping-orm}
Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of
boilerplate code required for this translation layer, but they are often criticized [^6].
Some commonly cited problems are:
* ORMs are complex and can’t completely hide the differences between the two models, so developers
still end up having to think about both the relational and the object representations of the data.
* ORMs are generally only used for OLTP app development (see [“Characterizing Transaction Processing and Analytics”](/en/ch1#sec_introduction_oltp)); data
engineers making the data available for analytics purposes still need to work with the underlying
relational representation, so the design of the relational schema still matters when using an ORM.
* Many ORMs work only with relational OLTP databases. Organizations with diverse data systems such
as search engines, graph databases, and NoSQL systems might find ORM support lacking.
* Some ORMs generate relational schemas automatically, but these might be awkward for the users who
are accessing the relational data directly, and they might be inefficient on the underlying
database. Customizing the ORM’s schema and query generation can be complex and negate the benefit of using the ORM in the first place.
* ORMs make it easy to accidentally write inefficient queries, such as the *N+1 query problem* [^7].
For example, say you want to display a list of user comments on a page, so you perform one query
that returns *N* comments, each containing the ID of its author. To show the name of the comment
author you need to look up the ID in the users table. In hand-written SQL you would probably
perform this join in the query and return the author name along with each comment, but with an ORM
you might end up making a separate query on the users table for each of the *N* comments to look
up its author, resulting in *N*+1 database queries in total, which is slower than performing the
join in the database. To avoid this problem, you may need to tell the ORM to fetch the author
information at the same time as fetching the comments.
Nevertheless, ORMs also have advantages:
* For data that is well suited to a relational model, some kind of translation between the
persistent relational and the in-memory object representation is inevitable, and ORMs reduce the
amount of boilerplate code required for this translation. Complicated queries may still need to be
handled outside of the ORM, but the ORM can help with the simple and repetitive cases.
* Some ORMs help with caching the results of database queries, which can help reduce the load on the database.
* ORMs can also help with managing schema migrations and other administrative activities.
#### The document data model for one-to-many relationships {#the-document-data-model-for-one-to-many-relationships}
Not all data lends itself well to a relational representation; let’s look at an example to explore a
limitation of the relational model. [Figure 3-1](/en/ch3#fig_obama_relational) illustrates how a résumé (a LinkedIn
profile) could be expressed in a relational schema. The profile as a whole can be identified by a
unique identifier, `user_id`. Fields like `first_name` and `last_name` appear exactly once per user,
so they can be modeled as columns on the `users` table.
Most people have had more than one job in their career (positions), and people may have varying
numbers of periods of education and any number of pieces of contact information. One way of
representing such *one-to-many relationships* is to put positions, education, and contact
information in separate tables, with a foreign key reference to the `users` table, as in
[Figure 3-1](/en/ch3#fig_obama_relational).
{{< figure src="/fig/ddia_0301.png" id="fig_obama_relational" caption="Figure 3-1. Representing a LinkedIn profile using a relational schema." class="w-full my-4" >}}
Another way of representing the same information, which is perhaps more natural and maps more
closely to an object structure in application code, is as a JSON document as shown in
[Example 3-1](/en/ch3#fig_obama_json).
{{< figure id="fig_obama_json" title="Example 3-1. Representing a LinkedIn profile as a JSON document" class="w-full my-4" >}}
```json
{
"user_id": 251,
"first_name": "Barack",
"last_name": "Obama",
"headline": "Former President of the United States of America",
"region_id": "us:91",
"photo_url": "/p/7/000/253/05b/308dd6e.jpg",
"positions": [
{"job_title": "President", "organization": "United States of America"},
{"job_title": "US Senator (D-IL)", "organization": "United States Senate"}
],
"education": [
{"school_name": "Harvard University", "start": 1988, "end": 1991},
{"school_name": "Columbia University", "start": 1981, "end": 1983}
],
"contact_info": {
"website": "https://barackobama.com",
"twitter": "https://twitter.com/barackobama"
}
}
```
Some developers feel that the JSON model reduces the impedance mismatch between the application code
and the storage layer. However, as we shall see in [Chapter 5](/en/ch5#ch_encoding), there are also problems with
JSON as a data encoding format. The lack of a schema is often cited as an advantage; we will discuss
this in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility).
The JSON representation has better *locality* than the multi-table schema in
[Figure 3-1](/en/ch3#fig_obama_relational) (see [“Data locality for reads and writes”](/en/ch3#sec_datamodels_document_locality)). If you want to fetch a profile
in the relational example, you need to either perform multiple queries (query each table by
`user_id`) or perform a messy multi-way join between the `users` table and its subordinate tables [^8].
In the JSON representation, all the relevant information is in one place, making the query both
faster and simpler.
The one-to-many relationships from the user profile to the user’s positions, educational history, and
contact information imply a tree structure in the data, and the JSON representation makes this tree
structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
{{< figure src="/fig/ddia_0302.png" id="fig_json_tree" caption="Figure 3-2. One-to-many relationships forming a tree structure." class="w-full my-4" >}}
--------
> [!NOTE]
> This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé typically has a small number of positions [^9] [^10].
> In situations where there may be a genuinely large number of related items—say, comments on a
> celebrity’s social media post, of which there could be many thousands—embedding them all in the same
> document may be too unwieldy, so the relational approach in [Figure 3-1](/en/ch3#fig_obama_relational) is preferable.
--------
### Normalization, Denormalization, and Joins {#sec_datamodels_normalization}
In [Example 3-1](/en/ch3#fig_obama_json) in the preceding section, `region_id` is given as an ID, not as the plain-text
string `"Washington, DC, United States"`. Why?
If the user interface has a free-text field for entering the region, it makes sense to store it as a
plain-text string. But there are advantages to having standardized lists of geographic regions, and
letting users choose from a drop-down list or autocompleter:
* Consistent style and spelling across profiles
* Avoiding ambiguity if there are several places with the same name (if the string were just
“Washington”, would it refer to DC or to the state?)
* Ease of updating—the name is stored in only one place, so it is easy to update across the board if
it ever needs to be changed (e.g., change of a city name due to political events)
* Localization support—when the site is translated into other languages, the standardized lists can
be localized, so the region can be displayed in the viewer’s language
* Better search—e.g., a search for people on the US East Coast can match this profile, because the
list of regions can encode the fact that Washington is located on the East Coast (which is not
apparent from the string `"Washington, DC"`)
Whether you store an ID or a text string is a question of *normalization*. When you use an ID, your
data is more normalized: the information that is meaningful to humans (such as the text *Washington,
DC*) is stored in only one place, and everything that refers to it uses an ID (which only has
meaning within the database). When you store the text directly, you are duplicating the
human-meaningful information in every record that uses it; this representation is *denormalized*.
The advantage of using an ID is that because it has no meaning to humans, it never needs to change:
the ID can remain the same, even if the information it identifies changes. Anything that is
meaningful to humans may need to change sometime in the future—and if that information is
duplicated, all the redundant copies need to be updated. That requires more code, more write
operations, more disk space, and risks inconsistencies (where some copies of the information are
updated but others aren’t).
The downside of a normalized representation is that every time you want to display a record
containing an ID, you have to do an additional lookup to resolve the ID into something
human-readable. In a relational data model, this is done using a *join*, for example:
```sql
SELECT users.*, regions.region_name
FROM users
JOIN regions ON users.region_id = regions.id
WHERE users.id = 251;
```
Document databases can store both normalized and denormalized data, but they are often associated
with denormalization—partly because the JSON data model makes it easy to store additional,
denormalized fields, and partly because the weak support for joins in many document databases makes
normalization inconvenient. Some document databases don’t support joins at all, so you have to
perform them in application code—that is, you first fetch a document containing an ID, and then
perform a second query to resolve that ID into another document. In MongoDB, it is also possible to
perform a join using the `$lookup` operator in an aggregation pipeline:
```mongodb-json
db.users.aggregate([
{ $match: { _id: 251 } },
{ $lookup: {
from: "regions",
localField: "region_id",
foreignField: "_id",
as: "region"
} }
])
```
#### Trade-offs of normalization {#trade-offs-of-normalization}
In the résumé example, while the `region_id` field is a reference into a standardized set of
regions, the name of the `organization` (the company or government where the person worked) and
`school_name` (where they studied) are just strings. This representation is denormalized: many
people may have worked at the same company, but there is no ID linking them.
Perhaps the organization and school should be entities instead, and the profile should reference
their IDs instead of their names? The same arguments for referencing the ID of a region also apply
here. For example, say we wanted to include the logo of the school or company in addition to their
name:
* In a denormalized representation, we would include the image URL of the logo on every individual
person’s profile; this makes the JSON document self-contained, but it creates a headache if we
ever need to change the logo, because we now need to find all of the occurrences of the old URL
and update them [^9].
* In a normalized representation, we would create an entity representing an organization or school,
and store its name, logo URL, and perhaps other attributes (description, news feed, etc.) once on
that entity. Every résumé that mentions the organization would then simply reference its ID, and
updating the logo is easy.
As a general principle, normalized data is usually faster to write (since there is only one copy),
but slower to query (since it requires joins); denormalized data is usually faster to read (fewer
joins), but more expensive to write (more copies to update, more disk space used). You might find it
helpful to view denormalization as a form of derived data ([“Systems of Record and Derived Data”](/en/ch1#sec_introduction_derived)), since you
need to set up a process for updating the redundant copies of the data.
Besides the cost of performing all these updates, you also need to consider the consistency of the
database if a process crashes halfway through making its updates. Databases that offer atomic
transactions (see [“Atomicity”](/en/ch8#sec_transactions_acid_atomicity)) make it easier to remain consistent, but not
all databases offer atomicity across multiple documents. It is also possible to ensure consistency
through stream processing, which we discuss in [“Keeping Systems in Sync”](/en/ch12#sec_stream_sync).
Normalization tends to be better for OLTP systems, where both reads and updates need to be fast;
analytics systems often fare better with denormalized data, since they perform updates in bulk, and
the performance of read-only queries is the dominant concern. Moreover, in systems of small to
moderate scale, a normalized data model is often best, because you don’t have to worry about keeping
multiple copies of the data consistent with each other, and the cost of performing joins is
acceptable. However, in very large-scale systems, the cost of joins can become problematic.
#### Denormalization in the social networking case study {#denormalization-in-the-social-networking-case-study}
In [“Case Study: Social Network Home Timelines”](/en/ch2#sec_introduction_twitter) we compared a normalized representation ([Figure 2-1](/en/ch2#fig_twitter_relational))
and a denormalized one (precomputed, materialized timelines): here, the join between `posts` and
`follows` was too expensive, and the materialized timeline is a cache of the result of that join.
The fan-out process that inserts a new post into followers’ timelines was our way of keeping the
denormalized representation consistent.
However, the implementation of materialized timelines at X (formerly Twitter) does not store the
actual text of each post: each entry actually only stores the post ID, the ID of the user who posted
it, and a little bit of extra information to identify reposts and replies [^11].
In other words, it is a precomputed result of (approximately) the following query:
```sql
SELECT posts.id, posts.sender_id
FROM posts
JOIN follows ON posts.sender_id = follows.followee_id
WHERE follows.follower_id = current_user
ORDER BY posts.timestamp DESC
LIMIT 1000
```
This means that whenever the timeline is read, the service still needs to perform two joins: look up
the post ID to fetch the actual post content (as well as statistics such as the number of likes
and replies), and look up the sender’s profile by ID (to get their username, profile picture, and
other details). This process of looking up the human-readable information by ID is called
*hydrating* the IDs, and it is essentially a join performed in application code [^11].
The reason for storing only IDs in the precomputed timeline is that the data they refer to is
fast-changing: the number of likes and replies may change multiple times per second on a popular
post, and some users regularly change their username or profile photo. Since the timeline should
show the latest like count and profile picture when it is viewed, it would not make sense to
denormalize this information into the materialized timeline. Moreover, the storage cost would be
increased significantly by such denormalization.
This example shows that having to perform joins when reading data is not, as sometimes claimed, an
impediment to creating high-performance, scalable services. Hydrating post ID and user ID is
actually a fairly easy operation to scale, since it parallelizes well, and the cost doesn’t depend
on the number of accounts you are following or the number of followers you have.
If you need to decide whether to denormalize something in your application, the social network case
study shows that the choice is not immediately obvious: the most scalable approach may involve
denormalizing some things and leaving other things normalized. You will have to carefully consider
how often the information changes, and the cost of reads and writes (which might be dominated by
outliers, such as users with many follows/followers in the case of a typical social network).
Normalization and denormalization are not inherently good or bad—they are just a trade-off in terms
of performance of reads and writes, as well as the amount of effort to implement.
### Many-to-One and Many-to-Many Relationships {#sec_datamodels_many_to_many}
While `positions` and `education` in [Figure 3-1](/en/ch3#fig_obama_relational) are examples of one-to-many or
one-to-few relationships (one résumé has several positions, but each position belongs only to one
résumé), the `region_id` field is an example of a *many-to-one* relationship (many people live in
the same region, but we assume that each person lives in only one region at any one time).
If we introduce entities for organizations and schools, and reference them by ID from the résumé,
then we also have *many-to-many* relationships (one person has worked for several organizations, and
an organization has several past or present employees). In a relational model, such a relationship
is usually represented as an *associative table* or *join table*, as shown in
[Figure 3-3](/en/ch3#fig_datamodels_m2m_rel): each position associates one user ID with one organization ID.
{{< figure src="/fig/ddia_0303.png" id="fig_datamodels_m2m_rel" caption="Figure 3-3. Many-to-many relationships in the relational model." class="w-full my-4" >}}
Many-to-one and many-to-many relationships do not easily fit within one self-contained JSON
document; they lend themselves more to a normalized representation. In a document model, one
possible representation is given in [Example 3-2](/en/ch3#fig_datamodels_m2m_json) and illustrated in
[Figure 3-4](/en/ch3#fig_datamodels_many_to_many): the data within each dotted rectangle can be grouped into one
document, but the links to organizations and schools are best represented as references to other
documents.
{{< figure id="fig_datamodels_m2m_json" title="Example 3-2. A résumé that references organizations by ID." class="w-full my-4" >}}
```json
{
"user_id": 251,
"first_name": "Barack",
"last_name": "Obama",
"positions": [
{"start": 2009, "end": 2017, "job_title": "President", "org_id": 513},
{"start": 2005, "end": 2008, "job_title": "US Senator (D-IL)", "org_id": 514}
],
...
}
```
{{< figure src="/fig/ddia_0304.png" id="fig_datamodels_many_to_many" caption="Figure 3-4. Many-to-many relationships in the document model: the data within each dotted box can be grouped into one document." class="w-full my-4" >}}
Many-to-many relationships often need to be queried in “both directions”: for example, finding all
of the organizations that a particular person has worked for, and finding all of the people who have
worked at a particular organization. One way of enabling such queries is to store ID references on
both sides, i.e., a résumé includes the ID of each organization where the person has worked, and the
organization document includes the IDs of the résumés that mention that organization. This
representation is denormalized, since the relationship is stored in two places, which could become
inconsistent with each other.
A normalized representation stores the relationship in only one place, and relies on *secondary
indexes* (which we discuss in [Chapter 4](/en/ch4#ch_storage)) to allow the relationship to be efficiently queried in
both directions. In the relational schema of [Figure 3-3](/en/ch3#fig_datamodels_m2m_rel), we would tell the database
to create indexes on both the `user_id` and the `org_id` columns of the `positions` table.
In the document model of [Example 3-2](/en/ch3#fig_datamodels_m2m_json), the database needs to index the `org_id` field
of objects inside the `positions` array. Many document databases and relational databases with JSON
support are able to create such indexes on values inside a document.
### Stars and Snowflakes: Schemas for Analytics {#sec_datamodels_analytics}
Data warehouses (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are usually relational, and there are a few
widely-used conventions for the structure of tables in a data warehouse: a *star schema*,
*snowflake schema*, *dimensional modeling* [^12],
and *one big table* (OBT). These structures are optimized for the needs of business analysts. ETL
processes translate data from operational systems into this schema.
[Figure 3-5](/en/ch3#fig_dwh_schema) shows an example of a star schema that might be found in the data warehouse of a grocery
retailer. At the center of the schema is a so-called *fact table* (in this example, it is called
`fact_sales`). Each row of the fact table represents an event that occurred at a particular time
(here, each row represents a customer’s purchase of a product). If we were analyzing website traffic
rather than retail sales, each row might represent a page view or a click by a user.
{{< figure src="/fig/ddia_0305.png" id="fig_dwh_schema" caption="Figure 3-5. Example of a star schema for use in a data warehouse." class="w-full my-4" >}}
Usually, facts are captured as individual events, because this allows maximum flexibility of
analysis later. However, this means that the fact table can become extremely large. A big enterprise
may have many petabytes of transaction history in its data warehouse, mostly represented as fact tables.
Some of the columns in the fact table are attributes, such as the price at which the product was
sold and the cost of buying it from the supplier (allowing the profit margin to be calculated).
Other columns in the fact table are foreign key references to other tables, called *dimension
tables*. As each row in the fact table represents an event, the dimensions represent the *who*,
*what*, *where*, *when*, *how*, and *why* of the event.
For example, in [Figure 3-5](/en/ch3#fig_dwh_schema), one of the dimensions is the product that was sold. Each row in
the `dim_product` table represents one type of product that is for sale, including its stock-keeping
unit (SKU), description, brand name, category, fat content, package size, etc. Each row in the
`fact_sales` table uses a foreign key to indicate which product was sold in that particular
transaction. Queries often involve multiple joins to multiple dimension tables.
Even date and time are often represented using dimension tables, because this allows additional
information about dates (such as public holidays) to be encoded, allowing queries to differentiate
between sales on holidays and non-holidays.
[Figure 3-5](/en/ch3#fig_dwh_schema) is an example of a star schema. The name comes from the fact that when the table
relationships are visualized, the fact table is in the middle, surrounded by its dimension tables;
the connections to these tables are like the rays of a star.
A variation of this template is known as the *snowflake schema*, where dimensions are further broken
down into subdimensions. For example, there could be separate tables for brands and
product categories, and each row in the `dim_product` table could reference the brand and category
as foreign keys, rather than storing them as strings in the `dim_product` table. Snowflake schemas
are more normalized than star schemas, but star schemas are often preferred because
they are simpler for analysts to work with [^12].
In a typical data warehouse, tables are often quite wide: fact tables often have over 100 columns,
sometimes several hundred. Dimension tables can also be wide, as they include all the metadata that
may be relevant for analysis—for example, the `dim_store` table may include details of which
services are offered at each store, whether it has an in-store bakery, the square footage, the date
when the store was first opened, when it was last remodeled, how far it is from the nearest highway, etc.
A star or snowflake schema consists mostly of many-to-one relationships (e.g., many sales occur for
one particular product, in one particular store), represented as the fact table having foreign keys
into dimension tables, or dimensions into sub-dimensions. In principle, other types of relationship
could exist, but they are often denormalized in order to simplify queries. For example, if a
customer buys several different products at once, that multi-item transaction is not represented
explicitly; instead, there is a separate row in the fact table for each product purchased, and those
facts all just happen to have the same customer ID, store ID, and timestamp.
Some data warehouse schemas take denormalization even further and leave out the dimension tables
entirely, folding the information in the dimensions into denormalized columns on the fact table
instead (essentially, precomputing the join between the fact table and the dimension tables). This
approach is known as *one big table* (OBT), and while it requires more storage space, it sometimes
enables faster queries [^13].
In the context of analytics, such denormalization is unproblematic, since the data typically
represents a log of historical data that is not going to change (except maybe for occasionally
correcting an error). The issues of data consistency and write overheads that occur with
denormalization in OLTP systems are not as pressing in analytics.
### When to Use Which Model {#sec_datamodels_document_summary}
The main arguments in favor of the document data model are schema flexibility, better performance
due to locality, and that for some applications it is closer to the object model used by the
application. The relational model counters by providing better support for joins, many-to-one, and
many-to-many relationships. Let’s examine these arguments in more detail.
If the data in your application has a document-like structure (i.e., a tree of one-to-many
relationships, where typically the entire tree is loaded at once), then it’s probably a good idea to
use a document model. The relational technique of *shredding*—splitting a document-like structure
into multiple tables (like `positions`, `education`, and `contact_info` in [Figure 3-1](/en/ch3#fig_obama_relational))
— can lead to cumbersome schemas and unnecessarily complicated application code.
The document model has limitations: for example, you cannot refer directly to a nested item within a
document, but instead you need to say something like “the second item in the list of positions for
user 251”. If you do need to reference nested items, a relational approach works better, since you
can refer to any item directly by its ID.
Some applications allow the user to choose the order of items: for example, imagine a to-do list or
issue tracker where the user can drag and drop tasks to reorder them. The document model supports
such applications well, because the items (or their IDs) can simply be stored in a JSON array to
determine their order. In relational databases there isn’t a standard way of representing such
reorderable lists, and various tricks are used: sorting by an integer column (requiring renumbering
when you insert into the middle), a linked list of IDs, or fractional indexing [^14] [^15] [^16].
#### Schema flexibility in the document model {#sec_datamodels_schema_flexibility}
Most document databases, and the JSON support in relational databases, do not enforce any schema on
the data in documents. XML support in relational databases usually comes with optional schema
validation. No schema means that arbitrary keys and values can be added to a document, and when
reading, clients have no guarantees as to what fields the documents may contain.
Document databases are sometimes called *schemaless*, but that’s misleading, as the code that reads
the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not
enforced by the database [^17].
A more accurate term is *schema-on-read* (the structure of the data is implicit, and only
interpreted when the data is read), in contrast with *schema-on-write* (the traditional approach of
relational databases, where the schema is explicit and the database ensures all data conforms to it
when the data is written) [^18].
Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas
schema-on-write is similar to static (compile-time) type checking. Just as the advocates of static
and dynamic type checking have big debates about their relative merits [^19],
enforcement of schemas in database is a contentious topic, and in general there’s no right or wrong
answer.
The difference between the approaches is particularly noticeable in situations where an application
wants to change the format of its data. For example, say you are currently storing each user’s full
name in one field, and you instead want to store the first name and last name separately [^20].
In a document database, you would just start writing new documents with the new fields and have
code in the application that handles the case when old documents are read. For example:
```mongodb-json
if (user && user.name && !user.first_name) {
// Documents written before Dec 8, 2023 don't have first_name
user.first_name = user.name.split(" ")[0];
}
```
The downside of this approach is that every part of your application that reads from the database
now needs to deal with documents in old formats that may have been written a long time in the past.
On the other hand, in a schema-on-write database, you would typically perform a *migration* along
the lines of:
```sql
ALTER TABLE users ADD COLUMN first_name text DEFAULT NULL;
UPDATE users SET first_name = split_part(name, ' ', 1); -- PostgreSQL
UPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL
```
In most relational databases, adding a column with a default value is fast and unproblematic, even
on large tables. However, running the `UPDATE` statement is likely to be slow on a large table,
since every row needs to be rewritten, and other schema operations (such as changing the data type
of a column) also typically require the entire table to be copied.
Various tools exist to allow this type of schema changes to be performed in the background without downtime [^21] [^22] [^23] [^24],
but performing such migrations on large databases remains operationally challenging. Complicated
migrations can be avoided by only adding the `first_name` column with a default value of `NULL`
(which is fast), and filling it in at read time, like you would with a document database.
The schema-on-read approach is advantageous if the items in the collection don’t all have the same
structure for some reason (i.e., the data is heterogeneous)—for example, because:
* There are many different types of objects, and it is not practicable to put each type of object in its own table.
* The structure of the data is determined by external systems over which you have no control and which may change at any time.
In situations like these, a schema may hurt more than it helps, and schemaless documents can be a
much more natural data model. But in cases where all records are expected to have the same
structure, schemas are a useful mechanism for documenting and enforcing that structure. We will
discuss schemas and schema evolution in more detail in [Chapter 5](/en/ch5#ch_encoding).
#### Data locality for reads and writes {#sec_datamodels_document_locality}
A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant
thereof (such as MongoDB’s BSON). If your application often needs to access the entire document
(for example, to render it on a web page), there is a performance advantage to this *storage
locality*. If data is split across multiple tables, like in [Figure 3-1](/en/ch3#fig_obama_relational), multiple
index lookups are required to retrieve it all, which may require more disk seeks and take more time.
The locality advantage only applies if you need large parts of the document at the same time. The
database typically needs to load the entire document, which can be wasteful if you only need to
access a small part of a large document. On updates to a document, the entire document usually needs
to be rewritten. For these reasons, it is generally recommended that you keep documents fairly small
and avoid frequent small updates to a document.
However, the idea of storing related data together for locality is not limited to the document
model. For example, Google’s Spanner database offers the same locality properties in a relational
data model, by allowing the schema to declare that a table’s rows should be interleaved (nested) within a parent table [^25].
Oracle allows the same, using a feature called *multi-table index cluster tables* [^26].
The *wide-column* data model popularized by Google’s Bigtable, and used e.g. in HBase and Accumulo,
has a concept of *column families*, which have a similar purpose of managing locality [^27].
#### Query languages for documents {#query-languages-for-documents}
Another difference between a relational and a document database is the language or API that you use
to query it. Most relational databases are queried using SQL, but document databases are more
varied. Some allow only key-value access by primary key, while others also offer secondary indexes
to query for values inside documents, and some provide rich query languages.
XML databases are often queried using XQuery and XPath, which are designed to allow complex queries,
including joins across multiple documents, and also format their results as XML [^28]. JSON Pointer [^29] and JSONPath [^30] provide an equivalent to XPath for JSON.
MongoDB’s aggregation pipeline, whose `$lookup` operator for joins we saw in
[“Normalization, Denormalization, and Joins”](/en/ch3#sec_datamodels_normalization), is an example of a query language for collections of JSON documents.
Let’s look at another example to get a feel for this language—this time an aggregation, which is
especially needed for analytics. Imagine you are a marine biologist, and you add an observation
record to your database every time you see animals in the ocean. Now you want to generate a report
saying how many sharks you have sighted per month. In PostgreSQL you might express that query like this:
```sql
SELECT date_trunc('month', observation_timestamp) AS observation_month, ❶
sum(num_animals) AS total_animals
FROM observations
WHERE family = 'Sharks'
GROUP BY observation_month;
```
❶ : The `date_trunc('month', timestamp)` function determines the calendar month
containing `timestamp`, and returns another timestamp representing the beginning of that month. In
other words, it rounds a timestamp down to the nearest month.
This query first filters the observations to only show species in the `Sharks` family, then groups
the observations by the calendar month in which they occurred, and finally adds up the number of
animals seen in all observations in that month. The same query can be expressed using MongoDB’s
aggregation pipeline as follows:
```mongodb-json
db.observations.aggregate([
{ $match: { family: "Sharks" } },
{ $group: {
_id: {
year: { $year: "$observationTimestamp" },
month: { $month: "$observationTimestamp" }
},
totalAnimals: { $sum: "$numAnimals" }
} }
]);
```
The aggregation pipeline language is similar in expressiveness to a subset of SQL, but it uses a
JSON-based syntax rather than SQL’s English-sentence-style syntax; the difference is perhaps a
matter of taste.
#### Convergence of document and relational databases {#convergence-of-document-and-relational-databases}
Document databases and relational databases started out as very different approaches to data
management, but they have grown more similar over time [^31].
Relational databases added support for JSON types and query operators, and the ability to index
properties inside documents. Some document databases (such as MongoDB, Couchbase, and RethinkDB)
added support for joins, secondary indexes, and declarative query languages.
This convergence of the models is good news for application developers, because the relational model
and the document model work best when you can combine both in the same database. Many document
databases need relational-style references to other documents, and many relational databases have
sections where schema flexibility is beneficial. Relational-document hybrids are a powerful combination.
--------
> [!NOTE]
> Codd’s original description of the relational model [^3] actually allowed something similar to JSON
> within a relational schema. He called it *nonsimple domains*. The idea was that a value in a row
> doesn’t have to just be a primitive datatype like a number or a string, but it could also be a
> nested relation (table)—so you can have an arbitrarily nested tree structure as a value, much like
> the JSON or XML support that was added to SQL over 30 years later.
--------
## Graph-Like Data Models {#sec_datamodels_graph}
We saw earlier that the type of relationships is an important distinguishing feature between
different data models. If your application has mostly one-to-many relationships (tree-structured
data) and few other relationships between records, the document model is appropriate.
But what if many-to-many relationships are very common in your data? The relational model can handle
simple cases of many-to-many relationships, but as the connections within your data become more
complex, it becomes more natural to start modeling your data as a graph.
A graph consists of two kinds of objects: *vertices* (also known as *nodes* or *entities*) and
*edges* (also known as *relationships* or *arcs*). Many kinds of data can be modeled as a graph.
Typical examples include:
Social graphs
: Vertices are people, and edges indicate which people know each other.
The web graph
: Vertices are web pages, and edges indicate HTML links to other pages.
Road or rail networks
: Vertices are junctions, and edges represent the roads or railway lines between them.
Well-known algorithms can operate on these graphs: for example, map navigation apps search for
the shortest path between two points in a road network, and
PageRank can be used on the web graph to determine the
popularity of a web page and thus its ranking in search results [^32].
Graphs can be represented in several different ways. In the *adjacency list* model, each vertex
stores the IDs of its neighbor vertices that are one edge away. Alternatively, you can use an
*adjacency matrix*, a two-dimensional array where each row and each column corresponds to a vertex,
where the value is zero when there is no edge between the row vertex and the column vertex, and
where the value is one if there is an edge. The adjacency list is good for graph traversals, and the
matrix is good for machine learning (see [“Dataframes, Matrices, and Arrays”](/en/ch3#sec_datamodels_dataframes)).
In the examples just given, all the vertices in a graph represent the same kind of thing (people, web
pages, or road junctions, respectively). However, graphs are not limited to such *homogeneous* data:
an equally powerful use of graphs is to provide a consistent way of storing completely different
types of objects in a single database. For example:
* Facebook maintains a single graph with many different types of vertices and edges: vertices
represent people, locations, events, checkins, and comments made by users; edges indicate which
people are friends with each other, which checkin happened in which location, who commented on
which post, who attended which event, and so on [^33].
* Knowledge graphs are used by search engines to record facts about entities that often occur in
search queries, such as organizations, people, and places [^34].
This information is obtained by crawling and analyzing the text on websites; some websites, such
as Wikidata, also publish graph data in a structured form.
There are several different, but related, ways of structuring and querying data in graphs. In this
section we will discuss the *property graph* model (implemented by Neo4j, Memgraph, KùzuDB [^35], and others [^36])
and the *triple-store* model (implemented by Datomic, AllegroGraph, Blazegraph, and others). These
models are fairly similar in what they can express, and some graph databases (such as Amazon
Neptune) support both models.
We will also look at four query languages for graphs (Cypher, SPARQL, Datalog, and GraphQL), as well
as SQL support for querying graphs. Other graph query languages exist, such as Gremlin [^37],
but these will give us a representative overview.
To illustrate these different languages and models, this section uses the graph shown in
[Figure 3-6](/en/ch3#fig_datamodels_graph) as running example. It could be taken from a social network or a
genealogical database: it shows two people, Lucy from Idaho and Alain from Saint-Lô, France. They
are married and living in London. Each person and each location is represented as a vertex, and the
relationships between them as edges. This example will help demonstrate some queries that are easy
in graph databases, but difficult in other models.
{{< figure src="/fig/ddia_0306.png" id="fig_datamodels_graph" caption="Figure 3-6. Example of graph-structured data (boxes represent vertices, arrows represent edges)." class="w-full my-4" >}}
### Property Graphs {#id56}
In the *property graph* (also known as *labeled property graph*) model, each vertex consists of:
* A unique identifier
* A label (string) to describe what type of object this vertex represents
* A set of outgoing edges
* A set of incoming edges
* A collection of properties (key-value pairs)
Each edge consists of:
* A unique identifier
* The vertex at which the edge starts (the *tail vertex*)
* The vertex at which the edge ends (the *head vertex*)
* A label to describe the kind of relationship between the two vertices
* A collection of properties (key-value pairs)
You can think of a graph store as consisting of two relational tables, one for vertices and one for
edges, as shown in [Example 3-3](/en/ch3#fig_graph_sql_schema) (this schema uses the PostgreSQL `jsonb` datatype to
store the properties of each vertex or edge). The head and tail vertex are stored for each edge; if
you want the set of incoming or outgoing edges for a vertex, you can query the `edges` table by
`head_vertex` or `tail_vertex`, respectively.
{{< figure id="fig_graph_sql_schema" title="Example 3-3. Representing a property graph using a relational schema" class="w-full my-4" >}}
```sql
CREATE TABLE vertices (
vertex_id integer PRIMARY KEY,
label text,
properties jsonb
);
CREATE TABLE edges (
edge_id integer PRIMARY KEY,
tail_vertex integer REFERENCES vertices (vertex_id),
head_vertex integer REFERENCES vertices (vertex_id),
label text,
properties jsonb
);
CREATE INDEX edges_tails ON edges (tail_vertex);
CREATE INDEX edges_heads ON edges (head_vertex);
```
Some important aspects of this model are:
1. Any vertex can have an edge connecting it with any other vertex. There is no schema that
restricts which kinds of things can or cannot be associated.
2. Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus
*traverse* the graph—i.e., follow a path through a chain of vertices—both forward and backward.
(That’s why [Example 3-3](/en/ch3#fig_graph_sql_schema) has indexes on both the `tail_vertex` and `head_vertex`
columns.)
3. By using different labels for different kinds of vertices and relationships, you can store
several different kinds of information in a single graph, while still maintaining a clean data
model.
The edges table is like the many-to-many associative table/join table we saw in
[“Many-to-One and Many-to-Many Relationships”](/en/ch3#sec_datamodels_many_to_many), generalized to allow many different types of relationship to be
stored in the same table. There may also be indexes on the labels and the properties, allowing
vertices or edges with certain properties to be found efficiently.
--------
> [!NOTE]
> A limitation of graph models is that an edge can only associate two vertices with each other,
> whereas a relational join table can represent three-way or even higher-degree relationships by
> having multiple foreign key references on a single row. Such relationships can be represented in a
> graph by creating an additional vertex corresponding to each row of the join table, and edges
> to/from that vertex, or by using a *hypergraph*.
--------
Those features give graphs a great deal of flexibility for data modeling, as illustrated in
[Figure 3-6](/en/ch3#fig_datamodels_graph). The figure shows a few things that would be difficult to express in a
traditional relational schema, such as different kinds of regional structures in different countries
(France has *départements* and *régions*, whereas the US has *counties* and *states*), quirks of
history such as a country within a country (ignoring for now the intricacies of sovereign states and
nations), and varying granularity of data (Lucy’s current residence is specified as a city, whereas
her place of birth is specified only at the level of a state).
You could imagine extending the graph to also include many other facts about Lucy and Alain, or
other people. For instance, you could use it to indicate any food allergies they have (by
introducing a vertex for each allergen, and an edge between a person and an allergen to indicate an
allergy), and link the allergens with a set of vertices that show which foods contain which
substances. Then you could write a query to find out what is safe for each person to eat.
Graphs are good for evolvability: as you add features to your application, a graph can easily be
extended to accommodate changes in your application’s data structures.
### The Cypher Query Language {#id57}
*Cypher* is a query language for property graphs, originally created for the Neo4j graph database,
and later developed into an open standard as *openCypher* [^38]. Besides Neo4j, Cypher is supported by Memgraph, KùzuDB [^35],
Amazon Neptune, Apache AGE (with storage in PostgreSQL), and others. It is named after a character
in the movie *The Matrix* and is not related to ciphers in cryptography [^39].
[Example 3-4](/en/ch3#fig_cypher_create) shows the Cypher query to insert the lefthand portion of
[Figure 3-6](/en/ch3#fig_datamodels_graph) into a graph database. The rest of the graph can be added similarly. Each
vertex is given a symbolic name like `usa` or `idaho`. That name is not stored in the database, but
only used internally within the query to create edges between the vertices, using an arrow notation:
`(idaho) -[:WITHIN]-> (usa)` creates an edge labeled `WITHIN`, with `idaho` as the tail node and
`usa` as the head node.
{{< figure id="fig_cypher_create" title="Example 3-4. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as a Cypher query" class="w-full my-4" >}}
```
CREATE
(namerica :Location {name:'North America', type:'continent'}),
(usa :Location {name:'United States', type:'country' }),
(idaho :Location {name:'Idaho', type:'state' }),
(lucy :Person {name:'Lucy' }),
(idaho) -[:WITHIN ]-> (usa) -[:WITHIN]-> (namerica),
(lucy) -[:BORN_IN]-> (idaho)
```
When all the vertices and edges of [Figure 3-6](/en/ch3#fig_datamodels_graph) are added to the database, we can start
asking interesting questions: for example, *find the names of all the people who emigrated from the
United States to Europe*. That is, find all the vertices that have a `BORN_IN` edge to a location
within the US, and also a `LIVING_IN` edge to a location within Europe, and return the `name`
property of each of those vertices.
[Example 3-5](/en/ch3#fig_cypher_query) shows how to express that query in Cypher. The same arrow notation is used in a
`MATCH` clause to find patterns in the graph: `(person) -[:BORN_IN]-> ()` matches any two vertices
that are related by an edge labeled `BORN_IN`. The tail vertex of that edge is bound to the
variable `person`, and the head vertex is left unnamed.
{{< figure id="fig_cypher_query" title="Example 3-5. Cypher query to find people who emigrated from the US to Europe" class="w-full my-4" >}}
```
MATCH
(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (:Location {name:'United States'}),
(person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (:Location {name:'Europe'})
RETURN person.name
```
The query can be read as follows:
> Find any vertex (call it `person`) that meets *both* of the following conditions:
>
> 1. `person` has an outgoing `BORN_IN` edge to some vertex. From that vertex, you can follow a chain
> of outgoing `WITHIN` edges until eventually you reach a vertex of type `Location`, whose `name`
> property is equal to `"United States"`.
> 2. That same `person` vertex also has an outgoing `LIVES_IN` edge. Following that edge, and then a
> chain of outgoing `WITHIN` edges, you eventually reach a vertex of type `Location`, whose `name`
> property is equal to `"Europe"`.
>
> For each such `person` vertex, return the `name` property.
There are several possible ways of executing the query. The description given here suggests that you
start by scanning all the people in the database, examine each person’s birthplace and residence,
and return only those people who meet the criteria.
But equivalently, you could start with the two `Location` vertices and work backward. If there is an
index on the `name` property, you can efficiently find the two vertices representing the US and
Europe. Then you can proceed to find all locations (states, regions, cities, etc.) in the US and
Europe respectively by following all incoming `WITHIN` edges. Finally, you can look for people who
can be found through an incoming `BORN_IN` or `LIVES_IN` edge at one of the location vertices.
### Graph Queries in SQL {#id58}
[Example 3-3](/en/ch3#fig_graph_sql_schema) suggested that graph data can be represented in a relational database. But
if we put graph data in a relational structure, can we also query it using SQL?
The answer is yes, but with some difficulty. Every edge that you traverse in a graph query is
effectively a join with the `edges` table. In a relational database, you usually know in advance
which joins you need in your query. On the other hand, in a graph query, you may need to traverse a
variable number of edges before you find the vertex you’re looking for—that is, the number of joins
is not fixed in advance.
In our example, that happens in the `() -[:WITHIN*0..]-> ()` pattern in the Cypher query. A person’s
`LIVES_IN` edge may point at any kind of location: a street, a city, a district, a region, a state,
etc. A city may be `WITHIN` a region, a region `WITHIN` a state, a state `WITHIN` a country, etc.
The `LIVES_IN` edge may point directly at the location vertex you’re looking for, or it may be
several levels away in the location hierarchy.
In Cypher, `:WITHIN*0..` expresses that fact very concisely: it means “follow a `WITHIN` edge, zero
or more times.” It is like the `*` operator in a regular expression.
Since SQL:1999, this idea of variable-length traversal paths in a query can be expressed using
something called *recursive common table expressions* (the `WITH RECURSIVE` syntax).
[Example 3-6](/en/ch3#fig_graph_sql_query) shows the same query—finding the names of people who emigrated from the US
to Europe—expressed in SQL using this technique. However, the syntax is very clumsy in comparison to
Cypher.
{{< figure id="fig_graph_sql_query" title="Example 3-6. The same query as [Example 3-5](/en/ch3#fig_cypher_query), written in SQL using recursive common table expressions" class="w-full my-4" >}}
```sql
WITH RECURSIVE
-- in_usa is the set of vertex IDs of all locations within the United States
in_usa(vertex_id) AS (
SELECT vertex_id FROM vertices
WHERE label = 'Location' AND properties->>'name' = 'United States' ❶
UNION
SELECT edges.tail_vertex FROM edges ❷
JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
WHERE edges.label = 'within'
),
-- in_europe is the set of vertex IDs of all locations within Europe
in_europe(vertex_id) AS (
SELECT vertex_id FROM vertices
WHERE label = 'location' AND properties->>'name' = 'Europe' ❸
UNION
SELECT edges.tail_vertex FROM edges
JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
WHERE edges.label = 'within'
),
-- born_in_usa is the set of vertex IDs of all people born in the US
born_in_usa(vertex_id) AS ( ❹
SELECT edges.tail_vertex FROM edges
JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
WHERE edges.label = 'born_in'
),
-- lives_in_europe is the set of vertex IDs of all people living in Europe
lives_in_europe(vertex_id) AS ( ❺
SELECT edges.tail_vertex FROM edges
JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
WHERE edges.label = 'lives_in'
)
SELECT vertices.properties->>'name'
FROM vertices
-- join to find those people who were both born in the US *and* live in Europe
JOIN born_in_usa ON vertices.vertex_id = born_in_usa.vertex_id ❻
JOIN lives_in_europe ON vertices.vertex_id = lives_in_europe.vertex_id;
```
❶: First find the vertex whose `name` property has the value `"United States"`, and make it the first element of the set
of vertices `in_usa`.
❷: Follow all incoming `within` edges from vertices in the set `in_usa`, and add them to the same
set, until all incoming `within` edges have been visited.
❸: Do the same starting with the vertex whose `name` property has the value `"Europe"`, and build up
the set of vertices `in_europe`.
❹: For each of the vertices in the set `in_usa`, follow incoming `born_in` edges to find people
who were born in some place within the United States.
❺: Similarly, for each of the vertices in the set `in_europe`, follow incoming `lives_in` edges to find people who live in Europe.
❻: Finally, intersect the set of people born in the USA with the set of people living in Europe, by
joining them.
The fact that a 4-line Cypher query requires 31 lines in SQL shows how much of a difference the
right choice of data model and query language can make. And this is just the beginning; there are
more details to consider, e.g., around handling cycles, and choosing between breadth-first or
depth-first traversal [^40].
Oracle has a different SQL extension for recursive queries, which it calls *hierarchical* [^41].
However, the situation may be improving: at the time of writing, there are plans to add a graph
query language called GQL to the SQL standard [^42] [^43], which will provide a syntax inspired by Cypher, GSQL [^44], and PGQL [^45].
### Triple-Stores and SPARQL {#id59}
The triple-store model is mostly equivalent to the property graph model, using different words to
describe the same ideas. It is nevertheless worth discussing, because there are various tools and
languages for triple-stores that can be valuable additions to your toolbox for building
applications.
In a triple-store, all information is stored in the form of very simple three-part statements:
(*subject*, *predicate*, *object*). For example, in the triple (*Jim*, *likes*, *bananas*), *Jim* is
the subject, *likes* is the predicate (verb), and *bananas* is the object.
The subject of a triple is equivalent to a vertex in a graph. The object is one of two things:
1. A value of a primitive datatype, such as a string or a number. In that case, the predicate and
object of the triple are equivalent to the key and value of a property on the subject vertex.
Using the example from [Figure 3-6](/en/ch3#fig_datamodels_graph), (*lucy*, *birthYear*, *1989*) is like a vertex
`lucy` with properties `{"birthYear": 1989}`.
2. Another vertex in the graph. In that case, the predicate is an edge in the
graph, the subject is the tail vertex, and the object is the head vertex. For example, in
(*lucy*, *marriedTo*, *alain*) the subject and object *lucy* and *alain* are both vertices, and
the predicate *marriedTo* is the label of the edge that connects them.
> [!NOTE]
> To be precise, databases that offer a triple-like data model often need to store some additional
> metadata on each tuple. For example, AWS Neptune uses quads (4-tuples) by adding a graph ID to each
> triple [^46];
> Datomic uses 5-tuples, extending each triple with a transaction ID and a boolean to indicate
> deletion [^47].
> Since these databases retain the basic *subject-predicate-object* structure explained above, this
> book nevertheless calls them triple-stores.
[Example 3-7](/en/ch3#fig_graph_n3_triples) shows the same data as in [Example 3-4](/en/ch3#fig_cypher_create), written as
triples in a format called *Turtle*, a subset of *Notation3* (*N3*) [^48].
{{< figure id="fig_graph_n3_triples" title="Example 3-7. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as Turtle triples" class="w-full my-4" >}}
```
@prefix : .
_:lucy a :Person.
_:lucy :name "Lucy".
_:lucy :bornIn _:idaho.
_:idaho a :Location.
_:idaho :name "Idaho".
_:idaho :type "state".
_:idaho :within _:usa.
_:usa a :Location.
_:usa :name "United States".
_:usa :type "country".
_:usa :within _:namerica.
_:namerica a :Location.
_:namerica :name "North America".
_:namerica :type "continent".
```
In this example, vertices of the graph are written as `_:someName`. The name doesn’t mean anything
outside of this file; it exists only because we otherwise wouldn’t know which triples refer to the
same vertex. When the predicate represents an edge, the object is a vertex, as in `_:idaho :within
_:usa`. When the predicate is a property, the object is a string literal, as in `_:usa :name "United States"`.
It’s quite repetitive to repeat the same subject over and over again, but fortunately you can use
semicolons to say multiple things about the same subject. This makes the Turtle format quite
readable: see [Example 3-8](/en/ch3#fig_graph_n3_shorthand).
{{< figure id="fig_graph_n3_shorthand" title="Example 3-8. A more concise way of writing the data in [Example 3-7](/en/ch3#fig_graph_n3_triples)" class="w-full my-4" >}}
```
@prefix : .
_:lucy a :Person; :name "Lucy"; :bornIn _:idaho.
_:idaho a :Location; :name "Idaho"; :type "state"; :within _:usa.
_:usa a :Location; :name "United States"; :type "country"; :within _:namerica.
_:namerica a :Location; :name "North America"; :type "continent".
```
--------
> [!TIP] THE SEMANTIC WEB
Some of the research and development effort on triple stores was motivated by the *Semantic Web*, an
early-2000s effort to facilitate internet-wide data exchange by publishing data not only as
human-readable web pages, but also in a standardized, machine-readable format. Although the Semantic
Web as originally envisioned did not succeed [^49] [^50],
the legacy of the Semantic Web project lives on in a couple of specific technologies: *linked data*
standards such as JSON-LD [^51], *ontologies* used in biomedical science [^52], Facebook’s Open Graph protocol [^53]
(which is used for link unfurling [^54]), knowledge graphs such as Wikidata, and standardized vocabularies for structured data maintained by [`schema.org`](https://schema.org/).
Triple-stores are another Semantic Web technology that has found use outside of its original use
case: even if you have no interest in the Semantic Web, triples can be a good internal data model for applications.
--------
#### The RDF data model {#the-rdf-data-model}
The Turtle language we used in [Example 3-8](/en/ch3#fig_graph_n3_shorthand) is actually a way of encoding data in the
*Resource Description Framework* (RDF) [^55],
a data model that was designed for the Semantic Web. RDF data can also be encoded in other ways, for
example (more verbosely) in XML, as shown in [Example 3-9](/en/ch3#fig_graph_rdf_xml). Tools like Apache Jena can
automatically convert between different RDF encodings.
{{< figure id="fig_graph_rdf_xml" title="Example 3-9. The data of [Example 3-8](/en/ch3#fig_graph_n3_shorthand), expressed using RDF/XML syntax" class="w-full my-4" >}}
```xml
IdahostateUnited StatescountryNorth AmericacontinentLucy
```
RDF has a few quirks due to the fact that it is designed for internet-wide data exchange. The
subject, predicate, and object of a triple are often URIs. For example, a predicate might be an URI
such as `` or ``,
rather than just `WITHIN` or `LIVES_IN`. The reasoning behind this design is that you should be able
to combine your data with someone else’s data, and if they attach a different meaning to the word
`within` or `lives_in`, you won’t get a conflict because their predicates are actually
`` and ``.
The URL `` doesn’t necessarily need to resolve to anything—from
RDF’s point of view, it is simply a namespace. To avoid potential confusion with `http://` URLs, the
examples in this section use non-resolvable URIs such as `urn:example:within`. Fortunately, you can
just specify this prefix once at the top of the file, and then forget about it.
#### The SPARQL query language {#the-sparql-query-language}
*SPARQL* is a query language for triple-stores using the RDF data model [^56].
(It is an acronym for *SPARQL Protocol and RDF Query Language*, pronounced “sparkle.”)
It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite
similar.
The same query as before—finding people who have moved from the US to Europe—is similarly concise in
SPARQL as it is in Cypher (see [Example 3-10](/en/ch3#fig_sparql_query)).
{{< figure id="fig_sparql_query" title="Example 3-10. The same query as [Example 3-5](/en/ch3#fig_cypher_query), expressed in SPARQL" class="w-full my-4" >}}
```
PREFIX :
SELECT ?personName WHERE {
?person :name ?personName.
?person :bornIn / :within* / :name "United States".
?person :livesIn / :within* / :name "Europe".
}
```
The structure is very similar. The following two expressions are equivalent (variables start with a
question mark in SPARQL):
```
(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (location) # Cypher
?person :bornIn / :within* ?location. # SPARQL
```
Because RDF doesn’t distinguish between properties and edges but just uses predicates for both, you
can use the same syntax for matching properties. In the following expression, the variable `usa` is
bound to any vertex that has a `name` property whose value is the string `"United States"`:
```
(usa {name:'United States'}) # Cypher
?usa :name "United States". # SPARQL
```
SPARQL is supported by Amazon Neptune, AllegroGraph, Blazegraph, OpenLink Virtuoso, Apache Jena, and
various other triple stores [^36].
### Datalog: Recursive Relational Queries {#id62}
Datalog is a much older language than SPARQL or Cypher: it arose from academic research in the 1980s [^57] [^58] [^59].
It is less well known among software engineers and not widely supported in mainstream databases, but
it ought to be better-known since it is a very expressive language that is particularly powerful for
complex queries. Several niche databases, including Datomic, LogicBlox, CozoDB, and LinkedIn’s
LIquid [^60] use Datalog as their query language.
Datalog is actually based on a relational data model, not a graph, but it appears in the graph
databases section of this book because recursive queries on graphs are a particular strength of
Datalog.
The contents of a Datalog database consists of *facts*, and each fact corresponds to a row in a
relational table. For example, say we have a table *location* containing locations, and it has three
columns: *ID*, *name*, and *type*. The fact that the US is a country could then be written as
`location(2, "United States", "country")`, where `2` is the ID of the US. In general, the statement
`table(val1, val2, …)` means that `table` contains a row where the first column contains `val1`,
the second column contains `val2`, and so on.
[Example 3-11](/en/ch3#fig_datalog_triples) shows how to write the data from the left-hand side of
[Figure 3-6](/en/ch3#fig_datamodels_graph) in Datalog. The edges of the graph (`within`, `born_in`, and `lives_in`)
are represented as two-column join tables. For example, Lucy has the ID 100 and Idaho has the ID 3,
so the relationship “Lucy was born in Idaho” is represented as `born_in(100, 3)`.
{{< figure id="fig_datalog_triples" title="Example 3-11. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as Datalog facts" class="w-full my-4" >}}
```
location(1, "North America", "continent").
location(2, "United States", "country").
location(3, "Idaho", "state").
within(2, 1). /* US is in North America */
within(3, 2). /* Idaho is in the US */
person(100, "Lucy").
born_in(100, 3). /* Lucy was born in Idaho */
```
Now that we have defined the data, we can write the same query as before, as shown in
[Example 3-12](/en/ch3#fig_datalog_query). It looks a bit different from the equivalent in Cypher or SPARQL, but don’t
let that put you off. Datalog is a subset of Prolog, a programming language that you might have seen
before if you’ve studied computer science.
{{< figure id="fig_datalog_query" title="Example 3-12. The same query as [Example 3-5](/en/ch3#fig_cypher_query), expressed in Datalog" class="w-full my-4" >}}
```sql
within_recursive(LocID, PlaceName) :- location(LocID, PlaceName, _). /* Rule 1 */
within_recursive(LocID, PlaceName) :- within(LocID, ViaID), /* Rule 2 */
within_recursive(ViaID, PlaceName).
migrated(PName, BornIn, LivingIn) :- person(PersonID, PName), /* Rule 3 */
born_in(PersonID, BornID),
within_recursive(BornID, BornIn),
lives_in(PersonID, LivingID),
within_recursive(LivingID, LivingIn).
us_to_europe(Person) :- migrated(Person, "United States", "Europe"). /* Rule 4 */
/* us_to_europe contains the row "Lucy". */
```
Cypher and SPARQL jump in right away with `SELECT`, but Datalog takes a small step at a time. We
define *rules* that derive new virtual tables from the underlying facts. These derived tables are
like (virtual) SQL views: they are not stored in the database, but you can query them in the same
way as a table containing stored facts.
In [Example 3-12](/en/ch3#fig_datalog_query) we define three derived tables: `within_recursive`, `migrated`, and
`us_to_europe`. The name and columns of the virtual tables are defined by what appears before the
`:-` symbol of each rule. For example, `migrated(PName, BornIn, LivingIn)` is a virtual table with
three columns: the name of a person, the name of the place where they were born, and the name of the
place where they are living.
The content of a virtual table is defined by the part of the rule after the `:-` symbol, where we
try to find rows that match a certain pattern in the tables. For example, `person(PersonID, PName)`
matches the row `person(100, "Lucy")`, with the variable `PersonID` bound to the value `100` and the
variable `PName` bound to the value `"Lucy"`. A rule applies if the system can find a match for
*all* patterns on the righthand side of the `:-` operator. When the rule applies, it’s as though the
lefthand side of the `:-` was added to the database (with variables replaced by the values they matched).
One possible way of applying the rules is thus (and as illustrated in [Figure 3-7](/en/ch3#fig_datalog_naive)):
1. `location(1, "North America", "continent")` exists in the database, so rule 1 applies. It generates `within_recursive(1, "North America")`.
2. `within(2, 1)` exists in the database and the previous step generated `within_recursive(1, "North America")`, so rule 2 applies. It generates `within_recursive(2, "North America")`.
3. `within(3, 2)` exists in the database and the previous step generated `within_recursive(2, "North America")`, so rule 2 applies. It generates `within_recursive(3, "North America")`.
By repeated application of rules 1 and 2, the `within_recursive` virtual table can tell us all the
locations in North America (or any other location) contained in our database.
{{< figure link="#fig_datalog_query" src="/fig/ddia_0307.png" id="fig_datalog_naive" title="Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from Example 3-12." class="w-full my-4" >}}
> Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12](/en/ch3#fig_datalog_query).
Now rule 3 can find people who were born in some location `BornIn` and live in some location
`LivingIn`. Rule 4 invokes rule 3 with `BornIn = 'United States'` and
`LivingIn = 'Europe'`, and returns only the names of the people who match the
search. By querying the contents of the virtual `us_to_europe` table, the Datalog system finally
gets the same answer as in the earlier Cypher and SPARQL queries.
The Datalog approach requires a different kind of thinking compared to the other query languages
discussed in this chapter. It allows complex queries to be built up rule by rule, with one rule
referring to other rules, similarly to the way that you break down code into functions that call
each other. Just like functions can be recursive, Datalog rules can also invoke themselves, like
rule 2 in [Example 3-12](/en/ch3#fig_datalog_query), which enables graph traversals in Datalog queries.
### GraphQL {#id63}
GraphQL is a query language that, by design, is much more restrictive than the other query languages
we have seen in this chapter. The purpose of GraphQL is to allow client software running on a user’s
device (such as a mobile app or a JavaScript web app frontend) to request a JSON document with a
particular structure, containing the fields necessary for rendering its user interface. GraphQL
interfaces allow developers to rapidly change queries in client code without changing server-side APIs.
GraphQL’s flexibility comes at a cost. Organizations that adopt GraphQL often need tooling to
convert GraphQL queries into requests to internal services, which often use REST or gRPC (see
[Chapter 5](/en/ch5#ch_encoding)). Authorization, rate limiting, and performance challenges are additional concerns [^61].
GraphQL’s query language is also limited since GraphQL come from an untrusted source. The language
does not allow anything that could be expensive to execute, since otherwise users could perform
denial-of-service attacks on a server by running lots of expensive queries. In particular, GraphQL
does not allow recursive queries (unlike Cypher, SPARQL, SQL, or Datalog), and it does not allow
arbitrary search conditions such as “find people who were born in the US and are now living in
Europe” (unless the service owners specifically choose to offer such search functionality).
Nevertheless, GraphQL is useful. [Example 3-13](/en/ch3#fig_graphql_query) shows how you might implement a group chat
application such as Discord or Slack using GraphQL. The query requests all the channels that the
user has access to, including the channel name and the 50 most recent messages in each channel. For
each message it requests the timestamp, the message content, and the name and profile picture URL
for the sender of the message. Moreover, if a message is a reply to another message, the query also
requests the sender name and the content of the message it is replying to (which might be rendered
in a smaller font above the reply, in order to provide some context).
{{< figure id="fig_graphql_query" title="Example 3-13. Example GraphQL query for a group chat application" class="w-full my-4" >}}
```
query ChatApp {
channels {
name
recentMessages(latest: 50) {
timestamp
content
sender {
fullName
imageUrl
}
replyTo {
content
sender {
fullName
}
}
}
}
}
```
[Example 3-14](/en/ch3#fig_graphql_response) shows what a response to the query in [Example 3-13](/en/ch3#fig_graphql_query) might look
like. The response is a JSON document that mirrors the structure of the query: it contains exactly
those attributes that were requested, no more and no less. This approach has the advantage that the
server does not need to know which attributes the client requires in order to render the user
interface; instead, the client can simply request what it needs. For example, this query does not
request a profile picture URL for the sender of the `replyTo` message, but if the user interface
were changed to add that profile picture, it would be easy for the client to add the required
`imageUrl` attribute to the query without changing the server.
{{< figure id="fig_graphql_response" title="Example 3-14. A possible response to the query in [Example 3-13](/en/ch3#fig_graphql_query)" class="w-full my-4" >}}
```json
{
"data": {
"channels": [
{
"name": "#general",
"recentMessages": [
{
"timestamp": 1693143014,
"content": "Hey! How are y'all doing?",
"sender": {"fullName": "Aaliyah", "imageUrl": "https://..."},
"replyTo": null
},
{
"timestamp": 1693143024,
"content": "Great! And you?",
"sender": {"fullName": "Caleb", "imageUrl": "https://..."},
"replyTo": {
"content": "Hey! How are y'all doing?",
"sender": {"fullName": "Aaliyah"}
}
},
...
```
In [Example 3-14](/en/ch3#fig_graphql_response) the name and image URL of a message sender is embedded directly in the
message object. If the same user sends multiple messages, this information is repeated on each
message. In principle, it would be possible to reduce this duplication, but GraphQL makes the design
choice to accept a larger response size in order to make it simpler to render the user interface
based on the data.
The `replyTo` field is similar: in [Example 3-14](/en/ch3#fig_graphql_response), the second message is a reply to the
first, and the content (“Hey!…”) and sender Aaliyah are duplicated under `replyTo`. It would be
possible to instead return the ID of the message being replied to, but then the client would have to
make an additional request to the server if that ID is not among the 50 most recent messages
returned. Duplicating the content makes it much simpler to work with the data.
The server’s database can store the data in a more normalized form, and perform the necessary joins
to process a query. For example, the server might store a message along with the user ID of the
sender and the ID of the message it is replying to; when it receives a query like the one above, the
server would then resolve those IDs to find the records they refer to. However, the client can only
ask the server to perform joins that are explicitly offered in the GraphQL schema.
Even though the response to a GraphQL query looks similar to a response from a document database,
and even though it has “graph” in the name, GraphQL can be implemented on top of any type of
database—relational, document, or graph.
## Event Sourcing and CQRS {#sec_datamodels_events}
In all the data models we have discussed so far, the data is queried in the same form as it is
written—be it JSON documents, rows in tables, or vertices and edges in a graph. However, in complex
applications it can sometimes be difficult to find a single data representation that is able to
satisfy all the different ways that the data needs to be queried and presented. In such situations,
it can be beneficial to write data in one form, and then to derive from it several representations
that are optimized for different types of reads.
We previously saw this idea in [“Systems of Record and Derived Data”](/en/ch1#sec_introduction_derived), and ETL (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh))
is one example of such a derivation process. Now we will take the idea further. If we are going to
derive one data representation from another anyway, we can choose different representations that are
optimized for writing and for reading, respectively. How would you model your data if you only
wanted to optimize it for writing, and if efficient queries were of no concern?
Perhaps the simplest, fastest, and most expressive way of writing data is an *event log*: every time
you want to write some data, you encode it as a self-contained string (perhaps as JSON), including a
timestamp, and then append it to a sequence of events. Events in this log are *immutable*: you never
change or delete them, you only ever append more events to the log (which may supersede earlier
events). An event can contain arbitrary properties.
[Figure 3-8](/en/ch3#fig_event_sourcing) shows an example that could be taken from a conference management system. A
conference can be a complex business domain: not only can individual attendees register and pay by
card, but companies can also order seats in bulk, pay by invoice, and then later assign the seats to
individual people. Some number of seats may be reserved for speakers, sponsors, volunteer helpers,
and so on. Reservations may also be cancelled, and meanwhile, the conference organizer might change
the capacity of the event by moving it to a different room. With all of this going on, simply
calculating the number of available seats becomes a challenging query.
{{< figure src="/fig/ddia_0308.png" id="fig_event_sourcing" title="Figure 3-8. Using a log of immutable events as source of truth, and deriving materialized views from it." class="w-full my-4" >}}
In [Figure 3-8](/en/ch3#fig_event_sourcing), every change to the state of the conference (such as the organizer
opening registrations, or attendees making and cancelling registrations) is first stored as an
event. Whenever an event is appended to the log, several *materialized views* (also known as
*projections* or *read models*) are also updated to reflect the effect of that event. In the
conference example, there might be one materialized view that collects all information related to
the status of each booking, another that computes charts for the conference organizer’s dashboard,
and a third that generates files for the printer that produces the attendees’ badges.
The idea of using events as the source of truth, and expressing every state change as an event, is
known as *event sourcing* [^62] [^63].
The principle of maintaining separate read-optimized representations and deriving them from the
write-optimized representation is called *command query responsibility segregation (CQRS)* [^64].
These terms originated in the domain-driven design (DDD) community, although similar ideas have been
around for a long time, for example in *state machine replication* (see [“Using shared logs”](/en/ch10#sec_consistency_smr)).
When a request from a user comes in, it is called a *command*, and it first needs to be validated.
Only once the command has been executed and it has been determined to be valid (e.g., there were
enough available seats for a requested reservation), it becomes a fact, and the corresponding event
is added to the log. Consequently, the event log should contain only valid events, and a consumer
of the event log that builds a materialized view is not allowed to reject an event.
When modelling your data in an event sourcing style, it is recommended that you name your events in
the past tense (e.g., “the seats were booked”), because an event is a record of the fact that
something has happened in the past. Even if the user later decides to change or cancel, the fact
remains true that they formerly held a booking, and the change or cancellation is a separate event
that is added later.
A similarity between event sourcing and a star schema fact table, as discussed in
[“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics), is that both are collections of events that happened in the past.
However, rows in a fact table all have the same set of columns, wheras in event sourcing there may
be many different event types, each with different properties. Moreover, a fact table is an
unordered collection, while in event sourcing the order of events is important: if a booking is
first made and then cancelled, processing those events in the wrong order would not make sense.
Event sourcing and CQRS have several advantages:
* For the people developing the system, events better communicate the intent of *why* something
happened. For example, it’s easier to understand the event “the booking was cancelled” than “the
`active` column on row 4001 of the `bookings` table was set to `false`, three rows associated with
that booking were deleted from the `seat_assignments` table, and a row representing the refund was
inserted into the `payments` table”. Those row modifications may still happen when a materialized
view processes the cancellation event, but when they are driven by an event, the reason for the
updates becomes much clearer.
* A key principle of event sourcing is that the materialized views are derived from the event log in
a reproducible way: you should always be able to delete the materialized views and recompute them
by processing the same events in the same order, using the same code. If there was a bug in the
view maintenance code, you can just delete the view and recompute it with the new code. It’s also
easier to find the bug because you can re-run the view maintenance code as often as you like and
inspect its behavior.
* You can have multiple materialized views that are optimized for the particular queries that your
application requires. They can be stored either in the same database as the events or a different
one, depending on your needs. They can use any data model, and they can be denormalized for fast
reads. You can even keep a view only in memory and avoid persisting it, as long as it’s okay to
recompute the view from the event log whenever the service restarts.
* If you decide you want to present the existing information in a new way, it is easy to build a new
materialized view from the existing event log. You can also evolve the system to support new
features by adding new types of events, or new properties to existing event types (any older
events remain unmodified). You can also chain new behaviors off existing events (for example, when
a conference attendee cancels, their seat could be offered to the next person on the waiting
list).
* If an event was written in error you can delete it again, and then you can rebuild the views
without the deleted event. On the other hand, in a database where you update and delete data
directly, a committed transaction is often difficult to reverse. Event sourcing can therefore
reduce the number of irreversible actions in the system, making it easier to change
(see [“Evolvability: Making Change Easy”](/en/ch2#sec_introduction_evolvability)).
* The event log can also serve as an audit log of everything that happened in the system, which is
valuable in regulated industries that require such auditability.
However, event sourcing and CQRS also have downsides:
* You need to be careful if external information is involved. For example, say an event contains a
price given in one currency, and for one of the views it needs to be converted into another
currency. Since the exchange rate may fluctuate, it would be problematic to fetch the exchange
rate from an external source when processing the event, since you would get a different result if
you recompute the materialized view on another date. To make the event processing logic
deterministic, you either need to include the exchange rate in the event itself, or have a way of
querying the historical exchange rate at the timestamp indicated in the event, ensuring that this
query always returns the same result for the same timestamp.
* The requirement that events are immutable creates problems if events contain personal data from
users, since users may exercise their right (e.g., under the GDPR) to request deletion of their
data. If the event log is on a per-user basis, you can just delete the whole log for that user,
but that doesn’t work if your event log contains events relating to multiple users. You can try
storing the personal data outside of the actual event, or encrypting it with a key that you can
later choose to delete, but that also makes it harder to recompute derived state when needed.
* Reprocessing events requires care if there are externally visible side-effects—for example, you
probably don’t want to resend confirmation emails every time you rebuild a materialized view.
You can implement event sourcing on top of any database, but there are also some systems that are
specifically designed to support this pattern, such as EventStoreDB, MartenDB (based on PostgreSQL),
and Axon Framework. You can also use message brokers such as Apache Kafka to store the event log,
and stream processors can keep the materialized views up-to-date; we will return to these topics in
[“Change data capture versus event sourcing”](/en/ch12#sec_stream_event_sourcing).
The only important requirement is that the event storage system must guarantee that all materialized
views process the events in exactly the same order as they appear in the log; as we shall see in
[Chapter 10](/en/ch10#ch_consistency), this is not always easy to achieve in a distributed system.
## Dataframes, Matrices, and Arrays {#sec_datamodels_dataframes}
The data models we have seen so far in this chapter are generally used for both transaction
processing and analytics purposes (see [“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics)). There are also some data
models that you are likely to encounter in an analytical or scientific context, but that rarely
feature in OLTP systems: dataframes and multidimensional arrays of numbers such as matrices.
Dataframes are a data model supported by the R language, the Pandas library for Python, Apache
Spark, ArcticDB, Dask, and other systems. They are a popular tool for data scientists preparing data
for training machine learning models, but they are also widely used for data exploration,
statistical data analysis, data visualization, and similar purposes.
At first glance, a dataframe is similar to a table in a relational database or a spreadsheet. It
supports relational-like operators that perform bulk operations on the contents of the dataframe:
for example, applying a function to all of the rows, filtering the rows based on some condition,
grouping rows by some columns and aggregating other columns, and joining the rows in one dataframe
with another dataframe based on some key (what a relational database calls *join* is typically
called *merge* on dataframes).
Instead of a declarative query such as SQL, a dataframe is typically manipulated through a series of
commands that modify its structure and content. This matches the typical workflow of data
scientists, who incrementally “wrangle” the data into a form that allows them to find answers to the
questions they are asking. These manipulations usually take place on the data scientist’s private
copy of the dataset, often on their local machine, although the end result may be shared with other
users.
Dataframe APIs also offer a wide variety of operations that go far beyond what relational databases
offer, and the data model is often used in ways that are very different from typical relational data modelling [^65].
For example, a common use of dataframes is to transform data from a relational-like representation
into a matrix or multidimensional array representation, which is the form that many machine learning
algorithms expect of their input.
A simple example of such a transformation is shown in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix). On the left we
have a relational table of how different users have rated various movies (on a scale of 1 to 5), and
on the right the data has been transformed into a matrix where each column is a movie and each row
is a user (similarly to a *pivot table* in a spreadsheet). The matrix is *sparse*, which means there
is no data for many user-movie combinations, but this is fine. This matrix may have many thousands
of columns and would therefore not fit well in a relational database, but dataframes and libraries
that offer sparse arrays (such as NumPy for Python) can handle such data easily.
{{< figure src="/fig/ddia_0309.png" id="fig_dataframe_to_matrix" title="Figure 3-9. Transforming a relational database of movie ratings into a matrix representation." class="w-full my-4" >}}
A matrix can only contain numbers, and various techniques are used to transform non-numerical data
into numbers in the matrix. For example:
* Dates (which are omitted from the example matrix in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix)) could be scaled
to be floating-point numbers within some suitable range.
* For columns that can only take one of a small, fixed set of values (for example, the genre of a
movie in a database of movies), a *one-hot encoding* is often used: we create a column for each
possible value (one for “comedy”, one for “drama”, one for “horror”, etc.), and for each row
representing a movie, we put a 1 in the column corresponding to the genre of that movie, and a 0
in all the other columns. This representation also easily generalizes to movies that fit within
several genres.
Once the data is in the form of a matrix of numbers, it is amenable to linear algebra operations,
which form the basis of many machine learning algorithms. For example, the data in
[Figure 3-9](/en/ch3#fig_dataframe_to_matrix) could be a part of a system for recommending movies that the user may
like. Dataframes are flexible enough to allow data to be gradually evolved from a relational form
into a matrix representation, while giving the data scientist control over the representation that
is most suitable for achieving the goals of the data analysis or model training process.
There are also databases such as TileDB [^66] that specialize in storing large multidimensional arrays of numbers; they are called *array
databases* and are most commonly used for scientific datasets such as geospatial measurements
(raster data on a regularly spaced grid), medical imaging, or observations from astronomical telescopes [^67].
Dataframes are also used in the financial industry for representing *time series data*, such as the
prices of assets and trades over time [^68].
## Summary {#summary}
Data models are a huge subject, and in this chapter we have taken a quick look at a broad variety of
different models. We didn’t have space to go into all the details of each model, but hopefully the
overview has been enough to whet your appetite to find out more about the model that best fits your
application’s requirements.
The *relational model*, despite being more than half a century old, remains an important data model
for many applications—especially in data warehousing and business analytics, where relational star
or snowflake schemas and SQL queries are ubiquitous. However, several alternatives to relational
data have also become popular in other domains:
* The *document model* targets use cases where data comes in self-contained JSON documents, and
where relationships between one document and another are rare.
* *Graph data models* go in the opposite direction, targeting use cases where anything is potentially
related to everything, and where queries potentially need to traverse multiple hops to find the
data of interest (which can be expressed using recursive queries in Cypher, SPARQL, or Datalog).
* *Dataframes* generalize relational data to large numbers of columns, and thereby provide a bridge
between databases and the multidimensional arrays that form the basis of much machine learning,
statistical data analysis, and scientific computing.
To some degree, one model can be emulated in terms of another model—for example, graph data can be
represented in a relational database—but the result can be awkward, as we saw with the support for
recursive queries in SQL.
Various specialist databases have therefore been developed for each data model, providing query
languages and storage engines that are optimized for a particular model. However, there is also a
trend for databases to expand into neighboring niches by adding support for other data models: for
example, relational databases have added support for document data in the form of JSON columns,
document databases have added relational-like joins, and support for graph data within SQL is
gradually improving.
Another model we discussed is *event sourcing*, which represents data as an append-only log of
immutable events, and which can be advantageous for modeling activities in complex business domains.
An append-only log is good for writing data (as we shall see in [Chapter 4](/en/ch4#ch_storage)); in order to support
efficient queries, the event log is translated into read-optimized materialized views through CQRS.
One thing that non-relational data models have in common is that they typically don’t enforce a
schema for the data they store, which can make it easier to adapt applications to changing
requirements. However, your application most likely still assumes that data has a certain structure;
it’s just a question of whether the schema is explicit (enforced on write) or implicit (assumed on read).
Although we have covered a lot of ground, there are still data models left unmentioned. To give just
a few brief examples:
* Researchers working with genome data often need to perform *sequence-similarity searches*, which
means taking one very long string (representing a DNA molecule) and matching it against a large
database of strings that are similar, but not identical. None of the databases described here can
handle this kind of usage, which is why researchers have written specialized genome database
software like GenBank [^69].
* Many financial systems use *ledgers* with double-entry accounting as their data model. This type
of data can be represented in relational databases, but there are also databases such as
TigerBeetle that specialize in this data model. Cryptocurrencies and blockchains are typically
based on distributed ledgers, which also have value transfer built into their data model.
* *Full-text search* is arguably a kind of data model that is frequently used alongside databases.
Information retrieval is a large specialist subject that we won’t cover in great detail in this
book, but we’ll touch on search indexes and vector search in [“Full-Text Search”](/en/ch4#sec_storage_full_text).
We have to leave it there for now. In the next chapter we will discuss some of the trade-offs that
come into play when *implementing* the data models described in this chapter.
### References
[^1]: Jamie Brandon. [Unexplanations: query optimization works because sql is declarative](https://www.scattered-thoughts.net/writing/unexplanations-sql-declarative/). *scattered-thoughts.net*, February 2024. Archived at [perma.cc/P6W2-WMFZ](https://perma.cc/P6W2-WMFZ)
[^2]: Joseph M. Hellerstein. [The Declarative Imperative: Experiences and Conjectures in Distributed Logic](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf). Tech report UCB/EECS-2010-90, Electrical Engineering and Computer Sciences, University of California at Berkeley, June 2010. Archived at [perma.cc/K56R-VVQM](https://perma.cc/K56R-VVQM)
[^3]: Edgar F. Codd. [A Relational Model of Data for Large Shared Data Banks](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf). *Communications of the ACM*, volume 13, issue 6, pages 377–387, June 1970. [doi:10.1145/362384.362685](https://doi.org/10.1145/362384.362685)
[^4]: Michael Stonebraker and Joseph M. Hellerstein. [What Goes Around Comes Around](http://mitpress2.mit.edu/books/chapters/0262693143chapm1.pdf). In *Readings in Database Systems*, 4th edition, MIT Press, pages 2–41, 2005. ISBN: 9780262693141
[^5]: Markus Winand. [Modern SQL: Beyond Relational](https://modern-sql.com/). *modern-sql.com*, 2015. Archived at [perma.cc/D63V-WAPN](https://perma.cc/D63V-WAPN)
[^6]: Martin Fowler. [OrmHate](https://martinfowler.com/bliki/OrmHate.html). *martinfowler.com*, May 2012. Archived at [perma.cc/VCM8-PKNG](https://perma.cc/VCM8-PKNG)
[^7]: Vlad Mihalcea. [N+1 query problem with JPA and Hibernate](https://vladmihalcea.com/n-plus-1-query-problem/). *vladmihalcea.com*, January 2023. Archived at [perma.cc/79EV-TZKB](https://perma.cc/79EV-TZKB)
[^8]: Jens Schauder. [This is the Beginning of the End of the N+1 Problem: Introducing Single Query Loading](https://spring.io/blog/2023/08/31/this-is-the-beginning-of-the-end-of-the-n-1-problem-introducing-single-query). *spring.io*, August 2023. Archived at [perma.cc/6V96-R333](https://perma.cc/6V96-R333)
[^9]: William Zola. [6 Rules of Thumb for MongoDB Schema Design](https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design). *mongodb.com*, June 2014. Archived at [perma.cc/T2BZ-PPJB](https://perma.cc/T2BZ-PPJB)
[^10]: Sidney Andrews and Christopher McClister. [Data modeling in Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data). *learn.microsoft.com*, February 2023. Archived at [archive.org](https://web.archive.org/web/20230207193233/https%3A//learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data)
[^11]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK)
[^12]: Ralph Kimball and Margy Ross. [*The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*](https://learning.oreilly.com/library/view/the-data-warehouse/9781118530801/), 3rd edition. John Wiley & Sons, July 2013. ISBN: 9781118530801
[^13]: Michael Kaminsky. [Data warehouse modeling: Star schema vs. OBT](https://www.fivetran.com/blog/star-schema-vs-obt). *fivetran.com*, August 2022. Archived at [perma.cc/2PZK-BFFP](https://perma.cc/2PZK-BFFP)
[^14]: Joe Nelson. [User-defined Order in SQL](https://begriffs.com/posts/2018-03-20-user-defined-order.html). *begriffs.com*, March 2018. Archived at [perma.cc/GS3W-F7AD](https://perma.cc/GS3W-F7AD)
[^15]: Evan Wallace. [Realtime Editing of Ordered Sequences](https://www.figma.com/blog/realtime-editing-of-ordered-sequences/). *figma.com*, March 2017. Archived at [perma.cc/K6ER-CQZW](https://perma.cc/K6ER-CQZW)
[^16]: David Greenspan. [Implementing Fractional Indexing](https://observablehq.com/%40dgreensp/implementing-fractional-indexing). *observablehq.com*, October 2020. Archived at [perma.cc/5N4R-MREN](https://perma.cc/5N4R-MREN)
[^17]: Martin Fowler. [Schemaless Data Structures](https://martinfowler.com/articles/schemaless/). *martinfowler.com*, January 2013.
[^18]: Amr Awadallah. [Schema-on-Read vs. Schema-on-Write](https://www.slideshare.net/awadallah/schemaonread-vs-schemaonwrite). At *Berkeley EECS RAD Lab Retreat*, Santa Cruz, CA, May 2009. Archived at [perma.cc/DTB2-JCFR](https://perma.cc/DTB2-JCFR)
[^19]: Martin Odersky. [The Trouble with Types](https://www.infoq.com/presentations/data-types-issues/). At *Strange Loop*, September 2013. Archived at [perma.cc/85QE-PVEP](https://perma.cc/85QE-PVEP)
[^20]: Conrad Irwin. [MongoDB—Confessions of a PostgreSQL Lover](https://speakerdeck.com/conradirwin/mongodb-confessions-of-a-postgresql-lover). At *HTML5DevConf*, October 2013. Archived at [perma.cc/C2J6-3AL5](https://perma.cc/C2J6-3AL5)
[^21]: [Percona Toolkit Documentation: pt-online-schema-change](https://docs.percona.com/percona-toolkit/pt-online-schema-change.html). *docs.percona.com*, 2023. Archived at [perma.cc/9K8R-E5UH](https://perma.cc/9K8R-E5UH)
[^22]: Shlomi Noach. [gh-ost: GitHub’s Online Schema Migration Tool for MySQL](https://github.blog/2016-08-01-gh-ost-github-s-online-migration-tool-for-mysql/). *github.blog*, August 2016. Archived at [perma.cc/7XAG-XB72](https://perma.cc/7XAG-XB72)
[^23]: Shayon Mukherjee. [pg-osc: Zero downtime schema changes in PostgreSQL](https://www.shayon.dev/post/2022/47/pg-osc-zero-downtime-schema-changes-in-postgresql/). *shayon.dev*, February 2022. Archived at [perma.cc/35WN-7WMY](https://perma.cc/35WN-7WMY)
[^24]: Carlos Pérez-Aradros Herce. [Introducing pgroll: zero-downtime, reversible, schema migrations for Postgres](https://xata.io/blog/pgroll-schema-migrations-postgres). *xata.io*, October 2023. Archived at [archive.org](https://web.archive.org/web/20231008161750/https%3A//xata.io/blog/pgroll-schema-migrations-postgres)
[^25]: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. [Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012.
[^26]: Donald K. Burleson. [Reduce I/O with Oracle Cluster Tables](http://www.dba-oracle.com/oracle_tip_hash_index_cluster_table.htm). *dba-oracle.com*. Archived at [perma.cc/7LBJ-9X2C](https://perma.cc/7LBJ-9X2C)
[^27]: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. [Bigtable: A Distributed Storage System for Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006.
[^28]: Priscilla Walmsley. [*XQuery, 2nd Edition*](https://learning.oreilly.com/library/view/xquery-2nd-edition/9781491915080/). O’Reilly Media, December 2015. ISBN: 9781491915080
[^29]: Paul C. Bryan, Kris Zyp, and Mark Nottingham. [JavaScript Object Notation (JSON) Pointer](https://www.rfc-editor.org/rfc/rfc6901). RFC 6901, IETF, April 2013.
[^30]: Stefan Gössner, Glyn Normington, and Carsten Bormann. [JSONPath: Query Expressions for JSON](https://www.rfc-editor.org/rfc/rfc9535.html). RFC 9535, IETF, February 2024.
[^31]: Michael Stonebraker and Andrew Pavlo. [What Goes Around Comes Around… And Around…](https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf). *ACM SIGMOD Record*, volume 53, issue 2, pages 21–37. [doi:10.1145/3685980.3685984](https://doi.org/10.1145/3685980.3685984)
[^32]: Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. [The PageRank Citation Ranking: Bringing Order to the Web](http://ilpubs.stanford.edu:8090/422/). Technical Report 1999-66, Stanford University InfoLab, November 1999. Archived at [perma.cc/UML9-UZHW](https://perma.cc/UML9-UZHW)
[^33]: Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. [TAO: Facebook’s Distributed Data Store for the Social Graph](https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson). At *USENIX Annual Technical Conference* (ATC), June 2013.
[^34]: Natasha Noy, Yuqing Gao, Anshu Jain, Anant Narayanan, Alan Patterson, and Jamie Taylor. [Industry-Scale Knowledge Graphs: Lessons and Challenges](https://cacm.acm.org/magazines/2019/8/238342-industry-scale-knowledge-graphs/fulltext). *Communications of the ACM*, volume 62, issue 8, pages 36–43, August 2019. [doi:10.1145/3331166](https://doi.org/10.1145/3331166)
[^35]: Xiyang Feng, Guodong Jin, Ziyi Chen, Chang Liu, and Semih Salihoğlu. [KÙZU Graph Database Management System](https://www.cidrdb.org/cidr2023/papers/p48-jin.pdf). At *3th Annual Conference on Innovative Data Systems Research* (CIDR 2023), January 2023.
[^36]: Maciej Besta, Emanuel Peter, Robert Gerstenberger, Marc Fischer, Michał Podstawski, Claude Barthels, Gustavo Alonso, Torsten Hoefler. [Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries](https://arxiv.org/pdf/1910.09017.pdf). *arxiv.org*, October 2019.
[^37]: [Apache TinkerPop 3.6.3 Documentation](https://tinkerpop.apache.org/docs/3.6.3/reference/). *tinkerpop.apache.org*, May 2023. Archived at [perma.cc/KM7W-7PAT](https://perma.cc/KM7W-7PAT)
[^38]: Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. [Cypher: An Evolving Query Language for Property Graphs](https://core.ac.uk/download/pdf/158372754.pdf). At *International Conference on Management of Data* (SIGMOD), pages 1433–1445, May 2018. [doi:10.1145/3183713.3190657](https://doi.org/10.1145/3183713.3190657)
[^39]: Emil Eifrem. [Twitter correspondence](https://twitter.com/emileifrem/status/419107961512804352), January 2014. Archived at [perma.cc/WM4S-BW64](https://perma.cc/WM4S-BW64)
[^40]: Francesco Tisiot. [Explore the new SEARCH and CYCLE features in PostgreSQL® 14](https://aiven.io/blog/explore-the-new-search-and-cycle-features-in-postgresql-14). *aiven.io*, December 2021. Archived at [perma.cc/J6BT-83UZ](https://perma.cc/J6BT-83UZ)
[^41]: Gaurav Goel. [Understanding Hierarchies in Oracle](https://towardsdatascience.com/understanding-hierarchies-in-oracle-43f85561f3d9). *towardsdatascience.com*, May 2020. Archived at [perma.cc/5ZLR-Q7EW](https://perma.cc/5ZLR-Q7EW)
[^42]: Alin Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Wim Martens, Jan Michels, Filip Murlak, Stefan Plantikow, Petra Selmer, Oskar van Rest, Hannes Voigt, Domagoj Vrgoč, Mingxi Wu, and Fred Zemke. [Graph Pattern Matching in GQL and SQL/PGQ](https://arxiv.org/abs/2112.06217). At *International Conference on Management of Data* (SIGMOD), pages 2246–2258, June 2022. [doi:10.1145/3514221.3526057](https://doi.org/10.1145/3514221.3526057)
[^43]: Alastair Green. [SQL... and now GQL](https://opencypher.org/articles/2019/09/12/SQL-and-now-GQL/). *opencypher.org*, September 2019. Archived at [perma.cc/AFB2-3SY7](https://perma.cc/AFB2-3SY7)
[^44]: Alin Deutsch, Yu Xu, and Mingxi Wu. [Seamless Syntactic and Semantic Integration of Query Primitives over Relational and Graph Data in GSQL](https://cdn2.hubspot.net/hubfs/4114546/IntegrationQuery%20PrimitivesGSQL.pdf). *tigergraph.com*, November 2018. Archived at [perma.cc/JG7J-Y35X](https://perma.cc/JG7J-Y35X)
[^45]: Oskar van Rest, Sungpack Hong, Jinha Kim, Xuming Meng, and Hassan Chafi. [PGQL: a property graph query language](https://event.cwi.nl/grades/2016/07-VanRest.pdf). At *4th International Workshop on Graph Data Management Experiences and Systems* (GRADES), June 2016. [doi:10.1145/2960414.2960421](https://doi.org/10.1145/2960414.2960421)
[^46]: Amazon Web Services. [Neptune Graph Data Model](https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html). Amazon Neptune User Guide, *docs.aws.amazon.com*. Archived at [perma.cc/CX3T-EZU9](https://perma.cc/CX3T-EZU9)
[^47]: Cognitect. [Datomic Data Model](https://docs.datomic.com/cloud/whatis/data-model.html). Datomic Cloud Documentation, *docs.datomic.com*. Archived at [perma.cc/LGM9-LEUT](https://perma.cc/LGM9-LEUT)
[^48]: David Beckett and Tim Berners-Lee. [Turtle – Terse RDF Triple Language](https://www.w3.org/TeamSubmission/turtle/). W3C Team Submission, March 2011.
[^49]: Sinclair Target. [Whatever Happened to the Semantic Web?](https://twobithistory.org/2018/05/27/semantic-web.html) *twobithistory.org*, May 2018. Archived at [perma.cc/M8GL-9KHS](https://perma.cc/M8GL-9KHS)
[^50]: Gavin Mendel-Gleason. [The Semantic Web is Dead – Long Live the Semantic Web!](https://terminusdb.com/blog/the-semantic-web-is-dead/) *terminusdb.com*, August 2022. Archived at [perma.cc/G2MZ-DSS3](https://perma.cc/G2MZ-DSS3)
[^51]: Manu Sporny. [JSON-LD and Why I Hate the Semantic Web](http://manu.sporny.org/2014/json-ld-origins-2/). *manu.sporny.org*, January 2014. Archived at [perma.cc/7PT4-PJKF](https://perma.cc/7PT4-PJKF)
[^52]: University of Michigan Library. [Biomedical Ontologies and Controlled Vocabularies](https://guides.lib.umich.edu/ontology), *guides.lib.umich.edu/ontology*. Archived at [perma.cc/Q5GA-F2N8](https://perma.cc/Q5GA-F2N8)
[^53]: Facebook. [The Open Graph protocol](https://ogp.me/), *ogp.me*. Archived at [perma.cc/C49A-GUSY](https://perma.cc/C49A-GUSY)
[^54]: Matt Haughey. [Everything you ever wanted to know about unfurling but were afraid to ask /or/ How to make your site previews look amazing in Slack](https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254). *medium.com*, November 2015. Archived at [perma.cc/C7S8-4PZN](https://perma.cc/C7S8-4PZN)
[^55]: W3C RDF Working Group. [Resource Description Framework (RDF)](https://www.w3.org/RDF/). *w3.org*, February 2004.
[^56]: Steve Harris, Andy Seaborne, and Eric Prud’hommeaux. [SPARQL 1.1 Query Language](https://www.w3.org/TR/sparql11-query/). W3C Recommendation, March 2013.
[^57]: Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou. [Datalog and Recursive Query Processing](http://blogs.evergreen.edu/sosw/files/2014/04/Green-Vol5-DBS-017.pdf). *Foundations and Trends in Databases*, volume 5, issue 2, pages 105–195, November 2013. [doi:10.1561/1900000017](https://doi.org/10.1561/1900000017)
[^58]: Stefano Ceri, Georg Gottlob, and Letizia Tanca. [What You Always Wanted to Know About Datalog (And Never Dared to Ask)](https://www.researchgate.net/profile/Letizia_Tanca/publication/3296132_What_you_always_wanted_to_know_about_Datalog_and_never_dared_to_ask/links/0fcfd50ca2d20473ca000000.pdf). *IEEE Transactions on Knowledge and Data Engineering*, volume 1, issue 1, pages 146–166, March 1989. [doi:10.1109/69.43410](https://doi.org/10.1109/69.43410)
[^59]: Serge Abiteboul, Richard Hull, and Victor Vianu. [*Foundations of Databases*](http://webdam.inria.fr/Alice/). Addison-Wesley, 1995. ISBN: 9780201537710, available online at [*webdam.inria.fr/Alice*](http://webdam.inria.fr/Alice/)
[^60]: Scott Meyer, Andrew Carter, and Andrew Rodriguez. [LIquid: The soul of a new graph database, Part 2](https://engineering.linkedin.com/blog/2020/liquid--the-soul-of-a-new-graph-database--part-2). *engineering.linkedin.com*, September 2020. Archived at [perma.cc/K9M4-PD6Q](https://perma.cc/K9M4-PD6Q)
[^61]: Matt Bessey. [Why, after 6 years, I’m over GraphQL](https://bessey.dev/blog/2024/05/24/why-im-over-graphql/). *bessey.dev*, May 2024. Archived at [perma.cc/2PAU-JYRA](https://perma.cc/2PAU-JYRA)
[^62]: Dominic Betts, Julián Domínguez, Grigori Melnik, Fernando Simonazzi, and Mani Subramanian. [*Exploring CQRS and Event Sourcing*](https://learn.microsoft.com/en-us/previous-versions/msp-n-p/jj554200%28v%3Dpandp.10%29). Microsoft Patterns & Practices, July 2012. ISBN: 1621140164, archived at [perma.cc/7A39-3NM8](https://perma.cc/7A39-3NM8)
[^63]: Greg Young. [CQRS and Event Sourcing](https://www.youtube.com/watch?v=JHGkaShoyNs). At *Code on the Beach*, August 2014.
[^64]: Greg Young. [CQRS Documents](https://cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf). *cqrs.wordpress.com*, November 2010. Archived at [perma.cc/X5R6-R47F](https://perma.cc/X5R6-R47F)
[^65]: Devin Petersohn, Stephen Macke, Doris Xin, William Ma, Doris Lee, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya Parameswaran. [Towards Scalable Dataframe Systems](https://www.vldb.org/pvldb/vol13/p2033-petersohn.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 11, pages 2033–2046. [doi:10.14778/3407790.3407807](https://doi.org/10.14778/3407790.3407807)
[^66]: Stavros Papadopoulos, Kushal Datta, Samuel Madden, and Timothy Mattson. [The TileDB Array Data Storage Manager](https://www.vldb.org/pvldb/vol10/p349-papadopoulos.pdf). *Proceedings of the VLDB Endowment*, volume 10, issue 4, pages 349–360, November 2016. [doi:10.14778/3025111.3025117](https://doi.org/10.14778/3025111.3025117)
[^67]: Florin Rusu. [Multidimensional Array Data Management](https://faculty.ucmerced.edu/frusu/Papers/Report/2022-09-fntdb-arrays.pdf). *Foundations and Trends in Databases*, volume 12, numbers 2–3, pages 69–220, February 2023. [doi:10.1561/1900000069](https://doi.org/10.1561/1900000069)
[^68]: Ed Targett. [Bloomberg, Man Group team up to develop open source “ArcticDB” database](https://www.thestack.technology/bloomberg-man-group-arcticdb-database-dataframe/). *thestack.technology*, March 2023. Archived at [perma.cc/M5YD-QQYV](https://perma.cc/M5YD-QQYV)
[^69]: Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and David L. Wheeler. [GenBank](https://academic.oup.com/nar/article/36/suppl_1/D25/2507746). *Nucleic Acids Research*, volume 36, database issue, pages D25–D30, December 2007. [doi:10.1093/nar/gkm929](https://doi.org/10.1093/nar/gkm929)
================================================
FILE: content/en/ch4.md
================================================
---
title: "4. Storage and Retrieval"
weight: 104
breadcrumbs: false
---

> *One of the miseries of life is that everybody names things a little bit wrong. And so it makes
> everything a little harder to understand in the world than it would be if it were named
> differently. A computer does not primarily compute in the sense of doing arithmetic. […] They
> primarily are filing systems.*
>
> [Richard Feynman](https://www.youtube.com/watch?v=EKWGGDXe5MA&t=296s),
> *Idiosyncratic Thinking* seminar (1985)
On the most fundamental level, a database needs to do two things: when you give it some data, it
should store the data, and when you ask it again later, it should give the data back to you.
In [Chapter 3](/en/ch3#ch_datamodels) we discussed data models and query languages—i.e., the format in which you give
the database your data, and the interface through which you can ask for it again later. In this
chapter we discuss the same from the database’s point of view: how the database can store the data
that you give it, and how it can find the data again when you ask for it.
Why should you, as an application developer, care how the database handles storage and retrieval
internally? You’re probably not going to implement your own storage engine from scratch, but you
*do* need to select a storage engine that is appropriate for your application, from the many that
are available. In order to configure a storage engine to perform well on your kind of workload, you
need to have a rough idea of what the storage engine is doing under the hood.
In particular, there is a big difference between storage engines that are optimized for
transactional workloads (OLTP) and those that are optimized for analytics (we introduced this
distinction in [“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics)). This chapter starts by examining two families of
storage engines for OLTP: *log-structured* storage engines that write out immutable data files, and
storage engines such as *B-trees* that update data in-place. These structures are used for both
key-value storage as well as secondary indexes.
Later in [“Data Storage for Analytics”](/en/ch4#sec_storage_analytics) we’ll discuss a family of storage engines that is optimized for
analytics, and in [“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional) we’ll briefly look at indexes for more advanced
queries, such as text retrieval.
## Storage and Indexing for OLTP {#sec_storage_oltp}
Consider the world’s simplest database, implemented as two Bash functions:
```bash
#!/bin/bash
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
```
These two functions implement a key-value store. You can call `db_set key value`, which will store
`key` and `value` in the database. The key and value can be (almost) anything you like—for
example, the value could be a JSON document. You can then call `db_get key`, which looks up the most
recent value associated with that particular key and returns it.
And it works:
```bash
$ db_set 12 '{"name":"London","attractions":["Big Ben","London Eye"]}'
$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'
$ db_get 42
{"name":"San Francisco","attractions":["Golden Gate Bridge"]}
```
The storage format is very simple: a text file where each line contains a key-value pair, separated
by a comma (roughly like a CSV file, ignoring escaping issues). Every call to `db_set` appends to
the end of the file. If you update a key several times, old versions of the value are not
overwritten—you need to look at the last occurrence of a key in a file to find the latest value
(hence the `tail -n 1` in `db_get`):
```bash
$ db_set 42 '{"name":"San Francisco","attractions":["Exploratorium"]}'
$ db_get 42
{"name":"San Francisco","attractions":["Exploratorium"]}
$ cat database
12,{"name":"London","attractions":["Big Ben","London Eye"]}
42,{"name":"San Francisco","attractions":["Golden Gate Bridge"]}
42,{"name":"San Francisco","attractions":["Exploratorium"]}
```
The `db_set` function actually has pretty good performance for something that is so simple, because
appending to a file is generally very efficient. Similarly to what `db_set` does, many databases
internally use a *log*, which is an append-only data file. Real databases have more issues to deal
with (such as handling concurrent writes, reclaiming disk space so that the log doesn’t grow
forever, and handling partially written records when recovering from a crash), but the basic
principle is the same. Logs are incredibly useful, and we will encounter them several times in this
book.
---------
> [!NOTE]
> The word *log* is often used to refer to application logs, where an application outputs text that
> describes what’s happening. In this book, *log* is used in the more general sense: an append-only
> sequence of records on disk. It doesn’t have to be human-readable; it might be binary and intended
> only for internal use by the database system.
--------
On the other hand, the `db_get` function has terrible performance if you have a large number of
records in your database. Every time you want to look up a key, `db_get` has to scan the entire
database file from beginning to end, looking for occurrences of the key. In algorithmic terms, the
cost of a lookup is *O*(*n*): if you double the number of records *n* in your database, a lookup
takes twice as long. That’s not good.
In order to efficiently find the value for a particular key in the database, we need a different
data structure: an *index*. In this chapter we will look at a range of indexing structures and see
how they compare; the general idea is to structure the data in a particular way (e.g., sorted by
some key) that makes it faster to locate the data you want. If you want to search the same data in
several different ways, you may need several different indexes on different parts of the data.
An index is an *additional* structure that is derived from the primary data. Many databases allow
you to add and remove indexes, and this doesn’t affect the contents of the database; it only affects
the performance of queries. Maintaining additional structures incurs overhead, especially on writes. For
writes, it’s hard to beat the performance of simply appending to a file, because that’s the simplest
possible write operation. Any kind of index usually slows down writes, because the index also needs
to be updated every time data is written.
This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but
every index consumes additional disk space and slows down writes, sometimes substantially [^1].
For this reason, databases don’t usually index everything by default, but require you—the person
writing the application or administering the database—to choose indexes manually, using your
knowledge of the application’s typical query patterns. You can then choose the indexes that give
your application the greatest benefit, without introducing more overhead on writes than necessary.
### Log-Structured Storage {#sec_storage_log_structured}
To start, let’s assume that you want to continue storing data in the append-only file written by
`db_set`, and you just want to speed up reads. One way you could do this is by keeping a hash map in
memory, in which every key is mapped to the byte offset in the file at which the most recent value
for that key can be found, as illustrated in [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
{{< figure src="/fig/ddia_0401.png" id="fig_storage_csv_hash_index" caption="Figure 4-1. Storing a log of key-value pairs in a CSV-like format, indexed with an in-memory hash map." class="w-full my-4" >}}
Whenever you append a new key-value pair to the file, you also update the hash map to reflect the
offset of the data you just wrote. When you want to look up a value, you use the hash map to find
the offset in the log file, seek to that location, and read the value. If that part of the data file
is already in the filesystem cache, a read doesn’t require any disk I/O at all.
This approach is much faster, but it still suffers from several problems:
* You never free up disk space occupied by old log entries that have been overwritten; if you keep
writing to the database you might run out of disk space.
* The hash map is not persisted, so you have to rebuild it when you restart the database—for
example, by scanning the whole log file to find the latest byte offset for each key. This makes
restarts slow if you have a lot of data.
* The hash table must fit in memory. In principle, you could maintain a hash table on disk, but
unfortunately it is difficult to make an on-disk hash map perform well. It requires a lot of
random access I/O, it is expensive to grow when it becomes full, and hash collisions require fiddly logic [^2].
* Range queries are not efficient. For example, you cannot easily scan over all keys between `10000`
and `19999`—you’d have to look up each key individually in the hash map.
#### The SSTable file format {#the-sstable-file-format}
In practice, hash tables are not used very often for database indexes, and instead it is much more
common to keep data in a structure that is *sorted by key* [^3].
One example of such a structure is a *Sorted String Table*, or *SSTable* for short, as shown in
[Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that
they are sorted by key, and each key only appears once in the file.
{{< figure src="/fig/ddia_0402.png" id="fig_storage_sstable_index" caption="Figure 4-2. An SSTable with a sparse index, allowing queries to jump to the right block." class="w-full my-4" >}}
Now you do not need to keep all the keys in memory: you can group the key-value pairs within an
SSTable into *blocks* of a few kilobytes, and then store the first key of each block in the index.
This kind of index, which stores only some of the keys, is called *sparse*. This index is stored in
a separate part of the SSTable, for example using an immutable B-tree, a trie, or another data
structure that allows queries to quickly look up a particular key [^4].
For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the
first key of the next block is `handsome`. Now say you’re looking for the key `handiwork`, which
doesn’t appear in the sparse index. Because of the sorting you know that `handiwork` must appear
between `handbag` and `handsome`. This means you can seek to the offset for `handbag` and scan the
file from there until you find `handiwork` (or not, if the key is not present in the file). A block
of a few kilobytes can be scanned very quickly.
Moreover, each block of records can be compressed (indicated by the shaded area in
[Figure 4-2](/en/ch4#fig_storage_sstable_index)). Besides saving disk space, compression also reduces the I/O
bandwidth use, at the cost of using a bit more CPU time.
#### Constructing and merging SSTables {#constructing-and-merging-sstables}
The SSTable file format is better for reading than an append-only log, but it makes writes more
difficult. We can’t simply append at the end, because then the file would no longer be sorted
(unless the keys happen to be written in ascending order). If we had to rewrite the whole SSTable
every time a key is inserted somewhere in the middle, writes would become far too expensive.
We can solve this problem with a *log-structured* approach, which is a hybrid between an append-only
log and a sorted file:
1. When a write comes in, add it to an in-memory ordered map data structure, such as a red-black
tree, skip list [^5], or trie [^6].
With these data structures, you can insert keys in any order, look them up efficiently, and read
them back in sorted order. This in-memory data structure is called the *memtable*.
2. When the memtable gets bigger than some threshold—typically a few megabytes—write it out to
disk in sorted order as an SSTable file. We call this new SSTable file the most recent *segment*
of the database, and it is stored as a separate file alongside the older segments. Each segment
has a separate index of its contents. While the new segment is being written out to disk, the
database can continue writing to a new memtable instance, and the old memtable’s memory is freed
when the writing of the SSTable is complete.
3. In order to read the value for some key, first try to find the key in the memtable and the most
recent on-disk segment. If it’s not there, look in the next-older segment, etc. until you either
find the key or reach the oldest segment. If the key does not appear in any of the segments, it
does not exist in the database.
4. From time to time, run a merging and compaction process in the background to combine segment files
and to discard overwritten or deleted values.
Merging segments works similarly to the *mergesort* algorithm [^5]. The process is illustrated in
[Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key
in each file, copy the lowest key (according to the sort order) to the output file, and repeat. If
the same key appears in more than one input file, keep only the more recent value. This produces a
new merged segment file, also sorted by key, with one value per key, and it uses minimal memory
because we can iterate over the SSTables one key at a time.
{{< figure src="/fig/ddia_0403.png" id="fig_storage_sstable_merging" caption="Figure 4-3. Merging several SSTable segments, retaining only the most recent value for each key." class="w-full my-4" >}}
To ensure that the data in the memtable is not lost if the database crashes, the storage engine
keeps a separate log on disk to which every write is immediately appended. This log is not sorted by
key, but that doesn’t matter, because its only purpose is to restore the memtable after a crash.
Every time the memtable has been written out to an SSTable, the corresponding part of the log can be
discarded.
If you want to delete a key and its associated value, you have to append a special deletion record
called a *tombstone* to the data file. When log segments are merged, the tombstone tells the merging
process to discard any previous values for the deleted key. Once the tombstone is merged into the
oldest segment, it can be dropped.
The algorithm described here is essentially what is used in RocksDB [^7], Cassandra, Scylla, and HBase [^8],
all of which were inspired by Google’s Bigtable paper [^9] (which introduced the terms *SSTable* and *memtable*).
The algorithm was originally published in 1996 under the name *Log-Structured Merge-Tree* or *LSM-Tree* [^10],
building on earlier work on log-structured filesystems [^11].
For this reason, storage engines that are based on the principle of merging and compacting sorted files are often called *LSM storage engines*.
In LSM storage engines, a segment file is written in one pass (either by writing out the memtable or
by merging some existing segments), and thereafter it is immutable. The merging and compaction of
segments can be done in a background thread, and while it is going on, we can still continue to
serve reads using the old segment files. When the merging process is complete, we switch read
requests to using the new merged segment instead of the old segments, and then the old segment files
can be deleted.
The segment files don’t necessarily have to be stored on local disk: they are also well suited for
writing to object storage. SlateDB and Delta Lake [^12]. take this approach, for example.
Having immutable segment files also simplifies crash recovery: if a crash happens while writing out
the memtable or while merging segments, the database can just delete the unfinished SSTable and
start afresh. The log that persists writes to the memtable could contain incomplete records if there
was a crash halfway through writing a record, or if the disk was full; these are typically detected
by including checksums in the log, and discarding corrupted or incomplete log entries. We will talk
more about durability and crash recovery in [Chapter 8](/en/ch8#ch_transactions).
#### Bloom filters {#bloom-filters}
With LSM storage it can be slow to read a key that was last updated a long time ago, or that does
not exist, since the storage engine needs to check several segment files. In order to speed up such
reads, LSM storage engines often include a *Bloom filter* [^13]
in each segment, which provides a fast but approximate way of checking whether a particular key
appears in a particular SSTable.
[Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in
reality, it would contain more keys and more bits). For every key in the SSTable we compute a hash
function, producing a set of numbers that are then interpreted as indexes into the array of bits [^14].
We set the bits corresponding to those indexes to 1, and leave the rest as 0. For example, the key
`handbag` hashes to the numbers (2, 9, 4), so we set the 2nd, 9th, and 4th bits to 1. The bitmap
is then stored as part of the SSTable, along with the sparse index of keys. This takes a bit of
extra space, but the Bloom filter is generally small compared to the rest of the SSTable.
{{< figure src="/fig/ddia_0404.png" id="fig_storage_bloom" caption="Figure 4-4. A Bloom filter provides a fast, probabilistic check whether a particular key exists in a particular SSTable." class="w-full my-4" >}}
When we want to know whether a key appears in the SSTable, we compute the same hash of that key as
before, and check the bits at those indexes. For example, in [Figure 4-4](/en/ch4#fig_storage_bloom), we’re querying
the key `handheld`, which hashes to (6, 11, 2). One of those bits is 1 (namely, bit number 2),
while the other two are 0. These checks can be made extremely fast using the bitwise operations that
all CPUs support.
If at least one of the bits is 0, we know that the key definitely does not appear in the SSTable.
If the bits in the query are all 1, it’s likely that the key is in the SSTable, but it’s also
possible that by coincidence all of those bits were set to 1 by other keys. This case when it looks
as if a key is present, even though it isn’t, is called a *false positive*.
The probability of false positives depends on the number of keys, the number of bits set per key,
and the total number of bits in the Bloom filter. You can use an online calculator tool to work out
the right parameters for your application [^15].
As a rule of thumb, you need to allocate 10 bits of Bloom filter space for every key in the SSTable
to get a false positive probability of 1%, and the probability is reduced tenfold for every 5
additional bits you allocate per key.
In the context of an LSM storage engines, false positives are no problem:
* If the Bloom filter says that a key *is not* present, we can safely skip that SSTable, since we
can be sure that it doesn’t contain the key.
* If the Bloom filter says the key *is* present, we have to consult the sparse index and decode the
block of key-value pairs to check whether the key really is there. If it was a false positive, we
have done a bit of unnecessary work, but otherwise no harm is done—we just continue the search
with the next-oldest segment.
#### Compaction strategies {#sec_storage_lsm_compaction}
An important detail is how the LSM storage chooses when to perform compaction, and which SSTables to
include in a compaction. Many LSM-based storage systems allow you to configure which compaction
strategy to use, and some of the common choices are [^16] [^17]:
Size-tiered compaction
: Newer and smaller SSTables are successively merged into older and larger SSTables. The SSTables
containing older data can get very large, and merging them requires a lot of temporary disk space.
The advantage of this strategy is that it can handle very high write throughput.
Leveled compaction
: The key range is split up into smaller SSTables and older data is moved into separate “levels,”
which allows the compaction to proceed more incrementally and use less disk space than the
size-tiered strategy. This strategy is more efficient for reads than size-tiered compaction
because the storage engine needs to read fewer SSTables to check whether they contain the key.
As a rule of thumb, size-tiered compaction performs better if you have mostly writes and few reads,
whereas leveled compaction performs better if your workload is dominated by reads. If you write a
small number of keys frequently and a large number of keys rarely, then leveled compaction can also
be advantageous [^18].
Even though there are many subtleties, the basic idea of LSM-trees—keeping a cascade of SSTables
that are merged in the background—is simple and effective. We discuss their performance
characteristics in more detail in [“Comparing B-Trees and LSM-Trees”](/en/ch4#sec_storage_btree_lsm_comparison).
--------
> [!TIP] EMBEDDED STORAGE ENGINES
Many databases run as a service that accepts queries over a network, but there are also *embedded*
databases that don’t expose a network API. Instead, they are libraries that run in the same process
as your application code, typically reading and writing files on the local disk, and you interact
with them through normal function calls. Examples of embedded storage engines include RocksDB,
SQLite, LMDB, DuckDB, and KùzuDB [^19].
Embedded databases are very commonly used in mobile apps to store the local user’s data. On the
backend, they can be an appropriate choice if the data is small enough to fit on a single machine,
and if there are not many concurrent transactions. For example, in a multitenant system in which
each tenant is small enough and completely separate from others (i.e., you do not need to run
queries that combine data from multiple tenants), you can potentially use a separate embedded
database instance per tenant [^20].
The storage and retrieval methods we discuss in this chapter are used in both embedded and in
client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques
for scaling a database across multiple machines.
--------
### B-Trees {#sec_storage_b_trees}
The log-structured approach is popular, but it is not the only form of key-value storage. The most
widely used structure for reading and writing database records by key is the *B-tree*.
Introduced in 1970 [^21] and called “ubiquitous” less than 10 years later [^22],
B-trees have stood the test of time very well. They remain the standard index implementation in
almost all relational databases, and many nonrelational databases use them too.
Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups
and range queries. But that’s where the similarity ends: B-trees have a very different design
philosophy.
The log-structured indexes we saw earlier break the database down into variable-size *segments*,
typically several megabytes or more in size, that are written once and are then immutable. By
contrast, B-trees break the database down into fixed-size *blocks* or *pages*, and may overwrite a
page in-place. A page is traditionally 4 KiB in size, but PostgreSQL now uses 8 KiB and
MySQL uses 16 KiB by default.
Each page can be identified using a page number, which allows one page to refer to another—similar
to a pointer, but on disk instead of in memory. If all the pages are stored in the same file,
multiplying the page number by the page size gives us the byte offset in the file where the page is
located. We can use these page references to construct a tree of pages, as illustrated in
[Figure 4-5](/en/ch4#fig_storage_b_tree).
{{< figure src="/fig/ddia_0405.png" id="fig_storage_b_tree" caption="Figure 4-5. Looking up the key 251 using a B-tree index. From the root page we first follow the reference to the page for keys 200–300, then the page for keys 250–270." class="w-full my-4" >}}
One page is designated as the *root* of the B-tree; whenever you want to look up a key in the index,
you start here. The page contains several keys and references to child pages.
Each child is responsible for a continuous range of keys, and the keys between the references indicate
where the boundaries between those ranges lie.
(This structure is sometimes called a B+ tree, but we don’t need to distinguish it
from other B-tree variants.)
In the example in [Figure 4-5](/en/ch4#fig_storage_b_tree), we are looking for the key 251, so we know that we need to
follow the page reference between the boundaries 200 and 300. That takes us to a similar-looking
page that further breaks down the 200–300 range into subranges. Eventually we get down to a
page containing individual keys (a *leaf page*), which either contains the value for each key
inline or contains references to the pages where the values can be found.
The number of references to child pages in one page of the B-tree is called the *branching factor*.
For example, in [Figure 4-5](/en/ch4#fig_storage_b_tree) the branching factor is six. In practice, the branching
factor depends on the amount of space required to store the page references and the range
boundaries, but typically it is several hundred.
If you want to update the value for an existing key in a B-tree, you search for the leaf page
containing that key, and overwrite that page on disk with a version that contains the new value.
If you want to add a new key, you need to find the page whose range encompasses the new key and add
it to that page. If there isn’t enough free space in the page to accommodate the new key, the page
is split into two half-full pages, and the parent page is updated to account for the new subdivision
of key ranges.
{{< figure src="/fig/ddia_0406.png" id="fig_storage_b_tree_split" caption="Figure 4-6. Growing a B-tree by splitting a page on the boundary key 337. The parent page is updated to reference both children." class="w-full my-4" >}}
In the example of [Figure 4-6](/en/ch4#fig_storage_b_tree_split), we want to insert the key 334, but the page for the
range 333–345 is already full. We therefore split it into a page for the range 333–337 (including
the new key), and a page for 337–344. We also have to update the parent page to have references to
both children, with a boundary value of 337 between them. If the parent page doesn’t have enough
space for the new reference, it may also need to be split, and the splits can continue all the way
to the root of the tree. When the root is split, we make a new root above it. Deleting keys (which
may require nodes to be merged) is more complex [^5].
This algorithm ensures that the tree remains *balanced*: a B-tree with *n* keys always has a depth
of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so
you don’t need to follow many page references to find the page you are looking for. (A four-level
tree of 4 KiB pages with a branching factor of 500 can store up to 250 TB.)
#### Making B-trees reliable {#sec_storage_btree_wal}
The basic underlying write operation of a B-tree is to overwrite a page on disk with new data. It is
assumed that the overwrite does not change the location of the page; i.e., all references to that
page remain intact when the page is overwritten. This is in stark contrast to log-structured indexes
such as LSM-trees, which only append to files (and eventually delete obsolete files) but never
modify files in place.
Overwriting several pages at once, like in a page split, is a dangerous operation: if the database
crashes after only some of the pages have been written, you end up with a corrupted tree (e.g.,
there may be an *orphan* page that is not a child of any parent). If the hardware can’t atomically
write an entire page, you can also end up with a partially written page (this is known as a *torn page* [^23]).
In order to make the database resilient to crashes, it is common for B-tree implementations to
include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file
to which every B-tree modification must be written before it can be applied to the pages of the tree
itself. When the database comes back up after a crash, this log is used to restore the B-tree back
to a consistent state [^2] [^24].
In filesystems, the equivalent mechanism is known as *journaling*.
To improve performance, B-tree implementations typically don’t immediately write every modified page
to disk, but buffer the B-tree pages in memory for a while first. The write-ahead log then also
ensures that data is not lost in the case of a crash: as long as data has been written to the WAL,
and flushed to disk using the `fsync()` system call, the data will be durable as the database will
be able to recover it after a crash [^25].
#### B-tree variants {#b-tree-variants}
As B-trees have been around for so long, many variants have been developed over the years. To
mention just a few:
* Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB)
use a copy-on-write scheme [^26].
A modified page is written to a different location, and a new version of the parent pages in the tree
is created, pointing at the new location. This approach is also useful for concurrency control, as we shall
see in [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation).
* We can save space in pages by not storing the entire key, but abbreviating it. Especially in pages
on the interior of the tree, keys only need to provide enough information to act as boundaries
between key ranges. Packing more keys into a page allows the tree to have a higher branching
factor, and thus fewer levels.
* To speed up scans over the key range in sorted order, some B-tree implementations try to lay out
the tree so that leaf pages appear in sequential order on disk, reducing the number of disk seeks.
However, it’s difficult to maintain that order as the tree grows.
* Additional pointers have been added to the tree. For example, each leaf page may have references to
its sibling pages to the left and right, which allows scanning keys in order without jumping back
to parent pages.
### Comparing B-Trees and LSM-Trees {#sec_storage_btree_lsm_comparison}
As a rule of thumb, LSM-trees are better suited for write-heavy applications, whereas B-trees are faster for reads [^27] [^28].
However, benchmarks are often sensitive to details of the workload. You need to test systems with
your particular workload in order to make a valid comparison. Moreover, it’s not a strict either/or
choice between LSM and B-trees: storage engines sometimes blend characteristics of both approaches,
for example by having multiple B-trees and merging them LSM-style. In this section we will briefly
discuss a few things that are worth considering when measuring the performance of a storage engine.
#### Read performance {#read-performance}
In a B-tree, looking up a key involves reading one page at each level of the B-tree. Since the
number of levels is usually quite small, this means that reads from a B-tree are generally fast and
have predictable performance. In an LSM storage engine, reads often have to check several different
SSTables at different stages of compaction, but Bloom filters help reduce the number of actual disk
I/O operations required. Both approaches can perform well, and which is faster depends on the
details of the storage engine and the workload.
Range queries are simple and fast on B-trees, as they can use the sorted structure of the tree. On
LSM storage, range queries can also take advantage of the SSTable sorting, but they need to scan all
the segments in parallel and combine the results. Bloom filters don’t help for range queries (since
you would need to compute the hash of every possible key within the range, which is impractical),
making range queries more expensive than point queries in the LSM approach [^29].
High write throughput can cause latency spikes in a log-structured storage engine if the
memtable fills up. This happens if data can’t be written out to disk fast enough, perhaps because
the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB,
perform *backpressure* in this situation: they suspend all reads and writes until the memtable has
been written out to disk [^30] [^31].
Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read
requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but
storage engines need to be carefully designed to take advantage of this parallelism [^32].
#### Sequential vs. random writes {#sidebar_sequential}
With a B-tree, if the application writes keys that are scattered all over the key space, the
resulting disk operations are also scattered randomly, since the pages that the storage engine needs
to overwrite could be located anywhere on disk. On the other hand, a log-structured storage engine
writes entire segment files at a time (either writing out the memtable or while compacting existing
segments), which are much bigger than a page in a B-tree.
The pattern of many small, scattered writes (as found in B-trees) is called *random writes*, while
the pattern of fewer large writes (as found in LSM-trees) is called *sequential writes*. Disks
generally have higher sequential write throughput than random write throughput, which means that a
log-structured storage engine can generally handle higher write throughput on the same hardware than
a B-tree. This difference is particularly big on spinning-disk hard drives (HDDs); on the solid
state drives (SSDs) that most databases use today, the difference is smaller, but still noticeable
(see [“Sequential vs. Random Writes on SSDs”](/en/ch4#sidebar_sequential)).
--------
> [!TIP] SEQUENTIAL VS. RANDOM WRITES ON SSDS
On spinning-disk hard drives (HDDs), sequential writes are much faster than random writes: a random
write has to mechanically move the disk head to a new position and wait for the right part of the
platter to pass underneath the disk head, which takes several milliseconds—an eternity in computing
timescales. However, SSDs (solid-state drives) including NVMe (Non-Volatile Memory Express, i.e.
flash memory attached to the PCI Express bus) have now overtaken HDDs for many use cases, and they
are not subject to such mechanical limitations.
Nevertheless, SSDs also have higher throughput for sequential writes than for than random writes.
The reason is that flash memory can be read or written one page (typically 4 KiB) at a time,
but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block
may contain valid data, whereas others may contain data that is no longer needed. Before erasing a
block, the controller must first move pages containing valid data into other blocks; this process is
called *garbage collection* (GC) [^33].
A sequential write workload writes larger chunks of data at a time, so it is likely that a whole
512 KiB block belongs to a single file; when that file is later deleted again, the whole block
can be erased without having to perform any GC. On the other hand, with a random write workload, it
is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
to perform more work before a block can be erased [^34] [^35] [^36].
The write bandwidth consumed by GC is then not available for the application. Moreover, the
additional writes performed by GC contribute to wear on the flash memory; therefore, random writes
wear out the drive faster than sequential writes.
--------
#### Write amplification {#write-amplification}
With any type of storage engine, one write request from the application turns into multiple I/O
operations on the underlying disk. With LSM-trees, a value is first written to the log for
durability, then again when the memtable is written to disk, and again every time the key-value pair
is part of a compaction. (If the values are significantly larger than the keys, this overhead can be
reduced by storing values separately from keys, and performing compaction only on SSTables
containing keys and references to values [^37].)
A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once
to the tree page itself. In addition, they sometimes need to write out an entire page, even if only
a few bytes in that page changed, to ensure the B-tree can be correctly recovered after a crash or
power failure [^38] [^39].
If you take the total number of bytes written to disk in some workload, and divide by the number of
bytes you would have to write if you simply wrote an append-only log with no index, you get the
*write amplification*. (Sometimes write amplification is defined in terms of I/O operations rather
than bytes.) In write-heavy applications, the bottleneck might be the rate at which the database can
write to disk. In this case, the higher the write amplification, the fewer writes per second it can
handle within the available disk bandwidth.
Write amplification is a problem in both LSM-trees and B-trees. Which one is better depends on
various factors, such as the length of your keys and values, and how often you overwrite existing
keys versus insert new ones. For typical workloads, LSM-trees tend to have lower write amplification
because they don’t have to write entire pages and they can compress chunks of the SSTable [^40].
This is another factor that makes LSM storage engines well suited for write-heavy workloads.
Besides affecting throughput, write amplification is also relevant for the wear on SSDs: a storage
engine with lower write amplification will wear out the SSD less quickly.
When measuring the write throughput of a storage engine, it is important to run the experiment for
long enough that the effects of write amplification become clear. When writing to an empty LSM-tree,
there are no compactions going on yet, so all of the disk bandwidth is available for new writes. As
the database grows, new writes need to share the disk bandwidth with compaction.
#### Disk space usage {#disk-space-usage}
B-trees can become *fragmented* over time: for example, if a large number of keys are deleted, the
database file may contain a lot of pages that are no longer used by the B-tree. Subsequent additions
to the B-tree can use those free pages, but they can’t easily be returned to the operating system
because they are in the middle of the file, so they still take up space on the filesystem. Databases
therefore need a background process that moves pages around to place them better, such as the vacuum
process in PostgreSQL [^25].
Fragmentation is less of a problem in LSM-trees, since the compaction process periodically rewrites
the data files anyway, and SSTables don’t have pages with unused space. Moreover, blocks of
key-value pairs can better be compressed in SSTables, and thus often produce smaller files on disk
than B-trees. Keys and values that have been overwritten continue to consume space until they are
removed by a compaction, but this overhead is quite low when using leveled compaction [^40] [^41].
Size-tiered compaction (see [“Compaction strategies”](/en/ch4#sec_storage_lsm_compaction)) uses more disk space, especially
temporarily during compaction.
Having multiple copies of some data on disk can also be a problem when you need to delete some data,
and be confident that it really has been deleted (perhaps to comply with data protection
regulations). For example, in most LSM storage engines a deleted record may still exist in the higher
levels until the tombstone representing the deletion has been propagated through all of the
compaction levels, which may take a long time. Specialist storage engine designs can propagate
deletions faster [^42].
On the other hand, the immutable nature of SSTable segment files is useful if you want to take a
snapshot of a database at some point in time (e.g. for a backup or to create a copy of the database
for testing): you can write out the memtable and record which segment files existed at that point in
time. As long as you don’t delete the files that are part of the snapshot, you don’t need to
actually copy them. In a B-tree whose pages are overwritten, taking such a snapshot efficiently is
more difficult.
### Multi-Column and Secondary Indexes {#sec_storage_index_multicolumn}
So far we have only discussed key-value indexes, which are like a *primary key* index in the
relational model. A primary key uniquely identifies one row in a relational table, or one document
in a document database, or one vertex in a graph database. Other records in the database can refer
to that row/document/vertex by its primary key (or ID), and the index is used to resolve such references.
It is also very common to have *secondary indexes*. In relational databases, you can create several
secondary indexes on the same table using the `CREATE INDEX` command, allowing you to search by
columns other than the primary key. For example, in [Figure 3-1](/en/ch3#fig_obama_relational) in [Chapter 3](/en/ch3#ch_datamodels)
you would most likely have a secondary index on the `user_id` columns so that you can find all the
rows belonging to the same user in each of the tables.
A secondary index can easily be constructed from a key-value index. The main difference is that
in a secondary index, the indexed values are not necessarily unique; that is,
there might be many rows (documents, vertices) under the same index entry. This can be
solved in two ways: either by making each value in the index a list of matching row identifiers (like a
postings list in a full-text index) or by making each entry unique by appending a row identifier to
it. Storage engines with in-place updates, like B-trees, and log-structured storage can both be used
to implement an index.
#### Storing values within the index {#sec_storage_index_heap}
The key in an index is the thing that queries search by, but the value can be one of several things:
* If the actual data (row, document, vertex) is stored directly within the index structure, it is
called a *clustered index*. For example, in MySQL’s InnoDB storage engine, the primary key of a
table is always a clustered index, and in SQL Server, you can specify one clustered index per table [^43].
* Alternatively, the value can be a reference to the actual data: either the primary key of the row
in question (InnoDB does this for secondary indexes), or a direct reference to a location on disk.
In the latter case, the place where rows are stored is known as a *heap file*, and it stores data
in no particular order (it may be append-only, or it may keep track of deleted rows in order to
overwrite them with new data later). For example, Postgres uses the heap file approach [^44].
* A middle ground between the two is a *covering index* or *index with included columns*, which
stores *some* of a table’s columns within the index, in addition to storing the full row on the
heap or in the primary key clustered index [^45].
This allows some queries to be answered by using the index alone, without having to resolve the
primary key or look in the heap file (in which case, the index is said to *cover* the query).
This can make some queries faster, but the duplication of data means the index uses more disk space and slows down
writes.
The indexes discussed so far only map a single key to a value. If you need to query multiple columns
of a table (or multiple fields in a document) simultaneously, see [“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional).
When updating a value without changing the key, the heap file approach can allow the record to be
overwritten in place, provided that the new value is not larger than the old value. The situation is
more complicated if the new value is larger, as it probably needs to be moved to a new location in
the heap where there is enough space. In that case, either all indexes need to be updated to point
at the new heap location of the record, or a forwarding pointer is left behind in the old heap location [^2].
### Keeping everything in memory {#sec_storage_inmemory}
The data structures discussed so far in this chapter have all been answers to the limitations of
disks. Compared to main memory, disks are awkward to deal with. With both magnetic disks and SSDs,
data on disk needs to be laid out carefully if you want good performance on reads and writes.
However, we tolerate this awkwardness because disks have two significant advantages: they are
durable (their contents are not lost if the power is turned off), and they have a lower cost per
gigabyte than RAM.
As RAM becomes cheaper, the cost-per-gigabyte argument is eroded. Many datasets are simply not that
big, so it’s quite feasible to keep them entirely in memory, potentially distributed across several
machines. This has led to the development of *in-memory databases*.
Some in-memory key-value stores, such as Memcached, are intended for caching use only, where it’s
acceptable for data to be lost if a machine is restarted. But other in-memory databases aim for
durability, which can be achieved with special hardware (such as battery-powered RAM), by writing a
log of changes to disk, by writing periodic snapshots to disk, or by replicating the in-memory state
to other machines.
When an in-memory database is restarted, it needs to reload its state, either from disk or over the
network from a replica (unless special hardware is used). Despite writing to disk, it’s still an
in-memory database, because the disk is merely used as an append-only log for durability, and reads
are served entirely from memory. Writing to disk also has operational advantages: files on disk can
easily be backed up, inspected, and analyzed by external utilities.
Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model,
and the vendors claim that they can offer big performance improvements by removing all the overheads
associated with managing on-disk data structures [^46] [^47].
RAMCloud is an open source, in-memory key-value store with durability (using a log-structured
approach for the data in memory as well as the data on disk) [^48].
Redis and Couchbase provide weak durability by writing to disk asynchronously.
Counterintuitively, the performance advantage of in-memory databases is not due to the fact that
they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk
if you have enough memory, because the operating system caches recently used disk blocks in memory
anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data
structures in a form that can be written to disk [^49].
Besides performance, another interesting area for in-memory databases is providing data models that
are difficult to implement with disk-based indexes. For example, Redis offers a database-like
interface to various data structures such as priority queues and sets. Because it keeps all data in
memory, its implementation is comparatively simple.
## Data Storage for Analytics {#sec_storage_analytics}
The data model of a data warehouse is most commonly relational, because SQL is generally a good fit
for analytic queries. There are many graphical data analysis tools that generate SQL queries,
visualize the results, and allow analysts to explore the data (through operations such as
*drill-down* and *slicing and dicing*).
On the surface, a data warehouse and a relational OLTP database look similar, because they both have
a SQL query interface. However, the internals of the systems can look quite different, because they
are optimized for very different query patterns. Many database vendors now focus on supporting
either transaction processing or analytics workloads, but not both.
Some databases, such as Microsoft SQL Server, SAP HANA, and SingleStore, have support for
transaction processing and data warehousing in the same product. However, these hybrid transactional
and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly
becoming two separate storage and query engines, which happen to be accessible through a common SQL
interface [^50] [^51] [^52] [^53].
### Cloud Data Warehouses {#sec_cloud_data_warehouses}
Data warehouse vendors such as Teradata, Vertica, and SAP HANA sell both on-premises warehouses
under commercial licenses and cloud-based solutions. But as many of their customers move to the
cloud, new cloud data warehouses such as Google Cloud BigQuery, Amazon Redshift, and Snowflake have
also become widely adopted. Unlike traditional data warehouses, cloud data warehouses take advantage
of scalable cloud infrastructure like object storage and serverless computation platforms.
Cloud data warehouses tend to integrate better with other cloud services and to be more elastic.
For example, many cloud warehouses support automatic log ingestion, and offer easy integration with
data processing frameworks such as Google Cloud’s Dataflow or Amazon Web Services’ Kinesis. These
warehouses are also more elastic because they decouple query computation from the storage layer [^54].
Data is persisted on object storage rather than local disks, which makes it easy to adjust storage
capacity and compute resources for queries independently, as we previously saw in
[“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native).
Open source data warehouses such as Apache Hive, Trino, and Apache Spark have also evolved with the
cloud. As data storage for analytics has moved to data lakes on object storage, open source warehouses
have begun to break apart [^55]. The following
components, which were previously integrated in a single system such as Apache Hive, are now often
implemented as separate components:
Query engine
: Query engines such as Trino, Apache DataFusion, and Presto parse SQL queries, optimize them into
execution plans, and execute them against the data. Execution usually requires parallel,
distributed data processing tasks. Some query engines provide built-in task execution, while
others choose to use third party execution frameworks such as Apache Spark or Apache Flink.
Storage format
: The storage format determines how the rows of a table are encoded as bytes in a file, which is
then typically stored in object storage or a distributed filesystem [^12].
This data can then be accessed by the query engine, but also by other applications using the data
lake. Examples of such storage formats are Parquet, ORC, Lance, or Nimble, and we will see more
about them in the next section.
Table format
: Files written in Apache Parquet and similar storage formats are typically immutable once written.
To support row inserts and deletions, a table format such as Apache Iceberg or Databricks’s Delta
format are used. Table formats specify a file format that defines which files constitute a table
along with the table’s schema. Such formats also offer advanced features such as time travel (the
ability to query a table as it was at a previous point in time), garbage collection, and even
transactions.
Data catalog
: Much like a table format defines which files make up a table, a data catalog defines which tables
comprise a database. Catalogs are used to create, rename, and drop tables. Unlike storage and table
formats, data catalogs such as Snowflake’s Polaris and Databricks’s Unity Catalog usually run as a
standalone service that can be queried using a REST interface. Apache Iceberg also offers a
catalog, which can be run inside a client or as a separate process. Query engines use catalog
information when reading and writing tables. Traditionally, catalogs and query engines have been
integrated, but decoupling them has enabled data discovery and data governance systems
(discussed in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance)) to access a catalog’s metadata as well.
### Column-Oriented Storage {#sec_storage_column}
As discussed in [“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics), data warehouses by convention often use a relational
schema with a big fact table that contains foreign key references into dimension tables.
If you have trillions of rows and petabytes of data in your fact tables, storing and querying them
efficiently becomes a challenging problem. Dimension tables are usually much smaller (millions of
rows), so in this section we will focus on storage of facts.
Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4
or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics) [^52]. Take the query in
[Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone
buying fruit or candy during the 2024 calendar year), but it only needs to access three columns of
the `fact_sales` table: `date_key`, `product_sk`,
and `quantity`. The query ignores all other columns.
{{< figure id="fig_storage_analytics_query" title="Example 4-1. Analyzing whether people are more inclined to buy fresh fruit or candy, depending on the day of the week" class="w-full my-4" >}}
```sql
SELECT
dim_date.weekday, dim_product.category,
SUM(fact_sales.quantity) AS quantity_sold
FROM fact_sales
JOIN dim_date ON fact_sales.date_key = dim_date.date_key
JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
WHERE
dim_date.year = 2024 AND
dim_product.category IN ('Fresh fruit', 'Candy')
GROUP BY
dim_date.weekday, dim_product.category;
```
How can we execute this query efficiently?
In most OLTP databases, storage is laid out in a *row-oriented* fashion: all the values from one row
of a table are stored next to each other. Document databases are similar: an entire document is
typically stored as one contiguous sequence of bytes. You can see this in the CSV example of [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
In order to process a query like [Example 4-1](/en/ch4#fig_storage_analytics_query), you may have indexes on
`fact_sales.date_key` and/or `fact_sales.product_sk` that tell the storage engine where to find
all the sales for a particular date or for a particular product. But then, a row-oriented storage
engine still needs to load all of those rows (each consisting of over 100 attributes) from disk into
memory, parse them, and filter out those that don’t meet the required conditions. That can take a
long time.
The idea behind *column-oriented* (or *columnar*) storage is simple: don’t store all the values from
one row together, but store all the values from each *column* together instead [^56].
If each column is stored separately, a query only needs to read and parse those columns that are
used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using
an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema).
--------
> [!NOTE]
> Column storage is easiest to understand in a relational data model, but it applies equally to
> nonrelational data. For example, Parquet [^57] is a columnar storage format that supports a document data model, based on Google’s Dremel [^58],
> using a technique known as *shredding* or *striping* [^59].
--------
{{< figure src="/fig/ddia_0407.png" id="fig_column_store" caption="Figure 4-7. Storing relational data by column, rather than by row." class="w-full my-4" >}}
The column-oriented storage layout relies on each column storing the rows in the same order.
Thus, if you need to reassemble an entire row, you can take the 23rd entry from each of the
individual columns and put them together to form the 23rd row of the table.
In fact, columnar storage engines don’t actually store an entire column (containing perhaps
trillions of rows) in one go. Instead, they break the table into blocks of thousands or millions of
rows, and within each block they store the values from each column separately [^60].
Since many queries are restricted to a particular date range, it is common to make each block
contain the rows for a particular timestamp range. A query then only needs to load the columns it
needs in those blocks that overlap with the required date range.
Columnar storage is used in almost all analytic databases nowadays [^60], ranging from large-scale cloud data warehouses such as Snowflake [^61]
to single-node embedded databases such as DuckDB [^62], and product analytics systems such as Pinot [^63] and Druid [^64].
It is used in storage formats such as Parquet, ORC [^65] [^66], Lance [^67], and Nimble [^68], and in-memory analytics formats like Apache Arrow
[^65] [^69] and Pandas/NumPy [^70]. Some time-series databases, such as InfluxDB IOx [^71] and TimescaleDB [^72], are also based on column-oriented storage.
#### Column Compression {#sec_storage_column_compression}
Besides only loading those columns from disk that are required for a query, we can further reduce
the demands on disk throughput and network bandwidth by compressing data. Fortunately,
column-oriented storage often lends itself very well to compression.
Take a look at the sequences of values for each column in [Figure 4-7](/en/ch4#fig_column_store): they often look quite
repetitive, which is a good sign for compression. Depending on the data in the column, different
compression techniques can be used. One technique that is particularly effective in data warehouses
is *bitmap encoding*, illustrated in [Figure 4-8](/en/ch4#fig_bitmap_index).
{{< figure src="/fig/ddia_0408.png" id="fig_bitmap_index" caption="Figure 4-8. Compressed, bitmap-indexed storage of a single column." class="w-full my-4" >}}
Often, the number of distinct values in a column is small compared to the number of rows (for
example, a retailer may have billions of sales transactions, but only 100,000 distinct products).
We can now take a column with *n* distinct values and turn it into *n* separate bitmaps: one bitmap
for each distinct value, with one bit for each row. The bit is 1 if the row has that value, and 0 if
not.
One option is to store those bitmaps using one bit per row. However, these bitmaps typically contain
a lot of zeros (we say that they are *sparse*). In that case, the bitmaps can additionally be
run-length encoded: counting the number of consecutive zeros or ones and storing that number, as
shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the
two bitmap representations, using whichever is the most compact [^73].
This can make the encoding of a column remarkably efficient.
Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data
warehouse. For example:
`WHERE product_sk IN (31, 68, 69):`
: Load the three bitmaps for `product_sk = 31`, `product_sk = 68`, and `product_sk = 69`, and
calculate the bitwise *OR* of the three bitmaps, which can be done very efficiently.
`WHERE product_sk = 30 AND store_sk = 3:`
: Load the bitmaps for `product_sk = 30` and `store_sk = 3`, and calculate the bitwise *AND*. This
works because the columns contain the rows in the same order, so the *k*th bit in one column’s
bitmap corresponds to the same row as the *k*th bit in another column’s bitmap.
Bitmaps can also be used to answer graph queries, such as finding all users of a social network who are followed by user *X* and who also follow user *Y* [^74].
There are also various other compression schemes for columnar databases, which you can find in the references [^75].
--------
> [!NOTE]
> Don’t confuse column-oriented databases with the *wide-column* (also known as *column-family*) data
> model, in which a row can have thousands of columns, and there is no need for all the rows to have the same columns [^9].
> Despite the similarity in name, wide-column databases are row-oriented, since they store all values from a row together.
> Google’s Bigtable, Apache Accumulo, and HBase are examples of the wide-column model.
--------
#### Sort Order in Column Storage {#sort-order-in-column-storage}
In a column store, it doesn’t necessarily matter in which order the rows are stored. It’s easiest to
store them in the order in which they were inserted, since then inserting a new row just means
appending to each of the columns. However, we can choose to impose an order, like we did with
SSTables previously, and use that as an indexing mechanism.
Note that it wouldn’t make sense to sort each column independently, because then we would no longer
know which items in the columns belong to the same row. We can only reconstruct a row because we
know that the *k*th item in one column belongs to the same row as the *k*th item in another
column.
Rather, the data needs to be sorted an entire row at a time, even though it is stored by column.
The administrator of the database can choose the columns by which the table should be sorted, using
their knowledge of common queries. For example, if queries often target date ranges, such as the
last month, it might make sense to make `date_key` the first sort key. Then the query can
scan only the rows from the last month, which will be much faster than scanning all rows.
A second column can determine the sort order of any rows that have the same value in the first
column. For example, if `date_key` is the first sort key in [Figure 4-7](/en/ch4#fig_column_store), it might make
sense for `product_sk` to be the second sort key so that all sales for the same product on the same
day are grouped together in storage. That will help queries that need to group or filter sales by
product within a certain date range.
Another advantage of sorted order is that it can help with compression of columns. If the primary
sort column does not have many distinct values, then after sorting, it will have long sequences
where the same value is repeated many times in a row. A simple run-length encoding, like we used for
the bitmaps in [Figure 4-8](/en/ch4#fig_bitmap_index), could compress that column down to a few kilobytes—even if
the table has billions of rows.
That compression effect is strongest on the first sort key. The second and third sort keys will be
more jumbled up, and thus not have such long runs of repeated values. Columns further down the
sorting priority appear in essentially random order, so they probably won’t compress as well. But
having the first few columns sorted is still a win overall.
#### Writing to Column-Oriented Storage {#writing-to-column-oriented-storage}
We saw in [“Characterizing Transaction Processing and Analytics”](/en/ch1#sec_introduction_oltp) that reads in data warehouses tend to consist of aggregations
over a large number of rows; column-oriented storage, compression, and sorting all help to make
those read queries faster. Writes in a data warehouse tend to be a bulk import of data, often via an ETL process.
With columnar storage, writing an individual row somewhere in the middle of a sorted table would be
very inefficient, as you would have to rewrite all the compressed columns from the insertion
position onwards. However, a bulk write of many rows at once amortizes the cost of rewriting those
columns, making it efficient.
A log-structured approach is often used to perform writes in batches. All writes first go to a
row-oriented, sorted, in-memory store. When enough writes have accumulated, they are merged with the
column-encoded files on disk and written to new files in bulk. As old files remain immutable, and
new files are written in one go, object storage is well suited for storing these files.
Queries need to examine both the column data on disk and the recent writes in memory, and combine
the two. The query execution engine hides this distinction from the user. From an analyst’s point
of view, data that has been modified with inserts, updates, or deletes is immediately reflected in
subsequent queries. Snowflake, Vertica, Apache Pinot, Apache Druid, and many others do this [^61] [^63] [^64] [^76].
### Query Execution: Compilation and Vectorization {#sec_storage_vectorized}
A complex SQL query for analytics is broken down into a *query plan* consisting of multiple stages,
called *operators*, which may be distributed across multiple machines for parallel execution. Query
planners can perform a lot of optimizations by choosing which operators to use, in which order to
perform them, and where to run each operator.
Within each operator, the query engine needs to do various things with the values in a column, such
as finding all the rows where the value is among a particular set of values (perhaps as part of a
join), or checking whether the value is greater than 15. It also needs to look at several columns
for the same row, for example to find all sales transactions where the product is bananas and the
store is a particular store of interest.
For data warehouse queries that need to scan over millions of rows, we need to worry not only about
the amount of data they need to read off disk, but also the CPU time required to execute complex
operators. The simplest kind of operator is like an interpreter for a programming language: while
iterating over each row, it checks a data structure representing the query to find out which
comparisons or calculations it needs to perform on which columns. Unfortunately, this is too slow
for many analytics purposes. Two alternative approaches for efficient query execution have emerged [^77]:
Query compilation
: The query engine takes the SQL query and generates code for executing it. The code iterates over
the rows one by one, looks at the values in the columns of interest, performs whatever comparisons
or calculations are needed, and copies the necessary values to an output buffer if the required
conditions are satisfied. The query engine compiles the generated code to machine code (often
using an existing compiler such as LLVM), and then runs it on the column-encoded data that has
been loaded into memory. This approach to code generation is similar to the just-in-time (JIT)
compilation approach that is used in the Java Virtual Machine (JVM) and similar runtimes.
Vectorized processing
: The query is interpreted, not compiled, but it is made fast by processing many values from a
column in a batch, instead of iterating over rows one by one. A fixed set of predefined operators
are built into the database; we can pass arguments to them and get back a batch of results [^50] [^75].
For example, we could pass the `product_sk` column and the ID of “bananas” to an equality operator,
and get back a bitmap (one bit per value in the input column, which is 1 if it’s a banana); we could
then pass the `store_sk` column and the ID of the store of interest to the same equality operator,
and get back another bitmap; and then we could pass the two bitmaps to a “bitwise AND” operator, as
shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
a particular store.
{{< figure src="/fig/ddia_0409.png" id="fig_bitmap_and" caption="Figure 4-9. A bitwise AND between two bitmaps lends itself to vectorization." class="w-full my-4" >}}
The two approaches are very different in terms of their implementation, but both are used in practice [^77]. Both can achieve very good
performance by taking advantages of the characteristics of modern CPUs:
* preferring sequential memory access over random access to reduce cache misses [^78],
* doing most of the work in tight inner loops (that is, with a small number of instructions and no
function calls) to keep the CPU instruction processing pipeline busy and avoid branch
mispredictions,
* making use of parallelism such as multiple threads and single-instruction-multi-data (SIMD) instructions [^79] [^80], and
* operating directly on compressed data without decoding it into a separate in-memory
representation, which saves memory allocation and copying costs.
### Materialized Views and Data Cubes {#sec_storage_materialized_views}
We previously encountered *materialized views* in [“Materializing and Updating Timelines”](/en/ch2#sec_introduction_materializing):
in a relational data model, they are table-like object whose contents are the results of some
query. The difference is that a materialized view is an actual copy of the query results, written to
disk, whereas a virtual view is just a shortcut for writing queries. When you read from a virtual
view, the SQL engine expands it into the view’s underlying query on the fly and then processes the
expanded query.
When the underlying data changes, a materialized view needs to be updated accordingly.
Some databases can do that automatically, and there are also systems such as Materialize that specialize in materialized view maintenance [^81].
Performing such updates means more work on writes, but materialized views can improve read
performance in workloads that repeatedly need to perform the same queries.
*Materialized aggregates* are a type of materialized views that can be useful in data warehouses. As
discussed earlier, data warehouse queries often involve an aggregate function, such as `COUNT`, `SUM`,
`AVG`, `MIN`, or `MAX` in SQL. If the same aggregates are used by many different queries, it can be
wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that
queries use most often? A *data cube* or *OLAP cube* does this by creating a grid of aggregates grouped by different dimensions [^82].
[Figure 4-10](/en/ch4#fig_data_cube) shows an example.
{{< figure src="/fig/ddia_0410.png" id="fig_data_cube" caption="Figure 4-10. Two dimensions of a data cube, aggregating data by summing." class="w-full my-4" >}}
Imagine for now that each fact has foreign keys to only two dimension tables—in [Figure 4-10](/en/ch4#fig_data_cube),
these are `date_key` and `product_sk`. You can now draw a two-dimensional table, with
dates along one axis and products along the other. Each cell contains the aggregate (e.g., `SUM`) of
an attribute (e.g., `net_price`) of all facts with that date-product combination. Then you can apply
the same aggregate along each row or column and get a summary that has been reduced by one
dimension (the sales by product regardless of date, or the sales by date regardless of product).
In general, facts often have more than two dimensions. In [Figure 3-5](/en/ch3#fig_dwh_schema) there are five
dimensions: date, product, store, promotion, and customer. It’s a lot harder to imagine what a
five-dimensional hypercube would look like, but the principle remains the same: each cell contains
the sales for a particular date-product-store-promotion-customer combination. These values can then
repeatedly be summarized along each of the dimensions.
The advantage of a materialized data cube is that certain queries become very fast because they
have effectively been precomputed. For example, if you want to know the total sales per store
yesterday, you just need to look at the totals along the appropriate dimension—no need to scan
millions of rows.
The disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data. For example,
there is no way of calculating which proportion of sales comes from items that cost more than $100,
because the price isn’t one of the dimensions. Most data warehouses therefore try to keep as much
raw data as possible, and use aggregates such as data cubes only as a performance boost for certain queries.
## Multidimensional and Full-Text Indexes {#sec_storage_multidimensional}
The B-trees and LSM-trees we saw in the first half of this chapter allow range queries over a single
attribute: for example, if the key is a username, you can use them as an index to efficiently find
all names starting with an L. But sometimes, searching by a single attribute is not enough.
The most common type of multi-column index is called a *concatenated index*, which simply combines
several fields into one key by appending one column to another (the index definition specifies in
which order the fields are concatenated). This is like an old-fashioned paper phone book, which
provides an index from (*lastname*, *firstname*) to phone number. Due to the sort order, the index
can be used to find all the people with a particular last name, or all the people with a particular
*lastname-firstname* combination. However, the index is useless if you want to find all the people
with a particular first name.
On the other hand, *multi-dimensional indexes* allow you to query several columns at once.
One case where this is particularly important is geospatial data. For example, a restaurant-search
website may have a database containing the latitude and longitude of each restaurant. When a user is
looking at the restaurants on a map, the website needs to search for all the restaurants within the
rectangular map area that the user is currently viewing. This requires a two-dimensional range query
like the following:
```sql
SELECT * FROM restaurants WHERE latitude > 51.4946 AND latitude < 51.5079
AND longitude > -0.1162 AND longitude < -0.1004;
```
A concatenated index over the latitude and longitude columns is not able to answer that kind of
query efficiently: it can give you either all the restaurants in a range of latitudes (but at any
longitude), or all the restaurants in a range of longitudes (but anywhere between the North and
South poles), but not both simultaneously.
One option is to translate a two-dimensional location into a single number using a space-filling
curve, and then to use a regular B-tree index [^83]. More commonly, specialized spatial indexes such as R-trees or Bkd-trees [^84]
are used; they divide up the space so that nearby data points tend to be grouped in the same subtree. For example, PostGIS implements geospatial indexes as R-trees using PostgreSQL’s
Generalized Search Tree indexing facility [^85]. It is also possible to use regularly spaced grids of triangles, squares, or hexagons [^86].
Multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce
website you could use a three-dimensional index on the dimensions (*red*, *green*, *blue*) to search
for products in a certain range of colors, or in a database of weather observations you could have a
two-dimensional index on (*date*, *temperature*) in order to efficiently search for all the
observations during the year 2013 where the temperature was between 25 and 30℃. With a
one-dimensional index, you would have to either scan over all the records from 2013 (regardless of
temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by
timestamp and temperature simultaneously [^87].
### Full-Text Search {#sec_storage_full_text}
Full-text search allows you to search a collection of text documents (web pages, product
descriptions, etc.) by keywords that might appear anywhere in the text [^88].
Information retrieval is a big, specialist topic that often involves language-specific processing:
for example, several Asian languages are written without spaces or punctuation between words, and
therefore splitting text into words requires a model that indicates which character sequences
constitute a word. Full-text search also often involves matching words that are similar but not
identical (such as typos or different grammatical forms of words) and synonyms. Those problems go
beyond the scope of this book.
However, at its core, you can think of full-text search as another kind of multidimensional query:
in this case, each word that might appear in a text (a *term*) is a dimension. A document that
contains term *x* has a value of 1 in dimension *x*, and a document that doesn’t contain *x* has a
value of 0. Searching for documents mentioning “red apples” means a query that looks for a 1 in the
*red* dimension, and simultaneously a 1 in the *apples* dimension. The number of dimensions may thus be very large.
The data structure that many search engines use to answer such queries is called an *inverted
index*. This is a key-value structure where the key is a term, and the value is the list of IDs of
all the documents that contain the term (the *postings list*). If the document IDs are sequential
numbers, the postings list can also be represented as a sparse bitmap, like in [Figure 4-8](/en/ch4#fig_bitmap_index):
the *n*th bit in the bitmap for term *x* is a 1 if the document with ID *n* contains the term *x* [^89].
Finding all the documents that contain both terms *x* and *y* is now similar to a vectorized data
warehouse query that searches for rows matching two conditions ([Figure 4-9](/en/ch4#fig_bitmap_and)): load the two
bitmaps for terms *x* and *y* and compute their bitwise AND. Even if the bitmaps are run-length
encoded, this can be done very efficiently.
For example, Lucene, the full-text indexing engine used by Elasticsearch and Solr, works like this [^90].
It stores the mapping from term to postings list in SSTable-like sorted files, which are merged in
the background using the same log-structured approach we saw earlier in this chapter [^91].
PostgreSQL’s GIN index type also uses postings lists to support full-text search and indexing inside
JSON documents [^92] [^93].
Instead of breaking text into words, an alternative is to find all the substrings of length *n*,
which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
`"hello"` are `"hel"`, `"ell"`, and `"llo"`. If we build an inverted index of all trigrams, we can
search the documents for arbitrary substrings that are at least three characters long. Trigram
indexes even allows regular expressions in search queries; the downside is that they are quite large [^94].
To cope with typos in documents or queries, Lucene is able to search text for words within a certain
edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced) [^95].
It does this by storing the set of terms as a finite state automaton over the characters in the keys, similar to a *trie* [^96],
and transforming it into a *Levenshtein automaton*, which supports efficient search for words within a given edit distance [^97].
### Vector Embeddings {#id92}
Semantic search goes beyond synonyms and typos to try and understand document concepts
and user intentions. For example, if your help pages contain a page titled “cancelling your
subscription”, users should still be able to find that page when searching for “how to close my
account” or “terminate contract”, which are close in terms of meaning even though they use
completely different words.
To understand a document’s semantics—its meaning—semantic search indexes use embedding models to
translate a document into a vector of floating-point values, called a *vector embedding*. The vector
represents a point in a multi-dimensional space, and each floating-point value represents the document’s
location along one dimension’s axis. Embedding models generate vector embeddings that are near
each other (in this multi-dimensional space) when the embedding’s input documents are semantically
similar.
--------
> [!NOTE]
> We saw the term *vectorized processing* in [“Query Execution: Compilation and Vectorization”](/en/ch4#sec_storage_vectorized).
> Vectors in semantic search have a different meaning. In vectorized processing, the vector refers to
> a batch of bits that can be processed with specially optimized code. In embedding models, vectors are a list of
> floating point numbers that represent a location in multi-dimensional space.
--------
For example, a three-dimensional vector embedding for a Wikipedia page about agriculture might be
`[0.1, 0.22, 0.11]`. A Wikipedia page about vegetables would be quite near, perhaps with an embedding
of `[0.13, 0.19, 0.24]`. A page about star schemas might have an embedding of `[0.82, 0.39, -0.74]`,
comparatively far away. We can tell by looking that the first two vectors are closer than the third.
Embedding models use much larger vectors (often over 1,000 numbers), but the principles are the
same. We don’t try to understand what the individual numbers mean;
they’re simply a way for embedding models to point to a location in an abstract multi-dimensional
space. Search engines use distance functions such as cosine similarity or Euclidean distance to
measure the distance between vectors. Cosine similarity measures the cosine of the angle of two
vectors to determine how close they are, while Euclidean distance measures the straight-line
distance between two points in space.
Many early embedding models such as Word2Vec [^98], BERT [^99], and GPT [^100]
worked with text data. Such models are usually implemented as neural networks. Researchers went on to
create embedding models for video, audio, and images as well. More recently, model
architecture has become *multimodal*: a single model can generate vector embeddings for multiple
modalities such as text and images.
Semantic search engines use an embedding model to generate a vector embedding when a user enters a
query. The user’s query and related context (such as a user’s location) are fed into the embedding
model. After the embedding model generates the query’s vector embedding, the search engine must find
documents with similar vector embeddings using a vector index.
Vector indexes store the vector embeddings of a collection of documents. To query the index, you
pass in the vector embedding of the query, and the index returns the documents whose vectors are
closest to the query vector. Since the R-trees we saw previously don’t work well for vectors with
many dimensions, specialized vector indexes are used, such as:
Flat indexes
: Vectors are stored in the index as they are. A query must read every vector and measure its
distance to the query vector. Flat indexes are accurate, but measuring the distance between the
query and each vector is slow.
Inverted file (IVF) indexes
: The vector space is clustered into partitions (called *centroids*) of vectors to reduce the number
of vectors that must be compared. IVF indexes are faster than flat indexes, but can give only
approximate results: the query and a document may fall into different partitions, even though they
are close to each other. A query on an IVF index first defines *probes*, which are simply the number
of partitions to check. Queries that use more probes will be more accurate, but will be slower, as
more vectors must be compared.
Hierarchical Navigable Small World (HNSW)
: HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw).
Each layer is represented as a graph, where nodes represent vectors, and edges represent proximity
to nearby vectors. A query starts by locating the nearest vector in the topmost layer, which has a
small number of nodes. The query then moves to the same node in the layer below and follows the
edges in that layer, which is more densely connected, looking for a vector that is closer to the
query vector. The process continues until the last layer is reached. As with IVF indexes, HNSW
indexes are approximate.
{{< figure src="/fig/ddia_0411.png" id="fig_vector_hnsw" caption="Figure 4-11. Searching for the database entry that is closest to a given query vector in a HNSW index." class="w-full my-4" >}}
Many popular vector databases implement IVF and HNSW indexes. Facebook’s Faiss library has many variations of each [^101], and PostgreSQL’s pgvector supports both as well [^102].
The full details of the IVF and HNSW algorithms are beyond the scope of this book, but their papers are an excellent resource [^103] [^104].
## Summary {#summary}
In this chapter we tried to get to the bottom of how databases perform storage and retrieval. What
happens when you store data in a database, and what does the database do when you query for the
data again later?
[“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics) introduced the distinction between transaction processing (OLTP) and
analytics (OLAP). In this chapter we saw that storage engines optimized for OLTP look very different
from those optimized for analytics:
* OLTP systems are optimized for a high volume of requests, each of which reads and writes a small
number of records, and which need fast responses. The records are typically accessed via a primary
key or a secondary index, and these indexes are typically ordered mappings from key to record,
which also support range queries.
* Data warehouses and similar analytic systems are optimized for complex read queries that scan over
a large number of records. They generally use a column-oriented storage layout with compression
that minimizes the amount of data that such a query needs to read off disk, and just-in-time
compilation of queries or vectorization to minimize the amount of CPU time spent processing the
data.
On the OLTP side, we saw storage engines from two main schools of thought:
* The log-structured approach, which only permits appending to files and deleting obsolete files,
but never updates a file that has been written. SSTables, LSM-trees, RocksDB, Cassandra, HBase,
Scylla, Lucene, and others belong to this group. In general, log-structured storage engines tend
to provide high write throughput.
* The update-in-place approach, which treats the disk as a set of fixed-size pages that can be
overwritten. B-trees, the biggest example of this philosophy, are used in all major relational
OLTP databases and also many nonrelational ones. As a rule of thumb, B-trees tend to be better for
reads, providing higher read throughput and lower response times than log-structured storage.
We then looked at indexes that can search for multiple conditions at the same time: multidimensional
indexes such as R-trees that can search for points on a map by latitude and longitude at the same
time, and full-text search indexes that can search for multiple keywords appearing in the same text.
Finally, vector databases are used for semantic search on text documents and other media; they use
vectors with a larger number of dimensions and find similar documents by comparing vector
similarity.
As an application developer, if you’re armed with this knowledge about the internals of storage
engines, you are in a much better position to know which tool is best suited for your particular
application. If you need to adjust a database’s tuning parameters, this understanding allows you to
imagine what effect a higher or a lower value may have.
Although this chapter couldn’t make you an expert in tuning any one particular storage engine, it
has hopefully equipped you with enough vocabulary and ideas that you can make sense of the
documentation for the database of your choice.
### References
[^1]: Nikolay Samokhvalov. [How partial, covering, and multicolumn indexes may slow down UPDATEs in PostgreSQL](https://postgres.ai/blog/20211029-how-partial-and-covering-indexes-affect-update-performance-in-postgresql). *postgres.ai*, October 2021. Archived at [perma.cc/PBK3-F4G9](https://perma.cc/PBK3-F4G9)
[^2]: Goetz Graefe. [Modern B-Tree Techniques](https://w6113.github.io/files/papers/btreesurvey-graefe.pdf). *Foundations and Trends in Databases*, volume 3, issue 4, pages 203–402, August 2011. [doi:10.1561/1900000028](https://doi.org/10.1561/1900000028)
[^3]: Evan Jones. [Why databases use ordered indexes but programming uses hash tables](https://www.evanjones.ca/ordered-vs-unordered-indexes.html). *evanjones.ca*, December 2019. Archived at [perma.cc/NJX8-3ZZD](https://perma.cc/NJX8-3ZZD)
[^4]: Branimir Lambov. [CEP-25: Trie-indexed SSTable format](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A%2BTrie-indexed%2BSSTable%2Bformat). *cwiki.apache.org*, November 2022. Archived at [perma.cc/HD7W-PW8U](https://perma.cc/HD7W-PW8U). Linked Google Doc archived at [perma.cc/UL6C-AAAE](https://perma.cc/UL6C-AAAE)
[^5]: Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein: *Introduction to Algorithms*, 3rd edition. MIT Press, 2009. ISBN: 978-0-262-53305-8
[^6]: Branimir Lambov. [Trie Memtables in Cassandra](https://www.vldb.org/pvldb/vol15/p3359-lambov.pdf). *Proceedings of the VLDB Endowment*, volume 15, issue 12, pages 3359–3371, August 2022. [doi:10.14778/3554821.3554828](https://doi.org/10.14778/3554821.3554828)
[^7]: Dhruba Borthakur. [The History of RocksDB](https://rocksdb.blogspot.com/2013/11/the-history-of-rocksdb.html). *rocksdb.blogspot.com*, November 2013. Archived at [perma.cc/Z7C5-JPSP](https://perma.cc/Z7C5-JPSP)
[^8]: Matteo Bertozzi. [Apache HBase I/O – HFile](https://blog.cloudera.com/apache-hbase-i-o-hfile/). *blog.cloudera.com*, June 2012. Archived at [perma.cc/U9XH-L2KL](https://perma.cc/U9XH-L2KL)
[^9]: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. [Bigtable: A Distributed Storage System for Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006.
[^10]: Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. [The Log-Structured Merge-Tree (LSM-Tree)](https://www.cs.umb.edu/~poneil/lsmtree.pdf). *Acta Informatica*, volume 33, issue 4, pages 351–385, June 1996. [doi:10.1007/s002360050048](https://doi.org/10.1007/s002360050048)
[^11]: Mendel Rosenblum and John K. Ousterhout. [The Design and Implementation of a Log-Structured File System](https://research.cs.wisc.edu/areas/os/Qual/papers/lfs.pdf). *ACM Transactions on Computer Systems*, volume 10, issue 1, pages 26–52, February 1992. [doi:10.1145/146941.146943](https://doi.org/10.1145/146941.146943)
[^12]: Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, Michał Świtakowski, Michał Szafrański, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, and Matei Zaharia. [Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores](https://vldb.org/pvldb/vol13/p3411-armbrust.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3411–3424, August 2020. [doi:10.14778/3415478.3415560](https://doi.org/10.14778/3415478.3415560)
[^13]: Burton H. Bloom. [Space/Time Trade-offs in Hash Coding with Allowable Errors](https://people.cs.umass.edu/~emery/classes/cmpsci691st/readings/Misc/p422-bloom.pdf). *Communications of the ACM*, volume 13, issue 7, pages 422–426, July 1970. [doi:10.1145/362686.362692](https://doi.org/10.1145/362686.362692)
[^14]: Adam Kirsch and Michael Mitzenmacher. [Less Hashing, Same Performance: Building a Better Bloom Filter](https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf). *Random Structures & Algorithms*, volume 33, issue 2, pages 187–218, September 2008. [doi:10.1002/rsa.20208](https://doi.org/10.1002/rsa.20208)
[^15]: Thomas Hurst. [Bloom Filter Calculator](https://hur.st/bloomfilter/). *hur.st*, September 2023. Archived at [perma.cc/L3AV-6VC2](https://perma.cc/L3AV-6VC2)
[^16]: Chen Luo and Michael J. Carey. [LSM-based storage techniques: a survey](https://arxiv.org/abs/1812.07527). *The VLDB Journal*, volume 29, pages 393–418, July 2019. [doi:10.1007/s00778-019-00555-y](https://doi.org/10.1007/s00778-019-00555-y)
[^17]: Subhadeep Sarkar and Manos Athanassoulis. [Dissecting, Designing, and Optimizing LSM-based Data Stores](https://www.youtube.com/watch?v=hkMkBZn2mGs). Tutorial at *ACM International Conference on Management of Data* (SIGMOD), June 2022. Slides archived at [perma.cc/93B3-E827](https://perma.cc/93B3-E827)
[^18]: Mark Callaghan. [Name that compaction algorithm](https://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html). *smalldatum.blogspot.com*, August 2018. Archived at [perma.cc/CN4M-82DY](https://perma.cc/CN4M-82DY)
[^19]: Prashanth Rao. [Embedded databases (1): The harmony of DuckDB, KùzuDB and LanceDB](https://thedataquarry.com/posts/embedded-db-1/). *thedataquarry.com*, August 2023. Archived at [perma.cc/PA28-2R35](https://perma.cc/PA28-2R35)
[^20]: Hacker News discussion. [Bluesky migrates to single-tenant SQLite](https://news.ycombinator.com/item?id=38171322). *news.ycombinator.com*, October 2023. Archived at [perma.cc/69LM-5P6X](https://perma.cc/69LM-5P6X)
[^21]: Rudolf Bayer and Edward M. McCreight. [Organization and Maintenance of Large Ordered Indices](https://dl.acm.org/doi/pdf/10.1145/1734663.1734671). Boeing Scientific Research Laboratories, Mathematical and Information Sciences Laboratory, report no. 20, July 1970. [doi:10.1145/1734663.1734671](https://doi.org/10.1145/1734663.1734671)
[^22]: Douglas Comer. [The Ubiquitous B-Tree](https://web.archive.org/web/20170809145513id_/http%3A//sites.fas.harvard.edu/~cs165/papers/comer.pdf). *ACM Computing Surveys*, volume 11, issue 2, pages 121–137, June 1979. [doi:10.1145/356770.356776](https://doi.org/10.1145/356770.356776)
[^23]: Alex Miller. [Torn Write Detection and Protection](https://transactional.blog/blog/2025-torn-writes). *transactional.blog*, April 2025. Archived at [perma.cc/G7EB-33EW](https://perma.cc/G7EB-33EW)
[^24]: C. Mohan and Frank Levine. [ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging](https://ics.uci.edu/~cs223/papers/p371-mohan.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 1992. [doi:10.1145/130283.130338](https://doi.org/10.1145/130283.130338)
[^25]: Hironobu Suzuki. [The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017.
[^26]: Howard Chu. [LDAP at Lightning Speed](https://buildstuff14.sched.com/event/08a1a368e272eb599a52e08b4c3c779d). At *Build Stuff ’14*, November 2014. Archived at [perma.cc/GB6Z-P8YH](https://perma.cc/GB6Z-P8YH)
[^27]: Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, Radu Stoica, Stratos Idreos, Anastasia Ailamaki, and Mark Callaghan. [Designing Access Methods: The RUM Conjecture](https://openproceedings.org/2016/conf/edbt/paper-12.pdf). At *19th International Conference on Extending Database Technology* (EDBT), March 2016. [doi:10.5441/002/edbt.2016.42](https://doi.org/10.5441/002/edbt.2016.42)
[^28]: Ben Stopford. [Log Structured Merge Trees](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/). *benstopford.com*, February 2015. Archived at [perma.cc/E5BV-KUJ6](https://perma.cc/E5BV-KUJ6)
[^29]: Mark Callaghan. [The Advantages of an LSM vs a B-Tree](https://smalldatum.blogspot.com/2016/01/summary-of-advantages-of-lsm-vs-b-tree.html). *smalldatum.blogspot.co.uk*, January 2016. Archived at [perma.cc/3TYZ-EFUD](https://perma.cc/3TYZ-EFUD)
[^30]: Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan Gupta, Ravishankar Chandhiramoorthi, and Diego Didona. [SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores](https://www.usenix.org/conference/atc19/presentation/balmau). At *USENIX Annual Technical Conference*, July 2019.
[^31]: Igor Canadi, Siying Dong, Mark Callaghan, et al. [RocksDB Tuning Guide](https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide). *github.com*, 2023. Archived at [perma.cc/UNY4-MK6C](https://perma.cc/UNY4-MK6C)
[^32]: Gabriel Haas and Viktor Leis. [What Modern NVMe Storage Can Do, and How to Exploit it: High-Performance I/O for High-Performance Storage Engines](https://www.vldb.org/pvldb/vol16/p2090-haas.pdf). *Proceedings of the VLDB Endowment*, volume 16, issue 9, pages 2090-2102. [doi:10.14778/3598581.3598584](https://doi.org/10.14778/3598581.3598584)
[^33]: Emmanuel Goossaert. [Coding for SSDs](https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/). *codecapsule.com*, February 2014.
[^34]: Jack Vanlightly. [Is sequential IO dead in the era of the NVMe drive?](https://jack-vanlightly.com/blog/2023/5/9/is-sequential-io-dead-in-the-era-of-the-nvme-drive) *jack-vanlightly.com*, May 2023. Archived at [perma.cc/7TMZ-TAPU](https://perma.cc/7TMZ-TAPU)
[^35]: Alibaba Cloud Storage Team. [Storage System Design Analysis: Factors Affecting NVMe SSD Performance (2)](https://www.alibabacloud.com/blog/594376). *alibabacloud.com*, January 2019. Archived at [archive.org](https://web.archive.org/web/20230510065132/https%3A//www.alibabacloud.com/blog/594376)
[^36]: Xiao-Yu Hu and Robert Haas. [The Fundamental Limit of Flash Random Write Performance: Understanding, Analysis and Performance Modelling](https://dominoweb.draco.res.ibm.com/reports/rz3771.pdf). *dominoweb.draco.res.ibm.com*, March 2010. Archived at [perma.cc/8JUL-4ZDS](https://perma.cc/8JUL-4ZDS)
[^37]: Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [WiscKey: Separating Keys from Values in SSD-conscious Storage](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf). At *4th USENIX Conference on File and Storage Technologies* (FAST), February 2016.
[^38]: Peter Zaitsev. [Innodb Double Write](https://www.percona.com/blog/innodb-double-write/). *percona.com*, August 2006. Archived at [perma.cc/NT4S-DK7T](https://perma.cc/NT4S-DK7T)
[^39]: Tomas Vondra. [On the Impact of Full-Page Writes](https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-page-writes/). *2ndquadrant.com*, November 2016. Archived at [perma.cc/7N6B-CVL3](https://perma.cc/7N6B-CVL3)
[^40]: Mark Callaghan. [Read, write & space amplification - B-Tree vs LSM](https://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html). *smalldatum.blogspot.com*, November 2015. Archived at [perma.cc/S487-WK5P](https://perma.cc/S487-WK5P)
[^41]: Mark Callaghan. [Choosing Between Efficiency and Performance with RocksDB](https://codemesh.io/codemesh2016/mark-callaghan). At *Code Mesh*, November 2016. Video at [youtube.com/watch?v=tgzkgZVXKB4](https://www.youtube.com/watch?v=tgzkgZVXKB4)
[^42]: Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, and Manos Athanassoulis. [Enabling Timely and Persistent Deletion in LSM-Engines](https://subhadeep.net/assets/fulltext/Enabling_Timely_and_Persistent_Deletion_in_LSM-Engines.pdf). *ACM Transactions on Database Systems*, volume 48, issue 3, article no. 8, August 2023. [doi:10.1145/3599724](https://doi.org/10.1145/3599724)
[^43]: Lukas Fittl. [Postgres vs. SQL Server: B-Tree Index Differences & the Benefit of Deduplication](https://pganalyze.com/blog/postgresql-vs-sql-server-btree-index-deduplication). *pganalyze.com*, April 2025. Archived at [perma.cc/XY6T-LTPX](https://perma.cc/XY6T-LTPX)
[^44]: Drew Silcock. [How Postgres stores data on disk – this one’s a page turner](https://drew.silcock.dev/blog/how-postgres-stores-data-on-disk/). *drew.silcock.dev*, August 2024. Archived at [perma.cc/8K7K-7VJ2](https://perma.cc/8K7K-7VJ2)
[^45]: Joe Webb. [Using Covering Indexes to Improve Query Performance](https://www.red-gate.com/simple-talk/databases/sql-server/learn/using-covering-indexes-to-improve-query-performance/). *simple-talk.com*, September 2008. Archived at [perma.cc/6MEZ-R5VR](https://perma.cc/6MEZ-R5VR)
[^46]: Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. [The End of an Architectural Era (It’s Time for a Complete Rewrite)](https://vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf). At *33rd International Conference on Very Large Data Bases* (VLDB), September 2007.
[^47]: [VoltDB Technical Overview White Paper](https://www.voltactivedata.com/wp-content/uploads/2017/03/hv-white-paper-voltdb-technical-overview.pdf). VoltDB, 2017. Archived at [perma.cc/B9SF-SK5G](https://perma.cc/B9SF-SK5G)
[^48]: Stephen M. Rumble, Ankita Kejriwal, and John K. Ousterhout. [Log-Structured Memory for DRAM-Based Storage](https://www.usenix.org/system/files/conference/fast14/fast14-paper_rumble.pdf). At *12th USENIX Conference on File and Storage Technologies* (FAST), February 2014.
[^49]: Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker. [OLTP Through the Looking Glass, and What We Found There](https://hstore.cs.brown.edu/papers/hstore-lookingglass.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2008. [doi:10.1145/1376616.1376713](https://doi.org/10.1145/1376616.1376713)
[^50]: Per-Åke Larson, Cipri Clinciu, Campbell Fraser, Eric N. Hanson, Mostafa Mokhtar, Michal Nowakiewicz, Vassilis Papadimos, Susan L. Price, Srikumar Rangarajan, Remus Rusanu, and Mayukh Saubhasik. [Enhancements to SQL Server Column Stores](https://web.archive.org/web/20131203001153id_/http%3A//research.microsoft.com/pubs/193599/Apollo3%20-%20Sigmod%202013%20-%20final.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2013. [doi:10.1145/2463676.2463708](https://doi.org/10.1145/2463676.2463708)
[^51]: Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes Rauhe, and Jonathan Dees. [The SAP HANA Database – An Architecture Overview](https://web.archive.org/web/20220208081111id_/http%3A//sites.computer.org/debull/A12mar/hana.pdf). *IEEE Data Engineering Bulletin*, volume 35, issue 1, pages 28–33, March 2012.
[^52]: Michael Stonebraker. [The Traditional RDBMS Wisdom Is (Almost Certainly) All Wrong](https://slideshot.epfl.ch/talks/166). Presentation at *EPFL*, May 2013.
[^53]: Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. [Cloud-Native Transactions and Analytics in SingleStore](https://dl.acm.org/doi/pdf/10.1145/3514221.3526055). At *ACM International Conference on Management of Data* (SIGMOD), June 2022. [doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055)
[^54]: Tino Tereshko and Jordan Tigani. [BigQuery under the hood](https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood). *cloud.google.com*, January 2016. Archived at [perma.cc/WP2Y-FUCF](https://perma.cc/WP2Y-FUCF)
[^55]: Wes McKinney. [The Road to Composable Data Systems: Thoughts on the Last 15 Years and the Future](https://wesmckinney.com/blog/looking-back-15-years/). *wesmckinney.com*, September 2023. Archived at [perma.cc/6L2M-GTJX](https://perma.cc/6L2M-GTJX)
[^56]: Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. [C-Store: A Column-oriented DBMS](https://www.vldb.org/archives/website/2005/program/paper/thu/p553-stonebraker.pdf). At *31st International Conference on Very Large Data Bases* (VLDB), pages 553–564, September 2005.
[^57]: Julien Le Dem. [Dremel Made Simple with Parquet](https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html). *blog.twitter.com*, September 2013.
[^58]: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. [Dremel: Interactive Analysis of Web-Scale Datasets](https://vldb.org/pvldb/vol3/R29.pdf). At *36th International Conference on Very Large Data Bases* (VLDB), pages 330–339, September 2010. [doi:10.14778/1920841.1920886](https://doi.org/10.14778/1920841.1920886)
[^59]: Joe Kearney. [Understanding Record Shredding: storing nested data in columns](https://www.joekearney.co.uk/posts/understanding-record-shredding). *joekearney.co.uk*, December 2016. Archived at [perma.cc/ZD5N-AX5D](https://perma.cc/ZD5N-AX5D)
[^60]: Jamie Brandon. [A shallow survey of OLAP and HTAP query engines](https://www.scattered-thoughts.net/writing/a-shallow-survey-of-olap-and-htap-query-engines). *scattered-thoughts.net*, September 2023. Archived at [perma.cc/L3KH-J4JF](https://perma.cc/L3KH-J4JF)
[^61]: Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. [The Snowflake Elastic Data Warehouse](https://dl.acm.org/doi/pdf/10.1145/2882903.2903741). At *ACM International Conference on Management of Data* (SIGMOD), pages 215–226, June 2016. [doi:10.1145/2882903.2903741](https://doi.org/10.1145/2882903.2903741)
[^62]: Mark Raasveldt and Hannes Mühleisen. [Data Management for Data Science Towards Embedded Analytics](https://duckdb.org/pdf/CIDR2020-raasveldt-muehleisen-duckdb.pdf). At *10th Conference on Innovative Data Systems Research* (CIDR), January 2020.
[^63]: Jean-François Im, Kishore Gopalakrishna, Subbu Subramaniam, Mayank Shrivastava, Adwait Tumbde, Xiaotian Jiang, Jennifer Dai, Seunghyun Lee, Neha Pawar, Jialiang Li, and Ravi Aringunram. [Pinot: Realtime OLAP for 530 Million Users](https://cwiki.apache.org/confluence/download/attachments/103092375/Pinot.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 583–594, May 2018. [doi:10.1145/3183713.3190661](https://doi.org/10.1145/3183713.3190661)
[^64]: Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, and Deep Ganguli. [Druid: A Real-time Analytical Data Store](https://static.druid.io/docs/druid.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2014. [doi:10.1145/2588555.2595631](https://doi.org/10.1145/2588555.2595631)
[^65]: Chunwei Liu, Anna Pavlenko, Matteo Interlandi, and Brandon Haynes. [Deep Dive into Common Open Formats for Analytical DBMSs](https://www.vldb.org/pvldb/vol16/p3044-liu.pdf). *Proceedings of the VLDB Endowment*, volume 16, issue 11, pages 3044–3056, July 2023. [doi:10.14778/3611479.3611507](https://doi.org/10.14778/3611479.3611507)
[^66]: Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. [An Empirical Evaluation of Columnar Storage Formats](https://www.vldb.org/pvldb/vol17/p148-zeng.pdf). *Proceedings of the VLDB Endowment*, volume 17, issue 2, pages 148–161. [doi:10.14778/3626292.3626298](https://doi.org/10.14778/3626292.3626298)
[^67]: Weston Pace. [Lance v2: A columnar container format for modern data](https://blog.lancedb.com/lance-v2/). *blog.lancedb.com*, April 2024. Archived at [perma.cc/ZK3Q-S9VJ](https://perma.cc/ZK3Q-S9VJ)
[^68]: Yoav Helfman. [Nimble, A New Columnar File Format](https://www.youtube.com/watch?v=bISBNVtXZ6M). At *VeloxCon*, April 2024.
[^69]: Wes McKinney. [Apache Arrow: High-Performance Columnar Data Framework](https://www.youtube.com/watch?v=YhF8YR0OEFk). At *CMU Database Group – Vaccination Database Tech Talks*, December 2021.
[^70]: Wes McKinney. [Python for Data Analysis, 3rd Edition](https://learning.oreilly.com/library/view/python-for-data/9781098104023/). O’Reilly Media, August 2022. ISBN: 9781098104023
[^71]: Paul Dix. [The Design of InfluxDB IOx: An In-Memory Columnar Database Written in Rust with Apache Arrow](https://www.youtube.com/watch?v=_zbwz-4RDXg). At *CMU Database Group – Vaccination Database Tech Talks*, May 2021.
[^72]: Carlota Soto and Mike Freedman. [Building Columnar Compression for Large PostgreSQL Databases](https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database/). *timescale.com*, March 2024. Archived at [perma.cc/7KTF-V3EH](https://perma.cc/7KTF-V3EH)
[^73]: Daniel Lemire, Gregory Ssi‐Yan‐Kai, and Owen Kaser. [Consistently faster and smaller compressed bitmaps with Roaring](https://arxiv.org/pdf/1603.06549). *Software: Practice and Experience*, volume 46, issue 11, pages 1547–1569, November 2016. [doi:10.1002/spe.2402](https://doi.org/10.1002/spe.2402)
[^74]: Jaz Volpert. [An entire Social Network in 1.6GB (GraphD Part 2)](https://jazco.dev/2024/04/20/roaring-bitmaps/). *jazco.dev*, April 2024. Archived at [perma.cc/L27Z-QVMG](https://perma.cc/L27Z-QVMG)
[^75]: Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel Madden. [The Design and Implementation of Modern Column-Oriented Database Systems](https://www.cs.umd.edu/~abadi/papers/abadi-column-stores.pdf). *Foundations and Trends in Databases*, volume 5, issue 3, pages 197–280, December 2013. [doi:10.1561/1900000024](https://doi.org/10.1561/1900000024)
[^76]: Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. [The Vertica Analytic Database: C-Store 7 Years Later](https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf). *Proceedings of the VLDB Endowment*, volume 5, issue 12, pages 1790–1801, August 2012. [doi:10.14778/2367502.2367518](https://doi.org/10.14778/2367502.2367518)
[^77]: Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. [Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask](https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf). *Proceedings of the VLDB Endowment*, volume 11, issue 13, pages 2209–2222, September 2018. [doi:10.14778/3275366.3284966](https://doi.org/10.14778/3275366.3284966)
[^78]: Forrest Smith. [Memory Bandwidth Napkin Math](https://www.forrestthewoods.com/blog/memory-bandwidth-napkin-math/). *forrestthewoods.com*, February 2020. Archived at [perma.cc/Y8U4-PS7N](https://perma.cc/Y8U4-PS7N)
[^79]: Peter Boncz, Marcin Zukowski, and Niels Nes. [MonetDB/X100: Hyper-Pipelining Query Execution](https://www.cidrdb.org/cidr2005/papers/P19.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005.
[^80]: Jingren Zhou and Kenneth A. Ross. [Implementing Database Operations Using SIMD Instructions](https://www1.cs.columbia.edu/~kar/pubsk/simd.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 145–156, June 2002. [doi:10.1145/564691.564709](https://doi.org/10.1145/564691.564709)
[^81]: Kevin Bartley. [OLTP Queries: Transfer Expensive Workloads to Materialize](https://materialize.com/blog/oltp-queries/). *materialize.com*, August 2024. Archived at [perma.cc/4TYM-TYD8](https://perma.cc/4TYM-TYD8)
[^82]: Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. [Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals](https://arxiv.org/pdf/cs/0701155). *Data Mining and Knowledge Discovery*, volume 1, issue 1, pages 29–53, March 2007. [doi:10.1023/A:1009726021843](https://doi.org/10.1023/A%3A1009726021843)
[^83]: Frank Ramsak, Volker Markl, Robert Fenk, Martin Zirkel, Klaus Elhardt, and Rudolf Bayer. [Integrating the UB-Tree into a Database System Kernel](https://www.vldb.org/conf/2000/P263.pdf). At *26th International Conference on Very Large Data Bases* (VLDB), September 2000.
[^84]: Octavian Procopiuc, Pankaj K. Agarwal, Lars Arge, and Jeffrey Scott Vitter. [Bkd-Tree: A Dynamic Scalable kd-Tree](https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf). At *8th International Symposium on Spatial and Temporal Databases* (SSTD), pages 46–65, July 2003. [doi:10.1007/978-3-540-45072-6\_4](https://doi.org/10.1007/978-3-540-45072-6_4)
[^85]: Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer. [Generalized Search Trees for Database Systems](https://dsf.berkeley.edu/papers/vldb95-gist.pdf). At *21st International Conference on Very Large Data Bases* (VLDB), September 1995.
[^86]: Isaac Brodsky. [H3: Uber’s Hexagonal Hierarchical Spatial Index](https://eng.uber.com/h3/). *eng.uber.com*, June 2018. Archived at [archive.org](https://web.archive.org/web/20240722003854/https%3A//www.uber.com/blog/h3/)
[^87]: Robert Escriva, Bernard Wong, and Emin Gün Sirer. [HyperDex: A Distributed, Searchable Key-Value Store](https://www.cs.princeton.edu/courses/archive/fall13/cos518/papers/hyperdex.pdf). At *ACM SIGCOMM Conference*, August 2012. [doi:10.1145/2377677.2377681](https://doi.org/10.1145/2377677.2377681)
[^88]: Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/). Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at [nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/)
[^89]: Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. [An Experimental Study of Bitmap Compression vs. Inverted List Compression](https://cseweb.ucsd.edu/~swanson/papers/SIGMOD2017-ListCompression.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 993–1008, May 2017. [doi:10.1145/3035918.3064007](https://doi.org/10.1145/3035918.3064007)
[^90]: Adrien Grand. [What is in a Lucene Index?](https://speakerdeck.com/elasticsearch/what-is-in-a-lucene-index) At *Lucene/Solr Revolution*, November 2013. Archived at [perma.cc/Z7QN-GBYY](https://perma.cc/Z7QN-GBYY)
[^91]: Michael McCandless. [Visualizing Lucene’s Segment Merges](https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html). *blog.mikemccandless.com*, February 2011. Archived at [perma.cc/3ZV8-72W6](https://perma.cc/3ZV8-72W6)
[^92]: Lukas Fittl. [Understanding Postgres GIN Indexes: The Good and the Bad](https://pganalyze.com/blog/gin-index). *pganalyze.com*, December 2021. Archived at [perma.cc/V3MW-26H6](https://perma.cc/V3MW-26H6)
[^93]: Jimmy Angelakos. [The State of (Full) Text Search in PostgreSQL 12](https://www.youtube.com/watch?v=c8IrUHV70KQ). At *FOSDEM*, February 2020. Archived at [perma.cc/J6US-3WZS](https://perma.cc/J6US-3WZS)
[^94]: Alexander Korotkov. [Index support for regular expression search](https://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf). At *PGConf.EU Prague*, October 2012. Archived at [perma.cc/5RFZ-ZKDQ](https://perma.cc/5RFZ-ZKDQ)
[^95]: Michael McCandless. [Lucene’s FuzzyQuery Is 100 Times Faster in 4.0](https://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html). *blog.mikemccandless.com*, March 2011. Archived at [perma.cc/E2WC-GHTW](https://perma.cc/E2WC-GHTW)
[^96]: Steffen Heinz, Justin Zobel, and Hugh E. Williams. [Burst Tries: A Fast, Efficient Data Structure for String Keys](https://web.archive.org/web/20130903070248id_/http%3A//ww2.cs.mu.oz.au%3A80/~jz/fulltext/acmtois02.pdf). *ACM Transactions on Information Systems*, volume 20, issue 2, pages 192–223, April 2002. [doi:10.1145/506309.506312](https://doi.org/10.1145/506309.506312)
[^97]: Klaus U. Schulz and Stoyan Mihov. [Fast String Correction with Levenshtein Automata](https://dmice.ohsu.edu/bedricks/courses/cs655/pdf/readings/2002_Schulz.pdf). *International Journal on Document Analysis and Recognition*, volume 5, issue 1, pages 67–85, November 2002. [doi:10.1007/s10032-002-0082-8](https://doi.org/10.1007/s10032-002-0082-8)
[^98]: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781). At *International Conference on Learning Representations* (ICLR), May 2013. [doi:10.48550/arXiv.1301.3781](https://doi.org/10.48550/arXiv.1301.3781)
[^99]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805). At *Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, volume 1, pages 4171–4186, June 2019. [doi:10.18653/v1/N19-1423](https://doi.org/10.18653/v1/N19-1423)
[^100]: Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf). *openai.com*, June 2018. Archived at [perma.cc/5N3C-DJ4C](https://perma.cc/5N3C-DJ4C)
[^101]: Matthijs Douze, Maria Lomeli, and Lucas Hosseini. [Faiss indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). *github.com*, August 2024. Archived at [perma.cc/2EWG-FPBS](https://perma.cc/2EWG-FPBS)
[^102]: Varik Matevosyan. [Understanding pgvector’s HNSW Index Storage in Postgres](https://lantern.dev/blog/pgvector-storage). *lantern.dev*, August 2024. Archived at [perma.cc/B2YB-JB59](https://perma.cc/B2YB-JB59)
[^103]: Dmitry Baranchuk, Artem Babenko, and Yury Malkov. [Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors](https://arxiv.org/pdf/1802.02422). At *European Conference on Computer Vision* (ECCV), pages 202–216, September 2018. [doi:10.1007/978-3-030-01258-8\_13](https://doi.org/10.1007/978-3-030-01258-8_13)
[^104]: Yury A. Malkov and Dmitry A. Yashunin. [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/pdf/1603.09320). *IEEE Transactions on Pattern Analysis and Machine Intelligence*, volume 42, issue 4, pages 824–836, April 2020. [doi:10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473)
================================================
FILE: content/en/ch5.md
================================================
---
title: "5. Encoding and Evolution"
weight: 105
breadcrumbs: false
---

> *Everything changes and nothing stands still.*
>
> Heraclitus of Ephesus, as quoted by Plato in *Cratylus* (360 BCE)
Applications inevitably change over time. Features are added or modified as new products are
launched, user requirements become better understood, or business circumstances change. In
[Chapter 2](/en/ch2#ch_nonfunctional) we introduced the idea of *evolvability*: we should aim to build systems that
make it easy to adapt to change (see [“Evolvability: Making Change Easy”](/en/ch2#sec_introduction_evolvability)).
In most cases, a change to an application’s features also requires a change to data that it stores:
perhaps a new field or record type needs to be captured, or perhaps existing data needs to be
presented in a new way.
The data models we discussed in [Chapter 3](/en/ch3#ch_datamodels) have different ways of coping with such change.
Relational databases generally assume that all data in the database conforms to one schema: although
that schema can be changed (through schema migrations; i.e., `ALTER` statements), there is exactly
one schema in force at any one point in time. By contrast, schema-on-read (“schemaless”) databases
don’t enforce a schema, so the database can contain a mixture of older and newer data formats
written at different times (see [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)).
When a data format or schema changes, a corresponding change to application code often needs to
happen (for example, you add a new field to a record, and the application code starts reading
and writing that field). However, in a large application, code changes often cannot happen
instantaneously:
* With server-side applications you may want to perform a *rolling upgrade*
(also known as a *staged rollout*), deploying the new version to a few nodes at a time, checking
whether the new version is running smoothly, and gradually working your way through all the nodes.
This allows new versions to be deployed without service downtime, and thus encourages more frequent releases and better evolvability.
* With client-side applications you’re at the mercy of the user, who may not install the update for some time.
This means that old and new versions of the code, and old and new data formats, may potentially all
coexist in the system at the same time. In order for the system to continue running smoothly, we
need to maintain compatibility in both directions:
Backward compatibility
: Newer code can read data that was written by older code.
Forward compatibility
: Older code can read data that was written by newer code.
Backward compatibility is normally not hard to achieve: as author of the newer code, you know the
format of data written by older code, and so you can explicitly handle it (if necessary by simply
keeping the old code to read the old data). Forward compatibility can be trickier, because it
requires older code to ignore additions made by a newer version of the code.
Another challenge with forward compatibility is illustrated in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
Say you add a field to a record schema, and the newer code creates a record containing that new
field and stores it in a database. Subsequently, an older version of the code (which doesn’t yet
know about the new field) reads the record, updates it, and writes it back. In this situation, the
desirable behavior is usually for the old code to keep the new field intact, even though it couldn’t
be interpreted. But if the record is decoded into a model object that does not explicitly
preserve unknown fields, data can be lost, like in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
{{< figure src="/fig/ddia_0501.png" id="fig_encoding_preserve_field" caption="When an older version of the application updates data previously written by a newer version of the application, data may be lost if you’re not careful." class="w-full my-4" >}}
In this chapter we will look at several formats for encoding data, including JSON, XML, Protocol
Buffers, and Avro. In particular, we will look at how they handle schema changes and how they
support systems where old and new data and code need to coexist. We will then discuss how those
formats are used for data storage and for communication: in databases, web services, REST APIs,
remote procedure calls (RPC), workflow engines, and event-driven systems such as actors and
message queues.
## Formats for Encoding Data {#sec_encoding_formats}
Programs usually work with data in (at least) two different representations:
1. In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These
data structures are optimized for efficient access and manipulation by the CPU (typically using
pointers).
2. When you want to write data to a file or send it over the network, you have to encode it as some
kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldn’t
make sense to any other process, this sequence-of-bytes representation often looks quite
different from the data structures that are normally used in memory.
Thus, we need some kind of translation between the two representations. The translation from the
in-memory representation to a byte sequence is called *encoding* (also known as *serialization* or
*marshalling*), and the reverse is called *decoding* (*parsing*, *deserialization*,
*unmarshalling*).
--------
> [!TIP] TERMINOLOGY CLASH
*Serialization* is unfortunately also used in the context of transactions (see [Chapter 8](/en/ch8#ch_transactions)),
with a completely different meaning. To avoid overloading the word we’ll stick with *encoding* in
this book, even though *serialization* is perhaps a more common term.
--------
There are exceptions in which encoding/decoding is not needed—for example, when a database operates
directly on compressed data loaded from disk, as discussed in [“Query Execution: Compilation and Vectorization”](/en/ch4#sec_storage_vectorized). There are
also *zero-copy* data formats that are designed to be used both at runtime and on disk/on the
network, without an explicit conversion step, such as Cap’n Proto and FlatBuffers.
However, most systems need to convert between in-memory objects and flat byte sequences. As this is
such a common problem, there are a myriad different libraries and encoding formats to choose from.
Let’s do a brief overview.
### Language-Specific Formats {#id96}
Many programming languages come with built-in support for encoding in-memory objects into byte
sequences. For example, Java has `java.io.Serializable`, Python has `pickle`, Ruby has `Marshal`,
and so on. Many third-party libraries also exist, such as Kryo for Java.
These encoding libraries are very convenient, because they allow in-memory objects to be saved and
restored with minimal additional code. However, they also have a number of deep problems:
* The encoding is often tied to a particular programming language, and reading the data in another
language is very difficult. If you store or transmit data in such an encoding, you are committing
yourself to your current programming language for potentially a very long time, and precluding
integrating your systems with those of other organizations (which may use different languages).
* In order to restore data in the same object types, the decoding process needs to be able to
instantiate arbitrary classes. This is frequently a source of security problems [^1]:
if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate
arbitrary classes, which in turn often allows them to do terrible things such as remotely
executing arbitrary code [^2] [^3].
* Versioning data is often an afterthought in these libraries: as they are intended for quick and
easy encoding of data, they often neglect the inconvenient problems of forward and backward compatibility [^4].
* Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also
often an afterthought. For example, Java’s built-in serialization is notorious for its bad
performance and bloated encoding [^5].
For these reasons it’s generally a bad idea to use your language’s built-in encoding for anything
other than very transient purposes.
### JSON, XML, and Binary Variants {#sec_encoding_json}
When moving to standardized encodings that can be written and read by many programming languages, JSON
and XML are the obvious contenders. They are widely known, widely supported, and almost as widely
disliked. XML is often criticized for being too verbose and unnecessarily complicated [^6].
JSON’s popularity is mainly due to its built-in support in web browsers and simplicity relative to
XML. CSV is another popular language-independent format, but it only supports tabular data without
nesting.
JSON, XML, and CSV are textual formats, and thus somewhat human-readable (although the syntax is a
popular topic of debate). Besides the superficial syntactic issues, they also have some subtle
problems:
* There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish
between a number and a string that happens to consist of digits (except by referring to an external
schema). JSON distinguishes strings and numbers, but it doesn’t distinguish integers and
floating-point numbers, and it doesn’t specify a precision.
This is a problem when dealing with large numbers; for example, integers greater than 253 cannot
be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become
inaccurate when parsed in a language that uses floating-point numbers, such as JavaScript [^7].
An example of numbers larger than 253 occurs on X (formerly Twitter), which uses a 64-bit number to
identify each post. The JSON returned by the API includes post IDs twice, once as a JSON number and
once as a decimal string, to work around the fact that the numbers are not correctly parsed by
JavaScript applications [^8].
* JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they
don’t support binary strings (sequences of bytes without a character encoding). Binary strings are a
useful feature, so people get around this limitation by encoding the binary data as text using
Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded.
This works, but it’s somewhat hacky and increases the data size by 33%.
* XML Schema and JSON Schema are powerful, and thus quite
complicated to learn and implement. Since the correct interpretation of data (such as numbers and
binary strings) depends on information in the schema, applications that don’t use XML/JSON schemas
need to potentially hard-code the appropriate encoding/decoding logic instead.
* CSV does not have any schema, so it is up to the application to define the meaning of each row and
column. If an application change adds a new row or column, you have to handle that change manually.
CSV is also a quite vague format (what happens if a value contains a comma or a newline character?).
Although its escaping rules have been formally specified [^9],
not all parsers implement them correctly.
Despite these flaws, JSON, XML, and CSV are good enough for many purposes. It’s likely that they will
remain popular, especially as data interchange formats (i.e., for sending data from one organization to
another). In these situations, as long as people agree on what the format is, it often doesn’t
matter how pretty or efficient the format is. The difficulty of getting different organizations to
agree on *anything* outweighs most other concerns.
#### JSON Schema {#json-schema}
JSON Schema has become widely adopted as a way to model data whenever it’s exchanged between systems
or written to storage. You’ll find JSON schemas in web services (see [“Web services”](/en/ch5#sec_web_services)) as part
of the OpenAPI web service specification, schema registries such as Confluent’s Schema Registry and
Red Hat’s Apicurio Registry, and in databases such as PostgreSQL’s pg\_jsonschema validator extension
and MongoDB’s `$jsonSchema` validator syntax.
The JSON Schema specification offers a number of features. Schemas include standard primitive types
including strings, numbers, integers, objects, arrays, booleans, or nulls. But JSON Schema also
offers a separate validation specification that allows developers to overlay constraints on fields.
For example, a `port` field might have a minimum of 1 and a maximum of 65535.
JSON Schemas can have either open or closed content models. An open content model permits any field
not defined in the schema to exist with any data type, whereas a closed content model only allows
fields that are explicitly defined. The open content model in JSON Schema is enabled when
`additionalProperties` is set to `true`, which is the default. Thus, JSON Schemas are usually a
definition of what *isn’t* permitted (namely, invalid values on any of the defined fields), rather
than what *is* permitted in a schema.
Open content models are powerful, but can be complex. For example, say you want to define a map from
integers (such as IDs) to strings. JSON does not have a map or dictionary type, only an “object”
type that can contain string keys, and values of any type. You can then constrain this type with
JSON Schema so that keys may only contain digits, and values can only be strings, using
`patternProperties` and `additionalProperties` as shown in [Example 5-1](/en/ch5#fig_encoding_json_schema).
{{< figure id="fig_encoding_json_schema" title="Example 5-1. Example JSON Schema with integer keys and string values. Integer keys are represented as strings containing only integers since JSON Schema requires all keys to be strings." class="w-full my-4" >}}
```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"patternProperties": {
"^[0-9]+$": {
"type": "string"
}
},
"additionalProperties": false
}
```
In addition to open and closed content models and validators, JSON Schema supports conditional
if/else schema logic, named types, references to remote schemas, and much more. All of this makes
for a very powerful schema language. Such features also make for unwieldy definitions. It can be
challenging to resolve remote schemas, reason about conditional rules, or evolve schemas in a
forwards or backwards compatible way [^10]. Similar concerns apply to XML Schema [^11].
#### Binary encoding {#binary-encoding}
JSON is less verbose than XML, but both still use a lot of space compared to binary formats. This
observation led to the development of a profusion of binary encodings for JSON (MessagePack, CBOR,
BSON, BJSON, UBJSON, BISON, Hessian, and Smile, to name a few) and for XML (WBXML and Fast Infoset,
for example). These formats have been adopted in various niches, as they are more compact and
sometimes faster to parse, but none of them are as widely adopted as the textual versions of JSON and XML [^12].
Some of these formats extend the set of datatypes (e.g., distinguishing integers and floating-point numbers,
or adding support for binary strings), but otherwise they keep the JSON/XML data model unchanged. In
particular, since they don’t prescribe a schema, they need to include all the object field names within
the encoded data. That is, in a binary encoding of the JSON document in [Example 5-2](/en/ch5#fig_encoding_json), they
will need to include the strings `userName`, `favoriteNumber`, and `interests` somewhere.
{{< figure id="fig_encoding_json" title="Example 5-2. Example record which we will encode in several binary formats in this chapter" class="w-full my-4" >}}
```json
{
"userName": "Martin",
"favoriteNumber": 1337,
"interests": ["daydreaming", "hacking"]
}
```
Let’s look at an example of MessagePack, a binary encoding for JSON. [Figure 5-2](/en/ch5#fig_encoding_messagepack)
shows the byte sequence that you get if you encode the JSON document in [Example 5-2](/en/ch5#fig_encoding_json) with
MessagePack. The first few bytes are as follows:
1. The first byte, `0x83`, indicates that what follows is an object (top four bits = `0x80`) with three
fields (bottom four bits = `0x03`). (In case you’re wondering what happens if an object has more
than 15 fields, so that the number of fields doesn’t fit in four bits, it then gets a different type
indicator, and the number of fields is encoded in two or four bytes.)
2. The second byte, `0xa8`, indicates that what follows is a string (top four bits = `0xa0`) that is eight
bytes long (bottom four bits = `0x08`).
3. The next eight bytes are the field name `userName` in ASCII. Since the length was indicated
previously, there’s no need for any marker to tell us where the string ends (or any escaping).
4. The next seven bytes encode the six-letter string value `Martin` with a prefix `0xa6`, and so on.
The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the
textual JSON encoding (with whitespace removed). All the binary encodings of JSON are similar in
this regard. It’s not clear whether such a small space reduction (and perhaps a speedup in parsing)
is worth the loss of human-readability.
In the following sections we will see how we can do much better, and encode the same record in just 32 bytes.
{{< figure link="#fig_encoding_json" src="/fig/ddia_0502.png" id="fig_encoding_messagepack" caption="Figure 5-2. Example record Example 5-2 encoded using MessagePack." class="w-full my-4" >}}
### Protocol Buffers {#sec_encoding_protobuf}
Protocol Buffers (protobuf) is a binary encoding library developed at Google.
It is similar to Apache Thrift, which was originally developed by Facebook [^13];
most of what this section says about Protocol Buffers applies also to Thrift.
Protocol Buffers requires a schema for any data that is encoded. To encode the data
in [Example 5-2](/en/ch5#fig_encoding_json) in Protocol Buffers, you would describe the schema in the Protocol Buffers
interface definition language (IDL) like this:
```protobuf
syntax = "proto3";
message Person {
string user_name = 1;
int64 favorite_number = 2;
repeated string interests = 3;
}
```
Protocol Buffers comes with a code generation tool that takes a schema definition like the one
shown here, and produces classes that implement the schema in various programming languages. Your
application code can call this generated code to encode or decode records of the schema. The schema
language is very simple compared to JSON Schema: it only defines the fields of records and their
types, but it does not support other restrictions on the possible values of fields.
Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in [Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14].
{{< figure src="/fig/ddia_0503.png" id="fig_encoding_protobuf" caption="Figure 5-3. Example record encoded using Protocol Buffers." class="w-full my-4" >}}
Similarly to [Figure 5-2](/en/ch5#fig_encoding_messagepack), each field has a type annotation (to indicate whether it
is a string, integer, etc.) and, where required, a length indication (such as the length of a
string). The strings that appear in the data (“Martin”, “daydreaming”, “hacking”) are also encoded
as ASCII (to be precise, UTF-8), similar to before.
The big difference compared to [Figure 5-2](/en/ch5#fig_encoding_messagepack) is that there are no field names
(`userName`, `favoriteNumber`, `interests`). Instead, the encoded data contains *field tags*, which
are numbers (`1`, `2`, and `3`). Those are the numbers that appear in the schema definition. Field tags
are like aliases for fields—they are a compact way of saying what field we’re talking about,
without having to spell out the field name.
As you can see, Protocol Buffers saves even more space by packing the field type and tag number into
a single byte. It uses variable-length integers: the number 1337 is encoded in two bytes, with the
top bit of each byte used to indicate whether there are still more bytes to come. This means numbers
between –64 and 63 are encoded in one byte, numbers between –8192 and 8191 are encoded in two bytes,
etc. Bigger numbers use more bytes.
Protocol Buffers doesn’t have an explicit list or array datatype. Instead, the `repeated` modifier
on the `interests` field indicates that the field contains a list of values, rather than a single
value. In the binary encoding, the list elements are represented simply as repeated occurrences of
the same field tag within the same record.
#### Field tags and schema evolution {#field-tags-and-schema-evolution}
We said previously that schemas inevitably need to change over time. We call this *schema
evolution*. How does Protocol Buffers handle schema changes while keeping backward and forward compatibility?
As you can see from the examples, an encoded record is just the concatenation of its encoded fields.
Each field is identified by its tag number (the numbers `1`, `2`, `3` in the sample schema) and
annotated with a datatype (e.g., string or integer). If a field value is not set, it is simply
omitted from the encoded record. From this you can see that field tags are critical to the meaning
of the encoded data. You can change the name of a field in the schema, since the encoded data never
refers to field names, but you cannot change a field’s tag, since that would make all existing
encoded data invalid.
You can add new fields to the schema, provided that you give each field a new tag number. If old
code (which doesn’t know about the new tag numbers you added) tries to read data written by new
code, including a new field with a tag number it doesn’t recognize, it can simply ignore that field.
The datatype annotation allows the parser to determine how many bytes it needs to skip, and preserve
the unknown fields to avoid the problem in [Figure 5-1](/en/ch5#fig_encoding_preserve_field). This maintains forward
compatibility: old code can read records that were written by new code.
What about backward compatibility? As long as each field has a unique tag number, new code can
always read old data, because the tag numbers still have the same meaning. If a field was added in
the new schema, and you read old data that does not yet contain that field, it is filled in with a
default value (for example, the empty string if the field type is string, or zero if it’s a number).
Removing a field is just like adding a field, with backward and forward compatibility concerns
reversed. You can never use the same tag number again, because you may still have data written
somewhere that includes the old tag number, and that field must be ignored by new code. Tag numbers
used in the past can be reserved in the schema definition to ensure they are not forgotten.
What about changing the datatype of a field? That is possible with some types—check the
documentation for details—but there is a risk that values will get truncated. For example, say you
change a 32-bit integer into a 64-bit integer. New code can easily read data written by old code,
because the parser can fill in any missing bits with zeros. However, if old code reads data written
by new code, the old code is still using a 32-bit variable to hold the value. If the decoded 64-bit
value won’t fit in 32 bits, it will be truncated.
### Avro {#sec_encoding_avro}
Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers.
It was started in 2009 as a subproject of Hadoop, as a result of Protocol Buffers not being a good
fit for Hadoop’s use cases [^15].
Avro also uses a schema to specify the structure of the data being encoded. It has two schema
languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily
machine-readable. Like Protocol Buffers, this schema language specifies only fields and their types,
and not complex validation rules like in JSON Schema.
Our example schema, written in Avro IDL, might look like this:
```c
record Person {
string userName;
union { null, long } favoriteNumber = null;
array interests;
}
```
The equivalent JSON representation of that schema is as follows:
```c
{
"type": "record",
"name": "Person",
"fields": [
{"name": "userName", "type": "string"},
{"name": "favoriteNumber", "type": ["null", "long"], "default": null},
{"name": "interests", "type": {"type": "array", "items": "string"}}
]
}
```
First of all, notice that there are no tag numbers in the schema. If we encode our example record
([Example 5-2](/en/ch5#fig_encoding_json)) using this schema, the Avro binary encoding is just 32 bytes long—the
most compact of all the encodings we have seen. The breakdown of the encoded byte sequence is shown
in [Figure 5-4](/en/ch5#fig_encoding_avro).
If you examine the byte sequence, you can see that there is nothing to identify fields or their
datatypes. The encoding simply consists of values concatenated together. A string is just a length
prefix followed by UTF-8 bytes, but there’s nothing in the encoded data that tells you that it is a
string. It could just as well be an integer, or something else entirely. An integer is encoded using
a variable-length encoding.
{{< figure src="/fig/ddia_0504.png" id="fig_encoding_avro" caption="Figure 5-4. Example record encoded using Avro." class="w-full my-4" >}}
To parse the binary data, you go through the fields in the order that they appear in the schema and
use the schema to tell you the datatype of each field. This means that the binary data can only be
decoded correctly if the code reading the data is using the *exact same schema* as the code that
wrote the data. Any mismatch in the schema between the reader and the writer would mean incorrectly
decoded data.
So, how does Avro support schema evolution?
#### The writer’s schema and the reader’s schema {#the-writers-schema-and-the-readers-schema}
When an application wants to encode some data (to write it to a file or database, to send it over
the network, etc.), it encodes the data using whatever version of the schema it knows about—for
example, that schema may be compiled into the application. This is known as the *writer’s schema*.
When an application wants to decode some data (read it from a file or database, receive it from the
network, etc.), it uses two schemas: the writer’s schema that is identical to the one used for
encoding, and the *reader’s schema*, which may be different. This is illustrated in
[Figure 5-5](/en/ch5#fig_encoding_avro_schemas). The reader’s schema defines the fields of each record that the
application code is expecting, and their types.
{{< figure src="/fig/ddia_0505.png" id="fig_encoding_avro_schemas" caption="Figure 5-5. In Protocol Buffers, encoding and decoding can use different versions of a schema. In Avro, decoding uses two schemas: the writer's schema must be identical to the one used for encoding, but the reader's schema can be an older or newer version." class="w-full my-4" >}}
If the reader’s and writer’s schema are the same, decoding is easy. If they are different, Avro
resolves the differences by looking at the writer’s schema and the reader’s schema side by side and
translating the data from the writer’s schema into the reader’s schema. The Avro specification [^16] [^17]
defines exactly how this resolution works, and it is illustrated in [Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
For example, it’s no problem if the writer’s schema and the reader’s schema have their fields in a
different order, because the schema resolution matches up the fields by field name. If the code
reading the data encounters a field that appears in the writer’s schema but not in the reader’s
schema, it is ignored. If the code reading the data expects some field, but the writer’s schema does
not contain a field of that name, it is filled in with a default value declared in the reader’s
schema.
{{< figure src="/fig/ddia_0506.png" id="fig_encoding_avro_resolution" caption="Figure 5-6. An Avro reader resolves differences between the writer's schema and the reader's schema." class="w-full my-4" >}}
#### Schema evolution rules {#schema-evolution-rules}
With Avro, forward compatibility means that you can have a new version of the schema as writer and
an old version of the schema as reader. Conversely, backward compatibility means that you can have a
new version of the schema as reader and an old version as writer.
To maintain compatibility, you may only add or remove a field that has a default value. (The field
`favoriteNumber` in our Avro schema has a default value of `null`.) For example, say you add a
field with a default value, so this new field exists in the new schema but not the old one. When a
reader using the new schema reads a record written with the old schema, the default value is filled
in for the missing field.
If you were to add a field that has no default value, new readers wouldn’t be able to read data
written by old writers, so you would break backward compatibility. If you were to remove a field
that has no default value, old readers wouldn’t be able to read data written by new writers, so you
would break forward compatibility.
In some programming languages, `null` is an acceptable default for any variable, but this is not the
case in Avro: if you want to allow a field to be null, you have to use a *union type*. For example,
`union { null, long, string } field;` indicates that `field` can be a number, or a string, or null.
You can only use `null` as a default value if it is the first branch of the union. This is a little
more verbose than having everything nullable by default, but it helps prevent bugs by being explicit
about what can and cannot be null [^18].
Changing the datatype of a field is possible, provided that Avro can convert the type. Changing the
name of a field is possible but a little tricky: the reader’s schema can contain aliases for field
names, so it can match an old writer’s schema field names against the aliases. This means that
changing a field name is backward compatible but not forward compatible. Similarly, adding a branch
to a union type is backward compatible but not forward compatible.
#### But what is the writer’s schema? {#but-what-is-the-writers-schema}
There is an important question that we’ve glossed over so far: how does the reader know the writer’s
schema with which a particular piece of data was encoded? We can’t just include the entire schema
with every record, because the schema would likely be much bigger than the encoded data, making all
the space savings from the binary encoding futile.
The answer depends on the context in which Avro is being used. To give a few examples:
Large file with lots of records
: A common use for Avro is for storing a large file containing millions of records, all encoded with
the same schema. (We will discuss this kind of situation in [Chapter 11](/en/ch11#ch_batch).) In this case, the
writer of that file can just include the writer’s schema once at the beginning of the file. Avro
specifies a file format (object container files) to do this.
Database with individually written records
: In a database, different records may be written at different points in time using different
writer’s schemas—you cannot assume that all the records will have the same schema. The simplest
solution is to include a version number at the beginning of every encoded record, and to keep a
list of schema versions in your database. A reader can fetch a record, extract the version number,
and then fetch the writer’s schema for that version number from the database. Using that writer’s
schema, it can decode the rest of the record.
Confluent’s schema registry for Apache Kafka [^19] and LinkedIn’s Espresso [^20] work this way, for example.
Sending records over a network connection
: When two processes are communicating over a bidirectional network connection, they can negotiate
the schema version on connection setup and then use that schema for the lifetime of the
connection. The Avro RPC protocol (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) works like this.
A database of schema versions is a useful thing to have in any case, since it acts as documentation
and gives you a chance to check schema compatibility [^21].
As the version number, you could use a simple incrementing integer, or you could use a hash of the schema.
#### Dynamically generated schemas {#dynamically-generated-schemas}
One advantage of Avro’s approach, compared to Protocol Buffers, is that the schema doesn’t contain
any tag numbers. But why is this important? What’s the problem with keeping a couple of numbers in
the schema?
The difference is that Avro is friendlier to *dynamically generated* schemas. For example, say
you have a relational database whose contents you want to dump to a file, and you want to use a
binary format to avoid the aforementioned problems with textual formats (JSON, CSV, XML). If you use
Avro, you can fairly easily generate an Avro schema (in the JSON representation we saw earlier) from the
relational schema and encode the database contents using that schema, dumping it all to an Avro
object container file [^22].
You can generate a record schema for each database table, and each column becomes a field in that
record. The column name in the database maps to the field name in Avro.
Now, if the database schema changes (for example, a table has one column added and one column
removed), you can just generate a new Avro schema from the updated database schema and export data in
the new Avro schema. The data export process does not need to pay any attention to the schema
change—it can simply do the schema conversion every time it runs. Anyone who reads the new data
files will see that the fields of the record have changed, but since the fields are identified by
name, the updated writer’s schema can still be matched up with the old reader’s schema.
By contrast, if you were using Protocol Buffers for this purpose, the field tags would likely have
to be assigned by hand: every time the database schema changes, an administrator would have to
manually update the mapping from database column names to field tags. (It might be possible to
automate this, but the schema generator would have to be very careful to not assign previously used
field tags.) This kind of dynamically generated schema simply wasn’t a design goal of Protocol
Buffers, whereas it was for Avro.
### The Merits of Schemas {#sec_encoding_schemas}
As we saw, Protocol Buffers and Avro both use a schema to describe a binary encoding format. Their
schema languages are much simpler than XML Schema or JSON Schema, which support much more detailed
validation rules (e.g., “the string value of this field must match this regular expression” or “the
integer value of this field must be between 0 and 100”). As Protocol Buffers and Avro are simpler to
implement and simpler to use, they have grown to support a fairly wide range of programming
languages.
The ideas on which these encodings are based are by no means new. For example, they have a lot in
common with ASN.1, a schema definition language that was first standardized in 1984 [^23] [^24].
It was used to define various network protocols, and its binary encoding (DER) is still used to encode
SSL certificates (X.509), for example [^25].
ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers [^26].
However, it’s also very complex and badly documented, so ASN.1 is probably not a good choice for new applications.
Many data systems also implement some kind of proprietary binary encoding for their data. For
example, most relational databases have a network protocol over which you can send queries to the
database and get back responses. Those protocols are generally specific to a particular database,
and the database vendor provides a driver (e.g., using the ODBC or JDBC APIs) that decodes responses
from the database’s network protocol into in-memory data structures.
So, we can see that although textual data formats such as JSON, XML, and CSV are widespread, binary
encodings based on schemas are also a viable option. They have a number of nice properties:
* They can be much more compact than the various “binary JSON” variants, since they can omit field
names from the encoded data.
* The schema is a valuable form of documentation, and because the schema is required for decoding,
you can be sure that it is up to date (whereas manually maintained documentation may easily
diverge from reality).
* Keeping a database of schemas allows you to check forward and backward compatibility of schema
changes, before anything is deployed.
* For users of statically typed programming languages, the ability to generate code from the schema
is useful, since it enables type-checking at compile time.
In summary, schema evolution allows the same kind of flexibility as schemaless/schema-on-read JSON
databases provide (see [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)), while also providing better
guarantees about your data and better tooling.
## Modes of Dataflow {#sec_encoding_dataflow}
At the beginning of this chapter we said that whenever you want to send some data to another process
with which you don’t share memory—for example, whenever you want to send data over the network or
write it to a file—you need to encode it as a sequence of bytes. We then discussed a variety of
different encodings for doing this.
We talked about forward and backward compatibility, which are important for evolvability (making
change easy by allowing you to upgrade different parts of your system independently, and not having
to change everything at once). Compatibility is a relationship between one process that encodes the
data, and another process that decodes it.
That’s a fairly abstract idea—there are many ways data can flow from one process to another.
Who encodes the data, and who decodes it? In the rest of this chapter we will explore some of the
most common ways how data flows between processes:
* Via databases (see [“Dataflow Through Databases”](/en/ch5#sec_encoding_dataflow_db))
* Via service calls (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc))
* Via workflow engines (see [“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows))
* Via asynchronous messages (see [“Event-Driven Architectures”](/en/ch5#sec_encoding_dataflow_msg))
### Dataflow Through Databases {#sec_encoding_dataflow_db}
In a database, the process that writes to the database encodes the data, and the process that reads
from the database decodes it. There may just be a single process accessing the database, in which
case the reader is simply a later version of the same process—in that case you can think of
storing something in the database as *sending a message to your future self*.
Backward compatibility is clearly necessary here; otherwise your future self won’t be able to decode
what you previously wrote.
In general, it’s common for several different processes to be accessing a database at the same time.
Those processes might be several different applications or services, or they may simply be several
instances of the same service (running in parallel for scalability or fault tolerance). Either way,
in an environment where the application is changing, it is likely that some processes accessing the
database will be running newer code and some will be running older code—for example because a new
version is currently being deployed in a rolling upgrade, so some instances have been updated while
others haven’t yet.
This means that a value in the database may be written by a *newer* version of the code, and
subsequently read by an *older* version of the code that is still running. Thus, forward
compatibility is also often required for databases.
#### Different values written at different times {#different-values-written-at-different-times}
A database generally allows any value to be updated at any time. This means that within a single
database you may have some values that were written five milliseconds ago, and some values that were
written five years ago.
When you deploy a new version of your application (of a server-side application, at least), you may
entirely replace the old version with the new version within a few minutes. The same is not true of
database contents: the five-year-old data will still be there, in the original encoding, unless you
have explicitly rewritten it since then. This observation is sometimes summed up as *data outlives
code*.
Rewriting (*migrating*) data into a new schema is certainly possible, but it’s an expensive thing to
do on a large dataset, so most databases avoid it if possible. Most relational databases allow
simple schema changes, such as adding a new column with a `null` default value, without rewriting
existing data. When an old row is read, the database fills in `null`s for any columns that are
missing from the encoded data on disk.
Schema evolution thus allows the entire database to appear as if it was encoded with a single
schema, even though the underlying storage may contain records encoded with various historical
versions of the schema.
More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or
moving some data into a separate table—still require data to be rewritten, often at the application level [^27].
Maintaining forward and backward compatibility across such migrations is still a research problem [^28].
#### Archival storage {#archival-storage}
Perhaps you take a snapshot of your database from time to time, say for backup purposes or for
loading into a data warehouse (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)). In this case, the data dump will typically
be encoded using the latest schema, even if the original encoding in the source database contained a
mixture of schema versions from different eras. Since you’re copying the data anyway, you might as
well encode the copy of the data consistently.
As the data dump is written in one go and is thereafter immutable, formats like Avro object
container files are a good fit. This is also a good opportunity to encode the data in an
analytics-friendly column-oriented format such as Parquet (see [“Column Compression”](/en/ch4#sec_storage_column_compression)).
In [Chapter 11](/en/ch11#ch_batch) we will talk more about using data in archival storage.
### Dataflow Through Services: REST and RPC {#sec_encoding_dataflow_rpc}
When you have processes that need to communicate over a network, there are a few different ways of
arranging that communication. The most common arrangement is to have two roles: *clients* and
*servers*. The servers expose an API over the network, and the clients can connect to the servers
to make requests to that API. The API exposed by the server is known as a *service*.
The web works this way: clients (web browsers) make requests to web servers, making `GET` requests
to download HTML, CSS, JavaScript, images, etc., and making `POST` requests to submit data to the
server. The API consists of a standardized set of protocols and data formats (HTTP, URLs, SSL/TLS,
HTML, etc.). Because web browsers, web servers, and website authors mostly agree on these standards,
you can use any web browser to access any website (at least in theory!).
Web browsers are not the only type of client. For example, native apps running on mobile devices and
desktop computers often talk to servers, and client-side JavaScript applications running inside web
browsers can also make HTTP requests.
In this case, the server’s response is typically not HTML for displaying to a human, but rather data
in an encoding that is convenient for further processing by the client-side application code (most
often JSON). Although HTTP may be used as the transport protocol, the API implemented on top is
application-specific, and the client and server need to agree on the details of that API.
In some ways, services are similar to databases: they typically allow clients to submit and query
data. However, while databases allow arbitrary queries using the query languages we discussed in
[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
that are predetermined by the business logic (application code) of the service [^29]. This restriction provides a degree of encapsulation: services can impose
fine-grained restrictions on what clients can and cannot do.
A key design goal of a service-oriented/microservices architecture is to make the application easier
to change and maintain by making services independently deployable and evolvable. A common principle
is that each service should be owned by one team, and that team should be able to release new
versions of the service frequently, without having to coordinate with other teams. We should
therefore expect old and new versions of servers and clients to be running at the same time, and so
the data encoding used by servers and clients must be compatible across versions of the service API.
#### Web services {#sec_web_services}
When HTTP is used as the underlying protocol for talking to the service, it is called a *web
service*. Web services are commonly used when building a service oriented or microservices
architecture (discussed earlier in [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)). The term “web service” is
perhaps a slight misnomer, because web services are not only used on the web, but in several
different contexts. For example:
1. A client application running on a user’s device (e.g., a native app on a mobile device, or a
JavaScript web app in a browser) making requests to a service over HTTP. These requests typically go over the public internet.
2. One service making requests to another service owned by the same organization, often located
within the same datacenter, as part of a service-oriented/microservices architecture.
3. One service making requests to a service owned by a different organization, usually via the
internet. This is used for data exchange between different organizations’ backend systems. This
category includes public APIs provided by online services, such as credit card processing
systems, or OAuth for shared access to user data.
The most popular service design philosophy is REST, which builds upon the principles of HTTP [^30] [^31].
It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for
cache control, authentication, and content type negotiation. An API designed according to the
principles of REST is called *RESTful*.
Code that needs to invoke a web service API must know which HTTP endpoint to query, and what data
format to send and expect in response. Even if a service adopts RESTful design principles, clients
need to somehow find out these details. Service developers often use an interface definition
language (IDL) to define and document their service’s API endpoints and data models, and to evolve
them over time. Other developers can then use the service definition to determine how to query the
service. The two most popular service IDLs are OpenAPI (also known as Swagger [^32])
and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send
and receive Protocol Buffers.
Developers typically write OpenAPI service definitions in JSON or YAML; see [Example 5-3](/en/ch5#fig_open_api_def).
The service definition allows developers to define service endpoints, documentation, versions, data
models, and much more. gRPC definitions look similar, but are defined using Protocol Buffers service definitions.
{{< figure id="fig_open_api_def" title="Example 5-3. Example OpenAPI service definition in YAML" class="w-full my-4" >}}
```yaml
openapi: 3.0.0
info:
title: Ping, Pong
version: 1.0.0
servers:
- url: http://localhost:8080
paths:
/ping:
get:
summary: Given a ping, returns a pong message
responses:
'200':
description: A pong
content:
application/json:
schema:
type: object
properties:
message:
type: string
example: Pong!
```
Even if a design philosophy and IDL are adopted, developers must still write the code that
implements their service’s API calls. A service framework is often adopted to simplify this
effort. Service frameworks such as Spring Boot, FastAPI, and gRPC allow developers to write the
business logic for each API endpoint while the framework code handles routing, metrics, caching,
authentication, and so on. [Example 5-4](/en/ch5#fig_fastapi_def) shows an example Python implementation of the service
defined in [Example 5-3](/en/ch5#fig_open_api_def).
{{< figure id="fig_fastapi_def" title="Example 5-4. Example FastAPI service implementing the definition from [Example 5-3](/en/ch5#fig_open_api_def)" class="w-full my-4" >}}
```python
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Ping, Pong", version="1.0.0")
class PongResponse(BaseModel):
message: str = "Pong!"
@app.get("/ping", response_model=PongResponse,
summary="Given a ping, returns a pong message")
async def ping():
return PongResponse()
```
Many frameworks couple service definitions and server code together. In some cases, such as with the
popular Python FastAPI framework, servers are written in code and an IDL is generated automatically.
In other cases, such as with gRPC, the service definition is written first, and server code
scaffolding is generated. Both approaches allow developers to generate client libraries and SDKs
in a variety of languages from the service definition. In addition to code generation, IDL tools
such as Swagger’s can generate documentation, verify schema change compatibility, and provide a
graphical user interfaces for developers to query and test services.
#### The problems with remote procedure calls (RPCs) {#sec_problems_with_rpc}
Web services are merely the latest incarnation of a long line of technologies for making API
requests over a network, many of which received a lot of hype but have serious problems. Enterprise
JavaBeans (EJB) and Java’s Remote Method Invocation (RMI) are limited to Java. The Distributed
Component Object Model (DCOM) is limited to Microsoft platforms. The Common Object Request Broker
Architecture (CORBA) is excessively complex, and does not provide backward or forward compatibility [^33].
SOAP and the WS-\* web services framework aim to provide interoperability across vendors, but are
also plagued by complexity and compatibility problems [^34] [^35] [^36].
All of these are based on the idea of a *remote procedure call* (RPC), which has been around since the 1970s [^37].
The RPC model tries to make a request to a remote network service look the same as calling a function or
method in your programming language, within the same process (this abstraction is called *location
transparency*). Although RPC seems convenient at first, the approach is fundamentally flawed [^38] [^39].
A network request is very different from a local function call:
* A local function call is predictable and either succeeds or fails, depending only on parameters
that are under your control. A network request is unpredictable: the request or response may be
lost due to a network problem, or the remote machine may be slow or unavailable, and such problems
are entirely outside of your control. Network problems are common, so you have to anticipate them,
for example by retrying a failed request.
* A local function call either returns a result, or throws an exception, or never returns (because
it goes into an infinite loop or the process crashes). A network request has another possible
outcome: it may return without a result, due to a *timeout*. In that case, you simply don’t know
what happened: if you don’t get a response from the remote service, you have no way of knowing
whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).)
* If you retry a failed network request, it could happen that the previous request actually got
through, and only the response was lost. In that case, retrying will cause the action to
be performed multiple times, unless you build a mechanism for deduplication (*idempotence*) into the protocol [^40].
Local function calls don’t have this problem. (We discuss idempotence in more detail in [“Idempotence”](/en/ch12#sec_stream_idempotence).)
* Every time you call a local function, it normally takes about the same time to execute. A network
request is much slower than a function call, and its latency is also wildly variable: at good
times it may complete in less than a millisecond, but when the network is congested or the remote
service is overloaded it may take many seconds to do exactly the same thing.
* When you call a local function, you can efficiently pass it references (pointers) to objects in
local memory. When you make a network request, all those parameters need to be encoded into a
sequence of bytes that can be sent over the network. That’s okay if the parameters are immutable
primitives like numbers or short strings, but it quickly becomes problematic with larger amounts
of data and mutable objects.
* The client and the service may be implemented in different programming languages, so the RPC
framework must translate datatypes from one language into another. This can end up ugly, since not
all languages have the same types—recall JavaScript’s problems with numbers greater than 253,
for example (see [“JSON, XML, and Binary Variants”](/en/ch5#sec_encoding_json)).
This problem doesn’t exist in a single process written in a single language.
All of these factors mean that there’s no point trying to make a remote service look too much like a
local object in your programming language, because it’s a fundamentally different thing. Part of the
appeal of REST is that it treats state transfer over a network as a process that is distinct from a
function call.
#### Load balancers, service discovery, and service meshes {#sec_encoding_service_discovery}
All services communicate over the network. For this reason, a client must know the address of the
service it’s connecting to—a problem known as *service discovery*. The simplest approach is to
configure a client to connect to the IP address and port where the service is running. This
configuration will work, but if the server goes offline, is transferred to a new machine, or becomes
overloaded, the client has to be manually reconfigured.
To provide higher availability and scalability, there are usually multiple instances of a service
running on different machines, any of which can handle an incoming request. Spreading requests
across these instances is called *load balancing* [^41].
There are many load balancing and service discovery solutions available:
* *Hardware load balancers* are specialized pieces of equipment that are installed in data centers.
They allow clients to connect to a single host and port, and incoming connections are routed to
one of the servers running the service. Such load balancers detect network failures when
connecting to a downstream server and shift the traffic to other servers.
* *Software load balancers* behave in much the same way as hardware load balancers. But rather than
requiring a special appliance, software load balancers such as Nginx and HAProxy are applications
that can be installed on a standard machine.
* The *domain name service (DNS)* is how domain names are resolved on the Internet when you open a
webpage. It supports load balancing by allowing multiple IP addresses to be associated with a
single domain name. Clients can then be configured to connect to a service using a domain name
rather than IP address, and the client’s network layer picks which IP address to use when making a
connection. One drawback of this approach is that DNS is designed to propagate changes over longer
periods of time, and to cache DNS entries. If servers are started, stopped, or moved frequently,
clients might see stale IP addresses that no longer have a server running on them.
* *Service discovery systems* use a centralized registry rather than DNS to track which service
endpoints are available. When a new service instance starts up, it registers itself with the
service discovery system by declaring the host and port it’s listening on, along with relevant
metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location,
and more. The service then periodically sends a heartbeat signal to the discovery system to signal
that the service is still available.
When a client wishes to connect to a service, it first queries the discovery system to get a list of
available endpoints, and then connects directly to the endpoint. Compared to DNS, service discovery
supports a much more dynamic environment where service instances change frequently. Discovery
systems also give clients more metadata about the service they’re connecting to, which enables
clients to make smarter load balancing decisions.
* *Service meshes* are a sophisticated form of load balancing that combine software load balancers
and service discovery. Unlike traditional software load balancers, which run on a separate
machine, service mesh load balancers are typically deployed as an in-process client library or as
a process or “sidecar” container on both the client and server. Client applications connect
to their own local service load balancer, which connects to the server’s load balancer. From
there, the connection is routed to the local server process.
Though complicated, this topology offers a number of advantages. Because the clients and servers are
routed entirely through local connections, connection encryption can be handled entirely at the load
balancer level. This shields clients and servers from having to deal with the complexities of SSL
certificates and TLS. Mesh systems also provide sophisticated observability. They can track which
services are calling each other in realtime, detect failures, track traffic load, and more.
Which solution is appropriate depends on an organization’s needs. Those running in a very dynamic
service environment with an orchestrator such as Kubernetes often choose to run a service mesh such
as Istio or Linkerd. Specialized infrastructure such as databases or messaging systems might require
their own purpose-built load balancer. Simpler deployments are best served with software load
balancers.
#### Data encoding and evolution for RPC {#data-encoding-and-evolution-for-rpc}
For evolvability, it is important that RPC clients and servers can be changed and deployed
independently. Compared to data flowing through databases (as described in the last section), we can make a
simplifying assumption in the case of dataflow through services: it is reasonable to assume that
all the servers will be updated first, and all the clients second. Thus, you only need backward
compatibility on requests, and forward compatibility on responses.
The backward and forward compatibility properties of an RPC scheme are inherited from whatever encoding it uses:
* gRPC (Protocol Buffers) and Avro RPC can be evolved according to the compatibility rules of the respective encoding format.
* RESTful APIs most commonly use JSON for responses, and JSON or URI-encoded/form-encoded request
parameters for requests. Adding optional request parameters and adding new fields to response
objects are usually considered changes that maintain compatibility.
Service compatibility is made harder by the fact that RPC is often used for communication across
organizational boundaries, so the provider of a service often has no control over its clients and
cannot force them to upgrade. Thus, compatibility needs to be maintained for a long time, perhaps
indefinitely. If a compatibility-breaking change is required, the service provider often ends up
maintaining multiple versions of the service API side by side.
There is no agreement on how API versioning should work (i.e., how a client can indicate which
version of the API it wants to use [^42]).
For RESTful APIs, common approaches are to use a version
number in the URL or in the HTTP `Accept` header. For services that use API keys to identify a
particular client, another option is to store a client’s requested API version on the server and to
allow this version selection to be updated through a separate administrative interface [^43].
### Durable Execution and Workflows {#sec_encoding_dataflow_workflows}
By definition, service-based architectures have multiple services that are all responsible for
different portions of an application. Consider a payment processing application that charges a
credit card and deposits the funds into a bank account. This system would likely have different
services responsible for fraud detection, credit card integration, bank integration, and so on.
Processing a single payment in our example requires many service calls. A payment processor service
might invoke the fraud detection service to check for fraud, call the credit card service to debit
the credit card, and call the banking service to deposit debited funds, as shown in
[Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*.
Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a
general-purpose programming language, a domain specific language (DSL), or a markup language such as
Business Process Execution Language (BPEL) [^44].
--------
> [!TIP] TASKS, ACTIVITIES, AND FUNCTIONS
Different workflow engines use different names for tasks. Temporal, for example, uses the term
*activity*. Others refer to tasks as *durable functions*. Though the names differ, the concepts are the same.
--------
{{< figure src="/fig/ddia_0507.png" id="fig_encoding_workflow" title="Figure 5-7. Example of a workflow expressed using Business Process Model and Notation (BPMN), a graphical notation." class="w-full my-4" >}}
Workflows are run, or executed, by a *workflow engine*. Workflow engines determine when to run each
task, on which machine a task must be run, what to do if a task fails (e.g., if the machine crashes
while the task is running), how many tasks are allowed to execute in parallel, and more.
Workflow engines are typically composed of an orchestrator and an executor. The orchestrator is
responsible for scheduling tasks to be executed and the executor is responsible for executing tasks.
Execution begins when a workflow is triggered. The orchestrator triggers the workflow itself if
users define a time-based schedule, such as hourly execution. External sources such as a web service
or even a human can also trigger workflow executions. Once triggered, executors are invoked to run
tasks.
There are many kinds of workflow engines that address a diverse set of use cases. Some, such as
Airflow, Dagster, and Prefect, integrate with data systems and orchestrate ETL tasks. Others, such
as Camunda and Orkes, provide a graphical notation for workflows (such as BPMN, used in
[Figure 5-7](/en/ch5#fig_encoding_workflow)) so that non-engineers can more easily define and execute workflows. Still
others, such as Temporal and Restate provide *durable execution*.
#### Durable execution {#durable-execution}
Durable execution frameworks have become a popular way to build service-based architectures that
require transactionality. In our payment example, we would like to process each payment exactly
once. A failure while the workflow is executing could result in a credit card charge, but no
corresponding bank account deposit. In a service-based architecture, we can’t simply wrap the two
tasks in a database transaction. Moreover, we might be interacting with third-party payment gateways
that we have limited control over.
Durable execution frameworks are a way to provide *exactly-once semantics* for workflows. If a
task fails, the framework will re-execute the task, but will skip any RPC calls or state changes
that the task made successfully before failing. Instead, the framework will pretend to make the
call, but will instead return the results from the previous call. This is possible because durable
execution frameworks log all RPCs and state changes to durable storage like a write-ahead log [^45] [^46].
[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
using Temporal.
{{< figure id="fig_temporal_workflow" title="Example 5-5. A Temporal workflow definition fragment for the payment workflow in [Figure 5-7](/en/ch5#fig_encoding_workflow)." class="w-full my-4" >}}
```python
@workflow.defn
class PaymentWorkflow:
@workflow.run
async def run(self, payment: PaymentRequest) -> PaymentResult:
is_fraud = await workflow.execute_activity(
check_fraud,
payment,
start_to_close_timeout=timedelta(seconds=15),
)
if is_fraud:
return PaymentResultFraudulent
credit_card_response = await workflow.execute_activity(
debit_credit_card,
payment,
start_to_close_timeout=timedelta(seconds=15),
)
# ...
```
Frameworks like Temporal are not without their challenges. External services, such as the
third-party payment gateway in our example, must still provide an idempotent API. Developers must
remember to use unique IDs for these APIs to prevent duplicate execution [^47].
And because durable execution frameworks log each RPC call in order, it expects a subsequent
execution to make the same RPC calls in the same order. This makes code changes brittle: you
might introduce undefined behavior simply by re-ordering function calls [^48].
Instead of modifying the code of an existing workflow, it is safer to deploy a new version of the
code separately, so that re-executions of existing workflow invocations continue to use the old
version, and only new invocations use the new code [^49].
Similarly, because durable execution frameworks expect to replay all code deterministically (the
same inputs produce the same outputs), nondeterministic code such as random number generators or system clocks are problematic [^48].
Frameworks often provide their own, deterministic implementations of such library functions, but
you have to remember to use them. In some cases, such as with Temporal’s workflowcheck tool,
frameworks provide static analysis tools to determine if nondeterministic behavior has been introduced.
--------
> [!NOTE]
> Making code deterministic is a powerful idea, but tricky to do robustly. In
> [“The Power of Determinism”](/en/ch9#sidebar_distributed_determinism) we will return to this topic.
--------
### Event-Driven Architectures {#sec_encoding_dataflow_msg}
In this final section, we will briefly look at *event-driven architectures*, which are another way
how encoded data can flow from one process to another. A request is called an *event* or *message*;
unlike RPC, the sender usually does not wait for the recipient to process the event. Moreover,
events are typically not sent to the recipient via a direct network connection, but go via an
intermediary called a *message broker* (also called an *event broker*, *message queue*, or
*message-oriented middleware*), which stores the message temporarily. [^50].
Using a message broker has several advantages compared to direct RPC:
* It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.
* It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.
* It avoids the need for service discovery, since senders do not need to directly connect to the IP address of the recipient.
* It allows the same message to be sent to several recipients.
* It logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).
The communication via a message broker is *asynchronous*: the sender doesn’t wait for the message to
be delivered, but simply sends it and then forgets about it. It’s possible to implement a
synchronous RPC-like model by having the sender wait for a response on a separate channel.
#### Message brokers {#message-brokers}
In the past, the landscape of message brokers was dominated by commercial enterprise software from
companies such as TIBCO, IBM WebSphere, and webMethods, before open source implementations such as
RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka become popular. More recently, cloud services
such as Amazon Kinesis, Azure Service Bus, and Google Cloud Pub/Sub have gained adoption. We will
compare them in more detail in [“Messaging Systems”](/en/ch12#sec_stream_messaging).
The detailed delivery semantics vary by implementation and configuration, but in general, two
message distribution patterns are most often used:
* One process adds a message to a named *queue*, and the broker delivers that message to a
*consumer* of that queue. If there are multiple consumers, one of them receives the message.
* One process publishes a message to a named *topic*, and the broker delivers that message to all
*subscribers* of that topic. If there are multiple subscribers, they all receive the message.
Message brokers typically don’t enforce any particular data model—a message is just a sequence of
bytes with some metadata, so you can use any encoding format. A common approach is to use Protocol
Buffers, Avro, or JSON, and to deploy a schema registry alongside the message broker to store all
the valid schema versions and check their compatibility [^19] [^21].
AsyncAPI, a messaging-based equivalent of OpenAPI, can also be used to specify the schema of messages.
Message brokers differ in terms of how durable their messages are. Many write messages to disk, so
that they are not lost in case the message broker crashes or needs to be restarted. Unlike
databases, many message brokers automatically delete messages again after they have been consumed.
Some brokers can be configured to store messages indefinitely, which you would require if you want
to use event sourcing (see [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events)).
If a consumer republishes messages to another topic, you may need to be careful to preserve unknown
fields, to prevent the issue described previously in the context of databases
([Figure 5-1](/en/ch5#fig_encoding_preserve_field)).
#### Distributed actor frameworks {#distributed-actor-frameworks}
The *actor model* is a programming model for concurrency in a single process. Rather than dealing
directly with threads (and the associated problems of race conditions, locking, and deadlock), logic
is encapsulated in *actors*. Each actor typically represents one client or entity, it may have some
local state (which is not shared with any other actor), and it communicates with other actors by
sending and receiving asynchronous messages. Message delivery is not guaranteed: in certain error
scenarios, messages will be lost. Since each actor processes only one message at a time, it doesn’t
need to worry about threads, and each actor can be scheduled independently by the framework.
In *distributed actor frameworks* such as Akka, Orleans [^51],
and Erlang/OTP, this programming model is used to scale an application across
multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient
are on the same node or different nodes. If they are on different nodes, the message is
transparently encoded into a byte sequence, sent over the network, and decoded on the other side.
Location transparency works better in the actor model than in RPC, because the actor model already
assumes that messages may be lost, even within a single process. Although latency over the network
is likely higher than within the same process, there is less of a fundamental mismatch between local
and remote communication when using the actor model.
A distributed actor framework essentially integrates a message broker and the actor programming
model into a single framework. However, if you want to perform rolling upgrades of your actor-based
application, you still have to worry about forward and backward compatibility, as messages may be
sent from a node running the new version to a node running the old version, and vice versa. This can
be achieved by using one of the encodings discussed in this chapter.
## Summary {#summary}
In this chapter we looked at several ways of turning data structures into bytes on the network or
bytes on disk. We saw how the details of these encodings affect not only their efficiency, but more
importantly also the architecture of applications and your options for evolving them.
In particular, many services need to support rolling upgrades, where a new version of a service is
gradually deployed to a few nodes at a time, rather than deploying to all nodes simultaneously.
Rolling upgrades allow new versions of a service to be released without downtime (thus encouraging
frequent small releases over rare big releases) and make deployments less risky (allowing faulty
releases to be detected and rolled back before they affect a large number of users). These
properties are hugely beneficial for *evolvability*, the ease of making changes to an application.
During rolling upgrades, or for various other reasons, we must assume that different nodes are
running the different versions of our application’s code. Thus, it is important that all data
flowing around the system is encoded in a way that provides backward compatibility (new code can
read old data) and forward compatibility (old code can read new data).
We discussed several data encoding formats and their compatibility properties:
* Programming language–specific encodings are restricted to a single programming language and often
fail to provide forward and backward compatibility.
* Textual formats like JSON, XML, and CSV are widespread, and their compatibility depends on how you
use them. They have optional schema languages, which are sometimes helpful and sometimes a
hindrance. These formats are somewhat vague about datatypes, so you have to be careful with things
like numbers and binary strings.
* Binary schema–driven formats like Protocol Buffers and Avro allow compact, efficient encoding with
clearly defined forward and backward compatibility semantics. The schemas can be useful for
documentation and code generation in statically typed languages. However, these formats have the
downside that data needs to be decoded before it is human-readable.
We also discussed several modes of dataflow, illustrating different scenarios in which data
encodings are important:
* Databases, where the process writing to the database encodes the data and the process reading
from the database decodes it
* RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes
a response, and the client finally decodes the response
* Event-driven architectures (using message brokers or actors), where nodes communicate by sending
each other messages that are encoded by the sender and decoded by the recipient
We can conclude that with a bit of care, backward/forward compatibility and rolling upgrades are
quite achievable. May your application’s evolution be rapid and your deployments be frequent.
### References
[^1]: [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*, July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y)
[^2]: Steve Breen. [What Do WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, and Your Application Have in Common? This Vulnerability](https://foxglovesecurity.com/2015/11/06/what-do-weblogic-websphere-jboss-jenkins-opennms-and-your-application-have-in-common-this-vulnerability/). *foxglovesecurity.com*, November 2015. Archived at [perma.cc/9U97-UVVD](https://perma.cc/9U97-UVVD)
[^3]: Patrick McKenzie. [What the Rails Security Issue Means for Your Startup](https://www.kalzumeus.com/2013/01/31/what-the-rails-security-issue-means-for-your-startup/). *kalzumeus.com*, January 2013. Archived at [perma.cc/2MBJ-7PZ6](https://perma.cc/2MBJ-7PZ6)
[^4]: Brian Goetz. [Towards Better Serialization](https://openjdk.org/projects/amber/design-notes/towards-better-serialization). *openjdk.org*, June 2019. Archived at [perma.cc/UK6U-GQDE](https://perma.cc/UK6U-GQDE)
[^5]: Eishay Smith. [jvm-serializers wiki](https://github.com/eishay/jvm-serializers/wiki). *github.com*, October 2023. Archived at [perma.cc/PJP7-WCNG](https://perma.cc/PJP7-WCNG)
[^6]: [XML Is a Poor Copy of S-Expressions](https://wiki.c2.com/?XmlIsaPoorCopyOfEssExpressions). *wiki.c2.com*, May 2013. Archived at [perma.cc/7FAN-YBKL](https://perma.cc/7FAN-YBKL)
[^7]: Julia Evans. [Examples of floating point problems](https://jvns.ca/blog/2023/01/13/examples-of-floating-point-problems/). *jvns.ca*, January 2023. Archived at [perma.cc/M57L-QKKW](https://perma.cc/M57L-QKKW)
[^8]: Matt Harris. [Snowflake: An Update and Some Very Important Information](https://groups.google.com/g/twitter-development-talk/c/ahbvo3VTIYI). Email to *Twitter Development Talk* mailing list, October 2010. Archived at [perma.cc/8UBV-MZ3D](https://perma.cc/8UBV-MZ3D)
[^9]: Yakov Shafranovich. [RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files](https://tools.ietf.org/html/rfc4180). IETF, October 2005.
[^10]: Andy Coates. [Evolving JSON Schemas - Part I](https://www.creekservice.org/articles/2024/01/08/json-schema-evolution-part-1.html) and [Part II](https://www.creekservice.org/articles/2024/01/09/json-schema-evolution-part-2.html). *creekservice.org*, January 2024. Archived at [perma.cc/MZW3-UA54](https://perma.cc/MZW3-UA54) and [perma.cc/GT5H-WKZ5](https://perma.cc/GT5H-WKZ5)
[^11]: Pierre Genevès, Nabil Layaïda, and Vincent Quint. [Ensuring Query Compatibility with Evolving XML Schemas](https://arxiv.org/abs/0811.4324). INRIA Technical Report 6711, November 2008.
[^12]: Tim Bray. [Bits On the Wire](https://www.tbray.org/ongoing/When/201x/2019/11/17/Bits-On-the-Wire). *tbray.org*, November 2019. Archived at [perma.cc/3BT3-BQU3](https://perma.cc/3BT3-BQU3)
[^13]: Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. [Thrift: Scalable Cross-Language Services Implementation](https://thrift.apache.org/static/files/thrift-20070401.pdf). Facebook technical report, April 2007. Archived at [perma.cc/22BS-TUFB](https://perma.cc/22BS-TUFB)
[^14]: Martin Kleppmann. [Schema Evolution in Avro, Protocol Buffers and Thrift](https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html). *martin.kleppmann.com*, December 2012. Archived at [perma.cc/E4R2-9RJT](https://perma.cc/E4R2-9RJT)
[^15]: Doug Cutting, Chad Walters, Jim Kellerman, et al. [[PROPOSAL] New Subproject: Avro](https://lists.apache.org/thread/z571w0r5jmfsjvnl0fq4fgg0vh28d3bk). Email thread on *hadoop-general* mailing list, *lists.apache.org*, April 2009. Archived at [perma.cc/4A79-BMEB](https://perma.cc/4A79-BMEB)
[^16]: Apache Software Foundation. [Apache Avro 1.12.0 Specification](https://avro.apache.org/docs/1.12.0/specification/). *avro.apache.org*, August 2024. Archived at [perma.cc/C36P-5EBQ](https://perma.cc/C36P-5EBQ)
[^17]: Apache Software Foundation. [Avro schemas as LL(1) CFG definitions](https://avro.apache.org/docs/1.12.0/api/java/org/apache/avro/io/parsing/doc-files/parsing.html). *avro.apache.org*, August 2024. Archived at [perma.cc/JB44-EM9Q](https://perma.cc/JB44-EM9Q)
[^18]: Tony Hoare. [Null References: The Billion Dollar Mistake](https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/). Talk at *QCon London*, March 2009.
[^19]: Confluent, Inc. [Schema Registry Overview](https://docs.confluent.io/platform/current/schema-registry/index.html). *docs.confluent.io*, 2024. Archived at [perma.cc/92C3-A9JA](https://perma.cc/92C3-A9JA)
[^20]: Aditya Auradkar and Tom Quiggle. [Introducing Espresso—LinkedIn’s Hot New Distributed Document Store](https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store). *engineering.linkedin.com*, January 2015. Archived at [perma.cc/FX4P-VW9T](https://perma.cc/FX4P-VW9T)
[^21]: Jay Kreps. [Putting Apache Kafka to Use: A Practical Guide to Building a Stream Data Platform (Part 2)](https://www.confluent.io/blog/event-streaming-platform-2/). *confluent.io*, February 2015. Archived at [perma.cc/8UA4-ZS5S](https://perma.cc/8UA4-ZS5S)
[^22]: Gwen Shapira. [The Problem of Managing Schemas](https://www.oreilly.com/content/the-problem-of-managing-schemas/). *oreilly.com*, November 2014. Archived at [perma.cc/BY8Q-RYV3](https://perma.cc/BY8Q-RYV3)
[^23]: John Larmouth. [*ASN.1 Complete*](https://www.oss.com/asn1/resources/books-whitepapers-pubs/larmouth-asn1-book.pdf). Morgan Kaufmann, 1999. ISBN: 978-0-122-33435-1. Archived at [perma.cc/GB7Y-XSXQ](https://perma.cc/GB7Y-XSXQ)
[^24]: Burton S. Kaliski Jr. [A Layman’s Guide to a Subset of ASN.1, BER, and DER](https://luca.ntop.org/Teaching/Appunti/asn1.html). Technical Note, RSA Data Security, Inc., November 1993. Archived at [perma.cc/2LMN-W9U8](https://perma.cc/2LMN-W9U8)
[^25]: Jacob Hoffman-Andrews. [A Warm Welcome to ASN.1 and DER](https://letsencrypt.org/docs/a-warm-welcome-to-asn1-and-der/). *letsencrypt.org*, April 2020. Archived at [perma.cc/CYT2-GPQ8](https://perma.cc/CYT2-GPQ8)
[^26]: Lev Walkin. [Question: Extensibility and Dropping Fields](https://lionet.info/asn1c/blog/2010/09/21/question-extensibility-removing-fields/). *lionet.info*, September 2010. Archived at [perma.cc/VX8E-NLH3](https://perma.cc/VX8E-NLH3)
[^27]: Jacqueline Xu. [Online migrations at scale](https://stripe.com/blog/online-migrations). *stripe.com*, February 2017. Archived at [perma.cc/X59W-DK7Y](https://perma.cc/X59W-DK7Y)
[^28]: Geoffrey Litt, Peter van Hardenberg, and Orion Henry. [Project Cambria: Translate your data with lenses](https://www.inkandswitch.com/cambria/). Technical Report, *Ink & Switch*, October 2020. Archived at [perma.cc/WA4V-VKDB](https://perma.cc/WA4V-VKDB)
[^29]: Pat Helland. [Data on the Outside Versus Data on the Inside](https://www.cidrdb.org/cidr2005/papers/P12.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005.
[^30]: Roy Thomas Fielding. [Architectural Styles and the Design of Network-Based Software Architectures](https://ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf). PhD Thesis, University of California, Irvine, 2000. Archived at [perma.cc/LWY9-7BPE](https://perma.cc/LWY9-7BPE)
[^31]: Roy Thomas Fielding. [REST APIs must be hypertext-driven](https://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven).” *roy.gbiv.com*, October 2008. Archived at [perma.cc/M2ZW-8ATG](https://perma.cc/M2ZW-8ATG)
[^32]: [OpenAPI Specification Version 3.1.0](https://swagger.io/specification/). *swagger.io*, February 2021. Archived at [perma.cc/3S6S-K5M4](https://perma.cc/3S6S-K5M4)
[^33]: Michi Henning. [The Rise and Fall of CORBA](https://cacm.acm.org/practice/the-rise-and-fall-of-corba/). *Communications of the ACM*, volume 51, issue 8, pages 52–57, August 2008. [doi:10.1145/1378704.1378718](https://doi.org/10.1145/1378704.1378718)
[^34]: Pete Lacey. [The S Stands for Simple](https://harmful.cat-v.org/software/xml/soap/simple). *harmful.cat-v.org*, November 2006. Archived at [perma.cc/4PMK-Z9X7](https://perma.cc/4PMK-Z9X7)
[^35]: Stefan Tilkov. [Interview: Pete Lacey Criticizes Web Services](https://www.infoq.com/articles/pete-lacey-ws-criticism/). *infoq.com*, December 2006. Archived at [perma.cc/JWF4-XY3P](https://perma.cc/JWF4-XY3P)
[^36]: Tim Bray. [The Loyal WS-Opposition](https://www.tbray.org/ongoing/When/200x/2004/09/18/WS-Oppo). *tbray.org*, September 2004. Archived at [perma.cc/J5Q8-69Q2](https://perma.cc/J5Q8-69Q2)
[^37]: Andrew D. Birrell and Bruce Jay Nelson. [Implementing Remote Procedure Calls](https://www.cs.princeton.edu/courses/archive/fall03/cs518/papers/rpc.pdf). *ACM Transactions on Computer Systems* (TOCS), volume 2, issue 1, pages 39–59, February 1984. [doi:10.1145/2080.357392](https://doi.org/10.1145/2080.357392)
[^38]: Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. [A Note on Distributed Computing](https://m.mirror.facebook.net/kde/devel/smli_tr-94-29.pdf). Sun Microsystems Laboratories, Inc., Technical Report TR-94-29, November 1994. Archived at [perma.cc/8LRZ-BSZR](https://perma.cc/8LRZ-BSZR)
[^39]: Steve Vinoski. [Convenience over Correctness](https://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf). *IEEE Internet Computing*, volume 12, issue 4, pages 89–92, July 2008. [doi:10.1109/MIC.2008.75](https://doi.org/10.1109/MIC.2008.75)
[^40]: Brandur Leach. [Designing robust and predictable APIs with idempotency](https://stripe.com/blog/idempotency). *stripe.com*, February 2017. Archived at [perma.cc/JD22-XZQT](https://perma.cc/JD22-XZQT)
[^41]: Sam Rose. [Load Balancing](https://samwho.dev/load-balancing/). *samwho.dev*, April 2023. Archived at [perma.cc/Q7BA-9AE2](https://perma.cc/Q7BA-9AE2)
[^42]: Troy Hunt. [Your API versioning is wrong, which is why I decided to do it 3 different wrong ways](https://www.troyhunt.com/your-api-versioning-is-wrong-which-is/). *troyhunt.com*, February 2014. Archived at [perma.cc/9DSW-DGR5](https://perma.cc/9DSW-DGR5)
[^43]: Brandur Leach. [APIs as infrastructure: future-proofing Stripe with versioning](https://stripe.com/blog/api-versioning). *stripe.com*, August 2017. Archived at [perma.cc/L63K-USFW](https://perma.cc/L63K-USFW)
[^44]: Alexandre Alves, Assaf Arkin, Sid Askary, et al. [Web Services Business Process Execution Language Version 2.0](https://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html). *docs.oasis-open.org*, April 2007.
[^45]: [What is a Temporal Service?](https://docs.temporal.io/clusters) *docs.temporal.io*, 2024. Archived at [perma.cc/32P3-CJ9V](https://perma.cc/32P3-CJ9V)
[^46]: Stephan Ewen. [Why we built Restate](https://restate.dev/blog/why-we-built-restate/). *restate.dev*, August 2023. Archived at [perma.cc/BJJ2-X75K](https://perma.cc/BJJ2-X75K)
[^47]: Keith Tenzer and Joshua Smith. [Idempotency and Durable Execution](https://temporal.io/blog/idempotency-and-durable-execution). *temporal.io*, February 2024. Archived at [perma.cc/9LGW-PCLU](https://perma.cc/9LGW-PCLU)
[^48]: [What is a Temporal Workflow?](https://docs.temporal.io/workflows) *docs.temporal.io*, 2024. Archived at [perma.cc/B5C5-Y396](https://perma.cc/B5C5-Y396)
[^49]: Jack Kleeman. [Solving durable execution’s immutability problem](https://restate.dev/blog/solving-durable-executions-immutability-problem/). *restate.dev*, February 2024. Archived at [perma.cc/G55L-EYH5](https://perma.cc/G55L-EYH5)
[^50]: Srinath Perera. [Exploring Event-Driven Architecture: A Beginner’s Guide for Cloud Native Developers](https://wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/). *wso2.com*, August 2023. Archived at [archive.org](https://web.archive.org/web/20240716204613/https%3A//wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/)
[^51]: Philip A. Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. [Orleans: Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical Report MSR-TR-2014-41, March 2014. Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF)
================================================
FILE: content/en/ch6.md
================================================
---
title: "6. Replication"
weight: 206
breadcrumbs: false
---

> *The major difference between a thing that might go wrong and a thing that cannot possibly go wrong
> is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible
> to get at or repair.*
>
> Douglas Adams, *Mostly Harmless* (1992)
*Replication* means keeping a copy of the same data on multiple machines that are connected via a
network. As discussed in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), there are several reasons
why you might want to replicate data:
* To keep data geographically close to your users (and thus reduce access latency)
* To allow the system to continue working even if some of its parts have failed (and thus increase availability)
* To scale out the number of machines that can serve read queries (and thus increase read throughput)
In this chapter we will assume that your dataset is small enough that each machine can hold a copy of
the entire dataset. In [Chapter 7](/en/ch7#ch_sharding) we will relax that assumption and discuss *sharding*
(*partitioning*) of datasets that are too big for a single machine. In later chapters we will discuss
various kinds of faults that can occur in a replicated data system, and how to deal with them.
If the data that you’re replicating does not change over time, then replication is easy: you just
need to copy the data to every node once, and you’re done. All of the difficulty in replication lies
in handling *changes* to replicated data, and that’s what this chapter is about. We will discuss
three families of algorithms for replicating changes between nodes: *single-leader*, *multi-leader*,
and *leaderless* replication. Almost all distributed databases use one of these three approaches.
They all have various pros and cons, which we will examine in detail.
There are many trade-offs to consider with replication: for example, whether to use synchronous or
asynchronous replication, and how to handle failed replicas. Those are often configuration options
in databases, and although the details vary by database, the general principles are similar across
many different implementations. We will discuss the consequences of such choices in this chapter.
Replication of databases is an old topic—the principles haven’t changed much since they were
studied in the 1970s [^1], because the fundamental constraints of networks have remained the same. Despite being so old,
concepts such as *eventual consistency* still cause confusion. In [“Problems with Replication Lag”](/en/ch6#sec_replication_lag) we will
get more precise about eventual consistency and discuss things like the *read-your-writes* and
*monotonic reads* guarantees.
--------
> [!TIP] BACKUPS AND REPLICATION
You might be wondering whether you still need backups if you have replication. The answer is yes,
because they have different purposes: replicas quickly reflect writes from one node on other nodes,
but backups store old snapshots of the data so that you can go back in time. If you accidentally
delete some data, replication doesn’t help since the deletion will have also been propagated to the
replicas, so you need a backup if you want to restore the deleted data.
In fact, replication and backups are often complementary to each other. Backups are sometimes part
of the process of setting up replication, as we shall see in [“Setting Up New Followers”](/en/ch6#sec_replication_new_replica).
Conversely, archiving replication logs can be part of a backup process.
Some databases internally maintain immutable snapshots of past states, which serve as a kind of
internal backup. However, this means keeping old versions of the data on the same storage media as
the current state. If you have a large amount of data, it can be cheaper to keep the backups of old
data in an object store that is optimized for infrequently-accessed data, and to store only the
current state of the database in primary storage.
--------
## Single-Leader Replication {#sec_replication_leader}
Each node that stores a copy of the database is called a *replica*. With multiple replicas, a
question inevitably arises: how do we ensure that all the data ends up on all the replicas?
Every write to the database needs to be processed by every replica; otherwise, the replicas would no
longer contain the same data. The most common solution is called *leader-based replication*,
*primary-backup*, or *active/passive*. It works as follows (see [Figure 6-1](/en/ch6#fig_replication_leader_follower)):
1. One of the replicas is designated the *leader* (also known as *primary* or *source* [^2]).
When clients want to write to the database, they must send their requests to the leader, which
first writes the new data to its local storage.
2. The other replicas are known as *followers* (*read replicas*, *secondaries*, or *hot standbys*).
Whenever the leader writes new data to its local storage, it also sends the data change to all of
its followers as part of a *replication log* or *change stream*. Each follower takes the log
from the leader and updates its local copy of the database accordingly, by applying all writes in
the same order as they were processed on the leader.
3. When a client wants to read from the database, it can query either the leader or any of the
followers. However, writes are only accepted on the leader (the followers are read-only from the
client’s point of view).
{{< figure src="/fig/ddia_0601.png" id="fig_replication_leader_follower" caption="Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas." class="w-full my-4" >}}
If the database is sharded (see [Chapter 7](/en/ch7#ch_sharding)), each shard has one leader. Different shards may
have their leaders on different nodes, but each shard must nevertheless have one leader node. In
[“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader) we will discuss an alternative model in which a system may have
multiple leaders for the same shard at the same time.
Single-leader replication is very widely used. It’s a built-in feature of many relational databases,
such as PostgreSQL, MySQL, Oracle Data Guard [^3], and SQL Server’s Always On Availability Groups [^4].
It is also used in some document databases such as MongoDB and DynamoDB [^5],
message brokers such as Kafka, replicated block devices such as DRBD, and some network filesystems.
Many consensus algorithms such as Raft, which is used for replication in CockroachDB [^6], TiDB [^7],
etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and automatically
elect a new leader if the old one fails (we will discuss consensus in more detail in [Chapter 10](/en/ch10#ch_consistency)).
--------
> [!NOTE]
> In older documents you may see the term *master–slave replication*. It means the same as
> leader-based replication, but the term should be avoided as it is widely considered offensive [^8].
--------
### Synchronous Versus Asynchronous Replication {#sec_replication_sync_async}
An important detail of a replicated system is whether the replication happens *synchronously* or
*asynchronously*. (In relational databases, this is often a configurable option; other systems are
often hardcoded to be either one or the other.)
Think about what happens in [Figure 6-1](/en/ch6#fig_replication_leader_follower), where the user of a website updates
their profile image. At some point in time, the client sends the update request to the leader;
shortly afterward, it is received by the leader. At some point, the leader forwards the data change
to the followers. Eventually, the leader notifies the client that the update was successful.
[Figure 6-2](/en/ch6#fig_replication_sync_replication) shows one possible way how the timings could work out.
{{< figure src="/fig/ddia_0602.png" id="fig_replication_sync_replication" caption="Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower." class="w-full my-4" >}}
In the example of [Figure 6-2](/en/ch6#fig_replication_sync_replication), the replication to follower 1 is
*synchronous*: the leader waits until follower 1 has confirmed that it received the write before
reporting success to the user, and before making the write visible to other clients. The replication
to follower 2 is *asynchronous*: the leader sends the message, but doesn’t wait for a response from
the follower.
The diagram shows that there is a substantial delay before follower 2 processes the message.
Normally, replication is quite fast: most database systems apply changes to followers in less than a
second. However, there is no guarantee of how long it might take. There are circumstances when
followers might fall behind the leader by several minutes or more; for example, if a follower is
recovering from a failure, if the system is operating near maximum capacity, or if there are network
problems between the nodes.
The advantage of synchronous replication is that the follower is guaranteed to have an up-to-date
copy of the data that is consistent with the leader. If the leader suddenly fails, we can be sure
that the data is still available on the follower. The disadvantage is that if the synchronous
follower doesn’t respond (because it has crashed, or there is a network fault, or for any other
reason), the write cannot be processed. The leader must block all writes and wait until the
synchronous replica is available again.
For that reason, it is impracticable for all followers to be synchronous: any one node outage would
cause the whole system to grind to a halt. In practice, if a database offers synchronous
replication, it often means that *one* of the followers is synchronous, and the others are
asynchronous. If the synchronous follower becomes unavailable or slow, one of the asynchronous
followers is made synchronous. This guarantees that you have an up-to-date copy of the data on at
least two nodes: the leader and one synchronous follower. This configuration is sometimes also
called *semi-synchronous*.
In some systems, a *majority* (e.g., 3 out of 5 replicas, including the leader) of replicas is
updated synchronously, and the remaining minority is asynchronous. This is an example of a *quorum*,
which we will discuss further in [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition). Majority quorums are often
used in systems that use a consensus protocol for automatic leader election, which we will return to
in [Chapter 10](/en/ch10#ch_consistency).
Sometimes, leader-based replication is configured to be completely asynchronous. In this case, if the
leader fails and is not recoverable, any writes that have not yet been replicated to followers are
lost. This means that a write is not guaranteed to be durable, even if it has been confirmed to the
client. However, a fully asynchronous configuration has the advantage that the leader can continue
processing writes, even if all of its followers have fallen behind.
Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless
widely used, especially if there are many followers or if they are geographically distributed [^9].
We will return to this issue in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag).
### Setting Up New Followers {#sec_replication_new_replica}
From time to time, you need to set up new followers—perhaps to increase the number of replicas,
or to replace failed nodes. How do you ensure that the new follower has an accurate copy of the
leader’s data?
Simply copying data files from one node to another is typically not sufficient: clients are
constantly writing to the database, and the data is always in flux, so a standard file copy would
see different parts of the database at different points in time. The result might not make any
sense.
You could make the files on disk consistent by locking the database (making it unavailable for
writes), but that would go against our goal of high availability. Fortunately, setting up a
follower can usually be done without downtime. Conceptually, the process looks like this:
1. Take a consistent snapshot of the leader’s database at some point in time—if possible, without
taking a lock on the entire database. Most databases have this feature, as it is also required
for backups. In some cases, third-party tools are needed, such as Percona XtraBackup for MySQL.
2. Copy the snapshot to the new follower node.
3. The follower connects to the leader and requests all the data changes that have happened since
the snapshot was taken. This requires that the snapshot is associated with an exact position in
the leader’s replication log. That position has various names: for example, PostgreSQL calls it
the *log sequence number*; MySQL has two mechanisms, *binlog coordinates* and *global transaction
identifiers* (GTIDs).
4. When the follower has processed the backlog of data changes since the snapshot, we say it has
*caught up*. It can now continue to process data changes from the leader as they happen.
The practical steps of setting up a follower vary significantly by database. In some systems the
process is fully automated, whereas in others it can be a somewhat arcane multi-step workflow that
needs to be manually performed by an administrator.
You can also archive the replication log to an object store; along with periodic snapshots of the
whole database in the object store this is a good way of implementing database backups and disaster
recovery. You can also perform steps 1 and 2 of setting up a new follower by downloading those files
from the object store. For example, WAL-G does this for PostgreSQL, MySQL, and SQL Server, and
Litestream does the equivalent for SQLite.
--------
> [!TIP] DATABASES BACKED BY OBJECT STORAGE
Object storage can be used for more than archiving data. Many databases are beginning to use object
stores such as Amazon Web Services S3, Google Cloud Storage, and Azure Blob Storage to serve data
for live queries. Storing database data in object storage has many benefits:
* Object storage is inexpensive compared to other cloud storage options, which allow cloud databases
to store less-often queried data on cheaper, higher-latency storage while serving the working set
from memory, SSDs, and NVMe.
* Object stores also provide multi-zone, dual-region, or multi-region replication with very high
durability guarantees. This also allows databases to bypass inter-zone network fees.
* Databases can use an object store’s *conditional write* feature—essentially, a *compare-and-set*
(CAS) operation—to implement transactions and leadership election [^10] [^11]
* Storing data from multiple databases in the same object store can simplify data integration,
particularly when open formats such as Apache Parquet and Apache Iceberg are used.
These benefits dramatically simplify the database architecture by shifting the responsibility of
transactions, leadership election, and replication to object storage.
Systems that adopt object storage for replication must grapple with some tradeoffs. Notably, object
stores have much higher read and write latencies than local disks or virtual block devices such as
EBS. Many cloud providers also charge a per-API call fee, which forces systems to batch reads and
writes to reduce cost. Such batching further increases latency. Moreover, many object stores do not
offer standard filesystem interfaces. This prevents systems that lack object storage integration
from leveraging object storage. Interfaces such as *filesystem in userspace* (FUSE) allow operators
to mount object store buckets as filesystems that applications can use without knowing their data is
stored on object storage. Still, many FUSE interfaces to object stores lack POSIX features such as
non-sequential writes or symlinks, which systems might depend on.
Different systems deal with these trade-offs in various ways. Some introduce a *tiered storage*
architecture that places less frequently accessed data on object storage while new or frequently
accessed data is kept on faster storage devices such as SSDs, NVMe, or even in memory. Other systems
use object storage as their primary storage tier, but use a separate low-latency storage system such
as Amazon’s EBS or Neon’s Safekeepers [^12]) to store their WAL. Recently, some systems have gone even farther by adopting a
*zero-disk architecture* (ZDA). ZDA-based systems persist all data to object storage and use disks
and memory strictly for caching. This allows nodes to have no persistent state, which dramatically
simplifies operations. WarpStream, Confluent Freight, Buf’s Bufstream, and Redpanda Serverless are
all Kafka-compatible systems built using a zero-disk architecture. Nearly every modern cloud data
warehouse also adopts such an architecture, as does Turbopuffer (a vector search engine), and
SlateDB (a cloud-native LSM storage engine).
--------
### Handling Node Outages {#sec_replication_failover}
Any node in the system can go down, perhaps unexpectedly due to a fault, but just as likely due to
planned maintenance (for example, rebooting a machine to install a kernel security patch). Being
able to reboot individual nodes without downtime is a big advantage for operations and maintenance.
Thus, our goal is to keep the system as a whole running despite individual node failures, and to keep
the impact of a node outage as small as possible.
How do you achieve high availability with leader-based replication?
#### Follower failure: Catch-up recovery {#follower-failure-catch-up-recovery}
On its local disk, each follower keeps a log of the data changes it has received from the leader. If
a follower crashes and is restarted, or if the network between the leader and the follower is
temporarily interrupted, the follower can recover quite easily: from its log, it knows the last
transaction that was processed before the fault occurred. Thus, the follower can connect to the
leader and request all the data changes that occurred during the time when the follower was
disconnected. When it has applied these changes, it has caught up to the leader and can continue
receiving a stream of data changes as before.
Although follower recovery is conceptually simple, it can be challenging in terms of performance: if
the database has a high write throughput or if the follower has been offline for a long time, there
might be a lot of writes to catch up on. There will be high load on both the recovering follower and
the leader (which needs to send the backlog of writes to the follower) while this catch-up is ongoing.
The leader can delete its log of writes once all followers have confirmed that they have processed
it, but if a follower is unavailable for a long time, the leader faces a choice: either it retains
the log until the follower recovers and catches up (at the risk of running out of disk space on the
leader), or it deletes the log that the unavailable follower has not yet acknowledged (in which case
the follower won’t be able to recover from the log, and will have to be restored from a backup when
it comes back).
#### Leader failure: Failover {#leader-failure-failover}
Handling a failure of the leader is trickier: one of the followers needs to be promoted to be the
new leader, clients need to be reconfigured to send their writes to the new leader, and the other
followers need to start consuming data changes from the new leader. This process is called
*failover*.
Failover can happen manually (an administrator is notified that the leader has failed and takes the
necessary steps to make a new leader) or automatically. An automatic failover process usually
consists of the following steps:
1. *Determining that the leader has failed.* There are many things that could potentially go wrong:
crashes, power outages, network issues, and more. There is no foolproof way of detecting what
has gone wrong, so most systems simply use a timeout: nodes frequently bounce messages back and
forth between each other, and if a node doesn’t respond for some period of time—say, 30
seconds—it is assumed to be dead. (If the leader is deliberately taken down for planned
maintenance, this doesn’t apply.)
2. *Choosing a new leader.* This could be done through an election process (where the leader is chosen by
a majority of the remaining replicas), or a new leader could be appointed by a previously
established *controller node* [^13].
The best candidate for leadership is usually the replica with the most up-to-date data changes
from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader
is a consensus problem, discussed in detail in [Chapter 10](/en/ch10#ch_consistency).
3. *Reconfiguring the system to use the new leader.* Clients now need to send
their write requests to the new leader (we discuss this
in [“Request Routing”](/en/ch7#sec_sharding_routing)). If the old leader comes back, it might still believe that it is
the leader, not realizing that the other replicas have
forced it to step down. The system needs to ensure that the old leader becomes a follower and
recognizes the new leader.
Failover is fraught with things that can go wrong:
* If asynchronous replication is used, the new leader may not have received all the writes from the old
leader before it failed. If the former leader rejoins the cluster after a new leader has been
chosen, what should happen to those writes? The new leader may have received conflicting writes
in the meantime. The most common solution is for the old leader’s unreplicated writes to simply be
discarded, which means that writes you believed to be committed actually weren’t durable after all.
* Discarding writes is especially dangerous if other storage systems outside of the database need to
be coordinated with the database contents. For example, in one incident at GitHub [^14],
an out-of-date MySQL follower
was promoted to leader. The database used an autoincrementing counter to assign primary keys to
new rows, but because the new leader’s counter lagged behind the old leader’s, it reused some
primary keys that were previously assigned by the old leader. These primary keys were also used in
a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis,
which caused some private data to be disclosed to the wrong users.
* In certain fault scenarios (see [Chapter 9](/en/ch9#ch_distributed)), it could happen that two nodes both believe
that they are the leader. This situation is called *split brain*, and it is dangerous: if both
leaders accept writes, and there is no process for resolving conflicts (see
[“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some
systems have a mechanism to shut down one node if two leaders are detected. However, if this
mechanism is not carefully designed, you can end up with both nodes being shut down [^15].
Moreover, there is a risk that by the time the split brain is detected and the old node is shut
down, it is already too late and data has already been corrupted.
* What is the right timeout before the leader is declared dead? A longer timeout means a longer
time to recovery in the case where the leader fails. However, if the timeout is too short, there
could be unnecessary failovers. For example, a temporary load spike could cause a node’s response
time to increase above the timeout, or a network glitch could cause delayed packets. If the system
is already struggling with high load or network problems, an unnecessary failover is likely to
make the situation worse, not better.
--------
> [!NOTE]
> Guarding against split brain by limiting or shutting down old leaders is known as *fencing* or, more
> emphatically, *Shoot The Other Node In The Head* (STONITH). We will discuss fencing in more detail
> in [“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing).
--------
There are no easy solutions to these problems. For this reason, some operations teams prefer to
perform failovers manually, even if the software supports automatic failover.
The most important thing with failover is to pick an up-to-date follower as the new leader—if
synchronous or semi-synchronous replication is used, this would be the follower that the old leader
waited for before acknowledging writes. With asynchronous replication, you can pick the follower
with the greatest log sequence number. This minimizes the amount of data that is lost during
failover: losing a fraction of a second of writes may be tolerable, but picking a follower that is
behind by several days could be catastrophic.
These issues—node failures; unreliable networks; and trade-offs around replica consistency,
durability, availability, and latency—are in fact fundamental problems in distributed systems.
In [Chapter 9](/en/ch9#ch_distributed) and [Chapter 10](/en/ch10#ch_consistency) we will discuss them in greater depth.
### Implementation of Replication Logs {#sec_replication_implementation}
How does leader-based replication work under the hood? Several different replication methods are
used in practice, so let’s look at each one briefly.
#### Statement-based replication {#statement-based-replication}
In the simplest case, the leader logs every write request (*statement*) that it executes and sends
that statement log to its followers. For a relational database, this means that every `INSERT`,
`UPDATE`, or `DELETE` statement is forwarded to followers, and each follower parses and executes
that SQL statement as if it had been received from a client.
Although this may sound reasonable, there are various ways in which this approach to replication can
break down:
* Any statement that calls a nondeterministic function, such as `NOW()` to get the current date
and time or `RAND()` to get a random number, is likely to generate a different value on each
replica.
* If statements use an autoincrementing column, or if they depend on the existing data in the
database (e.g., `UPDATE … WHERE `), they must be executed in exactly the same
order on each replica, or else they may have a different effect. This can be limiting when there
are multiple concurrently executing transactions.
* Statements that have side effects (e.g., triggers, stored procedures, user-defined functions) may
result in different side effects occurring on each replica, unless the side effects are absolutely
deterministic.
It is possible to work around those issues—for example, the leader can replace any nondeterministic
function calls with a fixed return value when the statement is logged so that the followers all get
the same value. The idea of executing deterministic statements in a fixed order is similar to the
event sourcing model that we previously discussed in [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events). This approach is
also known as *state machine replication*, and we will discuss the theory behind it in
[“Using shared logs”](/en/ch10#sec_consistency_smr).
Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today,
as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if
there is any nondeterminism in a statement. VoltDB uses statement-based replication, and makes it
safe by requiring transactions to be deterministic [^16]. However, determinism can be hard to guarantee
in practice, so many databases prefer other replication methods.
#### Write-ahead log (WAL) shipping {#write-ahead-log-wal-shipping}
In [Chapter 4](/en/ch4#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
every modification is first written to the WAL so that the tree can be restored to a consistent
state after a crash. Since the WAL contains all the information necessary to restore the indexes and
heap into a consistent state, we can use the exact same log to build a replica on another node:
besides writing the log to disk, the leader also sends it across the network to its followers. When
the follower processes this log, it builds a copy of the exact same files as found on the leader.
This method of replication is used in PostgreSQL and Oracle, among others [^17] [^18]
The main disadvantage is that the log describes the data on a very low level: a WAL contains details
of which bytes were changed in which disk blocks. This makes replication tightly coupled to the
storage engine. If the database changes its storage format from one version to another, it is
typically not possible to run different versions of the database software on the leader and the
followers.
That may seem like a minor implementation detail, but it can have a big operational impact. If the
replication protocol allows the follower to use a newer software version than the leader, you can
perform a zero-downtime upgrade of the database software by first upgrading the followers and then
performing a failover to make one of the upgraded nodes the new leader. If the replication protocol
does not allow this version mismatch, as is often the case with WAL shipping, such upgrades require
downtime.
#### Logical (row-based) log replication {#logical-row-based-log-replication}
An alternative is to use different log formats for replication and for the storage engine, which
allows the replication log to be decoupled from the storage engine internals. This kind of
replication log is called a *logical log*, to distinguish it from the storage engine’s (*physical*)
data representation.
A logical log for a relational database is usually a sequence of records describing writes to
database tables at the granularity of a row:
* For an inserted row, the log contains the new values of all columns.
* For a deleted row, the log contains enough information to uniquely identify the row that was
deleted. Typically this would be the primary key, but if there is no primary key on the table, the
old values of all columns need to be logged.
* For an updated row, the log contains enough information to uniquely identify the updated row, and
the new values of all columns (or at least the new values of all columns that changed).
A transaction that modifies several rows generates several such log records, followed by a record
indicating that the transaction was committed. MySQL keeps a separate logical replication log,
called the *binlog*, in addition to the WAL (when configured to use row-based replication).
PostgreSQL implements logical replication by decoding the physical WAL into row
insertion/update/delete events [^19].
Since a logical log is decoupled from the storage engine internals, it can more easily be kept
backward compatible, allowing the leader and the follower to run different versions of the database
software. This in turn enables upgrading to a new version with minimal downtime [^20].
A logical log format is also easier for external applications to parse. This aspect is useful if you want
to send the contents of a database to an external system, such as a data warehouse for offline
analysis, or for building custom indexes and caches [^21].
This technique is called *change data capture*, and we will return to it in [“Change Data Capture”](/en/ch12#sec_stream_cdc).
## Problems with Replication Lag {#sec_replication_lag}
Being able to tolerate node failures is just one reason for wanting replication. As mentioned
in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), other reasons are scalability (processing more
requests than a single machine can handle) and latency (placing replicas geographically closer to users).
Leader-based replication requires all writes to go through a single node, but read-only queries can
go to any replica. For workloads that consist of mostly reads and only a small percentage of writes
(which is often the case with online services), there is an attractive option: create many followers, and distribute
the read requests across those followers. This removes load from the leader and allows read requests to be
served by nearby replicas.
In this *read-scaling* architecture, you can increase the capacity for serving read-only requests
simply by adding more followers. However, this approach only realistically works with asynchronous
replication—if you tried to synchronously replicate to all followers, a single node failure or
network outage would make the entire system unavailable for writing. And the more nodes you have,
the likelier it is that one will be down, so a fully synchronous configuration would be very unreliable.
Unfortunately, if an application reads from an *asynchronous* follower, it may see outdated
information if the follower has fallen behind. This leads to apparent inconsistencies in the
database: if you run the same query on the leader and a follower at the same time, you may get
different results, because not all writes have been reflected in the follower. This inconsistency is
just a temporary state—if you stop writing to the database and wait a while, the followers will
eventually catch up and become consistent with the leader. For that reason, this effect is known
as *eventual consistency* [^22].
--------
> [!NOTE]
> The term *eventual consistency* was coined by Douglas Terry et al. [^23], popularized by Werner Vogels [^24],
> and became the battle cry of many NoSQL projects. However, not only NoSQL databases are eventually
> consistent: followers in an asynchronously replicated relational database have the same characteristics.
--------
The term “eventually” is deliberately vague: in general, there is no limit to how far a replica can
fall behind. In normal operation, the delay between a write happening on the leader and being
reflected on a follower—the *replication lag*—may be only a fraction of a second, and not
noticeable in practice. However, if the system is operating near capacity or if there is a problem
in the network, the lag can easily increase to several seconds or even minutes.
When the lag is so large, the inconsistencies it introduces are not just a theoretical issue but a
real problem for applications. In this section we will highlight three examples of problems that are
likely to occur when there is replication lag. We’ll also outline some approaches to solving them.
### Reading Your Own Writes {#sec_replication_ryw}
Many applications let the user submit some data and then view what they have submitted. This might
be a record in a customer database, or a comment on a discussion thread, or something else of that sort.
When new data is submitted, it must be sent to the leader, but when the user views the data, it can
be read from a follower. This is especially appropriate if data is frequently viewed but only
occasionally written.
With asynchronous replication, there is a problem, illustrated in
[Figure 6-3](/en/ch6#fig_replication_read_your_writes): if the user views the data shortly after making a write, the
new data may not yet have reached the replica. To the user, it looks as though the data they
submitted was lost, so they will be understandably unhappy.
{{< figure src="/fig/ddia_0603.png" id="fig_replication_read_your_writes" caption="Figure 6-3. A user makes a write, followed by a read from a stale replica. To prevent this anomaly, we need read-after-write consistency." class="w-full my-4" >}}
In this situation, we need *read-after-write consistency*, also known as *read-your-writes consistency* [^23].
This is a guarantee that if the user reloads the page, they will always see any updates they
submitted themselves. It makes no promises about other users: other users’ updates may not be
visible until some later time. However, it reassures the user that their own input has been saved
correctly.
How can we implement read-after-write consistency in a system with leader-based replication? There
are various possible techniques. To mention a few:
* When reading something that the user may have modified, read it from the leader or a synchronously
updated follower; otherwise, read it from an asynchronously updated follower.
This requires that you have some way of knowing whether something might have been
modified, without actually querying it. For example, user profile information on a social network
is normally only editable by the owner of the profile, not by anybody else. Thus, a simple
rule is: always read the user’s own profile from the leader, and any other users’ profiles from a
follower.
* If most things in the application are potentially editable by the user, that approach won’t be
effective, as most things would have to be read from the leader (negating the benefit of read
scaling). In that case, other criteria may be used to decide whether to read from the leader. For
example, you could track the time of the last update and, for one minute after the last update, make all
reads from the leader [^25].
You could also monitor the replication lag on followers and prevent queries on any follower that
is more than one minute behind the leader.
* The client can remember the timestamp of its most recent write—then the system can ensure that the
replica serving any reads for that user reflects updates at least until that timestamp. If a
replica is not sufficiently up to date, either the read can be handled by another replica or the
query can wait until the replica has caught up [^26].
The timestamp could be a *logical timestamp* (something that indicates ordering of writes, such as
the log sequence number) or the actual system clock (in which case clock synchronization becomes
critical; see [“Unreliable Clocks”](/en/ch9#sec_distributed_clocks)).
* If your replicas are distributed across regions (for geographical proximity to users or for
availability), there is additional complexity. Any request that needs to be served by the leader
must be routed to the region that contains the leader.
Another complication arises when the same user is accessing your service from multiple devices, for
example a desktop web browser and a mobile app. In this case you may want to provide *cross-device*
read-after-write consistency: if the user enters some information on one device and then views it
on another device, they should see the information they just entered.
In this case, there are some additional issues to consider:
* Approaches that require remembering the timestamp of the user’s last update become more difficult,
because the code running on one device doesn’t know what updates have happened on the other
device. This metadata will need to be centralized.
* If your replicas are distributed across different regions, there is no guarantee that connections
from different devices will be routed to the same region. (For example, if the user’s desktop
computer uses the home broadband connection and their mobile device uses the cellular data network,
the devices’ network routes may be completely different.) If your approach requires reading from the
leader, you may first need to route requests from all of a user’s devices to the same region.
--------
> ![TIP] Regions and Availability Zones
We use the term *region* to refer to one or more datacenters in a single geographic location. Cloud
providers locate multiple datacenters in the same geographic region. Each datacenter is referred to
as an *availability zone* or simply *zone*. Thus, a single cloud region is made up of multiple
zones. Each zone is a separate datacenter located in separate physical facility with its own
power, cooling, and so on.
Zones in the same region are connected by very high speed network connections. Latency is low enough
that most distributed systems can run with nodes spread across multiple zones in the same region as
though they were in a single zone. Multi-zone configurations allow distributed systems to survive
zonal outages where one zone goes offline, but they do not protect against regional outages where
all zones in a region are unavailable. To survive a regional outage, a distributed system must be
deployed across multiple regions, which can result in higher latencies, lower throughput, and
increased cloud networking bills. We will discuss these tradeoffs more in
[“Multi-leader replication topologies”](/en/ch6#sec_replication_topologies). For now, just know that when we say region, we mean a collection of
zones/datacenters in a single geographic location.
--------
### Monotonic Reads {#sec_replication_monotonic_reads}
Our second example of an anomaly that can occur when reading from asynchronous followers is that it’s
possible for a user to see things *moving backward in time*.
This can happen if a user makes several reads from different replicas. For example,
[Figure 6-4](/en/ch6#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower
with little lag, then to a follower with greater lag. (This scenario is quite likely if the user
refreshes a web page, and each request is routed to a random server.) The first query returns a
comment that was recently added by user 1234, but the second query doesn’t return anything because
the lagging follower has not yet picked up that write. In effect, the second query observes the
system state at an earlier point in time than the first query. This wouldn’t be so bad if the first query
hadn’t returned anything, because user 2345 probably wouldn’t know that user 1234 had recently added
a comment. However, it’s very confusing for user 2345 if they first see user 1234’s comment appear,
and then see it disappear again.
{{< figure src="/fig/ddia_0604.png" id="fig_replication_monotonic_reads" caption="Figure 6-4. A user first reads from a fresh replica, then from a stale replica. Time appears to go backward. To prevent this anomaly, we need monotonic reads." class="w-full my-4" >}}
*Monotonic reads* [^22] is a guarantee that this
kind of anomaly does not happen. It’s a lesser guarantee than strong consistency, but a stronger
guarantee than eventual consistency. When you read data, you may see an old value; monotonic reads
only means that if one user makes several reads in sequence, they will not see time go
backward—i.e., they will not read older data after having previously read newer data.
One way of achieving monotonic reads is to make sure that each user always makes their reads from
the same replica (different users can read from different replicas). For example, the replica can be
chosen based on a hash of the user ID, rather than randomly. However, if that replica fails, the
user’s queries will need to be rerouted to another replica.
### Consistent Prefix Reads {#sec_replication_consistent_prefix}
Our third example of replication lag anomalies concerns violation of causality. Imagine the
following short dialog between Mr. Poons and Mrs. Cake:
Mr. Poons
: How far into the future can you see, Mrs. Cake?
Mrs. Cake
: About ten seconds usually, Mr. Poons.
There is a causal dependency between those two sentences: Mrs. Cake heard Mr. Poons’s question and
answered it.
Now, imagine a third person is listening to this conversation through followers. The things said by
Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer
replication lag (see [Figure 6-5](/en/ch6#fig_replication_consistent_prefix)). This observer would hear the following:
Mrs. Cake
: About ten seconds usually, Mr. Poons.
Mr. Poons
: How far into the future can you see, Mrs. Cake?
To the observer it looks as though Mrs. Cake is answering the question before Mr. Poons has even asked
it. Such psychic powers are impressive, but very confusing [^27].
{{< figure src="/fig/ddia_0605.png" id="fig_replication_consistent_prefix" caption="Figure 6-5. If some shards are replicated slower than others, an observer may see the answer before they see the question." class="w-full my-4" >}}
Preventing this kind of anomaly requires another type of guarantee: *consistent prefix reads* [^22].
This guarantee says that if a sequence of writes happens in a certain order,
then anyone reading those writes will see them appear in the same order.
This is a particular problem in sharded (partitioned) databases, which we will discuss in
[Chapter 7](/en/ch7#ch_sharding). If the database always applies writes in the same order, reads always see a
consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different
shards operate independently, so there is no global ordering of writes: when a user reads from the
database, they may see some parts of the database in an older state and some in a newer state.
One solution is to make sure that any writes that are causally related to each other are written to
the same shard—but in some applications that cannot be done efficiently. There are also algorithms
that explicitly keep track of causal dependencies, a topic that we will return to in
[“The “happens-before” relation and concurrency”](/en/ch6#sec_replication_happens_before).
### Solutions for Replication Lag {#id131}
When working with an eventually consistent system, it is worth thinking about how the application
behaves if the replication lag increases to several minutes or even hours. If the answer is “no
problem,” that’s great. However, if the result is a bad experience for users, it’s important to
design the system to provide a stronger guarantee, such as read-after-write. Pretending that
replication is synchronous when in fact it is asynchronous is a recipe for problems down the line.
As discussed earlier, there are ways in which an application can provide a stronger guarantee than
the underlying database—for example, by performing certain kinds of reads on the leader or a
synchronously updated follower. However, dealing with these issues in application code is complex
and easy to get wrong.
The simplest programming model for application developers is to choose a database that provides a
strong consistency guarantee for replicas such as linearizability (see [Chapter 10](/en/ch10#ch_consistency)), and ACID
transactions (see [Chapter 8](/en/ch8#ch_transactions)). This allows you to mostly ignore the challenges that arise
from replication, and treat the database as if it had just a single node. In the early 2010s the
*NoSQL* movement promoted the view that these features limited scalability, and that large-scale
systems would have to embrace eventual consistency.
However, since then, a number of databases started providing strong consistency and transactions
while also offering the fault tolerance, high availability, and scalability advantages of a
distributed database. As mentioned in [“Relational Model versus Document Model”](/en/ch3#sec_datamodels_history), this trend is known as *NewSQL* to
contrast with NoSQL (although it’s less about SQL specifically, and more about new approaches to
scalable transaction management).
Even though scalable, strongly consistent distributed databases are now available, there are still
good reasons why some applications choose to use different forms of replication that offer weaker
consistency guarantees: they can offer stronger resilience in the face of network interruptions, and
have lower overheads compared to transactional systems. We will explore such approaches in the rest
of this chapter.
## Multi-Leader Replication {#sec_replication_multi_leader}
So far in this chapter we have only considered replication architectures using a single leader.
Although that is a common approach, there are interesting alternatives.
Single-leader replication has one major downside: all writes must go through the one leader. If you
can’t connect to the leader for any reason, for example due to a network interruption between you
and the leader, you can’t write to the database.
A natural extension of the single-leader replication model is to allow more than one node to accept
writes. Replication still happens in the same way: each node that processes a write must forward
that data change to all the other nodes. We call this a *multi-leader* configuration (also known as
*active/active* or *bidirectional* replication). In this setup, each leader simultaneously acts as a
follower to the other leaders.
As with single-leader replication, there is a choice between making it synchronous or asynchronous.
Let’s say you have two leaders, *A* and *B*, and you’re trying to write to *A*. If writes are
synchronously replicated from *A* to *B*, and the network between the two nodes is interrupted, you
can’t write to *A* until the network comes back. Synchronous multi-leader replication thus gives you
a model that is very similar to single-leader replication, i.e. if you had made *B* the leader and
*A* simply forwards any write requests to *B* to be executed.
For that reason, we won’t go further into synchronous multi-leader replication, and simply treat it
as equivalent to single-leader replication. The rest of this section focusses on asynchronous
multi-leader replication, in which any leader can process writes even when its connection to the
other leaders is interrupted.
### Geographically Distributed Operation {#sec_replication_multi_dc}
It rarely makes sense to use a multi-leader setup within a single region, because the benefits
rarely outweigh the added complexity. However, there are some situations in which this configuration
is reasonable.
Imagine you have a database with replicas in several different regions (perhaps so that you can
tolerate the failure of an entire region, or perhaps in order to be closer to your users). This is
known as a *geographically distributed*, *geo-distributed* or *geo-replicated* setup. With
single-leader replication, the leader has to be in *one* of the regions, and all writes must go
through that region.
In a multi-leader configuration, you can have a leader in *each* region.
[Figure 6-6](/en/ch6#fig_replication_multi_dc) shows what this architecture might look like. Within each region,
regular leader–follower replication is used (with followers maybe in a different availability zone
from the leader); between regions, each region’s leader replicates its changes to the leaders in
other regions.
{{< figure src="/fig/ddia_0606.png" id="fig_replication_multi_dc" caption="Figure 6-6. Multi-leader replication across multiple regions." class="w-full my-4" >}}
Let’s compare how the single-leader and multi-leader configurations fare in a multi-region deployment:
Performance
: In a single-leader configuration, every write must go over the internet to the region with the
leader. This can add significant latency to
writes and might contravene the purpose of having multiple regions in the first place. In a
multi-leader configuration, every write can be processed in the local region and is replicated
asynchronously to the other regions. Thus, the inter-region network delay is hidden from
users, which means the perceived performance may be better.
Tolerance of regional outages
: In a single-leader configuration, if the region with the leader becomes unavailable, failover can
promote a follower in another region to be leader. In a multi-leader configuration, each region
can continue operating independently of the others, and replication catches up when the offline
region comes back online.
Tolerance of network problems
: Even with dedicated connections, traffic between regions
can be less reliable than traffic between zones in the same region or within a single zone. A
single-leader configuration is very sensitive to problems in this inter-region link, because when
a client in one region wants to write to a leader in another region, it has to send its request
over that link and wait for the response before it can complete.
A multi-leader configuration with asynchronous replication can tolerate network problems better:
during a temporary network interruption, each region’s leader can continue independently processing writes.
Consistency
: A single-leader system can provide strong consistency guarantees, such as serializable
transactions, which we will discuss in [Chapter 8](/en/ch8#ch_transactions). The biggest downside of multi-leader
systems is that the consistency they can achieve is much weaker. For example, you can’t guarantee
that a bank account won’t go negative or that a username is unique: it’s always possible for
different leaders to process writes that are individually fine (paying out some of the money in an
account, registering a particular username), but which violate the constraint when taken together
with another write on another leader.
This is simply a fundamental limitation of distributed systems [^28].
If you need to enforce such constraints, you’re therefore better off with a single-leader system.
However, as we will see in [“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts), multi-leader systems can still
achieve consistency properties that are useful in a wide range of apps that don’t need such constraints.
Multi-leader replication is less common than single-leader replication, but it is still supported by
many databases, including MySQL, Oracle, SQL Server, and YugabyteDB. In some cases it is an external
add-on feature, for example in Redis Enterprise, EDB Postgres Distributed, and pglogical [^29].
As multi-leader replication is a somewhat retrofitted feature in many databases, there are often
subtle configuration pitfalls and surprising interactions with other database features. For example,
autoincrementing keys, triggers, and integrity constraints can be problematic. For this reason,
multi-leader replication is often considered dangerous territory that should be avoided if possible [^30].
#### Multi-leader replication topologies {#sec_replication_topologies}
A *replication topology* describes the communication paths along which writes are propagated from
one node to another. If you have two leaders, like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), there is
only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With
more than two leaders, various different topologies are possible. Some examples are illustrated in
[Figure 6-7](/en/ch6#fig_replication_topologies).
{{< figure src="/fig/ddia_0607.png" id="fig_replication_topologies" caption="Figure 6-7. Three example topologies in which multi-leader replication can be set up." class="w-full my-4" >}}
The most general topology is *all-to-all*, shown in [Figure 6-7](/en/ch6#fig_replication_topologies)(c),
in which every leader sends its writes to every other leader. However, more restricted topologies
are also used: for example a *circular topology* in which each node receives writes from one node
and forwards those writes (plus any writes of its own) to one other node. Another popular topology
has the shape of a *star*: one designated root node forwards writes to all of the other nodes. The
star topology can be generalized to a tree.
--------
> [!NOTE]
> Don’t confuse a star-shaped network topology with a *star schema* (see
> [“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics)), which describes the structure of a data model.
--------
In circular and star topologies, a write may need to pass through several nodes before it reaches
all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To
prevent infinite replication loops, each node is given a unique identifier, and in the replication
log, each write is tagged with the identifiers of all the nodes it has passed through [^31].
When a node receives a data change that is tagged with its own identifier, that data change is
ignored, because the node knows that it has already been processed.
#### Problems with different topologies {#problems-with-different-topologies}
A problem with circular and star topologies is that if just one node fails, it can interrupt the
flow of replication messages between other nodes, leaving them unable to communicate until the
node is fixed. The topology could be reconfigured to work around the failed node, but in most
deployments such reconfiguration would have to be done manually. The fault tolerance of a more
densely connected topology (such as all-to-all) is better because it allows messages to travel
along different paths, avoiding a single point of failure.
On the other hand, all-to-all topologies can have issues too. In particular, some network links may
be faster than others (e.g., due to network congestion), with the result that some replication
messages may “overtake” others, as illustrated in [Figure 6-8](/en/ch6#fig_replication_causality).
{{< figure src="/fig/ddia_0608.png" id="fig_replication_causality" caption="Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas." class="w-full my-4" >}}
In [Figure 6-8](/en/ch6#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may
first receive the update (which, from its point of view, is an update to a row that does not exist
in the database) and only later receive the corresponding insert (which should have preceded the
update).
This is a problem of causality, similar to the one we saw in [“Consistent Prefix Reads”](/en/ch6#sec_replication_consistent_prefix):
the update depends on the prior insert, so we need to make sure that all nodes process the insert
first, and then the update. Simply attaching a timestamp to every write is not sufficient, because
clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see
[Chapter 9](/en/ch9#ch_distributed)).
To order these events correctly, a technique called *version vectors* can be used, which we will
discuss later in this chapter (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). However, many multi-leader
replication systems don’t use good techniques for ordering updates, leaving them vulnerable to
issues like the one in [Figure 6-8](/en/ch6#fig_replication_causality). If you are using multi-leader replication, it
is worth being aware of these issues, carefully reading the documentation, and thoroughly testing
your database to ensure that it really does provide the guarantees you believe it to have.
### Sync Engines and Local-First Software {#sec_replication_offline_clients}
Another situation in which multi-leader replication is appropriate is if you have an application
that needs to continue to work while it is disconnected from the internet.
For example, consider the calendar apps on your mobile phone, your laptop, and other devices. You
need to be able to see your meetings (make read requests) and enter new meetings (make write
requests) at any time, regardless of whether your device currently has an internet connection. If
you make any changes while you are offline, they need to be synced with a server and your other
devices when the device is next online.
In this case, every device has a local database replica that acts as a leader (it accepts write
requests), and there is an asynchronous multi-leader replication process (sync) between the replicas
of your calendar on all of your devices. The replication lag may be hours or even days, depending on
when you have internet access available.
From an architectural point of view, this setup is very similar to multi-leader replication between
regions, taken to the extreme: each device is a “region,” and the network connection between them is
extremely unreliable.
#### Real-time collaboration, offline-first, and local-first apps {#real-time-collaboration-offline-first-and-local-first-apps}
Moreover, many modern web apps offer *real-time collaboration* features, such as Google Docs and
Sheets for text documents and spreadsheets, Figma for graphics, and Linear for project management.
What makes these apps so responsive is that user input is immediately reflected in the user
interface, without waiting for a network round-trip to the server, and edits by one user are shown
to their collaborators with low latency [^32] [^33] [^34]
This again results in a multi-leader architecture: each web browser tab that has opened the shared
file is a replica, and any updates that you make to the file are asynchronously replicated to the
devices of the other users who have opened the same file. Even if the app does not allow you to
continue editing a file while offline, the fact that multiple users can make edits without waiting
for a response from the server already makes it multi-leader.
Both offline editing and real-time collaboration require a similar replication infrastructure: the
application needs to capture any changes that the user makes to a file, and either send them to
collaborators immediately (if online), or store them locally for sending later (if offline).
Additionally, the application needs to receive changes from collaborators, merge them into the
user’s local copy of the file, and update the user interface to reflect the latest version. If
multiple users have changed the file concurrently, conflict resolution logic may be needed to merge
those changes.
A software library that supports this process is called a *sync engine*. Although the idea has
existed for a long time, the term has recently gained attention [^35] [^36] [^37].
An application that allows a user to continue editing a file while offline (which may be implemented
using a sync engine) is called *offline-first* [^38].
The term *local-first software* refers to collaborative apps that are not only offline-first, but
are also designed to continue working even if the developer who made the software shuts down all of
their online services [^39].
This can be achieved by using a sync engine with an open standard sync protocol for which multiple
service providers are available [^40].
For example, Git is a local-first collaboration system (albeit one that doesn’t support real-time
collaboration) since you can sync via GitHub, GitLab, or any other repository hosting service.
#### Pros and cons of sync engines {#pros-and-cons-of-sync-engines}
The dominant way of building web apps today is to keep very little persistent state on the client,
and to rely on making requests to a server whenever a new piece of data needs to be displayed or
some data needs to be updated. In contrast, when using a sync engine, you have persistent state on
the client, and communication with the server is moved into a background process. The sync engine
approach has a number of advantages:
* Having the data locally means the user interface can be much faster to respond than if it had to
wait for a service call to fetch some data. Some apps aim to respond to user input in the *next
frame* of the graphics system, which means rendering within 16 ms on a display with a
60 Hz refresh rate.
* Allowing users to continue working while offline is valuable, especially on mobile devices with
intermittent connectivity. With a sync engine, an app doesn’t need a separate offline mode: being
offline is the same as having very large network delay.
* A sync engine simplifies the programming model for frontend apps, compared to performing explicit
service calls in application code. Every service call requires error handling, as discussed in
[“The problems with remote procedure calls (RPCs)”](/en/ch5#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user
interface needs to somehow reflect that error. A sync engine allows the app to perform reads and
writes on local data, which almost never fails, leading to a more declarative programming style [^41].
* In order to display edits from other users in real-time, you need to receive notifications of
those edits and efficiently update the user interface accordingly. A sync engine combined with a
*reactive programming* model is a good way of implementing this [^42].
Sync engines work best when all the data that the user may need is downloaded in advance and stored
persistently on the client. This means that the data is available for offline access when needed,
but it also means that sync engines are not suitable if the user has access to a very large amount
of data. For example, downloading all the files that the user themselves created is probably fine
(one user generally doesn’t generate that much data), but downloading the entire catalog of an
e-commerce website probably doesn’t make sense.
The sync engine was pioneered by Lotus Notes in the 1980s [^43]
(without using that term), and sync for specific apps such as calendars has also existed for a long
time. Today there are a number of general-purpose sync engines, some of which use a proprietary
backend service (e.g., Google Firestore, Realm, or Ditto), and some have an open source backend,
making them suitable for creating local-first software (e.g., PouchDB/CouchDB, Automerge, or Yjs).
Multiplayer video games have a similar need to respond immediately to the user’s local actions, and
reconcile them with other players’ actions received asynchronously over the network. In game
development jargon the equivalent of a sync engine is called *netcode*. The techniques used in
netcode are quite specific to the requirements of games [^44], and don’t directly
carry over to other types of software, so we won’t consider them further in this book.
### Dealing with Conflicting Writes {#sec_replication_write_conflicts}
The biggest problem with multi-leader replication—both in a geo-distributed server-side database and
a local-first sync engine on end user devices—is that concurrent writes on different leaders can
lead to conflicts that need to be resolved.
For example, consider a wiki page that is simultaneously being edited by two users, as shown in
[Figure 6-9](/en/ch6#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2
independently changes the title from A to C. Each user’s change is successfully applied to their
local leader. However, when the changes are asynchronously replicated, a conflict is detected.
This problem does not occur in a single-leader database.
{{< figure src="/fig/ddia_0609.png" id="fig_replication_write_conflict" caption="Figure 6-9. A write conflict caused by two leaders concurrently updating the same record." class="w-full my-4" >}}
> [!NOTE]
> We say that the two writes in [Figure 6-9](/en/ch6#fig_replication_write_conflict) are *concurrent* because neither
> was “aware” of the other at the time the write was originally made. It doesn’t matter whether the
> writes literally happened at the same time; indeed, if the writes were made while offline, they
> might have actually happened some time apart. What matters is whether one write occurred in a state
> where the other write has already taken effect.
In [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent) we will tackle the question of how a database can determine
whether two writes are concurrent. For now we will assume that we can detect conflicts, and we want
to figure out the best way of resolving them.
#### Conflict avoidance {#conflict-avoidance}
One strategy for conflicts is to avoid them occurring in the first place. For example, if the
application can ensure that all writes for a particular record go through the same leader, then
conflicts cannot occur, even if the database as a whole is multi-leader. This approach is not
possible in the case of a sync engine client being updated offline, but it is sometimes possible in
geo-replicated server systems [^30].
For example, in an application where a user can only edit their own data, you can ensure that
requests from a particular user are always routed to the same region and use the leader in that
region for reading and writing. Different users may have different “home” regions (perhaps picked
based on geographic proximity to the user), but from any one user’s point of view the configuration
is essentially single-leader.
However, sometimes you might want to change the designated leader for a record—perhaps because
one region is unavailable and you need to reroute traffic to another region, or perhaps because
a user has moved to a different location and is now closer to a different region. There is now a
risk that the user performs a write while the change of designated leader is in progress, leading to
a conflict that would have to be resolved using one of the methods below. Thus, conflict avoidance
breaks down if you allow the leader to be changed.
Another example of conflict avoidance: imagine you want to insert new records and generate unique
IDs for them based on an auto-incrementing counter. If you have two leaders, you could set them up
so that one leader only generates odd numbers and the other only generates even numbers. That way
you can be sure that the two leaders won’t concurrently assign the same ID to different records.
We will discuss other ID assignment schemes in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical).
#### Last write wins (discarding concurrent writes) {#sec_replication_lww}
If conflicts can’t be avoided, the simplest way of resolving them is to attach a timestamp to each
write, and to always use the value with the greatest timestamp. For example, in
[Figure 6-9](/en/ch6#fig_replication_write_conflict), let’s say that the timestamp of user 1’s write is greater than
the timestamp of user 2’s write. In that case, both leaders will determine that the new title of the
page should be B, and they discard the write that sets it to C. If the writes coincidentally have
the same timestamp, the winner can be chosen by comparing the values (e.g., in the case of strings,
taking the one that’s earlier in the alphabet).
This approach is called *last write wins* (LWW) because the write with the greatest timestamp can be
considered the “last” one. The term is misleading though, because when two writes are concurrent
like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), which one is older and which is later is undefined, and
so the timestamp order of concurrent writes is essentially random.
Therefore the real meaning of LWW is: when the same record is concurrently written on different
leaders, one of those writes is randomly chosen to be the winner, and the other writes are silently
discarded, even though they were successfully processed at their respective leaders. This achieves
the goal that eventually all replicas end up in a consistent state, but at the cost of data loss.
If you can avoid conflicts—for example, by only inserting records with a unique key such as a UUID,
and never updating them—then LWW is no problem. But if you update existing
records, or if different leaders may insert records with the same key, then you have to decide
whether lost updates are a problem for your application. If lost updates are not acceptable, you
need to use one of the conflict resolution approaches described below.
Another problem with LWW is that if a real-time clock (e.g. a Unix timestamp) is used as timestamp
for the writes, the system becomes very sensitive to clock synchronization. If one node has a clock
that is ahead of the others, and you try to overwrite a value written by that node, your write may
be ignored as it may have a lower timestamp, even though it clearly occurred later. This problem can
be solved by using a *logical clock*, which we will discuss in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical).
#### Manual conflict resolution {#manual-conflict-resolution}
If randomly discarding some of your writes is not desirable, the next option is to resolve the
conflict manually. You may be familiar with manual conflict resolution from Git and other version
control systems: if commits on two different branches edit the same lines of the same file, and you
try to merge those branches, you will get a merge conflict that needs to be resolved before the
merge is complete.
In a database, it would be impractical for a conflict to stop the entire replication process until a
human has resolved it. Instead, databases typically store all the concurrently written values for a
given record—for example, both B and C in [Figure 6-9](/en/ch6#fig_replication_write_conflict). These values are
sometimes called *siblings*. The next time you query that record, the database returns *all* those
values, rather than just the latest one. You can then resolve those values in whatever way you want,
either automatically in application code (for example, you could concatenate B and C into “B/C”), or
by asking the user. You then write back a new value to the database to resolve the conflict.
This approach to conflict resolution is used in some systems, such as CouchDB. However, it also
suffers from a number of problems:
* The API of the database changes: for example, where previously the title of the wiki page was just
a string, it now becomes a set of strings that usually contains one element, but may sometimes
contain multiple elements if there is a conflict. This can make the data awkward to work with in
application code.
* Asking the user to manually merge the siblings is a lot of work, both for the app developer (who
needs to build the user interface for conflict resolution) and for the user (who may be confused
about what they are being asked to do, and why). In many cases, it’s better to merge automatically
than to bother the user.
* Merging siblings automatically can lead to surprising behavior if it is not done carefully. For
example, the shopping cart on Amazon used to allow concurrent updates, which were then merged by
keeping all the shopping cart items that appeared in any of the siblings (i.e., taking the set
union of the carts). This meant that if the customer had removed an item from their cart in one
sibling, but another sibling still contained that old item, the removed item would unexpectedly
reappear in the customer’s cart [^45]. [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear.
* If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution
process can itself introduce a new conflict. Those resolutions could even be inconsistent: for
example, one node may merge B and C into “B/C” and another may merge them into “C/B” if you are
not careful to order them consistently. When the conflict between “B/C” and “C/B” is merged, it
may result in “B/C/C/B” or something similarly surprising.
{{< figure src="/fig/ddia_0610.png" id="fig_replication_amazon_anomaly" caption="Figure 6-10. Example of Amazon's shopping cart anomaly: if conflicts on a shopping cart are merged by taking the union, deleted items may reappear." class="w-full my-4" >}}
#### Automatic conflict resolution {#automatic-conflict-resolution}
For many applications, the best way of handling conflicts is to use an algorithm that automatically
merges concurrent writes into a consistent state. Automatic conflict resolution ensures that all
replicas *converge* to the same state—i.e., all replicas that have processed the same set of writes
have the same state, regardless of the order in which the writes arrived.
LWW is a simple example of a conflict resolution algorithm. More sophisticated merge algorithms have
been developed for different types of data, with the goal of preserving the intended effect of all
updates as much as possible, and hence avoiding data loss:
* If the data is text (e.g., the title or body of a wiki page), we can detect which characters have
been inserted or deleted from one version to the next. The merged result then preserves all the
insertions and deletions made in any of the siblings. If users concurrently insert text at the
same position, it can be ordered deterministically so that all nodes get the same merged outcome.
* If the data is a collection of items (ordered like a to-do list, or unordered like a shopping
cart), we can merge it similarly to text by tracking insertions and deletions. To avoid the
shopping cart issue in [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly), the algorithms track the fact that Book
and DVD were deleted, so the merged result is Cart = {Soap}.
* If the data is an integer representing a counter that can be incremented or decremented (e.g., the
number of likes on a social media post), the merge algorithm can tell how many increments and
decrements happened on each sibling, and add them together correctly so that the result does not
double-count and does not drop updates.
* If the data is a key-value mapping, we can merge updates to the same key by applying one of the
other conflict resolution algorithms to the values under that key. Updates to different keys can
be handled independently from each other.
There are limits to what is possible with conflict resolution. For example, if you want to enforce
that a list contains no more than five items, and multiple users concurrently add items to the list
so that there are more than five in total, your only option is to drop some of the items.
Nevertheless, automatic conflict resolution is sufficient to build many useful apps. And if you
start from the requirement of wanting to build a collaborative offline-first or local-first app,
then conflict resolution is inevitable, and automating it is often the best approach.
### CRDTs and Operational Transformation {#sec_replication_crdts}
Two families of algorithms are commonly used to implement automatic conflict resolution:
*Conflict-free replicated datatypes* (CRDTs) [^46] and *Operational Transformation* (OT) [^47].
They have different design philosophies and performance characteristics, but both are able to
perform automatic merges for all the aforementioned types of data.
[Figure 6-11](/en/ch6#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
text. Assume you have two replicas that both start off with the text “ice”. One replica prepends the
letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make “ice!”.
{{< figure src="/fig/ddia_0611.png" id="fig_replication_ot_crdt" caption="Figure 6-11. How two concurrent insertions into a string are merged by OT and a CRDT respectively." class="w-full my-4" >}}
The merged result “nice!” is achieved differently by both types of algorithms:
OT
: We record the index at which characters are inserted or deleted: “n” is inserted at index 0, and
“!” at index 3. Next, the replicas exchange their operations. The insertion of “n” at 0 can be
applied as-is, but if the insertion of “!” at 3 were applied to the state “nice” we would get
“nic!e”, which is incorrect. We therefore need to transform the index of each operation to account
for concurrent operations that have already been applied; in this case, the insertion of “!” is
transformed to index 4 to account for the insertion of “n” at an earlier index.
CRDT
: Most CRDTs give each character a unique, immutable ID and use those to determine the positions of
insertions/deletions, instead of indexes. For example, in [Figure 6-11](/en/ch6#fig_replication_ot_crdt) we assign
the ID 1A to “i”, the ID 2A to “c”, etc. When inserting the exclamation mark, we generate an
operation containing the ID of the new character (4B) and the ID of the existing character after
which we want to insert (3A). To insert at the beginning of the string we give “nil” as the
preceding character ID. Concurrent insertions at the same position are ordered by the IDs of the
characters. This ensures that replicas converge without performing any transformation.
There are many algorithms based on variations of these ideas. Lists/arrays can be supported
similarly, using list elements instead of characters, and other datatypes such as key-value maps can
be added quite easily. There are some performance and functionality trade-offs between OT and CRDTs,
but it’s possible to combine the advantages of CRDTs and OT in one algorithm [^48].
OT is most often used for real-time collaborative editing of text, e.g. in Google Docs [^32], whereas CRDTs can be found in
distributed databases such as Redis Enterprise, Riak, and Azure Cosmos DB [^49].
Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge or Yjs) and with OT (e.g., ShareDB).
#### What is a conflict? {#what-is-a-conflict}
Some kinds of conflict are obvious. In the example in [Figure 6-9](/en/ch6#fig_replication_write_conflict), two writes
concurrently modified the same field in the same record, setting it to two different values. There
is little doubt that this is a conflict.
Other kinds of conflict can be more subtle to detect. For example, consider a meeting room booking
system: it tracks which room is booked by which group of people at which time. This application
needs to ensure that each room is only booked by one group of people at any one time (i.e., there
must not be any overlapping bookings for the same room). In this case, a conflict may arise if two
different bookings are created for the same room at the same time. Even if the application checks
availability before allowing a user to make a booking, there can be a conflict if the two bookings
are made on two different leaders.
There isn’t a quick ready-made answer, but in the following chapters we will trace a path toward a
good understanding of this problem. We will see some more examples of conflicts in
[Chapter 8](/en/ch8#ch_transactions), and in [“Ordering events to capture causality”](/en/ch13#sec_future_capture_causality) we will discuss scalable approaches for detecting and
resolving conflicts in a replicated system.
## Leaderless Replication {#sec_replication_leaderless}
The replication approaches we have discussed so far in this chapter—single-leader and
multi-leader replication—are based on the idea that a client sends a write request to one node
(the leader), and the database system takes care of copying that write to the other replicas. A
leader determines the order in which writes should be processed, and followers apply the leader’s
writes in the same order.
Some data storage systems take a different approach, abandoning the concept of a leader and
allowing any replica to directly accept writes from clients. Some of the earliest replicated data
systems were leaderless [^1] [^50], but the idea was mostly forgotten during the era of dominance of relational databases. It once again became
a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in
2007 [^45]. Riak, Cassandra, and ScyllaDB are open source datastores with leaderless replication models inspired
by Dynamo, so this kind of database is also known as *Dynamo-style*.
--------
> [!NOTE]
> The original *Dynamo* system was only described in a paper [^45], but never released outside of Amazon.
> The similarly-named *DynamoDB* is a more recent cloud database from AWS, but it has a completely different architecture:
> it uses single-leader replication based on the Multi-Paxos consensus algorithm [^5].
--------
In some leaderless implementations, the client directly sends its writes to several replicas, while
in others, a coordinator node does this on behalf of the client. However, unlike a leader database,
that coordinator does not enforce a particular ordering of writes. As we shall see, this difference in design has
profound consequences for the way the database is used.
### Writing to the Database When a Node Is Down {#id287}
Imagine you have a database with three replicas, and one of the replicas is currently
unavailable—perhaps it is being rebooted to install a system update. In a single-leader
configuration, if you want to continue processing writes, you may need to perform a failover (see
[“Handling Node Outages”](/en/ch6#sec_replication_failover)).
On the other hand, in a leaderless configuration, failover does not exist.
[Figure 6-12](/en/ch6#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to
all three replicas in parallel, and the two available replicas accept the write but the unavailable
replica misses it. Let’s say that it’s sufficient for two out of three replicas to
acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be
successful. The client simply ignores the fact that one of the replicas missed the write.
{{< figure src="/fig/ddia_0612.png" id="fig_replication_quorum_node_outage" caption="Figure 6-12. A quorum write, quorum read, and read repair after a node outage." class="w-full my-4" >}}
Now imagine that the unavailable node comes back online, and clients start reading from it. Any
writes that happened while the node was down are missing from that node. Thus, if you read from that
node, you may get *stale* (outdated) values as responses.
To solve that problem, when a client reads from the database, it doesn’t just send its request to
one replica: *read requests are also sent to several nodes in parallel*. The client may get
different responses from different nodes; for example, the up-to-date value from one node and a
stale value from another.
In order to tell which responses are up-to-date and which are outdated, every value that is written
needs to be tagged with a version number or timestamp, similarly to what we saw in
[“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww). When a client receives multiple values in response to a read, it uses the
one with the greatest timestamp (even if that value was only returned by one replica, and several
other replicas returned older values). See [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent) for more details.
#### Catching up on missed writes {#sec_replication_read_repair}
The replication system should ensure that eventually all the data is copied to every replica. After
an unavailable node comes back online, how does it catch up on the writes that it missed? Several
mechanisms are used in Dynamo-style datastores:
Read repair
: When a client makes a read from several nodes in parallel, it can detect any stale responses.
For example, in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from
replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale
value and writes the newer value back to that replica. This approach works well for values that are
frequently read.
Hinted handoff
: If one replica is unavailable, another replica may store writes on its behalf in the form of
*hints*. When the replica that was supposed to receive those writes comes back, the replica
storing the hints sends them to the recovered replica, and then deletes the hints. This *handoff*
process helps bring replicas up-to-date even for values that are never read, and therefore not
handled by read repair.
Anti-entropy
: In addition, there is a background process that periodically looks for differences in
the data between replicas and copies any missing data from one replica to another. Unlike the
replication log in leader-based replication, this *anti-entropy process* does not copy writes in
any particular order, and there may be a significant delay before data is copied.
#### Quorums for reading and writing {#sec_replication_quorum_condition}
In the example of [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), we considered the write to be successful
even though it was only processed on two out of three replicas. What if only one out of three
replicas accepted the write? How far can we push this?
If we know that every successful write is guaranteed to be present on at least two out of three
replicas, that means at most one replica can be stale. Thus, if we read from at least two replicas,
we can be sure that at least one of the two is up to date. If the third replica is down or slow to
respond, reads can nevertheless continue returning an up-to-date value.
More generally, if there are *n* replicas, every write must be confirmed by *w* nodes to be
considered successful, and we must query at least *r* nodes for each read. (In our example,
*n* = 3, *w* = 2, *r* = 2.) As long as *w* + *r* > *n*,
we expect to get an up-to-date value when reading, because at least one of the *r* nodes we’re
reading from must be up to date. Reads and writes that obey these *r* and *w* values are called *quorum* reads and writes [^50].
You can think of *r* and *w* as the minimum number of votes required for the read or write to be valid.
In Dynamo-style databases, the parameters *n*, *w*, and *r* are typically configurable. A common
choice is to make *n* an odd number (typically 3 or 5) and to set *w* = *r* =
(*n* + 1) / 2 (rounded up). However, you can vary the numbers as you see fit.
For example, a workload with few writes and many reads may benefit from setting *w* = *n* and
*r* = 1. This makes reads faster, but has the disadvantage that just one failed node causes all
database writes to fail.
--------
> [!NOTE]
> There may be more than *n* nodes in the cluster, but any given value is stored only on *n*
> nodes. This allows the dataset to be sharded, supporting datasets that are larger than you can fit
> on one node. We will return to sharding in [Chapter 7](/en/ch7#ch_sharding).
--------
The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes
as follows:
* If *w* < *n*, we can still process writes if a node is unavailable.
* If *r* < *n*, we can still process reads if a node is unavailable.
* With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable
node, like in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage).
* With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes.
This case is illustrated in [Figure 6-13](/en/ch6#fig_replication_quorum_overlap).
Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and *r*
determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success
before we consider the read or write to be successful.
{{< figure src="/fig/ddia_0613.png" id="fig_replication_quorum_overlap" caption="Figure 6-13. If *w* + *r* > *n*, at least one of the *r* replicas you read from must have seen the most recent successful write." class="w-full my-4" >}}
If fewer than the required *w* or *r* nodes are available, writes or reads return an error. A node
could be unavailable for many reasons: because the node is down (crashed, powered down), due to an
error executing the operation (can’t write because the disk is full), due to a network interruption
between the client and the node, or for any number of other reasons. We only care whether the node
returned a successful response and don’t need to distinguish between different kinds of fault.
### Limitations of Quorum Consistency {#sec_replication_quorum_limitations}
If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n*, you can
generally expect every read to return the most recent value written for a key. This is the case because the
set of nodes to which you’ve written and the set of nodes from which you’ve read must overlap. That
is, among the nodes you read there must be at least one node with the latest value (illustrated in
[Figure 6-13](/en/ch6#fig_replication_quorum_overlap)).
Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, because that ensures
*w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are
not necessarily majorities—it only matters that the sets of nodes used by the read and write
operations overlap in at least one node. Other quorum assignments are possible, which allows some
flexibility in the design of distributed algorithms [^51].
You may also set *w* and *r* to smaller numbers, so that *w* + *r* ≤ *n* (i.e.,
the quorum condition is not satisfied). In this case, reads and writes will still be sent to *n*
nodes, but a smaller number of successful responses is required for the operation to succeed.
With a smaller *w* and *r* you are more likely to read stale values, because it’s more likely that
your read didn’t include the node with the latest value. On the upside, this configuration allows
lower latency and higher availability: if there is a network interruption and many replicas become
unreachable, there’s a higher chance that you can continue processing reads and writes. Only after
the number of reachable replicas falls below *w* or *r* does the database become unavailable for
writing or reading, respectively.
However, even with *w* + *r* > *n*, there are edge cases in which the consistency
properties can be confusing. Some scenarios include:
* If a node carrying a new value fails, and its data is restored from a replica carrying an old
value, the number of replicas storing the new value may fall below *w*, breaking the quorum
condition.
* While a rebalancing is in progress, where some data is moved from one node to another (see
[Chapter 7](/en/ch7#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n*
replicas for a particular value. This can result in the read and write quorums no longer
overlapping.
* If a read is concurrent with a write operation, the read may or may not see the concurrently
written value. In particular, it’s possible for one read to see the new value, and a subsequent
read to see the old value, as we shall see in [“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable).
* If a write succeeded on some replicas but failed on others (for example because the disks on some
nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the
replicas where it succeeded. This means that if a write was reported as failed, subsequent reads
may or may not return the value from that write [^52].
* If the database uses timestamps from a real-time clock to determine which write is newer (as
Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a
faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww).
We will discuss this in more detail in [“Relying on Synchronized Clocks”](/en/ch9#sec_distributed_clocks_relying).
* If two writes occur concurrently, one of them might be processed first on one replica, and the
other might be processed first on another replica. This leads to a conflict, similarly to what we
saw for multi-leader replication (see [“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts)). We will return to this
topic in [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent).
Thus, although quorums appear to guarantee that a read returns the latest written value, in practice
it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate
eventual consistency. The parameters *w* and *r* allow you to adjust the probability of stale values
being read [^53], but it’s wise to not take them as absolute guarantees.
#### Monitoring staleness {#monitoring-staleness}
From an operational perspective, it’s important to monitor whether your databases are
returning up-to-date results. Even if your application can tolerate stale reads, you need to be
aware of the health of your replication. If it falls behind significantly, it should alert you so
that you can investigate the cause (for example, a problem in the network or an overloaded node).
For leader-based replication, the database typically exposes metrics for the replication lag, which
you can feed into a monitoring system. This is possible because writes are applied to the leader and
to followers in the same order, and each node has a position in the replication log (the number of
writes it has applied locally). By subtracting a follower’s current position from the leader’s
current position, you can measure the amount of replication lag.
However, in systems with leaderless replication, there is no fixed order in which writes are
applied, which makes monitoring more difficult. The number of hints that a replica stores for
handoff can be one measure of system health, but it’s difficult to interpret usefully [^54].
Eventual consistency is a deliberately vague guarantee, but for operability it’s important to be
able to quantify “eventual.”
### Single-Leader vs. Leaderless Replication Performance {#sec_replication_leaderless_perf}
A replication system based on a single leader can provide strong consistency guarantees that are
difficult or impossible to achieve in a leaderless system. However, as we have seen in
[“Problems with Replication Lag”](/en/ch6#sec_replication_lag), reads in a leader-based replicated system can also return stale values if
you make them on an asynchronously updated follower.
Reading from the leader ensures up-to-date responses, but it suffers from performance problems:
* Read throughput is limited by the leader’s capacity to handle requests (in contrast with read
scaling, which distributes reads across asynchronously updated replicas that may return stale
values).
* If the leader fails, you have to wait for the fault to be detected, and for the failover to
complete before you can continue handling requests. Even if the failover process is very quick,
users will notice it because of the temporarily increased response times; if failover takes a long
time, the system is unavailable for its duration.
* The system is very sensitive to performance problems on the leader: if the leader is slow to
respond, e.g. due to overload or some resource contention, the increased response times
immediately affect users as well.
A big advantage of a leaderless architecture is that it is more resilient against such issues.
Because there is no failover, and requests go to multiple replicas in parallel anyway, one replica
becoming slow or unavailable has very little impact on response times: the client simply uses the
responses from the other replicas that are faster to respond. Using the fastest responses is called
*request hedging*, and it can significantly reduce tail latency [^55]).
At its core, the resilience of a leaderless system comes from the fact that it doesn’t distinguish
between the normal case and the failure case. This is especially helpful when handling so-called
*gray failures*, in which a node isn’t completely down, but running in a degraded state where it is
unusually slow to handle requests [^56], or when a node is simply overloaded (for example, if a node has been offline for a while, recovery
via hinted handoff can cause a lot of additional load). A leader-based system has to decide whether
the situation is bad enough to warrant a failover (which can itself cause further disruption),
whereas in a leaderless system that question doesn’t even arise.
That said, leaderless systems can have performance problems as well:
* Even though the system doesn’t need to perform failover, one replica does need to detect when
another replica is unavailable so that it can store hints about writes that the unavailable
replica missed. When the unavailable replica comes back, the handoff process needs to send it
those hints. This puts additional load on the replicas at a time when the system is already under strain [^54].
* The more replicas you have, the bigger the size of your quorums, and the more responses you have
to wait for before a request can complete. Even if you wait only for the fastest *r* or *w*
replicas to respond, and even if you make the requests in parallel, a bigger *r* or *w* increases
the chance that you hit a slow replica, increasing the overall response time (see
[“Use of Response Time Metrics”](/en/ch2#sec_introduction_slo_sla)).
* A large-scale network interruption that disconnects a client from a large number of replicas can
make it impossible to form a quorum. Some leaderless databases offer a configuration option that
allows any reachable replica to accept writes, even if it’s not one of the usual replicas for that
key (Riak and Dynamo call this a *sloppy quorum* [^45];
Cassandra and ScyllaDB call it *consistency level ANY*). There is no guarantee that subsequent
reads will see the written value, but depending on the application it may still be better than
having the write fail.
Multi-leader replication can offer even greater resilience against network interruptions than
leaderless replication, since reads and writes only require communication with one leader, which can
be co-located with the client. However, since a write on one leader is propagated asynchronously to
the others, reads can be arbitrarily out-of-date. Quorum reads and writes provide a compromise: good
fault tolerance while also having a high likelihood of reading up-to-date data.
#### Multi-region operation {#multi-region-operation}
We previously discussed cross-region replication as a use case for multi-leader replication (see
[“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)). Leaderless replication is also suitable for
multi-region operation, since it is designed to tolerate conflicting concurrent writes, network
interruptions, and latency spikes.
Cassandra and ScyllaDB implement their multi-region support within the normal leaderless model: the
client sends its writes directly to the replicas in all regions, and you can choose from a variety
of consistency levels that determine how many responses are required for a request to be successful.
For example, you can request a quorum across the replicas in all the regions, a separate quorum in
each of the regions, or a quorum only in the client’s local region. A local quorum avoids having to
wait for slow requests to other regions, but it is also more likely to return stale results.
Riak keeps all communication between clients and database nodes local to one region, so *n*
describes the number of replicas within one region. Cross-region replication between
database clusters happens asynchronously in the background, in a style that is similar to
multi-leader replication.
### Detecting Concurrent Writes {#sec_replication_concurrent}
Like with multi-leader replication, leaderless databases allow concurrent writes to the same key,
resulting in conflicts that need to be resolved. Such conflicts may occur as the writes happen, but
not always: they could also be detected later during read repair, hinted handoff, or anti-entropy.
The problem is that events may arrive in a different order at different nodes, due to variable
network delays and partial failures. For example, [Figure 6-14](/en/ch6#fig_replication_concurrency) shows two clients,
A and B, simultaneously writing to a key *X* in a three-node datastore:
* Node 1 receives the write from A, but never receives the write from B due to a transient outage.
* Node 2 first receives the write from A, then the write from B.
* Node 3 first receives the write from B, then the write from A.
{{< figure src="/fig/ddia_0614.png" id="fig_replication_concurrency" caption="Figure 6-14. Concurrent writes in a Dynamo-style datastore: there is no well-defined ordering." class="w-full my-4" >}}
If each node simply overwrote the value for a key whenever it received a write request from a
client, the nodes would become permanently inconsistent, as shown by the final *get* request in
[Figure 6-14](/en/ch6#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other
nodes think that the value is A.
In order to become eventually consistent, the replicas should converge toward the same value. For
this, we can use any of the conflict resolution mechanisms we previously discussed in
[“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts), such as last-write-wins (used by Cassandra and ScyllaDB),
manual resolution, or CRDTs (described in [“CRDTs and Operational Transformation”](/en/ch6#sec_replication_crdts), and used by Riak).
Last-write-wins is easy to implement: each write is tagged with a timestamp, and a value with a
higher timestamp always overwrites a value with a lower timestamp. However, a timestamp doesn’t tell
you whether two values are actually conflicting (i.e., they were written concurrently) or not (they
were written one after another). If you want to resolve conflicts explicitly, the system needs to
take more care to detect concurrent writes.
#### The “happens-before” relation and concurrency {#sec_replication_happens_before}
How do we decide whether two operations are concurrent or not? To develop an intuition, let’s look
at some examples:
* In [Figure 6-8](/en/ch6#fig_replication_causality), the two writes are not concurrent: A’s insert *happens before*
B’s increment, because the value incremented by B is the value inserted by A. In other words, B’s
operation builds upon A’s operation, so B’s operation must have happened later.
We also say that B is *causally dependent* on A.
* On the other hand, the two writes in [Figure 6-14](/en/ch6#fig_replication_concurrency) are concurrent: when each
client starts the operation, it does not know that another client is also performing an operation
on the same key. Thus, there is no causal dependency between the operations.
An operation A *happens before* another operation B if B knows about A, or depends on A, or builds
upon A in some way. Whether one operation happens before another operation is the key to defining
what concurrency means. In fact, we can simply say that two operations are *concurrent* if neither
happens before the other (i.e., neither knows about the other) [^57].
Thus, whenever you have two operations A and B, there are three possibilities: either A happened
before B, or B happened before A, or A and B are concurrent. What we need is an algorithm to tell us
whether two operations are concurrent or not. If one operation happened before another, the later
operation should overwrite the earlier operation, but if the operations are concurrent, we have a
conflict that needs to be resolved.
--------
> ![TIP] Concurrency, Time, and Relativity
It may seem that two operations should be called concurrent if they occur “at the same time”—but
in fact, it is not important whether they literally overlap in time. Because of problems with clocks
in distributed systems, it is actually quite difficult to tell whether two things happened
at exactly the same time—an issue we will discuss in more detail in [Chapter 9](/en/ch9#ch_distributed).
For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if
they are both unaware of each other, regardless of the physical time at which they occurred. People
sometimes make a connection between this principle and the special theory of relativity in physics
[^57], which introduced the idea that
information cannot travel faster than the speed of light. Consequently, two events that occur some
distance apart cannot possibly affect each other if the time between the events is shorter than the
time it takes light to travel the distance between them.
In computer systems, two operations might be concurrent even though the speed of light would in
principle have allowed one operation to affect the other. For example, if the network was slow or
interrupted at the time, two operations can occur some time apart and still be concurrent, because
the network problems prevented one operation from being able to know about the other.
--------
#### Capturing the happens-before relationship {#capturing-the-happens-before-relationship}
Let’s look at an algorithm that determines whether two operations are concurrent, or whether one
happened before another. To keep things simple, let’s start with a database that has only one
replica. Once we have worked out how to do this on a single replica, we can generalize the approach
to a leaderless database with multiple replicas.
[Figure 6-15](/en/ch6#fig_replication_causality_single) shows two clients concurrently adding items to the same
shopping cart. (If that example strikes you as too inane, imagine instead two air traffic
controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is
empty. Between them, the clients make five writes to the database:
1. Client 1 adds `milk` to the cart. This is the first write to that key, so the server successfully
stores it and assigns it version 1. The server also echoes the value back to the client, along
with the version number.
2. Client 2 adds `eggs` to the cart, not knowing that client 1 concurrently added `milk` (client 2
thought that its `eggs` were the only item in the cart). The server assigns version 2 to this
write, and stores `eggs` and `milk` as two separate values (siblings). It then returns *both*
values to the client, along with the version number of 2.
3. Client 1, oblivious to client 2’s write, wants to add `flour` to the cart, so it thinks the
current cart contents should be `[milk, flour]`. It sends this value to the server, along with
the version number 1 that the server gave client 1 previously. The server can tell from the
version number that the write of `[milk, flour]` supersedes the prior value of `[milk]` but that
it is concurrent with `[eggs]`. Thus, the server assigns version 3 to `[milk, flour]`, overwrites
the version 1 value `[milk]`, but keeps the version 2 value `[eggs]` and returns both remaining
values to the client.
4. Meanwhile, client 2 wants to add `ham` to the cart, unaware that client 1 just added `flour`.
Client 2 received the two values `[milk]` and `[eggs]` from the server in the last response, so
the client now merges those values and adds `ham` to form a new value, `[eggs, milk, ham]`. It
sends that value to the server, along with the previous version number 2. The server detects that
version 2 overwrites `[eggs]` but is concurrent with `[milk, flour]`, so the two remaining
values are `[milk, flour]` with version 3, and `[eggs, milk, ham]` with version 4.
5. Finally, client 1 wants to add `bacon`. It previously received `[milk, flour]` and `[eggs]` from
the server at version 3, so it merges those, adds `bacon`, and sends the final value
`[milk, flour, eggs, bacon]` to the server, along with the version number 3. This overwrites
`[milk, flour]` (note that `[eggs]` was already overwritten in the last step) but is concurrent
with `[eggs, milk, ham]`, so the server keeps those two concurrent values.
{{< figure src="/fig/ddia_0615.png" id="fig_replication_causality_single" caption="Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart." class="w-full my-4" >}}
The dataflow between the operations in [Figure 6-15](/en/ch6#fig_replication_causality_single) is illustrated
graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The arrows indicate which operation
*happened before* which other operation, in the sense that the later operation *knew about* or
*depended on* the earlier one. In this example, the clients are never fully up to date with the data
on the server, since there is always another operation going on concurrently. But old versions of
the value do get overwritten eventually, and no writes are lost.
{{< figure link="#fig_replication_causality_single" src="/fig/ddia_0616.png" id="fig_replication_causal_dependencies" caption="Figure 6-16. Graph of causal dependencies in Figure 6-15." class="w-full my-4" >}}
Note that the server can determine whether two operations are concurrent by looking at the version
numbers—it does not need to interpret the value itself (so the value could be any data
structure). The algorithm works as follows:
* The server maintains a version number for every key, increments the version number every time that
key is written, and stores the new version number along with the value written.
* When a client reads a key, the server returns all siblings, i.e., all values that have not been
overwritten, as well as the latest version number. A client must read a key before writing.
* When a client writes a key, it must include the version number from the prior read, and it must
merge together all values that it received in the prior read, e.g. using a CRDT or by asking the
user. The response from a write request is like a read, returning all siblings, which allows us to
chain several writes like in the shopping cart example.
* When the server receives a write with a particular version number, it can overwrite all values
with that version number or below (since it knows that they have been merged into the new value),
but it must keep all values with a higher version number (because those values are concurrent with
the incoming write).
When a write includes the version number from a prior read, that tells us which previous state the
write is based on. If you make a write without including a version number, it is concurrent with all
other writes, so it will not overwrite anything—it will just be returned as one of the values
on subsequent reads.
#### Version vectors {#version-vectors}
The example in [Figure 6-15](/en/ch6#fig_replication_causality_single) used only a single replica. How does the
algorithm change when there are multiple replicas, but no leader?
[Figure 6-15](/en/ch6#fig_replication_causality_single) uses a single version number to capture dependencies between
operations, but that is not sufficient when there are multiple replicas accepting writes
concurrently. Instead, we need to use a version number *per replica* as well as per key. Each
replica increments its own version number when processing a write, and also keeps track of the
version numbers it has seen from each of the other replicas. This information indicates which values
to overwrite and which values to keep as siblings.
The collection of version numbers from all the replicas is called a *version vector* [^58].
A few variants of this idea are in use, but the most interesting is probably the *dotted version vector* [^59] [^60],
which is used in Riak 2.0 [^61] [^62].
We won’t go into the details, but the way it works is quite similar to what we saw in our cart example.
Like the version numbers in [Figure 6-15](/en/ch6#fig_replication_causality_single), version vectors are sent from the
database replicas to clients when values are read, and need to be sent back to the database when a
value is subsequently written. (Riak encodes the version vector as a string that it calls *causal
context*.) The version vector allows the database to distinguish between overwrites and concurrent
writes.
The version vector also ensures that it is safe to read from one replica and subsequently write back
to another replica. Doing so may result in siblings being created, but no data is lost as long as
siblings are merged correctly.
--------
> [!TIP] VERSION VECTORS AND VECTOR CLOCKS
A *version vector* is sometimes also called a *vector clock*, even though they are not quite the
same. The difference is subtle—please see the references for details [^60] [^63] [^64]. In brief, when
comparing the state of replicas, version vectors are the right data structure to use.
--------
## Summary {#summary}
In this chapter we looked at the issue of replication. Replication can serve several purposes:
*High availability*
: Keeping the system running, even when one machine (or several machines, a
zone, or even an entire region) goes down
*Disconnected operation*
: Allowing an application to continue working when there is a network
interruption
*Latency*
: Placing data geographically close to users, so that users can interact with it faster
*Scalability*
: Being able to handle a higher volume of reads than a single machine could handle,
by performing reads on replicas
Despite being a simple goal—keeping a copy of the same data on several machines—replication turns out
to be a remarkably tricky problem. It requires carefully thinking about concurrency and about all
the things that can go wrong, and dealing with the consequences of those faults. At a minimum, we
need to deal with unavailable nodes and network interruptions (and that’s not even considering the
more insidious kinds of fault, such as silent data corruption due to software bugs or hardware errors).
We discussed three main approaches to replication:
*Single-leader replication*
: Clients send all writes to a single node (the leader), which sends a
stream of data change events to the other replicas (followers). Reads can be performed on any
replica, but reads from followers might be stale.
*Multi-leader replication*
: Clients send each write to one of several leader nodes, any of which
can accept writes. The leaders send streams of data change events to each other and to any
follower nodes.
*Leaderless replication*
: Clients send each write to several nodes, and read from several nodes
in parallel in order to detect and correct nodes with stale data.
Each approach has advantages and disadvantages. Single-leader replication is popular because it is
fairly easy to understand and it offers strong consistency. Multi-leader and leaderless replication
can be more robust in the presence of faulty nodes, network interruptions, and latency spikes—at the
cost of requiring conflict resolution and providing weaker consistency guarantees.
Replication can be synchronous or asynchronous, which has a profound effect on the system behavior
when there is a fault. Although asynchronous replication can be fast when the system is running
smoothly, it’s important to figure out what happens when replication lag increases and servers fail.
If a leader fails and you promote an asynchronously updated follower to be the new leader, recently
committed data may be lost.
We looked at some strange effects that can be caused by replication lag, and we discussed a few
consistency models which are helpful for deciding how an application should behave under replication
lag:
*Read-after-write consistency*
: Users should always see data that they submitted themselves.
*Monotonic reads*
: After users have seen the data at one point in time, they shouldn’t later see
the data from some earlier point in time.
*Consistent prefix reads*
: Users should see the data in a state that makes causal sense:
for example, seeing a question and its reply in the correct order.
Finally, we discussed how multi-leader and leaderless replication ensure that all replicas
eventually converge to a consistent state: by using a version vector or similar algorithm to detect
which writes are concurrent, and by using a conflict resolution algorithm such as a CRDT to merge
the concurrently written values. Last-write-wins and manual conflict resolution are also possible.
This chapter has assumed that every replica stores a full copy of the whole database, which is
unrealistic for large datasets. In the next chapter we will look at *sharding*, which allows each
machine to store only a subset of the data.
### References
[^1]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)
[^2]: Kenny Gryp. [MySQL Terminology Updates](https://dev.mysql.com/blog-archive/mysql-terminology-updates/). *dev.mysql.com*, July 2020. Archived at [perma.cc/S62G-6RJ2](https://perma.cc/S62G-6RJ2)
[^3]: Oracle Corporation. [Oracle (Active) Data Guard 19c: Real-Time Data Protection and Availability](https://www.oracle.com/technetwork/database/availability/dg-adg-technical-overview-wp-5347548.pdf). White Paper, *oracle.com*, March 2019. Archived at [perma.cc/P5ST-RPKE](https://perma.cc/P5ST-RPKE)
[^4]: Microsoft. [What is an Always On availability group?](https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server) *learn.microsoft.com*, September 2024. Archived at [perma.cc/ABH6-3MXF](https://perma.cc/ABH6-3MXF)
[^5]: Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, and Akshat Vig. [Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical Conference* (ATC), July 2022.
[^6]: Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. [CockroachDB: The Resilient Geo-Distributed SQL Database](https://dl.acm.org/doi/abs/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 1493–1509, June 2020. [doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134)
[^7]: Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, and Xin Tang. [TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3072–3084. [doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535)
[^8]: Mallory Knodel and Niels ten Oever. [Terminology, Power, and Inclusive Language in Internet-Drafts and RFCs](https://www.ietf.org/archive/id/draft-knodel-terminology-14.html). *IETF Internet-Draft*, August 2023. Archived at [perma.cc/5ZY9-725E](https://perma.cc/5ZY9-725E)
[^9]: Buck Hodges. [Postmortem: VSTS 4 September 2018](https://devblogs.microsoft.com/devopsservice/?p=17485). *devblogs.microsoft.com*, September 2018. Archived at [perma.cc/ZF5R-DYZS](https://perma.cc/ZF5R-DYZS)
[^10]: Gunnar Morling. [Leader Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024. Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y)
[^11]: Vignesh Chandramohan, Rohan Desai, and Chris Riccomini. [SlateDB Manifest Design](https://github.com/slatedb/slatedb/blob/main/rfcs/0001-manifest.md). *github.com*, May 2024. Archived at [perma.cc/8EUY-P32Z](https://perma.cc/8EUY-P32Z)
[^12]: Stas Kelvich. [Why does Neon use Paxos instead of Raft, and what’s the difference?](https://neon.tech/blog/paxos) *neon.tech*, August 2022. Archived at [perma.cc/SEZ4-2GXU](https://perma.cc/SEZ4-2GXU)
[^13]: Dimitri Fontaine. [An introduction to the pg\_auto\_failover project](https://tapoueh.org/blog/2021/11/an-introduction-to-the-pg_auto_failover-project/). *tapoueh.org*, November 2021. Archived at [perma.cc/3WH5-6BAF](https://perma.cc/3WH5-6BAF)
[^14]: Jesse Newland. [GitHub availability this week](https://github.blog/news-insights/the-library/github-availability-this-week/). *github.blog*, September 2012. Archived at [perma.cc/3YRF-FTFJ](https://perma.cc/3YRF-FTFJ)
[^15]: Mark Imbriaco. [Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/). *github.blog*, December 2012. Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ)
[^16]: John Hugg. [‘All In’ with Determinism for Performance and Testing in Distributed Systems](https://www.youtube.com/watch?v=gJRj3vJL4wE). At *Strange Loop*, September 2015.
[^17]: Hironobu Suzuki. [The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017.
[^18]: Amit Kapila. [WAL Internals of PostgreSQL](https://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf). At *PostgreSQL Conference* (PGCon), May 2012. Archived at [perma.cc/6225-3SUX](https://perma.cc/6225-3SUX)
[^19]: Amit Kapila. [Evolution of Logical Replication](https://amitkapila16.blogspot.com/2023/09/evolution-of-logical-replication.html). *amitkapila16.blogspot.com*, September 2023. Archived at [perma.cc/F9VX-JLER](https://perma.cc/F9VX-JLER)
[^20]: Aru Petchimuthu. [Upgrade your Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL database, Part 2: Using the pglogical extension](https://aws.amazon.com/blogs/database/part-2-upgrade-your-amazon-rds-for-postgresql-database-using-the-pglogical-extension/). *aws.amazon.com*, August 2021. Archived at [perma.cc/RXT8-FS2T](https://perma.cc/RXT8-FS2T)
[^21]: Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, David Callies, Abhishek Choudhary, Laurent Demailly, Thomas Fersch, Liat Atsmon Guz, Andrzej Kotulski, Sachin Kulkarni, Sanjeev Kumar, Harry Li, Jun Li, Evgeniy Makeev, Kowshik Prakasam, Robbert van Renesse, Sabyasachi Roy, Pratyush Seth, Yee Jiun Song, Benjamin Wester, Kaushik Veeraraghavan, and Peter Xie. [Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf). At *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015.
[^22]: Douglas B. Terry. [Replicated Data Consistency Explained Through Baseball](https://www.microsoft.com/en-us/research/publication/replicated-data-consistency-explained-through-baseball/). Microsoft Research, Technical Report MSR-TR-2011-137, October 2011. Archived at [perma.cc/F4KZ-AR38](https://perma.cc/F4KZ-AR38)
[^23]: Douglas B. Terry, Alan J. Demers, Karin Petersen, Mike J. Spreitzer, Marvin M. Theher, and Brent B. Welch. [Session Guarantees for Weakly Consistent Replicated Data](https://csis.pace.edu/~marchese/CS865/Papers/SessionGuaranteesPDIS.pdf). At *3rd International Conference on Parallel and Distributed Information Systems* (PDIS), September 1994. [doi:10.1109/PDIS.1994.331722](https://doi.org/10.1109/PDIS.1994.331722)
[^24]: Werner Vogels. [Eventually Consistent](https://queue.acm.org/detail.cfm?id=1466448). *ACM Queue*, volume 6, issue 6, pages 14–19, October 2008. [doi:10.1145/1466443.1466448](https://doi.org/10.1145/1466443.1466448)
[^25]: Simon Willison. [Reply to: “My thoughts about Fly.io (so far) and other newish technology I’m getting into”](https://news.ycombinator.com/item?id=31434055). *news.ycombinator.com*, May 2022. Archived at [perma.cc/ZRV4-WWV8](https://perma.cc/ZRV4-WWV8)
[^26]: Nithin Tharakan. [Scaling Bitbucket’s Database](https://www.atlassian.com/blog/bitbucket/scaling-bitbuckets-database). *atlassian.com*, October 2020. Archived at [perma.cc/JAB7-9FGX](https://perma.cc/JAB7-9FGX)
[^27]: Terry Pratchett. *Reaper Man: A Discworld Novel*. Victor Gollancz, 1991. ISBN: 978-0-575-04979-6
[^28]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). *Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185–196, November 2014. [doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509)
[^29]: Yaser Raja and Peter Celentano. [PostgreSQL bi-directional replication using pglogical](https://aws.amazon.com/blogs/database/postgresql-bi-directional-replication-using-pglogical/). *aws.amazon.com*, January 2022. Archived at
[^30]: Robert Hodges. [If You \*Must\* Deploy Multi-Master Replication, Read This First](https://scale-out-blog.blogspot.com/2012/04/if-you-must-deploy-multi-master.html). *scale-out-blog.blogspot.com*, April 2012. Archived at [perma.cc/C2JN-F6Y8](https://perma.cc/C2JN-F6Y8)
[^31]: Lars Hofhansl. [HBASE-7709: Infinite Loop Possible in Master/Master Replication](https://issues.apache.org/jira/browse/HBASE-7709). *issues.apache.org*, January 2013. Archived at [perma.cc/24G2-8NLC](https://perma.cc/24G2-8NLC)
[^32]: John Day-Richter. [What’s Different About the New Google Docs: Making Collaboration Fast](https://drive.googleblog.com/2010/09/whats-different-about-new-google-docs.html). *drive.googleblog.com*, September 2010. Archived at [perma.cc/5TL8-TSJ2](https://perma.cc/5TL8-TSJ2)
[^33]: Evan Wallace. [How Figma’s multiplayer technology works](https://www.figma.com/blog/how-figmas-multiplayer-technology-works/). *figma.com*, October 2019. Archived at [perma.cc/L49H-LY4D](https://perma.cc/L49H-LY4D)
[^34]: Tuomas Artman. [Scaling the Linear Sync Engine](https://linear.app/blog/scaling-the-linear-sync-engine). *linear.app*, June 2023.
[^35]: Amr Saafan. [Why Sync Engines Might Be the Future of Web Applications](https://www.nilebits.com/blog/2024/09/sync-engines-future-web-applications/). *nilebits.com*, September 2024. Archived at [perma.cc/5N73-5M3V](https://perma.cc/5N73-5M3V)
[^36]: Isaac Hagoel. [Are Sync Engines The Future of Web Applications?](https://dev.to/isaachagoel/are-sync-engines-the-future-of-web-applications-1bbi) *dev.to*, July 2024. Archived at [perma.cc/R9HF-BKKL](https://perma.cc/R9HF-BKKL)
[^37]: Sujay Jayakar. [A Map of Sync](https://stack.convex.dev/a-map-of-sync). *stack.convex.dev*, October 2024. Archived at [perma.cc/82R3-H42A](https://perma.cc/82R3-H42A)
[^38]: Alex Feyerke. [Designing Offline-First Web Apps](https://alistapart.com/article/offline-first/). *alistapart.com*, December 2013. Archived at [perma.cc/WH7R-S2DS](https://perma.cc/WH7R-S2DS)
[^39]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: You own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019, pages 154–178. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
[^40]: Martin Kleppmann. [The past, present, and future of local-first](https://martin.kleppmann.com/2024/05/30/local-first-conference.html). At *Local-First Conference*, May 2024.
[^41]: Conrad Hofmeyr. [API Calling is to Sync Engines as jQuery is to React](https://www.powersync.com/blog/api-calling-is-to-sync-engines-as-jquery-is-to-react). *powersync.com*, November 2024. Archived at [perma.cc/2FP9-7WJJ](https://perma.cc/2FP9-7WJJ)
[^42]: Peter van Hardenberg and Martin Kleppmann. [PushPin: Towards Production-Quality Peer-to-Peer Collaboration](https://martin.kleppmann.com/papers/pushpin-papoc20.pdf). At *7th Workshop on Principles and Practice of Consistency for Distributed Data* (PaPoC), April 2020. [doi:10.1145/3380787.3393683](https://doi.org/10.1145/3380787.3393683)
[^43]: Leonard Kawell, Jr., Steven Beckhardt, Timothy Halvorsen, Raymond Ozzie, and Irene Greif. [Replicated document management in a group communication system](https://dl.acm.org/doi/pdf/10.1145/62266.1024798). At *ACM Conference on Computer-Supported Cooperative Work* (CSCW), September 1988. [doi:10.1145/62266.1024798](https://doi.org/10.1145/62266.1024798)
[^44]: Ricky Pusch. [Explaining how fighting games use delay-based and rollback netcode](https://words.infil.net/w02-netcode.html). *words.infil.net* and *arstechnica.com*, October 2019. Archived at [perma.cc/DE7W-RDJ8](https://perma.cc/DE7W-RDJ8)
[^45]: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. [Dynamo: Amazon’s Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007. [doi:10.1145/1323293.1294281](https://doi.org/10.1145/1323293.1294281)
[^46]: Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. [A Comprehensive Study of Convergent and Commutative Replicated Data Types](https://inria.hal.science/inria-00555588v1/document). INRIA Research Report no. 7506, January 2011.
[^47]: Chengzheng Sun and Clarence Ellis. [Operational Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=aef660812c5a9c4d3f06775f9455eeb090a4ff0f). At *ACM Conference on Computer Supported Cooperative Work* (CSCW), November 1998. [doi:10.1145/289444.289469](https://doi.org/10.1145/289444.289469)
[^48]: Joseph Gentle and Martin Kleppmann. [Collaborative Text Editing with Eg-walker: Better, Faster, Smaller](https://arxiv.org/abs/2409.14252). At *20th European Conference on Computer Systems* (EuroSys), March 2025. [doi:10.1145/3689031.3696076](https://doi.org/10.1145/3689031.3696076)
[^49]: Dharma Shukla. [Azure Cosmos DB: Pushing the frontier of globally distributed databases](https://azure.microsoft.com/en-us/blog/azure-cosmos-db-pushing-the-frontier-of-globally-distributed-databases/). *azure.microsoft.com*, September 2018. Archived at [perma.cc/UT3B-HH6R](https://perma.cc/UT3B-HH6R)
[^50]: David K. Gifford. [Weighted Voting for Replicated Data](https://www.cs.cmu.edu/~15-749/READINGS/required/availability/gifford79.pdf). At *7th ACM Symposium on Operating Systems Principles* (SOSP), December 1979. [doi:10.1145/800215.806583](https://doi.org/10.1145/800215.806583)
[^51]: Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. [Flexible Paxos: Quorum Intersection Revisited](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.OPODIS.2016.25). At *20th International Conference on Principles of Distributed Systems* (OPODIS), December 2016. [doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25)
[^52]: Joseph Blomstedt. [Bringing Consistency to Riak](https://vimeo.com/51973001). At *RICON West*, October 2012.
[^53]: Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, and Ion Stoica. [Quantifying eventual consistency with PBS](http://www.bailis.org/papers/pbs-vldbj2014.pdf). *The VLDB Journal*, volume 23, pages 279–302, April 2014. [doi:10.1007/s00778-013-0330-1](https://doi.org/10.1007/s00778-013-0330-1)
[^54]: Colin Breck. [Shared-Nothing Architectures for Server Replication and Synchronization](https://blog.colinbreck.com/shared-nothing-architectures-for-server-replication-and-synchronization/). *blog.colinbreck.com*, December 2019. Archived at [perma.cc/48P3-J6CJ](https://perma.cc/48P3-J6CJ)
[^55]: Jeffrey Dean and Luiz André Barroso. [The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/). *Communications of the ACM*, volume 56, issue 2, pages 74–80, February 2013. [doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794)
[^56]: Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. [Gray Failure: The Achilles’ Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in Operating Systems* (HotOS), May 2017. [doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005)
[^57]: Leslie Lamport. [Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, volume 21, issue 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563)
[^58]: D. Stott Parker Jr., Gerald J. Popek, Gerard Rudisin, Allen Stoughton, Bruce J. Walker, Evelyn Walton, Johanna M. Chow, David Edwards, Stephen Kiser, and Charles Kline. [Detection of Mutual Inconsistency in Distributed Systems](https://pages.cs.wisc.edu/~remzi/Classes/739/Papers/parker83detection.pdf). *IEEE Transactions on Software Engineering*, volume SE-9, issue 3, pages 240–247, May 1983. [doi:10.1109/TSE.1983.236733](https://doi.org/10.1109/TSE.1983.236733)
[^59]: Nuno Preguiça, Carlos Baquero, Paulo Sérgio Almeida, Victor Fonte, and Ricardo Gonçalves. [Dotted Version Vectors: Logical Clocks for Optimistic Replication](https://arxiv.org/abs/1011.5808). arXiv:1011.5808, November 2010.
[^60]: Giridhar Manepalli. [Clocks and Causality - Ordering Events in Distributed Systems](https://www.exhypothesi.com/clocks-and-causality/). *exhypothesi.com*, November 2022. Archived at [perma.cc/8REU-KVLQ](https://perma.cc/8REU-KVLQ)
[^61]: Sean Cribbs. [A Brief History of Time in Riak](https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak). At *RICON*, October 2014. Archived at [perma.cc/7U9P-6JFX](https://perma.cc/7U9P-6JFX)
[^62]: Russell Brown. [Vector Clocks Revisited Part 2: Dotted Version Vectors](https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/). *riak.com*, November 2015. Archived at [perma.cc/96QP-W98R](https://perma.cc/96QP-W98R)
[^63]: Carlos Baquero. [Version Vectors Are Not Vector Clocks](https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/). *haslab.wordpress.com*, July 2011. Archived at [perma.cc/7PNU-4AMG](https://perma.cc/7PNU-4AMG)
[^64]: Reinhard Schwarz and Friedemann Mattern. [Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed Computing*, volume 7, issue 3, pages 149–174, March 1994. [doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859)
================================================
FILE: content/en/ch7.md
================================================
---
title: "7. Sharding"
weight: 207
breadcrumbs: false
---

> *Clearly, we must break away from the sequential and not limit the computers. We must state
> definitions and provide for priorities and descriptions of data. We must state relationships, not
> procedures.*
>
> Grace Murray Hopper, *Management and the Computer of the Future* (1962)
A distributed database typically distributes data across nodes in two ways:
1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in [Chapter 6](/en/ch6#ch_replication).
2. If we don’t want every node to store all the data, we can split up a large amount of data into
smaller *shards* or *partitions*, and store different shards on different nodes. We’ll discuss
sharding in this chapter.
Normally, shards are defined in such a way that each piece of data (each record, row, or document)
belongs to exactly one shard. There are various ways of achieving this, which we discuss in depth in
this chapter. In effect, each shard is a small database of its own, although some database systems
support operations that touch multiple shards at the same time.
Sharding is usually combined with replication so that copies of each shard are stored on multiple
nodes. This means that, even though each record belongs to exactly one shard, it may still be stored
on several different nodes for fault tolerance.
A node may store more than one shard. If a single-leader replication model is used, the combination
of sharding and replication can look like [Figure 7-1](/en/ch7#fig_sharding_replicas), for example. Each shard’s
leader is assigned to one node, and its followers are assigned to other nodes. Each node may be the
leader for some shards and a follower for other shards, but each shard still only has one leader.
{{< figure src="/fig/ddia_0701.png" id="fig_sharding_replicas" caption="Figure 7-1. Combining replication and sharding: each node acts as leader for some shards and follower for other shards." class="w-full my-4" >}}
Everything we discussed in [Chapter 6](/en/ch6#ch_replication) about replication of databases applies equally to
replication of shards. Since the choice of sharding scheme is mostly independent of the choice of
replication scheme, we will ignore replication in this chapter for the sake of simplicity.
--------
> [!TIP] SHARDING AND PARTITIONING
What we call a *shard* in this chapter has many different names depending on which software you’re
using: it’s called a *partition* in Kafka, a *range* in CockroachDB, a *region* in HBase and TiDB, a
*tablet* in Bigtable and YugabyteDB, a *vnode* in Cassandra, ScyllaDB, and Riak, and a *vBucket* in
Couchbase, to name just a few.
Some databases treat partitions and shards as two distinct concepts. For example, in PostgreSQL,
partitioning is a way of splitting a large table into several files that are stored on the same
machine (which has several advantages, such as making it very fast to delete an entire partition),
whereas sharding splits a dataset across multiple machines [^1] [^2].
In many other systems, partitioning is just another word for sharding.
While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to
one theory, the term arose from the online role-play game *Ultima Online*, in which a magic crystal
was shattered into pieces, and each of those shards refracted a copy of the game world [^3].
The term *shard* thus came to mean one of a set of parallel game servers, and later was carried over
to databases. Another theory is that *shard* was originally an acronym of *System for Highly
Available Replicated Data*—reportedly a 1980s database, details of which are lost to history.
By the way, partitioning has nothing to do with *network partitions* (netsplits), a type of fault in
the network between nodes. We will discuss such faults in [Chapter 9](/en/ch9#ch_distributed).
--------
## Pros and Cons of Sharding {#sec_sharding_reasons}
The primary reason for sharding a database is *scalability*: it’s a solution if the volume of data
or the write throughput has become too great for a single node to handle, as it allows you to spread
that data and those writes across multiple nodes. (If read throughput is the problem, you don’t
necessarily need sharding—you can use *read scaling* as discussed in [Chapter 6](/en/ch6#ch_replication).)
In fact, sharding is one of the main tools we have for achieving *horizontal scaling* (a *scale-out*
architecture), as discussed in [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](/en/ch2#sec_introduction_shared_nothing): that is, allowing a system to
grow its capacity not by moving to a bigger machine, but by adding more (smaller) machines. If you
can divide the workload such that each shard handles a roughly equal share, you can then assign
those shards to different machines in order to process their data and queries in parallel.
While replication is useful at both small and large scale, because it enables fault tolerance and
offline operation, sharding is a heavyweight solution that is mostly relevant at large scale. If
your data volume and write throughput are such that you can process them on a single machine (and a
single machine can do a lot nowadays!), it’s often better to avoid sharding and stick with a
single-shard database.
The reason for this recommendation is that sharding often adds complexity: you typically have to
decide which records to put in which shard by choosing a *partition key*; all records with the
same partition key are placed in the same shard [^4].
This choice matters because accessing a record is fast if you know which shard it’s in, but if you
don’t know the shard you have to do an inefficient search across all shards, and the sharding scheme
is difficult to change.
Thus, sharding often works well for key-value data, where you can easily shard by key, but it’s
harder with relational data where you may want to search by a secondary index, or join records that
may be distributed across different shards. We will discuss this further in
[“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes).
Another problem with sharding is that a write may need to update related records in several
different shards. While transactions on a single node are quite common (see [Chapter 8](/en/ch8#ch_transactions)),
ensuring consistency across multiple shards requires a *distributed transaction*. As we shall see in
[Chapter 8](/en/ch8#ch_transactions), distributed transactions are available in some databases, but they are usually
much slower than single-node transactions, may become a bottleneck for the system as a whole, and
some systems don’t support them at all.
Some systems use sharding even on a single machine, typically running one single-threaded process
per CPU core to make use of the parallelism in the CPU, or to take advantage of a *nonuniform memory
access* (NUMA) architecture in which some banks of memory are closer to one CPU than to others [^5].
For example, Redis, VoltDB, and FoundationDB use one process per core, and rely on sharding to
spread load across CPU cores in the same machine [^6].
### Sharding for Multitenancy {#sec_sharding_multitenancy}
Software as a Service (SaaS) products and cloud services are often *multitenant*, where each tenant
is a customer. Multiple users may have logins on the same tenant, but each tenant has a
self-contained dataset that is separate from other tenants. For example, in an email marketing
service, each business that signs up is typically a separate tenant, since one business’s newsletter
signups, delivery data etc. are separate from those of other businesses.
Sometimes sharding is used to implement multitenant systems: either each tenant is given a separate
shard, or multiple small tenants may be grouped together into a larger shard. These shards might be
physically separate databases (which we previously touched on in [“Embedded storage engines”](/en/ch4#sidebar_embedded)), or
separately manageable portions of a larger logical database [^7].
Using sharding for multitenancy has several advantages:
Resource isolation
: If one tenant performs a computationally expensive operation, it is less likely that other
tenants’ performance will be affected if they are running on different shards.
Permission isolation
: If there is a bug in your access control logic, it’s less likely that you will accidentally give
one tenant access to another tenant’s data if those tenants’ datasets are stored physically
separately from each other.
Cell-based architecture
: You can apply sharding not only at the data storage level, but also for the services running your
application code. In a *cell-based architecture*, the services and storage for a particular set of
tenants are grouped into a self-contained *cell*, and different cells are set up such that they
can run largely independently from each other. This approach provides *fault isolation*: that is,
a fault in one cell remains limited to that cell, and tenants in other cells are not affected [^8].
Per-tenant backup and restore
: Backing up each tenant’s shard separately makes it possible to restore a tenant’s state from a
backup without affecting other tenants, which can be useful in case the tenant accidentally
deletes or overwrites important data [^9].
Regulatory compliance
: Data privacy regulation such as the GDPR gives individuals the right to access and delete all data
stored about them. If each person’s data is stored in a separate shard, this translates into
simple data export and deletion operations on their shard [^10].
Data residence
: If a particular tenant’s data needs to be stored in a particular jurisdiction in order to comply
with data residency laws, a region-aware database can allow you to assign that tenant’s shard to a particular region.
Gradual schema rollout
: Schema migrations (previously discussed in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)) can be rolled
out gradually, one tenant at a time. This reduces risk, as you can detect problems before they
affect all tenants, but it can be difficult to do transactionally [^11].
The main challenges around using sharding for multitenancy are:
* It assumes that each individual tenant is small enough to fit on a single node. If that is not the
case, and you have a single tenant that’s too big for one machine, you would need to additionally
perform sharding within a single tenant, which brings us back to the topic of sharding for
scalability [^12].
* If you have many small tenants, then creating a separate shard for each one may incur too much
overhead. You could group several small tenants together into a bigger shard, but then you have
the problem of how you move tenants from one shard to another as they grow.
* If you ever need to support features that connect data across multiple tenants, these become
harder to implement if you need to join data across multiple shards.
## Sharding of Key-Value Data {#sec_sharding_key_value}
Say you have a large amount of data, and you want to shard it. How do you decide which records to
store on which nodes?
Our goal with sharding is to spread the data and the query load evenly across nodes. If every node
takes a fair share, then—in theory—10 nodes should be able to handle 10 times as much data and 10
times the read and write throughput of a single node (ignoring replication). Moreover, if we add or
remove a node, we want to be able to *rebalance* the load so that it is evenly distributed across
the 11 (when adding) or the remaining 9 (when removing) nodes.
If the sharding is unfair, so that some shards have more data or queries than others, we call it
*skewed*. The presence of skew makes sharding much less effective. In an extreme case, all the load
could end up on one shard, so 9 out of 10 nodes are idle and your bottleneck is the single busy
node. A shard with disproportionately high load is called a *hot shard* or *hot spot*. If there’s
one key with a particularly high load (e.g., a celebrity in a social network), we call it a *hot key*.
Therefore we need an algorithm that takes as input the partition key of a record, and tells us which
shard that record is in. In a key-value store the partition key is usually the key, or the first
part of the key. In a relational model the partition key might be some column of a table (not
necessarily its primary key). That algorithm needs to be amenable to rebalancing in order to relieve
hot spots.
### Sharding by Key Range {#sec_sharding_key_range}
One way of sharding is to assign a contiguous range of partition keys (from some minimum to some
maximum) to each shard, like the volumes of a paper encyclopedia, as illustrated in
[Figure 7-2](/en/ch7#fig_sharding_encyclopedia). In this example, an entry’s partition key is its title. If you want
to look up the entry for a particular title, you can easily determine which shard contains that
entry by finding the volume whose key range contains the title you’re looking for, and thus pick the
correct book off the shelf.
{{< figure src="/fig/ddia_0702.png" id="fig_sharding_encyclopedia" caption="Figure 7-2. A print encyclopedia is sharded by key range." class="w-full my-4" >}}
The ranges of keys are not necessarily evenly spaced, because your data may not be evenly
distributed. For example, in [Figure 7-2](/en/ch7#fig_sharding_encyclopedia), volume 1 contains words starting with A
and B, but volume 12 contains words starting with T, U, V, W, X, Y, and Z. Simply having one volume
per two letters of the alphabet would lead to some volumes being much bigger than others. In order
to distribute the data evenly, the shard boundaries need to adapt to the data.
The shard boundaries might be chosen manually by an administrator, or the database can choose them
automatically. Manual key-range sharding is used by Vitess (a sharding layer for MySQL), for
example; the automatic variant is used by Bigtable, its open source equivalent HBase, the
range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB [^6]. YugabyteDB offers both manual and automatic
tablet splitting.
Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in
[Chapter 4](/en/ch4#ch_storage)). This has the advantage that range scans are easy, and you can treat the key as a
concatenated index in order to fetch several related records in one query (see
[“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional)). For example, consider an application that stores data from a
network of sensors, where the key is the timestamp of the measurement. Range scans are very useful
in this case, because they let you easily fetch, say, all the readings from a particular month.
A downside of key range sharding is that you can easily get a hot shard if there are a
lot of writes to nearby keys. For example, if the key is a timestamp, then the shards correspond to
ranges of time—e.g., one shard per month. Unfortunately, if you write data from the sensors to the
database as the measurements happen, all the writes end up going to the same shard (the one for
this month), so that shard can be overloaded with writes while others sit idle [^13].
To avoid this problem in the sensor database, you need to use something other than the timestamp as
the first element of the key. For example, you could prefix each timestamp with the sensor ID so
that the key ordering is first by sensor ID and then by timestamp. Assuming you have many sensors
active at the same time, the write load will end up more evenly spread across the shards. The
downside is that when you want to fetch the values of multiple sensors within a time range, you now
need to perform a separate range query for each sensor.
#### Rebalancing key-range sharded data {#rebalancing-key-range-sharded-data}
When you first set up your database, there are no key ranges to split into shards. Some databases,
such as HBase and MongoDB, allow you to configure an initial set of shards on an empty database,
which is called *pre-splitting*. This requires that you already have some idea of what the key
distribution is going to look like, so that you can choose appropriate key range boundaries [^14].
Later on, as your data volume and write throughput grow, a system with key-range sharding grows by
splitting an existing shard into two or more smaller shards, each of which holds a contiguous
sub-range of the original shard’s key range. The resulting smaller shards can then be distributed
across multiple nodes. If large amounts of data are deleted, you may also need to merge several
adjacent shards that have become small into one bigger one.
This process is similar to what happens at the top level of a B-tree (see [“B-Trees”](/en/ch4#sec_storage_b_trees)).
With databases that manage shard boundaries automatically, a shard split is typically triggered by:
* the shard reaching a configured size (for example, on HBase, the default is 10 GB), or
* in some systems, the write throughput being persistently above some threshold. Thus, a hot shard
may be split even if it is not storing a lot of data, so that its write load can be distributed more uniformly.
An advantage of key-range sharding is that the number of shards adapts to the data volume. If there
is only a small amount of data, a small number of shards is sufficient, so overheads are small; if
there is a huge amount of data, the size of each individual shard is limited to a configurable maximum [^15].
A downside of this approach is that splitting a shard is an expensive operation, since it requires
all of its data to be rewritten into new files, similarly to a compaction in a log-structured
storage engine. A shard that needs splitting is often also one that is under high load, and the cost
of splitting can exacerbate that load, risking it becoming overloaded.
### Sharding by Hash of Key {#sec_sharding_hash}
Key-range sharding is useful if you want records with nearby (but different) partition keys to be
grouped into the same shard; for example, this might be the case with timestamps. If you don’t care
whether partition keys are near each other (e.g., if they are tenant IDs in a multitenant
application), a common approach is to first hash the partition key before mapping it to a shard.
A good hash function takes skewed data and makes it uniformly distributed. Say you have a 32-bit
hash function that takes a string. Whenever you give it a new string, it returns a seemingly random
number between 0 and 232 − 1. Even if the input strings are very similar, their hashes are evenly
distributed across that range of numbers (but the same input always produces the same output).
For sharding purposes, the hash function need not be cryptographically strong: for example, MongoDB
uses MD5, whereas Cassandra and ScyllaDB use Murmur3. Many programming languages have simple hash
functions built in (as they are used for hash tables), but they may not be suitable for sharding:
for example, in Java’s `Object.hashCode()` and Ruby’s `Object#hash`, the same key may have a
different hash value in different processes, making them unsuitable for sharding [^16].
#### Hash modulo number of nodes {#hash-modulo-number-of-nodes}
Once you have hashed the key, how do you choose which shard to store it in? Maybe your first thought
is to take the hash value *modulo* the number of nodes in the system (using the `%` operator in many
programming languages). For example, *hash*(*key*) % 10 would return a number between
0 and 9 (if we write the hash as a decimal number, the hash % 10 would be the last digit).
If we have 10 nodes, numbered 0 to 9, that seems like an easy way of assigning each key to a node.
The problem with the *mod N* approach is that if the number of nodes *N* changes, most of the keys
have to be moved from one node to another. [Figure 7-3](/en/ch7#fig_sharding_hash_mod_n) shows what happens when you
have three nodes and add a fourth. Before the rebalancing, node 0 stored the keys whose hashes are
0, 3, 6, 9, and so on. After adding the fourth node, the key with hash 3 has moved to node 3, the
key with hash 6 has moved to node 2, the key with hash 9 has moved to node 1, and so on.
{{< figure src="/fig/ddia_0703.png" id="fig_sharding_hash_mod_n" caption="Figure 7-3. Assigning keys to nodes by hashing the key and taking it modulo the number of nodes. Changing the number of nodes results in many keys moving from one node to another." class="w-full my-4" >}}
The *mod N* function is easy to compute, but it leads to very inefficient rebalancing because there
is a lot of unnecessary movement of records from one node to another. We need an approach that
doesn’t move data around more than necessary.
#### Fixed number of shards {#fixed-number-of-shards}
One simple but widely-used solution is to create many more shards than there are nodes, and to
assign several shards to each node. For example, a database running on a cluster of 10 nodes may be
split into 1,000 shards from the outset so that 100 shards are assigned to each node. A key is then
stored in shard number *hash*(*key*) % 1,000, and the system separately keeps track of
which shard is stored on which node.
Now, if a node is added to the cluster, the system can reassign some of the shards from existing
nodes to the new node until they are fairly distributed once again. This process is illustrated in
[Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in reverse.
{{< figure src="/fig/ddia_0704.png" id="fig_sharding_rebalance_fixed" caption="Figure 7-4. Adding a new node to a database cluster with multiple shards per node." class="w-full my-4" >}}
In this model, only entire shards are moved between nodes, which is cheaper than splitting shards.
The number of shards does not change, nor does the assignment of keys to shards. The only thing that
changes is the assignment of shards to nodes. This change of assignment is not immediate—it takes
some time to transfer a large amount of data over the network—so the old assignment of shards is
used for any reads and writes that happen while the transfer is in progress.
It’s common to choose the number of shards to be a number that is divisible by many factors, so that
the dataset can be evenly split across various different numbers of nodes—not requiring the number
of nodes to be a power of 2, for example [^4].
You can even account for mismatched hardware in your cluster: by assigning more shards to nodes that
are more powerful, you can make those nodes take a greater share of the load.
This approach to sharding is used in Citus (a sharding layer for PostgreSQL), Riak, Elasticsearch,
and Couchbase, among others. It works well as long as you have a good estimate of how many shards
you will need when you first create the database. You can then add or remove nodes easily, subject
to the limitation that you can’t have more nodes than you have shards.
If you find the originally configured number of shards to be wrong—for example, if you have reached
a scale where you need more nodes than you have shards—then an expensive resharding operation is
required. It needs to split each shard and write it out to new files, using a lot of additional disk
space in the process. Some systems don’t allow resharding while concurrently writing to the
database, which makes it difficult to change the number of shards without downtime.
Choosing the right number of shards is difficult if the total size of the dataset is highly variable
(for example, if it starts small but may grow much larger over time). Since each shard contains a
fixed fraction of the total data, the size of each shard grows proportionally to the total amount of
data in the cluster. If shards are very large, rebalancing and recovery from node failures become
expensive. But if shards are too small, they incur too much overhead. The best performance is
achieved when the size of shards is “just right,” neither too big nor too small, which can be hard
to achieve if the number of shards is fixed but the dataset size varies.
#### Sharding by hash range {#sharding-by-hash-range}
If the required number of shards can’t be predicted in advance, it’s better to use a scheme in which
the number of shards can adapt easily to the workload. The aforementioned key-range sharding scheme
has this property, but it has a risk of hot spots when there are a lot of writes to nearby keys. One
solution is to combine key-range sharding with a hash function so that each shard contains a range
of *hash values* rather than a range of *keys*.
[Figure 7-5](/en/ch7#fig_sharding_hash_range) shows an example using a 16-bit hash function that returns a number
between 0 and 65,535 = 216 − 1 (in reality, the hash is usually 32 bits or more).
Even if the input keys are very similar (e.g., consecutive timestamps), their hashes are uniformly
distributed across that range. We can then assign a range of hash values to each shard: for example,
values between 0 and 16,383 to shard 0, values between 16,384 and 32,767 to shard 1, and so on.
{{< figure src="/fig/ddia_0705.png" id="fig_sharding_hash_range" caption="Figure 7-5. Assigning a contiguous range of hash values to each shard." class="w-full my-4" >}}
Like with key-range sharding, a shard in hash-range sharding can be split when it becomes too big or
too heavily loaded. This is still an expensive operation, but it can happen as needed, so the number
of shards adapts to the volume of data rather than being fixed in advance.
The downside compared to key-range sharding is that range queries over the partition key are not
efficient, as keys in the range are now scattered across all the shards. However, if keys consist of
two or more columns, and the partition key is only the first of these columns, you can still perform
efficient range queries over the second and later columns: as long as all records in the range query
have the same partition key, they will be in the same shard.
--------
> [!TIP] PARTITIONING AND RANGE QUERIES IN DATA WAREHOUSES
Data warehouses such as BigQuery, Snowflake, and Delta Lake support a similar indexing approach,
though the terminology differs. In BigQuery, for example, the partition key determines which
partition a record resides in while “cluster columns” determine how records are sorted within the
partition. Snowflake assigns records to “micro-partitions” automatically, but allows users to define
cluster keys for a table. Delta Lake supports both manual and automatic partition assignment, and
supports cluster keys. Clustering data not only improves range scan performance, but can
improve compression and filtering performance as well.
--------
Hash-range sharding is used in YugabyteDB and DynamoDB [^17], and is an option in MongoDB.
Cassandra and ScyllaDB use a variant of this approach that is illustrated in
[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between
those ranges. This means some ranges are bigger than others, but by having multiple ranges per node
those imbalances tend to even out [^15] [^18].
{{< figure src="/fig/ddia_0706.png" id="fig_sharding_cassandra" caption="Figure 7-6. Cassandra and ScyllaDB split the range of possible hash values (here 0–1023) into contiguous ranges with random boundaries, and assign several ranges to each node." class="w-full my-4" >}}
When nodes are added or removed, range boundaries are added and removed, and shards are split or
merged accordingly [^19].
In the example of [Figure 7-6](/en/ch7#fig_sharding_cassandra), when node 3 is added, node 1
transfers parts of two of its ranges to node 3, and node 2 transfers part of one of its ranges to
node 3. This has the effect of giving the new node an approximately fair share of the dataset,
without transferring more data than necessary from one node to another.
#### Consistent hashing {#sec_sharding_consistent_hashing}
A *consistent hashing* algorithm is a hash function that maps keys to a specified number of shards
in a way that satisfies two properties:
1. the number of keys mapped to each shard is roughly equal, and
2. when the number of shards changes, as few keys as possible are moved from one shard to another.
Note that *consistent* here has nothing to do with replica consistency (see [Chapter 6](/en/ch6#ch_replication)) or
ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describes the tendency of a key to stay in
the same shard as much as possible.
The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of consistent hashing [^20],
but several other consistent hashing algorithms have also been proposed [^21], such as *highest random weight*, also known as *rendezvous hashing* [^22],
and *jump consistent hash* [^23].
With Cassandra’s algorithm, if one node is added, a small number of existing shards are split into
sub-ranges; on the other hand, with rendezvous and jump consistent hashes, the new node is assigned
individual keys that were previously scattered across all of the other nodes. Which one is
preferable depends on the application.
### Skewed Workloads and Relieving Hot Spots {#sec_sharding_skew}
Consistent hashing ensures that keys are uniformly distributed across nodes, but that doesn’t mean
that the actual load is uniformly distributed. If the workload is highly skewed—that is, the amount
of data under some partition keys is much greater than other keys, or if the rate of requests to
some keys is much higher than to others—you can still end up with some servers being overloaded
while others sit almost idle.
For example, on a social media site, a celebrity user with millions of followers may cause a storm
of activity when they do something [^24].
This event can result in a large volume of reads and writes to the same key (where the partition key
is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on).
In such situations, a more flexible sharding policy is required [^25] [^26].
A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put
an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine [^27].
It’s also possible to compensate for skew at the application level. For example, if one key is known
to be very hot, a simple technique is to add a random number to the beginning or end of the key.
Just a two-digit decimal random number would split the writes to the key evenly across 100 different
keys, allowing those keys to be distributed to different shards.
However, having split the writes across different keys, any reads now have to do additional work, as
they have to read the data from all 100 keys and combine it. The volume of reads to each shard of
the hot key is not reduced; only the write load is split. This technique also requires additional
bookkeeping: it only makes sense to append the random number for the small number of hot keys; for
the vast majority of keys with low write throughput this would be unnecessary overhead. Thus, you
also need some way of keeping track of which keys are being split, and a process for converting a
regular key into a specially-managed hot key.
The problem is further compounded by change of load over time: for example, a particular social
media post that has gone viral may experience high load for a couple of days, but thereafter it’s
likely to calm down again. Moreover, some keys may be hot for writes while others are hot for reads,
necessitating different strategies for handling them.
Some systems (especially cloud services designed for large scale) have automated approaches for
dealing with hot shards; for example, Amazon calls it *heat management* [^28] or *adaptive capacity* [^17].
The details of how these systems work go beyond the scope of this book.
### Operations: Automatic or Manual Rebalancing {#sec_sharding_operations}
There is one important question with regard to rebalancing that we have glossed over: does the
splitting of shards and rebalancing happen automatically or manually?
Some systems automatically decide when to split shards and when to move them from one node to
another, without any human interaction, while others leave sharding to be explicitly configured by
an administrator. There is also a middle ground: for example, Couchbase and Riak generate a
suggested shard assignment automatically, but require an administrator to commit it before it takes effect.
Fully automated rebalancing can be convenient, because there is less operational work to do for
normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud
databases such as DynamoDB are promoted as being able to automatically add and remove shards to
adapt to big increases or decreases of load within a matter of minutes [^17] [^29].
However, automatic shard management can also be unpredictable. Rebalancing is an expensive
operation, because it requires rerouting requests and moving a large amount of data from one node to
another. If it is not done carefully, this process can overload the network or the nodes, and it
might harm the performance of other requests. The system must continue processing writes while the
rebalancing is in progress; if a system is near its maximum write throughput, the shard-splitting
process might not even be able to keep up with the rate of incoming writes [^29].
Such automation can be dangerous in combination with automatic failure detection. For example, say
one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that
the overloaded node is dead, and automatically rebalance the cluster to move load away from it. This
puts additional load on other nodes and the network, making the situation worse. There is a risk of
causing a cascading failure where other nodes become overloaded and are also falsely suspected of being down.
For that reason, it can be a good thing to have a human in the loop for rebalancing. It’s slower
than a fully automatic process, but it can help prevent operational surprises.
## Request Routing {#sec_sharding_routing}
We have discussed how to shard a dataset across multiple nodes, and how to rebalance those shards as
nodes are added or removed. Now let’s move on to the question: if you want to read or write a
particular key, how do you know which node—i.e., which IP address and port number—you need to
connect to?
We call this problem *request routing*, and it’s very similar to *service discovery*, which we
previously discussed in [“Load balancers, service discovery, and service meshes”](/en/ch5#sec_encoding_service_discovery). The biggest difference between the two
is that with services running application code, each instance is usually stateless, and a load
balancer can send a request to any of the instances. With sharded databases, a request for a key can
only be handled by a node that is a replica for the shard containing that key.
This means that request routing has to be aware of the assignment from keys to shards, and from
shards to nodes. On a high level, there are a few different approaches to this problem
(illustrated in [Figure 7-7](/en/ch7#fig_sharding_routing)):
1. Allow clients to contact any node (e.g., via a round-robin load balancer). If that node
coincidentally owns the shard to which the request applies, it can handle the request directly;
otherwise, it forwards the request to the appropriate node, receives the reply, and passes the
reply along to the client.
2. Send all requests from clients to a routing tier first, which determines the node that should
handle each request and forwards it accordingly. This routing tier does not itself handle any
requests; it only acts as a shard-aware load balancer.
3. Require that clients be aware of the sharding and the assignment of shards to nodes. In this
case, a client can connect directly to the appropriate node, without any intermediary.
{{< figure src="/fig/ddia_0707.png" id="fig_sharding_routing" caption="Figure 7-7. Three different ways of routing a request to the right node." class="w-full my-4" >}}
In all cases, there are some key problems:
* Who decides which shard should live on which node? It’s simplest to have a single coordinator
making that decision, but in that case how do you make it fault-tolerant in case the node running
the coordinator goes down? And if the coordinator role can failover to another node, how do you
prevent a split-brain situation (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)) where two different
coordinators make contradictory shard assignments?
* How does the component performing the routing (which may be one of the nodes, or the routing tier,
or the client) learn about changes in the assignment of shards to nodes?
* While a shard is being moved from one node to another, there is a cutover period during which the
new node has taken over, but requests to the old node may still be in flight. How do you handle
those?
Many distributed data systems rely on a separate coordination service such as ZooKeeper or etcd to
keep track of shard assignments, as illustrated in [Figure 7-8](/en/ch7#fig_sharding_zookeeper). They use consensus
algorithms (see [Chapter 10](/en/ch10#ch_consistency)) to provide fault tolerance and protection against split-brain.
Each node registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of shards
to nodes. Other actors, such as the routing tier or the sharding-aware client, can subscribe to this
information in ZooKeeper. Whenever a shard changes ownership, or a node is added or removed,
ZooKeeper notifies the routing tier so that it can keep its routing information up to date.
{{< figure src="/fig/ddia_0708.png" id="fig_sharding_zookeeper" caption="Figure 7-8. Using ZooKeeper to keep track of assignment of shards to nodes." class="w-full my-4" >}}
For example, HBase and SolrCloud use ZooKeeper to manage shard assignment, and Kubernetes uses etcd
to keep track of which service instance is running where. MongoDB has a similar architecture, but it
relies on its own *config server* implementation and *mongos* daemons as the routing tier. Kafka,
YugabyteDB, and TiDB use built-in implementations of the Raft consensus protocol to perform this
coordination function.
Cassandra, ScyllaDB, and Riak take a different approach: they use a *gossip protocol* among the
nodes to disseminate any changes in cluster state. This provides much weaker consistency than a
consensus protocol; it is possible to have split brain, in which different parts of the cluster have
different node assignments for the same shard. Leaderless databases can tolerate this because they
generally make weak consistency guarantees anyway (see [“Limitations of Quorum Consistency”](/en/ch6#sec_replication_quorum_limitations)).
When using a routing tier or when sending requests to a random node, clients still need to find the
IP addresses to connect to. These are not as fast-changing as the assignment of shards to nodes,
so it is often sufficient to use DNS for this purpose.
This discussion of request routing has focused on finding the shard for an individual key, which is
most relevant for sharded OLTP databases. Analytic databases often use sharding as well, but they
typically have a very different kind of query execution: rather than executing in a single shard, a
query typically needs to aggregate and join data from many different shards in parallel. We will
discuss techniques for such parallel query execution in [“JOIN and GROUP BY”](/en/ch11#sec_batch_join).
## Sharding and Secondary Indexes {#sec_sharding_secondary_indexes}
The sharding schemes we have discussed so far rely on the client knowing the partition key for any
record it wants to access. This is most easily done in a key-value data model, where the partition
key is the first part of the primary key (or the entire primary key), and so we can use the
partition key to determine the shard, and thus route reads and writes to the node that is
responsible for that key.
The situation becomes more complicated if secondary indexes are involved (see also
[“Multi-Column and Secondary Indexes”](/en/ch4#sec_storage_index_multicolumn)). A secondary index usually doesn’t identify a record uniquely but
rather is a way of searching for occurrences of a particular value: find all actions by user `123`,
find all articles containing the word `hogwash`, find all cars whose color is `red`, and so on.
Key-value stores often don’t have secondary indexes, but they are the bread and butter of relational
databases, they are common in document databases too, and they are the *raison d’être* of full-text
search engines such as Solr and Elasticsearch. The problem with secondary indexes is that they don’t
map neatly to shards. There are two main approaches to sharding a database with secondary indexes:
local and global indexes.
### Local Secondary Indexes {#id166}
For example, imagine you are operating a website for selling used cars (illustrated in
[Figure 7-9](/en/ch7#fig_sharding_local_secondary)). Each listing has a unique ID, and you use that ID as partition
key for sharding (for example, IDs 0 to 499 in shard 0, IDs 500 to 999 in shard 1, etc.).
If you want to let users search for cars, allowing them to filter by color and by make, you need a
secondary index on `color` and `make` (in a document database these would be fields; in a relational
database they would be columns). If you have declared the index, the database can perform the
indexing automatically. For example, whenever a red car is added to the database, the database shard
automatically adds its ID to the list of IDs for the index entry `color:red`. As discussed in
[Chapter 4](/en/ch4#ch_storage), that list of IDs is also called a *postings list*.
{{< figure src="/fig/ddia_0709.png" id="fig_sharding_local_secondary" caption="Figure 7-9. Local secondary indexes: each shard indexes only the records within its own shard." class="w-full my-4" >}}
> [!WARNING] WARNING
If your database only supports a key-value model, you might be tempted to implement a secondary
index yourself by creating a mapping from values to IDs in application code. If you go down this
route, you need to take great care to ensure your indexes remain consistent with the underlying
data. Race conditions and intermittent write failures (where some changes were saved but others
weren’t) can very easily cause the data to go out of sync—see [“The need for multi-object transactions”](/en/ch8#sec_transactions_need).
--------
In this indexing approach, each shard is completely separate: each shard maintains its own secondary
indexes, covering only the records in that shard. It doesn’t care what data is stored in other
shards. Whenever you write to the database—to add, remove, or update a records—you only need to
deal with the shard that contains the record that you are writing. For that reason, this type of
secondary index is known as a *local index*. In an information retrieval context it is also known as
a *document-partitioned index* [^30].
When reading from a local secondary index, if you already know the partition key of the record
you’re looking for, you can just perform the search on the appropriate shard. Moreover, if you only
want *some* results, and you don’t need all, you can send the request to any shard.
However, if you want all the results and don’t know their partition key in advance, you need to send
the query to all shards, and combine the results you get back, because the matching records might be
scattered across all the shards. In [Figure 7-9](/en/ch7#fig_sharding_local_secondary), red cars appear in both shard
0 and shard 1.
This approach to querying a sharded database can make read queries on secondary indexes quite
expensive. Even if you query the shards in parallel, it is prone to tail latency amplification (see
[“Use of Response Time Metrics”](/en/ch2#sec_introduction_slo_sla)). It also limits the scalability of your application: adding more
shards lets you store more data, but it doesn’t increase your query throughput if every shard has to
process every query anyway.
Nevertheless, local secondary indexes are widely used [^31]: for example, MongoDB, Riak, Cassandra [^32], Elasticsearch [^33],
SolrCloud, and VoltDB [^34] all use local secondary indexes.
### Global Secondary Indexes {#id167}
Rather than each shard having its own, local secondary index, we can construct a *global index* that
covers data in all shards. However, we can’t just store that index on one node, since it would
likely become a bottleneck and defeat the purpose of sharding. A global index must also be sharded,
but it can be sharded differently from the primary key index.
[Figure 7-10](/en/ch7#fig_sharding_global_secondary) illustrates what this could look like: the IDs of red cars from
all shards appear under `color:red` in the index, but the index is sharded so that colors starting
with the letters *a* to *r* appear in shard 0 and colors starting with *s* to *z* appear in shard 1.
The index on the make of car is partitioned similarly (with the shard boundary being between *f* and *h*).
{{< figure src="/fig/ddia_0710.png" id="fig_sharding_global_secondary" caption="Figure 7-10. A global secondary index reflects data from all shards, and is itself sharded by the indexed value." class="w-full my-4" >}}
This kind of index is also called *term-partitioned* [^30]:
recall from [“Full-Text Search”](/en/ch4#sec_storage_full_text) that in full-text search, a *term* is a keyword in a text that
you can search for. Here we generalise it to mean any value that you can search for in the secondary index.
The global index uses the term as partition key, so that when you’re looking for a particular term
or value, you can figure out which shard you need to query. As before, a shard can contain a
contiguous range of terms (as in [Figure 7-10](/en/ch7#fig_sharding_global_secondary)), or you can assign terms to
shards based on a hash of the term.
Global indexes have the advantage that a query with a single condition (such as *color = red*) only
needs to read from a single shard to fetch the postings list. However, if you want to fetch records
and not just IDs, you still have to read from all the shards that are responsible for those IDs.
If you have multiple search conditions or terms (e.g., searching for cars of a certain color and a
certain make, or searching for multiple words occurring in the same text), it’s likely that those
terms will be assigned to different shards. To compute the logical AND of the two conditions, the
system needs to find all the IDs that occur in both of the postings lists. That’s no problem if the
postings lists are short, but if they are long, it can be slow to send them over the network to
compute their intersection [^30].
Another challenge with global secondary indexes is that writes are more complicated than with local
indexes, because writing a single record might affect multiple shards of the index (every term in
the document might be on a different shard). This makes it harder to keep the secondary index in
sync with the underlying data. One option is to use a distributed transaction to atomically update
the shards storing the primary record and its secondary indexes (see [Chapter 8](/en/ch8#ch_transactions)).
Global secondary indexes are used by CockroachDB, TiDB, and YugabyteDB; DynamoDB supports both local
and global secondary indexes. In the case of DynamoDB, writes are asynchronously reflected in global
indexes, so reads from a global index may be stale (similarly to replication lag, as in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag)).
Nevertheless, global indexes are useful if read throughput is higher than write throughput, and if
the postings lists are not too long.
## Summary {#summary}
In this chapter we explored different ways of sharding a large dataset into smaller subsets.
Sharding is necessary when you have so much data that storing and processing it on a single machine
is no longer feasible.
The goal of sharding is to spread the data and query load evenly across multiple machines, avoiding
hot spots (nodes with disproportionately high load). This requires choosing a sharding scheme that
is appropriate to your data, and rebalancing the shards when nodes are added to or removed from the cluster.
We discussed two main approaches to sharding:
* *Key range sharding*, where keys are sorted, and a shard owns all the keys from some minimum up to
some maximum. Sorting has the advantage that efficient range queries are possible, but there is a
risk of hot spots if the application often accesses keys that are close together in the sorted
order.
In this approach, shards are typically rebalanced by splitting the range into two subranges when a
shard gets too big.
* *Hash sharding*, where a hash function is applied to each key, and a shard owns a range of hash
values (or another consistent hashing algorithm may be used to map hashes to shards). This method
destroys the ordering of keys, making range queries inefficient, but it may distribute load more
evenly.
When sharding by hash, it is common to create a fixed number of shards in advance, to assign several
shards to each node, and to move entire shards from one node to another when nodes are added or
removed. Splitting shards, like with key ranges, is also possible.
It is common to use the first part of the key as the partition key (i.e., to identify the shard),
and to sort records within that shard by the rest of the key. That way you can still have efficient
range queries among the records with the same partition key.
We also discussed the interaction between sharding and secondary indexes. A secondary index also
needs to be sharded, and there are two methods:
* *Local secondary indexes*, where the secondary indexes are stored
in the same shard as the primary key and value. This means that only a single shard needs to be
updated on write, but a lookup of the secondary index requires reading from all shards.
* *Global secondary indexes*, which are sharded separately based on
the indexed values. An entry in the secondary index may refer to records from all shards of the
primary key. When a record is written, several secondary index shards may need to be updated;
however, a read of the postings list can be served from a single shard (fetching the actual
records still requires reading from multiple shards).
Finally, we discussed techniques for routing queries to the appropriate shard, and how a
coordination service is often used to keep track of the assigment of shards to nodes.
By design, every shard operates mostly independently—that’s what allows a sharded database to scale
to multiple machines. However, operations that need to write to several shards can be problematic:
for example, what happens if the write to one shard succeeds, but another fails? We will address
that question in the following chapters.
### References
[^1]: Claire Giordano. [Understanding partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959)
[^2]: Brandur Leach. [Partitioning in Postgres, 2022 edition](https://brandur.org/fragments/postgres-partitioning-2022). *brandur.org*, October 2022. Archived at [perma.cc/Z5LE-6AKX](https://perma.cc/Z5LE-6AKX)
[^3]: Raph Koster. [Database “sharding” came from UO?](https://www.raphkoster.com/2009/01/08/database-sharding-came-from-uo/) *raphkoster.com*, January 2009. Archived at [perma.cc/4N9U-5KYF](https://perma.cc/4N9U-5KYF)
[^4]: Garrett Fidalgo. [Herding elephants: Lessons learned from sharding Postgres at Notion](https://www.notion.com/blog/sharding-postgres-at-notion). *notion.com*, October 2021. Archived at [perma.cc/5J5V-W2VX](https://perma.cc/5J5V-W2VX)
[^5]: Ulrich Drepper. [What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf). *akkadia.org*, November 2007. Archived at [perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ)
[^6]: Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. [FoundationDB: A Distributed Unbundled Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2021. [doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559)
[^7]: Marco Slot. [Citus 12: Schema-based sharding for PostgreSQL](https://www.citusdata.com/blog/2023/07/18/citus-12-schema-based-sharding-for-postgres/). *citusdata.com*, July 2023. Archived at [perma.cc/R874-EC9W](https://perma.cc/R874-EC9W)
[^8]: Robisson Oliveira. [Reducing the Scope of Impact with Cell-Based Architecture](https://docs.aws.amazon.com/pdfs/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/reducing-scope-of-impact-with-cell-based-architecture.pdf). AWS Well-Architected white paper, Amazon Web Services, September 2023. Archived at [perma.cc/4KWW-47NR](https://perma.cc/4KWW-47NR)
[^9]: Gwen Shapira. [Things DBs Don’t Do - But Should](https://www.thenile.dev/blog/things-dbs-dont-do). *thenile.dev*, February 2023. Archived at [perma.cc/C3J4-JSFW](https://perma.cc/C3J4-JSFW)
[^10]: Malte Schwarzkopf, Eddie Kohler, M. Frans Kaashoek, and Robert Morris. [Position: GDPR Compliance by Construction](https://cs.brown.edu/people/malte/pub/papers/2019-poly-gdpr.pdf). At *Towards Polystores that manage multiple Databases, Privacy, Security and/or Policy Issues for Heterogenous Data* (Poly), August 2019. [doi:10.1007/978-3-030-33752-0\_3](https://doi.org/10.1007/978-3-030-33752-0_3)
[^11]: Gwen Shapira. [Introducing pg\_karnak: Transactional schema migration across tenant databases](https://www.thenile.dev/blog/distributed-ddl). *thenile.dev*, November 2024. Archived at [perma.cc/R5RD-8HR9](https://perma.cc/R5RD-8HR9)
[^12]: Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón. [Scaling Datastores at Slack with Vitess](https://slack.engineering/scaling-datastores-at-slack-with-vitess/). *slack.engineering*, December 2020. Archived at [perma.cc/UW8F-ALJK](https://perma.cc/UW8F-ALJK)
[^13]: Ikai Lan. [App Engine Datastore Tip: Monotonically Increasing Values Are Bad](https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/). *ikaisays.com*, January 2011. Archived at [perma.cc/BPX8-RPJB](https://perma.cc/BPX8-RPJB)
[^14]: Enis Soztutar. [Apache HBase Region Splitting and Merging](https://www.cloudera.com/blog/technical/apache-hbase-region-splitting-and-merging.html). *cloudera.com*, February 2013. Archived at [perma.cc/S9HS-2X2C](https://perma.cc/S9HS-2X2C)
[^15]: Eric Evans. [Rethinking Topology in Cassandra](https://www.youtube.com/watch?v=Qz6ElTdYjjU). At *Cassandra Summit*, June 2013. Archived at [perma.cc/2DKM-F438](https://perma.cc/2DKM-F438)
[^16]: Martin Kleppmann. [Java’s hashCode Is Not Safe for Distributed Systems](https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html). *martin.kleppmann.com*, June 2012. Archived at [perma.cc/LK5U-VZSN](https://perma.cc/LK5U-VZSN)
[^17]: Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, and Akshat Vig. [Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical Conference* (ATC), July 2022.
[^18]: Brandon Williams. [Virtual Nodes in Cassandra 1.2](https://www.datastax.com/blog/virtual-nodes-cassandra-12). *datastax.com*, December 2012. Archived at [perma.cc/N385-EQXV](https://perma.cc/N385-EQXV)
[^19]: Branimir Lambov. [New Token Allocation Algorithm in Cassandra 3.0](https://www.datastax.com/blog/new-token-allocation-algorithm-cassandra-30). *datastax.com*, January 2016. Archived at [perma.cc/2BG7-LDWY](https://perma.cc/2BG7-LDWY)
[^20]: David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. [Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web](https://people.csail.mit.edu/karger/Papers/web.pdf). At *29th Annual ACM Symposium on Theory of Computing* (STOC), May 1997. [doi:10.1145/258533.258660](https://doi.org/10.1145/258533.258660)
[^21]: Damian Gryski. [Consistent Hashing: Algorithmic Tradeoffs](https://dgryski.medium.com/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8). *dgryski.medium.com*, April 2018. Archived at [perma.cc/B2WF-TYQ8](https://perma.cc/B2WF-TYQ8)
[^22]: David G. Thaler and Chinya V. Ravishankar. [Using name-based mappings to increase hit rates](https://www.cs.kent.edu/~javed/DL/web/p1-thaler.pdf). *IEEE/ACM Transactions on Networking*, volume 6, issue 1, pages 1–14, February 1998. [doi:10.1109/90.663936](https://doi.org/10.1109/90.663936)
[^23]: John Lamping and Eric Veach. [A Fast, Minimal Memory, Consistent Hash Algorithm](https://arxiv.org/abs/1406.2294). *arxiv.org*, June 2014.
[^24]: Samuel Axon. [3% of Twitter’s Servers Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010. Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX)
[^25]: Gerald Guo and Thawan Kooburat. [Scaling services with Shard Manager](https://engineering.fb.com/2020/08/24/production-engineering/scaling-services-with-shard-manager/). *engineering.fb.com*, August 2020. Archived at [perma.cc/EFS3-XQYT](https://perma.cc/EFS3-XQYT)
[^26]: Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, Kaushik Veeraraghavan, Biren Damani, Pol Mauri Ruiz, Vikas Mehta, and Chunqiang Tang. [Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications](https://dl.acm.org/doi/pdf/10.1145/3477132.3483546). *28th ACM SIGOPS Symposium on Operating Systems Principles* (SOSP), pages 553–569, October 2021. [doi:10.1145/3477132.3483546](https://doi.org/10.1145/3477132.3483546)
[^27]: Scott Lystig Fritchie. [A Critique of Resizable Hash Tables: Riak Core & Random Slicing](https://www.infoq.com/articles/dynamo-riak-random-slicing/). *infoq.com*, August 2018. Archived at [perma.cc/RPX7-7BLN](https://perma.cc/RPX7-7BLN)
[^28]: Andy Warfield. [Building and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. Archived at [perma.cc/6S7P-GLM4](https://perma.cc/6S7P-GLM4)
[^29]: Rich Houlihan. [DynamoDB adaptive capacity: smooth performance for chaotic workloads (DAT327)](https://www.youtube.com/watch?v=kMY0_m29YzU). At *AWS re:Invent*, November 2017.
[^30]: Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/). Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at [nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/)
[^31]: Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. [Earlybird: Real-Time Search at Twitter](https://cs.uwaterloo.ca/~jimmylin/publications/Busch_etal_ICDE2012.pdf). At *28th IEEE International Conference on Data Engineering* (ICDE), April 2012. [doi:10.1109/ICDE.2012.149](https://doi.org/10.1109/ICDE.2012.149)
[^32]: Nadav Har’El. [Indexing in Cassandra 3](https://github.com/scylladb/scylladb/wiki/Indexing-in-Cassandra-3). *github.com*, April 2017. Archived at [perma.cc/3ENV-8T9P](https://perma.cc/3ENV-8T9P)
[^33]: Zachary Tong. [Customizing Your Document Routing](https://www.elastic.co/blog/customizing-your-document-routing/). *elastic.co*, June 2013. Archived at [perma.cc/97VM-MREN](https://perma.cc/97VM-MREN)
[^34]: Andrew Pavlo. [H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/). *hstore.cs.brown.edu*, October 2013. Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z)
================================================
FILE: content/en/ch8.md
================================================
---
title: "8. Transactions"
weight: 208
breadcrumbs: false
---

> *Some authors have claimed that general two-phase commit is too expensive to support, because of the
> performance or availability problems that it brings. We believe it is better to have application
> programmers deal with performance problems due to overuse of transactions as bottlenecks arise,
> rather than always coding around the lack of transactions.*
>
> James Corbett et al., *Spanner: Google’s Globally-Distributed Database* (2012)
In the harsh reality of data systems, many things can go wrong:
* The database software or hardware may fail at any time (including in the middle of a write
operation).
* The application may crash at any time (including halfway through a series of operations).
* Interruptions in the network can unexpectedly cut off the application from the database, or one
database node from another.
* Several clients may write to the database at the same time, overwriting each other’s changes.
* A client may read data that doesn’t make sense because it has only partially been updated.
* Race conditions between clients can cause surprising bugs.
In order to be reliable, a system has to deal with these faults and ensure that they don’t cause
catastrophic failure of the entire system. However, implementing fault-tolerance mechanisms is a lot
of work. It requires a lot of careful thinking about all the things that can go wrong, and a lot of
testing to ensure that the solution actually works.
For decades, *transactions* have been the mechanism of choice for simplifying these issues. A
transaction is a way for an application to group several reads and writes together into a logical
unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either
the entire transaction succeeds (*commit*) or it fails (*abort*, *rollback*). If it fails, the
application can safely retry. With transactions, error handling becomes much simpler for an
application, because it doesn’t need to worry about partial failure—i.e., the case where some
operations succeed and some fail (for whatever reason).
If you have spent years working with transactions, they may seem obvious, but we shouldn’t take them
for granted. Transactions are not a law of nature; they were created with a purpose, namely to
*simplify the programming model* for applications accessing a database. By using transactions, the
application is free to ignore certain potential error scenarios and concurrency issues, because the
database takes care of them instead (we call these *safety guarantees*).
Not every application needs transactions, and sometimes there are advantages to weakening
transactional guarantees or abandoning them entirely (for example, to achieve higher performance or
higher availability). Some safety properties can be achieved without transactions. On the other
hand, transactions can prevent a lot of grief: for example, the technical cause behind the Post
Office Horizon scandal (see [“How Important Is Reliability?”](/en/ch2#sidebar_reliability_importance)) was probably a lack of ACID
transactions in the underlying accounting system [^1].
How do you figure out whether you need transactions? In order to answer that question, we first need
to understand exactly what safety guarantees transactions can provide, and what costs are associated
with them. Although transactions seem straightforward at first glance, there are actually many
subtle but important details that come into play.
In this chapter, we will examine many examples of things that can go wrong, and explore the
algorithms that databases use to guard against those issues. We will go especially deep in the area
of concurrency control, discussing various kinds of race conditions that can occur and how
databases implement isolation levels such as *read committed*, *snapshot isolation*, and
*serializability*.
Concurrency control is relevant for both single-node and distributed databases. Later in this
chapter, in [“Distributed Transactions”](/en/ch8#sec_transactions_distributed), we will examine the *two-phase commit* protocol and
the challenge of achieving atomicity in a distributed transaction.
## What Exactly Is a Transaction? {#sec_transactions_overview}
Almost all relational databases today, and some nonrelational databases, support transactions. Most
of them follow the style that was introduced in 1975 by IBM System R, the first SQL database [^2] [^3] [^4].
Although some implementation details have changed, the general idea has remained virtually the same
for 50 years: the transaction support in MySQL, PostgreSQL, Oracle, SQL Server, etc., is uncannily
similar to that of System R.
In the late 2000s, nonrelational (NoSQL) databases started gaining popularity. They aimed to
improve upon the relational status quo by offering a choice of new data models (see
[Chapter 3](/en/ch3#ch_datamodels)), and by including replication ([Chapter 6](/en/ch6#ch_replication)) and sharding
([Chapter 7](/en/ch7#ch_sharding)) by default. Transactions were the main casualty of this movement: many of this
generation of databases abandoned transactions entirely, or redefined the word to describe a
much weaker set of guarantees than had previously been understood.
The hype around NoSQL distributed databases led to a popular belief that transactions were
fundamentally unscalable, and that any large-scale system would have to abandon transactions in
order to maintain good performance and high availability. More recently, that belief has turned out
to be wrong. So-called “NewSQL” databases such as CockroachDB [^5], TiDB [^6], Spanner [^7], FoundationDB [^8],
and Yugabyte have shown that transactional systems can scale to large data volumes and high
throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide
strong ACID guarantees at scale.
However, that doesn’t mean that every system must be transactional either: like every other
technical design choice, transactions have advantages and limitations. In order to understand those
trade-offs, let’s go into the details of the guarantees that transactions can provide—both in normal
operation and in various extreme (but realistic) circumstances.
### The Meaning of ACID {#sec_transactions_acid}
The safety guarantees provided by transactions are often described by the well-known acronym *ACID*,
which stands for *Atomicity*, *Consistency*, *Isolation*, and *Durability*. It was coined in 1983 by
Theo Härder and Andreas Reuter [^9] in an effort to establish precise terminology for fault-tolerance mechanisms in databases.
However, in practice, one database’s implementation of ACID does not equal another’s implementation.
For example, as we shall see, there is a lot of ambiguity around the meaning of *isolation* [^10].
The high-level idea is sound, but the devil is in the details. Today, when a system claims to be
“ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately
become mostly a marketing term.
(Systems that do not meet the ACID criteria are sometimes called *BASE*, which stands for
*Basically Available*, *Soft state*, and *Eventual consistency* [^11].
This is even more vague than the definition of ACID. It seems that the only sensible definition of
BASE is “not ACID”; i.e., it can mean almost anything you want.)
Let’s dig into the definitions of atomicity, consistency, isolation, and durability, as this will let
us refine our idea of transactions.
#### Atomicity {#sec_transactions_acid_atomicity}
In general, *atomic* refers to something that cannot be broken down into smaller parts. The word
means similar but subtly different things in different branches of computing. For example, in
multi-threaded programming, if one thread executes an atomic operation, that means there is no way
that another thread could see the half-finished result of the operation. The system can only be in
the state it was before the operation or after the operation, not something in between.
By contrast, in the context of ACID, atomicity is *not* about concurrency. It does not describe
what happens if several processes try to access the same data at the same time, because that is
covered under the letter *I*, for *isolation* (see [“Isolation”](/en/ch8#sec_transactions_acid_isolation)).
Rather, ACID atomicity describes what happens if a client wants to make several writes, but a fault
occurs after some of the writes have been processed—for example, a process crashes, a network
connection is interrupted, a disk becomes full, or some integrity constraint is violated.
If the writes are grouped together into an atomic transaction, and the transaction cannot be
completed (*committed*) due to a fault, then the transaction is *aborted* and the database must
discard or undo any writes it has made so far in that transaction.
Without atomicity, if an error occurs partway through making multiple changes, it’s difficult to
know which changes have taken effect and which haven’t. The application could try again, but that
risks making the same change twice, leading to duplicate or incorrect data. Atomicity simplifies
this problem: if a transaction was aborted, the application can be sure that it didn’t change
anything, so it can safely be retried.
The ability to abort a transaction on error and have all writes from that transaction discarded is
the defining feature of ACID atomicity. Perhaps *abortability* would have been a better term than
*atomicity*, but we will stick with *atomicity* since that’s the usual word.
#### Consistency {#sec_transactions_acid_consistency}
The word *consistency* is terribly overloaded:
* In [Chapter 6](/en/ch6#ch_replication) we discussed *replica consistency* and the issue of *eventual consistency*
that arises in asynchronously replicated systems (see [“Problems with Replication Lag”](/en/ch6#sec_replication_lag)).
* A *consistent snapshot* of a database, e.g. for a backup, is a snapshot of the entire database as
it existed at one moment in time. More precisely, it is consistent with the happens-before
relation (see [“The “happens-before” relation and concurrency”](/en/ch6#sec_replication_happens_before)): that is, if the snapshot contains a value that
was written at a particular time, then it also reflects all the writes that happened before that
value was written.
* *Consistent hashing* is an approach to sharding that some systems use for rebalancing (see
[“Consistent hashing”](/en/ch7#sec_sharding_consistent_hashing)).
* In the CAP theorem (see [Chapter 10](/en/ch10#ch_consistency)), the word *consistency* is used to mean
*linearizability* (see [“Linearizability”](/en/ch10#sec_consistency_linearizability)).
* In the context of ACID, *consistency* refers to an application-specific notion of the database
being in a “good state.”
It’s unfortunate that the same word is used with at least five different meanings.
The idea of ACID consistency is that you have certain statements about your data (*invariants*) that
must always be true—for example, in an accounting system, credits and debits across all accounts
must always be balanced. If a transaction starts with a database that is valid according to these
invariants, and any writes during the transaction preserve the validity, then you can be sure that
the invariants are always satisfied. (An invariant may be temporarily violated during transaction
execution, but it should be satisfied again at transaction commit.)
If you want the database to enforce your invariants, you need to declare them as *constraints* as
part of the schema. For example, foreign key constraints, uniqueness constraints, or check
constraints (which restrict the values that can appear in an individual row) are often used to
model specific types of invariants. More complex consistency requirements can sometimes be modeled
using triggers or materialized views [^12].
However, complex invariants can be difficult or impossible to model using the constraints that
databases usually provide. In that case, it’s the application’s responsibility to define its
transactions correctly so that they preserve consistency. If you write bad data that violates your
invariants, but you haven’t declared those invariants, the database can’t stop you. As such, the C
in ACID often depends on how the application uses the database, and it’s not a property of the
database alone.
#### Isolation {#sec_transactions_acid_isolation}
Most databases are accessed by several clients at the same time. That is no problem if they are
reading and writing different parts of the database, but if they are accessing the same database
records, you can run into concurrency problems (race conditions).
[Figure 8-1](/en/ch8#fig_transactions_increment) is a simple example of this kind of problem. Say you have two clients
simultaneously incrementing a counter that is stored in a database. Each client needs to read the
current value, add 1, and write the new value back (assuming there is no increment operation built
into the database). In [Figure 8-1](/en/ch8#fig_transactions_increment) the counter should have increased from 42 to
44, because two increments happened, but it actually only went to 43 because of the race condition.
{{< figure src="/fig/ddia_0801.png" id="fig_transactions_increment" caption="Figure 8-1. A race condition between two clients concurrently incrementing a counter." class="w-full my-4" >}}
*Isolation* in the sense of ACID means that concurrently executing transactions are isolated from
each other: they cannot step on each other’s toes. The classic database textbooks formalize
isolation as *serializability*, which means that each transaction can pretend that it is the only
transaction running on the entire database. The database ensures that when the transactions have
committed, the result is the same as if they had run *serially* (one after another), even though in
reality they may have run concurrently [^13].
However, serializability has a performance cost. In practice, many databases use forms of isolation
that are weaker than serializability: that is, they allow concurrent transactions to interfere with
each other in limited ways. Some popular databases, such as Oracle, don’t even implement it (Oracle
has an isolation level called “serializable,” but it actually implements *snapshot isolation*, which
is a weaker guarantee than serializability [^10] [^14]).
This means that some kinds of race conditions can still occur. We will explore snapshot isolation
and other forms of isolation in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels).
#### Durability {#durability}
The purpose of a database system is to provide a safe place where data can be stored without fear of
losing it. *Durability* is the promise that once a transaction has committed successfully, any data it
has written will not be forgotten, even if there is a hardware fault or the database crashes.
In a single-node database, durability typically means that the data has been written to nonvolatile
storage such as a hard drive or SSD. Regular file writes are usually buffered in memory before being
sent to the disk sometime later, which means they would be lost if there is a sudden power failure;
many databases therefore use the `fsync()` system call to ensure the data really has been written to
disk. Databases usually also have a write-ahead log or similar (see [“Making B-trees reliable”](/en/ch4#sec_storage_btree_wal)),
which allows them to recover in the event that a crash occurs part way through a write.
In a replicated database, durability may mean that the data has been successfully copied to some
number of nodes. In order to provide a durability guarantee, a database must wait until these writes
or replications are complete before reporting a transaction as successfully committed. However,
as discussed in [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability), perfect durability does not exist: if all your
hard disks and all your backups are destroyed at the same time, there’s obviously nothing your
database can do to save you.
--------
> [!TIP] REPLICATION AND DURABILITY
Historically, durability meant writing to an archive tape. Then it was understood as writing to a disk
or SSD. More recently, it has been adapted to mean replication. Which implementation is better?
The truth is, nothing is perfect:
* If you write to disk and the machine dies, even though your data isn’t lost, it is inaccessible
until you either fix the machine or transfer the disk to another machine. Replicated systems can
remain available.
* A correlated fault—a power outage or a bug that crashes every node on a particular input—can
knock out all replicas at once (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability)), losing any data that is
only in memory. Writing to disk is therefore still relevant for replicated databases.
* In an asynchronously replicated system, recent writes may be lost when the leader becomes
unavailable (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)).
* When the power is suddenly cut, SSDs in particular have been shown to sometimes violate the
guarantees they are supposed to provide: even `fsync` isn’t guaranteed to work correctly [^15].
Disk firmware can have bugs, just like any other kind of software [^16] [^17],
e.g. causing drives to fail after exactly 32,768 hours of operation [^18].
And `fsync` is hard to use; even PostgreSQL used it incorrectly for over 20 years [^19] [^20] [^21].
* Subtle interactions between the storage engine and the filesystem implementation can lead to bugs
that are hard to track down, and may cause files on disk to be corrupted after a crash [^22] [^23].
Filesystem errors on one replica can sometimes spread to other replicas as well [^24].
* Data on disk can gradually become corrupted without this being detected [^25].
If data has been corrupted for some time, replicas and recent backups may also be corrupted. In
this case, you will need to try to restore the data from a historical backup.
* One study of SSDs found that between 30% and 80% of drives develop at least one bad block during
the first four years of operation, and only some of these can be corrected by the firmware [^26].
Magnetic hard drives have a lower rate of bad sectors, but a higher rate of complete failure than SSDs.
* When a worn-out SSD (that has gone through many write/erase cycles) is disconnected from power,
it can start losing data within a timescale of weeks to months, depending on the temperature [^27].
This is less of a problem for drives with lower wear levels [^28].
In practice, there is no one technique that can provide absolute guarantees. There are only various
risk-reduction techniques, including writing to disk, replicating to remote machines, and
backups—and they can and should be used together. As always, it’s wise to take any theoretical
“guarantees” with a healthy grain of salt.
--------
### Single-Object and Multi-Object Operations {#sec_transactions_multi_object}
To recap, in ACID, atomicity and isolation describe what the database should do if a client makes
several writes within the same transaction:
Atomicity
: If an error occurs halfway through a sequence of writes, the transaction should be aborted, and
the writes made up to that point should be discarded. In other words, the database saves you from
having to worry about partial failure, by giving an all-or-nothing guarantee.
Isolation
: Concurrently running transactions shouldn’t interfere with each other. For example, if one
transaction makes several writes, then another transaction should see either all or none of those
writes, but not some subset.
These definitions assume that you want to modify several objects (rows, documents, records) at once.
Such *multi-object transactions* are often needed if several pieces of data need to be kept in sync.
[Figure 8-2](/en/ch8#fig_transactions_read_uncommitted) shows an example from an email application. To display the
number of unread messages for a user, you could query something like:
```
SELECT COUNT(*) FROM emails WHERE recipient_id = 2 AND unread_flag = true
```
{{< figure src="/fig/ddia_0802.png" id="fig_transactions_read_uncommitted" caption="Figure 8-2. Violating isolation: one transaction reads another transaction's uncommitted writes (a \"dirty read\")." class="w-full my-4" >}}
However, you might find this query to be too slow if there are many emails, and decide to store the
number of unread messages in a separate field (a kind of denormalization, which we discuss in
[“Normalization, Denormalization, and Joins”](/en/ch3#sec_datamodels_normalization)). Now, whenever a new message comes in, you have to increment the
unread counter as well, and whenever a message is marked as read, you also have to decrement the
unread counter.
In [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), user 2 experiences an anomaly: the mailbox listing shows
an unread message, but the counter shows zero unread messages because the counter increment has not
yet happened. (If an incorrect counter in an email application seems too insignificant, think of a
customer account balance instead of an unread counter, and a payment transaction instead of an
email.) Isolation would have prevented this issue by ensuring that user 2 sees either both the
inserted email and the updated counter, or neither, but not an inconsistent halfway point.
[Figure 8-3](/en/ch8#fig_transactions_atomicity) illustrates the need for atomicity: if an error occurs somewhere
over the course of the transaction, the contents of the mailbox and the unread counter might become out
of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted
and the inserted email is rolled back.
{{< figure src="/fig/ddia_0803.png" id="fig_transactions_atomicity" caption="Figure 8-3. Atomicity ensures that if an error occurs any prior writes from that transaction are undone, to avoid an inconsistent state." class="w-full my-4" >}}
Multi-object transactions require some way of determining which read and write operations belong to
the same transaction. In relational databases, that is typically done based on the client’s TCP
connection to the database server: on any particular connection, everything between a `BEGIN
TRANSACTION` and a `COMMIT` statement is considered to be part of the same transaction. If the TCP
connection is interrupted, the transaction must be aborted.
On the other hand, many nonrelational databases don’t have such a way of grouping operations
together. Even if there is a multi-object API (for example, a key-value store may have a *multi-put*
operation that updates several keys in one operation), that doesn’t necessarily mean it has
transaction semantics: the command may succeed for some keys and fail for others, leaving the
database in a partially updated state.
#### Single-object writes {#sec_transactions_single_object}
Atomicity and isolation also apply when a single object is being changed. For example, imagine you
are writing a 20 KB JSON document to a database:
* If the network connection is interrupted after the first 10 KB have been sent, does the
database store that unparseable 10 KB fragment of JSON?
* If the power fails while the database is in the middle of overwriting the previous value on disk,
do you end up with the old and new values spliced together?
* If another client reads that document while the write is in progress, will it see a partially
updated value?
Those issues would be incredibly confusing, so storage engines almost universally aim to provide
atomicity and isolation on the level of a single object (such as a key-value pair) on one node.
Atomicity can be implemented using a log for crash recovery (see [“Making B-trees reliable”](/en/ch4#sec_storage_btree_wal)), and
isolation can be implemented using a lock on each object (allowing only one thread to access an
object at any one time).
Some databases also provide more complex atomic operations, such as an increment operation, which
removes the need for a read-modify-write cycle like that in [Figure 8-1](/en/ch8#fig_transactions_increment).
Similarly popular is a *conditional write* operation, which allows a write to happen only if the value
has not been concurrently changed by someone else (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)),
similarly to a compare-and-set or compare-and-swap (CAS) operation in shared-memory concurrency.
--------
> [!NOTE]
> Strictly speaking, the term *atomic increment* uses the word *atomic* in the sense of multi-threaded
> programming. In the context of ACID, it should actually be called an *isolated* or *serializable*
> increment, but that’s not the usual term.
--------
These single-object operations are useful, as they can prevent lost updates when several clients try
to write to the same object concurrently (see [“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update)). However, they are
not transactions in the usual sense of the word. For example, the “lightweight transactions” feature
of Cassandra and ScyllaDB, and Aerospike’s “strong consistency” mode offer linearizable (see
[“Linearizability”](/en/ch10#sec_consistency_linearizability)) reads and conditional writes on a single object, but no
guarantees across multiple objects.
#### The need for multi-object transactions {#sec_transactions_need}
Do we need multi-object transactions at all? Would it be possible to implement any application with
only a key-value data model and single-object operations?
There are some use cases in which single-object inserts, updates, and deletes are sufficient.
However, in many other cases writes to several different objects need to be coordinated:
* In a relational data model, a row in one table often has a foreign key reference to a row in
another table. Similarly, in a graph-like data model, a vertex has edges to other vertices.
Multi-object transactions allow you to ensure that these references remain valid: when inserting
several records that refer to one another, the foreign keys have to be correct and up to date,
or the data becomes nonsensical.
* In a document data model, the fields that need to be updated together are often within the same
document, which is treated as a single object—no multi-object transactions are needed when
updating a single document. However, document databases lacking join functionality also encourage
denormalization (see [“When to Use Which Model”](/en/ch3#sec_datamodels_document_summary)). When denormalized information needs to
be updated, like in the example of [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), you need to update
several documents in one go. Transactions are very useful in this situation to prevent
denormalized data from going out of sync.
* In databases with secondary indexes (almost everything except pure key-value stores), the indexes
also need to be updated every time you change a value. These indexes are different database
objects from a transaction point of view: for example, without transaction isolation, it’s
possible for a record to appear in one index but not another, because the update to the second
index hasn’t happened yet (see [“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes)).
Such applications can still be implemented without transactions. However, error handling becomes
much more complicated without atomicity, and the lack of isolation can cause concurrency problems.
We will discuss those in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels), and explore alternative approaches
in [“Derived data versus distributed transactions”](/en/ch13#sec_future_derived_vs_transactions).
#### Handling errors and aborts {#handling-errors-and-aborts}
A key feature of a transaction is that it can be aborted and safely retried if an error occurred.
ACID databases are based on this philosophy: if the database is in danger of violating its guarantee
of atomicity, isolation, or durability, it would rather abandon the transaction entirely than allow
it to remain half-finished.
Not all systems follow that philosophy, though. In particular, datastores with leaderless
replication (see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)) work much more on a “best effort” basis, which
could be summarized as “the database will do as much as it can, and if it runs into an error, it
won’t undo something it has already done”—so it’s the application’s responsibility to recover from
errors.
Errors will inevitably happen, but many software developers prefer to think only about the happy
path rather than the intricacies of error handling. For example, popular object-relational mapping
(ORM) frameworks such as Rails’s ActiveRecord and Django don’t retry aborted transactions—the
error usually results in an exception bubbling up the stack, so any user input is thrown away and
the user gets an error message. This is a shame, because the whole point of aborts is to enable safe
retries.
Although retrying an aborted transaction is a simple and effective error handling mechanism, it
isn’t perfect:
* If the transaction actually succeeded, but the network was interrupted while the server tried to
acknowledge the successful commit to the client (so it timed out from the client’s point of view),
then retrying the transaction causes it to be performed twice—unless you have an additional
application-level deduplication mechanism in place.
* If the error is due to overload or high contention between concurrent transactions, retrying the
transaction will make the problem worse, not better. To avoid such feedback cycles, you can limit
the number of retries, use exponential backoff, and handle overload-related errors differently
from other errors (see [“When an overloaded system won’t recover”](/en/ch2#sidebar_metastable)).
* It is only worth retrying after transient errors (for example due to deadlock, isolation
violation, temporary network interruptions, and failover); after a permanent error (e.g.,
constraint violation) a retry would be pointless.
* If the transaction also has side effects outside of the database, those side effects may happen
even if the transaction is aborted. For example, if you’re sending an email, you wouldn’t want to
send the email again every time you retry the transaction. If you want to make sure that several
different systems either commit or abort together, two-phase commit can help (we will discuss this
in [“Two-Phase Commit (2PC)”](/en/ch8#sec_transactions_2pc)).
* If the client process crashes while retrying, any data it was trying to write to the database is lost.
## Weak Isolation Levels {#sec_transactions_isolation_levels}
If two transactions don’t access the same data, or if both are read-only, they can safely be run in
parallel, because neither depends on the other. Concurrency issues (race conditions) only come into
play when one transaction reads data that is concurrently modified by another transaction, or when
the two transactions try to modify the same data.
Concurrency bugs are hard to find by testing, because such bugs are only triggered when you get
unlucky with the timing. Such timing issues might occur very rarely, and are usually difficult to
reproduce. Concurrency is also very difficult to reason about, especially in a large application
where you don’t necessarily know which other pieces of code are accessing the database. Application
development is difficult enough if you just have one user at a time; having many concurrent users
makes it much harder still, because any piece of data could unexpectedly change at any time.
For that reason, databases have long tried to hide concurrency issues from application developers by
providing *transaction isolation*. In theory, isolation should make your life easier by letting you
pretend that no concurrency is happening: *serializable* isolation means that the database
guarantees that transactions have the same effect as if they ran *serially* (i.e., one at a time,
without any concurrency).
In practice, isolation is unfortunately not that simple. Serializable isolation has a performance
cost, and many databases don’t want to pay that price [^10]. It’s therefore common for systems to use
weaker levels of isolation, which protect against *some* concurrency issues, but not all. Those
levels of isolation are much harder to understand, and they can lead to subtle bugs, but they are
nevertheless used in practice [^29].
Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have
caused substantial loss of money [^30] [^31] [^32], led to investigation by financial auditors [^33],
and caused customer data to be corrupted [^34]. A popular comment on revelations of such problems is “Use an ACID database if you’re handling
financial data!”—but that misses the point. Even many popular relational database systems (which
are usually considered “ACID”) use weak isolation, so they wouldn’t necessarily have prevented these
bugs from occurring.
--------
> [!NOTE]
> Incidentally, much of the banking system relies on text files that are exchanged via secure FTP [^35].
> In this context, having an audit trail and some human-level fraud prevention measures is actually
> more important than ACID properties.
--------
Those examples also highlight an important point: even if concurrency issues are rare in normal
operation, you have to consider the possibility that an attacker deliberately sends a burst of
highly concurrent requests to your API in an attempt to deliberately exploit concurrency bugs [^30]. Therefore, in order to build
applications that are reliable and secure, you have to ensure that such bugs are systematically
prevented.
In this section we will look at several weak (nonserializable) isolation levels that are used in
practice, and discuss in detail what kinds of race conditions can and cannot occur, so that you can
decide what level is appropriate to your application. Once we’ve done that, we will discuss
serializability in detail (see [“Serializability”](/en/ch8#sec_transactions_serializability)). Our discussion of isolation
levels will be informal, using examples. If you want rigorous definitions and analyses of their
properties, you can find them in the academic literature [^36] [^37] [^38] [^39].
### Read Committed {#sec_transactions_read_committed}
The most basic level of transaction isolation is *read committed*. It makes two guarantees:
1. When reading from the database, you will only see data that has been committed (no *dirty reads*).
2. When writing to the database, you will only overwrite data that has been committed (no *dirty writes*).
Some databases support an even weaker isolation level called *read uncommitted*. It prevents dirty
writes, but does not prevent dirty reads. Let’s discuss these two guarantees in more detail.
#### No dirty reads {#no-dirty-reads}
Imagine a transaction has written some data to the database, but the transaction has not yet committed or aborted.
Can another transaction see that uncommitted data? If yes, that is called a
*dirty read* [^3].
Transactions running at the read committed isolation level must prevent dirty reads. This means that
any writes by a transaction only become visible to others when that transaction commits (and then
all of its writes become visible at once). This is illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2’s *get x* still
returns the old value, 2, while user 1 has not yet committed.
{{< figure src="/fig/ddia_0804.png" id="fig_transactions_read_committed" caption="Figure 8-4. No dirty reads: user 2 sees the new value for x only after user 1's transaction has committed." class="w-full my-4" >}}
There are a few reasons why it’s useful to prevent dirty reads:
* If a transaction needs to update several rows, a dirty read means that another transaction may
see some of the updates but not others. For example, in [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), the
user sees the new unread email but not the updated counter. This is a dirty read of the email.
Seeing the database in a partially updated state is confusing to users and may cause other
transactions to take incorrect decisions.
* If a transaction aborts, any writes it has made need to be rolled back (like in
[Figure 8-3](/en/ch8#fig_transactions_atomicity)). If the database allows dirty reads, that means a transaction may
see data that is later rolled back—i.e., which is never actually committed to the database. Any
transaction that read uncommitted data would also need to be aborted, leading to a problem called
*cascading aborts*.
#### No dirty writes {#sec_transactions_dirty_write}
What happens if two transactions concurrently try to update the same row in a database? We don’t
know in which order the writes will happen, but we normally assume that the later write overwrites
the earlier write.
However, what happens if the earlier write is part of a transaction that has not yet committed, so
the later write overwrites an uncommitted value? This is called a *dirty write* [^36]. Transactions running at the read
committed isolation level must prevent dirty writes, usually by delaying the second write until the
first write’s transaction has committed or aborted.
By preventing dirty writes, this isolation level avoids some kinds of concurrency problems:
* If transactions update multiple rows, dirty writes can lead to a bad outcome. For example,
consider [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), which illustrates a used car sales website on which
two people, Aaliyah and Bryce, are simultaneously trying to buy the same car. Buying a car requires
two database writes: the listing on the website needs to be updated to reflect the buyer, and the
sales invoice needs to be sent to the buyer. In the case of [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), the
sale is awarded to Bryce (because he performs the winning update to the `listings` table), but the
invoice is sent to Aaliyah (because she performs the winning update to the `invoices` table). Read
committed prevents such mishaps.
* However, read committed does *not* prevent the race condition between two counter increments in
[Figure 8-1](/en/ch8#fig_transactions_increment). In this case, the second write happens after the first transaction
has committed, so it’s not a dirty write. It’s still incorrect, but for a different reason—in
[“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update) we will discuss how to make such counter increments safe.
{{< figure src="/fig/ddia_0805.png" id="fig_transactions_dirty_writes" caption="Figure 8-5. With dirty writes, conflicting writes from different transactions can be mixed up." class="w-full my-4" >}}
#### Implementing read committed {#sec_transactions_read_committed_impl}
Read committed is a very popular isolation level. It is the default setting in Oracle Database,
PostgreSQL, SQL Server, and many other databases [^10].
Most commonly, databases prevent dirty writes by using row-level locks: when a transaction wants to
modify a particular row (or document or some other object), it must first acquire a lock on that
row. It must then hold that lock until the transaction is committed or aborted. Only one transaction
can hold the lock for any given row; if another transaction wants to write to the same row, it must
wait until the first transaction is committed or aborted before it can acquire the lock and
continue. This locking is done automatically by databases in read committed mode (or stronger
isolation levels).
How do we prevent dirty reads? One option would be to use the same lock, and to require any
transaction that wants to read a row to briefly acquire the lock and then release it again
immediately after reading. This would ensure that a read couldn’t happen while a row has a
dirty, uncommitted value (because during that time the lock would be held by the transaction that
has made the write).
However, the approach of requiring read locks does not work well in practice, because one
long-running write transaction can force many other transactions to wait until the long-running
transaction has completed, even if the other transactions only read and do not write anything to the
database. This harms the response time of read-only transactions and is bad for
operability: a slowdown in one part of an application can have a knock-on effect in a completely
different part of the application, due to waiting for locks.
Nevertheless, locks are used to prevent dirty reads in some databases, such as IBM
Db2 and Microsoft SQL Server in the `read_committed_snapshot=off` setting [^29].
A more commonly used approach to preventing dirty reads is the one illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed): for every
row that is written, the database remembers both the old committed value and the new value
set by the transaction that currently holds the write lock. While the transaction is ongoing, any
other transactions that read the row are simply given the old value. Only when the new value is
committed do transactions switch over to reading the new value (see
[“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl) for more detail).
### Snapshot Isolation and Repeatable Read {#sec_transactions_snapshot_isolation}
If you look superficially at read committed isolation, you could be forgiven for thinking that it
does everything that a transaction needs to do: it allows aborts (required for atomicity), it
prevents reading the incomplete results of transactions, and it prevents concurrent writes from
getting intermingled. Indeed, those are useful features, and much stronger guarantees than you can
get from a system that has no transactions.
However, there are still plenty of ways in which you can have concurrency bugs when using this
isolation level. For example, [Figure 8-6](/en/ch8#fig_transactions_item_many_preceders) illustrates a problem that
can occur with read committed.
{{< figure src="/fig/ddia_0806.png" id="fig_transactions_item_many_preceders" caption="Figure 8-6. Read skew: Aaliyah observes the database in an inconsistent state." class="w-full my-4" >}}
Say Aaliyah has $1,000 of savings at a bank, split across two accounts with $500 each. Now a
transaction transfers $100 from one of her accounts to the other. If she is unlucky enough to look at her
list of account balances in the same moment as that transaction is being processed, she may see one
account balance at a time before the incoming payment has arrived (with a balance of $500), and the
other account after the outgoing transfer has been made (the new balance being $400). To Aaliyah it
now appears as though she only has a total of $900 in her accounts—it seems that $100 has
vanished into thin air.
This anomaly is called *read skew*, and it is an example of a *nonrepeatable read*:
if Aaliyah were to read the balance of account 1 again at the end of the transaction, she would see a different value ($600) than she saw
in her previous query. Read skew is considered acceptable under read committed isolation: the
account balances that Aaliyah saw were indeed committed at the time when she read them.
--------
> [!NOTE]
> The term *skew* is unfortunately overloaded: we previously used it in the sense of an *unbalanced
> workload with hot spots* (see [“Skewed Workloads and Relieving Hot Spots”](/en/ch7#sec_sharding_skew)), whereas here it means *timing anomaly*.
--------
In Aaliyah’s case, this is not a lasting problem, because she will most likely see consistent account
balances if she reloads the online banking website a few seconds later. However, some situations
cannot tolerate such temporary inconsistency:
Backups
: Taking a backup requires making a copy of the entire database, which may take hours on a large
database. During the time that the backup process is running, writes will continue to be made to
the database. Thus, you could end up with some parts of the backup containing an older version of
the data, and other parts containing a newer version. If you need to restore from such a backup,
the inconsistencies (such as disappearing money) become permanent.
Analytic queries and integrity checks
: Sometimes, you may want to run a query that scans over large parts of the database. Such queries
are common in analytics (see [“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics)), or may be part of a periodic integrity
check that everything is in order (monitoring for data corruption). These queries are likely to
return nonsensical results if they observe parts of the database at different points in time.
*Snapshot isolation* [^36] is the most common
solution to this problem. The idea is that each transaction reads from a *consistent snapshot* of
the database—that is, the transaction sees all the data that was committed in the database at the
start of the transaction. Even if the data is subsequently changed by another transaction, each
transaction sees only the old data from that particular point in time.
Snapshot isolation is a boon for long-running, read-only queries such as backups and analytics. It
is very hard to reason about the meaning of a query if the data on which it operates is changing at
the same time as the query is executing. When a transaction can see a consistent snapshot of the
database, frozen at a particular point in time, it is much easier to understand.
Snapshot isolation is a popular feature: variants of it are supported by PostgreSQL, MySQL with the
InnoDB storage engine, Oracle, SQL Server, and others, although the detailed behavior varies from
one system to the next [^29] [^40] [^41].
Some databases, such as Oracle, TiDB, and Aurora DSQL, even choose snapshot isolation as their
highest isolation level.
#### Multi-version concurrency control (MVCC) {#sec_transactions_snapshot_impl}
Like read committed isolation, implementations of snapshot isolation typically use write locks to
prevent dirty writes (see [“Implementing read committed”](/en/ch8#sec_transactions_read_committed_impl)), which means that a transaction
that makes a write can block the progress of another transaction that writes to the same row.
However, reads do not require any locks. From a performance point of view, a key principle of
snapshot isolation is *readers never block writers, and writers never block readers*. This allows a
database to handle long-running read queries on a consistent snapshot at the same time as processing
writes normally, without any lock contention between the two.
To implement snapshot isolation, databases use a generalization of the mechanism we saw for
preventing dirty reads in [Figure 8-4](/en/ch8#fig_transactions_read_committed). Instead of two versions of each row
(the committed version and the overwritten-but-not-yet-committed version), the database must
potentially keep several different committed versions of a row, because various in-progress
transactions may need to see the state of the database at different points in time. Because it
maintains several versions of a row side by side, this technique is known as *multi-version
concurrency control* (MVCC).
[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
[^40] [^42] [^43] (other implementations are similar).
When a transaction is started, it is given a unique, always-increasing transaction ID (`txid`).
Whenever a transaction writes anything to the database, the data it writes is tagged with the
transaction ID of the writer. (To be precise, transaction IDs in PostgreSQL are 32-bit integers, so
they overflow after approximately 4 billion transactions. The vacuum process performs cleanup to
ensure that overflow does not affect the data.)
{{< figure src="/fig/ddia_0807.png" id="fig_transactions_mvcc" caption="Figure 8-7. Implementing snapshot isolation using multi-version concurrency control." class="w-full my-4" >}}
Each row in a table has a `inserted_by` field, containing the ID of the transaction that inserted
this row into the table. Moreover, each row has a `deleted_by` field, which is initially empty. If a
transaction deletes a row, the row isn’t actually removed from the database, but it is marked for
deletion by setting the `deleted_by` field to the ID of the transaction that requested the deletion.
At some later time, when it is certain that no transaction can any longer access the deleted data, a
garbage collection process in the database removes any rows marked for deletion and frees their
space.
An update is internally translated into a delete and a insert [^44].
For example, in [Figure 8-7](/en/ch8#fig_transactions_mvcc), transaction 13 deducts $100 from account 2, changing the
balance from $500 to $400. The `accounts` table now actually contains two rows for account 2: a row
with a balance of $500 which was marked as deleted by transaction 13, and a row with a balance of
$400 which was inserted by transaction 13.
All of the versions of a row are stored within the same database heap (see
[“Storing values within the index”](/en/ch4#sec_storage_index_heap)), regardless of whether the transactions that wrote them have committed
or not. The versions of the same row form a linked list, going either from newest version to oldest
version or the other way round, so that queries can internally iterate over all versions of a row [^45] [^46].
#### Visibility rules for observing a consistent snapshot {#sec_transactions_mvcc_visibility}
When a transaction reads from the database, transaction IDs are used to decide which row versions it
can see and which are invisible. By carefully defining visibility rules, the database can present a
consistent snapshot of the database to the application. This works roughly as follows [^43]:
1. At the start of each transaction, the database makes a list of all the other transactions that
are in progress (not yet committed or aborted) at that time. Any writes that those
transactions have made are ignored, even if the transactions subsequently commit. This ensures
that we see a consistent snapshot that is not affected by another transaction committing.
2. Any writes made by transactions with a later transaction ID (i.e., which started after the current
transaction started, and which are therefore not included in the list of in-progress
transactions) are ignored, regardless of whether those transactions have committed.
3. Any writes made by aborted transactions are ignored, regardless of when that abort happened.
This has the advantage that when a transaction aborts, we don’t need to immediately remove the
rows it wrote from storage, since the visibility rule filters them out. The garbage collection
process can remove them later.
4. All other writes are visible to the application’s queries.
These rules apply to both insertion and deletion of rows. In [Figure 8-7](/en/ch8#fig_transactions_mvcc), when
transaction 12 reads from account 2, it sees a balance of $500 because the deletion of the $500
balance was made by transaction 13 (according to rule 2, transaction 12 cannot see a deletion made
by transaction 13), and the insertion of the $400 balance is not yet visible (by the same rule).
Put another way, a row is visible if both of the following conditions are true:
* At the time when the reader’s transaction started, the transaction that inserted the row had
already committed.
* The row is not marked for deletion, or if it is, the transaction that requested deletion had
not yet committed at the time when the reader’s transaction started.
A long-running transaction may continue using a snapshot for a long time, continuing to read values
that (from other transactions’ point of view) have long been overwritten or deleted. By never
updating values in place but instead inserting a new version every time a value is changed, the
database can provide a consistent snapshot while incurring only a small overhead.
#### Indexes and snapshot isolation {#indexes-and-snapshot-isolation}
How do indexes work in a multi-version database? The most common approach is that each index entry
points at one of the versions of a row that matches the entry (either the oldest or the newest
version). Each row version may contain a reference to the next-oldest or next-newest version. A
query that uses the index must then iterate over the rows to find one that is visible, and where the
value matches what the query is looking for. When garbage collection removes old row versions that
are no longer visible to any transaction, the corresponding index entries can also be removed.
Many implementation details affect the performance of multi-version concurrency control [^45] [^46].
For example, PostgreSQL has optimizations for avoiding index updates if different versions of the
same row can fit on the same page [^40]. Some other databases avoid storing full copies of modified rows,
and only store differences between versions to save space.
Another approach is used in CouchDB, Datomic, and LMDB. Although they also use B-trees (see
[“B-Trees”](/en/ch4#sec_storage_b_trees)), they use an *immutable* (copy-on-write) variant that does not overwrite
pages of the tree when they are updated, but instead creates a new copy of each modified page.
Parent pages, up to the root of the tree, are copied and updated to point to the new versions of
their child pages. Any pages that are not affected by a write do not need to be copied, and can be
shared with the new tree [^47].
With immutable B-trees, every write transaction (or batch of transactions) creates a new B-tree
root, and a particular root is a consistent snapshot of the database at the point in time when it
was created. There is no need to filter out rows based on transaction IDs because subsequent
writes cannot modify an existing B-tree; they can only create new tree roots. This approach also
requires a background process for compaction and garbage collection.
#### Snapshot isolation, repeatable read, and naming confusion {#snapshot-isolation-repeatable-read-and-naming-confusion}
MVCC is a commonly used implementation technique for databases, and often it is used to implement
snapshot isolation. However, different databases sometimes use different terms to refer to the same
thing: for example, snapshot isolation is called “repeatable read” in PostgreSQL, and “serializable”
in Oracle [^29]. Sometimes different systems
use the same term to mean different things: for example, while in PostgreSQL “repeatable read” means
snapshot isolation, in MySQL it means an implementation of MVCC with weaker consistency than
snapshot isolation [^41].
The reason for this naming confusion is that the SQL standard doesn’t have the concept of snapshot
isolation, because the standard is based on System R’s 1975 definition of isolation levels [^3] and snapshot isolation hadn’t yet been
invented then. Instead, it defines repeatable read, which looks superficially similar to snapshot
isolation. PostgreSQL calls its snapshot isolation level “repeatable read” because it meets the
requirements of the standard, and so they can claim standards compliance.
Unfortunately, the SQL standard’s definition of isolation levels is flawed—it is ambiguous,
imprecise, and not as implementation-independent as a standard should be [^36]. Even though several databases
implement repeatable read, there are big differences in the guarantees they actually provide,
despite being ostensibly standardized [^29]. There has been a formal definition of
repeatable read in the research literature [^37] [^38], but most implementations don’t satisfy that
formal definition. And to top it off, IBM Db2 uses “repeatable read” to refer to serializability [^10].
As a result, nobody really knows what repeatable read means.
### Preventing Lost Updates {#sec_transactions_lost_update}
The read committed and snapshot isolation levels we’ve discussed so far have been primarily about the guarantees
of what a read-only transaction can see in the presence of concurrent writes. We have mostly ignored
the issue of two transactions writing concurrently—we have only discussed dirty writes (see
[“No dirty writes”](/en/ch8#sec_transactions_dirty_write)), one particular type of write-write conflict that can occur.
There are several other interesting kinds of conflicts that can occur between concurrently writing
transactions. The best known of these is the *lost update* problem, illustrated in
[Figure 8-1](/en/ch8#fig_transactions_increment) with the example of two concurrent counter increments.
The lost update problem can occur if an application reads some value from the database, modifies it,
and writes back the modified value (a *read-modify-write cycle*). If two transactions do this
concurrently, one of the modifications can be lost, because the second write does not include the
first modification. (We sometimes say that the later write *clobbers* the earlier write.) This
pattern occurs in various different scenarios:
* Incrementing a counter or updating an account balance (requires reading the current value,
calculating the new value, and writing back the updated value)
* Making a local change to a complex value, e.g., adding an element to a list within a JSON document
(requires parsing the document, making the change, and writing back the modified document)
* Two users editing a wiki page at the same time, where each user saves their changes by sending the
entire page contents to the server, overwriting whatever is currently in the database
Because this is such a common problem, a variety of solutions have been developed [^48].
#### Atomic write operations {#atomic-write-operations}
Many databases provide atomic update operations, which remove the need to implement
read-modify-write cycles in application code. They are usually the best solution if your code can be
expressed in terms of those operations. For example, the following instruction is concurrency-safe
in most relational databases:
```sql
UPDATE counters SET value = value + 1 WHERE key = 'foo';
```
Similarly, document databases such as MongoDB provide atomic operations for making local
modifications to a part of a JSON document, and Redis provides atomic operations for modifying data
structures such as priority queues. Not all writes can easily be expressed in terms of atomic
operations—for example, updates to a wiki page involve arbitrary text editing, which can be handled
using algorithms discussed in [“CRDTs and Operational Transformation”](/en/ch6#sec_replication_crdts)—but in situations where atomic operations
can be used, they are usually the best choice.
Atomic operations are usually implemented by taking an exclusive lock on the object when it is read
so that no other transaction can read it until the update has been applied.
Another option is to simply force all atomic operations to be executed on a single thread.
Unfortunately, object-relational mapping (ORM) frameworks make it easy to accidentally write code
that performs unsafe read-modify-write cycles instead of using atomic operations provided by the
database [^49] [^50] [^51].
This can be a source of subtle bugs that are difficult to find by testing.
#### Explicit locking {#explicit-locking}
Another option for preventing lost updates, if the database’s built-in atomic operations don’t
provide the necessary functionality, is for the application to explicitly lock objects that are
going to be updated. Then the application can perform a read-modify-write cycle, and if any other
transaction tries to concurrently update or lock the same object, it is forced to wait until the
first read-modify-write cycle has completed.
For example, consider a multiplayer game in which several players can move the same figure
concurrently. In this case, an atomic operation may not be sufficient, because the application also
needs to ensure that a player’s move abides by the rules of the game, which involves some logic that
you cannot sensibly implement as a database query. Instead, you may use a lock to prevent two
players from concurrently moving the same piece, as illustrated in [Example 8-1](/en/ch8#fig_transactions_select_for_update).
{{< figure id="fig_transactions_select_for_update" title="Example 8-1. Explicitly locking rows to prevent lost updates" class="w-full my-4" >}}
```sql
BEGIN TRANSACTION;
SELECT * FROM figures
WHERE name = 'robot' AND game_id = 222
FOR UPDATE; ❶
-- Check whether move is valid, then update the position
-- of the piece that was returned by the previous SELECT.
UPDATE figures SET position = 'c4' WHERE id = 1234;
COMMIT;
```
❶: The `FOR UPDATE` clause indicates that the database should take a lock on all rows returned by this query.
This works, but to get it right, you need to carefully think about your application logic. It’s easy
to forget to add a necessary lock somewhere in the code, and thus introduce a race condition.
Moreover, if you lock multiple objects there is a risk of deadlock, where two or more transactions
are waiting for each other to release their locks. Many databases automatically detect deadlocks,
and abort one of the involved transactions so that the system can make progress. You can handle this
situation at the application level by retrying the aborted transaction.
#### Automatically detecting lost updates {#automatically-detecting-lost-updates}
Atomic operations and locks are ways of preventing lost updates by forcing the read-modify-write
cycles to happen sequentially. An alternative is to allow them to execute in parallel and, if the
transaction manager detects a lost update, abort the transaction and force it to retry
its read-modify-write cycle.
An advantage of this approach is that databases can perform this check efficiently in conjunction
with snapshot isolation. Indeed, PostgreSQL’s repeatable read, Oracle’s serializable, and SQL
Server’s snapshot isolation levels automatically detect when a lost update has occurred and abort
the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates [^29] [^41].
Some authors [^36] [^38] argue that a database must prevent lost
updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot
isolation under this definition.
Lost update detection is a great feature, because it doesn’t require application code to use any
special database features—you may forget to use a lock or an atomic operation and thus introduce
a bug, but lost update detection happens automatically and is thus less error-prone. However, you
also have to retry aborted transactions at the application level.
#### Conditional writes (compare-and-set) {#sec_transactions_compare_and_set}
In databases that don’t provide transactions, you sometimes find a *conditional write* operation
that can prevent lost updates by allowing an update to happen only if the value has not changed
since you last read it (previously mentioned in [“Single-object writes”](/en/ch8#sec_transactions_single_object)). If the current
value does not match what you previously read, the update has no effect, and the read-modify-write
cycle must be retried. It is the database equivalent of an atomic *compare-and-set* or
*compare-and-swap* (CAS) instruction that is supported by many CPUs.
For example, to prevent two users concurrently updating the same wiki page, you might try something
like this, expecting the update to occur only if the content of the page hasn’t changed since the
user started editing it:
```sql
-- This may or may not be safe, depending on the database implementation
UPDATE wiki_pages SET content = 'new content'
WHERE id = 1234 AND content = 'old content';
```
If the content has changed and no longer matches `'old content'`, this update will have no effect,
so you need to check whether the update took effect and retry if necessary. Instead of comparing the
full content, you could also use a version number column that you increment on every update, and
apply the update only if the current version number hasn’t changed. This approach is sometimes
called *optimistic locking* [^52].
Note that if another transaction has concurrently modified `content`, the new content may not be
visible under the MVCC visibility rules (see [“Visibility rules for observing a consistent snapshot”](/en/ch8#sec_transactions_mvcc_visibility)). Many
implementations of MVCC have an exception to the visibility rules for this scenario, where values
written by other transactions are visible to the evaluation of the `WHERE` clause of `UPDATE` and
`DELETE` queries, even though those writes are not otherwise visible in the snapshot.
#### Conflict resolution and replication {#conflict-resolution-and-replication}
In replicated databases (see [Chapter 6](/en/ch6#ch_replication)), preventing lost updates takes on another
dimension: since they have copies of the data on multiple nodes, and the data can potentially be
modified concurrently on different nodes, some additional steps need to be taken to prevent lost
updates.
Locks and conditional write operations assume that there is a single up-to-date copy of the data.
However, databases with multi-leader or leaderless replication usually allow several writes to
happen concurrently and replicate them asynchronously, so they cannot guarantee that there is a
single up-to-date copy of the data. Thus, techniques based on locks or conditional writes do not apply
in this context. (We will revisit this issue in more detail in [“Linearizability”](/en/ch10#sec_consistency_linearizability).)
Instead, as discussed in [“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts), a common approach in such replicated
databases is to allow concurrent writes to create several conflicting versions of a value (also
known as *siblings*), and to use application code or special data structures to resolve and merge
these versions after the fact.
Merging conflicting values can prevent lost updates if the updates are commutative (i.e., you can
apply them in a different order on different replicas, and still get the same result). For example,
incrementing a counter or adding an element to a set are commutative operations. That is the idea
behind CRDTs, which we encountered in [“CRDTs and Operational Transformation”](/en/ch6#sec_replication_crdts). However, some operations such as
conditional writes cannot be made commutative.
On the other hand, the *last write wins* (LWW) conflict resolution method is prone to lost updates,
as discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww).
Unfortunately, LWW is the default in many replicated databases.
### Write Skew and Phantoms {#sec_transactions_write_skew}
In the previous sections we saw *dirty writes* and *lost updates*, two kinds of race conditions that
can occur when different transactions concurrently try to write to the same objects. In order to
avoid data corruption, those race conditions need to be prevented—either automatically by the
database, or by manual safeguards such as using locks or atomic write operations.
However, that is not the end of the list of potential race conditions that can occur between
concurrent writes. In this section we will see some subtler examples of conflicts.
To begin, imagine this example: you are writing an application for doctors to manage their on-call
shifts at a hospital. The hospital usually tries to have several doctors on call at any one time,
but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if
they are sick themselves), provided that at least one colleague remains on call in that shift [^53] [^54].
Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are
feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button
to go off call at approximately the same time. What happens next is illustrated in
[Figure 8-8](/en/ch8#fig_transactions_write_skew).
{{< figure src="/fig/ddia_0808.png" id="fig_transactions_write_skew" caption="Figure 8-8. Example of write skew causing an application bug." class="w-full my-4" >}}
In each transaction, your application first checks that two or more doctors are currently on call;
if yes, it assumes it’s safe for one doctor to go off call. Since the database is using snapshot
isolation, both checks return `2`, so both transactions proceed to the next stage. Aaliyah updates her
own record to take herself off call, and Bryce updates his own record likewise. Both transactions
commit, and now no doctor is on call. Your requirement of having at least one doctor on call has been violated.
#### Characterizing write skew {#characterizing-write-skew}
This anomaly is called *write skew* [^36]. It
is neither a dirty write nor a lost update, because the two transactions are updating two different
objects (Aaliyah’s and Bryce’s on-call records, respectively). It is less obvious that a conflict occurred
here, but it’s definitely a race condition: if the two transactions had run one after another, the
second doctor would have been prevented from going off call. The anomalous behavior was only
possible because the transactions ran concurrently.
You can think of write skew as a generalization of the lost update problem. Write skew can occur if two
transactions read the same objects, and then update some of those objects (different transactions
may update different objects). In the special case where different transactions update the same
object, you get a dirty write or lost update anomaly (depending on the timing).
We saw that there are various different ways of preventing lost updates. With write skew, our
options are more restricted:
* Atomic single-object operations don’t help, as multiple objects are involved.
* The automatic detection of lost updates that you find in some implementations of snapshot
isolation unfortunately doesn’t help either: write skew is not automatically detected in
PostgreSQL’s repeatable read, MySQL/InnoDB’s repeatable read, Oracle’s serializable, or SQL
Server’s snapshot isolation level [^29].
Automatically preventing write skew requires true serializable isolation (see [“Serializability”](/en/ch8#sec_transactions_serializability)).
* Some databases allow you to configure constraints, which are then enforced by the database (e.g.,
uniqueness, foreign key constraints, or restrictions on a particular value). However, in order to
specify that at least one doctor must be on call, you would need a constraint that involves
multiple objects. Most databases do not have built-in support for such constraints, but you may be
able to implement them with triggers or materialized views, as discussed in
[“Consistency”](/en/ch8#sec_transactions_acid_consistency) [^12].
* If you can’t use a serializable isolation level, the second-best option in this case is probably
to explicitly lock the rows that the transaction depends on. In the doctors example, you could
write something like the following:
```sql
BEGIN TRANSACTION;
SELECT * FROM doctors
WHERE on_call = true
AND shift_id = 1234 FOR UPDATE; ❶
UPDATE doctors
SET on_call = false
WHERE name = 'Aaliyah'
AND shift_id = 1234;
COMMIT;
```
❶: As before, `FOR UPDATE` tells the database to lock all rows returned by this query.
#### More examples of write skew {#more-examples-of-write-skew}
Write skew may seem like an esoteric issue at first, but once you’re aware of it, you may notice
more situations in which it can occur. Here are some more examples:
Meeting room booking system
: Say you want to enforce that there cannot be two bookings for the same meeting room at the same time [^55].
When someone wants to make a booking, you first check for any conflicting bookings (i.e.,
bookings for the same room with an overlapping time range), and if none are found, you create the
meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)).
{{< figure id="fig_transactions_meeting_rooms" title="Example 8-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation)" class="w-full my-4" >}}
```sql
BEGIN TRANSACTION;
-- Check for any existing bookings that overlap with the period of noon-1pm
SELECT COUNT(*) FROM bookings
WHERE room_id = 123 AND
end_time > '2025-01-01 12:00' AND start_time < '2025-01-01 13:00';
-- If the previous query returned zero:
INSERT INTO bookings (room_id, start_time, end_time, user_id)
VALUES (123, '2025-01-01 12:00', '2025-01-01 13:00', 666);
COMMIT;
```
Unfortunately, snapshot isolation does not prevent another user from concurrently inserting a conflicting
meeting. In order to guarantee you won’t get scheduling conflicts, you once again need serializable
isolation.
Multiplayer game
: In [Example 8-1](/en/ch8#fig_transactions_select_for_update), we used a lock to prevent lost updates (that is, making
sure that two players can’t move the same figure at the same time). However, the lock doesn’t
prevent players from moving two different figures to the same position on the board or potentially
making some other move that violates the rules of the game. Depending on the kind of rule you are
enforcing, you might be able to use a unique constraint, but otherwise you’re vulnerable to write
skew.
Claiming a username
: On a website where each user has a unique username, two users may try to create accounts with the
same username at the same time. You may use a transaction to check whether a name is taken and, if
not, create an account with that name. However, like in the previous examples, that is not safe
under snapshot isolation. Fortunately, a unique constraint is a simple solution here (the second
transaction that tries to register the username will be aborted due to violating the constraint).
Preventing double-spending
: A service that allows users to spend money or points needs to check that a user doesn’t spend more
than they have. You might implement this by inserting a tentative spending item into a user’s
account, listing all the items in the account, and checking that the sum is positive.
With write skew, it could happen that two spending items are inserted concurrently that together
cause the balance to go negative, but that neither transaction notices the other.
#### Phantoms causing write skew {#sec_transactions_phantom}
All of these examples follow a similar pattern:
1. A `SELECT` query checks whether some requirement is satisfied by searching for rows that
match some search condition (there are at least two doctors on call, there are no existing
bookings for that room at that time, the position on the board doesn’t already have another
figure on it, the username isn’t already taken, there is still money in the account).
2. Depending on the result of the first query, the application code decides how to continue (perhaps
to go ahead with the operation, or perhaps to report an error to the user and abort).
3. If the application decides to go ahead, it makes a write (`INSERT`, `UPDATE`, or `DELETE`) to the
database and commits the transaction.
The effect of this write changes the precondition of the decision of step 2. In other words, if you
were to repeat the `SELECT` query from step 1 after committing the write, you would get a different
result, because the write changed the set of rows matching the search condition (there is now one
fewer doctor on call, the meeting room is now booked for that time, the position on the board is now
taken by the figure that was moved, the username is now taken, there is now less money in the
account).
The steps may occur in a different order. For example, you could first make the write, then the
`SELECT` query, and finally decide whether to abort or commit based on the result of the query.
In the case of the doctor on call example, the row being modified in step 3 was one of the rows
returned in step 1, so we could make the transaction safe and avoid write skew by locking the rows
in step 1 (`SELECT FOR UPDATE`). However, the other four examples are different: they check for the
*absence* of rows matching some search condition, and the write *adds* a row matching the same
condition. If the query in step 1 doesn’t return any rows, `SELECT FOR UPDATE` can’t attach locks to
anything [^56].
This effect, where a write in one transaction changes the result of a search query in another
transaction, is called a *phantom* [^4].
Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the
examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL
generated by ORMs is also prone to write skew [^50] [^51].
#### Materializing conflicts {#materializing-conflicts}
If the problem of phantoms is that there is no object to which we can attach the locks, perhaps we
can artificially introduce a lock object into the database?
For example, in the meeting room booking case you could imagine creating a table of time slots and
rooms. Each row in this table corresponds to a particular room for a particular time period (say, 15
minutes). You create rows for all possible combinations of rooms and time periods ahead of time,
e.g. for the next six months.
Now a transaction that wants to create a booking can lock (`SELECT FOR UPDATE`) the rows in the
table that correspond to the desired room and time period. After it has acquired the locks, it can
check for overlapping bookings and insert a new booking as before. Note that the additional table
isn’t used to store information about the booking—it’s purely a collection of locks which is used
to prevent bookings on the same room and time range from being modified concurrently.
This approach is called *materializing conflicts*, because it takes a phantom and turns it into a
lock conflict on a concrete set of rows that exist in the database [^14]. Unfortunately, it can be hard and
error-prone to figure out how to materialize conflicts, and it’s ugly to let a concurrency control
mechanism leak into the application data model. For those reasons, materializing conflicts should be
considered a last resort if no alternative is possible. A serializable isolation level is much
preferable in most cases.
## Serializability {#sec_transactions_serializability}
In this chapter we have seen several examples of transactions that are prone to race conditions.
Some race conditions are prevented by the read committed and snapshot isolation levels, but
others are not. We encountered some particularly tricky examples with write skew and phantoms. It’s
a sad situation:
* Isolation levels are hard to understand, and inconsistently implemented in different databases
(e.g., the meaning of “repeatable read” varies significantly).
* If you look at your application code, it’s difficult to tell whether it is safe to run at a
particular isolation level—especially in a large application, where you might not be aware of
all the things that may be happening concurrently.
* There are no good tools to help us detect race conditions. In principle, static analysis may
help [^33], but research techniques have not
yet found their way into practical use. Testing for concurrency issues is hard, because they are
usually nondeterministic—problems only occur if you get unlucky with the timing.
This is not a new problem—it has been like this since the 1970s, when weak isolation levels were
first introduced [^3]. All along, the answer
from researchers has been simple: use *serializable* isolation!
Serializable isolation is the strongest isolation level. It guarantees that even
though transactions may execute in parallel, the end result is the same as if they had executed one
at a time, *serially*, without any concurrency. Thus, the database guarantees that if the
transactions behave correctly when run individually, they continue to be correct when run
concurrently—in other words, the database prevents *all* possible race conditions.
But if serializable isolation is so much better than the mess of weak isolation levels, then why
isn’t everyone using it? To answer this question, we need to look at the options for implementing
serializability, and how they perform. Most databases that provide serializability today use one of
three techniques, which we will explore in the rest of this chapter:
* Literally executing transactions in a serial order (see [“Actual Serial Execution”](/en/ch8#sec_transactions_serial))
* Two-phase locking (see [“Two-Phase Locking (2PL)”](/en/ch8#sec_transactions_2pl)), which for several decades was the only viable option
* Optimistic concurrency control techniques such as serializable snapshot isolation (see
[“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi))
### Actual Serial Execution {#sec_transactions_serial}
The simplest way of avoiding concurrency problems is to remove the concurrency entirely: to
execute only one transaction at a time, in serial order, on a single thread. By doing so, we completely
sidestep the problem of detecting and preventing conflicts between transactions: the resulting
isolation is by definition serializable.
Even though this seems like an obvious idea, it was only in the 2000s that database designers
decided that a single-threaded loop for executing transactions was feasible [^57].
If multi-threaded concurrency was considered essential for getting good performance during the
previous 30 years, what changed to make single-threaded execution possible?
Two developments caused this rethink:
* RAM became cheap enough that for many use cases it is now feasible to keep the entire
active dataset in memory (see [“Keeping everything in memory”](/en/ch4#sec_storage_inmemory)). When all data that a transaction needs to
access is in memory, transactions can execute much faster than if they have to wait for data to be
loaded from disk.
* Database designers realized that OLTP transactions are usually short and only make a small number
of reads and writes (see [“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics)). By contrast, long-running analytic queries
are typically read-only, so they can be run on a consistent snapshot (using snapshot isolation)
outside of the serial execution loop.
The approach of executing transactions serially is implemented in VoltDB/H-Store, Redis, and Datomic,
for example [^58] [^59] [^60].
A system designed for single-threaded execution can sometimes perform better than a system that
supports concurrency, because it can avoid the coordination overhead of locking. However, its
throughput is limited to that of a single CPU core. In order to make the most of that single thread,
transactions need to be structured differently from their traditional form.
#### Encapsulating transactions in stored procedures {#encapsulating-transactions-in-stored-procedures}
In the early days of databases, the intention was that a database transaction could encompass an
entire flow of user activity. For example, booking an airline ticket is a multi-stage process
(searching for routes, fares, and available seats; deciding on an itinerary; booking seats on
each of the flights of the itinerary; entering passenger details; making payment). Database
designers thought that it would be neat if that entire process was one transaction so that it could
be committed atomically.
Unfortunately, humans are very slow to make up their minds and respond. If a database transaction
needs to wait for input from a user, the database needs to support a potentially huge number of
concurrent transactions, most of them idle. Most databases cannot do that efficiently, and so almost
all OLTP applications keep transactions short by avoiding interactively waiting for a user within a
transaction. On the web, this means that a transaction is committed within the same HTTP request—a
transaction does not span multiple requests. A new HTTP request starts a new transaction.
Even though the human has been taken out of the critical path, transactions have continued to be
executed in an interactive client/server style, one statement at a time. An application makes a
query, reads the result, perhaps makes another query depending on the result of the first query, and
so on. The queries and results are sent back and forth between the application code (running on one
machine) and the database server (on another machine).
In this interactive style of transaction, a lot of time is spent in network communication between
the application and the database. If you were to disallow concurrency in the database and only
process one transaction at a time, the throughput would be dreadful because the database would
spend most of its time waiting for the application to issue the next query for the current
transaction. In this kind of database, it’s necessary to process multiple transactions concurrently
in order to get reasonable performance.
For this reason, systems with single-threaded serial transaction processing don’t allow interactive
multi-statement transactions. Instead, the application must either limit itself to transactions
containing a single statement, or submit the entire transaction code to the database ahead of time,
as a *stored procedure* [^61].
The differences between interactive transactions and stored procedures is illustrated in
[Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the
stored procedure can execute very quickly, without waiting for any network or disk I/O.
{{< figure src="/fig/ddia_0809.png" id="fig_transactions_stored_proc" caption="Figure 8-9. The difference between an interactive transaction and a stored procedure (using the example transaction of [Figure 8-8](/en/ch8#fig_transactions_write_skew))." class="w-full my-4" >}}
#### Pros and cons of stored procedures {#sec_transactions_stored_proc_tradeoffs}
Stored procedures have existed for some time in relational databases, and they have been part of the
SQL standard (SQL/PSM) since 1999. They have gained a somewhat bad reputation, for various reasons:
* Traditionally, each database vendor had its own language for stored procedures (Oracle has PL/SQL, SQL Server
has T-SQL, PostgreSQL has PL/pgSQL, etc.). These languages haven’t kept up with developments in
general-purpose programming languages, so they look quite ugly and archaic from today’s point of
view, and they lack the ecosystem of libraries that you find with most programming languages.
* Code running in a database is difficult to manage: compared to an application server, it’s harder
to debug, more awkward to keep in version control and deploy, trickier to test, and difficult to
integrate with a metrics collection system for monitoring.
* A database is often much more performance-sensitive than an application server, because a single
database instance is often shared by many application servers. A badly written stored procedure
(e.g., using a lot of memory or CPU time) in a database can cause much more trouble than equivalent
badly written code in an application server.
* In a multitenant system that allows tenants to write their own stored procedures, it’s a security
risk to execute untrusted code in the same process as the database kernel [^62].
However, those issues can be overcome. Modern implementations of stored procedures have abandoned
PL/SQL and use existing general-purpose programming languages instead: VoltDB uses Java or Groovy,
Datomic uses Java or Clojure, Redis uses Lua, and MongoDB uses Javascript.
Stored procedures are also useful in cases where application logic can’t easily be embedded
elsewhere. Applications that use GraphQL, for example, might directly expose their database through
a GraphQL proxy. If the proxy doesn’t support complex validation logic, you can embed such logic
directly in the database using a stored procedure. If the database doesn’t support stored
procedures, you would have to deploy a validation service between the proxy and the database to do validation.
With stored procedures and in-memory data, executing all transactions on a single thread becomes
feasible. When stored procedures don’t need to wait for I/O and avoid the overhead of other
concurrency control mechanisms, they can achieve quite good throughput on a single thread.
VoltDB also uses stored procedures for replication: instead of copying a transaction’s writes from
one node to another, it executes the same stored procedure on each replica. VoltDB therefore
requires that stored procedures are *deterministic* (when run on different nodes, they must produce
the same result). If a transaction needs to use the current date and time, for example, it must do
so through special deterministic APIs (see [“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows) for more details on
deterministic operations). This approach is called *state machine replication*, and we will return
to it in [Chapter 10](/en/ch10#ch_consistency).
#### Sharding {#sharding}
Executing all transactions serially makes concurrency control much simpler, but limits the
transaction throughput of the database to the speed of a single CPU core on a single machine.
Read-only transactions may execute elsewhere, using snapshot isolation, but for applications with
high write throughput, the single-threaded transaction processor can become a serious bottleneck.
In order to scale to multiple CPU cores, and multiple nodes, you can shard your data
(see [Chapter 7](/en/ch7#ch_sharding)), which is supported in VoltDB. If you can find a way of sharding your dataset
so that each transaction only needs to read and write data within a single shard, then each shard
can have its own transaction processing thread running independently from the others. In this case,
you can give each CPU core its own shard, which allows your transaction throughput to scale linearly
with the number of CPU cores [^59].
However, for any transaction that needs to access multiple shards, the database must coordinate the
transaction across all the shards that it touches. The stored procedure needs to be performed in
lock-step across all shards to ensure serializability across the whole system.
Since cross-shard transactions have additional coordination overhead, they are vastly slower than
single-shard transactions. VoltDB reports a throughput of about 1,000 cross-shard writes per second,
which is orders of magnitude below its single-shard throughput and cannot be increased by adding
more machines [^61]. More recent research
has explored ways of making multi-shard transactions more scalable [^63].
Whether transactions can be single-shard depends very much on the structure of the data used by the
application. Simple key-value data can often be sharded very easily, but data with multiple
secondary indexes is likely to require a lot of cross-shard coordination (see
[“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes)).
#### Summary of serial execution {#summary-of-serial-execution}
Serial execution of transactions has become a viable way of achieving serializable isolation within
certain constraints:
* Every transaction must be small and fast, because it takes only one slow transaction to stall all transaction processing.
* It is most appropriate in situations where the active dataset can fit in memory. Rarely accessed
data could potentially be moved to disk, but if it needed to be accessed in a single-threaded
transaction, the system would get very slow.
* Write throughput must be low enough to be handled on a single CPU core, or else transactions need
to be sharded without requiring cross-shard coordination.
* Cross-shard transactions are possible, but their throughput is hard to scale.
### Two-Phase Locking (2PL) {#sec_transactions_2pl}
For around 30 years, there was only one widely used algorithm for serializability in databases:
*two-phase locking* (2PL), sometimes called *strong strict two-phase locking* (SS2PL) to distinguish
it from other variants of 2PL.
--------
> [!TIP] 2PL IS NOT 2PC
Two-phase *locking* (2PL) and two-phase *commit* (2PC) are two very different things. 2PL provides
serializable isolation, whereas 2PC provides atomic commit in a distributed database (see
[“Two-Phase Commit (2PC)”](/en/ch8#sec_transactions_2pc)). To avoid confusion, it’s best to think of them as entirely separate
concepts and to ignore the unfortunate similarity in the names.
--------
We saw previously that locks are often used to prevent dirty writes (see
[“No dirty writes”](/en/ch8#sec_transactions_dirty_write)): if two transactions concurrently try to write to the same object,
the lock ensures that the second writer must wait until the first one has finished its transaction
(aborted or committed) before it may continue.
Two-phase locking is similar, but makes the lock requirements much stronger. Several transactions
are allowed to concurrently read the same object as long as nobody is writing to it. But as soon as
anyone wants to write (modify or delete) an object, exclusive access is required:
* If transaction A has read an object and transaction B wants to write to that object, B must wait
until A commits or aborts before it can continue. (This ensures that B can’t change the object
unexpectedly behind A’s back.)
* If transaction A has written an object and transaction B wants to read that object, B must wait
until A commits or aborts before it can continue. (Reading an old version of the object, like in
[Figure 8-4](/en/ch8#fig_transactions_read_committed), is not acceptable under 2PL.)
In 2PL, writers don’t just block other writers; they also block readers and vice
versa. Snapshot isolation has the mantra *readers never block writers, and writers never block
readers* (see [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl)), which captures this key difference between
snapshot isolation and two-phase locking. On the other hand, because 2PL provides serializability,
it protects against all the race conditions discussed earlier, including lost updates and write skew.
#### Implementation of two-phase locking {#implementation-of-two-phase-locking}
2PL is used by the serializable isolation level in MySQL (InnoDB) and SQL Server, and the
repeatable read isolation level in Db2 [^29].
The blocking of readers and writers is implemented by having a lock on each object in the
database. The lock can either be in *shared mode* or in *exclusive mode* (also known as a
*multi-reader single-writer* lock). The lock is used as follows:
* If a transaction wants to read an object, it must first acquire the lock in shared mode. Several
transactions are allowed to hold the lock in shared mode simultaneously, but if another
transaction already has an exclusive lock on the object, these transactions must wait.
* If a transaction wants to write to an object, it must first acquire the lock in exclusive mode. No
other transaction may hold the lock at the same time (either in shared or in exclusive mode), so
if there is any existing lock on the object, the transaction must wait.
* If a transaction first reads and then writes an object, it may upgrade its shared lock to an
exclusive lock. The upgrade works the same as getting an exclusive lock directly.
* After a transaction has acquired the lock, it must continue to hold the lock until the end of the
transaction (commit or abort). This is where the name “two-phase” comes from: the first phase
(while the transaction is executing) is when the locks are acquired, and the second phase (at the
end of the transaction) is when all the locks are released.
Since so many locks are in use, it can happen quite easily that transaction A is stuck waiting for
transaction B to release its lock, and vice versa. This situation is called *deadlock*. The database
automatically detects deadlocks between transactions and aborts one of them so that the others can
make progress. The aborted transaction needs to be retried by the application.
#### Performance of two-phase locking {#performance-of-two-phase-locking}
The big downside of two-phase locking, and the reason why it hasn’t been used by everybody since the
1970s, is performance: transaction throughput and response times of queries are significantly worse
under two-phase locking than under weak isolation.
This is partly due to the overhead of acquiring and releasing all those locks, but more importantly
due to reduced concurrency. By design, if two concurrent transactions try to do anything that may
in any way result in a race condition, one has to wait for the other to complete.
For example, if you have a transaction that needs to read an entire table (e.g. a backup, analytics
query, or integrity check, as discussed in [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation)), that
transaction has to take a shared lock on the entire table. Therefore, the reading transaction first
has to wait until all in-progress transactions writing to that table have completed; then, while the
whole table is being read (which may take a long time on a large table), all other transactions that
want to write to that table are blocked until the big read-only transaction commits. In effect, the
database becomes unavailable for writes for an extended time.
For this reason, databases running 2PL can have quite unstable latencies, and they can be very slow at
high percentiles (see [“Describing Performance”](/en/ch2#sec_introduction_percentiles)) if there is contention in the workload. It
may take just one slow transaction, or one transaction that accesses a lot of data and acquires many
locks, to cause the rest of the system to grind to a halt.
Although deadlocks can happen with the lock-based read committed isolation level, they occur much
more frequently under 2PL serializable isolation (depending on the access patterns of your
transaction). This can be an additional performance problem: when a transaction is aborted due to
deadlock and is retried, it needs to do its work all over again. If deadlocks are frequent, this can
mean significant wasted effort.
#### Predicate locks {#predicate-locks}
In the preceding description of locks, we glossed over a subtle but important detail. In
[“Phantoms causing write skew”](/en/ch8#sec_transactions_phantom) we discussed the problem of *phantoms*—that is, one transaction
changing the results of another transaction’s search query. A database with serializable isolation
must prevent phantoms.
In the meeting room booking example this means that if one transaction has searched for existing
bookings for a room within a certain time window (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)), another
transaction is not allowed to concurrently insert or update another booking for the same room and
time range. (It’s okay to concurrently insert bookings for other rooms, or for the same room at a
different time that doesn’t affect the proposed booking.)
How do we implement this? Conceptually, we need a *predicate lock* [^4]. It works similarly to the
shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one
row in a table), it belongs to all objects that match some search condition, such as:
```
SELECT * FROM bookings
WHERE room_id = 123 AND
end_time > '2025-01-01 12:00' AND
start_time < '2025-01-01 13:00';
```
A predicate lock restricts access as follows:
* If transaction A wants to read objects matching some condition, like in that `SELECT` query, it
must acquire a shared-mode predicate lock on the conditions of the query. If another transaction B
currently has an exclusive lock on any object matching those conditions, A must wait until B
releases its lock before it is allowed to make its query.
* If transaction A wants to insert, update, or delete any object, it must first check whether either the old
or the new value matches any existing predicate lock. If there is a matching predicate lock held by
transaction B, then A must wait until B has committed or aborted before it can continue.
The key idea here is that a predicate lock applies even to objects that do not yet exist in the
database, but which might be added in the future (phantoms). If two-phase locking includes predicate locks,
the database prevents all forms of write skew and other race conditions, and so its isolation
becomes serializable.
#### Index-range locks {#sec_transactions_2pl_range}
Unfortunately, predicate locks do not perform well: if there are many locks by active transactions,
checking for matching locks becomes time-consuming. For that reason, most databases with 2PL
actually implement *index-range locking* (also known as *next-key locking*), which is a simplified
approximation of predicate locking [^54] [^64].
It’s safe to simplify a predicate by making it match a greater set of objects. For example, if you
have a predicate lock for bookings of room 123 between noon and 1 p.m., you can approximate it by
locking bookings for room 123 at any time, or you can approximate it by locking all rooms (not just
room 123) between noon and 1 p.m. This is safe because any write that matches the original predicate
will definitely also match the approximations.
In the room bookings database you would probably have an index on the `room_id` column, and/or
indexes on `start_time` and `end_time` (otherwise the preceding query would be very slow on a large database):
* Say your index is on `room_id`, and the database uses this index to find existing bookings for
room 123. Now the database can simply attach a shared lock to this index entry, indicating that a
transaction has searched for bookings of room 123.
* Alternatively, if the database uses a time-based index to find existing bookings, it can attach a
shared lock to a range of values in that index, indicating that a transaction has searched for
bookings that overlap with the time period of noon to 1 p.m. on January 1, 2025.
Either way, an approximation of the search condition is attached to one of the indexes. Now, if
another transaction wants to insert, update, or delete a booking for the same room and/or an
overlapping time period, it will have to update the same part of the index. In the process of doing
so, it will encounter the shared lock, and it will be forced to wait until the lock is released.
This provides effective protection against phantoms and write skew. Index-range locks are not as
precise as predicate locks would be (they may lock a bigger range of objects than is strictly
necessary to maintain serializability), but since they have much lower overheads, they are a good
compromise.
If there is no suitable index where a range lock can be attached, the database can fall back to a
shared lock on the entire table. This will not be good for performance, since it will stop all
other transactions writing to the table, but it’s a safe fallback position.
### Serializable Snapshot Isolation (SSI) {#sec_transactions_ssi}
This chapter has painted a bleak picture of concurrency control in databases. On the one hand, we
have implementations of serializability that don’t perform well (two-phase locking) or don’t scale
well (serial execution). On the other hand, we have weak isolation levels that have good
performance, but are prone to various race conditions (lost updates, write skew, phantoms, etc.). Are
serializable isolation and good performance fundamentally at odds with each other?
It seems not: an algorithm called *serializable snapshot isolation* (SSI) provides full
serializability with only a small performance penalty compared to snapshot isolation. SSI is
comparatively new: it was first described in 2008 [^53] [^65].
Today SSI and similar algorithms are used in single-node databases (the serializable isolation level
in PostgreSQL [^54], SQL Server’s In-Memory OLTP/Hekaton [^66], and HyPer [^67]), distributed databases (CockroachDB [^5] and
FoundationDB [^8]), and embedded storage engines such as BadgerDB.
#### Pessimistic versus optimistic concurrency control {#pessimistic-versus-optimistic-concurrency-control}
Two-phase locking is a so-called *pessimistic* concurrency control mechanism: it is based on the
principle that if anything might possibly go wrong (as indicated by a lock held by another
transaction), it’s better to wait until the situation is safe again before doing anything. It is
like *mutual exclusion*, which is used to protect data structures in multi-threaded programming.
Serial execution is, in a sense, pessimistic to the extreme: it is essentially equivalent to each
transaction having an exclusive lock on the entire database (or one shard of the database) for the
duration of the transaction. We compensate for the pessimism by making each transaction very fast to
execute, so it only needs to hold the “lock” for a short time.
By contrast, serializable snapshot isolation is an *optimistic* concurrency control technique.
Optimistic in this context means that instead of blocking if something potentially dangerous
happens, transactions continue anyway, in the hope that everything will turn out all right. When a
transaction wants to commit, the database checks whether anything bad happened (i.e., whether
isolation was violated); if so, the transaction is aborted and has to be retried. Only transactions
that executed serializably are allowed to commit.
Optimistic concurrency control is an old idea [^68], and its advantages and disadvantages have been debated for a long time [^69].
It performs badly if there is high contention (many transactions trying to access the same objects),
as this leads to a high proportion of transactions needing to abort. If the system is already close
to its maximum throughput, the additional transaction load from retried transactions can make
performance worse.
However, if there is enough spare capacity, and if contention between transactions is not too high,
optimistic concurrency control techniques tend to perform better than pessimistic ones. Contention
can be reduced with commutative atomic operations: for example, if several transactions concurrently
want to increment a counter, it doesn’t matter in which order the increments are applied (as long as
the counter isn’t read in the same transaction), so the concurrent increments can all be applied
without conflicting.
As the name suggests, SSI is based on snapshot isolation—that is, all reads within a transaction
are made from a consistent snapshot of the database (see [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation)).
On top of snapshot isolation, SSI adds an algorithm for detecting serialization conflicts among
reads and writes, and determining which transactions to abort.
#### Decisions based on an outdated premise {#decisions-based-on-an-outdated-premise}
When we previously discussed write skew in snapshot isolation (see [“Write Skew and Phantoms”](/en/ch8#sec_transactions_write_skew)),
we observed a recurring pattern: a transaction reads some data from the database, examines the
result of the query, and decides to take some action (write to the database) based on the result
that it saw. However, under snapshot isolation, the result from the original query may no longer be
up-to-date by the time the transaction commits, because the data may have been modified in the meantime.
Put another way, the transaction is taking an action based on a *premise* (a fact that was true at
the beginning of the transaction, e.g., “There are currently two doctors on call”). Later, when the
transaction wants to commit, the original data may have changed—the premise may no longer be
true.
When the application makes a query (e.g., “How many doctors are currently on call?”), the database
doesn’t know how the application logic uses the result of that query. To be safe, the database needs
to assume that any change in the query result (the premise) means that writes in that transaction
may be invalid. In other words, there may be a causal dependency between the queries and the writes
in the transaction. In order to provide serializable isolation, the database must detect situations
in which a transaction may have acted on an outdated premise and abort the transaction in that case.
How does the database know if a query result might have changed? There are two cases to consider:
* Detecting reads of a stale MVCC object version (uncommitted write occurred before the read)
* Detecting writes that affect prior reads (the write occurs after the read)
#### Detecting stale MVCC reads {#detecting-stale-mvcc-reads}
Recall that snapshot isolation is usually implemented by multi-version concurrency control (MVCC;
see [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl)). When a transaction reads from a consistent snapshot in an
MVCC database, it ignores writes that were made by any other transactions that hadn’t yet committed
at the time when the snapshot was taken.
In [Figure 8-10](/en/ch8#fig_transactions_detect_mvcc), transaction 43 sees
Aaliyah as having `on_call = true`, because transaction 42 (which modified Aaliyah’s on-call status) is
uncommitted. However, by the time transaction 43 wants to commit, transaction 42 has already
committed. This means that the write that was ignored when reading from the consistent snapshot has
now taken effect, and transaction 43’s premise is no longer true. Things get even more complicated
when a writer inserts data that didn’t exist before (see [“Phantoms causing write skew”](/en/ch8#sec_transactions_phantom)). We’ll
discuss detecting phantom writes for SSI in [“Detecting writes that affect prior reads”](/en/ch8#sec_detecting_writes_affect_reads).
{{< figure src="/fig/ddia_0810.png" id="fig_transactions_detect_mvcc" caption="Figure 8-10. Detecting when a transaction reads outdated values from an MVCC snapshot." class="w-full my-4" >}}
In order to prevent this anomaly, the database needs to track when a transaction ignores another
transaction’s writes due to MVCC visibility rules. When the transaction wants to commit, the
database checks whether any of the ignored writes have now been committed. If so, the transaction
must be aborted.
Why wait until committing? Why not abort transaction 43 immediately when the stale read is detected?
Well, if transaction 43 was a read-only transaction, it wouldn’t need to be aborted, because there
is no risk of write skew. At the time when transaction 43 makes its read, the database doesn’t yet
know whether that transaction is going to later perform a write. Moreover, transaction 42 may yet
abort or may still be uncommitted at the time when transaction 43 is committed, and so the read may
turn out not to have been stale after all. By avoiding unnecessary aborts, SSI preserves snapshot
isolation’s support for long-running reads from a consistent snapshot.
#### Detecting writes that affect prior reads {#sec_detecting_writes_affect_reads}
The second case to consider is when another transaction modifies data after it has been read. This
case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range).
{{< figure src="/fig/ddia_0811.png" id="fig_transactions_detect_index_range" caption="Figure 8-11. In serializable snapshot isolation, detecting when one transaction modifies another transaction's reads." class="w-full my-4" >}}
In the context of two-phase locking we discussed index-range locks (see
[“Index-range locks”](/en/ch8#sec_transactions_2pl_range)), which allow the database to lock access to all rows matching some
search query, such as `WHERE shift_id = 1234`. We can use a similar technique here, except that SSI
locks don’t block other transactions.
In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transactions 42 and 43 both search for on-call doctors
during shift `1234`. If there is an index on `shift_id`, the database can use the index entry 1234 to
record the fact that transactions 42 and 43 read this data. (If there is no index, this information
can be tracked at the table level.) This information only needs to be kept for a while: after a
transaction has finished (committed or aborted), and all concurrent transactions have finished, the
database can forget what data it read.
When a transaction writes to the database, it must look in the indexes for any other transactions
that have recently read the affected data. This process is similar to acquiring a write lock on the affected
key range, but rather than blocking until the readers have committed, the lock acts as a tripwire:
it simply notifies the transactions that the data they read may no longer be up to date.
In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transaction 43 notifies transaction 42 that its prior
read is outdated, and vice versa. Transaction 42 is first to commit, and it is successful: although
transaction 43’s write affected 42, 43 hasn’t yet committed, so the write has not yet taken effect.
However, when transaction 43 wants to commit, the conflicting write from 42 has already been
committed, so 43 must abort.
#### Performance of serializable snapshot isolation {#performance-of-serializable-snapshot-isolation}
As always, many engineering details affect how well an algorithm works in practice. For example, one
trade-off is the granularity at which transactions’ reads and writes are tracked. If the database
keeps track of each transaction’s activity in great detail, it can be precise about which
transactions need to abort, but the bookkeeping overhead can become significant. Less detailed
tracking is faster, but may lead to more transactions being aborted than strictly necessary.
In some cases, it’s okay for a transaction to read information that was overwritten by another
transaction: depending on what else happened, it’s sometimes possible to prove that the result of
the execution is nevertheless serializable. PostgreSQL uses this theory to reduce the number of
unnecessary aborts [^14] [^54].
Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one
transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot
isolation, writers don’t block readers, and vice versa. This design principle makes query latency
much more predictable and less variable. In particular, read-only queries can run on a consistent
snapshot without requiring any locks, which is very appealing for read-heavy workloads.
Compared to serial execution, serializable snapshot isolation is not limited to the throughput of a
single CPU core: for example, FoundationDB distributes the detection of serialization conflicts across multiple
machines, allowing it to scale to very high throughput. Even though data may be sharded across
multiple machines, transactions can read and write data in multiple shards while ensuring
serializable isolation.
Compared to non-serializable snapshot isolation, the need to check for serializability violations
introduces some performance overheads. How significant these overheads are is a matter of debate:
some believe that serializability checking is not worth it [^70],
while others believe that the performance of serializability is now so good that there is no need to
use the weaker snapshot isolation any more [^67].
The rate of aborts significantly affects the overall performance of SSI. For example, a transaction
that reads and writes data over a long period of time is likely to run into conflicts and abort, so
SSI requires that read-write transactions be fairly short (long-running read-only transactions are
okay). However, SSI is less sensitive to slow transactions than two-phase locking or serial
execution.
## Distributed Transactions {#sec_transactions_distributed}
The last few sections have focused on concurrency control for isolation, the I in ACID. The
algorithms we have seen apply to both single-node and distributed databases: although there are
challenges in making concurrency control algorithms scalable (for example, performing distributed
serializability checking for SSI), the high-level ideas for distributed concurrency control are
similar to single-node concurrency control [^8].
Consistency and durability also don’t change much when we move to distributed transactions. However,
atomicity requires more care.
For transactions that execute at a single database node, atomicity is commonly implemented by the
storage engine. When the client asks the database node to commit the transaction, the database makes
the transaction’s writes durable (typically in a write-ahead log; see [“Making B-trees reliable”](/en/ch4#sec_storage_btree_wal)) and
then appends a commit record to the log on disk. If the database crashes in the middle of this
process, the transaction is recovered from the log when the node restarts: if the commit record was
successfully written to disk before the crash, the transaction is considered committed; if not, any
writes from that transaction are rolled back.
Thus, on a single node, transaction commitment crucially depends on the *order* in which data is
durably written to disk: first the data, then the commit record [^22].
The key deciding moment for whether the transaction commits or aborts is the moment at which the
disk finishes writing the commit record: before that moment, it is still possible to abort (due to a
crash), but after that moment, the transaction is committed (even if the database crashes). Thus, it
is a single device (the controller of one particular disk drive, attached to one particular node)
that makes the commit atomic.
However, what if multiple nodes are involved in a transaction? For example, perhaps you have a
multi-object transaction in a sharded database, or a global secondary index (in which the
index entry may be on a different node from the primary data; see
[“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes)). Most “NoSQL” distributed datastores do not support such
distributed transactions, but various distributed relational databases do.
In these cases, it is not sufficient to simply send a commit request to all of the nodes and
independently commit the transaction on each one. It could easily happen that the commit succeeds on
some nodes and fails on other nodes, as shown in [Figure 8-12](/en/ch8#fig_transactions_non_atomic):
* Some nodes may detect a constraint violation or conflict, making an abort necessary, while other
nodes are successfully able to commit.
* Some of the commit requests might be lost in the network, eventually aborting due to a timeout,
while other commit requests get through.
* Some nodes may crash before the commit record is fully written and roll back on recovery, while
others successfully commit.
{{< figure src="/fig/ddia_0812.png" id="fig_transactions_non_atomic" caption="Figure 8-12. When a transaction involves multiple database nodes, it may commit on some and fail on others." class="w-full my-4" >}}
If some nodes commit the transaction but others abort it, the nodes become inconsistent with each
other. And once a transaction has been committed on one node, it cannot be retracted again if it
later turns out that it was aborted on another node. This is because once data has been committed,
it becomes visible to other transactions under *read committed* or stronger isolation. For example,
in [Figure 8-12](/en/ch8#fig_transactions_non_atomic), by the time user 1 notices that its commit failed on database 1,
user 2 has already read the data from the same transaction on database 2. If user 1’s transaction
was later aborted, user 2’s transaction would have to be reverted as well, since it was based on
data that was retroactively declared not to have existed.
A better approach is to ensure that the nodes involved in a transaction either all commit or all
abort, and to prevent a mixture of the two. Ensuring this is known as the *atomic commitment* problem.
### Two-Phase Commit (2PC) {#sec_transactions_2pc}
Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes. It
is a classic algorithm in distributed databases [^13] [^71] [^72]. 2PC is used
internally in some databases and also made available to applications in the form of *XA transactions* [^73]
(which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP
web services [^74] [^75].
The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two
phases (hence the name).
{{< figure src="/fig/ddia_0813.png" id="fig_transactions_two_phase_commit" title="Figure 8-13. A successful execution of two-phase commit (2PC)." class="w-full my-4" >}}
2PC uses a new component that does not normally appear in single-node transactions: a
*coordinator* (also known as *transaction manager*). The coordinator is often implemented as a
library within the same application process that is requesting the transaction (e.g., embedded in a
Java EE container), but it can also be a separate process or service. Examples of such coordinators
include Narayana, JOTM, BTM, or MSDTC.
When 2PC is used, a distributed
transaction begins with the application reading and writing data on multiple database nodes,
as normal. We call these database nodes *participants* in the transaction. When the application is
ready to commit, the coordinator begins phase 1: it sends a *prepare* request to each of the nodes,
asking them whether they are able to commit. The coordinator then tracks the responses from the
participants:
* If all participants reply “yes,” indicating they are ready to commit, then the coordinator sends
out a *commit* request in phase 2, and the commit actually takes place.
* If any of the participants replies “no,” the coordinator sends an *abort* request to all nodes in phase 2.
This process is somewhat like the traditional marriage ceremony in Western cultures: the minister
asks the bride and groom individually whether each wants to marry the other, and typically receives
the answer “I do” from both. After receiving both acknowledgments, the minister pronounces the
couple husband and wife: the transaction is committed, and the happy fact is broadcast to all
attendees. If either bride or groom does not say “yes,” the ceremony is aborted [^76].
#### A system of promises {#a-system-of-promises}
From this short description it might not be clear why two-phase commit ensures atomicity, while
one-phase commit across several nodes does not. Surely the prepare and commit requests can just
as easily be lost in the two-phase case. What makes 2PC different?
To understand why it works, we have to break down the process in a bit more detail:
1. When the application wants to begin a distributed transaction, it requests a transaction ID from
the coordinator. This transaction ID is globally unique.
2. The application begins a single-node transaction on each of the participants, and attaches the
globally unique transaction ID to the single-node transaction. All reads and writes are done in
one of these single-node transactions. If anything goes wrong at this stage (for example, a node
crashes or a request times out), the coordinator or any of the participants can abort.
3. When the application is ready to commit, the coordinator sends a prepare request to all
participants, tagged with the global transaction ID. If any of these requests fails or times out,
the coordinator sends an abort request for that transaction ID to all participants.
4. When a participant receives the prepare request, it makes sure that it can definitely commit
the transaction under all circumstances.
This includes writing all transaction data to disk (a crash, a power failure, or running out of
disk space is not an acceptable excuse for refusing to commit later), and checking for any
conflicts or constraint violations. By replying “yes” to the coordinator, the node promises to
commit the transaction without error if requested. In other words, the participant surrenders the
right to abort the transaction, but without actually committing it.
5. When the coordinator has received responses to all prepare requests, it makes a definitive
decision on whether to commit or abort the transaction (committing only if all participants voted
“yes”). The coordinator must write that decision to its transaction log on disk so that it knows
which way it decided in case it subsequently crashes. This is called the *commit point*.
6. Once the coordinator’s decision has been written to disk, the commit or abort request is sent
to all participants. If this request fails or times out, the coordinator must retry forever until
it succeeds. There is no more going back: if the decision was to commit, that decision must be
enforced, no matter how many retries it takes. If a participant has crashed in the meantime, the
transaction will be committed when it recovers—since the participant voted “yes,” it cannot
refuse to commit when it recovers.
Thus, the protocol contains two crucial “points of no return”: when a participant votes “yes,” it
promises that it will definitely be able to commit later (although the coordinator may still choose to
abort); and once the coordinator decides, that decision is irrevocable. Those promises ensure the
atomicity of 2PC. (Single-node atomic commit lumps these two events into one: writing the commit
record to the transaction log.)
Returning to the marriage analogy, before saying “I do,” you and your bride/groom have the freedom
to abort the transaction by saying “No way!” (or something to that effect). However, after saying “I
do,” you cannot retract that statement. If you faint after saying “I do” and you don’t hear the
minister speak the words “You are now husband and wife,” that doesn’t change the fact that the
transaction was committed. When you recover consciousness later, you can find out whether you are
married or not by querying the minister for the status of your global transaction ID, or you can
wait for the minister’s next retry of the commit request (since the retries will have continued
throughout your period of unconsciousness).
#### Coordinator failure {#coordinator-failure}
We have discussed what happens if one of the participants or the network fails during 2PC: if any of
the prepare requests fails or times out, the coordinator aborts the transaction; if any of the
commit or abort requests fails, the coordinator retries them indefinitely. However, it is less
clear what happens if the coordinator crashes.
If the coordinator fails before sending the prepare requests, a participant can safely abort the
transaction. But once the participant has received a prepare request and voted “yes,” it can no
longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction
was committed or aborted. If the coordinator crashes or the network fails at this point, the
participant can do nothing but wait. A participant’s transaction in this state is called *in doubt*
or *uncertain*.
The situation is illustrated in [Figure 8-14](/en/ch8#fig_transactions_2pc_crash). In this particular example, the
coordinator actually decided to commit, and database 2 received the commit request. However, the
coordinator crashed before it could send the commit request to database 1, and so database 1 does
not know whether to commit or abort. Even a timeout does not help here: if database 1 unilaterally
aborts after a timeout, it will end up inconsistent with database 2, which has committed. Similarly,
it is not safe to unilaterally commit, because another participant may have aborted.
{{< figure src="/fig/ddia_0814.png" id="fig_transactions_2pc_crash" title="Figure 8-14. The coordinator crashes after participants vote \"yes.\" Database 1 does not know whether to commit or abort." class="w-full my-4" >}}
Without hearing from the coordinator, the participant has no way of knowing whether to commit or
abort. In principle, the participants could communicate among themselves to find out how each
participant voted and come to some agreement, but that is not part of the 2PC protocol.
The only way 2PC can complete is by waiting for the coordinator to recover. This is why the
coordinator must write its commit or abort decision to a transaction log on disk before sending
commit or abort requests to participants: when the coordinator recovers, it determines the status of
all in-doubt transactions by reading its transaction log. Any transactions that don’t have a commit
record in the coordinator’s log are aborted. Thus, the commit point of 2PC comes down to a regular
single-node atomic commit on the coordinator.
#### Three-phase commit {#three-phase-commit}
Two-phase commit is called a *blocking* atomic commit protocol due to the fact that 2PC can become
stuck waiting for the coordinator to recover. It is possible to make an atomic commit protocol
*nonblocking*, so that it does not get stuck if a node fails. However, making this work in practice
is not so straightforward.
As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed [^13] [^77].
However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most
practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
cannot guarantee atomicity.
A better solution in practice is to replace the single-node coordinator with a fault-tolerant
consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_consistency).
### Distributed Transactions Across Different Systems {#sec_transactions_xa}
Distributed transactions and two-phase commit have a mixed reputation. On the one hand, they are
seen as providing an important safety guarantee that would be hard to achieve otherwise; on the
other hand, they are criticized for causing operational problems, killing performance, and promising
more than they can deliver [^78] [^79] [^80] [^81].
Many cloud services choose not to implement distributed transactions due to the operational problems they engender [^82].
Some implementations of distributed transactions carry a heavy performance penalty. Much of the
performance cost inherent in two-phase commit is due to the additional disk forcing (`fsync`) that
is required for crash recovery, and the additional network round-trips.
However, rather than dismissing distributed transactions outright, we should examine them in some
more detail, because there are important lessons to be learned from them. To begin, we should be
precise about what we mean by “distributed transactions.” Two quite different types of distributed
transactions are often conflated:
Database-internal distributed transactions
: Some distributed databases (i.e., databases that use replication and sharding in their standard
configuration) support internal transactions among the nodes of that database. For example,
YugabyteDB, TiDB, FoundationDB, Spanner, VoltDB, and MySQL Cluster’s NDB storage engine have such
internal transaction support. In this case, all the nodes participating in the transaction are
running the same database software.
Heterogeneous distributed transactions
: In a *heterogeneous* transaction, the participants are two or more different technologies: for
example, two databases from different vendors, or even non-database systems such as message
brokers. A distributed transaction across these systems must ensure atomic commit, even though
the systems may be entirely different under the hood.
Database-internal transactions do not have to be compatible with any other system, so they can
use any protocol and apply optimizations specific to that particular technology. For that reason,
database-internal distributed transactions can often work quite well. On the other hand,
transactions spanning heterogeneous technologies are a lot more challenging.
#### Exactly-once message processing {#sec_transactions_exactly_once}
Heterogeneous distributed transactions allow diverse systems to be integrated in powerful ways. For
example, a message from a message queue can be acknowledged as processed if and only if the database
transaction for processing the message was successfully committed. This is implemented by atomically
committing the message acknowledgment and the database writes in a single transaction. With
distributed transaction support, this is possible, even if the message broker and the database are
two unrelated technologies running on different machines.
If either the message delivery or the database transaction fails, both are aborted, and so the
message broker may safely redeliver the message later. Thus, by atomically committing the message
and the side effects of its processing, we can ensure that the message is *effectively* processed
exactly once, even if it required a few retries before it succeeded. The abort discards any side
effects of the partially completed transaction. This is known as *exactly-once semantics*.
Such a distributed transaction is only possible if all systems affected by the transaction are able
to use the same atomic commit protocol, however. For example, say a side effect of processing a
message is to send an email, and the email server does not support two-phase commit: it could happen
that the email is sent two or more times if message processing fails and is retried. But if all side
effects of processing a message are rolled back on transaction abort, then the processing step can
safely be retried as if nothing had happened.
We will return to the topic of exactly-once semantics later in this chapter. Let’s look first at the
atomic commit protocol that allows such heterogeneous distributed transactions.
#### XA transactions {#xa-transactions}
*X/Open XA* (short for *eXtended Architecture*) is a standard for implementing two-phase commit
across heterogeneous technologies [^73]. It was introduced in 1991 and has been widely
implemented: XA is supported by many traditional relational databases (including PostgreSQL, MySQL,
Db2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ).
XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator.
Bindings for this API exist in other languages; for example, in the world of Java EE applications,
XA transactions are implemented using the Java Transaction API (JTA), which in turn is supported by
many drivers for databases using Java Database Connectivity (JDBC) and drivers for message brokers
using the Java Message Service (JMS) APIs.
XA assumes that your application uses a network driver or client library to communicate with the
participant databases or messaging services. If the driver supports XA, that means it calls the XA
API to find out whether an operation should be part of a distributed transaction—and if so, it
sends the necessary information to the database server. The driver also exposes callbacks through
which the coordinator can ask the participant to prepare, commit, or abort.
The transaction coordinator implements the XA API. The standard does not specify how it should be
implemented, but in practice the coordinator is often simply a library that is loaded into the same
process as the application issuing the transaction (not a separate service). It keeps track of the
participants in a transaction, collects partipants’ responses after asking them to prepare (via a
callback into the driver), and uses a log on the local disk to keep track of the commit/abort
decision for each transaction.
If the application process crashes, or the machine on which the application is running dies, the
coordinator goes with it. Any participants with prepared but uncommitted transactions are then stuck
in doubt. Since the coordinator’s log is on the application server’s local disk, that server must be
restarted, and the coordinator library must read the log to recover the commit/abort outcome of each
transaction. Only then can the coordinator use the database driver’s XA callbacks to ask
participants to commit or abort, as appropriate. The database server cannot contact the coordinator
directly, since all communication must go via its client library.
#### Holding locks while in doubt {#holding-locks-while-in-doubt}
Why do we care so much about a transaction being stuck in doubt? Can’t the rest of the system just
get on with its work, and ignore the in-doubt transaction that will be cleaned up eventually?
The problem is with *locking*. As discussed in [“Read Committed”](/en/ch8#sec_transactions_read_committed), database
transactions usually take a row-level exclusive lock on any rows they modify, to prevent dirty
writes. In addition, if you want serializable isolation, a database using two-phase locking would
also have to take a shared lock on any rows *read* by the transaction.
The database cannot release those locks until the transaction commits or aborts (illustrated as a
shaded area in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit)). Therefore, when using two-phase commit, a
transaction must hold onto the locks throughout the time it is in doubt. If the coordinator has
crashed and takes 20 minutes to start up again, those locks will be held for 20 minutes. If the
coordinator’s log is entirely lost for some reason, those locks will be held forever—or at least
until the situation is manually resolved by an administrator.
While those locks are held, no other transaction can modify those rows. Depending on the isolation
level, other transactions may even be blocked from reading those rows. Thus, other transactions
cannot simply continue with their business—if they want to access that same data, they will be
blocked. This can cause large parts of your application to become unavailable until the in-doubt
transaction is resolved.
#### Recovering from coordinator failure {#recovering-from-coordinator-failure}
In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the
log and resolve any in-doubt transactions. However, in practice, *orphaned* in-doubt transactions do occur [^83] [^84] — that is,
transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because
the transaction log has been lost or corrupted due to a software bug). These transactions cannot be
resolved automatically, so they sit forever in the database, holding locks and blocking other
transactions.
Even rebooting your database servers will not fix this problem, since a correct implementation of
2PC must preserve the locks of an in-doubt transaction even across restarts (otherwise it would risk
violating the atomicity guarantee). It’s a sticky situation.
The only way out is for an administrator to manually decide whether to commit or roll back the
transactions. The administrator must examine the participants of each in-doubt transaction,
determine whether any participant has committed or aborted already, and then apply the same outcome
to the other participants. Resolving the problem potentially requires a lot of manual effort, and
most likely needs to be done under high stress and time pressure during a serious production outage
(otherwise, why would the coordinator be in such a bad state?).
Many XA implementations have an emergency escape hatch called *heuristic decisions*: allowing a
participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive
decision from the coordinator [^73]. To be clear,
*heuristic* here is a euphemism for *probably breaking atomicity*, since the heuristic decision
violates the system of promises in two-phase commit. Thus, heuristic decisions are intended only for
getting out of catastrophic situations, and not for regular use.
#### Problems with XA transactions {#problems-with-xa-transactions}
A single-node coordinator is a single point of failure for the entire system, and making it part of
the application server is also problematic because the coordinator’s logs on its local disk become a
crucial part of the durable system state—as important as the databases themselves.
In principle, the coordinator of an XA transaction could be highly available and replicated, just
like we would expect of any other important database. Unfortunately, this still doesn’t solve a
fundamental problem with XA, which is that it provides no way for the coordinator and the
participants of a transaction to communicate with each other directly. They can only communicate via
the application code that invoked the transaction, and the database drivers through which it calls
the participants.
Even if the coordinator were replicated, the application code would therefore be a single point of
failure. Solving this problem would require totally redesigning how application code is run to make
it replicated or restartable, which could perhaps look similar to durable execution (see
[“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows)). However, there don’t seem to be any tools that actually take
this approach in practice.
Another problem is that since XA needs to be compatible with a wide range of data systems, it is
necessarily a lowest common denominator. For example, it cannot detect deadlocks across different
systems (since that would require a standardized protocol for systems to exchange information on the
locks that each transaction is waiting for), and it does not work with SSI (see
[“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)), since that would require a protocol for identifying conflicts across
different systems.
These problems are somewhat inherent in performing transactions across heterogeneous technologies.
However, keeping several heterogeneous data systems consistent with each other is still a real and
important problem, so we need to find a different solution to it. This can be done, as we will see
in the next section and in [“Derived data versus distributed transactions”](/en/ch13#sec_future_derived_vs_transactions).
### Database-internal Distributed Transactions {#sec_transactions_internal}
As explained previously, there is a big difference between distributed transactions that span
multiple heterogeneous storage technologies, and those that are internal to a system—i.e., where all
the participating nodes are shards of the same database running the same software. Such internal
distributed transactions are a defining feature of “NewSQL” databases such as
CockroachDB [^5], TiDB [^6], Spanner [^7], FoundationDB [^8], and YugabyteDB, for example.
Some message brokers such as Kafka also support internal distributed transactions [^85].
Many of these systems use 2-phase commit to ensure atomicity of transactions that write to multiple
shards, and yet they don’t suffer the same problems as XA transactions. The reason is that because
their distributed transactions don’t need to interface with any other technologies, they avoid the
lowest-common-denominator trap—the designers of these systems are free to use better protocols that
are more reliable and faster.
The biggest problems with XA can be fixed by:
* Replicating the coordinator, with automatic failover to another coordinator node if the primary one crashes;
* Allowing the coordinator and data shards to communicate directly without going via application code;
* Replicating the participating shards, so that the risk of having to abort a transaction because of a fault in one of the shards is reduced; and
* Coupling the atomic commitment protocol with a distributed concurrency control protocol that supports deadlock detection and consistent reads across shards.
Consensus algorithms are commonly used to replicate the coordinator and the database shards. We will
see in [Chapter 10](/en/ch10#ch_consistency) how atomic commitment for distributed transactions can be implemented
using a consensus algorithm. These algorithms tolerate faults by automatically failing over from one
node to another without any human intervention, and while continuing to guarantee strong consistency
properties.
The isolation levels offered for distributed transactions depend on the system, but snapshot
isolation and serializable snapshot isolation are both possible across shards. The details of how
this works can be found in the papers referenced at the end of this chapter.
#### Exactly-once message processing revisited {#exactly-once-message-processing-revisited}
We saw in [“Exactly-once message processing”](/en/ch8#sec_transactions_exactly_once) that an important use case for distributed transactions
is to ensure that some operation takes effect exactly once, even if a crash occurs while it is being
processed and the processing needs to be retried. If you can atomically commit a transaction across
a message broker and a database, you can acknowledge the message to the broker if and only if it was
successfully processed and the database writes resulting from the process were committed.
However, you don’t actually need such distributed transactions to achieve exactly-once semantics. An
alternative approach is as follows, which only requires transactions within the database:
1. Assume every message has a unique ID, and in the database you have a table of message IDs that
have been processed. When you start processing a message from the broker, you begin a new
transaction on the database, and check the message ID. If the same message ID is already present
in the database, you know that it has already been processed, so you can acknowledge the message
to the broker and drop it.
2. If the message ID is not already in the database, you add it to the table. You then process the
message, which may result in additional writes to the database within the same transaction. When
you finish processing the message, you commit the transaction on the database.
3. Once the database transaction is successfully committed, you can acknowledge the message to the
broker.
4. Once the message has successfully been acknowledged to the broker, you know that it won’t try
processing the same message again, so you can delete the message ID from the database (in a
separate transaction).
If the message processor crashes before committing the database transaction, the transaction is
aborted and the message broker will retry processing. If it crashes after committing but before
acknowledging the message to the broker, it will also retry processing, but the retry will see the
message ID in the database and drop it. If it crashes after acknowledging the message but before
deleting the message ID from the database, you will have an old message ID lying around, which
doesn’t do any harm besides taking a little bit of storage space. If a retry happens before the
database transaction is aborted (which could happen if communication between the message processor
and the database is interrupted), a uniqueness constraint on the table of message IDs should prevent
the same message ID from being inserted by two concurrent transactions.
Thus, achieving exactly-once processing only requires transactions within the database—atomicity
across database and message broker is not necessary for this use case. Recording the message ID in
the database makes the message processing *idempotent*, so that message processing can be safely
retried without duplicating its side-effects. A similar approach is used in stream processing
frameworks such as Kafka Streams to achieve exactly-once semantics, as we shall see in [“Fault Tolerance”](/en/ch12#sec_stream_fault_tolerance).
However, internal distributed transactions within the database are still useful for the scalability
of patterns such as these: for example, they would allow the message IDs to be stored on one shard
and the main data updated by the message processing to be stored on other shards, and to ensure
atomicity of the transaction commit across those shards.
## Summary {#summary}
Transactions are an abstraction layer that allows an application to pretend that certain concurrency
problems and certain kinds of hardware and software faults don’t exist. A large class of errors is
reduced down to a simple *transaction abort*, and the application just needs to try again.
In this chapter we saw many examples of problems that transactions help prevent. Not all
applications are susceptible to all those problems: an application with very simple access patterns,
such as reading and writing only a single record, can probably manage without transactions. However,
for more complex access patterns, transactions can hugely reduce the number of potential error cases
you need to think about.
Without transactions, various error scenarios (processes crashing, network interruptions, power
outages, disk full, unexpected concurrency, etc.) mean that data can become inconsistent in various
ways. For example, denormalized data can easily go out of sync with the source data. Without
transactions, it becomes very difficult to reason about the effects that complex interacting accesses
can have on the database.
In this chapter, we went particularly deep into the topic of concurrency control. We discussed
several widely used isolation levels, in particular *read committed*, *snapshot isolation*
(sometimes called *repeatable read*), and *serializable*. We characterized those isolation levels by
discussing various examples of race conditions, summarized in [Table 8-1](/en/ch8#ch_transactions_isolation_levels):
{{< figure id="ch_transactions_isolation_levels" title="Table 8-1. Summary of anomalies that can occur at various isolation levels" class="w-full my-4" >}}
| Isolation level | Dirty reads | Read skew | Phantom reads | Lost updates | Write skew |
|--------------------|-------------|-------------|---------------|--------------|-------------|
| Read uncommitted | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible |
| Read committed | ✓ Prevented | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible |
| Snapshot isolation | ✓ Prevented | ✓ Prevented | ✓ Prevented | ? Depends | ✗ Possible |
| Serializable | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented |
Dirty reads
: One client reads another client’s writes before they have been committed. The read committed
isolation level and stronger levels prevent dirty reads.
Dirty writes
: One client overwrites data that another client has written, but not yet committed. Almost all
transaction implementations prevent dirty writes.
Read skew
: A client sees different parts of the database at different points in time. Some cases of read
skew are also known as *nonrepeatable reads*. This issue is most commonly prevented with snapshot
isolation, which allows a transaction to read from a consistent snapshot corresponding to one
particular point in time. It is usually implemented with *multi-version concurrency control*
(MVCC).
Lost updates
: Two clients concurrently perform a read-modify-write cycle. One overwrites the other’s write
without incorporating its changes, so data is lost. Some implementations of snapshot isolation
prevent this anomaly automatically, while others require a manual lock (`SELECT FOR UPDATE`).
Write skew
: A transaction reads something, makes a decision based on the value it saw, and writes the decision
to the database. However, by the time the write is made, the premise of the decision is no longer
true. Only serializable isolation prevents this anomaly.
Phantom reads
: A transaction reads objects that match some search condition. Another client makes a write that
affects the results of that search. Snapshot isolation prevents straightforward phantom reads, but
phantoms in the context of write skew require special treatment, such as index-range locks.
Weak isolation levels protect against some of those anomalies but leave you, the application
developer, to handle others manually (e.g., using explicit locking). Only serializable isolation
protects against all of these issues. We discussed three different approaches to implementing
serializable transactions:
Literally executing transactions in a serial order
: If you can make each transaction very fast to execute (typically by using stored procedures), and
the transaction throughput is low enough to process on a single CPU core or can be sharded, this
is a simple and effective option.
Two-phase locking
: For decades this has been the standard way of implementing serializability, but many applications
avoid using it because of its poor performance.
Serializable snapshot isolation (SSI)
: A comparatively new algorithm that avoids most of the downsides of the previous approaches. It
uses an optimistic approach, allowing transactions to proceed without blocking. When a transaction
wants to commit, it is checked, and it is aborted if the execution was not serializable.
Finally, we examined how to achieve atomicity when a transaction is distributed across multiple
nodes, using two-phase commit. If those nodes are all running the same database software,
distributed transactions can work quite well, but across different storage technologies (using XA
transactions), 2PC is problematic: it is very sensitive to faults in the coordinator and the
application code driving the transaction, and it interacts poorly with concurrency control
mechanisms. Fortunately, idempotence can ensure exactly-once semantics without requiring atomic
commit across different storage technologies, and we will see more on this in later chapters.
The examples in this chapter used a relational data model. However, as discussed in
[“The need for multi-object transactions”](/en/ch8#sec_transactions_need), transactions are a valuable database feature, no matter which data model is used.
### References
[^1]: Steven J. Murdoch. [What went wrong with Horizon: learning from the Post Office Trial](https://www.benthamsgaze.org/2021/07/15/what-went-wrong-with-horizon-learning-from-the-post-office-trial/). *benthamsgaze.org*, July 2021. Archived at [perma.cc/CNM4-553F](https://perma.cc/CNM4-553F)
[^2]: Donald D. Chamberlin, Morton M. Astrahan, Michael W. Blasgen, James N. Gray, W. Frank King, Bruce G. Lindsay, Raymond Lorie, James W. Mehl, Thomas G. Price, Franco Putzolu, Patricia Griffiths Selinger, Mario Schkolnick, Donald R. Slutz, Irving L. Traiger, Bradford W. Wade, and Robert A. Yost. [A History and Evaluation of System R](https://dsf.berkeley.edu/cs262/2005/SystemR.pdf). *Communications of the ACM*, volume 24, issue 10, pages 632–646, October 1981. [doi:10.1145/358769.358784](https://doi.org/10.1145/358769.358784)
[^3]: Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger. [Granularity of Locks and Degrees of Consistency in a Shared Data Base](https://citeseerx.ist.psu.edu/pdf/e127f0a6a912bb9150ecfe03c0ebf7fbc289a023). in *Modelling in Data Base Management Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems*, edited by G. M. Nijssen, pages 364–394, Elsevier/North Holland Publishing, 1976. Also in *Readings in Database Systems*, 4th edition, edited by Joseph M. Hellerstein and Michael Stonebraker, MIT Press, 2005. ISBN: 978-0-262-69314-1
[^4]: Kapali P. Eswaran, Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger. [The Notions of Consistency and Predicate Locks in a Database System](https://jimgray.azurewebsites.net/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf?from=https://research.microsoft.com/en-us/um/people/gray/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf). *Communications of the ACM*, volume 19, issue 11, pages 624–633, November 1976. [doi:10.1145/360363.360369](https://doi.org/10.1145/360363.360369)
[^5]: Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. [CockroachDB: The Resilient Geo-Distributed SQL Database](https://dl.acm.org/doi/pdf/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 1493–1509, June 2020. [doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134)
[^6]: Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, and Xin Tang. [TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3072–3084. [doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535)
[^7]: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. [Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012.
[^8]: Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. [FoundationDB: A Distributed Unbundled Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2021. [doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559)
[^9]: Theo Härder and Andreas Reuter. [Principles of Transaction-Oriented Database Recovery](https://citeseerx.ist.psu.edu/pdf/11ef7c142295aeb1a28a0e714c91fc8d610c3047). *ACM Computing Surveys*, volume 15, issue 4, pages 287–317, December 1983. [doi:10.1145/289.291](https://doi.org/10.1145/289.291)
[^10]: Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [HAT, not CAP: Towards Highly Available Transactions](https://www.usenix.org/system/files/conference/hotos13/hotos13-final80.pdf). At *14th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2013.
[^11]: Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. [Cluster-Based Scalable Network Services](https://people.eecs.berkeley.edu/~brewer/cs262b/TACC.pdf). At *16th ACM Symposium on Operating Systems Principles* (SOSP), October 1997. [doi:10.1145/268998.266662](https://doi.org/10.1145/268998.266662)
[^12]: Tony Andrews. [Enforcing Complex Constraints in Oracle](https://tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html). *tonyandrews.blogspot.co.uk*, October 2004. Archived at [archive.org](https://web.archive.org/web/20220201190625/https%3A//tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html)
[^13]: Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. [*Concurrency Control and Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at [*microsoft.com*](https://www.microsoft.com/en-us/research/people/philbe/book/).
[^14]: Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil, Patrick O’Neil, and Dennis Shasha. [Making Snapshot Isolation Serializable](https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2009/Papers/p492-fekete.pdf). *ACM Transactions on Database Systems*, volume 30, issue 2, pages 492–528, June 2005. [doi:10.1145/1071610.1071615](https://doi.org/10.1145/1071610.1071615)
[^15]: Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge. [Understanding the Robustness of SSDs Under Power Fault](https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf). At *11th USENIX Conference on File and Storage Technologies* (FAST), February 2013.
[^16]: Laurie Denness. [SSDs: A Gift and a Curse](https://laur.ie/blog/2015/06/ssds-a-gift-and-a-curse/). *laur.ie*, June 2015. Archived at [perma.cc/6GLP-BX3T](https://perma.cc/6GLP-BX3T)
[^17]: Adam Surak. [When Solid State Drives Are Not That Solid](https://www.algolia.com/blog/engineering/when-solid-state-drives-are-not-that-solid). *blog.algolia.com*, June 2015. Archived at [perma.cc/CBR9-QZEE](https://perma.cc/CBR9-QZEE)
[^18]: Hewlett Packard Enterprise. [Bulletin: (Revision) HPE SAS Solid State Drives - Critical Firmware Upgrade Required for Certain HPE SAS Solid State Drive Models to Prevent Drive Failure at 32,768 Hours of Operation](https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us). *support.hpe.com*, November 2019. Archived at [perma.cc/CZR4-AQBS](https://perma.cc/CZR4-AQBS)
[^19]: Craig Ringer et al. [PostgreSQL’s handling of fsync() errors is unsafe and risks data loss at least on XFS](https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com). Email thread on pgsql-hackers mailing list, *postgresql.org*, March 2018. Archived at [perma.cc/5RKU-57FL](https://perma.cc/5RKU-57FL)
[^20]: Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [Can Applications Recover from fsync Failures?](https://www.usenix.org/conference/atc20/presentation/rebello) At *USENIX Annual Technical Conference* (ATC), July 2020.
[^21]: Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [Crash Consistency: Rethinking the Fundamental Abstractions of the File System](https://dl.acm.org/doi/pdf/10.1145/2800695.2801719). *ACM Queue*, volume 13, issue 7, pages 20–28, July 2015. [doi:10.1145/2800695.2801719](https://doi.org/10.1145/2800695.2801719)
[^22]: Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf). At *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014.
[^23]: Chris Siebenmann. [Unix’s File Durability Problem](https://utcc.utoronto.ca/~cks/space/blog/unix/FileSyncProblem). *utcc.utoronto.ca*, April 2016. Archived at [perma.cc/VSS8-5MC4](https://perma.cc/VSS8-5MC4)
[^24]: Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://www.usenix.org/conference/fast17/technical-sessions/presentation/ganesan). At *15th USENIX Conference on File and Storage Technologies* (FAST), February 2017.
[^25]: Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [An Analysis of Data Corruption in the Storage Stack](https://www.usenix.org/legacy/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf). At *6th USENIX Conference on File and Storage Technologies* (FAST), February 2008.
[^26]: Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. [Flash Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder). At *14th USENIX Conference on File and Storage Technologies* (FAST), February 2016.
[^27]: Don Allison. [SSD Storage – Ignorance of Technology Is No Excuse](https://blog.korelogic.com/blog/2015/03/24). *blog.korelogic.com*, March 2015. Archived at [perma.cc/9QN4-9SNJ](https://perma.cc/9QN4-9SNJ)
[^28]: Gordon Mah Ung. [Debunked: Your SSD won’t lose data if left unplugged after all](https://www.pcworld.com/article/427602/debunked-your-ssd-wont-lose-data-if-left-unplugged-after-all.html). *pcworld.com*, May 2015. Archived at [perma.cc/S46H-JUDU](https://perma.cc/S46H-JUDU)
[^29]: Martin Kleppmann. [Hermitage: Testing the ‘I’ in ACID](https://martin.kleppmann.com/2014/11/25/hermitage-testing-the-i-in-acid.html). *martin.kleppmann.com*, November 2014. Archived at [perma.cc/KP2Y-AQGK](https://perma.cc/KP2Y-AQGK)
[^30]: Todd Warszawski and Peter Bailis. [ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications](http://www.bailis.org/papers/acidrain-sigmod2017.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017. [doi:10.1145/3035918.3064037](https://doi.org/10.1145/3035918.3064037)
[^31]: Tristan D’Agosta. [BTC Stolen from Poloniex](https://bitcointalk.org/index.php?topic=499580). *bitcointalk.org*, March 2014. Archived at [perma.cc/YHA6-4C5D](https://perma.cc/YHA6-4C5D)
[^32]: bitcointhief2. [How I Stole Roughly 100 BTC from an Exchange and How I Could Have Stolen More!](https://www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/) *reddit.com*, February 2014. Archived at [archive.org](https://web.archive.org/web/20250118042610/https%3A//www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/)
[^33]: Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan. [Automating the Detection of Snapshot Isolation Anomalies](https://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf). At *33rd International Conference on Very Large Data Bases* (VLDB), September 2007.
[^34]: Michael Melanson. [Transactions: The Limits of Isolation](https://www.michaelmelanson.net/posts/transactions-the-limits-of-isolation/). *michaelmelanson.net*, November 2014. Archived at [perma.cc/RG5R-KMYZ](https://perma.cc/RG5R-KMYZ)
[^35]: Edward Kim. [How ACH works: A developer perspective — Part 1](https://engineering.gusto.com/how-ach-works-a-developer-perspective-part-1-339d3e7bea1). *engineering.gusto.com*, April 2014. Archived at [perma.cc/7B2H-PU94](https://perma.cc/7B2H-PU94)
[^36]: Hal Berenson, Philip A. Bernstein, Jim N. Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. [A Critique of ANSI SQL Isolation Levels](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 1995. [doi:10.1145/568271.223785](https://doi.org/10.1145/568271.223785)
[^37]: Atul Adya. [Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions](https://pmg.csail.mit.edu/papers/adya-phd.pdf). PhD Thesis, Massachusetts Institute of Technology, March 1999. Archived at [perma.cc/E97M-HW5Q](https://perma.cc/E97M-HW5Q)
[^38]: Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Highly Available Transactions: Virtues and Limitations](https://www.vldb.org/pvldb/vol7/p181-bailis.pdf). At *40th International Conference on Very Large Data Bases* (VLDB), September 2014.
[^39]: Natacha Crooks, Youer Pu, Lorenzo Alvisi, and Allen Clement. [Seeing is Believing: A Client-Centric Specification of Database Isolation](https://www.cs.cornell.edu/lorenzo/papers/Crooks17Seeing.pdf). At *ACM Symposium on Principles of Distributed Computing* (PODC), pages 73–82, July 2017. [doi:10.1145/3087801.3087802](https://doi.org/10.1145/3087801.3087802)
[^40]: Bruce Momjian. [MVCC Unmasked](https://momjian.us/main/writings/pgsql/mvcc.pdf). *momjian.us*, July 2014. Archived at [perma.cc/KQ47-9GYB](https://perma.cc/KQ47-9GYB)
[^41]: Peter Alvaro and Kyle Kingsbury. [MySQL 8.0.34](https://jepsen.io/analyses/mysql-8.0.34). *jepsen.io*, December 2023. Archived at [perma.cc/HGE2-Z878](https://perma.cc/HGE2-Z878)
[^42]: Egor Rogov. [PostgreSQL 14 Internals](https://postgrespro.com/community/books/internals). *postgrespro.com*, April 2023. Archived at [perma.cc/FRK2-D7WB](https://perma.cc/FRK2-D7WB)
[^43]: Hironobu Suzuki. [The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017.
[^44]: Rohan Reddy Alleti. [Internals of MVCC in Postgres: Hidden costs of Updates vs Inserts](https://medium.com/%40rohanjnr44/internals-of-mvcc-in-postgres-hidden-costs-of-updates-vs-inserts-381eadd35844). *medium.com*, March 2025. Archived at [perma.cc/3ACX-DFXT](https://perma.cc/3ACX-DFXT)
[^45]: Andy Pavlo and Bohan Zhang. [The Part of PostgreSQL We Hate the Most](https://www.cs.cmu.edu/~pavlo/blog/2023/04/the-part-of-postgresql-we-hate-the-most.html). *cs.cmu.edu*, April 2023. Archived at [perma.cc/XSP6-3JBN](https://perma.cc/XSP6-3JBN)
[^46]: Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. [An empirical evaluation of in-memory multi-version concurrency control](https://vldb.org/pvldb/vol10/p781-Wu.pdf). *Proceedings of the VLDB Endowment*, volume 10, issue 7, pages 781–792, March 2017. [doi:10.14778/3067421.3067427](https://doi.org/10.14778/3067421.3067427)
[^47]: Nikita Prokopov. [Unofficial Guide to Datomic Internals](https://tonsky.me/blog/unofficial-guide-to-datomic-internals/). *tonsky.me*, May 2014.
[^48]: Daniil Svetlov. [A Practical Guide to Taming Postgres Isolation Anomalies](https://dansvetlov.me/postgres-anomalies/). *dansvetlov.me*, March 2025. Archived at [perma.cc/L7LE-TDLS](https://perma.cc/L7LE-TDLS)
[^49]: Nate Wiger. [An Atomic Rant](https://nateware.com/2010/02/18/an-atomic-rant/). *nateware.com*, February 2010. Archived at [perma.cc/5ZYB-PE44](https://perma.cc/5ZYB-PE44)
[^50]: James Coglan. [Reading and writing, part 3: web applications](https://blog.jcoglan.com/2020/10/12/reading-and-writing-part-3/). *blog.jcoglan.com*, October 2020. Archived at [perma.cc/A7EK-PJVS](https://perma.cc/A7EK-PJVS)
[^51]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2737784](https://doi.org/10.1145/2723372.2737784)
[^52]: Jaana Dogan. [Things I Wished More Developers Knew About Databases](https://rakyll.medium.com/things-i-wished-more-developers-knew-about-databases-2d0178464f78). *rakyll.medium.com*, April 2020. Archived at [perma.cc/6EFK-P2TD](https://perma.cc/6EFK-P2TD)
[^53]: Michael J. Cahill, Uwe Röhm, and Alan Fekete. [Serializable Isolation for Snapshot Databases](https://www.cs.cornell.edu/~sowell/dbpapers/serializable_isolation.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2008. [doi:10.1145/1376616.1376690](https://doi.org/10.1145/1376616.1376690)
[^54]: Dan R. K. Ports and Kevin Grittner. [Serializable Snapshot Isolation in PostgreSQL](https://drkp.net/papers/ssi-vldb12.pdf). At *38th International Conference on Very Large Databases* (VLDB), August 2012.
[^55]: Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J. Spreitzer and Carl H. Hauser. [Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](https://pdos.csail.mit.edu/6.824/papers/bayou-conflicts.pdf). At *15th ACM Symposium on Operating Systems Principles* (SOSP), December 1995. [doi:10.1145/224056.224070](https://doi.org/10.1145/224056.224070)
[^56]: Hans-Jürgen Schönig. [Constraints over multiple rows in PostgreSQL](https://www.cybertec-postgresql.com/en/postgresql-constraints-over-multiple-rows/). *cybertec-postgresql.com*, June 2021. Archived at [perma.cc/2TGH-XUPZ](https://perma.cc/2TGH-XUPZ)
[^57]: Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. [The End of an Architectural Era (It’s Time for a Complete Rewrite)](https://vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf). At *33rd International Conference on Very Large Data Bases* (VLDB), September 2007.
[^58]: John Hugg. [H-Store/VoltDB Architecture vs. CEP Systems and Newer Streaming Architectures](https://www.youtube.com/watch?v=hD5M4a1UVz8). At *Data @Scale Boston*, November 2014.
[^59]: Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, John Hugg, and Daniel J. Abadi. [H-Store: A High-Performance, Distributed Main Memory Transaction Processing System](https://www.vldb.org/pvldb/vol1/1454211.pdf). *Proceedings of the VLDB Endowment*, volume 1, issue 2, pages 1496–1499, August 2008.
[^60]: Rich Hickey. [The Architecture of Datomic](https://www.infoq.com/articles/Architecture-Datomic/). *infoq.com*, November 2012. Archived at [perma.cc/5YWU-8XJK](https://perma.cc/5YWU-8XJK)
[^61]: John Hugg. [Debunking Myths About the VoltDB In-Memory Database](https://dzone.com/articles/debunking-myths-about-voltdb). *dzone.com*, May 2014. Archived at [perma.cc/2Z9N-HPKF](https://perma.cc/2Z9N-HPKF)
[^62]: Xinjing Zhou, Viktor Leis, Xiangyao Yu, and Michael Stonebraker. [OLTP Through the Looking Glass 16 Years Later: Communication is the New Bottleneck](https://www.vldb.org/cidrdb/papers/2025/p17-zhou.pdf). At *15th Annual Conference on Innovative Data Systems Research* (CIDR), January 2025.
[^63]: Xinjing Zhou, Xiangyao Yu, Goetz Graefe, and Michael Stonebraker. [Lotus: scalable multi-partition transactions on single-threaded partitioned databases](https://www.vldb.org/pvldb/vol15/p2939-zhou.pdf). *Proceedings of the VLDB Endowment* (PVLDB), volume 15, issue 11, pages 2939–2952, July 2022. [doi:10.14778/3551793.3551843](https://doi.org/10.14778/3551793.3551843)
[^64]: Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton. [Architecture of a Database System](https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf). *Foundations and Trends in Databases*, volume 1, issue 2, pages 141–259, November 2007. [doi:10.1561/1900000002](https://doi.org/10.1561/1900000002)
[^65]: Michael J. Cahill. [Serializable Isolation for Snapshot Databases](https://ses.library.usyd.edu.au/bitstream/handle/2123/5353/michael-cahill-2009-thesis.pdf). PhD Thesis, University of Sydney, July 2009. Archived at [perma.cc/727J-NTMP](https://perma.cc/727J-NTMP)
[^66]: Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Åke Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. [Hekaton: SQL Server’s Memory-Optimized OLTP Engine](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/06/Hekaton-Sigmod2013-final.pdf). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 1243–1254, June 2013. [doi:10.1145/2463676.2463710](https://doi.org/10.1145/2463676.2463710)
[^67]: Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. [Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems](https://db.in.tum.de/~muehlbau/papers/mvcc.pdf). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 677–689, May 2015. [doi:10.1145/2723372.2749436](https://doi.org/10.1145/2723372.2749436)
[^68]: D. Z. Badal. [Correctness of Concurrency Control and Implications in Distributed Databases](https://ieeexplore.ieee.org/abstract/document/762563). At *3rd International IEEE Computer Software and Applications Conference* (COMPSAC), November 1979. [doi:10.1109/CMPSAC.1979.762563](https://doi.org/10.1109/CMPSAC.1979.762563)
[^69]: Rakesh Agrawal, Michael J. Carey, and Miron Livny. [Concurrency Control Performance Modeling: Alternatives and Implications](https://people.eecs.berkeley.edu/~brewer/cs262/ConcControl.pdf). *ACM Transactions on Database Systems* (TODS), volume 12, issue 4, pages 609–654, December 1987. [doi:10.1145/32204.32220](https://doi.org/10.1145/32204.32220)
[^70]: Marc Brooker. [Snapshot Isolation vs Serializability](https://brooker.co.za/blog/2024/12/17/occ-and-isolation.html). *brooker.co.za*, December 2024. Archived at [perma.cc/5TRC-CR5G](https://perma.cc/5TRC-CR5G)
[^71]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)
[^72]: C. Mohan, Bruce G. Lindsay, and Ron Obermarck. [Transaction Management in the R\* Distributed Database Management System](https://cs.brown.edu/courses/csci2270/archives/2012/papers/dtxn/p378-mohan.pdf). *ACM Transactions on Database Systems*, volume 11, issue 4, pages 378–396, December 1986. [doi:10.1145/7239.7266](https://doi.org/10.1145/7239.7266)
[^73]: X/Open Company Ltd. [Distributed Transaction Processing: The XA Specification](https://pubs.opengroup.org/onlinepubs/009680699/toc.pdf). Technical Standard XO/CAE/91/300, December 1991. ISBN: 978-1-872-63024-3, archived at [perma.cc/Z96H-29JB](https://perma.cc/Z96H-29JB)
[^74]: Ivan Silva Neto and Francisco Reverbel. [Lessons Learned from Implementing WS-Coordination and WS-AtomicTransaction](https://www.ime.usp.br/~reverbel/papers/icis2008.pdf). At *7th IEEE/ACIS International Conference on Computer and Information Science* (ICIS), May 2008. [doi:10.1109/ICIS.2008.75](https://doi.org/10.1109/ICIS.2008.75)
[^75]: James E. Johnson, David E. Langworthy, Leslie Lamport, and Friedrich H. Vogt. [Formal Specification of a Web Services Protocol](https://www.microsoft.com/en-us/research/publication/formal-specification-of-a-web-services-protocol/). At *1st International Workshop on Web Services and Formal Methods* (WS-FM), February 2004. [doi:10.1016/j.entcs.2004.02.022](https://doi.org/10.1016/j.entcs.2004.02.022)
[^76]: Jim Gray. [The Transaction Concept: Virtues and Limitations](https://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf). At *7th International Conference on Very Large Data Bases* (VLDB), September 1981.
[^77]: Dale Skeen. [Nonblocking Commit Protocols](https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/Ske81.pdf). At *ACM International Conference on Management of Data* (SIGMOD), April 1981. [doi:10.1145/582318.582339](https://doi.org/10.1145/582318.582339)
[^78]: Gregor Hohpe. [Your Coffee Shop Doesn’t Use Two-Phase Commit](https://www.martinfowler.com/ieeeSoftware/coffeeShop.pdf). *IEEE Software*, volume 22, issue 2, pages 64–66, March 2005. [doi:10.1109/MS.2005.52](https://doi.org/10.1109/MS.2005.52)
[^79]: Pat Helland. [Life Beyond Distributed Transactions: An Apostate’s Opinion](https://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf). At *3rd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2007.
[^80]: Jonathan Oliver. [My Beef with MSDTC and Two-Phase Commits](https://blog.jonathanoliver.com/my-beef-with-msdtc-and-two-phase-commits/). *blog.jonathanoliver.com*, April 2011. Archived at [perma.cc/K8HF-Z4EN](https://perma.cc/K8HF-Z4EN)
[^81]: Oren Eini (Ahende Rahien). [The Fallacy of Distributed Transactions](https://ayende.com/blog/167362/the-fallacy-of-distributed-transactions). *ayende.com*, July 2014. Archived at [perma.cc/VB87-2JEF](https://perma.cc/VB87-2JEF)
[^82]: Clemens Vasters. [Transactions in Windows Azure (with Service Bus) – An Email Discussion](https://learn.microsoft.com/en-gb/archive/blogs/clemensv/transactions-in-windows-azure-with-service-bus-an-email-discussion). *learn.microsoft.com*, July 2012. Archived at [perma.cc/4EZ9-5SKW](https://perma.cc/4EZ9-5SKW)
[^83]: Ajmer Dhariwal. [Orphaned MSDTC Transactions (-2 spids)](https://www.eraofdata.com/posts/2008/orphaned-msdtc-transactions-2-spids/). *eraofdata.com*, December 2008. Archived at [perma.cc/YG6F-U34C](https://perma.cc/YG6F-U34C)
[^84]: Paul Randal. [Real World Story of DBCC PAGE Saving the Day](https://www.sqlskills.com/blogs/paul/real-world-story-of-dbcc-page-saving-the-day/). *sqlskills.com*, June 2013. Archived at [perma.cc/2MJN-A5QH](https://perma.cc/2MJN-A5QH)
[^85]: Guozhang Wang, Lei Chen, Ayusman Dikshit, Jason Gustafson, Boyang Chen, Matthias J. Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva Mehta, Varun Madan, and Jun Rao. [Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka](https://dl.acm.org/doi/pdf/10.1145/3448016.3457556). At *ACM International Conference on Management of Data* (SIGMOD), June 2021. [doi:10.1145/3448016.3457556](https://doi.org/10.1145/3448016.3457556)
================================================
FILE: content/en/ch9.md
================================================
---
title: "9. The Trouble with Distributed Systems"
weight: 209
breadcrumbs: false
---

> *They’re funny things, Accidents. You never have them till you’re having them.*
>
> A.A. Milne, *The House at Pooh Corner* (1928)
As discussed in [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability), making a system reliable means ensuring that the
system as a whole continues working, even when things go wrong (i.e., when there is a fault).
However, anticipating all the possible faults and handling them is not that easy. As a developer, it
is very tempting to focus mostly on the happy path (after all, most of the time things work fine!)
and to neglect faults, since they introduce a lot of edge cases.
If you want your system to be reliable in the presence of faults you have to radically change your
mindset, and focus on the things that could go wrong, even though they may be unlikely. It doesn’t
matter whether there is only a one-in-a-million chance of a thing going wrong: in a large enough
system, one-in-a-million events happen every day. Experienced systems operators will tell you that
anything that *can* go wrong *will* go wrong.
Moreover, working with distributed systems is fundamentally different from writing software on a
single computer—and the main difference is that there are lots of new and exciting ways for things
to go wrong [^1] [^2].
In this chapter, you will get a taste of the problems that arise in practice, and an understanding
of the things you can and cannot rely on.
To understand what challenges we are up against, we will now turn our pessimism to the maximum and
explore the things that may go wrong in a distributed system. We will look into problems with
networks ([“Unreliable Networks”](/en/ch9#sec_distributed_networks)) as well as clocks and timing issues
([“Unreliable Clocks”](/en/ch9#sec_distributed_clocks)). The consequences of all these issues are disorienting, so we’ll
explore how to think about the state of a distributed system and how to reason about things that
have happened ([“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth)). Later, in [Chapter 10](/en/ch10#ch_consistency), we will look at some
examples of how we can achieve fault tolerance in the face of those faults.
## Faults and Partial Failures {#sec_distributed_partial_failure}
When you are writing a program on a single computer, it normally behaves in a fairly predictable
way: either it works or it doesn’t. Buggy software may give the appearance that the computer is
sometimes “having a bad day” (a problem that is often fixed by a reboot), but that is mostly just
a consequence of badly written software.
There is no fundamental reason why software on a single computer should be flaky: when the hardware
is working correctly, the same operation always produces the same result (it is *deterministic*). If
there is a hardware problem (e.g., memory corruption or a loose connector), the consequence is usually a
total system failure (e.g., kernel panic, “blue screen of death,” failure to start up). An individual
computer with good software is usually either fully functional or entirely broken, but not something
in between.
This is a deliberate choice in the design of computers: if an internal fault occurs, we prefer a
computer to crash completely rather than returning a wrong result, because wrong results are difficult
and confusing to deal with. Thus, computers hide the fuzzy physical reality on which they are
implemented and present an idealized system model that operates with mathematical perfection. A CPU
instruction always does the same thing; if you write some data to memory or disk, that data remains
intact and doesn’t get randomly corrupted. As discussed in [“Hardware and Software Faults”](/en/ch2#sec_introduction_hardware_faults),
this is not actually true—in reality, data does get silently corrupted and CPUs do sometimes
silently return the wrong result—but it happens rarely enough that we can get away with ignoring it.
When you are writing software that runs on several computers, connected by a network, the situation
is fundamentally different. In distributed systems, faults occur much more frequently, and so we can
no longer ignore them—we have no choice but to confront the messy reality of the physical world. And
in the physical world, a remarkably wide range of things can go wrong, as illustrated by this
anecdote [^3]:
> In my limited experience I’ve dealt with long-lived network partitions in a single data center (DC),
> PDU [power distribution unit] failures, switch failures, accidental power cycles of whole racks,
> whole-DC backbone failures, whole-DC power failures, and a hypoglycemic driver smashing his Ford
> pickup truck into a DC’s HVAC [heating, ventilation, and air conditioning] system. And I’m not even
> an ops guy.
>
> —— Coda Hale
In a distributed system, there may well be some parts of the system that are broken in some
unpredictable way, even though other parts of the system are working fine. This is known as a
*partial failure*. The difficulty is that partial failures are *nondeterministic*: if you try to do
anything involving multiple nodes and the network, it may sometimes work and sometimes unpredictably
fail. As we shall see, you may not even *know* whether something succeeded or not!
This nondeterminism and possibility of partial failures is what makes distributed systems hard to work with [^4].
On the other hand, if a distributed system can tolerate partial failures, that opens up powerful
possibilities: for example, it allows you to perform a rolling upgrade, rebooting one node at a time
to install software updates while the system as a whole continues working uninterrupted all the
time. Fault tolerance therefore allows us to make distributed systems more reliable than single-node
systems: we can build a reliable system from unreliable components.
But before we can implement fault tolerance, we need to know more about the faults that we’re
supposed to tolerate. It is important to consider a wide range of possible faults—even fairly
unlikely ones—and to artificially create such situations in your testing environment to see what
happens. In distributed systems, suspicion, pessimism, and paranoia pay off.
## Unreliable Networks {#sec_distributed_networks}
As discussed in [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](/en/ch2#sec_introduction_shared_nothing), the distributed systems we focus on
in this book are mostly *shared-nothing systems*: i.e., a bunch of machines connected by a network.
The network is the only way those machines can communicate—we assume that each machine has its
own memory and disk, and one machine cannot access another machine’s memory or disk (except by
making requests to a service over the network). Even when storage is shared, such as with Amazon’s
S3, machines communicate with shared storage services over the network.
The internet and most internal networks in datacenters (often Ethernet) are *asynchronous packet
networks*. In this kind of network, one node can send a message (a packet) to another node, but the
network gives no guarantees as to when it will arrive, or whether it will arrive at all. If you send
a request and expect a response, many things could go wrong (some of which are illustrated in
[Figure 9-1](/en/ch9#fig_distributed_network)):
1. Your request may have been lost (perhaps someone unplugged a network cable).
2. Your request may be waiting in a queue and will be delivered later (perhaps the network or the
recipient is overloaded).
3. The remote node may have failed (perhaps it crashed or it was powered down).
4. The remote node may have temporarily stopped responding (perhaps it is experiencing a long
garbage collection pause; see [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses)), but it will start responding
again later.
5. The remote node may have processed your request, but the response has been lost on the network
(perhaps a network switch has been misconfigured).
6. The remote node may have processed your request, but the response has been delayed and will be
delivered later (perhaps the network or your own machine is overloaded).
{{< figure src="/fig/ddia_0901.png" id="fig_distributed_network" caption="Figure 9-1. If you send a request and don't get a response, it's not possible to distinguish whether (a) the request was lost, (b) the remote node is down, or (c) the response was lost." class="w-full my-4" >}}
The sender can’t even tell whether the packet was delivered: the only option is for the recipient to
send a response message, which may in turn be lost or delayed. These issues are indistinguishable in
an asynchronous network: the only information you have is that you haven’t received a response yet.
If you send a request to another node and don’t receive a response, it is *impossible* to tell why.
The usual way of handling this issue is a *timeout*: after some time you give up waiting and assume that
the response is not going to arrive. However, when a timeout occurs, you still don’t know whether
the remote node got your request or not (and if the request is still queued somewhere, it may still
be delivered to the recipient, even if the sender has given up on it).
### The Limitations of TCP {#sec_distributed_tcp}
Network packets have a maximum size (generally a few kilobytes), but many applications need to send
messages (requests, responses) that are too big to fit in one packet. These applications most often
use TCP, the Transmission Control Protocol, to establish a *connection* that breaks up large data
streams into individual packets, and puts them back together again on the receiving side.
--------
> [!NOTE]
> Most of what we say about TCP applies also to its more recent alternative QUIC, as well as the
> Stream Control Transmission Protocol (SCTP) used in WebRTC, the BitTorrent uTP protocol, and
> other transport protocols. For a comparison to UDP, see [“TCP Versus UDP”](/en/ch9#sidebar_distributed_tcp_udp).
--------
TCP is often described as providing “reliable” delivery, in the sense that it detects and
retransmits dropped packets, it detects reordered packets and puts them back in the correct order,
and it detects packet corruption using a simple checksum. It also figures out how fast it can send
data so that it is transferred as quickly as possible, but without overloading the network or the
receiving node; this is known as *congestion control*, *flow control*, or *backpressure* [^5].
When you “send” some data by writing it to a socket, it actually doesn’t get sent immediately,
but it’s only placed in a buffer managed by your operating system. When the congestion control
algorithm decides that it has capacity to send a packet, it takes the next packet-worth of data from
that buffer and passes it to the network interface. The packet passes through several switches and
routers, and eventually the receiving node’s operating system places the packet’s data in a receive
buffer and sends an acknowledgment packet back to the sender. Only then does the receiving operating
system notify the application that some more data has arrived [^6].
So, if TCP provides “reliability”, does that mean we no longer need to worry about networks being
unreliable? Unfortunately not. It decides that a packet must have been lost if no acknowledgment
arrives within some timeout, but TCP can’t tell either whether it was the outbound packet or the
acknowledgment that was lost. Although TCP can resend the packet, it can’t guarantee that the new
packet will get through either. If the network cable is unplugged, TCP can’t plug it back in for
you. Eventually, after a configurable timeout, TCP gives up and signals an error to the application.
If a TCP connection is closed with an error—perhaps because the remote node crashed, or perhaps
because the network was interrupted—you unfortunately have no way of knowing how much data was
actually processed by the remote node [^6].
Even if TCP acknowledged that a packet was delivered, this only means that the operating system
kernel on the remote node received it, but the application may have crashed before it handled that
data. If you want to be sure that a request was successful, you need a positive response from the
application itself [^7].
Nevertheless, TCP is very useful, because it provides a convenient way of sending and receiving
messages that are too big to fit in one packet. Once a TCP connection is established, you can also
use it to send multiple requests and responses. This is usually done by first sending a header that
indicates the length of the following message in bytes, followed by the actual message. HTTP and
many RPC protocols (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) work like this.
### Network Faults in Practice {#sec_distributed_network_faults}
We have been building computer networks for decades—one might hope that by now we would have figured
out how to make them reliable. Unfortunately, we have not yet succeeded. There are some systematic
studies, and plenty of anecdotal evidence, showing that network problems can be surprisingly common,
even in controlled environments like a datacenter operated by one company [^8]:
* One study in a medium-sized datacenter found about 12 network faults per month, of which half
disconnected a single machine, and half disconnected an entire rack [^9].
* Another study measured the failure rates of components like top-of-rack switches, aggregation
switches, and load balancers [^10].
It found that adding redundant networking gear doesn’t reduce faults as much as you might hope,
since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause
of outages.
* Interruptions of wide-area fiber links have been blamed on cows [^11], beavers [^12], and sharks [^13]
(though shark bites have become rarer due to better shielding of submarine cables [^14]).
Humans are also at fault, be it due to accidental misconfiguration [^15], scavenging [^16], or sabotage [^17].
* Across different cloud regions, round-trip times of up to several *minutes* have been observed at
high percentiles [^18].
Even within a single datacenter, packet delay of more than a minute can occur during a network
topology reconfiguration, triggered by a problem during a software upgrade for a switch [^19].
Thus, we have to assume that messages might be delayed arbitrarily.
* Sometimes communications are partially interrupted, depending on who you’re talking to: for
example, A and B can communicate, B and C can communicate, but A and C cannot [^20] [^21].
Other surprising faults include a network interface that sometimes drops all inbound packets but
sends outbound packets successfully [^22]:
just because a network link works in one direction doesn’t guarantee it’s also working in the opposite direction.
* Even a brief network interruption can have repercussions that last for much longer than the
original issue [^8] [^20] [^23].
--------
> [!TIP] NETWORK PARTITIONS
When one part of the network is cut off from the rest due to a network fault, that is sometimes
called a *network partition* or *netsplit*, but it is not fundamentally different from other kinds
of network interruption. Network partitions are not related to sharding of a storage system, which
is sometimes also called *partitioning* (see [Chapter 7](/en/ch7#ch_sharding)).
--------
Even if network faults are rare in your environment, the fact that faults *can* occur means that
your software needs to be able to handle them. Whenever any communication happens over a network, it
may fail—there is no way around it.
If the error handling of network faults is not defined and tested, arbitrarily bad things could
happen: for example, the cluster could become deadlocked and permanently unable to serve requests,
even when the network recovers [^24],
or it could even delete all of your data [^25].
If software is put in an unanticipated situation, it may do arbitrary unexpected things.
Handling network faults doesn’t necessarily mean *tolerating* them: if your network is normally
fairly reliable, a valid approach may be to simply show an error message to users while your network
is experiencing problems. However, you do need to know how your software reacts to network problems
and ensure that the system can recover from them.
It may make sense to deliberately trigger network problems and test the system’s response (this is
known as *fault injection*; see [“Fault injection”](/en/ch9#sec_fault_injection)).
### Detecting Faults {#id307}
Many systems need to automatically detect faulty nodes. For example:
* A load balancer needs to stop sending requests to a node that is dead (i.e., take it *out of rotation*).
* In a distributed database with single-leader replication, if the leader fails, one of the
followers needs to be promoted to be the new leader (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)).
Unfortunately, the uncertainty about the network makes it difficult to tell whether a node is
working or not. In some specific circumstances you might get some feedback to explicitly tell you
that something is not working:
* If you can reach the machine on which the node should be running, but no process is listening on
the destination port (e.g., because the process crashed), the operating system will helpfully close
or refuse TCP connections by sending a `RST` or `FIN` packet in reply.
* If a node process crashed (or was killed by an administrator) but the node’s operating system is
still running, a script can notify other nodes about the crash so that another node can take over
quickly without having to wait for a timeout to expire. For example, HBase does this [^26].
* If you have access to the management interface of the network switches in your datacenter, you can
query them to detect link failures at a hardware level (e.g., if the remote machine is powered
down). This option is ruled out if you’re connecting via the internet, or if you’re in a shared
datacenter with no access to the switches themselves, or if you can’t reach the management
interface due to a network problem.
* If a router is sure that the IP address you’re trying to connect to is unreachable, it may reply
to you with an ICMP Destination Unreachable packet. However, the router doesn’t have a magic
failure detection capability either—it is subject to the same limitations as other participants
of the network.
Rapid feedback about a remote node being down is useful, but you can’t count on it. If something has
gone wrong, you may get an error response at some level of the stack, but in general you have to
assume that you will get no response at all. You can retry a few times, wait for a timeout to
elapse, and eventually declare the node dead if you don’t hear back within the timeout.
### Timeouts and Unbounded Delays {#sec_distributed_queueing}
If a timeout is the only sure way of detecting a fault, then how long should the timeout be? There
is unfortunately no simple answer.
A long timeout means a long wait until a node is declared dead (and during this time, users may have
to wait or see error messages). A short timeout detects faults faster, but carries a higher risk of
incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown (e.g., due
to a load spike on the node or the network).
Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of
performing some action (for example, sending an email), and another node takes over, the action may
end up being performed twice. We will discuss this issue in more detail in
[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), [Chapter 10](/en/ch10#ch_consistency), and [“The End-to-End Argument for Databases”](/en/ch13#sec_future_end_to_end).
When a node is declared dead, its responsibilities need to be transferred to other nodes, which
places additional load on other nodes and the network. If the system is already struggling with high
load, declaring nodes dead prematurely can make the problem worse. In particular, it could happen
that the node actually wasn’t dead but only slow to respond due to overload; transferring its load
to other nodes can cause a cascading failure (in the extreme case, all nodes declare each other
dead, and everything stops working—see [“When an overloaded system won’t recover”](/en/ch2#sidebar_metastable)).
Imagine a fictitious system with a network that guaranteed a maximum delay for packets—every packet
is either delivered within some time *d*, or it is lost, but delivery never takes longer than *d*.
Furthermore, assume that you can guarantee that a non-failed node always handles a request within
some time *r*. In this case, you could guarantee that every successful request receives a response
within time 2*d* + *r*—and if you don’t receive a response within that time, you know
that either the network or the remote node is not working. If this was true,
2*d* + *r* would be a reasonable timeout to use.
Unfortunately, most systems we work with have neither of those guarantees: asynchronous networks
have *unbounded delays* (that is, they try to deliver packets as quickly as possible, but there is
no upper limit on the time it may take for a packet to arrive), and most server implementations
cannot guarantee that they can handle requests within some maximum time (see
[“Response time guarantees”](/en/ch9#sec_distributed_clocks_realtime)). For failure detection, it’s not sufficient for the system to
be fast most of the time: if your timeout is low, it only takes a transient spike in round-trip
times to throw the system off-balance.
#### Network congestion and queueing {#network-congestion-and-queueing}
When driving a car, travel times on road networks often vary most due to traffic congestion.
Similarly, the variability of packet delays on computer networks is most often due to queueing [^27]:
* If several different nodes simultaneously try to send packets to the same destination, the network
switch must queue them up and feed them into the destination network link one by one (as illustrated
in [Figure 9-2](/en/ch9#fig_distributed_switch_queueing)). On a busy network link, a packet may have to wait a while
until it can get a slot (this is called *network congestion*). If there is so much incoming data
that the switch queue fills up, the packet is dropped, so it needs to be resent—even though
the network is functioning fine.
* When a packet reaches the destination machine, if all CPU cores are currently busy, the incoming
request from the network is queued by the operating system until the application is ready to
handle it. Depending on the load on the machine, this may take an arbitrary length of time [^28].
* In virtualized environments, a running operating system is often paused for tens of milliseconds
while another virtual machine uses a CPU core. During this time, the VM cannot consume any data
from the network, so the incoming data is queued (buffered) by the virtual machine monitor [^29],
further increasing the variability of network delays.
* As mentioned earlier, in order to avoid overloading the network, TCP limits the rate at which it
sends data. This means additional queueing at the sender before the data even enters the network.
{{< figure src="/fig/ddia_0902.png" id="fig_distributed_switch_queueing" caption="Figure 9-2. If several machines send network traffic to the same destination, its switch queue can fill up. Here, ports 1, 2, and 4 are all trying to send packets to port 3." class="w-full my-4" >}}
Moreover, when TCP detects and automatically retransmits a lost packet, although the application
does not see the packet loss directly, it does see the resulting delay (waiting for the timeout to
expire, and then waiting for the retransmitted packet to be acknowledged).
--------
> [!TIP] TCP VERSUS UDP
Some latency-sensitive applications, such as videoconferencing and Voice over IP (VoIP), use UDP
rather than TCP. It’s a trade-off between reliability and variability of delays: as UDP does not
perform flow control and does not retransmit lost packets, it avoids some of the reasons for
variable network delays (although it is still susceptible to switch queues and scheduling delays).
UDP is a good choice in situations where delayed data is worthless. For example, in a VoIP phone
call, there probably isn’t enough time to retransmit a lost packet before its data is due to be
played over the loudspeakers. In this case, there’s no point in retransmitting the packet—the
application must instead fill the missing packet’s time slot with silence (causing a brief
interruption in the sound) and move on in the stream. The retry happens at the human layer instead.
(“Could you repeat that please? The sound just cut out for a moment.”)
--------
All of these factors contribute to the variability of network delays. Queueing delays have an
especially wide range when a system is close to its maximum capacity: a system with plenty of spare
capacity can easily drain queues, whereas in a highly utilized system, long queues can build up very
quickly.
In public clouds and multitenant datacenters, resources are shared among many customers: the
network links and switches, and even each machine’s network interface and CPUs (when running on
virtual machines), are shared. Processing large amounts of data can use the entire capacity of
network links (*saturate* them). As you have no control over or insight into other customers’ usage of the shared
resources, network delays can be highly variable if someone near you (a *noisy neighbor*) is
using a lot of resources [^30] [^31].
In such environments, you can only choose timeouts experimentally: measure the distribution of
network round-trip times over an extended period, and over many machines, to determine the expected
variability of delays. Then, taking into account your application’s characteristics, you can
determine an appropriate trade-off between failure detection delay and risk of premature timeouts.
Even better, rather than using configured constant timeouts, systems can continually measure
response times and their variability (*jitter*), and automatically adjust timeouts according to the
observed response time distribution. The Phi Accrual failure detector [^32],
which is used for example in Akka and Cassandra [^33]
is one way of doing this. TCP retransmission timeouts also work similarly [^5].
### Synchronous Versus Asynchronous Networks {#sec_distributed_sync_networks}
Distributed systems would be a lot simpler if we could rely on the network to deliver packets with
some fixed maximum delay, and not to drop packets. Why can’t we solve this at the hardware level
and make the network reliable so that the software doesn’t need to worry about it?
To answer this question, it’s interesting to compare datacenter networks to the traditional fixed-line
telephone network (non-cellular, non-VoIP), which is extremely reliable: delayed audio
frames and dropped calls are very rare. A phone call requires a constantly low end-to-end latency
and enough bandwidth to transfer the audio samples of your voice. Wouldn’t it be nice to have
similar reliability and predictability in computer networks?
When you make a call over the telephone network, it establishes a *circuit*: a fixed, guaranteed
amount of bandwidth is allocated for the call, along the entire route between the two callers. This
circuit remains in place until the call ends [^34].
For example, an ISDN network runs at a fixed rate of 4,000 frames per second. When a call is
established, it is allocated 16 bits of space within each frame (in each direction). Thus, for the
duration of the call, each side is guaranteed to be able to send exactly 16 bits of audio data every
250 microseconds [^35].
This kind of network is *synchronous*: even as data passes through several routers, it does not
suffer from queueing, because the 16 bits of space for the call have already been reserved in the
next hop of the network. And because there is no queueing, the maximum end-to-end latency of the
network is fixed. We call this a *bounded delay*.
#### Can we not simply make network delays predictable? {#can-we-not-simply-make-network-delays-predictable}
Note that a circuit in a telephone network is very different from a TCP connection: a circuit is a
fixed amount of reserved bandwidth which nobody else can use while the circuit is established,
whereas the packets of a TCP connection opportunistically use whatever network bandwidth is
available. You can give TCP a variable-sized block of data (e.g., an email or a web page), and it
will try to transfer it in the shortest time possible. While a TCP connection is idle, it doesn’t
use any bandwidth (except perhaps for an occasional keepalive packet).
If datacenter networks and the internet were circuit-switched networks, it would be possible to
establish a guaranteed maximum round-trip time when a circuit was set up. However, they are not:
Ethernet and IP are packet-switched protocols, which suffer from queueing and thus unbounded delays
in the network. These protocols do not have the concept of a circuit.
Why do datacenter networks and the internet use packet switching? The answer is that they are
optimized for *bursty traffic*. A circuit is good for an audio or video call, which needs to
transfer a fairly constant number of bits per second for the duration of the call. On the other
hand, requesting a web page, sending an email, or transferring a file doesn’t have any particular
bandwidth requirement—we just want it to complete as quickly as possible.
If you wanted to transfer a file over a circuit, you would have to guess a bandwidth allocation. If
you guess too low, the transfer is unnecessarily slow, leaving network capacity unused. If you guess
too high, the circuit cannot be set up (because the network cannot allow a circuit to be created if
its bandwidth allocation cannot be guaranteed). Thus, using circuits for bursty data transfers
wastes network capacity and makes transfers unnecessarily slow. By contrast, TCP dynamically adapts
the rate of data transfer to the available network capacity.
There have been some attempts to build hybrid networks that support both circuit switching and
packet switching. *Asynchronous Transfer Mode* (ATM) was a competitor to Ethernet in the 1980s, but
it didn’t gain much adoption outside of telephone network core switches. InfiniBand has some similarities [^36]:
it implements end-to-end flow control at the link layer, which reduces the need for queueing in the
network, although it can still suffer from delays due to link congestion [^37].
With careful use of *quality of service* (QoS, prioritization and scheduling of packets) and *admission
control* (rate-limiting senders), it is possible to emulate circuit switching on packet networks, or
provide statistically bounded delay [^27] [^34]. New network algorithms like Low Latency, Low
Loss, and Scalable Throughput (L4S) attempt to mitigate some of the queuing and congestion control
problems both at the client and router level. Linux’s traffic controller (TC) also allows
applications to reprioritize packets for QoS purposes.
--------
> [!TIP] LATENCY AND RESOURCE UTILIZATION
More generally, you can think of variable delays as a consequence of dynamic resource partitioning.
Say you have a wire between two telephone switches that can carry up to 10,000 simultaneous calls.
Each circuit that is switched over this wire occupies one of those call slots. Thus, you can think of
the wire as a resource that can be shared by up to 10,000 simultaneous users. The resource is
divided up in a *static* way: even if you’re the only call on the wire right now, and all other
9,999 slots are unused, your circuit is still allocated the same fixed amount of bandwidth as when
the wire is fully utilized.
By contrast, the internet shares network bandwidth *dynamically*. Senders push and jostle with each
other to get their packets over the wire as quickly as possible, and the network switches decide
which packet to send (i.e., the bandwidth allocation) from one moment to the next. This approach has the
downside of queueing, but the advantage is that it maximizes utilization of the wire. The wire has a
fixed cost, so if you utilize it better, each byte you send over the wire is cheaper.
A similar situation arises with CPUs: if you share each CPU core dynamically between several
threads, one thread sometimes has to wait in the operating system’s run queue while another thread
is running, so a thread can be paused for varying lengths of time [^38].
However, this utilizes the hardware better than if you allocated a static number of CPU cycles to
each thread (see [“Response time guarantees”](/en/ch9#sec_distributed_clocks_realtime)). Better hardware utilization is also why cloud
platforms run several virtual machines from different customers on the same physical machine.
Latency guarantees are achievable in certain environments, if resources are statically partitioned
(e.g., dedicated hardware and exclusive bandwidth allocations). However, it comes at the cost of
reduced utilization—in other words, it is more expensive. On the other hand, multitenancy with
dynamic resource partitioning provides better utilization, so it is cheaper, but it has the downside of variable delays.
Variable delays in networks are not a law of nature, but simply the result of a cost/benefit trade-off.
--------
However, such quality of service is currently not enabled in multitenant datacenters and public clouds, or when communicating via the internet.
Currently deployed technology does not allow us to make any guarantees about delays or reliability
of the network: we have to assume that network congestion, queueing, and unbounded delays will
happen. Consequently, there’s no “correct” value for timeouts—they need to be determined experimentally.
Peering agreements between internet service providers and the establishment of routes through the
Border Gateway Protocol (BGP), bear closer resemblance to circuit switching than IP itself. At this
level, it is possible to buy dedicated bandwidth. However, internet routing operates at the level of
networks, not individual connections between hosts, and at a much longer timescale.
## Unreliable Clocks {#sec_distributed_clocks}
Clocks and time are important. Applications depend on clocks in various ways to answer questions
like the following:
1. Has this request timed out yet?
2. What’s the 99th percentile response time of this service?
3. How many queries per second did this service handle on average in the last five minutes?
4. How long did the user spend on our site?
5. When was this article published?
6. At what date and time should the reminder email be sent?
7. When does this cache entry expire?
8. What is the timestamp on this error message in the log file?
Examples 1–4 measure *durations* (e.g., the time interval between a request being sent and a
response being received), whereas examples 5–8 describe *points in time* (events that occur on a
particular date, at a particular time).
In a distributed system, time is a tricky business, because communication is not instantaneous: it
takes time for a message to travel across the network from one machine to another. The time when a
message is received is always later than the time when it is sent, but due to variable delays in the
network, we don’t know how much later. This fact sometimes makes it difficult to determine the order
in which things happened when multiple machines are involved.
Moreover, each machine on the network has its own clock, which is an actual hardware device: usually
a quartz crystal oscillator. These devices are not perfectly accurate, so each machine has its own
notion of time, which may be slightly faster or slower than on other machines. It is possible to
synchronize clocks to some degree: the most commonly used mechanism is the Network Time Protocol (NTP), which
allows the computer clock to be adjusted according to the time reported by a group of servers [^39].
The servers in turn get their time from a more accurate time source, such as a GPS receiver.
### Monotonic Versus Time-of-Day Clocks {#sec_distributed_monotonic_timeofday}
Modern computers have at least two different kinds of clocks: a *time-of-day clock* and a *monotonic
clock*. Although they both measure time, it is important to distinguish the two, since they serve
different purposes.
#### Time-of-day clocks {#time-of-day-clocks}
A time-of-day clock does what you intuitively expect of a clock: it returns the current date and
time according to some calendar (also known as *wall-clock time*). For example,
`clock_gettime(CLOCK_REALTIME)` on Linux and
`System.currentTimeMillis()` in Java return the number of seconds (or milliseconds) since the
*epoch*: midnight UTC on January 1, 1970, according to the Gregorian calendar, not counting leap
seconds. Some systems use other dates as their reference point.
(Although the Linux clock is called *real-time*, it has nothing to do with real-time operating
systems, as discussed in [“Response time guarantees”](/en/ch9#sec_distributed_clocks_realtime).)
Time-of-day clocks are usually synchronized with NTP, which means that a timestamp from one machine
(ideally) means the same as a timestamp on another machine. However, time-of-day clocks also have
various oddities, as described in the next section. In particular, if the local clock is too far
ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in
time. These jumps, as well as similar jumps caused by leap seconds, make time-of-day clocks
unsuitable for measuring elapsed time [^40].
Time-of-day clocks can experience jumps due to the start and end of Daylight Saving Time (DST);
these can be avoided by always using UTC as time zone, which does not have DST.
Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward
in steps of 10 ms on older Windows systems [^41].
On recent systems, this is less of a problem.
#### Monotonic clocks {#monotonic-clocks}
A monotonic clock is suitable for measuring a duration (time interval), such as a timeout or a
service’s response time: `clock_gettime(CLOCK_MONOTONIC)` or `clock_gettime(CLOCK_BOOTTIME)` on Linux [^42]
and `System.nanoTime()` in Java are monotonic clocks, for example. The name comes from the fact that
they are guaranteed to always move forward (whereas a time-of-day clock may jump back in time).
You can check the value of the monotonic clock at one point in time, do something, and then check
the clock again at a later time. The *difference* between the two values tells you how much time
elapsed between the two checks — more like a stopwatch than a wall clock. However, the *absolute*
value of the clock is meaningless: it might be the number of nanoseconds since the computer was
booted up, or something similarly arbitrary. In particular, it makes no sense to compare monotonic
clock values from two different computers, because they don’t mean the same thing.
On a server with multiple CPU sockets, there may be a separate timer per CPU, which is not
necessarily synchronized with other CPUs [^43].
Operating systems compensate for any discrepancy and try
to present a monotonic view of the clock to application threads, even as they are scheduled across
different CPUs. However, it is wise to take this guarantee of monotonicity with a pinch of salt [^44].
NTP may adjust the frequency at which the monotonic clock moves forward (this is known as *slewing*
the clock) if it detects that the computer’s local quartz is moving faster or slower than the NTP
server. By default, NTP allows the clock rate to be speeded up or slowed down by up to 0.05%, but
NTP cannot cause the monotonic clock to jump forward or backward. The resolution of monotonic
clocks is usually quite good: on most systems they can measure time intervals in microseconds or
less.
In a distributed system, using a monotonic clock for measuring elapsed time (e.g., timeouts) is
usually fine, because it doesn’t assume any synchronization between different nodes’ clocks and is
not sensitive to slight inaccuracies of measurement.
### Clock Synchronization and Accuracy {#sec_distributed_clock_accuracy}
Monotonic clocks don’t need synchronization, but time-of-day clocks need to be set according to an
NTP server or other external time source in order to be useful. Unfortunately, our methods for
getting a clock to tell the correct time aren’t nearly as reliable or accurate as you might
hope—hardware clocks and NTP can be fickle beasts. To give just a few examples:
* The quartz clock in a computer is not very accurate: it *drifts* (runs faster or slower than it
should). Clock drift varies depending on the temperature of the machine. Google assumes a clock
drift of up to 200 ppm (parts per million) for its servers [^45],
which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30
seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best
possible accuracy you can achieve, even if everything is working correctly.
* If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the
local clock will be forcibly reset [^39]. Any applications observing the time before and after this reset may see time go backward or suddenly jump forward.
* If a node is accidentally firewalled off from NTP servers, the misconfiguration may go
unnoticed for some time, during which the drift may add up to large discrepancies between
different nodes’ clocks. Anecdotal evidence suggests that this does happen in practice.
* NTP synchronization can only be as good as the network delay, so there is a limit to its
accuracy when you’re on a congested network with variable packet delays. One experiment showed
that a minimum error of 35 ms is achievable when synchronizing over the internet [^46],
though occasional spikes in network delay lead to errors of around a second. Depending on the
configuration, large network delays can cause the NTP client to give up entirely.
* Some NTP servers are wrong or misconfigured, reporting time that is off by hours [^47] [^48].
NTP clients mitigate such errors by querying several servers and ignoring outliers.
Nevertheless, it’s somewhat worrying to bet the correctness of your systems on the time that you
were told by a stranger on the internet.
* Leap seconds result in a minute that is 59 seconds or 61 seconds long, which messes up timing
assumptions in systems that are not designed with leap seconds in mind [^49].
The fact that leap seconds have crashed many large systems [^40] [^50]
shows how easy it is for incorrect assumptions about clocks to sneak into a system. The best
way of handling leap seconds may be to make NTP servers “lie,” by performing the leap second
adjustment gradually over the course of a day (this is known as *smearing*) [^51] [^52],
although actual NTP server behavior varies in practice [^53].
Leap seconds will no longer be used from 2035 onwards, so this problem will fortunately go away.
* In virtual machines, the hardware clock is virtualized, which raises additional challenges for applications that need accurate timekeeping [^54].
When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds
while another VM is running. From an application’s point of view, this pause manifests itself as
the clock suddenly jumping forward [^29].
If a VM pauses for several seconds, the clock may then be several seconds behind the actual time,
but NTP may continue to report that the clock is almost perfectly in sync [^55].
* If you run software on devices that you don’t fully control (e.g., mobile or embedded devices), you
probably cannot trust the device’s hardware clock at all. Some users deliberately set their
hardware clock to an incorrect date and time, for example to cheat in games [^56].
As a result, the clock might be set to a time wildly in the past or the future.
It is possible to achieve very good clock accuracy if you care about it sufficiently to invest
significant resources. For example, the MiFID II European regulation for financial
institutions requires all high-frequency trading funds to synchronize their clocks to within 100
microseconds of UTC, in order to help debug market anomalies such as “flash crashes” and to help
detect market manipulation [^57].
Such accuracy can be achieved with some special hardware (GPS receivers and/or atomic clocks), the
Precision Time Protocol (PTP) and careful deployment and monitoring [^58] [^59].
Relying on GPS alone can be risky because GPS signals can easily be jammed. In some locations this
happens frequently, e.g. close to military facilities [^60].
Some cloud providers have begun offering high-accuracy clock synchronization for their virtual machines [^61].
However, clock synchronization still requires a lot of care. If your NTP daemon is misconfigured, or
a firewall is blocking NTP traffic, the clock error due to drift can quickly become large.
### Relying on Synchronized Clocks {#sec_distributed_clocks_relying}
The problem with clocks is that while they seem simple and easy to use, they have a surprising
number of pitfalls: a day may not have exactly 86,400 seconds, time-of-day clocks may move backward
in time, and the time according to one node’s clock may be quite different from another node’s clock.
Earlier in this chapter we discussed networks dropping and arbitrarily delaying packets. Even though
networks are well behaved most of the time, software must be designed on the assumption that the
network will occasionally be faulty, and the software must handle such faults gracefully. The same
is true with clocks: although they work quite well most of the time, robust software needs to be
prepared to deal with incorrect clocks.
Part of the problem is that incorrect clocks easily go unnoticed. If a machine’s CPU is defective or
its network is misconfigured, it most likely won’t work at all, so it will quickly be noticed and
fixed. On the other hand, if its quartz clock is defective or its NTP client is misconfigured, most
things will seem to work fine, even though its clock gradually drifts further and further away from
reality. If some piece of software is relying on an accurately synchronized clock, the result is
more likely to be silent and subtle data loss than a dramatic crash [^62] [^63].
Thus, if you use software that requires synchronized clocks, it is essential that you also carefully
monitor the clock offsets between all the machines. Any node whose clock drifts too far from the
others should be declared dead and removed from the cluster. Such monitoring ensures that you notice
the broken clocks before they can cause too much damage.
#### Timestamps for ordering events {#sec_distributed_lww}
Let’s consider one particular situation in which it is tempting, but dangerous, to rely on clocks:
ordering of events across multiple nodes [^64].
For example, if two clients write to a distributed database, who got there first? Which write is the
more recent one?
[Figure 9-3](/en/ch9#fig_distributed_timestamps) illustrates a dangerous use of time-of-day clocks in a database with
multi-leader replication (the example is similar to [Figure 6-8](/en/ch6#fig_replication_causality)). Client A writes
*x* = 1 on node 1; the write is replicated to node 3; client B increments *x* on node
3 (we now have *x* = 2); and finally, both writes are replicated to node 2.
{{< figure src="/fig/ddia_0903.png" id="fig_distributed_timestamps" caption="Figure 9-3. The write by client B is causally later than the write by client A, but B's write has an earlier timestamp." class="w-full my-4" >}}
In [Figure 9-3](/en/ch9#fig_distributed_timestamps), when a write is replicated to other nodes, it is tagged with a
timestamp according to the time-of-day clock on the node where the write originated. The clock
synchronization is very good in this example: the skew between node 1 and node 3 is less than
3 ms, which is probably better than you can expect in practice.
Since the increment builds upon the earlier write of *x* = 1, we might expect that the
write of *x* = 2 should have the greater timestamp of the two. Unfortunately, that is
not what happens in [Figure 9-3](/en/ch9#fig_distributed_timestamps): the write *x* = 1 has a timestamp of
42.004 seconds, but the write *x* = 2 has a timestamp of 42.003 seconds.
As discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww), one way of resolving conflicts between concurrently written
values on different nodes is *last write wins* (LWW), which means keeping the write with the
greatest timestamp for a given key and discarding all writes with older timestamps. In the example
of [Figure 9-3](/en/ch9#fig_distributed_timestamps), when node 2 receives these two events, it will incorrectly
conclude that *x* = 1 is the more recent value and drop the write *x* = 2,
so the increment is lost.
This problem can be prevented by ensuring that when a value is overwritten, the new value always has
a higher timestamp than the overwritten value, even if that timestamp is ahead of the writer’s local
clock. However, that incurs the cost of an additional read to find the greatest existing timestamp.
Some systems, including Cassandra and ScyllaDB, want to write to all replicas in a single round
trip, and therefore they simply use the client clock’s timestamp along with a last write wins
policy [^62]. This approach has some serious problems:
* Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite
values previously written by a node with a fast clock until the clock skew between the nodes has elapsed [^63] [^65].
This scenario can cause arbitrary amounts of data to be silently dropped without any error being
reported to the application.
* LWW cannot distinguish between writes that occurred sequentially in quick succession (in
[Figure 9-3](/en/ch9#fig_distributed_timestamps), client B’s increment definitely occurs *after* client A’s write)
and writes that were truly concurrent (neither writer was aware of the other). Additional
causality tracking mechanisms, such as version vectors, are needed in order to prevent violations
of causality (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)).
* It is possible for two nodes to independently generate writes with the same timestamp, especially
when the clock only has millisecond resolution. An additional tiebreaker value (which can simply
be a large random number) is required to resolve such conflicts, but this approach can also lead to
violations of causality [^62].
Thus, even though it is tempting to resolve conflicts by keeping the most “recent” value and
discarding others, it’s important to be aware that the definition of “recent” depends on a local
time-of-day clock, which may well be incorrect. Even with tightly NTP-synchronized clocks, you could
send a packet at timestamp 100 ms (according to the sender’s clock) and have it arrive at
timestamp 99 ms (according to the recipient’s clock)—so it appears as though the packet
arrived before it was sent, which is impossible.
Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur?
Probably not, because NTP’s synchronization accuracy is itself limited by the network round-trip
time, in addition to other sources of error such as quartz drift. To guarantee a correct ordering,
you would need the clock error to be significantly lower than the network delay, which is not possible.
So-called *logical clocks* [^66], which are based on incrementing counters rather than an oscillating quartz crystal, are a safer
alternative for ordering events (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). Logical clocks do not measure
the time of day or the number of seconds elapsed, only the relative ordering of events (whether one
event happened before or after another). In contrast, time-of-day and monotonic clocks, which
measure actual elapsed time, are also known as *physical clocks*. We’ll look at logical clocks in
more detail in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical).
#### Clock readings with a confidence interval {#clock-readings-with-a-confidence-interval}
You may be able to read a machine’s time-of-day clock with microsecond or even nanosecond
resolution. But even if you can get such a fine-grained measurement, that doesn’t mean the value is
actually accurate to such precision. In fact, it most likely is not—as mentioned previously, the
drift in an imprecise quartz clock can easily be several milliseconds, even if you synchronize with
an NTP server on the local network every minute. With an NTP server on the public internet, the best
possible accuracy is probably to the tens of milliseconds, and the error may easily spike to over
100 ms when there is network congestion.
Thus, it doesn’t make sense to think of a clock reading as a point in time—it is more like a
range of times, within a confidence interval: for example, a system may be 95% confident that the
time now is between 10.3 and 10.5 seconds past the minute, but it doesn’t know any more precisely than that [^67].
If we only know the time +/– 100 ms, the microsecond digits in the timestamp are essentially meaningless.
The uncertainty bound can be calculated based on your time source. If you have a GPS receiver or
atomic clock directly attached to your computer, the expected error range is determined by
the device and, in the case of GPS, by the quality of the signal from the satellites. If you’re
getting the time from a server, the uncertainty is based on the expected quartz drift since your
last sync with the server, plus the NTP server’s uncertainty, plus the network round-trip time to
the server (to a first approximation, and assuming you trust the server).
Unfortunately, most systems don’t expose this uncertainty: for example, when you call
`clock_gettime()`, the return value doesn’t tell you the expected error of the timestamp, so you
don’t know if its confidence interval is five milliseconds or five years.
There are exceptions: the *TrueTime* API in Google’s Spanner [^45] and Amazon’s ClockBound explicitly report the
confidence interval on the local clock. When you ask it for the current time, you get back two
values: `[earliest, latest]`, which are the *earliest possible* and the *latest possible*
timestamp. Based on its uncertainty calculations, the clock knows that the actual current time is
somewhere within that interval. The width of the interval depends, among other things, on how long
it has been since the local quartz clock was last synchronized with a more accurate clock source.
#### Synchronized clocks for global snapshots {#sec_distributed_spanner}
In [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation) we discussed *multi-version concurrency control* (MVCC),
which is a very useful feature in databases that need to support both small, fast read-write
transactions and large, long-running read-only transactions (e.g., for backups or analytics). It
allows read-only transactions to see a *snapshot* of the database, a consistent state at a
particular point in time, without locking and interfering with read-write transactions.
Generally, MVCC requires a monotonically increasing transaction ID. If a write happened later than
the snapshot (i.e., the write has a greater transaction ID than the snapshot), that write is
invisible to the snapshot transaction. On a single-node database, a simple counter is sufficient for
generating transaction IDs.
However, when a database is distributed across many machines, potentially in multiple datacenters, a
global, monotonically increasing transaction ID (across all shards) is difficult to generate,
because it requires coordination. The transaction ID must reflect causality: if transaction B reads
or overwrites a value that was previously written by transaction A, then B must have a higher
transaction ID than A—otherwise, the snapshot would not be consistent. With lots of small, rapid
transactions, creating transaction IDs in a distributed system becomes an untenable
bottleneck. (We will discuss such ID generators in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical).)
Can we use the timestamps from synchronized time-of-day clocks as transaction IDs? If we could get
the synchronization good enough, they would have the right properties: later transactions have a
higher timestamp. The problem, of course, is the uncertainty about clock accuracy.
Spanner implements snapshot isolation across datacenters in this way [^68] [^69].
It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the
following observation: if you have two confidence intervals, each consisting of an earliest and
latest possible timestamp (*A* = [*Aearliest*, *Alatest*] and *B* = [*Bearliest*, *Blatest*]), and those two intervals do not overlap
(i.e., *Aearliest* < *Alatest* < *Bearliest* < *Blatest*), then B definitely happened after A—there
can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened.
In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the
length of the confidence interval before committing a read-write transaction. By doing so, it
ensures that any transaction that may read the data is at a sufficiently later time, so their
confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner
needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS
receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [^45].
The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to
have a confidence interval, and the accurate clock sources only help keep that interval small. Other
systems are beginning to adopt similar approaches: for example, YugabyteDB can leverage ClockBound
when running on AWS [^70], and several other systems now also rely on clock synchronization to various degrees [^71] [^72].
### Process Pauses {#sec_distributed_clocks_pauses}
Let’s consider another example of dangerous clock use in a distributed system. Say you have a
database with a single leader per shard. Only the leader is allowed to accept writes. How does a
node know that it is still leader (that it hasn’t been declared dead by the others), and that it may
safely accept writes?
One option is for the leader to obtain a *lease* from the other nodes, which is similar to a lock with a timeout [^73].
Only one node can hold the lease at any one time—thus, when a node obtains a lease, it knows that
it is the leader for some amount of time, until the lease expires. In order to remain leader, the
node must periodically renew the lease before it expires. If the node fails, it stops renewing the
lease, so another node can take over when it expires.
You can imagine the request-handling loop looking something like this:
```js
while (true) {
request = getIncomingRequest();
// Ensure that the lease always has at least 10 seconds remaining
if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {
lease = lease.renew();
}
if (lease.isValid()) {
process(request);
}
}
```
What’s wrong with this code? Firstly, it’s relying on synchronized clocks: the expiry time on the
lease is set by a different machine (where the expiry may be calculated as the current time plus 30
seconds, for example), and it’s being compared to the local system clock. If the clocks are out of
sync by more than a few seconds, this code will start doing strange things.
Secondly, even if we change the protocol to only use the local monotonic clock, there is another
problem: the code assumes that very little time passes between the point that it checks the time
(`System.currentTimeMillis()`) and the time when the request is processed (`process(request)`).
Normally this code runs very quickly, so the 10 second buffer is more than enough to ensure that the
lease doesn’t expire in the middle of processing a request.
However, what if there is an unexpected pause in the execution of the program? For example, imagine
the thread stops for 15 seconds around the line `lease.isValid()` before finally continuing. In
that case, it’s likely that the lease will have expired by the time the request is processed, and
another node has already taken over as leader. However, there is nothing to tell this thread that it
was paused for so long, so this code won’t notice that the lease has expired until the next
iteration of the loop—by which time it may have already done something unsafe by processing the
request.
Is it reasonable to assume that a thread might be paused for so long? Unfortunately yes. There are
various reasons why this could happen:
* Contention among threads accessing a shared resource, such as a lock or queue, can cause threads
to spend a lot of their time waiting. Moving to a machine with more CPU cores can make such
problems worse, and contention problems can be difficult to diagnose [^74].
* Many programming language runtimes (such as the Java Virtual Machine) have a *garbage collector*
(GC) that occasionally needs to stop all running threads. In the past, such *“stop-the-world” GC
pauses* would sometimes last for several minutes [^75]!
With modern GC algorithms this is less of a problem, but GC pauses can still be noticable (see
[“Limiting the impact of garbage collection”](/en/ch9#sec_distributed_gc_impact)).
* In virtualized environments, a virtual machine can be *suspended* (pausing the execution of all
processes and saving the contents of memory to disk) and *resumed* (restoring the contents of
memory and continuing execution). This pause can occur at any time in a process’s execution and can
last for an arbitrary length of time. This feature is sometimes used for *live migration* of
virtual machines from one host to another without a reboot, in which case the length of the pause
depends on the rate at which processes are writing to memory [^76].
* On end-user devices such as laptops and phones, execution may also be suspended and resumed
arbitrarily, e.g., when the user closes the lid of their laptop.
* When the operating system context-switches to another thread, or when the hypervisor switches to a
different virtual machine (when running in a virtual machine), the currently running thread can be
paused at any arbitrary point in the code. In the case of a virtual machine, the CPU time spent in
other virtual machines is known as *steal time*. If the machine is under heavy load—i.e., if
there is a long queue of threads waiting to run—it may take some time before the paused thread
gets to run again.
* If the application performs synchronous disk access, a thread may be paused waiting for a slow
disk I/O operation to complete [^77]. In many languages, disk access can happen
surprisingly, even if the code doesn’t explicitly mention file access—for example, the Java
classloader lazily loads class files when they are first used, which could happen at any time in
the program execution. I/O pauses and GC pauses may even conspire to combine their delays [^78].
If the disk is actually a network filesystem or network block device (such as Amazon’s EBS), the
I/O latency is further subject to the variability of network delays [^31].
* If the operating system is configured to allow *swapping to disk* (*paging*), a simple memory
access may result in a page fault that requires a page from disk to be loaded into memory. The
thread is paused while this slow I/O operation takes place. If memory pressure is high, this may
in turn require a different page to be swapped out to disk. In extreme circumstances, the
operating system may spend most of its time swapping pages in and out of memory and getting little
actual work done (this is known as *thrashing*). To avoid this problem, paging is often disabled
on server machines (if you would rather kill a process to free up memory than risk thrashing).
* A Unix process can be paused by sending it the `SIGSTOP` signal, for example by pressing Ctrl-Z in
a shell. This signal immediately stops the process from getting any more CPU cycles until it is
resumed with `SIGCONT`, at which point it continues running where it left off. Even if your
environment does not normally use `SIGSTOP`, it might be sent accidentally by an operations
engineer.
All of these occurrences can *preempt* the running thread at any point and resume it at some later time,
without the thread even noticing. The problem is similar to making multi-threaded code on a single
machine thread-safe: you can’t assume anything about timing, because arbitrary context switches and
parallelism may occur.
When writing multi-threaded code on a single machine, we have fairly good tools for making it
thread-safe: mutexes, semaphores, atomic counters, lock-free data structures, blocking queues, and
so on. Unfortunately, these tools don’t directly translate to distributed systems, because a
distributed system has no shared memory—only messages sent over an unreliable network.
A node in a distributed system must assume that its execution can be paused for a significant length
of time at any point, even in the middle of a function. During the pause, the rest of the world
keeps moving and may even declare the paused node dead because it’s not responding. Eventually,
the paused node may continue running, without even noticing that it was asleep until it checks its
clock sometime later.
#### Response time guarantees {#sec_distributed_clocks_realtime}
In many programming languages and operating systems, threads and processes may pause for an
unbounded amount of time, as discussed. Those reasons for pausing *can* be eliminated if you try
hard enough.
Some software runs in environments where a failure to respond within a specified time can cause
serious damage: computers that control aircraft, rockets, robots, cars, and other physical objects
must respond quickly and predictably to their sensor inputs. In these systems, there is a specified
*deadline* by which the software must respond; if it doesn’t meet the deadline, that may cause a
failure of the entire system. These are so-called *hard real-time* systems.
--------
> [!NOTE]
> In embedded systems, *real-time* means that a system is carefully designed and tested to meet
> specified timing guarantees in all circumstances. This meaning is in contrast to the more vague use of the
> term *real-time* on the web, where it describes servers pushing data to clients and stream
> processing without hard response time constraints (see [Chapter 12](/en/ch12#ch_stream)).
--------
For example, if your car’s onboard sensors detect that you are currently experiencing a crash, you
wouldn’t want the release of the airbag to be delayed due to an inopportune GC pause in the airbag
release system.
Providing real-time guarantees in a system requires support from all levels of the software stack: a
*real-time operating system* (RTOS) that allows processes to be scheduled with a guaranteed
allocation of CPU time in specified intervals is needed; library functions must document their
worst-case execution times; dynamic memory allocation may be restricted or disallowed entirely
(real-time garbage collectors exist, but the application must still ensure that it doesn’t give the
GC too much work to do); and an enormous amount of testing and measurement must be done to ensure
that guarantees are being met.
All of this requires a large amount of additional work and severely restricts the range of
programming languages, libraries, and tools that can be used (since most languages and tools do not
provide real-time guarantees). For these reasons, developing real-time systems is very expensive,
and they are most commonly used in safety-critical embedded devices. Moreover, “real-time” is not the
same as “high-performance”—in fact, real-time systems may have lower throughput, since they have to
prioritize timely responses above all else (see also [“Latency and Resource Utilization”](/en/ch9#sidebar_distributed_latency_utilization)).
For most server-side data processing systems, real-time guarantees are simply not economical or
appropriate. Consequently, these systems must suffer the pauses and clock instability that come from
operating in a non-real-time environment.
#### Limiting the impact of garbage collection {#sec_distributed_gc_impact}
Garbage collection used to be one of the biggest reasons for process pauses [^79],
but fortunately GC algorithms have improved a lot: a properly tuned collector will now usually pause
for no more than a few milliseconds. The Java runtime offers collectors such as concurrent mark
sweep (CMS), garbage-first (G1), the Z garbage collector (ZGC), Epsilon, and Shenandoah. Each of
these is optimized for different memory profiles such as high-frequency object creation, large
heaps, and so on. By contrast, Go offers a simpler concurrent mark sweep garbage collector that
attempts to optimize itself.
If you need to avoid GC pauses entirely, one option is to use a language that doesn’t have a garbage
collector at all. For example, Swift uses automatic reference counting to determine when memory can
be freed; Rust and Mojo track lifetimes of objects using the type system so the compiler can
determine how long memory must be allocated for.
It’s also possible to use a garbage-collected language while mitigating the impact of pauses.
One approach is to treat GC pauses like brief planned outages of a node, and to let other nodes
handle requests from clients while one node is collecting its garbage. If the runtime can warn the
application that a node soon requires a GC pause, the application can stop sending new requests to
that node, wait for it to finish processing outstanding requests, and then perform the GC while no
requests are in progress. This trick hides GC pauses from clients and reduces the high percentiles
of the response time [^80] [^81].
A variant of this idea is to use the garbage collector only for short-lived objects (which are fast
to collect) and to restart processes periodically, before they accumulate enough long-lived objects
to require a full GC of long-lived objects [^79] [^82].
One node can be restarted at a time, and traffic can be shifted away from the node before the
planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)).
These measures cannot fully prevent garbage collection pauses, but they can usefully reduce their
impact on the application.
## Knowledge, Truth, and Lies {#sec_distributed_truth}
So far in this chapter we have explored the ways in which distributed systems are different from
programs running on a single computer: there is no shared memory, only message passing via an
unreliable network with variable delays, and the systems may suffer from partial failures, unreliable clocks,
and processing pauses.
The consequences of these issues are profoundly disorienting if you’re not used to distributed
systems. A node in the network cannot *know* anything for sure about other nodes—it can only make
guesses based on the messages it receives (or doesn’t receive). A node can only find out what state
another node is in (what data it has stored, whether it is correctly functioning, etc.) by
exchanging messages with it. If a remote node doesn’t respond, there is no way of knowing what state
it is in, because problems in the network cannot reliably be distinguished from problems at a node.
Discussions of these systems border on the philosophical: What do we know to be true or false in our
system? How sure can we be of that knowledge, if the mechanisms for perception and measurement are unreliable [^83]?
Should software systems obey the laws that we expect of the physical world, such as cause and effect?
Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed
system, we can state the assumptions we are making about the behavior (the *system model*) and
design the actual system in such a way that it meets those assumptions. Algorithms can be proved to
function correctly within a certain system model. This means that reliable behavior is achievable,
even if the underlying system model provides very few guarantees.
However, although it is possible to make software well behaved in an unreliable system model, it
is not straightforward to do so. In the rest of this chapter we will further explore the notions of
knowledge and truth in distributed systems, which will help us think about the kinds of assumptions
we can make and the guarantees we may want to provide. In [Chapter 10](/en/ch10#ch_consistency) we will proceed to
look at some examples of distributed algorithms that provide particular guarantees under particular
assumptions.
### The Majority Rules {#sec_distributed_majority}
Imagine a network with an asymmetric fault: a node is able to receive all messages sent to it, but
any outgoing messages from that node are dropped or delayed [^22]. Even though that node is working
perfectly well, and is receiving requests from other nodes, the other nodes cannot hear its
responses. After some timeout, the other nodes declare it dead, because they haven’t heard from the
node. The situation unfolds like a nightmare: the semi-disconnected node is dragged to the
graveyard, kicking and screaming “I’m not dead!”—but since nobody can hear its screaming, the
funeral procession continues with stoic determination.
In a slightly less nightmarish scenario, the semi-disconnected node may notice that the messages it
is sending are not being acknowledged by other nodes, and so realize that there must be a fault
in the network. Nevertheless, the node is wrongly declared dead by the other nodes, and the
semi-disconnected node cannot do anything about it.
As a third scenario, imagine a node that pauses execution for one minute. During that time, no
requests are processed and no responses are sent. The other nodes wait, retry, grow impatient, and
eventually declare the node dead and load it onto the hearse. Finally, the pause finishes and the
node’s threads continue as if nothing had happened. The other nodes are surprised as the supposedly
dead node suddenly raises its head out of the coffin, in full health, and starts cheerfully chatting
with bystanders. At first, the paused node doesn’t even realize that an entire minute has passed and
that it was declared dead—from its perspective, hardly any time has passed since it was last talking
to the other nodes.
The moral of these stories is that a node cannot necessarily trust its own judgment of a situation.
A distributed system cannot exclusively rely on a single node, because a node may fail at any time,
potentially leaving the system stuck and unable to recover. Instead, many distributed algorithms
rely on a *quorum*, that is, voting among the nodes (see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)):
decisions require some minimum number of votes from several nodes in order to reduce the dependence
on any one particular node.
That includes decisions about declaring nodes dead. If a quorum of nodes declares another node
dead, then it must be considered dead, even if that node still very much feels alive. The individual
node must abide by the quorum decision and step down.
Most commonly, the quorum is an absolute majority of more than half the nodes (although other kinds
of quorums are possible). A majority quorum allows the system to continue working if a minority of nodes
are faulty (with three nodes, one faulty node can be tolerated; with five nodes, two faulty nodes can be
tolerated). However, it is still safe, because there can only be only one majority in the
system—there cannot be two majorities with conflicting decisions at the same time. We will discuss
the use of quorums in more detail when we get to *consensus algorithms* in [Chapter 10](/en/ch10#ch_consistency).
### Distributed Locks and Leases {#sec_distributed_lock_fencing}
Locks and leases in distributed application are prone to be misused, and a common source of bugs [^84].
Let’s look at one particular case of how they can go wrong.
In [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses) we saw that a lease is a kind of lock that times out and can be
assigned to a new owner if the old owner stops responding (perhaps because it crashed, it paused for
too long, or it was disconnected from the network). You can use leases in situations where a system
requires there to be only one of some thing. For example:
* Only one node is allowed to be the leader for a database shard, to avoid split brain (see
[“Handling Node Outages”](/en/ch6#sec_replication_failover)).
* Only one transaction or client is allowed to update a particular resource or object, to prevent
it being corrupted by concurrent writes.
* Only one node should process a given input file to a big processing job, to avoid wasted effort
due to multiple nodes redundantly doing the same work.
It is worth thinking carefully about what happens if several nodes simultaneously believe that they
hold the lease, perhaps due to a process pause. In the third example, the consequence is only some
wasted computational resources, which is not a big deal. But in the first two cases, the consequence
could be lost or corrupted data, which is much more serious.
For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
implementation of locking. (The bug is not theoretical: HBase used to have this problem [^85] [^86].)
Say you want to ensure that a file in a storage service can only be
accessed by one client at a time, because if multiple clients tried to write to it, the file would
become corrupted. You try to implement this by requiring a client to obtain a lease from a lock
service before accessing the file. Such a lock service is often implemented using a consensus
algorithm; we will discuss this further in [Chapter 10](/en/ch10#ch_consistency).
{{< figure src="/fig/ddia_0904.png" id="fig_distributed_lease_pause" caption="Figure 9-4. Incorrect implementation of a distributed lock: client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage." class="w-full my-4" >}}
The problem is an example of what we discussed in [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses): if the client
holding the lease is paused for too long, its lease expires. Another client can obtain a lease for
the same file, and start writing to the file. When the paused client comes back, it believes
(incorrectly) that it still has a valid lease and proceeds to also write to the file. We now have a
split brain situation: the clients’ writes clash and corrupt the file.
[Figure 9-5](/en/ch9#fig_distributed_lease_delay) shows a different problem that has similar consequences. In this
example there is no process pause, only a crash by client 1. Just before client 1 crashes it sends a
write request to the storage service, but this request is delayed for a long time in the network.
(Remember from [“Network Faults in Practice”](/en/ch9#sec_distributed_network_faults) that packets can sometimes be delayed by a minute
or more.) By the time the write request arrives at the storage service, the lease has already timed
out, allowing client 2 to acquire it and issue a write of its own. The result is corruption similar
to [Figure 9-4](/en/ch9#fig_distributed_lease_pause).
{{< figure src="/fig/ddia_0905.png" id="fig_distributed_lease_delay" caption="Figure 9-5. A message from a former leaseholder might be delayed for a long time, and arrive after another node has taken over the lease." class="w-full my-4" >}}
#### Fencing off zombies and delayed requests {#sec_distributed_fencing_tokens}
The term *zombie* is sometimes used to describe a former leaseholder who has not yet found out that
it lost the lease, and who is still acting as if it was the current leaseholder. Since we cannot
rule out zombies entirely, we have to instead ensure that they can’t do any damage in the form of
split brain. This is called *fencing off* the zombie.
Some systems attempt to fence off zombies by shutting them down, for example by disconnecting them
from the network [^9], shutting down the VM via
the cloud provider’s management interface, or even physically powering down the machine [^87].
This approach is known as *Shoot The Other Node In The Head* or STONITH. Unfortunately, it suffers
from some problems: it does not protect against large network delays like in
[Figure 9-5](/en/ch9#fig_distributed_lease_delay); it can happen that all of the nodes shut each other down [^19]; and by the time the zombie has been
detected and shut down, it may already be too late and data may already have been corrupted.
A more robust fencing solution, which protects against both zombies and delayed requests, is
illustrated in [Figure 9-6](/en/ch9#fig_distributed_fencing).
{{< figure src="/fig/ddia_0906.png" id="fig_distributed_fencing" caption="Figure 9-6. Making access to storage safe by allowing writes only in the order of increasing fencing tokens." class="w-full my-4" >}}
Let’s assume that every time the lock service grants a lock or lease, it also returns a *fencing
token*, which is a number that increases every time a lock is granted (e.g., incremented by the lock
service). We can then require that every time a client sends a write request to the storage service,
it must include its current fencing token.
--------
> [!NOTE]
> There are several alternative names for fencing tokens. In Chubby, Google’s lock service, they are
> called *sequencers* [^88], and in Kafka they are called *epoch numbers*.
> In consensus algorithms, which we will discuss in [Chapter 10](/en/ch10#ch_consistency), the *ballot number* (Paxos) or
> *term number* (Raft) serves a similar purpose.
--------
In [Figure 9-6](/en/ch9#fig_distributed_fencing), client 1 acquires the lease with a token of 33, but then
it goes into a long pause and the lease expires. Client 2 acquires the lease with a token of 34 (the
number always increases) and then sends its write request to the storage service, including the
token of 34. Later, client 1 comes back to life and sends its write to the storage service,
including its token value 33. However, the storage service remembers that it has already processed a
write with a higher token number (34), and so it rejects the request with token 33. A client that
has just acquired the lease must immediately make a write to the storage service, and once that
write has completed, any zombies are fenced off.
If ZooKeeper is your lock service, you can use the transaction ID `zxid` or the node version
`cversion` as fencing token [^85].
With etcd, the revision number along with the lease ID serves a similar purpose [^89].
The FencedLock API in Hazelcast explicitly generates a fencing token [^90].
This mechanism requires that the storage service has some way of checking whether a write is based
on an outdated token. Alternatively, it’s sufficient for the service to support a write that
succeeds only if the object has not been written by another client since the current client last
read it, similarly to an atomic compare-and-set (CAS) operation. For example, object storage
services support such a check: Amazon S3 calls it *conditional writes*, Azure Blob Storage calls it
*conditional headers*, and Google Cloud Storage calls it *request preconditions*.
#### Fencing with multiple replicas {#fencing-with-multiple-replicas}
If your clients need to write only to one storage service that supports such conditional writes, the
lock service is somewhat redundant [^91] [^92], since the lease assignment could have been implemented directly based on that storage service [^93].
However, once you have a fencing token you can also use it with multiple services or replicas, and
ensure that the old leaseholder is fenced off on all of those services.
For example, imagine the storage service is a leaderless replicated key-value store with
last-write-wins conflict resolution (see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)). In such a system, the
client sends writes directly to each replica, and each replica independently decides whether to
accept a write based on a timestamp assigned by the client.
As illustrated in [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), you can put the writer’s fencing token in
the most significant bits or digits of the timestamp. You can then be sure that any timestamp
generated by the new leaseholder will be greater than any timestamp from the old leaseholder, even
if the old leaseholder’s writes happened later.
{{< figure src="/fig/ddia_0907.png" id="fig_distributed_fencing_leaderless" caption="Figure 9-7. Using fencing tokens to protect writes to a leaderless replicated database." class="w-full my-4" >}}
In [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), Client 2 has a fencing token of 34, so all of its
timestamps starting with 34… are greater than any timestamps starting with 33… that are
generated by Client 1. Client 2 writes to a quorum of replicas but it can’t reach Replica 3. This
means that when the zombie Client 1 later tries to write, its write may succeed at Replica 3 even
though it is ignored by replicas 1 and 2. This is not a problem, since a subsequent quorum read will
prefer the write from Client 2 with the greater timestamp, and read repair or anti-entropy will
eventually overwrite the value written by Client 1.
As you can see from these examples, it is not safe to assume that there is only one node holding a
lease at any one time. Fortunately, with a bit of care you can use fencing tokens to prevent zombies
and delayed requests from doing any damage.
### Byzantine Faults {#sec_distributed_byzantine}
Fencing tokens can detect and block a node that is *inadvertently* acting in error (e.g., because it
hasn’t yet found out that its lease has expired). However, if the node deliberately wanted to
subvert the system’s guarantees, it could easily do so by sending messages with a fake fencing
token.
In this book we assume that nodes are unreliable but honest: they may be slow or never respond (due
to a fault), and their state may be outdated (due to a GC pause or network delays), but we assume
that if a node *does* respond, it is telling the “truth”: to the best of its knowledge, it is
playing by the rules of the protocol.
Distributed systems problems become much harder if there is a risk that nodes may “lie” (send
arbitrary faulty or corrupted responses)—for example, it might cast multiple contradictory votes in
the same election. Such behavior is known as a *Byzantine fault*, and the problem of reaching
consensus in this untrusting environment is known as the *Byzantine Generals Problem* [^94].
> [!TIP] THE BYZANTINE GENERALS PROBLEM
The Byzantine Generals Problem is a generalization of the so-called *Two Generals Problem* [^95],
which imagines a situation in which two army generals need to agree on a battle plan. As they
have set up camp on two different sites, they can only communicate by messenger, and the messengers
sometimes get delayed or lost (like packets in a network). We will discuss this problem of
*consensus* in [Chapter 10](/en/ch10#ch_consistency).
In the Byzantine version of the problem, there are *n* generals who need to agree, and their
endeavor is hampered by the fact that there are some traitors in their midst. Most of the generals
are loyal, and thus send truthful messages, but the traitors may try to deceive and confuse the
others by sending fake or untrue messages. It is not known in advance who the traitors are.
Byzantium was an ancient Greek city that later became Constantinople, in the place which is now
Istanbul in Turkey. There isn’t any historic evidence that the generals of Byzantium were any more
prone to intrigue and conspiracy than those elsewhere. Rather, the name is derived from *Byzantine*
in the sense of *excessively complicated, bureaucratic, devious*, which was used in politics long
before computers [^96].
Lamport wanted to choose a nationality that would not offend any readers, and he was advised that
calling it *The Albanian Generals Problem* was not such a good idea [^97].
--------
A system is *Byzantine fault-tolerant* if it continues to operate correctly even if some of the
nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering
with the network. This concern is relevant in certain specific circumstances. For example:
* In aerospace environments, the data in a computer’s memory or CPU register could become corrupted
by radiation, leading it to respond to other nodes in arbitrarily unpredictable ways. Since a
system failure would be very expensive (e.g., an aircraft crashing and killing everyone on board,
or a rocket colliding with the International Space Station), flight control systems must tolerate
Byzantine faults [^98] [^99].
* In a system with multiple participating parties, some participants may attempt to cheat or
defraud others. In such circumstances, it is not safe for a node to simply trust another node’s
messages, since they may be sent with malicious intent. For example, cryptocurrencies like
Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties
to agree whether a transaction happened or not, without relying on a central authority [^100].
However, in the kinds of systems we discuss in this book, we can usually safely assume that there
are no Byzantine faults. In a datacenter, all the nodes are controlled by your organization (so
they can hopefully be trusted) and radiation levels are low enough that memory corruption is not a
major problem (although datacenters in orbit are being considered [^101]).
Multitenant systems have mutually untrusting tenants, but they are isolated from each
other using firewalls, virtualization, and access control policies, not using Byzantine fault
tolerance. Protocols for making systems Byzantine fault-tolerant are quite expensive [^102],
and fault-tolerant embedded systems rely on support from the hardware level [^98]. In most server-side data systems, the
cost of deploying Byzantine fault-tolerant solutions makes them impracticable.
Web applications do need to expect arbitrary and malicious behavior of clients that are under
end-user control, such as web browsers. This is why input validation, sanitization, and output
escaping are so important: to prevent SQL injection and cross-site scripting, for example. However,
we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the
authority on deciding what client behavior is and isn’t allowed. In peer-to-peer networks, where
there is no such central authority, Byzantine fault tolerance is more relevant [^103] [^104].
A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to
all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant
algorithms require a supermajority of more than two-thirds of the nodes to be functioning correctly
(for example, if you have four nodes, at most one may malfunction). To use this approach against bugs, you
would have to have four independent implementations of the same software and hope that a bug only
appears in one of the four implementations.
Similarly, it would be appealing if a protocol could protect us from vulnerabilities, security
compromises, and malicious attacks. Unfortunately, this is not realistic either: in most systems, if
an attacker can compromise one node, they can probably compromise all of them, because they are
probably running the same software. Thus, traditional mechanisms (authentication, access control,
encryption, firewalls, and so on) continue to be the main protection against attackers.
#### Weak forms of lying {#weak-forms-of-lying}
Although we assume that nodes are generally honest, it can be worth adding mechanisms to software
that guard against weak forms of “lying”—for example, invalid messages due to hardware issues,
software bugs, and misconfiguration. Such protection mechanisms are not full-blown Byzantine fault
tolerance, as they would not withstand a determined adversary, but they are nevertheless simple and
pragmatic steps toward better reliability. For example:
* Network packets do sometimes get corrupted due to hardware issues or bugs in operating systems,
drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and
UDP, but sometimes they evade detection [^105] [^106] [^107].
Simple measures are usually sufficient protection against such corruption, such as checksums in
the application-level protocol. TLS-encrypted connections also offer protection against corruption.
* A publicly accessible application must carefully sanitize any inputs from users, for example
checking that a value is within a reasonable range and limiting the size of strings to prevent
denial of service through large memory allocations. An internal service behind a firewall may be
able to get away with less strict checks on inputs, but basic checks in protocol parsers are still a good idea [^105].
* NTP clients can be configured with multiple server addresses. When synchronizing, the client
contacts all of them, estimates their errors, and checks that a majority of servers agree on some
time range. As long as most of the servers are okay, a misconfigured NTP server that is reporting an
incorrect time is detected as an outlier and is excluded from synchronization [^39]. The use of multiple servers makes NTP
more robust than if it only uses a single server.
### System Model and Reality {#sec_distributed_system_model}
Many algorithms have been designed to solve distributed systems problems—for example, we will
examine solutions for the consensus problem in [Chapter 10](/en/ch10#ch_consistency). In order to be useful, these
algorithms need to tolerate the various faults of distributed systems that we discussed in this
chapter.
Algorithms need to be written in a way that does not depend too heavily on the details of the
hardware and software configuration on which they are run. This in turn requires that we somehow
formalize the kinds of faults that we expect to happen in a system. We do this by defining a *system
model*, which is an abstraction that describes what things an algorithm may assume.
With regard to timing assumptions, three system models are in common use:
Synchronous model
: The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock
error. This does not imply exactly synchronized clocks or zero network delay; it just means you
know that network delay, pauses, and clock drift will never exceed some fixed upper bound [^108].
The synchronous model is not a realistic model of most practical
systems, because (as discussed in this chapter) unbounded delays and pauses do occur.
Partially synchronous model
: Partial synchrony means that a system behaves like a synchronous system *most of the time*, but it
sometimes exceeds the bounds for network delay, process pauses, and clock drift [^108]. This is a realistic model of many
systems: most of the time, networks and processes are quite well behaved—otherwise we would never
be able to get anything done—but we have to reckon with the fact that any timing assumptions
may be shattered occasionally. When this happens, network delay, pauses, and clock error may become
arbitrarily large.
Asynchronous model
: In this model, an algorithm is not allowed to make any timing assumptions—in fact, it does not
even have a clock (so it cannot use timeouts). Some algorithms can be designed for the
asynchronous model, but it is very restrictive.
Moreover, besides timing issues, we have to consider node failures. Some common system models for
nodes are:
Crash-stop faults
: In the *crash-stop* (or *fail-stop*) model, an algorithm may assume that a node can fail in only
one way, namely by crashing [^109].
This means that the node may suddenly stop responding at any moment, and thereafter that node is
gone forever—it never comes back.
Crash-recovery faults
: We assume that nodes may crash at any moment, and perhaps start responding again after some
unknown time. In the crash-recovery model, nodes are assumed to have stable storage (i.e.,
nonvolatile disk storage) that is preserved across crashes, while the in-memory state is assumed
to be lost.
Degraded performance and partial functionality
: In addition to crashing and restarting, nodes may go slow: they may still be able to respond to
health check requests, while being too slow to get any real work done. For example, a Gigabit
network interface could suddenly drop to 1 Kb/s throughput due to a driver bug [^110];
a process that is under memory pressure may spend most of its time performing garbage collection [^111];
worn-out SSDs can have erratic performance; and hardware can be affected by high temperature,
loose connectors, mechanical vibration, power supply problems, firmware bugs, and more [^112].
Such a situation is called a *limping node*, *gray failure*, or *fail-slow* [^113],
and it can be even more difficult to deal with than a cleanly failed node. A related problem is
when a process stops doing some of the things it is supposed to do while other aspects continue
working, for example because a background thread is crashed or deadlocked [^114].
Byzantine (arbitrary) faults
: Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described
in the last section.
For modeling real systems, the partially synchronous model with crash-recovery faults is generally
the most useful model. It allows for unbounded network delay, process pauses, and slow nodes. But
how do distributed algorithms cope with that model?
#### Defining the correctness of an algorithm {#defining-the-correctness-of-an-algorithm}
To define what it means for an algorithm to be *correct*, we can describe its *properties*. For
example, the output of a sorting algorithm has the property that for any two distinct elements of
the output list, the element further to the left is smaller than the element further to the right.
That is simply a formal way of defining what it means for a list to be sorted.
Similarly, we can write down the properties we want of a distributed algorithm to define what it
means to be correct. For example, if we are generating fencing tokens for a lock (see
[“Fencing off zombies and delayed requests”](/en/ch9#sec_distributed_fencing_tokens)), we may require the algorithm to have the following properties:
Uniqueness
: No two requests for a fencing token return the same value.
Monotonic sequence
: If request *x* returned token *t**x*, and request *y* returned token *t**y*, and
*x* completed before *y* began, then *t**x* < *t**y*.
Availability
: A node that requests a fencing token and does not crash eventually receives a response.
An algorithm is correct in some system model if it always satisfies its properties in all situations
that we assume may occur in that system model. However, if all nodes crash, or all network delays
suddenly become infinitely long, then no algorithm will be able to get anything done. How can we
still make useful guarantees even in a system model that allows complete failures?
#### Safety and liveness {#sec_distributed_safety_liveness}
To clarify the situation, it is worth distinguishing between two different kinds of properties:
*safety* and *liveness* properties. In the example just given, *uniqueness* and *monotonic sequence* are
safety properties, but *availability* is a liveness property.
What distinguishes the two kinds of properties? A giveaway is that liveness properties often include
the word “eventually” in their definition. (And yes, you guessed it—*eventual consistency* is a
liveness property [^115].)
Safety is often informally defined as *nothing bad happens*, and liveness as *something good
eventually happens*. However, it’s best to not read too much into those informal definitions,
because “good” and “bad” are value judgements that don’t apply well to algorithms. The actual
definitions of safety and liveness are more precise [^116]:
* If a safety property is violated, we can point at a particular point in time at which it was
broken (for example, if the uniqueness property was violated, we can identify the particular
operation in which a duplicate fencing token was returned). After a safety property has been
violated, the violation cannot be undone—the damage is already done.
* A liveness property works the other way round: it may not hold at some point in time (for example,
a node may have sent a request but not yet received a response), but there is always hope that it
may be satisfied in the future (namely by receiving a response).
An advantage of distinguishing between safety and liveness properties is that it helps us deal with
difficult system models. For distributed algorithms, it is common to require that safety properties
*always* hold, in all possible situations of a system model [^108]. That is, even if all nodes crash, or
the entire network fails, the algorithm must nevertheless ensure that it does not return a wrong
result (i.e., that the safety properties remain satisfied).
However, with liveness properties we are allowed to make caveats: for example, we could say that a
request needs to receive a response only if a majority of nodes have not crashed, and only if the
network eventually recovers from an outage. The definition of the partially synchronous model
requires that eventually the system returns to a synchronous state—that is, any period of network
interruption lasts only for a finite duration and is then repaired.
#### Mapping system models to the real world {#mapping-system-models-to-the-real-world}
Safety and liveness properties and system models are very useful for reasoning about the correctness
of a distributed algorithm. However, when implementing an algorithm in practice, the messy facts of
reality come back to bite you again, and it becomes clear that the system model is a simplified
abstraction of reality.
For example, algorithms in the crash-recovery model generally assume that data in stable storage
survives crashes. However, what happens if the data on disk is corrupted, or the data is wiped out
due to hardware error or misconfiguration [^117]?
What happens if a server has a firmware bug and fails to recognize
its hard drives on reboot, even though the drives are correctly attached to the server [^118]?
Quorum algorithms (see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)) rely on a node remembering the data
that it claims to have stored. If a node may suffer from amnesia and forget previously stored data,
that breaks the quorum condition, and thus breaks the correctness of the algorithm. Perhaps a new
system model is needed, in which we assume that stable storage mostly survives crashes, but may
sometimes be lost. But that model then becomes harder to reason about.
The theoretical description of an algorithm can declare that certain things are simply assumed not
to happen—and in non-Byzantine systems, we do have to make some assumptions about faults that can
and cannot happen. However, a real implementation may still have to include code to handle the
case where something happens that was assumed to be impossible, even if that handling boils down to
`printf("Sucks to be you")` and `exit(666)`—i.e., letting a human operator clean up the mess [^119].
(This is one difference between computer science and software engineering.)
That is not to say that theoretical, abstract system models are worthless—quite the opposite.
They are incredibly helpful for distilling down the complexity of real systems to a manageable set
of faults that we can reason about, so that we can understand the problem and try to solve it
systematically.
### Formal Methods and Randomized Testing {#sec_distributed_formal}
How do we know that an algorithm satisfies the required properties? Due to concurrency, partial
failures, and network delays there are a huge number of potential states. We need to guarantee
that the properties hold in every possible state, and ensure that we haven’t forgotten about any
edge cases.
One approach is to formally verify an algorithm by describing it mathematically, and using proof
techniques to show that it satisfies the required properties in all situations that the system model
allows. Proving an algorithm correct does not mean its *implementation* on a real system will
necessarily always behave correctly. But it’s a very good first step, because the theoretical
analysis can uncover problems in an algorithm that might remain hidden for a long time in a real
system, and that only come to bite you when your assumptions (e.g., about timing) are defeated due
to unusual circumstances.
It is prudent to combine theoretical analysis with empirical testing to verify that implementations
behave as expected. Techniques such as property-based testing, fuzzing, and deterministic simulation
testing (DST) use randomization to test a system in a wide range of situations. Companies such as
Amazon Web Services have successfully used a combination of these techniques on many of their
products [^120] [^121].
#### Model checking and specification languages {#model-checking-and-specification-languages}
*Model checkers* are tools that help verify that an algorithm or system behaves as expected. An algorithm
specification is written in a purpose-built language such as TLA+, Gallina, or FizzBee. These
languages make it easier to focus on an algorithm’s behavior without worrying about code
implementation details. Model checkers then use these models to verify that invariants hold across
all of an algorithm’s states by systematically trying all the things that could happen.
Model checking can’t actually prove that an algorithm’s invariants hold for every possible state
since most real-world algorithms have an infinite state space. A true verification of all states
would require a formal proof, which can be done, but which is typically more difficult than running
a model checker. Instead, model checkers encourage you to reduce the algorithm’s model to an
approximation that can be fully verified, or to limit the execution to some upper bound (for
example, by setting a maximum number of messages that can be sent). Any bugs that only occur with
longer executions would then not be found.
Still, model checkers strike a nice balance between ease of use and the ability to find non-obvious
bugs. CockroachDB, TiDB, Kafka, and many other distributed systems use model specifications to find
and fix bugs [^122] [^123] [^124]. For example,
using TLA+, researchers were able to demonstrate the potential for data loss in viewstamped
replication (VR) caused by ambiguity in the prose description of the algorithm [^125].
By design, model checkers don’t run your actual code, but rather a simplified model that specifies
only the core ideas of your protocol. This makes it more tractable to systematically explore the
state space, but it risks that your specification and your implementation go out of sync with each other [^126].
It is possible to check whether the model and the real implementation have equivalent behavior, but
this requires instrumentation in the real implementation [^127].
#### Fault injection {#sec_fault_injection}
Many bugs are triggered when machine and network failures occur. Fault injection is an effective
(and sometimes scary) technique that verifies whether a system’s implementation works as expected things
go wrong. The idea is simple: inject faults into a running system’s environment and see how it
behaves. Faults can be network failures, machine crashes, disk corruption, paused
processes—anything you can imagine going wrong with a computer.
Fault injection tests are typically run in an environment that closely resembles the production
environment where the system will run. Some even inject faults directly into their production
environment. Netflix popularized this approach with their Chaos Monkey tool [^128]. Production fault
injection is often referred to as *chaos engineering*, which we discussed in
[“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability).
To run fault injection tests, the system under test is first deployed along with fault injection
coordinators and scripts. Coordinators are responsible for deciding what faults to execute and when
to execute them. Local or remote scripts are responsible for injecting failures into individual
nodes or processes. Injection scripts use many different tools to trigger faults. A Linux process
can be paused or killed using Linux’s `kill` command, a disk can be unmounted with `umount`, and
network connections can be disrupted through firewall settings. You can inspect system behavior
during and after faults are injected to make sure things work as expected.
The myriad of tools required to trigger failures make fault injection tests cumbersome to write.
It’s common to adopt a fault injection framework like Jepsen to run fault injection tests to
simplify the process. Such frameworks come with integrations for various operating systems and many
pre-built fault injectors [^129].
Jepsen has been remarkably effective at finding critical bugs in many widely-used systems [^130] [^131].
#### Deterministic simulation testing {#deterministic-simulation-testing}
Deterministic simulation testing (DST) has also become a popular complement to model-checking and
fault injection. It uses a similar state space exploration process as a model checker, but it tests
your actual code, not a model.
In DST, a simulation automatically runs through a large number of randomised executions of the
system. Network communication, I/O, and clock timing during the simulation are all replaced with
mocks that allow the simulator to control the exact order in which things happen, including various
timings and failure scenarios. This allows the simulator to explore many more situations than
hand-written tests or fault injection could. If a test fails, it can be re-run since the simulator
knows the exact order of operations that triggered the failure—in contrast to fault injection, which
does not have such fine-grained control over the system.
DST requires the simulator to be able to control all sources of nondeterminism, such as network
delays. One of three strategies is generally adopted to make code deterministic:
Application-level
: Some systems are built from the ground-up to make it easy to execute code deterministically. For
example, FoundationDB, one of the pioneers in the DST space, is built using an asynchronous
communication library called Flow. Flow provides a point for developers to inject a deterministic
network simulation into the system [^132].
Similarly, TigerBeetle is an online transaction processing (OLTP) database with first-class DST
support. The system’s state is modeled as a state machine, with all mutations occuring within a
single event loop. When combined with mock deterministic primitives such as clocks, such an
architecture is able to run deterministically [^133].
Runtime-level
: Languages with asynchronous runtimes and commonly used libraries provide an insertion point
to introduce determinism. A single-threaded runtime is used to force all asynchronous code to run
sequentially. FrostDB, for example, patches Go’s runtime to execute goroutines sequentially [^134].
Rust’s madsim library works in a similar manner. Madsim provides deterministic implementations of
Tokio’s asynchronous runtime API, AWS’s S3 library, Kafka’s Rust library, and many others.
Applications can swap in deterministic libraries and runtimes to get deterministic test executions
without changing their code.
Machine-level
: Rather than patching code at runtime, an entire machine can be made deterministic. This is a
delicate process that requires a machine to respond to all normally nondeterministic calls with
deterministic responses. Tools such as Antithesis do this by building a custom hypervisor that
replaces normally nondeterministic operations with deterministic ones. Everything from clocks
to network and storage needs to be accounted for. Once done, though, developers can run their
entire distributed system in a collection of containers within the hypervisor and get a completely
deterministic distributed system.
DST provides several advantages beyond replayability. Tools such as Antithesis attempt to explore
many different code paths in application code by branching a test execution into multiple
sub-executions when it discovers less common behavior. And because deterministic tests often use
mocked clocks and network calls, such tests can run faster than wall-clock time. For example,
TigerBeetle’s time abstraction allows simulations to simulate network latency and timeouts without
actually taking the full length of time to trigger the timeout. Such techniques allow the simulator
to explore more code paths faster.
#### The Power of Determinism {#sidebar_distributed_determinism}
Nondeterminism is at the core of all of the distributed systems challenges we discussed in this
chapter: concurrency, network delay, process pauses, clock jumps, and crashes all happen in
unpredictable ways that vary from one run of a system to the next. Conversely, if you can make a
system deterministic, that can hugely simplify things.
In fact, making things deterministic is a simple but powerful idea that arises again and again in
distributed system design. Besides deterministic simulation testing, we have seen several ways of
using determinism over the past chapters:
* A key advantage of event sourcing (see [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events)) is that you can
deterministically replay a log of events to reconstruct derived materialized views.
* Workflow engines (see [“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows)) rely on workflow definitions being
deterministic to provide durable execution semantics.
* *State machine replication*, which we will discuss in [“Using shared logs”](/en/ch10#sec_consistency_smr), replicates data by
independently executing the same sequence of deterministic transactions on each replica. We have
already seen two variants of that idea: statement-based replication (see
[“Implementation of Replication Logs”](/en/ch6#sec_replication_implementation)) and serial transaction execution using stored procedures
(see [“Pros and cons of stored procedures”](/en/ch8#sec_transactions_stored_proc_tradeoffs)).
However, making code fully deterministic requires care. Even once you have removed all concurrency
and replaced I/O, network communication, clocks, and random number generators with deterministic
simulations, elements of nondeterminism may remain. For example, in some programming languages, the
order in which you iterate over the elements of a hash table may be nondeterministic. Whether you
run into a resource limit (memory allocation failure, stack overflow) is also nondeterministic.
## Summary {#summary}
In this chapter we have discussed a wide range of problems that can occur in distributed systems,
including:
* Whenever you try to send a packet over the network, it may be lost or arbitrarily delayed.
Likewise, the reply may be lost or delayed, so if you don’t get a reply, you have no idea whether
the message got through.
* A node’s clock may be significantly out of sync with other nodes (despite your best efforts to set
up NTP), it may suddenly jump forward or back in time, and relying on it is dangerous because you
most likely don’t have a good measure of your clock’s confidence interval.
* A process may pause for a substantial amount of time at any point in its execution, be declared
dead by other nodes, and then come back to life again without realizing that it was paused.
The fact that such *partial failures* can occur is the defining characteristic of distributed
systems. Whenever software tries to do anything involving other nodes, there is the possibility that
it may occasionally fail, or randomly go slow, or not respond at all (and eventually time out). In
distributed systems, we try to build tolerance of partial failures into software, so that the system
as a whole may continue functioning even when some of its constituent parts are broken.
To tolerate faults, the first step is to *detect* them, but even that is hard. Most systems
don’t have an accurate mechanism of detecting whether a node has failed, so most distributed
algorithms rely on timeouts to determine whether a remote node is still available. However, timeouts
can’t distinguish between network and node failures, and variable network delay sometimes causes a
node to be falsely suspected of crashing. Handling limping nodes, which are responding but are too
slow to do anything useful, is even harder.
Once a fault is detected, making a system tolerate it is not easy either: there is no global
variable, no shared memory, no common knowledge or any other kind of shared state between the machines [^83].
Nodes can’t even agree on what time it is, let alone on anything more profound. The only way
information can flow from one node to another is by sending it over the unreliable network. Major
decisions cannot be safely made by a single node, so we require protocols that enlist help from
other nodes and try to get a quorum to agree.
If you’re used to writing software in the idealized mathematical perfection of a single computer,
where the same operation always deterministically returns the same result, then moving to the messy
physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems
engineers will often regard a problem as trivial if it can be solved on a single computer [^4],
and indeed a single computer can do a lot nowadays. If you can avoid opening Pandora’s box and
simply keep things on a single machine, for example by using an embedded storage engine (see [“Embedded storage engines”](/en/ch4#sidebar_embedded)), it is generally worth doing so.
However, as discussed in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), scalability is not the only reason for
wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically
close to users) are equally important goals, and those things cannot be achieved with a single node.
The power of distributed systems is that in principle, they can run forever without being
interrupted at the service level, because all faults and maintenance can be handled at the node
level. (In practice, if a bad configuration change is rolled out to all nodes, that will still bring
a distributed system to its knees.)
In this chapter we also went on some tangents to explore whether the unreliability of networks,
clocks, and processes is an inevitable law of nature. We saw that it isn’t: it is possible to give
hard real-time response guarantees and bounded delays in networks, but doing so is very expensive and
results in lower utilization of hardware resources. Most non-safety-critical systems choose cheap
and unreliable over expensive and reliable.
This chapter has been all about problems, and has given us a bleak outlook. In the next chapter we
will move on to solutions, and discuss some algorithms that have been designed to cope with the
problems in distributed systems.
### References
[^1]: Mark Cavage. [There’s Just No Getting Around It: You’re Building a Distributed System](https://queue.acm.org/detail.cfm?id=2482856). *ACM Queue*, volume 11, issue 4, pages 80-89, April 2013. [doi:10.1145/2466486.2482856](https://doi.org/10.1145/2466486.2482856)
[^2]: Jay Kreps. [Getting Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW)
[^3]: Coda Hale. [You Can’t Sacrifice Partition Tolerance](https://codahale.com/you-cant-sacrifice-partition-tolerance/). *codahale.com*, October 2010.
[^4]: Jeff Hodges. [Notes on Distributed Systems for Young Bloods](https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/). *somethingsimilar.com*, January 2013. Archived at [perma.cc/B636-62CE](https://perma.cc/B636-62CE)
[^5]: Van Jacobson. [Congestion Avoidance and Control](https://www.cs.usask.ca/ftp/pub/discus/seminars2002-2003/p314-jacobson.pdf). At *ACM Symposium on Communications Architectures and Protocols* (SIGCOMM), August 1988. [doi:10.1145/52324.52356](https://doi.org/10.1145/52324.52356)
[^6]: Bert Hubert. [The Ultimate SO\_LINGER Page, or: Why Is My TCP Not Reliable](https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable). *blog.netherlabs.nl*, January 2009. Archived at [perma.cc/6HDX-L2RR](https://perma.cc/6HDX-L2RR)
[^7]: Jerome H. Saltzer, David P. Reed, and David D. Clark. [End-To-End Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf). *ACM Transactions on Computer Systems*, volume 2, issue 4, pages 277–288, November 1984. [doi:10.1145/357401.357402](https://doi.org/10.1145/357401.357402)
[^8]: Peter Bailis and Kyle Kingsbury. [The Network Is Reliable](https://queue.acm.org/detail.cfm?id=2655736). *ACM Queue*, volume 12, issue 7, pages 48-55, July 2014. [doi:10.1145/2639988.2639988](https://doi.org/10.1145/2639988.2639988)
[^9]: Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera, and Michael Walfish. [Taming Uncertainty in Distributed Systems with Help from the Network](https://cs.nyu.edu/~mwalfish/papers/albatross-eurosys15.pdf). At *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741976](https://doi.org/10.1145/2741948.2741976)
[^10]: Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. [Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications](https://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdf). At *ACM SIGCOMM Conference*, August 2011. [doi:10.1145/2018436.2018477](https://doi.org/10.1145/2018436.2018477)
[^11]: Urs Hölzle. [But recently a farmer had started grazing a herd of cows nearby. And whenever they stepped on the fiber link, they bent it enough to cause a blip](https://x.com/uhoelzle/status/1263333283107991558). *x.com*, May 2020. Archived at [perma.cc/WX8X-ZZA5](https://perma.cc/WX8X-ZZA5)
[^12]: CBC News. [Hundreds lose internet service in northern B.C. after beaver chews through cable](https://www.cbc.ca/news/canada/british-columbia/beaver-internet-down-tumbler-ridge-1.6001594). *cbc.ca*, April 2021. Archived at [perma.cc/UW8C-H2MY](https://perma.cc/UW8C-H2MY)
[^13]: Will Oremus. [The Global Internet Is Being Attacked by Sharks, Google Confirms](https://slate.com/technology/2014/08/shark-attacks-threaten-google-s-undersea-internet-cables-video.html). *slate.com*, August 2014. Archived at [perma.cc/P6F3-C6YG](https://perma.cc/P6F3-C6YG)
[^14]: Jess Auerbach Jahajeeah. [Down to the wire: The ship fixing our internet](https://continent.substack.com/p/down-to-the-wire-the-ship-fixing). *continent.substack.com*, November 2023. Archived at [perma.cc/DP7B-EQ7S](https://perma.cc/DP7B-EQ7S)
[^15]: Santosh Janardhan. [More details about the October 4 outage](https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/). *engineering.fb.com*, October 2021. Archived at [perma.cc/WW89-VSXH](https://perma.cc/WW89-VSXH)
[^16]: Tom Parfitt. [Georgian woman cuts off web access to whole of Armenia](https://www.theguardian.com/world/2011/apr/06/georgian-woman-cuts-web-access). *theguardian.com*, April 2011. Archived at [perma.cc/KMC3-N3NZ](https://perma.cc/KMC3-N3NZ)
[^17]: Antonio Voce, Tural Ahmedzade and Ashley Kirk. [‘Shadow fleets’ and subaquatic sabotage: are Europe’s undersea internet cables under attack?](https://www.theguardian.com/world/ng-interactive/2025/mar/05/shadow-fleets-subaquatic-sabotage-europe-undersea-internet-cables-under-attack) *theguardian.com*, March 2025. Archived at [perma.cc/HA7S-ZDBV](https://perma.cc/HA7S-ZDBV)
[^18]: Shengyun Liu, Paolo Viotti, Christian Cachin, Vivien Quéma, and Marko Vukolić. [XFT: Practical Fault Tolerance beyond Crashes](https://www.usenix.org/system/files/conference/osdi16/osdi16-liu.pdf). At *12th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), November 2016.
[^19]: Mark Imbriaco. [Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/). *github.blog*, December 2012. Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ)
[^20]: Tom Lianza and Chris Snook. [A Byzantine failure in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/). *blog.cloudflare.com*, November 2020. Archived at [perma.cc/83EZ-ALCY](https://perma.cc/83EZ-ALCY)
[^21]: Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan, and Samer Al-Kiswany. [Toward a Generic Fault Tolerance Technique for Partial Network Partitioning](https://www.usenix.org/conference/osdi20/presentation/alfatafta). At *14th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), November 2020.
[^22]: Marc A. Donges. [Re: bnx2 cards Intermittantly Going Offline](https://www.spinics.net/lists/netdev/msg210485.html). Message to Linux *netdev* mailing list, *spinics.net*, September 2012. Archived at [perma.cc/TXP6-H8R3](https://perma.cc/TXP6-H8R3)
[^23]: Troy Toman. [Inside a CODE RED: Network Edition](https://signalvnoise.com/svn3/inside-a-code-red-network-edition/). *signalvnoise.com*, September 2020. Archived at [perma.cc/BET6-FY25](https://perma.cc/BET6-FY25)
[^24]: Kyle Kingsbury. [Call Me Maybe: Elasticsearch](https://aphyr.com/posts/317-call-me-maybe-elasticsearch). *aphyr.com*, June 2014. [perma.cc/JK47-S89J](https://perma.cc/JK47-S89J)
[^25]: Salvatore Sanfilippo. [A Few Arguments About Redis Sentinel Properties and Fail Scenarios](https://antirez.com/news/80). *antirez.com*, October 2014. [perma.cc/8XEU-CLM8](https://perma.cc/8XEU-CLM8)
[^26]: Nicolas Liochon. [CAP: If All You Have Is a Timeout, Everything Looks Like a Partition](http://blog.thislongrun.com/2015/05/CAP-theorem-partition-timeout-zookeeper.html). *blog.thislongrun.com*, May 2015. Archived at [perma.cc/FS57-V2PZ](https://perma.cc/FS57-V2PZ)
[^27]: Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert N. M. Watson, Andrew W. Moore, Steven Hand, and Jon Crowcroft. [Queues Don’t Matter When You Can JUMP Them!](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-grosvenor_update.pdf) At *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015.
[^28]: Theo Julienne. [Debugging network stalls on Kubernetes](https://github.blog/engineering/debugging-network-stalls-on-kubernetes/). *github.blog*, November 2019. Archived at [perma.cc/K9M8-XVGL](https://perma.cc/K9M8-XVGL)
[^29]: Guohui Wang and T. S. Eugene Ng. [The Impact of Virtualization on Network Performance of Amazon EC2 Data Center](https://www.cs.rice.edu/~eugeneng/papers/INFOCOM10-ec2.pdf). At *29th IEEE International Conference on Computer Communications* (INFOCOM), March 2010. [doi:10.1109/INFCOM.2010.5461931](https://doi.org/10.1109/INFCOM.2010.5461931)
[^30]: Brandon Philips. [etcd: Distributed Locking and Service Discovery](https://www.youtube.com/watch?v=HJIjTTHWYnE). At *Strange Loop*, September 2014.
[^31]: Steve Newman. [A Systematic Look at EC2 I/O](https://www.sentinelone.com/blog/a-systematic-look-at-ec2-i-o/). *blog.scalyr.com*, October 2012. Archived at [perma.cc/FL4R-H2VE](https://perma.cc/FL4R-H2VE)
[^32]: Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama. [The ϕ Accrual Failure Detector](https://hdl.handle.net/10119/4784). Japan Advanced Institute of Science and Technology, School of Information Science, Technical Report IS-RR-2004-010, May 2004. Archived at [perma.cc/NSM2-TRYA](https://perma.cc/NSM2-TRYA)
[^33]: Jeffrey Wang. [Phi Accrual Failure Detector](https://ternarysearch.blogspot.com/2013/08/phi-accrual-failure-detector.html). *ternarysearch.blogspot.co.uk*, August 2013. [perma.cc/L452-AMLV](https://perma.cc/L452-AMLV)
[^34]: Srinivasan Keshav. *An Engineering Approach to Computer Networking: ATM Networks, the Internet, and the Telephone Network*. Addison-Wesley Professional, May 1997. ISBN: 978-0-201-63442-6
[^35]: Othmar Kyas. *ATM Networks*. International Thomson Publishing, 1995. ISBN: 978-1-850-32128-6
[^36]: Mellanox Technologies. [InfiniBand FAQ, Rev 1.3](https://network.nvidia.com/related-docs/whitepapers/InfiniBandFAQ_FQ_100.pdf). *network.nvidia.com*, December 2014. Archived at [perma.cc/LQJ4-QZVK](https://perma.cc/LQJ4-QZVK)
[^37]: Jose Renato Santos, Yoshio Turner, and G. (John) Janakiraman. [End-to-End Congestion Control for InfiniBand](https://infocom2003.ieee-infocom.org/papers/28_01.PDF). At *22nd Annual Joint Conference of the IEEE Computer and Communications Societies* (INFOCOM), April 2003. Also published by HP Laboratories Palo Alto, Tech Report HPL-2002-359. [doi:10.1109/INFCOM.2003.1208949](https://doi.org/10.1109/INFCOM.2003.1208949)
[^38]: Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. [Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency](https://syslab.cs.washington.edu/papers/latency-socc14.pdf). At *ACM Symposium on Cloud Computing* (SOCC), November 2014. [doi:10.1145/2670979.2670988](https://doi.org/10.1145/2670979.2670988)
[^39]: Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley. [The NTP FAQ and HOWTO](https://www.ntp.org/ntpfaq/). *ntp.org*, November 2006.
[^40]: John Graham-Cumming. [How and why the leap second affected Cloudflare DNS](https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/). *blog.cloudflare.com*, January 2017. Archived at [archive.org](https://web.archive.org/web/20250202041444/https%3A//blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/)
[^41]: David Holmes. [Inside the Hotspot VM: Clocks, Timers and Scheduling Events – Part I – Windows](https://web.archive.org/web/20160308031939/https%3A//blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks). *blogs.oracle.com*, October 2006. Archived at [archive.org](https://web.archive.org/web/20160308031939/https%3A//blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks)
[^42]: Joran Dirk Greef. [Three Clocks are Better than One](https://tigerbeetle.com/blog/2021-08-30-three-clocks-are-better-than-one/). *tigerbeetle.com*, August 2021. Archived at [perma.cc/5RXG-EU6B](https://perma.cc/5RXG-EU6B)
[^43]: Oliver Yang. [Pitfalls of TSC usage](https://oliveryang.net/2015/09/pitfalls-of-TSC-usage/). *oliveryang.net*, September 2015. Archived at [perma.cc/Z2QY-5FRA](https://perma.cc/Z2QY-5FRA)
[^44]: Steve Loughran. [Time on Multi-Core, Multi-Socket Servers](https://steveloughran.blogspot.com/2015/09/time-on-multi-core-multi-socket-servers.html). *steveloughran.blogspot.co.uk*, September 2015. Archived at [perma.cc/7M4S-D4U6](https://perma.cc/7M4S-D4U6)
[^45]: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. [Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012.
[^46]: M. Caporaloni and R. Ambrosini. [How Closely Can a Personal Computer Clock Track the UTC Timescale Via the Internet?](https://iopscience.iop.org/0143-0807/23/4/103/) *European Journal of Physics*, volume 23, issue 4, pages L17–L21, June 2012. [doi:10.1088/0143-0807/23/4/103](https://doi.org/10.1088/0143-0807/23/4/103)
[^47]: Nelson Minar. [A Survey of the NTP Network](https://alumni.media.mit.edu/~nelson/research/ntp-survey99/). *alumni.media.mit.edu*, December 1999. Archived at [perma.cc/EV76-7ZV3](https://perma.cc/EV76-7ZV3)
[^48]: Viliam Holub. [Synchronizing Clocks in a Cassandra Cluster Pt. 1 – The Problem](https://blog.rapid7.com/2014/03/14/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/). *blog.rapid7.com*, March 2014. Archived at [perma.cc/N3RV-5LNL](https://perma.cc/N3RV-5LNL)
[^49]: Poul-Henning Kamp. [The One-Second War (What Time Will You Die?)](https://queue.acm.org/detail.cfm?id=1967009) *ACM Queue*, volume 9, issue 4, pages 44–48, April 2011. [doi:10.1145/1966989.1967009](https://doi.org/10.1145/1966989.1967009)
[^50]: Nelson Minar. [Leap Second Crashes Half the Internet](https://www.somebits.com/weblog/tech/bad/leap-second-2012.html). *somebits.com*, July 2012. Archived at [perma.cc/2WB8-D6EU](https://perma.cc/2WB8-D6EU)
[^51]: Christopher Pascoe. [Time, Technology and Leaping Seconds](https://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html). *googleblog.blogspot.co.uk*, September 2011. Archived at [perma.cc/U2JL-7E74](https://perma.cc/U2JL-7E74)
[^52]: Mingxue Zhao and Jeff Barr. [Look Before You Leap – The Coming Leap Second and AWS](https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/). *aws.amazon.com*, May 2015. Archived at [perma.cc/KPE9-XMFM](https://perma.cc/KPE9-XMFM)
[^53]: Darryl Veitch and Kanthaiah Vijayalayan. [Network Timing and the 2015 Leap Second](https://opus.lib.uts.edu.au/bitstream/10453/43923/1/LeapSecond_camera.pdf). At *17th International Conference on Passive and Active Measurement* (PAM), April 2016. [doi:10.1007/978-3-319-30505-9\_29](https://doi.org/10.1007/978-3-319-30505-9_29)
[^54]: VMware, Inc. [Timekeeping in VMware Virtual Machines](https://www.vmware.com/docs/vmware_timekeeping). *vmware.com*, October 2008. Archived at [perma.cc/HM5R-T5NF](https://perma.cc/HM5R-T5NF)
[^55]: Victor Yodaiken. [Clock Synchronization in Finance and Beyond](https://www.yodaiken.com/wp-content/uploads/2018/05/financeandbeyond.pdf). *yodaiken.com*, November 2017. Archived at [perma.cc/9XZD-8ZZN](https://perma.cc/9XZD-8ZZN)
[^56]: Mustafa Emre Acer, Emily Stark, Adrienne Porter Felt, Sascha Fahl, Radhika Bhargava, Bhanu Dev, Matt Braithwaite, Ryan Sleevi, and Parisa Tabriz. [Where the Wild Warnings Are: Root Causes of Chrome HTTPS Certificate Errors](https://acmccs.github.io/papers/p1407-acerA.pdf). At *ACM SIGSAC Conference on Computer and Communications Security* (CCS), pages 1407–1420, October 2017. [doi:10.1145/3133956.3134007](https://doi.org/10.1145/3133956.3134007)
[^57]: European Securities and Markets Authority. [MiFID II / MiFIR: Regulatory Technical and Implementing Standards – Annex I](https://www.esma.europa.eu/sites/default/files/library/2015/11/2015-esma-1464_annex_i_-_draft_rts_and_its_on_mifid_ii_and_mifir.pdf). *esma.europa.eu*, Report ESMA/2015/1464, September 2015. Archived at [perma.cc/ZLX9-FGQ3](https://perma.cc/ZLX9-FGQ3)
[^58]: Luke Bigum. [Solving MiFID II Clock Synchronisation With Minimum Spend (Part 1)](https://catach.blogspot.com/2015/11/solving-mifid-ii-clock-synchronisation.html). *catach.blogspot.com*, November 2015. Archived at [perma.cc/4J5W-FNM4](https://perma.cc/4J5W-FNM4)
[^59]: Oleg Obleukhov and Ahmad Byagowi. [How Precision Time Protocol is being deployed at Meta](https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/). *engineering.fb.com*, November 2022. Archived at [perma.cc/29G6-UJNW](https://perma.cc/29G6-UJNW)
[^60]: John Wiseman. [gpsjam.org](https://gpsjam.org/), July 2022.
[^61]: Josh Levinson, Julien Ridoux, and Chris Munns. [It’s About Time: Microsecond-Accurate Clocks on Amazon EC2 Instances](https://aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/). *aws.amazon.com*, November 2023. Archived at [perma.cc/56M6-5VMZ](https://perma.cc/56M6-5VMZ)
[^62]: Kyle Kingsbury. [Call Me Maybe: Cassandra](https://aphyr.com/posts/294-call-me-maybe-cassandra/). *aphyr.com*, September 2013. Archived at [perma.cc/4MBR-J96V](https://perma.cc/4MBR-J96V)
[^63]: John Daily. [Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems](https://riak.com/clocks-are-bad-or-welcome-to-distributed-systems/). *riak.com*, November 2013. Archived at [perma.cc/4XB5-UCXY](https://perma.cc/4XB5-UCXY)
[^64]: Marc Brooker. [It’s About Time!](https://brooker.co.za/blog/2023/11/27/about-time.html) *brooker.co.za*, November 2023. Archived at [perma.cc/N6YK-DRPA](https://perma.cc/N6YK-DRPA)
[^65]: Kyle Kingsbury. [The Trouble with Timestamps](https://aphyr.com/posts/299-the-trouble-with-timestamps). *aphyr.com*, October 2013. Archived at [perma.cc/W3AM-5VAV](https://perma.cc/W3AM-5VAV)
[^66]: Leslie Lamport. [Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, volume 21, issue 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563)
[^67]: Justin Sheehy. [There Is No Now: Problems With Simultaneity in Distributed Systems](https://queue.acm.org/detail.cfm?id=2745385). *ACM Queue*, volume 13, issue 3, pages 36–41, March 2015. [doi:10.1145/2733108](https://doi.org/10.1145/2733108)
[^68]: Murat Demirbas. [Spanner: Google’s Globally-Distributed Database](https://muratbuffalo.blogspot.com/2013/07/spanner-googles-globally-distributed_4.html). *muratbuffalo.blogspot.co.uk*, July 2013. Archived at [perma.cc/6VWR-C9WB](https://perma.cc/6VWR-C9WB)
[^69]: Dahlia Malkhi and Jean-Philippe Martin. [Spanner’s Concurrency Control](https://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf). *ACM SIGACT News*, volume 44, issue 3, pages 73–77, September 2013. [doi:10.1145/2527748.2527767](https://doi.org/10.1145/2527748.2527767)
[^70]: Franck Pachot. [Achieving Precise Clock Synchronization on AWS](https://www.yugabyte.com/blog/aws-clock-synchronization/). *yugabyte.com*, December 2024. Archived at [perma.cc/UYM6-RNBS](https://perma.cc/UYM6-RNBS)
[^71]: Spencer Kimball. [Living Without Atomic Clocks: Where CockroachDB and Spanner diverge](https://www.cockroachlabs.com/blog/living-without-atomic-clocks/). *cockroachlabs.com*, January 2022. Archived at [perma.cc/AWZ7-RXFT](https://perma.cc/AWZ7-RXFT)
[^72]: Murat Demirbas. [Use of Time in Distributed Databases (part 4): Synchronized clocks in production databases](https://muratbuffalo.blogspot.com/2025/01/use-of-time-in-distributed-databases.html). *muratbuffalo.blogspot.com*, January 2025. Archived at [perma.cc/9WNX-Q9U3](https://perma.cc/9WNX-Q9U3)
[^73]: Cary G. Gray and David R. Cheriton. [Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency](https://courses.cs.duke.edu/spring11/cps210/papers/p202-gray.pdf). At *12th ACM Symposium on Operating Systems Principles* (SOSP), December 1989. [doi:10.1145/74850.74870](https://doi.org/10.1145/74850.74870)
[^74]: Daniel Sturman, Scott Delap, Max Ross, et al. [Roblox Return to Service](https://corp.roblox.com/newsroom/2022/01/roblox-return-to-service-10-28-10-31-2021). *corp.roblox.com*, January 2022. Archived at [perma.cc/8ALT-WAS4](https://perma.cc/8ALT-WAS4)
[^75]: Todd Lipcon. [Avoiding Full GCs with MemStore-Local Allocation Buffers](https://www.slideshare.net/slideshow/hbase-hug-presentation/7038178). *slideshare.net*, February 2011. Archived at
[^76]: Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. [Live Migration of Virtual Machines](https://www.usenix.org/legacy/publications/library/proceedings/nsdi05/tech/full_papers/clark/clark.pdf). At *2nd USENIX Symposium on Symposium on Networked Systems Design & Implementation* (NSDI), May 2005.
[^77]: Mike Shaver. [fsyncers and Curveballs](https://web.archive.org/web/20220107141023/http%3A//shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/). *shaver.off.net*, May 2008. Archived at [archive.org](https://web.archive.org/web/20220107141023/http%3A//shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/)
[^78]: Zhenyun Zhuang and Cuong Tran. [Eliminating Large JVM GC Pauses Caused by Background IO Traffic](https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic). *engineering.linkedin.com*, February 2016. Archived at [perma.cc/ML2M-X9XT](https://perma.cc/ML2M-X9XT)
[^79]: Martin Thompson. [Java Garbage Collection Distilled](https://mechanical-sympathy.blogspot.com/2013/07/java-garbage-collection-distilled.html). *mechanical-sympathy.blogspot.co.uk*, July 2013. Archived at [perma.cc/DJT3-NQLQ](https://perma.cc/DJT3-NQLQ)
[^80]: David Terei and Amit Levy. [Blade: A Data Center Garbage Collector](https://arxiv.org/pdf/1504.02578). arXiv:1504.02578, April 2015.
[^81]: Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz. [Trash Day: Coordinating Garbage Collection in Distributed Systems](https://timharris.uk/papers/2015-hotos.pdf). At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
[^82]: Martin Fowler. [The LMAX Architecture](https://martinfowler.com/articles/lmax.html). *martinfowler.com*, July 2011. Archived at [perma.cc/5AV4-N6RJ](https://perma.cc/5AV4-N6RJ)
[^83]: Joseph Y. Halpern and Yoram Moses. [Knowledge and common knowledge in a distributed environment](https://groups.csail.mit.edu/tds/papers/Halpern/JACM90.pdf). *Journal of the ACM* (JACM), volume 37, issue 3, pages 549–587, July 1990. [doi:10.1145/79147.79161](https://doi.org/10.1145/79147.79161)
[^84]: Chuzhe Tang, Zhaoguo Wang, Xiaodong Zhang, Qianmian Yu, Binyu Zang, Haibing Guan, and Haibo Chen. [Ad Hoc Transactions in Web Applications: The Good, the Bad, and the Ugly](https://ipads.se.sjtu.edu.cn/_media/publications/concerto-sigmod22.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2022. [doi:10.1145/3514221.3526120](https://doi.org/10.1145/3514221.3526120)
[^85]: Flavio P. Junqueira and Benjamin Reed. [*ZooKeeper: Distributed Process Coordination*](https://www.oreilly.com/library/view/zookeeper/9781449361297/). O’Reilly Media, 2013. ISBN: 978-1-449-36130-3
[^86]: Enis Söztutar. [HBase and HDFS: Understanding Filesystem Usage in HBase](https://www.slideshare.net/slideshow/hbase-and-hdfs-understanding-filesystem-usage/22990858). At *HBaseCon*, June 2013. Archived at [perma.cc/4DXR-9P88](https://perma.cc/4DXR-9P88)
[^87]: SUSE LLC. [SUSE Linux Enterprise High Availability 15 SP6 Administration Guide, Section 12: Fencing and STONITH](https://documentation.suse.com/sle-ha/15-SP6/html/SLE-HA-all/cha-ha-fencing.html). *documentation.suse.com*, March 2025. Archived at [perma.cc/8LAR-EL9D](https://perma.cc/8LAR-EL9D)
[^88]: Mike Burrows. [The Chubby Lock Service for Loosely-Coupled Distributed Systems](https://research.google/pubs/pub27897/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006.
[^89]: Kyle Kingsbury. [etcd 3.4.3](https://jepsen.io/analyses/etcd-3.4.3). *jepsen.io*, January 2020. Archived at [perma.cc/2P3Y-MPWU](https://perma.cc/2P3Y-MPWU)
[^90]: Ensar Basri Kahveci. [Distributed Locks are Dead; Long Live Distributed Locks!](https://hazelcast.com/blog/long-live-distributed-locks/) *hazelcast.com*, April 2019. Archived at [perma.cc/7FS5-LDXE](https://perma.cc/7FS5-LDXE)
[^91]: Martin Kleppmann. [How to do distributed locking](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html). *martin.kleppmann.com*, February 2016. Archived at [perma.cc/Y24W-YQ5L](https://perma.cc/Y24W-YQ5L)
[^92]: Salvatore Sanfilippo. [Is Redlock safe?](https://antirez.com/news/101) *antirez.com*, February 2016. Archived at [perma.cc/B6GA-9Q6A](https://perma.cc/B6GA-9Q6A)
[^93]: Gunnar Morling. [Leader Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024. Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y)
[^94]: Leslie Lamport, Robert Shostak, and Marshall Pease. [The Byzantine Generals Problem](https://www.microsoft.com/en-us/research/publication/byzantine-generals-problem/). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 4, issue 3, pages 382–401, July 1982. [doi:10.1145/357172.357176](https://doi.org/10.1145/357172.357176)
[^95]: Jim N. Gray. [Notes on Data Base Operating Systems](https://jimgray.azurewebsites.net/papers/dbos.pdf). in *Operating Systems: An Advanced Course*, Lecture Notes in Computer Science, volume 60, edited by R. Bayer, R. M. Graham, and G. Seegmüller, pages 393–481, Springer-Verlag, 1978. ISBN: 978-3-540-08755-7. Archived at [perma.cc/7S9M-2LZU](https://perma.cc/7S9M-2LZU)
[^96]: Brian Palmer. [How Complicated Was the Byzantine Empire?](https://slate.com/news-and-politics/2011/10/the-byzantine-tax-code-how-complicated-was-byzantium-anyway.html) *slate.com*, October 2011. Archived at [perma.cc/AN7X-FL3N](https://perma.cc/AN7X-FL3N)
[^97]: Leslie Lamport. [My Writings](https://lamport.azurewebsites.net/pubs/pubs.html). *lamport.azurewebsites.net*, December 2014. Archived at [perma.cc/5NNM-SQGR](https://perma.cc/5NNM-SQGR)
[^98]: John Rushby. [Bus Architectures for Safety-Critical Embedded Systems](https://www.csl.sri.com/papers/emsoft01/emsoft01.pdf). At *1st International Workshop on Embedded Software* (EMSOFT), October 2001. [doi:10.1007/3-540-45449-7\_22](https://doi.org/10.1007/3-540-45449-7_22)
[^99]: Jake Edge. [ELC: SpaceX Lessons Learned](https://lwn.net/Articles/540368/). *lwn.net*, March 2013. Archived at [perma.cc/AYX8-QP5X](https://perma.cc/AYX8-QP5X)
[^100]: Shehar Bano, Alberto Sonnino, Mustafa Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis. [SoK: Consensus in the Age of Blockchains](https://smeiklej.com/files/aft19a.pdf). At *1st ACM Conference on Advances in Financial Technologies* (AFT), October 2019. [doi:10.1145/3318041.3355458](https://doi.org/10.1145/3318041.3355458)
[^101]: Ezra Feilden, Adi Oltean, and Philip Johnston. [Why we should train AI in space](https://www.starcloud.com/wp). White Paper, *starcloud.com*, September 2024. Archived at [perma.cc/7Y3S-8UB6](https://perma.cc/7Y3S-8UB6)
[^102]: James Mickens. [The Saddest Moment](https://www.usenix.org/system/files/login-logout_1305_mickens.pdf). *USENIX ;login*, May 2013. Archived at [perma.cc/T7BZ-XCFR](https://perma.cc/T7BZ-XCFR)
[^103]: Martin Kleppmann and Heidi Howard. [Byzantine Eventual Consistency and the Fundamental Limits of Peer-to-Peer Databases](https://arxiv.org/abs/2012.00472). *arxiv.org*, December 2020. [doi:10.48550/arXiv.2012.00472](https://doi.org/10.48550/arXiv.2012.00472)
[^104]: Martin Kleppmann. [Making CRDTs Byzantine Fault Tolerant](https://martin.kleppmann.com/papers/bft-crdt-papoc22.pdf). At *9th Workshop on Principles and Practice of Consistency for Distributed Data* (PaPoC), April 2022. [doi:10.1145/3517209.3524042](https://doi.org/10.1145/3517209.3524042)
[^105]: Evan Gilman. [The Discovery of Apache ZooKeeper’s Poison Packet](https://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/). *pagerduty.com*, May 2015. Archived at [perma.cc/RV6L-Y5CQ](https://perma.cc/RV6L-Y5CQ)
[^106]: Jonathan Stone and Craig Partridge. [When the CRC and TCP Checksum Disagree](https://conferences2.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf). At *ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication* (SIGCOMM), August 2000. [doi:10.1145/347059.347561](https://doi.org/10.1145/347059.347561)
[^107]: Evan Jones. [How Both TCP and Ethernet Checksums Fail](https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html). *evanjones.ca*, October 2015. Archived at [perma.cc/9T5V-B8X5](https://perma.cc/9T5V-B8X5)
[^108]: Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. [Consensus in the Presence of Partial Synchrony](https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf). *Journal of the ACM*, volume 35, issue 2, pages 288–323, April 1988. [doi:10.1145/42282.42283](https://doi.org/10.1145/42282.42283)
[^109]: Richard D. Schlichting and Fred B. Schneider. [Fail-stop processors: an approach to designing fault-tolerant computing systems](https://www.cs.cornell.edu/fbs/publications/Fail_Stop.pdf). *ACM Transactions on Computer Systems* (TOCS), volume 1, issue 3, pages 222–238, August 1983. [doi:10.1145/357369.357371](https://doi.org/10.1145/357369.357371)
[^110]: Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. [Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems](https://ucare.cs.uchicago.edu/pdf/socc13-limplock.pdf). At *4th ACM Symposium on Cloud Computing* (SoCC), October 2013. [doi:10.1145/2523616.2523627](https://doi.org/10.1145/2523616.2523627)
[^111]: Josh Snyder and Joseph Lynch. [Garbage collecting unhealthy JVMs, a proactive approach](https://netflixtechblog.medium.com/introducing-jvmquake-ec944c60ba70). Netflix Technology Blog, *netflixtechblog.medium.com*, November 2019. Archived at [perma.cc/8BTA-N3YB](https://perma.cc/8BTA-N3YB)
[^112]: Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. [Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems](https://www.usenix.org/system/files/conference/fast18/fast18-gunawi.pdf). At *16th USENIX Conference on File and Storage Technologies*, February 2018.
[^113]: Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. [Gray Failure: The Achilles’ Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in Operating Systems* (HotOS), May 2017. [doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005)
[^114]: Chang Lou, Peng Huang, and Scott Smith. [Understanding, Detecting and Localizing Partial Failures in Large System Software](https://www.usenix.org/conference/nsdi20/presentation/lou). At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020.
[^115]: Peter Bailis and Ali Ghodsi. [Eventual Consistency Today: Limitations, Extensions, and Beyond](https://queue.acm.org/detail.cfm?id=2462076). *ACM Queue*, volume 11, issue 3, pages 55-63, March 2013. [doi:10.1145/2460276.2462076](https://doi.org/10.1145/2460276.2462076)
[^116]: Bowen Alpern and Fred B. Schneider. [Defining Liveness](https://www.cs.cornell.edu/fbs/publications/DefLiveness.pdf). *Information Processing Letters*, volume 21, issue 4, pages 181–185, October 1985. [doi:10.1016/0020-0190(85)90056-0](https://doi.org/10.1016/0020-0190%2885%2990056-0)
[^117]: Flavio P. Junqueira. [Dude, Where’s My Metadata?](https://fpj.me/2015/05/28/dude-wheres-my-metadata/) *fpj.me*, May 2015. Archived at [perma.cc/D2EU-Y9S5](https://perma.cc/D2EU-Y9S5)
[^118]: Scott Sanders. [January 28th Incident Report](https://github.com/blog/2106-january-28th-incident-report). *github.com*, February 2016. Archived at [perma.cc/5GZR-88TV](https://perma.cc/5GZR-88TV)
[^119]: Jay Kreps. [A Few Notes on Kafka and Jepsen](https://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen). *blog.empathybox.com*, September 2013. [perma.cc/XJ5C-F583](https://perma.cc/XJ5C-F583)
[^120]: Marc Brooker and Ankush Desai. [Systems Correctness Practices at AWS](https://dl.acm.org/doi/pdf/10.1145/3712057). *Queue, Volume 22, Issue 6*, November/December 2024. [doi:10.1145/3712057](https://doi.org/10.1145/3712057)
[^121]: Andrey Satarin. [Testing Distributed Systems: Curated list of resources on testing distributed systems](https://asatarin.github.io/testing-distributed-systems/). *asatarin.github.io*. Archived at [perma.cc/U5V8-XP24](https://perma.cc/U5V8-XP24)
[^122]: Jack Vanlightly. [Verifying Kafka transactions - Diary entry 2 - Writing an initial TLA+ spec](https://jack-vanlightly.com/analyses/2024/12/3/verifying-kafka-transactions-diary-entry-2-writing-an-initial-tla-spec). *jack-vanlightly.com*, December 2024. Archived at [perma.cc/NSQ8-MQ5N](https://perma.cc/NSQ8-MQ5N)
[^123]: Siddon Tang. [From Chaos to Order — Tools and Techniques for Testing TiDB, A Distributed NewSQL Database](https://www.pingcap.com/blog/chaos-practice-in-tidb/). *pingcap.com*, April 2018. Archived at [perma.cc/5EJB-R29F](https://perma.cc/5EJB-R29F)
[^124]: Nathan VanBenschoten. [Parallel Commits: An atomic commit protocol for globally distributed transactions](https://www.cockroachlabs.com/blog/parallel-commits/). *cockroachlabs.com*, November 2019. Archived at [perma.cc/5FZ7-QK6J](https://perma.cc/5FZ7-QK6J%20)
[^125]: Jack Vanlightly. [Paper: VR Revisited - State Transfer (part 3)](https://jack-vanlightly.com/analyses/2022/12/28/paper-vr-revisited-state-transfer-part-3). *jack-vanlightly.com*, December 2022. Archived at [perma.cc/KNK3-K6WS](https://perma.cc/KNK3-K6WS)
[^126]: Hillel Wayne. [What if the spec doesn’t match the code?](https://buttondown.com/hillelwayne/archive/what-if-the-spec-doesnt-match-the-code/) *buttondown.com*, March 2024. Archived at [perma.cc/8HEZ-KHER](https://perma.cc/8HEZ-KHER)
[^127]: Lingzhi Ouyang, Xudong Sun, Ruize Tang, Yu Huang, Madhav Jivrajani, Xiaoxing Ma, Tianyin Xu. [Multi-Grained Specifications for Distributed System Model Checking and Verification](https://arxiv.org/abs/2409.14301). At *20th European Conference on Computer Systems* (EuroSys), March 2025. [doi:10.1145/3689031.3696069](https://doi.org/10.1145/3689031.3696069)
[^128]: Yury Izrailevsky and Ariel Tseitlin. [The Netflix Simian Army](https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116). *netflixtechblog.com*, July, 2011. Archived at [perma.cc/M3NY-FJW6](https://perma.cc/M3NY-FJW6)
[^129]: Kyle Kingsbury. [Jepsen: On the perils of network partitions](https://aphyr.com/posts/281-jepsen-on-the-perils-of-network-partitions). *aphyr.com*, May, 2013. Archived at [perma.cc/W98G-6HQP](https://perma.cc/W98G-6HQP)
[^130]: Kyle Kingsbury. [Jepsen Analyses](https://jepsen.io/analyses). *jepsen.io*, 2024. Archived at [perma.cc/8LDN-D2T8](https://perma.cc/8LDN-D2T8)
[^131]: Rupak Majumdar and Filip Niksic. [Why is random testing effective for partition tolerance bugs?](https://dl.acm.org/doi/pdf/10.1145/3158134) *Proceedings of the ACM on Programming Languages* (PACMPL), volume 2, issue POPL, article no. 46, December 2017. [doi:10.1145/3158134](https://doi.org/10.1145/3158134)
[^132]: FoundationDB project authors. [Simulation and Testing](https://apple.github.io/foundationdb/testing.html). *apple.github.io*. Archived at [perma.cc/NQ3L-PM4C](https://perma.cc/NQ3L-PM4C)
[^133]: Alex Kladov. [Simulation Testing For Liveness](https://tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness/). *tigerbeetle.com*, July 2023. Archived at [perma.cc/RKD4-HGCR](https://perma.cc/RKD4-HGCR)
[^134]: Alfonso Subiotto Marqués. [(Mostly) Deterministic Simulation Testing in Go](https://www.polarsignals.com/blog/posts/2024/05/28/mostly-dst-in-go). *polarsignals.com*, May 2024. Archived at [perma.cc/ULD6-TSA4](https://perma.cc/ULD6-TSA4)
================================================
FILE: content/en/colophon.md
================================================
---
title: Colophon
weight: 600
breadcrumbs: false
---
## About the Author
**Martin Kleppmann** is an Associate Professor at the University of Cambridge, UK, where he teaches on distributed systems and cryptographic protocols.
The first edition of *Designing Data-Intensive Applications* in 2017 established him as an authority on data systems,
and through his research on distributed systems he helped start the local-first software movement.
Previously he was a software engineer and entrepreneur at internet companies including LinkedIn and Rapportive,
where he worked on large-scale data infrastructure.

**Chris Riccomini** is a software engineer, startup investor, and author with 15+ years of experience at PayPal,
LinkedIn, and WePay. He runs Materialized View Capital, where he invests in infrastructure startups.
He is also the co-creator of Apache Samza and SlateDB,
and co-author of The Missing README: A Guide for the New Software Engineer.
## Colophon
The animal on the cover of *Designing Data-Intensive Applications* is an Indian wild boar (*Sus scrofa cristatus*), a subspecies of wild boar found in India, Myanmar, Nepal, Sri Lanka, and Thailand. They are distinctive from European boars in that they have higher back bristles, no woolly undercoat, and a larger, straighter skull.
The Indian wild boar has a coat of gray or black hair, with stiff bristles running along the spine. Males have protruding canine teeth (called tushes) that are used to fight with rivals or fend off predators. Males are larger than females, but the species aver‐ ages 33–35 inches tall at the shoulder and 200–300 pounds in weight. Their natural predators include bears, tigers, and various big cats.
These animals are nocturnal and omnivorous—they eat a wide variety of things, including roots, insects, carrion, nuts, berries, and small animals. Wild boars are also known to root through garbage and crop fields, causing a great deal of destruction and earning the enmity of farmers. They need to eat 4,000–4,500 calories a day. Boars have a well-developed sense of smell, which helps them forage for underground plant material and burrowing animals. However, their eyesight is poor.
Wild boars have long held significance in human culture. In Hindu lore, the boar is an avatar of the god Vishnu. In ancient Greek funerary monuments, it was a symbol of a gallant loser (in contrast to the victorious lion). Due to its aggression, it was depicted on the armor and weapons of Scandinavian, Germanic, and Anglo-Saxon warriors. In the Chinese zodiac, it symbolizes determination and impetuosity.
Many of the animals on O’Reilly covers are endangered; all of them are important to the world. To learn more about how you can help, go to *animals.oreilly.com*.
The cover image is from Shaw’s *Zoology*. The cover fonts are URW Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the font in diagrams is Adobe Myriad Pro; the heading font is Adobe Myriad Condensed; and the code font is Dal‐ ton Maag’s Ubuntu Mono.
================================================
FILE: content/en/glossary.md
================================================
---
title: Glossary
weight: 500
breadcrumbs: false
---
> Please note that the definitions in this glossary are short and simple, intended to convey the core idea but not the full subtleties of a term. For more detail, please follow the references into the main text.
### asynchronous
Not waiting for something to complete (e.g., sending data over the network to another node), and not making any assumptions about how long it is going to take. See [“Synchronous Versus Asynchronous Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_sync_async), [“Synchronous Versus Asynchronous Networks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_sync_networks), and [“System Model and Reality”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_system_model).
### atomic
1. In the context of concurrency: describing an operation that appears to take effect at a single point in time, so another concurrent process can never encounter the operation in a “half-finished” state. See also *isolation*.
2. In the context of transactions: grouping together a set of writes that must either all be committed or all be rolled back, even if faults occur. See [“Atomicity”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_acid_atomicity) and [“Two-Phase Commit (2PC)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pc).
### backpressure
Forcing the sender of some data to slow down when the recipient cannot keep up with it. Also known as *flow control*. See [“When an Overloaded System Won’t Recover”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sidebar_metastable).
### batch process
A computation that takes some fixed (and usually large) set of data as input and produces some other data as output, without modifying the input. See [Chapter 11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch11.html#ch_batch).
### bounded
Having some known upper limit or size. Used for example in the context of network delay (see [“Timeouts and Unbounded Delays”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_queueing)) and datasets (see the introduction to [Chapter 12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#ch_stream)).
### Byzantine fault
A node that behaves incorrectly in some arbitrary way, for example by sending contradictory or malicious messages to other nodes. See [“Byzantine Faults”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_byzantine).
### cache
A component that remembers recently used data in order to speed up future reads of the same data. It is generally not complete: thus, if some data is missing from the cache, it has to be fetched from some underlying, slower data storage system that has a complete copy of the data.
### CAP theorem
A widely misunderstood theoretical result that is not useful in practice. See [“The CAP theorem”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_cap).
### causality
The dependency between events that arises when one thing “happens before” another thing in a system. For example, a later event that is in response to an earlier event, or builds upon an earlier event, or should be understood in the light of an earlier event. See [“The “happens-before” relation and concurrency”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_happens_before).
### consensus
A fundamental problem in distributed computing, concerning getting several nodes to agree on something (for example, which node should be the leader for a database cluster). The problem is much harder than it seems at first glance. See [“Consensus”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_consensus).
### data warehouse
A database in which data from several different OLTP systems has been combined and prepared to be used for analytics purposes. See [“Data Warehousing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_dwh).
### declarative
Describing the properties that something should have, but not the exact steps for how to achieve it. In the context of database queries, a query optimizer takes a declarative query and decides how it should best be executed. See [“Terminology: Declarative Query Languages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sidebar_declarative).
### denormalize
To introduce some amount of redundancy or duplication in a *normalized* dataset, typically in the form of a *cache* or *index*, in order to speed up reads. A denormalized value is a kind of precomputed query result, similar to a materialized view. See [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization).
### derived data
A dataset that is created from some other data through a repeatable process, which you could run again if necessary. Usually, derived data is needed to speed up a particular kind of read access to the data. Indexes, caches, and materialized views are examples of derived data. See [“Systems of Record and Derived Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_derived).
### deterministic
Describing a function that always produces the same output if you give it the same input. This means it cannot depend on random numbers, the time of day, network communication, or other unpredictable things. See [“The Power of Determinism”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sidebar_distributed_determinism).
### distributed
Running on several nodes connected by a network. Characterized by *partial failures*: some part of the system may be broken while other parts are still working, and it is often impossible for the software to know what exactly is broken. See [“Faults and Partial Failures”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_partial_failure).
### durable
Storing data in a way such that you believe it will not be lost, even if various faults occur. See [“Durability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_acid_durability).
### ETL
Extract–Transform–Load. The process of extracting data from a source database, transforming it into a form that is more suitable for analytic queries, and loading it into a data warehouse or batch processing system. See [“Data Warehousing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_dwh).
### failover
In systems that have a single leader, failover is the process of moving the leadership role from one node to another. See [“Handling Node Outages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_failover).
### fault-tolerant
Able to recover automatically if something goes wrong (e.g., if a machine crashes or a network link fails). See [“Reliability and Fault Tolerance”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_reliability).
### flow control
See *backpressure*.
### follower
A replica that does not directly accept any writes from clients, but only processes data changes that it receives from a leader. Also known as a *secondary*, *read replica*, or *hot standby*. See [“Single-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_leader).
### full-text search
Searching text by arbitrary keywords, often with additional features such as matching similarly spelled words or synonyms. A full-text index is a kind of *secondary index* that supports such queries. See [“Full-Text Search”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_full_text).
### graph
A data structure consisting of *vertices* (things that you can refer to, also known as *nodes* or *entities*) and *edges* (connections from one vertex to another, also known as *relationships* or *arcs*). See [“Graph-Like Data Models”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_graph).
### hash
A function that turns an input into a random-looking number. The same input always returns the same number as output. Two different inputs are very likely to have two different numbers as output, although it is possible that two different inputs produce the same output (this is called a *collision*). See [“Sharding by Hash of Key”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_hash).
### idempotent
Describing an operation that can be safely retried; if it is executed more than once, it has the same effect as if it was only executed once. See [“Idempotence”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#sec_stream_idempotence).
### index
A data structure that lets you efficiently search for all records that have a particular value in a particular field. See [“Storage and Indexing for OLTP”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_oltp).
### isolation
In the context of transactions, describing the degree to which concurrently executing transactions can interfere with each other. *Serializable* isolation provides the strongest guarantees, but weaker isolation levels are also used. See [“Isolation”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_acid_isolation).
### join
To bring together records that have something in common. Most commonly used in the case where one record has a reference to another (a foreign key, a document reference, an edge in a graph) and a query needs to get the record that the reference points to. See [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization) and [“JOIN and GROUP BY”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch11.html#sec_batch_join).
### leader
When data or a service is replicated across several nodes, the leader is the designated replica that is allowed to make changes. A leader may be elected through some protocol, or manually chosen by an administrator. Also known as the *primary* or *source*. See [“Single-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_leader).
### linearizable
Behaving as if there was only a single copy of data in the system, which is updated by atomic operations. See [“Linearizability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_linearizability).
### locality
A performance optimization: putting several pieces of data in the same place if they are frequently needed at the same time. See [“Data locality for reads and writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_document_locality).
### lock
A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See [“Two-Phase Locking (2PL)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pl) and [“Distributed Locks and Leases”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_lock_fencing).
### log
An append-only file for storing data. A *write-ahead log* is used to make a storage engine resilient against crashes (see [“Making B-trees reliable”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_btree_wal)), a *log-structured* storage engine uses logs as its primary storage format (see [“Log-Structured Storage”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_log_structured)), a *replication log* is used to copy writes from a leader to followers (see [“Single-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_leader)), and an *event log* can represent a data stream (see [“Log-based Message Brokers”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#sec_stream_log)).
### materialize
To perform a computation eagerly and write out its result, as opposed to calculating it on demand when requested. See [“Event Sourcing and CQRS”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_events).
### node
An instance of some software running on a computer, which communicates with other nodes via a network in order to accomplish some task.
### normalized
Structured in such a way that there is no redundancy or duplication. In a normalized database, when some piece of data changes, you only need to change it in one place, not many copies in many different places. See [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization).
### OLAP
Online analytic processing. Access pattern characterized by aggregating (e.g., count, sum, average) over a large number of records. See [“Operational Versus Analytical Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_analytics).
### OLTP
Online transaction processing. Access pattern characterized by fast queries that read or write a small number of records, usually indexed by key. See [“Operational Versus Analytical Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_analytics).
### sharding
Splitting up a large dataset or computation that is too big for a single machine into smaller parts and spreading them across several machines. Also known as *partitioning*. See [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding).
### percentile
A way of measuring the distribution of values by counting how many values are above or below some threshold. For example, the 95th percentile response time during some period is the time *t* such that 95% of requests in that period complete in less than *t*, and 5% take longer than *t*. See [“Describing Performance”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_percentiles).
### primary key
A value (typically a number or a string) that uniquely identifies a record. In many applications, primary keys are generated by the system when a record is created (e.g., sequentially or randomly); they are not usually set by users. See also *secondary index*.
### quorum
The minimum number of nodes that need to vote on an operation before it can be considered successful. See [“Quorums for reading and writing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_quorum_condition).
### rebalance
To move data or services from one node to another in order to spread the load fairly. See [“Sharding of Key-Value Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_key_value).
### replication
Keeping a copy of the same data on several nodes (*replicas*) so that it remains accessible if a node becomes unreachable. See [Chapter 6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#ch_replication).
### schema
A description of the structure of some data, including its fields and datatypes. Whether some data conforms to a schema can be checked at various points in the data’s lifetime (see [“Schema flexibility in the document model”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_schema_flexibility)), and a schema can change over time (see [Chapter 5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch05.html#ch_encoding)).
### secondary index
An additional data structure that is maintained alongside the primary data storage and which allows you to efficiently search for records that match a certain kind of condition. See [“Multi-Column and Secondary Indexes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_index_multicolumn) and [“Sharding and Secondary Indexes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_secondary_indexes).
### serializable
An *isolation* guarantee that if several transactions execute concurrently, they behave the same as if they had executed one at a time, in some serial order. See [“Serializability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_serializability).
### shared-nothing
An architecture in which independent nodes—each with their own CPUs, memory, and disks—are connected via a conventional network, in contrast to shared-memory or shared-disk architectures. See [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_shared_nothing).
### skew
1. Imbalanced load across shards, such that some shards have lots of requests or data, and others have much less. Also known as *hot spots*. See [“Skewed Workloads and Relieving Hot Spots”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_skew).
2. A timing anomaly that causes events to appear in an unexpected, nonsequential order. See the discussions of *read skew* in [“Snapshot Isolation and Repeatable Read”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_snapshot_isolation), *write skew* in [“Write Skew and Phantoms”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_write_skew), and *clock skew* in [“Timestamps for ordering events”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_lww).
### split brain
A scenario in which two nodes simultaneously believe themselves to be the leader, and which may cause system guarantees to be violated. See [“Handling Node Outages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_failover) and [“The Majority Rules”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_majority).
### stored procedure
A way of encoding the logic of a transaction such that it can be entirely executed on a database server, without communicating back and forth with a client during the transaction. See [“Actual Serial Execution”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_serial).
### stream process
A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See [Chapter 12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#ch_stream).
### synchronous
The opposite of *asynchronous*.
### system of record
A system that holds the primary, authoritative version of some data, also known as the *source of truth*. Changes are first written here, and other datasets may be derived from the system of record. See [“Systems of Record and Derived Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_derived).
### timeout
One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See [“Timeouts and Unbounded Delays”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_queueing).
### total order
A way of comparing things (e.g., timestamps) that allows you to always say which one of two things is greater and which one is lesser. An ordering in which some things are incomparable (you cannot say which is greater or smaller) is called a *partial order*.
### transaction
Grouping together several reads and writes into a logical unit, in order to simplify error handling and concurrency issues. See [Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions).
### two-phase commit (2PC)
An algorithm to ensure that several database nodes either all *atomically* commit or all abort a transaction. See [“Two-Phase Commit (2PC)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pc).
### two-phase locking (2PL)
An algorithm for achieving *serializable isolation* that works by a transaction acquiring a lock on all data it reads or writes, and holding the lock until the end of the transaction. See [“Two-Phase Locking (2PL)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pl).
### unbounded
Not having any known upper limit or size. The opposite of *bounded*.
================================================
FILE: content/en/indexes.md
================================================
---
title: Indexes
weight: 550
breadcrumbs: false
---
### Symbols
- 3FS (distributed filesystem, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
### A
- aborts (transactions), [Transactions](/en/ch8#ch_transactions), [Atomicity](/en/ch8#sec_transactions_acid_atomicity)
- cascading, [No dirty reads](/en/ch8#no-dirty-reads)
- in two-phase commit, [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
- performance of optimistic concurrency control, [Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
- retrying aborted transactions, [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
- abstraction, [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Simplicity: Managing Complexity](/en/ch2#id38), [Data Models and Query Languages](/en/ch3#ch_datamodels), [Transactions](/en/ch8#ch_transactions), [Summary](/en/ch8#summary)
- accidental complexity, [Simplicity: Managing Complexity](/en/ch2#id38)
- accountability, [Responsibility and Accountability](/en/ch14#id371)
- accounting (financial data), [Summary](/en/ch3#summary), [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
- Accumulo (database)
- wide-column data model, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality), [Column Compression](/en/ch4#sec_storage_column_compression)
- ACID properties (transactions), [The Meaning of ACID](/en/ch8#sec_transactions_acid)
- atomicity, [Atomicity](/en/ch8#sec_transactions_acid_atomicity), [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
- consistency, [Consistency](/en/ch8#sec_transactions_acid_consistency), [Maintaining integrity in the face of software bugs](/en/ch13#id455)
- durability, [Making B-trees reliable](/en/ch4#sec_storage_btree_wal), [Durability](/en/ch8#durability)
- isolation, [Isolation](/en/ch8#sec_transactions_acid_isolation), [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
- acknowledgements (messaging), [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
- active/active replication (see multi-leader replication)
- active/passive replication (see leader-based replication)
- ActiveMQ (messaging), [Message brokers](/en/ch5#message-brokers), [Message brokers compared to databases](/en/ch12#id297)
- distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
- ActiveRecord (object-relational mapper), [Object-relational mapping (ORM)](/en/ch3#object-relational-mapping-orm), [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
- activity (workflows) (see workflow engines)
- actor model, [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
- (see also event-driven architecture)
- comparison to stream processing, [Event-Driven Architectures and RPC](/en/ch12#sec_stream_actors_drpc)
- adaptive capacity, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
- Advanced Message Queuing Protocol (see AMQP)
- aerospace systems, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
- Aerospike (database)
- strong consistency mode, [Single-object writes](/en/ch8#sec_transactions_single_object)
- AGE (graph database), [The Cypher Query Language](/en/ch3#id57)
- aggregation
- data cubes and materialized views, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
- in batch processes, [Sorting Versus In-memory Aggregation](/en/ch11#id275)
- in stream processes, [Stream analytics](/en/ch12#id318)
- aggregation pipeline (MongoDB), [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization), [Query languages for documents](/en/ch3#query-languages-for-documents)
- Agile, [Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability)
- minimizing irreversibility, [Batch Processing](/en/ch11#ch_batch), [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
- moving faster with confidence, [The end-to-end argument again](/en/ch13#id456)
- agreement, [Single-value consensus](/en/ch10#single-value-consensus), [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
- (see also consensus)
- AI (artificial intelligence) (see machine learning)
- AI Act (European Union), [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- AirByte, [Data Warehousing](/en/ch1#sec_introduction_dwh)
- Airflow (workflow scheduler), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows), [Batch Processing](/en/ch11#ch_batch), [Scheduling Workflows](/en/ch11#sec_batch_workflows)
- cloud data warehouse integration, [Query languages](/en/ch11#sec_batch_query_lanauges)
- use for ETL, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
- Akamai
- response time study, [Average, Median, and Percentiles](/en/ch2#id24)
- algorithms
- algorithm correctness, [Defining the correctness of an algorithm](/en/ch9#defining-the-correctness-of-an-algorithm)
- B-trees, [B-Trees](/en/ch4#sec_storage_b_trees)-[B-tree variants](/en/ch4#b-tree-variants)
- for distributed systems, [System Model and Reality](/en/ch9#sec_distributed_system_model)
- mergesort, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Shuffling Data](/en/ch11#sec_shuffle)
- scheduling, [Resource Allocation](/en/ch11#id279)
- SSTables and LSM-trees, [The SSTable file format](/en/ch4#the-sstable-file-format)-[Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- all-to-all replication topologies, [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)
- AllegroGraph (database), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- SPARQL query language, [The SPARQL query language](/en/ch3#the-sparql-query-language)
- ALTER TABLE statement (SQL), [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility), [Encoding and Evolution](/en/ch5#ch_encoding)
- Amazon
- Dynamo (see Dynamo (database))
- response time study, [Average, Median, and Percentiles](/en/ch2#id24)
- Amazon Web Services (AWS)
- Aurora (see Aurora (cloud database))
- ClockBound (see ClockBound (time sync))
- correctness testing, [Formal Methods and Randomized Testing](/en/ch9#sec_distributed_formal)
- DynamoDB (see DynamoDB (database))
- EBS (see EBS (virtual block device))
- Kinesis (see Kinesis (messaging))
- Neptune (see Neptune (graph database))
- network reliability, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
- S3 (see S3 (object storage))
- amplification
- of bias, [Bias and Discrimination](/en/ch14#id370)
- of failures, [Maintaining derived state](/en/ch13#id446)
- of tail latency, [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla), [Local Secondary Indexes](/en/ch7#id166)
- write amplification, [Write amplification](/en/ch4#write-amplification)
- AMQP (Advanced Message Queuing Protocol), [Message brokers compared to databases](/en/ch12#id297)
- (see also messaging systems)
- comparison to log-based messaging, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Replaying old messages](/en/ch12#sec_stream_replay)
- message ordering, [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
- analytical systems, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)
- as derived data systems, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
- ETL from operational systems, [Data Warehousing](/en/ch1#sec_introduction_dwh)
- governance, [Beyond the data lake](/en/ch1#beyond-the-data-lake)
- analytics, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)-[Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
- comparison to transaction processing, [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
- data normalization, [Trade-offs of normalization](/en/ch3#trade-offs-of-normalization)
- data warehousing (see data warehousing)
- predictive (see predictive analytics)
- relation to batch processing, [Analytics](/en/ch11#sec_batch_olap)-[Analytics](/en/ch11#sec_batch_olap)
- schemas for, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)-[Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
- snapshot isolation for queries, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
- stream analytics, [Stream analytics](/en/ch12#id318)
- analytics engineering, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)
- anti-entropy, [Catching up on missed writes](/en/ch6#sec_replication_read_repair)
- Antithesis (deterministic simulation testing), [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- Apache Accumulo (see Accumulo)
- Apache ActiveMQ (see ActiveMQ)
- Apache AGE (see AGE)
- Apache Arrow (see Arrow (data format))
- Apache Avro (see Avro)
- Apache Beam (see Beam)
- Apache BookKeeper (see BookKeeper)
- Apache Cassandra (see Cassandra)
- Apache Curator (see Curator)
- Apache DataFusion (see DataFusion (query engine))
- Apache Druid (see Druid (database))
- Apache Flink (see Flink (processing framework))
- Apache HBase (see HBase)
- Apache Iceberg (see Iceberg (table format))
- Apache Jena (see Jena)
- Apache Kafka (see Kafka)
- Apache Lucene (see Lucene)
- Apache Oozie (see Oozie (workflow scheduler))
- Apache ORC (see ORC (data format))
- Apache Parquet (see Parquet (data format))
- Apache Pig (query language), [Query languages](/en/ch11#sec_batch_query_lanauges)
- Apache Pinot (see Pinot (database))
- Apache Pulsar (see Pulsar)
- Apache Qpid (see Qpid)
- Apache Samza (see Samza)
- Apache Solr (see Solr)
- Apache Spark (see Spark) (see Spark (processing framework))
- Apache Storm (see Storm)
- Apache Superset (see Superset (data visualization software))
- Apache Thrift (see Thrift)
- Apache ZooKeeper (see ZooKeeper)
- Apama (stream analytics), [Complex event processing](/en/ch12#id317)
- append-only files (see logs)
- Application Programming Interfaces (APIs), [Data Models and Query Languages](/en/ch3#ch_datamodels)
- for change streams, [API support for change streams](/en/ch12#sec_stream_change_api)
- for distributed transactions, [XA transactions](/en/ch8#xa-transactions)
- for services, [Dataflow Through Services: REST and RPC](/en/ch5#sec_encoding_dataflow_rpc)-[Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- (see also services)
- evolvability, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- RESTful, [Web services](/en/ch5#sec_web_services)
- application state (see state)
- approximate search (see similarity search)
- archival storage, data from databases, [Archival storage](/en/ch5#archival-storage)
- arcs (see edges)
- ArcticDB (database), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- arithmetic mean, [Average, Median, and Percentiles](/en/ch2#id24)
- arrays
- array databases, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- multidimensional, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- Arrow (data format), [Column-Oriented Storage](/en/ch4#sec_storage_column), [DataFrames](/en/ch11#id287)
- artificial intelligence (see machine learning)
- ASCII text, [Protocol Buffers](/en/ch5#sec_encoding_protobuf)
- ASN.1 (schema language), [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
- associative table, [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many), [Property Graphs](/en/ch3#id56)
- asynchronous networks, [Unreliable Networks](/en/ch9#sec_distributed_networks), [Glossary](/en/glossary)
- comparison to synchronous networks, [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
- system model, [System Model and Reality](/en/ch9#sec_distributed_system_model)
- asynchronous replication, [Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async), [Glossary](/en/glossary)
- data loss on failover, [Leader failure: Failover](/en/ch6#leader-failure-failover)
- reads from asynchronous follower, [Problems with Replication Lag](/en/ch6#sec_replication_lag)
- with multiple leaders, [Multi-Leader Replication](/en/ch6#sec_replication_multi_leader)
- Asynchronous Transfer Mode (ATM), [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
- atomic broadcast, [Shared logs as consensus](/en/ch10#sec_consistency_shared_logs)
- atomic clocks, [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- (see also clocks)
- atomicity (concurrency), [Glossary](/en/glossary)
- atomic increment, [Single-object writes](/en/ch8#sec_transactions_single_object)
- compare-and-set (CAS), [Conditional writes (compare-and-set)](/en/ch8#sec_transactions_compare_and_set), [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- (see also compare-and-set (CAS))
- denormalized data, [Trade-offs of normalization](/en/ch3#trade-offs-of-normalization)
- fetch-and-add/increment, [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical), [Consensus](/en/ch10#sec_consistency_consensus), [Fetch-and-add as consensus](/en/ch10#fetch-and-add-as-consensus)
- write operations, [Atomic write operations](/en/ch8#atomic-write-operations)
- atomicity (transactions), [Atomicity](/en/ch8#sec_transactions_acid_atomicity), [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object), [Glossary](/en/glossary)
- atomic commit
- avoiding, [Multi-shard request processing](/en/ch13#id360), [Coordination-avoiding data systems](/en/ch13#id454)
- blocking and nonblocking, [Three-phase commit](/en/ch8#three-phase-commit)
- in stream processing, [Exactly-once message processing](/en/ch8#sec_transactions_exactly_once), [Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited), [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit)
- maintaining derived data, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
- distributed transactions, [Distributed Transactions](/en/ch8#sec_transactions_distributed)-[Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
- for multi-object transactions, [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
- for single-object writes, [Single-object writes](/en/ch8#sec_transactions_single_object)
- relation to consensus, [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
- auditability, [Trust, but Verify](/en/ch13#sec_future_verification)-[Tools for auditable data systems](/en/ch13#id366)
- designing for, [Designing for auditability](/en/ch13#id365)
- self-auditing systems, [Don't just blindly trust what they promise](/en/ch13#id364)
- through immutability, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
- tools for auditable data systems, [Tools for auditable data systems](/en/ch13#id366)
- Aurora (cloud database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
- Aurora DSQL (database)
- snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
- auto-scaling, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
- Automerge (CRDT library), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- availability, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)
- (see also fault tolerance)
- in CAP theorem, [The CAP theorem](/en/ch10#the-cap-theorem)
- in leader election, [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
- in service level agreements (SLAs), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
- availability zones, [Tolerating hardware faults through redundancy](/en/ch2#tolerating-hardware-faults-through-redundancy), [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
- Avro (data format), [Avro](/en/ch5#sec_encoding_avro)-[Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
- dynamically generated schemas, [Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
- object container files, [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema), [Archival storage](/en/ch5#archival-storage)
- reader determining writer's schema, [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
- schema evolution, [The writer's schema and the reader's schema](/en/ch5#the-writers-schema-and-the-readers-schema)
- use in batch processing, [MapReduce](/en/ch11#sec_batch_mapreduce)
- awk (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Distributed Job Orchestration](/en/ch11#id278)
- Axon Framework, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- Azkaban (workflow scheduler), [Batch Processing](/en/ch11#ch_batch)
- Azure Blob Storage (object storage), [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- conditional headers, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
- Azure managed disks, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
- Azure SQL DB (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
- Azure Storage, [Object Stores](/en/ch11#id277)
- Azure Synapse Analytics (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
- Azure Virtual Machines
- spot virtual machines, [Handling Faults](/en/ch11#id281)
### B
- B-trees (indexes), [B-Trees](/en/ch4#sec_storage_b_trees)-[B-tree variants](/en/ch4#b-tree-variants)
- B+ trees, [B-tree variants](/en/ch4#b-tree-variants)
- branching factor, [B-Trees](/en/ch4#sec_storage_b_trees)
- comparison to LSM-trees, [Comparing B-Trees and LSM-Trees](/en/ch4#sec_storage_btree_lsm_comparison)-[Disk space usage](/en/ch4#disk-space-usage)
- crash recovery, [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
- growing by splitting a page, [B-Trees](/en/ch4#sec_storage_b_trees)
- immutable variants, [B-tree variants](/en/ch4#b-tree-variants), [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
- similarity to shard splitting, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)
- variants, [B-tree variants](/en/ch4#b-tree-variants)
- B2 (object storage), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- Backblaze B2 (see B2 (object storage))
- backend, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
- backoff, exponential, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
- backpressure, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Read performance](/en/ch4#read-performance), [Messaging Systems](/en/ch12#sec_stream_messaging), [Glossary](/en/glossary)
- in batch processing, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
- in TCP, [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
- backups
- database snapshot for replication, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- in multitenant systems, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- integrity of, [Don't just blindly trust what they promise](/en/ch13#id364)
- snapshot isolation for, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
- using object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- versus replication, [Replication](/en/ch6#ch_replication)
- backward compatibility, [Encoding and Evolution](/en/ch5#ch_encoding)
- BadgerDB (database)
- serializable transactions, [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)
- BASE, contrast to ACID, [The Meaning of ACID](/en/ch8#sec_transactions_acid)
- bash shell (Unix), [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp)
- batch processing, [Batch Processing](/en/ch11#ch_batch)-[Summary](/en/ch11#id292), [Glossary](/en/glossary)
- and functional programming, [MapReduce](/en/ch11#sec_batch_mapreduce)
- benefits of, [Batch Processing](/en/ch11#ch_batch)
- combining with stream processing, [Unifying batch and stream processing](/en/ch13#id338)
- comparison to stream processing, [Processing Streams](/en/ch12#sec_stream_processing)
- dataflow engines, [Dataflow Engines](/en/ch11#sec_batch_dataflow)-[Dataflow Engines](/en/ch11#sec_batch_dataflow)
- fault tolerance, [Handling Faults](/en/ch11#id281), [Messaging Systems](/en/ch12#sec_stream_messaging)
- for data integration, [Batch and Stream Processing](/en/ch13#sec_future_batch_streaming)-[Unifying batch and stream processing](/en/ch13#id338)
- graphs and iterative processing, [Machine Learning](/en/ch11#id290)
- high-level APIs and languages, [Query languages](/en/ch11#sec_batch_query_lanauges)-[Query languages](/en/ch11#sec_batch_query_lanauges)
- in cloud data warehouses, [Query languages](/en/ch11#sec_batch_query_lanauges)
- in distributed systems, [Batch Processing in Distributed Systems](/en/ch11#sec_batch_distributed)
- join and group by, [JOIN and GROUP BY](/en/ch11#sec_batch_join)-[JOIN and GROUP BY](/en/ch11#sec_batch_join)
- limitations, [Batch Processing](/en/ch11#ch_batch)
- log-based messaging and, [Replaying old messages](/en/ch12#sec_stream_replay)
- maintaining derived state, [Maintaining derived state](/en/ch13#id446)
- measuring performance, [Batch Processing](/en/ch11#ch_batch)
- models of, [Batch Processing Models](/en/ch11#id431)
- resource allocation, [Resource Allocation](/en/ch11#id279)-[Resource Allocation](/en/ch11#id279)
- resource managers, [Distributed Job Orchestration](/en/ch11#id278)
- schedulers, [Distributed Job Orchestration](/en/ch11#id278)
- serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)-[Serving Derived Data](/en/ch11#sec_batch_serving_derived)
- shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)-[Shuffling Data](/en/ch11#sec_shuffle)
- task execution, [Distributed Job Orchestration](/en/ch11#id278)
- use cases, [Batch Use Cases](/en/ch11#sec_batch_output)-[Serving Derived Data](/en/ch11#sec_batch_serving_derived)
- using Unix tools (example), [Batch Processing with Unix Tools](/en/ch11#sec_batch_unix)-[Sorting Versus In-memory Aggregation](/en/ch11#id275)
- batch processing frameworks
- comparison to operating systems, [Batch Processing in Distributed Systems](/en/ch11#sec_batch_distributed)
- Beam (dataflow library), [Unifying batch and stream processing](/en/ch13#id338)
- BERT (language model), [Vector Embeddings](/en/ch4#id92)
- bias, [Bias and Discrimination](/en/ch14#id370)
- bidirectional replication (see multi-leader replication)
- big ball of mud, [Simplicity: Managing Complexity](/en/ch2#id38)
- big data
- versus data minimization, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
- BigQuery (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Batch Processing](/en/ch11#ch_batch)
- DataFrames, [Query languages](/en/ch11#sec_batch_query_lanauges)
- sharding and clustering, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)
- snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
- Bigtable (database)
- sharding scheme, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- storage layout, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- tablets (sharding), [Sharding](/en/ch7#ch_sharding)
- wide-column data model, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality), [Column Compression](/en/ch4#sec_storage_column_compression)
- binary data encodings, [Binary encoding](/en/ch5#binary-encoding)-[The Merits of Schemas](/en/ch5#sec_encoding_schemas)
- Avro, [Avro](/en/ch5#sec_encoding_avro)-[Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
- MessagePack, [Binary encoding](/en/ch5#binary-encoding)-[Binary encoding](/en/ch5#binary-encoding)
- Protocol Buffers, [Protocol Buffers](/en/ch5#sec_encoding_protobuf)-[Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
- binary encoding
- based on schemas, [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
- by network drivers, [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
- binary strings, lack of support in JSON and XML, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
- Bitcoin (cryptocurrency), [Tools for auditable data systems](/en/ch13#id366)
- Byzantine fault tolerance, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
- concurrency bugs in exchanges, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
- bitmap indexes, [Column Compression](/en/ch4#sec_storage_column_compression)
- BitTorrent uTP protocol, [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
- Bkd-trees (indexes), [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
- blameless postmortems, [Humans and Reliability](/en/ch2#id31)
- Blazegraph (database), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- SPARQL query language, [The SPARQL query language](/en/ch3#the-sparql-query-language)
- blob storage (see object storage)
- block (file system), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- block device (disk), [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
- blockchains, [Summary](/en/ch3#summary)
- Byzantine fault tolerance, [Byzantine Faults](/en/ch9#sec_distributed_byzantine), [Consensus](/en/ch10#sec_consistency_consensus), [Tools for auditable data systems](/en/ch13#id366)
- blocking atomic commit, [Three-phase commit](/en/ch8#three-phase-commit)
- Bloom filter (algorithm), [Bloom filters](/en/ch4#bloom-filters), [Read performance](/en/ch4#read-performance), [Stream analytics](/en/ch12#id318)
- BookKeeper (replicated log), [Allocating work to nodes](/en/ch10#allocating-work-to-nodes)
- bounded datasets, [Stream Processing](/en/ch12#ch_stream), [Glossary](/en/glossary)
- (see also batch processing)
- bounded delays, [Glossary](/en/glossary)
- in networks, [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
- process pauses, [Response time guarantees](/en/ch9#sec_distributed_clocks_realtime)
- broadcast
- total order broadcast (see shared logs)
- brokerless messaging, [Direct messaging from producers to consumers](/en/ch12#id296)
- Brubeck (metrics aggregator), [Direct messaging from producers to consumers](/en/ch12#id296)
- BTM (transaction coordinator), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
- Buf
- Bufstream (messaging), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- Bufstream (messaging), [Disk space usage](/en/ch12#sec_stream_disk_usage)
- build or buy, [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud)
- bursty network traffic patterns, [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
- business analyst, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
- business data processing, [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
- business intelligence, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)-[Data Warehousing](/en/ch1#sec_introduction_dwh)
- Business Process Execution Language (BPEL), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
- Business Process Model and Notation (BPMN), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
- example, [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
- byte sequence, encoding data in, [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
- Byzantine faults, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)-[Weak forms of lying](/en/ch9#weak-forms-of-lying), [System Model and Reality](/en/ch9#sec_distributed_system_model), [Glossary](/en/glossary)
- Byzantine fault-tolerant systems, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
- Byzantine Generals Problem, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
- consensus algorithms and, [Consensus](/en/ch10#sec_consistency_consensus), [Tools for auditable data systems](/en/ch13#id366)
### C
- caches, [Keeping everything in memory](/en/ch4#sec_storage_inmemory), [Glossary](/en/glossary)
- and materialized views, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
- as derived data, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived), [Composing Data Storage Technologies](/en/ch13#id447)-[Unbundled versus integrated systems](/en/ch13#id448)
- in CPUs, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized), [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
- invalidation and maintenance, [Keeping Systems in Sync](/en/ch12#sec_stream_sync), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- linearizability, [Linearizability](/en/ch10#sec_consistency_linearizability)
- local disks in the cloud, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
- calendar sync, [Sync Engines and Local-First Software](/en/ch6#sec_replication_offline_clients), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- California Consumer Privacy Act (CCPA), [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- Camunda (workflow engine), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
- canonical version (of data), [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
- CAP theorem, [The CAP theorem](/en/ch10#the-cap-theorem)-[The CAP theorem](/en/ch10#the-cap-theorem), [Glossary](/en/glossary)
- capacity planning, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- Cap'n Proto (data format), [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
- carbon emissions, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
- cascading aborts, [No dirty reads](/en/ch8#no-dirty-reads)
- cascading failures, [Software faults](/en/ch2#software-faults), [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations), [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing)
- Cassandra (database)
- change data capture, [Implementing change data capture](/en/ch12#id307), [API support for change streams](/en/ch12#sec_stream_change_api)
- compaction strategy, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- consistency level ANY, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- hash-range sharding, [Sharding by Hash of Key](/en/ch7#sec_sharding_hash), [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- last-write-wins conflict resolution, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)
- leaderless replication, [Leaderless Replication](/en/ch6#sec_replication_leaderless)
- lightweight transactions, [Single-object writes](/en/ch8#sec_transactions_single_object)
- linearizability, lack of, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
- log-structured storage, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- multi-region support, [Multi-region operation](/en/ch6#multi-region-operation)
- secondary indexes, [Local Secondary Indexes](/en/ch7#id166)
- use of clocks, [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations), [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
- vnodes (sharding), [Sharding](/en/ch7#ch_sharding)
- cat (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis)
- catalog, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- causal context, [Version vectors](/en/ch6#version-vectors)
- (see also causal dependencies)
- causal dependencies, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)-[Version vectors](/en/ch6#version-vectors)
- capturing, [Version vectors](/en/ch6#version-vectors), [Ordering events to capture causality](/en/ch13#sec_future_capture_causality), [Reads are events too](/en/ch13#sec_future_read_events)
- by total ordering, [The limits of total ordering](/en/ch13#id335)
- in transactions, [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)
- sending message to friends (example), [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
- causality, [Glossary](/en/glossary)
- causal ordering
- total order consistent with, [Logical Clocks](/en/ch10#sec_consistency_timestamps)
- consistency with, [Logical Clocks](/en/ch10#sec_consistency_timestamps)-[Enforcing constraints using logical clocks](/en/ch10#enforcing-constraints-using-logical-clocks)
- happens-before relation, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)
- in serializable transactions, [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)-[Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
- mismatch with clocks, [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
- ordering events to capture, [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
- violations of, [Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix), [Problems with different topologies](/en/ch6#problems-with-different-topologies), [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
- with synchronized clocks, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- cell-based architecture, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- CEP (see complex event processing)
- CephFS (distributed filesystem), [Batch Processing](/en/ch11#ch_batch), [Object Stores](/en/ch11#id277)
- certificate transparency, [Tools for auditable data systems](/en/ch13#id366)
- cgroups, [Distributed Job Orchestration](/en/ch11#id278)
- change data capture, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication), [Change Data Capture](/en/ch12#sec_stream_cdc)
- API support for change streams, [API support for change streams](/en/ch12#sec_stream_change_api)
- comparison to event sourcing, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
- implementing, [Implementing change data capture](/en/ch12#id307)
- initial snapshot, [Initial snapshot](/en/ch12#sec_stream_cdc_snapshot)
- log compaction, [Log compaction](/en/ch12#sec_stream_log_compaction)
- changelogs, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)
- change data capture, [Change Data Capture](/en/ch12#sec_stream_cdc)
- for operator state, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
- in stream joins, [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins)
- log compaction, [Log compaction](/en/ch12#sec_stream_log_compaction)
- maintaining derived state, [Databases and Streams](/en/ch12#sec_stream_databases)
- chaos engineering, [Fault Tolerance](/en/ch2#id27), [Fault injection](/en/ch9#sec_fault_injection)
- checkpointing
- in high-performance computing, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
- in stream processors, [Microbatching and checkpointing](/en/ch12#id329)
- circuit breaker (limiting retries), [Describing Performance](/en/ch2#sec_introduction_percentiles)
- circuit-switched networks, [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
- circular buffers, [Disk space usage](/en/ch12#sec_stream_disk_usage)
- circular replication topologies, [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)
- Citus (database)
- hash sharding, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
- ClickHouse (database), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
- incremental view maintenance, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- clickstream data, analysis of, [JOIN and GROUP BY](/en/ch11#sec_batch_join)
- clients
- calling services, [Dataflow Through Services: REST and RPC](/en/ch5#sec_encoding_dataflow_rpc)
- offline-capable, [Sync Engines and Local-First Software](/en/ch6#sec_replication_offline_clients), [Stateful, offline-capable clients](/en/ch13#id347)
- pushing state changes to, [Pushing state changes to clients](/en/ch13#id348)
- request routing, [Request Routing](/en/ch7#sec_sharding_routing)
- ClockBound (time sync), [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval)
- use in YugabyteDB, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- clocks, [Unreliable Clocks](/en/ch9#sec_distributed_clocks)-[Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
- atomic clocks, [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- confidence interval, [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval)-[Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- for global snapshots, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- hybrid logical clocks, [Hybrid logical clocks](/en/ch10#hybrid-logical-clocks)
- logical (see logical clocks)
- skew, [Last write wins (discarding concurrent writes)](/en/ch6#sec_replication_lww), [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations), [Relying on Synchronized Clocks](/en/ch9#sec_distributed_clocks_relying)-[Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval), [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
- slewing, [Monotonic clocks](/en/ch9#monotonic-clocks)
- synchronization and accuracy, [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)-[Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
- synchronization using GPS, [Unreliable Clocks](/en/ch9#sec_distributed_clocks), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy), [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- time-of-day versus monotonic clocks, [Monotonic Versus Time-of-Day Clocks](/en/ch9#sec_distributed_monotonic_timeofday)
- timestamping events, [Whose clock are you using, anyway?](/en/ch12#id438)
- cloud services, [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud)-[Cloud Computing Versus Supercomputing](/en/ch1#id17)
- availability zones, [Tolerating hardware faults through redundancy](/en/ch2#tolerating-hardware-faults-through-redundancy), [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
- data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- need for service discovery, [Service discovery](/en/ch10#service-discovery)
- network glitches, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
- pros and cons, [Pros and Cons of Cloud Services](/en/ch1#sec_introduction_cloud_tradeoffs)-[Pros and Cons of Cloud Services](/en/ch1#sec_introduction_cloud_tradeoffs)
- quotas, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- regions (see regions (geographic distribution))
- serverless, [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
- shared resources, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- versus supercomputing, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
- cloud-native, [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)-[Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- Cloudflare
- R2 (see R2 (object storage))
- clustered indexes, [Storing values within the index](/en/ch4#sec_storage_index_heap)
- clustering (record ordering), [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- CockroachDB (database)
- consensus-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- consistency model, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- key-range sharding, [Sharding](/en/ch7#ch_sharding), [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- serializable transactions, [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)
- sharded secondary indexes, [Global Secondary Indexes](/en/ch7#id167)
- transactions, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
- use of model-checking, [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
- code generation
- for query execution, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- with Protocol Buffers, [Protocol Buffers](/en/ch5#sec_encoding_protobuf)
- collaborative editing, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- column families (Bigtable), [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality), [Column Compression](/en/ch4#sec_storage_column_compression)
- column-oriented storage, [Column-Oriented Storage](/en/ch4#sec_storage_column)-[Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- column compression, [Column Compression](/en/ch4#sec_storage_column_compression)
- Parquet, [Column-Oriented Storage](/en/ch4#sec_storage_column), [Archival storage](/en/ch5#archival-storage)
- sort order in, [Sort Order in Column Storage](/en/ch4#sort-order-in-column-storage)-[Sort Order in Column Storage](/en/ch4#sort-order-in-column-storage)
- vectorized processing, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- versus wide-column model, [Column Compression](/en/ch4#sec_storage_column_compression)
- writing to, [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
- comma-separated values (see CSV)
- command query responsibility segregation (CQRS), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)-[Event Sourcing and CQRS](/en/ch3#sec_datamodels_events), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
- commands (event sourcing), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- commits (transactions), [Transactions](/en/ch8#ch_transactions)
- atomic commit, [Distributed Transactions](/en/ch8#sec_transactions_distributed)-[Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
- (see also atomicity; transactions)
- read committed isolation, [Read Committed](/en/ch8#sec_transactions_read_committed)
- three-phase commit (3PC), [Three-phase commit](/en/ch8#three-phase-commit)
- two-phase commit (2PC), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)-[Coordinator failure](/en/ch8#coordinator-failure)
- commutative operations, [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- compaction
- of changelogs, [Log compaction](/en/ch12#sec_stream_log_compaction)
- (see also log compaction)
- for stream operator state, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
- of log-structured storage, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- issues with, [Read performance](/en/ch4#read-performance)
- size-tiered and leveled approaches, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction), [Disk space usage](/en/ch4#disk-space-usage)
- compare-and-set (CAS), [Conditional writes (compare-and-set)](/en/ch8#sec_transactions_compare_and_set), [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- implementing locks, [Coordination Services](/en/ch10#sec_consistency_coordination)
- implementing uniqueness constraints, [Constraints and uniqueness guarantees](/en/ch10#sec_consistency_uniqueness)
- on object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- relation to consensus, [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable), [Consensus](/en/ch10#sec_consistency_consensus), [Compare-and-set as consensus](/en/ch10#compare-and-set-as-consensus)
- relation to fencing tokens, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
- relation to transactions, [Single-object writes](/en/ch8#sec_transactions_single_object)
- compatibility, [Encoding and Evolution](/en/ch5#ch_encoding), [Modes of Dataflow](/en/ch5#sec_encoding_dataflow)
- calling services, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- properties of encoding formats, [Summary](/en/ch5#summary)
- using databases, [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)-[Archival storage](/en/ch5#archival-storage)
- compensating transactions, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros), [Loosely interpreted constraints](/en/ch13#id362)
- compilation, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- complex event processing (CEP), [Complex event processing](/en/ch12#id317)
- complexity
- distilling in theoretical models, [Mapping system models to the real world](/en/ch9#mapping-system-models-to-the-real-world)
- essential and accidental, [Simplicity: Managing Complexity](/en/ch2#id38)
- hiding using abstraction, [Data Models and Query Languages](/en/ch3#ch_datamodels)
- managing, [Simplicity: Managing Complexity](/en/ch2#id38)
- composing data systems (see unbundling databases)
- compression
- in SSTables, [The SSTable file format](/en/ch4#the-sstable-file-format)
- compute-intensive applications, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
- computer games, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- concatenated indexes, [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
- in hash-sharded systems, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- concurrency
- actor programming model, [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks), [Event-Driven Architectures and RPC](/en/ch12#sec_stream_actors_drpc)
- (see also event-driven architecture)
- bugs from weak transaction isolation, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
- conflict resolution, [Dealing with Conflicting Writes](/en/ch6#sec_replication_write_conflicts)-[Types of conflict](/en/ch6#sec_replication_write_conflicts)
- definition, [Dealing with Conflicting Writes](/en/ch6#sec_replication_write_conflicts)
- detecting concurrent writes, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)-[Version vectors](/en/ch6#version-vectors)
- dual writes, problems with, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
- happens-before relation, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)
- in replicated systems, [Problems with Replication Lag](/en/ch6#sec_replication_lag)-[Version vectors](/en/ch6#version-vectors), [Linearizability](/en/ch10#sec_consistency_linearizability)-[Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
- lost updates, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)
- multi-version concurrency control (MVCC), [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- optimistic concurrency control, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
- ordering of operations, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- reducing, through event logs, [Concurrency control](/en/ch12#sec_stream_concurrency), [Dataflow: Interplay between state changes and application code](/en/ch13#id450)
- time and relativity, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)
- transaction isolation, [Isolation](/en/ch8#sec_transactions_acid_isolation)
- write skew (transaction isolation), [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Materializing conflicts](/en/ch8#materializing-conflicts)
- conditional write, [Conditional writes (compare-and-set)](/en/ch8#sec_transactions_compare_and_set)
- in transactions, [Single-object writes](/en/ch8#sec_transactions_single_object)
- on object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- conference management system (example), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- conflict-free replicated datatypes (CRDTs), [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
- for leaderless replication, [Capturing the happens-before relationship](/en/ch6#capturing-the-happens-before-relationship)
- preventing lost updates, [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- conflicts
- avoidance, [Conflict avoidance](/en/ch6#conflict-avoidance)
- causal dependencies, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)
- conflict detection
- in distributed transactions, [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
- in log-based systems, [Uniqueness constraints require consensus](/en/ch13#id452)
- in serializable snapshot isolation (SSI), [Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
- in two-phase commit, [A system of promises](/en/ch8#a-system-of-promises)
- conflict resolution
- by aborting transactions, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
- by apologizing, [Loosely interpreted constraints](/en/ch13#id362)
- last write wins (LWW), [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
- using atomic operations, [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- determining what is a conflict, [Types of conflict](/en/ch6#sec_replication_write_conflicts), [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
- in leaderless replication, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)
- lost updates, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)-[Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- materializing, [Materializing conflicts](/en/ch8#materializing-conflicts)
- resolution, [Dealing with Conflicting Writes](/en/ch6#sec_replication_write_conflicts)-[Types of conflict](/en/ch6#sec_replication_write_conflicts)
- automatic, [Automatic conflict resolution](/en/ch6#automatic-conflict-resolution)
- in leaderless systems, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)
- last write wins (LWW), [Last write wins (discarding concurrent writes)](/en/ch6#sec_replication_lww)
- using custom logic, [Manual conflict resolution](/en/ch6#manual-conflict-resolution), [Capturing the happens-before relationship](/en/ch6#capturing-the-happens-before-relationship)
- siblings, [Manual conflict resolution](/en/ch6#manual-conflict-resolution), [Capturing the happens-before relationship](/en/ch6#capturing-the-happens-before-relationship)
- merging, [Capturing the happens-before relationship](/en/ch6#capturing-the-happens-before-relationship)
- write skew (transaction isolation), [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Materializing conflicts](/en/ch8#materializing-conflicts)
- Confluent
- Freight (messaging), [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Disk space usage](/en/ch12#sec_stream_disk_usage)
- schema registry, [JSON Schema](/en/ch5#json-schema), [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
- congestion (networks)
- avoidance, [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
- limiting accuracy of clocks, [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval)
- queueing delays, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- consensus, [Consensus](/en/ch10#sec_consistency_consensus)-[Summary](/en/ch10#summary), [Glossary](/en/glossary)
- algorithms, [Consensus](/en/ch10#sec_consistency_consensus), [Consensus in Practice](/en/ch10#sec_consistency_total_order)
- consensus numbers, [Fetch-and-add as consensus](/en/ch10#fetch-and-add-as-consensus)
- coordination services, [Coordination Services](/en/ch10#sec_consistency_coordination)-[Service discovery](/en/ch10#service-discovery)
- cost of, [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
- impossibility of, [Consensus](/en/ch10#sec_consistency_consensus)
- preventing split brain, [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
- reconfiguration, [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
- relation to atomic commitment, [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
- relation to compare-and-set (CAS), [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable), [Compare-and-set as consensus](/en/ch10#compare-and-set-as-consensus)
- relation to fetch-and-add, [Fetch-and-add as consensus](/en/ch10#fetch-and-add-as-consensus)
- relation to replication, [Using shared logs](/en/ch10#sec_consistency_smr)
- relation to shared logs, [Shared logs as consensus](/en/ch10#sec_consistency_shared_logs)
- relation to uniqueness constraints, [Uniqueness constraints require consensus](/en/ch13#id452)
- safety and liveness properties, [Single-value consensus](/en/ch10#single-value-consensus)
- single-value consensus, [Single-value consensus](/en/ch10#single-value-consensus)
- consent (GDPR), [Consent and Freedom of Choice](/en/ch14#id375)
- consistency, [Consistency](/en/ch8#sec_transactions_acid_consistency), [Timeliness and Integrity](/en/ch13#sec_future_integrity)
- across different databases, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Keeping Systems in Sync](/en/ch12#sec_stream_sync), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views), [Derived data versus distributed transactions](/en/ch13#sec_future_derived_vs_transactions)
- causal, [Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix), [Problems with different topologies](/en/ch6#problems-with-different-topologies), [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
- consistent prefix reads, [Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix)-[Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix)
- consistent snapshots, [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)-[Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner), [Initial snapshot](/en/ch12#sec_stream_cdc_snapshot), [Creating an index](/en/ch13#id340)
- (see also snapshots)
- crash recovery, [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
- enforcing constraints (see constraints)
- eventual, [Problems with Replication Lag](/en/ch6#sec_replication_lag)
- (see also eventual consistency)
- in ACID transactions, [Consistency](/en/ch8#sec_transactions_acid_consistency), [Maintaining integrity in the face of software bugs](/en/ch13#id455)
- in CAP theorem, [The CAP theorem](/en/ch10#the-cap-theorem)
- in leader election, [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
- in microservices, [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems)
- linearizability, [Solutions for Replication Lag](/en/ch6#id131), [Linearizability](/en/ch10#sec_consistency_linearizability)-[Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
- meanings of, [Consistency](/en/ch8#sec_transactions_acid_consistency)
- monotonic reads, [Monotonic Reads](/en/ch6#sec_replication_monotonic_reads)-[Monotonic Reads](/en/ch6#sec_replication_monotonic_reads)
- of secondary indexes, [The need for multi-object transactions](/en/ch8#sec_transactions_need), [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation), [Reasoning about dataflows](/en/ch13#id443), [Creating an index](/en/ch13#id340)
- read-after-write, [Reading Your Own Writes](/en/ch6#sec_replication_ryw)-[Reading Your Own Writes](/en/ch6#sec_replication_ryw)
- in derived data systems, [Derived data versus distributed transactions](/en/ch13#sec_future_derived_vs_transactions)
- strong (see linearizability)
- timeliness and integrity, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
- using quorums, [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations), [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
- consistent hashing, [Consistent hashing](/en/ch7#sec_sharding_consistent_hashing)
- consistent prefix reads, [Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix)
- constraints (databases), [Consistency](/en/ch8#sec_transactions_acid_consistency), [Characterizing write skew](/en/ch8#characterizing-write-skew)
- asynchronously checked, [Loosely interpreted constraints](/en/ch13#id362)
- coordination avoidance, [Coordination-avoiding data systems](/en/ch13#id454)
- ensuring idempotence, [Uniquely identifying requests](/en/ch13#id355)
- in log-based systems, [Enforcing Constraints](/en/ch13#sec_future_constraints)-[Multi-shard request processing](/en/ch13#id360)
- across multiple shards, [Multi-shard request processing](/en/ch13#id360)
- in two-phase commit, [Distributed Transactions](/en/ch8#sec_transactions_distributed), [A system of promises](/en/ch8#a-system-of-promises)
- relation to consensus, [Uniqueness constraints require consensus](/en/ch13#id452)
- requiring linearizability, [Constraints and uniqueness guarantees](/en/ch10#sec_consistency_uniqueness)
- Consul (coordination service), [Coordination Services](/en/ch10#sec_consistency_coordination)
- use for service discovery, [Service discovery](/en/ch10#service-discovery)
- consumers (message streams), [Message brokers](/en/ch5#message-brokers), [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
- backpressure, [Messaging Systems](/en/ch12#sec_stream_messaging)
- consumer groups, [Multiple consumers](/en/ch12#id298)
- consumer offsets in logs, [Consumer offsets](/en/ch12#sec_stream_log_offsets)
- failures, [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering), [Consumer offsets](/en/ch12#sec_stream_log_offsets)
- fan-out, [Materializing and Updating Timelines](/en/ch2#sec_introduction_materializing), [Multiple consumers](/en/ch12#id298), [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging)
- load balancing, [Multiple consumers](/en/ch12#id298), [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging)
- not keeping up with producers, [Messaging Systems](/en/ch12#sec_stream_messaging), [Disk space usage](/en/ch12#sec_stream_disk_usage), [Making unbundling work](/en/ch13#sec_future_unbundling_favor)
- content models (JSON Schema), [JSON Schema](/en/ch5#json-schema)
- contention
- between transactions, [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
- blocking threads, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- performance of optimistic concurrency control, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
- under two-phase locking, [Performance of two-phase locking](/en/ch8#performance-of-two-phase-locking)
- context switches, [Latency and Response Time](/en/ch2#id23), [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- convergence (conflict resolution), [Automatic conflict resolution](/en/ch6#automatic-conflict-resolution)-[CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
- coordination
- avoidance, [Coordination-avoiding data systems](/en/ch13#id454)
- cross-datacenter, [The limits of total ordering](/en/ch13#id335)
- cross-region, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- cross-shard ordering, [Sharding](/en/ch8#sharding), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner), [Using shared logs](/en/ch10#sec_consistency_smr), [Multi-shard request processing](/en/ch13#id360)
- routing requests to shards, [Request Routing](/en/ch7#sec_sharding_routing)
- services, [Locking and leader election](/en/ch10#locking-and-leader-election), [Coordination Services](/en/ch10#sec_consistency_coordination)-[Service discovery](/en/ch10#service-discovery)
- coordinator (in 2PC), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
- failure, [Coordinator failure](/en/ch8#coordinator-failure)
- in XA transactions, [XA transactions](/en/ch8#xa-transactions)-[Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
- recovery, [Recovering from coordinator failure](/en/ch8#recovering-from-coordinator-failure)
- copy-on-write (B-trees), [B-tree variants](/en/ch4#b-tree-variants), [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
- CORBA (Common Object Request Broker Architecture), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
- coronal mass ejection (see solar storm)
- correctness
- auditability, [Trust, but Verify](/en/ch13#sec_future_verification)-[Tools for auditable data systems](/en/ch13#id366)
- Byzantine fault tolerance, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
- dealing with partial failures, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
- in log-based systems, [Enforcing Constraints](/en/ch13#sec_future_constraints)-[Multi-shard request processing](/en/ch13#id360)
- of algorithm within system model, [Defining the correctness of an algorithm](/en/ch9#defining-the-correctness-of-an-algorithm)
- of derived data, [Designing for auditability](/en/ch13#id365)
- of immutable data, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
- of personal data, [Responsibility and Accountability](/en/ch14#id371), [Privacy and Use of Data](/en/ch14#id457)
- of time, [Problems with different topologies](/en/ch6#problems-with-different-topologies), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)-[Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- of transactions, [Consistency](/en/ch8#sec_transactions_acid_consistency), [Aiming for Correctness](/en/ch13#sec_future_correctness), [Maintaining integrity in the face of software bugs](/en/ch13#id455)
- timeliness and integrity, [Timeliness and Integrity](/en/ch13#sec_future_integrity)-[Coordination-avoiding data systems](/en/ch13#id454)
- corruption of data
- detecting, [The end-to-end argument](/en/ch13#sec_future_e2e_argument), [Don't just blindly trust what they promise](/en/ch13#id364)-[Tools for auditable data systems](/en/ch13#id366)
- due to pathological memory access, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
- due to radiation, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
- due to split brain, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing)
- due to weak transaction isolation, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
- integrity as absence of, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
- network packets, [Weak forms of lying](/en/ch9#weak-forms-of-lying)
- on disks, [Durability](/en/ch8#durability)
- preventing using write-ahead logs, [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
- recovering from, [Batch Processing](/en/ch11#ch_batch), [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
- cosine similarity (semantic search), [Vector Embeddings](/en/ch4#id92)
- Couchbase (database)
- document data model, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
- durability, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- hash sharding, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
- join support, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- rebalancing, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
- vBuckets (sharding), [Sharding](/en/ch7#ch_sharding)
- CouchDB (database)
- as sync engine, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- B-tree storage, [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
- conflict resolution, [Manual conflict resolution](/en/ch6#manual-conflict-resolution)
- coupling (loose and tight), [Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability)
- covering indexes, [Storing values within the index](/en/ch4#sec_storage_index_heap)
- CozoDB (database), [Datalog: Recursive Relational Queries](/en/ch3#id62)
- CPUs
- cache coherence and memory barriers, [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
- caching and pipelining, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- computing the wrong result, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
- SIMD instructions, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- crash-stop and crash-recovery faults, [System Model and Reality](/en/ch9#sec_distributed_system_model)
- CRDTs (see conflict-free replicated datatypes)
- CREATE INDEX statement (SQL), [Multi-Column and Secondary Indexes](/en/ch4#sec_storage_index_multicolumn), [Creating an index](/en/ch13#id340)
- credit rating agencies, [Responsibility and Accountability](/en/ch14#id371)
- crypto-shredding, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events), [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- cryptocurrencies, [Summary](/en/ch3#summary)
- cryptography
- defense against attackers, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
- end-to-end encryption and authentication, [The end-to-end argument](/en/ch13#sec_future_e2e_argument)
- CSV (comma-separated values), [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp), [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
- Curator (ZooKeeper recipes), [Locking and leader election](/en/ch10#locking-and-leader-election), [Allocating work to nodes](/en/ch10#allocating-work-to-nodes)
- Cypher (query language), [The Cypher Query Language](/en/ch3#id57)
- comparison to SPARQL, [The SPARQL query language](/en/ch3#the-sparql-query-language)
### D
- Daft (processing framework)
- DataFrames, [DataFrames](/en/ch11#id287)
- shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)
- Dagster (workflow scheduler), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows), [Batch Processing](/en/ch11#ch_batch), [Scheduling Workflows](/en/ch11#sec_batch_workflows)
- cloud data warehouse integration, [Query languages](/en/ch11#sec_batch_query_lanauges)
- dashboard (business intelligence), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
- Dask (processing framework), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- data catalog, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- data connectors, [Data Warehousing](/en/ch1#sec_introduction_dwh)
- data contracts, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
- change data capture, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
- data corruption (see corruption of data)
- data cubes, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
- data engineering, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)
- data fabric, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
- data formats (see encoding)
- data infrastructure, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
- data integration, [Data Integration](/en/ch13#sec_future_integration)-[Unifying batch and stream processing](/en/ch13#id338), [Summary](/en/ch13#id367)
- batch and stream processing, [Batch and Stream Processing](/en/ch13#sec_future_batch_streaming)-[Unifying batch and stream processing](/en/ch13#id338)
- maintaining derived state, [Maintaining derived state](/en/ch13#id446)
- reprocessing data, [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
- unifying, [Unifying batch and stream processing](/en/ch13#id338)
- by unbundling databases, [Unbundling Databases](/en/ch13#sec_future_unbundling)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
- comparison to federated databases, [The meta-database of everything](/en/ch13#id341)
- combining tools by deriving data, [Combining Specialized Tools by Deriving Data](/en/ch13#id442)-[Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
- derived data versus distributed transactions, [Derived data versus distributed transactions](/en/ch13#sec_future_derived_vs_transactions)
- limits of total ordering, [The limits of total ordering](/en/ch13#id335)
- ordering events to capture causality, [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
- reasoning about dataflows, [Reasoning about dataflows](/en/ch13#id443)
- need for, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
- using batch processing, [Batch Processing](/en/ch11#ch_batch), [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
- data lake, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
- data lakehouse, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Analytics](/en/ch11#sec_batch_olap)
- data locality (see locality)
- data mesh, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
- data minimization, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
- data models, [Data Models and Query Languages](/en/ch3#ch_datamodels)-[Summary](/en/ch3#summary)
- DataFrames and arrays, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- graph-like models, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)-[GraphQL](/en/ch3#id63)
- Datalog language, [Datalog: Recursive Relational Queries](/en/ch3#id62)-[Datalog: Recursive Relational Queries](/en/ch3#id62)
- property graphs, [Property Graphs](/en/ch3#id56)
- RDF and triple-stores, [Triple-Stores and SPARQL](/en/ch3#id59)-[The SPARQL query language](/en/ch3#the-sparql-query-language)
- relational model versus document model, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)-[Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- supporting multiple, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- data pipelines, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake), [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived), [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
- data products, [Beyond the data lake](/en/ch1#beyond-the-data-lake)
- data protection regulations (see GDPR)
- data residence laws, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed), [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- data science, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
- data silo, [Data Warehousing](/en/ch1#sec_introduction_dwh)
- data systems
- correctness, constraints, and integrity, [Aiming for Correctness](/en/ch13#sec_future_correctness)-[Tools for auditable data systems](/en/ch13#id366)
- data integration, [Data Integration](/en/ch13#sec_future_integration)-[Unifying batch and stream processing](/en/ch13#id338)
- goals for using, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
- heterogeneous, keeping in sync, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
- maintainability, [Maintainability](/en/ch2#sec_introduction_maintainability)-[Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability)
- possible faults in, [Transactions](/en/ch8#ch_transactions)
- reliability, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)-[Humans and Reliability](/en/ch2#id31)
- hardware faults, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
- human errors, [Humans and Reliability](/en/ch2#id31)
- importance of, [Humans and Reliability](/en/ch2#id31)
- software faults, [Software faults](/en/ch2#software-faults)
- scalability, [Scalability](/en/ch2#sec_introduction_scalability)-[Principles for Scalability](/en/ch2#id35)
- unbundling databases, [Unbundling Databases](/en/ch13#sec_future_unbundling)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
- unreliable clocks, [Unreliable Clocks](/en/ch9#sec_distributed_clocks)-[Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
- data warehousing, [Data Warehousing](/en/ch1#sec_introduction_dwh), [Glossary](/en/glossary)
- cloud-based solutions, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- ETL (extract-transform-load), [Data Warehousing](/en/ch1#sec_introduction_dwh), [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
- for batch processing, [Batch Processing](/en/ch11#ch_batch)
- keeping data systems in sync, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
- schema design, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
- sharding and clustering, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- slowly changing dimension (SCD), [Time-dependence of joins](/en/ch12#sec_stream_join_time)
- data-intensive applications, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
- database administrator, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- database-internal distributed transactions, [Distributed Transactions Across Different Systems](/en/ch8#sec_transactions_xa), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal), [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit)
- databases
- archival storage, [Archival storage](/en/ch5#archival-storage)
- comparison of message brokers to, [Message brokers compared to databases](/en/ch12#id297)
- dataflow through, [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)
- end-to-end argument for, [The end-to-end argument](/en/ch13#sec_future_e2e_argument)-[Applying end-to-end thinking in data systems](/en/ch13#id357)
- checking integrity, [The end-to-end argument again](/en/ch13#id456)
- relation to event streams, [Databases and Streams](/en/ch12#sec_stream_databases)-[Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- (see also changelogs)
- API support for change streams, [API support for change streams](/en/ch12#sec_stream_change_api), [Separation of application code and state](/en/ch13#id344)
- change data capture, [Change Data Capture](/en/ch12#sec_stream_cdc)-[API support for change streams](/en/ch12#sec_stream_change_api)
- event sourcing, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
- keeping systems in sync, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)-[Keeping Systems in Sync](/en/ch12#sec_stream_sync)
- philosophy of immutable events, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)-[Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- unbundling, [Unbundling Databases](/en/ch13#sec_future_unbundling)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
- composing data storage technologies, [Composing Data Storage Technologies](/en/ch13#id447)-[Unbundled versus integrated systems](/en/ch13#id448)
- designing applications around dataflow, [Designing Applications Around Dataflow](/en/ch13#sec_future_dataflow)-[Stream processors and services](/en/ch13#id345)
- observing derived state, [Observing Derived State](/en/ch13#sec_future_observing)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
- datacenters
- failures of, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
- geographically distributed (see regions (geographic distribution))
- multitenancy and shared resources, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- network architecture, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
- network faults, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
- dataflow, [Modes of Dataflow](/en/ch5#sec_encoding_dataflow)-[Distributed actor frameworks](/en/ch5#distributed-actor-frameworks), [Designing Applications Around Dataflow](/en/ch13#sec_future_dataflow)-[Stream processors and services](/en/ch13#id345)
- correctness of dataflow systems, [Correctness of dataflow systems](/en/ch13#id453)
- dataflow engines, [Dataflow Engines](/en/ch11#sec_batch_dataflow)
- comparison to stream processing, [Processing Streams](/en/ch12#sec_stream_processing)
- DataFrames, [DataFrames](/en/ch11#id287)
- support in batch processing frameworks, [Batch Processing](/en/ch11#ch_batch)
- event-driven, [Event-Driven Architectures](/en/ch5#sec_encoding_dataflow_msg)-[Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
- reasoning about, [Reasoning about dataflows](/en/ch13#id443)
- through databases, [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)
- through services, [Dataflow Through Services: REST and RPC](/en/ch5#sec_encoding_dataflow_rpc)-[Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- workflow engines (see workflow engines)
- DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- implementation, [DataFrames](/en/ch11#id287)
- in batch processing, [DataFrames](/en/ch11#id287)
- in notebooks, [Machine Learning](/en/ch11#id290)
- support in batch processing frameworks, [Batch Processing](/en/ch11#ch_batch)
- DataFusion (query engine), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- Datalog (query language), [Datalog: Recursive Relational Queries](/en/ch3#id62)-[Datalog: Recursive Relational Queries](/en/ch3#id62)
- Datastream (change data capture), [API support for change streams](/en/ch12#sec_stream_change_api)
- datatypes
- binary strings in XML and JSON, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
- conflict-free, [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
- in Avro encodings, [Avro](/en/ch5#sec_encoding_avro)
- in Protocol Buffers, [Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
- numbers in XML and JSON, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
- Datensparsamkeit, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- Datomic (database)
- B-tree storage, [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
- data model, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph), [Triple-Stores and SPARQL](/en/ch3#id59)
- Datalog query language, [Datalog: Recursive Relational Queries](/en/ch3#id62)
- excision (deleting data), [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- languages for transactions, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
- serial execution of transactions, [Actual Serial Execution](/en/ch8#sec_transactions_serial)
- Daylight Saving Time (DST), [Time-of-day clocks](/en/ch9#time-of-day-clocks)
- Db2 (database)
- change data capture, [Implementing change data capture](/en/ch12#id307)
- DBA (database administrator), [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- deadlocks, [Explicit locking](/en/ch8#explicit-locking)
- detection, in distributed transaction, [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
- in two-phase locking (2PL), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- Debezium (change data capture), [Implementing change data capture](/en/ch12#id307)
- Cassandra, [API support for change streams](/en/ch12#sec_stream_change_api)
- for data integration, [Unbundled versus integrated systems](/en/ch13#id448)
- declarative languages, [Data Models and Query Languages](/en/ch3#ch_datamodels), [Glossary](/en/glossary)
- and sync engines, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- Datalog, [Datalog: Recursive Relational Queries](/en/ch3#id62)
- in document databases, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- recursive SQL queries, [Graph Queries in SQL](/en/ch3#id58)
- SPARQL, [The SPARQL query language](/en/ch3#the-sparql-query-language)
- DeepSeek
- 3FS (see 3FS)
- delays
- bounded network delays, [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
- bounded process pauses, [Response time guarantees](/en/ch9#sec_distributed_clocks_realtime)
- unbounded network delays, [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing)
- unbounded process pauses, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- deleting data, [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- in LSM storage, [Disk space usage](/en/ch4#disk-space-usage)
- legal basis, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- Delta Lake (table format), [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- sharding and clustering, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- demilitarized zone (networking), [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
- denormalization (data representation), [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)-[Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many), [Glossary](/en/glossary)
- in derived data systems, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
- in event sourcing/CQRS, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- in social network case study, [Denormalization in the social networking case study](/en/ch3#denormalization-in-the-social-networking-case-study)
- materialized views, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
- updating derived data, [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object), [The need for multi-object transactions](/en/ch8#sec_transactions_need), [Combining Specialized Tools by Deriving Data](/en/ch13#id442)
- versus normalization, [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
- derived data, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived), [Stream Processing](/en/ch12#ch_stream), [Glossary](/en/glossary)
- batch processing, [Batch Processing](/en/ch11#ch_batch)
- event sourcing and CQRS, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- from change data capture, [Implementing change data capture](/en/ch12#id307)
- maintaining derived state through logs, [Databases and Streams](/en/ch12#sec_stream_databases)-[API support for change streams](/en/ch12#sec_stream_change_api), [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)-[Concurrency control](/en/ch12#sec_stream_concurrency)
- observing, by subscribing to streams, [End-to-end event streams](/en/ch13#id349)
- outputs of batch and stream processing, [Batch and Stream Processing](/en/ch13#sec_future_batch_streaming)
- through application code, [Application code as a derivation function](/en/ch13#sec_future_dataflow_derivation)
- versus distributed transactions, [Derived data versus distributed transactions](/en/ch13#sec_future_derived_vs_transactions)
- design patterns, [Simplicity: Managing Complexity](/en/ch2#id38)
- deterministic operations, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs), [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure), [Glossary](/en/glossary)
- and idempotence, [Idempotence](/en/ch12#sec_stream_idempotence), [Reasoning about dataflows](/en/ch13#id443)
- computing derived data, [Maintaining derived state](/en/ch13#id446), [Correctness of dataflow systems](/en/ch13#id453), [Designing for auditability](/en/ch13#id365)
- in event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- in state machine replication, [Using shared logs](/en/ch10#sec_consistency_smr), [Databases and Streams](/en/ch12#sec_stream_databases)
- in statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication)
- in testing, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- joins, [Time-dependence of joins](/en/ch12#sec_stream_join_time)
- making code deterministic, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- overview, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- deterministic simulation testing (DST), [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- DevOps, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- dimension tables, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
- dimensional modeling (see star schemas)
- directed acyclic graphs (DAG)
- workflows, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
- (see also workflow engines)
- dirty reads (transaction isolation), [No dirty reads](/en/ch8#no-dirty-reads)
- dirty writes (transaction isolation), [No dirty writes](/en/ch8#sec_transactions_dirty_write)
- disaggregation
- of storage and compute, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
- Discord (group chat)
- GraphQL example, [GraphQL](/en/ch3#id63)
- discrimination, [Bias and Discrimination](/en/ch14#id370)
- disks (see hard disks)
- distributed actor frameworks, [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
- distributed filesystems, [Distributed Filesystems](/en/ch11#sec_batch_dfs)-[Distributed Filesystems](/en/ch11#sec_batch_dfs)
- comparison to object storage, [Object Stores](/en/ch11#id277)
- use by Flink, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
- distributed ledgers, [Summary](/en/ch3#summary)
- distributed systems, [The Trouble with Distributed Systems](/en/ch9#ch_distributed)-[Summary](/en/ch9#summary), [Glossary](/en/glossary)
- Byzantine faults, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)-[Weak forms of lying](/en/ch9#weak-forms-of-lying)
- detecting network faults, [Detecting Faults](/en/ch9#id307)
- faults and partial failures, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
- formalization of consensus, [Single-value consensus](/en/ch10#single-value-consensus)
- impossibility results, [The CAP theorem](/en/ch10#the-cap-theorem), [Consensus](/en/ch10#sec_consistency_consensus)
- issues with failover, [Leader failure: Failover](/en/ch6#leader-failure-failover)
- multi-region (see regions (geographic distribution))
- network problems, [Unreliable Networks](/en/ch9#sec_distributed_networks)-[Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
- problems with, [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems)
- quorums, relying on, [The Majority Rules](/en/ch9#sec_distributed_majority)
- reasons for using, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed), [Replication](/en/ch6#ch_replication)
- synchronized clocks, relying on, [Relying on Synchronized Clocks](/en/ch9#sec_distributed_clocks_relying)-[Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- system models, [System Model and Reality](/en/ch9#sec_distributed_system_model)-[Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- use of clocks and time, [Unreliable Clocks](/en/ch9#sec_distributed_clocks)
- distributed transactions (see transactions)
- Django (web framework), [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
- DMZ (demilitarized zone), [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
- DNS (Domain Name System), [Request Routing](/en/ch7#sec_sharding_routing), [Service discovery](/en/ch10#service-discovery)
- for load balancing, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
- Docker (container manager), [Separation of application code and state](/en/ch13#id344)
- document data model, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)-[Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- comparison to relational model, [When to Use Which Model](/en/ch3#sec_datamodels_document_summary)-[Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- multi-object transactions, need for, [The need for multi-object transactions](/en/ch8#sec_transactions_need)
- sharded secondary indexes, [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)
- versus relational model
- convergence of models, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- data locality, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
- document-partitioned indexes (see local secondary indexes)
- domain-driven design (DDD), [Simplicity: Managing Complexity](/en/ch2#id38), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- dotted version vectors, [Version vectors](/en/ch6#version-vectors)
- double-entry bookkeeping, [Summary](/en/ch3#summary)
- DRBD (Distributed Replicated Block Device), [Single-Leader Replication](/en/ch6#sec_replication_leader)
- drift (clocks), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
- Druid (database), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Column-Oriented Storage](/en/ch4#sec_storage_column), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
- handling writes, [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
- pre-aggregation, [Analytics](/en/ch11#sec_batch_olap)
- serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
- Dryad (dataflow engine), [Dataflow Engines](/en/ch11#sec_batch_dataflow)
- dual writes, problems with, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
- DuckDB (database), [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems), [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- column-oriented storage, [Column-Oriented Storage](/en/ch4#sec_storage_column)
- use for ETL, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
- duplicates, suppression of, [Duplicate suppression](/en/ch13#id354)
- (see also idempotence)
- using a unique ID, [Uniquely identifying requests](/en/ch13#id355), [Multi-shard request processing](/en/ch13#id360)
- durability (transactions), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal), [Durability](/en/ch8#durability), [Glossary](/en/glossary)
- durable execution, [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
- reliance on determinism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- Restate (see Restate (workflow engine))
- Temporal (see Temporal (workflow engine))
- durable functions (see workflow engines)
- duration (time), [Unreliable Clocks](/en/ch9#sec_distributed_clocks)
- measurement with monotonic clocks, [Monotonic clocks](/en/ch9#monotonic-clocks)
- dynamically typed languages
- analogy to schema-on-read, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
- Dynamo (database), [Leaderless Replication](/en/ch6#sec_replication_leaderless)
- Dynamo-style databases (see leaderless replication)
- DynamoDB (database)
- auto-scaling, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
- hash-range sharding, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- sharded secondary indexes, [Global Secondary Indexes](/en/ch7#id167)
### E
- EBS (virtual block device), [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
- compared to object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- ECC (see error-correcting codes)
- EDB Postgres Distributed (database), [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- edges (in graphs), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- property graph model, [Property Graphs](/en/ch3#id56)
- edit distance (full-text search), [Full-Text Search](/en/ch4#sec_storage_full_text)
- effectively-once semantics, [Fault Tolerance](/en/ch12#sec_stream_fault_tolerance), [Exactly-once execution of an operation](/en/ch13#id353)
- (see also exactly-once semantics)
- preservation of integrity, [Correctness of dataflow systems](/en/ch13#id453)
- Elastic Compute Cloud (EC2)
- spot instances, [Handling Faults](/en/ch11#id281)
- elasticity, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
- cloud data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Query languages](/en/ch11#sec_batch_query_lanauges)
- Elasticsearch (search server)
- local secondary indexes, [Local Secondary Indexes](/en/ch7#id166)
- percolator (stream search), [Search on streams](/en/ch12#id320)
- serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
- shard rebalancing, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
- use of Lucene, [Full-Text Search](/en/ch4#sec_storage_full_text)
- Elm (programming language), [End-to-end event streams](/en/ch13#id349)
- ELT (extract-load-transform), [Data Warehousing](/en/ch1#sec_introduction_dwh)
- relation to batch processing, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
- embarassingly parallel (algorithms)
- ETL (see ETL (extract-transform-load))
- MapReduce, [MapReduce](/en/ch11#sec_batch_mapreduce)
- (see also MapReduce)
- embedded storage engines, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- embedding (vector), [Vector Embeddings](/en/ch4#id92)
- encodings (data formats), [Encoding and Evolution](/en/ch5#ch_encoding)-[The Merits of Schemas](/en/ch5#sec_encoding_schemas)
- Avro, [Avro](/en/ch5#sec_encoding_avro)-[Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
- binary variants of JSON and XML, [Binary encoding](/en/ch5#binary-encoding)
- compatibility, [Encoding and Evolution](/en/ch5#ch_encoding)
- calling services, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- using databases, [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)-[Archival storage](/en/ch5#archival-storage)
- defined, [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
- JSON, XML, and CSV, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
- language-specific formats, [Language-Specific Formats](/en/ch5#id96)
- merits of schemas, [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
- Protocol Buffers, [Protocol Buffers](/en/ch5#sec_encoding_protobuf)-[Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
- representations of data, [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
- end-to-end argument, [The end-to-end argument](/en/ch13#sec_future_e2e_argument)-[Applying end-to-end thinking in data systems](/en/ch13#id357)
- checking integrity, [The end-to-end argument again](/en/ch13#id456)
- publish/subscribe streams, [End-to-end event streams](/en/ch13#id349)
- enrichment (stream), [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins)
- Enterprise JavaBeans (EJB), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
- enterprise software, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
- entities (see vertices)
- ephemeral storage, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
- epoch (consensus algorithms), [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
- epoch (Unix timestamps), [Time-of-day clocks](/en/ch9#time-of-day-clocks)
- erasure coding (error correction), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- error handling
- for network faults, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
- in transactions, [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
- error-correcting codes, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- Esper (CEP engine), [Complex event processing](/en/ch12#id317)
- essential complexity, [Simplicity: Managing Complexity](/en/ch2#id38)
- etcd (coordination service), [Coordination Services](/en/ch10#sec_consistency_coordination)-[Service discovery](/en/ch10#service-discovery)
- generating fencing tokens, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens), [Coordination Services](/en/ch10#sec_consistency_coordination)
- linearizable operations, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable), [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
- locks and leader election, [Locking and leader election](/en/ch10#locking-and-leader-election)
- use for service discovery, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery), [Service discovery](/en/ch10#service-discovery)
- use for shard assignment, [Request Routing](/en/ch7#sec_sharding_routing)
- use of Raft algorithm, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- Ethereum (blockchain), [Tools for auditable data systems](/en/ch13#id366)
- Ethernet (networks), [Cloud Computing Versus Supercomputing](/en/ch1#id17), [Unreliable Networks](/en/ch9#sec_distributed_networks), [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
- packet checksums, [Weak forms of lying](/en/ch9#weak-forms-of-lying), [The end-to-end argument](/en/ch13#sec_future_e2e_argument)
- ethics, [Doing the Right Thing](/en/ch14)-[Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
- code of ethics and professional practice, [Doing the Right Thing](/en/ch14)
- legislation and self-regulation, [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
- predictive analytics, [Predictive Analytics](/en/ch14#id369)-[Feedback Loops](/en/ch14#id372)
- amplifying bias, [Bias and Discrimination](/en/ch14#id370)
- feedback loops, [Feedback Loops](/en/ch14#id372)
- privacy and tracking, [Privacy and Tracking](/en/ch14#id373)-[Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
- consent and freedom of choice, [Consent and Freedom of Choice](/en/ch14#id375)
- data as assets and power, [Data as Assets and Power](/en/ch14#id376)
- meaning of privacy, [Privacy and Use of Data](/en/ch14#id457)
- surveillance, [Surveillance](/en/ch14#id374)
- respect, dignity, and agency, [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
- unintended consequences, [Doing the Right Thing](/en/ch14), [Feedback Loops](/en/ch14#id372)
- ETL (extract-transform-load), [Data Warehousing](/en/ch1#sec_introduction_dwh), [Keeping Systems in Sync](/en/ch12#sec_stream_sync), [Glossary](/en/glossary)
- relation to batch processing, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)-[Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
- using batch processing, [Batch Processing](/en/ch11#ch_batch)
- Euclidean distance (semantic search), [Vector Embeddings](/en/ch4#id92)
- European Union
- AI Act (see AI Act)
- GDPR (see GDPR)
- event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)-[Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- and change data capture, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
- comparison to change data capture, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
- immutability and auditability, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability), [Designing for auditability](/en/ch13#id365)
- large, reliable data systems, [Uniquely identifying requests](/en/ch13#id355), [Correctness of dataflow systems](/en/ch13#id453)
- reliance on determinism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- event streams (see streams)
- event-driven architecture, [Event-Driven Architectures](/en/ch5#sec_encoding_dataflow_msg)-[Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
- distributed actor frameworks, [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
- events, [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
- deciding on total order of, [The limits of total ordering](/en/ch13#id335)
- deriving views from event log, [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
- event time versus processing time, [Event time versus processing time](/en/ch12#id322), [Microbatching and checkpointing](/en/ch12#id329), [Unifying batch and stream processing](/en/ch13#id338)
- immutable, advantages of, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros), [Designing for auditability](/en/ch13#id365)
- ordering to capture causality, [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
- reads as, [Reads are events too](/en/ch13#sec_future_read_events)
- stragglers, [Handling straggler events](/en/ch12#id323)
- timestamp of, in stream processing, [Whose clock are you using, anyway?](/en/ch12#id438)
- EventSource (browser API), [Pushing state changes to clients](/en/ch13#id348)
- EventStoreDB (database), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- eventual consistency, [Replication](/en/ch6#ch_replication), [Problems with Replication Lag](/en/ch6#sec_replication_lag), [Safety and liveness](/en/ch9#sec_distributed_safety_liveness)
- (see also conflicts)
- and perpetual inconsistency, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
- strong eventual consistency, [Automatic conflict resolution](/en/ch6#automatic-conflict-resolution)
- evidence
- data used as, [Humans and Reliability](/en/ch2#id31)
- evolvability, [Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability), [Encoding and Evolution](/en/ch5#ch_encoding)
- calling services, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- graph-structured data, [Property Graphs](/en/ch3#id56)
- of databases, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility), [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)-[Archival storage](/en/ch5#archival-storage), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views), [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
- reprocessing data, [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing), [Unifying batch and stream processing](/en/ch13#id338)
- schema evolution in Avro, [The writer's schema and the reader's schema](/en/ch5#the-writers-schema-and-the-readers-schema)
- schema evolution in Protocol Buffers, [Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
- schema-on-read, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility), [Encoding and Evolution](/en/ch5#ch_encoding), [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
- exactly-once semantics, [Exactly-once message processing](/en/ch8#sec_transactions_exactly_once), [Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited), [Fault Tolerance](/en/ch12#sec_stream_fault_tolerance), [Exactly-once execution of an operation](/en/ch13#id353)
- parity with batch processors, [Unifying batch and stream processing](/en/ch13#id338)
- preservation of integrity, [Correctness of dataflow systems](/en/ch13#id453)
- using durable execution, [Durable execution](/en/ch5#durable-execution)
- exclusive mode (locks), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- exponential backoff, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
- ext4 (file system), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- eXtended Architecture transactions (see XA transactions)
- extract-transform-load (see ETL)
### F
- Facebook
- Faiss (vector index), [Vector Embeddings](/en/ch4#id92)
- React (user interface library), [End-to-end event streams](/en/ch13#id349)
- social graphs, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- facts
- fact table (star schema), [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
- in Datalog, [Datalog: Recursive Relational Queries](/en/ch3#id62)
- in event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- fail-slow faults, [System Model and Reality](/en/ch9#sec_distributed_system_model)
- fail-stop model, [System Model and Reality](/en/ch9#sec_distributed_system_model)
- failover, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Glossary](/en/glossary)
- (see also leader-based replication)
- in leaderless replication, absence of, [Writing to the Database When a Node Is Down](/en/ch6#id287)
- leader election, [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing), [Consensus](/en/ch10#sec_consistency_consensus), [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
- potential problems, [Leader failure: Failover](/en/ch6#leader-failure-failover)
- failures
- amplification by distributed transactions, [Maintaining derived state](/en/ch13#id446)
- failure detection, [Detecting Faults](/en/ch9#id307)
- automatic rebalancing causing cascading failures, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
- timeouts and unbounded delays, [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing), [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- using a coordination service, [Coordination Services](/en/ch10#sec_consistency_coordination)
- faults versus, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)
- partial failures, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure), [Summary](/en/ch9#summary)
- Faiss (vector index), [Vector Embeddings](/en/ch4#id92)
- false positive (Bloom filters), [Bloom filters](/en/ch4#bloom-filters)
- fan-out (messaging systems), [Materializing and Updating Timelines](/en/ch2#sec_introduction_materializing), [Multiple consumers](/en/ch12#id298)
- fault injection, [Fault Tolerance](/en/ch2#id27), [Network Faults in Practice](/en/ch9#sec_distributed_network_faults), [Fault injection](/en/ch9#sec_fault_injection)
- fault isolation, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- fault tolerance, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)-[Humans and Reliability](/en/ch2#id31), [Glossary](/en/glossary)
- formalization in consensus, [Single-value consensus](/en/ch10#single-value-consensus)
- human fault tolerance, [Batch Processing](/en/ch11#ch_batch)
- in batch processing, [Handling Faults](/en/ch11#id281)
- in log-based systems, [Applying end-to-end thinking in data systems](/en/ch13#id357), [Timeliness and Integrity](/en/ch13#sec_future_integrity)-[Correctness of dataflow systems](/en/ch13#id453)
- in stream processing, [Fault Tolerance](/en/ch12#sec_stream_fault_tolerance)-[Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
- atomic commit, [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit)
- idempotence, [Idempotence](/en/ch12#sec_stream_idempotence)
- maintaining derived state, [Maintaining derived state](/en/ch13#id446)
- microbatching and checkpointing, [Microbatching and checkpointing](/en/ch12#id329)
- rebuilding state after a failure, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
- of distributed transactions, [XA transactions](/en/ch8#xa-transactions)-[Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
- of leader-based and leaderless replication, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- transaction atomicity, [Atomicity](/en/ch8#sec_transactions_acid_atomicity), [Distributed Transactions](/en/ch8#sec_transactions_distributed)-[Exactly-once message processing](/en/ch8#sec_transactions_exactly_once)
- faults
- Byzantine faults, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)-[Weak forms of lying](/en/ch9#weak-forms-of-lying)
- failures versus, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)
- handled by transactions, [Transactions](/en/ch8#ch_transactions)
- handling in supercomputers and cloud computing, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
- hardware, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
- in distributed systems, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
- introducing deliberately (see fault injection)
- network faults, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)-[Detecting Faults](/en/ch9#id307)
- asymmetric faults, [The Majority Rules](/en/ch9#sec_distributed_majority)
- detecting, [Detecting Faults](/en/ch9#id307)
- tolerance of, in multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- software faults, [Software faults](/en/ch2#software-faults)
- tolerating (see fault tolerance)
- feature engineering (machine learning), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
- federated databases, [The meta-database of everything](/en/ch13#id341)
- Feldera (database)
- incremental view maintenance, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- fence (CPU instruction), [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
- fencing (preventing split brain), [Leader failure: Failover](/en/ch6#leader-failure-failover), [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)-[Fencing with multiple replicas](/en/ch9#fencing-with-multiple-replicas)
- generating fencing tokens, [Using shared logs](/en/ch10#sec_consistency_smr), [Coordination Services](/en/ch10#sec_consistency_coordination)
- properties of fencing tokens, [Defining the correctness of an algorithm](/en/ch9#defining-the-correctness-of-an-algorithm)
- stream processors writing to databases, [Idempotence](/en/ch12#sec_stream_idempotence), [Exactly-once execution of an operation](/en/ch13#id353)
- fetch-and-add
- relation to consensus, [Fetch-and-add as consensus](/en/ch10#fetch-and-add-as-consensus)
- Fibre Channel (networks), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- field tags (Protocol Buffers), [Protocol Buffers](/en/ch5#sec_encoding_protobuf)-[Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
- Figma (graphics software), [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- filesystem in userspace (FUSE), [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- on object storage, [Object Stores](/en/ch11#id277)
- financial data
- accounting ledgers, [Summary](/en/ch3#summary)
- immutability, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
- time series data, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- Fivetran, [Data Warehousing](/en/ch1#sec_introduction_dwh)
- FizzBee (specification language), [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
- flat index (vector index), [Vector Embeddings](/en/ch4#id92)
- FlatBuffers (data format), [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
- Flink (processing framework), [Batch Processing](/en/ch11#ch_batch), [Dataflow Engines](/en/ch11#sec_batch_dataflow)
- cost efficiency, [Query languages](/en/ch11#sec_batch_query_lanauges)
- DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes), [DataFrames](/en/ch11#id287)
- fault tolerance, [Handling Faults](/en/ch11#id281), [Microbatching and checkpointing](/en/ch12#id329), [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
- FlinkML, [Machine Learning](/en/ch11#id290)
- for data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- high availability using ZooKeeper, [Coordination Services](/en/ch10#sec_consistency_coordination)
- integration of batch and stream processing, [Unifying batch and stream processing](/en/ch13#id338)
- query optimizer, [Query languages](/en/ch11#sec_batch_query_lanauges)
- shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)
- stream processing, [Stream analytics](/en/ch12#id318)
- streaming SQL support, [Complex event processing](/en/ch12#id317)
- flow control, [The Limitations of TCP](/en/ch9#sec_distributed_tcp), [Messaging Systems](/en/ch12#sec_stream_messaging), [Glossary](/en/glossary)
- FLP result (on consensus), [Consensus](/en/ch10#sec_consistency_consensus)
- Flyte (workflow scheduler), [Machine Learning](/en/ch11#id290)
- followers, [Single-Leader Replication](/en/ch6#sec_replication_leader), [Glossary](/en/glossary)
- (see also leader-based replication)
- formal methods, [Formal Methods and Randomized Testing](/en/ch9#sec_distributed_formal)-[Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- forward compatibility, [Encoding and Evolution](/en/ch5#ch_encoding)
- forward decay (algorithm), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
- Fossil (version control system), [Concurrency control](/en/ch12#sec_stream_concurrency)
- shunning (deleting data), [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- FoundationDB (database)
- consistency model, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- deterministic simulation testing, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- key-range sharding, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- process-per-core model, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
- serializable transactions, [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi), [Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
- transactions, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
- fractional indexing, [When to Use Which Model](/en/ch3#sec_datamodels_document_summary)
- fragmentation (of B-trees), [Disk space usage](/en/ch4#disk-space-usage)
- frame (computer graphics), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- frontend (web development), [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
- FrostDB (database)
- deterministic simulation testing (DST), [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- fsync (system call), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal), [Durability](/en/ch8#durability)
- full-text search, [Full-Text Search](/en/ch4#sec_storage_full_text), [Glossary](/en/glossary)
- and fuzzy indexes, [Full-Text Search](/en/ch4#sec_storage_full_text)
- Lucene storage engine, [Full-Text Search](/en/ch4#sec_storage_full_text)
- sharded indexes, [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)
- Function as a Service (FaaS), [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
- functional programming
- inspiration for MapReduce, [MapReduce](/en/ch11#sec_batch_mapreduce)
- functional requirements, [Defining Nonfunctional Requirements](/en/ch2#ch_nonfunctional)
- FUSE (see filesystem in userspace (FUSE))
- fuzzing, [Formal Methods and Randomized Testing](/en/ch9#sec_distributed_formal)
- fuzzy search (see similarity search)
### G
- Gallina (specification language), [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
- game development, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- garbage collection
- immutability and, [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- process pauses for, [Latency and Response Time](/en/ch2#id23), [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)-[Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact), [The Majority Rules](/en/ch9#sec_distributed_majority)
- (see also process pauses)
- gas stations algorithmic pricing, [Feedback Loops](/en/ch14#id372)
- GDPR (regulation), [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- consent, [Consent and Freedom of Choice](/en/ch14#id375)
- data minimization, [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
- legitimate interest, [Consent and Freedom of Choice](/en/ch14#id375)
- right of access, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- right to erasure, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Disk space usage](/en/ch4#disk-space-usage), [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- GenBank (genome database), [Summary](/en/ch3#summary)
- General Data Protection Regulation (see GDPR (regulation))
- genome analysis, [Summary](/en/ch3#summary)
- geographic distribution (see regions (geographic distribution))
- geospatial indexes, [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
- Git (version control system), [Concurrency control](/en/ch12#sec_stream_concurrency)
- local-first software, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- merge conflicts, [Manual conflict resolution](/en/ch6#manual-conflict-resolution)
- GitHub, postmortems, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Leader failure: Failover](/en/ch6#leader-failure-failover), [Mapping system models to the real world](/en/ch9#mapping-system-models-to-the-real-world)
- global secondary indexes, [Global Secondary Indexes](/en/ch7#id167), [Summary](/en/ch7#summary)
- globally unique identifiers (see UUIDs)
- GlusterFS (distributed filesystem), [Batch Processing](/en/ch11#ch_batch), [Distributed Filesystems](/en/ch11#sec_batch_dfs), [Object Stores](/en/ch11#id277)
- GNU Coreutils (Linux), [Sorting Versus In-memory Aggregation](/en/ch11#id275)
- Go (programming language)
- garbage collection, [Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
- GoldenGate (change data capture), [Implementing change data capture](/en/ch12#id307)
- (see also Oracle)
- Google
- BigQuery (see BigQuery (database))
- Bigtable (see Bigtable (database))
- Chubby (lock service), [Coordination Services](/en/ch10#sec_consistency_coordination)
- Cloud Storage (object storage), [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Object Stores](/en/ch11#id277)
- request preconditions, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
- Compute Engine
- preemptible instances, [Handling Faults](/en/ch11#id281)
- Dataflow (stream processing)
- data warehouse integration, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)
- Dataflow (stream processor), [Stream analytics](/en/ch12#id318), [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit), [Unifying batch and stream processing](/en/ch13#id338)
- (see also Beam)
- Datastream (change data capture), [API support for change streams](/en/ch12#sec_stream_change_api)
- Docs (collaborative editor), [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps), [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
- operational transformation, [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
- Dremel (query engine), [Column-Oriented Storage](/en/ch4#sec_storage_column)
- Firestore (database), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- MapReduce (batch processing), [Batch Processing](/en/ch11#ch_batch)
- (see also MapReduce)
- Percolator (transaction system), [Implementing a linearizable ID generator](/en/ch10#implementing-a-linearizable-id-generator)
- persistent disks (cloud service), [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
- Pub/Sub (messaging), [Message brokers](/en/ch5#message-brokers), [Message brokers compared to databases](/en/ch12#id297), [Using logs for message storage](/en/ch12#id300)
- response time study, [Average, Median, and Percentiles](/en/ch2#id24)
- Sheets (collaborative spreadsheet), [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps), [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
- Spanner (see Spanner (database))
- TrueTime (clock API), [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval)
- gossip protocol, [Request Routing](/en/ch7#sec_sharding_routing)
- governance, [Beyond the data lake](/en/ch1#beyond-the-data-lake)
- government use of data, [Data as Assets and Power](/en/ch14#id376)
- GPS (Global Positioning System)
- use for clock synchronization, [Unreliable Clocks](/en/ch9#sec_distributed_clocks), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy), [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- GPT (language model), [Vector Embeddings](/en/ch4#id92)
- GPU (graphics processing unit), [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
- gradual rollout (see rolling upgrades)
- GraphQL (query language), [GraphQL](/en/ch3#id63)
- validation, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
- graphs, [Glossary](/en/glossary)
- as data models, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)-[GraphQL](/en/ch3#id63)
- property graphs, [Property Graphs](/en/ch3#id56)
- RDF and triple-stores, [Triple-Stores and SPARQL](/en/ch3#id59)-[The SPARQL query language](/en/ch3#the-sparql-query-language)
- DAGs (see directed acyclic graphs)
- processing and analysis, [Machine Learning](/en/ch11#id290)
- query languages
- Cypher, [The Cypher Query Language](/en/ch3#id57)
- Datalog, [Datalog: Recursive Relational Queries](/en/ch3#id62)-[Datalog: Recursive Relational Queries](/en/ch3#id62)
- GraphQL, [GraphQL](/en/ch3#id63)
- Gremlin, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- recursive SQL queries, [Graph Queries in SQL](/en/ch3#id58)
- SPARQL, [The SPARQL query language](/en/ch3#the-sparql-query-language)-[The SPARQL query language](/en/ch3#the-sparql-query-language)
- traversal, [Property Graphs](/en/ch3#id56)
- gray failures, [System Model and Reality](/en/ch9#sec_distributed_system_model)
- in leaderless replication, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- Gremlin (graph query language), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- grep (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis)
- gRPC (service calls), [Microservices and Serverless](/en/ch1#sec_introduction_microservices), [Web services](/en/ch5#sec_web_services)
- forward and backward compatibility, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- GUIDs (see UUIDs)
### H
- Hadoop (data infrastructure)
- comparison to distributed databases, [Batch Processing](/en/ch11#ch_batch)
- MapReduce (see MapReduce)
- NodeManager, [Distributed Job Orchestration](/en/ch11#id278)
- YARN (see YARN (job scheduler))
- HANA (see SAP HANA (database))
- happens-before relation, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)
- hard disks
- access patterns, [Sequential versus random writes](/en/ch4#sidebar_sequential)
- detecting corruption, [The end-to-end argument](/en/ch13#sec_future_e2e_argument), [Don't just blindly trust what they promise](/en/ch13#id364)
- faults in, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults), [Durability](/en/ch8#durability)
- sequential vs. random writes, [Sequential versus random writes](/en/ch4#sidebar_sequential)
- sequential write throughput, [Disk space usage](/en/ch12#sec_stream_disk_usage)
- hardware faults, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
- hash function
- in Bloom filters, [Bloom filters](/en/ch4#bloom-filters)
- hash join
- in stream processing, [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins)
- hash sharding, [Sharding by Hash of Key](/en/ch7#sec_sharding_hash)-[Consistent hashing](/en/ch7#sec_sharding_consistent_hashing), [Summary](/en/ch7#summary)
- consistent hashing, [Consistent hashing](/en/ch7#sec_sharding_consistent_hashing)
- problems with hash mod N, [Hash modulo number of nodes](/en/ch7#hash-modulo-number-of-nodes)
- range queries, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- suitable hash functions, [Sharding by Hash of Key](/en/ch7#sec_sharding_hash)
- with fixed number of shards, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
- hash tables, [Log-Structured Storage](/en/ch4#sec_storage_log_structured)
- Hazelcast (in-memory data grid)
- FencedLock, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
- Flake ID Generator, [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
- HBase (database)
- bug due to lack of fencing, [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing)
- key-range sharding, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- log-structured storage, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- regions (sharding), [Sharding](/en/ch7#ch_sharding)
- request routing, [Request Routing](/en/ch7#sec_sharding_routing)
- size-tiered compaction, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- wide-column data model, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality), [Column Compression](/en/ch4#sec_storage_column_compression)
- HDFS (Hadoop Distributed File System), [Batch Processing](/en/ch11#ch_batch), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- (see also distributed filesystems)
- checking data integrity, [Don't just blindly trust what they promise](/en/ch13#id364)
- DataNode, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- NameNode, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- use in MapReduce, [MapReduce](/en/ch11#sec_batch_mapreduce)
- workflow example, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
- HdrHistogram (numerical library), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
- head (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Distributed Job Orchestration](/en/ch11#id278)
- head vertex (property graphs), [Property Graphs](/en/ch3#id56)
- head-of-line blocking, [Latency and Response Time](/en/ch2#id23)
- heap files (databases), [Storing values within the index](/en/ch4#sec_storage_index_heap)
- in multiversion concurrency control, [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl)
- heat management, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
- hedged requests, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- heterogeneous distributed transactions, [Distributed Transactions Across Different Systems](/en/ch8#sec_transactions_xa), [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
- heuristic decisions (in 2PC), [Recovering from coordinator failure](/en/ch8#recovering-from-coordinator-failure)
- Hex (notebook), [Machine Learning](/en/ch11#id290)
- hexagons
- for geospatial indexing, [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
- Hibernate (object-relational mapper), [Object-relational mapping (ORM)](/en/ch3#object-relational-mapping-orm)
- hierarchical model, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
- hierarchical navigable small world (vector index), [Vector Embeddings](/en/ch4#id92)
- hierarchical queries (see recursive common table expressions)
- high availability (see fault tolerance)
- high-frequency trading, [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
- high-performance computing (HPC), [Cloud Computing Versus Supercomputing](/en/ch1#id17)
- hinted handoff (leaderless replication), [Catching up on missed writes](/en/ch6#sec_replication_read_repair)
- histograms, [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
- Hive (data warehouse), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- query optimizer, [Query languages](/en/ch11#sec_batch_query_lanauges)
- HNSW (vector index), [Vector Embeddings](/en/ch4#id92)
- hopping windows (stream processing), [Types of windows](/en/ch12#id324)
- (see also windows)
- Hoptimator (query engine), [The meta-database of everything](/en/ch13#id341)
- Horizon scandal, [Humans and Reliability](/en/ch2#id31)
- lack of transactions, [Transactions](/en/ch8#ch_transactions)
- horizontal scaling (see scaling out)
- by sharding, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
- HornetQ (messaging), [Message brokers](/en/ch5#message-brokers), [Message brokers compared to databases](/en/ch12#id297)
- distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
- hot keys, [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
- hot spots, [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
- due to celebrities, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
- for time-series data, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- relieving, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
- hot standbys (see leader-based replication)
- HTAP (see hybrid transactional/analytic processing)
- HTTP, use in APIs (see services)
- human errors, [Humans and Reliability](/en/ch2#id31), [Network Faults in Practice](/en/ch9#sec_distributed_network_faults), [Batch Processing](/en/ch11#ch_batch)
- hybrid logical clocks, [Hybrid logical clocks](/en/ch10#hybrid-logical-clocks)
- hybrid transactional/analytic processing, [Data Warehousing](/en/ch1#sec_introduction_dwh), [Data Storage for Analytics](/en/ch4#sec_storage_analytics)
- hydrating IDs (join), [Denormalization in the social networking case study](/en/ch3#denormalization-in-the-social-networking-case-study)
- hypergraph, [Property Graphs](/en/ch3#id56)
- HyperLogLog (algorithm), [Stream analytics](/en/ch12#id318)
### I
- I/O operations, waiting for, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- IaaS (see infrastructure as a service (IaaS))
- IBM
- Db2 (database)
- distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
- serializable isolation, [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- MQ (messaging), [Message brokers compared to databases](/en/ch12#id297)
- distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
- System R (database), [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview)
- WebSphere (messaging), [Message brokers](/en/ch5#message-brokers)
- Iceberg (table format), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- databases on object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- log-based message broker storage, [Disk space usage](/en/ch12#sec_stream_disk_usage)
- idempotence, [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc), [Idempotence](/en/ch12#sec_stream_idempotence), [Glossary](/en/glossary)
- by giving operations unique IDs, [Multi-shard request processing](/en/ch13#id360)
- by giving requests unique IDs, [Uniquely identifying requests](/en/ch13#id355)
- for exactly-once semantics, [Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
- idempotent operations, [Exactly-once execution of an operation](/en/ch13#id353)
- in workflow engines, [Durable execution](/en/ch5#durable-execution)
- immutability
- advantages of, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros), [Designing for auditability](/en/ch13#id365)
- and right to erasure, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Disk space usage](/en/ch4#disk-space-usage)
- crypto-shredding for deletion, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events), [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- deriving state from event log, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)-[Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- for crash recovery, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- in B-trees, [B-tree variants](/en/ch4#b-tree-variants), [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
- in event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events), [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
- limitations of, [Concurrency control](/en/ch12#sec_stream_concurrency)
- impedance mismatch, [The Object-Relational Mismatch](/en/ch3#sec_datamodels_document)
- in doubt (transaction status), [Coordinator failure](/en/ch8#coordinator-failure)
- holding locks, [Holding locks while in doubt](/en/ch8#holding-locks-while-in-doubt)
- orphaned transactions, [Recovering from coordinator failure](/en/ch8#recovering-from-coordinator-failure)
- in-memory databases, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- durability, [Durability](/en/ch8#durability)
- serial transaction execution, [Actual Serial Execution](/en/ch8#sec_transactions_serial)
- incidents
- accounting software bugs leading to wrongful convictions, [Humans and Reliability](/en/ch2#id31)
- blameless postmortems, [Humans and Reliability](/en/ch2#id31)
- crashes due to leap seconds, [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
- data corruption and financial losses due to concurrency bugs, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
- data corruption on hard disks, [Durability](/en/ch8#durability)
- data loss due to last-write-wins, [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
- data on disks unreadable, [Mapping system models to the real world](/en/ch9#mapping-system-models-to-the-real-world)
- disclosure of sensitive data due to primary key reuse, [Leader failure: Failover](/en/ch6#leader-failure-failover)
- errors in transaction serializability, [Maintaining integrity in the face of software bugs](/en/ch13#id455)
- gigabit network interface with 1 Kb/s throughput, [System Model and Reality](/en/ch9#sec_distributed_system_model)
- leap second crash, [Software faults](/en/ch2#software-faults)
- network faults, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
- network interface dropping only inbound packets, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
- network partitions and whole-datacenter failures, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
- poor handling of network faults, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
- sending message to ex-partner, [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
- sharks biting undersea cables, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
- split brain due to 1-minute packet delay, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
- SSD failure after 32,768 hours, [Software faults](/en/ch2#software-faults)
- thread contention bringing down a service, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- vibrations in server rack, [Latency and Response Time](/en/ch2#id23)
- violation of uniqueness constraint, [Maintaining integrity in the face of software bugs](/en/ch13#id455)
- incremental view maintenance (IVM), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- for data integration, [Unbundled versus integrated systems](/en/ch13#id448)
- indexes, [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp), [Glossary](/en/glossary)
- and snapshot isolation, [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
- as derived data, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived), [Composing Data Storage Technologies](/en/ch13#id447)-[Unbundled versus integrated systems](/en/ch13#id448)
- B-trees, [B-Trees](/en/ch4#sec_storage_b_trees)-[B-tree variants](/en/ch4#b-tree-variants)
- clustered, [Storing values within the index](/en/ch4#sec_storage_index_heap)
- comparison of B-trees and LSM-trees, [Comparing B-Trees and LSM-Trees](/en/ch4#sec_storage_btree_lsm_comparison)-[Disk space usage](/en/ch4#disk-space-usage)
- covering (with included columns), [Storing values within the index](/en/ch4#sec_storage_index_heap)
- creating, [Creating an index](/en/ch13#id340)
- full-text search, [Full-Text Search](/en/ch4#sec_storage_full_text)
- geospatial, [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
- index-range locking, [Index-range locks](/en/ch8#sec_transactions_2pl_range)
- multi-column (concatenated), [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
- secondary, [Multi-Column and Secondary Indexes](/en/ch4#sec_storage_index_multicolumn)
- (see also secondary indexes)
- problems with dual writes, [Keeping Systems in Sync](/en/ch12#sec_stream_sync), [Reasoning about dataflows](/en/ch13#id443)
- sharding and secondary indexes, [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)-[Global Secondary Indexes](/en/ch7#id167), [Summary](/en/ch7#summary)
- sparse, [The SSTable file format](/en/ch4#the-sstable-file-format)
- SSTables and LSM-trees, [The SSTable file format](/en/ch4#the-sstable-file-format)-[Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- updating when data changes, [Keeping Systems in Sync](/en/ch12#sec_stream_sync), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- Industrial Revolution, [Remembering the Industrial Revolution](/en/ch14#id377)
- InfiniBand (networks), [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
- InfluxDB IOx (storage engine), [Column-Oriented Storage](/en/ch4#sec_storage_column)
- information retrieval (see full-text search)
- infrastructure as a service (IaaS), [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud), [Layering of cloud services](/en/ch1#layering-of-cloud-services)
- InnoDB (storage engine)
- clustered index on primary key, [Storing values within the index](/en/ch4#sec_storage_index_heap)
- not preventing lost updates, [Automatically detecting lost updates](/en/ch8#automatically-detecting-lost-updates)
- preventing write skew, [Characterizing write skew](/en/ch8#characterizing-write-skew), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- serializable isolation, [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
- instance (cloud computing), [Layering of cloud services](/en/ch1#layering-of-cloud-services)
- integrating different data systems (see data integration)
- integrity, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
- coordination-avoiding data systems, [Coordination-avoiding data systems](/en/ch13#id454)
- correctness of dataflow systems, [Correctness of dataflow systems](/en/ch13#id453)
- in consensus formalization, [Single-value consensus](/en/ch10#single-value-consensus), [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
- integrity checks, [Don't just blindly trust what they promise](/en/ch13#id364)
- (see also auditing)
- end-to-end, [The end-to-end argument](/en/ch13#sec_future_e2e_argument), [The end-to-end argument again](/en/ch13#id456)
- use of snapshot isolation, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
- maintaining despite software bugs, [Maintaining integrity in the face of software bugs](/en/ch13#id455)
- Interface Definition Language (IDL), [Protocol Buffers](/en/ch5#sec_encoding_protobuf), [Avro](/en/ch5#sec_encoding_avro), [Web services](/en/ch5#sec_web_services)
- invariants, [Consistency](/en/ch8#sec_transactions_acid_consistency)
- (see also constraints)
- inverted file index (vector index), [Vector Embeddings](/en/ch4#id92)
- inverted index, [Full-Text Search](/en/ch4#sec_storage_full_text)
- irreversibility, minimizing, [Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events), [Batch Processing](/en/ch11#ch_batch)
- ISDN (Integrated Services Digital Network), [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
- isolation (in operating systems)
- cgroups (see cgroups)
- isolation (in transactions), [Isolation](/en/ch8#sec_transactions_acid_isolation), [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object), [Glossary](/en/glossary)
- correctness and, [Aiming for Correctness](/en/ch13#sec_future_correctness)
- for single-object writes, [Single-object writes](/en/ch8#sec_transactions_single_object)
- serializability, [Serializability](/en/ch8#sec_transactions_serializability)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
- actual serial execution, [Actual Serial Execution](/en/ch8#sec_transactions_serial)-[Summary of serial execution](/en/ch8#summary-of-serial-execution)
- serializable snapshot isolation (SSI), [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
- two-phase locking (2PL), [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)-[Index-range locks](/en/ch8#sec_transactions_2pl_range)
- violating, [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
- weak isolation levels, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)-[Materializing conflicts](/en/ch8#materializing-conflicts)
- preventing lost updates, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)-[Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- read committed, [Read Committed](/en/ch8#sec_transactions_read_committed)-[Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
- snapshot isolation, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)-[Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- IVF (vector index), [Vector Embeddings](/en/ch4#id92)
### J
- Java Database Connectivity (JDBC)
- distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
- network drivers, [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
- Java Enterprise Edition (EE), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc), [XA transactions](/en/ch8#xa-transactions)
- Java Message Service (JMS), [Message brokers compared to databases](/en/ch12#id297)
- (see also messaging systems)
- comparison to log-based messaging, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Replaying old messages](/en/ch12#sec_stream_replay)
- distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
- message ordering, [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
- Java Transaction API (JTA), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc), [XA transactions](/en/ch8#xa-transactions)
- Java Virtual Machine (JVM)
- garbage collection, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses), [Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
- JIT compilation, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- process reuse in batch processors, [Dataflow Engines](/en/ch11#sec_batch_dataflow)
- Jena (RDF framework), [The RDF data model](/en/ch3#the-rdf-data-model)
- SPARQL query language, [The SPARQL query language](/en/ch3#the-sparql-query-language)
- Jepsen (fault tolerance testing), [Fault injection](/en/ch9#sec_fault_injection), [Aiming for Correctness](/en/ch13#sec_future_correctness)
- jitter (network delay), [Average, Median, and Percentiles](/en/ch2#id24), [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- JMESPath (query language), [Query languages](/en/ch11#sec_batch_query_lanauges)
- join table, [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many), [Property Graphs](/en/ch3#id56)
- joins, [Glossary](/en/glossary)
- expressing as relational operators, [Query languages](/en/ch11#sec_batch_query_lanauges)
- handling GraphQL query, [GraphQL](/en/ch3#id63)
- in application code, [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization), [Denormalization in the social networking case study](/en/ch3#denormalization-in-the-social-networking-case-study)
- in DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- in relational and document databases, [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)
- secondary indexes and, [Multi-Column and Secondary Indexes](/en/ch4#sec_storage_index_multicolumn)
- sort-merge joins, [JOIN and GROUP BY](/en/ch11#sec_batch_join)
- stream joins, [Stream Joins](/en/ch12#sec_stream_joins)-[Time-dependence of joins](/en/ch12#sec_stream_join_time)
- stream-stream join, [Stream-stream join (window join)](/en/ch12#id440)
- stream-table join, [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins)
- table-table join, [Table-table join (materialized view maintenance)](/en/ch12#id326)
- time-dependence of, [Time-dependence of joins](/en/ch12#sec_stream_join_time)
- support in document databases, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- JOTM (transaction coordinator), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
- journaling (filesystems), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
- JSON
- aggregation pipeline (query language), [Query languages for documents](/en/ch3#query-languages-for-documents)
- Avro schema representation, [Avro](/en/ch5#sec_encoding_avro)
- binary variants, [Binary encoding](/en/ch5#binary-encoding)
- data locality, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
- document data model, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
- for application data, issues with, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
- GraphQL response, [GraphQL](/en/ch3#id63)
- in relational databases, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
- representing a résumé (example), [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
- Schema, [JSON Schema](/en/ch5#json-schema)
- JSON-LD, [Triple-Stores and SPARQL](/en/ch3#id59)
- JsonPath (query language), [Query languages](/en/ch11#sec_batch_query_lanauges)
- JuiceFS (distributed filesystem), [Distributed Filesystems](/en/ch11#sec_batch_dfs), [Object Stores](/en/ch11#id277)
- Jupyter (notebook), [Machine Learning](/en/ch11#id290)
- just-in-time (JIT) compilation, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
### K
- Kafka (messaging), [Message brokers](/en/ch5#message-brokers), [Using logs for message storage](/en/ch12#id300)
- consumer groups, [Multiple consumers](/en/ch12#id298)
- for data integration, [Unbundled versus integrated systems](/en/ch13#id448)
- for event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- Kafka Connect (database integration), [Implementing change data capture](/en/ch12#id307), [API support for change streams](/en/ch12#sec_stream_change_api), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
- Kafka Streams (stream processor), [Stream analytics](/en/ch12#id318), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- exactly-once semantics, [Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
- fault tolerance, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
- ksqlDB (stream database), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- log compaction, [Log compaction](/en/ch12#sec_stream_log_compaction), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- message offsets, [Using logs for message storage](/en/ch12#id300), [Idempotence](/en/ch12#sec_stream_idempotence)
- partitions (sharding), [Sharding](/en/ch7#ch_sharding)
- request routing, [Request Routing](/en/ch7#sec_sharding_routing)
- schema registry, [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
- serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
- tiered storage, [Disk space usage](/en/ch12#sec_stream_disk_usage)
- transactions, [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal), [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit)
- unclean leader election, [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
- use of model-checking, [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
- kappa architecture, [Unifying batch and stream processing](/en/ch13#id338)
- key-value stores, [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp)
- comparison to object stores, [Object Stores](/en/ch11#id277)
- in-memory, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- LSM storage, [Log-Structured Storage](/en/ch4#sec_storage_log_structured)-[Disk space usage](/en/ch4#disk-space-usage)
- sharding, [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)-[Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
- by hash of key, [Sharding by Hash of Key](/en/ch7#sec_sharding_hash), [Summary](/en/ch7#summary)
- by key range, [Sharding by Key Range](/en/ch7#sec_sharding_key_range), [Summary](/en/ch7#summary)
- skew and hot spots, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
- Kinesis (messaging), [Message brokers](/en/ch5#message-brokers), [Using logs for message storage](/en/ch12#id300)
- data warehouse integration, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- Kryo (Java), [Language-Specific Formats](/en/ch5#id96)
- ksqlDB (stream database), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- Kubernetes (cluster manager), [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud), [Microservices and Serverless](/en/ch1#sec_introduction_microservices), [Distributed Job Orchestration](/en/ch11#id278), [Separation of application code and state](/en/ch13#id344)
- Kubeflow, [Machine Learning](/en/ch11#id290)
- kubelet, [Distributed Job Orchestration](/en/ch11#id278)
- operators, [Distributed Job Orchestration](/en/ch11#id278)
- use of etcd, [Request Routing](/en/ch7#sec_sharding_routing), [Coordination Services](/en/ch10#sec_consistency_coordination)
- KùzuDB (database), [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- as embedded storage engine, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- Cypher query language, [The Cypher Query Language](/en/ch3#id57)
### L
- labeled property graphs (see property graphs)
- lambda architecture, [Unifying batch and stream processing](/en/ch13#id338)
- Lamport timestamps, [Lamport timestamps](/en/ch10#lamport-timestamps)
- Lance (data format), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Column-Oriented Storage](/en/ch4#sec_storage_column)
- (see also column-oriented storage)
- large language models (LLMs)
- pre-processing training data, [Machine Learning](/en/ch11#id290)
- last write wins (LWW), [Last write wins (discarding concurrent writes)](/en/ch6#sec_replication_lww), [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent), [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
- problems with, [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
- prone to lost updates, [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- latency, [Latency and Response Time](/en/ch2#id23)
- (see also response time)
- across regions, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
- instability under two-phase locking, [Performance of two-phase locking](/en/ch8#performance-of-two-phase-locking)
- network latency and resource utilization, [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
- reducing by request hedging, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- response time versus, [Latency and Response Time](/en/ch2#id23)
- tail latency, [Average, Median, and Percentiles](/en/ch2#id24), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla), [Local Secondary Indexes](/en/ch7#id166)
- law (see legal matters)
- layering (of cloud services), [Layering of cloud services](/en/ch1#layering-of-cloud-services)
- leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)-[Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
- (see also replication)
- failover, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing)
- handling node outages, [Handling Node Outages](/en/ch6#sec_replication_failover)
- implementation of replication logs
- change data capture, [Change Data Capture](/en/ch12#sec_stream_cdc)-[API support for change streams](/en/ch12#sec_stream_change_api)
- (see also changelogs)
- statement-based, [Statement-based replication](/en/ch6#statement-based-replication)
- write-ahead log (WAL) shipping, [Write-ahead log (WAL) shipping](/en/ch6#write-ahead-log-wal-shipping)
- linearizability of operations, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
- locking and leader election, [Locking and leader election](/en/ch10#locking-and-leader-election)
- log sequence number, [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Consumer offsets](/en/ch12#sec_stream_log_offsets)
- read-scaling architecture, [Problems with Replication Lag](/en/ch6#sec_replication_lag), [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- relation to consensus, [Consensus](/en/ch10#sec_consistency_consensus), [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus), [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
- setting up new followers, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- synchronous versus asynchronous, [Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async)-[Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async)
- leaderless replication, [Leaderless Replication](/en/ch6#sec_replication_leaderless)-[Version vectors](/en/ch6#version-vectors)
- (see also replication)
- catching up on missed writes, [Catching up on missed writes](/en/ch6#sec_replication_read_repair)
- detecting concurrent writes, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)-[Version vectors](/en/ch6#version-vectors)
- version vectors, [Version vectors](/en/ch6#version-vectors)
- multi-region, [Multi-region operation](/en/ch6#multi-region-operation)
- quorums, [Quorums for reading and writing](/en/ch6#sec_replication_quorum_condition)-[Multi-region operation](/en/ch6#multi-region-operation)
- consistency limitations, [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations)-[Monitoring staleness](/en/ch6#monitoring-staleness), [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
- leap seconds, [Software faults](/en/ch2#software-faults), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
- in time-of-day clocks, [Time-of-day clocks](/en/ch9#time-of-day-clocks)
- leases, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- implementation with coordination service, [Coordination Services](/en/ch10#sec_consistency_coordination)
- need for fencing, [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing)
- relation to consensus, [Single-value consensus](/en/ch10#single-value-consensus)
- ledgers (accounting), [Summary](/en/ch3#summary)
- immutability, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
- legacy systems, maintenance of, [Maintainability](/en/ch2#sec_introduction_maintainability)
- legal matters, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)-[Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- data deletion, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Disk space usage](/en/ch4#disk-space-usage)
- data residence, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed), [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- privacy regulation, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
- legitimate interest (GDPR), [Consent and Freedom of Choice](/en/ch14#id375)
- leveled compaction, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction), [Disk space usage](/en/ch4#disk-space-usage)
- Levenshtein automata, [Full-Text Search](/en/ch4#sec_storage_full_text)
- limping (partial failure), [System Model and Reality](/en/ch9#sec_distributed_system_model)
- Linear (project management software), [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- linear algebra, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- linear scalability, [Describing Load](/en/ch2#id33)
- linearizability, [Solutions for Replication Lag](/en/ch6#id131), [Linearizability](/en/ch10#sec_consistency_linearizability)-[Linearizability and network delays](/en/ch10#linearizability-and-network-delays), [Glossary](/en/glossary)
- and consensus, [Consensus](/en/ch10#sec_consistency_consensus)
- cost of, [The Cost of Linearizability](/en/ch10#sec_linearizability_cost)-[Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
- CAP theorem, [The CAP theorem](/en/ch10#the-cap-theorem)
- memory on multi-core CPUs, [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
- definition, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)-[What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- ID generation, [Linearizable ID Generators](/en/ch10#sec_consistency_linearizable_id)
- in coordination services, [Coordination Services](/en/ch10#sec_consistency_coordination)
- of derived data systems
- avoiding coordination, [Coordination-avoiding data systems](/en/ch13#id454)
- of different replication methods, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)-[Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
- using quorums, [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
- reads in consensus systems, [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
- relying on, [Relying on Linearizability](/en/ch10#sec_consistency_linearizability_usage)-[Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
- constraints and uniqueness, [Constraints and uniqueness guarantees](/en/ch10#sec_consistency_uniqueness)
- cross-channel timing dependencies, [Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
- locking and leader election, [Locking and leader election](/en/ch10#locking-and-leader-election)
- versus serializability, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- linked data, [Triple-Stores and SPARQL](/en/ch3#id59)
- LinkedIn
- Espresso (database), [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
- LIquid (database), [Datalog: Recursive Relational Queries](/en/ch3#id62)
- profile (example), [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
- Linux, leap second bug, [Software faults](/en/ch2#software-faults), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
- Litestream (backup tool), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- liveness properties, [Safety and liveness](/en/ch9#sec_distributed_safety_liveness)
- LLVM (compiler), [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- LMDB (storage engine), [Compaction strategies](/en/ch4#sec_storage_lsm_compaction), [B-tree variants](/en/ch4#b-tree-variants), [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
- load
- coping with, [Principles for Scalability](/en/ch2#id35)
- describing, [Describing Load](/en/ch2#id33)
- load balancing, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
- in hardware, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
- in software, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
- using message brokers, [Multiple consumers](/en/ch12#id298)
- load shedding, [Describing Performance](/en/ch2#sec_introduction_percentiles)
- local secondary indexes, [Local Secondary Indexes](/en/ch7#id166), [Summary](/en/ch7#summary)
- local-first software, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- locality (data access), [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships), [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality), [Glossary](/en/glossary)
- in batch processing, [Dataflow Engines](/en/ch11#sec_batch_dataflow)
- in stateful clients, [Sync Engines and Local-First Software](/en/ch6#sec_replication_offline_clients), [Stateful, offline-capable clients](/en/ch13#id347)
- in stream processing, [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins), [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance), [Stream processors and services](/en/ch13#id345), [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
- location transparency, [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
- in the actor model, [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
- lock-in, [Pros and Cons of Cloud Services](/en/ch1#sec_introduction_cloud_tradeoffs)
- locks, [Glossary](/en/glossary)
- deadlock, [Explicit locking](/en/ch8#explicit-locking), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- distributed locking, [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing)-[Fencing with multiple replicas](/en/ch9#fencing-with-multiple-replicas), [Locking and leader election](/en/ch10#locking-and-leader-election)
- fencing tokens, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
- implementation with coordination service, [Coordination Services](/en/ch10#sec_consistency_coordination)
- relation to consensus, [Single-value consensus](/en/ch10#single-value-consensus)
- for transaction isolation
- in snapshot isolation, [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl)
- in two-phase locking (2PL), [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)-[Index-range locks](/en/ch8#sec_transactions_2pl_range)
- making operations atomic, [Atomic write operations](/en/ch8#atomic-write-operations)
- performance, [Performance of two-phase locking](/en/ch8#performance-of-two-phase-locking)
- preventing dirty writes, [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
- preventing phantoms with index-range locks, [Index-range locks](/en/ch8#sec_transactions_2pl_range), [Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
- read locks (shared mode), [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- shared mode and exclusive mode, [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- in distributed transactions
- deadlock detection, [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
- in-doubt transactions holding locks, [Holding locks while in doubt](/en/ch8#holding-locks-while-in-doubt)
- materializing conflicts with, [Materializing conflicts](/en/ch8#materializing-conflicts)
- preventing lost updates by explicit locking, [Explicit locking](/en/ch8#explicit-locking)
- log sequence number, [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Consumer offsets](/en/ch12#sec_stream_log_offsets)
- logical clocks, [Timestamps for ordering events](/en/ch9#sec_distributed_lww), [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)-[Enforcing constraints using logical clocks](/en/ch10#enforcing-constraints-using-logical-clocks), [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
- for last-write-wins, [Last write wins (discarding concurrent writes)](/en/ch6#sec_replication_lww)
- for read-after-write consistency, [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
- hybrid logical clocks, [Hybrid logical clocks](/en/ch10#hybrid-logical-clocks)
- insufficiency for enforcing constraints, [Enforcing constraints using logical clocks](/en/ch10#enforcing-constraints-using-logical-clocks)
- Lamport timestamps, [Lamport timestamps](/en/ch10#lamport-timestamps)
- logical replication, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
- for change data capture, [Implementing change data capture](/en/ch12#id307)
- LogicBlox (database), [Datalog: Recursive Relational Queries](/en/ch3#id62)
- logs (data structure), [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp), [Shared logs as consensus](/en/ch10#sec_consistency_shared_logs), [Glossary](/en/glossary)
- (see also shared logs)
- advantages of immutability, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
- and right to erasure, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Disk space usage](/en/ch4#disk-space-usage)
- compaction, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Compaction strategies](/en/ch4#sec_storage_lsm_compaction), [Log compaction](/en/ch12#sec_stream_log_compaction), [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)
- for stream operator state, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
- implementing uniqueness constraints, [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
- log-based messaging, [Log-based Message Brokers](/en/ch12#sec_stream_log)-[Replaying old messages](/en/ch12#sec_stream_replay)
- comparison to traditional messaging, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Replaying old messages](/en/ch12#sec_stream_replay)
- consumer offsets, [Consumer offsets](/en/ch12#sec_stream_log_offsets)
- disk space usage, [Disk space usage](/en/ch12#sec_stream_disk_usage)
- replaying old messages, [Replaying old messages](/en/ch12#sec_stream_replay), [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing), [Unifying batch and stream processing](/en/ch13#id338)
- slow consumers, [When consumers cannot keep up with producers](/en/ch12#id459)
- using logs for message storage, [Using logs for message storage](/en/ch12#id300)
- log-structured storage, [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp)-[Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- log-structured merge tree (see LSM-trees)
- relation to consensus, [Shared logs as consensus](/en/ch10#sec_consistency_shared_logs)
- replication, [Single-Leader Replication](/en/ch6#sec_replication_leader), [Implementation of Replication Logs](/en/ch6#sec_replication_implementation)-[Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
- change data capture, [Change Data Capture](/en/ch12#sec_stream_cdc)-[API support for change streams](/en/ch12#sec_stream_change_api)
- (see also changelogs)
- coordination with snapshot, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- logical (row-based) replication, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
- statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication)
- write-ahead log (WAL) shipping, [Write-ahead log (WAL) shipping](/en/ch6#write-ahead-log-wal-shipping)
- scalability limits, [The limits of total ordering](/en/ch13#id335)
- Looker (business intelligence software), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Analytics](/en/ch11#sec_batch_olap)
- loose coupling, [Making unbundling work](/en/ch13#sec_future_unbundling_favor)
- lost updates (see updates)
- Lotus Notes (sync engine), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- LSM-trees (indexes), [The SSTable file format](/en/ch4#the-sstable-file-format)-[Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- comparison to B-trees, [Comparing B-Trees and LSM-Trees](/en/ch4#sec_storage_btree_lsm_comparison)-[Disk space usage](/en/ch4#disk-space-usage)
- Lucene (storage engine), [Full-Text Search](/en/ch4#sec_storage_full_text)
- similarity search, [Full-Text Search](/en/ch4#sec_storage_full_text)
- LWW (see last write wins)
### M
- machine learning
- batch inference, [Machine Learning](/en/ch11#id290)
- data preparation with DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- deleting training data, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- deploying data products, [Beyond the data lake](/en/ch1#beyond-the-data-lake)
- ethical considerations, [Predictive Analytics](/en/ch14#id369)
- (see also ethics)
- feature engineering, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake), [Machine Learning](/en/ch11#id290)
- in analytics systems, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)
- iterative processing, [Machine Learning](/en/ch11#id290)
- LLMs (see large language models (LLMs))
- models derived from training data, [Application code as a derivation function](/en/ch13#sec_future_dataflow_derivation)
- relation to batch processing, [Machine Learning](/en/ch11#id290)-[Machine Learning](/en/ch11#id290)
- using a data lake, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
- using GPUs, [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
- using matrices, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- madsim (deterministic simulation testing), [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- magic scaling sauce, [Principles for Scalability](/en/ch2#id35)
- maintainability, [Maintainability](/en/ch2#sec_introduction_maintainability)-[Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability), [A Philosophy of Streaming Systems](/en/ch13#ch_philosophy)
- evolvability (see evolvability)
- operability, [Operability: Making Life Easy for Operations](/en/ch2#id37)
- simplicity and managing complexity, [Simplicity: Managing Complexity](/en/ch2#id38)
- many-to-many relationships, [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many)
- modeling as graphs, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- many-to-one relationships, [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many)
- in star schema, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
- MapReduce (batch processing), [Batch Processing](/en/ch11#ch_batch), [MapReduce](/en/ch11#sec_batch_mapreduce)-[MapReduce](/en/ch11#sec_batch_mapreduce)
- analysis of user activity events (example), [JOIN and GROUP BY](/en/ch11#sec_batch_join)
- comparison to stream processing, [Processing Streams](/en/ch12#sec_stream_processing)
- disadvantages and limitations of, [MapReduce](/en/ch11#sec_batch_mapreduce)
- fault tolerance, [Handling Faults](/en/ch11#id281)
- higher-level tools, [Query languages](/en/ch11#sec_batch_query_lanauges)
- mapper and reducer functions, [MapReduce](/en/ch11#sec_batch_mapreduce)
- shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)
- sort-merge joins, [JOIN and GROUP BY](/en/ch11#sec_batch_join)
- workflows, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
- (see also workflow engines)
- marshalling (see encoding)
- MartenDB (database), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- master-slave replication (obsolete term), [Single-Leader Replication](/en/ch6#sec_replication_leader)
- materialization, [Glossary](/en/glossary)
- aggregate values, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
- conflicts, [Materializing conflicts](/en/ch8#materializing-conflicts)
- materialized views, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
- as derived data, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived), [Composing Data Storage Technologies](/en/ch13#id447)-[Unbundled versus integrated systems](/en/ch13#id448)
- in event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- incremental view maintenance, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- (see also incremental view maintenance (IVM))
- maintaining, using stream processing, [Maintaining materialized views](/en/ch12#sec_stream_mat_view), [Table-table join (materialized view maintenance)](/en/ch12#id326)
- social network timeline example, [Materializing and Updating Timelines](/en/ch2#sec_introduction_materializing)
- Materialize (database), [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
- incremental view maintenance, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- matrices, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- sparse, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- Maxwell (change data capture), [Implementing change data capture](/en/ch12#id307)
- mean, [Average, Median, and Percentiles](/en/ch2#id24)
- media monitoring, [Search on streams](/en/ch12#id320)
- median, [Average, Median, and Percentiles](/en/ch2#id24)
- meeting room booking (example), [More examples of write skew](/en/ch8#more-examples-of-write-skew), [Predicate locks](/en/ch8#predicate-locks), [Enforcing Constraints](/en/ch13#sec_future_constraints)
- Memcached (caching server), [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- Memgraph (database), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- Cypher query language, [The Cypher Query Language](/en/ch3#id57)
- memory
- barrier (CPU instruction), [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
- corruption, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
- in-memory databases, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- durability, [Durability](/en/ch8#durability)
- serial transaction execution, [Actual Serial Execution](/en/ch8#sec_transactions_serial)
- in-memory representation of data, [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
- memtable (in LSM-trees), [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- random bit-flips in, [Trust, but Verify](/en/ch13#sec_future_verification)
- use by indexes, [Log-Structured Storage](/en/ch4#sec_storage_log_structured)
- memtable (in LSM-trees), [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- Mercurial (version control system), [Concurrency control](/en/ch12#sec_stream_concurrency)
- merge (DataFrame operator), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- merging sorted files, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Shuffling Data](/en/ch11#sec_shuffle)
- Merkle trees, [Tools for auditable data systems](/en/ch13#id366)
- Mesos (cluster manager), [Separation of application code and state](/en/ch13#id344)
- message brokers (see messaging systems)
- message-passing (see event-driven architecture)
- MessagePack (encoding format), [Binary encoding](/en/ch5#binary-encoding)
- messaging systems, [Stream Processing](/en/ch12#ch_stream)-[Replaying old messages](/en/ch12#sec_stream_replay)
- (see also streams)
- backpressure, buffering, or dropping messages, [Messaging Systems](/en/ch12#sec_stream_messaging)
- brokerless messaging, [Direct messaging from producers to consumers](/en/ch12#id296)
- event logs, [Log-based Message Brokers](/en/ch12#sec_stream_log)-[Replaying old messages](/en/ch12#sec_stream_replay)
- as data model, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- comparison to traditional messaging, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Replaying old messages](/en/ch12#sec_stream_replay)
- consumer offsets, [Consumer offsets](/en/ch12#sec_stream_log_offsets)
- replaying old messages, [Replaying old messages](/en/ch12#sec_stream_replay), [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing), [Unifying batch and stream processing](/en/ch13#id338)
- slow consumers, [When consumers cannot keep up with producers](/en/ch12#id459)
- exactly-once semantics, [Exactly-once message processing](/en/ch8#sec_transactions_exactly_once), [Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited), [Fault Tolerance](/en/ch12#sec_stream_fault_tolerance)
- message brokers, [Message brokers](/en/ch12#id433)-[Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
- acknowledgements and redelivery, [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
- comparison to event logs, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Replaying old messages](/en/ch12#sec_stream_replay)
- multiple consumers of same topic, [Multiple consumers](/en/ch12#id298)
- versus RPC, [Event-Driven Architectures](/en/ch5#sec_encoding_dataflow_msg)
- message loss, [Messaging Systems](/en/ch12#sec_stream_messaging)
- reliability, [Messaging Systems](/en/ch12#sec_stream_messaging)
- uniqueness in log-based messaging, [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
- metastable failure, [Describing Performance](/en/ch2#sec_introduction_percentiles)
- metered billing
- serverless, [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
- storage, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- microbatching, [Microbatching and checkpointing](/en/ch12#id329)
- microservices, [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
- (see also services)
- causal dependencies across services, [The limits of total ordering](/en/ch13#id335)
- loose coupling, [Making unbundling work](/en/ch13#sec_future_unbundling_favor)
- relation to batch/stream processors, [Batch Processing](/en/ch11#ch_batch), [Stream processors and services](/en/ch13#id345)
- Microsoft
- Azure Blob Storage (see Azure Blob Storage)
- Azure managed disks, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
- Azure Service Bus (messaging), [Message brokers](/en/ch5#message-brokers), [Message brokers compared to databases](/en/ch12#id297)
- Azure SQL DB (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
- Azure Storage, [Object Stores](/en/ch11#id277)
- Azure Stream Analytics, [Stream analytics](/en/ch12#id318)
- Azure Synapse Analytics (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
- DCOM (Distributed Component Object Model), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
- MSDTC (transaction coordinator), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
- SQL Server (see SQL Server)
- Microsoft Power BI (see Power BI (business intelligence software))
- migrating (rewriting) data, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility), [Different values written at different times](/en/ch5#different-values-written-at-different-times), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views), [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
- MinIO (object storage), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- mobile apps, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
- embedded databases, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- model checking, [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
- modulus operator (%), [Hash modulo number of nodes](/en/ch7#hash-modulo-number-of-nodes)
- Mojo (programming language)
- memory management, [Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
- MongoDB (database)
- aggregation pipeline, [Query languages for documents](/en/ch3#query-languages-for-documents)
- atomic operations, [Atomic write operations](/en/ch8#atomic-write-operations)
- BSON, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
- document data model, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
- hash-range sharding, [Sharding by Hash of Key](/en/ch7#sec_sharding_hash), [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- in the cloud, [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
- join support, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- joins (\$lookup operator), [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)
- JSON Schema validation, [JSON Schema](/en/ch5#json-schema)
- leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- ObjectIds, [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
- range-based sharding, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- request routing, [Request Routing](/en/ch7#sec_sharding_routing)
- secondary indexes, [Local Secondary Indexes](/en/ch7#id166)
- shard splitting, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)
- stored procedures, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
- monitoring, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations), [Humans and Reliability](/en/ch2#id31), [Operability: Making Life Easy for Operations](/en/ch2#id37)
- monotonic clocks, [Monotonic clocks](/en/ch9#monotonic-clocks)
- monotonic reads, [Monotonic Reads](/en/ch6#sec_replication_monotonic_reads)
- Morel (query language), [Query languages](/en/ch11#sec_batch_query_lanauges)
- MSMQ (messaging), [XA transactions](/en/ch8#xa-transactions)
- multi-column indexes, [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
- multi-leader replication, [Multi-Leader Replication](/en/ch6#sec_replication_multi_leader)-[Types of conflict](/en/ch6#sec_replication_write_conflicts)
- (see also replication)
- collaborative editing, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- conflict detection, [Types of conflict](/en/ch6#sec_replication_write_conflicts)
- conflict resolution, [Dealing with Conflicting Writes](/en/ch6#sec_replication_write_conflicts)
- for multi-region replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc), [The Cost of Linearizability](/en/ch10#sec_linearizability_cost)
- linearizability, lack of, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
- offline-capable clients, [Sync Engines and Local-First Software](/en/ch6#sec_replication_offline_clients)
- replication topologies, [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)-[Problems with different topologies](/en/ch6#problems-with-different-topologies)
- multi-object transactions, [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
- need for, [The need for multi-object transactions](/en/ch8#sec_transactions_need)
- Multi-Paxos (consensus algorithm), [Consensus in Practice](/en/ch10#sec_consistency_total_order)
- multi-reader single-writer lock, [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- multi-table index cluster tables (Oracle), [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
- multi-version concurrency control (MVCC), [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl), [Summary](/en/ch8#summary)
- detecting stale MVCC reads, [Detecting stale MVCC reads](/en/ch8#detecting-stale-mvcc-reads)
- indexes and snapshot isolation, [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
- using synchronized clocks, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- multidimensional arrays, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- multitenancy, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute), [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- by sharding, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- using embedded databases, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- versus Byzantine fault tolerance, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
- mutual exclusion, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
- (see also locks)
- MySQL (database)
- archiving WAL to object stores, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- binlog coordinates, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- change data capture, [Implementing change data capture](/en/ch12#id307), [API support for change streams](/en/ch12#sec_stream_change_api)
- circular replication topology, [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)
- consistent snapshots, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
- global transaction identifiers (GTIDs), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- in the cloud, [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
- InnoDB storage engine (see InnoDB)
- leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- row-based replication, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
- sharding (see Vitess (database))
- snapshot isolation support, [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- (see also InnoDB)
- statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication)
### N
- N+1 query problem, [Object-relational mapping (ORM)](/en/ch3#object-relational-mapping-orm)
- nanomsg (messaging library), [Direct messaging from producers to consumers](/en/ch12#id296)
- Narayana (transaction coordinator), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
- NATS (messaging), [Message brokers](/en/ch5#message-brokers)
- natural language processing, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
- Neo4j (database)
- Cypher query language, [The Cypher Query Language](/en/ch3#id57)
- graph data model, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- Neon (database), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- Nephele (dataflow engine), [Dataflow Engines](/en/ch11#sec_batch_dataflow)
- Neptune (graph database), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- Cypher query language, [The Cypher Query Language](/en/ch3#id57)
- SPARQL query language, [The SPARQL query language](/en/ch3#the-sparql-query-language)
- netcode (game development), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- Network Attached Storage (NAS), [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- network model (data representation), [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
- Network Time Protocol (see NTP)
- networks
- congestion and queueing, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- datacenter network topologies, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
- faults (see faults)
- linearizability and network delays, [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
- network partitions, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
- in CAP theorem, [The Cost of Linearizability](/en/ch10#sec_linearizability_cost)
- timeouts and unbounded delays, [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing)
- NewSQL, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history), [Solutions for Replication Lag](/en/ch6#id131)
- transactions and, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
- next-key locking, [Index-range locks](/en/ch8#sec_transactions_2pl_range)
- NFS (network file system), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- on object storage, [Object Stores](/en/ch11#id277)
- Nimble (data format), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Column-Oriented Storage](/en/ch4#sec_storage_column)
- (see also column-oriented storage)
- node (in graphs) (see vertices)
- nodes (processes), [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed), [Glossary](/en/glossary)
- handling outages in leader-based replication, [Handling Node Outages](/en/ch6#sec_replication_failover)
- system models for failure, [System Model and Reality](/en/ch9#sec_distributed_system_model)
- noisy neighbors, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- nonblocking atomic commit, [Three-phase commit](/en/ch8#three-phase-commit)
- nondeterministic operations, [Statement-based replication](/en/ch6#statement-based-replication)
- (see also deterministic operations)
- in distributed systems, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- in workflow engines, [Durable execution](/en/ch5#durable-execution)
- partial failures, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
- sources of nondeterminism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- nonfunctional requirements, [Defining Nonfunctional Requirements](/en/ch2#ch_nonfunctional), [Summary](/en/ch2#summary)
- nonrepeatable reads, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
- (see also read skew)
- normalization (data representation), [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)-[Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many), [Glossary](/en/glossary)
- foreign key references, [The need for multi-object transactions](/en/ch8#sec_transactions_need)
- in social network case study, [Denormalization in the social networking case study](/en/ch3#denormalization-in-the-social-networking-case-study)
- in systems of record, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
- versus denormalization, [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
- NoSQL, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history), [Solutions for Replication Lag](/en/ch6#id131), [Unbundling Databases](/en/ch13#sec_future_unbundling)
- transactions and, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview)
- Notation3 (N3), [Triple-Stores and SPARQL](/en/ch3#id59)
- NTP (Network Time Protocol), [Unreliable Clocks](/en/ch9#sec_distributed_clocks)
- accuracy, [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy), [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
- adjustments to monotonic clocks, [Monotonic clocks](/en/ch9#monotonic-clocks)
- multiple server addresses, [Weak forms of lying](/en/ch9#weak-forms-of-lying)
- numbers, in XML and JSON encodings, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
- NumPy (Python library), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes), [Column-Oriented Storage](/en/ch4#sec_storage_column)
- NVMe (Non-Volatile Memory Express) (see solid state drives (SSDs))
### O
- object databases, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
- object storage, [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Object Stores](/en/ch11#id277)-[Object Stores](/en/ch11#id277)
- Azure Blob Storage (see Azure Blob Storage)
- comparison to distributed filesystems, [Object Stores](/en/ch11#id277)
- comparison to key-value stores, [Object Stores](/en/ch11#id277)
- databases backed by, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- for backups, [Replication](/en/ch6#ch_replication)
- for cloud data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
- for database replication, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- Google Cloud Storage (see Google Cloud Storage)
- object size, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
- S3 (see S3 (object storage))
- storing LSM segment files, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- support for fencing, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
- use in data lakes, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
- object-relational mapping (ORM) frameworks, [Object-relational mapping (ORM)](/en/ch3#object-relational-mapping-orm)
- error handling and aborted transactions, [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
- unsafe read-modify-write cycle code, [Atomic write operations](/en/ch8#atomic-write-operations)
- object-relational mismatch, [The Object-Relational Mismatch](/en/ch3#sec_datamodels_document)
- observability, [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems), [Humans and Reliability](/en/ch2#id31), [Operability: Making Life Easy for Operations](/en/ch2#id37)
- observer pattern, [Separation of application code and state](/en/ch13#id344)
- OBT (one big table), [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics), [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
- offline systems, [Batch Processing](/en/ch11#ch_batch)
- (see also batch processing)
- offline-first applications, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps), [Stateful, offline-capable clients](/en/ch13#id347)
- offsets
- consumer offsets in sharded logs, [Consumer offsets](/en/ch12#sec_stream_log_offsets)
- messages in sharded logs, [Using logs for message storage](/en/ch12#id300)
- OLAP (online analytic processing), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Glossary](/en/glossary)
- data cubes, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
- OLTP (online transaction processing), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Glossary](/en/glossary)
- analytics queries versus, [Analytics](/en/ch11#sec_batch_olap)
- data normalization, [Trade-offs of normalization](/en/ch3#trade-offs-of-normalization)
- workload characteristics, [Actual Serial Execution](/en/ch8#sec_transactions_serial)
- on-premises deployment, [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud)
- data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- one big table (data warehouse schema), [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics), [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
- one-hot encoding, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- one-to-few relationships, [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
- one-to-many relationships, [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
- JSON representation, [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
- online systems, [Batch Processing](/en/ch11#ch_batch)
- (see also services)
- versus scientific computing, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
- ontologies, [Triple-Stores and SPARQL](/en/ch3#id59)
- Oozie (workflow scheduler), [Batch Processing](/en/ch11#ch_batch)
- OpenAPI (service definition format), [Microservices and Serverless](/en/ch1#sec_introduction_microservices), [Web services](/en/ch5#sec_web_services), [Web services](/en/ch5#sec_web_services)
- use of JSON Schema, [JSON Schema](/en/ch5#json-schema)
- openCypher (see Cypher (query language))
- OpenLink Virtuoso (see Virtuoso (database))
- OpenStack
- Swift (object storage), [Object Stores](/en/ch11#id277)
- operability, [Operability: Making Life Easy for Operations](/en/ch2#id37)
- operating systems versus databases, [Unbundling Databases](/en/ch13#sec_future_unbundling)
- operational systems, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)
- (see also OLTP)
- as systems of record, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
- ETL into analytical systems, [Data Warehousing](/en/ch1#sec_introduction_dwh)
- operational transformation, [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
- operations teams, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- operators (query execution), [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- in stream processing, [Processing Streams](/en/ch12#sec_stream_processing)
- optimistic concurrency control, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
- optimistic locking, [Conditional writes (compare-and-set)](/en/ch8#sec_transactions_compare_and_set)
- Oracle (database)
- distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
- GoldenGate (change data capture), [Implementing change data capture](/en/ch12#id307)
- hierarchical queries, [Graph Queries in SQL](/en/ch3#id58), [Graph Queries in SQL](/en/ch3#id58)
- lack of serializability, [Isolation](/en/ch8#sec_transactions_acid_isolation)
- leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- multi-table index cluster tables, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
- not preventing write skew, [Characterizing write skew](/en/ch8#characterizing-write-skew)
- PL/SQL language, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
- preventing lost updates, [Automatically detecting lost updates](/en/ch8#automatically-detecting-lost-updates)
- read committed isolation, [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
- Real Application Clusters (RAC), [Locking and leader election](/en/ch10#locking-and-leader-election)
- snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation), [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- TimesTen (in-memory database), [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- WAL-based replication, [Write-ahead log (WAL) shipping](/en/ch6#write-ahead-log-wal-shipping)
- ORC (data format), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Column-Oriented Storage](/en/ch4#sec_storage_column)
- (see also column-oriented storage)
- orchestration (service deployment), [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud), [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
- batch job execution, [Distributed Job Orchestration](/en/ch11#id278)-[Distributed Job Orchestration](/en/ch11#id278)
- workflow engines, [Batch Processing](/en/ch11#ch_batch)
- ordering
- event logs, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- limits of total ordering, [The limits of total ordering](/en/ch13#id335)
- logical timestamps, [Logical Clocks](/en/ch10#sec_consistency_timestamps)
- of auto-incrementing IDs, [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
- shared logs, [Consensus in Practice](/en/ch10#sec_consistency_total_order)-[Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
- Orkes (workflow engine), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
- orphan pages (B-trees), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
- outbox pattern, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
- outliers (response time), [Average, Median, and Percentiles](/en/ch2#id24)
- outsourcing, [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud)
- overload, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
### P
- PACELC principle, [The CAP theorem](/en/ch10#the-cap-theorem)
- package managers, [Separation of application code and state](/en/ch13#id344)
- packet switching, [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
- packets
- corruption of, [Weak forms of lying](/en/ch9#weak-forms-of-lying)
- sending via UDP, [Direct messaging from producers to consumers](/en/ch12#id296)
- PageRank (algorithm), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph), [Query languages](/en/ch11#sec_batch_query_lanauges), [Machine Learning](/en/ch11#id290)
- paging (see virtual memory)
- pandas (Python library), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes), [Column-Oriented Storage](/en/ch4#sec_storage_column), [DataFrames](/en/ch11#id287)
- Parquet (data format), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Column-Oriented Storage](/en/ch4#sec_storage_column), [Archival storage](/en/ch5#archival-storage), [Query languages](/en/ch11#sec_batch_query_lanauges)
- (see also column-oriented storage)
- databases on object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- document data model, [Column-Oriented Storage](/en/ch4#sec_storage_column)
- use in batch processing, [MapReduce](/en/ch11#sec_batch_mapreduce)
- partial failures, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure), [Summary](/en/ch9#summary)
- limping, [System Model and Reality](/en/ch9#sec_distributed_system_model)
- partial synchrony (system model), [System Model and Reality](/en/ch9#sec_distributed_system_model)
- partition key, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons), [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
- partitioning (see sharding)
- Paxos (consensus algorithm), [Consensus](/en/ch10#sec_consistency_consensus), [Consensus in Practice](/en/ch10#sec_consistency_total_order)
- ballot number, [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
- Multi-Paxos, [Consensus in Practice](/en/ch10#sec_consistency_total_order)
- payment card industry (PCI), [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- PCI (payment card industry) compliance, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- percentiles, [Average, Median, and Percentiles](/en/ch2#id24), [Glossary](/en/glossary)
- calculating efficiently, [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
- importance of high percentiles, [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
- use in service level agreements (SLAs), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
- Percolator (Google), [Implementing a linearizable ID generator](/en/ch10#implementing-a-linearizable-id-generator)
- Percona XtraBackup (MySQL tool), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- performance
- degradation as fault, [System Model and Reality](/en/ch9#sec_distributed_system_model)
- describing, [Describing Performance](/en/ch2#sec_introduction_percentiles)
- of distributed transactions, [Distributed Transactions Across Different Systems](/en/ch8#sec_transactions_xa)
- of in-memory databases, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- of linearizability, [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
- of multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- permission isolation, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- perpetual inconsistency, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
- pessimistic concurrency control, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
- pglogical (PostgreSQL extension), [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- pgvector (vector index), [Vector Embeddings](/en/ch4#id92)
- phantoms (transaction isolation), [Phantoms causing write skew](/en/ch8#sec_transactions_phantom)
- materializing conflicts, [Materializing conflicts](/en/ch8#materializing-conflicts)
- preventing, in serializability, [Predicate locks](/en/ch8#predicate-locks)
- physical clocks (see clocks)
- pickle (Python), [Language-Specific Formats](/en/ch5#id96)
- Pinot (database), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Column-Oriented Storage](/en/ch4#sec_storage_column)
- handling writes, [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
- pre-aggregation, [Analytics](/en/ch11#sec_batch_olap)
- serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived), [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
- pipelined execution
- in data warehouse queries, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- pivot table, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- point in time, [Unreliable Clocks](/en/ch9#sec_distributed_clocks)
- point query, [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
- Polaris (data catalog), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- polling, [Representing Users, Posts, and Follows](/en/ch2#id20)
- polystores, [The meta-database of everything](/en/ch13#id341)
- POSIX (portable operating system interface)
- compliant filesystems, [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Distributed Filesystems](/en/ch11#sec_batch_dfs), [Object Stores](/en/ch11#id277)
- Post Office Horizon scandal, [Humans and Reliability](/en/ch2#id31)
- lack of transactions, [Transactions](/en/ch8#ch_transactions)
- PostgreSQL (database)
- archiving WAL to object stores, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- change data capture, [Implementing change data capture](/en/ch12#id307), [API support for change streams](/en/ch12#sec_stream_change_api)
- distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
- foreign data wrappers, [The meta-database of everything](/en/ch13#id341)
- full text search support, [Combining Specialized Tools by Deriving Data](/en/ch13#id442)
- in the cloud, [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
- JSON Schema validation, [JSON Schema](/en/ch5#json-schema)
- leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- log sequence number, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- logical decoding, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
- materialized view maintenance, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- MVCC implementation, [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl), [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
- partitioning vs. sharding, [Sharding](/en/ch7#ch_sharding)
- pgvector (vector index), [Vector Embeddings](/en/ch4#id92)
- PL/pgSQL language, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
- PostGIS geospatial indexes, [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
- preventing lost updates, [Automatically detecting lost updates](/en/ch8#automatically-detecting-lost-updates)
- preventing write skew, [Characterizing write skew](/en/ch8#characterizing-write-skew), [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)
- read committed isolation, [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
- representing graphs, [Property Graphs](/en/ch3#id56)
- serializable snapshot isolation (SSI), [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)
- sharding (see Citus (database))
- snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation), [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- WAL-based replication, [Write-ahead log (WAL) shipping](/en/ch6#write-ahead-log-wal-shipping)
- postings list, [Full-Text Search](/en/ch4#sec_storage_full_text)
- in sharded indexes, [Local Secondary Indexes](/en/ch7#id166)
- postmortems, blameless, [Humans and Reliability](/en/ch2#id31)
- PouchDB (database), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- Power BI (business intelligence software), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Analytics](/en/ch11#sec_batch_olap)
- pre-aggregation, [Analytics](/en/ch11#sec_batch_olap)
- serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
- pre-splitting, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)
- Precision Time Protocol (PTP), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
- predicate locks, [Predicate locks](/en/ch8#predicate-locks)
- predictive analytics, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics), [Predictive Analytics](/en/ch14#id369)-[Feedback Loops](/en/ch14#id372)
- amplifying bias, [Bias and Discrimination](/en/ch14#id370)
- ethics of (see ethics)
- feedback loops, [Feedback Loops](/en/ch14#id372)
- preemption, [Resource Allocation](/en/ch11#id279)
- in distributed schedulers, [Handling Faults](/en/ch11#id281)
- of threads, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- Prefect (workflow scheduler), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows), [Batch Processing](/en/ch11#ch_batch), [Scheduling Workflows](/en/ch11#sec_batch_workflows)
- cloud data warehouse integration, [Query languages](/en/ch11#sec_batch_query_lanauges)
- Presto (query engine), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- primary keys, [Multi-Column and Secondary Indexes](/en/ch4#sec_storage_index_multicolumn), [Glossary](/en/glossary)
- auto-incrementing, [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
- versus partition key, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- primary-backup replication (see leader-based replication)
- privacy, [Privacy and Tracking](/en/ch14#id373)-[Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
- consent and freedom of choice, [Consent and Freedom of Choice](/en/ch14#id375)
- data as assets and power, [Data as Assets and Power](/en/ch14#id376)
- deleting data, [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- ethical considerations (see ethics)
- legislation and self-regulation, [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
- meaning of, [Privacy and Use of Data](/en/ch14#id457)
- regulation, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- surveillance, [Surveillance](/en/ch14#id374)
- tracking behavioral data, [Privacy and Tracking](/en/ch14#id373)
- probabilistic algorithms, [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla), [Stream analytics](/en/ch12#id318)
- process pauses, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)-[Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
- processing time (of events), [Reasoning About Time](/en/ch12#sec_stream_time)
- producers (message streams), [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
- product analytics, [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
- column-oriented storage, [Column-Oriented Storage](/en/ch4#sec_storage_column)
- programming languages
- for stored procedures, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
- projections (event sourcing), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- Prolog (language), [Datalog: Recursive Relational Queries](/en/ch3#id62)
- (see also Datalog)
- property graphs, [Property Graphs](/en/ch3#id56)
- Cypher query language, [The Cypher Query Language](/en/ch3#id57)
- Property Graph Query Language (PGQL), [Graph Queries in SQL](/en/ch3#id58)
- property-based testing, [Humans and Reliability](/en/ch2#id31), [Formal Methods and Randomized Testing](/en/ch9#sec_distributed_formal)
- Protocol Buffers (data format), [Protocol Buffers](/en/ch5#sec_encoding_protobuf)-[Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution), [Protocol Buffers](/en/ch5#sec_encoding_protobuf)
- field tags and schema evolution, [Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
- provenance of data, [Designing for auditability](/en/ch13#id365)
- publish/subscribe model, [Messaging Systems](/en/ch12#sec_stream_messaging)
- publishers (message streams), [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
- Pulsar (streaming platform), [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
- PyTorch (machine learning library), [Machine Learning](/en/ch11#id290)
### Q
- Qpid (messaging), [Message brokers compared to databases](/en/ch12#id297)
- quality of service (QoS), [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
- Quantcast File System (distributed filesystem), [Object Stores](/en/ch11#id277)
- query engines
- compilation and vectorization, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- in cloud data warehouse, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- operators, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- optimizing declarative queries, [Data Models and Query Languages](/en/ch3#ch_datamodels)
- query languages
- Cypher, [The Cypher Query Language](/en/ch3#id57)
- Datalog, [Datalog: Recursive Relational Queries](/en/ch3#id62)
- GraphQL, [GraphQL](/en/ch3#id63)
- MongoDB aggregation pipeline, [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization), [Query languages for documents](/en/ch3#query-languages-for-documents)
- recursive SQL queries, [Graph Queries in SQL](/en/ch3#id58)
- SPARQL, [The SPARQL query language](/en/ch3#the-sparql-query-language)
- SQL, [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)
- query optimizers, [Query languages](/en/ch11#sec_batch_query_lanauges)
- query plans, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- queueing delays, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- head-of-line blocking, [Latency and Response Time](/en/ch2#id23)
- latency and response time, [Latency and Response Time](/en/ch2#id23)
- queues (messaging), [Message brokers](/en/ch5#message-brokers)
- QUIC (protocol), [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
- quorums, [Quorums for reading and writing](/en/ch6#sec_replication_quorum_condition)-[Multi-region operation](/en/ch6#multi-region-operation), [Glossary](/en/glossary)
- for leaderless replication, [Quorums for reading and writing](/en/ch6#sec_replication_quorum_condition)
- in consensus algorithms, [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
- limitations of consistency, [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations)-[Monitoring staleness](/en/ch6#monitoring-staleness), [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
- making decisions in distributed systems, [The Majority Rules](/en/ch9#sec_distributed_majority)
- monitoring staleness, [Monitoring staleness](/en/ch6#monitoring-staleness)
- multi-region replication, [Multi-region operation](/en/ch6#multi-region-operation)
- relying on durability, [Mapping system models to the real world](/en/ch9#mapping-system-models-to-the-real-world)
- quotas, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
### R
- R (language), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes), [DataFrames](/en/ch11#id287)
- R-trees (indexes), [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
- R2 (object storage), [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- RabbitMQ (messaging), [Message brokers](/en/ch5#message-brokers), [Message brokers compared to databases](/en/ch12#id297)
- quorum queues (replication), [Single-Leader Replication](/en/ch6#sec_replication_leader)
- race conditions, [Isolation](/en/ch8#sec_transactions_acid_isolation)
- (see also concurrency)
- avoiding with linearizability, [Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
- caused by dual writes, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
- causing loss of money, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
- dirty writes, [No dirty writes](/en/ch8#sec_transactions_dirty_write)
- in counter increments, [No dirty writes](/en/ch8#sec_transactions_dirty_write)
- lost updates, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)-[Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- preventing with event logs, [Concurrency control](/en/ch12#sec_stream_concurrency), [Dataflow: Interplay between state changes and application code](/en/ch13#id450)
- preventing with serializable isolation, [Serializability](/en/ch8#sec_transactions_serializability)
- weak transaction isolation, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
- write skew, [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Materializing conflicts](/en/ch8#materializing-conflicts)
- Raft (consensus algorithm), [Consensus](/en/ch10#sec_consistency_consensus), [Consensus in Practice](/en/ch10#sec_consistency_total_order)
- leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- sensitivity to network problems, [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
- term number, [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
- use in etcd, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
- RAID (Redundant Array of Independent Disks), [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute), [Tolerating hardware faults through redundancy](/en/ch2#tolerating-hardware-faults-through-redundancy), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- railways, schema migration on, [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
- RAM (see memory)
- RAMCloud (in-memory storage), [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- random writes (access pattern), [Sequential versus random writes](/en/ch4#sidebar_sequential)
- range queries
- in B-trees, [B-Trees](/en/ch4#sec_storage_b_trees), [Read performance](/en/ch4#read-performance)
- in LSM-trees, [Read performance](/en/ch4#read-performance)
- not efficient in hash maps, [Log-Structured Storage](/en/ch4#sec_storage_log_structured)
- with hash sharding, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- ranking algorithms, [Machine Learning](/en/ch11#id290)
- Ray (workflow scheduler), [Machine Learning](/en/ch11#id290)
- RDF (Resource Description Framework), [The RDF data model](/en/ch3#the-rdf-data-model)
- querying with SPARQL, [The SPARQL query language](/en/ch3#the-sparql-query-language)
- RDMA (Remote Direct Memory Access), [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Cloud Computing Versus Supercomputing](/en/ch1#id17)
- React (user interface library), [End-to-end event streams](/en/ch13#id349)
- reactive programming, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- read committed isolation level, [Read Committed](/en/ch8#sec_transactions_read_committed)-[Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
- implementing, [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
- multi-version concurrency control (MVCC), [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl)
- no dirty reads, [No dirty reads](/en/ch8#no-dirty-reads)
- no dirty writes, [No dirty writes](/en/ch8#sec_transactions_dirty_write)
- read models (event sourcing), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- read path (derived data), [Observing Derived State](/en/ch13#sec_future_observing)
- read repair (leaderless replication), [Catching up on missed writes](/en/ch6#sec_replication_read_repair)
- for linearizability, [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
- read replicas (see leader-based replication)
- read skew (transaction isolation), [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation), [Summary](/en/ch8#summary)
- read uncommitted isolation level, [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
- read-after-write consistency, [Reading Your Own Writes](/en/ch6#sec_replication_ryw), [Timeliness and Integrity](/en/ch13#sec_future_integrity)
- cross-device, [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
- in derived data systems, [Derived data versus distributed transactions](/en/ch13#sec_future_derived_vs_transactions)
- read-modify-write cycle, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)
- read-scaling architecture, [Problems with Replication Lag](/en/ch6#sec_replication_lag), [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- versus sharding, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
- reads as events, [Reads are events too](/en/ch13#sec_future_read_events)
- real-time
- analytics (see product analytics)
- collaborative editing, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- publish/subscribe dataflow, [End-to-end event streams](/en/ch13#id349)
- response time guarantees, [Response time guarantees](/en/ch9#sec_distributed_clocks_realtime)
- time-of-day clocks, [Time-of-day clocks](/en/ch9#time-of-day-clocks)
- Realm (database), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- rebalancing shards, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)-[Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations), [Glossary](/en/glossary)
- (see also sharding)
- automatic or manual rebalancing, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
- fixed number of shards, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
- fixed number of shards per node, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- problems with hash mod N, [Hash modulo number of nodes](/en/ch7#hash-modulo-number-of-nodes)
- recency guarantee, [Linearizability](/en/ch10#sec_consistency_linearizability)
- recommendation engines, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)
- building using DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- iterative processing, [Machine Learning](/en/ch11#id290)
- reconfiguration (consensus), [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
- records, [MapReduce](/en/ch11#sec_batch_mapreduce)
- events in stream processing, [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
- recursive queries
- in Cypher, [The Cypher Query Language](/en/ch3#id57)
- in Datalog, [Datalog: Recursive Relational Queries](/en/ch3#id62)
- in SPARQL, [The SPARQL query language](/en/ch3#the-sparql-query-language)
- lack of, in GraphQL, [GraphQL](/en/ch3#id63)
- SQL common table expressions, [Graph Queries in SQL](/en/ch3#id58)
- Red Hat
- Apicurio Registry, [JSON Schema](/en/ch5#json-schema)
- red-black tree, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- redelivery (messaging), [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
- Redis (database)
- atomic operations, [Atomic write operations](/en/ch8#atomic-write-operations)
- CRDT support, [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
- durability, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- Lua scripting, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
- multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- process-per-core model, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
- single-threaded execution, [Actual Serial Execution](/en/ch8#sec_transactions_serial)
- redo log (see write-ahead log)
- Redpanda (messaging), [Message brokers](/en/ch5#message-brokers), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- tiered storage, [Disk space usage](/en/ch12#sec_stream_disk_usage)
- Redshift (database), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- redundancy
- hardware components, [Tolerating hardware faults through redundancy](/en/ch2#tolerating-hardware-faults-through-redundancy)
- of derived data, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
- (see also derived data)
- Reed--Solomon codes (error correction), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- refactoring, [Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability)
- (see also evolvability)
- regions (geographic distribution), [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
- (see also datacenters)
- consensus across, [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
- definition, [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
- latency, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
- linearizable ID generation, [Implementing a linearizable ID generator](/en/ch10#implementing-a-linearizable-id-generator)
- replication across, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)-[Problems with different topologies](/en/ch6#problems-with-different-topologies), [The Cost of Linearizability](/en/ch10#sec_linearizability_cost), [The limits of total ordering](/en/ch13#id335)
- leaderless, [Multi-region operation](/en/ch6#multi-region-operation)
- multi-leader, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- regions (sharding), [Sharding](/en/ch7#ch_sharding)
- register (data structure), [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- regulation (see legal matters)
- relational data model, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake), [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)-[Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- comparison to document model, [When to Use Which Model](/en/ch3#sec_datamodels_document_summary)-[Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- graph queries in SQL, [Graph Queries in SQL](/en/ch3#id58)
- in-memory databases with, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- many-to-one and many-to-many relationships, [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many)
- multi-object transactions, need for, [The need for multi-object transactions](/en/ch8#sec_transactions_need)
- object-relational mismatch, [The Object-Relational Mismatch](/en/ch3#sec_datamodels_document)
- representing a reorderable list, [When to Use Which Model](/en/ch3#sec_datamodels_document_summary)
- versus document model
- convergence of models, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- data locality, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
- relational databases
- eventual consistency, [Problems with Replication Lag](/en/ch6#sec_replication_lag)
- history, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
- leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- logical logs, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
- philosophy compared to Unix, [Unbundling Databases](/en/ch13#sec_future_unbundling), [The meta-database of everything](/en/ch13#id341)
- schema changes, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility), [Encoding and Evolution](/en/ch5#ch_encoding), [Different values written at different times](/en/ch5#different-values-written-at-different-times)
- sharded secondary indexes, [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)
- statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication)
- use of B-tree indexes, [B-Trees](/en/ch4#sec_storage_b_trees)
- relationships (see edges)
- reliability, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)-[Humans and Reliability](/en/ch2#id31), [A Philosophy of Streaming Systems](/en/ch13#ch_philosophy)
- building a reliable system from unreliable components, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
- hardware faults, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
- human errors, [Humans and Reliability](/en/ch2#id31)
- importance of, [Humans and Reliability](/en/ch2#id31)
- of messaging systems, [Messaging Systems](/en/ch12#sec_stream_messaging)
- software faults, [Software faults](/en/ch2#software-faults)
- Remote Method Invocation (Java RMI), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
- remote procedure calls (RPCs), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)-[Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- (see also services)
- data encoding and evolution, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- issues with, [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
- using Avro, [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
- versus message brokers, [Event-Driven Architectures](/en/ch5#sec_encoding_dataflow_msg)
- renewable energy, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
- repeatable reads (transaction isolation), [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- replicas, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- replication, [Replication](/en/ch6#ch_replication)-[Summary](/en/ch6#summary), [Glossary](/en/glossary)
- and durability, [Durability](/en/ch8#durability)
- conflict resolution and, [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- consistency properties, [Problems with Replication Lag](/en/ch6#sec_replication_lag)-[Solutions for Replication Lag](/en/ch6#id131)
- consistent prefix reads, [Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix)
- monotonic reads, [Monotonic Reads](/en/ch6#sec_replication_monotonic_reads)
- reading your own writes, [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
- in distributed filesystems, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- leaderless, [Leaderless Replication](/en/ch6#sec_replication_leaderless)-[Version vectors](/en/ch6#version-vectors)
- detecting concurrent writes, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)-[Version vectors](/en/ch6#version-vectors)
- limitations of quorum consistency, [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations)-[Monitoring staleness](/en/ch6#monitoring-staleness), [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
- monitoring staleness, [Monitoring staleness](/en/ch6#monitoring-staleness)
- multi-leader, [Multi-Leader Replication](/en/ch6#sec_replication_multi_leader)-[Types of conflict](/en/ch6#sec_replication_write_conflicts)
- across multiple regions, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc), [The Cost of Linearizability](/en/ch10#sec_linearizability_cost)
- conflict resolution, [Dealing with Conflicting Writes](/en/ch6#sec_replication_write_conflicts)-[Types of conflict](/en/ch6#sec_replication_write_conflicts)
- replication topologies, [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)-[Problems with different topologies](/en/ch6#problems-with-different-topologies)
- reasons for using, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed), [Replication](/en/ch6#ch_replication)
- sharding and, [Sharding](/en/ch7#ch_sharding)
- single-leader, [Single-Leader Replication](/en/ch6#sec_replication_leader)-[Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
- failover, [Leader failure: Failover](/en/ch6#leader-failure-failover)
- implementation of replication logs, [Implementation of Replication Logs](/en/ch6#sec_replication_implementation)-[Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
- relation to consensus, [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus), [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
- setting up new followers, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- synchronous versus asynchronous, [Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async)-[Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async)
- state machine replication, [Statement-based replication](/en/ch6#statement-based-replication), [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs), [Using shared logs](/en/ch10#sec_consistency_smr), [Databases and Streams](/en/ch12#sec_stream_databases)
- event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- reliance on determinism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- using consensus, [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
- using erasure coding, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- using object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- versus backups, [Replication](/en/ch6#ch_replication)
- with heterogeneous data systems, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
- replication logs (see logs)
- representations of data (see data models)
- reprocessing data, [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing), [Unifying batch and stream processing](/en/ch13#id338)
- (see also evolvability)
- from log-based messaging, [Replaying old messages](/en/ch12#sec_stream_replay)
- request hedging, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- request identifiers, [Uniquely identifying requests](/en/ch13#id355), [Multi-shard request processing](/en/ch13#id360)
- request routing, [Request Routing](/en/ch7#sec_sharding_routing)-[Request Routing](/en/ch7#sec_sharding_routing)
- approaches to, [Request Routing](/en/ch7#sec_sharding_routing)
- residence laws for data, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed), [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- resilient systems, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)
- (see also fault tolerance)
- resource isolation, [Cloud Computing Versus Supercomputing](/en/ch1#id17), [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- resource limits, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- response time
- as performance metric, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Batch Processing](/en/ch11#ch_batch)
- guarantees on, [Response time guarantees](/en/ch9#sec_distributed_clocks_realtime)
- impact on users, [Average, Median, and Percentiles](/en/ch2#id24)
- in replicated systems, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- latency versus, [Latency and Response Time](/en/ch2#id23)
- mean and percentiles, [Average, Median, and Percentiles](/en/ch2#id24)
- user experience, [Average, Median, and Percentiles](/en/ch2#id24)
- responsibility and accountability, [Responsibility and Accountability](/en/ch14#id371)
- REST (Representational State Transfer), [Web services](/en/ch5#sec_web_services)
- (see also services)
- Restate (workflow engine), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
- RethinkDB (database)
- join support, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- key-range sharding, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- retry storm, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Software faults](/en/ch2#software-faults)
- reverse ETL, [Beyond the data lake](/en/ch1#beyond-the-data-lake)
- Riak (database)
- CRDT support, [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts), [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)
- dotted version vectors, [Version vectors](/en/ch6#version-vectors)
- gossip protocol, [Request Routing](/en/ch7#sec_sharding_routing)
- hash sharding, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
- leaderless replication, [Leaderless Replication](/en/ch6#sec_replication_leaderless)
- linearizability, lack of, [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
- multi-region support, [Multi-region operation](/en/ch6#multi-region-operation)
- rebalancing, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
- secondary indexes, [Local Secondary Indexes](/en/ch7#id166)
- sloppy quorums, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- vnodes (sharding), [Sharding](/en/ch7#ch_sharding)
- ring buffers, [Disk space usage](/en/ch12#sec_stream_disk_usage)
- RisingWave (database)
- incremental view maintenance, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- rockets, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
- RocksDB (storage engine), [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- as embedded storage engine, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- leveled compaction, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
- rollbacks (transactions), [Transactions](/en/ch8#ch_transactions)
- rolling upgrades, [Tolerating hardware faults through redundancy](/en/ch2#tolerating-hardware-faults-through-redundancy), [Encoding and Evolution](/en/ch5#ch_encoding), [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
- in a multitenant system, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- routing (see request routing)
- row-based replication, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
- row-oriented storage, [Column-Oriented Storage](/en/ch4#sec_storage_column)
- rowhammer (memory corruption), [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
- RPCs (see remote procedure calls)
- rules (Datalog), [Datalog: Recursive Relational Queries](/en/ch3#id62)
- Rust (programming language)
- memory management, [Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
### S
- S3 (object storage), [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Batch Processing](/en/ch11#ch_batch), [Distributed Filesystems](/en/ch11#sec_batch_dfs), [Object Stores](/en/ch11#id277)
- checking data integrity, [Don't just blindly trust what they promise](/en/ch13#id364)
- conditional writes, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
- object size, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
- S3 Express One Zone, [Object Stores](/en/ch11#id277), [Object Stores](/en/ch11#id277)
- use in MapReduce, [MapReduce](/en/ch11#sec_batch_mapreduce)
- workflow example, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
- SaaS (see software as a service (SaaS))
- safety and liveness properties, [Safety and liveness](/en/ch9#sec_distributed_safety_liveness)
- in consensus algorithms, [Single-value consensus](/en/ch10#single-value-consensus)
- in transactions, [Transactions](/en/ch8#ch_transactions)
- sagas (see compensating transactions)
- Samza (stream processor), [Stream analytics](/en/ch12#id318)
- SAP HANA (database), [Data Storage for Analytics](/en/ch4#sec_storage_analytics)
- scalability, [Scalability](/en/ch2#sec_introduction_scalability)-[Principles for Scalability](/en/ch2#id35), [A Philosophy of Streaming Systems](/en/ch13#ch_philosophy)
- auto-scaling, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
- by sharding, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
- describing load, [Describing Load](/en/ch2#id33)
- describing performance, [Describing Performance](/en/ch2#sec_introduction_percentiles)
- linear, [Describing Load](/en/ch2#id33)
- principles for, [Principles for Scalability](/en/ch2#id35)
- replication and, [Problems with Replication Lag](/en/ch6#sec_replication_lag)
- scaling up versus scaling out, [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing)
- scaling out, [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing)
- (see also shared-nothing architecture)
- by sharding, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
- scaling up, [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing)
- SCD (slowly changing dimension), [Time-dependence of joins](/en/ch12#sec_stream_join_time)
- scheduling
- algorithms, [Resource Allocation](/en/ch11#id279)
- batch jobs, [Distributed Job Orchestration](/en/ch11#id278)-[Scheduling Workflows](/en/ch11#sec_batch_workflows)
- gang scheduling, [Resource Allocation](/en/ch11#id279)
- schema-on-read, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
- comparison to evolvable schema, [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
- schema-on-write, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
- schemaless databases (see schema-on-read)
- schemas, [Glossary](/en/glossary)
- Avro, [Avro](/en/ch5#sec_encoding_avro)-[Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
- reader determining writer's schema, [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
- schema evolution, [The writer's schema and the reader's schema](/en/ch5#the-writers-schema-and-the-readers-schema)
- dynamically generated, [Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
- evolution of, [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
- affecting application code, [Encoding and Evolution](/en/ch5#ch_encoding)
- compatibility checking, [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
- in databases, [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)-[Archival storage](/en/ch5#archival-storage)
- in service calls, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- flexibility in document model, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
- for analytics, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)-[Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
- for JSON and XML, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json), [JSON Schema](/en/ch5#json-schema)
- generation and migration using ORMs, [Object-relational mapping (ORM)](/en/ch3#object-relational-mapping-orm)
- merits of, [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
- migration, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
- Protocol Buffers, [Protocol Buffers](/en/ch5#sec_encoding_protobuf)-[Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
- schema evolution, [Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
- schema migration on railways, [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
- traditional approach to design, fallacy in, [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
- scientific computing, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
- scikit-learn (Python library), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
- ScyllaDB (database)
- cluster metadata, [Request Routing](/en/ch7#sec_sharding_routing)
- consistency level ANY, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- hash-range sharding, [Sharding by Hash of Key](/en/ch7#sec_sharding_hash), [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- last-write-wins conflict resolution, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)
- leaderless replication, [Leaderless Replication](/en/ch6#sec_replication_leaderless)
- lightweight transactions, [Single-object writes](/en/ch8#sec_transactions_single_object)
- linearizability, lack of, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
- log-structured storage, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- multi-region support, [Multi-region operation](/en/ch6#multi-region-operation)
- use of clocks, [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations), [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
- vnodes (sharding), [Sharding](/en/ch7#ch_sharding)
- search engines (see full-text search)
- searching on streams, [Search on streams](/en/ch12#id320)
- secondaries (see leader-based replication)
- secondary indexes, [Multi-Column and Secondary Indexes](/en/ch4#sec_storage_index_multicolumn), [Glossary](/en/glossary)
- for many-to-many relationships, [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many)
- problems with dual writes, [Keeping Systems in Sync](/en/ch12#sec_stream_sync), [Reasoning about dataflows](/en/ch13#id443)
- sharding, [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)-[Global Secondary Indexes](/en/ch7#id167), [Summary](/en/ch7#summary)
- global, [Global Secondary Indexes](/en/ch7#id167)
- index maintenance, [Maintaining derived state](/en/ch13#id446)
- local, [Local Secondary Indexes](/en/ch7#id166)
- updating, transaction isolation and, [The need for multi-object transactions](/en/ch8#sec_transactions_need)
- secondary sort (MapReduce), [JOIN and GROUP BY](/en/ch11#sec_batch_join)
- sed (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis)
- self-hosting, [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud)
- data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- self-joins, [Summary](/en/ch12#id332)
- self-validating systems, [Don't just blindly trust what they promise](/en/ch13#id364)
- semantic search, [Vector Embeddings](/en/ch4#id92)
- semantic similarity, [Vector Embeddings](/en/ch4#id92)
- semantic web, [Triple-Stores and SPARQL](/en/ch3#id59)
- semi-synchronous replication, [Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async)
- sequential writes (access pattern), [Sequential versus random writes](/en/ch4#sidebar_sequential)
- serializability, [Isolation](/en/ch8#sec_transactions_acid_isolation), [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels), [Serializability](/en/ch8#sec_transactions_serializability)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation), [Glossary](/en/glossary)
- linearizability versus, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- pessimistic versus optimistic concurrency control, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
- serial execution, [Actual Serial Execution](/en/ch8#sec_transactions_serial)-[Summary of serial execution](/en/ch8#summary-of-serial-execution)
- sharding, [Sharding](/en/ch8#sharding)
- using stored procedures, [Encapsulating transactions in stored procedures](/en/ch8#encapsulating-transactions-in-stored-procedures), [Using shared logs](/en/ch10#sec_consistency_smr)
- serializable snapshot isolation (SSI), [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
- detecting stale MVCC reads, [Detecting stale MVCC reads](/en/ch8#detecting-stale-mvcc-reads)
- detecting writes that affect prior reads, [Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
- distributed execution, [Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
- performance of SSI, [Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
- preventing write skew, [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)-[Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
- strict serializability, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- timeliness vs. integrity, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
- two-phase locking (2PL), [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)-[Index-range locks](/en/ch8#sec_transactions_2pl_range)
- index-range locks, [Index-range locks](/en/ch8#sec_transactions_2pl_range)
- performance, [Performance of two-phase locking](/en/ch8#performance-of-two-phase-locking)
- Serializable (Java), [Language-Specific Formats](/en/ch5#id96)
- serialization, [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
- (see also encoding)
- serverless, [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
- service discovery, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery), [Request Routing](/en/ch7#sec_sharding_routing), [Service discovery](/en/ch10#service-discovery)
- registration, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
- using DNS, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery), [Request Routing](/en/ch7#sec_sharding_routing), [Service discovery](/en/ch10#service-discovery)
- service level agreements (SLAs), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla), [Describing Load](/en/ch2#id33)
- service mesh, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
- Service Organization Control (SOC), [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- service time, [Latency and Response Time](/en/ch2#id23)
- service-oriented architecture (SOA), [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
- (see also services)
- services, [Dataflow Through Services: REST and RPC](/en/ch5#sec_encoding_dataflow_rpc)-[Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- microservices, [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
- causal dependencies across services, [The limits of total ordering](/en/ch13#id335)
- loose coupling, [Making unbundling work](/en/ch13#sec_future_unbundling_favor)
- relation to batch/stream processors, [Batch Processing](/en/ch11#ch_batch), [Stream processors and services](/en/ch13#id345)
- remote procedure calls (RPCs), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)-[Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- issues with, [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
- similarity to databases, [Dataflow Through Services: REST and RPC](/en/ch5#sec_encoding_dataflow_rpc)
- web services, [Web services](/en/ch5#sec_web_services)
- session windows (stream processing), [Types of windows](/en/ch12#id324)
- (see also windows)
- sharding, [Sharding](/en/ch7#ch_sharding)-[Summary](/en/ch7#summary), [Glossary](/en/glossary)
- and consensus, [Using shared logs](/en/ch10#sec_consistency_smr)
- and replication, [Sharding](/en/ch7#ch_sharding)
- distributed transactions across shards, [Distributed Transactions](/en/ch8#sec_transactions_distributed)
- hot shards, [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
- in batch processing, [Batch Processing](/en/ch11#ch_batch)
- key-range splitting, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)
- multi-shard operations, [Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
- enforcing constraints, [Multi-shard request processing](/en/ch13#id360)
- secondary index maintenance, [Maintaining derived state](/en/ch13#id446)
- of key-value data, [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)-[Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
- by key range, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- skew and hot spots, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
- origin of the term, [Sharding](/en/ch7#ch_sharding)
- partition key, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons), [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
- rebalancing
- of key-range sharded data, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)
- rebalancing shards, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)-[Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
- automatic or manual rebalancing, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
- problems with hash mod N, [Hash modulo number of nodes](/en/ch7#hash-modulo-number-of-nodes)
- using fixed number of shards, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
- using N shards per node, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- request routing, [Request Routing](/en/ch7#sec_sharding_routing)-[Request Routing](/en/ch7#sec_sharding_routing)
- secondary indexes, [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)-[Global Secondary Indexes](/en/ch7#id167)
- global, [Global Secondary Indexes](/en/ch7#id167)
- local, [Local Secondary Indexes](/en/ch7#id166)
- serial execution of transactions and, [Sharding](/en/ch8#sharding)
- sorting sharded data, [Shuffling Data](/en/ch11#sec_shuffle)
- shared logs, [Consensus in Practice](/en/ch10#sec_consistency_total_order)-[Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus), [The limits of total ordering](/en/ch13#id335), [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
- algorithms, [Consensus in Practice](/en/ch10#sec_consistency_total_order)
- for event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- for messaging, [Log-based Message Brokers](/en/ch12#sec_stream_log)-[Replaying old messages](/en/ch12#sec_stream_replay)
- relation to consensus, [Shared logs as consensus](/en/ch10#sec_consistency_shared_logs)
- using, [Using shared logs](/en/ch10#sec_consistency_smr)
- shared mode (locks), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- shared-disk architecture, [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- shared-memory architecture, [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing)
- shared-nothing architecture, [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing), [Glossary](/en/glossary)
- distributed filesystems, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- (see also distributed filesystems)
- use of network, [Unreliable Networks](/en/ch9#sec_distributed_networks)
- sharks
- biting undersea cables, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
- counting (example), [Query languages for documents](/en/ch3#query-languages-for-documents)
- shredding (deletion) (see crypto-shredding)
- shredding (in columnar encoding), [Column-Oriented Storage](/en/ch4#sec_storage_column)
- shredding (in relational model), [When to Use Which Model](/en/ch3#sec_datamodels_document_summary)
- shuffle (batch processing), [Shuffling Data](/en/ch11#sec_shuffle)-[Shuffling Data](/en/ch11#sec_shuffle)
- siblings (concurrent values), [Manual conflict resolution](/en/ch6#manual-conflict-resolution), [Capturing the happens-before relationship](/en/ch6#capturing-the-happens-before-relationship), [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- (see also conflicts)
- silo, [Data Warehousing](/en/ch1#sec_introduction_dwh)
- similarity search
- edit distance, [Full-Text Search](/en/ch4#sec_storage_full_text)
- genome data, [Summary](/en/ch3#summary)
- simplicity, [Simplicity: Managing Complexity](/en/ch2#id38)
- Singer, [Data Warehousing](/en/ch1#sec_introduction_dwh)
- single-instruction-multi-data (SIMD) instructions, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- single-leader replication (see leader-based replication)
- single-threaded execution, [Atomic write operations](/en/ch8#atomic-write-operations), [Actual Serial Execution](/en/ch8#sec_transactions_serial)
- in stream processing, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Concurrency control](/en/ch12#sec_stream_concurrency), [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
- SingleStore (database)
- in-memory storage, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- site reliability engineer, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- size-tiered compaction, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction), [Disk space usage](/en/ch4#disk-space-usage)
- skew, [Glossary](/en/glossary)
- clock skew, [Relying on Synchronized Clocks](/en/ch9#sec_distributed_clocks_relying)-[Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval), [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
- in transaction isolation
- read skew, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation), [Summary](/en/ch8#summary)
- write skew, [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Materializing conflicts](/en/ch8#materializing-conflicts), [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)-[Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
- (see also write skew)
- meanings of, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
- unbalanced workload, [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
- compensating for, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
- due to celebrities, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
- for time-series data, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- skip list, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- SLA (see service level agreements)
- Slack (group chat)
- GraphQL example, [GraphQL](/en/ch3#id63)
- SlateDB (database), [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- sliding windows (stream processing), [Types of windows](/en/ch12#id324)
- (see also windows)
- sloppy quorums, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- slowly changing dimension (data warehouses), [Time-dependence of joins](/en/ch12#sec_stream_join_time)
- smearing (leap seconds adjustments), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
- snapshots (databases)
- as backups, [Replication](/en/ch6#ch_replication)
- computing derived data, [Creating an index](/en/ch13#id340)
- in change data capture, [Initial snapshot](/en/ch12#sec_stream_cdc_snapshot)
- serializable snapshot isolation (SSI), [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
- setting up a new replica, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- snapshot isolation and repeatable read, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)-[Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- implementing with MVCC, [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl)
- indexes and MVCC, [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
- visibility rules, [Visibility rules for observing a consistent snapshot](/en/ch8#sec_transactions_mvcc_visibility)
- synchronized clocks for global snapshots, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- Snowflake (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native), [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Batch Processing](/en/ch11#ch_batch)
- column-oriented storage, [Column-Oriented Storage](/en/ch4#sec_storage_column)
- handling writes, [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
- sharding and clustering, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- Snowpark, [Query languages](/en/ch11#sec_batch_query_lanauges)
- Snowflake (ID generator), [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
- snowflake schemas, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
- SOAP (web services), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
- SOC2 (see Service Organization Control (SOC))
- social graph, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- society
- responsibility towards, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
- sociotechnical systems, [Humans and Reliability](/en/ch2#id31)
- software as a service (SaaS), [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs), [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud)
- ETL from, [Data Warehousing](/en/ch1#sec_introduction_dwh)
- multitenancy, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- software bugs, [Software faults](/en/ch2#software-faults)
- maintaining integrity, [Maintaining integrity in the face of software bugs](/en/ch13#id455)
- solar storm, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
- solid state drives (SSDs)
- access patterns, [Sequential versus random writes](/en/ch4#sidebar_sequential)
- compared to object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- detecting corruption, [The end-to-end argument](/en/ch13#sec_future_e2e_argument), [Don't just blindly trust what they promise](/en/ch13#id364)
- failure rate, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
- faults in, [Durability](/en/ch8#durability)
- firmware bugs, [Software faults](/en/ch2#software-faults)
- read throughput, [Read performance](/en/ch4#read-performance)
- sequential vs. random writes, [Sequential versus random writes](/en/ch4#sidebar_sequential)
- Solr (search server)
- local secondary indexes, [Local Secondary Indexes](/en/ch7#id166)
- request routing, [Request Routing](/en/ch7#sec_sharding_routing)
- use of Lucene, [Full-Text Search](/en/ch4#sec_storage_full_text)
- sort (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Sorting Versus In-memory Aggregation](/en/ch11#id275), [Distributed Job Orchestration](/en/ch11#id278)
- sort-merge joins (MapReduce), [JOIN and GROUP BY](/en/ch11#sec_batch_join)
- Sorted String Tables (see SSTables)
- sorting
- sort order in column storage, [Sort Order in Column Storage](/en/ch4#sort-order-in-column-storage)
- source of truth (see systems of record)
- Spanner (database)
- consistency model, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- data locality, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
- in the cloud, [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
- snapshot isolation using clocks, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- transactions, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
- TrueTime API, [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval)
- Spark (processing framework), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native), [Batch Processing](/en/ch11#ch_batch), [Dataflow Engines](/en/ch11#sec_batch_dataflow)
- cost efficiency, [Query languages](/en/ch11#sec_batch_query_lanauges)
- DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes), [DataFrames](/en/ch11#id287)
- fault tolerance, [Handling Faults](/en/ch11#id281)
- for data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- high availability using ZooKeeper, [Coordination Services](/en/ch10#sec_consistency_coordination)
- MLlib, [Machine Learning](/en/ch11#id290)
- query optimizer, [Query languages](/en/ch11#sec_batch_query_lanauges)
- shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)
- Spark Streaming, [Stream analytics](/en/ch12#id318)
- microbatching, [Microbatching and checkpointing](/en/ch12#id329)
- streaming SQL support, [Complex event processing](/en/ch12#id317)
- use for ETL, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
- SPARQL (query language), [The SPARQL query language](/en/ch3#the-sparql-query-language)
- sparse index, [The SSTable file format](/en/ch4#the-sstable-file-format)
- sparse matrices, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- split brain, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Request Routing](/en/ch7#sec_sharding_routing), [Glossary](/en/glossary)
- enforcing constraints, [Uniqueness constraints require consensus](/en/ch13#id452)
- in consensus algorithms, [Consensus](/en/ch10#sec_consistency_consensus), [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
- preventing, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
- using fencing tokens to avoid, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)-[Fencing with multiple replicas](/en/ch9#fencing-with-multiple-replicas)
- spot instances, [Handling Faults](/en/ch11#id281)
- spreadsheets, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- dataflow programming, [Designing Applications Around Dataflow](/en/ch13#sec_future_dataflow)
- pivot table, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- SQL (Structured Query Language), [Simplicity: Managing Complexity](/en/ch2#id38), [Relational Model versus Document Model](/en/ch3#sec_datamodels_history), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- for analytics, [Data Warehousing](/en/ch1#sec_introduction_dwh), [Column-Oriented Storage](/en/ch4#sec_storage_column)
- graph queries in, [Graph Queries in SQL](/en/ch3#id58)
- isolation levels standard, issues with, [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- joins, [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)
- résumé (example), [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
- social network home timelines (example), [Representing Users, Posts, and Follows](/en/ch2#id20)
- SQL injection vulnerability, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
- statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication)
- stored procedures, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
- support in batch processing frameworks, [Batch Processing](/en/ch11#ch_batch)
- views, [Datalog: Recursive Relational Queries](/en/ch3#id62)
- SQL Server (database)
- archiving WAL to object stores, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- change data capture, [Implementing change data capture](/en/ch12#id307)
- data warehousing support, [Data Storage for Analytics](/en/ch4#sec_storage_analytics)
- distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
- leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- preventing lost updates, [Automatically detecting lost updates](/en/ch8#automatically-detecting-lost-updates)
- preventing write skew, [Characterizing write skew](/en/ch8#characterizing-write-skew), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- read committed isolation, [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
- serializable isolation, [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
- T-SQL language, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
- SQLite (database), [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems), [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- archiving WAL to object stores, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- SRE (site reliability engineer), [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- SSDs (see solid state drives)
- SSTables (storage format), [The SSTable file format](/en/ch4#the-sstable-file-format)-[Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- constructing and maintaining, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- making LSM-Tree from, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- staged rollout (see rolling upgrades)
- staleness (old data), [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
- cross-channel timing dependencies, [Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
- in leaderless databases, [Writing to the Database When a Node Is Down](/en/ch6#id287)
- in multi-version concurrency control, [Detecting stale MVCC reads](/en/ch8#detecting-stale-mvcc-reads)
- monitoring for, [Monitoring staleness](/en/ch6#monitoring-staleness)
- of client state, [Pushing state changes to clients](/en/ch13#id348)
- versus linearizability, [Linearizability](/en/ch10#sec_consistency_linearizability)
- versus timeliness, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
- standbys (see leader-based replication)
- star replication topologies, [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)
- star schemas, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)-[Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
- Star Wars analogy (event time versus processing time), [Event time versus processing time](/en/ch12#id322)
- starvation (scheduling), [Resource Allocation](/en/ch11#id279)
- state
- derived from log of immutable events, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)
- interplay between state changes and application code, [Dataflow: Interplay between state changes and application code](/en/ch13#id450)
- maintaining derived state, [Maintaining derived state](/en/ch13#id446)
- maintenance by stream processor in stream-stream joins, [Stream-stream join (window join)](/en/ch12#id440)
- observing derived state, [Observing Derived State](/en/ch13#sec_future_observing)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
- rebuilding after stream processor failure, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
- separation of application code and, [Separation of application code and state](/en/ch13#id344)
- state machine replication, [Statement-based replication](/en/ch6#statement-based-replication), [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs), [Using shared logs](/en/ch10#sec_consistency_smr), [Databases and Streams](/en/ch12#sec_stream_databases)
- event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- reliance on determinism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- stateless systems, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
- statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication)
- reliance on determinism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- statically typed languages
- analogy to schema-on-write, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
- statistical and numerical algorithms, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- StatsD (metrics aggregator), [Direct messaging from producers to consumers](/en/ch12#id296)
- stock market feeds, [Direct messaging from producers to consumers](/en/ch12#id296)
- STONITH (Shoot The Other Node In The Head), [Leader failure: Failover](/en/ch6#leader-failure-failover)
- problems with, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
- stop-the-world (see garbage collection)
- storage
- composing data storage technologies, [Composing Data Storage Technologies](/en/ch13#id447)-[Unbundled versus integrated systems](/en/ch13#id448)
- Storage Area Network (SAN), [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- storage engines, [Storage and Retrieval](/en/ch4#ch_storage)-[Summary](/en/ch4#summary)
- column-oriented, [Column-Oriented Storage](/en/ch4#sec_storage_column)-[Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- column compression, [Column Compression](/en/ch4#sec_storage_column_compression)-[Column Compression](/en/ch4#sec_storage_column_compression)
- defined, [Column-Oriented Storage](/en/ch4#sec_storage_column)
- Parquet, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Column-Oriented Storage](/en/ch4#sec_storage_column), [Archival storage](/en/ch5#archival-storage)
- sort order in, [Sort Order in Column Storage](/en/ch4#sort-order-in-column-storage)-[Sort Order in Column Storage](/en/ch4#sort-order-in-column-storage)
- versus wide-column model, [Column Compression](/en/ch4#sec_storage_column_compression)
- writing to, [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
- in-memory storage, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- durability, [Durability](/en/ch8#durability)
- row-oriented, [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp)-[Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- B-trees, [B-Trees](/en/ch4#sec_storage_b_trees)-[B-tree variants](/en/ch4#b-tree-variants)
- comparing B-trees and LSM-trees, [Comparing B-Trees and LSM-Trees](/en/ch4#sec_storage_btree_lsm_comparison)-[Disk space usage](/en/ch4#disk-space-usage)
- defined, [Column-Oriented Storage](/en/ch4#sec_storage_column)
- log-structured, [Log-Structured Storage](/en/ch4#sec_storage_log_structured)-[Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- stored procedures, [Encapsulating transactions in stored procedures](/en/ch8#encapsulating-transactions-in-stored-procedures)-[Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs), [Glossary](/en/glossary)
- and shared logs, [Using shared logs](/en/ch10#sec_consistency_smr)
- pros and cons of, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
- similarity to stream processors, [Application code as a derivation function](/en/ch13#sec_future_dataflow_derivation)
- Storm (stream processor), [Stream analytics](/en/ch12#id318)
- distributed RPC, [Event-Driven Architectures and RPC](/en/ch12#sec_stream_actors_drpc), [Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
- Trident state handling, [Idempotence](/en/ch12#sec_stream_idempotence)
- straggler events, [Handling straggler events](/en/ch12#id323)
- Stream Control Transmission Protocol (SCTP), [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
- stream processing, [Processing Streams](/en/ch12#sec_stream_processing)-[Summary](/en/ch12#id332), [Glossary](/en/glossary)
- accessing external services within job, [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins), [Microbatching and checkpointing](/en/ch12#id329), [Idempotence](/en/ch12#sec_stream_idempotence), [Exactly-once execution of an operation](/en/ch13#id353)
- combining with batch processing, [Unifying batch and stream processing](/en/ch13#id338)
- comparison to batch processing, [Processing Streams](/en/ch12#sec_stream_processing)
- complex event processing (CEP), [Complex event processing](/en/ch12#id317)
- fault tolerance, [Fault Tolerance](/en/ch12#sec_stream_fault_tolerance)-[Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
- atomic commit, [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit)
- idempotence, [Idempotence](/en/ch12#sec_stream_idempotence)
- microbatching and checkpointing, [Microbatching and checkpointing](/en/ch12#id329)
- rebuilding state after a failure, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
- for data integration, [Batch and Stream Processing](/en/ch13#sec_future_batch_streaming)-[Unifying batch and stream processing](/en/ch13#id338)
- for event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- maintaining derived state, [Maintaining derived state](/en/ch13#id446)
- maintenance of materialized views, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
- messaging systems (see messaging systems)
- reasoning about time, [Reasoning About Time](/en/ch12#sec_stream_time)-[Types of windows](/en/ch12#id324)
- event time versus processing time, [Event time versus processing time](/en/ch12#id322), [Microbatching and checkpointing](/en/ch12#id329), [Unifying batch and stream processing](/en/ch13#id338)
- knowing when window is ready, [Handling straggler events](/en/ch12#id323)
- types of windows, [Types of windows](/en/ch12#id324)
- relation to databases (see streams)
- relation to services, [Stream processors and services](/en/ch13#id345)
- relationship to batch processing, [Batch Processing](/en/ch11#ch_batch)
- search on streams, [Search on streams](/en/ch12#id320)
- single-threaded execution, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Concurrency control](/en/ch12#sec_stream_concurrency)
- stream analytics, [Stream analytics](/en/ch12#id318)
- stream joins, [Stream Joins](/en/ch12#sec_stream_joins)-[Time-dependence of joins](/en/ch12#sec_stream_join_time)
- stream-stream join, [Stream-stream join (window join)](/en/ch12#id440)
- stream-table join, [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins)
- table-table join, [Table-table join (materialized view maintenance)](/en/ch12#id326)
- time-dependence of, [Time-dependence of joins](/en/ch12#sec_stream_join_time)
- streams, [Stream Processing](/en/ch12#ch_stream)-[Replaying old messages](/en/ch12#sec_stream_replay)
- end-to-end, pushing events to clients, [End-to-end event streams](/en/ch13#id349)
- messaging systems (see messaging systems)
- processing (see stream processing)
- relation to databases, [Databases and Streams](/en/ch12#sec_stream_databases)-[Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- (see also changelogs)
- API support for change streams, [API support for change streams](/en/ch12#sec_stream_change_api)
- change data capture, [Change Data Capture](/en/ch12#sec_stream_cdc)-[API support for change streams](/en/ch12#sec_stream_change_api)
- derivative of state by time, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)
- event sourcing, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
- keeping systems in sync, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)-[Keeping Systems in Sync](/en/ch12#sec_stream_sync)
- philosophy of immutable events, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)-[Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
- topics, [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
- strict serializability, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- timeliness vs. integrity, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
- striping (in columnar encoding), [Column-Oriented Storage](/en/ch4#sec_storage_column)
- strong consistency (see linearizability)
- strong eventual consistency, [Automatic conflict resolution](/en/ch6#automatic-conflict-resolution)
- strong one-copy serializability, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- subjects, predicates, and objects (in triple-stores), [Triple-Stores and SPARQL](/en/ch3#id59)
- subscribers (message streams), [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
- (see also consumers)
- supercomputers, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
- Superset (data visualization software), [Analytics](/en/ch11#sec_batch_olap)
- surveillance, [Surveillance](/en/ch14#id374)
- (see also privacy)
- sushi principle, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
- sustainability, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
- Swagger (service definition format), [Web services](/en/ch5#sec_web_services)
- swapping to disk (see virtual memory)
- Swift (programming language)
- memory management, [Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
- sync engines, [Sync Engines and Local-First Software](/en/ch6#sec_replication_offline_clients)-[Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- examples of, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- for local-first software, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- synchronous networks, [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks), [Glossary](/en/glossary)
- comparison to asynchronous networks, [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
- system model, [System Model and Reality](/en/ch9#sec_distributed_system_model)
- synchronous replication, [Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async), [Glossary](/en/glossary)
- with multiple leaders, [Multi-Leader Replication](/en/ch6#sec_replication_multi_leader)
- system administrator, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- system models, [Knowledge, Truth, and Lies](/en/ch9#sec_distributed_truth), [System Model and Reality](/en/ch9#sec_distributed_system_model)-[Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- assumptions in, [Trust, but Verify](/en/ch13#sec_future_verification)
- correctness of algorithms, [Defining the correctness of an algorithm](/en/ch9#defining-the-correctness-of-an-algorithm)
- mapping to the real world, [Mapping system models to the real world](/en/ch9#mapping-system-models-to-the-real-world)
- safety and liveness, [Safety and liveness](/en/ch9#sec_distributed_safety_liveness)
- systems of record, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived), [Glossary](/en/glossary)
- change data capture, [Implementing change data capture](/en/ch12#id307), [Reasoning about dataflows](/en/ch13#id443)
- event logs, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- treating event log as, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)
- systems thinking, [Feedback Loops](/en/ch14#id372)
### T
- t-digest (algorithm), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
- table-table joins, [Table-table join (materialized view maintenance)](/en/ch12#id326)
- Tableau (data visualization software), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Analytics](/en/ch11#sec_batch_olap)
- tail (Unix tool), [Using logs for message storage](/en/ch12#id300)
- tail latency (see latency)
- tail vertex (property graphs), [Property Graphs](/en/ch3#id56)
- task (workflows) (see workflow engines)
- TCP (Transmission Control Protocol), [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
- comparison to circuit switching, [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
- comparison to UDP, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- connection failures, [Detecting Faults](/en/ch9#id307)
- flow control, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing), [Messaging Systems](/en/ch12#sec_stream_messaging)
- packet checksums, [Weak forms of lying](/en/ch9#weak-forms-of-lying), [The end-to-end argument](/en/ch13#sec_future_e2e_argument), [Trust, but Verify](/en/ch13#sec_future_verification)
- reliability and duplicate suppression, [Duplicate suppression](/en/ch13#id354)
- retransmission timeouts, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- use for transaction sessions, [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
- Temporal (workflow engine), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
- Tensorflow (machine learning library), [Machine Learning](/en/ch11#id290)
- Teradata (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- term-partitioned indexes (see global secondary indexes)
- termination (consensus), [Single-value consensus](/en/ch10#single-value-consensus), [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
- testing, [Humans and Reliability](/en/ch2#id31)
- thrashing (out of memory), [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- threads (concurrency)
- actor model, [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks), [Event-Driven Architectures and RPC](/en/ch12#sec_stream_actors_drpc)
- (see also event-driven architecture)
- atomic operations, [Atomicity](/en/ch8#sec_transactions_acid_atomicity)
- background threads, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- execution pauses, [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable), [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)-[Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- memory barriers, [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
- preemption, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- single (see single-threaded execution)
- three-phase commit, [Three-phase commit](/en/ch8#three-phase-commit)
- three-way relationships, [Property Graphs](/en/ch3#id56)
- Thrift (data format), [Protocol Buffers](/en/ch5#sec_encoding_protobuf)
- throughput, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Describing Load](/en/ch2#id33), [Batch Processing](/en/ch11#ch_batch)
- TIBCO, [Message brokers](/en/ch5#message-brokers)
- Enterprise Message Service, [Message brokers compared to databases](/en/ch12#id297)
- StreamBase (stream analytics), [Complex event processing](/en/ch12#id317)
- TiDB (database)
- consensus-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
- regions (sharding), [Sharding](/en/ch7#ch_sharding)
- request routing, [Request Routing](/en/ch7#sec_sharding_routing)
- serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
- sharded secondary indexes, [Global Secondary Indexes](/en/ch7#id167)
- snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
- timestamp oracle, [Implementing a linearizable ID generator](/en/ch10#implementing-a-linearizable-id-generator)
- transactions, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
- use of model-checking, [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
- tiered storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Disk space usage](/en/ch12#sec_stream_disk_usage)
- TigerBeetle (database), [Summary](/en/ch3#summary)
- deterministic simulation testing, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- TigerGraph (database)
- GSQL language, [Graph Queries in SQL](/en/ch3#id58)
- Tigris (object storage), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- TileDB (database), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- time
- concurrency and, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)
- cross-channel timing dependencies, [Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
- in distributed systems, [Unreliable Clocks](/en/ch9#sec_distributed_clocks)-[Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
- (see also clocks)
- clock synchronization and accuracy, [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
- relying on synchronized clocks, [Relying on Synchronized Clocks](/en/ch9#sec_distributed_clocks_relying)-[Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- process pauses, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)-[Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
- reasoning about, in stream processors, [Reasoning About Time](/en/ch12#sec_stream_time)-[Types of windows](/en/ch12#id324)
- event time versus processing time, [Event time versus processing time](/en/ch12#id322), [Microbatching and checkpointing](/en/ch12#id329), [Unifying batch and stream processing](/en/ch13#id338)
- knowing when window is ready, [Handling straggler events](/en/ch12#id323)
- timestamp of events, [Whose clock are you using, anyway?](/en/ch12#id438)
- types of windows, [Types of windows](/en/ch12#id324)
- system models for distributed systems, [System Model and Reality](/en/ch9#sec_distributed_system_model)
- time-dependence in stream joins, [Time-dependence of joins](/en/ch12#sec_stream_join_time)
- time series data
- as DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- column-oriented storage, [Column-Oriented Storage](/en/ch4#sec_storage_column)
- time-of-day clocks, [Time-of-day clocks](/en/ch9#time-of-day-clocks)
- hybrid logical clocks, [Hybrid logical clocks](/en/ch10#hybrid-logical-clocks)
- timeliness, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
- coordination-avoiding data systems, [Coordination-avoiding data systems](/en/ch13#id454)
- correctness of dataflow systems, [Correctness of dataflow systems](/en/ch13#id453)
- timeouts, [Unreliable Networks](/en/ch9#sec_distributed_networks), [Glossary](/en/glossary)
- dynamic configuration of, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- for failover, [Leader failure: Failover](/en/ch6#leader-failure-failover)
- length of, [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing)
- TimescaleDB (database), [Column-Oriented Storage](/en/ch4#sec_storage_column)
- timestamps, [Logical Clocks](/en/ch10#sec_consistency_timestamps)
- assigning to events in stream processing, [Whose clock are you using, anyway?](/en/ch12#id438)
- for read-after-write consistency, [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
- for transaction ordering, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- insufficiency for enforcing constraints, [Enforcing constraints using logical clocks](/en/ch10#enforcing-constraints-using-logical-clocks)
- key range sharding by, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- Lamport, [Lamport timestamps](/en/ch10#lamport-timestamps)
- logical, [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
- ordering events, [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
- timestamp oracle, [Implementing a linearizable ID generator](/en/ch10#implementing-a-linearizable-id-generator)
- TLA+ (specification language), [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
- token bucket (limiting retries), [Describing Performance](/en/ch2#sec_introduction_percentiles)
- tombstones, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Disk space usage](/en/ch4#disk-space-usage), [Log compaction](/en/ch12#sec_stream_log_compaction)
- topics (messaging), [Message brokers](/en/ch5#message-brokers), [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
- torn pages (B-trees), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
- total order, [Glossary](/en/glossary)
- broadcast (see shared logs)
- limits of, [The limits of total ordering](/en/ch13#id335)
- on logical timestamps, [Logical Clocks](/en/ch10#sec_consistency_timestamps)
- tracing, [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems)
- tracking behavioral data, [Privacy and Tracking](/en/ch14#id373)
- (see also privacy)
- trade-offs, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)-[Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- transaction coordinator (see coordinator)
- transaction manager (see coordinator)
- transaction processing, [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)-[Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
- comparison to analytics, [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
- comparison to data warehousing, [Data Storage for Analytics](/en/ch4#sec_storage_analytics)
- transactions, [Transactions](/en/ch8#ch_transactions)-[Summary](/en/ch8#summary), [Glossary](/en/glossary)
- ACID properties of, [The Meaning of ACID](/en/ch8#sec_transactions_acid)
- atomicity, [Atomicity](/en/ch8#sec_transactions_acid_atomicity)
- consistency, [Consistency](/en/ch8#sec_transactions_acid_consistency)
- durability, [Making B-trees reliable](/en/ch4#sec_storage_btree_wal), [Durability](/en/ch8#durability)
- isolation, [Isolation](/en/ch8#sec_transactions_acid_isolation)
- and derived data integrity, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
- and replication, [Solutions for Replication Lag](/en/ch6#id131)
- compensating (see compensating transactions)
- concept of, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview)
- distributed transactions, [Distributed Transactions](/en/ch8#sec_transactions_distributed)-[Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
- avoiding, [Derived data versus distributed transactions](/en/ch13#sec_future_derived_vs_transactions), [Making unbundling work](/en/ch13#sec_future_unbundling_favor), [Enforcing Constraints](/en/ch13#sec_future_constraints)-[Coordination-avoiding data systems](/en/ch13#id454)
- failure amplification, [Maintaining derived state](/en/ch13#id446)
- for sharded systems, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
- in doubt/uncertain status, [Coordinator failure](/en/ch8#coordinator-failure), [Holding locks while in doubt](/en/ch8#holding-locks-while-in-doubt)
- two-phase commit, [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)-[Three-phase commit](/en/ch8#three-phase-commit)
- use of, [Distributed Transactions Across Different Systems](/en/ch8#sec_transactions_xa)-[Exactly-once message processing](/en/ch8#sec_transactions_exactly_once)
- XA transactions, [XA transactions](/en/ch8#xa-transactions)-[Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
- OLTP versus analytics queries, [Analytics](/en/ch11#sec_batch_olap)
- purpose of, [Transactions](/en/ch8#ch_transactions)
- serializability, [Serializability](/en/ch8#sec_transactions_serializability)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
- actual serial execution, [Actual Serial Execution](/en/ch8#sec_transactions_serial)-[Summary of serial execution](/en/ch8#summary-of-serial-execution)
- pessimistic versus optimistic concurrency control, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
- serializable snapshot isolation (SSI), [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
- two-phase locking (2PL), [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)-[Index-range locks](/en/ch8#sec_transactions_2pl_range)
- single-object and multi-object, [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)-[Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
- handling errors and aborts, [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
- need for multi-object transactions, [The need for multi-object transactions](/en/ch8#sec_transactions_need)
- single-object writes, [Single-object writes](/en/ch8#sec_transactions_single_object)
- snapshot isolation (see snapshots)
- strict serializability, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- weak isolation levels, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)-[Materializing conflicts](/en/ch8#materializing-conflicts)
- preventing lost updates, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)-[Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- read committed, [Read Committed](/en/ch8#sec_transactions_read_committed)-[Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
- traversal (graphs), [Property Graphs](/en/ch3#id56)
- trie (data structure), [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Full-Text Search](/en/ch4#sec_storage_full_text)
- as SSTable index, [The SSTable file format](/en/ch4#the-sstable-file-format)
- triggers (databases), [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
- Trino (data warehouse), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- federated databases, [The meta-database of everything](/en/ch13#id341)
- query optimizer, [Query languages](/en/ch11#sec_batch_query_lanauges)
- use for ETL, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
- workflow example, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
- triple-stores, [Triple-Stores and SPARQL](/en/ch3#id59)-[The SPARQL query language](/en/ch3#the-sparql-query-language)
- SPARQL query language, [The SPARQL query language](/en/ch3#the-sparql-query-language)
- tumbling windows (stream processing), [Types of windows](/en/ch12#id324)
- (see also windows)
- in microbatching, [Microbatching and checkpointing](/en/ch12#id329)
- Turbopuffer (vector search), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- Turtle (RDF data format), [Triple-Stores and SPARQL](/en/ch3#id59)
- Twitter (see X (social network))
- two-phase commit (2PC), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)-[Coordinator failure](/en/ch8#coordinator-failure), [Glossary](/en/glossary)
- confusion with two-phase locking, [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)
- coordinator failure, [Coordinator failure](/en/ch8#coordinator-failure)
- coordinator recovery, [Recovering from coordinator failure](/en/ch8#recovering-from-coordinator-failure)
- how it works, [A system of promises](/en/ch8#a-system-of-promises)
- performance cost, [Distributed Transactions Across Different Systems](/en/ch8#sec_transactions_xa)
- problems with XA transactions, [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
- transactions holding locks, [Holding locks while in doubt](/en/ch8#holding-locks-while-in-doubt)
- two-phase locking (2PL), [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)-[Index-range locks](/en/ch8#sec_transactions_2pl_range), [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition), [Glossary](/en/glossary)
- confusion with two-phase commit, [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)
- growing and shrinking phases, [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- index-range locks, [Index-range locks](/en/ch8#sec_transactions_2pl_range)
- performance of, [Performance of two-phase locking](/en/ch8#performance-of-two-phase-locking)
- type checking, dynamic versus static, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
### U
- UDP (User Datagram Protocol)
- comparison to TCP, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- multicast, [Direct messaging from producers to consumers](/en/ch12#id296)
- Ultima Online (game), [Sharding](/en/ch7#ch_sharding)
- unbounded datasets, [Stream Processing](/en/ch12#ch_stream), [Glossary](/en/glossary)
- (see also streams)
- unbounded delays, [Glossary](/en/glossary)
- in networks, [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing)
- process pauses, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- unbundling databases, [Unbundling Databases](/en/ch13#sec_future_unbundling)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
- composing data storage technologies, [Composing Data Storage Technologies](/en/ch13#id447)-[Unbundled versus integrated systems](/en/ch13#id448)
- federation versus unbundling, [The meta-database of everything](/en/ch13#id341)
- designing applications around dataflow, [Designing Applications Around Dataflow](/en/ch13#sec_future_dataflow)-[Stream processors and services](/en/ch13#id345)
- observing derived state, [Observing Derived State](/en/ch13#sec_future_observing)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
- materialized views and caching, [Materialized views and caching](/en/ch13#id451)
- multi-shard data processing, [Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
- pushing state changes to clients, [Pushing state changes to clients](/en/ch13#id348)
- uncertain (transaction status) (see in doubt)
- union type (in Avro), [Schema evolution rules](/en/ch5#schema-evolution-rules)
- uniq (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Distributed Job Orchestration](/en/ch11#id278)
- uniqueness constraints
- asynchronously checked, [Loosely interpreted constraints](/en/ch13#id362)
- requiring consensus, [Uniqueness constraints require consensus](/en/ch13#id452)
- requiring linearizability, [Constraints and uniqueness guarantees](/en/ch10#sec_consistency_uniqueness)
- uniqueness in log-based messaging, [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
- Unity (data catalog), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- universally unique identifiers (see UUIDs)
- Unix philosophy
- comparison to relational databases, [Unbundling Databases](/en/ch13#sec_future_unbundling), [The meta-database of everything](/en/ch13#id341)
- comparison to stream processing, [Processing Streams](/en/ch12#sec_stream_processing)
- Unix pipes, [Simple Log Analysis](/en/ch11#sec_batch_log_analysis)
- compared to distributed batch processing, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
- UPDATE statement (SQL), [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
- updates
- preventing lost updates, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)-[Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- atomic write operations, [Atomic write operations](/en/ch8#atomic-write-operations)
- automatically detecting lost updates, [Automatically detecting lost updates](/en/ch8#automatically-detecting-lost-updates)
- compare-and-set (CAS), [Conditional writes (compare-and-set)](/en/ch8#sec_transactions_compare_and_set)
- conflict resolution and replication, [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- using explicit locking, [Explicit locking](/en/ch8#explicit-locking)
- preventing write skew, [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Materializing conflicts](/en/ch8#materializing-conflicts)
- utilization
- batch process scheduling, [Resource Allocation](/en/ch11#id279)
- increasing through preemption, [Handling Faults](/en/ch11#id281)
- trade-off with latency, [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
- uTP protocol (BitTorrent), [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
- UUIDs, [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
### V
- validity (consensus), [Single-value consensus](/en/ch10#single-value-consensus), [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
- vBuckets (sharding), [Sharding](/en/ch7#ch_sharding)
- vector clocks, [Version vectors](/en/ch6#version-vectors)
- (see also version vectors)
- and Lamport/hybrid logical clocks, [Lamport/hybrid logical clocks versus vector clocks](/en/ch10#lamporthybrid-logical-clocks-vs-vector-clocks)
- and version vectors, [Version vectors](/en/ch6#version-vectors)
- vector embedding, [Vector Embeddings](/en/ch4#id92)
- vectorized processing, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- vendor lock-in, [Pros and Cons of Cloud Services](/en/ch1#sec_introduction_cloud_tradeoffs)
- Venice (database), [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
- verification, [Trust, but Verify](/en/ch13#sec_future_verification)-[Tools for auditable data systems](/en/ch13#id366)
- avoiding blind trust, [Don't just blindly trust what they promise](/en/ch13#id364)
- designing for auditability, [Designing for auditability](/en/ch13#id365)
- end-to-end integrity checks, [The end-to-end argument again](/en/ch13#id456)
- tools for auditable data systems, [Tools for auditable data systems](/en/ch13#id366)
- version control systems
- merge conflicts, [Manual conflict resolution](/en/ch6#manual-conflict-resolution)
- reliance on immutable data, [Concurrency control](/en/ch12#sec_stream_concurrency)
- version vectors, [Problems with different topologies](/en/ch6#problems-with-different-topologies), [Version vectors](/en/ch6#version-vectors)
- dotted, [Version vectors](/en/ch6#version-vectors)
- versus vector clocks, [Version vectors](/en/ch6#version-vectors)
- Vertica (database), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- handling writes, [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
- vertical scaling (see scaling up)
- vertices (in graphs), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- property graph model, [Property Graphs](/en/ch3#id56)
- video games, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- video transcoding (example), [Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
- views (SQL queries), [Datalog: Recursive Relational Queries](/en/ch3#id62)
- materialized views (see materialization)
- Viewstamped Replication (consensus algorithm), [Consensus](/en/ch10#sec_consistency_consensus), [Consensus in Practice](/en/ch10#sec_consistency_total_order)
- use of model-checking, [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
- view number, [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
- virtual block device, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
- virtual file system, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- comparison to distributed filesystems, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- virtual machines, [Layering of cloud services](/en/ch1#layering-of-cloud-services)
- context switches, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- network performance, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- noisy neighbors, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- virtualized clocks in, [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
- virtual memory
- process pauses due to page faults, [Latency and Response Time](/en/ch2#id23), [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- Virtuoso (database), [The SPARQL query language](/en/ch3#the-sparql-query-language)
- VisiCalc (spreadsheets), [Designing Applications Around Dataflow](/en/ch13#sec_future_dataflow)
- Vitess (database)
- key-range sharding, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- vnodes (sharding), [Sharding](/en/ch7#ch_sharding)
- vocabularies, [Triple-Stores and SPARQL](/en/ch3#id59)
- Voice over IP (VoIP), [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- VoltDB (database)
- cross-shard serializability, [Sharding](/en/ch8#sharding)
- deterministic stored procedures, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
- in-memory storage, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- process-per-core model, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
- secondary indexes, [Local Secondary Indexes](/en/ch7#id166)
- serial execution of transactions, [Actual Serial Execution](/en/ch8#sec_transactions_serial)
- statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication), [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
- transactions in stream processing, [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit)
### W
- WAL (write-ahead log), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
- WAL-G (backup tool), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- WarpStream (messaging), [Disk space usage](/en/ch12#sec_stream_disk_usage)
- web services (see services)
- webhooks, [Direct messaging from producers to consumers](/en/ch12#id296)
- webMethods (messaging), [Message brokers](/en/ch5#message-brokers)
- WebSocket (protocol), [Pushing state changes to clients](/en/ch13#id348)
- wide-column data model, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
- versus column-oriented storage, [Column Compression](/en/ch4#sec_storage_column_compression)
- windows (stream processing), [Stream analytics](/en/ch12#id318), [Reasoning About Time](/en/ch12#sec_stream_time)-[Types of windows](/en/ch12#id324)
- infinite windows for changelogs, [Maintaining materialized views](/en/ch12#sec_stream_mat_view), [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins)
- knowing when all events have arrived, [Handling straggler events](/en/ch12#id323)
- stream joins within a window, [Stream-stream join (window join)](/en/ch12#id440)
- types of windows, [Types of windows](/en/ch12#id324)
- WITH RECURSIVE syntax (SQL), [Graph Queries in SQL](/en/ch3#id58)
- Word2Vec (language model), [Vector Embeddings](/en/ch4#id92)
- workflow engines, [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
- Airflow (see Airflow (workflow scheduler))
- batch processing, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
- Camunda (see Camunda (workflow engine))
- Dagster (see Dagster (workflow scheduler))
- durable execution, [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
- ETL (see ETL (extract-transform-load))
- executor, [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
- orchestrators, [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows), [Batch Processing](/en/ch11#ch_batch)
- Orkes (see Orkes (workflow engine))
- Prefect (see Prefect (workflow scheduler))
- reliance on determinism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- Restate (see Restate (workflow engine))
- Temporal (see Temporal (workflow engine))
- working set, [Sorting Versus In-memory Aggregation](/en/ch11#id275)
- write amplification, [Write amplification](/en/ch4#write-amplification)
- write path (derived data), [Observing Derived State](/en/ch13#sec_future_observing)
- write skew (transaction isolation), [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Materializing conflicts](/en/ch8#materializing-conflicts)
- characterizing, [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Phantoms causing write skew](/en/ch8#sec_transactions_phantom), [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)
- examples of, [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew), [More examples of write skew](/en/ch8#more-examples-of-write-skew)
- materializing conflicts, [Materializing conflicts](/en/ch8#materializing-conflicts)
- occurrence in practice, [Maintaining integrity in the face of software bugs](/en/ch13#id455)
- phantoms, [Phantoms causing write skew](/en/ch8#sec_transactions_phantom)
- preventing
- in snapshot isolation, [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)-[Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
- in two-phase locking, [Predicate locks](/en/ch8#predicate-locks)-[Index-range locks](/en/ch8#sec_transactions_2pl_range)
- options for, [Characterizing write skew](/en/ch8#characterizing-write-skew)
- write-ahead log (WAL), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal), [Write-ahead log (WAL) shipping](/en/ch6#write-ahead-log-wal-shipping)
- in durable execution, [Durable execution](/en/ch5#durable-execution)
- writes (database)
- atomic write operations, [Atomic write operations](/en/ch8#atomic-write-operations)
- detecting writes affecting prior reads, [Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
- preventing dirty writes with read committed, [No dirty writes](/en/ch8#sec_transactions_dirty_write)
- WS-\* framework, [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
- WS-AtomicTransaction (2PC), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
### X
- X (social network)
- constructing home timelines (example), [Case Study: Social Network Home Timelines](/en/ch2#sec_introduction_twitter), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views), [Table-table join (materialized view maintenance)](/en/ch12#id326), [Materialized views and caching](/en/ch13#id451)
- cost of joins, [Denormalization in the social networking case study](/en/ch3#denormalization-in-the-social-networking-case-study)
- describing load, [Describing Load](/en/ch2#id33)
- fault tolerance, [Fault Tolerance](/en/ch2#id27)
- performance metrics, [Describing Performance](/en/ch2#sec_introduction_percentiles)
- DistributedLog (event log), [Using logs for message storage](/en/ch12#id300)
- Snowflake (ID generator), [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
- XA transactions, [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc), [XA transactions](/en/ch8#xa-transactions)-[Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
- heuristic decisions, [Recovering from coordinator failure](/en/ch8#recovering-from-coordinator-failure)
- problems with, [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
- xargs (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis)
- XFS (file system), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
- XGBoost (machine learning library), [Machine Learning](/en/ch11#id290)
- XML
- binary variants, [Binary encoding](/en/ch5#binary-encoding)
- data locality, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
- encoding RDF data, [The RDF data model](/en/ch3#the-rdf-data-model)
- for application data, issues with, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
- in relational databases, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
- XML databases, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history), [Query languages for documents](/en/ch3#query-languages-for-documents)
- Xorq (query engine), [The meta-database of everything](/en/ch13#id341)
- XPath, [Query languages for documents](/en/ch3#query-languages-for-documents)
- XQuery, [Query languages for documents](/en/ch3#query-languages-for-documents)
### Y
- Yahoo
- response time study, [Average, Median, and Percentiles](/en/ch2#id24)
- YARN (job scheduler), [Distributed Job Orchestration](/en/ch11#id278), [Separation of application code and state](/en/ch13#id344)
- ApplicationMaster, [Distributed Job Orchestration](/en/ch11#id278)
- Yjs (CRDT library), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- YugabyteDB (database)
- hash-range sharding, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- key-range sharding, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- request routing, [Request Routing](/en/ch7#sec_sharding_routing)
- sharded secondary indexes, [Global Secondary Indexes](/en/ch7#id167)
- tablets (sharding), [Sharding](/en/ch7#ch_sharding)
- transactions, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
- use of clock synchronization, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
### Z
- Zab (consensus algorithm), [Consensus](/en/ch10#sec_consistency_consensus), [Consensus in Practice](/en/ch10#sec_consistency_total_order)
- use in ZooKeeper, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
- zero-copy, [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
- zero-disk architecture (ZDA), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- ZeroMQ (messaging library), [Direct messaging from producers to consumers](/en/ch12#id296)
- zombies (split brain), [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
- zones (cloud computing) (see availability zones)
- ZooKeeper (coordination service), [Coordination Services](/en/ch10#sec_consistency_coordination)-[Service discovery](/en/ch10#service-discovery)
- generating fencing tokens, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens), [Using shared logs](/en/ch10#sec_consistency_smr), [Coordination Services](/en/ch10#sec_consistency_coordination)
- linearizable operations, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
- locks and leader election, [Locking and leader election](/en/ch10#locking-and-leader-election)
- observers, [Service discovery](/en/ch10#service-discovery)
- use for service discovery, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery), [Service discovery](/en/ch10#service-discovery)
- use for shard assignment, [Request Routing](/en/ch7#sec_sharding_routing)
- use of Zab algorithm, [Consensus](/en/ch10#sec_consistency_consensus)
================================================
FILE: content/en/part-i.md
================================================
---
title: "PART I: Foundations of Data Systems"
weight: 100
breadcrumbs: false
---
{{< callout type="warning" >}}
This page is from the 1st edition, 2nd edition is not available yet.
{{< /callout >}}
The first five chapters go through the fundamental ideas that apply to all data systems, whether running on a single machine or distributed across a cluster of machines:
1. [Chapter 1](/en/ch1) introduces the tradeoffs that data systems must make, such as the balance between consistency and availability, and how these tradeoffs affect system design.
2. [Chater 2](/en/ch2) discusses the nonfunctional requirements of data systems, such as availability, consistency, and latency. And how we can try to achieve these goals.
3. [Chapter 3](/en/ch3) compares several different data models and query languages—the most visible distinguishing factor between databases from a developer’s point of view. We will see how different models are appropriate to different situations.
4. [Chapter 4](/en/ch4) turns to the internals of storage engines and looks at how databases lay out data on disk. Different storage engines are optimized for different workloads, and choosing the right one can have a huge effect on performance.
5. [Chapter 5](/en/ch5) compares various formats for data encoding (serialization) and especially examines how they fare in an environment where application requirements change and schemas need to adapt over time.
Later, [Part II](/en/part-ii) will turn to the particular issues of distributed data systems.
## [1. Trade-offs in Data Systems Architecture](/en/ch1)
- [Analytical versus Operational Systems](/en/ch1#sec_introduction_analytics)
- [Cloud versus Self-Hosting](/en/ch1#sec_introduction_cloud)
- [Distributed versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
- [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- [Summary](/en/ch1#summary)
## [2. Defining Nonfunctional Requirements](/en/ch2)
- [Case Study: Social Network Home Timelines](/en/ch2#sec_introduction_twitter)
- [Describing Performance](/en/ch2#sec_introduction_percentiles)
- [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)
- [Scalability](/en/ch2#sec_introduction_scalability)
- [Maintainability](/en/ch2#sec_introduction_maintainability)
- [Summary](/en/ch2#summary)
## [3. Data Models and Query Languages](/en/ch3)
- [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
- [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- [Dataframes, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- [Summary](/en/ch3#summary)
## [4. Storage and Retrieval](/en/ch4)
- [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp)
- [Data Storage for Analytics](/en/ch4#sec_storage_analytics)
- [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
- [Summary](/en/ch4#summary)
## [5. Encoding and Evolution](/en/ch5)
- [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
- [Modes of Dataflow](/en/ch5#sec_encoding_dataflow)
- [Summary](/en/ch5#summary)
================================================
FILE: content/en/part-ii.md
================================================
---
title: "PART II: Distributed Data"
weight: 200
breadcrumbs: false
---
{{< callout type="warning" >}}
This page is from the 1st edition, 2nd edition is not available yet.
{{< /callout >}}
> *For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.*
>
> —Richard Feynman, *Rogers Commission Report* (1986)
-------
In [Part I](/en/part-i) of this book, we discussed aspects of data systems that apply when data is stored on a single machine. Now, in [Part II](/en/part-ii),
we move up a level and ask: what happens if multiple machines are involved in storage and retrieval of data?
There are various reasons why you might want to distribute a database across multiple machines:
***Scalability***
If your data volume, read load, or write load grows bigger than a single machine can handle, you can potentially spread the load across multiple machines.
***Fault tolerance/high availability***
If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down,
you can use multiple machines to give you redundancy. When one fails, another one can take over.
***Latency***
If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geographically close to them.
That avoids the users having to wait for network packets to travel halfway around the world.
## Scaling to Higher Load
If all you need is to scale to higher load, the simplest approach is to buy a more powerful machine (sometimes called *vertical scaling* or *scaling up*). Many CPUs, many RAM chips, and many disks can be joined together under one operating system,
and a fast interconnect allows any CPU to access any part of the memory or disk. In this kind of *shared-memory architecture*, all the components can be treated as a single machine [^1].
> [!NOTE]
> In a large machine, although any CPU can access any part of memory, some banks of memory are closer to one CPU than to others (this is called nonuniform memory access, or NUMA [^1]).
> To make efficient use of this architecture, the processing needs to be broken down so that each CPU mostly accesses memory that is nearby—which means that partitioning is still required, even when ostensibly running on one machine.
The problem with a shared-memory approach is that the cost grows faster than linearly: a machine with twice as many CPUs, twice as much RAM, and twice as much disk capacity as another typically costs significantly more than twice as much.
And due to bottlenecks, a machine twice the size cannot necessarily handle twice the load.
A shared-memory architecture may offer limited fault tolerance—high-end machines have hot-swappable components (you can replace disks, memory modules, and even CPUs without shutting down the machines) — but it is definitely limited to a single geographic location.
Another approach is the *shared-disk architecture*, which uses several machines with independent CPUs and RAM, but stores data on an array of disks that is shared between the machines, which are connected via a fast network.
This architecture is used for some data warehousing workloads, but contention and the overhead of locking limit the scalability of the shared-disk approach [^2].
> [!NOTE]
> Network Attached Storage (NAS) or Storage Area Network (SAN).
### Shared-Nothing Architectures
By contrast, *shared-nothing architectures* [^3] (sometimes called *horizontal scaling* or *scaling out*) have gained a lot of popularity.
In this approach, each machine or virtual machine running the database software is called a *node*.
Each node uses its CPUs, RAM, and disks independently. Any coordination between nodes is done at the software level, using a conventional network.
No special hardware is required by a shared-nothing system, so you can use whatever machines have the best price/performance ratio.
You can potentially distribute data across multiple geographic regions, and thus reduce latency for users and potentially be able to survive the loss of an entire datacenter.
With cloud deployments of virtual machines, you don’t need to be operating at Google scale: even for small companies, a multi-region distributed architecture is now feasible.
In this part of the book, we focus on shared-nothing architectures—not because they are necessarily the best choice for every use case, but rather because they require the most caution from you, the application developer.
If your data is distributed across multiple nodes, you need to be aware of the constraints and trade-offs that occur in such a distributed system—the database cannot magically hide these from you.
While a distributed shared-nothing architecture has many advantages, it usually also incurs additional complexity for applications and sometimes limits the expressiveness of the data models you can use.
In some cases, a simple single-threaded program can perform significantly better than a cluster with over 100 CPU cores [^4]. On the other hand, shared-nothing systems can be very powerful.
The next few chapters go into details on the issues that arise when data is distributed.
### Replication Versus Partitioning
There are two common ways data is distributed across multiple nodes:
***Replication***
Keeping a copy of the same data on several different nodes, potentially in different locations.
Replication provides redundancy: if some nodes are unavailable, the data can still be served from the remaining nodes.
Replication can also help improve performance. We discuss replication in [Chapter 6](/en/ch6).
***Partitioning***
Splitting a big database into smaller subsets called *partitions* so that different partitions can be assigned to different nodes (also known as *sharding*).
We discuss partitioning in [Chapter 7](/en/ch7).
These are separate mechanisms, but they often go hand in hand, as illustrated in [Figure II-1](#fig_replication_partitioning).
{{< figure src="/v1/ddia_part-ii_01.png" id="fig_replication_partitioning" caption="*Figure II-1. A database split into two partitions, with two replicas per partition." class="w-full my-4" >}}
With an understanding of those concepts, we can discuss the difficult trade-offs that you need to make in a distributed system.
We’ll discuss *transactions* in [Chapter 8](/en/ch8), as that will help you understand all the many things that can go wrong in a data system, and what you can do about them.
We’ll conclude this part of the book by discussing the fundamental limitations of distributed systems in [Chapters 9](/en/ch9) and [10](/en/ch10).
Later, in [Part III](/en/part-iii) of this book, we will discuss how you can take several (potentially distributed) datastores and integrate them into a larger system,
satisfying the needs of a complex application. But first, let’s talk about distributed data.
## [6. Replication](/en/ch6)
- [Single-Leader Replication](/en/ch6#sec_replication_leader)
- [Problems with Replication Lag](/en/ch6#sec_replication_lag)
- [Multi-Leader Replication](/en/ch6#sec_replication_multi_leader)
- [Leaderless Replication](/en/ch6#sec_replication_leaderless)
- [Summary](/en/ch6#summary)
## [7. Sharding](/en/ch7)
- [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
- [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
- [Request Routing](/en/ch7#sec_sharding_routing)
- [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)
- [Summary](/en/ch7#summary)
## [8. Transactions](/en/ch8)
- [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview)
- [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
- [Serializability](/en/ch8#sec_transactions_serializability)
- [Distributed Transactions](/en/ch8#sec_transactions_distributed)
- [Summary](/en/ch8#summary)
## [9. The Trouble with Distributed Systems](/en/ch9)
- [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
- [Unreliable Networks](/en/ch9#sec_distributed_networks)
- [Unreliable Clocks](/en/ch9#sec_distributed_clocks)
- [Knowledge, Truth, and Lies](/en/ch9#sec_distributed_truth)
- [Summary](/en/ch9#summary)
## [10. Consistency and Consensus](/en/ch10)
- [Linearizability](/en/ch10#sec_consistency_linearizability)
- [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
- [Consensus](/en/ch10#sec_consistency_consensus)
- [Summary](/en/ch10#summary)
### References
[^1]: Ulrich Drepper: “[What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf),” akka‐dia.org, November 21, 2007.
[^2]: Ben Stopford: “[Shared Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/),” benstopford.com, November 24, 2009.
[^3]: Michael Stonebraker: “[The Case for Shared Nothing](http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf),” IEEE Database EngineeringBulletin, volume 9, number 1, pages 4–9, March 1986.
[^4]: Frank McSherry, Michael Isard, and Derek G. Murray: “[Scalability! But at What COST?](http://www.frankmcsherry.org/assets/COST.pdf),” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS),May 2015.
================================================
FILE: content/en/part-iii.md
================================================
---
title: "PART III: Derived Data"
weight: 300
breadcrumbs: false
---
{{< callout type="warning" >}}
This page is from the 1st edition, 2nd edition is not available yet.
{{< /callout >}}
In Parts [I](/en/part-i) and [II](/en/part-ii) of this book, we assembled from the ground up all the major considerations that go into a distributed database,
from the layout of data on disk all the way to the limits of distributed consistency in the presence of faults. However, this discussion assumed that there was only one database in the application.
In reality, data systems are often more complex. In a large application you often need to be able to access and process data in many different ways,
and there is no one database that can satisfy all those different needs simultaneously. Applications thus commonly use a combination of several different datastores,
indexes, caches, analytics systems, etc. and implement mechanisms for moving data from one store to another.
In this final part of the book, we will examine the issues around integrating multiple different data systems,
potentially with different data models and optimized for different access patterns, into one coherent application architecture.
This aspect of system-building is often overlooked by vendors who claim that their product can satisfy all your needs.
In reality, integrating disparate systems is one of the most important things that needs to be done in a nontrivial application.
## Systems of Record and Derived Data
On a high level, systems that store and process data can be grouped into two broad categories:
***Systems of record***
A system of record, also known as *source of truth*, holds the authoritative version of your data.
When new data comes in, e.g., as user input, it is first written here.
Each fact is represented exactly once (the representation is typically *normalized*).
If there is any discrepancy between another system and the system of record,
then the value in the system of record is (by definition) the correct one.
***Derived data systems***
Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way.
If you lose derived data, you can recreate it from the original source. A classic example is a cache: data can be served from the cache if present,
but if the cache doesn’t contain what you need, you can fall back to the underlying database. Denormalized values, indexes,
and materialized views also fall into this category. In recommendation systems, predictive summary data is often derived from usage logs.
Technically speaking, derived data is *redundant*, in the sense that it duplicates existing information.
However, it is often essential for getting good performance on read queries. It is commonly *denormalized*.
You can derive several different datasets from a single source, enabling you to look at the data from different “points of view.”
Not all systems make a clear distinction between systems of record and derived data in their architecture,
but it’s a very helpful distinction to make, because it clarifies the dataflow through your system:
it makes explicit which parts of the system have which inputs and which outputs, and how they depend on each other.
Most databases, storage engines, and query languages are not inherently either a system of record or a derived system.
A database is just a tool: how you use it is up to you.
The distinction between system of record and derived data system depends not on the tool, but on how you use it in your application.
By being clear about which data is derived from which other data, you can bring clarity to an otherwise confusing system architecture.
This point will be a running theme throughout this part of the book.
## Overview of Chapters
We will start in [Chapter 11](/en/ch11) by examining batch-oriented dataflow systems such as MapReduce, and see how they give us good tools and principles for building large- scale data systems.
In [Chapter 12](/en/ch12) we will take those ideas and apply them to data streams, which allow us to do the same kinds of things with lower delays.
In [Chapter 13](/en/ch13) we explore ideas about how we might use these tools to build reliable, scalable, and maintainable applications in the future.
[Chapter 14](/en/ch14) concludes the book with ethics, privacy, and the social impact of data systems.
## Index
- [11. Batch Processing](/en/ch11) (WIP)
- [12. Stream Processing](/en/ch12) (WIP)
- [13. A Philosophy of Streaming Systems](/en/ch13) (WIP)
- [14. Doing the Right Thing](/en/ch14) (WIP)
================================================
FILE: content/en/preface.md
================================================
---
title: Preface
weight: 50
breadcrumbs: false
---
{{< callout type="warning" >}}
This page is from the 1st edition, 2nd edition is not available yet.
{{< /callout >}}
If you have worked in software engineering in recent years, especially in server-side and backend systems, you have probably been bombarded with a plethora of buzzwords relating to storage and processing of data. NoSQL! Big Data! Web-scale! Sharding! Eventual consistency! ACID! CAP theorem! Cloud services! MapReduce! Real-time!
In the last decade we have seen many interesting developments in databases, in distributed systems, and in the ways we build applications on top of them. There are various driving forces for these developments:
- Internet companies such as Google, Yahoo!, Amazon, Facebook, LinkedIn, Microsoft, and Twitter are handling huge volumes of data and traffic, forcing them to create new tools that enable them to efficiently handle such scale.
- Businesses need to be agile, test hypotheses cheaply, and respond quickly to new market insights by keeping development cycles short and data models flexible.
- Free and open source software has become very successful and is now preferred to commercial or bespoke in-house software in many environments.
- CPU clock speeds are barely increasing, but multi-core processors are standard, and networks are getting faster. This means parallelism is only going to increase.
- Even if you work on a small team, you can now build systems that are distributed across many machines and even multiple geographic regions, thanks to infrastructure as a service (IaaS) such as Amazon Web Services.
- Many services are now expected to be highly available; extended downtime due to outages or maintenance is becoming increasingly unacceptable.
*Data-intensive applications* are pushing the boundaries of what is possible by making use of these technological developments. We call an application *data-intensive* if data is its primary challenge—the quantity of data, the complexity of data, or the speed at which it is changing—as opposed to *compute-intensive*, where CPU cycles are the bottleneck.
The tools and technologies that help data-intensive applications store and process data have been rapidly adapting to these changes. New types of database systems (“NoSQL”) have been getting lots of attention, but message queues, caches, search indexes, frameworks for batch and stream processing, and related technologies are very important too. Many applications use some combination of these.
The buzzwords that fill this space are a sign of enthusiasm for the new possibilities, which is a great thing. However, as software engineers and architects, we also need to have a technically accurate and precise understanding of the various technologies and their trade-offs if we want to build good applications. For that understanding, we have to dig deeper than buzzwords.
Fortunately, behind the rapid changes in technology, there are enduring principles that remain true, no matter which version of a particular tool you are using. If you understand those principles, you’re in a position to see where each tool fits in, how to make good use of it, and how to avoid its pitfalls. That’s where this book comes in.
The goal of this book is to help you navigate the diverse and fast-changing landscape of technologies for processing and storing data. This book is not a tutorial for one particular tool, nor is it a textbook full of dry theory. Instead, we will look at examples of successful data systems: technologies that form the foundation of many popular applications and that have to meet scalability, performance, and reliability requirements in production every day.
We will dig into the internals of those systems, tease apart their key algorithms, discuss their principles and the trade-offs they have to make. On this journey, we will try to find useful ways of *thinking about* data systems—not just *how* they work, but also *why* they work that way, and what questions we need to ask.
After reading this book, you will be in a great position to decide which kind of technology is appropriate for which purpose, and understand how tools can be combined to form the foundation of a good application architecture. You won’t be ready to build your own database storage engine from scratch, but fortunately that is rarely necessary. You will, however, develop a good intuition for what your systems are doing under the hood so that you can reason about their behavior, make good design decisions, and track down any problems that may arise.
## Who Should Read This Book?
If you develop applications that have some kind of server/backend for storing or processing data, and your applications use the internet (e.g., web applications, mobile apps, or internet-connected sensors), then this book is for you.
This book is for software engineers, software architects, and technical managers who love to code. It is especially relevant if you need to make decisions about the architecture of the systems you work on—for example, if you need to choose tools for solving a given problem and figure out how best to apply them. But even if you have no choice over your tools, this book will help you better understand their strengths and weaknesses.
You should have some experience building web-based applications or network services, and you should be familiar with relational databases and SQL. Any non-relational databases and other data-related tools you know are a bonus, but not required. A general understanding of common network protocols like TCP and HTTP is helpful. Your choice of programming language or framework makes no difference for this book.
If any of the following are true for you, you’ll find this book valuable:
- You want to learn how to make data systems scalable, for example, to support web or mobile apps with millions of users.
- You need to make applications highly available (minimizing downtime) and operationally robust.
- You are looking for ways of making systems easier to maintain in the long run, even as they grow and as requirements and technologies change.
- You have a natural curiosity for the way things work and want to know what goes on inside major websites and online services. This book breaks down the internals of various databases and data processing systems, and it’s great fun to explore the bright thinking that went into their design.
Sometimes, when discussing scalable data systems, people make comments along the lines of, “You’re not Google or Amazon. Stop worrying about scale and just use a relational database.” There is truth in that statement: building for scale that you don’t need is wasted effort and may lock you into an inflexible design. In effect, it is a form of premature optimization. However, it’s also important to choose the right tool for the job, and different technologies each have their own strengths and weaknesses. As we shall see, relational databases are important but not the final word on dealing with data.
## Scope of This Book
This book does not attempt to give detailed instructions on how to install or use specific software packages or APIs, since there is already plenty of documentation for those things. Instead we discuss the various principles and trade-offs that are fundamental to data systems, and we explore the different design decisions taken by different products.
In the ebook editions we have included links to the full text of online resources. All links were verified at the time of publication, but unfortunately links tend to break frequently due to the nature of the web. If you come across a broken link, or if you are reading a print copy of this book, you can look up references using a search engine. For academic papers, you can search for the title in Google Scholar to find open-access PDF files. Alternatively, you can find all of the references at [*https:// github.com/ept/ddia-references*](https:// github.com/ept/ddia-references), where we maintain up-to-date links.
We look primarily at the *architecture* of data systems and the ways they are integrated into data-intensive applications. This book doesn’t have space to cover deployment, operations, security, management, and other areas—those are complex and important topics, and we wouldn’t do them justice by making them superficial side notes in this book. They deserve books of their own.
Many of the technologies described in this book fall within the realm of the *Big Data* buzzword. However, the term “Big Data” is so overused and underdefined that it is not useful in a serious engineering discussion. This book uses less ambiguous terms, such as single-node versus distributed systems, or online/interactive versus offline/ batch processing systems.
This book has a bias toward free and open source software (FOSS), because reading, modifying, and executing source code is a great way to understand how something works in detail. Open platforms also reduce the risk of vendor lock-in. However, where appropriate, we also discuss proprietary software (closed-source software, software as a service, or companies’ in-house software that is only described in literature but not released publicly).
## Outline of This Book
This book is arranged into three parts:
1. In [Part I](/en/part-i), we discuss the fundamental ideas that underpin the design of data-intensive applications. We start in [Chapter 1](/en/ch1) by discussing what we’re actually trying to achieve: reliability, scalability, and maintainability; how we need to think about them; and how we can achieve them. In [Chapter 2](/en/ch2) we compare several different data models and query languages, and see how they are appropriate to different situations. In [Chapter 3](/en/ch3) we talk about storage engines: how databases arrange data on disk so that we can find it again efficiently. [Chapter 4](/en/ch4) turns to formats for data encoding (serialization) and evolution of schemas over time.
2. [In Part II](/en/part-ii), we move from data stored on one machine to data that is distributed across multiple machines. This is often necessary for scalability, but brings with it a variety of unique challenges. We first discuss replication ([Chapter 5](/en/ch5)), partitioning/sharding ([Chapter 6](/en/ch6)), and transactions ([Chapter 7](/en/ch7)). We then go into more detail on the problems with distributed systems ([Chapter 8](/en/ch8)) and what it means to achieve consistency and consensus in a distributed system ([Chapter 9](/en/ch9)).
3. In [Part III](/en/part-iii), we discuss systems that derive some datasets from other datasets. Derived data often occurs in heterogeneous systems: when there is no one database that can do everything well, applications need to integrate several different databases, caches, indexes, and so on. In [Chapter 10](/en/ch10) we start with a batch processing approach to derived data, and we build upon it with stream processing in [Chapter 11](/en/ch11). Finally, in [Chapter 12](/en/ch12) we put everything together and discuss approaches for building reliable, scalable, and maintainable applications in the future.
## References and Further Reading
Most of what we discuss in this book has already been said elsewhere in some form or another—in conference presentations, research papers, blog posts, code, bug trackers, mailing lists, and engineering folklore. This book summarizes the most important ideas from many different sources, and it includes pointers to the original literature throughout the text. The references at the end of each chapter are a great resource if you want to explore an area in more depth, and most of them are freely available online.
## O'Reilly Safari
[Safari](http://oreilly.com/safari) (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.
Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.
For more information, please visit http://oreilly.com/safari.
## How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at *http://bit.ly/designing-data-intensive-apps*.
To comment or ask technical questions about this book, send email to *bookquestions@oreilly.com*.
For more information about our books, courses, conferences, and news, see our website at *http://www.oreilly.com*.
* Find us on Facebook: [http://facebook.com/oreilly](http://facebook.com/oreilly)
* Follow us on Twitter: [http://twitter.com/oreillymedia](#http://twitter.com/oreillymedia)
* Watch us on YouTube: [http://www.youtube.com/oreillymedia](#http://www.youtube.com/oreillymedia)
## Acknowledgments
Acknowledgments
This book is an amalgamation and systematization of a large number of other people’s ideas and knowledge, combining experience from both academic research and industrial practice. In computing we tend to be attracted to things that are new and shiny, but I think we have a huge amount to learn from things that have been done before. This book has over 800 references to articles, blog posts, talks, documentation, and more, and they have been an invaluable learning resource for me. I am very grateful to the authors of this material for sharing their knowledge.
I have also learned a lot from personal conversations, thanks to a large number of people who have taken the time to discuss ideas or patiently explain things to me. In particular, I would like to thank Joe Adler, Ross Anderson, Peter Bailis, Márton Balassi, Alastair Beresford, Mark Callaghan, Mat Clayton, Patrick Collison, Sean Cribbs, Shirshanka Das, Niklas Ekström, Stephan Ewen, Alan Fekete, Gyula Fóra, Camille Fournier, Andres Freund, John Garbutt, Seth Gilbert, Tom Haggett, Pat Helland, Joe Hellerstein, Jakob Homan, Heidi Howard, John Hugg, Julian Hyde, Conrad Irwin, Evan Jones, Flavio Junqueira, Jessica Kerr, Kyle Kingsbury, Jay Kreps, Carl Lerche, Nicolas Liochon, Steve Loughran, Lee Mallabone, Nathan Marz, Caitie, McCaffrey, Josie McLellan, Christopher Meiklejohn, Ian Meyers, Neha Narkhede, Neha Narula, Cathy O’Neil, Onora O’Neill, Ludovic Orban, Zoran Perkov, Julia Powles, Chris Riccomini, Henry Robinson, David Rosenthal, Jennifer Rullmann, Matthew Sackman, Martin Scholl, Amit Sela, Gwen Shapira, Greg Spurrier, Sam Stokes, Ben Stopford, Tom Stuart, Diana Vasile, Rahul Vohra, Pete Warden, and Brett Wooldridge.
Several more people have been invaluable to the writing of this book by reviewing drafts and providing feedback. For these contributions I am particularly indebted to Raul Agepati, Tyler Akidau, Mattias Andersson, Sasha Baranov, Veena Basavaraj, David Beyer, Jim Brikman, Paul Carey, Raul Castro Fernandez, Joseph Chow, Derek Elkins, Sam Elliott, Alexander Gallego, Mark Grover, Stu Halloway, Heidi Howard, Nicola Kleppmann, Stefan Kruppa, Bjorn Madsen, Sander Mak, Stefan Podkowinski, Phil Potter, Hamid Ramazani, Sam Stokes, and Ben Summers. Of course, I take all responsibility for any remaining errors or unpalatable opinions in this book.
For helping this book become real, and for their patience with my slow writing and unusual requests, I am grateful to my editors Marie Beaugureau, Mike Loukides, Ann Spencer, and all the team at O’Reilly. For helping find the right words, I thank Rachel Head. For giving me the time and freedom to write in spite of other work commitments, I thank Alastair Beresford, Susan Goodhue, Neha Narkhede, and Kevin Scott.
Very special thanks are due to Shabbir Diwan and Edie Freedman, who illustrated with great care the maps that accompany the chapters. It’s wonderful that they took on the unconventional idea of creating maps, and made them so beautiful and compelling.
Finally, my love goes to my family and friends, without whom I would not have been able to get through this writing process that has taken almost four years. You’re the best.
================================================
FILE: content/en/toc.md
================================================
---
title: "Table of Content"
linkTitle: "Table of Content"
weight: 10
breadcrumbs: false
---

## [Preface](/en/preface)
- [Who Should Read This Book?](/en/preface#who-should-read-this-book)
- [Scope of This Book](/en/preface#scope-of-this-book)
- [Outline of This Book](/en/preface#outline-of-this-book)
- [References and Further Reading](/en/preface#references-and-further-reading)
- [O'Reilly Safari](/en/preface#oreilly-safari)
- [How to Contact Us](/en/preface#how-to-contact-us)
- [Acknowledgments](/en/preface#acknowledgments)
## [1. Trade-offs in Data Systems Architecture](/en/ch1)
- [Analytical versus Operational Systems](/en/ch1#sec_introduction_analytics)
- [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
- [Data Warehousing](/en/ch1#sec_introduction_dwh)
- [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
- [Beyond the data lake](/en/ch1#beyond-the-data-lake)
- [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
- [Cloud versus Self-Hosting](/en/ch1#sec_introduction_cloud)
- [Pros and Cons of Cloud Services](/en/ch1#sec_introduction_cloud_tradeoffs)
- [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
- [Layering of cloud services](/en/ch1#layering-of-cloud-services)
- [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
- [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
- [Distributed versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
- [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems)
- [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
- [Cloud Computing versus Supercomputing](/en/ch1#id17)
- [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
- [Summary](/en/ch1#summary)
- [References](/en/ch1#references)
## [2. Defining Nonfunctional Requirements](/en/ch2)
- [Case Study: Social Network Home Timelines](/en/ch2#sec_introduction_twitter)
- [Representing Users, Posts, and Follows](/en/ch2#id20)
- [Materializing and Updating Timelines](/en/ch2#sec_introduction_materializing)
- [Describing Performance](/en/ch2#sec_introduction_percentiles)
- [Latency and Response Time](/en/ch2#id23)
- [Average, Median, and Percentiles](/en/ch2#id24)
- [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
- [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)
- [Fault Tolerance](/en/ch2#id27)
- [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
- [Tolerating hardware faults through redundancy](/en/ch2#tolerating-hardware-faults-through-redundancy)
- [Software faults](/en/ch2#software-faults)
- [Humans and Reliability](/en/ch2#id31)
- [Scalability](/en/ch2#sec_introduction_scalability)
- [Describing Load](/en/ch2#id33)
- [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing)
- [Principles for Scalability](/en/ch2#id35)
- [Maintainability](/en/ch2#sec_introduction_maintainability)
- [Operability: Making Life Easy for Operations](/en/ch2#id37)
- [Simplicity: Managing Complexity](/en/ch2#id38)
- [Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability)
- [Summary](/en/ch2#summary)
- [References](/en/ch2#references)
## [3. Data Models and Query Languages](/en/ch3)
- [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
- [The Object-Relational Mismatch](/en/ch3#sec_datamodels_document)
- [Object-relational mapping (ORM)](/en/ch3#object-relational-mapping-orm)
- [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
- [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)
- [Trade-offs of normalization](/en/ch3#trade-offs-of-normalization)
- [Denormalization in the social networking case study](/en/ch3#denormalization-in-the-social-networking-case-study)
- [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many)
- [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
- [When to Use Which Model](/en/ch3#sec_datamodels_document_summary)
- [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
- [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
- [Query languages for documents](/en/ch3#query-languages-for-documents)
- [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
- [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
- [Property Graphs](/en/ch3#id56)
- [The Cypher Query Language](/en/ch3#id57)
- [Graph Queries in SQL](/en/ch3#id58)
- [Triple-Stores and SPARQL](/en/ch3#id59)
- [The RDF data model](/en/ch3#the-rdf-data-model)
- [The SPARQL query language](/en/ch3#the-sparql-query-language)
- [Datalog: Recursive Relational Queries](/en/ch3#id62)
- [GraphQL](/en/ch3#id63)
- [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
- [Dataframes, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
- [Summary](/en/ch3#summary)
- [References](/en/ch3#references)
## [4. Storage and Retrieval](/en/ch4)
- [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp)
- [Log-Structured Storage](/en/ch4#sec_storage_log_structured)
- [The SSTable file format](/en/ch4#the-sstable-file-format)
- [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
- [Bloom filters](/en/ch4#bloom-filters)
- [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
- [B-Trees](/en/ch4#sec_storage_b_trees)
- [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
- [B-tree variants](/en/ch4#b-tree-variants)
- [Comparing B-Trees and LSM-Trees](/en/ch4#sec_storage_btree_lsm_comparison)
- [Read performance](/en/ch4#read-performance)
- [Sequential vs. random writes](/en/ch4#sidebar_sequential)
- [Write amplification](/en/ch4#write-amplification)
- [Disk space usage](/en/ch4#disk-space-usage)
- [Multi-Column and Secondary Indexes](/en/ch4#sec_storage_index_multicolumn)
- [Storing values within the index](/en/ch4#sec_storage_index_heap)
- [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
- [Data Storage for Analytics](/en/ch4#sec_storage_analytics)
- [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
- [Column-Oriented Storage](/en/ch4#sec_storage_column)
- [Column Compression](/en/ch4#sec_storage_column_compression)
- [Sort Order in Column Storage](/en/ch4#sort-order-in-column-storage)
- [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
- [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
- [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
- [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
- [Full-Text Search](/en/ch4#sec_storage_full_text)
- [Vector Embeddings](/en/ch4#id92)
- [Summary](/en/ch4#summary)
- [References](/en/ch4#references)
## [5. Encoding and Evolution](/en/ch5)
- [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
- [Language-Specific Formats](/en/ch5#id96)
- [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
- [JSON Schema](/en/ch5#json-schema)
- [Binary encoding](/en/ch5#binary-encoding)
- [Protocol Buffers](/en/ch5#sec_encoding_protobuf)
- [Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
- [Avro](/en/ch5#sec_encoding_avro)
- [The writer’s schema and the reader’s schema](/en/ch5#the-writers-schema-and-the-readers-schema)
- [Schema evolution rules](/en/ch5#schema-evolution-rules)
- [But what is the writer’s schema?](/en/ch5#but-what-is-the-writers-schema)
- [Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
- [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
- [Modes of Dataflow](/en/ch5#sec_encoding_dataflow)
- [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)
- [Different values written at different times](/en/ch5#different-values-written-at-different-times)
- [Archival storage](/en/ch5#archival-storage)
- [Dataflow Through Services: REST and RPC](/en/ch5#sec_encoding_dataflow_rpc)
- [Web services](/en/ch5#sec_web_services)
- [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
- [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
- [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
- [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
- [Durable execution](/en/ch5#durable-execution)
- [Event-Driven Architectures](/en/ch5#sec_encoding_dataflow_msg)
- [Message brokers](/en/ch5#message-brokers)
- [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
- [Summary](/en/ch5#summary)
- [References](/en/ch5#references)
## [6. Replication](/en/ch6)
- [Single-Leader Replication](/en/ch6#sec_replication_leader)
- [Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async)
- [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
- [Handling Node Outages](/en/ch6#sec_replication_failover)
- [Follower failure: Catch-up recovery](/en/ch6#follower-failure-catch-up-recovery)
- [Leader failure: Failover](/en/ch6#leader-failure-failover)
- [Implementation of Replication Logs](/en/ch6#sec_replication_implementation)
- [Statement-based replication](/en/ch6#statement-based-replication)
- [Write-ahead log (WAL) shipping](/en/ch6#write-ahead-log-wal-shipping)
- [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
- [Problems with Replication Lag](/en/ch6#sec_replication_lag)
- [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
- [Monotonic Reads](/en/ch6#sec_replication_monotonic_reads)
- [Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix)
- [Solutions for Replication Lag](/en/ch6#id131)
- [Multi-Leader Replication](/en/ch6#sec_replication_multi_leader)
- [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
- [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)
- [Problems with different topologies](/en/ch6#problems-with-different-topologies)
- [Sync Engines and Local-First Software](/en/ch6#sec_replication_offline_clients)
- [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
- [Dealing with Conflicting Writes](/en/ch6#sec_replication_write_conflicts)
- [Conflict avoidance](/en/ch6#conflict-avoidance)
- [Last write wins (discarding concurrent writes)](/en/ch6#sec_replication_lww)
- [Manual conflict resolution](/en/ch6#manual-conflict-resolution)
- [Automatic conflict resolution](/en/ch6#automatic-conflict-resolution)
- [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
- [What is a conflict?](/en/ch6#what-is-a-conflict)
- [Leaderless Replication](/en/ch6#sec_replication_leaderless)
- [Writing to the Database When a Node Is Down](/en/ch6#id287)
- [Catching up on missed writes](/en/ch6#sec_replication_read_repair)
- [Quorums for reading and writing](/en/ch6#sec_replication_quorum_condition)
- [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations)
- [Monitoring staleness](/en/ch6#monitoring-staleness)
- [Single-Leader vs. Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
- [Multi-region operation](/en/ch6#multi-region-operation)
- [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)
- [The “happens-before” relation and concurrency](/en/ch6#sec_replication_happens_before)
- [Capturing the happens-before relationship](/en/ch6#capturing-the-happens-before-relationship)
- [Version vectors](/en/ch6#version-vectors)
- [Summary](/en/ch6#summary)
- [References](/en/ch6#references)
## [7. Sharding](/en/ch7)
- [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
- [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
- [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
- [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
- [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)
- [Sharding by Hash of Key](/en/ch7#sec_sharding_hash)
- [Hash modulo number of nodes](/en/ch7#hash-modulo-number-of-nodes)
- [Fixed number of shards](/en/ch7#fixed-number-of-shards)
- [Sharding by hash range](/en/ch7#sharding-by-hash-range)
- [Consistent hashing](/en/ch7#sec_sharding_consistent_hashing)
- [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
- [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
- [Request Routing](/en/ch7#sec_sharding_routing)
- [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)
- [Local Secondary Indexes](/en/ch7#id166)
- [Global Secondary Indexes](/en/ch7#id167)
- [Summary](/en/ch7#summary)
- [References](/en/ch7#references)
## [8. Transactions](/en/ch8)
- [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview)
- [The Meaning of ACID](/en/ch8#sec_transactions_acid)
- [Atomicity](/en/ch8#sec_transactions_acid_atomicity)
- [Consistency](/en/ch8#sec_transactions_acid_consistency)
- [Isolation](/en/ch8#sec_transactions_acid_isolation)
- [Durability](/en/ch8#durability)
- [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
- [Single-object writes](/en/ch8#sec_transactions_single_object)
- [The need for multi-object transactions](/en/ch8#sec_transactions_need)
- [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
- [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
- [Read Committed](/en/ch8#sec_transactions_read_committed)
- [No dirty reads](/en/ch8#no-dirty-reads)
- [No dirty writes](/en/ch8#sec_transactions_dirty_write)
- [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
- [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
- [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl)
- [Visibility rules for observing a consistent snapshot](/en/ch8#sec_transactions_mvcc_visibility)
- [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
- [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)
- [Atomic write operations](/en/ch8#atomic-write-operations)
- [Explicit locking](/en/ch8#explicit-locking)
- [Automatically detecting lost updates](/en/ch8#automatically-detecting-lost-updates)
- [Conditional writes (compare-and-set)](/en/ch8#sec_transactions_compare_and_set)
- [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
- [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)
- [Characterizing write skew](/en/ch8#characterizing-write-skew)
- [More examples of write skew](/en/ch8#more-examples-of-write-skew)
- [Phantoms causing write skew](/en/ch8#sec_transactions_phantom)
- [Materializing conflicts](/en/ch8#materializing-conflicts)
- [Serializability](/en/ch8#sec_transactions_serializability)
- [Actual Serial Execution](/en/ch8#sec_transactions_serial)
- [Encapsulating transactions in stored procedures](/en/ch8#encapsulating-transactions-in-stored-procedures)
- [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
- [Sharding](/en/ch8#sharding)
- [Summary of serial execution](/en/ch8#summary-of-serial-execution)
- [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)
- [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
- [Performance of two-phase locking](/en/ch8#performance-of-two-phase-locking)
- [Predicate locks](/en/ch8#predicate-locks)
- [Index-range locks](/en/ch8#sec_transactions_2pl_range)
- [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)
- [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
- [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)
- [Detecting stale MVCC reads](/en/ch8#detecting-stale-mvcc-reads)
- [Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
- [Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
- [Distributed Transactions](/en/ch8#sec_transactions_distributed)
- [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
- [A system of promises](/en/ch8#a-system-of-promises)
- [Coordinator failure](/en/ch8#coordinator-failure)
- [Three-phase commit](/en/ch8#three-phase-commit)
- [Distributed Transactions Across Different Systems](/en/ch8#sec_transactions_xa)
- [Exactly-once message processing](/en/ch8#sec_transactions_exactly_once)
- [XA transactions](/en/ch8#xa-transactions)
- [Holding locks while in doubt](/en/ch8#holding-locks-while-in-doubt)
- [Recovering from coordinator failure](/en/ch8#recovering-from-coordinator-failure)
- [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
- [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
- [Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
- [Summary](/en/ch8#summary)
- [References](/en/ch8#references)
## [9. The Trouble with Distributed Systems](/en/ch9)
- [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
- [Unreliable Networks](/en/ch9#sec_distributed_networks)
- [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
- [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
- [Detecting Faults](/en/ch9#id307)
- [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing)
- [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
- [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
- [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
- [Unreliable Clocks](/en/ch9#sec_distributed_clocks)
- [Monotonic Versus Time-of-Day Clocks](/en/ch9#sec_distributed_monotonic_timeofday)
- [Time-of-day clocks](/en/ch9#time-of-day-clocks)
- [Monotonic clocks](/en/ch9#monotonic-clocks)
- [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
- [Relying on Synchronized Clocks](/en/ch9#sec_distributed_clocks_relying)
- [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
- [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval)
- [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
- [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
- [Response time guarantees](/en/ch9#sec_distributed_clocks_realtime)
- [Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
- [Knowledge, Truth, and Lies](/en/ch9#sec_distributed_truth)
- [The Majority Rules](/en/ch9#sec_distributed_majority)
- [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing)
- [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
- [Fencing with multiple replicas](/en/ch9#fencing-with-multiple-replicas)
- [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
- [Weak forms of lying](/en/ch9#weak-forms-of-lying)
- [System Model and Reality](/en/ch9#sec_distributed_system_model)
- [Defining the correctness of an algorithm](/en/ch9#defining-the-correctness-of-an-algorithm)
- [Safety and liveness](/en/ch9#sec_distributed_safety_liveness)
- [Mapping system models to the real world](/en/ch9#mapping-system-models-to-the-real-world)
- [Formal Methods and Randomized Testing](/en/ch9#sec_distributed_formal)
- [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
- [Fault injection](/en/ch9#sec_fault_injection)
- [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
- [Summary](/en/ch9#summary)
- [References](/en/ch9#references)
## [10. Consistency and Consensus](/en/ch10)
- [Linearizability](/en/ch10#sec_consistency_linearizability)
- [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
- [Relying on Linearizability](/en/ch10#sec_consistency_linearizability_usage)
- [Locking and leader election](/en/ch10#locking-and-leader-election)
- [Constraints and uniqueness guarantees](/en/ch10#sec_consistency_uniqueness)
- [Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
- [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
- [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
- [The Cost of Linearizability](/en/ch10#sec_linearizability_cost)
- [The CAP theorem](/en/ch10#the-cap-theorem)
- [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
- [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
- [Logical Clocks](/en/ch10#sec_consistency_timestamps)
- [Lamport timestamps](/en/ch10#lamport-timestamps)
- [Hybrid logical clocks](/en/ch10#hybrid-logical-clocks)
- [Lamport/hybrid logical clocks vs. vector clocks](/en/ch10#lamporthybrid-logical-clocks-vs-vector-clocks)
- [Linearizable ID Generators](/en/ch10#sec_consistency_linearizable_id)
- [Implementing a linearizable ID generator](/en/ch10#implementing-a-linearizable-id-generator)
- [Enforcing constraints using logical clocks](/en/ch10#enforcing-constraints-using-logical-clocks)
- [Consensus](/en/ch10#sec_consistency_consensus)
- [The Many Faces of Consensus](/en/ch10#sec_consistency_faces)
- [Single-value consensus](/en/ch10#single-value-consensus)
- [Compare-and-set as consensus](/en/ch10#compare-and-set-as-consensus)
- [Shared logs as consensus](/en/ch10#sec_consistency_shared_logs)
- [Fetch-and-add as consensus](/en/ch10#fetch-and-add-as-consensus)
- [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
- [Consensus in Practice](/en/ch10#sec_consistency_total_order)
- [Using shared logs](/en/ch10#sec_consistency_smr)
- [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
- [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
- [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
- [Coordination Services](/en/ch10#sec_consistency_coordination)
- [Allocating work to nodes](/en/ch10#allocating-work-to-nodes)
- [Service discovery](/en/ch10#service-discovery)
- [Summary](/en/ch10#summary)
- [References](/en/ch10#references)
## [11. Batch Processing](/en/ch11)
- [……](/en/ch11#)
- [Summary](/en/ch11#id292)
- [References](/en/ch11#references)
## [12. Stream Processing](/en/ch12)
- [……](/en/ch12#)
- [Summary](/en/ch12#id332)
- [References](/en/ch12#references)
## [13. A Philosophy of Streaming Systems](/en/ch13)
- [……](/en/ch13#)
- [Summary](/en/ch13#id367)
- [References](/en/ch13#references)
## [14. Doing the Right Thing](/en/ch14)
- [……](/en/ch14#)
- [Summary](/en/ch14#id594)
- [References](/en/ch14#references)
## [Glossary](/en/glossary)
## [Colophon](/en/colophon)
- [About the Author](/en/colophon#about-the-author)
- [Colophon](/en/colophon#colophon)
================================================
FILE: content/tw/_index.md
================================================
---
title: 設計資料密集型應用(第二版)
linkTitle: DDIA
cascade:
type: docs
breadcrumbs: false
---
**作者**: [Martin Kleppmann](https://martin.kleppmann.com),[《Designing Data-Intensive Applications 2nd Edition》](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html) : 英國劍橋大學分散式系統研究員,演講者,博主和開源貢獻者,軟體工程師和企業家,曾在 LinkedIn 和 Rapportive 負責資料基礎架構。
**譯者**:[**馮若航**](https://vonng.com),網名 [@Vonng](https://github.com/Vonng)。
PostgreSQL 專家,資料庫老司機,雲計算泥石流。
[**Pigsty**](https://pgsty.com) 作者與創始人。
架構師,DBA,全棧工程師 @ TanTan,Alibaba,Apple。
獨立開源貢獻者,[GitStar Ranking 600](https://gitstar-ranking.com/Vonng),[國區活躍 Top20](https://committers.top/china)。
[DDIA](https://ddia.pigsty.io) / [PG Internal](https://pgint.vonng.com) 中文版譯者,公眾號:《老馮雲數》,資料庫 KOL。
**校訂**: [@yingang](https://github.com/yingang) | [繁體中文](/tw) **版本維護** by [@afunTW](https://github.com/afunTW) | [完整貢獻者列表](/contrib)
> [!NOTE]
> **DDIA 第二版** 正在翻譯中 ([`main`](https://github.com/Vonng/ddia/tree/main) 分支),歡迎加入並提出您的寶貴意見。
> [!TIP] 預覽版讀者須知
> 預覽版電子書允許你在作者寫作時就能獲得最原始、未經編輯的內容 —— 這樣你就能在這些技術正式釋出之前很久就用上它們。
> 如果你想積極參與審閱和評論這份草稿,請在 GitHub 上聯絡。本書的 GitHub 倉庫是 [ept/ddia2-feedback](https://github.com/ept/ddia2-feedback),中文翻譯版的倉庫是 [Vonng/ddia](https://github.com/Vonng/ddia)。
## 譯序
> 不懂資料庫的全棧工程師不是好架構師 —— 馮若航 / Vonng
現今,尤其是在網際網路領域,大多數應用都屬於資料密集型應用。本書從底層資料結構到頂層架構設計,將資料系統設計中的精髓娓娓道來。其中的寶貴經驗無論是對架構師、DBA、還是後端工程師、甚至產品經理都會有幫助。
這是一本理論結合實踐的書,書中很多問題,譯者在實際場景中都曾遇到過,讀來讓人擊節扼腕。如果能早點讀到這本書,該少走多少彎路啊!
這也是一本深入淺出的書,講述概念的來龍去脈而不是賣弄定義,介紹事物發展演化歷程而不是事實堆砌,將複雜的概念講述的淺顯易懂,但又直擊本質不失深度。每章最後的引用質量非常好,是深入學習各個主題的絕佳索引。
本書為資料系統的設計、實現、與評價提供了很好的概念框架。讀完並理解本書內容後,讀者可以輕鬆看破大多數的技術忽悠,與技術磚家撕起來虎虎生風。
這是 2017 年譯者讀過最好的一本技術類書籍,這麼好的書沒有中文翻譯,實在是遺憾。某不才,願為先進技術文化的傳播貢獻一份力量。既可以深入學習有趣的技術主題,又可以鍛鍊中英文語言文字功底,何樂而不為?
## 前言
> 在我們的社會中,技術是一種強大的力量。資料、軟體、通訊可以用於壞的方面:不公平的階級固化,損害公民權利,保護既得利益集團。但也可以用於好的方面:讓底層人民發出自己的聲音,讓每個人都擁有機會,避免災難。本書獻給所有將技術用於善途的人們。
> 計算是一種流行文化,流行文化鄙視歷史。流行文化關乎個體身份和參與感,但與合作無關。流行文化活在當下,也與過去和未來無關。我認為大部分(為了錢)編寫程式碼的人就是這樣的,他們不知道自己的文化來自哪裡。
>
> —— 阿蘭・凱接受 Dobb 博士的雜誌採訪時(2012 年)
## 目錄
### [序言](/tw/preface)
### [第一部分:資料系統基礎](/tw/part-i)
- [1. 資料系統架構中的權衡](/tw/ch1)
- [2. 定義非功能性需求](/tw/ch2)
- [3. 資料模型與查詢語言](/tw/ch3)
- [4. 儲存與檢索](/tw/ch4)
- [5. 編碼與演化](/tw/ch5)
### [第二部分:分散式資料](/tw/part-ii)
- [6. 複製](/tw/ch6)
- [7. 分片](/tw/ch7)
- [8. 事務](/tw/ch8)
- [9. 分散式系統的麻煩](/tw/ch9)
- [10.一致性與共識](/tw/ch10)
### [第三部分:派生資料](/tw/part-iii)
- [11. 批處理](/tw/ch11)
- [12. 流處理](/tw/ch12)
- [13. 流式系統的哲學](/tw/ch13)
- [14. 將事情做正確](/ch14)
- [術語表](/tw/glossary)
- [索引](/index)
- [後記](/tw/colophon)
## 法律宣告
從原作者處得知,已經有簡體中文的翻譯計劃,將於 2018 年末完成。[購買地址](https://search.jd.com/Search?keyword=設計資料密集型應用)
譯者純粹出於 **學習目的** 與 **個人興趣** 翻譯本書,不追求任何經濟利益。
譯者保留對此版本譯文的署名權,其他權利以原作者和出版社的主張為準。
本譯文只供學習研究參考之用,不得公開傳播發行或用於商業用途。有能力閱讀英文書籍者請購買正版支援。
## 貢獻
0. 全文校訂 by [@yingang](https://github.com/Vonng/ddia/commits?author=yingang)
1. [序言初翻修正](https://github.com/Vonng/ddia/commit/afb5edab55c62ed23474149f229677e3b42dfc2c) by [@seagullbird](https://github.com/Vonng/ddia/commits?author=seagullbird)
2. [第一章語法標點校正](https://github.com/Vonng/ddia/commit/973b12cd8f8fcdf4852f1eb1649ddd9d187e3644) by [@nevertiree](https://github.com/Vonng/ddia/commits?author=nevertiree)
3. [第六章部分校正](https://github.com/Vonng/ddia/commit/d4eb0852c0ec1e93c8aacc496c80b915bb1e6d48) 與[第十章的初翻](https://github.com/Vonng/ddia/commit/9de8dbd1bfe6fbb03b3bf6c1a1aa2291aed2490e) by [@MuAlex](https://github.com/Vonng/ddia/commits?author=MuAlex)
4. [第一部分](/tw/part-i)前言,[ch2](/tw/ch2)校正 by [@jiajiadebug](https://github.com/Vonng/ddia/commits?author=jiajiadebug)
5. [詞彙表](/tw/glossary)、[後記](/tw/colophon)關於野豬的部分 by [@Chowss](https://github.com/Vonng/ddia/commits?author=Chowss)
6. [繁體中文](https://github.com/Vonng/ddia/pulls)版本與轉換指令碼 by [@afunTW](https://github.com/afunTW)
7. 多處翻譯修正 by [@songzhibin97](https://github.com/Vonng/ddia/commits?author=songzhibin97) [@MamaShip](https://github.com/Vonng/ddia/commits?author=MamaShip) [@FangYuan33](https://github.com/Vonng/ddia/commits?author=FangYuan33)
8. [感謝所有作出貢獻,提出意見的朋友們](/contrib):
Pull Requests & Issues
| ISSUE & Pull Requests | USER | Title |
|-------------------------------------------------|------------------------------------------------------------|----------------------------------------------------------------|
| [386](https://github.com/Vonng/ddia/pull/386) | [@uncle-lv](https://github.com/uncle-lv) | ch2: 最佳化一處翻譯 |
| [384](https://github.com/Vonng/ddia/pull/384) | [@PanggNOTlovebean](https://github.com/PanggNOTlovebean) | docs: 最佳化中文文件的措辭和表達 |
| [383](https://github.com/Vonng/ddia/pull/383) | [@PanggNOTlovebean](https://github.com/PanggNOTlovebean) | docs: 修正 ch4 中的術語和表達錯誤 |
| [382](https://github.com/Vonng/ddia/pull/382) | [@uncle-lv](https://github.com/uncle-lv) | ch1: 最佳化一處翻譯 |
| [381](https://github.com/Vonng/ddia/pull/381) | [@Max-Tortoise](https://github.com/Max-Tortoise) | ch4: 修正一處術語不完整問題 |
| [377](https://github.com/Vonng/ddia/pull/377) | [@huang06](https://github.com/huang06) | 最佳化翻譯術語 |
| [375](https://github.com/Vonng/ddia/issues/375) | [@z-soulx](https://github.com/z-soulx) | 對於是否100%全中文翻譯的必要性討論?個人-沒必要100%,特別是“名詞”,有原單詞更加適合it人員 |
| [371](https://github.com/Vonng/ddia/pull/371) | [@lewiszlw](https://github.com/lewiszlw) | CPU core -> CPU 核心 |
| [369](https://github.com/Vonng/ddia/pull/369) | [@bbwang-gl](https://github.com/bbwang-gl) | ch7: 可序列化快照隔離檢測一個事務何時修改另一個事務的讀取 |
| [368](https://github.com/Vonng/ddia/pull/368) | [@yhao3](https://github.com/yhao3) | 更新 zh-tw.py 與 zh-tw 內容 |
| [367](https://github.com/Vonng/ddia/pull/367) | [@yhao3](https://github.com/yhao3) | 修正拼寫、格式和標點問題 |
| [366](https://github.com/Vonng/ddia/pull/366) | [@yangshangde](https://github.com/yangshangde) | ch8: 將“電源失敗”改為“電源失效” |
| [365](https://github.com/Vonng/ddia/pull/365) | [@xyohn](https://github.com/xyohn) | ch1: 最佳化“儲存與計算分離”相關翻譯 |
| [364](https://github.com/Vonng/ddia/issues/364) | [@xyohn](https://github.com/xyohn) | ch1: 最佳化“儲存與計算分離”相關翻譯 |
| [363](https://github.com/Vonng/ddia/pull/363) | [@xyohn](https://github.com/xyohn) | #362: 最佳化一處翻譯 |
| [362](https://github.com/Vonng/ddia/issues/362) | [@xyohn](https://github.com/xyohn) | ch1: 最佳化一處翻譯 |
| [359](https://github.com/Vonng/ddia/pull/359) | [@c25423](https://github.com/c25423) | ch10: 修正一處拼寫錯誤 |
| [358](https://github.com/Vonng/ddia/pull/358) | [@lewiszlw](https://github.com/lewiszlw) | ch4: 修正一處拼寫錯誤 |
| [356](https://github.com/Vonng/ddia/pull/356) | [@lewiszlw](https://github.com/lewiszlw) | ch2: 修正一處標點錯誤 |
| [355](https://github.com/Vonng/ddia/pull/355) | [@DuroyGeorge](https://github.com/DuroyGeorge) | ch12: 修正一處格式錯誤 |
| [354](https://github.com/Vonng/ddia/pull/354) | [@justlorain](https://github.com/justlorain) | ch7: 修正一處參考連結 |
| [353](https://github.com/Vonng/ddia/pull/353) | [@fantasyczl](https://github.com/fantasyczl) | ch3&9: 修正兩處引用錯誤 |
| [352](https://github.com/Vonng/ddia/pull/352) | [@fantasyczl](https://github.com/fantasyczl) | 支援輸出為 EPUB 格式 |
| [349](https://github.com/Vonng/ddia/pull/349) | [@xiyihan0](https://github.com/xiyihan0) | ch1: 修正一處格式錯誤 |
| [348](https://github.com/Vonng/ddia/pull/348) | [@omegaatt36](https://github.com/omegaatt36) | ch3: 修正一處影像連結 |
| [346](https://github.com/Vonng/ddia/issues/346) | [@Vermouth1995](https://github.com/Vermouth1995) | ch1: 最佳化一處翻譯 |
| [343](https://github.com/Vonng/ddia/pull/343) | [@kehao-chen](https://github.com/kehao-chen) | ch10: 最佳化一處翻譯 |
| [341](https://github.com/Vonng/ddia/pull/341) | [@YKIsTheBest](https://github.com/YKIsTheBest) | ch3: 最佳化兩處翻譯 |
| [340](https://github.com/Vonng/ddia/pull/340) | [@YKIsTheBest](https://github.com/YKIsTheBest) | ch2: 最佳化多處翻譯 |
| [338](https://github.com/Vonng/ddia/pull/338) | [@YKIsTheBest](https://github.com/YKIsTheBest) | ch1: 最佳化一處翻譯 |
| [335](https://github.com/Vonng/ddia/pull/335) | [@kimi0230](https://github.com/kimi0230) | 修正一處繁體中文錯誤 |
| [334](https://github.com/Vonng/ddia/pull/334) | [@soulrrrrr](https://github.com/soulrrrrr) | ch2: 修正一處繁體中文錯誤 |
| [332](https://github.com/Vonng/ddia/pull/332) | [@justlorain](https://github.com/justlorain) | ch5: 修正一處翻譯錯誤 |
| [331](https://github.com/Vonng/ddia/pull/331) | [@Lyianu](https://github.com/Lyianu) | ch9: 更正幾處拼寫錯誤 |
| [330](https://github.com/Vonng/ddia/pull/330) | [@Lyianu](https://github.com/Lyianu) | ch7: 最佳化一處翻譯 |
| [329](https://github.com/Vonng/ddia/issues/329) | [@Lyianu](https://github.com/Lyianu) | ch6: 指出一處翻譯錯誤 |
| [328](https://github.com/Vonng/ddia/pull/328) | [@justlorain](https://github.com/justlorain) | ch4: 更正一處翻譯遺漏 |
| [326](https://github.com/Vonng/ddia/pull/326) | [@liangGTY](https://github.com/liangGTY) | ch1: 最佳化一處翻譯 |
| [323](https://github.com/Vonng/ddia/pull/323) | [@marvin263](https://github.com/marvin263) | ch5: 最佳化一處翻譯 |
| [322](https://github.com/Vonng/ddia/pull/322) | [@marvin263](https://github.com/marvin263) | ch8: 最佳化一處翻譯 |
| [304](https://github.com/Vonng/ddia/pull/304) | [@spike014](https://github.com/spike014) | ch11: 最佳化一處翻譯 |
| [298](https://github.com/Vonng/ddia/pull/298) | [@Makonike](https://github.com/Makonike) | ch11&12: 修正兩處錯誤 |
| [284](https://github.com/Vonng/ddia/pull/284) | [@WAangzE](https://github.com/WAangzE) | ch4: 更正一處列表錯誤 |
| [283](https://github.com/Vonng/ddia/pull/283) | [@WAangzE](https://github.com/WAangzE) | ch3: 更正一處錯別字 |
| [282](https://github.com/Vonng/ddia/pull/282) | [@WAangzE](https://github.com/WAangzE) | ch2: 更正一處公式問題 |
| [281](https://github.com/Vonng/ddia/pull/281) | [@lyuxi99](https://github.com/lyuxi99) | 更正多處內部連結錯誤 |
| [280](https://github.com/Vonng/ddia/pull/280) | [@lyuxi99](https://github.com/lyuxi99) | ch9: 更正內部連結錯誤 |
| [279](https://github.com/Vonng/ddia/issues/279) | [@codexvn](https://github.com/codexvn) | ch9: 指出公式在 GitHub Pages 顯示的問題 |
| [278](https://github.com/Vonng/ddia/pull/278) | [@LJlkdskdjflsa](https://github.com/LJlkdskdjflsa) | 發現了繁體中文版本中的錯誤翻譯 |
| [275](https://github.com/Vonng/ddia/pull/275) | [@117503445](https://github.com/117503445) | 更正 LICENSE 連結 |
| [274](https://github.com/Vonng/ddia/pull/274) | [@uncle-lv](https://github.com/uncle-lv) | ch7: 修正錯別字 |
| [273](https://github.com/Vonng/ddia/pull/273) | [@Sdot-Python](https://github.com/Sdot-Python) | ch7: 統一了 write skew 的翻譯 |
| [271](https://github.com/Vonng/ddia/pull/271) | [@Makonike](https://github.com/Makonike) | ch6: 統一了 rebalancing 的翻譯 |
| [270](https://github.com/Vonng/ddia/pull/270) | [@Ynjxsjmh](https://github.com/Ynjxsjmh) | ch7: 修正不一致的翻譯 |
| [263](https://github.com/Vonng/ddia/pull/263) | [@zydmayday](https://github.com/zydmayday) | ch5: 修正譯文中的重複單詞 |
| [260](https://github.com/Vonng/ddia/pull/260) | [@haifeiWu](https://github.com/haifeiWu) | ch4: 修正部分不準確的翻譯 |
| [258](https://github.com/Vonng/ddia/pull/258) | [@bestgrc](https://github.com/bestgrc) | ch3: 修正一處翻譯錯誤 |
| [257](https://github.com/Vonng/ddia/pull/257) | [@UnderSam](https://github.com/UnderSam) | ch8: 修正一處拼寫錯誤 |
| [256](https://github.com/Vonng/ddia/pull/256) | [@AlphaWang](https://github.com/AlphaWang) | ch7: 修正“可序列化”相關內容的多處翻譯不當 |
| [255](https://github.com/Vonng/ddia/pull/255) | [@AlphaWang](https://github.com/AlphaWang) | ch7: 修正“可重複讀”相關內容的多處翻譯不當 |
| [253](https://github.com/Vonng/ddia/pull/253) | [@AlphaWang](https://github.com/AlphaWang) | ch7: 修正“讀已提交”相關內容的多處翻譯不當 |
| [246](https://github.com/Vonng/ddia/pull/246) | [@derekwu0101](https://github.com/derekwu0101) | ch3: 修正繁體中文的轉譯錯誤 |
| [245](https://github.com/Vonng/ddia/pull/245) | [@skyran1278](https://github.com/skyran1278) | ch12: 修正繁體中文的轉譯錯誤 |
| [244](https://github.com/Vonng/ddia/pull/244) | [@Axlgrep](https://github.com/Axlgrep) | ch9: 修正不通順的翻譯 |
| [242](https://github.com/Vonng/ddia/pull/242) | [@lynkeib](https://github.com/lynkeib) | ch9: 修正不通順的翻譯 |
| [241](https://github.com/Vonng/ddia/pull/241) | [@lynkeib](https://github.com/lynkeib) | ch8: 修正不正確的公式格式 |
| [240](https://github.com/Vonng/ddia/pull/240) | [@8da2k](https://github.com/8da2k) | ch9: 修正不通順的翻譯 |
| [239](https://github.com/Vonng/ddia/pull/239) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch7: 修正不一致的翻譯 |
| [237](https://github.com/Vonng/ddia/pull/237) | [@zhangnew](https://github.com/zhangnew) | ch3: 修正錯誤的圖片連結 |
| [229](https://github.com/Vonng/ddia/pull/229) | [@lis186](https://github.com/lis186) | 指出繁體中文的轉譯錯誤:複雜 |
| [226](https://github.com/Vonng/ddia/pull/226) | [@chroming](https://github.com/chroming) | ch1: 修正導航欄中的章節名稱 |
| [220](https://github.com/Vonng/ddia/pull/220) | [@skyran1278](https://github.com/skyran1278) | ch9: 修正線性一致的繁體中文翻譯 |
| [194](https://github.com/Vonng/ddia/pull/194) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch4: 修正錯誤的翻譯 |
| [193](https://github.com/Vonng/ddia/pull/193) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch4: 最佳化譯文 |
| [192](https://github.com/Vonng/ddia/pull/192) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch4: 修正不一致和不通順的翻譯 |
| [190](https://github.com/Vonng/ddia/pull/190) | [@Pcrab](https://github.com/Pcrab) | ch1: 修正不準確的翻譯 |
| [187](https://github.com/Vonng/ddia/pull/187) | [@narojay](https://github.com/narojay) | ch9: 修正生硬的翻譯 |
| [186](https://github.com/Vonng/ddia/pull/186) | [@narojay](https://github.com/narojay) | ch8: 修正錯別字 |
| [185](https://github.com/Vonng/ddia/issues/185) | [@8da2k](https://github.com/8da2k) | 指出小標題跳轉的問題 |
| [184](https://github.com/Vonng/ddia/pull/184) | [@DavidZhiXing](https://github.com/DavidZhiXing) | ch10: 修正失效的網址 |
| [183](https://github.com/Vonng/ddia/pull/183) | [@OneSizeFitsQuorum](https://github.com/OneSizeFitsQuorum) | ch8: 修正錯別字 |
| [182](https://github.com/Vonng/ddia/issues/182) | [@lroolle](https://github.com/lroolle) | 建議docsify的主題風格 |
| [181](https://github.com/Vonng/ddia/pull/181) | [@YunfengGao](https://github.com/YunfengGao) | ch2: 修正翻譯錯誤 |
| [180](https://github.com/Vonng/ddia/pull/180) | [@skyran1278](https://github.com/skyran1278) | ch3: 指出繁體中文的轉譯錯誤 |
| [177](https://github.com/Vonng/ddia/pull/177) | [@exzhawk](https://github.com/exzhawk) | 支援 Github Pages 裡的公式顯示 |
| [176](https://github.com/Vonng/ddia/pull/176) | [@haifeiWu](https://github.com/haifeiWu) | ch2: 語義網相關翻譯更正 |
| [175](https://github.com/Vonng/ddia/pull/175) | [@cwr31](https://github.com/cwr31) | ch7: 不變式相關翻譯更正 |
| [174](https://github.com/Vonng/ddia/pull/174) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | README & preface: 更正不正確的中文用詞和標點符號 |
| [173](https://github.com/Vonng/ddia/pull/173) | [@ZvanYang](https://github.com/ZvanYang) | ch12: 修正不完整的翻譯 |
| [171](https://github.com/Vonng/ddia/pull/171) | [@ZvanYang](https://github.com/ZvanYang) | ch12: 修正重複的譯文 |
| [169](https://github.com/Vonng/ddia/pull/169) | [@ZvanYang](https://github.com/ZvanYang) | ch12: 更正不太通順的翻譯 |
| [166](https://github.com/Vonng/ddia/pull/166) | [@bp4m4h94](https://github.com/bp4m4h94) | ch1: 發現錯誤的文獻索引 |
| [164](https://github.com/Vonng/ddia/pull/164) | [@DragonDriver](https://github.com/DragonDriver) | preface: 更正錯誤的標點符號 |
| [163](https://github.com/Vonng/ddia/pull/163) | [@llmmddCoder](https://github.com/llmmddCoder) | ch1: 更正錯誤字 |
| [160](https://github.com/Vonng/ddia/pull/160) | [@Zhayhp](https://github.com/Zhayhp) | ch2: 建議將 network model 翻譯為網狀模型 |
| [159](https://github.com/Vonng/ddia/pull/159) | [@1ess](https://github.com/1ess) | ch4: 更正錯誤字 |
| [157](https://github.com/Vonng/ddia/pull/157) | [@ZvanYang](https://github.com/ZvanYang) | ch7: 更正不太通順的翻譯 |
| [155](https://github.com/Vonng/ddia/pull/155) | [@ZvanYang](https://github.com/ZvanYang) | ch7: 更正不太通順的翻譯 |
| [153](https://github.com/Vonng/ddia/pull/153) | [@DavidZhiXing](https://github.com/DavidZhiXing) | ch9: 修正縮圖的錯別字 |
| [152](https://github.com/Vonng/ddia/pull/152) | [@ZvanYang](https://github.com/ZvanYang) | ch7: 除重->去重 |
| [151](https://github.com/Vonng/ddia/pull/151) | [@ZvanYang](https://github.com/ZvanYang) | ch5: 修訂sibling相關的翻譯 |
| [147](https://github.com/Vonng/ddia/pull/147) | [@ZvanYang](https://github.com/ZvanYang) | ch5: 更正一處不準確的翻譯 |
| [145](https://github.com/Vonng/ddia/pull/145) | [@Hookey](https://github.com/Hookey) | 識別了當前簡繁轉譯過程中處理不當的地方,暫透過轉換指令碼規避 |
| [144](https://github.com/Vonng/ddia/issues/144) | [@secret4233](https://github.com/secret4233) | ch7: 不翻譯`next-key locking` |
| [143](https://github.com/Vonng/ddia/issues/143) | [@imcheney](https://github.com/imcheney) | ch3: 更新殘留的機翻段落 |
| [142](https://github.com/Vonng/ddia/issues/142) | [@XIJINIAN](https://github.com/XIJINIAN) | 建議去除段首的製表符 |
| [141](https://github.com/Vonng/ddia/issues/141) | [@Flyraty](https://github.com/Flyraty) | ch5: 發現一處錯誤格式的章節引用 |
| [140](https://github.com/Vonng/ddia/pull/140) | [@Bowser1704](https://github.com/Bowser1704) | ch5: 修正章節Summary中多處不通順的翻譯 |
| [139](https://github.com/Vonng/ddia/pull/139) | [@Bowser1704](https://github.com/Bowser1704) | ch2&ch3: 修正多處不通順的或錯誤的翻譯 |
| [137](https://github.com/Vonng/ddia/pull/137) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch5&ch6: 最佳化多處不通順的或錯誤的翻譯 |
| [134](https://github.com/Vonng/ddia/pull/134) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch4: 最佳化多處不通順的或錯誤的翻譯 |
| [133](https://github.com/Vonng/ddia/pull/133) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch3: 最佳化多處錯誤的或不通順的翻譯 |
| [132](https://github.com/Vonng/ddia/pull/132) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch3: 最佳化一處容易產生歧義的翻譯 |
| [131](https://github.com/Vonng/ddia/pull/131) | [@rwwg4](https://github.com/rwwg4) | ch6: 修正兩處錯誤的翻譯 |
| [129](https://github.com/Vonng/ddia/pull/129) | [@anaer](https://github.com/anaer) | ch4: 修正兩處強調文字和四處程式碼變數名稱 |
| [128](https://github.com/Vonng/ddia/pull/128) | [@meilin96](https://github.com/meilin96) | ch5: 修正一處錯誤的引用 |
| [126](https://github.com/Vonng/ddia/pull/126) | [@cwr31](https://github.com/cwr31) | ch10: 修正一處錯誤的翻譯(功能 -> 函式) |
| [125](https://github.com/Vonng/ddia/pull/125) | [@dch1228](https://github.com/dch1228) | ch2: 最佳化 how best 的翻譯(如何以最佳方式) |
| [123](https://github.com/Vonng/ddia/pull/123) | [@yingang](https://github.com/yingang) | translation updates (chapter 9, TOC in readme, glossary, etc.) |
| [121](https://github.com/Vonng/ddia/pull/121) | [@yingang](https://github.com/yingang) | translation updates (chapter 5 to chapter 8) |
| [120](https://github.com/Vonng/ddia/pull/120) | [@jiong-han](https://github.com/jiong-han) | Typo fix: 呲之以鼻 -> 嗤之以鼻 |
| [119](https://github.com/Vonng/ddia/pull/119) | [@cclauss](https://github.com/cclauss) | Streamline file operations in convert() |
| [118](https://github.com/Vonng/ddia/pull/118) | [@yingang](https://github.com/yingang) | translation updates (chapter 2 to chapter 4) |
| [117](https://github.com/Vonng/ddia/pull/117) | [@feeeei](https://github.com/feeeei) | 統一每章的標題格式 |
| [115](https://github.com/Vonng/ddia/pull/115) | [@NageNalock](https://github.com/NageNalock) | 第七章病句修改: 重複詞語 |
| [114](https://github.com/Vonng/ddia/pull/114) | [@Sunt-ing](https://github.com/Sunt-ing) | Update README.md: correct the book name |
| [113](https://github.com/Vonng/ddia/pull/113) | [@lpxxn](https://github.com/lpxxn) | 修改語句 |
| [112](https://github.com/Vonng/ddia/pull/112) | [@ibyte2011](https://github.com/ibyte2011) | Update ch9.md |
| [110](https://github.com/Vonng/ddia/pull/110) | [@lpxxn](https://github.com/lpxxn) | 讀已寫入資料 |
| [107](https://github.com/Vonng/ddia/pull/107) | [@abbychau](https://github.com/abbychau) | 單調鐘和好死還是賴活著 |
| [106](https://github.com/Vonng/ddia/pull/106) | [@enochii](https://github.com/enochii) | typo in ch2: fix braces typo |
| [105](https://github.com/Vonng/ddia/pull/105) | [@LiminCode](https://github.com/LiminCode) | Chronicle translation error |
| [104](https://github.com/Vonng/ddia/pull/104) | [@Sunt-ing](https://github.com/Sunt-ing) | several advice for better translation |
| [103](https://github.com/Vonng/ddia/pull/103) | [@Sunt-ing](https://github.com/Sunt-ing) | typo in ch4: should be 完成 rather than 完全 |
| [102](https://github.com/Vonng/ddia/pull/102) | [@Sunt-ing](https://github.com/Sunt-ing) | ch4: better-translation: 扼殺 → 破壞 |
| [101](https://github.com/Vonng/ddia/pull/101) | [@Sunt-ing](https://github.com/Sunt-ing) | typo in Ch4: should be "改變" rathr than "蓋面" |
| [100](https://github.com/Vonng/ddia/pull/100) | [@LiminCode](https://github.com/LiminCode) | fix missing translation |
| [99 ](https://github.com/Vonng/ddia/pull/99) | [@mrdrivingduck](https://github.com/mrdrivingduck) | ch6: fix the word rebalancing |
| [98 ](https://github.com/Vonng/ddia/pull/98) | [@jacklightChen](https://github.com/jacklightChen) | fix ch7.md: fix wrong references |
| [97 ](https://github.com/Vonng/ddia/pull/97) | [@jenac](https://github.com/jenac) | 96 |
| [96 ](https://github.com/Vonng/ddia/pull/96) | [@PragmaTwice](https://github.com/PragmaTwice) | ch2: fix typo about 'may or may not be' |
| [95 ](https://github.com/Vonng/ddia/pull/95) | [@EvanMu96](https://github.com/EvanMu96) | fix translation of "the battle cry" in ch5 |
| [94 ](https://github.com/Vonng/ddia/pull/94) | [@kemingy](https://github.com/kemingy) | ch6: fix markdown and punctuations |
| [93 ](https://github.com/Vonng/ddia/pull/93) | [@kemingy](https://github.com/kemingy) | ch5: fix markdown and some typos |
| [92 ](https://github.com/Vonng/ddia/pull/92) | [@Gilbert1024](https://github.com/Gilbert1024) | Merge pull request #1 from Vonng/master |
| [88 ](https://github.com/Vonng/ddia/pull/88) | [@kemingy](https://github.com/kemingy) | fix typo for ch1, ch2, ch3, ch4 |
| [87 ](https://github.com/Vonng/ddia/pull/87) | [@wynn5a](https://github.com/wynn5a) | Update ch3.md |
| [86 ](https://github.com/Vonng/ddia/pull/86) | [@northmorn](https://github.com/northmorn) | Update ch1.md |
| [85 ](https://github.com/Vonng/ddia/pull/85) | [@sunbuhui](https://github.com/sunbuhui) | fix ch2.md: fix ch2 ambiguous translation |
| [84 ](https://github.com/Vonng/ddia/pull/84) | [@ganler](https://github.com/ganler) | Fix translation: use up |
| [83 ](https://github.com/Vonng/ddia/pull/83) | [@afunTW](https://github.com/afunTW) | Using OpenCC to convert from zh-cn to zh-tw |
| [82 ](https://github.com/Vonng/ddia/pull/82) | [@kangni](https://github.com/kangni) | fix gitbook url |
| [78 ](https://github.com/Vonng/ddia/pull/78) | [@hanyu2](https://github.com/hanyu2) | Fix unappropriated translation |
| [77 ](https://github.com/Vonng/ddia/pull/77) | [@Ozarklake](https://github.com/Ozarklake) | fix typo |
| [75 ](https://github.com/Vonng/ddia/pull/75) | [@2997ms](https://github.com/2997ms) | Fix typo |
| [74 ](https://github.com/Vonng/ddia/pull/74) | [@2997ms](https://github.com/2997ms) | Update ch9.md |
| [70 ](https://github.com/Vonng/ddia/pull/70) | [@2997ms](https://github.com/2997ms) | Update ch7.md |
| [67 ](https://github.com/Vonng/ddia/pull/67) | [@jiajiadebug](https://github.com/jiajiadebug) | fix issues in ch2 - ch9 and glossary |
| [66 ](https://github.com/Vonng/ddia/pull/66) | [@blindpirate](https://github.com/blindpirate) | Fix typo |
| [63 ](https://github.com/Vonng/ddia/pull/63) | [@haifeiWu](https://github.com/haifeiWu) | Update ch10.md |
| [62 ](https://github.com/Vonng/ddia/pull/62) | [@ych](https://github.com/ych) | fix ch1.md typesetting problem |
| [61 ](https://github.com/Vonng/ddia/pull/61) | [@xianlaioy](https://github.com/xianlaioy) | docs:鍾-->種,去掉ou |
| [60 ](https://github.com/Vonng/ddia/pull/60) | [@Zombo1296](https://github.com/Zombo1296) | 否則 -> 或者 |
| [59 ](https://github.com/Vonng/ddia/pull/59) | [@AlexanderMisel](https://github.com/AlexanderMisel) | 呼叫->呼叫,顯著->顯著 |
| [58 ](https://github.com/Vonng/ddia/pull/58) | [@ibyte2011](https://github.com/ibyte2011) | Update ch8.md |
| [55 ](https://github.com/Vonng/ddia/pull/55) | [@saintube](https://github.com/saintube) | ch8: 修改連結錯誤 |
| [54 ](https://github.com/Vonng/ddia/pull/54) | [@Panmax](https://github.com/Panmax) | Update ch2.md |
| [53 ](https://github.com/Vonng/ddia/pull/53) | [@ibyte2011](https://github.com/ibyte2011) | Update ch9.md |
| [52 ](https://github.com/Vonng/ddia/pull/52) | [@hecenjie](https://github.com/hecenjie) | Update ch1.md |
| [51 ](https://github.com/Vonng/ddia/pull/51) | [@latavin243](https://github.com/latavin243) | fix 修正ch3 ch4幾處翻譯 |
| [50 ](https://github.com/Vonng/ddia/pull/50) | [@AlexZFX](https://github.com/AlexZFX) | 幾個疏漏和格式錯誤 |
| [49 ](https://github.com/Vonng/ddia/pull/49) | [@haifeiWu](https://github.com/haifeiWu) | Update ch1.md |
| [48 ](https://github.com/Vonng/ddia/pull/48) | [@scaugrated](https://github.com/scaugrated) | fix typo |
| [47 ](https://github.com/Vonng/ddia/pull/47) | [@lzwill](https://github.com/lzwill) | Fixed typos in ch2 |
| [45 ](https://github.com/Vonng/ddia/pull/45) | [@zenuo](https://github.com/zenuo) | 刪除一個多餘的右括號 |
| [44 ](https://github.com/Vonng/ddia/pull/44) | [@akxxsb](https://github.com/akxxsb) | 修正第七章底部連結錯誤 |
| [43 ](https://github.com/Vonng/ddia/pull/43) | [@baijinping](https://github.com/baijinping) | "更假簡單"->"更加簡單" |
| [42 ](https://github.com/Vonng/ddia/pull/42) | [@tisonkun](https://github.com/tisonkun) | 修復 ch1 中的無序列表格式 |
| [38 ](https://github.com/Vonng/ddia/pull/38) | [@renjie-c](https://github.com/renjie-c) | 糾正多處的翻譯小錯誤 |
| [37 ](https://github.com/Vonng/ddia/pull/37) | [@tankilo](https://github.com/tankilo) | fix translation mistakes in ch4.md |
| [36 ](https://github.com/Vonng/ddia/pull/36) | [@wwek](https://github.com/wwek) | 1.修復多個連結錯誤 2.名詞最佳化修訂 3.錯誤修訂 |
| [35 ](https://github.com/Vonng/ddia/pull/35) | [@wwek](https://github.com/wwek) | fix ch7.md to ch8.md link error |
| [34 ](https://github.com/Vonng/ddia/pull/34) | [@wwek](https://github.com/wwek) | Merge pull request #1 from Vonng/master |
| [33 ](https://github.com/Vonng/ddia/pull/33) | [@wwek](https://github.com/wwek) | fix part-ii.md link error |
| [32 ](https://github.com/Vonng/ddia/pull/32) | [@JCYoky](https://github.com/JCYoky) | Update ch2.md |
| [31 ](https://github.com/Vonng/ddia/pull/31) | [@elsonLee](https://github.com/elsonLee) | Update ch7.md |
| [26 ](https://github.com/Vonng/ddia/pull/26) | [@yjhmelody](https://github.com/yjhmelody) | 修復一些明顯錯誤 |
| [25 ](https://github.com/Vonng/ddia/pull/25) | [@lqbilbo](https://github.com/lqbilbo) | 修復連結錯誤 |
| [24 ](https://github.com/Vonng/ddia/pull/24) | [@artiship](https://github.com/artiship) | 修改詞語順序 |
| [23 ](https://github.com/Vonng/ddia/pull/23) | [@artiship](https://github.com/artiship) | 修正錯別字 |
| [22 ](https://github.com/Vonng/ddia/pull/22) | [@artiship](https://github.com/artiship) | 糾正翻譯錯誤 |
| [21 ](https://github.com/Vonng/ddia/pull/21) | [@zhtisi](https://github.com/zhtisi) | 修正目錄和本章標題不符的情況 |
| [20 ](https://github.com/Vonng/ddia/pull/20) | [@rentiansheng](https://github.com/rentiansheng) | Update ch7.md |
| [19 ](https://github.com/Vonng/ddia/pull/19) | [@LHRchina](https://github.com/LHRchina) | 修復語句小bug |
| [16 ](https://github.com/Vonng/ddia/pull/16) | [@MuAlex](https://github.com/MuAlex) | Master |
| [15 ](https://github.com/Vonng/ddia/pull/15) | [@cg-zhou](https://github.com/cg-zhou) | Update translation progress |
| [14 ](https://github.com/Vonng/ddia/pull/14) | [@cg-zhou](https://github.com/cg-zhou) | Translate glossary |
| [13 ](https://github.com/Vonng/ddia/pull/13) | [@cg-zhou](https://github.com/cg-zhou) | 詳細修改了後記中和印度野豬相關的描述 |
| [12 ](https://github.com/Vonng/ddia/pull/12) | [@ibyte2011](https://github.com/ibyte2011) | 修改了部分翻譯 |
| [11 ](https://github.com/Vonng/ddia/pull/11) | [@jiajiadebug](https://github.com/jiajiadebug) | ch2 100% |
| [10 ](https://github.com/Vonng/ddia/pull/10) | [@jiajiadebug](https://github.com/jiajiadebug) | ch2 20% |
| [9 ](https://github.com/Vonng/ddia/pull/9) | [@jiajiadebug](https://github.com/jiajiadebug) | Preface, ch1, part-i translation minor fixes |
| [7 ](https://github.com/Vonng/ddia/pull/7) | [@MuAlex](https://github.com/MuAlex) | Ch6 translation pull request |
| [6 ](https://github.com/Vonng/ddia/pull/6) | [@MuAlex](https://github.com/MuAlex) | Ch6 change version1 |
| [5 ](https://github.com/Vonng/ddia/pull/5) | [@nevertiree](https://github.com/nevertiree) | Chapter 01語法微調 |
| [2 ](https://github.com/Vonng/ddia/pull/2) | [@seagullbird](https://github.com/seagullbird) | 序言初翻 |
---------
## 許可證
本專案採用 [CC-BY 4.0](https://github.com/Vonng/ddia/blob/master/LICENSE) 許可證,您可以在這裡找到完整說明:
- [署名 4.0 協議國際版 CC BY 4.0 Deed](https://creativecommons.org/licenses/by/4.0/deed.zh-hans)
- [Attribution 4.0 International CC BY 4.0](https://creativecommons.org/licenses/by/4.0/deed.en)
================================================
FILE: content/tw/ch1.md
================================================
---
title: "1. 資料系統架構中的權衡"
weight: 101
breadcrumbs: false
---
> *沒有完美的解決方案,只有權衡取捨。[…] 你能做的就是努力獲得最佳的權衡,這就是你所能期望的一切。*
>
> [Thomas Sowell](https://www.youtube.com/watch?v=2YUtKr8-_Fg),接受 Fred Barnes 採訪(2005)
> [!TIP] 早期讀者注意事項
> 透過 Early Release 電子書,你可以在最早階段讀到作者寫作中的原始、未編輯內容,從而在正式版釋出前儘早使用這些技術。
>
> 這將是最終書籍的第 1 章。本書的 GitHub 倉庫是 https://github.com/ept/ddia2-feedback。
> 如果你希望積極參與本草稿的審閱與評論,請在 GitHub 上聯絡。
資料是當今應用開發的核心。隨著 Web 與移動應用、軟體即服務(SaaS)和雲服務普及,把許多不同使用者的資料存放在共享的伺服器端資料基礎設施中,已經成為常態。來自使用者行為、業務交易、裝置與感測器的資料,需要被儲存並可用於分析。使用者每次與應用互動,既會讀取已有資料,也會產生新資料。
當資料量較小、可在單機儲存和處理時,問題往往並不複雜。但隨著資料規模或查詢速率增長,資料必須分佈到多臺機器上,挑戰隨之而來。隨著需求變得更複雜,僅靠單一系統通常已不足夠,你可能需要組合多個具備不同能力的儲存與處理系統。
如果“管理資料”是開發過程中的主要挑戰之一,我們稱這樣的應用為 **資料密集型(data-intensive)** 應用 [^1]。與之對照,在 **計算密集型(compute-intensive)** 系統中,難點是並行化超大規模計算;而在資料密集型應用中,我們更常關心的是:如何儲存與處理海量資料、如何管理資料變化、如何在故障與併發下保持一致性,以及如何讓服務保持高可用。
這類應用通常由若干標準構件搭建而成,每個構件負責一種常見能力。例如,很多應用都需要:
* 儲存資料,以便它們或其他應用程式以後能再次找到(**資料庫**)
* 記住昂貴操作的結果,以加快讀取速度(**快取**)
* 允許使用者按關鍵字搜尋資料或以各種方式過濾資料(**搜尋索引**)
* 一旦事件和資料變更發生就立即處理(**流處理**)
* 定期處理累積的大量資料(**批處理**)
在構建應用時,我們通常會選擇若干軟體系統或服務(例如資料庫或 API),再用應用程式碼把它們拼接起來。如果你的需求恰好落在這些系統的設計邊界內,這並不困難。
但當應用目標更有野心時,問題就會出現。資料庫有很多種,各自特性不同、適用場景也不同,如何選型?快取有多種做法,搜尋索引也有多種構建方式,如何權衡?當單個工具無法獨立完成目標時,如何把多個工具可靠地組合起來?這些都並不簡單。
本書正是用來幫助你做這類決策:該用什麼技術、怎樣組合技術。你會看到,沒有哪種方案在根本上永遠優於另一種;每種方案都有得失。透過本書,你將學會提出正確問題來評估和比較資料系統,從而為你的具體應用找到更合適的方案。
我們將從今天組織內資料的典型使用方式開始。這些思想很多源自 **企業軟體**(即大型組織的軟體需求與工程實踐,例如大公司和政府機構),因為在歷史上,只有這類組織才有足夠大的資料規模,值得投入複雜技術方案。如果你的資料足夠小,電子表格都可能夠用;但近些年,小公司和初創團隊構建資料密集型系統也越來越常見。
資料系統的核心難點之一在於:不同的人需要用同一份資料做完全不同的事。在公司裡,你和你的團隊有自己的優先順序,另一個團隊即使使用同一資料集,目標也可能完全不同。更麻煩的是,這些目標往往並未被明確表達,容易引發誤解和分歧。
為了幫助你瞭解可以做出哪些選擇,本章比較了幾個對比概念,並探討了它們的權衡:
* 事務型系統和分析型系統之間的區別(["分析型與事務型系統"](#sec_introduction_analytics));
* 雲服務和自託管系統的利弊(["雲服務與自託管"](#sec_introduction_cloud));
* 何時從單節點系統轉向分散式系統(["分散式與單節點系統"](#sec_introduction_distributed));以及
* 平衡業務需求和使用者權利(["資料系統、法律與社會"](#sec_introduction_compliance))。
此外,本章還會引入貫穿全書的關鍵術語。
> [!TIP] 術語:前端和後端
本書討論的大部分內容都與 **後端開發** 相關。對 Web 應用而言,執行在瀏覽器中的客戶端程式碼稱為 **前端**,處理使用者請求的伺服器端程式碼稱為 **後端**。移動應用也類似前端:它們提供使用者介面,通常經由網際網路與伺服器端後端通訊。前端有時會在裝置本地管理資料 [^2],但更棘手的資料基礎設施問題通常發生在後端:前端只處理單個使用者的資料,而後端需要代表 **所有** 使用者管理資料。
後端服務通常透過 HTTP(有時是 WebSocket)提供訪問。其核心是應用程式碼:在一個或多個數據庫中讀寫資料,並按需接入快取、訊息佇列等其他系統(可統稱為 **資料基礎設施**)。應用程式碼往往是 **無狀態** 的:處理完一個 HTTP 請求後,不保留該請求上下文。因此,凡是需要跨請求持久化的資訊,都必須寫在客戶端,或寫入伺服器端資料基礎設施。
## 分析型與事務型系統 {#sec_introduction_analytics}
如果你在企業中從事資料系統工作,往往會遇到幾類不同的資料使用者。第一類是 **後端工程師**,他們構建服務來處理讀取與更新資料的請求;這些服務通常直接面向外部使用者,或透過其他服務間接提供能力(參見["微服務與無伺服器"](#sec_introduction_microservices))。有時服務也只供組織內部使用。
除了管理後端服務的團隊外,通常還有兩類人需要訪問組織的資料:**業務分析師**,他們生成關於組織活動的報告,以幫助管理層做出更好的決策(**商業智慧** 或 **BI**);以及 **資料科學家**,他們在資料中尋找新的見解,或建立由資料分析和機器學習(AI)支援的面向使用者的產品功能(例如,電子商務網站上的“購買了 X 的人也購買了 Y”推薦、風險評分或垃圾郵件過濾等預測分析,以及搜尋結果排名)。
儘管業務分析師和資料科學家傾向於使用不同的工具並以不同的方式操作,但他們有一些共同點:兩者都執行 **分析**,這意味著他們檢視使用者和後端服務生成的資料,但他們通常不修改這些資料(除了可能修復錯誤)。他們可能建立派生資料集,其中原始資料已經以某種方式處理過。這導致了兩種型別系統之間的分離——我們將在本書中使用這種區別:
* **事務型系統** 由後端服務和資料基礎設施組成,在這裡建立資料,例如透過服務外部使用者。在這裡,應用程式程式碼基於使用者執行的操作讀取和修改其資料庫中的資料。
* **分析型系統** 服務於業務分析師和資料科學家的需求。它們包含來自事務型系統的只讀資料副本,並針對分析所需的資料處理型別進行了最佳化。
正如我們將在下一節中看到的,事務型系統和分析型系統通常出於充分的理由而保持分離。隨著這些系統的成熟,出現了兩個新的專業角色:**資料工程師** 和 **分析工程師**。資料工程師是知道如何整合事務型系統和分析型系統的人,並更廣泛地負責組織的資料基礎設施 [^3]。分析工程師對資料進行建模和轉換,使其對組織中的業務分析師和資料科學家更有用 [^4]。
許多工程師只專注於事務型或分析型其中一側。然而,本書會同時覆蓋這兩類資料系統,因為它們都在組織內的資料生命週期中扮演關鍵角色。我們將深入討論向內外部使用者提供服務所需的資料基礎設施,幫助你更好地與“另一側”的同事協作。
### 事務處理與分析的特徵 {#sec_introduction_oltp}
在商業資料處理的早期,對資料庫的寫入通常對應於發生的 **商業交易(commercial transaction)**:進行銷售、向供應商下訂單、支付員工工資等。隨著資料庫擴充套件到不涉及金錢交換的領域,**事務(transaction)** 這個術語仍然保留了下來,指的是形成邏輯單元的一組讀取和寫入。
> [!NOTE]
> [第 8 章](/tw/ch8#ch_transactions) 詳細探討了我們所說的事務的含義。本章鬆散地使用該術語來指代低延遲的讀取和寫入。
儘管資料庫開始用於許多不同型別的資料——社交媒體上的帖子、遊戲中的移動、地址簿中的聯絡人等等——但是基本的訪問模式仍然類似於處理商業交易。事務型系統通常透過某個鍵查詢少量記錄(這稱為 **點查詢**)。基於使用者的輸入插入、更新或刪除記錄。因為這些應用程式是互動式的,這種訪問模式被稱為 **聯機事務處理**(OLTP)。
然而,資料庫也越來越多地用於分析,與 OLTP 相比,分析具有非常不同的訪問模式。通常,分析查詢會掃描大量記錄,並計算聚合統計資訊(如計數、求和或平均值),而不是將單個記錄返回給使用者。例如,連鎖超市的業務分析師可能想要回答以下分析查詢:
* 我們每家商店在一月份的總收入是多少?
* 在我們最近的促銷期間,我們比平時多賣出了多少香蕉?
* 哪個品牌的嬰兒食品最常與 X 品牌尿布一起購買?
這些型別的查詢產生的報告對商業智慧很重要,可以幫助管理層決定下一步做什麼。為了將這種使用資料庫的模式與事務處理區分開來,它被稱為 **聯機分析處理**(OLAP)[^5]。OLTP 和分析之間的區別並不總是很明確,但[表 1-1](#tab_oltp_vs_olap) 列出了一些典型特徵。
{{< figure id="tab_oltp_vs_olap" title="表 1-1. 事務型系統和分析型系統特徵比較" class="w-full my-4" >}}
| 屬性 | 事務型系統(OLTP) | 分析型系統(OLAP) |
|-----------------|----------------------------------------|-----------------------------------|
| 主要讀取模式 | 點查詢(透過鍵獲取單個記錄) | 對大量記錄進行聚合 |
| 主要寫入模式 | 建立、更新和刪除單個記錄 | 批次匯入(ETL)或事件流 |
| 人類使用者示例 | Web 或移動應用程式的終端使用者 | 內部分析師,用於決策支援 |
| 機器使用示例 | 檢查操作是否被授權 | 檢測欺詐/濫用模式 |
| 查詢型別 | 固定的查詢集,由應用程式預定義 | 分析師可以進行任意查詢 |
| 資料代表 | 資料的最新狀態(當前時間點) | 隨時間發生的事件歷史 |
| 資料集大小 | GB 到 TB | TB 到 PB |
> [!NOTE]
> OLAP 中 **聯機(online)** 的含義不明確;它可能指的是查詢不僅用於預定義的報告,也可能是指分析師互動式地使用 OLAP 系統來進行探索性查詢。
在事務型系統中,通常不允許使用者構建自定義 SQL 查詢並在資料庫上執行它們,因為這可能會允許他們讀取或修改他們沒有許可權訪問的資料。此外,他們可能編寫執行成本高昂的查詢,從而影響其他使用者的資料庫效能。出於這些原因,OLTP 系統主要執行嵌入到應用程式程式碼中的固定查詢集,只偶爾使用一次性的自定義查詢來進行維護或故障排除。另一方面,分析資料庫通常讓使用者可以自由地手動編寫任意 SQL 查詢,或使用 Tableau、Looker 或 Microsoft Power BI 等資料視覺化或儀表板工具自動生成查詢。
還有一種型別的系統是為分析型的工作負載(對許多記錄進行聚合的查詢)設計的,但嵌入到面向使用者的產品中。這一類別被稱為 **產品分析** 或 **即時分析**,為這種用途設計的系統包括 Pinot、Druid 和 ClickHouse [^6]。
### 資料倉庫 {#sec_introduction_dwh}
起初,相同的資料庫既用於事務處理,也用於分析查詢。SQL 在這方面相當靈活:它對兩種型別的查詢都很有效。然而,在 20 世紀 80 年代末和 90 年代初,企業有停止使用其 OLTP 系統進行分析目的的趨勢,轉而在單獨的資料庫系統上執行分析。這個單獨的資料庫被稱為 **資料倉庫**。
一家大型企業可能有幾十個甚至上百個聯機事務處理系統:為面向客戶的網站提供動力的系統、控制實體店中的銷售點(收銀臺)系統、跟蹤倉庫中的庫存、規劃車輛路線、管理供應商、管理員工以及執行許多其他任務。這些系統中的每一個都很複雜,需要一個團隊來維護它,因此這些系統最終主要是相互獨立地執行。
出於幾個原因,業務分析師和資料科學家直接查詢這些 OLTP 系統通常是不可取的:
* 感興趣的資料可能分佈在多個事務型系統中,使得在單個查詢中組合這些資料集變得困難(稱為 **資料孤島** 的問題);
* 適合 OLTP 的模式和資料佈局不太適合分析(參見["星型和雪花型:分析模式"](/tw/ch3#sec_datamodels_analytics));
* 分析查詢可能相當昂貴,在 OLTP 資料庫上執行它們會影響其他使用者的效能;以及
* 出於安全或合規原因,OLTP 系統可能位於不允許使用者直接訪問的單獨網路中。
相比之下,**資料倉庫** 是一個單獨的資料庫,分析師可以隨心所欲地查詢,而不會影響 OLTP 操作 [^7]。正如我們將在[第 4 章](/tw/ch4#ch_storage)中看到的,資料倉庫通常以與 OLTP 資料庫非常不同的方式儲存資料,以最佳化分析中常見的查詢型別。
資料倉庫包含公司中所有各種 OLTP 系統中資料的只讀副本。資料從 OLTP 資料庫中提取(使用定期資料轉儲或連續更新流),轉換為分析友好的模式,進行清理,然後載入到資料倉庫中。這種將資料匯入資料倉庫的過程稱為 **提取-轉換-載入**(ETL),如[圖 1-1](#fig_dwh_etl) 所示。有時 **轉換** 和 **載入** 步驟的順序會互換(即,先載入,再在資料倉庫中進行轉換),從而產生 **ELT**。
{{< figure src="/fig/ddia_0101.png" id="fig_dwh_etl" caption="圖 1-1. ETL 到資料倉庫的簡化概述。" class="w-full my-4" >}}
在某些情況下,ETL 過程的資料來源是外部 SaaS 產品,如客戶關係管理(CRM)、電子郵件營銷或信用卡處理系統。在這些情況下,你無法直接訪問原始資料庫,因為它只能透過軟體供應商的 API 訪問。將這些外部系統的資料匯入你自己的資料倉庫可以實現透過 SaaS API 無法實現的分析。SaaS API 的 ETL 通常由專門的資料聯結器服務(如 Fivetran、Singer 或 AirByte)實現。
一些資料庫系統提供 **混合事務/分析處理**(HTAP),目標是在單個系統中同時支援 OLTP 和分析,而無需從一個系統 ETL 到另一個系統 [^8] [^9]。然而,許多 HTAP 系統內部由一個 OLTP 系統與一個單獨的分析系統耦合組成,隱藏在公共介面後面——因此兩者之間的區別對於理解這些系統如何工作仍然很重要。
此外,儘管 HTAP 已出現,但由於目標和約束不同,事務型系統與分析型系統分離仍很常見。尤其是,讓每個事務型系統擁有自己的資料庫通常被視為良好實踐(參見["微服務與無伺服器"](#sec_introduction_microservices)),這會形成數百個相互獨立的事務型資料庫;與之對應,企業往往只有一個統一的資料倉庫,以便分析師能在單個查詢裡組合多個事務型系統的資料。
因此,HTAP 不會取代資料倉庫。相反,它在同一應用程式既需要執行掃描大量行的分析查詢,又需要以低延遲讀取和更新單個記錄的場景中很有用。例如,欺詐檢測可能涉及此類工作負載 [^10]。
事務型系統和分析型系統之間的分離是更廣泛趨勢的一部分:隨著工作負載變得更加苛刻,系統變得更加專業化並針對特定工作負載進行最佳化。通用系統可以舒適地處理小資料量,但規模越大,系統往往變得越專業化 [^11]。
#### 從資料倉庫到資料湖 {#from-data-warehouse-to-data-lake}
資料倉庫通常使用透過 SQL 進行查詢的 **關係** 資料模型(參見[第 3 章](/tw/ch3#ch_datamodels)),可能使用專門的商業智慧軟體。這個模型很適合業務分析師需要進行的查詢型別,但不太適合資料科學家的需求,他們可能需要執行以下任務:
* 將資料轉換為適合訓練機器學習模型的形式;這通常需要將資料庫表的行和列轉換為稱為 **特徵** 的數值向量或矩陣。以最大化訓練模型效能的方式執行這種轉換的過程稱為 **特徵工程**,它通常需要難以用 SQL 表達的自定義程式碼。
* 獲取文字資料(例如,產品評論)並使用自然語言處理技術嘗試從中提取結構化資訊(例如,作者的情感或他們提到的主題)。同樣,他們可能需要使用計算機視覺技術從照片中提取結構化資訊。
儘管已經有人在努力將機器學習運算元新增到 SQL 資料模型 [^12] 並在關係基礎上構建高效的機器學習系統 [^13],但許多資料科學家不喜歡在資料倉庫等關係資料庫中工作。相反,許多人更喜歡使用 Python 資料分析庫(如 pandas 和 scikit-learn)、統計分析語言(如 R)和分散式分析框架(如 Spark)[^14]。我們將在["資料框、矩陣和陣列"](/tw/ch3#sec_datamodels_dataframes)中進一步討論這些。
因此,組織面臨著以適合資料科學家使用的形式提供資料的需求。答案是 **資料湖**:一個集中的資料儲存庫,儲存任何可能對分析有用的資料副本,透過 ETL 過程從事務型系統獲得。與資料倉庫的區別在於,資料湖只是包含檔案,而不強制任何特定的檔案格式或資料模型。資料湖中的檔案可能是資料庫記錄的集合,使用 Avro 或 Parquet 等檔案格式編碼(參見[第 5 章](/tw/ch5#ch_encoding)),但它們同樣可以包含文字、影像、影片、感測器讀數、稀疏矩陣、特徵向量、基因組序列或任何其他型別的資料 [^15]。除了更靈活之外,這通常也比關係資料儲存更便宜,因為資料湖可以使用商品化的檔案儲存,如物件儲存(參見["雲原生系統架構"](#sec_introduction_cloud_native))。
ETL 過程已經泛化為 **資料管道**,在某些情況下,資料湖已成為從事務型系統到資料倉庫路徑上的中間站。資料湖包含事務型系統產生的“原始”形式的資料,沒有轉換為關係資料倉庫模式。這種方法的優勢在於,每個資料消費者都可以將原始資料轉換為最適合其需求的形式。它被稱為 **壽司原則**:“原始資料更好”[^16]。
除了從資料湖載入資料到單獨的資料倉庫之外,還可以直接在資料湖中的檔案上執行典型的資料倉庫工作負載(SQL 查詢和業務分析),以及資料科學和機器學習的工作負載。這種架構被稱為 **資料湖倉**,它需要一個查詢執行引擎和一個元資料(例如,模式管理)層來擴充套件資料湖的檔案儲存 [^17]。
Apache Hive、Spark SQL、Presto 和 Trino 是這種方法的例子。
#### 超越資料湖 {#beyond-the-data-lake}
隨著分析實踐的成熟,組織越來越重視分析系統與資料管道的管理和運維,這一點在 DataOps 宣言中已有體現 [^18]。其中一部分是治理、隱私以及對 GDPR、CCPA 等法規的遵從;我們會在["資料系統、法律與社會"](#sec_introduction_compliance)和["立法與行業自律"](/ch14#sec_future_legislation)中討論。
此外,分析資料的提供形式也越來越多樣:不僅有檔案和關係表,也有事件流(見[第 12 章](/tw/ch12#ch_stream))。基於檔案的分析通常透過週期性重跑(例如每天一次)來響應資料變化,而流處理能夠讓分析系統在秒級響應事件。對於時效性要求高的場景,這種方式很有價值,例如識別並阻斷潛在的欺詐或濫用行為。
在某些場景中,分析系統的輸出還會迴流到事務型系統(這一過程有時稱為 **反向 ETL** [^19])。例如,在分析系統裡訓練出的機器學習模型會部署到生產環境,為終端使用者生成“買了 X 的人也買了 Y”這類推薦。此類分析系統的投產結果也稱為 **資料產品** [^20]。機器學習模型可藉助 TFX、Kubeflow、MLflow 等專用工具部署到事務型系統。
### 記錄系統與派生資料 {#sec_introduction_derived}
與事務型系統和分析型系統的區分相關,本書還區分 **記錄系統** 與 **派生資料系統**。這組術語有助於你理清資料在系統中的流向:
權威記錄系統
: 記錄系統,也稱 **真相來源(權威資料來源)**,儲存某類資料的權威(canonical)版本。新資料進入系統時(例如使用者輸入)首先寫入這裡。每個事實只表示一次(這種表示通常是 **正規化** 的;見["正規化、反正規化與連線"](/tw/ch3#sec_datamodels_normalization))。如果其他系統與記錄系統不一致,則按定義以記錄系統為準。
派生資料系統
: 派生系統中的資料,是對其他系統中已有資料進行轉換或處理後的結果。如果派生資料丟失,可以從原始資料來源重新構建。經典例子是快取:命中時由快取返回,未命中時回退到底層資料庫。反正規化值、索引、物化檢視、變換後的資料表示,以及在資料集上訓練出的模型,都屬於這一類。
從技術上說,派生資料是 **冗餘** 的,因為它複製了已有資訊。但它往往是讀查詢高效能的關鍵。你可以從同一個源資料派生出多個數據集,以不同“視角”觀察同一份事實。
分析系統通常屬於派生資料系統,因為它消費的是別處產生的資料。事務型服務往往同時包含記錄系統和派生資料系統:前者是資料首先寫入的主資料庫,後者則是用於加速常見讀取操作的索引與快取,尤其針對記錄系統難以高效回答的查詢。
大多數資料庫、儲存引擎和查詢語言本身並不天然屬於“記錄系統”或“派生系統”。資料庫只是工具,關鍵在於你如何使用它。兩者的區別不在工具本身,而在應用中的職責劃分。只要明確“哪些資料由哪些資料派生而來”,原本混亂的系統架構就會清晰很多。
當一個系統的資料由另一個系統的資料派生而來時,你需要在記錄系統原始資料變化時同步更新派生資料。不幸的是,很多資料庫預設假設應用只依賴單一資料庫,並不擅長在多系統之間傳播這類更新。在["資料整合"](/tw/ch13#sec_future_integration)中,我們會討論如何組合多個數據系統,實現單一系統難以獨立完成的能力。
至此,我們結束了對分析與事務處理的比較。下一節將討論另一組常被反覆爭論的權衡。
## 雲服務與自託管 {#sec_introduction_cloud}
對於組織需要做的任何事情,首要問題之一是:應該在內部完成,還是應該外包?應該自建還是購買?
歸根結底,這是一個關於業務優先順序的問題。公認的管理智慧是,作為組織核心競爭力或競爭優勢的事物應該在內部完成,而非核心、例行或常見的事物應該留給供應商 [^21]。
舉一個極端的例子,大多數公司不會自己發電(除非他們是能源公司,而且不考慮緊急備用電源),因為從電網購買電力更便宜。
對於軟體,需要做出的兩個重要決定是誰構建軟體和誰部署它。有一系列可能性,每個決定都在不同程度上外包,如[圖 1-2](#fig_cloud_spectrum) 所示。
一個極端是你自己編寫並在內部執行的定製軟體;另一個極端是廣泛使用的雲服務或軟體即服務(SaaS)產品,由外部供應商實施和運營,你只能透過 Web 介面或 API 訪問。
{{< figure src="/fig/ddia_0102.png" id="fig_cloud_spectrum" caption="圖 1-2. 軟體型別及其運維的範圍。" class="w-full my-4" >}}
中間地帶是你 **自託管** 的現成軟體(開源或商業),即自己部署——例如,如果你下載 MySQL 並將其安裝在你控制的伺服器上。
這可能在你自己的硬體上(通常稱為 **本地部署**,即使伺服器實際上在租用的資料中心機架中而不是字面上在你自己的場所)
,或者在雲中的虛擬機器上(**基礎設施即服務** 或 IaaS)。沿著這個範圍還有更多的點,例如,採用開源軟體並執行其修改版本。
與這個範圍分開的還有 **如何** 部署服務的問題,無論是在雲中還是在本地——例如,是否使用 Kubernetes 等編排框架。
然而,部署工具的選擇超出了本書的範圍,因為其他因素對資料系統的架構有更大的影響。
### 雲服務的利弊 {#sec_introduction_cloud_tradeoffs}
使用雲服務而不是自己執行對應的軟體,本質上是將該軟體的運維外包給雲提供商。
使用雲服務有充分的支援和反對理由。雲提供商聲稱,使用他們的服務可以節省你的時間和金錢,並相比自建基礎設施讓你更敏捷。
雲服務實際上是否比自託管更便宜、更容易,很大程度上取決於你的技能和系統的工作負載。
如果你已經有設定和運維所需系統的經驗,並且你的負載相當可預測(即,你需要的機器數量不會劇烈波動),
那麼購買自己的機器並自己在上面執行軟體通常更便宜 [^22] [^23]。
另一方面,如果你需要一個你還不知道如何部署和運維的系統,那麼採用雲服務通常比學習自己管理系統更容易、更快。
如果你必須專門僱用和培訓員工來維護和運營系統,那可能會變得非常昂貴。
使用雲時你仍然需要一個運維團隊(參見["雲時代的運維"](#sec_introduction_operations)),但外包基本的系統管理可以讓你的團隊專注於更高層次的問題。
當你將系統的運維外包給專門運維該服務的公司時,可能會帶來更好的服務,因為供應商在向許多客戶提供服務中獲得了專業運維知識。
另一方面,如果你自己運維服務,你可以配置和調整它,以專門針對你特定的工作負載進行最佳化,而云服務不太可能願意替你進行此類定製。
如果你的系統負載隨時間變化很大,雲服務特別有價值。如果你配置機器以能夠處理峰值負載,但這些計算資源大部分時間都處於空閒狀態,系統就變得不太具有成本效益。
在這種情況下,雲服務的優勢在於它們可以更容易地根據需求變化向上或向下擴充套件你的計算資源。
例如,分析系統通常具有極其可變的負載:快速執行大型分析查詢需要並行使用大量計算資源,但一旦查詢完成,這些資源就會處於空閒狀態,直到使用者進行下一個查詢。
預定義的查詢(例如,每日報告)可以排隊和排程以平滑負載,但對於互動式查詢,你越希望它們完成得快,工作負載就變得越可變。
如果你的資料集如此之大,以至於快速查詢需要大量的計算資源,使用雲可以節省資金,因為你可以將未使用的資源返回給供應商,而不是讓它們閒置。對於較小的資料集,這種差異不太顯著。
雲服務的最大缺點是你無法控制它:
* 如果它缺少你需要的功能,你所能做的就是禮貌地詢問供應商是否會新增它;你通常無法自己實現它。
* 如果服務宕機,你所能做的就是等它恢復。
* 如果你以觸發錯誤或導致效能問題的方式使用服務,你將很難診斷問題。對於你自己執行的軟體,你可以從作業系統獲取效能指標和除錯資訊來幫助你理解其行為,你可以檢視伺服器日誌,但對於供應商託管的服務,你通常無法訪問這些內部資訊。
* 此外,如果服務關閉或變得無法接受的昂貴,或者如果供應商決定以你不喜歡的方式更改他們的產品,你就受制於他們 —— 繼續執行舊版本的軟體通常不是一個可行選項,所以你將被迫遷移到替代服務 [^24]。
如果有暴露相容 API 的替代服務,這種風險會得到緩解,但對於許多雲服務,沒有標準 API,這增加了切換成本,使供應商鎖定成為一個問題。
* 雲供應商需要被信任以保持資料安全,這可能會使遵守隱私和安全法規的過程複雜化。
儘管有所有這些風險,組織在雲服務之上構建新應用程式或採用混合方法(在系統的某些部分使用雲服務)變得越來越流行。然而,雲服務不會取代所有內部資料系統:許多較舊的系統早於雲,對於任何具有現有云服務無法滿足的專業要求的服務,內部系統仍然是必要的。例如,對延遲非常敏感的應用程式(如高頻交易)需要對硬體的完全控制。
### 雲原生系統架構 {#sec_introduction_cloud_native}
除了具有不同的經濟模型(訂閱服務而不是購買硬體和許可軟體在其上執行)之外,雲的興起也對資料系統在技術層面的實現產生了深遠的影響。
術語 **雲原生** 用於描述旨在利用雲服務的架構。
原則上,幾乎任何可自託管的軟體都可以做成雲服務;事實上,許多主流資料系統都已有託管版本。
不過,從零設計為雲原生的系統已經展示出若干優勢:同等硬體下效能更好、故障恢復更快、能更快按負載擴縮計算資源,並支援更大資料集 [^25] [^26] [^27]。[表 1-2](#tab_cloud_native_dbs) 給出兩類系統的一些示例。
{{< figure id="tab_cloud_native_dbs" title="表 1-2. 自託管與雲原生資料庫系統示例" class="w-full my-4" >}}
| 類別 | 自託管系統 | 雲原生系統 |
|------------------|----------------------------|----------------------------------------------------------------------|
| 事務型/OLTP | MySQL、PostgreSQL、MongoDB | AWS Aurora [^25]、Azure SQL DB Hyperscale [^26]、Google Cloud Spanner |
| 分析型/OLAP | Teradata、ClickHouse、Spark | Snowflake [^27]、Google BigQuery、Azure Synapse Analytics |
#### 雲服務的分層 {#layering-of-cloud-services}
許多自託管資料系統的系統要求非常簡單:它們在傳統作業系統(如 Linux 或 Windows)上執行,將資料儲存為檔案系統上的檔案,並透過 TCP/IP 等標準網路協議進行通訊。
少數系統依賴於特殊硬體,如 GPU(用於機器學習)或 RDMA 網路介面,但總的來說,自託管軟體傾向於使用非常通用的計算資源:CPU、RAM、檔案系統和 IP 網路。
在雲中,這種型別的軟體可以在基礎設施即服務(IaaS)環境中執行,使用一個或多個虛擬機器(或 **例項**),分配一定的 CPU、記憶體、磁碟和網路頻寬。
與物理機器相比,雲實例可以更快地配置,並且有更多種類的大小,但除此之外,它們與傳統計算機類似:你可以在上面執行任何你喜歡的軟體,但你負責自己管理它。
相比之下,雲原生服務的關鍵思想是不僅使用由作業系統管理的計算資源,還基於較低級別的雲服務構建更高級別的服務。例如:
* 使用 **物件儲存** 服務(如 Amazon S3、Azure Blob Storage 和 Cloudflare R2)儲存大檔案。它們提供比典型檔案系統更有限的 API(基本檔案讀寫),但它們的優勢在於隱藏了底層物理機器:服務自動將資料分佈在許多機器上,因此你不必擔心任何一臺機器上的磁碟空間用完。即使某些機器或其磁碟完全故障,也不會丟失資料。
* 在物件儲存和其他雲服務之上建立更多的服務:例如,Snowflake 是一個基於雲的分析資料庫(資料倉庫),依賴於 S3 進行資料儲存 [^27],而一些其他服務反過來建立在 Snowflake 之上。
與計算中的抽象一樣,沒有一個正確的答案告訴你應該使用什麼。作為一般規則,更高級別的抽象往往更面向特定的用例。如果你的需求與為其設計更高級別系統的情況相匹配,使用現有的高級別系統可能會比自己從較低級別系統構建更輕鬆,且更能滿足您的需求。另一方面,如果沒有滿足你需求的高階系統,那麼從較低級別的元件自己構建它是唯一的選擇。
#### 儲存與計算的分離 {#sec_introduction_storage_compute}
在傳統計算中,磁碟儲存被認為是持久的(我們假設一旦某些東西被寫入磁碟,它就不會丟失)。為了容忍單個硬碟的故障,通常使用 RAID(獨立磁碟冗餘陣列)在連線到同一臺機器的幾個磁碟上維護資料副本。RAID 可以在硬體中執行,也可以由作業系統在軟體中執行,它對訪問檔案系統的應用程式是透明的。
在雲中,計算例項(虛擬機器)也可能有本地磁碟連線,但云原生系統通常將這些磁碟更多地視為臨時快取,而不是長期儲存。這是因為如果關聯的例項出現故障,或者為了適應負載變化而將例項替換為更大或更小的例項(在不同的物理機器上),本地磁碟就會變得不可訪問。
作為本地磁碟的替代方案,雲服務還提供可以從一個例項分離並附加到另一個例項的虛擬磁碟儲存(Amazon EBS、Azure 託管磁碟和 Google Cloud 中的持久磁碟)。這種虛擬磁碟實際上不是物理磁碟,而是由一組單獨的機器提供的雲服務,它模擬磁碟的行為(**塊裝置**,其中每個塊通常為 4 KiB 大小)。這項技術使得在雲中執行傳統的基於磁碟的軟體成為可能,但塊裝置模擬所引入的開銷在一開始就為雲設計的系統中是可以避免的 [^25]。它還使應用程式對網路故障非常敏感,因為虛擬塊裝置上的每個 I/O 實際上都是網路呼叫 [^28]。
為了解決這個問題,雲原生服務通常避免使用虛擬磁碟,而是建立在針對特定工作負載最佳化的專用儲存服務之上。物件儲存服務(如 S3)設計用於長期儲存相當大的檔案,大小從數百 KB 到幾 GB 不等。資料庫中儲存的單個行或值通常比這小得多;因此,雲資料庫通常在單獨的服務中管理較小的值,並將較大的資料塊(包含許多單個值)儲存在物件儲存中 [^26] [^29]。我們將在[第 4 章](/tw/ch4#ch_storage)中看到這樣做的方法。
在傳統的系統架構中,同一臺計算機負責儲存(磁碟)和計算(CPU 和 RAM),但在雲原生系統中,這兩個職責已經在某種程度上分離或 **解耦** [^9] [^27] [^30] [^31]:例如,S3 只儲存檔案,如果你想分析該資料,你必須在 S3 之外的某個地方執行分析程式碼。這意味著透過網路傳輸資料,我們將在["分散式與單節點系統"](#sec_introduction_distributed)中進一步討論。
此外,雲原生系統通常是 **多租戶** 的,這意味著不是每個客戶都有一臺單獨的機器,而是來自幾個不同客戶的資料和計算由同一服務在同一共享硬體上處理 [^32]。
多租戶可以實現更好的硬體利用率、更容易的可伸縮性和雲提供商更容易的管理,但它也需要仔細的工程設計,以確保一個客戶的活動不會影響其他客戶的系統的效能或安全性 [^33]。
### 雲時代的運維 {#sec_introduction_operations}
傳統上,管理組織伺服器端資料基礎設施的人員被稱為 **資料庫管理員**(DBA)或 **系統管理員**(sysadmins)。最近,許多組織已經嘗試將軟體開發和運維的角色整合到團隊中,共同負責後端服務和資料基礎設施;**DevOps** 理念引導了這一趨勢。**站點可靠性工程師**(SRE)是 Google 對這個想法的實現 [^34]。
運維的作用是確保服務可靠地交付給使用者(包括配置基礎設施和部署應用程式),並確保穩定的生產環境(包括監控和診斷可能影響可靠性的任何問題)。對於自託管系統,運維傳統上涉及大量在單個機器級別的工作,例如容量規劃(例如,監控可用磁碟空間並在空間用完之前新增更多磁碟)、配置新機器、將服務從一臺機器移動到另一臺機器,以及安裝作業系統補丁。
許多雲服務提供了 API 來隱藏實際實現服務的單個機器。例如,雲端儲存用 **計量計費** 替換固定大小的磁碟,你可以儲存資料而無需提前規劃容量需求,然後根據實際使用的空間收費。此外,即使在單個機器發生故障時,許多雲服務仍能保持高可用性(參見["可靠性與容錯"](/tw/ch2#sec_introduction_reliability))。
從單個機器到服務的重點轉移伴隨著運維角色的變化。提供可靠服務的高階目標保持不變,但流程和工具已經發展。DevOps/SRE 理念更加強調:
* 自動化——優先考慮可重複的流程而不是手動的一次性工作,
* 優先考慮短暫的虛擬機器和服務而不是長期執行的伺服器,
* 啟用頻繁的應用程式更新,
* 從事故中學習,以及
* 保留組織關於系統的知識,即使組織里的人員在不斷流動 [^35]。
隨著雲服務的興起,角色出現了分叉:基礎設施公司的運維團隊專門研究向大量客戶提供可靠服務的細節,而服務的客戶在基礎設施上花費盡可能少的時間和精力 [^36]。
雲服務的客戶仍然需要運維,但他們專注於不同的方面,例如為給定任務選擇最合適的服務、將不同服務相互整合,以及從一個服務遷移到另一個服務。即使計量計費消除了傳統意義上的容量規劃需求,瞭解你為哪個目的使用哪些資源仍然很重要,這樣你就不會在不需要的雲資源上浪費金錢:容量規劃變成了財務規劃,效能最佳化變成了成本最佳化 [^37]。
此外,雲服務確實有資源限制或 **配額**(例如你可以同時執行的最大程序數),你需要在遇到它們之前瞭解並規劃這些 [^38]。
採用雲服務可能比執行自己的基礎設施更容易、更快,儘管學習如何使用它也有成本,也許還要解決其限制。隨著越來越多的供應商提供針對不同用例的更廣泛的雲服務,不同服務之間的整合成為一個特別的挑戰 [^39] [^40]。
ETL(參見["資料倉庫"](#sec_introduction_dwh))只是故事的一部分;面向事務處理的雲服務之間也需要相互整合。目前,缺乏能促進這類整合的標準,因此往往仍要投入大量手工工作。
無法完全外包給雲服務的其他運維方面包括維護應用程式及其使用的庫的安全性、管理你自己的服務之間的互動、監控服務的負載,以及追蹤問題的原因,例如效能下降或中斷。雖然雲正在改變運維的角色,但對運維的需求比以往任何時候都大。
## 分散式與單節點系統 {#sec_introduction_distributed}
涉及多臺機器透過網路通訊的系統稱為 **分散式系統**。參與分散式系統的每個程序稱為 **節點**。你希望採用分散式系統的原因可能有多種:
固有的分散式系統
: 如果應用程式涉及兩個或多個互動使用者,每個使用者使用自己的裝置,那麼系統不可避免地是分散式的:裝置之間的通訊必須透過網路進行。
雲服務之間的請求
: 如果資料儲存在一個服務中但在另一個服務中處理,則必須透過網路從一個服務傳輸到另一個服務。
容錯/高可用性
: 如果你的應用程式需要在一臺機器(或幾臺機器、網路或整個資料中心)發生故障時繼續工作,你可以使用多臺機器為你提供冗餘。當一臺故障時,另一臺可以接管。參見["可靠性與容錯"](/tw/ch2#sec_introduction_reliability)和[第 6 章](/tw/ch6#ch_replication)關於複製的內容。
可伸縮性
: 如果你的資料量或計算需求增長超過單臺機器的處理能力,你可以潛在地將負載分散到多臺機器上。參見["可伸縮性"](/tw/ch2#sec_introduction_scalability)。
延遲
: 如果你在世界各地都有使用者,你可能希望在全球各個地區都有伺服器,以便每個使用者都可以從地理位置接近他們的伺服器獲得服務。這避免了使用者必須等待網路資料包繞地球半圈才能回答他們的請求。參見["描述效能"](/tw/ch2#sec_introduction_percentiles)。
彈性
: 如果你的應用程式在某些時候很忙,在其他時候很空閒,雲部署可以根據需求向上或向下伸縮,因此你只需為實際使用的資源付費。這在單臺機器上更困難,它需要按處理最大負載的情況進行配置,即使在幾乎不使用的時候也是如此。
使用專用硬體
: 系統的不同部分可以利用不同型別的硬體來匹配其工作負載。例如,物件儲存可能使用具有許多磁碟但很少 CPU 的機器,而資料分析系統可能使用具有大量 CPU 和記憶體但沒有磁碟的機器,機器學習系統可能使用具有 GPU 的機器(GPU 在訓練深度神經網路和其他機器學習任務方面比 CPU 效率高得多)。
法律合規
: 一些國家有資料駐留法律,要求其管轄範圍內的人員資料必須在該國地理範圍內儲存和處理 [^41]。這些規則的範圍各不相同——例如,在某些情況下,它僅適用於醫療或金融資料,而其他情況則更廣泛。因此,在幾個這樣的管轄區域中擁有使用者的服務不得不將他們的資料分佈在幾個位置的伺服器上。
可持續性
: 如果你能靈活把控作業執行的地點和時間,你可能能夠在可再生電力充足的時間和地點執行它們,並避免在電網緊張時執行它們。這可以減少你的碳排放,並允許你利用到廉價的電力 [^42] [^43]。
這些原因既適用於你自己編寫的服務(應用程式程式碼),也適用於由現成軟體(如資料庫)組成的服務。
### 分散式系統的問題 {#sec_introduction_dist_sys_problems}
分散式系統也有缺點。透過網路進行的每個請求和 API 呼叫都需要處理失敗的可能性:網路可能中斷,或者服務可能過載或崩潰,因此任何請求都可能超時而沒有收到響應。在這種情況下,我們不知道服務是否收到了請求,簡單地重試它可能不安全。我們將在[第 9 章](/tw/ch9#ch_distributed)中詳細討論這些問題。
儘管資料中心網路很快,但呼叫另一個服務仍然比在同一程序中呼叫函式慢得多 [^44]。
在處理大量資料時,與其將資料從其儲存處傳輸到處理它的單獨機器,將計算帶到已經擁有資料的機器上可能更快 [^45]。
更多的節點並不總是更快:在某些情況下,一個簡單的單執行緒程式在單臺計算機上執行的效能可以比在具有 100 多個 CPU 核心的叢集上更好 [^46]。
對分散式系統進行故障排除通常很困難:如果系統響應緩慢,你如何找出問題所在?**可觀測性** [^47] [^48] 技術可以用來對分散式系統中的問題進行診斷,這涉及到系統執行資料的收集,並提供查詢方式來支援對高層級的指標或單個的事件的分析。**追蹤** 工具(如 OpenTelemetry、Zipkin 和 Jaeger)允許你跟蹤哪個客戶端為哪個操作呼叫了哪個伺服器,以及每次呼叫花費了多長時間 [^49]。
資料庫提供了各種機制來確保資料一致性,正如我們將在[第 6 章](/tw/ch6#ch_replication)和[第 8 章](/tw/ch8#ch_transactions)中看到的。然而,當每個服務都有自己的資料庫時,維護這些不同服務之間的資料一致性就成了應用程式的問題。分散式事務(我們在[第 8 章](/tw/ch8#ch_transactions)中探討)是確保一致性的一種可能技術,但它們在微服務上下文中很少使用,因為它們違背了使服務彼此獨立的目標,而且許多資料庫不支援它們 [^50]。
出於所有這些原因,如果你可以在單臺機器上做某件事情,與搭建分散式系統相比通常要簡單得多,成本也更低 [^23] [^46] [^51]。CPU、記憶體和磁碟已經變得更大、更快、更可靠。當與 DuckDB、SQLite 和 KùzuDB 等單節點資料庫結合使用時,許多工作負載現在可以在單個節點上執行。我們將在[第 4 章](/tw/ch4#ch_storage)中進一步探討這個主題。
### 微服務與無伺服器 {#sec_introduction_microservices}
在多臺機器上分佈系統的最常見方式是將它們分為客戶端和伺服器,並讓客戶端向伺服器發出請求。最常見的是使用 HTTP 進行此通訊,正如我們將在["流經服務的資料流:REST 和 RPC"](/tw/ch5#sec_encoding_dataflow_rpc)中討論的。同一程序可能既是伺服器(處理傳入請求)又是客戶端(向其他服務發出出站請求)。
這種構建應用程式的方式傳統上被稱為 **面向服務的體系結構**(SOA);最近,這個想法已經被細化為 **微服務** 架構 [^52] [^53]。在這種架構中,服務有一個明確定義的目的(例如,對於 S3 來說,這個目的是檔案儲存);每個服務公開一個可以由客戶端透過網路呼叫的 API,每個服務有一個負責其維護的團隊。因此,複雜的應用程式可以分解為多個互動服務,每個服務由單獨的團隊管理。
將複雜的軟體分解為多個服務有幾個優點:每個服務可以獨立更新,減少團隊之間的協調工作;每個服務可以分配它需要的硬體資源;透過將實現細節隱藏在 API 後面,服務所有者可以自由地更改實現而不影響客戶端。在資料儲存方面,每個服務通常有自己的資料庫,而不在服務之間共享資料庫:共享資料庫實際上會使整個資料庫結構成為服務 API 的一部分,然後該結構將很難更改。共享資料庫還可能導致一個服務的查詢對其他服務的效能產生負面影響。
另一方面,擁有許多服務本身可能會帶來複雜性:每個服務都需要用於部署新版本、調整分配的硬體資源以匹配負載、收集日誌、監控服務健康狀況以及在出現問題時向值班工程師發出警報的基礎設施。**編排** 框架(如 Kubernetes)已成為部署服務的流行方式,因為它們為這種基礎設施提供了基礎。在開發期間測試服務可能很複雜,因為你還需要執行它所依賴的所有其他服務。
微服務 API 的演進可能具有挑戰性。呼叫 API 的客戶端期望 API 具有某些欄位。開發人員可能希望根據業務需求的變化向 API 新增或刪除欄位,但這樣做可能會導致客戶端失敗。更糟糕的是,這種失敗通常直到開發週期的後期才被發現,當更新的服務 API 部署到預生產或生產環境時。API 描述標準(如 OpenAPI 和 gRPC)有助於管理客戶端和伺服器 API 之間的關係;我們將在[第 5 章](/tw/ch5#ch_encoding)中進一步討論這些。
微服務主要是人員問題的技術解決方案:允許不同的團隊獨立取得進展,而無需相互協調。這在大公司中很有價值,但在沒有很多團隊的小公司中,使用微服務可能是不必要的開銷,最好以最簡單的方式實現應用程式 [^52]。
**無伺服器(Serverless)**,或 **函式即服務**(FaaS),是另一種部署方式:基礎設施管理進一步外包給雲廠商 [^33]。使用虛擬機器時,你需要顯式決定何時啟動、何時關閉例項;而在無伺服器模型中,雲廠商會根據進入服務的請求自動分配和回收計算資源 [^54]。這種部署方式把更多運維負擔轉移給雲廠商,並支援按使用量計費,而不是按例項計費。為實現這些優勢,許多無伺服器平臺會限制函式執行時長、限制執行時環境,並在函式首次呼叫時出現較慢冷啟動。術語“無伺服器”本身也容易誤導:每次函式執行依然執行在某臺伺服器上,只是後續執行未必在同一臺機器上。此外,BigQuery 及多種 Kafka 產品也採用“Serverless”術語,強調其服務可自動擴縮容且按使用量計費。
就像雲端儲存以計量計費取代了傳統容量規劃(預先決定買多少磁碟)一樣,無伺服器模式把同樣的計費邏輯帶到了程式碼執行層:你只為程式碼實際執行的時間付費,而不必預先準備固定資源。
### 雲計算與超級計算 {#id17}
雲計算不是構建大規模計算系統的唯一方式;另一種選擇是 **高效能計算**(HPC),也稱為 **超級計算**。儘管有重疊,但與雲計算和企業資料中心繫統相比,HPC 通常有不同的設計考量並使用不同的技術。其中一些差異是:
* 超級計算機通常用於計算密集型科學計算任務,例如天氣預報、氣候建模、分子動力學(模擬原子和分子的運動)、複雜的最佳化問題和求解偏微分方程。另一方面,雲計算往往用於線上服務、業務資料系統和需要以高可用性為使用者請求提供服務的類似系統。
* 超級計算機通常執行大型批處理作業,定期將其計算狀態檢查點儲存到磁碟。如果節點發生故障,常見的解決方案是簡單地停止整個叢集工作負載,修復故障節點,然後從最後一個檢查點重新啟動計算 [^55] [^56]。對於雲服務,通常不希望停止整個叢集,因為服務需要以最小的中斷持續為使用者提供服務。
* 超級計算機節點通常透過共享記憶體和遠端直接記憶體訪問(RDMA)進行通訊,這支援高頻寬和低延遲,但假設系統使用者之間有高度的信任 [^57]。在雲計算中,網路和機器通常由相互不信任的組織共享,需要更強的安全機制,如資源隔離(例如虛擬機器)、加密和身份驗證。
* 雲資料中心網路通常基於 IP 和乙太網,以 Clos 拓撲排列以提供高對分頻寬——這是網路整體效能的常用度量 [^55] [^58]。超級計算機通常使用專門的網路拓撲,例如多維網格和環面 [^59],這能讓具有已知通訊模式的 HPC 工作負載產生更好的效能。
* 雲計算允許節點分佈在多個地理區域,而超級計算機通常假設它們的所有節點都靠近在一起。
大規模分析系統有時與超級計算共享一些特徵,如果你在這個領域工作,瞭解這些技術可能是值得的。然而,本書主要關注需要持續可用的服務,如["可靠性與容錯"](/tw/ch2#sec_introduction_reliability)中所討論的。
## 資料系統、法律與社會 {#sec_introduction_compliance}
到目前為止,你已經在本章中看到,資料系統的架構不僅受到技術目標和要求的影響,還受到它們所支援的組織的人力需求的影響。越來越多的資料系統工程師認識到,僅服務於自己企業的需求是不夠的:我們還對整個社會負有責任。
一個特別的關注點是儲存有關人員及其行為資料的系統。自 2018 年以來,**通用資料保護條例**(GDPR)賦予了許多歐洲國家居民對其個人資料更大的控制權和法律權利,類似的隱私法規已在世界各地的各個國家和州採用,例如加州消費者隱私法(CCPA)。關於 AI 的法規,例如 **歐盟 AI 法案**,對個人資料的使用方式施加了進一步的限制。
此外,即使在不直接受法規約束的領域,人們也越來越認識到計算機系統對人和社會的影響。社交媒體改變了個人消費新聞的方式,這影響了他們的政治觀點,因此可能影響選舉結果。自動化系統越來越多地做出對個人產生深遠影響的決策,例如決定誰應該獲得貸款或保險覆蓋,誰應該被邀請參加工作面試,或者誰應該被懷疑犯罪 [^60]。
每個從事此類系統工作的人都有責任考慮道德影響並確保他們遵守相關法律。沒有必要讓每個人都成為法律和道德專家,但對法律和道德原則的基本認識與分散式系統中的一些基礎知識同樣重要。
法律考慮正在影響資料系統設計的基礎 [^61]。例如,GDPR 授予個人在請求時刪除其資料的權利(有時稱為 **被遺忘權**)。然而,正如我們將在本書中看到的,許多資料系統依賴不可變構造(如僅追加日誌)作為其設計的一部分;我們如何確保刪除應該不可變的檔案中間的某些資料?我們如何處理已被納入派生資料集(參見["記錄系統與派生資料"](#sec_introduction_derived))的資料刪除,例如機器學習模型的訓練資料?回答這些問題會帶來新的工程挑戰。
目前,我們對於哪些特定技術或系統架構應被視為“符合 GDPR”沒有明確的指導方針。法規故意不強制要求特定技術,因為隨著技術的進步,這些技術可能會迅速變化。相反,法律文字規定了需要解釋的高層級原則。這意味著如何遵守隱私法規的問題沒有簡單的答案,但我們將透過這個視角來看待本書中的一些技術。
一般來說,我們儲存資料是因為我們認為其價值大於儲存它的成本。然而,值得記住的是,儲存成本不僅僅是你為 Amazon S3 或其他服務支付的賬單:成本效益計算還應該考慮到如果資料被洩露或被對手入侵的責任和聲譽損害風險,以及如果資料的儲存和處理被發現不符合法律的法律成本和罰款風險 [^51]。
政府或警察部隊也可能迫使公司交出資料。當存在資料可能暴露犯罪行為的風險時(例如,在幾個中東和非洲國家的同性戀,或在幾個美國州尋求墮胎),儲存該資料會為使用者創造真正的安全風險。例如,去墮胎診所的行程很容易被位置資料洩露,甚至可能透過使用者 IP 地址隨時間的日誌(表示大致位置)洩露。
一旦考慮到所有風險,可能合理地決定某些資料根本不值得儲存,因此應該刪除。這個 **資料最小化** 原則(有時以德語術語 **Datensparsamkeit** 為人所知)與“大資料”哲學相反,後者是投機性地儲存大量資料,以防將來有用 [^62]。但它符合 GDPR,該法規要求個人資料只能為指定的、明確的目的收集,這些資料以後不得用於任何其他目的,並且資料不得保留超過收集目的所需的時間 [^63]。
企業也注意到了隱私和安全問題。信用卡公司要求處理支付的企業遵守嚴格的支付卡行業(PCI)標準。處理商需要經常接受獨立審計師的評估,以驗證持續的合規性。軟體供應商也受到了更多的審查。現在許多買家要求他們的供應商遵守服務組織控制(SOC)型別 2 標準。與 PCI 合規性一樣,供應商需要接受第三方審計以驗證遵守情況。
總的來說,關鍵在於平衡業務目標與被收集、被處理資料的人們的權益。這個主題還有很多內容;在[第 14 章](/ch14#ch_right_thing)中,我們會進一步討論倫理與法律合規,以及偏見與歧視等問題。
## 總結 {#summary}
本章的主線是理解“權衡”。對許多問題而言,並不存在唯一正確答案,而是有多種路徑,各有利弊。我們討論了影響資料系統架構的幾個關鍵選擇,並引入了後續章節會反覆使用的術語。
我們首先區分了事務型(事務處理,OLTP)和分析型(OLAP)系統。它們不僅面對不同訪問模式與資料型別,也服務於不同人群。我們還看到資料倉庫與資料湖這兩類體系,它們透過 ETL 接收來自事務型系統的資料。在[第 4 章](/tw/ch4#ch_storage)中,我們會看到由於查詢型別不同,事務型與分析型系統常常採用截然不同的內部資料佈局。
隨後,我們把相對較新的雲服務模式與長期主導資料系統架構的自託管正規化做了比較。哪種方式更具成本效益高度依賴具體情境,但不可否認,雲原生架構正在深刻改變資料系統的構建方式,例如儲存與計算的分離。
雲系統天然是分散式系統,我們也簡要討論了它與單機方案之間的權衡。有些場景無法避免分散式,但如果單機可行,不必急於把系統分散式化。在[第 9 章](/tw/ch9#ch_distributed)中,我們會更深入地討論分散式系統的挑戰。
最後,資料系統架構不僅由企業自身需求決定,也受保護資料主體權利的隱私法規所塑造,而這一點常被工程實踐忽略。如何把法律要求轉化為技術實現,目前仍無標準答案;但在閱讀本書後續內容時,始終帶著這個問題會很重要。
### 參考文獻
[^1]: Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. [The Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26)
[^2]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
[^3]: Joe Reis and Matt Housley. [*Fundamentals of Data Engineering*](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/). O’Reilly Media, 2022. ISBN: 9781098108304
[^4]: Rui Pedro Machado and Helder Russa. [*Analytics Engineering with SQL and dbt*](https://www.oreilly.com/library/view/analytics-engineering-with/9781098142377/). O’Reilly Media, 2023. ISBN: 9781098142384
[^5]: Edgar F. Codd, S. B. Codd, and C. T. Salley. [Providing OLAP to User-Analysts: An IT Mandate](https://www.estgv.ipv.pt/PaginasPessoais/jloureiro/ESI_AID2007_2008/fichas/codd.pdf). E. F. Codd Associates, 1993. Archived at [perma.cc/RKX8-2GEE](https://perma.cc/RKX8-2GEE)
[^6]: Chinmay Soman and Neha Pawar. [Comparing Three Real-Time OLAP Databases: Apache Pinot, Apache Druid, and ClickHouse](https://startree.ai/blog/a-tale-of-three-real-time-olap-databases). *startree.ai*, April 2023. Archived at [perma.cc/8BZP-VWPA](https://perma.cc/8BZP-VWPA)
[^7]: Surajit Chaudhuri and Umeshwar Dayal. [An Overview of Data Warehousing and OLAP Technology](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigrecord.pdf). *ACM SIGMOD Record*, volume 26, issue 1, pages 65–74, March 1997. [doi:10.1145/248603.248616](https://doi.org/10.1145/248603.248616)
[^8]: Fatma Özcan, Yuanyuan Tian, and Pinar Tözün. [Hybrid Transactional/Analytical Processing: A Survey](https://humming80.github.io/papers/sigmod-htaptut.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017. [doi:10.1145/3035918.3054784](https://doi.org/10.1145/3035918.3054784)
[^9]: Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. [Cloud-Native Transactions and Analytics in SingleStore](https://dl.acm.org/doi/abs/10.1145/3514221.3526055). At *International Conference on Management of Data* (SIGMOD), June 2022. [doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055)
[^10]: Chao Zhang, Guoliang Li, Jintao Zhang, Xinning Zhang, and Jianhua Feng. [HTAP Databases: A Survey](https://arxiv.org/pdf/2404.15670). *IEEE Transactions on Knowledge and Data Engineering*, April 2024. [doi:10.1109/TKDE.2024.3389693](https://doi.org/10.1109/TKDE.2024.3389693)
[^11]: Michael Stonebraker and Uğur Çetintemel. [‘One Size Fits All’: An Idea Whose Time Has Come and Gone](https://pages.cs.wisc.edu/~shivaram/cs744-readings/fits_all.pdf). At *21st International Conference on Data Engineering* (ICDE), April 2005. [doi:10.1109/ICDE.2005.1](https://doi.org/10.1109/ICDE.2005.1)
[^12]: Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. [MAD Skills: New Analysis Practices for Big Data](https://www.vldb.org/pvldb/vol2/vldb09-219.pdf). *Proceedings of the VLDB Endowment*, volume 2, issue 2, pages 1481–1492, August 2009. [doi:10.14778/1687553.1687576](https://doi.org/10.14778/1687553.1687576)
[^13]: Dan Olteanu. [The Relational Data Borg is Learning](https://www.vldb.org/pvldb/vol13/p3502-olteanu.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, August 2020. [doi:10.14778/3415478.3415572](https://doi.org/10.14778/3415478.3415572)
[^14]: Matt Bornstein, Martin Casado, and Jennifer Li. [Emerging Architectures for Modern Data Infrastructure: 2020](https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/). *future.a16z.com*, October 2020. Archived at [perma.cc/LF8W-KDCC](https://perma.cc/LF8W-KDCC)
[^15]: Martin Fowler. [DataLake](https://www.martinfowler.com/bliki/DataLake.html). *martinfowler.com*, February 2015. Archived at [perma.cc/4WKN-CZUK](https://perma.cc/4WKN-CZUK)
[^16]: Bobby Johnson and Joseph Adler. [The Sushi Principle: Raw Data Is Better](https://learning.oreilly.com/videos/strata-hadoop/9781491924143/9781491924143-video210840/). At *Strata+Hadoop World*, February 2015.
[^17]: Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. [Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference on Innovative Data Systems Research* (CIDR), January 2021.
[^18]: DataKitchen, Inc. [The DataOps Manifesto](https://dataopsmanifesto.org/en/). *dataopsmanifesto.org*, 2017. Archived at [perma.cc/3F5N-FUQ4](https://perma.cc/3F5N-FUQ4)
[^19]: Tejas Manohar. [What is Reverse ETL: A Definition & Why It’s Taking Off](https://hightouch.io/blog/reverse-etl/). *hightouch.io*, November 2021. Archived at [perma.cc/A7TN-GLYJ](https://perma.cc/A7TN-GLYJ)
[^20]: Simon O’Regan. [Designing Data Products](https://towardsdatascience.com/designing-data-products-b6b93edf3d23). *towardsdatascience.com*, August 2018. Archived at [perma.cc/HU67-3RV8](https://perma.cc/HU67-3RV8)
[^21]: Camille Fournier. [Why is it so hard to decide to buy?](https://skamille.medium.com/why-is-it-so-hard-to-decide-to-buy-d86fee98e88e) *skamille.medium.com*, July 2021. Archived at [perma.cc/6VSG-HQ5X](https://perma.cc/6VSG-HQ5X)
[^22]: David Heinemeier Hansson. [Why we’re leaving the cloud](https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0). *world.hey.com*, October 2022. Archived at [perma.cc/82E6-UJ65](https://perma.cc/82E6-UJ65)
[^23]: Nima Badizadegan. [Use One Big Server](https://specbranch.com/posts/one-big-server/). *specbranch.com*, August 2022. Archived at [perma.cc/M8NB-95UK](https://perma.cc/M8NB-95UK)
[^24]: Steve Yegge. [Dear Google Cloud: Your Deprecation Policy is Killing You](https://steve-yegge.medium.com/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc). *steve-yegge.medium.com*, August 2020. Archived at [perma.cc/KQP9-SPGU](https://perma.cc/KQP9-SPGU)
[^25]: Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. [Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases](https://media.amazonwebservices.com/blog/2017/aurora-design-considerations-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 1041–1052, May 2017. [doi:10.1145/3035918.3056101](https://doi.org/10.1145/3035918.3056101)
[^26]: Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. [Socrates: The New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 1743–1756, June 2019. [doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047)
[^27]: Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong, Ashish Motivala, and Thierry Cruanes. [Building An Elastic Query Engine on Disaggregated Storage](https://www.usenix.org/system/files/nsdi20-paper-vuppalapati.pdf). At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020.
[^28]: Nick Van Wiggeren. [The Real Failure Rate of EBS](https://planetscale.com/blog/the-real-fail-rate-of-ebs). *planetscale.com*, March 2025. Archived at [perma.cc/43CR-SAH5](https://perma.cc/43CR-SAH5)
[^29]: Colin Breck. [Predicting the Future of Distributed Systems](https://blog.colinbreck.com/predicting-the-future-of-distributed-systems/). *blog.colinbreck.com*, August 2024. Archived at [perma.cc/K5FC-4XX2](https://perma.cc/K5FC-4XX2)
[^30]: Gwen Shapira. [Compute-Storage Separation Explained](https://www.thenile.dev/blog/storage-compute). *thenile.dev*, January 2023. Archived at [perma.cc/QCV3-XJNZ](https://perma.cc/QCV3-XJNZ)
[^31]: Ravi Murthy and Gurmeet Goindi. [AlloyDB for PostgreSQL under the hood: Intelligent, database-aware storage](https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage). *cloud.google.com*, May 2022. Archived at [archive.org](https://web.archive.org/web/20220514021120/https%3A//cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage)
[^32]: Jack Vanlightly. [The Architecture of Serverless Data Systems](https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems). *jack-vanlightly.com*, November 2023. Archived at [perma.cc/UDV4-TNJ5](https://perma.cc/UDV4-TNJ5)
[^33]: Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, David A. Patterson. [Cloud Programming Simplified: A Berkeley View on Serverless Computing](https://arxiv.org/abs/1902.03383). *arxiv.org*, February 2019.
[^34]: Betsy Beyer, Jennifer Petoff, Chris Jones, and Niall Richard Murphy. [*Site Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). O’Reilly Media, 2016. ISBN: 9781491929124
[^35]: Thomas Limoncelli. [The Time I Stole $10,000 from Bell Labs](https://queue.acm.org/detail.cfm?id=3434773). *ACM Queue*, volume 18, issue 5, November 2020. [doi:10.1145/3434571.3434773](https://doi.org/10.1145/3434571.3434773)
[^36]: Charity Majors. [The Future of Ops Jobs](https://acloudguru.com/blog/engineering/the-future-of-ops-jobs). *acloudguru.com*, August 2020. Archived at [perma.cc/GRU2-CZG3](https://perma.cc/GRU2-CZG3)
[^37]: Boris Cherkasky. [(Over)Pay As You Go for Your Datastore](https://medium.com/riskified-technology/over-pay-as-you-go-for-your-datastore-11a29ae49a8b). *medium.com*, September 2021. Archived at [perma.cc/Q8TV-2AM2](https://perma.cc/Q8TV-2AM2)
[^38]: Shlomi Kushchi. [Serverless Doesn’t Mean DevOpsLess or NoOps](https://thenewstack.io/serverless-doesnt-mean-devopsless-or-noops/). *thenewstack.io*, February 2023. Archived at [perma.cc/3NJR-AYYU](https://perma.cc/3NJR-AYYU)
[^39]: Erik Bernhardsson. [Storm in the stratosphere: how the cloud will be reshuffled](https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html). *erikbern.com*, November 2021. Archived at [perma.cc/SYB2-99P3](https://perma.cc/SYB2-99P3)
[^40]: Benn Stancil. [The data OS](https://benn.substack.com/p/the-data-os). *benn.substack.com*, September 2021. Archived at [perma.cc/WQ43-FHS6](https://perma.cc/WQ43-FHS6)
[^41]: Maria Korolov. [Data residency laws pushing companies toward residency as a service](https://www.csoonline.com/article/3647761/data-residency-laws-pushing-companies-toward-residency-as-a-service.html). *csoonline.com*, January 2022. Archived at [perma.cc/CHE4-XZZ2](https://perma.cc/CHE4-XZZ2)
[^42]: Severin Borenstein. [Can Data Centers Flex Their Power Demand?](https://energyathaas.wordpress.com/2025/04/14/can-data-centers-flex-their-power-demand/) *energyathaas.wordpress.com*, April 2025. Archived at
[^43]: Bilge Acun, Benjamin Lee, Fiodar Kazhamiaka, Aditya Sundarrajan, Kiwan Maeng, Manoj Chakkaravarthy, David Brooks, and Carole-Jean Wu. [Carbon Dependencies in Datacenter Design and Management](https://hotcarbon.org/assets/2022/pdf/hotcarbon22-acun.pdf). *ACM SIGENERGY Energy Informatics Review*, volume 3, issue 3, pages 21–26. [doi:10.1145/3630614.3630619](https://doi.org/10.1145/3630614.3630619)
[^44]: Kousik Nath. [These are the numbers every computer engineer should know](https://www.freecodecamp.org/news/must-know-numbers-for-every-computer-engineer/). *freecodecamp.org*, September 2019. Archived at [perma.cc/RW73-36RL](https://perma.cc/RW73-36RL)
[^45]: Joseph M. Hellerstein, Jose Faleiro, Joseph E. Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. [Serverless Computing: One Step Forward, Two Steps Back](https://arxiv.org/abs/1812.03651). At *Conference on Innovative Data Systems Research* (CIDR), January 2019.
[^46]: Frank McSherry, Michael Isard, and Derek G. Murray. [Scalability! But at What COST?](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf) At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
[^47]: Cindy Sridharan. *[Distributed Systems Observability: A Guide to Building Robust Systems](https://unlimited.humio.com/rs/756-LMY-106/images/Distributed-Systems-Observability-eBook.pdf)*. Report, O’Reilly Media, May 2018. Archived at [perma.cc/M6JL-XKCM](https://perma.cc/M6JL-XKCM)
[^48]: Charity Majors. [Observability — A 3-Year Retrospective](https://thenewstack.io/observability-a-3-year-retrospective/). *thenewstack.io*, August 2019. Archived at [perma.cc/CG62-TJWL](https://perma.cc/CG62-TJWL)
[^49]: Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. [Dapper, a Large-Scale Distributed Systems Tracing Infrastructure](https://research.google/pubs/pub36356/). Google Technical Report dapper-2010-1, April 2010. Archived at [perma.cc/K7KU-2TMH](https://perma.cc/K7KU-2TMH)
[^50]: Rodrigo Laigner, Yongluan Zhou, Marcos Antonio Vaz Salles, Yijian Liu, and Marcos Kalinowski. [Data management in microservices: State of the practice, challenges, and research directions](https://www.vldb.org/pvldb/vol14/p3348-laigner.pdf). *Proceedings of the VLDB Endowment*, volume 14, issue 13, pages 3348–3361, September 2021. [doi:10.14778/3484224.3484232](https://doi.org/10.14778/3484224.3484232)
[^51]: Jordan Tigani. [Big Data is Dead](https://motherduck.com/blog/big-data-is-dead/). *motherduck.com*, February 2023. Archived at [perma.cc/HT4Q-K77U](https://perma.cc/HT4Q-K77U)
[^52]: Sam Newman. [*Building Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). O’Reilly Media, 2021. ISBN: 9781492034025
[^53]: Chris Richardson. [Microservices: Decomposing Applications for Deployability and Scalability](https://www.infoq.com/articles/microservices-intro/). *infoq.com*, May 2014. Archived at [perma.cc/CKN4-YEQ2](https://perma.cc/CKN4-YEQ2)
[^54]: Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, Ricardo Bianchini. [Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider](https://www.usenix.org/system/files/atc20-shahrad.pdf). At *USENIX Annual Technical Conference* (ATC), July 2020.
[^55]: Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. [The Datacenter as a Computer: Designing Warehouse-Scale Machines](https://www.morganclaypool.com/doi/10.2200/S00874ED3V01Y201809CAC046), third edition. Morgan & Claypool Synthesis Lectures on Computer Architecture, October 2018. [doi:10.2200/S00874ED3V01Y201809CAC046](https://doi.org/10.2200/S00874ED3V01Y201809CAC046)
[^56]: David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. [Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing](https://arcb.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/sc12.pdf),” at *International Conference for High Performance Computing, Networking, Storage and Analysis* (SC), November 2012. [doi:10.1109/SC.2012.49](https://doi.org/10.1109/SC.2012.49)
[^57]: Anna Kornfeld Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang. [Securing RDMA for High-Performance Datacenter Storage Systems](https://www.usenix.org/conference/hotcloud20/presentation/kornfeld-simpson). At *12th USENIX Workshop on Hot Topics in Cloud Computing* (HotCloud), July 2020.
[^58]: Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. [Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf). At *Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. [doi:10.1145/2785956.2787508](https://doi.org/10.1145/2785956.2787508)
[^59]: Glenn K. Lockwood. [Hadoop’s Uncomfortable Fit in HPC](https://blog.glennklockwood.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html). *glennklockwood.blogspot.co.uk*, May 2014. Archived at [perma.cc/S8XX-Y67B](https://perma.cc/S8XX-Y67B)
[^60]: Cathy O’Neil: *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. ISBN: 9780553418811
[^61]: Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. [Understanding and Benchmarking the Impact of GDPR on Database Systems](https://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 7, pages 1064–1077, March 2020. [doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354)
[^62]: Martin Fowler. [Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html). *martinfowler.com*, December 2013. Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6)
[^63]: [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016.
================================================
FILE: content/tw/ch10.md
================================================
---
title: "10. 一致性與共識"
weight: 210
breadcrumbs: false
---

> *一句古老的格言告誡說:"千萬不要帶著兩塊計時器出海;要麼帶一塊,要麼帶三塊。"*
>
> 弗雷德里克·P·布魯克斯,《人月神話:軟體工程隨筆》(1995)
正如在 [第九章](/tw/ch9) 中討論的,分散式系統中會出現許多問題。如果我們希望服務在出現這些問題時仍能正確工作,就需要找到容錯的方法。
我們擁有的最佳容錯工具之一是 *複製*。然而,正如我們在 [第六章](/tw/ch6) 中看到的,在多個副本上擁有多份資料副本會帶來不一致的風險。讀取可能由一個非最新的副本處理,從而產生過時的結果。如果多個副本可以接受寫入,我們必須處理在不同副本上併發寫入的值之間的衝突。從高層次來看,處理這些問題有兩種相互競爭的理念:
最終一致性
: 在這種理念中,系統被複制這一事實對應用程式是可見的,作為應用程式開發者,你需要處理可能出現的不一致和衝突。這種方法通常用於多主複製(見 ["多主複製"](/tw/ch6#sec_replication_multi_leader))和無主複製(見 ["無主複製"](/tw/ch6#sec_replication_leaderless))的系統中。
強一致性
: 這種理念認為應用程式不應該擔心複製的內部細節,系統應該表現得就像單節點一樣。這種方法的優點是對你(應用程式開發者)來說更簡單。缺點是更強的一致性會帶來效能成本,並且某些最終一致系統能夠容忍的故障會導致強一致系統出現中斷。
一如既往,哪種方法更好取決於你的應用程式。如果你有一個應用程式,使用者可以在離線狀態下對資料進行更改,那麼最終一致性是不可避免的,如 ["同步引擎與本地優先軟體"](/tw/ch6#sec_replication_offline_clients) 中所討論的。然而,最終一致性對應用程式來說也可能很難處理。如果你的副本位於具有快速、可靠通訊的資料中心,那麼強一致性通常是合適的,因為其成本是可以接受的。
在本章中,我們將深入探討強一致性方法,關注三個領域:
1. 一個挑戰是"強一致性"相當模糊,因此我們將制定一個更精確的定義,明確我們想要實現什麼:*線性一致性*。
2. 我們將研究生成 ID 和時間戳的問題。這可能聽起來與一致性無關,但實際上密切相關。
3. 我們將探討分散式系統如何在保持容錯的同時實現線性一致性;答案是 *共識* 演算法。
在此過程中,我們將看到分散式系統中什麼是可能的,什麼是不可能的,存在一些基本限制。
本章的主題以難以正確實現而著稱;構建在沒有故障時表現良好,但在面對設計者沒有考慮到的不幸故障組合時完全崩潰的系統非常容易。已經發展了大量理論來幫助我們思考這些邊界情況,這使我們能夠構建可以穩健地容忍故障的系統。
本章只會觸及表面:我們將堅持非正式的直覺,避免演算法細節、形式化模型和證明。如果你想在共識系統和類似基礎設施上進行認真的工作,你需要更深入地研究理論,才有機會讓你的系統穩健。與往常一樣,本章中的文獻參考提供了一些初步的指引。
## 線性一致性 {#sec_consistency_linearizability}
如果你希望複製的資料庫儘可能簡單易用,你應該讓它表現得就像根本沒有複製一樣。然後使用者就不必擔心複製延遲、衝突和其他不一致性。這將給我們帶來容錯的優勢,但不會因為必須考慮多個副本而帶來複雜性。
這就是 *線性一致性* [^1] 背後的想法(也稱為 *原子一致性* [^2]、*強一致性*、*即時一致性* 或 *外部一致性* [^3])。線性一致性的確切定義相當微妙,我們將在本節的其餘部分探討它。但基本思想是讓系統看起來好像只有一份資料副本,並且對它的所有操作都是原子的。有了這個保證,即使實際上可能有多個副本,應用程式也不需要擔心它們。
在線性一致系統中,一旦一個客戶端成功完成寫入,所有從資料庫讀取的客戶端都必須能夠看到剛剛寫入的值。維護單一資料副本的假象,意味著要保證讀取到的是最新值,而不是來自過時的快取或副本。換句話說,線性一致性是一種 *新鮮度保證*。為了闡明這個想法,讓我們看一個非線性一致系統的例子。
{{< figure src="/fig/ddia_1001.png" id="fig_consistency_linearizability_0" caption="圖 10-1. 如果這個資料庫是線性一致的,那麼 Alice 的讀取要麼返回 1 而不是 0,要麼 Bob 的讀取返回 0 而不是 1。" class="w-full my-4" >}}
[圖 10-1](#fig_consistency_linearizability_0) 顯示了一個非線性一致的體育網站示例 [^4]。Aaliyah 和 Bryce 坐在同一個房間裡,都在檢視手機,想要了解他們最喜歡的球隊比賽的結果。就在最終比分宣佈後,Aaliyah 重新整理了頁面,看到了獲勝者的公告,並興奮地告訴了 Bryce。Bryce 懷疑地在自己的手機上點選了 *重新整理*,但他的請求傳送到了一個滯後的資料庫副本,因此他的手機顯示比賽仍在進行中。
如果 Aaliyah 和 Bryce 同時點選重新整理,他們得到兩個不同的查詢結果就不會那麼令人驚訝了,因為他們不知道他們各自的請求在伺服器上被處理的確切時間。然而,Bryce 知道他是在聽到 Aaliyah 宣佈最終比分 *之後* 點選重新整理按鈕(發起查詢)的,因此他期望他的查詢結果至少與 Aaliyah 的一樣新。他的查詢返回過時結果這一事實違反了線性一致性。
### 什麼使系統具有線性一致性? {#sec_consistency_lin_definition}
為了更好地理解線性一致性,讓我們看一些更多的例子。[圖 10-2](#fig_consistency_linearizability_1) 顯示了三個客戶端在線性一致資料庫中併發讀取和寫入同一個物件 *x*。在分散式系統理論中,*x* 被稱為 *暫存器*——在實踐中,它可能是鍵值儲存中的一個鍵,關係資料庫中的一行,或者文件資料庫中的一個文件,例如。
{{< figure src="/fig/ddia_1002.png" id="fig_consistency_linearizability_1" caption="圖 10-2. Alice 觀察到 x = 0 且 y = 1,而 Bob 觀察到 x = 1 且 y = 0。就好像 Alice 和 Bob 的計算機對寫入發生的順序意見不一。" class="w-full my-4" >}}
為簡單起見,[圖 10-2](#fig_consistency_linearizability_1) 僅顯示了從客戶端角度看的請求,而不是資料庫的內部。每個條形代表客戶端發出的請求,條形的開始是傳送請求的時間,條形的結束是客戶端收到響應的時間。由於網路延遲可變,客戶端不知道資料庫確切何時處理了它的請求——它只知道必須在客戶端傳送請求和接收響應之間的某個時間發生。
在這個例子中,暫存器有兩種型別的操作:
* *read*(*x*) ⇒ *v* 表示客戶端請求讀取暫存器 *x* 的值,資料庫返回值 *v*。
* *write*(*x*, *v*) ⇒ *r* 表示客戶端請求將暫存器 *x* 設定為值 *v*,資料庫返回響應 *r*(可能是 *ok* 或 *error*)。
在 [圖 10-2](#fig_consistency_linearizability_1) 中,*x* 的值最初為 0,客戶端 C 執行寫入請求將其設定為 1。在此期間,客戶端 A 和 B 反覆輪詢資料庫以讀取最新值。A 和 B 的讀取請求可能得到什麼響應?
* 客戶端 A 的第一個讀取操作在寫入開始之前完成,因此它必須明確返回舊值 0。
* 客戶端 A 的最後一次讀取在寫入完成後開始,因此如果資料庫是線性一致的,它必須明確返回新值 1,因為讀取必須在寫入之後被處理。
* 與寫入操作在時間上重疊的任何讀取操作可能返回 0 或 1,因為我們不知道在讀取操作被處理時寫入是否已經生效。這些操作與寫入是 *併發* 的。
然而,這還不足以完全描述線性一致性:如果與寫入併發的讀取可以返回舊值或新值,那麼讀者可能會在寫入進行時多次看到值在舊值和新值之間來回翻轉。這不是我們對模擬"單一資料副本"的系統所期望的。
為了使系統線性一致,我們需要新增另一個約束,如 [圖 10-3](#fig_consistency_linearizability_2) 所示。
{{< figure src="/fig/ddia_1003.png" id="fig_consistency_linearizability_2" caption="圖 10-3. 如果 Alice 和 Bob 有完美的時鐘,線性一致性將要求返回 x = 1,因為 x 的讀取在寫入 x = 1 完成後開始。" class="w-full my-4" >}}
在線性一致系統中,我們想象必須有某個時間點(在寫入操作的開始和結束之間),*x* 的值從 0 原子地翻轉到 1。因此,如果一個客戶端的讀取返回新值 1,所有後續讀取也必須返回新值,即使寫入操作尚未完成。
這種時序依賴關係在 [圖 10-3](#fig_consistency_linearizability_2) 中用箭頭表示。客戶端 A 是第一個讀取新值 1 的。就在 A 的讀取返回後,B 開始新的讀取。由於 B 的讀取嚴格發生在 A 的讀取之後,它也必須返回 1,即使 C 的寫入仍在進行中。(這與 [圖 10-1](#fig_consistency_linearizability_0) 中 Aaliyah 和 Bryce 的情況相同:在 Aaliyah 讀取新值後,Bryce 也期望讀取新值。)
我們可以進一步細化這個時序圖,以視覺化每個操作在某個時間點原子地生效 [^5],就像 [圖 10-4](#fig_consistency_linearizability_3) 中顯示的更複雜的例子。在這個例子中,除了 *read* 和 *write* 之外,我們添加了第三種操作型別:
* *cas*(*x*, *v*old, *v*new) ⇒ *r* 表示客戶端請求一個原子 *比較並設定* 操作(見 ["條件寫入(比較並設定)"](/tw/ch8#sec_transactions_compare_and_set))。如果暫存器 *x* 的當前值等於 *v*old,它應該原子地設定為 *v*new。如果 *x* 的值與 *v*old 不同,則操作應該保持暫存器不變並返回錯誤。*r* 是資料庫的響應(*ok* 或 *error*)。
[圖 10-4](#fig_consistency_linearizability_3) 中的每個操作都用一條垂直線(在每個操作的條形內)標記,表示我們認為操作執行的時間。這些標記按順序連線起來,結果必須是暫存器的有效讀寫序列(每次讀取必須返回最近寫入設定的值)。
線性一致性的要求是連線操作標記的線始終向前移動(從左到右),永不後退。這個要求確保了我們之前討論的新鮮度保證:一旦寫入或讀取了新值,所有後續讀取都會看到寫入的值,直到它再次被覆蓋。
{{< figure src="/fig/ddia_1004.png" id="fig_consistency_linearizability_3" caption="圖 10-4. x 的讀取與寫入 x = 1 併發。由於我們不知道操作的確切時序,讀取可以返回 0 或 1。" class="w-full my-4" >}}
[圖 10-4](#fig_consistency_linearizability_3) 中有一些有趣的細節需要指出:
* 首先客戶端 B 傳送了讀取 *x* 的請求,然後客戶端 D 傳送了將 *x* 設定為 0 的請求,然後客戶端 A 傳送了將 *x* 設定為 1 的請求。然而,返回給 B 的讀取值是 1(A 寫入的值)。這是可以的:這意味著資料庫首先處理了 D 的寫入,然後是 A 的寫入,最後是 B 的讀取。雖然這不是傳送請求的順序,但這是一個可接受的順序,因為這三個請求是併發的。也許 B 的讀取請求在網路中稍有延遲,因此它在兩次寫入之後才到達資料庫。
* 客戶端 B 的讀取在客戶端 A 收到資料庫的響應之前返回了 1,表示值 1 的寫入成功。這也是可以的:這只是意味著從資料庫到客戶端 A 的 *ok* 響應在網路中稍有延遲。
* 這個模型不假設任何事務隔離:另一個客戶端可以隨時更改值。例如,C 首先讀取 1,然後讀取 2,因為該值在兩次讀取之間被 B 更改了。原子比較並設定(*cas*)操作可用於檢查值是否未被另一個客戶端併發更改:B 和 C 的 *cas* 請求成功,但 D 的 *cas* 請求失敗(到資料庫處理它時,*x* 的值不再是 0)。
* 客戶端 B 的最後一次讀取(在陰影條中)不是線性一致的。該操作與 C 的 *cas* 寫入併發,後者將 *x* 從 2 更新到 4。在沒有其他請求的情況下,B 的讀取返回 2 是可以的。然而,客戶端 A 在 B 的讀取開始之前已經讀取了新值 4,因此 B 不允許讀取比 A 更舊的值。同樣,這與 [圖 10-1](#fig_consistency_linearizability_0) 中 Aaliyah 和 Bryce 的情況相同。
這就是線性一致性背後的直覺;形式化定義 [^1] 更精確地描述了它。可以(儘管計算成本高昂)透過記錄所有請求和響應的時序,並檢查它們是否可以排列成有效的順序序列來測試系統的行為是否線性一致 [^6] [^7]。
就像除了可序列化之外還有各種弱隔離級別用於事務(見 ["弱隔離級別"](/tw/ch8#sec_transactions_isolation_levels)),除了線性一致性之外,複製系統也有各種較弱的一致性模型 [^8]。實際上,我們在 ["複製延遲問題"](/tw/ch6#sec_replication_lag) 中看到的 *寫後讀*、*單調讀* 和 *一致性字首讀* 屬性就是這種較弱一致性模型的例子。線性一致性保證所有這些較弱的屬性,以及更多。在本章中,我們將重點關注線性一致性,它是最常用的最強一致性模型。
--------
> [!TIP] 線性一致性與可序列化
線性一致性很容易與可序列化混淆(見 ["可序列化"](/tw/ch8#sec_transactions_serializability)),因為這兩個詞似乎都意味著類似"可以按順序排列"的東西。然而,它們是完全不同的保證,區分它們很重要:
可序列化
: 可序列化是事務的隔離屬性,其中每個事務可能讀取和寫入 *多個物件*(行、文件、記錄)。它保證事務的行為與它們按 *某種* 序列順序執行時相同:也就是說,就好像你首先執行一個事務的所有操作,然後執行另一個事務的所有操作,依此類推,而不交錯它們。該序列順序可以與事務實際執行的順序不同 [^9]。
線性一致性
: 線性一致性是對暫存器(*單個物件*)的讀寫保證。它不將操作分組到事務中,因此它不能防止涉及多個物件的問題,如寫偏差(見 ["寫偏差和幻讀"](/tw/ch8#sec_transactions_write_skew))。然而,線性一致性是一個 *新鮮度* 保證:它要求如果一個操作在另一個操作開始之前完成,那麼後一個操作必須觀察到至少與前一個操作一樣新的狀態。可序列化沒有這個要求:例如,可序列化允許過時讀取 [^10]。
(*順序一致性* 又是另外一回事 [^8],但我們不會在這裡討論它。)
資料庫可能同時提供可序列化和線性一致性,這種組合稱為 *嚴格可序列化* 或 *強單副本可序列化*(*strong-1SR*)[^11] [^12]。單節點資料庫通常是線性一致的。對於使用樂觀方法(如可序列化快照隔離)的分散式資料庫(見 ["可序列化快照隔離(SSI)"](/tw/ch8#sec_transactions_ssi)),情況更加複雜:例如,CockroachDB 提供可序列化和對讀取的一些新鮮度保證,但不是嚴格可序列化 [^13],因為這需要事務之間進行昂貴的協調 [^14]。
也可以將較弱的隔離級別與線性一致性結合,或將較弱的一致性模型與可序列化結合;實際上,一致性模型和隔離級別可以在很大程度上相互獨立地選擇 [^15] [^16]。
--------
### 依賴線性一致性 {#sec_consistency_linearizability_usage}
在什麼情況下線性一致性有用?檢視體育比賽的最終比分也許是一個無關緊要的例子:過時幾秒鐘的結果在這種情況下不太可能造成任何實際傷害。然而,有幾個領域中線性一致性是使系統正確工作的重要要求。
#### 鎖定與領導者選舉 {#locking-and-leader-election}
使用單主複製的系統需要確保確實只有一個主節點,而不是多個(腦裂)。選舉領導者的一種方法是使用租約:每個啟動的節點都嘗試獲取租約,成功的節點成為領導者 [^17]。無論這種機制如何實現,它都必須是線性一致的:兩個不同的節點不應該能夠同時獲取租約。
像 Apache ZooKeeper [^18] 和 etcd 這樣的協調服務通常用於實現分散式租約和領導者選舉。它們使用共識演算法以容錯的方式實現線性一致的操作(我們將在本章後面討論這些演算法)。實現租約和領導者選舉正確仍然有許多微妙的細節(例如,參見 ["分散式鎖和租約"](/tw/ch9#sec_distributed_lock_fencing) 中的柵欄問題),像 Apache Curator 這樣的庫透過在 ZooKeeper 之上提供更高級別的配方來提供幫助。然而,線性一致的儲存服務是這些協調任務的基本基礎。
--------
> [!NOTE]
> 嚴格來說,ZooKeeper 提供線性一致的寫入,但讀取可能是過時的,因為不能保證它們由當前領導者提供 [^18]。etcd 從版本 3 開始預設提供線性一致的讀取。
--------
分散式鎖也在一些分散式資料庫中以更細粒度的級別使用,例如 Oracle Real Application Clusters (RAC) [^19]。RAC 對每個磁碟頁使用一個鎖,多個節點共享對同一磁碟儲存系統的訪問。由於這些線性一致的鎖位於事務執行的關鍵路徑上,RAC 部署通常具有專用的叢集互連網路用於資料庫節點之間的通訊。
#### 約束與唯一性保證 {#sec_consistency_uniqueness}
唯一性約束在資料庫中很常見:例如,使用者名稱或電子郵件地址必須唯一標識一個使用者,在檔案儲存服務中不能有兩個具有相同路徑和檔名的檔案。如果你想在資料寫入時強制執行此約束(這樣如果兩個人同時嘗試建立具有相同名稱的使用者或檔案,其中一個將返回錯誤),你需要線性一致性。
這種情況實際上類似於鎖:當用戶註冊你的服務時,你可以認為他們獲取了所選使用者名稱的"鎖"。該操作也非常類似於原子比較並設定,將使用者名稱設定為宣告它的使用者的 ID,前提是使用者名稱尚未被佔用。
如果你想確保銀行賬戶餘額永遠不會變為負數,或者你不會銷售超過倉庫庫存的物品,或者兩個人不會同時預訂同一航班或劇院的同一座位,也會出現類似的問題。這些約束都要求有一個所有節點都同意的單一最新值(賬戶餘額、庫存水平、座位佔用情況)。
在實際應用中,有時可以接受寬鬆地對待這些約束(例如,如果航班超售,你可以將客戶轉移到其他航班,併為不便提供補償)。在這種情況下,可能不需要線性一致性,我們將在 ["時效性與完整性"](/tw/ch13#sec_future_integrity) 中討論這種寬鬆解釋的約束。
然而,硬唯一性約束,例如你通常在關係資料庫中找到的約束,需要線性一致性。其他型別的約束,例如外部索引鍵或屬性約束,可以在沒有線性一致性的情況下實現 [^20]。
#### 跨通道時序依賴 {#cross-channel-timing-dependencies}
注意 [圖 10-1](#fig_consistency_linearizability_0) 中的一個細節:如果 Aaliyah 沒有大聲說出比分,Bryce 就不會知道他的查詢結果是過時的。他只會在幾秒鐘後再次重新整理頁面,最終看到最終比分。線性一致性違規之所以被注意到,只是因為系統中有一個額外的通訊通道(Aaliyah 的聲音到 Bryce 的耳朵)。
類似的情況可能出現在計算機系統中。例如,假設你有一個網站,使用者可以上傳影片,後臺程序將影片轉碼為較低質量,以便在慢速網際網路連線上流式傳輸。該系統的架構和資料流如 [圖 10-5](#fig_consistency_transcoder) 所示。
影片轉碼器需要明確指示執行轉碼作業,此指令透過訊息佇列從 Web 伺服器傳送到轉碼器(見 ["訊息傳遞系統"](/tw/ch12#sec_stream_messaging))。Web 伺服器不會將整個影片放在佇列中,因為大多數訊息代理都是為小訊息設計的,而影片可能有許多兆位元組大小。相反,影片首先寫入檔案儲存服務,寫入完成後,轉碼指令被放入佇列。
{{< figure src="/fig/ddia_1005.png" id="fig_consistency_transcoder" caption="圖 10-5. 一個非線性一致的系統:Alice 和 Bob 在不同時間看到上傳的影像,因此 Bob 的請求基於過時的資料。" class="w-full my-4" >}}
如果檔案儲存服務是線性一致的,那麼這個系統應該工作正常。如果它不是線性一致的,就存在競態條件的風險:訊息佇列([圖 10-5](#fig_consistency_transcoder) 中的步驟 3 和 4)可能比儲存服務內部的複製更快。在這種情況下,當轉碼器獲取原始影片(步驟 5)時,它可能會看到檔案的舊版本,或者根本看不到任何內容。如果它處理影片的舊版本,檔案儲存中的原始影片和轉碼影片將永久不一致。
這個問題的出現是因為 Web 伺服器和轉碼器之間有兩個不同的通訊通道:檔案儲存和訊息佇列。如果沒有線性一致性的新鮮度保證,這兩個通道之間可能存在競態條件。這種情況類似於 [圖 10-1](#fig_consistency_linearizability_0),其中也存在兩個通訊通道之間的競態條件:資料庫複製和 Aaliyah 嘴巴到 Bryce 耳朵之間的現實音訊通道。
如果你有一個可以接收推送通知的移動應用程式,並且應用程式在收到推送通知時從伺服器獲取一些資料,就會發生類似的競態條件。如果資料獲取可能傳送到滯後的副本,可能會發生推送通知快速透過,但後續獲取沒有看到推送通知所涉及的資料。
線性一致性不是避免這種競態條件的唯一方法,但它是最容易理解的。如果你控制額外的通訊通道(如訊息佇列的情況,但不是 Aaliyah 和 Bryce 的情況),你可以使用類似於我們在 ["讀己之寫"](/tw/ch6#sec_replication_ryw) 中討論的替代方法,但代價是額外的複雜性。
### 實現線性一致性系統 {#sec_consistency_implementing_linearizable}
現在我們已經看了線性一致性有用的幾個例子,讓我們思考如何實現一個提供線性一致語義的系統。
由於線性一致性本質上意味著"表現得好像只有一份資料副本,並且對它的所有操作都是原子的",最簡單的答案是真的只使用一份資料副本。然而,這種方法將無法容忍故障:如果持有該副本的節點失敗,資料將丟失,或者至少在節點重新啟動之前無法訪問。
讓我們重新審視 [第六章](/tw/ch6) 中的複製方法,並比較它們是否可以實現線性一致:
單主複製(可能線性一致)
: 在單主複製系統中,主節點擁有用於寫入的資料主副本,備庫在其他節點上維護資料副本。只要你在主節點上執行所有讀寫操作,它們很可能是線性一致的。然而,這假設你確定知道誰是主節點。如 ["分散式鎖和租約"](/tw/ch9#sec_distributed_lock_fencing) 中所討論的,一個節點很可能認為自己是主節點,而實際上並不是。如果這個“妄想中的主節點”繼續處理請求,很可能會違反線性一致性 [^21]。使用非同步複製時,故障切換甚至可能丟失已提交的寫入,這違反了永續性和線性一致性。
對單主資料庫進行分片,每個分片有一個單獨的主節點,不會影響線性一致性,因為它只是單物件保證。跨分片事務是另一回事(見 ["分散式事務"](/tw/ch8#sec_transactions_distributed))。
共識演算法(可能線性一致)
: 一些共識演算法本質上是帶有自動領導者選舉和故障切換的單主複製。它們經過精心設計以防止腦裂,使它們能夠安全地實現線性一致的儲存。ZooKeeper 使用 Zab 共識演算法 [^22],etcd 使用 Raft [^23],例如。然而,僅僅因為系統使用共識並不能保證其上的所有操作都是線性一致的:如果它允許在不檢查節點是否仍然是領導者的情況下在節點上讀取,讀取的結果可能是過時的,如果剛剛選出了新的領導者。
多主複製(非線性一致)
: 具有多主複製的系統通常不是線性一致的,因為它們在多個節點上併發處理寫入,並將它們非同步複製到其他節點。因此,它們可能產生需要解決的衝突寫入(見 ["處理衝突寫入"](/tw/ch6#sec_replication_write_conflicts))。
無主複製(可能非線性一致)
: 對於具有無主複製的系統(Dynamo 風格;見 ["無主複製"](/tw/ch6#sec_replication_leaderless)),人們有時聲稱可以透過要求仲裁讀寫(*w* + *r* > *n*)來獲得"強一致性"。根據確切的演算法,以及你如何定義強一致性,這並不完全正確。
基於日曆時鐘的"最後寫入獲勝"衝突解決方法(例如,在 Cassandra 和 ScyllaDB 中)幾乎肯定是非線性一致的,因為時鐘時間戳由於時鐘偏差而無法保證與實際事件順序一致(見 ["依賴同步時鐘"](/tw/ch9#sec_distributed_clocks_relying))。即使使用仲裁,也可能出現非線性一致的行為,如下一節所示。
#### 線性一致性與仲裁 {#sec_consistency_quorum_linearizable}
直觀地說,在 Dynamo 風格的模型中,仲裁讀寫似乎應該是線性一致的。然而,當我們有可變的網路延遲時,可能會出現競態條件,如 [圖 10-6](#fig_consistency_leaderless) 所示。
{{< figure src="/fig/ddia_1006.png" id="fig_consistency_leaderless" caption="圖 10-6. 如果網路延遲是可變的,仲裁不足以確保線性一致性。" class="w-full my-4" >}}
在 [圖 10-6](#fig_consistency_leaderless) 中,*x* 的初始值為 0,寫入客戶端透過向所有三個副本傳送寫入(*n* = 3,*w* = 3)將 *x* 更新為 1。同時,客戶端 A 從兩個節點的仲裁(*r* = 2)讀取,並在其中一個節點上看到新值 1。同時與寫入併發,客戶端 B 從不同的兩個節點仲裁讀取,並從兩者獲得舊值 0。
仲裁條件得到滿足(*w* + *r* > *n*),但這種執行仍然不是線性一致的:B 的請求在 A 的請求完成後開始,但 B 返回舊值而 A 返回新值。(這又是 [圖 10-1](#fig_consistency_linearizability_0) 中 Aaliyah 和 Bryce 的情況。)
可以使 Dynamo 風格的仲裁線性一致,但代價是降低效能:讀者必須同步執行讀修復(見 ["追趕錯過的寫入"](/tw/ch6#sec_replication_read_repair)),然後才能將結果返回給應用程式 [^24]。此外,在寫入之前,寫入者必須讀取節點仲裁的最新狀態以獲取任何先前寫入的最新時間戳,並確保新寫入具有更大的時間戳 [^25] [^26]。然而,Riak 由於效能損失而不執行同步讀修復。Cassandra 確實等待仲裁讀取時的讀修復完成 [^27],但由於它使用日曆時鐘作為時間戳而失去了線性一致性。
此外,只有線性一致的讀寫操作可以以這種方式實現;線性一致的比較並設定操作不能,因為它需要共識演算法 [^28]。
總之,最安全的假設是,具有 Dynamo 風格複製的無主系統不提供線性一致性,即使使用仲裁讀寫。
### 線性一致性的代價 {#sec_linearizability_cost}
由於某些複製方法可以提供線性一致性而其他方法不能,因此更深入地探討線性一致性的利弊是很有趣的。
我們已經在 [第六章](/tw/ch6) 中討論了不同複製方法的一些用例;例如,我們看到多主複製通常是多區域複製的良好選擇(見 ["地理分散式操作"](/tw/ch6#sec_replication_multi_dc))。[圖 10-7](#fig_consistency_cap_availability) 展示了這種部署的示例。
{{< figure src="/fig/ddia_1007.png" id="fig_consistency_cap_availability" caption="圖 10-7. 如果客戶端由於網路分割槽而無法聯絡足夠的副本,它們就無法處理寫入。" class="w-full my-4" >}}
考慮如果兩個區域之間出現網路中斷會發生什麼。讓我們假設每個區域內的網路正常工作,客戶端可以到達其本地區域,但這些區域之間無法相互連線。這被稱為 *網路分割槽*。
使用多主資料庫,每個區域可以繼續正常執行:由於來自一個區域的寫入被非同步複製到另一個區域,寫入只是排隊並在網路連線恢復時交換。
另一方面,如果使用單主複製,那麼主節點必須在其中一個區域。任何寫入和任何線性一致的讀取都必須傳送到主節點。因此,對於連線到備庫所在區域的任何客戶端,這些讀寫請求都必須透過網路同步傳送到主節點區域。
如果在單主設定中區域之間的網路中斷,連線到備庫區域的客戶端無法聯絡主節點,因此它們既不能對資料庫進行任何寫入,也不能進行任何線性一致的讀取。它們仍然可以從備庫讀取,但這些讀取可能是過時的(非線性一致)。如果應用程式需要線性一致的讀寫,網路中斷會導致應用程式在無法聯絡主節點的區域中變得不可用。
如果客戶端可以直接連線到主節點區域,這不是問題,因為應用程式在那裡繼續正常工作。但只能訪問備庫區域的客戶端將在網路鏈路修復之前遇到中斷。
#### CAP 定理 {#the-cap-theorem}
這個問題不僅僅是單主和多主複製的結果:任何線性一致的資料庫都有這個問題,無論它如何實現。這個問題也不特定於多區域部署,而是可以發生在任何不可靠的網路上,即使在一個區域內。權衡如下:
* 如果你的應用程式 *需要* 線性一致性,並且某些副本由於網路問題與其他副本斷開連線,那麼某些副本在斷開連線時無法處理請求:它們必須等待網路問題修復,或者返回錯誤(無論哪種方式,它們都變得 *不可用*)。這種選擇有時被稱為 *CP*(在網路分割槽下一致)。
* 如果你的應用程式 *不需要* 線性一致性,那麼它可以以一種方式編寫,使每個副本可以獨立處理請求,即使它與其他副本斷開連線(例如,多主)。在這種情況下,應用程式可以在面對網路問題時保持 *可用*,但其行為不是線性一致的。這種選擇被稱為 *AP*(在網路分割槽下可用)。
因此,不需要線性一致性的應用程式可以更好地容忍網路問題。這種見解通常被稱為 *CAP 定理* [^29] [^30] [^31] [^32],由 Eric Brewer 在 2000 年命名,儘管這種權衡自 1970 年代以來就為分散式資料庫設計者所知 [^33] [^34] [^35]。
CAP 最初是作為經驗法則提出的,沒有精確的定義,目的是開始關於資料庫中權衡的討論。當時,許多分散式資料庫專注於在具有共享儲存的機器叢集上提供線性一致語義 [^19],CAP 鼓勵資料庫工程師探索更廣泛的分散式無共享系統設計空間,這些系統更適合實現大規模 Web 服務 [^36]。CAP 在這種文化轉變方面值得稱讚——它幫助觸發了 NoSQL 運動,這是 2000 年代中期左右的一系列新資料庫技術。
> [!TIP] 無用的 CAP 定理
CAP 有時被表述為 *一致性、可用性、分割槽容錯性:從 3 箇中選擇 2 個*。不幸的是,這樣表述是誤導性的 [^32],因為網路分割槽是一種故障,所以它們不是你可以選擇的:無論你喜歡與否,它們都會發生。
當網路正常工作時,系統可以同時提供一致性(線性一致性)和完全可用性。當發生網路故障時,你必須在線性一致性或完全可用性之間進行選擇。因此,CAP 的更好表述方式是 *分割槽時要麼一致要麼可用* [^37]。更可靠的網路需要更少地做出這種選擇,但在某個時候這種選擇是不可避免的。
CP/AP 分類方案還有幾個進一步的缺陷 [^4]。*一致性* 被形式化為線性一致性(定理沒有說任何關於較弱一致性模型的內容),*可用性* 的形式化 [^30] 與該術語的通常含義不匹配 [^38]。許多高可用(容錯)系統實際上不符合 CAP 對可用性的特殊定義。此外,一些系統設計者選擇(有充分理由)既不提供線性一致性也不提供 CAP 定理假設的可用性形式,因此這些系統既不是 CP 也不是 AP [^39] [^40]。
總的來說,關於 CAP 有很多誤解和混淆,它並不能幫助我們更好地理解系統,因此最好避免使用 CAP。
正式定義的 CAP 定理 [^30] 範圍非常狹窄:它只考慮一種一致性模型(即線性一致性)和一種故障(網路分割槽,根據 Google 的資料,這是不到 8% 事件的原因 [^41])。它沒有說任何關於網路延遲、死節點或其他權衡的內容。因此,儘管 CAP 在歷史上具有影響力,但對於設計系統幾乎沒有實際價值 [^4] [^38]。
已經有努力推廣 CAP。例如,*PACELC 原則* 觀察到系統設計者也可能選擇在網路正常工作時削弱一致性以減少延遲 [^39] [^40] [^42]。因此,在網路分割槽(P)期間,我們需要在可用性(A)和一致性(C)之間進行選擇;否則(E),當沒有分割槽時,我們可能在低延遲(L)和一致性(C)之間進行選擇。然而,這個定義繼承了 CAP 的幾個問題,例如一致性和可用性的反直覺定義。
分散式系統中有許多更有趣的不可能性結果 [^43],CAP 現在已被更精確的結果所取代 [^44] [^45],因此它今天主要具有歷史意義。
#### 線性一致性與網路延遲 {#linearizability-and-network-delays}
儘管線性一致性是一個有用的保證,但令人驚訝的是,實際上很少有系統是線性一致的。例如,即使現代多核 CPU 上的 RAM 也不是線性一致的 [^46]:如果在一個 CPU 核心上執行的執行緒寫入記憶體地址,而另一個 CPU 核心上的執行緒隨後讀取相同的地址,不能保證讀取第一個執行緒寫入的值(除非使用 *記憶體屏障* 或 *柵欄* [^47])。
這種行為的原因是每個 CPU 核心都有自己的記憶體快取和儲存緩衝區。預設情況下,記憶體訪問首先進入快取,任何更改都非同步寫出到主記憶體。由於訪問快取中的資料比訪問主記憶體快得多 [^48],這個特性對於現代 CPU 的良好效能至關重要。然而,現在有多份資料副本(一份在主記憶體中,可能還有幾份在各種快取中),這些副本是非同步更新的,因此線性一致性丟失了。
為什麼要做出這種權衡?使用 CAP 定理來證明多核記憶體一致性模型是沒有意義的:在一臺計算機內,我們通常假設可靠的通訊,我們不期望一個 CPU 核心在與計算機其餘部分斷開連線的情況下能夠繼續正常執行。放棄線性一致性的原因是 *效能*,而不是容錯 [^39]。
許多選擇不提供線性一致保證的分散式資料庫也是如此:它們這樣做主要是為了提高效能,而不是為了容錯 [^42]。線性一致性很慢——這在任何時候都是真的,不僅在網路故障期間。
我們能否找到更高效的線性一致儲存實現?答案似乎是否定的:Attiya 和 Welch [^49] 證明,如果你想要線性一致性,讀寫請求的響應時間至少與網路中延遲的不確定性成正比。在具有高度可變延遲的網路中,例如大多數計算機網路(見 ["超時和無界延遲"](/tw/ch9#sec_distributed_queueing)),線性一致讀寫的響應時間不可避免地會很高。更快的線性一致性演算法不存在,但較弱的一致性模型可能會快得多,因此這種權衡對於延遲敏感的系統很重要。在 ["時效性與完整性"](/tw/ch13#sec_future_integrity) 中,我們將討論一些在不犧牲正確性的情況下避免線性一致性的方法。
## ID 生成器和邏輯時鐘 {#sec_consistency_logical}
在許多應用程式中,你需要在建立資料庫記錄時為它們分配某種唯一的 ID,這給了你一個可以引用這些記錄的主鍵。在單節點資料庫中,通常使用自增整數,它的優點是隻需要 64 位(如果你確定永遠不會有超過 40 億條記錄,甚至可以使用 32 位,但這是有風險的)來儲存。
這種自增 ID 的另一個優點是,ID 的順序告訴你記錄建立的順序。例如,[圖 10-8](#fig_consistency_id_generator) 顯示了一個聊天應用程式,它在釋出聊天訊息時為其分配自增 ID。然後,你可以按 ID 遞增的順序顯示訊息,生成的聊天執行緒將有意義:Aaliyah 釋出了一個被分配 ID 1 的問題,而 Bryce 對該問題的回答被分配了一個更大的 ID,即 3。
{{< figure src="/fig/ddia_1008.png" id="fig_consistency_id_generator" caption="圖 10-8. 兩個不同的節點可能生成衝突的 ID。" class="w-full my-4" >}}
這個單節點 ID 生成器是線性一致系統的另一個例子。每個獲取 ID 的請求都是一個原子地遞增計數器並返回舊計數器值的操作(*獲取並增加* 操作);線性一致性確保如果 Aaliyah 的訊息釋出在 Bryce 的釋出開始之前完成,那麼 Bryce 的 ID 必須大於 Aaliyah 的。[圖 10-8](#fig_consistency_id_generator) 中 Aaliyah 和 Caleb 的訊息是併發的,因此線性一致性不指定它們的 ID 必須如何排序,只要它們是唯一的。
記憶體中的單節點 ID 生成器很容易實現:你可以使用 CPU 提供的原子遞增指令,它允許多個執行緒安全地遞增同一個計數器。使計數器持久化需要更多的努力,這樣節點就可以崩潰並重新啟動而不重置計數器值,這將導致重複的 ID。但真正的問題是:
* 單節點 ID 生成器不具容錯性,因為該節點是單點故障。
* 如果你想在另一個區域建立記錄,速度會很慢,因為你可能必須往返地球的另一端才能獲得 ID。
* 如果你有高寫入吞吐量,該單個節點可能成為瓶頸。
你可以考慮各種 ID 生成器的替代選項:
分片 ID 分配
: 你可以有多個分配 ID 的節點——例如,一個只生成偶數,一個只生成奇數。一般來說,你可以在 ID 中保留一些位來包含分片編號。這些 ID 仍然緊湊,但你失去了排序屬性:例如,如果你有 ID 為 16 和 17 的聊天訊息,你不知道訊息 16 是否實際上是先發送的,因為 ID 是由不同的節點分配的,其中一個節點可能領先於另一個。
預分配 ID 塊
: 不是從單節點 ID 生成器請求單個 ID,它可以分發 ID 塊。例如,節點 A 可能宣告從 1 到 1,000 的 ID 塊,節點 B 可能宣告從 1,001 到 2,000 的塊。然後每個節點可以獨立地從其塊中分發 ID,並在其序列號供應開始不足時從單節點 ID 生成器請求新塊。但是,這種方案也不能確保正確的排序:可能會發生這樣的情況,一條訊息被分配了 1,001 到 2,000 範圍內的 ID,而後來的訊息被分配了 1 到 1,000 範圍內的 ID,如果 ID 是由不同的節點分配的。
隨機 UUID
: 你可以使用 *通用唯一識別符號*(UUID),也稱為 *全域性唯一識別符號*(GUID)。它們的一大優點是可以在任何節點上本地生成,無需通訊,但它們需要更多空間(128 位)。有幾種不同版本的 UUID;最簡單的是版本 4,它本質上是一個如此長的隨機數,以至於兩個節點選擇相同的可能性非常小。不幸的是,這些 ID 的順序也是隨機的,因此比較兩個 ID 不會告訴你哪個更新。
時鐘時間戳使其唯一
: 如果你的節點的日曆時鐘使用 NTP 保持大致正確,你可以透過將該時鐘的時間戳放在最高有效位中,並用確保 ID 唯一的額外資訊填充剩餘位來生成 ID,即使時間戳不是——例如,分片編號和每分片遞增序列號,或長隨機值。這種方法用於版本 7 UUID [^50]、Twitter 的 Snowflake [^51]、ULID [^52]、Hazelcast 的 Flake ID 生成器、MongoDB ObjectID 和許多類似方案 [^50]。你可以在應用程式程式碼或資料庫中實現這些 ID 生成器 [^53]。
所有這些方案都生成唯一的 ID(至少有足夠高的機率,使衝突極其罕見),但它們對 ID 的排序保證比單節點自增方案弱得多。
如 ["為事件排序的時間戳"](/tw/ch9#sec_distributed_lww) 中所討論的,時鐘時間戳最多隻能提供近似排序:如果較早的寫入從稍快的時鐘獲得時間戳,而較晚寫入的時間戳來自稍慢的時鐘,則時間戳順序可能與事件實際發生的順序不一致。由於使用非單調時鐘而導致的時鐘跳躍,即使單個節點生成的時間戳也可能排序錯誤。因此,基於時鐘時間的 ID 生成器不太可能是線性一致的。
你可以透過依賴高精度時鐘同步,使用原子鐘或 GPS 接收器來減少這種排序不一致。但如果能夠在不依賴特殊硬體的情況下生成唯一且正確排序的 ID 也會很好。這就是 *邏輯時鐘* 的用途。
### 邏輯時鐘 {#sec_consistency_timestamps}
在 ["不可靠的時鐘"](/tw/ch9#sec_distributed_clocks) 中,我們討論了日曆時鐘和單調時鐘。這兩種都是 *物理時鐘*:它們測量經過的秒數(或毫秒、微秒等)。
在分散式系統中,通常還使用另一種時鐘,稱為 *邏輯時鐘*。物理時鐘是計算已經過的秒數的硬體裝置,而邏輯時鐘是計算已發生事件的演算法。來自邏輯時鐘的時間戳因此不會告訴你現在幾點,但你 *可以* 比較來自邏輯時鐘的兩個時間戳,以判斷哪個更早,哪個更晚。
邏輯時鐘的要求通常是:
* 其時間戳緊湊(大小為幾個位元組)且唯一;
* 你可以比較任意兩個時間戳(即它們是 *全序* 的);並且
* 時間戳的順序與因果關係 *一致*:如果操作 A 發生在 B 之前,那麼 A 的時間戳小於 B 的時間戳。(我們之前在 ["“先發生”關係與併發"](/tw/ch6#sec_replication_happens_before) 中討論了因果關係。)
單節點 ID 生成器滿足這些要求,但我們剛剛討論的分散式 ID 生成器不滿足因果排序要求。
#### Lamport 時間戳 {#lamport-timestamps}
幸運的是,有一種生成邏輯時間戳的簡單方法,它與因果關係 *一致*,你可以將其用作分散式 ID 生成器。它被稱為 *Lamport 時鐘*,由 Leslie Lamport 在 1978 年提出 [^54],現在是分散式系統領域被引用最多的論文之一。
[圖 10-9](#fig_consistency_lamport_ts) 顯示了 Lamport 時鐘如何在 [圖 10-8](#fig_consistency_id_generator) 的聊天示例中工作。每個節點都有一個唯一識別符號,在 [圖 10-9](#fig_consistency_lamport_ts) 中是名稱"Aaliyah"、"Bryce"或"Caleb",但在實踐中可能是隨機 UUID 或類似的東西。此外,每個節點都保留它已處理的運算元的計數器。Lamport 時間戳就是一對(*計數器*,*節點 ID*)。兩個節點有時可能具有相同的計數器值,但透過在時間戳中包含節點 ID,每個時間戳都是唯一的。
{{< figure src="/fig/ddia_1009.png" id="fig_consistency_lamport_ts" caption="圖 10-9. Lamport 時間戳提供與因果關係一致的全序。" class="w-full my-4" >}}
每次節點生成時間戳時,它都會遞增其計數器值並使用新值。此外,每次節點看到來自另一個節點的時間戳時,如果該時間戳中的計數器值大於其本地計數器值,它會將其本地計數器增加到與時間戳中的值匹配。
在 [圖 10-9](#fig_consistency_lamport_ts) 中,Aaliyah 在釋出自己的訊息時還沒有看到 Caleb 的訊息,反之亦然。假設兩個使用者都以初始計數器值 0 開始,因此都遞增其本地計數器並將新計數器值 1 附加到其訊息。當 Bryce 收到這些訊息時,他將本地計數器值增加到 1。最後,Bryce 向 Aaliyah 的訊息傳送回覆,為此他遞增本地計數器並將新值 2 附加到訊息。
要比較兩個 Lamport 時間戳,我們首先比較它們的計數器值:例如,(2, "Bryce") 大於 (1, "Aaliyah"),也大於 (1, "Caleb")。如果兩個時間戳具有相同的計數器,我們改為比較它們的節點 ID,使用通常的字典序字串比較。因此,此示例中的時間戳順序是 (1, "Aaliyah") < (1, "Caleb") < (2, "Bryce")。
#### 混合邏輯時鐘 {#hybrid-logical-clocks}
Lamport 時間戳擅長捕獲事物發生的順序,但它們有一些限制:
* 由於它們與物理時間沒有直接關係,你不能使用它們來查詢,比如說,在特定日期釋出的所有訊息——你需要單獨儲存物理時間。
* 如果兩個節點從不通訊,一個節點的計數器遞增將永遠不會反映在另一個節點的計數器中。因此,可能會發生這樣的情況,即在不同節點上大約同一時間生成的事件具有極不相同的計數器值。
*混合邏輯時鐘* 結合了物理日曆時鐘的優勢和 Lamport 時鐘的排序保證 [^55]。像物理時鐘一樣,它計算秒或微秒。像 Lamport 時鐘一樣,當一個節點看到來自另一個節點的時間戳大於其本地時鐘值時,它將自己的本地值向前移動以匹配另一個節點的時間戳。因此,如果一個節點的時鐘執行得很快,其他節點在通訊時也會類似地向前移動它們的時鐘。
每次生成混合邏輯時鐘的時間戳時,它也會遞增,這確保時鐘單調向前移動,即使底層物理時鐘由於 NTP 調整而向後跳躍。因此,混合邏輯時鐘可能略微領先於底層物理時鐘。演算法的細節確保這種差異儘可能小。
因此,你可以將混合邏輯時鐘的時間戳幾乎像傳統日曆時鐘的時間戳一樣對待,具有其排序與先發生關係一致的附加屬性。它不依賴於任何特殊硬體,只需要大致同步的時鐘。例如,CockroachDB 使用混合邏輯時鐘。
#### Lamport/混合邏輯時鐘 vs. 向量時鐘 {#lamporthybrid-logical-clocks-vs-vector-clocks}
在 ["多版本併發控制(MVCC)"](/tw/ch8#sec_transactions_snapshot_impl) 中,我們討論了快照隔離通常是如何實現的:本質上,透過給每個事務一個事務 ID,並允許每個事務看到由 ID 較低的事務進行的寫入,但使 ID 較高的事務的寫入不可見。Lamport 時鐘和混合邏輯時鐘是生成這些事務 ID 的好方法,因為它們確保快照與因果關係一致 [^56]。
當併發生成多個時間戳時,這些演算法會任意排序它們。這意味著當你檢視兩個時間戳時,你通常無法判斷它們是併發生成的還是一個發生在另一個之前。(在 [圖 10-9](#fig_consistency_lamport_ts) 的示例中,你實際上可以判斷 Aaliyah 和 Caleb 的訊息必須是併發的,因為它們具有相同的計數器值,但當計數器值不同時,你無法判斷它們是否併發。)
如果你想能夠確定記錄何時併發建立,你需要不同的演算法,例如 *向量時鐘*。缺點是向量時鐘的時間戳要大得多——可能是系統中每個節點一個整數。有關檢測併發的更多詳細資訊,請參見 ["檢測併發寫入"](/tw/ch6#sec_replication_concurrent)。
### 線性一致的 ID 生成器 {#sec_consistency_linearizable_id}
儘管 Lamport 時鐘和混合邏輯時鐘提供了有用的排序保證,但該排序仍然弱於我們之前討論的線性一致單節點 ID 生成器。回想一下,線性一致性要求如果請求 A 在請求 B 開始之前完成,那麼 B 必須具有更高的 ID,即使 A 和 B 從未相互通訊。另一方面,Lamport 時鐘只能確保節點生成的時間戳大於該節點看到的任何其他時間戳,但它不能對它沒有看到的時間戳說任何話。
[圖 10-10](#fig_consistency_permissions) 顯示了非線性一致 ID 生成器如何導致問題。想象一個社交媒體網站,使用者 A 想要與朋友私下分享一張尷尬的照片。A 的賬戶最初是公開的,但使用他們的筆記型電腦,A 首先將他們的賬戶設定更改為私密。然後 A 使用他們的手機上傳照片。由於 A 按順序執行了這些更新,他們可能合理地期望照片上傳受到新的、受限的賬戶許可權的約束。
{{< figure src="/fig/ddia_1010.png" id="fig_consistency_permissions" caption="圖 10-10. 使用 Lamport 時間戳的許可權系統示例。" class="w-full my-4" >}}
賬戶許可權和照片儲存在兩個單獨的資料庫(或同一資料庫的單獨分片)中,讓我們假設它們使用 Lamport 時鐘或混合邏輯時鐘為每次寫入分配時間戳。由於照片資料庫沒有從賬戶資料庫讀取,照片資料庫中的本地計數器可能稍微落後,因此照片上傳被分配了比賬戶設定更新更低的時間戳。
接下來,假設一個檢視者(不是 A 的朋友)正在檢視 A 的個人資料,他們的讀取使用快照隔離的 MVCC 實現。可能會發生這樣的情況,檢視者的讀取具有大於照片上傳的時間戳,但小於賬戶設定更新的時間戳。因此,系統將確定在讀取時賬戶仍然是公開的,因此向檢視者顯示他們不應該看到的尷尬照片。
你可以想象幾種可能的方法來解決這個問題。也許照片資料庫應該在執行寫入之前讀取使用者的賬戶狀態,但很容易忘記這樣的檢查。如果 A 的操作是在同一裝置上執行的,也許該裝置上的應用程式可以跟蹤該使用者寫入的最新時間戳——但如果使用者使用筆記型電腦和手機,如示例中所示,那就不那麼容易了。
在這種情況下,最簡單的解決方案是使用線性一致的 ID 生成器,這將確保照片上傳被分配比賬戶許可權更改更大的 ID。
#### 實現線性一致的 ID 生成器 {#implementing-a-linearizable-id-generator}
確保 ID 分配線性一致的最簡單方法實際上是為此目的使用單個節點。該節點只需要原子地遞增計數器並在請求時返回其值,持久化計數器值(以便在節點崩潰並重新啟動時不會生成重複的 ID),並使用單主複製進行容錯複製。這種方法在實踐中使用:例如,TiDB/TiKV 稱之為 *時間戳預言機*,受 Google 的 Percolator [^57] 啟發。
作為最佳化,你可以避免在每個請求上執行磁碟寫入和複製。相反,ID 生成器可以寫入描述一批 ID 的記錄;一旦該記錄被持久化並完成複製,節點就可以開始按順序向客戶端分發這些 ID。在它用完該批次中的 ID 之前,它可以為下一批持久化並複製記錄。這樣,如果節點崩潰並重啟,或故障切換到備庫,某些 ID 會被跳過,但不會發出任何重複或亂序的 ID。
你不能輕易地對 ID 生成器進行分片,因為如果你有多個分片獨立分發 ID,你就無法再保證它們的順序是線性一致的。你也不能輕易地將 ID 生成器分佈在多個區域;因此,在地理分散式資料庫中,所有 ID 請求都必須轉到單個區域的節點。從好的方面來說,ID 生成器的工作非常簡單,因此單個節點可以處理大量請求吞吐量。
如果你不想使用單節點 ID 生成器,可以使用替代方案:你可以做 Google 的 Spanner 所做的,如 ["全域性快照的同步時鐘"](/tw/ch9#sec_distributed_spanner) 中所討論的。它依賴於物理時鐘,該時鐘不僅返回單個時間戳,還返回表示時鐘讀數不確定性的時間戳範圍。然後它等待該不確定性間隔的持續時間過去後再返回。
假設不確定性間隔是正確的(即真實的當前物理時間始終位於該間隔內),此過程還確保如果一個請求在另一個請求開始之前完成,後一個請求將具有更大的時間戳。這種方法確保了這種線性一致的 ID 分配,而無需任何通訊:即使不同區域的請求也將被正確排序,無需等待跨區域請求。缺點是你需要硬體和軟體支援,以使時鐘緊密同步並計算必要的不確定性間隔。
#### 使用邏輯時鐘強制約束 {#enforcing-constraints-using-logical-clocks}
在 ["約束與唯一性保證"](#sec_consistency_uniqueness) 中,我們看到線性一致的比較並設定操作可用於在分散式系統中實現鎖、唯一性約束和類似構造。這提出了一個問題:邏輯時鐘或線性一致的 ID 生成器是否也足以實現這些東西?
答案是:不完全。當你有幾個節點都試圖獲取同一個鎖或註冊同一個使用者名稱時,你可以使用邏輯時鐘為這些請求分配時間戳,並選擇具有最低時間戳的請求作為獲勝者。如果時鐘是線性一致的,你知道任何未來的請求都將始終生成更大的時間戳,因此你可以確定沒有未來的請求會收到比獲勝者更低的時間戳。
不幸的是,問題的一部分仍未解決:節點如何知道自己的時間戳是否最低?要確定,它需要聽到可能生成時間戳的 *每個* 其他節點 [^54]。如果其他節點之一在此期間失敗,或者由於網路問題無法訪問,該系統將停止執行,因為我們無法確定該節點是否可能具有最低的時間戳。這不是我們需要的那種容錯系統。
要以容錯方式實現鎖、租約和類似構造,我們需要比邏輯時鐘或 ID 生成器更強大的東西:我們需要共識。
## 共識 {#sec_consistency_consensus}
在本章中,我們已經看到了幾個只有單個節點時很容易,但如果你想要容錯就會變得困難得多的例子:
* 如果你只有一個主節點,並且在該主節點上進行所有讀寫,資料庫可以是線性一致的。但是,如果該主節點失敗,如何進行故障切換,同時避免腦裂?如何確保一個認為自己是主節點的節點實際上沒有被投票罷免?
* 單節點上的線性一致 ID 生成器只是一個帶有原子獲取並增加指令的計數器,但如果它崩潰了怎麼辦?
* 原子比較並設定(CAS)操作對許多事情都很有用,例如當多個程序競相獲取它時決定誰獲得鎖或租約,或確保具有給定名稱的檔案或使用者的唯一性。在單個節點上,CAS 可能就像一條 CPU 指令一樣簡單,但如何使其容錯?
事實證明,所有這些都是同一個基本分散式系統問題的例項:*共識*。共識是分散式計算中最重要和最基本的問題之一;它也是出了名的難以正確實現 [^58] [^59],許多系統在過去都出錯了。現在我們已經討論了複製([第六章](/tw/ch6))、事務([第八章](/tw/ch8))、系統模型([第九章](/tw/ch9))和線性一致性(本章),我們終於準備好解決共識問題了。
最著名的共識演算法是 Viewstamped Replication [^60] [^61]、Paxos [^58] [^62] [^63] [^64]、Raft [^23] [^65] [^66] 和 Zab [^18] [^22] [^67]。這些演算法之間有相當多的相似之處,但它們並不相同 [^68] [^69]。這些演算法在非拜占庭系統模型中工作:也就是說,網路通訊可能會被任意延遲或丟棄,節點可能會崩潰、重啟和斷開連線,但演算法假設節點在其他方面正確遵循協議,不會惡意行為。
也有可以容忍某些拜占庭節點的共識演算法,即不正確遵循協議的節點(例如,向其他節點發送矛盾訊息)。一個常見的假設是少於三分之一的節點是拜占庭故障的 [^26] [^70]。這種 *拜占庭容錯*(BFT)共識演算法用於區塊鏈 [^71]。然而,如 ["拜占庭故障"](/tw/ch9#sec_distributed_byzantine) 中所解釋的,BFT 演算法超出了本書的範圍。
--------
> [!TIP] 共識的不可能性
你可能聽說過 FLP 結果 [^72]——以作者 Fischer、Lynch 和 Paterson 的名字命名——它證明如果存在節點可能崩潰的風險,就沒有演算法總是能夠達成共識。在分散式系統中,我們必須假設節點可能會崩潰,因此可靠的共識是不可能的。然而,在這裡我們正在討論實現共識的演算法。這是怎麼回事?
首先,FLP 並不是說我們永遠無法達成共識——它只是說我們不能保證共識演算法 *總是* 終止。此外,FLP 結果是在非同步系統模型中假設確定性演算法的情況下證明的(見 ["系統模型與現實"](/tw/ch9#sec_distributed_system_model)),這意味著演算法不能使用任何時鐘或超時。如果它可以使用超時來懷疑另一個節點可能已經崩潰(即使懷疑有時是錯誤的),那麼共識就變得可解 [^73]。即使只是允許演算法使用隨機數也足以繞過不可能性結果 [^74]。
因此,儘管 FLP 關於共識不可能性的結果具有重要的理論意義,但分散式系統通常可以在實踐中實現共識。
--------
### 共識的多面性 {#sec_consistency_faces}
共識可以用幾種不同的方式表達:
* *單值共識* 非常類似於原子 *比較並設定* 操作,它可用於實現鎖、租約和唯一性約束。
* 構建 *僅追加日誌* 也需要共識;它通常形式化為 *全序廣播*。有了日誌,你可以構建 *狀態機複製*、基於主節點的複製、事件溯源和其他有用的東西。
* 多資料庫或多分片事務的 *原子提交* 要求所有參與者就是否提交或中止事務達成一致。
我們很快就會探討所有這些。事實上,這些問題都是相互等價的:如果你有解決其中一個問題的演算法,你可以將其轉換為任何其他問題的解決方案。這是一個相當深刻且也許令人驚訝的見解!這就是為什麼我們可以將所有這些東西歸入"共識"之下,即使它們表面上看起來完全不同。
#### 單值共識 {#single-value-consensus}
共識的標準表述涉及讓多個節點就單個值達成一致。例如:
* 當具有單主複製的資料庫首次啟動時,或者當現有主節點失敗時,多個節點可能會同時嘗試成為主節點。同樣,多個節點可能競相獲取鎖或租約。共識允許它們決定哪一個獲勝。
* 如果幾個人同時嘗試預訂飛機上的最後一個座位,或劇院中的同一個座位,或嘗試使用相同的使用者名稱註冊賬戶,那麼共識演算法可以確定哪一個應該成功。
更一般地說,一個或多個節點可能 *提議* 值,共識演算法 *決定* 其中一個值。在上述示例中,每個節點可以提議自己的 ID,演算法決定哪個節點 ID 應該成為新的主節點、租約的持有者或飛機/劇院座位的購買者。在這種形式主義中,共識演算法必須滿足以下屬性 [^26]:
一致同意
: 沒有兩個節點決定不同。
完整性
: 一旦節點決定了一個值,它就不能透過決定另一個值來改變主意。
有效性
: 如果節點決定值 *v*,那麼 *v* 是由某個節點提議的。
終止
: 每個未崩潰的節點最終都會決定某個值。
如果你想決定多個值,你可以為每個值執行共識演算法的單獨例項。例如,你可以為劇院中的每個可預訂座位進行單獨的共識執行,這樣你就可以為每個座位獲得一個決定(一個買家)。
一致同意和完整性屬性定義了共識的核心思想:每個人都決定相同的結果,一旦你決定了,你就不能改變主意。有效性屬性排除了瑣碎的解決方案:例如,你可以有一個總是決定 `null` 的演算法,無論提議什麼;這個演算法將滿足同意和完整性屬性,但不滿足有效性屬性。
如果你不關心容錯,那麼滿足前三個屬性很容易:你可以硬編碼一個節點作為"獨裁者",讓該節點做出所有決定。然而,如果那個節點失敗,那麼系統就無法再做出任何決定——就像沒有故障切換的單主複製一樣。所有的困難都來自對容錯的需求。
終止屬性形式化了容錯的想法。它本質上是說共識演算法不能簡單地坐著什麼都不做——換句話說,它必須取得進展。即使某些節點失敗,其他節點仍必須達成決定。(終止是活性屬性,而其他三個是安全屬性——見 ["安全性和活性"](/tw/ch9#sec_distributed_safety_liveness)。)
如果崩潰的節點可能恢復,你可以等待它回來。然而,共識必須確保即使崩潰的節點突然消失並且永遠不會回來,它也會做出決定。(不要想象軟體崩潰,而是想象有地震,包含你的節點的資料中心被山體滑坡摧毀。你必須假設你的節點被埋在 30 英尺的泥土下,永遠不會重新上線。)
當然,如果 *所有* 節點都崩潰了,並且沒有一個在執行,那麼任何演算法都不可能決定任何事情。演算法可以容忍的故障數量是有限的:事實上,可以證明任何共識演算法都需要至少大多數節點正常執行才能確保終止 [^73]。該多數可以安全地形成仲裁(見 ["讀寫仲裁"](/tw/ch6#sec_replication_quorum_condition))。
因此,終止屬性受到少於一半節點崩潰或不可達的假設的約束。然而,大多數共識演算法確保安全屬性——同意、完整性和有效性——始終得到滿足,即使大多數節點失敗或存在嚴重的網路問題 [^75]。因此,大規模中斷可能會阻止系統處理請求,但它不能透過導致做出不一致的決定來破壞共識系統。
#### 比較並設定作為共識 {#compare-and-set-as-consensus}
比較並設定(CAS)操作檢查某個物件的當前值是否等於某個期望值;如果是,它原子地將物件更新為某個新值;如果不是,它保持物件不變並返回錯誤。
如果你有容錯、線性一致的 CAS 操作,很容易解決共識問題:最初將物件設定為空值;每個想要提議值的節點都使用期望值為空、新值為它想要提議的值(假設它是非空的)呼叫 CAS。然後決定的值就是物件設定的任何值。
同樣,如果你有共識的解決方案,你可以實現 CAS:每當一個或多個節點想要使用相同的期望值執行 CAS 時,你使用共識協議提議 CAS 呼叫中的新值,然後將物件設定為共識決定的任何值。任何新值未被決定的 CAS 呼叫都返回錯誤。具有不同期望值的 CAS 呼叫使用共識協議的單獨執行。
這表明 CAS 和共識彼此等價 [^28] [^73]。同樣,兩者在單個節點上都很簡單,但要使其容錯則具有挑戰性。作為分散式環境中 CAS 的示例,我們在 ["由物件儲存支援的資料庫"](/tw/ch6#sec_replication_object_storage) 中看到了物件儲存的條件寫入操作,它允許寫入僅在自當前客戶端上次讀取以來具有相同名稱的物件未被另一個客戶端建立或修改時發生。
然而,線性一致的讀寫暫存器不足以解決共識。FLP 結果告訴我們,共識不能由非同步崩潰停止模型中的確定性演算法解決 [^72],但我們在 ["線性一致性與仲裁"](#sec_consistency_quorum_linearizable) 中看到,線性一致的暫存器可以使用此模型中的仲裁讀/寫來實現 [^24] [^25] [^26]。由此可見,線性一致的暫存器無法解決共識。
#### 共享日誌作為共識 {#sec_consistency_shared_logs}
我們已經看到了幾個日誌的例子,例如複製日誌、事務日誌和預寫日誌。日誌儲存一系列 *日誌條目*,任何讀取它的人都會看到相同順序的相同條目。有時日誌有一個允許追加新條目的單個寫入者,但 *共享日誌* 是多個節點可以請求追加條目的日誌。單主複製就是一個例子:任何客戶端都可以要求主節點進行寫入,主節點將其追加到複製日誌,然後所有備庫按照與主節點相同的順序應用寫入。
更正式地說,共享日誌支援兩種操作:你可以請求將值新增到日誌中,並且可以讀取日誌中的條目。它必須滿足以下屬性:
最終追加
: 如果節點請求將某個值新增到日誌中,並且節點不會崩潰,那麼該節點最終必須在日誌條目中讀取該值。
可靠交付
: 沒有日誌條目丟失:如果一個節點讀取某個日誌條目,那麼最終每個未崩潰的節點也必須讀取該日誌條目。
僅追加
: 一旦節點讀取了某個日誌條目,它就是不可變的,新的日誌條目只能在它之後新增,而不能在之前。節點可能會重新讀取日誌,在這種情況下,它會以與最初讀取它們時相同的順序看到相同的日誌條目(即使節點崩潰並重新啟動)。
一致性
: 如果兩個節點都讀取某個日誌條目 *e*,那麼在 *e* 之前,它們必須以相同的順序讀取完全相同的日誌條目序列。
有效性
: 如果節點讀取包含某個值的日誌條目,那麼某個節點先前請求將該值新增到日誌中。
--------
> [!NOTE]
> 共享日誌在形式上被稱為 *全序廣播*、*原子廣播* 或 *全序組播* 協議 [^26] [^76] [^77]。這是用不同的詞描述的同一件事:請求將值新增到日誌中然後稱為"廣播"它,讀取日誌條目稱為"交付"它。
--------
如果你有共享日誌的實現,很容易解決共識問題:每個想要提議值的節點都請求將其新增到日誌中,第一個日誌條目中讀回的任何值就是決定的值。由於所有節點以相同的順序讀取日誌條目,它們保證就首先交付哪個值達成一致 [^28]。
相反,如果你有共識的解決方案,你可以實現共享日誌。細節有點複雜,但基本思想是這樣的 [^73]:
1. 你為每個未來的日誌條目在日誌中都有一個槽,並且你為每個這樣的槽執行共識演算法的單獨例項,以決定該條目中應該包含什麼值。
2. 當節點想要向日志新增值時,它為尚未決定的槽之一提議該值。
3. 當共識演算法為其中一個槽做出決定,並且所有先前的槽都已經決定時,則決定的值作為新的日誌條目追加,並且已經決定的任何連續槽也將其決定的值追加到日誌中。
4. 如果提議的值未被某個槽選擇,想要新增它的節點會透過為稍後的槽提議它來重試。
這表明共識等價於全序廣播和共享日誌。沒有故障切換的單主複製不滿足活性要求,因為如果主節點崩潰,它將停止傳遞訊息。像往常一樣,挑戰在於安全地自動執行故障切換。
#### 獲取並增加作為共識 {#fetch-and-add-as-consensus}
我們在 ["線性一致的 ID 生成器"](#sec_consistency_linearizable_id) 中看到的線性一致 ID 生成器接近解決共識,但略有不足。我們可以使用獲取並增加操作實現這樣的 ID 生成器,該操作原子地遞增計數器並返回舊的計數器值。
如果你有 CAS 操作,很容易實現獲取並增加:首先讀取計數器值,然後執行 CAS,其中期望值是你讀取的值,新值是該值加一。如果 CAS 失敗,你將重試整個過程,直到 CAS 成功。當存在爭用時,這比本機獲取並增加操作效率低,但在功能上是等效的。由於你可以使用共識實現 CAS,你也可以使用共識實現獲取並增加。
相反,如果你有容錯的獲取並增加操作,你能解決共識問題嗎?假設你將計數器初始化為零,每個想要提議值的節點都呼叫獲取並增加操作來遞增計數器。由於獲取並增加操作是原子的,其中一個節點將讀取初始值零,其他節點都將讀取至少遞增過一次的值。
現在假設讀取零的節點是獲勝者,它的值被決定。這對於讀取零的節點有效,但其他節點有問題:它們知道自己不是獲勝者,但它們不知道其他節點中哪一個獲勝了。獲勝者可以向其他節點發送訊息,讓它們知道它已經獲勝,但如果獲勝者在有機會發送此訊息之前崩潰了怎麼辦?在這種情況下,其他節點將被掛起,無法決定任何值,因此共識不會終止。其他節點不能回退到另一個節點,因為讀取零的節點可能會回來並正確地決定它提議的值。
一個例外是,如果我們確定不超過兩個節點將提議值。在這種情況下,節點可以相互發送它們想要提議的值,然後每個都執行獲取並增加操作。讀取零的節點決定自己的值,讀取一的節點決定另一個節點的值。這解決了兩個節點之間的共識問題,這就是為什麼我們可以說獲取並增加的 *共識數* 為二 [^28]。相比之下,CAS 和共享日誌解決了任意數量節點可能提議值的共識,因此它們的共識數為 ∞(無窮大)。
#### 原子提交作為共識 {#atomic-commitment-as-consensus}
在 ["分散式事務"](/tw/ch8#sec_transactions_distributed) 中,我們看到了 *原子提交* 問題,即確保參與分散式事務的資料庫或分片都提交或中止事務。我們還看到了 *兩階段提交* 演算法,它依賴於作為單點故障的協調器。
共識和原子提交之間有什麼關係?乍一看,它們似乎非常相似——兩者都需要節點達成某種形式的一致。然而,有一個重要的區別:對於共識,可以決定提議的任何值,而對於原子提交,如果 *任何* 參與者投票中止,演算法 *必須* 中止。更準確地說,原子提交需要以下屬性 [^78]:
一致同意
: 沒有兩個節點決定不同的結果。
完整性
: 一旦節點決定了一個結果,它就不能透過決定另一個結果來改變主意。
有效性
: 如果節點決定提交,那麼所有節點必須先前投票提交。如果任何節點投票中止,節點必須中止。
非平凡性
: 如果所有節點都投票提交,並且沒有發生通訊超時,那麼所有節點必須決定提交。
終止
: 每個未崩潰的節點最終都會決定提交或中止。
有效性屬性確保事務只有在所有節點都同意時才能提交;非平凡性屬性確保演算法不能簡單地總是中止(但如果任何節點之間的通訊超時,它允許中止)。其他三個屬性基本上與共識相同。
如果你有共識的解決方案,有多種方法可以解決原子提交 [^78] [^79]。一種方法是這樣的:當你想要提交事務時,每個節點將其提交或中止的投票傳送給每個其他節點。從自己和每個其他節點收到提交投票的節點使用共識演算法提議"提交";收到中止投票或經歷超時的節點使用共識演算法提議"中止"。當節點發現共識演算法決定了什麼時,它會相應地提交或中止。
在這個演算法中,只有當所有節點都投票提交時,才會提議"提交"。如果任何節點投票中止,所有共識演算法中的提議都將是"中止"。如果所有節點都投票提交但某些通訊超時,可能會發生某些節點提議"中止"而其他節點提議"提交";在這種情況下,節點是提交還是中止並不重要,只要它們都做同樣的事。
如果你有容錯的原子提交協議,你也可以解決共識。每個想要提議值的節點都在節點仲裁上啟動事務,並在每個節點上執行單節點 CAS,如果其值尚未被另一個事務設定,則將暫存器設定為提議的值。如果 CAS 成功,節點投票提交,否則投票中止。如果原子提交協議決定提交事務,其值將被決定用於共識;如果原子提交中止,提議節點將使用新事務重試。
這表明原子提交和共識也是彼此等價的。
### 共識的實踐 {#sec_consistency_total_order}
我們已經看到,單值共識、CAS、共享日誌和原子提交都彼此等價:你可以將其中一個的解決方案轉換為任何其他的解決方案。這是一個有價值的理論見解,但它沒有回答這個問題:在實踐中,這些許多共識表述中哪一個最有用?
答案是大多數共識系統提供共享日誌,也稱為全序廣播。Raft、Viewstamped Replication 和 Zab 直接提供共享日誌。Paxos 提供單值共識,但在實踐中,大多數使用 Paxos 的系統實際上使用稱為 Multi-Paxos 的擴充套件,它也提供共享日誌。
#### 使用共享日誌 {#sec_consistency_smr}
共享日誌非常適合資料庫複製:如果每個日誌條目代表對資料庫的寫入,並且每個副本使用確定性邏輯以相同的順序處理相同的寫入,那麼副本將全部處於一致狀態。這個想法被稱為 *狀態機複製* [^80],它是事件溯源背後的原則,我們在 ["事件溯源和 CQRS"](/tw/ch3#sec_datamodels_events) 中看到了。共享日誌對於流處理也很有用,我們將在 [第十二章](/tw/ch12#ch_stream) 中看到。
同樣,共享日誌可用於實現可序列化事務:如 ["實際序列執行"](/tw/ch8#sec_transactions_serial) 中所討論的,如果每個日誌條目代表要作為儲存過程執行的確定性事務,並且如果每個節點以相同的順序執行這些事務,那麼事務將是可序列化的 [^81] [^82]。
---------
> [!NOTE]
> 具有強一致性模型的分片資料庫通常為每個分片維護一個單獨的日誌,這提高了可伸縮性,但限制了它們可以跨分片提供的一致性保證(例如,一致快照、外部索引鍵引用)。跨分片的可序列化事務是可能的,但需要額外的協調 [^83]。
--------
共享日誌也很強大,因為它可以很容易地適應其他形式的共識:
* 我們之前看到了如何使用它來實現單值共識和 CAS:只需決定日誌中首先出現的值。
* 如果你想要許多單值共識例項(例如,幾個人試圖預訂的劇院中每個座位一個),請在日誌條目中包含座位編號,並決定包含給定座位編號的第一個日誌條目。
* 如果你想要原子獲取並增加,請將要新增到計數器的數字放入日誌條目中,當前計數器值是到目前為止所有日誌條目的總和。日誌條目上的簡單計數器可用於生成柵欄令牌(見 ["柵欄化殭屍和延遲請求"](/tw/ch9#sec_distributed_fencing_tokens));例如,在 ZooKeeper 中,此序列號稱為 `zxid` [^18]。
#### 從單主複製到共識 {#from-single-leader-replication-to-consensus}
我們之前看到,如果你有一個單一的"獨裁者"節點做出決定,單值共識很容易,同樣,如果單個主節點是唯一允許向其追加條目的節點,共享日誌也很容易。問題是如果該節點失敗如何提供容錯。
傳統上,具有單主複製的資料庫沒有解決這個問題:它們將主節點故障切換作為人類管理員必須手動執行的操作。不幸的是,這意味著大量的停機時間,因為人類反應的速度是有限的,並且它不滿足共識的終止屬性。對於共識,我們要求演算法可以自動選擇新的主節點。(並非所有共識演算法都有主節點,但常用的演算法有 [^84]。)
然而,有一個問題。我們之前討論過腦裂的問題,並說所有節點都需要就誰是主節點達成一致——否則兩個不同的節點可能各自認為自己是主節點,從而做出不一致的決定。因此,似乎我們需要共識來選舉主節點,而我們需要主節點來解決共識。我們如何擺脫這個難題?
事實上,共識演算法不要求在任何時候只有一個主節點。相反,它們做出了較弱的保證:它們定義了一個 *紀元編號*(在 Paxos 中稱為 *投票編號*,在 Viewstamped Replication 中稱為 *檢視編號*,在 Raft 中稱為 *任期編號*)並保證在每個紀元內,主節點是唯一的。
當節點因為在某個超時時間內沒有收到主節點的訊息而認為當前主節點已死時,它可能會開始投票選舉新的主節點。這次選舉被賦予一個大於任何先前紀元的新紀元編號。如果兩個不同紀元中的兩個不同主節點之間存在衝突(也許是因為先前的主節點實際上並沒有死),那麼具有更高紀元編號的主節點獲勝。
在主節點被允許將下一個條目追加到共享日誌之前,它必須首先檢查是否有其他具有更高紀元編號的主節點可能追加不同的條目。它可以透過從一個節點仲裁收集投票來做到這一點,通常(但並非總是)是多數節點 [^85]。只有在節點不知道任何其他具有更高紀元的主節點時,節點才會投贊成票。
因此,我們有兩輪投票:一次選擇主節點,第二次對主節點提議的下一個要追加到日誌的條目進行投票。這兩次投票的仲裁必須重疊:如果對提議的投票成功,投票支援它的節點中至少有一個也必須參與了最近成功的主節點選舉 [^85]。因此,如果對提議的投票透過而沒有透露任何更高編號的紀元,當前主節點可以得出結論,沒有選出具有更高紀元編號的主節點,因此它可以安全地將提議的條目追加到日誌中 [^26] [^86]。
這兩輪投票表面上看起來類似於兩階段提交,但它們是非常不同的協議。在共識演算法中,任何節點都可以開始選舉,它只需要節點仲裁的響應;在 2PC 中,只有協調器可以請求投票,它需要 *每個* 參與者的"是"投票才能提交。
#### 共識的微妙之處 {#subtleties-of-consensus}
這個基本結構對於 Raft、Multi-Paxos、Zab 和 Viewstamped Replication 的所有都是通用的:節點仲裁的投票選舉主節點,然後主節點想要追加到日誌的每個條目都需要另一個仲裁投票 [^68] [^69]。每個新的日誌條目在確認給請求寫入的客戶端之前都會同步複製到節點仲裁。這確保如果當前主節點失敗,日誌條目不會丟失。
然而,魔鬼在細節中,這也是這些演算法採用不同方法的地方。例如,當舊主節點失敗並選出新主節點時,演算法需要確保新主節點遵守舊主節點在失敗之前已經追加的任何日誌條目。Raft 透過只允許其日誌至少與其大多數追隨者一樣最新的節點成為新主節點來做到這一點 [^69]。相比之下,Paxos 允許任何節點成為新主節點,但要求它在開始追加自己的新條目之前使其日誌與其他節點保持最新。
--------
> [!TIP] 主節點選舉中的一致性與可用性
如果你希望共識演算法嚴格保證 ["共享日誌作為共識"](#sec_consistency_shared_logs) 中列出的屬性,那麼新主節點在處理任何寫入或線性一致讀取之前必須瞭解任何已確認的日誌條目,這一點至關重要。如果具有過時資料的節點成為新主節點,它可能會將新值寫入已經由舊主節點寫入的日誌條目,從而違反共享日誌的僅追加屬性。
在某些情況下,你可能選擇削弱共識屬性,以便更快地從主節點故障中恢復。例如,Kafka 提供了啟用 *不乾淨的主節點選舉* 的選項,它允許任何副本成為主節點,即使它不是最新的。此外,在採用非同步複製的資料庫中,當主節點失敗時,你無法保證任何備庫是最新的。
如果你放棄新主節點必須是最新的要求,你可能會提高效能和可用性,但你是在薄冰上,因為共識理論不再適用。雖然只要沒有故障,事情就會正常工作,但 [第九章](/tw/ch9) 中討論的問題很容易導致大量資料丟失或損壞。
--------
另一個微妙之處是如何處理演算法處理舊主節點在失敗之前提議的日誌條目,但對於追加到日誌的投票尚未完成。你可以在本章的參考文獻中找到這些細節的討論 [^23] [^69] [^86]。
對於使用共識演算法進行復制的資料庫,不僅寫入需要轉換為日誌條目並複製到仲裁。如果你想保證線性一致的讀取,它們也必須像寫入一樣透過仲裁投票,以確認認為自己是主節點的節點確實仍然是最新的。例如,etcd 中的線性一致讀取就是這樣工作的。
在其標準形式中,大多數共識演算法假設一組固定的節點——也就是說,節點可能會宕機並重新啟動,但允許投票的節點集在建立叢集時是固定的。在實踐中,通常需要在系統配置中新增新節點或刪除舊節點。共識演算法已經擴充套件了 *重新配置* 功能,使這成為可能。這在向系統新增新區域或從一個位置遷移到另一個位置(透過首先新增新節點,然後刪除舊節點)時特別有用。
#### 共識的利弊 {#pros-and-cons-of-consensus}
儘管它們複雜而微妙,但共識演算法是分散式系統的巨大突破。共識本質上是"正確完成的單主複製",在主節點故障時自動故障切換,確保沒有已提交的資料丟失,也不可能出現腦裂,即使面對我們在 [第九章](/tw/ch9) 中討論的所有問題。
由於單主複製與自動故障切換本質上是共識的定義之一,任何提供自動故障切換但不使用經過驗證的共識演算法的系統都可能是不安全的 [^87]。使用經過驗證的共識演算法並不能保證整個系統的正確性——仍然有很多其他地方可能潛伏著錯誤——但這是一個好的開始。
然而,共識並不是到處都使用,因為好處是有代價的。共識系統總是需要嚴格的多數才能執行——容忍一個故障需要三個節點,或者容忍兩個故障需要五個節點。每個操作都需要與仲裁通訊,因此你不能透過新增更多節點來增加吞吐量(事實上,你新增的每個節點都會使演算法變慢)。如果網路分割槽將某些節點與其餘節點隔離,只有網路的多數部分可以取得進展,其餘部分被阻塞。
共識系統通常依賴超時來檢測失敗的節點。在具有高度可變網路延遲的環境中,特別是跨多個地理區域分佈的系統,調整這些超時可能很困難:如果它們太大,從故障中恢復需要很長時間;如果它們太小,可能會有很多不必要的主節點選舉,導致糟糕的效能,因為系統最終花費更多時間選擇主節點而不是做有用的工作。
有時,共識演算法對網路問題特別敏感。例如,Raft 已被證明具有不愉快的邊緣情況 [^88] [^89]:如果除了一個始終不可靠的特定網路連結之外,整個網路都正常工作,Raft 可能會進入主節點身份在兩個節點之間不斷跳躍的情況,或者當前主節點不斷被迫辭職,因此係統實際上從未取得進展。設計對不可靠網路更穩健的演算法仍然是一個開放的研究問題。
對於想要高可用但不想接受共識成本的系統,唯一真正的選擇是使用較弱的一致性模型,例如 [第六章](/tw/ch6) 中討論的無主或多主複製提供的模型。這些方法通常不提供線性一致性,但對於不需要它的應用程式來說已經足夠。
### 協調服務 {#sec_consistency_coordination}
共識演算法對於任何希望提供線性一致操作的分散式資料庫都很有價值,許多現代分散式資料庫也都用共識來做複製。但有一類系統是共識演算法的重度使用者:*協調服務*,例如 ZooKeeper、etcd 和 Consul。雖然它們表面上看起來像普通鍵值儲存,但它們並不是為通用資料儲存而設計的。
相反,它們的目標是協調另一個分散式系統中的多個節點。例如,Kubernetes 依賴 etcd;Spark 和 Flink 在高可用模式下會在後臺依賴 ZooKeeper。協調服務通常只儲存小規模資料,這些資料可以完全放入記憶體(同時仍會寫盤以保證永續性),並透過容錯共識演算法在多個節點間複製。
協調服務的設計思路來自 Google 的 Chubby 鎖服務 [^17] [^58]。它把共識演算法與一些在分散式系統裡尤其有用的能力結合在一起:
鎖與租約
: 我們前面看到,共識系統可以實現具備容錯能力的原子比較並設定(CAS)操作。協調服務正是基於這一點來實現鎖和租約:若多個節點併發嘗試獲取同一個租約,最終只會有一個成功。
支援柵欄
: 如 ["分散式鎖和租約"](/tw/ch9#sec_distributed_lock_fencing) 所述,當某個資源受租約保護時,需要 *柵欄* 機制來防止程序暫停或網路大延遲時的相互干擾。共識系統可透過為每個日誌條目分配單調遞增 ID 來生成柵欄令牌(ZooKeeper 中的 `zxid` 和 `cversion`,etcd 中的 revision)。
故障檢測
: 客戶端會在協調服務上維持長連線會話,並透過週期性心跳檢查對端是否存活。即使連線臨時中斷或某臺服務端故障,客戶端持有的租約仍可保持有效;但如果超過租約超時時間仍未收到心跳,協調服務就會認為客戶端已失效並釋放租約(ZooKeeper 將其稱為 *臨時節點*)。
變更通知
: 客戶端可以請求:當某些鍵發生變化時由協調服務主動通知。這樣客戶端就能知道另一個節點何時加入叢集(基於其寫入的值),或者何時失效(會話超時、臨時節點消失)。這類通知避免了客戶端頻繁輪詢。
故障檢測和變更通知本身不需要共識,但與需要共識的原子操作、柵欄機制結合後,它們對分散式協調非常有用。
--------
> [!TIP] 用協調服務管理配置
應用與基礎設施通常都有配置引數,例如超時時間、執行緒池大小等。有時會把這類配置資料以鍵值對形式存放在協調服務中。程序啟動時載入最新配置,並訂閱後續變更通知。配置更新後,程序可以立即應用新值,或重啟後生效。
配置管理本身不需要協調服務裡的共識能力;但如果系統本來就已經運行了協調服務,那麼直接複用它的通知機制會很方便。另一種做法是程序週期性地從檔案或 URL 拉取配置更新,以避免依賴專門的協調服務。
--------
#### 將工作分配給節點 {#allocating-work-to-nodes}
當你有某個程序或服務的多個例項,且其中一個需要被選為主節點時,協調服務很有用。如果主節點失效,其他節點之一應當接管。這不僅適用於單主資料庫,也適用於作業排程器等有狀態系統。
另一個場景是:你有某種分片資源(資料庫、訊息流、檔案儲存、分散式 Actor 系統等),需要決定每個分片由哪個節點負責。隨著新節點加入叢集,需要把部分分片從舊節點遷移到新節點以實現再平衡;當節點被移除或失效時,其他節點需要接手其工作。
這類任務可以透過協調服務中的原子操作、臨時節點和通知機制配合完成。若實現得當,應用可以在無人值守的情況下自動從故障中恢復。即使有 Apache Curator 這類在 ZooKeeper 客戶端 API 上封裝的高階庫,這件事仍不容易;但它仍遠好於從零實現共識演算法,後者極易引入缺陷。
專用協調服務還有一個優勢:無論被協調系統有多少節點,協調服務本身通常都只需執行在一組固定節點上(常見是 3 個或 5 個)。例如,一個擁有數千分片的儲存系統若在數千節點上直接跑共識會非常低效;把共識“外包”給少量協調服務節點通常更合理。
通常,協調服務管理的資料變化頻率不高:例如“IP 為 10.1.1.23 的節點當前是分片 7 的主節點”這類資訊,更新週期往往是分鐘級或小時級。協調服務不適合儲存每秒變化數千次的資料。對於高頻變化資料,應該使用常規資料庫;或者使用 Apache BookKeeper [^90] [^91] 這類工具複製服務內部的快速變化狀態。
#### 服務發現 {#service-discovery}
ZooKeeper、etcd 和 Consul 也常用於 *服務發現*:即確定連線某個服務所需的 IP 地址(見 ["負載均衡、服務發現和服務網格"](/tw/ch5#sec_encoding_service_discovery))。在雲環境下,虛擬機器常常頻繁上下線,因此你通常無法預先知道服務地址。常見做法是讓服務啟動時把自身網路端點註冊到服務登錄檔,再供其他服務查詢。
用協調服務做服務發現很方便,因為它的故障檢測和變更通知能讓客戶端及時跟蹤服務例項的增減。而且如果你本來就用協調服務做租約、鎖或主節點選舉,那麼繼續複用它做服務發現通常也很自然,因為它已經知道哪個節點應該接收請求。
不過,對服務發現使用共識往往有些“殺雞用牛刀”:這個場景通常不要求線性一致性,更重要的是高可用和低延遲,因為沒有服務發現,整個系統都會停滯。因此通常更傾向於快取服務發現結果,並接受其可能略有陳舊。比如基於 DNS 的服務發現,就是透過多層快取來獲得良好的效能與可用性。
為支援這類需求,ZooKeeper 提供了 *observer*(觀察者)節點:它接收日誌並維護一份 ZooKeeper 資料副本,但不參與共識投票。來自 observer 的讀取不具備線性一致性(可能陳舊),但即使網路中斷仍然可用,並且能透過快取提高系統可支援的讀吞吐量。
## 總結 {#summary}
在本章中,我們研究了容錯系統中強一致性的主題:它是什麼,以及如何實現它。我們深入研究了線性一致性,這是強一致性的一種流行形式化:它意味著複製的資料看起來好像只有一個副本,所有操作都以原子方式作用於它。我們看到,當你需要在讀取時某些資料是最新的,或者需要解決競爭條件(例如,如果多個節點併發地嘗試做同樣的事情,比如建立具有相同名稱的檔案)時,線性一致性是有用的。
雖然線性一致性很有吸引力,因為它易於理解——它使資料庫的行為像單執行緒程式中的變數一樣——但它的缺點是速度慢,特別是在網路延遲較大的環境中。許多複製演算法不能保證線性一致性,即使表面上看起來它們可能提供強一致性。
接下來,我們在 ID 生成器的背景下應用了線性一致性的概念。單節點自增計數器是線性一致的,但不是容錯的。許多分散式 ID 生成方案不能保證 ID 的順序與事件實際發生的順序一致。像 Lamport 時鐘和混合邏輯時鐘這樣的邏輯時鐘提供了與因果關係一致的順序,但沒有線性一致性。
這引導我們進入了共識的概念。我們看到,達成共識意味著以一種所有節點都同意決定的方式決定某事,並且他們不能改變主意。廣泛的問題實際上可以歸約為共識,並且彼此等價(即,如果你有一個問題的解決方案,你可以將其轉換為所有其他問題的解決方案)。這些等價的問題包括:
線性一致的比較並設定操作
: 暫存器需要根據其當前值是否等於操作中給定的引數,原子地 **決定** 是否設定其值。
鎖和租約
: 當多個客戶端併發地嘗試獲取鎖或租約時,鎖 **決定** 哪一個成功獲取它。
唯一性約束
: 當多個事務併發地嘗試建立具有相同鍵的衝突記錄時,約束必須 **決定** 允許哪一個,哪一個應該因約束違反而失敗。
共享日誌
: 當多個節點併發地想要向日志追加條目時,日誌 **決定** 它們被追加的順序。全序廣播也是等價的。
原子事務提交
: 參與分散式事務的資料庫節點必須都以相同的方式 **決定** 是提交還是中止事務。
線性一致的獲取並增加操作
: 這個操作可以用來實現 ID 生成器。多個節點可以併發呼叫該操作,它 **決定** 它們遞增計數器的順序。這種情況實際上只解決了兩個節點之間的共識,而其他情況適用於任意數量的節點。
如果你只有一個節點,或者願意把決策能力交給單個節點,所有這些都很簡單。這就是單主資料庫中發生的事情:所有決策權都授予主節點,這也是這類資料庫能夠提供線性一致操作、唯一性約束和複製日誌等能力的原因。
然而,如果這個單一主節點失效,或者網路中斷使其不可達,這樣的系統就無法繼續推進,直到人工完成手動故障切換。Raft 和 Paxos 等廣泛使用的共識演算法,本質上就是內建自動主節點選舉與故障切換的“單主複製”。
共識演算法經過精心設計,以確保在故障轉移期間不會丟失任何已提交的寫入,並且系統不會進入腦裂狀態(多個節點接受寫入)。這要求每個寫入和每個線性一致的讀取都由節點的仲裁(通常是多數)確認。這可能是昂貴的,特別是跨地理區域,但如果你想要共識提供的強一致性和容錯性,這是不可避免的。
像 ZooKeeper 和 etcd 這樣的協調服務也是建立在共識演算法之上的。它們提供鎖、租約、故障檢測和變更通知功能,這些功能對於管理分散式應用程式的狀態很有用。如果你發現自己想要做那些可以歸約為共識的事情之一,並且你希望它是容錯的,建議使用協調服務。它不會保證你做對,但它可能會有所幫助。
共識演算法複雜而微妙,但其背後有自 1980 年代以來形成的豐富理論體系支援。正是這些理論,使我們能夠構建出能夠容忍 [第九章](/tw/ch9#ch_distributed) 所述故障、同時仍保證資料不被破壞的系統。這是分散式系統工程中的重要成就,本章末尾參考文獻展示了其中一些關鍵工作。
然而,共識並不總是正確的工具:在某些系統中,不需要它提供的強一致性屬性,使用較弱一致性來換取更高可用性和更好效能反而更合適。在這些場景下,通常會使用無主或多主複製,這也是我們之前在 [第六章](/tw/ch6#ch_replication) 討論過的內容。我們在本章討論的邏輯時鐘在那類場景中也很有幫助。
### 參考文獻
[^1]: Maurice P. Herlihy and Jeannette M. Wing. [Linearizability: A Correctness Condition for Concurrent Objects](https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 12, issue 3, pages 463–492, July 1990. [doi:10.1145/78969.78972](https://doi.org/10.1145/78969.78972)
[^2]: Leslie Lamport. [On interprocess communication](https://www.microsoft.com/en-us/research/publication/interprocess-communication-part-basic-formalism-part-ii-algorithms/). *Distributed Computing*, volume 1, issue 2, pages 77–101, June 1986. [doi:10.1007/BF01786228](https://doi.org/10.1007/BF01786228)
[^3]: David K. Gifford. [Information Storage in a Decentralized Computer System](https://bitsavers.org/pdf/xerox/parc/techReports/CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System.pdf). Xerox Palo Alto Research Centers, CSL-81-8, June 1981. Archived at [perma.cc/2XXP-3JPB](https://perma.cc/2XXP-3JPB)
[^4]: Martin Kleppmann. [Please Stop Calling Databases CP or AP](https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html). *martin.kleppmann.com*, May 2015. Archived at [perma.cc/MJ5G-75GL](https://perma.cc/MJ5G-75GL)
[^5]: Kyle Kingsbury. [Call Me Maybe: MongoDB Stale Reads](https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads). *aphyr.com*, April 2015. Archived at [perma.cc/DXB4-J4JC](https://perma.cc/DXB4-J4JC)
[^6]: Kyle Kingsbury. [Computational Techniques in Knossos](https://aphyr.com/posts/314-computational-techniques-in-knossos). *aphyr.com*, May 2014. Archived at [perma.cc/2X5M-EHTU](https://perma.cc/2X5M-EHTU)
[^7]: Kyle Kingsbury and Peter Alvaro. [Elle: Inferring Isolation Anomalies from Experimental Observations](https://www.vldb.org/pvldb/vol14/p268-alvaro.pdf). *Proceedings of the VLDB Endowment*, volume 14, issue 3, pages 268–280, November 2020. [doi:10.14778/3430915.3430918](https://doi.org/10.14778/3430915.3430918)
[^8]: Paolo Viotti and Marko Vukolić. [Consistency in Non-Transactional Distributed Storage Systems](https://arxiv.org/abs/1512.00168). *ACM Computing Surveys* (CSUR), volume 49, issue 1, article no. 19, June 2016. [doi:10.1145/2926965](https://doi.org/10.1145/2926965)
[^9]: Peter Bailis. [Linearizability Versus Serializability](http://www.bailis.org/blog/linearizability-versus-serializability/). *bailis.org*, September 2014. Archived at [perma.cc/386B-KAC3](https://perma.cc/386B-KAC3)
[^10]: Daniel Abadi. [Correctness Anomalies Under Serializable Isolation](https://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html). *dbmsmusings.blogspot.com*, June 2019. Archived at [perma.cc/JGS7-BZFY](https://perma.cc/JGS7-BZFY)
[^11]: Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Highly Available Transactions: Virtues and Limitations](https://www.vldb.org/pvldb/vol7/p181-bailis.pdf). *Proceedings of the VLDB Endowment*, volume 7, issue 3, pages 181–192, November 2013. [doi:10.14778/2732232.2732237](https://doi.org/10.14778/2732232.2732237), extended version published as [arXiv:1302.0309](https://arxiv.org/abs/1302.0309)
[^12]: Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. [*Concurrency Control and Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at [*microsoft.com*](https://www.microsoft.com/en-us/research/people/philbe/book/).
[^13]: Andrei Matei. [CockroachDB’s consistency model](https://www.cockroachlabs.com/blog/consistency-model/). *cockroachlabs.com*, February 2021. Archived at [perma.cc/MR38-883B](https://perma.cc/MR38-883B)
[^14]: Murat Demirbas. [Strict-serializability, but at what cost, for what purpose?](https://muratbuffalo.blogspot.com/2022/08/strict-serializability-but-at-what-cost.html) *muratbuffalo.blogspot.com*, August 2022. Archived at [perma.cc/T8AY-N3U9](https://perma.cc/T8AY-N3U9)
[^15]: Ben Darnell. [How to talk about consistency and isolation in distributed DBs](https://www.cockroachlabs.com/blog/db-consistency-isolation-terminology/). *cockroachlabs.com*, February 2022. Archived at [perma.cc/53SV-JBGK](https://perma.cc/53SV-JBGK)
[^16]: Daniel Abadi. [An explanation of the difference between Isolation levels vs. Consistency levels](https://dbmsmusings.blogspot.com/2019/08/an-explanation-of-difference-between.html). *dbmsmusings.blogspot.com*, August 2019. Archived at [perma.cc/QSF2-CD4P](https://perma.cc/QSF2-CD4P)
[^17]: Mike Burrows. [The Chubby Lock Service for Loosely-Coupled Distributed Systems](https://research.google/pubs/pub27897/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006.
[^18]: Flavio P. Junqueira and Benjamin Reed. [*ZooKeeper: Distributed Process Coordination*](https://www.oreilly.com/library/view/zookeeper/9781449361297/). O’Reilly Media, 2013. ISBN: 978-1-449-36130-3
[^19]: Murali Vallath. [*Oracle 10g RAC Grid, Services & Clustering*](https://www.oreilly.com/library/view/oracle-10g-rac/9781555583217/). Elsevier Digital Press, 2006. ISBN: 978-1-555-58321-7
[^20]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). *Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185–196, November 2014. [doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509)
[^21]: Kyle Kingsbury. [Call Me Maybe: etcd and Consul](https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul). *aphyr.com*, June 2014. Archived at [perma.cc/XL7U-378K](https://perma.cc/XL7U-378K)
[^22]: Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. [Zab: High-Performance Broadcast for Primary-Backup Systems](https://marcoserafini.github.io/assets/pdf/zab.pdf). At *41st IEEE International Conference on Dependable Systems and Networks* (DSN), June 2011. [doi:10.1109/DSN.2011.5958223](https://doi.org/10.1109/DSN.2011.5958223)
[^23]: Diego Ongaro and John K. Ousterhout. [In Search of an Understandable Consensus Algorithm](https://www.usenix.org/system/files/conference/atc14/atc14-paper-ongaro.pdf). At *USENIX Annual Technical Conference* (ATC), June 2014.
[^24]: Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. [Sharing Memory Robustly in Message-Passing Systems](https://www.cs.huji.ac.il/course/2004/dist/p124-attiya.pdf). *Journal of the ACM*, volume 42, issue 1, pages 124–142, January 1995. [doi:10.1145/200836.200869](https://doi.org/10.1145/200836.200869)
[^25]: Nancy Lynch and Alex Shvartsman. [Robust Emulation of Shared Memory Using Dynamic Quorum-Acknowledged Broadcasts](https://groups.csail.mit.edu/tds/papers/Lynch/FTCS97.pdf). At *27th Annual International Symposium on Fault-Tolerant Computing* (FTCS), June 1997. [doi:10.1109/FTCS.1997.614100](https://doi.org/10.1109/FTCS.1997.614100)
[^26]: Christian Cachin, Rachid Guerraoui, and Luís Rodrigues. [*Introduction to Reliable and Secure Distributed Programming*](https://www.distributedprogramming.net/), 2nd edition. Springer, 2011. ISBN: 978-3-642-15259-7, [doi:10.1007/978-3-642-15260-3](https://doi.org/10.1007/978-3-642-15260-3)
[^27]: Niklas Ekström, Mikhail Panchenko, and Jonathan Ellis. [Possible Issue with Read Repair?](https://lists.apache.org/thread/wwsjnnc93mdlpw8nb0d5gn4q1bmpzbon) Email thread on *cassandra-dev* mailing list, October 2012.
[^28]: Maurice P. Herlihy. [Wait-Free Synchronization](https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 13, issue 1, pages 124–149, January 1991. [doi:10.1145/114005.102808](https://doi.org/10.1145/114005.102808)
[^29]: Armando Fox and Eric A. Brewer. [Harvest, Yield, and Scalable Tolerant Systems](https://radlab.cs.berkeley.edu/people/fox/static/pubs/pdf/c18.pdf). At *7th Workshop on Hot Topics in Operating Systems* (HotOS), March 1999. [doi:10.1109/HOTOS.1999.798396](https://doi.org/10.1109/HOTOS.1999.798396)
[^30]: Seth Gilbert and Nancy Lynch. [Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services](https://www.comp.nus.edu.sg/~gilbert/pubs/BrewersConjecture-SigAct.pdf). *ACM SIGACT News*, volume 33, issue 2, pages 51–59, June 2002. [doi:10.1145/564585.564601](https://doi.org/10.1145/564585.564601)
[^31]: Seth Gilbert and Nancy Lynch. [Perspectives on the CAP Theorem](https://groups.csail.mit.edu/tds/papers/Gilbert/Brewer2.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 30–36, February 2012. [doi:10.1109/MC.2011.389](https://doi.org/10.1109/MC.2011.389)
[^32]: Eric A. Brewer. [CAP Twelve Years Later: How the ‘Rules’ Have Changed](https://sites.cs.ucsb.edu/~rich/class/cs293-cloud/papers/brewer-cap.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 23–29, February 2012. [doi:10.1109/MC.2012.37](https://doi.org/10.1109/MC.2012.37)
[^33]: Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen. [Consistency in Partitioned Networks](https://www.cs.rice.edu/~alc/old/comp520/papers/DGS85.pdf). *ACM Computing Surveys*, volume 17, issue 3, pages 341–370, September 1985. [doi:10.1145/5505.5508](https://doi.org/10.1145/5505.5508)
[^34]: Paul R. Johnson and Robert H. Thomas. [RFC 677: The Maintenance of Duplicate Databases](https://tools.ietf.org/html/rfc677). Network Working Group, January 1975.
[^35]: Michael J. Fischer and Alan Michael. [Sacrificing Serializability to Attain High Availability of Data in an Unreliable Network](https://sites.cs.ucsb.edu/~agrawal/spring2011/ugrad/p70-fischer.pdf). At *1st ACM Symposium on Principles of Database Systems* (PODS), March 1982. [doi:10.1145/588111.588124](https://doi.org/10.1145/588111.588124)
[^36]: Eric A. Brewer. [NoSQL: Past, Present, Future](https://www.infoq.com/presentations/NoSQL-History/). At *QCon San Francisco*, November 2012.
[^37]: Adrian Cockcroft. [Migrating to Microservices](https://www.infoq.com/presentations/migration-cloud-native/). At *QCon London*, March 2014.
[^38]: Martin Kleppmann. [A Critique of the CAP Theorem](https://arxiv.org/abs/1509.05393). arXiv:1509.05393, September 2015.
[^39]: Daniel Abadi. [Problems with CAP, and Yahoo’s little known NoSQL system](https://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html). *dbmsmusings.blogspot.com*, April 2010. Archived at [perma.cc/4NTZ-CLM9](https://perma.cc/4NTZ-CLM9)
[^40]: Daniel Abadi. [Hazelcast and the Mythical PA/EC System](https://dbmsmusings.blogspot.com/2017/10/hazelcast-and-mythical-paec-system.html). *dbmsmusings.blogspot.com*, October 2017. Archived at [perma.cc/J5XM-U5C2](https://perma.cc/J5XM-U5C2)
[^41]: Eric Brewer. [Spanner, TrueTime & The CAP Theorem](https://research.google.com/pubs/archive/45855.pdf). *research.google.com*, February 2017. Archived at [perma.cc/59UW-RH7N](https://perma.cc/59UW-RH7N)
[^42]: Daniel J. Abadi. [Consistency Tradeoffs in Modern Distributed Database System Design](https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 37–42, February 2012. [doi:10.1109/MC.2012.33](https://doi.org/10.1109/MC.2012.33)
[^43]: Nancy A. Lynch. [A Hundred Impossibility Proofs for Distributed Computing](https://groups.csail.mit.edu/tds/papers/Lynch/podc89.pdf). At *8th ACM Symposium on Principles of Distributed Computing* (PODC), August 1989. [doi:10.1145/72981.72982](https://doi.org/10.1145/72981.72982)
[^44]: Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin. [Consistency, Availability, and Convergence](https://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2036.pdf). University of Texas at Austin, Department of Computer Science, Tech Report UTCS TR-11-22, May 2011. Archived at [perma.cc/SAV8-9JAJ](https://perma.cc/SAV8-9JAJ)
[^45]: Hagit Attiya, Faith Ellen, and Adam Morrison. [Limitations of Highly-Available Eventually-Consistent Data Stores](https://www.cs.tau.ac.il/~mad/publications/podc2015-replds.pdf). At *ACM Symposium on Principles of Distributed Computing* (PODC), July 2015. [doi:10.1145/2767386.2767419](https://doi.org/10.1145/2767386.2767419)
[^46]: Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. [x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors](https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf). *Communications of the ACM*, volume 53, issue 7, pages 89–97, July 2010. [doi:10.1145/1785414.1785443](https://doi.org/10.1145/1785414.1785443)
[^47]: Martin Thompson. [Memory Barriers/Fences](https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html). *mechanical-sympathy.blogspot.co.uk*, July 2011. Archived at [perma.cc/7NXM-GC5U](https://perma.cc/7NXM-GC5U)
[^48]: Ulrich Drepper. [What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf). *akkadia.org*, November 2007. Archived at [perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ)
[^49]: Hagit Attiya and Jennifer L. Welch. [Sequential Consistency Versus Linearizability](https://courses.csail.mit.edu/6.852/01/papers/p91-attiya.pdf). *ACM Transactions on Computer Systems* (TOCS), volume 12, issue 2, pages 91–122, May 1994. [doi:10.1145/176575.176576](https://doi.org/10.1145/176575.176576)
[^50]: Kyzer R. Davis, Brad G. Peabody, and Paul J. Leach. [Universally Unique IDentifiers (UUIDs)](https://www.rfc-editor.org/rfc/rfc9562). RFC 9562, IETF, May 2024.
[^51]: Ryan King. [Announcing Snowflake](https://blog.x.com/engineering/en_us/a/2010/announcing-snowflake). *blog.x.com*, June 2010. Archived at [archive.org](https://web.archive.org/web/20241128214604/https%3A//blog.x.com/engineering/en_us/a/2010/announcing-snowflake)
[^52]: Alizain Feerasta. [Universally Unique Lexicographically Sortable Identifier](https://github.com/ulid/spec). *github.com*, 2016. Archived at [perma.cc/NV2Y-ZP8U](https://perma.cc/NV2Y-ZP8U)
[^53]: Rob Conery. [A Better ID Generator for PostgreSQL](https://bigmachine.io/2014/05/29/a-better-id-generator-for-postgresql/). *bigmachine.io*, May 2014. Archived at [perma.cc/K7QV-3KFC](https://perma.cc/K7QV-3KFC)
[^54]: Leslie Lamport. [Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, volume 21, issue 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563)
[^55]: Sandeep S. Kulkarni, Murat Demirbas, Deepak Madeppa, Bharadwaj Avva, and Marcelo Leone. [Logical Physical Clocks](https://cse.buffalo.edu/~demirbas/publications/hlc.pdf). *18th International Conference on Principles of Distributed Systems* (OPODIS), December 2014. [doi:10.1007/978-3-319-14472-6\_2](https://doi.org/10.1007/978-3-319-14472-6_2)
[^56]: Manuel Bravo, Nuno Diegues, Jingna Zeng, Paolo Romano, and Luís Rodrigues. [On the use of Clocks to Enforce Consistency in the Cloud](http://sites.computer.org/debull/A15mar/p18.pdf). *IEEE Data Engineering Bulletin*, volume 38, issue 1, pages 18–31, March 2015. Archived at [perma.cc/68ZU-45SH](https://perma.cc/68ZU-45SH)
[^57]: Daniel Peng and Frank Dabek. [Large-Scale Incremental Processing Using Distributed Transactions and Notifications](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf). At *9th USENIX Conference on Operating Systems Design and Implementation* (OSDI), October 2010.
[^58]: Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone. [Paxos Made Live – An Engineering Perspective](https://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf). At *26th ACM Symposium on Principles of Distributed Computing* (PODC), June 2007. [doi:10.1145/1281100.1281103](https://doi.org/10.1145/1281100.1281103)
[^59]: Will Portnoy. [Lessons Learned from Implementing Paxos](https://blog.willportnoy.com/2012/06/lessons-learned-from-paxos.html). *blog.willportnoy.com*, June 2012. Archived at [perma.cc/QHD9-FDD2](https://perma.cc/QHD9-FDD2)
[^60]: Brian M. Oki and Barbara H. Liskov. [Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems](https://pmg.csail.mit.edu/papers/vr.pdf). At *7th ACM Symposium on Principles of Distributed Computing* (PODC), August 1988. [doi:10.1145/62546.62549](https://doi.org/10.1145/62546.62549)
[^61]: Barbara H. Liskov and James Cowling. [Viewstamped Replication Revisited](https://pmg.csail.mit.edu/papers/vr-revisited.pdf). Massachusetts Institute of Technology, Tech Report MIT-CSAIL-TR-2012-021, July 2012. Archived at [perma.cc/56SJ-WENQ](https://perma.cc/56SJ-WENQ)
[^62]: Leslie Lamport. [The Part-Time Parliament](https://www.microsoft.com/en-us/research/publication/part-time-parliament/). *ACM Transactions on Computer Systems*, volume 16, issue 2, pages 133–169, May 1998. [doi:10.1145/279227.279229](https://doi.org/10.1145/279227.279229)
[^63]: Leslie Lamport. [Paxos Made Simple](https://www.microsoft.com/en-us/research/publication/paxos-made-simple/). *ACM SIGACT News*, volume 32, issue 4, pages 51–58, December 2001. Archived at [perma.cc/82HP-MNKE](https://perma.cc/82HP-MNKE)
[^64]: Robbert van Renesse and Deniz Altinbuken. [Paxos Made Moderately Complex](https://people.cs.umass.edu/~arun/590CC/papers/paxos-moderately-complex.pdf). *ACM Computing Surveys* (CSUR), volume 47, issue 3, article no. 42, February 2015. [doi:10.1145/2673577](https://doi.org/10.1145/2673577)
[^65]: Diego Ongaro. [Consensus: Bridging Theory and Practice](https://github.com/ongardie/dissertation). PhD Thesis, Stanford University, August 2014. Archived at [perma.cc/5VTZ-2ADH](https://perma.cc/5VTZ-2ADH)
[^66]: Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy, and Jon Crowcroft. [Raft Refloated: Do We Have Consensus?](https://www.cl.cam.ac.uk/research/srg/netos/papers/2015-raftrefloated-osr.pdf) *ACM SIGOPS Operating Systems Review*, volume 49, issue 1, pages 12–21, January 2015. [doi:10.1145/2723872.2723876](https://doi.org/10.1145/2723872.2723876)
[^67]: André Medeiros. [ZooKeeper’s Atomic Broadcast Protocol: Theory and Practice](http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf). Aalto University School of Science, March 2012. Archived at [perma.cc/FVL4-JMVA](https://perma.cc/FVL4-JMVA)
[^68]: Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider. [Vive La Différence: Paxos vs. Viewstamped Replication vs. Zab](https://arxiv.org/abs/1309.5671). *IEEE Transactions on Dependable and Secure Computing*, volume 12, issue 4, pages 472–484, September 2014. [doi:10.1109/TDSC.2014.2355848](https://doi.org/10.1109/TDSC.2014.2355848)
[^69]: Heidi Howard and Richard Mortier. [Paxos vs Raft: Have we reached consensus on distributed consensus?](https://arxiv.org/abs/2004.05074). At *7th Workshop on Principles and Practice of Consistency for Distributed Data* (PaPoC), April 2020. [doi:10.1145/3380787.3393681](https://doi.org/10.1145/3380787.3393681)
[^70]: Miguel Castro and Barbara H. Liskov. [Practical Byzantine Fault Tolerance and Proactive Recovery](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/p398-castro-bft-tocs.pdf). *ACM Transactions on Computer Systems*, volume 20, issue 4, pages 396–461, November 2002. [doi:10.1145/571637.571640](https://doi.org/10.1145/571637.571640)
[^71]: Shehar Bano, Alberto Sonnino, Mustafa Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis. [SoK: Consensus in the Age of Blockchains](https://smeiklej.com/files/aft19a.pdf). At *1st ACM Conference on Advances in Financial Technologies* (AFT), October 2019. [doi:10.1145/3318041.3355458](https://doi.org/10.1145/3318041.3355458)
[^72]: Michael J. Fischer, Nancy Lynch, and Michael S. Paterson. [Impossibility of Distributed Consensus with One Faulty Process](https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf). *Journal of the ACM*, volume 32, issue 2, pages 374–382, April 1985. [doi:10.1145/3149.214121](https://doi.org/10.1145/3149.214121)
[^73]: Tushar Deepak Chandra and Sam Toueg. [Unreliable Failure Detectors for Reliable Distributed Systems](https://courses.csail.mit.edu/6.852/08/papers/CT96-JACM.pdf). *Journal of the ACM*, volume 43, issue 2, pages 225–267, March 1996. [doi:10.1145/226643.226647](https://doi.org/10.1145/226643.226647)
[^74]: Michael Ben-Or. [Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols](https://homepage.cs.uiowa.edu/~ghosh/BenOr.pdf). At *2nd ACM Symposium on Principles of Distributed Computing* (PODC), August 1983. [doi:10.1145/800221.806707](https://doi.org/10.1145/800221.806707)
[^75]: Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. [Consensus in the Presence of Partial Synchrony](https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf). *Journal of the ACM*, volume 35, issue 2, pages 288–323, April 1988. [doi:10.1145/42282.42283](https://doi.org/10.1145/42282.42283)
[^76]: Xavier Défago, André Schiper, and Péter Urbán. [Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey](https://dspace.jaist.ac.jp/dspace/bitstream/10119/4883/1/defago_et_al.pdf). *ACM Computing Surveys*, volume 36, issue 4, pages 372–421, December 2004. [doi:10.1145/1041680.1041682](https://doi.org/10.1145/1041680.1041682)
[^77]: Hagit Attiya and Jennifer Welch. *Distributed Computing: Fundamentals, Simulations and Advanced Topics*, 2nd edition. John Wiley & Sons, 2004. ISBN: 978-0-471-45324-6, [doi:10.1002/0471478210](https://doi.org/10.1002/0471478210)
[^78]: Rachid Guerraoui. [Revisiting the Relationship Between Non-Blocking Atomic Commitment and Consensus](https://citeseerx.ist.psu.edu/pdf/5d06489503b6f791aa56d2d7942359c2592e44b0). At *9th International Workshop on Distributed Algorithms* (WDAG), September 1995. [doi:10.1007/BFb0022140](https://doi.org/10.1007/BFb0022140)
[^79]: Jim N. Gray and Leslie Lamport. [Consensus on Transaction Commit](https://dsf.berkeley.edu/cs286/papers/paxoscommit-tods2006.pdf). *ACM Transactions on Database Systems* (TODS), volume 31, issue 1, pages 133–160, March 2006. [doi:10.1145/1132863.1132867](https://doi.org/10.1145/1132863.1132867)
[^80]: Fred B. Schneider. [Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial](https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf). *ACM Computing Surveys*, volume 22, issue 4, pages 299–319, December 1990. [doi:10.1145/98163.98167](https://doi.org/10.1145/98163.98167)
[^81]: Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. [Calvin: Fast Distributed Transactions for Partitioned Database Systems](https://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2012. [doi:10.1145/2213836.2213838](https://doi.org/10.1145/2213836.2213838)
[^82]: Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck. [Tango: Distributed Data Structures over a Shared Log](https://www.microsoft.com/en-us/research/publication/tango-distributed-data-structures-over-a-shared-log/). At *24th ACM Symposium on Operating Systems Principles* (SOSP), November 2013. [doi:10.1145/2517349.2522732](https://doi.org/10.1145/2517349.2522732)
[^83]: Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis. [CORFU: A Shared Log Design for Flash Clusters](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final30.pdf). At *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012.
[^84]: Vasilis Gavrielatos, Antonios Katsarakis, and Vijay Nagarajan. [Odyssey: the impact of modern hardware on strongly-consistent replication protocols](https://vasigavr1.github.io/files/Odyssey_Eurosys_2021.pdf). At *16th European Conference on Computer Systems* (EuroSys), April 2021. [doi:10.1145/3447786.3456240](https://doi.org/10.1145/3447786.3456240)
[^85]: Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. [Flexible Paxos: Quorum Intersection Revisited](https://drops.dagstuhl.de/opus/volltexte/2017/7094/pdf/LIPIcs-OPODIS-2016-25.pdf). At *20th International Conference on Principles of Distributed Systems* (OPODIS), December 2016. [doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25)
[^86]: Martin Kleppmann. [Distributed Systems lecture notes](https://www.cl.cam.ac.uk/teaching/2425/ConcDisSys/dist-sys-notes.pdf). *University of Cambridge*, October 2024. Archived at [perma.cc/SS3Q-FNS5](https://perma.cc/SS3Q-FNS5)
[^87]: Kyle Kingsbury. [Call Me Maybe: Elasticsearch 1.5.0](https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0). *aphyr.com*, April 2015. Archived at [perma.cc/37MZ-JT7H](https://perma.cc/37MZ-JT7H)
[^88]: Heidi Howard and Jon Crowcroft. [Coracle: Evaluating Consensus at the Internet Edge](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p85.pdf). At *Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. [doi:10.1145/2829988.2790010](https://doi.org/10.1145/2829988.2790010)
[^89]: Tom Lianza and Chris Snook. [A Byzantine failure in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/). *blog.cloudflare.com*, November 2020. Archived at [perma.cc/83EZ-ALCY](https://perma.cc/83EZ-ALCY)
[^90]: Ivan Kelly. [BookKeeper Tutorial](https://github.com/ivankelly/bookkeeper-tutorial). *github.com*, October 2014. Archived at [perma.cc/37Y6-VZWU](https://perma.cc/37Y6-VZWU)
[^91]: Jack Vanlightly. [Apache BookKeeper Insights Part 1 — External Consensus and Dynamic Membership](https://medium.com/splunk-maas/apache-bookkeeper-insights-part-1-external-consensus-and-dynamic-membership-c259f388da21). *medium.com*, November 2021. Archived at [perma.cc/3MDB-8GFB](https://perma.cc/3MDB-8GFB)
================================================
FILE: content/tw/ch11.md
================================================
---
title: "第十一章:批處理"
linkTitle: "11. 批處理"
weight: 311
breadcrumbs: false
---

> *帶有太強個人色彩的系統無法成功。當最初的設計完成並且相對穩健時,真正的考驗才剛開始:此後會有許多持不同觀點的人做出各自的實驗。*
>
> 高德納
到目前為止,本書大部分內容都圍繞著 *請求(request)* 與 *查詢(query)* 以及對應的 *響應(response)* 或 *結果(result)* 展開。現代很多資料系統都預設採用這種處理方式:你發出請求或指令,系統儘快給出答案。
網頁瀏覽器請求頁面、服務呼叫遠端 API、資料庫、快取、搜尋索引,以及很多其他系統都如此運作。我們稱這類系統為 *線上系統(online systems)*。它們通常以響應時間作為主要效能指標,並且往往需要良好的容錯能力來保證高可用。
但有時候,你需要執行的計算比一次互動式請求大得多,或者要處理的資料量遠超單次請求能承載的範圍。例如訓練 AI 模型、把海量資料從一種形式轉換成另一種形式、或者在超大資料集上做分析計算。我們把這類任務稱為 *批處理(batch processing)* 作業,有時也稱為 *離線系統(offline systems)*。
批處理作業讀取一批輸入資料(只讀),並生成一批輸出資料(每次執行都從頭生成)。它通常不會像讀寫事務那樣原地修改資料。因此,輸出是由輸入推匯出的 *派生資料(derived data)*(見[“記錄系統與派生資料”](/tw/ch1#sec_introduction_derived)):如果不滿意輸出,你可以直接刪除它,修改作業邏輯,再跑一遍即可。把輸入視為不可變並儘量避免副作用(例如直接寫外部資料庫),不僅有助於效能,也帶來其他好處:
- 如果你在程式碼中引入了 bug 導致輸出錯誤或損壞,可以直接回滾程式碼並重跑作業,輸出就會恢復正確。更簡單的做法是把舊輸出保留在另一個目錄,直接切回舊版本。多數物件儲存與開放表格式(見[“雲資料倉庫”](/tw/ch4#sec_cloud_data_warehouses))都支援這種能力,通常稱為 *時間旅行(time travel)*。大多數支援讀寫事務的資料庫不具備這種特性:如果錯誤程式碼把壞資料寫進資料庫,僅回滾程式碼並不能修復已寫入的資料。能夠從錯誤程式碼中恢復的能力被稱為 *容忍人為失誤* [^1]。
- 因為回滾容易,功能開發能比“犯錯會造成不可逆損害”的環境更快推進。這個 *最小化不可逆性* 的原則對敏捷開發非常有益 [^2]。
- 同一組檔案可以作為多種作業的輸入,包括監控類作業:例如計算指標、驗證輸出是否符合預期(如與上一次結果比較並度量偏差)。
- 批處理框架能更高效地利用計算資源。雖然也可以用 OLTP 資料庫和應用伺服器等線上系統做批處理,但資源成本通常顯著更高。
批處理也有挑戰。多數框架中,作業只有在整體完成後,其輸出才能被下游進一步處理。批處理也可能低效:輸入哪怕只變動一個位元組,也可能需要重算整個輸入資料集。儘管如此,批處理在大量場景中依然非常有用,我們會在[“批處理用例”](#sec_batch_output)中回到這個話題。
批處理作業可能執行很久:幾分鐘、幾小時甚至幾天。很多作業是週期排程的(例如每天一次)。它的核心效能指標通常是吞吐量:單位時間能處理多少資料。有些批處理系統透過“中止並整體重啟”應對故障,也有些具備更細粒度容錯能力,可以在部分節點崩潰時仍讓作業完成。
> [!NOTE]
> 批處理的另一種替代形態是 *流處理(stream processing)*:作業不會在“處理完輸入後結束”,而是持續監聽輸入,並在變化發生後很快處理。我們將在[第十二章](/tw/ch12#ch_stream)討論流處理。
線上處理與批處理的邊界並不總是清晰:一個執行很久的資料庫查詢,看起來也很像批處理過程。但批處理有一些獨特特性,使其成為構建可靠、可伸縮、可維護應用的重要積木。例如,它常在 *資料整合(data integration)* 中發揮作用,即把多個數據系統組合起來完成單一系統做不到的事。ETL(見[“資料倉庫”](/tw/ch1#sec_introduction_dwh))就是典型例子。
現代批處理深受 MapReduce 影響。Google 在 2004 年發表了這一批處理演算法 [^3],隨後 Hadoop、CouchDB、MongoDB 等開源系統都實現了它。MapReduce 是相對底層的程式設計模型,其能力不如資料倉庫中的並行查詢執行引擎成熟 [^4] [^5]。它在誕生時確實讓商用硬體上的處理規模躍升一大步,但今天已大體過時,Google 內部也不再使用 [^6] [^7]。
如今批處理更常透過 Spark、Flink 或資料倉庫查詢引擎完成。它們與 MapReduce 一樣高度依賴分片(見[第七章](/tw/ch7#ch_sharding))和並行執行,但快取與執行策略更成熟。隨著這些系統走向成熟,運維問題已大幅緩解,重點轉向可用性:資料流 API、查詢語言、DataFrame API 得到廣泛支援;任務與工作流編排也顯著進化。以 Hadoop 為中心的 Oozie、Azkaban 等排程器,正被 Airflow、Dagster、Prefect 這類更通用方案替代,它們可協調多種批處理框架與雲資料倉庫。
雲計算已無處不在。批處理儲存層也正在從 HDFS、GlusterFS、CephFS 這類分散式檔案系統(DFS)向 S3 等物件儲存遷移。BigQuery、Snowflake 這類可伸縮雲資料倉庫,正在模糊“資料倉庫”和“批處理系統”之間的邊界。
為了建立直覺,本章先從單機 Unix 工具示例出發,再擴充套件到分散式多機處理。你會看到,分散式批處理框架在很多方面很像作業系統:它也有排程器和檔案系統。隨後我們會討論編寫批處理作業的幾種處理模型,最後給出常見應用場景。
## 使用 Unix 工具的批處理 {#sec_batch_unix}
假設你有一臺 Web 伺服器,每處理一個請求就在日誌檔案末尾追加一行。例如,使用 nginx 預設訪問日誌格式,一行可能像這樣:
216.58.210.78 - - [27/Jun/2025:17:55:11 +0000] "GET /css/typography.css HTTP/1.1"
200 3377 "https://martin.kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36"
(實際上這是一行,這裡為了閱讀方便換了行。)這一行包含了很多資訊。要正確解釋它,你需要日誌格式定義:
$remote_addr - $remote_user [$time_local] "$request"
$status $body_bytes_sent "$http_referer" "$http_user_agent"
這表示:UTC 時間 2025 年 6 月 27 日 17:55:11,伺服器收到來自客戶端 IP `216.58.210.78` 對 `/css/typography.css` 的請求。使用者未認證,因此 `$remote_user` 是連字元(`-`)。響應狀態碼是 200(成功),響應體大小 3,377 位元組。瀏覽器是 Chrome 137,該檔案是從頁面 *[*https://martin.kleppmann.com/*](https://martin.kleppmann.com/)* 引用而來。
看起來“解析日誌”有點樸素,但它在現代科技公司裡是核心能力之一,從廣告流水線到支付處理都大量依賴。事實上,這也是 MapReduce 與“大資料”浪潮快速興起的重要推動力。
### 簡單日誌分析 {#sec_batch_log_analysis}
很多工具都能從日誌生成漂亮的網站流量報告。這裡為了練手,我們只用基礎 Unix 工具自己做一個。比如你想找出網站最受歡迎的五個頁面,可以在 shell 中這樣做:
```bash
cat /var/log/nginx/access.log | #1
awk '{print $7}' | #2
sort | #3
uniq -c | #4
sort -r -n | #5
head -n 5 #6
```
1. 讀取日誌檔案。(嚴格說這裡不需要 `cat`,可直接把檔案作為 `awk` 引數;但這樣寫更直觀看出線性管道。)
2. 以空白字元切分每行,只輸出第 7 個欄位,也就是請求 URL。上面的樣例中是 `/css/typography.css`。
3. 按字典序對 URL 排序。某個 URL 若出現 *n* 次,排序後會連續出現 *n* 行。
4. `uniq` 透過比較相鄰兩行是否相同來去重。`-c` 讓它輸出計數:每個不同 URL 出現了多少次。
5. 第二次 `sort` 按每行開頭的數字(`-n`)排序,並用 `-r` 逆序,出現次數最多的排在最前。
6. `head` 只保留前 5 行(`-n 5`),丟棄其餘。
輸出大致如下:
```
4189 /favicon.ico
3631 /2016/02/08/how-to-do-distributed-locking.html
2124 /2020/11/18/distributed-systems-and-elliptic-curves.html
1369 /
915 /css/typography.css
```
如果你不熟悉 Unix 工具,這條命令看起來可能有點晦澀,但它威力很強。它能在幾秒內處理 GB 級日誌,而且修改分析邏輯也非常方便:例如要排除 CSS 檔案,可把 `awk` 引數改成 `'$7 !~ /\.css$/ {print $7}'`;若要統計訪問最多的客戶端 IP,把 `awk` 引數改成 `'{print $1}'` 即可。
本書篇幅有限,無法展開講 Unix 工具,但它們非常值得學。令人驚訝的是,僅靠 `awk`、`sed`、`grep`、`sort`、`uniq`、`xargs` 的組合,就能在幾分鐘內做出很多資料分析,並且效能相當好 [^8]。
### 命令鏈與自定義程式 {#sec_batch_custom_program}
你也可以不用 Unix 管道,而寫個小程式完成同樣的事。比如用 Python:
```python
from collections import defaultdict
counts = defaultdict(int) #1
with open('/var/log/nginx/access.log', 'r') as file:
for line in file:
url = line.split()[6] #2
counts[url] += 1 #3
top5 = sorted(((count, url) for url, count in counts.items()), reverse=True)[:5] #4
for count, url in top5: #5
print(f"{count} {url}")
```
1. `counts` 是散列表,記錄每個 URL 出現次數,預設值為 0。
2. 每行按空白字元切分,取第 7 個欄位作為 URL(Python 陣列從 0 開始,所以索引是 6)。
3. 當前行對應 URL 的計數器加一。
4. 按計數降序排序,取前五項。
5. 列印前五項。
這個程式不如 Unix 管道簡潔,但可讀性也不錯,偏好取決於習慣。不過兩者除了語法差異,執行流程也很不一樣;在大檔案上執行時,這種差異會很明顯。
### 排序與記憶體聚合 {#id275}
Python 指令碼在記憶體裡維護了一個“URL -> 出現次數”的散列表。Unix 管道示例沒有這種散列表,而是透過排序把同一 URL 的多次出現排到一起。
哪種方法更好?取決於不同 URL 的數量。對多數中小網站而言,通常可以把所有不同 URL 及其計數器放進(比如)1GB 記憶體。這個作業的 *工作集(working set)*(需要隨機訪問的記憶體規模)只取決於不同 URL 的個數:即便一百萬條日誌都指向同一 URL,散列表也只存一個 URL 和一個計數器。工作集足夠小時,記憶體散列表很好用,筆記本都能跑。
但如果工作集大於可用記憶體,排序法就有優勢:它能高效使用磁碟。這與[“日誌結構儲存”](/tw/ch4#sec_storage_log_structured)中的原理一樣:先在記憶體對資料塊排序並寫成段檔案,再把多個有序段合併成更大的有序檔案。歸併排序的順序訪問模式對磁碟很友好(見[“SSD 上的順序寫與隨機寫”](/tw/ch4#sidebar_sequential))。
GNU Coreutils(Linux)中的 `sort` 能自動把超記憶體資料溢寫到磁碟,並自動利用多核並行排序 [^9]。這意味著前面的 Unix 命令鏈可以自然擴充套件到大資料集而不耗盡記憶體,瓶頸通常變成磁碟讀取輸入檔案的速率。
Unix 工具的一個侷限是它們只在單機執行。當資料大到單機記憶體或本地磁碟都放不下時,就需要分散式批處理框架。
## 分散式系統中的批處理 {#sec_batch_distributed}
在前面的 Unix 示例中,單機有幾個協同元件在處理日誌:
- 透過作業系統檔案系統介面訪問的儲存裝置。
- 決定程序何時執行、如何分配 CPU 資源的排程器。
- 一串透過管道把 `stdin`/`stdout` 連線起來的 Unix 程式。
分散式批處理框架也有對應元件。某種意義上,你可以把分散式處理框架看成“分散式作業系統”:它有檔案系統、有任務排程器,還有透過檔案系統或其他通道互相傳遞資料的程式。
### 分散式檔案系統 {#sec_batch_dfs}
作業系統提供的檔案系統由多層組成:
- 最底層是塊裝置驅動,直接與磁碟互動,向上層提供原始塊讀寫。
- 塊層之上是頁快取,快取最近訪問塊以提升讀取速度。
- 塊 API 之上是檔案系統層,負責把大檔案切塊,並維護 inode、目錄、檔案等元資料。Linux 常見實現如 ext4、XFS。
- 最上層,作業系統透過統一 API(虛擬檔案系統,VFS)嚮應用暴露不同檔案系統,讓應用以統一方式讀寫底層不同實現。
分散式檔案系統(DFS)工作方式很類似:檔案被切成塊並分散到多臺機器。DFS 的塊通常比本地檔案系統大得多:HDFS 預設 128MB,JuiceFS 和許多物件儲存常用 4MB,而 ext4 預設塊通常是 4096 位元組。塊越大,需要維護的元資料越少,這對 PB 級資料非常關鍵;同時尋道開銷佔比也更低。
大多數物理儲存裝置不能做“部分塊寫入”,即使資料不足一個塊也得寫滿塊。DFS 的塊更大且通常構建在作業系統檔案系統之上,因此一般沒有這個約束。比如一個 900MB 檔案在 128MB 分塊下,會有 7 個 128MB 塊和 1 個 4MB 塊。
讀取 DFS 塊需要透過網路請求到持有該塊的叢集節點。每臺機器都執行守護程序,對外提供 API,使遠端程序能把本地檔案系統中的塊當作檔案讀寫。HDFS 把這些守護程序叫 DataNode,GlusterFS 叫 glusterfsd。後文統稱 *資料節點(data node)*。
DFS 也實現了“分散式版本”的頁快取。因為 DFS 塊作為檔案存放在資料節點本地,讀寫會經過資料節點作業系統,自帶記憶體頁快取,熱門塊會被快取在記憶體中。某些 DFS 還提供更多快取層,例如 JuiceFS 的客戶端快取和本地磁碟快取。
像 ext4/XFS 這樣的檔案系統會維護空閒空間、塊位置、目錄結構、許可權等元資料。DFS 同樣需要記錄“檔案塊分佈在哪些機器”“許可權如何”等資訊。Hadoop 使用 NameNode 維護叢集元資料;DeepSeek 的 3FS 使用元資料服務並把元資料持久化到 FoundationDB 之類鍵值儲存。
在檔案系統之上是 VFS。批處理系統裡最接近它的是 DFS 協議:批處理框架需要透過協議/介面來讀寫儲存。只要實現協議,就能作為可插拔儲存接入。例如 S3 API 已被 MinIO、Cloudflare R2、Tigris、Backblaze B2 等大量系統相容支援。具備 S3 支援的批處理系統通常可直接使用這些儲存。
有些 DFS 還提供 POSIX 相容檔案系統,讓作業系統 VFS 把它當普通檔案系統。常見整合方式是 FUSE 或 NFS 協議。NFS 可能是最知名分散式檔案系統協議,最初用於讓多個客戶端讀寫單個伺服器上的資料。後來 AWS EFS、Archil 等提供了更可伸縮的 NFS 相容實現。NFS 客戶端雖仍連到一個端點,但底層會與分散式元資料服務和資料節點互動完成讀寫。
> [!TIP] 分散式檔案系統與網路儲存
> 分散式檔案系統基於 *無共享(shared-nothing)* 原則(見[“共享記憶體、共享磁碟與無共享架構”](/tw/ch2#sec_introduction_shared_nothing)),與 NAS(網路附加儲存)和 SAN(儲存區域網路)等 *共享磁碟* 方案形成對照。共享磁碟通常依賴集中式儲存裝置、定製硬體和專用網路(如光纖通道);無共享方案不要求專用硬體,只需普通資料中心網路互聯的機器。
很多 DFS 構建在商用硬體上,成本更低但故障率高於企業級專用硬體。為容忍機器和磁碟故障,檔案塊通常複製到多臺機器。這也讓排程器更容易均衡負載:任務可在任一持有輸入副本的節點執行。複製可以是多副本(見[第六章](/tw/ch6#ch_replication)),也可以是 Reed-Solomon 等 *糾刪碼* 方案,以更低儲存開銷恢復丟失資料 [^10] [^11] [^12]。這與 RAID 思想類似,只是 RAID 面向同一機器上的多塊磁碟,而 DFS 是透過普通資料中心網路跨機器做訪問和複製。
### 物件儲存 {#id277}
Amazon S3、Google Cloud Storage、Azure Blob Storage、OpenStack Swift 等物件儲存,已成為批處理場景中對 DFS 的主流替代。實際上兩者邊界越來越模糊:正如前一節和[“由物件儲存支撐的資料庫”](/tw/ch6#sec_replication_object_storage)所述,FUSE 可以把 S3 這類物件儲存“掛載成檔案系統”;JuiceFS、Ceph 等系統也同時提供物件 API 與檔案系統 API。但這些介面、效能、以及一致性保證差異很大,即便 API 看似相容,也需要仔細驗證行為是否符合預期。
物件儲存中的每個物件有一個 URL,例如 `s3://my-photo-bucket/2025/04/01/birthday.png`。其中主機部分(`my-photo-bucket`)是 bucket 名,後半部分是物件 *鍵(key)*(示例裡是 `/2025/04/01/birthday.png`)。bucket 名全域性唯一;物件鍵在 bucket 內必須唯一。
物件讀取用 `get`,寫入用 `put`。與檔案系統檔案不同,物件寫入後通常不可變;更新物件需要透過 `put` 全量重寫,類似鍵值儲存。Azure Blob Storage 和 S3 Express One Zone 支援追加,但多數物件儲存不支援。它也沒有 `fopen`、`fseek` 這類檔案控制代碼 API。
物件看起來像按目錄組織,這很容易讓人誤解:物件儲存並沒有真正目錄概念。所謂路徑只是約定,斜槓也是 key 的一部分。這個約定允許你按字首列出物件,類似“目錄列表”,但與檔案系統目錄列舉有兩點不同:
- 字首 `list` 行為更像 Unix 的遞迴 `ls -R`:會返回所有以該字首開頭的物件,包括“子路徑”下的物件。
- 不存在“空目錄”。如果你刪除了 `s3://my-photo-bucket/2025/04/01` 下所有物件,再列 `s3://my-photo-bucket/2025/04` 時就看不到 `01`。常見做法是建立 0 位元組物件表示空目錄(如建立空物件 `s3://my-photo-bucket/2025/04/01` 以保留目錄佔位)。
DFS 常支援硬連結、符號連結、檔案鎖、原子重新命名等檔案系統操作,而物件儲存通常缺失這些能力:連結和鎖大多不支援;重新命名也非原子,通常是“複製到新 key,再刪除舊 key”。若要“重新命名目錄”,因為目錄名是 key 的一部分,實際上要逐個物件重新命名。
[第四章](/tw/ch4#ch_storage)討論的鍵值儲存通常面向小值(通常 KB 級)和高頻低延遲讀寫。相比之下,DFS 和物件儲存通常最佳化的是大物件(MB 到 GB)和低頻大塊讀寫。不過近年物件儲存也在增強小物件高頻訪問能力,例如 S3 Express One Zone 已提供單毫秒級延遲,計費模型也更接近鍵值儲存。
DFS 與物件儲存另一個區別是:HDFS 等 DFS 可把計算任務排程到持有檔案副本的機器上,讓任務本地讀檔案,減少網路傳輸(當任務程式碼遠小於待讀檔案時尤其划算)。物件儲存通常把儲存和計算解耦,雖然可能用更多頻寬,但現代資料中心網路很快,通常可接受。同時這種解耦讓 CPU/記憶體與儲存容量可以獨立擴充套件。
### 分散式作業編排 {#id278}
前面的“作業系統類比”同樣適用於作業編排。在單機上跑 Unix 批處理任務時,總得有東西真正去執行 `awk`、`sort`、`uniq`、`head` 程序;需要把一個程序輸出送到另一個程序輸入;要給每個程序分配記憶體;公平排程 CPU 指令;隔離記憶體與 I/O 邊界,等等。單機裡這由作業系統核心負責;分散式環境裡,這就是作業編排器(orchestrator)的職責。
批處理框架會向編排器的排程器發起“執行作業”請求。請求通常包含如下元資料:
- 需要執行的任務數量;
- 每個任務所需記憶體、CPU、磁碟;
- 作業識別符號;
- 訪問憑據;
- 輸入輸出等作業引數;
- 所需硬體資訊(如 GPU、磁碟型別);
- 作業可執行程式碼的位置。
Kubernetes、Hadoop YARN(Yet Another Resource Negotiator)[^13] 等編排器會結合這些請求與叢集狀態,依靠以下元件執行任務:
任務執行器(Task executors)
: 每個節點上執行執行器守護程序,例如 YARN 的 *NodeManager* 或 Kubernetes 的 *kubelet*。執行器負責拉起任務、透過心跳上報存活狀態、跟蹤節點上的任務狀態與資源佔用。收到“啟動任務”請求後,執行器會獲取作業程式碼並執行啟動命令;隨後持續監控程序直至結束或失敗,並更新對應狀態元資料。
很多執行器還配合作業系統實現安全與效能隔離,例如 YARN 和 Kubernetes 都會使用 Linux *cgroups*。這樣可防止任務越權訪問資料,或因資源濫用影響同機其他任務。
資源管理器(Resource Manager)
: 資源管理器維護各節點元資料:可用硬體(CPU、GPU、記憶體、磁碟等)、任務狀態、網路位置、節點健康狀態等,從而形成全域性檢視。其中心化特性可能成為可用性和可伸縮性瓶頸。YARN 藉助 ZooKeeper,Kubernetes 藉助 etcd 儲存叢集狀態(見[“協調服務”](/tw/ch10#sec_consistency_coordination))。
排程器(Scheduler)
: 編排器通常包含中心化排程子系統,接收啟動/停止作業與狀態查詢請求。例如收到“啟動 10 個任務,使用指定 Docker 映象,且必須執行在某類 GPU 節點上”的請求後,排程器會基於請求和資源管理器狀態決定“哪些任務跑在哪些節點”,再通知執行器執行。
不同編排器命名各異,但幾乎都具備這些核心元件。
> [!NOTE]
> 有些排程決策需要“應用特定排程器”參與,才能考慮更具體的業務約束,例如當查詢量達到閾值時自動擴容只讀副本。中心排程器與應用排程器協同決定如何執行任務。YARN 把這類子排程器稱為 *ApplicationMaster*,Kubernetes 通常稱為 *operator*。
#### 資源分配 {#id279}
排程器在編排系統中最具挑戰的職責之一,就是在資源有限且作業需求衝突時,做出合理分配。它本質上是在公平與效率之間做平衡。
假設一個小叢集有 5 個節點,共 160 個 CPU 核。排程器收到兩個作業請求,每個都想要 100 核。怎麼排最好?
- 可以給每個作業先分 80 個任務,剩餘 20 個等前面的任務結束後再啟動。
- 也可以先跑完其中一個作業,再等 100 核都空出來後跑另一個。這叫 *gang scheduling*(成組排程)。
- 如果一個請求先到,排程器還要決定是立即把 100 核都給它,還是為未來請求預留一部分資源。
這是很簡化的例子,但已經能看到艱難權衡。以成組排程為例,如果排程器為了湊齊 100 核而長期預留資源,節點會閒置,資源利用率下降,若其他作業也在搶佔式預留,還可能死鎖。
反過來,如果只是被動等 100 核“自然可用”,中間可能被別的作業拿走,導致長時間湊不齊,從而產生 *飢餓(starvation)*。排程器也可以 *搶佔(preempt)* 一部分先到作業任務,把它們殺掉給後到作業騰資源;但被殺任務之後還要重跑,整體效率同樣下降。
把這個問題放大到數百甚至數百萬個請求,想求全域性最優幾乎不可行。事實上這是 *NP-hard* 問題:除了很小規模,很難在可接受時間內算出最優解 [^14] [^15]。
因此工程上排程器通常採用啟發式方法,在非最優前提下做“足夠好”的決策。常見演算法包括 FIFO、主導資源公平(DRF)、優先順序佇列、容量/配額排程、各種裝箱演算法等。細節超出本書範圍,但這是非常有趣的研究領域。
#### 工作流排程 {#sec_batch_workflows}
本章開頭的 Unix 示例是多個命令串聯。分散式批處理中同樣常見:一個作業輸出要成為一個或多個後續作業輸入,而每個作業又可能依賴多個上游輸入。這個依賴結構稱為 *工作流(workflow)* 或 *有向無環圖(DAG)*。
> [!NOTE]
> 我們在[“持久化執行與工作流”](/tw/ch5#sec_encoding_dataflow_workflows)中討論過“按步驟執行 RPC”的工作流引擎;在批處理語境裡,“工作流”指的是一串批處理過程:每一步讀輸入、產輸出,通常不直接對外做 RPC。持久化執行引擎通常單次請求處理的資料量小於批處理系統,但兩者邊界並非絕對。
需要多作業工作流常見有以下原因:
- 一個作業輸出可能被多個團隊維護的下游作業消費。此時先把輸出寫到公共位置更合理,下游可按“資料更新觸發”或定時方式執行。
- 你可能要在多個處理工具間傳遞資料。比如 Spark 作業寫 HDFS,再由 Python 觸發 Trino SQL 查詢(見[“雲資料倉庫”](/tw/ch4#sec_cloud_data_warehouses))繼續處理並寫入 S3。
- 有些流水線內部天然需要多階段。例如第一階段按某鍵分片,下一階段按另一鍵分片,那麼第一階段需要先產出符合第二階段要求的資料佈局。
在 Unix 裡,管道用很小的記憶體緩衝連線前後命令,不落盤。若緩衝區滿,上游必須等待下游消費,這是一種 *背壓(backpressure)*。Spark、Flink 等批處理執行引擎也支援類似模式:一個任務輸出直接傳給下一任務(跨機時經網路傳輸)。
但在工作流中,更常見仍是“上游作業寫 DFS/物件儲存,下游再讀”,這樣可讓作業在時間上解耦。若一個作業有多個輸入,工作流排程器通常會等待所有上游輸入生產成功後再啟動它。
YARN ResourceManager 或 Spark 內建排程器主要做“作業內排程”,不負責整條工作流。為管理跨作業依賴,出現了 Airflow、Dagster、Prefect 等工作流排程器。它們在維護大量批作業時非常關鍵:包含 50~100 個作業的工作流並不罕見;大型組織內很多團隊會跨系統互相消費輸出。沒有工具支撐,很難管理這種複雜資料流。
#### 故障處理 {#id281}
批處理作業往往執行時間長。長時間執行且並行任務多的作業,在執行過程中遇到至少一次任務失敗幾乎是常態。正如[“硬體與軟體故障”](/tw/ch2#sec_introduction_hardware_faults)和[“不可靠網路”](/tw/ch9#sec_distributed_networks)所述,原因可能是硬體故障(商用硬體尤甚)、網路中斷等。
任務無法完成的另一原因是被排程器主動搶佔(kill)。當系統有多優先順序佇列時,這很常見:低優先順序任務便宜、高優先順序任務昂貴。低優先順序任務可用空閒算力跑,但高優先順序任務一到就可能把它們搶佔掉。雲廠商的對應產品名分別是:AWS 的 *spot instances*、Azure 的 *spot virtual machines*、GCP 的 *preemptible instances* [^16]。
批處理很多時候對即時性要求不高,因此很適合利用低優先順序資源/搶佔式例項降成本:本質上它在“吃”否則會閒置的算力,提高叢集利用率。但代價是更高的被殺機率:實際裡搶佔往往比硬體故障更常見 [^17]。
由於批處理每次都從頭生成輸出,任務失敗比線上系統更容易處理:刪掉失敗任務的部分輸出,把任務重新排程到別的機器重跑即可。若只因一個任務失敗就重跑整個作業會非常浪費,因此 MapReduce 及其後繼系統都儘量讓並行任務彼此獨立,從而把重試粒度降到單個任務 [^3]。
當一個任務輸出成為另一任務輸入(即在工作流內傳遞)時,容錯更複雜。MapReduce 的做法是:中間資料總是寫回 DFS,且只有寫入任務成功後才允許下游讀取。這個方案在頻繁搶佔環境中也能工作,但會帶來大量 DFS 寫入,效率不高。
Spark 更傾向把中間資料放記憶體或溢寫本地磁碟,只把最終結果寫 DFS;它還記錄中間資料的計算血緣,丟失時可重算 [^18]。Flink 則採用定期檢查點快照機制 [^19]。我們會在[“資料流引擎”](#sec_batch_dataflow)繼續討論。
## 批處理模型 {#id431}
前面我們討論了分散式環境中批作業如何排程。現在轉向“批處理框架如何處理資料”。最常見的兩類模型是 MapReduce 與資料流引擎。儘管實踐中資料流引擎已大面積替代 MapReduce,但理解 MapReduce 仍然重要,因為它深刻影響了現代批處理框架。
MapReduce 與資料流引擎都發展出多種程式設計介面:低層 API、關係查詢語言、DataFrame API。它們讓應用工程師、資料分析工程師、業務分析師乃至非技術人員都能參與資料處理。我們將在[“批處理用例”](#sec_batch_output)中討論這些用途。
### MapReduce {#sec_batch_mapreduce}
MapReduce 的處理模式與[“簡單日誌分析”](#sec_batch_log_analysis)幾乎同構:
1. 讀取輸入檔案並切分為 *記錄(records)*。在日誌例子裡,每條記錄就是一行(`\n` 為記錄分隔符)。在 Hadoop MapReduce 中,輸入通常存放在 HDFS 或 S3 等物件儲存,檔案格式可能是 Parquet(列式,見[“面向列儲存”](/tw/ch4#sec_storage_column))或 Avro(行式,見[“Avro”](/tw/ch5#sec_encoding_avro))。
2. 呼叫 mapper,從每條輸入記錄中提取鍵和值。Unix 示例中 mapper 相當於 `awk '{print $7}'`:URL(`$7`)是鍵,值可留空。
3. 按鍵排序所有鍵值對。日誌示例中這一步對應第一次 `sort`。
4. 呼叫 reducer 遍歷排序後的鍵值對。同鍵記錄會相鄰,因此可以在很小記憶體狀態下合併。Unix 示例中 reducer 等價於 `uniq -c`,統計相鄰同鍵記錄數。
這四步就是一個 MapReduce 作業。第 2 步(map)與第 4 步(reduce)是你寫業務邏輯的地方;第 1 步(檔案切記錄)由輸入格式解析器完成;第 3 步排序在 MapReduce 中是隱式內建的,你無需手寫。這一步是批處理的基礎演算法,我們會在[“混洗資料”](#sec_shuffle)再討論。
要建立 MapReduce 作業,你需實現兩個回撥:mapper 與 reducer,其行為如下。
Mapper
: 對每條輸入記錄呼叫一次。它從輸入記錄中提取鍵和值,並可為每條輸入產生任意數量鍵值對(包括 0 條)。它不保留跨記錄狀態,每條記錄獨立處理。
Reducer
: 框架收集 mapper 產生的鍵值對,把同鍵值集合交給 reducer(以迭代器形式)。reducer 可輸出結果記錄(如同一 URL 的出現次數)。
在日誌示例裡,第 5 步還有一次 `sort` 用於按請求次數排名 URL。MapReduce 若要第二輪排序,通常要再寫一個作業:前一個輸出作為後一個輸入。換個角度看,mapper 的作用是把資料整理成適合排序的形態;reducer 的作用是處理已排序資料。
> [!TIP] MapReduce 與函數語言程式設計
> MapReduce 雖用於批處理,但其程式設計模型來自函數語言程式設計。Lisp 把 *map* 與 *reduce/fold* 作為列表上的高階函式引入,後來進入 Python、Rust、Java 等主流語言。包括 SQL 在內的大量資料處理操作都可在 MapReduce 之上表達。Map 和 reduce 以及函數語言程式設計的一些特性恰好契合 MapReduce:可組合、天然適合資料處理鏈;map 還是典型“令人尷尬地並行”(每條輸入獨立處理);reduce 則可按不同鍵並行。
但用原始 MapReduce API 寫複雜處理其實很費力,例如各種連線演算法都要自己實現 [^20]。MapReduce 相比現代批處理引擎也偏慢,一個重要原因是其“以檔案為中心”的 I/O 讓作業流水化困難:上游不結束,下游很難提前處理輸出。
### 資料流引擎 {#sec_batch_dataflow}
為解決 MapReduce 的侷限,出現了多種分散式批處理執行引擎,最著名的是 Spark [^18] [^21] 和 Flink [^19]。它們設計細節各異,但有一個共同點:把整條工作流當成一個作業處理,而不是拆成互相獨立的小作業。
因為它們顯式建模了跨多個處理階段的資料流動,所以稱為 *資料流引擎(dataflow engines)*。與 MapReduce 一樣,它們提供低層 API(反覆呼叫使用者函式逐條處理記錄),也提供更高層運算元(如 *join*、*group by*)。它們透過分片並行輸入,並透過網路把一個任務輸出傳給另一個任務輸入。與 MapReduce 不同,運算元不必嚴格在 map/reduce 兩類角色間交替,而可以更靈活組合。
這些 API 通常以關係風格構件表達計算:按欄位值連線資料集、按鍵分組、按條件過濾、按計數或求和等函式聚合。內部實現依賴的正是下一節要講的混洗演算法。
這種處理引擎風格可追溯到 Dryad [^22]、Nephele [^23] 等研究系統。相比 MapReduce,它有幾個優勢:
- 像排序這類昂貴操作只在“確實需要”的地方執行,而不是每個 map 與 reduce 階段之間都預設做。
- 連續多個不改變分片方式的運算元(如 map/filter)可融合成一個任務,減少資料複製開銷。
- 由於工作流裡的連線與資料依賴都顯式宣告,排程器能全域性最佳化資料區域性。比如把“消費某資料”的任務放到“生產該資料”的同機上,用共享記憶體緩衝交換,而非走網路複製。
- 運算元間中間狀態通常放記憶體或本地磁碟即可,比寫 DFS/物件儲存 I/O 更低(後者要多副本並落到多機磁碟)。MapReduce 僅對 mapper 輸出做了這類最佳化,資料流引擎把它推廣到所有中間狀態。
- 輸入一就緒就能啟動下游運算元,無需等待整個上游階段全部完成。
- 可複用已有程序執行新運算元,減少啟動開銷;MapReduce 往往為每個任務起一個新 JVM。
因此,資料流引擎能實現與 MapReduce 工作流同樣的計算,但通常速度明顯更快。
### 混洗資料 {#sec_shuffle}
本章開頭的 Unix 工具示例和 MapReduce 都建立在排序之上。批處理系統要能排序 PB 級資料,單機放不下,因此必須使用“輸入與輸出都分片”的分散式排序演算法,這就是 *混洗(shuffle)*。
> [!NOTE] 混洗不是隨機
> “shuffle” 容易引發誤解。洗牌會得到隨機順序;而這裡的 shuffle 產出的是排序後的確定順序,不含隨機性。
混洗是批處理系統的基礎演算法,連線與聚合都依賴它。MapReduce、Spark、Flink、Daft、Dataflow、BigQuery [^24] 都實現了高可伸縮且高效能的混洗機制以處理大資料集。這裡用 Hadoop MapReduce 的混洗實現做說明 [^25],但核心思想在其他系統同樣適用。
[圖 11-1](#fig_batch_mapreduce) 展示了一個 MapReduce 作業的資料流。假設輸入已分片,標記為 *m1*、*m2*、*m3*。例如每個分片可以是 HDFS 中一個檔案,或物件儲存中的一個物件;同一資料集的所有分片可以放在同一 HDFS 目錄,或使用同一物件字首。
{{< figure src="/fig/ddia_1101.png" id="fig_batch_mapreduce" caption="圖 11-1. 一個包含三個 mapper 和三個 reducer 的 MapReduce 作業。" class="w-full my-4" >}}
框架會為每個輸入分片啟動一個 map 任務。任務讀取分配到的檔案,並逐條記錄呼叫 mapper 回撥。reduce 側也會分片。map 任務數由輸入分片數決定;reduce 任務數由作業作者配置(可與 map 數不同)。
mapper 輸出是鍵值對。框架需要保證:若不同 mapper 輸出了同一個鍵,這些鍵值對最終必須由同一個 reducer 處理。為此,每個 mapper 會在本地磁碟為每個 reducer 維護一個輸出檔案(例如[圖 11-1](#fig_batch_mapreduce)中的 *m1,r2*:由 mapper1 生成,目標是 reducer2)。mapper 每輸出一條鍵值對,通常會按鍵的雜湊決定寫入哪個 reducer 檔案(類似[“按鍵雜湊分片”](/tw/ch7#sec_sharding_hash))。
mapper 寫這些檔案的同時,也會在每個檔案內部按鍵排序。可用的正是[“日誌結構儲存”](/tw/ch4#sec_storage_log_structured)中的技術:先在記憶體有序結構裡積累一批鍵值對,寫成有序段檔案,再把小段逐步合併成大段。
每個 mapper 完成後,reducer 會連線到 mapper,把屬於自己的有序檔案複製到本地磁碟。reducer 拿到所有 mapper 的對應分片後,再用歸併排序方式合併它們並保持有序。同鍵記錄即便來自不同 mapper,也會在合併後相鄰。隨後 reducer 以“每個鍵一次呼叫”的方式執行,每次拿到一個可迭代器,遍歷該鍵所有值。
reducer 輸出記錄會順序寫入檔案,每個 reduce 任務一個檔案。[圖 11-1](#fig_batch_mapreduce)中的 *r1*、*r2*、*r3* 就是輸出資料集的分片,最終寫回 DFS 或物件儲存。
MapReduce 在 map 與 reduce 之間執行混洗;現代資料流引擎和雲資料倉庫則更複雜。BigQuery 等系統已最佳化混洗,使資料儘量留在記憶體,並寫入外部排序服務 [^24],以提升速度並透過複製增強韌性。
#### JOIN 與 GROUP BY {#sec_batch_join}
下面看“有序資料”如何簡化分散式連線與聚合。為便於說明仍以 MapReduce 為例,但概念適用於大多數批處理系統。
批處理裡常見連線場景見[圖 11-2](#fig_batch_join_example)。左邊是使用者活動日誌(*activity events* 或 *clickstream data*),右邊是使用者資料庫。它可以看作星型模型的一部分(見[“星型與雪花型:分析模式”](/tw/ch3#sec_datamodels_analytics)):活動日誌是事實表,使用者庫是維度表之一。
{{< figure src="/fig/ddia_1102.png" id="fig_batch_join_example" caption="圖 11-2. 使用者活動日誌與使用者畫像資料庫的連線。" class="w-full my-4" >}}
如果你要做“結合使用者庫資訊的活動分析”(例如利用使用者出生日期欄位,判斷哪些頁面更受年輕或年長使用者歡迎),就需要連線這兩張表。若兩邊都大到必須分片,怎麼做?
可利用 MapReduce 的關鍵特性:混洗會把同鍵鍵值對匯聚到同一個 reducer,無論它們最初在哪個分片。這裡使用者 ID 就可以作為鍵。因此可寫一個 mapper 掃活動日誌,輸出“按使用者 ID 鍵控的頁面訪問 URL”(見[圖 11-3](#fig_batch_join_reduce));再寫一個 mapper 按行掃描使用者表,提取“使用者 ID 作為鍵、出生日期作為值”。
{{< figure src="/fig/ddia_1103.png" id="fig_batch_join_reduce" caption="圖 11-3. 基於使用者 ID 的排序合併連線。若輸入資料集由多個檔案分片組成,可並行啟動多個 mapper 處理。" class="w-full my-4" >}}
混洗保證 reducer 能同時拿到某使用者的出生日期和該使用者全部頁面訪問事件。MapReduce 甚至可以把記錄進一步排成 reducer 先看到使用者記錄、再按時間戳看到活動事件,這稱為 *二次排序(secondary sort)* [^25]。
於是 reducer 很容易實現連線邏輯:先拿到出生日期並存入區域性變數,再遍歷同一使用者 ID 的活動事件,輸出“被訪問 URL + 訪問者出生日期”。因為 reducer 一次處理一個使用者的全部記錄,所以記憶體裡只要保留一條使用者記錄,也無需發任何網路請求。這個演算法稱為 *排序合併連線(sort-merge join)*:mapper 輸出先按鍵排序,reducer 再把連線兩側有序記錄合併。
工作流中的下一個 MapReduce 作業就可以繼續計算“每個 URL 的訪問者年齡分佈”:先按 URL 做一次混洗,再在 reducer 中遍歷同 URL 的所有訪問記錄(含出生日期),按年齡段維護計數並逐條累加,從而實現 *group by* 與聚合。
### 查詢語言 {#sec_batch_query_lanauges}
這些年分散式批處理執行引擎不斷成熟。如今在上萬臺機器的叢集上儲存並處理數 PB 資料,基礎設施已足夠穩健。隨著“如何在這規模下把系統跑起來”基本被解決,重點開始轉向程式設計模型的可用性。
MapReduce、資料流引擎、雲資料倉庫都把 SQL 作為批處理“通用語”。這很自然:傳統資料倉庫本就用 SQL,資料分析/ETL 工具都支援 SQL,幾乎所有開發者和分析師也都熟悉 SQL。
相比手寫 MapReduce,查詢語言介面不僅程式碼更少,還支援互動式使用:可在終端或 GUI 裡寫分析 SQL 並直接執行。這種互動式查詢對於業務分析、產品、銷售、財務等角色探索資料非常高效。雖然它不完全是“經典批處理”形態,但 SQL 讓探索式查詢也能在分散式批處理系統中高效完成。
高階查詢語言不只提升人的生產力,也提高機器執行效率。正如[“雲資料倉庫”](/tw/ch4#sec_cloud_data_warehouses)所述,查詢引擎要把 SQL 轉成在集群裡執行的批處理作業。這個從查詢到語法樹再到物理運算元的轉換過程,讓引擎有機會做最佳化。Hive、Trino、Spark、Flink 等查詢引擎都具備代價最佳化器:它們可分析連線輸入特徵,自動選擇更合適的連線演算法,甚至重排連線順序以減少中間狀態 [^19] [^26] [^27] [^28]。
SQL 是最流行的通用批處理語言,但在一些細分場景中仍有其他語言。Apache Pig 提供了基於關係運算元的逐步式資料流水線描述方式,而非“一個超大 SQL 查詢”。DataFrame(下一節)有相似特徵,Morel 則是受 Pig 影響的更現代語言。還有使用者採用 jq、JMESPath、JsonPath 等 JSON 查詢語言。
在[“圖狀資料模型”](/tw/ch3#sec_datamodels_graph)中,我們討論了圖建模與圖查詢語言如何遍歷邊和頂點。許多圖處理框架也支援透過查詢語言做批計算,例如 Apache TinkerPop 的 Gremlin。我們會在[“批處理用例”](#sec_batch_output)繼續看圖處理場景。
> [!TIP] 批處理與雲資料倉庫正在收斂
> 歷史上,資料倉庫執行在專用硬體裝置上,主要提供關係資料的 SQL 分析查詢;而 MapReduce 等批處理框架強調更高可伸縮性與更高靈活性,允許使用通用程式語言寫處理邏輯,並讀寫任意資料格式。
>
> 隨著發展,兩者越來越像。現代批處理框架已經支援 SQL,並藉助 Parquet 等列式格式和最佳化執行引擎(見[“查詢執行:編譯與向量化”](/tw/ch4#sec_storage_vectorized))在關係查詢上獲得良好效能。與此同時,資料倉庫透過雲化(見[“雲資料倉庫”](/tw/ch4#sec_cloud_data_warehouses))獲得更強可伸縮能力,並實現了許多與分散式批處理框架相同的排程、容錯和混洗技術,很多也使用分散式檔案系統。
>
> 正如批處理系統採納 SQL,雲倉庫也在採納 DataFrame 等替代處理模型(下一節)。例如 BigQuery 提供 BigQuery DataFrames,Snowflake 的 Snowpark 能與 Pandas 整合。Airflow、Prefect、Dagster 等批處理工作流編排器也已廣泛整合雲倉庫。
>
> 當然,並非所有批任務都容易用 SQL 表達。PageRank 等迭代圖演算法、複雜機器學習任務都很難用 SQL 寫。涉及影像、影片、音訊等非關係多模態資料的 AI 處理同樣如此。
>
> 此外,雲資料倉庫在某些負載上並不理想。行級逐條計算與列式儲存不匹配,效率較低,此時更適合使用倉庫的其他 API 或批處理系統。雲倉庫通常也比其他批處理系統更貴,某些大作業放到 Spark/Flink 等系統可能更具成本優勢。
>
> 因此,“用批處理系統還是資料倉庫”最終要看成本、便利性、實現複雜度、可用性等綜合因素。大型企業往往並存多套系統以保留選擇空間;小公司通常一套系統也能跑起來。
### DataFrames {#id287}
隨著資料科學家和統計學家開始用分散式批處理框架做機器學習,他們發現原有處理模型不夠順手,因為他們更習慣 R 與 Pandas 裡的 DataFrame 資料模型(見[“DataFrame、矩陣與陣列”](/tw/ch3#sec_datamodels_dataframes))。DataFrame 與關係庫裡的表很像:由多行組成,同一列值型別一致。它不是寫一個超大 SQL,而是透過呼叫對應關係運算元的函式來做過濾、連線、排序、分組等操作。
早期 DataFrame 操作大多在本地記憶體執行,因此只能處理單機裝得下的資料集。資料科學家希望在批處理環境中,仍用熟悉的 DataFrame API 處理大資料。Spark、Flink、Daft 等分散式框架都因此提供了 DataFrame API。需要注意的是,本地 DataFrame 通常帶索引且有順序,而分散式 DataFrame 往往沒有 [^29],遷移時可能出現效能“意外”。
DataFrame API 看起來和資料流 API 相似,但實現方式差別不小。Pandas 呼叫方法後通常立刻執行;Spark 則會先把 DataFrame API 呼叫翻譯為查詢計劃,做查詢最佳化後,再在分散式資料流引擎上執行,從而獲得更好效能。
Daft 等框架甚至同時支援客戶端與服務端計算:小規模記憶體操作在客戶端執行,大資料與重計算在服務端執行。Apache Arrow 等列式格式提供統一資料模型,可被兩側執行引擎共享。
## 批處理用例 {#sec_batch_output}
瞭解了批處理如何工作後,我們來看它在不同應用中的落地。批處理非常適合“海量資料的批次計算”,但不適合低延遲場景。因此,只要資料多且新鮮度要求不高,幾乎都能看到批處理的身影。這聽起來像限制,但現實裡大量工作都符合這個模型:
- 會計對賬與庫存核對:企業定期驗證交易、銀行賬戶與庫存是否一致,常由批處理完成 [^30]。
- 製造業需求預測:通常以週期性批任務計算 [^31]。
- 電商、媒體、社交平臺推薦模型訓練:大量依賴批處理 [^32] [^33]。
- 許多金融系統也是批處理驅動。例如美國銀行網路幾乎完全基於批任務執行 [^34]。
下面分別討論幾個幾乎所有行業都常見的批處理用例。
### 提取-轉換-載入(ETL) {#sec_batch_etl_usage}
[“資料倉庫”](/tw/ch1#sec_introduction_dwh)介紹了 ETL/ELT:從生產資料庫抽取資料、進行轉換,再載入到下游系統。本節用“ETL”統稱這兩類負載。尤其當下遊是資料倉庫時,ETL 常由批處理作業承載。
批處理天然並行,非常適合資料轉換。很多轉換任務都是“令人尷尬地並行”:過濾、欄位投影及大量常見倉庫轉換都可並行完成。
批處理環境通常自帶成熟工作流排程器,便於安排、編排和除錯 ETL 流水線。發生故障時,排程器常會自動重試以覆蓋瞬時問題;若持續失敗,則明確標記失敗,便於工程師快速定位流水線中斷點。像 Airflow 還內建大量 source/sink/query 運算元,可直接對接 MySQL、PostgreSQL、Snowflake、Spark、Flink 等數十種系統。排程器與資料處理系統的緊密整合顯著簡化了資料整合。
我們也看到,批處理在“出錯後排障與修復”方面很友好,這對除錯資料流水線極其關鍵。失敗檔案可直接檢查,ETL 作業可修復後重跑。比如輸入檔案不再包含某個轉換邏輯依賴欄位,資料工程師就能據此更新轉換邏輯或修復上游生產作業。
過去資料流水線往往由單一資料工程團隊集中維護,因為讓產品團隊自行編寫和維護複雜批流水線不太現實。近年隨著處理模型和元資料管理改進,組織內更多團隊都能參與並維護自己的流水線。*data mesh* [^35] [^36]、*data contract* [^37]、*data fabric* [^38] 等實踐,正透過規範和工具幫助團隊安全釋出可被全組織消費的資料。
如今資料流水線與分析查詢不僅共享處理模型,也常共享執行引擎。很多 ETL 作業與消費其輸出的分析查詢都執行在同一系統裡:例如同樣以 SparkSQL、Trino 或 DuckDB 查詢執行。這樣的架構進一步模糊了應用工程、資料工程、分析工程與業務分析之間的界限。
### 分析(Analytics) {#sec_batch_olap}
在[“操作型系統與分析型系統”](/tw/ch1#sec_introduction_analytics)中我們看到,分析查詢(OLAP)通常要掃描大量記錄並做分組聚合。這類負載可以與其他批任務一起執行在批處理系統中。分析人員寫 SQL,經查詢引擎執行,讀寫底層 DFS 或物件儲存。表到檔案對映、名稱、型別等表元資料通常由 Apache Iceberg 等表格式與 Unity 等 catalog 管理(見[“雲資料倉庫”](/tw/ch4#sec_cloud_data_warehouses))。這種架構稱為 *資料湖倉(data lakehouse)* [^39]。
與 ETL 類似,SQL 介面改進讓很多組織用 Spark 等批框架直接承載分析。常見模式有兩類:
- 預聚合查詢:先把資料滾動聚合為 OLAP 立方體或資料集市,以提升查詢速度(見[“物化檢視與資料立方”](/tw/ch4#sec_storage_materialized_views))。預聚合結果可在倉庫查詢,或推送到 Apache Druid、Apache Pinot 這類即時 OLAP 系統。預聚合通常按固定週期執行,通常由[“工作流排程”](#sec_batch_workflows)中提到的排程器管理。
- 臨時查詢(ad hoc):使用者為回答具體業務問題、分析使用者行為、排查執行問題等隨時發起。該場景非常看重響應時間,分析師通常會根據每次結果繼續迭代提問。執行快的批處理查詢引擎可顯著縮短等待。
SQL 支援還讓批處理系統更易接入電子表格與視覺化工具,如 Tableau、Power BI、Looker、Apache Superset。比如 Tableau 有 SparkSQL、Presto 聯結器;Superset 支援 Trino、Hive、Spark SQL、Presto 等大量最終會觸發批任務的資料系統。
### 機器學習 {#id290}
機器學習(ML)高度依賴批處理。資料科學家、ML 工程師、AI 工程師會用批處理框架探索資料模式、做資料轉換、訓練模型。常見用途包括:
- 特徵工程:把原始資料過濾並轉換為可訓練資料。預測模型往往要求數值特徵,因此文字或離散值等資料需要先轉成目標格式。
- 模型訓練:訓練資料是批過程輸入,訓練後模型權重是輸出。
- 批次推理:當資料集很大且不要求即時結果時,可對整批資料做預測,也包括在測試集上評估模型預測效果。
很多框架為這些場景提供了專用工具。例如 Spark 的 MLlib、Flink 的 FlinkML 都內建豐富的特徵工程工具、統計函式與分類器。
推薦系統和排序系統等 ML 應用也大量使用圖處理(見[“圖狀資料模型”](/tw/ch3#sec_datamodels_graph))。許多圖演算法表達為“沿邊逐步傳播資訊並反覆迭代”:把一個頂點與相鄰頂點連線,傳遞某些資訊,重複直到滿足停止條件,例如無邊可繼續,或某個指標收斂。
*批同步並行(bulk synchronous parallel, BSP)* 計算模型 [^40] 已成為批圖計算常用模型。Apache Giraph [^20]、Spark GraphX、Flink Gelly [^41] 等都實現了它。它也常被稱為 *Pregel* 模型,因為 Google 的 Pregel 論文讓這一方法廣為人知 [^42]。
批處理同樣是大語言模型(LLM)資料準備與訓練的重要組成部分。網頁等原始文字通常存放在 DFS 或物件儲存中,必須先預處理才能用於訓練。適合批處理框架的預處理步驟包括:
- 從 HTML 中提取純文字,並修復損壞文字;
- 檢測並清理低質量、無關或重複文件;
- 對文字做分詞並轉換為嵌入向量(詞或片段的數值表示)。
Kubeflow、Flyte、Ray 等框架就專為這類負載構建。以 OpenAI 為例,ChatGPT 訓練流程中就使用了 Ray [^43]。這些框架通常內建與 PyTorch、TensorFlow、XGBoost 等 LLM/AI 庫的整合,並支援特徵工程、模型訓練、批次推理、微調等能力。
最後,資料科學家常在 Jupyter、Hex 等互動式 Notebook 中實驗資料。Notebook 由多個 *cell* 組成,每個 cell 是一小段 Markdown、Python 或 SQL;按順序執行可得到表格、圖表或資料結果。很多 Notebook 背後透過 DataFrame API 或 SQL 呼叫批處理系統。
### 對外提供派生資料 {#sec_batch_serving_derived}
批處理常用於構建預計算/派生資料集,如商品推薦、面向使用者的報表、機器學習特徵等。這些資料通常由生產資料庫、鍵值儲存或搜尋引擎對外服務。不論目標系統是什麼,都需要把批處理環境中的 DFS/物件儲存輸出,回灌到線上服務資料庫。
最直觀的做法是:在批作業裡直接使用資料庫客戶端庫,一條條寫生產資料庫(假設防火牆允許)。這雖然能工作,但通常不是好主意,原因有三:
- 每條記錄一次網路請求,比批任務正常吞吐低幾個數量級。即便客戶端支援批寫,效能通常也不理想。
- 批處理框架常並行跑很多工。若所有任務同時以批處理速率寫同一資料庫,很容易把資料庫壓垮,進而影響其線上查詢效能,引發系統其他部分故障 [^44]。
- 批作業通常提供清晰的“全有或全無”輸出語義:作業成功時,結果等價於每個任務恰好執行一次;作業失敗時,無有效輸出。但如果在作業內直接寫外部系統,就產生了外部可見副作用,難以隱藏:部分完成結果可能被其他系統看到,任務失敗重啟還可能造成重複寫。
更好的方案是把預計算結果先推送到 Kafka 這類流系統(我們會在[第十二章](/tw/ch12#ch_stream)深入討論)。Elasticsearch、Apache Pinot、Apache Druid、Venice 這類派生資料儲存 [^45],以及 ClickHouse 等雲數倉,都支援從 Kafka 攝入資料。透過流系統過渡可以改善前述問題:
- 流系統針對順序寫最佳化,更適合批作業的大吞吐寫入模式;
- 流系統可在批作業與生產庫間充當緩衝層,下游可按自身能力限速讀取,避免影響線上流量;
- 一個批作業輸出可被多個下游系統同時消費;
- 流系統還可作為批處理網路與生產網路之間的安全邊界(可部署在 DMZ)。
但“經由流”並不會自動解決“全有或全無”語義。要實現這一點,批作業需要在完成後向下遊發出“作業完成,可對外可見”的通知。流消費者需要像 *讀已提交(read committed)* 事務那樣,在收到完成通知前讓新資料對查詢不可見(見[“讀已提交”](/tw/ch8#sec_transactions_read_committed))。
另一種在資料庫冷啟動(bootstrap)時更常見的模式,是在批作業內直接構建一個全新資料庫,再把檔案從 DFS、物件儲存或本地檔案系統批次匯入目標資料庫。很多系統都提供這類批次匯入工具,如 TiDB Lightning、Apache Pinot/Apache Druid 的 Hadoop 匯入作業,RocksDB 也提供從批作業批次匯入 SST 的 API。
“批構建 + 批匯入”速度非常快,也更容易在不同資料版本間做原子切換。但對於需要持續增量更新的場景,這種“每次構建全新庫”的方式會更難。很多系統採用混合策略,同時支援冷啟動與增量載入。比如 Venice 就支援混合儲存,可同時做基於行的批更新和全量資料集切換。
## 本章小結 {#id292}
本章討論了批處理系統的設計與實現。我們先從經典 Unix 工具鏈(awk、sort、uniq 等)出發,說明了批處理的基礎原語,例如排序和計數。
然後我們把視角擴充套件到分散式批處理系統。批處理以“不可變、有限(bounded)的輸入資料集”為物件,生成輸出資料,這使得重跑和除錯可以不引入副作用。圍繞這一模式,批處理框架通常包含三層核心能力:決定作業何時何地執行的編排層,負責持久化資料的儲存層,以及執行實際計算的計算層。
我們看了分散式檔案系統和物件儲存如何透過分塊複製、快取和元資料服務管理大檔案,也討論了現代批處理框架如何透過可插拔 API 與這些儲存互動。我們還討論了編排器在大叢集中如何排程任務、分配資源和處理故障,以及“按作業排程”的編排器與“按依賴圖管理整組作業生命週期”的工作流編排器之間的區別。
在處理模型方面,我們回顧了 MapReduce 及其經典 map/reduce 函式,又介紹了 Spark、Flink 等更易用且效能更好的資料流引擎。為了理解批作業如何擴充套件到大規模,我們重點講了混洗(shuffle)演算法,它是實現分組、連線、聚合的基礎操作。
隨著批處理系統成熟,焦點轉向可用性。高階查詢語言(尤其 SQL)和 DataFrame API 讓批處理作業更易編寫,也更容易被最佳化器最佳化。查詢最佳化器把宣告式查詢轉換為高效執行計劃。
最後我們回顧了批處理常見用例:
- ETL 流水線:透過定時工作流在不同系統間提取、轉換、載入資料;
- 分析:既支援預聚合報表,也支援臨時探索查詢;
- 機器學習:用於準備與處理大規模訓練資料;
- 把批處理輸出灌入面向生產流量的系統:常透過流系統或批次匯入工具,把派生資料提供給使用者。
下一章我們將轉向流處理。與批處理不同,流處理輸入是 *無界(unbounded)* 的:作業仍在,但輸入是持續不斷的資料流,因此作業不會“完成”。我們會看到,流處理與批處理在一些方面很相似,但“輸入無界”這一前提也會顯著改變系統設計。
### 參考文獻 {#references}
[^1]: Nathan Marz. [How to Beat the CAP Theorem](http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html). *nathanmarz.com*, October 2011. Archived at [perma.cc/4BS9-R9A4](https://perma.cc/4BS9-R9A4)
[^2]: Molly Bartlett Dishman and Martin Fowler. [Agile Architecture](https://www.youtube.com/watch?v=VjKYO6DP3fo&list=PL055Epbe6d5aFJdvWNtTeg_UEHZEHdInE). At *O'Reilly Software Architecture Conference*, March 2015.
[^3]: Jeffrey Dean and Sanjay Ghemawat. [MapReduce: Simplified Data Processing on Large Clusters](https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean.pdf). At *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004.
[^4]: Shivnath Babu and Herodotos Herodotou. [Massively Parallel Databases and MapReduce Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/db-mr-survey-final.pdf). *Foundations and Trends in Databases*, volume 5, issue 1, pages 1--104, November 2013. [doi:10.1561/1900000036](https://doi.org/10.1561/1900000036)
[^5]: David J. DeWitt and Michael Stonebraker. [MapReduce: A Major Step Backwards](https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html). Originally published at *databasecolumn.vertica.com*, January 2008. Archived at [perma.cc/U8PA-K48V](https://perma.cc/U8PA-K48V)
[^6]: Henry Robinson. [The Elephant Was a Trojan Horse: On the Death of Map-Reduce at Google](https://www.the-paper-trail.org/post/2014-06-25-the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/). *the-paper-trail.org*, June 2014. Archived at [perma.cc/9FEM-X787](https://perma.cc/9FEM-X787)
[^7]: Urs Hölzle. [R.I.P. MapReduce. After having served us well since 2003, today we removed the remaining internal codebase for good](https://twitter.com/uhoelzle/status/1177360023976067077). *twitter.com*, September 2019. Archived at [perma.cc/B34T-LLY7](https://perma.cc/B34T-LLY7)
[^8]: Adam Drake. [Command-Line Tools Can Be 235x Faster than Your Hadoop Cluster](https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html). *aadrake.com*, January 2014. Archived at [perma.cc/87SP-ZMCY](https://perma.cc/87SP-ZMCY)
[^9]: [`sort`: Sort text files](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html). GNU Coreutils 9.7 Documentation, Free Software Foundation, Inc., 2025.
[^10]: Michael Ovsiannikov, Silvius Rus, Damian Reeves, Paul Sutter, Sriram Rao, and Jim Kelly. [The Quantcast File System](https://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p808-ovsiannikov.pdf). *Proceedings of the VLDB Endowment*, volume 6, issue 11, pages 1092--1101, August 2013. [doi:10.14778/2536222.2536234](https://doi.org/10.14778/2536222.2536234)
[^11]: Andrew Wang, Zhe Zhang, Kai Zheng, Uma Maheswara G., and Vinayakumar B. [Introduction to HDFS Erasure Coding in Apache Hadoop](https://www.cloudera.com/blog/technical/introduction-to-hdfs-erasure-coding-in-apache-hadoop.html). *blog.cloudera.com*, September 2015. Archived at [archive.org](https://web.archive.org/web/20250731115546/https://www.cloudera.com/blog/technical/introduction-to-hdfs-erasure-coding-in-apache-hadoop.html)
[^12]: Andy Warfield. [Building and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. Archived at [perma.cc/7LPK-TP7V](https://perma.cc/7LPK-TP7V)
[^13]: Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. [Apache Hadoop YARN: Yet Another Resource Negotiator](https://opencourse.inf.ed.ac.uk/sites/default/files/2023-10/yarn-socc13.pdf). At *4th Annual Symposium on Cloud Computing* (SoCC), October 2013. [doi:10.1145/2523616.2523633](https://doi.org/10.1145/2523616.2523633)
[^14]: Richard M. Karp. [Reducibility Among Combinatorial Problems](https://www.cs.purdue.edu/homes/hosking/197/canon/karp.pdf). *Complexity of Computer Computations. The IBM Research Symposia Series*. Springer, 1972. [doi:10.1007/978-1-4684-2001-2_9](https://doi.org/10.1007/978-1-4684-2001-2_9)
[^15]: J. D. Ullman. [NP-Complete Scheduling Problems](https://www.cs.montana.edu/bhz/classes/fall-2018/csci460/paper4.pdf). *Journal of Computer and System Sciences*, volume 10, issue 3, June 1975. [doi:10.1016/S0022-0000(75)80008-0](https://doi.org/10.1016/S0022-0000(75)80008-0)
[^16]: Gilad David Maayan. [The complete guide to spot instances on AWS, Azure and GCP](https://www.datacenterdynamics.com/en/opinions/complete-guide-spot-instances-aws-azure-and-gcp/). *datacenterdynamics.com*, March 2021. Archived at [archive.org](https://web.archive.org/web/20250722114617/https://www.datacenterdynamics.com/en/opinions/complete-guide-spot-instances-aws-azure-and-gcp/)
[^17]: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. [Large-Scale Cluster Management at Google with Borg](https://dl.acm.org/doi/pdf/10.1145/2741948.2741964). At *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741964](https://doi.org/10.1145/2741948.2741964)
[^18]: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf). At *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012.
[^19]: Paris Carbone, Stephan Ewen, Seif Haridi, Asterios Katsifodimos, Volker Markl, and Kostas Tzoumas. [Apache Flink™: Stream and Batch Processing in a Single Engine](http://sites.computer.org/debull/A15dec/p28.pdf). *Bulletin of the IEEE Computer Society Technical Committee on Data Engineering*, volume 38, issue 4, December 2015. Archived at [perma.cc/G3N3-BKX5](https://perma.cc/G3N3-BKX5)
[^20]: Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira. *[Hadoop Application Architectures](https://learning.oreilly.com/library/view/hadoop-application-architectures/9781491910313/)*. O'Reilly Media, 2015. ISBN: 978-1-491-90004-8
[^21]: Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee. *[Learning Spark, 2nd Edition](https://learning.oreilly.com/library/view/learning-spark-2nd/9781492050032/)*. O'Reilly Media, 2020. ISBN: 978-1492050049
[^22]: Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. [Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks](https://www.microsoft.com/en-us/research/publication/dryad-distributed-data-parallel-programs-from-sequential-building-blocks/). At *2nd European Conference on Computer Systems* (EuroSys), March 2007. [doi:10.1145/1272996.1273005](https://doi.org/10.1145/1272996.1273005)
[^23]: Daniel Warneke and Odej Kao. [Nephele: Efficient Parallel Data Processing in the Cloud](https://stratosphere2.dima.tu-berlin.de/assets/papers/Nephele_09.pdf). At *2nd Workshop on Many-Task Computing on Grids and Supercomputers* (MTAGS), November 2009. [doi:10.1145/1646468.1646476](https://doi.org/10.1145/1646468.1646476)
[^24]: Hossein Ahmadi. [In-memory query execution in Google BigQuery](https://cloud.google.com/blog/products/bigquery/in-memory-query-execution-in-google-bigquery). *cloud.google.com*, August 2016. Archived at [perma.cc/DGG2-FL9W](https://perma.cc/DGG2-FL9W)
[^25]: Tom White. *[Hadoop: The Definitive Guide](https://learning.oreilly.com/library/view/hadoop-the-definitive/9781491901687/)*, 4th edition. O'Reilly Media, 2015. ISBN: 978-1-491-90163-2
[^26]: Fabian Hüske. [Peeking into Apache Flink's Engine Room](https://flink.apache.org/2015/03/13/peeking-into-apache-flinks-engine-room/). *flink.apache.org*, March 2015. Archived at [perma.cc/44BW-ALJX](https://perma.cc/44BW-ALJX)
[^27]: Mostafa Mokhtar. [Hive 0.14 Cost Based Optimizer (CBO) Technical Overview](https://web.archive.org/web/20170607112708/http://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/). *hortonworks.com*, March 2015. Archived on [archive.org](https://web.archive.org/web/20170607112708/http://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/)
[^28]: Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. [Spark SQL: Relational Data Processing in Spark](https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742797](https://doi.org/10.1145/2723372.2742797)
[^29]: Kaya Kupferschmidt. [Spark vs Pandas, part 2 -- Spark](https://towardsdatascience.com/spark-vs-pandas-part-2-spark-c57f8ea3a781/). *towardsdatascience.com*, October 2020. Archived at [perma.cc/5BRK-G4N5](https://perma.cc/5BRK-G4N5)
[^30]: Ammar Chalifah. [Tracking payments at scale](https://bolt.eu/en/blog/tracking-payments-at-scale). *bolt.eu.com*, June 2025. Archived at [perma.cc/Q4KX-8K3J](https://perma.cc/Q4KX-8K3J)
[^31]: Nafi Ahmet Turgut, Hamza Akyıldız, Hasan Burak Yel, Mehmet İkbal Özmen, Mutlu Polatcan, Pinar Baki, and Esra Kayabali. [Demand forecasting at Getir built with Amazon Forecast](https://aws.amazon.com/blogs/machine-learning/demand-forecasting-at-getir-built-with-amazon-forecast). *aws.amazon.com.com*, May 2023. Archived at [perma.cc/H3H6-GNL7](https://perma.cc/H3H6-GNL7)
[^32]: Jason (Siyu) Zhu. [Enhancing homepage feed relevance by harnessing the power of large corpus sparse ID embeddings](https://www.linkedin.com/blog/engineering/feed/enhancing-homepage-feed-relevance-by-harnessing-the-power-of-lar). *linkedin.com*, August 2023. Archived at [archive.org](https://web.archive.org/web/20250225094424/https://www.linkedin.com/blog/engineering/feed/enhancing-homepage-feed-relevance-by-harnessing-the-power-of-lar)
[^33]: Avery Ching, Sital Kedia, and Shuojie Wang. [Apache Spark \@Scale: A 60 TB+ production use case](https://engineering.fb.com/2016/08/31/core-infra/apache-spark-scale-a-60-tb-production-use-case/). *engineering.fb.com*, August 2016. Archived at [perma.cc/F7R5-YFAV](https://perma.cc/F7R5-YFAV)
[^34]: Edward Kim. [How ACH works: A developer perspective --- Part 1](https://engineering.gusto.com/how-ach-works-a-developer-perspective-part-1-339d3e7bea1). *engineering.gusto.com*, April 2014. Archived at [perma.cc/F67P-VBLK](https://perma.cc/F67P-VBLK)
[^35]: Zhamak Dehghani. [How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh](https://martinfowler.com/articles/data-monolith-to-mesh.html). *martinfowler.com*, May 2019. Archived at [perma.cc/LN2L-L4VC](https://perma.cc/LN2L-L4VC)
[^36]: Chris Riccomini. [What the Heck is a Data Mesh?!](https://cnr.sh/essays/what-the-heck-data-mesh) *cnr.sh*, June 2021. Archived at [perma.cc/NEJ2-BAX3](https://perma.cc/NEJ2-BAX3)
[^37]: Chad Sanderson, Mark Freeman, B. E. Schmidt. [*Data Contracts*](https://www.oreilly.com/library/view/data-contracts/9781098157623/). O'Reilly Media, 2025. ISBN: 9781098157623
[^38]: Daniel Abadi. [Data Fabric vs. Data Mesh: What's the Difference?](https://www.starburst.io/blog/data-fabric-vs-data-mesh-whats-the-difference/) *starburst.io*, November 2021. Archived at [perma.cc/RSK3-HXDK](https://perma.cc/RSK3-HXDK)
[^39]: Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. [Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference on Innovative Data Systems Research* (CIDR), January 2021.
[^40]: Leslie G. Valiant. [A Bridging Model for Parallel Computation](https://dl.acm.org/doi/pdf/10.1145/79173.79181). *Communications of the ACM*, volume 33, issue 8, pages 103--111, August 1990. [doi:10.1145/79173.79181](https://doi.org/10.1145/79173.79181)
[^41]: Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. [Spinning Fast Iterative Data Flows](https://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf). *Proceedings of the VLDB Endowment*, volume 5, issue 11, pages 1268-1279, July 2012. [doi:10.14778/2350229.2350245](https://doi.org/10.14778/2350229.2350245)
[^42]: Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. [Pregel: A System for Large-Scale Graph Processing](https://kowshik.github.io/JPregel/pregel_paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2010. [doi:10.1145/1807167.1807184](https://doi.org/10.1145/1807167.1807184)
[^43]: Richard MacManus. [OpenAI Chats about Scaling LLMs at Anyscale's Ray Summit](https://thenewstack.io/openai-chats-about-scaling-llms-at-anyscales-ray-summit/). *thenewstack.io*, September 2023. Archived at [perma.cc/YJD6-KUXU](https://perma.cc/YJD6-KUXU)
[^44]: Jay Kreps. [Why Local State is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing). *oreilly.com*, July 2014. Archived at [perma.cc/P8HU-R5LA](https://perma.cc/P8HU-R5LA)
[^45]: Félix GV. [Open Sourcing Venice -- LinkedIn's Derived Data Platform](https://www.linkedin.com/blog/engineering/open-source/open-sourcing-venice-linkedin-s-derived-data-platform). *linkedin.com*, September 2022. Archived at [archive.org](https://web.archive.org/web/20250226160927/https://www.linkedin.com/blog/engineering/open-source/open-sourcing-venice-linkedin-s-derived-data-platform)
================================================
FILE: content/tw/ch12.md
================================================
---
title: "第十二章:流處理"
linkTitle: "12. 流處理"
weight: 312
math: true
breadcrumbs: false
---

> 有效的複雜系統總是從簡單的系統演化而來。反之亦然:從零設計的複雜系統沒一個能有效工作的。
>
> —— 約翰・加爾,Systemantics(1975)
在 [第十一章](/tw/ch11) 中,我們討論了批處理技術,它讀取一組檔案作為輸入,並生成一組新的檔案作為輸出。輸出是 **派生資料(derived data)** 的一種形式;也就是說,如果需要,可以透過再次執行批處理過程來重新建立資料集。我們看到了如何使用這個簡單而強大的想法來建立搜尋索引、推薦系統、做分析等等。
然而,在 [第十一章](/tw/ch11) 中仍然有一個很大的假設:即輸入是有界的,即已知和有限的大小,所以批處理知道它何時完成輸入的讀取。例如,MapReduce 核心的排序操作必須讀取其全部輸入,然後才能開始生成輸出:可能發生這種情況:最後一條輸入記錄具有最小的鍵,因此需要第一個被輸出,所以提早開始輸出是不可行的。
實際上,很多資料是 **無界限** 的,因為它隨著時間的推移而逐漸到達:你的使用者在昨天和今天產生了資料,明天他們將繼續產生更多的資料。除非你停業,否則這個過程永遠都不會結束,所以資料集從來就不會以任何有意義的方式 “完成”[^1]。因此,批處理程式必須將資料人為地分成固定時間段的資料塊,例如,在每天結束時處理一天的資料,或者在每小時結束時處理一小時的資料。
日常批處理中的問題是,輸入的變更只會在一天之後的輸出中反映出來,這對於許多急躁的使用者來說太慢了。為了減少延遲,我們可以更頻繁地執行處理 —— 比如說,在每秒鐘的末尾 —— 或者甚至更連續一些,完全拋開固定的時間切片,當事件發生時就立即進行處理,這就是 **流處理(stream processing)** 背後的想法。
一般來說,“流” 是指隨著時間的推移逐漸可用的資料。這個概念出現在很多地方:Unix 的 stdin 和 stdout、程式語言(惰性列表)[^2]、檔案系統 API(如 Java 的 `FileInputStream`)、TCP 連線、透過網際網路傳送音訊和影片等等。
在本章中,我們將把 **事件流(event stream)** 視為一種資料管理機制:無界限,增量處理,與上一章中的批次資料相對應。我們將首先討論怎樣表示、儲存、透過網路傳輸流。在 “[資料庫與流](#sec_stream_databases)” 中,我們將研究流和資料庫之間的關係。最後在 “[流處理](#sec_stream_processing)” 中,我們將研究連續處理這些流的方法和工具,以及它們用於應用構建的方式。
## 傳遞事件流 {#sec_stream_transmit}
在批處理領域,作業的輸入和輸出是檔案(也許在分散式檔案系統上)。流處理領域中的等價物看上去是什麼樣子的?
當輸入是一個檔案(一個位元組序列),第一個處理步驟通常是將其解析為一系列記錄。在流處理的上下文中,記錄通常被叫做 **事件(event)** ,但它本質上是一樣的:一個小的、自包含的、不可變的物件,包含某個時間點發生的某件事情的細節。一個事件通常包含一個來自日曆時鐘的時間戳,以指明事件發生的時間(請參閱 “[單調鍾與日曆時鐘](/tw/ch9#sec_distributed_monotonic_timeofday)”)。
例如,發生的事件可能是使用者採取的行動,例如檢視頁面或進行購買。它也可能來源於機器,例如對溫度感測器或 CPU 利用率的週期性測量。在 “[使用 Unix 工具的批處理](/tw/ch11#sec_batch_unix)” 的示例中,Web 伺服器日誌的每一行都是一個事件。
事件可能被編碼為文字字串或 JSON,或者某種二進位制編碼,如 [第五章](/tw/ch5) 所述。這種編碼允許你儲存一個事件,例如將其追加到一個檔案,將其插入關係表,或將其寫入文件資料庫。它還允許你透過網路將事件傳送到另一個節點以進行處理。
在批處理中,檔案被寫入一次,然後可能被多個作業讀取。類似地,在流處理術語中,一個事件由 **生產者(producer)** (也稱為 **釋出者(publisher)** 或 **傳送者(sender)** )生成一次,然後可能由多個 **消費者(consumer)** ( **訂閱者(subscribers)** 或 **接收者(recipients)** )進行處理[^3]。在檔案系統中,檔名標識一組相關記錄;在流式系統中,相關的事件通常被聚合為一個 **主題(topic)** 或 **流(stream)** 。
原則上講,檔案或資料庫就足以連線生產者和消費者:生產者將其生成的每個事件寫入資料儲存,且每個消費者定期輪詢資料儲存,檢查自上次執行以來新出現的事件。這實際上正是批處理在每天結束時處理當天資料時所做的事情。
但當我們想要進行低延遲的連續處理時,如果資料儲存不是為這種用途專門設計的,那麼輪詢開銷就會很大。輪詢的越頻繁,能返回新事件的請求比例就越低,而額外開銷也就越高。相比之下,最好能在新事件出現時直接通知消費者。
資料庫在傳統上對這種通知機制支援的並不好,關係型資料庫通常有 **觸發器(trigger)** ,它們可以對變化(如,插入表中的一行)作出反應,但是它們的功能非常有限,並且在資料庫設計中有些後顧之憂[^4]。相應的是,已經開發了專門的工具來提供事件通知。
### 訊息傳遞系統 {#sec_stream_messaging}
向消費者通知新事件的常用方式是使用 **訊息傳遞系統(messaging system)**:生產者傳送包含事件的訊息,然後將訊息推送給消費者。我們之前在 “[訊息傳遞中的資料流](/tw/ch5#sec_encoding_dataflow_msg)” 中談到了這些系統,但現在我們將詳細介紹這些系統。
像生產者和消費者之間的 Unix 管道或 TCP 連線這樣的直接通道,是實現訊息傳遞系統的簡單方法。但是,大多數訊息傳遞系統都在這一基本模型上進行了擴充套件。特別的是,Unix 管道和 TCP 將恰好一個傳送者與恰好一個接收者連線,而一個訊息傳遞系統允許多個生產者節點將訊息傳送到同一個主題,並允許多個消費者節點接收主題中的訊息。
在這個 **釋出 / 訂閱** 模式中,不同的系統採取各種各樣的方法,並沒有針對所有目的的通用答案。為了區分這些系統,問一下這兩個問題會特別有幫助:
1. **如果生產者傳送訊息的速度比消費者能夠處理的速度快會發生什麼?** 一般來說,有三種選擇:系統可以丟掉訊息,將訊息放入緩衝佇列,或使用 **背壓**(backpressure,也稱為 **流量控制**,即 flow control:阻塞生產者,以免其傳送更多的訊息)。例如 Unix 管道和 TCP 就使用了背壓:它們有一個固定大小的小緩衝區,如果填滿,傳送者會被阻塞,直到接收者從緩衝區中取出資料(請參閱 “[網路擁塞和排隊](/tw/ch9#sec_distributed_congestion)”)。
如果訊息被快取在佇列中,那麼理解佇列增長會發生什麼是很重要的。當佇列裝不進記憶體時系統會崩潰嗎?還是將訊息寫入磁碟?如果是這樣,磁碟訪問又會如何影響訊息傳遞系統的效能[^5],磁碟寫滿又會發生什麼[^6]?
2. **如果節點崩潰或暫時離線,會發生什麼情況? —— 是否會有訊息丟失?** 與資料庫一樣,永續性可能需要寫入磁碟和 / 或複製的某種組合(請參閱 “[複製與永續性](/tw/ch8#sidebar_transactions_durability)”),這是有代價的。如果你能接受有時訊息會丟失,則可能在同一硬體上獲得更高的吞吐量和更低的延遲。
是否可以接受訊息丟失取決於應用。例如,對於週期傳輸的感測器讀數和指標,偶爾丟失的資料點可能並不重要,因為更新的值會在短時間內發出。但要注意,如果大量的訊息被丟棄,可能無法立刻意識到指標已經不正確了[^7]。如果你正在對事件計數,那麼它們能夠可靠送達是更重要的,因為每個丟失的訊息都意味著使計數器的錯誤擴大。
我們在 [第十一章](/tw/ch11) 中探討的批處理系統的一個很好的特性是,它們提供了強大的可靠性保證:失敗的任務會自動重試,失敗任務的部分輸出會自動丟棄。這意味著輸出與沒有發生故障一樣,這有助於簡化程式設計模型。在本章的後面,我們將研究如何在流處理的上下文中提供類似的保證。
#### 直接從生產者傳遞給消費者 {#id296}
許多訊息傳遞系統使用生產者和消費者之間的直接網路通訊,而不透過中間節點:
* UDP 組播廣泛應用於金融行業,例如股票市場,其中低時延非常重要[^8]。雖然 UDP 本身是不可靠的,但應用層的協議可以恢復丟失的資料包(生產者必須記住它傳送的資料包,以便能按需重新發送資料包)。
* 無代理的訊息庫,如 ZeroMQ 和 nanomsg 採取類似的方法,透過 TCP 或 IP 多播實現釋出 / 訂閱訊息傳遞。
* 一些指標採集代理(例如 StatsD [^9])使用不可靠的 UDP 訊息傳遞來收集網路中所有機器的指標並進行監控。(在 StatsD 協議中,計數器指標只有在所有訊息都被接收時才是準確的;使用 UDP 使得這些指標至多是近似值[^10]。另請參閱 “[TCP 與 UDP](/tw/ch9#sidebar_distributed_tcp_udp)”)。
* 如果消費者在網路上公開了服務,生產者可以直接傳送 HTTP 或 RPC 請求(請參閱 “[服務中的資料流:REST 與 RPC](/tw/ch5#sec_encoding_dataflow_rpc)”)將訊息推送給使用者。這就是 webhooks 背後的想法[^11]:把一個服務的回撥 URL 註冊到另一個服務中,當事件發生時向該 URL 發起請求。
儘管這些直接訊息傳遞系統在設計它們的環境中執行良好,但是它們通常要求應用程式碼意識到訊息丟失的可能性。它們的容錯程度極為有限:即使協議檢測到並重傳在網路中丟失的資料包,它們通常也只是假設生產者和消費者始終線上。
如果消費者處於離線狀態,則可能會丟失其不可達時傳送的訊息。一些協議允許生產者重試失敗的訊息傳遞,但當生產者崩潰時,它可能會丟失訊息緩衝區及其本應傳送的訊息,這種方法可能就沒用了。
#### 訊息代理 {#id433}
一種廣泛使用的替代方法是透過 **訊息代理**(message broker,也稱為 **訊息佇列**,即 message queue)傳送訊息,訊息代理實質上是一種針對處理訊息流而最佳化的資料庫[^12]。它作為伺服器執行,生產者和消費者作為客戶端連線到伺服器。生產者將訊息寫入代理,消費者透過從代理那裡讀取來接收訊息。
透過將資料集中在代理上,這些系統可以更容易地容忍來來去去的客戶端(連線,斷開連線和崩潰),而永續性問題則轉移到代理的身上。一些訊息代理只將訊息儲存在記憶體中,而另一些訊息代理(取決於配置)將其寫入磁碟,以便在代理崩潰的情況下不會丟失。針對緩慢的消費者,它們通常會允許無上限的排隊(而不是丟棄訊息或背壓),儘管這種選擇也可能取決於配置。
排隊的結果是,消費者通常是 **非同步(asynchronous)** 的:當生產者傳送訊息時,通常只會等待代理確認訊息已經被快取,而不等待訊息被消費者處理。向消費者遞送訊息將發生在未來某個未定的時間點 —— 通常在幾分之一秒之內,但有時當訊息堆積時會顯著延遲。
#### 訊息代理與資料庫的對比 {#id297}
有些訊息代理甚至可以使用 XA 或 JTA 參與兩階段提交協議(請參閱 “[實踐中的分散式事務](/tw/ch8#sec_transactions_xa)”)。這個功能與資料庫在本質上非常相似,儘管訊息代理和資料庫之間仍存在實踐上很重要的差異:
* 資料庫通常保留資料直至顯式刪除,而大多數訊息代理在訊息成功遞送給消費者時會自動刪除訊息。這樣的訊息代理不適合長期的資料儲存。
* 由於它們很快就能刪除訊息,大多數訊息代理都認為它們的工作集相當小 —— 即佇列很短。如果代理需要緩衝很多訊息,比如因為消費者速度較慢(如果記憶體裝不下訊息,可能會溢位到磁碟),每個訊息需要更長的處理時間,整體吞吐量可能會惡化[^5]。
* 資料庫通常支援次級索引和各種搜尋資料的方式,而訊息代理通常支援按照某種模式匹配主題,訂閱其子集。雖然機制並不一樣,但對於客戶端選擇想要了解的資料的一部分,都是基本的方式。
* 查詢資料庫時,結果通常基於某個時間點的資料快照;如果另一個客戶端隨後向資料庫寫入一些改變了查詢結果的內容,則第一個客戶端不會發現其先前結果現已過期(除非它重複查詢或輪詢變更)。相比之下,訊息代理不支援任意查詢,但是當資料發生變化時(即新訊息可用時),它們會通知客戶端。
這是關於訊息代理的傳統觀點,它被封裝在諸如 JMS [^13] 和 AMQP [^14] 的標準中,並且被諸如 RabbitMQ、ActiveMQ、HornetQ、Qpid、TIBCO 企業訊息服務、IBM MQ、Azure Service Bus 和 Google Cloud Pub/Sub 所實現[^15]。儘管可以把資料庫當作佇列來用,但要調優到理想效能並不容易[^16]。
#### 多個消費者 {#id298}
當多個消費者從同一主題中讀取訊息時,有兩種主要的訊息傳遞模式,如 [圖 12-1](#fig_stream_broker_patterns) 所示:
負載均衡(load balancing)
: 每條訊息都被傳遞給消費者 **之一**,所以處理該主題下訊息的工作能被多個消費者共享。代理可以為消費者任意分配訊息。當處理訊息的代價高昂,希望能並行處理訊息時,此模式非常有用(在 AMQP 中,可以透過讓多個客戶端從同一個佇列中消費來實現負載均衡,而在 JMS 中則稱之為 **共享訂閱**,即 shared subscription)。
扇出(fan-out)
: 每條訊息都被傳遞給 **所有** 消費者。扇出允許幾個獨立的消費者各自 “收聽” 相同的訊息廣播,而不會相互影響 —— 這個流處理中的概念對應批處理中多個不同批處理作業讀取同一份輸入檔案 (JMS 中的主題訂閱與 AMQP 中的交叉繫結提供了這一功能)。
{{< figure src="/fig/ddia_1201.png" id="fig_stream_broker_patterns" caption="圖 12-1. (a)負載均衡:在消費者間共享消費主題;(b)扇出:將每條訊息傳遞給多個消費者。" class="w-full my-4" >}}
兩種模式可以組合使用:例如,兩個獨立的消費者組可以每組各訂閱同一個主題,每一組都共同收到所有訊息,但在每一組內部,每條訊息僅由單個節點處理。
#### 確認與重新傳遞 {#sec_stream_reordering}
消費者隨時可能會崩潰,所以有一種可能的情況是:代理向消費者遞送訊息,但消費者沒有處理,或者在消費者崩潰之前只進行了部分處理。為了確保訊息不會丟失,訊息代理使用 **確認(acknowledgments)**:客戶端必須顯式告知代理訊息處理完畢的時間,以便代理能將訊息從佇列中移除。
如果與客戶端的連線關閉,或者代理超出一段時間未收到確認,代理則認為訊息沒有被處理,因此它將訊息再遞送給另一個消費者。(請注意可能發生這樣的情況,訊息 **實際上是** 處理完畢的,但 **確認** 在網路中丟失了。需要一種原子提交協議才能處理這種情況,正如在 “[實踐中的分散式事務](/tw/ch8#sec_transactions_xa)” 中所討論的那樣)
當與負載均衡相結合時,這種重傳行為對訊息的順序有種有趣的影響。在 [圖 12-2](#fig_stream_redelivery_reordering) 中,消費者通常按照生產者傳送的順序處理訊息。然而消費者 2 在處理訊息 m3 時崩潰,與此同時消費者 1 正在處理訊息 m4。未確認的訊息 m3 隨後被重新發送給消費者 1,結果消費者 1 按照 m4,m3,m5 的順序處理訊息。因此 m3 和 m4 的交付順序與生產者 1 的傳送順序不同。
{{< figure src="/fig/ddia_1202.png" id="fig_stream_redelivery_reordering" caption="圖 12-2. 在處理 m3 時消費者 2 崩潰,因此稍後重傳至消費者 1。" class="w-full my-4" >}}
即使訊息代理試圖保留訊息的順序(如 JMS 和 AMQP 標準所要求的),負載均衡與重傳的組合也不可避免地導致訊息被重新排序。為避免此問題,你可以讓每個消費者使用單獨的佇列(即不使用負載均衡功能)。如果訊息是完全獨立的,則訊息順序重排並不是一個問題。但正如我們將在本章後續部分所述,如果訊息之間存在因果依賴關係,這就是一個很重要的問題。
重傳還可能導致資源浪費、資源飢餓,甚至使流永久阻塞。一個常見場景是生產者錯誤地序列化訊息,例如 JSON 物件缺少必填鍵。任何讀取到該訊息的消費者都會因為缺鍵而失敗,無法傳送確認,於是代理會不斷重傳,導致其他消費者也不斷失敗。如果代理強順序保證,後續訊息可能被徹底卡住;即便允許重排,也會持續浪費資源在永遠無法確認的壞訊息上。
這類問題通常透過 **死信佇列(dead letter queue, DLQ)** 處理:不再無限重試,而是把問題訊息移到另一條佇列中,從而解堵主消費鏈路[^17] [^18]。運維通常會對死信佇列設定告警 —— 只要有訊息進入,就代表出現了錯誤。收到告警後,操作員可以決定永久丟棄該訊息、人工修復後重新投遞,或修復消費者程式碼以正確處理該訊息。除了傳統佇列系統,基於日誌的訊息系統和流處理系統也開始支援 DLQ[^19]。
### 基於日誌的訊息代理 {#sec_stream_log}
透過網路傳送資料包或向網路服務傳送請求通常是短暫的操作,不會留下永久的痕跡。儘管可以永久記錄(透過抓包與日誌),但我們通常不這麼做。即使是將訊息持久地寫入磁碟的訊息代理,在送達給消費者之後也會很快刪除訊息,因為它們建立在短暫訊息傳遞的思維方式上。
資料庫和檔案系統採用截然相反的方法論:至少在某人顯式刪除前,通常寫入資料庫或檔案的所有內容都要被永久記錄下來。
這種思維方式上的差異對建立派生資料的方式有巨大影響。如 [第十一章](/tw/ch11) 所述,批處理過程的一個關鍵特性是,你可以反覆執行它們,試驗處理步驟,不用擔心損壞輸入(因為輸入是隻讀的)。而 AMQP/JMS 風格的訊息傳遞並非如此:收到訊息是具有破壞性的,因為確認可能導致訊息從代理中被刪除,因此你不能期望再次運行同一個消費者能得到相同的結果。
如果你將新的消費者新增到訊息傳遞系統,通常只能接收到消費者註冊之後開始傳送的訊息。先前的任何訊息都隨風而逝,一去不復返。作為對比,你可以隨時為檔案和資料庫新增新的客戶端,且能讀取任意久遠的資料(只要應用沒有顯式覆蓋或刪除這些資料)。
為什麼我們不能把它倆雜交一下,既有資料庫的持久儲存方式,又有訊息傳遞的低延遲通知?這就是 **基於日誌的訊息代理(log-based message brokers)** 背後的想法。
#### 使用日誌進行訊息儲存 {#id300}
日誌只是磁碟上簡單的僅追加記錄序列。我們先前在 [第四章](/tw/ch4) 中日誌結構儲存引擎和預寫式日誌的上下文中討論了日誌,在 [第六章](/tw/ch6) 複製的上下文裡也討論了它。
同樣的結構可以用於實現訊息代理:生產者透過將訊息追加到日誌末尾來發送訊息,而消費者透過依次讀取日誌來接收訊息。如果消費者讀到日誌末尾,則會等待新訊息追加的通知。Unix 工具 `tail -f` 能監視檔案被追加寫入的資料,基本上就是這樣工作的。
為了伸縮超出單個磁碟所能提供的更高吞吐量,可以對日誌進行 **分割槽**(按 [第七章](/tw/ch7) 的定義)。不同的分割槽可以託管在不同的機器上,使得每個分割槽都有一份能獨立於其他分割槽進行讀寫的日誌。一個主題可以定義為一組攜帶相同型別訊息的分割槽。這種方法如 [圖 12-3](#fig_stream_log_partitions) 所示。
在每個分割槽內,代理為每個訊息分配一個單調遞增的序列號或 **偏移量**(offset,在 [圖 12-3](#fig_stream_log_partitions) 中,框中的數字是訊息偏移量)。這種序列號是有意義的,因為分割槽是僅追加寫入的,所以分割槽內的訊息是完全有序的。沒有跨不同分割槽的順序保證。
{{< figure src="/fig/ddia_1203.png" id="fig_stream_log_partitions" caption="圖 12-3. 生產者透過將訊息追加寫入主題分割槽檔案來發送訊息,消費者依次讀取這些檔案。" class="w-full my-4" >}}
Apache Kafka [^20] 和 Amazon Kinesis Streams 都是按這種方式工作的基於日誌的訊息代理。Google Cloud Pub/Sub 在架構上類似,但對外暴露的是 JMS 風格的 API,而不是日誌抽象[^15]。儘管這些訊息代理將所有訊息寫入磁碟,但透過跨多臺機器分割槽,依然能夠達到每秒數百萬條訊息的吞吐量,並透過複製訊息實現容錯[^21] [^22]。
#### 日誌與傳統的訊息傳遞相比 {#sec_stream_logs_vs_messaging}
基於日誌的方法天然支援扇出式訊息傳遞,因為多個消費者可以獨立讀取日誌,而不會相互影響 —— 讀取訊息不會將其從日誌中刪除。為了在一組消費者之間實現負載平衡,代理可以將整個分割槽分配給消費者組中的節點,而不是將單條訊息分配給消費者客戶端。
然後每個客戶端將消費被指派分割槽中的 **所有** 訊息。通常情況下,當一個使用者被指派了一個日誌分割槽時,它會以簡單的單執行緒方式順序地讀取分割槽中的訊息。這種粗粒度的負載均衡方法有一些缺點:
* 共享消費主題工作的節點數,最多為該主題中的日誌分割槽數,因為同一個分割槽內的所有訊息被遞送到同一個節點。
* 如果某條訊息處理緩慢,則它會阻塞該分割槽中後續訊息的處理(一種頭部阻塞的形式;請參閱 “[描述效能](/tw/ch2#sec_introduction_percentiles)”)。
因此在訊息處理代價高昂,希望逐條並行處理,以及訊息的順序並沒有那麼重要的情況下,JMS/AMQP 風格的訊息代理是可取的。另一方面,在訊息吞吐量很高,處理迅速,順序很重要的情況下,基於日誌的方法表現得非常好[^23] [^24]。不過,基於日誌與傳統訊息系統的邊界並不絕對:例如,一個主題分割槽通常一次只分配給一個消費者[^25] [^26]。
#### 消費者偏移量 {#sec_stream_log_offsets}
順序消費一個分割槽使得判斷訊息是否已經被處理變得相當容易:所有偏移量小於消費者的當前偏移量的訊息已經被處理,而具有更大偏移量的訊息還沒有被看到。因此,代理不需要跟蹤確認每條訊息,只需要定期記錄消費者的偏移即可。這種方法減少了額外簿記開銷,而且在批處理和流處理中採用這種方法有助於提高基於日誌的系統的吞吐量。
實際上,這種偏移量與單領導者資料庫複製中常見的日誌序列號非常相似,我們在 “[設定新從庫](/tw/ch6#sec_replication_new_replica)” 中討論了這種情況。在資料庫複製中,日誌序列號允許跟隨者斷開連線後,重新連線到領導者,並在不跳過任何寫入的情況下恢復複製。這裡原理完全相同:訊息代理表現得像一個主庫,而消費者就像一個從庫。
如果消費者節點失效,則失效消費者的分割槽將指派給其他節點,並從最後記錄的偏移量開始消費訊息。如果消費者已經處理了後續的訊息,但還沒有記錄它們的偏移量,那麼重啟後這些訊息將被處理兩次。我們將在本章後面討論這個問題的處理方法。
#### 磁碟空間使用 {#sec_stream_disk_usage}
如果只追加寫入日誌,則磁碟空間終究會耗盡。為了回收磁碟空間,日誌實際上被分割成段,並不時地將舊段刪除或移動到歸檔儲存。(我們將在後面討論一種更為複雜的磁碟空間釋放方式)
這就意味著如果一個慢消費者跟不上訊息產生的速率而落後得太多,它的消費偏移量指向了刪除的段,那麼它就會錯過一些訊息。實際上,日誌實現了一個有限大小的緩衝區,當緩衝區填滿時會丟棄舊訊息,它也被稱為 **迴圈緩衝區(circular buffer)** 或 **環形緩衝區(ring buffer)**。不過由於緩衝區在磁碟上,因此緩衝區可能相當的大。
讓我們做個粗略估算。在撰寫本文時,典型的大容量硬碟約為 20 TB,順序寫入吞吐量約為 250 MB/s。如果持續以最高速率寫入訊息,磁碟大約 22 小時就會寫滿並開始刪除最舊訊息。這意味著,即使在滿速寫入下,磁碟日誌也至少可以緩衝約 22 小時的資料。實踐中部署很少持續打滿磁碟頻寬,因此通常可以保留數天甚至數週的訊息緩衝區。
許多基於日誌的訊息代理現在也將訊息分層儲存到物件儲存中,以進一步提升容量,方式與我們在第六章中討論“物件儲存支撐資料庫”時類似。像 Apache Kafka 和 Redpanda 可以把較舊訊息放在物件儲存中按需讀取;還有一些系統直接將全部訊息儲存在物件儲存中。除了成本優勢外,這種架構也有資料整合優勢:如果物件儲存中的訊息以 Iceberg 表形式組織,批處理和資料倉庫作業可以直接在這些資料上執行,而無需再複製一份資料。
#### 當消費者跟不上生產者時 {#id459}
在 “[訊息傳遞系統](#sec_stream_messaging)” 中,如果消費者無法跟上生產者傳送資訊的速度時,我們討論了三種選擇:丟棄資訊,進行緩衝或施加背壓。在這種分類法裡,基於日誌的方法是緩衝的一種形式,具有很大但大小固定的緩衝區(受可用磁碟空間的限制)。
如果消費者遠遠落後,而所要求的資訊比保留在磁碟上的資訊還要舊,那麼它將不能讀取這些資訊,所以代理實際上丟棄了比緩衝區容量更大的舊資訊。你可以監控消費者落後日誌頭部的距離,如果落後太多就發出報警。由於緩衝區很大,因而有足夠的時間讓運維人員來修復慢消費者,並在訊息開始丟失之前讓其趕上。
即使消費者真的落後太多開始丟失訊息,也只有那個消費者受到影響;它不會中斷其他消費者的服務。這是一個巨大的運維優勢:你可以實驗性地消費生產日誌,以進行開發,測試或除錯,而不必擔心會中斷生產服務。當消費者關閉或崩潰時,會停止消耗資源,唯一剩下的只有消費者偏移量。
這種行為也與傳統的訊息代理形成了鮮明對比,在那種情況下,你需要小心地刪除那些消費者已經關閉的佇列 —— 否則那些佇列就會累積不必要的訊息,從其他仍活躍的消費者那裡佔走記憶體。
#### 重播舊訊息 {#sec_stream_replay}
我們之前提到,使用 AMQP 和 JMS 風格的訊息代理,處理和確認訊息是一個破壞性的操作,因為它會導致訊息在代理上被刪除。另一方面,在基於日誌的訊息代理中,使用訊息更像是從檔案中讀取資料:這是隻讀操作,不會更改日誌。
除了消費者的任何輸出之外,處理的唯一副作用是消費者偏移量的前進。但偏移量是在消費者的控制之下的,所以如果需要的話可以很容易地操縱:例如你可以用昨天的偏移量跑一個消費者副本,並將輸出寫到不同的位置,以便重新處理最近一天的訊息。你可以使用各種不同的處理程式碼重複任意次。
這一方面使得基於日誌的訊息傳遞更像上一章的批處理,其中派生資料透過可重複的轉換過程與輸入資料顯式分離。它允許進行更多的實驗,更容易從錯誤和漏洞中恢復,使其成為在組織內整合資料流的良好工具[^27]。
## 資料庫與流 {#sec_stream_databases}
我們已經在訊息代理和資料庫之間進行了一些比較。儘管傳統上它們被視為單獨的工具類別,但是我們看到基於日誌的訊息代理已經成功地從資料庫中獲取靈感並將其應用於訊息傳遞。我們也可以反過來:從訊息傳遞和流中獲取靈感,並將它們應用於資料庫。
我們之前曾經說過,事件是某個時刻發生的事情的記錄。發生的事情可能是使用者操作(例如鍵入搜尋查詢)或讀取感測器,但也可能是 **寫入資料庫**。某些東西被寫入資料庫的事實是可以被捕獲、儲存和處理的事件。這一觀察結果表明,資料庫和資料流之間的聯絡不僅僅是磁碟日誌的物理儲存 —— 而是更深層的聯絡。
事實上,複製日誌(請參閱 “[複製日誌的實現](/tw/ch6#sec_replication_implementation)”)是一個由資料庫寫入事件組成的流,由主庫在處理事務時生成。從庫將寫入流應用到它們自己的資料庫副本,從而最終得到相同資料的精確副本。複製日誌中的事件描述發生的資料更改。
我們還在 “[使用共享日誌](/tw/ch10#sec_consistency_smr)” 中遇到了狀態機複製原理,其中指出:如果每個事件代表對資料庫的寫入,並且每個副本按相同的順序處理相同的事件,則副本將達到相同的最終狀態(假設事件處理是一個確定性的操作)。這是事件流的又一種場景!
在本節中,我們將首先看看異構資料系統中出現的一個問題,然後探討如何透過將事件流的想法帶入資料庫來解決這個問題。
### 保持系統同步 {#sec_stream_sync}
正如我們在本書中所看到的,沒有一個系統能夠滿足所有的資料儲存、查詢和處理需求。在實踐中,大多數重要應用都需要組合使用幾種不同的技術來滿足所有的需求:例如,使用 OLTP 資料庫來為使用者請求提供服務,使用快取來加速常見請求,使用全文索引來處理搜尋查詢,使用資料倉庫用於分析。每一種技術都有自己的資料副本,並根據自己的目的進行儲存方式的最佳化。
由於相同或相關的資料出現在了不同的地方,因此相互間需要保持同步:如果某個專案在資料庫中被更新,它也應當在快取、搜尋索引和資料倉庫中被更新。對於資料倉庫,這種同步通常由 ETL 程序執行(請參閱 “[資料倉庫](/tw/ch1#sec_introduction_dwh)”),通常是先取得資料庫的完整副本,然後執行轉換,並批次載入到資料倉庫中 —— 換句話說,批處理。我們在 “[批處理工作流的輸出](/tw/ch11#sec_batch_output)” 中同樣看到了如何使用批處理建立搜尋索引、推薦系統和其他派生資料系統。
如果週期性的完整資料庫轉儲過於緩慢,有時會使用的替代方法是 **雙寫(dual write)**,其中應用程式碼在資料變更時明確寫入每個系統:例如,首先寫入資料庫,然後更新搜尋索引,然後使快取項失效(甚至同時執行這些寫入)。
但是,雙寫有一些嚴重的問題,其中一個是競爭條件,如 [圖 12-4](#fig_stream_dual_write_race) 所示。在這個例子中,兩個客戶端同時想要更新一個專案 X:客戶端 1 想要將值設定為 A,客戶端 2 想要將其設定為 B。兩個客戶端首先將新值寫入資料庫,然後將其寫入到搜尋索引。因為運氣不好,這些請求的時序是交錯的:資料庫首先看到來自客戶端 1 的寫入將值設定為 A,然後來自客戶端 2 的寫入將值設定為 B,因此資料庫中的最終值為 B。搜尋索引首先看到來自客戶端 2 的寫入,然後是客戶端 1 的寫入,所以搜尋索引中的最終值是 A。即使沒發生錯誤,這兩個系統現在也永久地不一致了。
{{< figure src="/fig/ddia_1204.png" id="fig_stream_dual_write_race" caption="圖 12-4. 在資料庫中 X 首先被設定為 A,然後被設定為 B,而在搜尋索引處,寫入以相反的順序到達。" class="w-full my-4" >}}
除非有一些額外的併發檢測機制,例如我們在 “[檢測併發寫入](/tw/ch6#sec_replication_concurrent)” 中討論的版本向量,否則你甚至不會意識到發生了併發寫入 —— 一個值將簡單地以無提示方式覆蓋另一個值。
雙重寫入的另一個問題是,其中一個寫入可能會失敗,而另一個成功。這是一個容錯問題,而不是一個併發問題,但也會造成兩個系統互相不一致的結果。確保它們要麼都成功要麼都失敗,是原子提交問題的一個例子,解決這個問題的代價是昂貴的(請參閱 “[原子提交與兩階段提交](/tw/ch8#sec_transactions_2pc)”)。
如果你只有一個單領導者複製的資料庫,那麼這個領導者決定了寫入順序,而狀態機複製方法可以在資料庫副本上工作。然而,在 [圖 12-4](#fig_stream_dual_write_race) 中,沒有單個主庫:資料庫可能有一個領導者,搜尋索引也可能有一個領導者,但是兩者都不追隨對方,所以可能會發生衝突(請參閱 “[多主複製](/tw/ch6#sec_replication_multi_leader)”)。
如果實際上只有一個領導者 —— 例如,資料庫 —— 而且我們能讓搜尋索引成為資料庫的追隨者,情況要好得多。但這在實踐中可能嗎?
### 資料變更捕獲 {#sec_stream_cdc}
大多數資料庫的複製日誌的問題在於,它們一直被當做資料庫的內部實現細節,而不是公開的 API。客戶端應該透過其資料模型和查詢語言來查詢資料庫,而不是解析複製日誌並嘗試從中提取資料。
數十年來,許多資料庫根本沒有記錄在檔的獲取變更日誌的方式。由於這個原因,捕獲資料庫中所有的變更,然後將其複製到其他儲存技術(搜尋索引、快取或資料倉庫)中是相當困難的。
最近,人們對 **資料變更捕獲(change data capture, CDC)** 越來越感興趣,這是一種觀察寫入資料庫的所有資料變更,並將其提取並轉換為可以複製到其他系統中的形式的過程。CDC 是非常有意思的,尤其是當變更能在被寫入後立刻用於流時[^28]。
例如,你可以捕獲資料庫中的變更,並不斷將相同的變更應用至搜尋索引。如果變更日誌以相同的順序應用,則可以預期搜尋索引中的資料與資料庫中的資料是匹配的。搜尋索引和任何其他派生資料系統只是變更流的消費者,如 [圖 12-5](#fig_stream_cdc_flow) 所示。
{{< figure src="/fig/ddia_1205.png" id="fig_stream_cdc_flow" caption="圖 12-5. 將資料按順序寫入一個數據庫,然後按照相同的順序將這些更改應用到其他系統。" class="w-full my-4" >}}
#### 資料變更捕獲的實現 {#id307}
我們可以將日誌消費者叫做 **派生資料系統**,正如在 [第一章](/tw/ch1#sec_introduction_derived) 討論“記錄系統與派生資料”時所述:儲存在搜尋索引和資料倉庫中的資料,只是 **記錄系統** 資料的額外檢視。資料變更捕獲是一種機制,可確保對記錄系統所做的所有更改都反映在派生資料系統中,以便派生系統具有資料的準確副本。
從本質上說,資料變更捕獲使得一個數據庫成為領導者(被捕獲變化的資料庫),並將其他元件變為追隨者。基於日誌的訊息代理非常適合從源資料庫傳輸變更事件,因為它保留了訊息的順序(避免了 [圖 12-2](#fig_stream_redelivery_reordering) 的重新排序問題)。
資料庫觸發器可用來實現資料變更捕獲(請參閱 “[基於觸發器的複製](/tw/ch6#sec_replication_logical)”),透過註冊觀察所有變更的觸發器,並將相應的變更項寫入變更日誌表中。但是它們往往是脆弱的,而且有顯著的效能開銷。解析複製日誌可能是一種更穩健的方法,但它也很有挑戰,例如如何應對模式變更。
邏輯複製日誌可以用於實現 CDC(請參閱 “[邏輯(基於行)的日誌複製](/tw/ch6#sec_replication_logical)”),但會帶來不少挑戰,例如模式變更和更新建模。Debezium 開源專案專門解決這些問題,提供了面向 MySQL、PostgreSQL、Oracle、SQL Server、Db2、Cassandra 等資料庫的源聯結器。Kafka Connect 也為多種資料庫提供了 CDC 聯結器;Maxwell 透過解析 binlog 為 MySQL 提供類似能力[^29],GoldenGate 為 Oracle 提供類似能力,pgcapture 為 PostgreSQL 提供類似能力。
類似於訊息代理,資料變更捕獲通常是非同步的:記錄資料庫系統在提交變更之前不會等待消費者應用變更。這種設計具有的運維優勢是,新增緩慢的消費者不會過度影響記錄系統。不過,所有複製延遲可能有的問題在這裡都可能出現(請參閱 “[複製延遲問題](/tw/ch6#sec_replication_lag)”)。
#### 初始快照 {#sec_stream_cdc_snapshot}
如果你擁有 **所有** 對資料庫進行變更的日誌,則可以透過重播該日誌,來重建資料庫的完整狀態。但是在許多情況下,永遠保留所有更改會耗費太多磁碟空間,且重播過於費時,因此日誌需要被截斷。
例如,構建新的全文索引需要整個資料庫的完整副本 —— 僅僅應用最近變更的日誌是不夠的,因為這樣會丟失最近未曾更新的專案。因此,如果你沒有完整的歷史日誌,則需要從一個一致的快照開始,如先前的 “[設定新從庫](/tw/ch6#sec_replication_new_replica)” 中所述。
資料庫的快照必須與變更日誌中的已知位置或偏移量相對應,以便在處理完快照後知道從哪裡開始應用變更。一些 CDC 工具集成了這種快照功能,而其他工具則把它留給你手動執行。Debezium 使用 Netflix 的 DBLog 水位線演算法提供增量快照能力[^30] [^31]。
#### 日誌壓縮 {#sec_stream_log_compaction}
如果你只能保留有限的歷史日誌,則每次要新增新的派生資料系統時,都需要做一次快照。但 **日誌壓縮(log compaction)** 提供了一個很好的備選方案。
我們之前在 “[日誌結構儲存](/tw/ch4#sec_storage_log_structured)” 的上下文中討論過日誌壓縮(可參閱 [圖 4-3](/tw/ch4#fig_storage_sstable_merging) 的示例)。原理很簡單:儲存引擎定期在日誌中查詢具有相同鍵的記錄,丟掉所有重複的內容,並只保留每個鍵的最新更新。這個壓縮與合併過程在後臺執行,如 [圖 12-6](#fig_stream_compaction) 所示。
{{< figure src="/fig/ddia_1206.png" id="fig_stream_compaction" caption="圖 12-6. 一個鍵值對日誌,其中鍵是貓影片的 ID(mew、purr、scratch、yawn),值是播放次數。日誌壓縮只保留每個鍵的最新值。" class="w-full my-4" >}}
在日誌結構儲存引擎中,具有特殊值 NULL(**墓碑**,即 tombstone)的更新表示該鍵被刪除,並會在日誌壓縮過程中被移除。但只要鍵不被覆蓋或刪除,它就會永遠留在日誌中。這種壓縮日誌所需的磁碟空間僅取決於資料庫的當前內容,而不取決於資料庫中曾經發生的寫入次數。如果相同的鍵經常被覆蓋寫入,則先前的值將最終將被垃圾回收,只有最新的值會保留下來。
在基於日誌的訊息代理與資料變更捕獲的上下文中也適用相同的想法。如果 CDC 系統被配置為,每個變更都包含一個主鍵,且每個鍵的更新都替換了該鍵以前的值,那麼只需要保留對鍵的最新寫入就足夠了。
現在,無論何時需要重建派生資料系統(如搜尋索引),你可以從壓縮日誌主題的零偏移量處啟動新的消費者,然後依次掃描日誌中的所有訊息。日誌能保證包含資料庫中每個鍵的最新值(也可能是一些較舊的值)—— 換句話說,你可以使用它來獲取資料庫內容的完整副本,而無需從 CDC 源資料庫取一個快照。
Apache Kafka 支援這種日誌壓縮功能。正如我們將在本章後面看到的,它允許訊息代理被當成永續性儲存使用,而不僅僅是用於臨時訊息。
#### 變更流的 API 支援 {#sec_stream_change_api}
如今許多主流資料庫都把變更流作為一等介面提供,而不再像過去那樣主要依賴“事後補丁式”或逆向工程式的 CDC。MySQL、PostgreSQL 等關係資料庫通常透過與自身複製相同的日誌通道輸出變更;各大雲廠商也提供了對應的 CDC 服務,例如 Google Cloud 的 Datastream 可向關係資料庫與資料倉庫提供流式資料訪問。
即使是 Cassandra 這類最終一致、基於法定票數的資料庫,也開始支援資料變更捕獲。正如我們在第十章關於線性一致與法定票數中看到的,寫入是否“可見”取決於讀寫一致性設定,這使得其 CDC 的統一抽象更困難。Cassandra 的做法通常是公開各節點原始日誌段,而不是提供單一統一的變更流;消費方需要自己讀取併合並各節點日誌,生成業務可用的單一事件流[^32]。
Kafka Connect[^33]提供了大量資料庫系統與 Kafka 的 CDC 整合能力。變更事件一旦進入 Kafka,就可以用於更新搜尋索引等派生系統,也可以繼續送入後續流處理鏈路。
#### 資料變更捕獲與事件溯源 {#sec_stream_event_sourcing}
資料變更捕獲與事件溯源都把狀態變化表示成事件日誌,但二者抽象層級不同:
* 在資料變更捕獲中,應用仍以可變方式使用資料庫,任意更新/刪除記錄;變更日誌從資料庫底層抽取(如複製日誌),因此能保證抽取順序與真實寫入順序一致,避免 [圖 12-4](#fig_stream_dual_write_race) 這類競態問題。
* 在事件溯源中,應用邏輯從一開始就構建在不可變事件之上,事件儲存通常是僅追加寫入,更新和刪除被限制或禁止。事件語義是應用層行為,而非底層狀態差異。
二者孰優取決於場景。對未採用事件溯源的系統而言,引入它通常是一次較大架構變更;而資料變更捕獲通常可在現有資料庫上以較小改動接入,應用層甚至可以感知不到 CDC 的存在。
> [!TIP] 資料變更捕獲與資料庫模式
> 資料變更捕獲看上去比事件溯源更容易落地,但它也有自己的工程挑戰。
>
> 在微服務架構中,資料庫通常只由所屬服務直接訪問;其他服務透過該服務 API 互動,因此資料庫模式本應是服務內部實現細節,可隨服務演化。
>
> 但 CDC 往往直接複用上游資料庫模式做複製,這會把原本“內部模式”變成“外部契約”。刪除某個列可能會直接破壞下游消費者[^34]。
>
> 一種常見解法是 **Outbox 模式**:專門維護對外發布的 outbox 表,讓 CDC 讀取 outbox,而不是直接讀取內部領域模型表。這樣可以在儘量不影響外部消費者的前提下演化內部模式[^35] [^36]。它看起來像雙寫,實際上也是雙寫;但它把兩次寫入留在同一個資料庫系統內,因此可放進同一事務,規避跨系統雙寫的一致性問題。
和資料變更捕獲一樣,重放事件日誌也能重建當前狀態,但日誌壓縮策略不同:
* 對於 CDC,更新事件通常攜帶記錄的完整新版本,因此同一主鍵的最新事件就足以決定當前值,舊事件可被壓縮。
* 對於事件溯源,事件通常描述使用者意圖而非狀態覆蓋,後續事件一般不會“覆蓋”先前事件,因此重建狀態通常需要完整歷史,不能按 CDC 的方式壓縮。
採用事件溯源的系統通常會儲存由事件日誌匯出的狀態快照,以降低讀取與恢復成本;但快照本質上是效能最佳化。其核心假設仍是:原始事件可長期儲存,並在需要時可完整重放。我們將在“不變性的侷限性”中討論這一假設的邊界。
### 狀態、流和不變性 {#sec_stream_immutability}
我們在 [第十一章](/tw/ch11) 中看到,批處理因其輸入檔案不變性而受益良多,你可以在現有輸入檔案上執行實驗性處理作業,而不用擔心損壞它們。這種不變性原則也是使得事件溯源與資料變更捕獲如此強大的原因。
我們通常將資料庫視為應用程式當前狀態的儲存 —— 這種表示針對讀取進行了最佳化,而且通常對於服務查詢而言是最為方便的表示。狀態的本質是,它會變化,所以資料庫才會支援資料的增刪改。這又該如何匹配不變性呢?
只要你的狀態發生了變化,那麼這個狀態就是這段時間中事件修改的結果。例如,當前可用的座位列表是你已處理的預訂所產生的結果,當前帳戶餘額是帳戶中的借與貸的結果,而 Web 伺服器的響應時間圖,是所有已發生 Web 請求的獨立響應時間的聚合結果。
無論狀態如何變化,總是有一系列事件導致了這些變化。即使事情已經執行與回滾,這些事件出現是始終成立的。關鍵的想法是:可變的狀態與不可變事件的僅追加日誌相互之間並不矛盾:它們是一體兩面,互為陰陽的。所有變化的日誌 —— **變化日誌(changelog)**,表示了隨時間演變的狀態。
如果你傾向於數學表示,那麼你可能會說,應用狀態是事件流對時間求積分得到的結果,而變更流是狀態對時間求微分的結果,如 [圖 12-7](#fig_stream_state_derivative) 所示[^37] [^38]。這個比喻有一些侷限性(例如,狀態的二階導似乎沒有意義),但這是考慮資料的一個實用出發點。
$$
\begin{aligned}
state(now) &= \int_{t=0}^{now} stream(t)\,dt \\
stream(t) &= \frac{d\,state(t)}{dt}
\end{aligned}
$$
{{< figure src="/fig/ddia_1207.png" id="fig_stream_state_derivative" caption="圖 12-7. 應用當前狀態與事件流之間的關係。" class="w-full my-4" >}}
如果你持久儲存了變更日誌,那麼重現狀態就非常簡單。如果你將事件日誌視為記錄系統,而把可變狀態視為其派生結果,那麼系統中的資料流就更容易推理。正如 Jim Gray 和 Andreas Reuter 在 1992 年所說[^39]:
> 從原理上講,資料庫並非必需;日誌已經包含了全部資訊。之所以要保留資料庫(即日誌末端的當前狀態),只是為了提高讀取效能。
日誌壓縮(如 “[日誌壓縮](#sec_stream_log_compaction)” 中所述)是連線日誌與資料庫狀態之間的橋樑:它只保留每條記錄的最新版本,並丟棄被覆蓋的版本。
#### 不可變事件的優點 {#sec_stream_immutability_pros}
資料庫中的不變性是一個古老的概念。例如,會計在幾個世紀以來一直在財務記賬中應用不變性。一筆交易發生時,它被記錄在一個僅追加寫入的分類帳中,實質上是描述貨幣、商品或服務轉手的事件日誌。賬目,比如利潤、虧損、資產負債表,是從分類賬中的交易求和派生而來[^40]。
如果發生錯誤,會計師不會刪除或更改分類帳中的錯誤交易 —— 而是新增另一筆交易以補償錯誤,例如退還一筆不正確的費用。不正確的交易將永遠保留在分類帳中,對於審計而言可能非常重要。如果從不正確的分類賬派生出的錯誤數字已經公佈,那麼下一個會計週期的數字就會包括一個更正。這個過程在會計事務中是很常見的[^41]。
儘管這種可審計性只在金融系統中尤其重要,但對於不受這種嚴格監管的許多其他系統,也是很有幫助的。如 “[批處理用例](/tw/ch11#sec_batch_output)” 中所討論的,如果你意外地部署了將錯誤資料寫入資料庫的錯誤程式碼,當代碼會破壞性地覆寫資料時,恢復要困難得多。使用不可變事件的僅追加日誌,診斷問題與故障恢復就要容易得多。
不可變的事件也包含了比當前狀態更多的資訊。例如在購物網站上,顧客可以將物品新增到他們的購物車,然後再將其移除。雖然從履行訂單的角度,第二個事件取消了第一個事件,但對分析目的而言,知道客戶考慮過某個特定項而之後又反悔,可能是很有用的。也許他們會選擇在未來購買,或者他們已經找到了替代品。這個資訊被記錄在事件日誌中,但對於移出購物車就刪除記錄的資料庫而言,這個資訊在移出購物車時可能就丟失了。
#### 從同一事件日誌中派生多個檢視 {#sec_stream_deriving_views}
此外,透過從不變的事件日誌中分離出可變的狀態,你可以針對不同的讀取方式,從相同的事件日誌中派生出幾種不同的表現形式。效果就像一個流的多個消費者一樣([圖 12-5](#fig_stream_cdc_flow)):例如,Kafka Connect 能將來自 Kafka 的資料匯出到各種不同的資料庫與索引[^33]。這對於許多其他儲存和索引系統(如搜尋伺服器)來說也是有意義的,當系統要從分散式日誌中獲取輸入時尤其如此(請參閱 “[保持系統同步](#sec_stream_sync)”)。
新增從事件日誌到資料庫的顯式轉換,能夠使應用更容易地隨時間演進:如果你想要引入一個新功能,以新的方式表示現有資料,則可以使用事件日誌來構建一個單獨的、針對新功能的讀取最佳化檢視,無需修改現有系統而與之共存。並行執行新舊系統通常比在現有系統中執行複雜的模式遷移更容易。一旦不再需要舊的系統,你可以簡單地關閉它並回收其資源[^42] [^43]。
如果你不需要擔心如何查詢與訪問資料,那麼儲存資料通常是非常簡單的。模式設計、索引和儲存引擎的許多複雜性,都是希望支援某些特定查詢和訪問模式的結果(請參閱 [第三章](/tw/ch3))。出於這個原因,透過將資料寫入的形式與讀取形式相分離,並允許幾個不同的讀取檢視,你能獲得很大的靈活性。這個想法有時被稱為 **命令查詢責任分離(command query responsibility segregation, CQRS)**[^44]。
資料庫和模式設計的傳統方法是基於這樣一種謬論,資料必須以與查詢相同的形式寫入。如果可以將資料從針對寫入最佳化的事件日誌轉換為針對讀取最佳化的應用狀態,那麼有關正規化和反正規化的爭論就變得無關緊要了(請參閱 “[多對一和多對多的關係](/tw/ch3#sec_datamodels_normalization)”):在針對讀取最佳化的檢視中對資料進行反正規化是完全合理的,因為翻譯過程提供了使其與事件日誌保持一致的機制。
在 “[描述負載](/tw/ch2#sec_introduction_twitter)” 中,我們討論了推特主頁時間線,它是特定使用者關注的人群所發推特的快取(類似郵箱)。這是 **針對讀取最佳化的狀態** 的又一個例子:主頁時間線是高度反正規化的,因為你的推文與你所有粉絲的時間線都構成了重複。然而,扇出服務保持了這種重複狀態與新推特以及新關注關係的同步,從而保證了重複的可管理性。
#### 併發控制 {#sec_stream_concurrency}
事件溯源和資料變更捕獲的最大缺點是,事件日誌的消費者通常是非同步的,所以可能會出現這樣的情況:使用者會寫入日誌,然後從日誌派生檢視中讀取,結果發現他的寫入還沒有反映在讀取檢視中。我們之前在 “[讀己之寫](/tw/ch6#sec_replication_ryw)” 中討論了這個問題以及可能的解決方案。
一種解決方案是將事件追加到日誌時同步執行讀取檢視的更新。而將這些寫入操作合併為一個原子單元需要 **事務**,所以要麼將事件日誌和讀取檢視儲存在同一個儲存系統中,要麼就需要跨不同系統進行分散式事務。或者,你也可以使用在 “[使用共享日誌](/tw/ch10#sec_consistency_smr)” 中討論的方法。
另一方面,從事件日誌匯出當前狀態也簡化了併發控制的某些部分。許多對於多物件事務的需求(請參閱 “[單物件和多物件操作](/tw/ch8#sec_transactions_multi_object)”)源於單個使用者操作需要在多個不同的位置更改資料。透過事件溯源,你可以設計一個自包含的事件以表示一個使用者操作。然後使用者操作就只需要在一個地方進行單次寫入操作 —— 即將事件附加到日誌中 —— 這個還是很容易使原子化的。
如果事件日誌與應用狀態以相同的方式分割槽(例如,處理分割槽 3 中的客戶事件只需要更新分割槽 3 中的應用狀態),那麼直接使用單執行緒日誌消費者就不需要寫入併發控制了。它從設計上一次只處理一個事件(請參閱 “[真的序列執行](/tw/ch8#sec_transactions_serial)”)。日誌透過在分割槽中定義事件的序列順序,消除了併發性的不確定性[^27]。如果一個事件觸及多個狀態分割槽,那麼需要做更多的工作,我們將在 [第十三章](/tw/ch13) 討論。
#### 不變性的侷限性 {#sec_stream_immutability_limitations}
許多不使用事件溯源模型的系統也還是依賴不可變性:各種資料庫在內部使用不可變的資料結構或多版本資料來支援時間點快照(請參閱 “[索引和快照隔離](/tw/ch8#sec_transactions_snapshot_indexes)” )。Git、Mercurial 和 Fossil 等版本控制系統也依靠不可變的資料來儲存檔案的版本歷史記錄。
永遠保持所有變更的不變歷史,在多大程度上是可行的?答案取決於資料集的流失率。一些工作負載主要是新增資料,很少更新或刪除;它們很容易保持不變。其他工作負載在相對較小的資料集上有較高的更新 / 刪除率;在這些情況下,不可變的歷史可能增至難以接受的巨大,碎片化可能成為一個問題,壓縮與垃圾收集的表現對於運維的穩健性變得至關重要[^45] [^46]。
除了效能方面的原因外,也可能有出於管理方面的原因需要刪除資料的情況,儘管這些資料都是不可變的。例如,隱私條例可能要求在使用者關閉帳戶後刪除他們的個人資訊,資料保護立法可能要求刪除錯誤的資訊,或者可能需要阻止敏感資訊的意外洩露。
在這種情況下,僅僅在日誌中新增另一個事件來指明先前的資料應該被視為刪除是不夠的 —— 你實際上是想改寫歷史,並假裝資料從一開始就沒有寫入。例如,Datomic 管這個特性叫 **切除(excision)**[^47],而 Fossil 版本控制系統有一個類似的概念叫 **避免(shunning)**[^48]。
真正刪除資料是非常非常困難的[^49],因為副本可能存在於很多地方:例如,儲存引擎、檔案系統和 SSD 通常會向新位置寫入,而不是原地覆蓋舊資料[^41];而備份往往刻意設計為不可變,以防誤刪或損壞。
一種支援刪除不可變資料的方法是 **加密粉碎(crypto-shredding)**[^50]:將未來可能需要刪除的資料以加密形式儲存,刪除時僅銷燬金鑰。這樣,密文仍在,但不可再被使用。從某種意義上說,這只是把可變性從“資料本身”轉移到“金鑰管理”上。
此外,你需要預先決定哪些資料共享同一金鑰、哪些資料使用不同金鑰,因為後續你能“粉碎”的粒度通常是“該金鑰加密的全部資料”或“都不刪”,很難只刪其中一部分。若為每條記錄單獨存金鑰,金鑰儲存規模又會變得不可控。像 puncturable encryption 這樣的高階方案[^51]可以提供更細粒度的撤銷能力,但尚未廣泛落地。
總的來說,刪除更多是在“讓資料更難被取回”,而非“讓資料絕對不可恢復”。儘管如此,在某些場景下仍必須盡力而為,正如我們在 “[立法與自律](/ch14#sec_future_legislation)” 中會看到的。
## 流處理 {#sec_stream_processing}
到目前為止,本章中我們已經討論了流的來源(使用者活動事件,感測器和寫入資料庫),我們討論了流如何傳輸(直接透過訊息傳送,透過訊息代理,透過事件日誌)。
剩下的就是討論一下你可以用流做什麼 —— 也就是說,你可以處理它。一般來說,有三種選項:
1. 你可以將事件中的資料寫入資料庫、快取、搜尋索引或類似的儲存系統,然後能被其他客戶端查詢。如 [圖 12-5](#fig_stream_cdc_flow) 所示,這是資料庫與系統其他部分所發生的變更保持同步的好方法 —— 特別是當流消費者是寫入資料庫的唯一客戶端時。如 “[批處理工作流的輸出](/tw/ch11#sec_batch_output)” 中所討論的,它是寫入儲存系統的流等價物。
2. 你能以某種方式將事件推送給使用者,例如傳送報警郵件或推送通知,或將事件流式傳輸到可即時顯示的儀表板上。在這種情況下,人是流的最終消費者。
3. 你可以處理一個或多個輸入流,併產生一個或多個輸出流。流可能會經過由幾個這樣的處理階段組成的流水線,最後再輸出(選項 1 或 2)。
在本章的剩餘部分中,我們將討論選項 3:處理流以產生其他派生流。處理這樣的流的程式碼片段,被稱為 **運算元(operator)** 或 **作業(job)**。它與我們在 [第十一章](/tw/ch11) 中討論過的 Unix 程序和 MapReduce 作業密切相關,資料流的模式是相似的:一個流處理器以只讀的方式使用輸入流,並將其輸出以僅追加的方式寫入一個不同的位置。
流處理中的分割槽和並行化模式也非常類似於 [第十一章](/tw/ch11) 中介紹的 MapReduce 和資料流引擎,因此我們不再重複這些主題。基本的 Map 操作(如轉換和過濾記錄)也是一樣的。
與批次作業相比的一個關鍵區別是,流不會結束。這種差異會帶來很多隱含的結果。正如本章開始部分所討論的,排序對無界資料集沒有意義,因此無法使用 **排序合併連線**(請參閱 “[Reduce 側連線與分組](/tw/ch11#sec_batch_join)”)。容錯機制也必須改變:對於已經運行了幾分鐘的批處理作業,可以簡單地從頭開始重啟失敗任務,但是對於已經執行數年的流作業,重啟後從頭開始跑可能並不是一個可行的選項。
### 流處理的應用 {#sec_stream_uses}
長期以來,流處理一直用於監控目的,如果某個事件發生,組織希望能得到警報。例如:
* 欺詐檢測系統需要確定信用卡的使用模式是否有意外地變化,如果信用卡可能已被盜刷,則鎖卡。
* 交易系統需要檢查金融市場的價格變化,並根據指定的規則進行交易。
* 製造系統需要監控工廠中機器的狀態,如果出現故障,可以快速定位問題。
* 軍事和情報系統需要跟蹤潛在侵略者的活動,並在出現襲擊徵兆時發出警報。
這些型別的應用需要非常精密複雜的模式匹配與相關檢測。然而隨著時代的進步,流處理的其他用途也開始出現。在本節中,我們將簡要比較一下這些應用。
#### 複合事件處理 {#id317}
**複合事件處理(complex event processing, CEP)** 是 20 世紀 90 年代為分析事件流而開發出的一種方法,尤其適用於需要搜尋某些事件模式的應用[^52]。與正則表示式允許你在字串中搜索特定字元模式的方式類似,CEP 允許你指定規則以在流中搜索某些事件模式。
CEP 系統通常使用高層次的宣告式查詢語言,比如 SQL,或者圖形使用者介面,來描述應該檢測到的事件模式。這些查詢被提交給處理引擎,該引擎消費輸入流,並在內部維護一個執行所需匹配的狀態機。當發現匹配時,引擎發出一個 **複合事件**(即 complex event,CEP 因此得名),並附有檢測到的事件模式詳情[^53]。
在這些系統中,查詢和資料之間的關係與普通資料庫相比是顛倒的。通常情況下,資料庫會持久儲存資料,並將查詢視為臨時的:當查詢進入時,資料庫搜尋與查詢匹配的資料,然後在查詢完成時丟掉查詢。CEP 引擎反轉了角色:查詢是長期儲存的,來自輸入流的事件不斷流過它們,搜尋匹配事件模式的查詢[^54]。
CEP 的實現包括 Esper、Apama 和 TIBCO StreamBase。像 Flink 和 Spark Streaming 這樣的分散式流處理框架,也支援在流上使用 SQL 進行宣告式查詢。
#### 流分析 {#id318}
使用流處理的另一個領域是對流進行分析。CEP 與流分析之間的邊界是模糊的,但一般來說,分析往往對找出特定事件序列並不關心,而更關注大量事件上的聚合與統計指標 —— 例如:
* 測量某種型別事件的速率(每個時間間隔內發生的頻率)
* 滾動計算一段時間視窗內某個值的平均值
* 將當前的統計值與先前的時間區間的值對比(例如,檢測趨勢,當指標與上週同比異常偏高或偏低時報警)
這些統計值通常是在固定時間區間內進行計算的,例如,你可能想知道在過去 5 分鐘內服務每秒查詢次數的均值,以及此時間段內響應時間的第 99 百分位點。在幾分鐘內取平均,能抹平秒和秒之間的無關波動,且仍然能向你展示流量模式的時間圖景。聚合的時間間隔稱為 **視窗(window)**,我們將在 “[時間推理](#sec_stream_time)” 中更詳細地討論視窗。
流分析系統有時會使用機率演算法,例如 Bloom filter(我們在 “[效能最佳化](/tw/ch4#sec_storage_bloom_filter)” 中遇到過)來管理成員資格,HyperLogLog[^55]用於基數估計以及各種百分比估計算法(請參閱 “[實踐中的百分位點](/tw/ch2#sec_introduction_percentiles)”)。機率演算法產出近似的結果,但比起精確演算法的優點是記憶體使用要少得多。使用近似演算法有時讓人們覺得流處理系統總是有損的和不精確的,但這是錯誤看法:流處理並沒有任何內在的近似性,而機率演算法只是一種最佳化[^56]。
許多開源分散式流處理框架的設計都是針對分析設計的:例如 Apache Storm、Spark Streaming、Flink、Samza、Apache Beam 和 Kafka Streams[^57]。託管服務包括 Google Cloud Dataflow 和 Azure Stream Analytics。
#### 維護物化檢視 {#sec_stream_mat_view}
我們在 “[資料庫與流](#sec_stream_databases)” 中看到,資料庫的變更流可以用於維護派生資料系統(如快取、搜尋索引和資料倉庫),並使其與源資料庫保持最新。我們可以將這些示例視作維護 **物化檢視(materialized view)** 的一種具體場景:在某個資料集上派生出一個替代檢視以便高效查詢,並在底層資料變更時更新檢視[^37]。
同樣,在事件溯源中,應用程式的狀態是透過應用事件日誌來維護的;這裡的應用程式狀態也是一種物化檢視。與流分析場景不同的是,僅考慮某個時間視窗內的事件通常是不夠的:構建物化檢視可能需要任意時間段內的 **所有** 事件,除了那些可能由日誌壓縮丟棄的過時事件(請參閱 “[日誌壓縮](#sec_stream_log_compaction)”)。實際上,你需要一個可以一直延伸到時間開端的視窗。
原則上講,任何流處理元件都可以用於維護物化檢視,儘管 “永遠執行” 與一些面向分析的框架假設的 “主要在有限時間段視窗上執行” 背道而馳,Kafka Streams 和 Confluent 的 ksqlDB 支援這種用法,建立在 Kafka 對日誌壓縮的支援上[^58]。
> [!TIP] 增量檢視維護
> 資料庫看起來很適合做物化檢視維護:它們本來就擅長儲存完整資料副本,也常常支援物化檢視。
>
> 但很多資料庫重新整理物化檢視仍依賴批處理或按需觸發(例如 PostgreSQL 的 `REFRESH MATERIALIZED VIEW`),而不是在源資料變化時做增量維護。這會帶來兩個問題:
>
> 1. 效率低:每次重新整理都重算全量資料,而不是隻處理變化部分[^38] [^59] [^60]。
> 2. 不夠即時:重新整理間隔內的變化不會立刻反映在視圖裡。
>
> Materialize、RisingWave、ClickHouse、Feldera 等系統都在探索更即時的增量維護路徑[^61]。
#### 在流上搜索 {#id320}
除了允許搜尋由多個事件構成模式的 CEP 外,有時也存在基於複雜標準(例如全文檢索查詢)來搜尋單個事件的需求。
例如,媒體監測服務可以訂閱新聞文章 Feed 與來自媒體的播客,搜尋任何關於公司、產品或感興趣的話題的新聞。這是透過預先構建一個搜尋查詢來完成的,然後不斷地將新聞項的流與該查詢進行匹配。在一些網站上也有類似的功能:例如,當市場上出現符合其搜尋條件的新房產時,房地產網站的使用者可以要求網站通知他們。Elasticsearch 的 percolator 功能,是實現這種流搜尋的一種選擇[^62]。
傳統的搜尋引擎首先索引檔案,然後在索引上跑查詢。相比之下,搜尋一個數據流則反了過來:查詢被儲存下來,文件從查詢中流過,就像在 CEP 中一樣。最簡單的情況就是,你可以為每個文件測試每個查詢。但是如果你有大量查詢,這可能會變慢。為了最佳化這個過程,可以像對文件一樣,為查詢建立索引。因而收窄可能匹配的查詢集合[^63]。
#### 事件驅動架構與 RPC {#sec_stream_actors_drpc}
在 “[訊息傳遞中的資料流](/tw/ch5#sec_encoding_dataflow_msg)” 中我們討論過,訊息傳遞系統可以作為 RPC 的替代方案,即作為一種服務間通訊的機制,比如在 Actor 模型中所使用的那樣。儘管這些系統也是基於訊息和事件,但我們通常不會將其視作流處理元件:
* Actor 框架主要是管理模組通訊的併發和分散式執行的一種機制,而流處理主要是一種資料管理技術。
* Actor 之間的交流往往是短暫的、一對一的;而事件日誌則是持久的、多訂閱者的。
* Actor 可以以任意方式進行通訊(包括迴圈的請求 / 響應模式),但流處理通常配置在無環流水線中,其中每個流都是一個特定作業的輸出,由良好定義的輸入流中派生而來。
也就是說,RPC 類系統與流處理之間有一些交叉領域。例如,Apache Storm 有一個稱為 **分散式 RPC** 的功能,它允許將使用者查詢分散到一系列也處理事件流的節點上;然後這些查詢與來自輸入流的事件交織,而結果可以被彙總併發回給使用者(另請參閱 “[多分割槽資料處理](/tw/ch13#sec_future_unbundled_multi_shard)”)。
也可以使用 Actor 框架來處理流。但是,很多這樣的框架在崩潰時不能保證訊息的傳遞,除非你實現了額外的重試邏輯,否則這種處理不是容錯的。
### 時間推理 {#sec_stream_time}
流處理通常需要與時間打交道,尤其是用於分析目的時候,會頻繁使用時間視窗,例如 “過去五分鐘的平均值”。“過去五分鐘” 的含義看上去似乎是清晰而無歧義的,但不幸的是,這個概念非常棘手。
在批處理過程中,大量的歷史事件被快速地處理。如果需要按時間來分析,批處理器需要檢查每個事件中嵌入的時間戳。讀取執行批處理機器的系統時鐘沒有任何意義,因為處理執行的時間與事件實際發生的時間無關。
批處理可以在幾分鐘內讀取一年的歷史事件;在大多數情況下,感興趣的時間線是歷史中的一年,而不是處理中的幾分鐘。而且使用事件中的時間戳,使得處理是 **確定性** 的:在相同的輸入上再次執行相同的處理過程會得到相同的結果。
另一方面,許多流處理框架使用處理機器上的本地系統時鐘(**處理時間**,即 processing time)來確定 **視窗(windowing)**[^64]。這種方法的優點是簡單,如果事件建立與事件處理之間的延遲可以忽略不計,那也是合理的。然而,如果存在任何顯著的處理延遲 —— 即,事件處理顯著地晚於事件實際發生的時間,這種處理方式就失效了。
#### 事件時間與處理時間 {#id322}
很多原因都可能導致處理延遲:排隊,網路故障(請參閱 “[不可靠的網路](/tw/ch9#sec_distributed_networks)”),效能問題導致訊息代理 / 訊息處理器出現爭用,流消費者重啟,從故障中恢復時重新處理過去的事件(請參閱 “[重播舊訊息](#sec_stream_replay)”),或者在修復程式碼 BUG 之後。
而且,訊息延遲還可能導致無法預測訊息順序。例如,假設使用者首先發出一個 Web 請求(由 Web 伺服器 A 處理),然後發出第二個請求(由伺服器 B 處理)。A 和 B 發出描述它們所處理請求的事件,但是 B 的事件在 A 的事件發生之前到達訊息代理。現在,流處理器將首先看到 B 事件,然後看到 A 事件,即使它們實際上是以相反的順序發生的。
有一個類比也許能幫助理解,“星球大戰” 電影:第四集於 1977 年發行,第五集於 1980 年,第六集於 1983 年,緊隨其後的是 1999 年的第一集、2002 年的第二集、2005 年的第三集,以及 2015 年、2017 年和 2019 年的第七至第九集[^65]。如果你按照它們上映的順序觀看電影,你處理電影的順序與它們敘事的順序就是不一致的。(集數編號就像事件時間戳,而你觀看電影的日期就是處理時間)作為人類,我們能夠應對這種不連續性,但是流處理演算法需要專門編寫,以適應這種時序與順序的問題。
將事件時間和處理時間搞混會導致錯誤的資料。例如,假設你有一個流處理器用於測量請求速率(計算每秒請求數)。如果你重新部署流處理器,它可能會停止一分鐘,並在恢復之後處理積壓的事件。如果你按處理時間來衡量速率,那麼在處理積壓日誌時,請求速率看上去就像有一個異常的突發尖峰,而實際上請求速率是穩定的([圖 12-8](#fig_stream_processing_time_skew))。
{{< figure src="/fig/ddia_1208.png" id="fig_stream_processing_time_skew" caption="圖 12-8. 按處理時間分窗,會因為處理速率的變動引入人為因素。" class="w-full my-4" >}}
#### 處理滯留事件 {#id323}
用事件時間來定義視窗的一個棘手的問題是,你永遠也無法確定是不是已經收到了特定視窗的所有事件,還是說還有一些事件正在來的路上。
例如,假設你將事件分組為一分鐘的視窗,以便統計每分鐘的請求數。你已經計數了一些帶有本小時內第 37 分鐘時間戳的事件,時間流逝,現在進入的主要都是本小時內第 38 和第 39 分鐘的事件。什麼時候才能宣佈你已經完成了第 37 分鐘的視窗計數,並輸出其計數器值?
在一段時間沒有看到任何新的事件之後,你可以超時並宣佈一個視窗已經就緒,但仍然可能發生這種情況:某些事件被緩衝在另一臺機器上,由於網路中斷而延遲。你需要能夠處理這種在視窗宣告完成之後到達的 **滯留(straggler)** 事件。大體上,你有兩種選擇[^1]:
1. 忽略這些滯留事件,因為在正常情況下它們可能只是事件中的一小部分。你可以將丟棄事件的數量作為一個監控指標,並在出現大量丟訊息的情況時報警。
2. 釋出一個 **更正(correction)**,一個包括滯留事件的更新視窗值。你可能還需要收回以前的輸出。
在某些情況下,可以使用特殊的訊息來指示 “從現在開始,不會有比 t 更早時間戳的訊息了”,消費者可以使用它來觸發視窗[^66]。但是,如果不同機器上的多個生產者都在生成事件,每個生產者都有自己的最小時間戳閾值,則消費者需要分別跟蹤每個生產者。在這種情況下,新增和刪除生產者都是比較棘手的。
#### 你用的是誰的時鐘? {#id438}
當事件可能在系統內多個地方進行緩衝時,為事件分配時間戳更加困難了。例如,考慮一個移動應用向伺服器上報關於用量的事件。該應用可能會在裝置處於離線狀態時被使用,在這種情況下,它將在裝置本地緩衝事件,並在下一次網際網路連線可用時向伺服器上報這些事件(可能是幾小時甚至幾天)。對於這個流的任意消費者而言,它們就如延遲極大的滯留事件一樣。
在這種情況下,事件上的時間戳實際上應當是使用者交互發生的時間,取決於移動裝置的本地時鐘。然而使用者控制的裝置上的時鐘通常是不可信的,因為它可能會被無意或故意設定成錯誤的時間(請參閱 “[時鐘同步與準確性](/tw/ch9#sec_distributed_clock_accuracy)”)。伺服器收到事件的時間(取決於伺服器的時鐘)可能是更準確的,因為伺服器在你的控制之下,但在描述使用者互動方面意義不大。
要校正不正確的裝置時鐘,一種方法是記錄三個時間戳[^67]:
* 事件發生的時間,取決於裝置時鐘
* 事件傳送往伺服器的時間,取決於裝置時鐘
* 事件被伺服器接收的時間,取決於伺服器時鐘
透過從第三個時間戳中減去第二個時間戳,可以估算裝置時鐘和伺服器時鐘之間的偏移(假設網路延遲與所需的時間戳精度相比可忽略不計)。然後可以將該偏移應用於事件時間戳,從而估計事件實際發生的真實時間(假設裝置時鐘偏移在事件發生時與送往伺服器之間沒有變化)。
這並不是流處理獨有的問題,批處理有著完全一樣的時間推理問題。只是在流處理的上下文中,我們更容易意識到時間的流逝。
#### 視窗的型別 {#id324}
當你知道如何確定一個事件的時間戳後,下一步就是如何定義時間段的視窗。然後視窗就可以用於聚合,例如事件計數,或計算視窗內值的平均值。有幾種視窗很常用[^64] [^68]:
滾動視窗(Tumbling Window)
: 滾動視窗有著固定的長度,每個事件都僅能屬於一個視窗。例如,假設你有一個 1 分鐘的滾動視窗,則所有時間戳在 `10:03:00` 和 `10:03:59` 之間的事件會被分組到一個視窗中,`10:04:00` 和 `10:04:59` 之間的事件被分組到下一個視窗,依此類推。透過將每個事件時間戳四捨五入至最近的分鐘來確定它所屬的視窗,可以實現 1 分鐘的滾動視窗。
跳動視窗(Hopping Window)
: 跳動視窗也有著固定的長度,但允許視窗重疊以提供一些平滑。例如,一個帶有 1 分鐘跳躍步長的 5 分鐘視窗將包含 `10:03:00` 至 `10:07:59` 之間的事件,而下一個視窗將覆蓋 `10:04:00` 至 `10:08:59` 之間的事件,等等。透過首先計算 1 分鐘的滾動視窗(tumbling window),然後在幾個相鄰視窗上進行聚合,可以實現這種跳動視窗。
滑動視窗(Sliding Window)
: 滑動視窗包含了彼此間距在特定時長內的所有事件。例如,一個 5 分鐘的滑動視窗應當覆蓋 `10:03:39` 和 `10:08:12` 的事件,因為它們相距不超過 5 分鐘(注意滾動視窗與步長 5 分鐘的跳動視窗可能不會把這兩個事件分組到同一個視窗中,因為它們使用固定的邊界)。透過維護一個按時間排序的事件緩衝區,並不斷從視窗中移除過期的舊事件,可以實現滑動視窗。
會話視窗(Session window)
: 與其他視窗型別不同,會話視窗沒有固定的持續時間,而定義為:將同一使用者出現時間相近的所有事件分組在一起,而當用戶一段時間沒有活動時(例如,如果 30 分鐘內沒有事件)視窗結束。會話切分是網站分析的常見需求(請參閱 “[JOIN 與 GROUP BY](/tw/ch11#sec_batch_join)”)。
### 流連線 {#sec_stream_joins}
在 [第十一章](/tw/ch11) 中,我們討論了批處理作業如何透過鍵來連線資料集,以及這種連線是如何成為資料管道的重要組成部分的。由於流處理將資料管道泛化為對無限資料集進行增量處理,因此對流進行連線的需求也是完全相同的。
然而,新事件隨時可能出現在一個流中,這使得流連線要比批處理連線更具挑戰性。為了更好地理解情況,讓我們先來區分三種不同型別的連線:**流 - 流** 連線,**流 - 表** 連線,與 **表 - 表** 連線。我們將在下面的章節中透過例子來說明。
#### 流流連線(視窗連線) {#id440}
假設你的網站上有搜尋功能,而你想要找出搜尋 URL 的近期趨勢。每當有人鍵入搜尋查詢時,都會記錄下一個包含查詢與其返回結果的事件。每當有人點選其中一個搜尋結果時,就會記錄另一個記錄點選事件。為了計算搜尋結果中每個 URL 的點選率,你需要將搜尋動作與點選動作的事件連在一起,這些事件透過相同的會話 ID 進行連線。廣告系統中需要類似的分析[^69]。
如果使用者丟棄了搜尋結果,點選可能永遠不會發生,即使它出現了,搜尋與點選之間的時間可能是高度可變的:在很多情況下,它可能是幾秒鐘,但也可能長達幾天或幾周(如果使用者執行搜尋,忘掉了這個瀏覽器頁面,過了一段時間後重新回到這個瀏覽器頁面上,並點選了一個結果)。由於可變的網路延遲,點選事件甚至可能先於搜尋事件到達。你可以選擇合適的連線視窗 —— 例如,如果點選與搜尋之間的時間間隔在一小時內,你可能會選擇連線兩者。
請注意,在點選事件中嵌入搜尋詳情與事件連線並不一樣:這樣做的話,只有當用戶點選了一個搜尋結果時你才能知道,而那些沒有點選的搜尋就無能為力了。為了衡量搜尋質量,你需要準確的點選率,為此搜尋事件和點選事件兩者都是必要的。
為了實現這種型別的連線,流處理器需要維護 **狀態**:例如,按會話 ID 索引最近一小時內發生的所有事件。無論何時發生搜尋事件或點選事件,都會被新增到合適的索引中,而流處理器也會檢查另一個索引是否有具有相同會話 ID 的事件到達。如果有匹配事件就會發出一個表示搜尋結果被點選的事件;如果搜尋事件直到過期都沒看見有匹配的點選事件,就會發出一個表示搜尋結果未被點選的事件。
#### 流表連線(流擴充) {#sec_stream_table_joins}
在 “[示例:使用者活動事件分析](/tw/ch11#sec_batch_join)”([圖 11-2](/tw/ch11#fig_batch_join_example))中,我們看到了連線兩個資料集的批處理作業示例:一組使用者活動事件和一個使用者檔案資料庫。將使用者活動事件視為流,並在流處理器中連續執行相同的連線是很自然的想法:輸入是包含使用者 ID 的活動事件流,而輸出還是活動事件流,但其中使用者 ID 已經被擴充套件為使用者的檔案資訊。這個過程有時被稱為使用資料庫的資訊來 **擴充(enriching)** 活動事件。
要執行此連線,流處理器需要一次處理一個活動事件,在資料庫中查詢事件的使用者 ID,並將檔案資訊新增到活動事件中。資料庫查詢可以透過查詢遠端資料庫來實現。但正如在 “[示例:使用者活動事件分析](/tw/ch11#sec_batch_join)” 一節中討論的,此類遠端查詢可能會很慢,並且有可能導致資料庫過載[^58]。
另一種方法是將資料庫副本載入到流處理器中,以便在本地進行查詢而無需網路往返。這種技術與我們在 “[JOIN 與 GROUP BY](/tw/ch11#sec_batch_join)” 中討論的雜湊連線非常相似:如果資料庫的本地副本足夠小,則可以是記憶體中的散列表,比較大的話也可以是本地磁碟上的索引。
與批處理作業的區別在於,批處理作業使用資料庫的時間點快照作為輸入,而流處理器是長時間執行的,且資料庫的內容可能隨時間而改變,所以流處理器資料庫的本地副本需要保持更新。這個問題可以透過資料變更捕獲來解決:流處理器可以訂閱使用者檔案資料庫的更新日誌,如同活動事件流一樣。當增添或修改檔案時,流處理器會更新其本地副本。因此,我們有了兩個流之間的連線:活動事件和檔案更新。
流表連線實際上非常類似於流流連線;最大的區別在於對於表的變更日誌流,連線使用了一個可以回溯到 “時間起點” 的視窗(概念上是無限的視窗),新版本的記錄會覆蓋更早的版本。對於輸入的流,連線可能壓根兒就沒有維護任何視窗。
#### 表表連線(維護物化檢視) {#id326}
我們在 “[描述負載](/tw/ch2#sec_introduction_twitter)” 中討論的推特時間線例子時說過,當用戶想要檢視他們的主頁時間線時,迭代使用者所關注人群的推文併合並它們是一個開銷巨大的操作。
相反,我們需要一個時間線快取:一種每個使用者的 “收件箱”,在傳送推文的時候寫入這些資訊,因而讀取時間線時只需要簡單地查詢即可。物化與維護這個快取需要處理以下事件:
* 當用戶 u 傳送新的推文時,它將被新增到每個關注使用者 u 的時間線上。
* 使用者刪除推文時,推文將從所有使用者的時間線中刪除。
* 當用戶 *u*~1~ 開始關注使用者 *u*~2~ 時,*u*~2~ 最近的推文將被新增到 *u*~1~ 的時間線上。
* 當用戶 *u*~1~ 取消關注使用者 *u*~2~ 時,*u*~2~ 的推文將從 *u*~1~ 的時間線中移除。
要在流處理器中實現這種快取維護,你需要推文事件流(傳送與刪除)和關注關係事件流(關注與取消關注)。流處理需要維護一個數據庫,包含每個使用者的粉絲集合,以便知道當一條新推文到達時,需要更新哪些時間線。
觀察這個流處理過程的另一種視角是:它維護了一個連線了兩個表(推文與關注)的物化檢視,如下所示:
```sql
SELECT follows.follower_id AS timeline_id,
array_agg(tweets.* ORDER BY tweets.timestamp DESC)
FROM tweets
JOIN follows ON follows.followee_id = tweets.sender_id
GROUP BY follows.follower_id
```
流連線直接對應於這個查詢中的表連線。時間線實際上是這個查詢結果的快取,每當底層的表發生變化時都會更新。
> [!NOTE]
> 如果你將流視作表的導數(如 [圖 12-7](#fig_stream_state_derivative) 所示),並把連線看作兩個表 *u·v* 的乘積,那麼會出現一個有趣現象:物化連線的變化流遵循乘積法則 \( (u \cdot v)' = u'v + uv' \)。換句話說,任何推文變化都要和當前關注關係連線,任何關注關係變化都要和當前推文連線[^37]。
#### 連線的時間依賴性 {#sec_stream_join_time}
這裡描述的三種連線(流流,流表,表表)有很多共通之處:它們都需要流處理器維護連線一側的一些狀態(搜尋與點選事件,使用者檔案,關注列表),然後當連線另一側的訊息到達時查詢該狀態。
用於維護狀態的事件順序是很重要的(先關注然後取消關注,或者其他類似操作)。在分割槽日誌中,單個分割槽內的事件順序是保留下來的。但典型情況下是沒有跨流或跨分割槽的順序保證的。
這就產生了一個問題:如果不同流中的事件發生在近似的時間範圍內,則應該按照什麼樣的順序進行處理?在流表連線的例子中,如果使用者更新了它們的檔案,哪些活動事件與舊檔案連線(在檔案更新前處理),哪些又與新檔案連線(在檔案更新之後處理)?換句話說:你需要對一些狀態做連線,如果狀態會隨著時間推移而變化,那應當使用什麼時間點來連線呢?
這種時序依賴可能出現在很多地方。例如銷售東西需要對發票應用適當的稅率,這取決於所處的國家 / 州,產品型別,銷售日期(因為稅率時不時會變化)。當連線銷售額與稅率表時,你可能期望的是使用銷售時的稅率參與連線。如果你正在重新處理歷史資料,銷售時的稅率可能和現在的稅率有所不同。
如果跨越流的事件順序是未定的,則連線會變為不確定性的[^70],這意味著你在同樣輸入上重跑相同的作業未必會得到相同的結果:當你重跑任務時,輸入流上的事件可能會以不同的方式交織。
在資料倉庫中,這個問題被稱為 **緩慢變化的維度(slowly changing dimension, SCD)**,通常透過對特定版本的記錄使用唯一的識別符號來解決:例如,每當稅率改變時都會獲得一個新的識別符號,而發票在銷售時會帶有稅率的識別符號[^71] [^72]。這種變化使連線變為確定性的,但也會導致日誌壓縮無法進行:表中所有的記錄版本都需要保留。
### 容錯 {#sec_stream_fault_tolerance}
在本章的最後一節中,讓我們看一看流處理是如何容錯的。我們在 [第十一章](/tw/ch11) 中看到,批處理框架可以很容易地容錯:如果 MapReduce 作業中的任務失敗,可以簡單地在另一臺機器上再次啟動,並且丟棄失敗任務的輸出。這種透明的重試是可能的,因為輸入檔案是不可變的,每個任務都將其輸出寫入到 HDFS 上的獨立檔案中,而輸出僅當任務成功完成後可見。
特別是,批處理容錯方法可確保批處理作業的輸出與沒有出錯的情況相同,即使實際上某些任務失敗了。看起來好像每條輸入記錄都被處理了恰好一次 —— 沒有記錄被跳過,而且沒有記錄被處理兩次。儘管重啟任務意味著實際上可能會多次處理記錄,但輸出中的可見效果看上去就像只處理過一次。這個原則被稱為 **恰好一次語義(exactly-once semantics)**,儘管 **等效一次(effectively-once)** 可能會是一個更寫實的術語[^73]。
在流處理中也出現了同樣的容錯問題,但是處理起來沒有那麼直觀:等待某個任務完成之後再使其輸出可見並不是一個可行選項,因為你永遠無法處理完一個無限的流。
#### 微批次與存檔點 {#id329}
一個解決方案是將流分解成小塊,並像微型批處理一樣處理每個塊。這種方法被稱為 **微批次(microbatching)**,它被用於 Spark Streaming[^74]。批次的大小通常約為 1 秒,這是對效能妥協的結果:較小的批次會導致更大的排程與協調開銷,而較大的批次意味著流處理器結果可見之前的延遲要更長。
微批次也隱式提供了一個與批次大小相等的滾動視窗(按處理時間而不是事件時間戳分窗)。任何需要更大視窗的作業都需要顯式地將狀態從一個微批次轉移到下一個微批次。
Apache Flink 則使用不同的方法,它會定期生成狀態的滾動存檔點並將其寫入持久儲存[^75] [^76]。如果流運算元崩潰,它可以從最近的存檔點重啟,並丟棄從最近檢查點到崩潰之間的所有輸出。存檔點會由訊息流中的 **壁障(barrier)** 觸發,類似於微批次之間的邊界,但不會強制一個特定的視窗大小。
在流處理框架的範圍內,微批次與存檔點方法提供了與批處理一樣的 **恰好一次語義**。但是,只要輸出離開流處理器(例如,寫入資料庫,向外部訊息代理傳送訊息,或傳送電子郵件),框架就無法拋棄失敗批次的輸出了。在這種情況下,重啟失敗任務會導致外部副作用發生兩次,只有微批次或存檔點不足以阻止這一問題。
#### 原子提交再現 {#sec_stream_atomic_commit}
為了在出現故障時表現出恰好處理一次的樣子,我們需要確保事件處理的所有輸出和副作用 **當且僅當** 處理成功時才會生效。這些影響包括傳送給下游運算元或外部訊息傳遞系統(包括電子郵件或推送通知)的任何訊息,任何資料庫寫入,對運算元狀態的任何變更,以及對輸入訊息的任何確認(包括在基於日誌的訊息代理中將消費者偏移量前移)。
這些事情要麼都原子地發生,要麼都不發生,但是它們不應當失去同步。如果這種方法聽起來很熟悉,那是因為我們在分散式事務和兩階段提交的上下文中討論過它(請參閱 “[恰好一次的訊息處理](/tw/ch8#sec_transactions_exactly_once)”)。
在 [第十章](/tw/ch10) 中,我們討論了分散式事務傳統實現中的問題(如 XA)。然而在限制更為嚴苛的環境中,也是有可能高效實現這種原子提交機制的。Google Cloud Dataflow[^66] [^75]、VoltDB[^77] 和 Apache Kafka[^78] [^79] 中都使用了這種方法。與 XA 不同,這些實現不會嘗試跨異構技術提供事務,而是透過在流處理框架中同時管理狀態變更與訊息傳遞來內化事務。事務協議的開銷可以透過在單個事務中處理多個輸入訊息來分攤。
#### 冪等性 {#sec_stream_idempotence}
我們的目標是丟棄任何失敗任務的部分輸出,以便能安全地重試,而不會生效兩次。分散式事務是實現這個目標的一種方式,而另一種方式是依賴 **冪等性(idempotence)**[^80]。
冪等操作是多次重複執行與單次執行效果相同的操作。例如,將鍵值儲存中的某個鍵設定為某個特定值是冪等的(再次寫入該值,只是用同樣的值替代),而遞增一個計數器不是冪等的(再次執行遞增意味著該值遞增兩次)。
即使一個操作不是天生冪等的,往往可以透過一些額外的元資料做成冪等的。例如,在使用來自 Kafka 的訊息時,每條訊息都有一個持久的、單調遞增的偏移量。將值寫入外部資料庫時可以將這個偏移量帶上,這樣你就可以判斷一條更新是不是已經執行過了,因而避免重複執行。
Storm 的 Trident 基於類似的想法來處理狀態。依賴冪等性意味著隱含了一些假設:重啟一個失敗的任務必須以相同的順序重播相同的訊息(基於日誌的訊息代理能做這些事),處理必須是確定性的,沒有其他節點能同時更新相同的值[^81] [^82]。
當從一個處理節點故障切換到另一個節點時,可能需要進行 **防護**(fencing,請參閱 “[領導者和鎖](/tw/ch9#sec_distributed_lock_fencing)”),以防止被假死節點干擾。儘管有這麼多注意事項,冪等操作是一種實現 **恰好一次語義** 的有效方式,僅需很小的額外開銷。
#### 失敗後重建狀態 {#sec_stream_state_fault_tolerance}
任何需要狀態的流處理 —— 例如,任何視窗聚合(例如計數器,平均值和直方圖)以及任何用於連線的表和索引,都必須確保在失敗之後能恢復其狀態。
一種選擇是將狀態儲存在遠端資料儲存中,並進行復制,然而正如在 “[流表連線(流擴充)](#sec_stream_table_joins)” 中所述,每個訊息都要查詢遠端資料庫可能會很慢。另一種方法是在流處理器本地儲存狀態,並定期複製。然後當流處理器從故障中恢復時,新任務可以讀取狀態副本,恢復處理而不丟失資料。
例如,Flink 定期捕獲運算元狀態的快照,並將它們寫入 HDFS 等持久儲存中[^75] [^76]。Kafka Streams 透過將狀態變更傳送到具有日誌壓縮功能的專用 Kafka 主題來複制狀態變更,這與資料變更捕獲類似[^83]。VoltDB 透過在多個節點上對每個輸入訊息進行冗餘處理來複制狀態(請參閱 “[真的序列執行](/tw/ch8#sec_transactions_serial)”)。
在某些情況下,甚至可能都不需要複製狀態,因為它可以從輸入流重建。例如,如果狀態是從相當短的視窗中聚合而成,則簡單地重播該視窗中的輸入事件可能是足夠快的。如果狀態是透過資料變更捕獲來維護的資料庫的本地副本,那麼也可以從日誌壓縮的變更流中重建資料庫(請參閱 “[日誌壓縮](#sec_stream_log_compaction)”)。
然而,所有這些權衡取決於底層基礎架構的效能特徵:在某些系統中,網路延遲可能低於磁碟訪問延遲,網路頻寬也可能與磁碟頻寬相當。沒有針對所有情況的普適理想權衡,隨著儲存和網路技術的發展,本地狀態與遠端狀態的優點也可能會互換。
## 本章小結 {#id332}
在本章中,我們討論了事件流,它們所服務的目的,以及如何處理它們。在某些方面,流處理非常類似於在 [第十一章](/tw/ch11) 中討論的批處理,不過是在無限的(永無止境的)流而不是固定大小的輸入上持續進行[^84]。從這個角度來看,訊息代理和事件日誌可以視作檔案系統的流式等價物。
我們花了一些時間比較兩種訊息代理:
AMQP/JMS 風格的訊息代理
: 代理將單條訊息分配給消費者,消費者在成功處理單條訊息後確認訊息。訊息被確認後從代理中刪除。這種方法適合作為一種非同步形式的 RPC(另請參閱 “[事件驅動的架構](/tw/ch5#sec_encoding_dataflow_msg)”),例如在任務佇列中,訊息處理的確切順序並不重要,而且訊息在處理完之後,不需要回頭重新讀取舊訊息。
基於日誌的訊息代理
: 代理將一個分割槽中的所有訊息分配給同一個消費者節點,並始終以相同的順序傳遞訊息。並行是透過分割槽實現的,消費者透過存檔最近處理訊息的偏移量來跟蹤工作進度。訊息代理將訊息保留在磁碟上,因此如有必要的話,可以回跳並重新讀取舊訊息。
基於日誌的方法與資料庫中的複製日誌(請參閱 [第六章](/tw/ch6))和日誌結構儲存引擎(請參閱 [第四章](/tw/ch4))有相似之處。我們看到,這種方法對於消費輸入流,併產生派生狀態或派生輸出資料流的系統而言特別適用。
就流的來源而言,我們討論了幾種可能性:使用者活動事件,定期讀數的感測器,和 Feed 資料(例如,金融中的市場資料)能夠自然地表示為流。我們發現將資料庫寫入視作流也是很有用的:我們可以捕獲變更日誌 —— 即對資料庫所做的所有變更的歷史記錄 —— 隱式地透過資料變更捕獲,或顯式地透過事件溯源。日誌壓縮允許流也能保有資料庫內容的完整副本。
將資料庫表示為流為系統整合帶來了很多強大機遇。透過消費變更日誌並將其應用至派生系統,你能使諸如搜尋索引、快取以及分析系統這類派生資料系統不斷保持更新。你甚至能從頭開始,透過讀取從創世至今的所有變更日誌,為現有資料建立全新的檢視。
像流一樣維護狀態以及訊息重播的基礎設施,是在各種流處理框架中實現流連線和容錯的基礎。我們討論了流處理的幾種目的,包括搜尋事件模式(複雜事件處理),計算分窗聚合(流分析),以及保證派生資料系統處於最新狀態(物化檢視)。
然後我們討論了在流處理中對時間進行推理的困難,包括處理時間與事件時間戳之間的區別,以及當你認為視窗已經完事之後,如何處理到達的掉隊事件的問題。
我們區分了流處理中可能出現的三種連線型別:
流流連線
: 兩個輸入流都由活動事件組成,而連線運算元在某個時間視窗內搜尋相關的事件。例如,它可能會將同一個使用者 30 分鐘內進行的兩個活動聯絡在一起。如果你想要找出一個流內的相關事件,連線的兩側輸入可能實際上都是同一個流(**自連線**,即 self-join)。
流表連線
: 一個輸入流由活動事件組成,另一個輸入流是資料庫變更日誌。變更日誌保證了資料庫的本地副本是最新的。對於每個活動事件,連線運算元將查詢資料庫,並輸出一個擴充套件的活動事件。
表表連線
: 兩個輸入流都是資料庫變更日誌。在這種情況下,一側的每一個變化都與另一側的最新狀態相連線。結果是兩表連線所得物化檢視的變更流。
最後,我們討論了在流處理中實現容錯和恰好一次語義的技術。與批處理一樣,我們需要放棄任何失敗任務的部分輸出。然而由於流處理長時間執行並持續產生輸出,所以不能簡單地丟棄所有的輸出。相反,可以使用更細粒度的恢復機制,基於微批次、存檔點、事務或冪等寫入。
### 參考文獻 {#references}
[^1]: Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. [The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing](https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf). *Proceedings of the VLDB Endowment*, volume 8, issue 12, pages 1792--1803, August 2015. [doi:10.14778/2824032.2824076](https://doi.org/10.14778/2824032.2824076)
[^2]: Harold Abelson, Gerald Jay Sussman, and Julie Sussman. [*Structure and Interpretation of Computer Programs*](https://web.mit.edu/6.001/6.037/sicp.pdf), 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, archived at [archive.org/details/sicp_20211010](https://archive.org/details/sicp_20211010)
[^3]: Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec. [The Many Faces of Publish/Subscribe](https://www.cs.ru.nl/~pieter/oss/manyfaces.pdf). *ACM Computing Surveys*, volume 35, issue 2, pages 114--131, June 2003. [doi:10.1145/857076.857078](https://doi.org/10.1145/857076.857078)
[^4]: Don Carney, Uğur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Greg Seidman, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. [Monitoring Streams -- A New Class of Data Management Applications](https://www.vldb.org/conf/2002/S07P02.pdf). At *28th International Conference on Very Large Data Bases* (VLDB), August 2002. [doi:10.1016/B978-155860869-6/50027-5](https://doi.org/10.1016/B978-155860869-6/50027-5)
[^5]: Matthew Sackman. [Pushing Back](https://wellquite.org/posts/lshift/pushing_back/). *wellquite.org*, May 2016. Archived at [perma.cc/3KCZ-RUFY](https://perma.cc/3KCZ-RUFY)
[^6]: Thomas Figg (tef). [how (not) to write a pipeline](https://web.archive.org/web/20250107135013/https://cohost.org/tef/post/1764930-how-not-to-write-a). *cohost.org*, June 2023. Archived at [perma.cc/A3V8-NYCM](https://perma.cc/A3V8-NYCM)
[^7]: Vicent Martí. [Brubeck, a statsd-Compatible Metrics Aggregator](https://github.blog/news-insights/the-library/brubeck/). *github.blog*, June 2015. Archived at [perma.cc/TP3Q-DJYM](https://perma.cc/TP3Q-DJYM)
[^8]: Seth Lowenberger. [MoldUDP64 Protocol Specification V 1.00](https://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf). *nasdaqtrader.com*, July 2009. Archived at
[^9]: Ian Malpass. [Measure Anything, Measure Everything](https://codeascraft.com/2011/02/15/measure-anything-measure-everything/). *codeascraft.com*, February 2011. Archived at [archive.org](https://web.archive.org/web/20250820034209/https://www.etsy.com/codeascraft/measure-anything-measure-everything/)
[^10]: Dieter Plaetinck. [25 Graphite, Grafana and statsd Gotchas](https://grafana.com/blog/2016/03/03/25-graphite-grafana-and-statsd-gotchas/). *grafana.com*, March 2016. Archived at [perma.cc/3NP3-67U7](https://perma.cc/3NP3-67U7)
[^11]: Jeff Lindsay. [Web Hooks to Revolutionize the Web](https://progrium.github.io/blog/2007/05/03/web-hooks-to-revolutionize-the-web/). *progrium.com*, May 2007. Archived at [perma.cc/BF9U-XNX4](https://perma.cc/BF9U-XNX4)
[^12]: Jim N. Gray. [Queues Are Databases](https://arxiv.org/pdf/cs/0701158.pdf). Microsoft Research Technical Report MSR-TR-95-56, December 1995. Archived at [arxiv.org](https://arxiv.org/pdf/cs/0701158)
[^13]: Mark Hapner, Rich Burridge, Rahul Sharma, Joseph Fialli, Kate Stout, and Nigel Deakin. [JSR-343 Java Message Service (JMS) 2.0 Specification](https://jcp.org/en/jsr/detail?id=343). *jms-spec.java.net*, March 2013. Archived at [perma.cc/E4YG-46TA](https://perma.cc/E4YG-46TA)
[^14]: Sanjay Aiyagari, Matthew Arrott, Mark Atwell, Jason Brome, Alan Conway, Robert Godfrey, Robert Greig, Pieter Hintjens, John O'Hara, Matthias Radestock, Alexis Richardson, Martin Ritchie, Shahrokh Sadjadi, Rafael Schloming, Steven Shaw, Martin Sustrik, Carl Trieloff, Kim van der Riet, and Steve Vinoski. [AMQP: Advanced Message Queuing Protocol Specification](https://www.rabbitmq.com/resources/specs/amqp0-9-1.pdf). Version 0-9-1, November 2008. Archived at [perma.cc/6YJJ-GM9X](https://perma.cc/6YJJ-GM9X)
[^15]: [Architectural overview of Pub/Sub](https://cloud.google.com/pubsub/architecture). *cloud.google.com*, 2025. Archived at [perma.cc/VWF5-ABP4](https://perma.cc/VWF5-ABP4)
[^16]: Aris Tzoumas. [Lessons from scaling PostgreSQL queues to 100k events per second](https://www.rudderstack.com/blog/scaling-postgres-queue/). *rudderstack.com*, July 2025. Archived at [perma.cc/QD8C-VA4Y](https://perma.cc/QD8C-VA4Y)
[^17]: Robin Moffatt. [Kafka Connect Deep Dive -- Error Handling and Dead Letter Queues](https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues/). *confluent.io*, March 2019. Archived at [perma.cc/KQ5A-AB28](https://perma.cc/KQ5A-AB28)
[^18]: Dunith Danushka. [Message reprocessing: How to implement the dead letter queue](https://redpanda.com/blog/reliable-message-processing-with-dead-letter-queue). *redpanda.com*. Archived at [perma.cc/R7UB-WEWF](https://perma.cc/R7UB-WEWF)
[^19]: Damien Gasparina, Loic Greffier, and Sebastien Viale. [KIP-1034: Dead letter queue in Kafka Streams](https://cwiki.apache.org/confluence/display/KAFKA/KIP-1034%3A+Dead+letter+queue+in+Kafka+Streams). *cwiki.apache.org*, April 2024. Archived at [perma.cc/3VXV-QXAN](https://perma.cc/3VXV-QXAN)
[^20]: Jay Kreps, Neha Narkhede, and Jun Rao. [Kafka: A Distributed Messaging System for Log Processing](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf). At *6th International Workshop on Networking Meets Databases* (NetDB), June 2011. Archived at [perma.cc/CSW7-TCQ5](https://perma.cc/CSW7-TCQ5)
[^21]: Jay Kreps. [Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)](https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines). *engineering.linkedin.com*, April 2014. Archived at [archive.org](https://web.archive.org/web/20140921000742/https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines)
[^22]: Kartik Paramasivam. [How We're Improving and Advancing Kafka at LinkedIn](https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedin). *engineering.linkedin.com*, September 2015. Archived at [perma.cc/3S3V-JCYJ](https://perma.cc/3S3V-JCYJ)
[^23]: Philippe Dobbelaere and Kyumars Sheykh Esmaili. [Kafka versus RabbitMQ: A comparative study of two industry reference publish/subscribe implementations](https://arxiv.org/abs/1709.00333). At *11th ACM International Conference on Distributed and Event-based Systems* (DEBS), June 2017. [doi:10.1145/3093742.3093908](https://doi.org/10.1145/3093742.3093908)
[^24]: Kate Holterhoff. [Why Message Queues Endure: A History](https://redmonk.com/kholterhoff/2024/12/12/why-message-queues-endure-a-history/). *redmonk.com*, December 2024. Archived at [perma.cc/6DX8-XK4W](https://perma.cc/6DX8-XK4W)
[^25]: Andrew Schofield. [KIP-932: Queues for Kafka](https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka). *cwiki.apache.org*, May 2023. Archived at [perma.cc/LBE4-BEMK](https://perma.cc/LBE4-BEMK)
[^26]: Jack Vanlightly. [The advantages of queues on logs](https://jack-vanlightly.com/blog/2023/10/2/the-advantages-of-queues-on-logs). *jack-vanlightly.com*, October 2023. Archived at [perma.cc/WJ7V-287K](https://perma.cc/WJ7V-287K)
[^27]: Jay Kreps. [The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying). *engineering.linkedin.com*, December 2013. Archived at [perma.cc/2JHR-FR64](https://perma.cc/2JHR-FR64)
[^28]: Andy Hattemer. [Change Data Capture is having a moment. Why?](https://materialize.com/blog/change-data-capture-is-having-a-moment-why/) *materialize.com*, September 2021. Archived at [perma.cc/AL37-P53C](https://perma.cc/AL37-P53C)
[^29]: Prem Santosh Udaya Shankar. [Streaming MySQL Tables in Real-Time to Kafka](https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html). *engineeringblog.yelp.com*, August 2016. Archived at [perma.cc/5ZR3-2GVV](https://perma.cc/5ZR3-2GVV)
[^30]: Andreas Andreakis, Ioannis Papapanagiotou. [DBLog: A Watermark Based Change-Data-Capture Framework](https://arxiv.org/pdf/2010.12597). October 2020. Archived at [arxiv.org](https://arxiv.org/pdf/2010.12597)
[^31]: Jiri Pechanec. [Percolator](https://debezium.io/blog/2021/10/07/incremental-snapshots/). *debezium.io*, October 2021. Archived at [perma.cc/EQ8E-W6KQ](https://perma.cc/EQ8E-W6KQ)
[^32]: Debezium maintainers. [Debezium Connector for Cassandra](https://debezium.io/documentation/reference/stable/connectors/cassandra.html). *debezium.io*. Archived at [perma.cc/WR6K-EKMD](https://perma.cc/WR6K-EKMD)
[^33]: Neha Narkhede. [Announcing Kafka Connect: Building Large-Scale Low-Latency Data Pipelines](https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/). *confluent.io*, February 2016. Archived at [perma.cc/8WXJ-L6GF](https://perma.cc/8WXJ-L6GF)
[^34]: Chris Riccomini. [Kafka change data capture breaks database encapsulation](https://cnr.sh/posts/2018-11-05-kafka-change-data-capture-breaks-database-encapsulation/). *cnr.sh*, November 2018. Archived at [perma.cc/P572-9MKF](https://perma.cc/P572-9MKF)
[^35]: Gunnar Morling. ["Change Data Capture Breaks Encapsulation". Does it, though?](https://www.decodable.co/blog/change-data-capture-breaks-encapsulation-does-it-though) *decodable.co*, November 2023. Archived at [perma.cc/YX2P-WNWR](https://perma.cc/YX2P-WNWR)
[^36]: Gunnar Morling. [Revisiting the Outbox Pattern](https://www.decodable.co/blog/revisiting-the-outbox-pattern). *decodable.co*, October 2024. Archived at [perma.cc/M5ZL-RPS9](https://perma.cc/M5ZL-RPS9)
[^37]: Ashish Gupta and Inderpal Singh Mumick. [Maintenance of Materialized Views: Problems, Techniques, and Applications](https://web.archive.org/web/20220407025818id_/http://sites.computer.org/debull/95JUN-CD.pdf#page=5). *IEEE Data Engineering Bulletin*, volume 18, issue 2, pages 3--18, June 1995. Archived at [archive.org](https://web.archive.org/web/20220407025818id_/http://sites.computer.org/debull/95JUN-CD.pdf#page=5)
[^38]: Mihai Budiu, Tej Chajed, Frank McSherry, Leonid Ryzhyk, Val Tannen. [DBSP: Incremental Computation on Streams and Its Applications to Databases](https://sigmodrecord.org/publications/sigmodRecord/2403/pdfs/20_dbsp-budiu.pdf). *SIGMOD Record*, volume 53, issue 1, pages 87--95, March 2024. [doi:10.1145/3665252.3665271](https://doi.org/10.1145/3665252.3665271)
[^39]: Jim Gray and Andreas Reuter. [*Transaction Processing: Concepts and Techniques*](https://learning.oreilly.com/library/view/transaction-processing/9780080519555/). Morgan Kaufmann, 1992. ISBN: 9781558601901
[^40]: Martin Kleppmann. [Accounting for Computer Scientists](https://martin.kleppmann.com/2011/03/07/accounting-for-computer-scientists.html). *martin.kleppmann.com*, March 2011. Archived at [perma.cc/9EGX-P38N](https://perma.cc/9EGX-P38N)
[^41]: Pat Helland. [Immutability Changes Everything](https://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf). At *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
[^42]: Martin Kleppmann. [*Making Sense of Stream Processing*](https://martin.kleppmann.com/papers/stream-processing.pdf). Report, O'Reilly Media, May 2016. Archived at [perma.cc/RAY4-JDVX](https://perma.cc/RAY4-JDVX)
[^43]: Kartik Paramasivam. [Stream Processing Hard Problems -- Part 1: Killing Lambda](https://engineering.linkedin.com/blog/2016/06/stream-processing-hard-problems-part-1-killing-lambda). *engineering.linkedin.com*, June 2016. Archived at [archive.org](https://web.archive.org/web/20240621211312/https://www.linkedin.com/blog/engineering/data-streaming-processing/stream-processing-hard-problems-part-1-killing-lambda)
[^44]: Stéphane Derosiaux. [CQRS: What? Why? How?](https://sderosiaux.medium.com/cqrs-what-why-how-945543482313) *sderosiaux.medium.com*, September 2019. Archived at [perma.cc/FZ3U-HVJ4](https://perma.cc/FZ3U-HVJ4)
[^45]: Baron Schwartz. [Immutability, MVCC, and Garbage Collection](https://web.archive.org/web/20220122020806/http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/). *xaprb.com*, December 2013. Archived at [archive.org](https://web.archive.org/web/20220122020806/http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/)
[^46]: Daniel Eloff, Slava Akhmechet, Jay Kreps, et al. [Re: Turning the Database Inside-out with Apache Samza](https://news.ycombinator.com/item?id=9145197). Hacker News discussion, *news.ycombinator.com*, March 2015. Archived at [perma.cc/ML9E-JC83](https://perma.cc/ML9E-JC83)
[^47]: [Datomic Documentation: Excision](https://docs.datomic.com/operation/excision.html). Cognitect, Inc., *docs.datomic.com*. Archived at [perma.cc/J5QQ-SH32](https://perma.cc/J5QQ-SH32)
[^48]: [Fossil Documentation: Deleting Content from Fossil](https://fossil-scm.org/home/doc/trunk/www/shunning.wiki). *fossil-scm.org*, 2025. Archived at [perma.cc/DS23-GTNG](https://perma.cc/DS23-GTNG)
[^49]: Jay Kreps. [The irony of distributed systems is that data loss is really easy but deleting data is surprisingly hard.](https://x.com/jaykreps/status/582580836425330688) *x.com*, March 2015. Archived at [perma.cc/7RRZ-V7B7](https://perma.cc/7RRZ-V7B7)
[^50]: Brent Robinson. [Crypto shredding: How it can solve modern data retention challenges](https://medium.com/@brentrobinson5/crypto-shredding-how-it-can-solve-modern-data-retention-challenges-da874b01745b). *medium.com*, January 2019. Archived at
[^51]: Matthew D. Green and Ian Miers. [Forward Secure Asynchronous Messaging from Puncturable Encryption](https://isi.jhu.edu/~mgreen/forward_sec.pdf). At *IEEE Symposium on Security and Privacy*, May 2015. [doi:10.1109/SP.2015.26](https://doi.org/10.1109/SP.2015.26)
[^52]: David C. Luckham. [What's the Difference Between ESP and CEP?](https://complexevents.com/2020/06/15/whats-the-difference-between-esp-and-cep-2/) *complexevents.com*, June 2019. Archived at [perma.cc/E7PZ-FDEF](https://perma.cc/E7PZ-FDEF)
[^53]: Arvind Arasu, Shivnath Babu, and Jennifer Widom. [The CQL Continuous Query Language: Semantic Foundations and Query Execution](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cql.pdf). *The VLDB Journal*, volume 15, issue 2, pages 121--142, June 2006. [doi:10.1007/s00778-004-0147-z](https://doi.org/10.1007/s00778-004-0147-z)
[^54]: Julian Hyde. [Data in Flight: How Streaming SQL Technology Can Help Solve the Web 2.0 Data Crunch](https://queue.acm.org/detail.cfm?id=1667562). *ACM Queue*, volume 7, issue 11, December 2009. [doi:10.1145/1661785.1667562](https://doi.org/10.1145/1661785.1667562)
[^55]: Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. [HyperLogLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). At *Conference on Analysis of Algorithms* (AofA), June 2007. [doi:10.46298/dmtcs.3545](https://doi.org/10.46298/dmtcs.3545)
[^56]: Jay Kreps. [Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture). *oreilly.com*, July 2014. Archived at [perma.cc/2WY5-HC8Y](https://perma.cc/2WY5-HC8Y)
[^57]: Ian Reppel. [An Overview of Apache Streaming Technologies](https://ianreppel.org/an-overview-of-apache-streaming-technologies/). *ianreppel.org*, March 2016. Archived at [perma.cc/BB3E-QJLW](https://perma.cc/BB3E-QJLW)
[^58]: Jay Kreps. [Why Local State is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing). *oreilly.com*, July 2014. Archived at [perma.cc/P8HU-R5LA](https://perma.cc/P8HU-R5LA)
[^59]: RisingWave Labs. [Deep Dive Into the RisingWave Stream Processing Engine - Part 2: Computational Model](https://risingwave.com/blog/deep-dive-into-the-risingwave-stream-processing-engine-part-2-computational-model/). *risingwave.com*, November 2023. Archived at [perma.cc/LM74-XDEL](https://perma.cc/LM74-XDEL)
[^60]: Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard. [Differential dataflow](https://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf). At *6th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2013.
[^61]: Andy Hattemer. [Incremental Computation in the Database](https://materialize.com/guides/incremental-computation/). *materialize.com*, March 2020. Archived at [perma.cc/AL94-YVRN](https://perma.cc/AL94-YVRN)
[^62]: Shay Banon. [Percolator](https://www.elastic.co/blog/percolator). *elastic.co*, February 2011. Archived at [perma.cc/LS5R-4FQX](https://perma.cc/LS5R-4FQX)
[^63]: Alan Woodward and Martin Kleppmann. [Real-Time Full-Text Search with Luwak and Samza](https://martin.kleppmann.com/2015/04/13/real-time-full-text-search-luwak-samza.html). *martin.kleppmann.com*, April 2015. Archived at [perma.cc/2U92-Q7R4](https://perma.cc/2U92-Q7R4)
[^64]: Tyler Akidau. [The World Beyond Batch: Streaming 102](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102). *oreilly.com*, January 2016. Archived at [perma.cc/4XF9-8M2K](https://perma.cc/4XF9-8M2K)
[^65]: Stephan Ewen. [Streaming Analytics with Apache Flink](https://www.slideshare.net/slideshow/advanced-streaming-analytics-with-apache-flink-and-apache-kafka-stephan-ewen/61920008). At *Kafka Summit*, April 2016. Archived at [perma.cc/QBQ4-F9MR](https://perma.cc/QBQ4-F9MR)
[^66]: Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. [MillWheel: Fault-Tolerant Stream Processing at Internet Scale](https://www.vldb.org/pvldb/vol6/p1033-akidau.pdf). *Proceedings of the VLDB Endowment*, volume 6, issue 11, pages 1033--1044, August 2013. [doi:10.14778/2536222.2536229](https://doi.org/10.14778/2536222.2536229)
[^67]: Alex Dean. [Improving Snowplow's Understanding of Time](https://snowplow.io/blog/improving-snowplows-understanding-of-time). *snowplow.io*, September 2015. Archived at [perma.cc/6CT9-Z3Q2](https://perma.cc/6CT9-Z3Q2)
[^68]: [Azure Stream Analytics: Windowing functions](https://learn.microsoft.com/en-gb/stream-analytics-query/windowing-azure-stream-analytics). Microsoft Azure Reference, *learn.microsoft.com*, July 2025. Archived at [archive.org](https://web.archive.org/web/20250901140013/https://learn.microsoft.com/en-gb/stream-analytics-query/windowing-azure-stream-analytics)
[^69]: Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, Ashish Gupta, Haifeng Jiang, Tianhao Qiu, Alexey Reznichenko, Deomid Ryabkov, Manpreet Singh, and Shivakumar Venkataraman. [Photon: Fault-Tolerant and Scalable Joining of Continuous Data Streams](https://research.google.com/pubs/archive/41529.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2013. [doi:10.1145/2463676.2465272](https://doi.org/10.1145/2463676.2465272)
[^70]: Ben Kirwin. [Doing the Impossible: Exactly-Once Messaging Patterns in Kafka](https://ben.kirw.in/2014/11/28/kafka-patterns/). *ben.kirw.in*, November 2014. Archived at [perma.cc/A5QL-QRX7](https://perma.cc/A5QL-QRX7)
[^71]: Pat Helland. [Data on the Outside Versus Data on the Inside](https://www.cidrdb.org/cidr2005/papers/P12.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005.
[^72]: Ralph Kimball and Margy Ross. [*The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*](https://learning.oreilly.com/library/view/the-data-warehouse/9781118530801/), 3rd edition. John Wiley & Sons, 2013. ISBN: 978-1-118-53080-1
[^73]: Viktor Klang. [I'm coining the phrase 'effectively-once' for message processing with at-least-once + idempotent operations](https://x.com/viktorklang/status/789036133434978304). *x.com*, October 2016. Archived at [perma.cc/7DT9-TDG2](https://perma.cc/7DT9-TDG2)
[^74]: Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. [Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters](https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final28.pdf). At *4th USENIX Conference in Hot Topics in Cloud Computing* (HotCloud), June 2012.
[^75]: Kostas Tzoumas, Stephan Ewen, and Robert Metzger. [High-Throughput, Low-Latency, and Exactly-Once Stream Processing with Apache Flink](https://web.archive.org/web/20250429165534/https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink). *ververica.com*, August 2015. Archived at [archive.org](https://web.archive.org/web/20250429165534/https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink)
[^76]: Paris Carbone, Gyula Fóra, Stephan Ewen, Seif Haridi, and Kostas Tzoumas. [Lightweight Asynchronous Snapshots for Distributed Dataflows](https://arxiv.org/abs/1506.08603). arXiv:1506.08603 [cs.DC], June 2015.
[^77]: Ryan Betts and John Hugg. [*Fast Data: Smart and at Scale*](https://www.voltactivedata.com/wp-content/uploads/2017/03/hv-ebook-fast-data-smart-and-at-scale.pdf). Report, O'Reilly Media, October 2015. Archived at [perma.cc/VQ6S-XQQY](https://perma.cc/VQ6S-XQQY)
[^78]: Neha Narkhede and Guozhang Wang. [Exactly-Once Semantics Are Possible: Here's How Kafka Does It](https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/). *confluent.io*, June 2019. Archived at [perma.cc/Q2AU-Q2ED](https://perma.cc/Q2AU-Q2ED)
[^79]: Jason Gustafson, Flavio Junqueira, Apurva Mehta, Sriram Subramanian, and Guozhang Wang. [KIP-98 -- Exactly Once Delivery and Transactional Messaging](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging). *cwiki.apache.org*, November 2016. Archived at [perma.cc/95PT-RCTG](https://perma.cc/95PT-RCTG)
[^80]: Pat Helland. [Idempotence Is Not a Medical Condition](https://dl.acm.org/doi/pdf/10.1145/2160718.2160734). *Communications of the ACM*, volume 55, issue 5, page 56, May 2012. [doi:10.1145/2160718.2160734](https://doi.org/10.1145/2160718.2160734)
[^81]: Jay Kreps. [Re: Trying to Achieve Deterministic Behavior on Recovery/Rewind](https://lists.apache.org/thread/n0sz6zld72nvjtnytv09pxc57mdcf9ft). Email to *samza-dev* mailing list, September 2014. Archived at [perma.cc/7DPD-GJNL](https://perma.cc/7DPD-GJNL)
[^82]: E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. [A Survey of Rollback-Recovery Protocols in Message-Passing Systems](https://www.cs.utexas.edu/~lorenzo/papers/SurveyFinal.pdf). *ACM Computing Surveys*, volume 34, issue 3, pages 375--408, September 2002. [doi:10.1145/568522.568525](https://doi.org/10.1145/568522.568525)
[^83]: Adam Warski. [Kafka Streams -- How Does It Fit the Stream Processing Landscape?](https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/) *softwaremill.com*, June 2016. Archived at [perma.cc/WQ5Q-H2J2](https://perma.cc/WQ5Q-H2J2)
[^84]: Stephan Ewen, Fabian Hueske, and Xiaowei Jiang. [Batch as a Special Case of Streaming and Alibaba's contribution of Blink](https://flink.apache.org/2019/02/13/batch-as-a-special-case-of-streaming-and-alibabas-contribution-of-blink/). *flink.apache.org*, February 2019. Archived at [perma.cc/A529-SKA9](https://perma.cc/A529-SKA9)
================================================
FILE: content/tw/ch13.md
================================================
---
title: "第十三章:流式系統的哲學"
linkTitle: "13. 流式系統的哲學"
weight: 313
breadcrumbs: false
---

> 如果船長的終極目標是保護船隻,他應該永遠待在港口。
>
> —— 聖托馬斯・阿奎那《神學大全》(1265-1274)
[第二章](/tw/ch2) 討論了構建 **可靠**、**可伸縮**、**可維護** 應用與系統的目標。這些主題貫穿了全書:例如,我們討論了提升可靠性的多種容錯演算法、提升可伸縮性的分割槽方法,以及提升可維護性的演化與抽象機制。
在本章中,我們將把這些想法整合起來,並特別基於 [第十二章](/tw/ch12) 的流式/事件驅動架構思路,提出一套滿足這些目標的應用開發哲學。與前幾章相比,本章立場更鮮明:不是並列比較多種方案,而是深入展開一種特定的設計哲學。
## 資料整合 {#sec_future_integration}
本書中反覆出現的主題是,對於任何給定的問題都會有好幾種解決方案,所有這些解決方案都有不同的優缺點與利弊權衡。例如在 [第四章](/tw/ch4) 討論儲存引擎時,我們看到了日誌結構儲存、B 樹以及列式儲存。在 [第六章](/tw/ch6) 討論複製時,我們看到了單領導者、多領導者和無領導者的方法。
如果你有一個類似於 “我想儲存一些資料並稍後再查詢” 的問題,那麼並沒有一種正確的解決方案。但對於不同的具體環境,總會有不同的合適方法。軟體實現通常必須選擇一種特定的方法。使單條程式碼路徑能做到穩定健壯且表現良好已經是一件非常困難的事情了 —— 嘗試在單個軟體中完成所有事情,幾乎可以保證,實現效果會很差。
因此軟體工具的最佳選擇也取決於情況。每一種軟體,甚至所謂的 “通用” 資料庫,都是針對特定的使用模式設計的。
面對讓人眼花繚亂的諸多替代品,第一個挑戰就是弄清軟體與其適用環境的對映關係。供應商不願告訴你他們軟體不適用的工作負載,這是可以理解的。但是希望先前的章節能給你提供一些問題,讓你讀出字裡行間的言外之意,並更好地理解這些權衡。
但是,即使你已經完全理解各種工具與其適用環境間的關係,還有一個挑戰:在複雜的應用中,資料的用法通常花樣百出。不太可能存在適用於 **所有** 不同資料應用場景的軟體,因此你不可避免地需要拼湊幾個不同的軟體來以提供應用所需的功能。
### 組合使用派生資料的工具 {#id442}
例如,為了處理任意關鍵詞的搜尋查詢,將 OLTP 資料庫與全文檢索索引整合在一起是很常見的需求。儘管一些資料庫(例如 PostgreSQL)包含了全文索引功能,對於簡單的應用完全夠了[^1],但更複雜的搜尋能力就需要專業的資訊檢索工具了。相反的是,搜尋索引通常不適合作為持久的記錄系統,因此許多應用需要組合這兩種不同的工具以滿足所有需求。
我們在 “[保持系統同步](/tw/ch12#sec_stream_sync)” 中接觸過整合資料系統的問題。隨著資料不同表示形式的增加,整合問題變得越來越困難。除了資料庫和搜尋索引之外,也許你需要在分析系統(資料倉庫,或批處理和流處理系統)中維護資料副本;維護從原始資料中派生的快取,或反正規化的資料版本;將資料灌入機器學習、分類、排名或推薦系統中;或者基於資料變更傳送通知。
#### 理解資料流 {#id443}
當需要在多個儲存系統中維護相同資料的副本以滿足不同的訪問模式時,你要對輸入和輸出瞭如指掌:哪些資料先寫入,哪些資料表示派生自哪些來源?如何以正確的格式,將所有資料匯入正確的地方?
例如,你可能會首先將資料寫入 **記錄系統** 資料庫,捕獲對該資料庫所做的變更(請參閱 “[變更資料捕獲](/tw/ch12#sec_stream_cdc)”),然後將變更以相同的順序應用於搜尋索引。如果變更資料捕獲(CDC)是更新索引的唯一方式,則可以確定該索引完全派生自記錄系統,因此與其保持一致(除軟體錯誤外)。寫入資料庫是向該系統提供新輸入的唯一方式。
允許應用程式直接寫入搜尋索引和資料庫引入了如 [圖 12-4](/tw/ch12#fig_stream_dual_write_race) 所示的問題,其中兩個客戶端同時傳送衝突的寫入,且兩個儲存系統按不同順序處理它們。在這種情況下,既不是資料庫說了算,也不是搜尋索引說了算,所以它們做出了相反的決定,進入彼此間永續性的不一致狀態。
如果你可以透過單個系統來提供所有使用者輸入,從而決定所有寫入的排序,則透過按相同順序處理寫入,可以更容易地派生出其他資料表示。這是狀態機複製方法的一個應用,我們在 “[全序廣播](/tw/ch10#sec_consistency_total_order)” 中看到。無論你使用變更資料捕獲還是事件溯源日誌,都不如簡單的基於全序的決策原則更重要。
基於事件日誌來更新派生資料的系統,通常可以做到 **確定性** 與 **冪等性**(請參閱 “[冪等性](/tw/ch12#sec_stream_idempotence)”),使得從故障中恢復相當容易。
#### 派生資料與分散式事務 {#sec_future_derived_vs_transactions}
保持不同資料系統彼此一致的經典方法涉及分散式事務,如 “[原子提交與兩階段提交](/tw/ch8#sec_transactions_2pc)” 中所述。與分散式事務相比,使用派生資料系統的方法如何?
在抽象層面,它們透過不同的方式達到類似的目標。分散式事務透過 **鎖** 進行互斥來決定寫入的順序(請參閱 “[兩階段鎖定](/tw/ch8#sec_transactions_2pl)”),而 CDC 和事件溯源使用日誌進行排序。分散式事務使用原子提交來確保變更只生效一次,而基於日誌的系統通常基於 **確定性重試** 和 **冪等性**。
最大的不同之處在於事務系統通常提供 [線性一致性](/tw/ch10#sec_consistency_linearizability),這包含著有用的保證,例如 [讀己之寫](/tw/ch6#sec_replication_ryw)。另一方面,派生資料系統通常是非同步更新的,因此它們預設不會提供相同的時序保證。
在願意為分散式事務付出代價的有限場景中,它們已被成功應用。但是,我認為 XA 的容錯能力和效能很差勁(請參閱 “[實踐中的分散式事務](/tw/ch8#sec_transactions_xa)”),這嚴重限制了它的實用性。我相信為分散式事務設計一種更好的協議是可行的。但使這樣一種協議被現有工具廣泛接受是很有挑戰的,且不是立竿見影的事。
在沒有廣泛支援的良好分散式事務協議的情況下,我認為基於日誌的派生資料是整合不同資料系統的最有前途的方法。然而,諸如讀己之寫的保證是有用的,我認為告訴所有人 “最終一致性是不可避免的 —— 忍一忍並學會和它打交道” 是沒有什麼建設性的(至少在缺乏 **如何** 應對的良好指導時)。
在本章後文中,我們將討論一些在非同步派生系統之上實現更強保障的方法,並邁向分散式事務和基於日誌的非同步系統之間的中間地帶。
#### 全序的限制 {#id335}
對於足夠小的系統,構建一個完全有序的事件日誌是完全可行的(正如單主複製資料庫的流行所證明的那樣,它正好建立了這樣一種日誌)。但是,隨著系統向更大更複雜的工作負載伸縮,限制開始出現:
* 在大多數情況下,構建完全有序的日誌,需要所有事件彙集於決定順序的 **單個領導者節點**。如果事件吞吐量大於單臺計算機的處理能力,則需要將其分割槽到多臺計算機上(請參閱 “[分割槽日誌](/tw/ch12#sec_stream_log)”)。然後兩個不同分割槽中的事件順序關係就不明確了。
* 如果伺服器分佈在多個 **地理位置分散** 的資料中心上,例如為了容忍整個資料中心掉線,你通常在每個資料中心都有單獨的主庫,因為網路延遲會導致同步的跨資料中心協調效率低下(請參閱 “[多主複製](/tw/ch6#sec_replication_multi_leader)”)。這意味著源自兩個不同資料中心的事件順序未定義。
* 將應用程式部署為微服務時(請參閱 “[服務中的資料流:REST 與 RPC](/tw/ch5#sec_encoding_dataflow_rpc)”),常見的設計選擇是將每個服務及其持久狀態作為獨立單元進行部署,服務之間不共享持久狀態。當兩個事件來自不同的服務時,這些事件間的順序未定義。
* 某些應用程式在客戶端儲存狀態,該狀態在使用者輸入時立即更新(無需等待伺服器確認),甚至可以繼續離線工作(請參閱 “[需要離線操作的客戶端](/tw/ch6#sec_replication_offline_clients)”)。對於這樣的應用程式,客戶端和伺服器很可能以不同的順序看到事件。
在形式上,決定事件的全域性順序稱為 **全序廣播**,相當於 **共識**(請參閱 “[共識演算法和全序廣播](/tw/ch10#sec_consistency_faces)”)。大多數共識演算法都是針對單個節點的吞吐量足以處理整個事件流的情況而設計的,並且這些演算法不提供多個節點共享事件排序工作的機制。設計可以伸縮至單個節點的吞吐量之上,且在地理位置分散環境中仍能良好工作的共識演算法仍然是一個開放研究問題。
#### 排序事件以捕獲因果關係 {#sec_future_capture_causality}
在事件之間不存在因果關係的情況下,全序的缺乏並不是一個大問題,因為併發事件可以任意排序。其他一些情況很容易處理:例如,當同一物件有多個更新時,它們可以透過將特定物件 ID 的所有更新路由到相同的日誌分割槽來完全排序。然而,因果關係有時會以更微妙的方式出現(請參閱 “[順序與因果關係](/tw/ch10#sec_consistency_logical)”)。
例如,考慮一個社交網路服務,以及一對曾處於戀愛關係但剛分手的使用者。其中一個使用者將另一個使用者從好友中移除,然後向剩餘的好友傳送訊息,抱怨他們的前任。使用者的心思是他們的前任不應該看到這些粗魯的訊息,因為訊息是在好友狀態解除後傳送的。
但是如果好友關係狀態與訊息儲存在不同的地方,在這樣一個系統中,可能會出現 **解除好友** 事件與 **傳送訊息** 事件之間的因果依賴丟失的情況。如果因果依賴關係沒有被捕捉到,則傳送有關新訊息的通知的服務可能會在 **解除好友** 事件之前處理 **傳送訊息** 事件,從而錯誤地向前任傳送通知。
在本例中,通知實際上是訊息和好友列表之間的連線,使得它與我們先前討論的連線的時序問題有關(請參閱 “[連線的時間依賴性](/tw/ch12#sec_stream_join_time)”)。不幸的是,這個問題似乎並沒有一個簡單的答案[^2] [^3]。起點包括:
* 邏輯時間戳可以提供無需協調的全域性順序(請參閱 “[序列號順序](/tw/ch10#sec_consistency_logical)”),因此它們可能有助於全序廣播不可行的情況。但是,他們仍然要求收件人處理不按順序傳送的事件,並且需要傳遞其他元資料。
* 如果你可以記錄一個事件來記錄使用者在做出決定之前所看到的系統狀態,並給該事件一個唯一的識別符號,那麼後面的任何事件都可以引用該事件識別符號來記錄因果關係[^4]。我們將在 “[讀也是事件](#sec_future_read_events)” 中回到這個想法。
* 衝突解決演算法(請參閱 “[自動衝突解決](/tw/ch6#automatic-conflict-resolution)”)有助於處理以意外順序傳遞的事件。它們對於維護狀態很有用,但如果行為有外部副作用(例如,給使用者傳送通知),就沒什麼幫助了。
也許,隨著時間的推移,應用開發模式將出現,使得能夠有效地捕獲因果依賴關係,並且保持正確的派生狀態,而不會迫使所有事件經歷全序廣播的瓶頸)。
### 批處理與流處理 {#sec_future_batch_streaming}
我會說資料整合的目標是,確保資料最終能在所有正確的地方表現出正確的形式。這樣做需要消費輸入、轉換、連線、過濾、聚合、訓練模型、評估、以及最終寫出適當的輸出。批處理和流處理是實現這一目標的工具。
批處理和流處理的輸出是派生資料集,例如搜尋索引、物化檢視、向用戶顯示的建議、聚合指標等(請參閱 “[批處理工作流的輸出](/tw/ch11#sec_batch_output)” 和 “[流處理的應用](/tw/ch12#sec_stream_uses)”)。
正如我們在 [第十一章](/tw/ch11) 和 [第十二章](/tw/ch12) 中看到的,批處理和流處理有許多共同的原則,主要的根本區別在於流處理器在無限資料集上執行,而批處理輸入是已知的有限大小。
#### 維護派生狀態 {#id446}
批處理有著很強的函式式風格(即使其程式碼不是用函式式語言編寫的):它鼓勵確定性的純函式,其輸出僅依賴於輸入,除了顯式輸出外沒有副作用,將輸入視作不可變的,且輸出是僅追加的。流處理與之類似,但它擴充套件了運算元以允許受管理的、容錯的狀態(請參閱 “[失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)”)。
具有良好定義的輸入和輸出的確定性函式的原理不僅有利於容錯(請參閱 “[冪等性](/tw/ch12#sec_stream_idempotence)”),也簡化了有關組織中資料流的推理[^7]。無論派生資料是搜尋索引、統計模型還是快取,採用這種觀點思考都是很有幫助的:將其視為從一個東西派生出另一個的資料管道,透過函式式應用程式碼推送一個系統的狀態變更,並將其效果應用至派生系統中。
原則上,派生資料系統可以同步地維護,就像關係資料庫在與索引表寫入操作相同的事務中同步更新次級索引一樣。然而,非同步是使基於事件日誌的系統穩健的原因:它允許系統的一部分故障被抑制在本地。而如果任何一個參與者失敗,分散式事務將中止,因此它們傾向於透過將故障傳播到系統的其餘部分來放大故障(請參閱 “[分散式事務的限制](/tw/ch8#sec_transactions_xa)”)。
我們在 “[分割槽與次級索引](/tw/ch7#sec_sharding_secondary_indexes)” 中看到,次級索引經常跨越分割槽邊界。具有次級索引的分割槽系統需要將寫入傳送到多個分割槽(如果索引按關鍵詞分割槽的話)或將讀取傳送到所有分割槽(如果索引是按文件分割槽的話)。如果索引是非同步維護的,這種跨分割槽通訊也是最可靠和最可伸縮的[^8](另請參閱 “[多分割槽資料處理](#sec_future_unbundled_multi_shard)”)。
#### 應用演化後重新處理資料 {#sec_future_reprocessing}
在維護派生資料時,批處理和流處理都是有用的。流處理允許將輸入中的變化以低延遲反映在派生檢視中,而批處理允許重新處理大量累積的歷史資料以便將新檢視匯出到現有資料集上。
特別是,重新處理現有資料為維護系統、演化並支援新功能和需求變更提供了一個良好的機制(請參閱 [第四章](/tw/ch4))。沒有重新進行處理,模式演化將僅限於簡單的變化,例如向記錄中新增新的可選欄位或新增新型別的記錄。無論是在寫時模式還是在讀時模式中都是如此(請參閱 “[文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility)”)。另一方面,透過重新處理,可以將資料集重組為一個完全不同的模型,以便更好地滿足新的要求。
> ### 鐵路上的模式遷移
>
> 大規模的 “模式遷移” 也發生在非計算機系統中。例如,在 19 世紀英國鐵路建設初期,軌距(兩軌之間的距離)就有了各種各樣的競爭標準。為一種軌距而建的列車不能在另一種軌距的軌道上執行,這限制了火車網路中可能的相互連線[^9]。
>
> 在 1846 年最終確定了一個標準軌距之後,其他軌距的軌道必須轉換 —— 但是如何在不停運火車線路的情況下進行數月甚至數年的遷移?解決的辦法是首先透過新增第三條軌道將軌道轉換為 **雙軌距(dual gauge)** 或 **混合軌距**。這種轉換可以逐漸完成,當完成時,兩種軌距的列車都可以線上路上跑,使用三條軌道中的兩條。事實上,一旦所有的列車都轉換成標準軌距,那麼可以移除提供非標準軌距的軌道。
>
> 以這種方式 “再加工” 現有的軌道,讓新舊版本並存,可以在幾年的時間內逐漸改變軌距。然而,這是一項昂貴的事業,這就是今天非標準軌距仍然存在的原因。例如,舊金山灣區的 BART 系統使用了與美國大部分地區不同的軌距。
派生檢視允許 **漸進演化(gradual evolution)**。如果你想重新構建資料集,不需要執行突然切換式的遷移。取而代之的是,你可以將舊架構和新架構並排維護為相同基礎資料上的兩個獨立派生檢視。然後可以開始將少量使用者轉移到新檢視,以測試其效能並發現任何錯誤,而大多數使用者仍然會被路由到舊檢視。你可以逐漸地增加訪問新檢視的使用者比例,最終可以刪除舊檢視[^10]。
這種逐漸遷移的美妙之處在於,如果出現問題,每個階段的過程都很容易逆轉:你始終有一個可以回滾的可用系統。透過降低不可逆損害的風險,你能對繼續前進更有信心,從而更快地改善系統[^11]。
#### 統一批處理和流處理 {#id338}
早期統一批處理與流處理的提案是 **Lambda 架構**[^12],但它有不少問題,並且已經逐漸淡出主流。更新的系統允許在同一個系統中同時實現批計算(重處理歷史資料)和流計算(事件到達即處理)[^15]。
在一個系統中統一批處理和流處理需要以下功能,這些功能也正在越來越廣泛地被提供:
* 透過處理最近事件流的相同處理引擎來重播歷史事件的能力。例如,基於日誌的訊息代理可以重播訊息(請參閱 “[重播舊訊息](/tw/ch12#sec_stream_replay)”),某些流處理器可以從 HDFS 等分散式檔案系統讀取輸入。
* 對於流處理器來說,恰好一次語義 —— 即確保輸出與未發生故障的輸出相同,即使事實上發生故障(請參閱 “[容錯](/tw/ch12#sec_stream_fault_tolerance)”)。與批處理一樣,這需要丟棄任何失敗任務的部分輸出。
* 按事件時間進行視窗化的工具,而不是按處理時間進行視窗化,因為處理歷史事件時,處理時間毫無意義(請參閱 “[時間推理](/tw/ch12#sec_stream_time)”)。例如,Apache Beam 提供了用於表達這種計算的 API,可以在 Apache Flink 或 Google Cloud Dataflow 使用。
## 分拆資料庫 {#sec_future_unbundling}
在最抽象的層面上,資料庫、批/流處理器和作業系統都在做相似的事情:儲存資料,並允許你處理和查詢這些資料[^16]。資料庫將資料儲存為某種資料模型下的記錄(例如錶行、文件、圖頂點等),而作業系統檔案系統將資料存為檔案;但它們本質上都可視作 “資訊管理” 系統[^17]。正如我們在 [第十一章](/tw/ch11) 中看到的,批處理系統在很多方面像是 Unix 的分散式版本。
當然,有很多實際的差異。例如,許多檔案系統都不能很好地處理包含 1000 萬個小檔案的目錄,而包含 1000 萬個小記錄的資料庫完全是尋常而不起眼的。無論如何,作業系統和資料庫之間的相似之處和差異值得探討。
Unix 和關係資料庫以非常不同的哲學來處理資訊管理問題。Unix 認為它的目的是為程式設計師提供一種相當低層次的硬體的邏輯抽象,而關係資料庫則希望為應用程式設計師提供一種高層次的抽象,以隱藏磁碟上資料結構的複雜性、併發性、崩潰恢復等等。Unix 發展出的管道和檔案只是位元組序列,而資料庫則發展出了 SQL 和事務。
哪種方法更好?當然這取決於你想要的是什麼。Unix 是 “簡單的”,因為它是對硬體資源相當薄的包裝;關係資料庫是 “更簡單” 的,因為一個簡短的宣告性查詢可以利用很多強大的基礎設施(查詢最佳化、索引、連線方法、併發控制、複製等),而不需要查詢的作者理解其實現細節。
這些哲學之間的矛盾已經持續了幾十年(Unix 和關係模型都出現在 70 年代初),仍然沒有解決。例如,我將 NoSQL 運動解釋為,希望將類 Unix 的低級別抽象方法應用於分散式 OLTP 資料儲存的領域。
在這一部分我將試圖調和這兩個哲學,希望我們能各取其美。
### 組合使用資料儲存技術 {#id447}
在本書的過程中,我們討論了資料庫提供的各種功能及其工作原理,其中包括:
* 次級索引,使你可以根據欄位的值有效地搜尋記錄(請參閱 “[其他索引結構](/tw/ch4#sec_storage_index_multicolumn)”)
* 物化檢視,這是一種預計算的查詢結果快取(請參閱 “[聚合:資料立方體和物化檢視](/tw/ch4#sec_storage_materialized_views)”)
* 複製日誌,保持其他節點上資料的副本最新(請參閱 “[複製日誌的實現](/tw/ch6#sec_replication_implementation)”)
* 全文檢索索引,允許在文字中進行關鍵字搜尋(請參閱 “[全文檢索與模糊索引](/tw/ch4#sec_storage_full_text)”),也內置於某些關係資料庫[^1]
在 [第十一章](/tw/ch11) 和 [第十二章](/tw/ch12) 中,出現了類似的主題。我們討論了如何構建全文檢索索引(請參閱 “[批處理工作流的輸出](/tw/ch11#sec_batch_output)”),瞭解了如何維護物化檢視(請參閱 “[維護物化檢視](/tw/ch12#sec_stream_mat_view)”)以及如何將變更從資料庫複製到派生資料系統(請參閱 “[變更資料捕獲](/tw/ch12#sec_stream_cdc)”)。
資料庫中內建的功能與人們用批處理和流處理器構建的派生資料系統似乎有相似之處。
#### 建立索引 {#id340}
想想當你執行 `CREATE INDEX` 在關係資料庫中建立一個新的索引時會發生什麼。資料庫必須掃描表的一致性快照,挑選出所有被索引的欄位值,對它們進行排序,然後寫出索引。然後它必須處理自一致快照以來所做的寫入操作(假設表在建立索引時未被鎖定,所以寫操作可能會繼續)。一旦完成,只要事務寫入表中,資料庫就必須繼續保持索引最新。
此過程非常類似於設定新的從庫副本(請參閱 “[設定新從庫](/tw/ch6#sec_replication_new_replica)”),也非常類似於流處理系統中的 **引導(bootstrap)** 變更資料捕獲(請參閱 “[初始快照](/tw/ch12#sec_stream_cdc_snapshot)”)。
無論何時執行 `CREATE INDEX`,資料庫都會重新處理現有資料集(如 “[應用演化後重新處理資料](#sec_future_reprocessing)” 中所述),並將該索引作為新檢視匯出到現有資料上。現有資料可能是狀態的快照,而不是所有發生變化的日誌,但兩者密切相關(請參閱 “[狀態、流和不變性](/tw/ch12#sec_stream_immutability)”)。
#### 一切的元資料庫 {#id341}
有鑑於此,我認為整個組織的資料流開始像一個巨大的資料庫[^7]。每當批處理、流處理或 ETL 過程將資料從一個地方傳輸並轉換到另一個地方時,它都像資料庫子系統在維護索引或物化檢視。
從這種角度來看,批處理和流處理器就像精心實現的觸發器、儲存過程和物化檢視維護例程。它們維護的派生資料系統就像不同的索引型別。例如,關係資料庫可能支援 B 樹索引、雜湊索引、空間索引(請參閱 “[多列索引](/tw/ch4#sec_storage_index_multicolumn)”)以及其他型別的索引。在新興的派生資料系統架構中,不是將這些設施作為單個整合資料庫產品的功能實現,而是由各種不同的軟體提供,執行在不同的機器上,由不同的團隊管理。
這些發展在未來將會把我們帶到哪裡?如果我們從沒有適合所有訪問模式的單一資料模型或儲存格式的前提出發,我推測有兩種途徑可以將不同的儲存和處理工具組合成一個有凝聚力的系統:
**聯合資料庫:統一讀取**
可以為各種各樣的底層儲存引擎和處理方法提供一個統一的查詢介面 —— 一種稱為 **聯合資料庫(federated database)** 或 **多型儲存(polystore)** 的方法[^18] [^19]。例如,PostgreSQL 的 **外部資料包裝器(foreign data wrapper)** 功能符合這種模式[^20]。需要專用資料模型或查詢介面的應用程式仍然可以直接訪問底層儲存引擎,而想要組合來自不同位置的資料的使用者可以透過聯合介面輕鬆完成操作。
聯合查詢介面遵循著單一整合系統的關係型傳統,帶有高階查詢語言和優雅的語義,但實現起來非常複雜。
**分拆資料庫:統一寫入**
雖然聯合能解決跨多個不同系統的只讀查詢問題,但它並沒有很好的解決跨系統 **同步** 寫入的問題。我們說過,在單個數據庫中,建立一致的索引是一項內建功能。當我們構建多個儲存系統時,我們同樣需要確保所有資料變更都會在所有正確的位置結束,即使在出現故障時也是如此。想要更容易地將儲存系統可靠地插接在一起(例如,透過變更資料捕獲和事件日誌),就像將資料庫的索引維護功能以可以跨不同技術同步寫入的方式分開[^7] [^21]。
分拆方法遵循 Unix 傳統的小型工具,它可以很好地完成一件事[^22],透過統一的低層級 API(管道)進行通訊,並且可以使用更高層級的語言進行組合(shell)[^16] 。
#### 開展分拆工作 {#sec_future_unbundling_favor}
聯合和分拆是一個硬幣的兩面:用不同的元件構成可靠、 可伸縮和可維護的系統。聯合只讀查詢需要將一個數據模型對映到另一個數據模型,這需要一些思考,但最終還是一個可解決的問題。而我認為同步寫入到幾個儲存系統是更困難的工程問題,所以我將重點關注它。
傳統的同步寫入方法需要跨異構儲存系統的分散式事務[^18],我認為這是錯誤的解決方案(請參閱 “[派生資料與分散式事務](#sec_future_derived_vs_transactions)”)。單個儲存或流處理系統內的事務是可行的,但是當資料跨越不同技術之間的邊界時,我認為具有冪等寫入的非同步事件日誌是一種更加健壯和實用的方法。
例如,分散式事務在某些流處理元件內部使用,以匹配 **恰好一次(exactly-once)** 語義(請參閱 “[原子提交再現](/tw/ch12#sec_stream_atomic_commit)”),這可以很好地工作。然而,當事務需要涉及由不同人群編寫的系統時(例如,當資料從流處理元件寫入分散式鍵值儲存或搜尋索引時),缺乏標準化的事務協議會使整合更難。有冪等消費者的有序事件日誌(請參閱 “[冪等性](/tw/ch12#sec_stream_idempotence)”)是一種更簡單的抽象,因此在異構系統中實現更加可行[^7]。
基於日誌的整合的一大優勢是各個元件之間的 **鬆散耦合(loose coupling)**,這體現在兩個方面:
1. 在系統級別,非同步事件流使整個系統在個別元件的中斷或效能下降時更加穩健。如果消費者執行緩慢或失敗,那麼事件日誌可以緩衝訊息(請參閱 “[磁碟空間使用](/tw/ch12#sec_stream_disk_usage)”),以便生產者和任何其他消費者可以繼續不受影響地執行。有問題的消費者可以在問題修復後趕上,因此不會錯過任何資料,並且包含故障。相比之下,分散式事務的同步互動往往會將本地故障升級為大規模故障(請參閱 “[分散式事務的限制](/tw/ch8#sec_transactions_xa)”)。
2. 在人力方面,分拆資料系統允許不同的團隊獨立開發,改進和維護不同的軟體元件和服務。專業化使得每個團隊都可以專注於做好一件事,並與其他團隊的系統以明確的介面互動。事件日誌提供了一個足夠強大的介面,以捕獲相當強的一致性屬性(由於永續性和事件的順序),但也足夠普適於幾乎任何型別的資料。
#### 分拆系統與整合系統 {#id448}
如果分拆確實成為未來的方式,它也不會取代目前形式的資料庫 —— 它們仍然會像以往一樣被需要。為了維護流處理元件中的狀態,資料庫仍然是需要的,並且為批處理和流處理器的輸出提供查詢服務(請參閱 “[批處理工作流的輸出](/tw/ch11#sec_batch_output)” 與 “[流處理](/tw/ch12#sec_stream_processing)”)。專用查詢引擎對於特定的工作負載仍然非常重要:例如,MPP 資料倉庫中的查詢引擎針對探索性分析查詢進行了最佳化,並且能夠很好地處理這種型別的工作負載(請參閱 “[Hadoop 與分散式資料庫的對比](/tw/ch11#sec_batch_distributed)”)。
執行幾種不同基礎設施的複雜性可能是一個問題:每種軟體都有一個學習曲線,配置問題和操作怪癖,因此部署儘可能少的移動部件是很有必要的。比起使用應用程式碼拼接多個工具而成的系統,單一整合軟體產品也可以在其設計應對的工作負載型別上實現更好、更可預測的效能[^23]。正如在前言中所說的那樣,為了不需要的規模而構建系統是白費精力,而且可能會將你鎖死在一個不靈活的設計中。實際上,這是一種過早最佳化的形式。
分拆的目標不是要針對個別資料庫與特定工作負載的效能進行競爭;我們的目標是允許你結合多個不同的資料庫,以便在比單個軟體可能實現的更廣泛的工作負載範圍內實現更好的效能。這是關於廣度,而不是深度 —— 與我們在 “[Hadoop 與分散式資料庫的對比](/tw/ch11#sec_batch_distributed)” 中討論的儲存和處理模型的多樣性一樣。
因此,如果有一項技術可以滿足你的所有需求,那麼最好使用該產品,而不是試圖用更低層級的元件重新實現它。只有當沒有單一軟體滿足你的所有需求時,才會出現拆分和聯合的優勢。
### 圍繞資料流設計應用 {#sec_future_dataflow}
當底層資料發生變化時去更新派生資料,這個思路並不新鮮。比如電子表格就有很強的資料流程式設計能力[^33]:你可以在一個單元格寫公式(例如對另一列求和),只要輸入變化,結果就會自動重算。這正是我們希望資料系統具備的能力:資料庫記錄一旦變化,相關索引、快取檢視和聚合結果都應自動重新整理,而不需要應用開發者關心重新整理細節。
從這個意義上說,今天很多資料系統仍可以向 VisiCalc 在 1979 年就具備的特性學習[^34]。與電子表格不同的是,現代資料系統還必須同時滿足容錯、可伸縮、持久化儲存、跨團隊異構技術整合等要求,也必須能夠複用已有庫與服務。指望所有軟體都在一種語言、框架或工具上統一實現並不現實。
#### 應用程式碼作為派生函式 {#sec_future_dataflow_derivation}
當一個數據集派生自另一個數據集時,它會經歷某種轉換函式。例如:
* 次級索引是由一種直白的轉換函式生成的派生資料集:對於基礎表中的每行或每個文件,它挑選被索引的列或欄位中的值,並按這些值排序(假設使用 B 樹或 SSTable 索引,按鍵排序,如 [第四章](/tw/ch4) 所述)。
* 全文檢索索引是透過應用各種自然語言處理函式而建立的,諸如語言檢測、分詞、詞幹或詞彙化、拼寫糾正和同義詞識別,然後構建用於高效查詢的資料結構(例如倒排索引)。
* 在機器學習系統中,我們可以將模型視作從訓練資料透過應用各種特徵提取、統計分析函式派生的資料,當模型應用於新的輸入資料時,模型的輸出是從輸入和模型(因此間接地從訓練資料)中派生的。
* 快取通常包含將以使用者介面(UI)顯示的形式的資料聚合。因此填充快取需要知道 UI 中引用的欄位;UI 中的變更可能需要更新快取填充方式的定義,並重建快取。
用於次級索引的派生函式是如此常用的需求,以致於它作為核心功能被內建至許多資料庫中,你可以簡單地透過 `CREATE INDEX` 來呼叫它。對於全文索引,常見語言的基本語言特徵可能內建到資料庫中,但更複雜的特徵通常需要領域特定的調整。在機器學習中,特徵工程是眾所周知的特定於應用的特徵,通常需要包含很多關於使用者互動與應用部署的詳細知識[^35]。
當建立派生資料集的函式不是像建立次級索引那樣的標準搬磚函式時,需要自定義程式碼來處理特定於應用的東西。而這個自定義程式碼是讓許多資料庫掙扎的地方,雖然關係資料庫通常支援觸發器、儲存過程和使用者定義的函式,可以用它們來在資料庫中執行應用程式碼,但它們有點像資料庫設計裡的事後反思。(請參閱 “[傳遞事件流](/tw/ch12#sec_stream_transmit)”)。
#### 應用程式碼和狀態的分離 {#id344}
理論上,資料庫可以是任意應用程式碼的部署環境,就如同作業系統一樣。然而實踐中它們對這一目標適配的很差。它們不滿足現代應用開發的要求,例如依賴和軟體包管理、版本控制、滾動升級、可演化性、監控、指標、對網路服務的呼叫以及與外部系統的整合。
另一方面,Mesos、YARN、Docker、Kubernetes 等部署和叢集管理工具專為執行應用程式碼而設計。透過專注於做好一件事情,他們能夠做得比將資料庫作為其眾多功能之一執行使用者定義的功能要好得多。
我認為讓系統的某些部分專門用於持久資料儲存並讓其他部分專門執行應用程式程式碼是有意義的。這兩者可以在保持獨立的同時互動。
現在大多數 Web 應用程式都是作為無狀態服務部署的,其中任何使用者請求都可以路由到任何應用程式伺服器,並且伺服器在傳送響應後會忘記所有請求。這種部署方式很方便,因為可以隨意新增或刪除伺服器,但狀態必須到某個地方:通常是資料庫。趨勢是將無狀態應用程式邏輯與狀態管理(資料庫)分開:不將應用程式邏輯放入資料庫中,也不將持久狀態置於應用程式中[^36]。正如函數語言程式設計社群喜歡開玩笑說的那樣,“我們相信 **教會(Church)** 與 **國家(state)** 的分離”[^37]。
在這個典型的 Web 應用模型中,資料庫充當一種可以透過網路同步訪問的可變共享變數。應用程式可以讀取和更新變數,而資料庫負責維持它的永續性,提供一些諸如併發控制和容錯的功能。
但是,在大多數程式語言中,你無法訂閱可變變數中的變更 —— 你只能定期讀取它。與電子表格不同,如果變數的值發生變化,變數的讀者不會收到通知(你可以在自己的程式碼中實現這樣的通知 —— 這被稱為 **觀察者模式** —— 但大多數語言沒有將這種模式作為內建功能)。
資料庫繼承了這種可變資料的被動方法:如果你想知道資料庫的內容是否發生了變化,通常你唯一的選擇就是輪詢(即定期重複你的查詢)。訂閱變更只是剛剛開始出現的功能(請參閱 “[變更流的 API 支援](/tw/ch12#sec_stream_change_api)”)。
#### 資料流:應用程式碼與狀態變化的互動 {#id450}
從資料流的角度思考應用程式,意味著重新協調應用程式碼和狀態管理之間的關係。我們不再將資料庫視作被應用操縱的被動變數,取而代之的是更多地考慮狀態,狀態變更和處理它們的程式碼之間的相互作用與協同關係。應用程式碼透過在另一個地方觸發狀態變更來響應狀態變更。
我們在 “[資料庫與流](/tw/ch12#sec_stream_databases)” 中看到了這一思路,我們討論了將資料庫的變更日誌視為一種我們可以訂閱的事件流。諸如 Actor 的訊息傳遞系統(請參閱 “[訊息傳遞中的資料流](/tw/ch5#sec_encoding_dataflow_msg)”)也具有響應事件的概念。早在 20 世紀 80 年代,**元組空間(tuple space)** 模型就已經探索了表達分散式計算的方式:觀察狀態變更並作出反應的過程[^38] [^39]。
如前所述,當觸發器由於資料變更而被觸發時,或次級索引更新以反映索引表中的變更時,資料庫內部也發生著類似的情況。分拆資料庫意味著將這個想法應用於在主資料庫之外,用於建立派生資料集:快取、全文檢索索引、機器學習或分析系統。我們可以為此使用流處理和訊息傳遞系統。
需要記住的重要一點是,維護派生資料不同於執行非同步任務。傳統的訊息傳遞系統通常是為執行非同步任務設計的(請參閱 “[日誌與傳統的訊息傳遞相比](/tw/ch12#sec_stream_logs_vs_messaging)”):
* 在維護派生資料時,狀態變更的順序通常很重要(如果多個檢視是從事件日誌派生的,則需要按照相同的順序處理事件,以便它們之間保持一致)。如 “[確認與重新傳遞](/tw/ch12#sec_stream_reordering)” 中所述,許多訊息代理在重傳未確認訊息時沒有此屬性,雙寫也被排除在外(請參閱 “[保持系統同步](/tw/ch12#sec_stream_sync)”)。
* 容錯是派生資料的關鍵:僅僅丟失單個訊息就會導致派生資料集永遠與其資料來源失去同步。訊息傳遞和派生狀態更新都必須可靠。例如,許多 Actor 系統預設在記憶體中維護 Actor 的狀態和訊息,所以如果執行 Actor 的機器崩潰,狀態和訊息就會丟失。
穩定的訊息排序和容錯訊息處理是相當嚴格的要求,但與分散式事務相比,它們開銷更小,執行更穩定。現代流處理元件可以提供這些排序和可靠性保證,並允許應用程式碼以流運算元的形式執行。
這些應用程式碼可以執行任意處理,包括資料庫內建派生函式通常不提供的功能。就像透過管道連結的 Unix 工具一樣,流運算元可以圍繞著資料流構建大型系統。每個運算元接受狀態變更的流作為輸入,併產生其他狀態變化的流作為輸出。
#### 流處理器和服務 {#id345}
當今流行的應用開發風格涉及將功能分解為一組透過同步網路請求(如 REST API)進行通訊的 **服務**(service,請參閱 “[服務中的資料流:REST 與 RPC](/tw/ch5#sec_encoding_dataflow_rpc)”)。這種面向服務的架構優於單一龐大應用的優勢主要在於:通過鬆散耦合來提供組織上的可伸縮性:不同的團隊可以專職於不同的服務上,從而減少團隊之間的協調工作(因為服務可以獨立部署和更新)。
在資料流中組裝流運算元與微服務方法有很多相似之處[^40]。但底層通訊機制是有很大區別:資料流採用單向非同步訊息流,而不是同步的請求 / 響應式互動。
除了在 “[訊息傳遞中的資料流](/tw/ch5#sec_encoding_dataflow_msg)” 中列出的優點(如更好的容錯性),資料流系統還能實現更好的效能。例如,假設客戶正在購買以一種貨幣定價,但以另一種貨幣支付的商品。為了執行貨幣換算,你需要知道當前的匯率。這個操作可以透過兩種方式實現[^40] [^41]:
1. 在微服務方法中,處理購買的程式碼可能會查詢匯率服務或資料庫,以獲取特定貨幣的當前匯率。
2. 在資料流方法中,處理訂單的程式碼會提前訂閱匯率變更流,並在匯率發生變動時將當前匯率儲存在本地資料庫中。處理訂單時只需查詢本地資料庫即可。
第二種方法能將對另一服務的同步網路請求替換為對本地資料庫的查詢(可能在同一臺機器甚至同一個程序中)。資料流方法不僅更快,而且當其他服務失效時也更穩健。最快且最可靠的網路請求就是壓根沒有網路請求!我們現在不再使用 RPC,而是在購買事件和匯率更新事件之間建立流聯接(請參閱 “[流表連線(流擴充)](/tw/ch12#sec_stream_table_joins)”)。
連線是時間相關的:如果購買事件在稍後的時間點被重新處理,匯率可能已經改變。如果要重建原始輸出,則需要獲取原始購買時的歷史匯率。無論是查詢服務還是訂閱匯率更新流,你都需要處理這種時間相關性(請參閱 “[連線的時間依賴性](/tw/ch12#sec_stream_join_time)”)。
訂閱變更流,而不是在需要時查詢當前狀態,使我們更接近類似電子表格的計算模型:當某些資料發生變更時,依賴於此的所有派生資料都可以快速更新。還有很多未解決的問題,例如關於時間相關連線等問題,但我認為圍繞資料流構建應用的想法是一個非常有希望的方向。
### 觀察派生資料狀態 {#sec_future_observing}
在抽象層面,上一節討論的資料流系統給出了建立並維護派生資料集(如搜尋索引、物化檢視、預測模型)的過程。我們把這稱為 **寫路徑(write path)**:當資訊寫入系統後,它可能經過多個批處理與流處理階段,最終所有相關派生資料集都會被更新。[圖 13-1](#fig_future_write_read_paths) 展示了搜尋索引更新的例子。
{{< figure src="/fig/ddia_1301.png" id="fig_future_write_read_paths" caption="圖 13-1 在搜尋索引中,寫入(文件更新)與讀取(查詢)相遇。" class="w-full my-4" >}}
但你為什麼一開始就要建立派生資料集?很可能是因為你想在以後再次查詢它。這就是 **讀路徑(read path)**:當服務使用者請求時,你需要從派生資料集中讀取,也許還要對結果進行一些額外處理,然後構建給使用者的響應。
總而言之,寫路徑和讀路徑涵蓋了資料的整個旅程,從收集資料開始,到使用資料結束(可能是由另一個人)。寫路徑是預計算過程的一部分 —— 即,一旦資料進入,即刻完成,無論是否有人需要看它。讀路徑是這個過程中只有當有人請求時才會發生的部分。如果你熟悉函數語言程式設計語言,則可能會注意到寫路徑類似於立即求值,讀路徑類似於惰性求值。
如 [圖 13-1](#fig_future_write_read_paths) 所示,派生資料集是寫路徑和讀路徑相遇的地方。它代表了寫入時工作量與讀取時工作量之間的權衡。
#### 物化檢視和快取 {#id451}
全文檢索索引就是一個很好的例子:寫路徑更新索引,讀路徑在索引中搜索關鍵字。讀寫都需要做一些工作。寫入需要更新文件中出現的所有關鍵詞的索引條目。讀取需要搜尋查詢中的每個單詞,並應用布林邏輯來查詢包含查詢中所有單詞(AND 運算子)的文件,或者每個單詞(OR 運算子)的任何同義詞。
如果沒有索引,搜尋查詢將不得不掃描所有文件(如 grep),如果有著大量文件,這樣做的開銷巨大。沒有索引意味著寫入路徑上的工作量較少(沒有要更新的索引),但是在讀取路徑上需要更多工作。
另一方面,可以想象為所有可能的查詢預先計算搜尋結果。在這種情況下,讀路徑上的工作量會減少:不需要布林邏輯,只需查詢查詢結果並返回即可。但寫路徑會更加昂貴:可能的搜尋查詢集合是無限大的,因此預先計算所有可能的搜尋結果將需要無限的時間和儲存空間,這在實踐中不可行。
另一種選擇是預先計算一組固定的最常見查詢的搜尋結果,以便可以快速提供它們而無需轉到索引。不常見的查詢仍然可以透過索引來提供服務。這通常被稱為常見查詢的 **快取(cache)**,儘管我們也可以稱之為 **物化檢視(materialized view)**,因為當新文件出現,且需要被包含在這些常見查詢的搜尋結果之中時,這些索引就需要更新。
從這個例子中我們可以看到,索引不是寫路徑和讀路徑之間唯一可能的邊界;快取常見搜尋結果也是可行的;而在少量文件上使用沒有索引的類 grep 掃描也是可行的。由此來看,快取,索引和物化檢視的作用很簡單:它們改變了讀路徑與寫路徑之間的邊界。透過預先計算結果,從而允許我們在寫路徑上做更多的工作,以節省讀路徑上的工作量。
在寫路徑上完成的工作和讀路徑之間的界限,實際上是本書開始處在 “[描述負載](/tw/ch2#sec_introduction_twitter)” 中推特例子裡談到的主題。在該例中,我們還看到了與普通使用者相比,名人的寫路徑和讀路徑可能有所不同。在 500 頁之後,我們已經繞回了起點!
#### 有狀態、可離線的客戶端 {#id347}
我發現寫路徑和讀路徑之間的邊界很有趣,因為我們可以試著改變這個邊界,並探討這種改變的實際意義。我們來看看不同上下文中的這一想法。
過去二十年來,Web 應用的火熱讓我們對應用開發作出了一些很容易視作理所當然的假設。具體來說就是,客戶端 / 伺服器模型 —— 客戶端大多是無狀態的,而伺服器擁有資料的權威 —— 已經普遍到我們幾乎忘掉了還有其他任何模型的存在。但是技術在不斷地發展,我認為不時地質疑現狀非常重要。
傳統上,網路瀏覽器是無狀態的客戶端,只有當連線到網際網路時才能做一些有用的事情(能離線執行的唯一事情基本上就是上下滾動之前線上時載入好的頁面)。然而,最近的 “單頁面” JavaScript Web 應用已經獲得了很多有狀態的功能,包括客戶端使用者介面互動,以及 Web 瀏覽器中的持久化本地儲存。移動應用可以類似地在裝置上儲存大量狀態,而且大多數使用者互動都不需要與伺服器往返互動。
這些不斷變化的功能重新引發了對 **離線優先(offline-first)** 應用的興趣,這些應用盡可能地在同一裝置上使用本地資料庫,無需連線網際網路,並在後臺網路連線可用時與遠端伺服器同步[^42]。由於移動裝置通常具有緩慢且不可靠的蜂窩網路連線,因此,如果使用者的使用者介面不必等待同步網路請求,且應用主要是離線工作的,則這是一個巨大優勢(請參閱 “[需要離線操作的客戶端](/tw/ch6#sec_replication_offline_clients)”)。
當我們擺脫無狀態客戶端與中央資料庫互動的假設,並轉向在終端使用者裝置上維護狀態時,這就開啟了新世界的大門。特別是,我們可以將裝置上的狀態視為 **伺服器狀態的快取**。螢幕上的畫素是客戶端應用中模型物件的物化檢視;模型物件是遠端資料中心的本地狀態副本[^27]。
#### 將狀態變更推送給客戶端 {#id348}
在典型的網頁中,如果你在 Web 瀏覽器中載入頁面,並且隨後伺服器上的資料發生變更,則瀏覽器在重新載入頁面之前對此一無所知。瀏覽器只能在一個時間點讀取資料,假設它是靜態的 —— 它不會訂閱來自伺服器的更新。因此裝置上的狀態是陳舊的快取,除非你顯式輪詢變更否則不會更新。(像 RSS 這樣基於 HTTP 的 Feed 訂閱協議實際上只是一種基本的輪詢形式)
最近的協議已經超越了 HTTP 的基本請求 / 響應模式:服務端傳送的事件(EventSource API)和 WebSockets 提供了通訊通道,透過這些通道,Web 瀏覽器可以與伺服器保持開啟的 TCP 連線,只要瀏覽器仍然連線著,伺服器就能主動向瀏覽器推送資訊。這為伺服器提供了主動通知終端使用者客戶端的機會,伺服器能告知客戶端其本地儲存狀態的任何變化,從而減少客戶端狀態的陳舊程度。
用我們的寫路徑與讀路徑模型來講,主動將狀態變更推至到客戶端裝置,意味著將寫路徑一直延伸到終端使用者。當客戶端首次初始化時,它仍然需要使用讀路徑來獲取其初始狀態,但此後它就能夠依賴伺服器傳送的狀態變更流了。我們在流處理和訊息傳遞部分討論的想法並不侷限於資料中心中:我們可以進一步採納這些想法,並將它們一直延伸到終端使用者裝置[^43]。
這些裝置有時會離線,並在此期間無法收到伺服器狀態變更的任何通知。但是我們已經解決了這個問題:在 “[消費者偏移量](/tw/ch12#sec_stream_log_offsets)” 中,我們討論了基於日誌的訊息代理的消費者能在失敗或斷開連線後重連,並確保它不會錯過掉線期間任何到達的訊息。同樣的技術適用於單個使用者,每個裝置都是一個小事件流的小小訂閱者。
#### 端到端的事件流 {#id349}
最近用於開發有狀態的客戶端與使用者介面的工具,例如如 Elm 語言[^30]和 Facebook 的 React、Flux 和 Redux 工具鏈,已經透過訂閱表示使用者輸入或伺服器響應的事件流來管理客戶端的內部狀態,其結構與事件溯源相似(請參閱 “[事件溯源](/tw/ch12#sec_stream_event_sourcing)”)。
將這種程式設計模型擴充套件為:允許伺服器將狀態變更事件推送到客戶端的事件管道中,是非常自然的。因此,狀態變化可以透過 **端到端(end-to-end)** 的寫路徑流動:從一個裝置上的互動觸發狀態變更開始,經由事件日誌,並穿過幾個派生資料系統與流處理器,一直到另一臺裝置上的使用者介面,而有人正在觀察使用者介面上的狀態變化。這些狀態變化能以相當低的延遲傳播 —— 比如說,在一秒內從一端到另一端。
一些應用(如即時訊息傳遞與線上遊戲)已經具有這種 “即時” 架構(在低延遲互動的意義上,不是在 “[響應時間保證](/tw/ch9#sec_distributed_clocks_realtime)” 中的意義上)。但我們為什麼不用這種方式構建所有的應用?
挑戰在於,關於無狀態客戶端和請求 / 響應互動的假設已經根深蒂固地植入在我們的資料庫、庫、框架以及協議之中。許多資料儲存支援讀取與寫入操作,為請求返回一個響應,但只有極少數提供訂閱變更的能力 —— 請求返回一個隨時間推移的響應流(請參閱 “[變更流的 API 支援](/tw/ch12#sec_stream_change_api)” )。
為了將寫路徑延伸至終端使用者,我們需要從根本上重新思考我們構建這些系統的方式:從請求 / 響應互動轉向釋出 / 訂閱資料流[^27]。更具響應性的使用者介面與更好的離線支援,我認為這些優勢值得我們付出努力。如果你正在設計資料系統,我希望你對訂閱變更的選項留有印象,而不只是查詢當前狀態。
#### 讀也是事件 {#sec_future_read_events}
我們討論過,當流處理器將派生資料寫入儲存(資料庫,快取或索引)時,以及當用戶請求查詢該儲存時,儲存將充當寫路徑和讀路徑之間的邊界。該儲存應當允許對資料進行隨機訪問的讀取查詢,否則這些查詢將需要掃描整個事件日誌。
在很多情況下,資料儲存與流處理系統是分開的。但回想一下,流處理器還是需要維護狀態以執行聚合和連線的(請參閱 “[流連線](/tw/ch12#sec_stream_joins)”)。這種狀態通常隱藏在流處理器內部,但一些框架也允許這些狀態被外部客戶端查詢[^45],將流處理器本身變成一種簡單的資料庫。
我願意進一步思考這個想法。正如到目前為止所討論的那樣,對儲存的寫入是透過事件日誌進行的,而讀取是臨時的網路請求,直接流向儲存著待查資料的節點。這是一個合理的設計,但不是唯一可行的設計。也可以將讀取請求表示為事件流,並同時將讀事件與寫事件送往流處理器;流處理器透過將讀取結果傳送到輸出流來響應讀取事件[^46]。
當寫入和讀取都被表示為事件,並且被路由到同一個流運算元以便處理時,我們實際上是在讀取查詢流和資料庫之間執行流表連線。讀取事件需要被送往儲存資料的資料庫分割槽(請參閱 “[請求路由](/tw/ch7#sec_sharding_routing)”),就像批處理和流處理器在連線時需要在同一個鍵上對輸入分割槽一樣(請參閱 “[Reduce 側連線與分組](/tw/ch11#sec_batch_join)”)。
服務請求與執行連線之間的這種相似之處是非常關鍵的[^47]。一次性讀取請求只是將請求傳過連線運算元,然後請求馬上就被忘掉了;而一個訂閱請求,則是與連線另一側過去與未來事件的持久化連線。
記錄讀取事件的日誌可能對於追蹤整個系統中的因果關係與資料來源也有好處:它可以讓你重現出當用戶做出特定決策之前看見了什麼。例如在網商中,向客戶顯示的預測送達日期與庫存狀態,可能會影響他們是否選擇購買一件商品[^4]。要分析這種聯絡,則需要記錄使用者查詢運輸與庫存狀態的結果。
將讀取事件寫入持久儲存可以更好地跟蹤因果關係(請參閱 “[排序事件以捕獲因果關係](#sec_future_capture_causality)”),但會產生額外的儲存與 I/O 成本。最佳化這些系統以減少開銷仍然是一個開放的研究問題[^2]。但如果你已經出於運維目的留下了讀取請求日誌,將其作為請求處理的副作用,那麼將這份日誌作為請求事件源並不是什麼特別大的變更。
#### 多分割槽資料處理 {#sec_future_unbundled_multi_shard}
對於只涉及單個分割槽的查詢,透過流來發送查詢與收集響應可能是殺雞用牛刀了。然而,這個想法開啟了分散式執行複雜查詢的可能性,這需要合併來自多個分割槽的資料,利用了流處理器已經提供的訊息路由、分割槽和連線的基礎設施。
Storm 的分散式 RPC 功能支援這種使用模式(請參閱 “[訊息傳遞和 RPC](/tw/ch12#sec_stream_actors_drpc)”)。例如,它已經被用來計算瀏覽過某個推特 URL 的人數 —— 即,發推包含該 URL 的所有人的粉絲集合的並集[^48]。由於推特的使用者是分割槽的,因此這種計算需要合併來自多個分割槽的結果。
這種模式的另一個例子是欺詐預防:為了評估特定購買事件是否具有欺詐風險,你可以檢查該使用者 IP 地址,電子郵件地址,帳單地址,送貨地址的信用分。這些信用資料庫中的每一個都是有分割槽的,因此為特定購買事件採集分數需要連線一系列不同的分割槽資料集[^49]。
MPP 資料庫的內部查詢執行圖有著類似的特徵(請參閱 “[Hadoop 與分散式資料庫的對比](/tw/ch11#sec_batch_distributed)”)。如果需要執行這種多分割槽連線,則直接使用提供此功能的資料庫,可能要比使用流處理器實現它要更簡單。然而將查詢視為流提供了一種選項,可以用於實現超出傳統現成解決方案的大規模應用。
## 追求正確性 {#sec_future_correctness}
對於只讀取資料的無狀態服務,出問題也沒什麼大不了的:你可以修復該錯誤並重啟服務,而一切都恢復正常。像資料庫這樣的有狀態系統就沒那麼簡單了:它們被設計為永遠記住事物(或多或少),所以如果出現問題,這種(錯誤的)效果也將潛在地永遠持續下去,這意味著它們需要更仔細的思考[^50]。
我們希望構建可靠且 **正確** 的應用(即使面對各種故障,程式的語義也能被很好地定義與理解)。約四十年來,原子性、隔離性和永續性([第八章](/tw/ch8))等事務特性一直是構建正確應用的首選工具。然而這些地基沒有看上去那麼牢固:例如弱隔離級別帶來的困惑可以佐證(請參閱 “[弱隔離級別](/tw/ch8#sec_transactions_isolation_levels)”)。
事務在某些領域被完全拋棄,並被提供更好效能與可伸縮性的模型取代,但後者有更複雜的語義(例如,請參閱 “[無主複製](/tw/ch6#sec_replication_leaderless)”)。**一致性(Consistency)** 經常被談起,但其定義並不明確(請參閱 “[一致性](/tw/ch8#sec_transactions_acid_consistency)” 和 [第十章](/tw/ch10))。有些人斷言我們應當為了高可用而 “擁抱弱一致性”,但卻對這些概念實際上意味著什麼缺乏清晰的認識。
對於如此重要的話題,我們的理解,以及我們的工程方法卻是驚人地薄弱。例如,確定在特定事務隔離等級或複製配置下執行特定應用是否安全是非常困難的[^51] [^52]。通常簡單的解決方案似乎在低併發性的情況下工作正常,並且沒有錯誤,但在要求更高的情況下卻會出現許多微妙的錯誤。
例如,Kyle Kingsbury 的 Jepsen 實驗[^53]標出了一些產品聲稱的安全保證與其在網路問題與崩潰時的實際行為之間的明顯差異。即使像資料庫這樣的基礎設施產品沒有問題,應用程式碼仍然需要正確使用它們提供的功能才行,如果配置很難理解,這是很容易出錯的(在這種情況下指的是弱隔離級別,法定人數配置等)。
如果你的應用可以容忍偶爾的崩潰,以及以不可預料的方式損壞或丟失資料,那生活就要簡單得多,而你可能只要雙手合十念阿彌陀佛,期望佛祖能保佑最好的結果。另一方面,如果你需要更強的正確性保證,那麼可序列化與原子提交就是久經考驗的方法,但它們是有代價的:它們通常只在單個數據中心中工作(這就排除了地理位置分散的架構),並限制了系統能夠實現的規模與容錯特性。
雖然傳統的事務方法並沒有走遠,但我也相信在使應用正確而靈活地處理錯誤方面上,事務也不是最後一個可以談的。在本節中,我將提出一些在資料流架構中考量正確性的方式。
### 資料庫的端到端原則 {#sec_future_end_to_end}
僅僅因為一個應用程式使用了具有相對較強安全屬性的資料系統(例如可序列化的事務),並不意味著就可以保證沒有資料丟失或損壞。例如,如果某個應用有個 Bug,導致它寫入不正確的資料,或者從資料庫中刪除資料,那麼可序列化的事務也救不了你。
這個例子可能看起來很無聊,但值得認真對待:應用會出 Bug,而人也會犯錯誤。我在 “[狀態、流和不變性](/tw/ch12#sec_stream_immutability)” 中使用了這個例子來支援不可變和僅追加的資料,閹割掉錯誤程式碼摧毀良好資料的能力,能讓從錯誤中恢復更為容易。
雖然不變性很有用,但它本身並非萬靈藥。讓我們來看一個可能發生的、非常微妙的資料損壞案例。
#### 恰好執行一次操作 {#id353}
在 “[容錯](/tw/ch12#sec_stream_fault_tolerance)” 中,我們見到了 **恰好一次**(或 **等效一次**)語義的概念。如果在處理訊息時出現問題,你可以選擇放棄(丟棄訊息 —— 導致資料丟失)或重試。如果重試,就會有這種風險:第一次實際上成功了,只不過你沒有發現。結果這個訊息就被處理了兩次。
處理兩次是資料損壞的一種形式:為同樣的服務向客戶收費兩次(收費太多)或增長計數器兩次(誇大指標)都不是我們想要的。在這種情況下,恰好一次意味著安排計算,使得最終效果與沒有發生錯誤的情況一樣,即使操作實際上因為某種錯誤而重試。我們先前討論過實現這一目標的幾種方法。
最有效的方法之一是使操作 **冪等**(idempotent,請參閱 “[冪等性](/tw/ch12#sec_stream_idempotence)”):即確保它無論是執行一次還是執行多次都具有相同的效果。但是,將不是天生冪等的操作變為冪等的操作需要一些額外的努力與關注:你可能需要維護一些額外的元資料(例如更新了值的操作 ID 集合),並在從一個節點故障切換至另一個節點時做好防護(請參閱 “[領導者和鎖](/tw/ch9#sec_distributed_lock_fencing)”)。
#### 抑制重複 {#id354}
除了流處理之外,其他許多地方也需要抑制重複的模式。例如,TCP 使用了資料包上的序列號,以便接收方可以將它們正確排序,並確定網路上是否有資料包丟失或重複。在將資料交付應用前,TCP 協議棧會重新傳輸任何丟失的資料包,也會移除任何重複的資料包。
但是,這種重複抑制僅適用於單條 TCP 連線的場景中。假設 TCP 連線是一個客戶端與資料庫的連線,並且它正在執行 [例 13-1](#fig_future_non_idempotent) 中的事務。在許多資料庫中,事務是繫結在客戶端連線上的(如果客戶端傳送了多個查詢,資料庫就知道它們屬於同一個事務,因為它們是在同一個 TCP 連線上傳送的)。如果客戶端在傳送 `COMMIT` 之後並在從資料庫伺服器收到響應之前遇到網路中斷與連線超時,客戶端是不知道事務是否已經被提交的([圖 9-1](/tw/ch9#fig_distributed_network))。
##### 例 13-1 資金從一個賬戶到另一個賬戶的非冪等轉移
```sql
BEGIN TRANSACTION;
UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;
COMMIT;
```
客戶端可以重連到資料庫並重試事務,但現在已經處於 TCP 重複抑制的範圍之外了。因為 [例 13-1](#fig_future_non_idempotent) 中的事務不是冪等的,可能會發生轉了 \$22 而不是期望的 \$11。因此,儘管 [例 13-1](#fig_future_non_idempotent) 是一個事務原子性的標準樣例,但它實際上並不正確,而真正的銀行並不會這樣辦事[^3]。
兩階段提交(請參閱 “[原子提交與兩階段提交](/tw/ch8#sec_transactions_2pc)”)協議會破壞 TCP 連線與事務之間的 1:1 對映,因為它們必須在故障後允許事務協調器重連到資料庫,告訴資料庫將存疑事務提交還是中止。這足以確保事務只被恰好執行一次嗎?不幸的是,並不能。
即使我們可以抑制資料庫客戶端與伺服器之間的重複事務,我們仍然需要擔心終端使用者裝置與應用伺服器之間的網路。例如,如果終端使用者的客戶端是 Web 瀏覽器,則它可能會使用 HTTP POST 請求向伺服器提交指令。也許使用者正處於一個訊號微弱的蜂窩資料網路連線中,它們成功地傳送了 POST,但卻在能夠從伺服器接收響應之前沒了訊號。
在這種情況下,可能會向用戶顯示錯誤訊息,而他們可能會手動重試。Web 瀏覽器警告說,“你確定要再次提交這個表單嗎?” —— 使用者選 “是”,因為他們希望操作發生(Post/Redirect/Get 模式[^54]可以避免在正常操作中出現此警告訊息,但 POST 請求超時就沒辦法了)。從 Web 伺服器的角度來看,重試是一個獨立的請求;從資料庫的角度來看,這是一個獨立的事務。通常的除重機制無濟於事。
#### 操作識別符號 {#id355}
要在通過幾跳的網路通訊上使操作具有冪等性,僅僅依賴資料庫提供的事務機制是不夠的,你需要考慮 **端到端(end-to-end)** 的請求流。
例如,你可以為操作生成一個唯一識別符號(例如 UUID),並將其作為隱藏表單欄位包含在客戶端應用中,或透過計算所有相關表單欄位的雜湊來生成操作 ID[^3]。如果瀏覽器提交了兩次 POST,請求會攜帶相同操作 ID。你就可以把這個 ID 貫穿傳遞到資料庫,並確保同一個 ID 最多隻執行一次,如 [例 13-2](#fig_future_request_id) 所示。
##### 例 13-2 使用唯一 ID 抑制重複請求
```sql
ALTER TABLE requests ADD UNIQUE (request_id);
BEGIN TRANSACTION;
INSERT INTO requests
(request_id, from_account, to_account, amount)
VALUES('0286FDB8-D7E1-423F-B40B-792B3608036C', 4321, 1234, 11.00);
UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;
COMMIT;
```
[例 13-2](#fig_future_request_id) 依賴於 `request_id` 列上的唯一約束。如果事務嘗試插入已存在的 ID,`INSERT` 會失敗並中止事務,從而避免重複生效。即使在較弱隔離級別下,關係資料庫通常也能正確維護唯一性約束(而應用層的 “先檢查再插入” 在不可序列化隔離下可能失敗,見 “[寫入偏差與幻讀](/tw/ch8#sec_transactions_write_skew)”)。
除了抑制重複請求,[例 13-2](#fig_future_request_id) 中的 `requests` 表本身也像一份事件日誌,可用於事件溯源或變更資料捕獲。賬戶餘額更新並不一定要與事件插入放在同一事務中,因為餘額是可由下游消費者從請求事件派生出的冗餘狀態;只要請求事件被恰好處理一次(同樣可透過請求 ID 保證),即可保持正確性。
#### 端到端原則 {#sec_future_e2e_argument}
抑制重複事務的這種情況只是一個更普遍的原則的一個例子,這個原則被稱為 **端到端原則(end-to-end argument)**,它在 1984 年由 Saltzer、Reed 和 Clark 闡述[^55]:
> 只有在通訊系統兩端應用的知識與幫助下,所討論的功能才能完全地正確地實現。因而將這種被質疑的功能作為通訊系統本身的功能是不可能的(有時,通訊系統可以提供這種功能的不完備版本,可能有助於提高效能)。
>
在我們的例子中 **所討論的功能** 是重複抑制。我們看到 TCP 在 TCP 連線層次抑制了重複的資料包,一些流處理器在訊息處理層次提供了所謂的恰好一次語義,但這些都無法阻止當一個請求超時時,使用者親自提交重複的請求。TCP,資料庫事務,以及流處理器本身並不能完全排除這些重複。解決這個問題需要一個端到端的解決方案:從終端使用者的客戶端一路傳遞到資料庫的事務識別符號。
端到端原則也適用於檢查資料的完整性:乙太網,TCP 和 TLS 中內建的校驗和可以檢測網路中資料包的損壞情況,但是它們無法檢測到由連線兩端傳送 / 接收軟體中 Bug 導致的損壞。或資料儲存所在磁碟上的損壞。如果你想捕獲資料所有可能的損壞來源,你也需要端到端的校驗和。
類似的原則也適用於加密[^55]:家庭 WiFi 網路上的密碼可以防止人們竊聽你的 WiFi 流量,但無法阻止網際網路上其他地方攻擊者的窺探;客戶端與伺服器之間的 TLS/SSL 可以阻擋網路攻擊者,但無法阻止惡意伺服器。只有端到端的加密和認證可以防止所有這些事情。
儘管低層級的功能(TCP 重複抑制、乙太網校驗和、WiFi 加密)無法單獨提供所需的端到端功能,但它們仍然很有用,因為它們能降低較高層級出現問題的可能性。例如,如果我們沒有 TCP 來將資料包排成正確的順序,那麼 HTTP 請求通常就會被攪爛。我們只需要記住,低級別的可靠性功能本身並不足以確保端到端的正確性。
#### 在資料系統中應用端到端思考 {#id357}
這將我帶回最初的論點:僅僅因為應用使用了提供相對較強安全屬性的資料系統,例如可序列化的事務,並不意味著應用的資料就不會丟失或損壞了。應用本身也需要採取端到端的措施,例如除重。
這實在是一個遺憾,因為容錯機制很難弄好。低層級的可靠機制(比如 TCP 中的那些)執行的相當好,因而剩下的高層級錯誤基本很少出現。如果能將這些剩下的高層級容錯機制打包成抽象,而應用不需要再去操心,那該多好呀 —— 但恐怕我們還沒有找到這一正確的抽象。
長期以來,事務被認為是一個很好的抽象,我相信它們確實是很有用的。正如 [第八章](/tw/ch8) 中所討論的,它們將各種可能的問題(併發寫入、違背約束、崩潰、網路中斷、磁碟故障)合併為兩種可能結果:提交或中止。這是對程式設計模型而言的一種巨大簡化,但這還不夠。
事務是代價高昂的,當涉及異構儲存技術時尤為甚(請參閱 “[實踐中的分散式事務](/tw/ch8#sec_transactions_xa)”)。我們拒絕使用分散式事務是因為它開銷太大,結果我們最後不得不在應用程式碼中重新實現容錯機制。正如本書中大量的例子所示,對併發性與部分失敗的推理是困難且違反直覺的,所以我懷疑大多數應用級別的機制都不能正確工作,最終結果是資料丟失或損壞。
出於這些原因,我認為探索對容錯的抽象是很有價值的。它使提供應用特定的端到端的正確性屬性變得更簡單,而且還能在大規模分散式環境中提供良好的效能與運維特性。
### 強制約束 {#sec_future_constraints}
讓我們思考一下在 [分拆資料庫](#sec_future_unbundling) 上下文中的 **正確性(correctness)**。我們看到端到端的除重可以透過從客戶端一路透傳到資料庫的請求 ID 實現。那麼其他型別的約束呢?
我們先來特別關注一下 **唯一性約束** —— 例如我們在 [例 13-2](#fig_future_request_id) 中所依賴的約束。在 “[約束和唯一性保證](/tw/ch10#sec_consistency_uniqueness)” 中,我們看到了幾個其他需要強制實施唯一性的應用功能例子:使用者名稱或電子郵件地址必須唯一標識使用者,檔案儲存服務不能包含多個重名檔案,兩個人不能在航班或劇院預訂同一個座位。
其他型別的約束也非常類似:例如,確保帳戶餘額永遠不會變為負數,確保不會超賣庫存,或者會議室沒有重複的預訂。執行唯一性約束的技術通常也可以用於這些約束。
#### 唯一性約束需要達成共識 {#id452}
在 [第十章](/tw/ch10) 中我們看到,在分散式環境中,強制執行唯一性約束需要共識:如果存在多個具有相同值的併發請求,則系統需要決定衝突操作中的哪一個被接受,並拒絕其他違背約束的操作。
達成這一共識的最常見方式是使單個節點作為領導,並使其負責所有決策。只要你不介意所有請求都擠過單個節點(即使客戶端位於世界的另一端),只要該節點沒有失效,系統就能正常工作。如果你需要容忍領導者失效,那麼就又回到了共識問題(請參閱 “[單主複製與共識](/tw/ch10#from-single-leader-replication-to-consensus)”)。
唯一性檢查可以透過對唯一性欄位分割槽做橫向伸縮。例如,如果需要透過請求 ID 確保唯一性(如 [例 13-2](#fig_future_request_id) 所示),你可以確保所有具有相同請求 ID 的請求都被路由到同一分割槽(請參閱 [第七章](/tw/ch7))。如果你需要讓使用者名稱是唯一的,則可以按使用者名稱的雜湊值做分割槽。
但非同步多主複製排除在外,因為可能會發生不同主庫同時接受衝突寫操作的情況,因而這些值不再是唯一的(請參閱 “[實現線性一致的系統](/tw/ch10#sec_consistency_implementing_linearizable)”)。如果你想立刻拒絕任何違背約束的寫入,同步協調是無法避免的[^56]。
#### 基於日誌訊息傳遞中的唯一性 {#sec_future_uniqueness_log}
日誌確保所有消費者以相同順序看到訊息,這在形式上稱為 **全序廣播(total order broadcast)**,並且等價於共識(請參閱 “[全序廣播](/tw/ch10#sec_consistency_total_order)”)。在基於日誌訊息傳遞的分拆資料庫方案中,我們可以用同樣的思路來實施唯一性約束。
流處理器在單個執行緒上依次消費單個日誌分割槽中的所有訊息(請參閱 “[日誌與傳統的訊息傳遞相比](/tw/ch12#sec_stream_logs_vs_messaging)”)。因此,如果日誌是按需要確保唯一的值做的分割槽,則流處理器可以無歧義地、確定性地決定幾個衝突操作中的哪一個先到達。例如,在多個使用者嘗試宣告相同使用者名稱的情況下[^57]:
1. 每個對使用者名稱的請求都被編碼為一條訊息,並追加到按使用者名稱雜湊值確定的分割槽。
2. 流處理器依序讀取日誌中的請求,並使用本地資料庫來追蹤哪些使用者名稱已經被佔用了。對於所有申請可用使用者名稱的請求,它都會記錄該使用者名稱,並向輸出流傳送一條成功訊息。對於所有申請已佔用使用者名稱的請求,它都會向輸出流傳送一條拒絕訊息。
3. 請求使用者名稱的客戶端監視輸出流,等待與其請求相對應的成功或拒絕訊息。
該演算法基本上與 “[使用全序廣播實現線性一致的儲存](/tw/ch10#sec_consistency_total_order)” 中的演算法相同。它可以簡單地透過增加分割槽數伸縮至較大的請求吞吐量,因為每個分割槽都可以被獨立處理。
該方法不僅適用於唯一性約束,而且適用於許多其他型別的約束。其基本原理是,任何可能衝突的寫入都會路由到相同的分割槽並按順序處理。正如 “[什麼是衝突?](/tw/ch6#what-is-a-conflict)” 與 “[寫入偏差與幻讀](/tw/ch8#sec_transactions_write_skew)” 中所述,衝突的定義可能取決於應用,但流處理器可以使用任意邏輯來驗證請求。這個想法與 Bayou 在 90 年代開創的方法類似[^58]。
#### 多分割槽請求處理 {#id360}
當請求涉及多個分割槽時,如何在滿足約束的同時保證原子效果,會更有挑戰性。在 [例 13-2](#fig_future_request_id) 中,至少可能涉及三個分割槽:請求 ID 所在分割槽、收款賬戶所在分割槽、付款賬戶所在分割槽。它們彼此獨立,並不必然位於同一分割槽。
在傳統資料庫方案裡,這類事務通常需要跨分割槽原子提交;這會把事務強行納入跨分割槽全序,從而引入同步協調開銷並影響吞吐量。
但使用分割槽日誌與流處理器,也可以在不使用跨分割槽原子提交的情況下達到等價正確性。
{{< figure src="/fig/ddia_1302.png" id="fig_future_multi_shard" caption="圖 13-2 使用事件日誌與流處理器,檢查源賬戶是否有足夠餘額,並將資金原子地劃轉到目標賬戶與手續費賬戶。" class="w-full my-4" >}}
1. 客戶端為轉賬請求生成全域性唯一請求 ID,並將請求按源賬戶 ID 路由到相應日誌分割槽。
2. 一個流處理器消費該請求日誌,並維護源賬戶本地狀態及已處理請求 ID 集。遇到新請求 ID 時,先檢查餘額是否充足;若充足,則在本地狀態中預留金額,併發出多個後續事件:源賬戶的出賬事件、目標賬戶的入賬事件、手續費賬戶的入賬事件。所有事件都攜帶同一請求 ID。
3. 源賬戶處理器稍後會再次收到出賬事件。它根據請求 ID 識別出這是先前預留過的支付,執行真正扣款並更新本地狀態;若重複到達則忽略。
4. 目標賬戶與手續費賬戶各自由獨立處理任務消費。收到入賬事件後更新本地狀態,並基於請求 ID 去重。
圖 13-2 雖然畫成三個賬戶落在三個分割槽中,但即使在同一分割槽也同樣成立。關鍵條件是:同一賬戶的事件必須按日誌順序處理,且訊息投遞具備至少一次語義,處理邏輯保持確定性。
如果源賬戶處理器在處理中崩潰,恢復後會重放相同請求並做出相同決策,發出相同請求 ID 的後續事件。下游消費者會基於請求 ID 去重,因此不會重複生效。
這個系統的原子性不來自分散式事務,而來自初始請求事件寫入源賬戶日誌這一原子動作。只要這個起點事件寫入成功,後續事件最終都會出現:它們可能因故障恢復而延遲,也可能短暫重複,但最終可達。
透過把多分割槽事務拆成多個按不同鍵分割槽的階段,並貫穿端到端請求 ID,我們在故障場景下依然能保證“每個請求對付款方與收款方都恰好生效一次”,同時避免使用原子提交協議。
### 及時性與完整性 {#sec_future_integrity}
事務的一個便利屬性是,它們通常是線性一致的(請參閱 “[線性一致性](/tw/ch10#sec_consistency_linearizability)”),也就是說,寫入者會等到事務提交,而之後其寫入立刻對所有讀取者可見。
當我們把一個操作拆分為跨越多個階段的流處理器時,卻並非如此:日誌的消費者在設計上就是非同步的,因此傳送者不會等其訊息被消費者處理完。但是,客戶端等待輸出流中的特定訊息是可能的。這正是我們在 “[基於日誌訊息傳遞中的唯一性](#sec_future_uniqueness_log)” 一節中檢查唯一性約束時所做的事情。
在這個例子中,唯一性檢查的正確性不取決於訊息傳送者是否等待結果。等待的目的僅僅是同步通知傳送者唯一性檢查是否成功。但該通知可以與訊息處理的結果相解耦。
更一般地來講,我認為術語 **一致性(consistency)** 這個術語混淆了兩個值得分別考慮的需求:
* 及時性(Timeliness)
及時性意味著確保使用者觀察到系統的最新狀態。我們之前看到,如果使用者從陳舊的資料副本中讀取資料,它們可能會觀察到系統處於不一致的狀態(請參閱 “[複製延遲問題](/tw/ch6#sec_replication_lag)”)。但這種不一致是暫時的,而最終會透過等待與重試簡單地得到解決。
CAP 定理(請參閱 “[線性一致性的代價](/tw/ch10#sec_linearizability_cost)”)使用 **線性一致性(linearizability)** 意義上的一致性,這是實現及時性的強有力方法。像 **寫後讀** 這樣及時性更弱的一致性也很有用(請參閱 “[讀己之寫](/tw/ch6#sec_replication_ryw)”)。
* 完整性(Integrity)
完整性意味著沒有損壞;即沒有資料丟失,並且沒有矛盾或錯誤的資料。尤其是如果某些派生資料集是作為底層資料之上的檢視而維護的(請參閱 “[從事件日誌中派生出當前狀態](/tw/ch12#sec_stream_deriving_views)”),這種派生必須是正確的。例如,資料庫索引必須正確地反映資料庫的內容 —— 缺失某些記錄的索引並不是很有用。
如果完整性被違背,這種不一致是永久的:在大多數情況下,等待與重試並不能修復資料庫損壞。相反的是,需要顯式地檢查與修復。在 ACID 事務的上下文中(請參閱 “[ACID 的含義](/tw/ch8#sec_transactions_acid)”),一致性通常被理解為某種特定於應用的完整性概念。原子性和永續性是保持完整性的重要工具。
口號形式:違反及時性,“最終一致性”;違反完整性,“永無一致性”。
我斷言在大多數應用中,完整性比及時性重要得多。違反及時性可能令人困惑與討厭,但違反完整性的結果可能是災難性的。
例如在你的信用卡對賬單上,如果某一筆過去 24 小時內完成的交易尚未出現並不令人奇怪 —— 這些系統有一定的滯後是正常的。我們知道銀行是非同步核算與敲定交易的,這裡的及時性並不是非常重要[^3]。但如果當期對賬單餘額與上期對賬單餘額加交易總額對不上(求和錯誤),或者出現一筆向你收費但未向商家付款的交易(消失的錢),那就實在是太糟糕了,這樣的問題就違背了系統的完整性。
#### 資料流系統的正確性 {#id453}
ACID 事務通常既提供及時性(例如線性一致性)也提供完整性保證(例如原子提交)。因此如果你從 ACID 事務的角度來看待應用的正確性,那麼及時性與完整性的區別是無關緊要的。
另一方面,對於在本章中討論的基於事件的資料流系統而言,它們的一個有趣特性就是將及時性與完整性分開。在非同步處理事件流時不能保證及時性,除非你顯式構建一個在返回之前明確等待特定訊息到達的消費者。但完整性實際上才是流處理系統的核心。
**恰好一次** 或 **等效一次** 語義(請參閱 “[容錯](/tw/ch12#sec_stream_fault_tolerance)”)是一種保持完整性的機制。如果事件丟失或者生效兩次,就有可能違背資料系統的完整性。因此在出現故障時,容錯訊息傳遞與重複抑制(例如,冪等操作)對於維護資料系統的完整性是很重要的。
正如我們在上一節看到的那樣,可靠的流處理系統可以在無需分散式事務與原子提交協議的情況下保持完整性,這意味著它們有潛力達到與後者相當的正確性,同時還具備好得多的效能與運維穩健性。為了達成這種正確性,我們組合使用了多種機制:
* 將寫入操作的內容表示為單條訊息,從而可以輕鬆地被原子寫入 —— 與事件溯源搭配效果拔群(請參閱 “[事件溯源](/tw/ch12#sec_stream_event_sourcing)”)。
* 使用與儲存過程類似的確定性派生函式,從這一訊息中派生出所有其他的狀態變更(請參閱 “[真的序列執行](/tw/ch8#sec_transactions_serial)” 和 “[應用程式碼作為派生函式](#sec_future_dataflow_derivation)”)
* 將客戶端生成的請求 ID 傳遞透過所有的處理層次,從而允許端到端的除重,帶來冪等性。
* 使訊息不可變,並允許派生資料能隨時被重新處理,這使從錯誤中恢復更加容易(請參閱 “[不可變事件的優點](/tw/ch12#sec_stream_immutability_pros)”)
這種機制組合在我看來,是未來構建容錯應用的一個非常有前景的方向。
#### 寬鬆地解釋約束 {#id362}
如前所述,執行唯一性約束需要共識,通常透過在單個節點中彙集特定分割槽中的所有事件來實現。如果我們想要傳統的唯一性約束形式,這種限制是不可避免的,流處理也不例外。
然而另一個需要了解的事實是,許多真實世界的應用實際上可以擺脫這種形式,接受弱得多的唯一性:
* 如果兩個人同時註冊了相同的使用者名稱或預訂了相同的座位,你可以給其中一個人發訊息道歉,並要求他們換一個不同的使用者名稱或座位。這種糾正錯誤的變化被稱為 **補償性事務(compensating transaction)**[^59] [^60]。
* 如果客戶訂購的物品多於倉庫中的物品,你可以下單補倉,併為延誤向客戶道歉,向他們提供折扣。實際上,這麼說吧,如果叉車在倉庫中軋過了你的貨物,剩下的貨物比你想象的要少,那麼你也是得這麼做[^61]。因此,既然道歉工作流無論如何已經成為你商業過程中的一部分了,那麼對庫存物品數目新增線性一致的約束可能就沒必要了。
* 與之類似,許多航空公司都會超賣機票,打著一些旅客可能會錯過航班的算盤;許多旅館也會超賣客房,抱著部分客人可能會取消預訂的期望。在這些情況下,出於商業原因而故意違反了 “一人一座” 的約束;當需求超過供給的情況出現時,就會進入補償流程(退款、升級艙位 / 房型、提供隔壁酒店的免費的房間)。即使沒有超賣,為了應對由惡劣天氣或員工罷工導致的航班取消,你還是需要道歉與補償流程 —— 從這些問題中恢復僅僅是商業活動的正常組成部分。
* 如果有人從賬戶超額取款,銀行可以向他們收取透支費用,並要求他們償還欠款。透過限制每天的提款總額,銀行的風險是有限的。
在許多商業場景中,臨時違背約束並稍後透過道歉來修復,實際上是可以接受的。道歉的成本各不相同,但通常很低(以金錢或名聲來算):你無法撤回已傳送的電子郵件,但可以傳送一封后續電子郵件進行更正。如果你不小心向信用卡收取了兩次費用,則可以將其中一項收費退款,而代價僅僅是手續費,也許還有客戶的投訴。儘管一旦 ATM 吐了錢,你無法直接取回,但原則上如果賬戶透支而客戶拒不支付,你可以派催收員收回欠款。
道歉的成本是否能接受是一個商業決策。如果可以接受的話,在寫入資料之前檢查所有約束的傳統模型反而會帶來不必要的限制,而線性一致性的約束也不是必須的。樂觀寫入,事後檢查可能是一種合理的選擇。你仍然可以在做一些挽回成本高昂的事情前確保有相關的驗證,但這並不意味著寫入資料之前必須先進行驗證。
這些應用 **確實** 需要完整性:你不會希望丟失預訂資訊,或者由於借方貸方不匹配導致資金消失。但是它們在執行約束時 **並不需要** 及時性:如果你銷售的貨物多於倉庫中的庫存,可以在事後道歉後並彌補問題。這種做法與我們在 “[處理寫入衝突](/tw/ch6#sec_replication_write_conflicts)” 中討論的衝突解決方法類似。
#### 無協調資料系統 {#id454}
我們現在已經做了兩個有趣的觀察:
1. 資料流系統可以維持派生資料的完整性保證,而無需原子提交、線性一致性或者同步的跨分割槽協調。
2. 雖然嚴格的唯一性約束要求及時性和協調,但許多應用實際上可以接受寬鬆的約束:只要整個過程保持完整性,這些約束可能會被臨時違反並在稍後被修復。
總之這些觀察意味著,資料流系統可以為許多應用提供無需協調的資料管理服務,且仍能給出很強的完整性保證。這種 **無協調(coordination-avoiding)** 的資料系統有著很大的吸引力:比起需要執行同步協調的系統,它們能達到更好的效能與更強的容錯能力[^56]。
例如,這種系統可以使用多領導者配置運維,跨越多個數據中心,在區域間非同步複製。任何一個數據中心都可以持續獨立執行,因為不需要同步的跨區域協調。這樣的系統的及時性保證會很弱 —— 如果不引入協調它是不可能是線性一致的 —— 但它仍然可以提供有力的完整性保證。
在這種情況下,可序列化事務作為維護派生狀態的一部分仍然是有用的,但它們只能在小範圍內執行,在那裡它們工作得很好[^8]。異構分散式事務(如 XA 事務,請參閱 “[實踐中的分散式事務](/tw/ch8#sec_transactions_xa)”)不是必需的。同步協調仍然可以在需要的地方引入(例如在無法恢復的操作之前強制執行嚴格的約束),但是如果只是應用的一小部分地方需要它,沒必要讓所有操作都付出協調的代價。[^43]。
另一種審視協調與約束的角度是:它們減少了由於不一致而必須做出的道歉數量,但也可能會降低系統的效能和可用性,從而可能增加由於宕機中斷而需要做出的道歉數量。你不可能將道歉數量減少到零,但可以根據自己的需求尋找最佳平衡點 —— 既不存在太多不一致性,又不存在太多可用性問題。
### 信任但驗證 {#sec_future_verification}
我們所有關於正確性,完整性和容錯的討論都基於一些假設,假設某些事情可能會出錯,但其他事情不會。我們將這些假設稱為我們的 **系統模型**(system model,請參閱 “[將系統模型對映到現實世界](/tw/ch9#sec_distributed_system_model)”):例如,我們應該假設程序可能會崩潰,機器可能突然斷電,網路可能會任意延遲或丟棄訊息。但是我們也可能假設寫入磁碟的資料在執行 `fsync` 後不會丟失,記憶體中的資料沒有損壞,而 CPU 的乘法指令總是能返回正確的結果。
這些假設是相當合理的,因為大多數時候它們都是成立的,如果我們不得不經常擔心計算機出錯,那麼基本上寸步難行。在傳統上,系統模型採用二元方法處理故障:我們假設有些事情可能會發生,而其他事情 **永遠** 不會發生。實際上,這更像是一個機率問題:有些事情更有可能,其他事情不太可能。問題在於違反我們假設的情況是否經常發生,以至於我們可能在實踐中遇到它們。
我們已經看到,資料可能會在記憶體中、磁碟上、以及網路傳輸過程中出現損壞。也許這件事值得我們投入更多關注:當系統規模足夠大時,哪怕機率再低的問題也會在現實中發生。
#### 維護完整性,儘管軟體有Bug {#id455}
除了這些硬體問題之外,總是存在軟體 Bug 的風險,這些錯誤不會被較低層次的網路、記憶體或檔案系統校驗和所捕獲。即使廣泛使用的資料庫軟體也有 Bug:即使像 MySQL 與 PostgreSQL 這樣穩健、口碑良好、多年來被許多人充分測試過的軟體,就我個人所見也有 Bug,比如 MySQL 未能正確維護唯一約束[^65],以及 PostgreSQL 的可序列化隔離等級存在特定的寫入偏差異常[^66]。對於不那麼成熟的軟體來說,情況可能要糟糕得多。
儘管在仔細設計,測試,以及審查上做出很多努力,但 Bug 仍然會在不知不覺中產生。儘管它們很少,而且最終會被發現並被修復,但總會有那麼一段時間,這些 Bug 可能會損壞資料。
而對於應用程式碼,我們不得不假設會有更多的錯誤,因為絕大多數應用的程式碼經受的評審與測試遠遠無法與資料庫的程式碼相比。許多應用甚至沒有正確使用資料庫提供的用於維持完整性的功能,例如外部索引鍵或唯一性約束[^36]。
ACID 意義下的一致性(請參閱 “[一致性](/tw/ch8#sec_transactions_acid_consistency)”)基於這樣一種想法:資料庫以一致的狀態啟動,而事務將其從一個一致狀態轉換至另一個一致的狀態。因此,我們期望資料庫始終處於一致狀態。然而,只有當你假設事務沒有 Bug 時,這種想法才有意義。如果應用以某種錯誤的方式使用資料庫,例如,不安全地使用弱隔離等級,資料庫的完整性就無法得到保證。
#### 不要盲目信任承諾 {#id364}
由於硬體和軟體並不總是符合我們的理想,所以資料損壞似乎早晚不可避免。因此,我們至少應該有辦法查明資料是否已經損壞,以便我們能夠修復它,並嘗試追查錯誤的來源。檢查資料完整性稱為 **審計(auditing)**。
如 “[不可變事件的優點](/tw/ch12#sec_stream_immutability_pros)” 一節中所述,審計不僅僅適用於財務應用程式。不過,可審計性在財務中是非常非常重要的,因為每個人都知道錯誤總會發生,我們也都認為能夠檢測和解決問題是合理的需求。
成熟的系統同樣傾向於考慮不太可能的事情出錯的可能性,並管理這種風險。例如,HDFS 和 Amazon S3 等大規模儲存系統並不完全信任磁碟:它們執行後臺程序持續回讀檔案,並將其與其他副本進行比較,並將檔案從一個磁碟移動到另一個,以便降低靜默損壞的風險[^67]。
如果你想確保你的資料仍然存在,你必須真正讀取它並進行檢查。大多數時候它們仍然會在那裡,但如果不是這樣,你一定想盡早知道答案,而不是更晚。按照同樣的原則,不時地嘗試從備份中恢復是非常重要的 —— 否則當你發現備份損壞時,你可能已經遇到了資料丟失,那時候就真的太晚了。不要盲目地相信它們全都管用。
#### 為可審計性而設計 {#id365}
如果一個事務在一個數據庫中改變了多個物件,在這一事實發生後,很難說清這個事務到底意味著什麼。即使你捕獲了事務日誌(請參閱 “[變更資料捕獲](/tw/ch12#sec_stream_cdc)”),各種表中的插入、更新和刪除操作並不一定能清楚地表明 **為什麼** 要執行這些變更。決定這些變更的是應用邏輯中的呼叫,而這一應用邏輯稍縱即逝,無法重現。
相比之下,基於事件的系統可以提供更好的可審計性。在事件溯源方法中,系統的使用者輸入被表示為一個單一不可變事件,而任何其導致的狀態變更都派生自該事件。派生可以實現為具有確定性與可重複性,因而相同的事件日誌透過相同版本的派生程式碼時,會導致相同的狀態變更。
顯式處理資料流(請參閱 “[批處理輸出的哲學](/tw/ch11#sec_batch_output)”)可以使資料的 **來龍去脈(provenance)** 更加清晰,從而使完整性檢查更具可行性。對於事件日誌,我們可以使用雜湊來檢查事件儲存沒有被破壞。對於任何派生狀態,我們可以重新執行從事件日誌中派生它的批處理器與流處理器,以檢查是否獲得相同的結果,或者,甚至並行執行冗餘的派生流程。
具有確定性且定義良好的資料流,也使除錯與跟蹤系統的執行變得容易,以便確定它 **為什麼** 做了某些事情[^4] [^69]。如果出現意想之外的事情,那麼重現導致意外事件的確切事故現場的診斷能力 —— 一種時間旅行除錯功能是非常有價值的。
#### 端到端原則重現 {#id456}
如果我們不能完全相信系統的每個元件都不會損壞 —— 每一個硬體都沒缺陷,每一個軟體都沒有 Bug —— 那我們至少必須定期檢查資料的完整性。如果我們不檢查,我們就不能發現損壞,直到無可挽回地導致對下游的破壞時,那時候再去追蹤問題就要難得多,且代價也要高的多。
檢查資料系統的完整性,最好是以端到端的方式進行(請參閱 “[資料庫的端到端原則](#sec_future_end_to_end)”):我們能在完整性檢查中涵蓋的系統越多,某些處理階中出現不被察覺損壞的機率就越小。如果我們能檢查整個派生資料管道端到端的正確性,那麼沿著這一路徑的任何磁碟、網路、服務以及演算法的正確性檢查都隱含在其中了。
持續的端到端完整性檢查可以不斷提高你對系統正確性的信心,從而使你能更快地進步[^70]。與自動化測試一樣,審計提高了快速發現錯誤的可能性,從而降低了系統變更或新儲存技術可能導致損失的風險。如果你不害怕進行變更,就可以更好地充分演化一個應用,使其滿足不斷變化的需求。
#### 用於可審計資料系統的工具 {#id366}
目前,把可審計性作為一級目標的資料系統還不多。一些應用會實現自己的審計機制(例如把變更寫入獨立審計表),但要同時保證審計日誌與主資料庫狀態都不可篡改仍然很難。
像 Bitcoin、Ethereum 這樣的區塊鏈,本質上是帶密碼學一致性校驗的共享僅追加日誌;交易可視作事件,智慧合約可視作流處理器。它們透過共識協議讓所有節點同意同一事件序列。與本書 [第十章](/tw/ch10) 的共識協議相比,區塊鏈的一個差異是強調拜占庭容錯:參與節點會持續相互校驗完整性[^71] [^72] [^73]。
對多數應用而言,區塊鏈整體開銷仍偏高;但其中一些密碼學工具可在更輕量的場景複用。比如 **默克爾樹(Merkle tree)**[^74]可高效證明某條記錄屬於某資料集。**證書透明性(certificate transparency)** 使用可驗證的僅追加日誌與默克爾樹來校驗 TLS/SSL 證書有效性[^75] [^76]。
未來,這類完整性校驗與審計算法可能會在通用資料系統中更廣泛應用。要把它們做到與無密碼學審計系統同等級別的可伸縮性,同時把效能開銷壓到足夠低,仍需要工程改進,但方向值得重視。
## 本章小結 {#id367}
在本章中,我們討論了設計資料系統的新方式,而且也包括了我的個人觀點,以及對未來的猜測。我們從這樣一種觀察開始:沒有單種工具能高效服務所有可能的用例,因此應用必須組合使用幾種不同的軟體才能實現其目標。我們討論了如何使用批處理與事件流來解決這一 **資料整合(data integration)** 問題,以便讓資料變更在不同系統之間流動。
在這種方法中,某些系統被指定為記錄系統,而其他資料則透過轉換派生自記錄系統。透過這種方式,我們可以維護索引、物化檢視、機器學習模型、統計摘要等等。透過使這些派生和轉換操作非同步且鬆散耦合,能夠防止一個區域中的問題擴散到系統中不相關部分,從而增加整個系統的穩健性與容錯性。
將資料流表示為從一個數據集到另一個數據集的轉換也有助於演化應用程式:如果你想變更其中一個處理步驟,例如變更索引或快取的結構,則可以在整個輸入資料集上重新執行新的轉換程式碼,以便重新派生輸出。同樣,出現問題時,你也可以修復程式碼並重新處理資料以便恢復。
這些過程與資料庫內部已經完成的過程非常類似,因此我們將資料流應用的概念重新改寫為,**分拆(unbundling)** 資料庫元件,並透過組合這些鬆散耦合的元件來構建應用程式。
派生狀態可以透過觀察底層資料的變更來更新。此外,派生狀態本身可以進一步被下游消費者觀察。我們甚至可以將這種資料流一路傳送至顯示資料的終端使用者裝置,從而構建可動態更新以反映資料變更,並在離線時能繼續工作的使用者介面。
接下來,我們討論了如何確保所有這些處理在出現故障時保持正確。我們看到可伸縮的強完整性保證可以透過非同步事件處理來實現,透過使用端到端操作識別符號使操作冪等,以及透過非同步檢查約束。客戶端可以等到檢查透過,或者不等待繼續前進,但是可能會冒有違反約束需要道歉的風險。這種方法比使用分散式事務的傳統方法更具可伸縮性與可靠性,並且在實踐中適用於很多業務流程。
透過圍繞資料流構建應用,並非同步檢查約束,我們可以避免絕大多數協調,構建在地理分佈和故障場景下依然保持完整性且效能良好的系統。隨後我們還討論了如何透過審計驗證完整性、發現損壞,並指出區塊鏈/分散式賬本所使用的一些機制與事件驅動系統在思想上也存在共通之處。
##### Footnotes
### References {#references}
[^1]: Rachid Belaid: “[Postgres Full-Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/),” *rachbelaid.com*, July 13, 2015.
[^2]: Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
[^3]: Pat Helland and Dave Campbell: “[Building on Quicksand](https://web.archive.org/web/20220606172817/https://database.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf),” at *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009.
[^4]: Jessica Kerr: “[Provenance and Causality in Distributed Systems](https://web.archive.org/web/20190425150540/http://blog.jessitron.com/2016/09/provenance-and-causality-in-distributed.html),” *blog.jessitron.com*, September 25, 2016.
[^5]: Kostas Tzoumas: “[Batch Is a Special Case of Streaming](http://data-artisans.com/blog/batch-is-a-special-case-of-streaming/),” *data-artisans.com*, September 15, 2015.
[^6]: Shinji Kim and Robert Blafford: “[Stream Windowing Performance Analysis: Concord and Spark Streaming](https://web.archive.org/web/20180125074821/http://concord.io/posts/windowing_performance_analysis_w_spark_streaming),” *concord.io*, July 6, 2016.
[^7]: Jay Kreps: “[The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying),” *engineering.linkedin.com*, December 16, 2013.
[^8]: Pat Helland: “[Life Beyond Distributed Transactions: An Apostate’s Opinion](https://web.archive.org/web/20200730171311/http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf),” at *3rd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2007.
[^9]: “[Great Western Railway (1835–1948)](https://web.archive.org/web/20160122155425/https://www.networkrail.co.uk/VirtualArchive/great-western/),” Network Rail Virtual Archive, *networkrail.co.uk*.
[^10]: Jacqueline Xu: “[Online Migrations at Scale](https://stripe.com/blog/online-migrations),” *stripe.com*, February 2, 2017.
[^11]: Molly Bartlett Dishman and Martin Fowler: “[Agile Architecture](https://web.archive.org/web/20161130034721/http://conferences.oreilly.com/software-architecture/sa2015/public/schedule/detail/40388),” at *O'Reilly Software Architecture Conference*, March 2015.
[^12]: Nathan Marz and James Warren: [*Big Data: Principles and Best Practices of Scalable Real-Time Data Systems*](https://www.manning.com/books/big-data). Manning, 2015. ISBN: 978-1-617-29034-3
[^13]: Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin: “[Summingbird: A Framework for Integrating Batch and Online MapReduce Computations](http://www.vldb.org/pvldb/vol7/p1441-boykin.pdf),” at *40th International Conference on Very Large Data Bases* (VLDB), September 2014.
[^14]: Jay Kreps: “[Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture),” *oreilly.com*, July 2, 2014.
[^15]: Raul Castro Fernandez, Peter Pietzuch, Jay Kreps, et al.: “[Liquid: Unifying Nearline and Offline Big Data Integration](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper25u.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
[^16]: Dennis M. Ritchie and Ken Thompson: “[The UNIX Time-Sharing System](http://web.eecs.utk.edu/~qcao1/cs560/papers/paper-unix.pdf),” *Communications of the ACM*, volume 17, number 7, pages 365–375, July 1974. [doi:10.1145/361011.361061](http://dx.doi.org/10.1145/361011.361061)
[^17]: Eric A. Brewer and Joseph M. Hellerstein: “[CS262a: Advanced Topics in Computer Systems](http://people.eecs.berkeley.edu/~brewer/cs262/systemr.html),” lecture notes, University of California, Berkeley, *cs.berkeley.edu*, August 2011.
[^18]: Michael Stonebraker: “[The Case for Polystores](http://wp.sigmod.org/?p=1629),” *wp.sigmod.org*, July 13, 2015.
[^19]: Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, et al.: “[The BigDAWG Polystore System](https://dspace.mit.edu/handle/1721.1/100936),” *ACM SIGMOD Record*, volume 44, number 2, pages 11–16, June 2015. [doi:10.1145/2814710.2814713](http://dx.doi.org/10.1145/2814710.2814713)
[^20]: Patrycja Dybka: “[Foreign Data Wrappers for PostgreSQL](https://web.archive.org/web/20221003115732/https://www.vertabelo.com/blog/foreign-data-wrappers-for-postgresql/),” *vertabelo.com*, March 24, 2015.
[^21]: David B. Lomet, Alan Fekete, Gerhard Weikum, and Mike Zwilling: “[Unbundling Transaction Services in the Cloud](https://www.microsoft.com/en-us/research/publication/unbundling-transaction-services-in-the-cloud/),” at *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009.
[^22]: Martin Kleppmann and Jay Kreps: “[Kafka, Samza and the Unix Philosophy of Distributed Data](http://martin.kleppmann.com/papers/kafka-debull15.pdf),” *IEEE Data Engineering Bulletin*, volume 38, number 4, pages 4–14, December 2015.
[^23]: John Hugg: “[Winning Now and in the Future: Where VoltDB Shines](https://voltdb.com/blog/winning-now-and-future-where-voltdb-shines),” *voltdb.com*, March 23, 2016.
[^24]: Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard: “[Differential Dataflow](http://cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf),” at *6th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2013.
[^25]: Derek G Murray, Frank McSherry, Rebecca Isaacs, et al.: “[Naiad: A Timely Dataflow System](http://sigops.org/s/conferences/sosp/2013/papers/p439-murray.pdf),” at *24th ACM Symposium on Operating Systems Principles* (SOSP), pages 439–455, November 2013. [doi:10.1145/2517349.2522738](http://dx.doi.org/10.1145/2517349.2522738)
[^26]: Gwen Shapira: “[We have a bunch of customers who are implementing ‘database inside-out’ concept and they all ask ‘is anyone else doing it? are we crazy?’](https://twitter.com/gwenshap/status/758800071110430720)” *twitter.com*, July 28, 2016.
[^27]: Martin Kleppmann: “[Turning the Database Inside-out with Apache Samza,](http://martin.kleppmann.com/2015/03/04/turning-the-database-inside-out.html)” at *Strange Loop*, September 2014.
[^28]: Peter Van Roy and Seif Haridi: [*Concepts, Techniques, and Models of Computer Programming*](https://www.info.ucl.ac.be/~pvr/book.html). MIT Press, 2004. ISBN: 978-0-262-22069-9
[^29]: “[Juttle Documentation](http://juttle.github.io/juttle/),” *juttle.github.io*, 2016.
[^30]: Evan Czaplicki and Stephen Chong: “[Asynchronous Functional Reactive Programming for GUIs](http://people.seas.harvard.edu/~chong/pubs/pldi13-elm.pdf),” at *34th ACM SIGPLAN Conference on Programming Language Design and Implementation* (PLDI), June 2013. [doi:10.1145/2491956.2462161](http://dx.doi.org/10.1145/2491956.2462161)
[^31]: Engineer Bainomugisha, Andoni Lombide Carreton, Tom van Cutsem, Stijn Mostinckx, and Wolfgang de Meuter: “[A Survey on Reactive Programming](http://soft.vub.ac.be/Publications/2012/vub-soft-tr-12-13.pdf),” *ACM Computing Surveys*, volume 45, number 4, pages 1–34, August 2013. [doi:10.1145/2501654.2501666](http://dx.doi.org/10.1145/2501654.2501666)
[^32]: Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak: “[Consistency Analysis in Bloom: A CALM and Collected Approach](https://dsf.berkeley.edu/cs286/papers/calm-cidr2011.pdf),” at *5th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2011.
[^33]: Felienne Hermans: “[Spreadsheets Are Code](https://vimeo.com/145492419),” at *Code Mesh*, November 2015.
[^34]: Dan Bricklin and Bob Frankston: “[VisiCalc: Information from Its Creators](http://danbricklin.com/visicalc.htm),” *danbricklin.com*.
[^35]: D. Sculley, Gary Holt, Daniel Golovin, et al.: “[Machine Learning: The High-Interest Credit Card of Technical Debt](http://research.google.com/pubs/pub43146.html),” at *NIPS Workshop on Software Engineering for Machine Learning* (SE4ML), December 2014.
[^36]: Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “[Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2737784](http://dx.doi.org/10.1145/2723372.2737784)
[^37]: Guy Steele: “[Re: Need for Macros (Was Re: Icon)](https://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01134.html),” email to *ll1-discuss* mailing list, *people.csail.mit.edu*, December 24, 2001.
[^38]: David Gelernter: “[Generative Communication in Linda](http://cseweb.ucsd.edu/groups/csag/html/teaching/cse291s03/Readings/p80-gelernter.pdf),” *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 7, number 1, pages 80–112, January 1985. [doi:10.1145/2363.2433](http://dx.doi.org/10.1145/2363.2433)
[^39]: Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “[The Many Faces of Publish/Subscribe](http://www.cs.ru.nl/~pieter/oss/manyfaces.pdf),” *ACM Computing Surveys*, volume 35, number 2, pages 114–131, June 2003. [doi:10.1145/857076.857078](http://dx.doi.org/10.1145/857076.857078)
[^40]: Ben Stopford: “[Microservices in a Streaming World](https://www.infoq.com/presentations/microservices-streaming),” at *QCon London*, March 2016.
[^41]: Christian Posta: “[Why Microservices Should Be Event Driven: Autonomy vs Authority](http://blog.christianposta.com/microservices/why-microservices-should-be-event-driven-autonomy-vs-authority/),” *blog.christianposta.com*, May 27, 2016.
[^42]: Alex Feyerke: “[Say Hello to Offline First](https://web.archive.org/web/20210420014747/http://hood.ie/blog/say-hello-to-offline-first.html),” *hood.ie*, November 5, 2013.
[^43]: Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich: “[Global Sequence Protocol: A Robust Abstraction for Replicated Shared State](http://drops.dagstuhl.de/opus/volltexte/2015/5238/),” at *29th European Conference on Object-Oriented Programming* (ECOOP), July 2015. [doi:10.4230/LIPIcs.ECOOP.2015.568](http://dx.doi.org/10.4230/LIPIcs.ECOOP.2015.568)
[^44]: Mark Soper: “[Clearing Up React Data Management Confusion with Flux, Redux, and Relay](https://medium.com/@marksoper/clearing-up-react-data-management-confusion-with-flux-redux-and-relay-aad504e63cae),” *medium.com*, December 3, 2015.
[^45]: Eno Thereska, Damian Guy, Michael Noll, and Neha Narkhede: “[Unifying Stream Processing and Interactive Queries in Apache Kafka](http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/),” *confluent.io*, October 26, 2016.
[^46]: Frank McSherry: “[Dataflow as Database](https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-17.md),” *github.com*, July 17, 2016.
[^47]: Peter Alvaro: “[I See What You Mean](https://www.youtube.com/watch?v=R2Aa4PivG0g),” at *Strange Loop*, September 2015.
[^48]: Nathan Marz: “[Trident: A High-Level Abstraction for Realtime Computation](https://blog.twitter.com/2012/trident-a-high-level-abstraction-for-realtime-computation),” *blog.twitter.com*, August 2, 2012.
[^49]: Edi Bice: “[Low Latency Web Scale Fraud Prevention with Apache Samza, Kafka and Friends](http://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends),” at *Merchant Risk Council MRC Vegas Conference*, March 2016.
[^50]: Charity Majors: “[The Accidental DBA](https://charity.wtf/2016/10/02/the-accidental-dba/),” *charity.wtf*, October 2, 2016.
[^51]: Arthur J. Bernstein, Philip M. Lewis, and Shiyong Lu: “[Semantic Conditions for Correctness at Different Isolation Levels](http://db.cs.berkeley.edu/cs286/papers/isolation-icde2000.pdf),” at *16th International Conference on Data Engineering* (ICDE), February 2000. [doi:10.1109/ICDE.2000.839387](http://dx.doi.org/10.1109/ICDE.2000.839387)
[^52]: Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “[Automating the Detection of Snapshot Isolation Anomalies](http://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf),” at *33rd International Conference on Very Large Data Bases* (VLDB), September 2007.
[^53]: Kyle Kingsbury: [Jepsen blog post series](https://aphyr.com/tags/jepsen), *aphyr.com*, 2013–2016.
[^54]: Michael Jouravlev: “[Redirect After Post](http://www.theserverside.com/news/1365146/Redirect-After-Post),” *theserverside.com*, August 1, 2004.
[^55]: Jerome H. Saltzer, David P. Reed, and David D. Clark: “[End-to-End Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf),” *ACM Transactions on Computer Systems*, volume 2, number 4, pages 277–288, November 1984. [doi:10.1145/357401.357402](http://dx.doi.org/10.1145/357401.357402)
[^56]: Peter Bailis, Alan Fekete, Michael J. Franklin, et al.: “[Coordination-Avoiding Database Systems](http://arxiv.org/pdf/1402.2237.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 3, pages 185–196, November 2014.
[^57]: Alex Yarmula: “[Strong Consistency in Manhattan](https://blog.twitter.com/2016/strong-consistency-in-manhattan),” *blog.twitter.com*, March 17, 2016.
[^58]: Douglas B Terry, Marvin M Theimer, Karin Petersen, et al.: “[Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](http://css.csail.mit.edu/6.824/2014/papers/bayou-conflicts.pdf),” at *15th ACM Symposium on Operating Systems Principles* (SOSP), pages 172–182, December 1995. [doi:10.1145/224056.224070](http://dx.doi.org/10.1145/224056.224070)
[^59]: Jim Gray: “[The Transaction Concept: Virtues and Limitations](http://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf),” at *7th International Conference on Very Large Data Bases* (VLDB), September 1981.
[^60]: Hector Garcia-Molina and Kenneth Salem: “[Sagas](http://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), May 1987. [doi:10.1145/38713.38742](http://dx.doi.org/10.1145/38713.38742)
[^61]: Pat Helland: “[Memories, Guesses, and Apologies](https://web.archive.org/web/20160304020907/http://blogs.msdn.com/b/pathelland/archive/2007/05/15/memories-guesses-and-apologies.aspx),” *blogs.msdn.com*, May 15, 2007.
[^62]: Yoongu Kim, Ross Daly, Jeremie Kim, et al.: “[Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf),” at *41st Annual International Symposium on Computer Architecture* (ISCA), June 2014. [doi:10.1145/2678373.2665726](http://dx.doi.org/10.1145/2678373.2665726)
[^63]: Mark Seaborn and Thomas Dullien: “[Exploiting the DRAM Rowhammer Bug to Gain Kernel Privileges](https://googleprojectzero.blogspot.co.uk/2015/03/exploiting-dram-rowhammer-bug-to-gain.html),” *googleprojectzero.blogspot.co.uk*, March 9, 2015.
[^64]: Jim N. Gray and Catharine van Ingen: “[Empirical Measurements of Disk Failure Rates and Error Rates](https://www.microsoft.com/en-us/research/publication/empirical-measurements-of-disk-failure-rates-and-error-rates/),” Microsoft Research, MSR-TR-2005-166, December 2005.
[^65]: Annamalai Gurusami and Daniel Price: “[Bug #73170: Duplicates in Unique Secondary Index Because of Fix of Bug#68021](http://bugs.mysql.com/bug.php?id=73170),” *bugs.mysql.com*, July 2014.
[^66]: Gary Fredericks: “[Postgres Serializability Bug](https://github.com/gfredericks/pg-serializability-bug),” *github.com*, September 2015.
[^67]: Xiao Chen: “[HDFS DataNode Scanners and Disk Checker Explained](http://blog.cloudera.com/blog/2016/12/hdfs-datanode-scanners-and-disk-checker-explained/),” *blog.cloudera.com*, December 20, 2016.
[^68]: Jay Kreps: “[Getting Real About Distributed System Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability),” *blog.empathybox.com*, March 19, 2012.
[^69]: Martin Fowler: “[The LMAX Architecture](http://martinfowler.com/articles/lmax.html),” *martinfowler.com*, July 12, 2011.
[^70]: Sam Stokes: “[Move Fast with Confidence](http://blog.samstokes.co.uk/blog/2016/07/11/move-fast-with-confidence/),” *blog.samstokes.co.uk*, July 11, 2016.
[^71]: “[Hyperledger Sawtooth documentation](https://web.archive.org/web/20220120211548/https://sawtooth.hyperledger.org/docs/core/releases/latest/introduction.html),” Intel Corporation, *sawtooth.hyperledger.org*, 2017.
[^72]: Richard Gendal Brown: “[Introducing R3 Corda™: A Distributed Ledger Designed for Financial Services](https://gendal.me/2016/04/05/introducing-r3-corda-a-distributed-ledger-designed-for-financial-services/),” *gendal.me*, April 5, 2016.
[^73]: Trent McConaghy, Rodolphe Marques, Andreas Müller, et al.: “[BigchainDB: A Scalable Blockchain Database](https://www.bigchaindb.com/whitepaper/bigchaindb-whitepaper.pdf),” *bigchaindb.com*, June 8, 2016.
[^74]: Ralph C. Merkle: “[A Digital Signature Based on a Conventional Encryption Function](https://people.eecs.berkeley.edu/~raluca/cs261-f15/readings/merkle.pdf),” at *CRYPTO '87*, August 1987. [doi:10.1007/3-540-48184-2_32](http://dx.doi.org/10.1007/3-540-48184-2_32)
[^75]: Ben Laurie: “[Certificate Transparency](http://queue.acm.org/detail.cfm?id=2668154),” *ACM Queue*, volume 12, number 8, pages 10-19, August 2014. [doi:10.1145/2668152.2668154](http://dx.doi.org/10.1145/2668152.2668154)
[^76]: Mark D. Ryan: “[Enhanced Certificate Transparency and End-to-End Encrypted Mail](https://www.ndss-symposium.org/wp-content/uploads/2017/09/12_2_1.pdf),” at *Network and Distributed System Security Symposium* (NDSS), February 2014. [doi:10.14722/ndss.2014.23379](http://dx.doi.org/10.14722/ndss.2014.23379)
================================================
FILE: content/tw/ch14.md
================================================
---
title: "14. 將事情做正確"
weight: 314
breadcrumbs: false
---

> *將世界的美好、醜陋與殘酷一起餵給 AI,卻期待它只反映美好的一面,這是一種幻想。*
>
> Vinay Uday Prabhu 與 Abeba Birhane,《Large Datasets: A Pyrrhic Win for Computer Vision?》(2020)
在本書最後一章,讓我們退一步看問題。整本書裡,我們考察了各種資料系統架構,評估了它們的利弊,也探討了如何構建可靠、可伸縮、可維護的應用。然而,我們一直略去了討論中一個重要而基礎的部分,現在該補上了。
每個系統都是為了某種目的而建;我們做的每個動作,都有預期後果,也有非預期後果。目的可能只是賺錢,但對世界產生的影響可能遠遠超出這個初始目的。構建這些系統的工程師,有責任認真思考這些後果,並且有意識地決定我們希望生活在怎樣的世界中。
我們常把資料當成抽象事物來談論,但請記住,許多資料集都是關於人的:他們的行為、興趣、身份。我們必須以人性與尊重來對待這樣的資料。使用者也是人,而人的尊嚴至高無上 [^1]。
軟體開發越來越涉及重要的倫理抉擇。確實有一些指南幫助軟體工程師應對這些問題,比如 ACM《倫理與職業行為準則》 [^2],但在實踐中,它們很少被討論、應用與執行。因此,工程師和產品經理有時會對隱私以及產品可能帶來的負面後果抱持一種輕率態度 [^3], [^4]。
技術本身並無善惡,關鍵在於它如何被使用,以及它如何影響人。這一點對搜尋引擎這樣的軟體系統成立,對槍支這樣的武器同樣成立。軟體工程師若只專注技術本身而忽視其後果,是不夠的:倫理責任同樣由我們承擔。倫理推理很難,但它又重要到不能迴避。
不過,什麼算“好”或“壞”並沒有清晰定義,而計算領域的大多數人甚至不討論這個問題 [^5]。與計算領域中的很多概念不同,倫理的核心概念並沒有嚴格且確定的單一含義,它們需要解釋,而解釋可能具有主觀性 [^6]。倫理並不是走一遍檢查清單、確認你“合規”就完事;它是一種參與式、迭代式的反思過程,要與相關人群對話,並對結果負責 [^7]。
## 預測分析 {#id369}
例如,預測分析是人們對大資料和 AI 感到興奮的重要原因之一。用資料分析來預測天氣或疾病傳播是一回事 [^8];預測一個罪犯是否可能再犯、一個貸款申請者是否可能違約,或一個保險客戶是否可能提出高額理賠,又是另一回事 [^9]。後者會直接影響個人的生活。
支付網路當然想防止欺詐交易,銀行想避免壞賬,航空公司想避免劫機,公司想避免僱到低效或不可信的人。從它們的角度看,錯過一筆業務機會的成本較低,而壞賬或問題員工的成本更高,因此機構傾向於謹慎行事完全可以理解。拿不準時,說“不”更穩妥。
然而,隨著演算法決策越來越普遍,一個被某個演算法標記為“高風險”的人(不管標記準確與否),可能會不斷遭遇這種“不”。如果一個人系統性地被排除在工作、航空出行、保險保障、房屋租賃、金融服務以及社會其他關鍵領域之外,這對個體自由構成的約束之大,以至於有人稱之為“演算法監獄” [^10]。在尊重人權的國家,刑事司法講究“未經證明有罪即推定無罪”;但自動化系統卻可能在沒有罪證、幾乎無申訴機會的情況下,系統性且任意地把一個人排除在社會參與之外。
### 偏見與歧視 {#id370}
演算法作出的決策並不必然比人更好,也不必然更差。每個人都可能有偏見,即使他們主動嘗試糾偏也是如此;歧視性做法也可能被文化性地制度化。人們期待基於資料、而非基於人的主觀直覺評估來作決定,可能更公平,也能讓傳統系統中常被忽視的人獲得更好機會 [^11]。
當我們開發預測分析和 AI 系統時,我們並不只是把人的決策“自動化”——即用軟體寫明何時說“是”或“否”的規則;我們甚至把規則本身也交給資料去推斷。然而,這些系統學到的模式往往是不透明的:即使資料中存在某種相關性,我們也未必知道為什麼。如果演算法輸入中存在系統性偏差,系統很可能會在輸出中學習並放大這種偏差 [^12]。
在許多國家,反歧視法禁止依據族裔、年齡、性別、性取向、殘障或信仰等受保護特徵而區別對待他人。一個人的其他資料特徵也許可以分析,但如果這些特徵與受保護特徵相關怎麼辦?例如,在按種族隔離的社群裡,一個人的郵編,甚至其 IP 地址,都可能是種族的強預測因子。這樣一看,認為演算法能把帶偏見的資料作為輸入,卻產出公平中立的結果,幾乎是荒謬的 [^13], [^14]。然而,資料驅動決策的支持者常隱含這種信念,這種態度甚至被諷刺為“機器學習就像給偏見洗錢” [^15]。
預測分析系統只是在外推過去;如果過去有歧視,它們就會把歧視編碼並放大 [^16]。如果我們希望未來比過去更好,就需要道德想象力,而這隻能由人提供 [^17]。資料和模型應當是我們的工具,而不是我們的主宰。
### 責任與問責 {#id371}
自動化決策把責任與問責問題擺到了臺前 [^17]。如果人犯了錯,可以追責,受影響者也可以申訴。演算法同樣會出錯,但如果演算法出了問題,誰來負責 [^18]?自動駕駛汽車造成事故,誰應承擔責任?自動化信用評分演算法如果系統性歧視某一族裔或宗教的人,受害者是否有救濟途徑?如果你的機器學習系統的決策受到司法審查,你能向法官解釋演算法是如何作出該決策的嗎?人不應透過“怪演算法”來逃避自己的責任。
信用評級機構是一個較早的先例:透過收集資料來對人作決策。糟糕的信用評分會讓生活變難,但至少信用分通常基於與借貸歷史直接相關的事實記錄,記錄中的錯誤也可以更正(儘管機構往往不會讓這件事變得容易)。相比之下,基於機器學習的評分演算法通常使用更廣泛的輸入且更不透明,使人更難理解某個具體決策是如何得出的,也更難判斷某人是否受到了不公平或歧視性對待 [^19]。
信用分回答的是“你過去行為如何?”;而預測分析通常基於“誰和你相似,以及像你這樣的人過去行為如何?”。把某人和“相似人群”類比,本質上就是在給人貼群體標籤,比如按居住地(這往往是種族和社會經濟階層的近似代理)來推斷。那被分錯桶的人怎麼辦?此外,如果決策因錯誤資料而出錯,幾乎不可能得到救濟 [^17]。
許多資料本質上是統計性的,這意味著即便總體機率分佈正確,具體個案也可能是錯的。比如,某國平均預期壽命是 80 歲,並不意味著你會在 80 歲生日那天去世。僅憑平均值和機率分佈,我們很難判斷某個具體個體會活到多少歲。同樣,預測系統的輸出是機率性的,在具體個案上完全可能出錯。
盲目相信資料在決策中的至高地位,不僅是錯覺,更是危險。隨著資料驅動決策越來越普遍,我們必須找到辦法讓演算法可問責、可透明,避免強化既有偏見,並在它們不可避免地犯錯時加以糾正。
我們還需要想辦法防止資料被用來傷害人,並實現其積極潛力。比如,分析可以揭示一個人財務和社會生活上的特徵。一方面,這種能力可以用於把援助精準地送到最需要的人手中。另一方面,它有時被掠奪性企業用來識別脆弱人群,並向其兜售高成本貸款、含金量極低的學歷專案等高風險產品 [^17], [^20]。
### 反饋迴路 {#id372}
即便在對人影響沒那麼立竿見影的預測應用中,比如推薦系統,我們也必須直面棘手問題。當服務越來越擅長預測使用者想看什麼內容時,它可能最終只向人們展示他們本就認同的觀點,形成迴音室,讓刻板印象、錯誤資訊和社會極化不斷滋生。我們已經看到社交媒體迴音室對選舉活動的影響。
當預測分析影響人的生活時,自我強化的反饋迴路會帶來尤其惡性的後果。比如,設想僱主用信用分來評估候選人。你原本是一個工作能力不錯、信用也不錯的人,但因某個無法控制的不幸事件突然陷入財務困境。賬單逾期後,你的信用分下降,找到工作的可能性也隨之下降。失業把你推向貧困,反過來讓你的評分更差,進一步降低就業機會 [^17]。這就是一種下行螺旋:有毒假設披著數學嚴謹與資料客觀的偽裝。
反饋迴路還有另一個例子:經濟學家發現,德國加油站引入演算法定價後,競爭反而減弱,消費者價格上升,因為演算法學會了“合謀” [^21]。
我們並不總能預測這些反饋迴路何時出現。不過,很多後果可以透過思考“整個系統”來預見(不僅是計算機化部分,還包括與系統互動的人)——這種方法稱為 **系統思維** [^22]。我們可以嘗試理解資料分析系統對不同行為、結構與特徵的響應。系統是在強化和放大人與人之間既有差異(例如讓富者更富、窮者更窮),還是在努力對抗不公?而且,即便出發點再好,我們也必須警惕非預期後果。
## 隱私與追蹤 {#id373}
除了預測分析的問題——也就是用資料自動化地對人作決策——資料收集本身也有倫理問題。收集資料的組織,與資料被收集的人之間,到底是什麼關係?
當系統只儲存使用者明確輸入的資料,因為使用者希望系統以某種方式儲存和處理它時,系統是在為使用者提供服務:使用者是客戶。但當用戶活動是在做其他事情時被“順帶”追蹤並記錄下來,這種關係就不那麼清晰了。服務不再只是執行使用者指令,而開始擁有自己的利益,而這種利益可能與使用者利益衝突。
行為資料追蹤已成為許多線上服務面向使用者功能的重要組成部分:追蹤搜尋結果點選有助於改進搜尋排序;推薦“喜歡 X 的人也喜歡 Y”可幫助使用者發現有趣且有用的內容;A/B 測試與使用者流程分析可幫助改進使用者介面。這些功能都需要一定程度的使用者行為追蹤,使用者也能從中受益。
然而,取決於公司的商業模式,追蹤往往不會止步於此。如果服務靠廣告資助,那麼廣告主才是真正客戶,使用者利益就會退居次位。追蹤資料會變得更細、分析會更深入、資料會被長期保留,以便為營銷目的構建每個人的精細畫像。
這時,公司與被收集資料的使用者之間的關係,就開始顯著改變了。使用者得到“免費”服務,並被引導儘可能多地參與。對使用者的追蹤,主要服務的並不是這個個體,而是資助服務的廣告主需求。這樣的關係,用一個語義更陰暗的詞來描述更貼切:**監視**。
### 監視 {#id374}
做個思想實驗:把 *data* 一詞替換為 *surveillance*(監視),看看常見說法是否還那麼“好聽” [^23]。例如:“在我們這個監視驅動的組織中,我們收集即時監視流並存入監視倉庫。我們的監視科學家使用先進的分析與監視處理來產出新洞見。”
這個思想實驗對本書來說少見地帶有一點論戰色彩,彷彿書名成了《設計監視密集型應用》(*Designing Surveillance-Intensive Applications*)。但為了強調這一點,我們需要更尖銳的詞。在我們試圖讓軟體“吞噬世界” [^24] 的過程中,我們構建了人類有史以來規模最大的群體監視基礎設施。我們正快速接近這樣一個世界:幾乎每個有人居住的空間都至少有一個聯網麥克風,存在於智慧手機、智慧電視、語音助手裝置、嬰兒監視器,甚至使用雲語音識別的兒童玩具中。許多這類裝置的安全記錄都非常糟糕 [^25]。
與過去相比,新變化在於:數字化讓大規模收集人的資料變得很容易。對我們位置與行動軌跡、社交關係與通訊、購買與支付、健康資訊的監視,幾乎已不可避免。一個監視型組織最終掌握的個人資訊,甚至可能比當事人自己知道的還多——例如,在當事人意識到之前就識別出其疾病或經濟困境。
即便是過去最極權、最壓迫的政權,也只能夢想把麥克風裝進每個房間,並迫使每個人隨身攜帶可追蹤其位置與行動的裝置。可是,由於數字技術帶來的好處太大,我們如今卻自願接受這個全面監視的世界。區別只在於:資料由企業收集以向我們提供服務,而不是由政府機構為控制目的而收集 [^26]。
並非所有資料收集都一定構成監視,但把它放在“監視”的框架下審視,有助於我們理解自己與資料收集者的關係。為什麼我們似乎樂於接受企業監視?也許你覺得自己“沒什麼可隱瞞”——換句話說,你與既有權力結構完全一致,不是邊緣少數群體,也無需擔心被迫害 [^27]。但不是每個人都這麼幸運。又或者,你覺得目的似乎是善意的——不是公開的強制和馴化,而只是更好的推薦與更個性化的營銷。然而,結合上一節對預測分析的討論,這種區分就沒那麼清楚了。
我們已經看到,汽車在未經駕駛員同意的情況下追蹤其駕駛行為,並影響保險費率 [^28];也看到了與佩戴健身追蹤裝置繫結的健康保險保障。當監視被用於決定對生活關鍵方面有重大影響的事項(如保險保障或就業)時,它看起來就不再“無害”。而且,資料分析還能揭示極具侵入性的內容:例如,智慧手錶或健身手環裡的運動感測器可以以相當高的準確率推斷你在輸入什麼(包括密碼) [^29]。感測器精度和分析演算法只會越來越強。
### 同意與選擇自由 {#id375}
我們或許會主張,使用者是自願選擇使用會追蹤其活動的服務,並且他們同意了服務條款和隱私政策,因此他們已同意資料收集。我們甚至可能聲稱,使用者正以其提供的資料換取有價值的服務,而追蹤是提供該服務所必需的。毫無疑問,社交網路、搜尋引擎和各種其他免費線上服務確實對使用者有價值——但這個論證有問題。
首先,我們應當問:追蹤在哪種意義上是“必要的”?有些追蹤形式確實直接用於改進使用者功能:例如,追蹤搜尋結果點選率可提升搜尋排序與相關性;追蹤客戶常一起購買哪些商品,可幫助網店推薦關聯商品。然而,當追蹤使用者互動是為了內容推薦,或為了廣告構建使用者畫像時,這是否真正在使用者利益之中就不那麼清楚了——還是說,它“必要”僅僅因為廣告在為服務買單?
其次,使用者對自己向我們的資料庫“喂入”了哪些資料、這些資料如何被保留與處理,幾乎沒有認知——而多數隱私政策更多是在遮蔽而非闡明。使用者若不瞭解其資料會發生什麼,就無法給出有意義的同意。並且,某個使用者的資料往往也會揭示並非該服務使用者、也未同意任何條款的其他人。我們在本書這部分討論過的那些派生資料集——其中可能把全體使用者資料與行為追蹤及外部資料來源結合——正是使用者不可能形成有意義理解的資料型別。
此外,資料從使用者身上被抽取是單向過程,不是具有真實互惠的關係,也不是公平的價值交換。這裡沒有對話,沒有讓使用者協商“提供多少資料、換取什麼服務”的空間:服務與使用者之間的關係高度不對稱、單向度。規則由服務制定,而非使用者 [^30], [^31]。
在歐盟,《通用資料保護條例》(GDPR)要求同意必須是 “freely given, specific, informed, and unambiguous”,並且使用者必須能夠 “refuse or withdraw consent without detriment”——否則不被視為 “freely given”。任何徵求同意的請求都必須以 “an intelligible and easily accessible form, using clear and plain language” 撰寫。此外,“silence, pre-ticked boxes or inactivity \[do not\] constitute consent” [^32]。除同意外,個人資料處理還可基於其他合法基礎,例如 *legitimate interest*,它允許某些資料用途,如防欺詐 [^33]。
你可能會說,不同意被監視的使用者可以選擇不用這項服務。但這種選擇同樣不自由:如果某項服務流行到“被大多數人視為基本社會參與所必需” [^30],那就不能合理期待人們退出——使用它在事實上成了強制(*de facto* mandatory)。例如,在多數西方社群中,攜帶智慧手機、透過社交網路社交、使用 Google 獲取資訊,已經成為常態。尤其當服務具有網路效應時,選擇 *不* 使用它會付出社會成本。
因為追蹤政策而拒絕使用某服務,說起來容易做起來難。這些平臺本來就是為吸引使用者而設計的。許多平臺使用遊戲機制和賭博常見策略來讓使用者反覆回來 [^34]。即便使用者能克服這一點,拒絕參與也往往只是少數特權人群的選項:他們有時間和知識去理解隱私政策,也有能力承擔潛在代價——比如錯過本可透過該服務獲得的社會參與或職業機會。對於處境更不利的人來說,並不存在真正意義上的選擇自由:監視變得無可逃避。
### 隱私與資料使用 {#id457}
有時有人聲稱“隱私已死”,理由是某些使用者願意在社交媒體上釋出各種生活內容,有些瑣碎,有些極度私密。但這個說法是錯誤的,它建立在對 *privacy* 一詞的誤解之上。
擁有隱私並不意味著把一切都保密;它意味著擁有選擇自由:哪些內容向誰披露、哪些公開、哪些保密。隱私權是一種決策權:它讓每個人在每種情境中,決定自己在“保密”與“透明”光譜上的位置 [^30]。這是個體自由與自主性的重要組成部分。
例如,一個患有罕見疾病的人,可能非常願意把其私密醫療資料提供給研究者,只要這有助於開發治療方法。但關鍵在於,這個人應當有權選擇誰可以訪問這些資料,以及出於什麼目的。如果其病情資訊可能損害其醫療保險、就業或其他重要權益,這個人很可能會更謹慎地共享資料。
當資料透過監視基礎設施從人們身上被抽取時,被侵蝕的未必是隱私權本身,而可能是隱私權的轉移:轉移給資料收集者。獲取資料的公司本質上是在說“相信我們會正確使用你的資料”,這意味著決定“披露什麼、保密什麼”的權利,從個人轉移到了公司。
這些公司反過來會把監視結果中的很大一部分保密,因為一旦公開,會讓人感到毛骨悚然,並傷害其商業模式(該模式依賴於“比其他公司更瞭解你”)。關於使用者的私密資訊通常只以間接方式被暴露,例如透過向特定人群(如患有某種疾病的人)定向投放廣告的工具表現出來。
即便特定使用者無法從某條廣告所面向的人群桶中被個人重識別,他們仍失去了對某些私密資訊披露的主導權。決定“向誰披露什麼”不再基於使用者自己的偏好,而是公司在行使這種隱私權,目標是利潤最大化。
許多公司追求的目標是“不被 *感知* 為令人不適”,迴避“資料收集到底有多侵入”這一問題,轉而專注於管理使用者感知。而且就連這種感知管理也常常做得不好:例如,某些內容也許在事實層面是正確的,但若會觸發痛苦記憶,使用者可能並不想被提醒 [^35]。面對任何資料,我們都應預期它可能出錯、不可取或在某些情況下不合適,並且需要構建機制來處理這些失效。至於什麼算“不可取”或“不合適”,當然屬於人的判斷;演算法除非被我們顯式程式設計去尊重人的需要,否則對這些概念是無感的。作為這些系統的工程師,我們必須保持謙遜,接受並預先規劃這些失效。
線上服務裡的隱私設定,允許使用者控制其資料的哪些方面可被其他使用者看到,這是把部分控制權還給使用者的起點。然而,不管設定如何,服務本身仍可不受限制地訪問這些資料,並可在隱私政策允許範圍內任意使用。即使服務承諾不把資料出售給第三方,通常也會賦予自己在內部處理和分析資料的廣泛權利,而這種處理常常遠遠超出使用者可見範圍。
這種把隱私權從個人大規模轉移到企業的現象,在歷史上前所未有 [^30]。監視並非從未存在,但過去它昂貴且依賴人工,不具備自動化與可伸縮性。信任關係也一直存在,比如病人與醫生、被告與律師之間——但這些關係中的資料使用長期受倫理、法律與監管約束。網際網路服務則讓“在缺乏有意義同意的情況下聚合海量敏感資訊,並在使用者不知情時以大規模方式使用”變得容易得多。
### 資料作為資產與權力 {#id376}
由於行為資料是使用者與服務互動的副產物,它有時被稱為 “data exhaust”(資料尾氣),暗示這些資料是無價值的廢料。照這個角度看,行為分析與預測分析像一種“回收”,從原本會被丟棄的資料中提煉價值。
更準確的看法可能正相反:從經濟學角度看,如果定向廣告在為服務買單,那麼生成行為資料的使用者活動就可被視作一種勞動 [^36]。甚至可以更進一步主張:使用者互動的應用本身,只是引誘使用者不斷向監視基礎設施輸入更多個人資訊的手段 [^30]。線上服務中常見的人類創造力與社會關係,被資料抽取機器以冷酷方式利用。
個人資料是有價值資產,這從資料經紀商行業的存在即可見一斑:這是一個在隱秘中運作、頗為灰暗的行業,購買、聚合、分析、推斷並轉售關於個人的侵入性資料,多數用於營銷 [^20]。初創公司的估值常以使用者數、以“眼球”為基礎——也就是以其監視能力為基礎。
因為資料有價值,很多人都想要它。公司當然想要——這本就是它們收集資料的原因。政府也想拿到:透過秘密交易、脅迫、法律強制,或者直接竊取 [^37]。當公司破產時,其收集的個人資料會作為資產被出售。並且,資料很難徹底保護,洩露事件頻發得令人不安。
這些觀察促使批評者說,資料不只是資產,還是“有毒資產”(*toxic asset*) [^37],或者至少是“危險材料”(*hazardous material*) [^38]。也許資料不是“新黃金”、不是“新石油”,而是“新鈾” [^39]。即使我們認為自己有能力防止資料濫用,每次收集資料時也必須權衡收益與其落入錯誤之手的風險:計算機系統可能被犯罪分子或敵對外國情報機構攻破,資料可能被內部人員洩露,公司可能落入與我們價值觀不一致的管理層手中,或國家可能被一個毫無顧忌、會強迫我們交出資料的政權接管。
收集資料時,我們不僅要考慮今天的政治環境,還要考慮未來所有可能的政府。無法保證未來每一屆政府都會尊重人權與公民自由,因此,“安裝那些未來可能助長警察國家的技術,是糟糕的公民衛生習慣” [^40]。
正如古老格言所說,“知識就是力量”。而且,“審視他人而避免自身被審視,是最重要的權力形式之一” [^41]。這正是極權政府追求監視的原因:它賦予其控制人口的力量。今天的科技公司雖未公開追求政治權力,但它們積累的資料與知識依然賦予其對我們生活的巨大影響力,其中很多是隱蔽的,處在公共監督之外 [^42]。
### 回顧工業革命 {#id377}
資料是資訊時代的決定性特徵。網際網路、資料儲存與處理、軟體驅動自動化,正在深刻影響全球經濟和人類社會。我們的日常生活與社會組織已被資訊科技改變,並且在未來幾十年很可能繼續發生劇烈變化,這很容易讓人聯想到工業革命 [^17], [^26]。
工業革命建立在重大技術與農業進步之上,長期看帶來了持續經濟增長和生活水平顯著改善。但它也伴隨嚴重問題:空氣汙染(煙塵與化工過程)和水汙染(工業與生活廢棄物)都觸目驚心。工廠主生活奢華,城市工人卻常住在惡劣住房裡、長時間在嚴苛條件下勞動。童工普遍存在,包括礦井中危險且低薪的工作。
社會花了很長時間才建立起各種防護措施:環境保護法規、工作場所安全規程、取締童工、食品衛生檢查。毫無疑問,當工廠不再被允許把廢棄物排進河裡、售賣汙染食品、剝削工人時,做生意的成本上升了。但整個社會從這些規制中獲益巨大,今天幾乎沒人願意回到那之前 [^17]。
正如工業革命有其需要被管理的黑暗面一樣,我們向資訊時代的過渡也有重大問題,必須正視並解決 [^43], [^44]。資料的收集與使用就是其中之一。借用 Bruce Schneier 的話 [^26]:
> 資料是資訊時代的汙染問題,而保護隱私是環境挑戰。幾乎所有計算機都會產生資訊。它會長期滯留、不斷髮酵。我們如何處理它——如何圍堵它、如何處置它——對資訊經濟的健康至關重要。正如今天我們回望工業時代的早期幾十年,會疑惑我們的祖先為何在建設工業世界的狂熱中忽視了汙染問題;我們的後代也將回望資訊時代的這些早期幾十年,並以我們如何應對資料收集與濫用的挑戰來評判我們。
>
> 我們應努力讓他們感到驕傲。
### 立法與自律 {#sec_future_legislation}
資料保護法也許能夠幫助維護個體權利。例如,歐盟 GDPR 規定,個人資料必須“為特定、明確且合法的目的而收集,不得以與這些目的不相容的方式進一步處理”;並且資料必須“就處理目的而言充分、相關且限於必要範圍” [^32]。
然而,這一 **資料最小化** 原則與大資料哲學正面衝突。大資料強調最大化資料收集,把資料與其他資料集合並,持續實驗與探索,以產生新洞見。探索意味著為預見之外的目的使用資料,這與“特定且明確目的”正相反。儘管 GDPR 對線上廣告行業產生了一些影響 [^45],監管執行總體仍偏弱 [^46],也似乎沒有在更廣泛的科技行業內真正帶來文化與實踐層面的顯著轉變。
那些收集大量個人資料的公司把監管視為負擔和創新阻礙。這種反對在某種程度上也有其合理性。比如共享醫療資料時,隱私風險確實明確存在,但也有潛在機會:如果資料分析能幫助我們實現更好的診斷或找到更好的治療方案,能減少多少死亡 [^47]?過度監管可能會阻礙這類突破。如何平衡機會與風險並不容易 [^41]。
歸根結底,科技行業需要在個人資料問題上完成一次文化轉向。我們應停止把使用者當作可最佳化指標,記住他們是應被尊重、擁有尊嚴與主體性的人。我們應透過自律來約束資料收集與處理實踐,以建立並維繫依賴我們軟體的人們的信任 [^48]。並且,我們應主動教育終端使用者其資料如何被使用,而不是把他們矇在鼓裡。
我們應允許每個個體保有其隱私——也就是對自身資料的控制——而不是透過監視把這種控制偷走。個體對自身資料的控制權,就像國家公園中的自然環境:如果我們不明確保護並照料它,它就會被破壞。這會成為“公地悲劇”,最終所有人都更糟。無處不在的監視並非命中註定——我們仍有機會阻止它。
第一步是不要無限期保留資料,而應在不再需要時儘快清除,並在源頭最小化收集 [^48], [^49]。只要你的資料不存在,它就不會被洩露、被盜,或被政府強制交出。總的來說,這需要文化與態度的改變。作為技術從業者,如果我們不考慮自己工作的社會影響,那就是沒有盡到本職 [^50]。
## 總結 {#id594}
至此,本書接近尾聲。我們已經走過了很長一段路:
- 在 [第 1 章](/tw/ch1#ch_tradeoffs) 中,我們對比了分析型系統與事務型系統,比較了雲與自託管,權衡了分散式與單節點系統,並討論了如何平衡業務需求與使用者需求。
- 在 [第 2 章](/tw/ch2#ch_nonfunctional) 中,我們看到了如何定義非功能性需求,例如效能、可靠性、可伸縮性與可維護性。
- 在 [第 3 章](/tw/ch3#ch_datamodels) 中,我們考察了從關係模型、文件模型到圖模型的一系列資料模型,也討論了事件溯源與 DataFrame。我們還看了多種查詢語言示例,包括 SQL、Cypher、SPARQL、Datalog 與 GraphQL。
- 在 [第 4 章](/tw/ch4#ch_storage) 中,我們討論了面向 OLTP 的儲存引擎(LSM 樹與 B 樹)、面向分析的儲存(列式儲存),以及面向資訊檢索的索引(全文檢索與向量檢索)。
- 在 [第 5 章](/tw/ch5#ch_encoding) 中,我們考察了將資料物件編碼為位元組的不同方式,以及如何在需求變化時支援演化。我們還比較了程序間資料流動的幾種方式:經由資料庫、服務呼叫、工作流引擎或事件驅動架構。
- 在 [第 6 章](/tw/ch6#ch_replication) 中,我們研究了單領導者、多領導者與無主(無領導者)複製之間的權衡,也討論了寫後讀一致性等一致性模型,以及可讓客戶端離線工作的同步引擎。
- 在 [第 7 章](/tw/ch7#ch_sharding) 中,我們深入討論了分片,包括再平衡策略、請求路由與次級索引。
- 在 [第 8 章](/tw/ch8#ch_transactions) 中,我們覆蓋了事務:永續性、各種隔離級別(讀已提交、快照隔離、可序列化)的實現方式,以及如何在分散式事務中保證原子性。
- 在 [第 9 章](/tw/ch9#ch_distributed) 中,我們梳理了分散式系統中的基礎問題(網路失效與延遲、時鐘誤差、程序暫停、崩潰),並看到這些問題如何讓“實現一個看似簡單的鎖”都變得困難。
- 在 [第 10 章](/tw/ch10#ch_consistency) 中,我們深入分析了各種共識形式,以及它所支援的一致性模型(線性一致性)。
- 在 [第 11 章](/tw/ch11#ch_batch) 中,我們深入批處理,從簡單的 Unix 工具鏈一直講到基於分散式檔案系統或物件儲存的大規模分散式批處理系統。
- 在 [第 12 章](/tw/ch12#ch_stream) 中,我們把批處理推廣到流處理,討論了底層訊息代理、資料變更捕獲、容錯機制,以及流連線等處理模式。
- 在 [第 13 章](/tw/ch13#ch_philosophy) 中,我們探討了流式系統的一種哲學,它使異構資料系統更易於整合、系統更易於演化、應用更易於擴充套件。
最後,在本章中,我們後退一步,審視了構建資料密集型應用的一些倫理面向。我們看到,資料雖可為善,也可能造成嚴重傷害:作出深刻影響個人生活卻難以申訴的決策,導致歧視與剝削,使監視常態化,並暴露私密資訊。我們還面臨資料洩露風險,也可能發現某些出於善意的資料使用產生了非預期後果。
隨著軟體與資料對世界產生如此巨大的影響,我們作為工程師必須記住:我們有責任朝著我們希望生活其中的世界努力——一個以人性與尊重對待人的世界。讓我們共同朝這個目標前進。
### 參考文獻 {#references}
[^1]: David Schmudde. [What If Data Is a Bad Idea?](https://schmud.de/posts/2024-08-18-data-is-a-bad-idea.html). *schmud.de*, August 2024. Archived at [perma.cc/ZXU5-XMCT](https://perma.cc/ZXU5-XMCT)
[^2]: [ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics). Association for Computing Machinery, *acm.org*, 2018. Archived at [perma.cc/SEA8-CMB8](https://perma.cc/SEA8-CMB8)
[^3]: Igor Perisic. [Making Hard Choices: The Quest for Ethics in Machine Learning](https://www.linkedin.com/blog/engineering/archive/making-hard-choices-the-quest-for-ethics-in-machine-learning). *linkedin.com*, November 2016. Archived at [perma.cc/DGF8-KNT7](https://perma.cc/DGF8-KNT7)
[^4]: John Naughton. [Algorithm Writers Need a Code of Conduct](https://www.theguardian.com/commentisfree/2015/dec/06/algorithm-writers-should-have-code-of-conduct). *theguardian.com*, December 2015. Archived at [perma.cc/TBG2-3NG6](https://perma.cc/TBG2-3NG6)
[^5]: Ben Green. ["Good" isn't good enough](https://www.benzevgreen.com/wp-content/uploads/2019/11/19-ai4sg.pdf). At *NeurIPS Joint Workshop on AI for Social Good*, December 2019. Archived at [perma.cc/H4LN-7VY3](https://perma.cc/H4LN-7VY3)
[^6]: Deborah G. Johnson and Mario Verdicchio. [Ethical AI is Not about AI](https://cacm.acm.org/opinion/ethical-ai-is-not-about-ai/). *Communications of the ACM*, volume 66, issue 2, pages 32--34, January 2023. [doi:10.1145/3576932](https://doi.org/10.1145/3576932)
[^7]: Marc Steen. [Ethics as a Participatory and Iterative Process](https://cacm.acm.org/opinion/ethics-as-a-participatory-and-iterative-process/). *Communications of the ACM*, volume 66, issue 5, pages 27--29, April 2023. [doi:10.1145/3550069](https://doi.org/10.1145/3550069)
[^8]: Logan Kugler. [What Happens When Big Data Blunders?](https://cacm.acm.org/news/what-happens-when-big-data-blunders/) *Communications of the ACM*, volume 59, issue 6, pages 15--16, June 2016. [doi:10.1145/2911975](https://doi.org/10.1145/2911975)
[^9]: Miri Zilka. [Algorithms and the criminal justice system: promises and challenges in deployment and research](https://www.cl.cam.ac.uk/research/security/seminars/archive/video/2023-03-07-t196231.html). At *University of Cambridge Security Seminar Series*, March 2023.
[^10]: Bill Davidow. [Welcome to Algorithmic Prison](https://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/). *theatlantic.com*, February 2014. Archived at [archive.org](https://web.archive.org/web/20171019201812/https://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/)
[^11]: Don Peck. [They're Watching You at Work](https://www.theatlantic.com/magazine/archive/2013/12/theyre-watching-you-at-work/354681/). *theatlantic.com*, December 2013. Archived at [perma.cc/YR9T-6M38](https://perma.cc/YR9T-6M38)
[^12]: Leigh Alexander. [Is an Algorithm Any Less Racist Than a Human?](https://www.theguardian.com/technology/2016/aug/03/algorithm-racist-human-employers-work) *theguardian.com*, August 2016. Archived at [perma.cc/XP93-DSVX](https://perma.cc/XP93-DSVX)
[^13]: Jesse Emspak. [How a Machine Learns Prejudice](https://www.scientificamerican.com/article/how-a-machine-learns-prejudice/). *scientificamerican.com*, December 2016. [perma.cc/R3L5-55E6](https://perma.cc/R3L5-55E6)
[^14]: Rohit Chopra, Kristen Clarke, Charlotte A. Burrows, and Lina M. Khan. [Joint Statement on Enforcement Efforts Against Discrimination and Bias in Automated Systems](https://www.ftc.gov/system/files/ftc_gov/pdf/EEOC-CRT-FTC-CFPB-AI-Joint-Statement%28final%29.pdf). *ftc.gov*, April 2023. Archived at [perma.cc/YY4Y-RCCA](https://perma.cc/YY4Y-RCCA)
[^15]: Maciej Cegłowski. [The Moral Economy of Tech](https://idlewords.com/talks/sase_panel.htm). *idlewords.com*, June 2016. Archived at [perma.cc/L8XV-BKTD](https://perma.cc/L8XV-BKTD)
[^16]: Greg Nichols. [Artificial Intelligence in healthcare is racist](https://www.zdnet.com/article/artificial-intelligence-in-healthcare-is-racist/). *zdnet.com*, November 2020. Archived at [perma.cc/3MKW-YKRS](https://perma.cc/3MKW-YKRS)
[^17]: Cathy O'Neil. *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. ISBN: 978-0-553-41881-1
[^18]: Julia Angwin. [Make Algorithms Accountable](https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html). *nytimes.com*, August 2016. Archived at [archive.org](https://web.archive.org/web/20230819055242/https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html)
[^19]: Bryce Goodman and Seth Flaxman. [European Union Regulations on Algorithmic Decision-Making and a 'Right to Explanation'](https://arxiv.org/abs/1606.08813). At *ICML Workshop on Human Interpretability in Machine Learning*, June 2016. Archived at [arxiv.org/abs/1606.08813](https://arxiv.org/abs/1606.08813)
[^20]: [A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes](https://www.commerce.senate.gov/services/files/0d2b3642-6221-4888-a631-08f2f255b577). Staff Report, *United States Senate Committee on Commerce, Science, and Transportation*, *commerce.senate.gov*, December 2013. Archived at [perma.cc/32NV-YWLQ](https://perma.cc/32NV-YWLQ)
[^21]: Stephanie Assad, Robert Clark, Daniel Ershov, and Lei Xu. [Algorithmic Pricing and Competition: Empirical Evidence from the German Retail Gasoline Market](https://economics.yale.edu/sites/default/files/clark_acex_jan_2021.pdf). *Journal of Political Economy*, volume 132, issue 3, pages 723-771, March 2024. [doi:10.1086/726906](https://doi.org/10.1086/726906)
[^22]: Donella H. Meadows and Diana Wright. *Thinking in Systems: A Primer*. Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7
[^23]: Daniel J. Bernstein. [Listening to a "big data"/"data science" talk. Mentally translating "data" to "surveillance": "\...everything starts with surveillance\..."](https://x.com/hashbreaker/status/598076230437568512) *x.com*, May 2015. Archived at [perma.cc/EY3D-WBBJ](https://perma.cc/EY3D-WBBJ)
[^24]: Marc Andreessen. [Why Software Is Eating the World](https://a16z.com/why-software-is-eating-the-world/). *a16z.com*, August 2011. Archived at [perma.cc/3DCC-W3G6](https://perma.cc/3DCC-W3G6)
[^25]: J. M. Porup. ['Internet of Things' Security Is Hilariously Broken and Getting Worse](https://arstechnica.com/information-technology/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/). *arstechnica.com*, January 2016. Archived at [archive.org](https://web.archive.org/web/20250823001716/https://arstechnica.com/information-technology/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/)
[^26]: Bruce Schneier. [*Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World*](https://www.schneier.com/books/data_and_goliath/). W. W. Norton, 2015. ISBN: 978-0-393-35217-7
[^27]: The Grugq. [Nothing to Hide](https://grugq.tumblr.com/post/142799983558/nothing-to-hide). *grugq.tumblr.com*, April 2016. Archived at [perma.cc/BL95-8W5M](https://perma.cc/BL95-8W5M)
[^28]: Federal Trade Commission. [FTC Takes Action Against General Motors for Sharing Drivers' Precise Location and Driving Behavior Data Without Consent](https://www.ftc.gov/news-events/news/press-releases/2025/01/ftc-takes-action-against-general-motors-sharing-drivers-precise-location-driving-behavior-data). *ftc.gov*, January 2025. Archived at [perma.cc/3XGV-3HRD](https://perma.cc/3XGV-3HRD)
[^29]: Tony Beltramelli. [Deep-Spying: Spying Using Smartwatch and Deep Learning](https://arxiv.org/abs/1512.05616). Masters Thesis, IT University of Copenhagen, December 2015. Archived at *arxiv.org/abs/1512.05616*
[^30]: Shoshana Zuboff. [Big Other: Surveillance Capitalism and the Prospects of an Information Civilization](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2594754). *Journal of Information Technology*, volume 30, issue 1, pages 75--89, April 2015. [doi:10.1057/jit.2015.5](https://doi.org/10.1057/jit.2015.5)
[^31]: Michiel Rhoen. [Beyond Consent: Improving Data Protection Through Consumer Protection Law](https://policyreview.info/articles/analysis/beyond-consent-improving-data-protection-through-consumer-protection-law). *Internet Policy Review*, volume 5, issue 1, March 2016. [doi:10.14763/2016.1.404](https://doi.org/10.14763/2016.1.404)
[^32]: [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016](https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng). *Official Journal of the European Union*, L 119/1, May 2016.
[^33]: UK Information Commissioner's Office. [What is the 'legitimate interests' basis?](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/lawful-basis/legitimate-interests/what-is-the-legitimate-interests-basis/) *ico.org.uk*. Archived at [perma.cc/W8XR-F7ML](https://perma.cc/W8XR-F7ML)
[^34]: Tristan Harris. [How a handful of tech companies control billions of minds every day](https://www.ted.com/talks/tristan_harris_how_a_handful_of_tech_companies_control_billions_of_minds_every_day). At *TED2017*, April 2017.
[^35]: Carina C. Zona. [Consequences of an Insightful Algorithm](https://www.youtube.com/watch?v=YRI40A4tyWU). At *GOTO Berlin*, November 2016.
[^36]: Imanol Arrieta Ibarra, Leonard Goff, Diego Jiménez Hernández, Jaron Lanier, and E. Glen Weyl. [Should We Treat Data as Labor? Moving Beyond 'Free'](https://www.aeaweb.org/conference/2018/preliminary/paper/2Y7N88na). *American Economic Association Papers Proceedings*, volume 1, issue 1, December 2017.
[^37]: Bruce Schneier. [Data Is a Toxic Asset, So Why Not Throw It Out?](https://www.schneier.com/essays/archives/2016/03/data_is_a_toxic_asse.html) *schneier.com*, March 2016. Archived at [perma.cc/4GZH-WR3D](https://perma.cc/4GZH-WR3D)
[^38]: Cory Scott. [Data is not toxic - which implies no benefit - but rather hazardous material, where we must balance need vs. want](https://x.com/cory_scott/status/706586399483437056). *x.com*, March 2016. Archived at [perma.cc/CLV7-JF2E](https://perma.cc/CLV7-JF2E)
[^39]: Mark Pesce. [Data is the new uranium -- incredibly powerful and amazingly dangerous](https://www.theregister.com/2024/11/20/data_is_the_new_uranium/). *theregister.com*, November 2024. Archived at [perma.cc/NV8B-GYGV](https://perma.cc/NV8B-GYGV)
[^40]: Bruce Schneier. [Mission Creep: When Everything Is Terrorism](https://www.schneier.com/essays/archives/2013/07/mission_creep_when_e.html). *schneier.com*, July 2013. Archived at [perma.cc/QB2C-5RCE](https://perma.cc/QB2C-5RCE)
[^41]: Lena Ulbricht and Maximilian von Grafenstein. [Big Data: Big Power Shifts?](https://policyreview.info/articles/analysis/big-data-big-power-shifts) *Internet Policy Review*, volume 5, issue 1, March 2016. [doi:10.14763/2016.1.406](https://doi.org/10.14763/2016.1.406)
[^42]: Ellen P. Goodman and Julia Powles. [Facebook and Google: Most Powerful and Secretive Empires We've Ever Known](https://www.theguardian.com/technology/2016/sep/28/google-facebook-powerful-secretive-empire-transparency). *theguardian.com*, September 2016. Archived at [perma.cc/8UJA-43G6](https://perma.cc/8UJA-43G6)
[^43]: Judy Estrin and Sam Gill. [The World Is Choking on Digital Pollution](https://washingtonmonthly.com/2019/01/13/the-world-is-choking-on-digital-pollution/). *washingtonmonthly.com*, January 2019. Archived at [perma.cc/3VHF-C6UC](https://perma.cc/3VHF-C6UC)
[^44]: A. Michael Froomkin. [Regulating Mass Surveillance as Privacy Pollution: Learning from Environmental Impact Statements](https://repository.law.miami.edu/cgi/viewcontent.cgi?article=1062&context=fac_articles). *University of Illinois Law Review*, volume 2015, issue 5, August 2015. Archived at [perma.cc/24ZL-VK2T](https://perma.cc/24ZL-VK2T)
[^45]: Pengyuan Wang, Li Jiang, and Jian Yang. [The Early Impact of GDPR Compliance on Display Advertising: The Case of an Ad Publisher](https://openreview.net/pdf?id=TUnLHNo19S). *Journal of Marketing Research*, volume 61, issue 1, April 2023. [doi:10.1177/00222437231171848](https://doi.org/10.1177/00222437231171848)
[^46]: Johnny Ryan. [Don't be fooled by Meta's fine for data breaches](https://www.economist.com/by-invitation/2023/05/24/dont-be-fooled-by-metas-fine-for-data-breaches-says-johnny-ryan). *The Economist*, May 2023. Archived at [perma.cc/VCR6-55HR](https://perma.cc/VCR6-55HR)
[^47]: Jessica Leber. [Your Data Footprint Is Affecting Your Life in Ways You Can't Even Imagine](https://www.fastcompany.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine). *fastcompany.com*, March 2016. Archived at [archive.org](https://web.archive.org/web/20161128133016/https://www.fastcoexist.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine)
[^48]: Maciej Cegłowski. [Haunted by Data](https://idlewords.com/talks/haunted_by_data.htm). *idlewords.com*, October 2015. Archived at [archive.org](https://web.archive.org/web/20161130143932/https://idlewords.com/talks/haunted_by_data.htm)
[^49]: Sam Thielman. [You Are Not What You Read: Librarians Purge User Data to Protect Privacy](https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy). *theguardian.com*, January 2016. Archived at [archive.org](https://web.archive.org/web/20250828224851/https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy)
[^50]: Jez Humble. [It's a cliché that people get into tech to "change the world". So then, you have to actually consider what the impact of your work is on the world. The idea that you can or should exclude societal and political discussions in tech is idiotic. It means you're not doing your job](https://x.com/jezhumble/status/1386758340894597122). *x.com*, April 2021. Archived at [perma.cc/3NYS-MHLC](https://perma.cc/3NYS-MHLC)
================================================
FILE: content/tw/ch2.md
================================================
---
title: "2. 定義非功能性需求"
weight: 102
breadcrumbs: false
---

> *網際網路做得太好了,以至於大多數人把它看成像太平洋那樣的自然資源,而不是人造產物。上一次出現這種規模且幾乎無差錯的技術是什麼時候?*
>
> [艾倫・凱](https://www.drdobbs.com/architecture-and-design/interview-with-alan-kay/240003442),
> 在接受 *Dr Dobb's Journal* 採訪時(2012 年)
構建一個應用時,你通常會從一張需求清單開始。清單最上面的,往往是應用必須提供的功能:需要哪些頁面和按鈕,每個操作應該完成什麼行為,才能實現軟體的目標。這些就是 ***功能性需求***。
此外,你通常還會有一些 ***非功能性需求***:例如,應用應當足夠快、足夠可靠、足夠安全、符合法規,而且易於維護。這些需求可能並沒有明確寫下來,因為它們看起來像是“常識”,但它們與功能需求同樣重要。一個慢得無法忍受、或頻繁出錯的應用,幾乎等於不存在。
許多非功能性需求(比如安全)超出了本書範圍。但本章會討論其中幾項核心要求,並幫助你用更清晰的方式描述自己的系統:
* 如何定義並衡量系統的 **效能**(參見 ["描述效能"](#sec_introduction_percentiles));
* 服務 **可靠** 到底意味著什麼:也就是在出錯時仍能持續正確工作(參見 ["可靠性與容錯"](#sec_introduction_reliability));
* 如何透過高效增加計算資源,讓系統在負載增長時保持 **可伸縮性**(參見 ["可伸縮性"](#sec_introduction_scalability));以及
* 如何讓系統在長期演進中保持 **可維護性**(參見 ["可維護性"](#sec_introduction_maintainability))。
本章引入的術語,在後續章節深入實現細節時也會反覆用到。不過純定義往往比較抽象。為了把概念落到實處,本章先從一個案例研究開始:看看社交網路服務可能如何實現,並藉此討論效能與可伸縮性問題。
## 案例研究:社交網路首頁時間線 {#sec_introduction_twitter}
假設你要實現一個類似 X(原 Twitter)的社交網路:使用者可以發帖,並追隨其他使用者。這會極大簡化真實系統的實現方式 [^1] [^2] [^3],但足以說明大規模系統會遇到的一些關鍵問題。
我們假設:使用者每天發帖 5 億條,平均每秒約 5,700 條;在特殊事件期間,峰值可能衝到每秒 150,000 條 [^4]。再假設平均每位使用者追隨 200 人,並有 200 名追隨者(實際分佈非常不均勻:大多數人只有少量追隨者,少數名人如巴拉克・奧巴馬則有上億追隨者)。
### 表示使用者、帖子與關注關係 {#id20}
假設我們將所有資料儲存在關係資料庫中,如 [圖 2-1](#fig_twitter_relational) 所示。我們有一個使用者表、一個帖子表和一個關注關係表。
{{< figure src="/fig/ddia_0201.png" id="fig_twitter_relational" caption="圖 2-1. 社交網路的簡單關係模式,使用者可以相互關注。" class="w-full my-4" >}}
假設該社交網路最重要的讀操作是 *首頁時間線*:展示你所追隨的人最近釋出的帖子(為簡化起見,我們忽略廣告、未追隨使用者的推薦帖,以及其他擴充套件功能)。獲取某個使用者首頁時間線的 SQL 可能如下:
```sql
SELECT posts.*, users.* FROM posts
JOIN follows ON posts.sender_id = follows.followee_id
JOIN users ON posts.sender_id = users.id
WHERE follows.follower_id = current_user
ORDER BY posts.timestamp DESC
LIMIT 1000
```
要執行此查詢,資料庫將使用 `follows` 表找到 `current_user` 關注的所有人,查詢這些使用者最近的帖子,並按時間戳排序以獲取被關注使用者的最新 1,000 條帖子。
帖子具有時效性。我們假設:某人發帖後,追隨者應在 5 秒內看到。一個做法是客戶端每 5 秒重複執行一次上述查詢(即 *輪詢*)。如果同時線上登入使用者有 1000 萬,就意味著每秒要執行 200 萬次查詢。即使把輪詢間隔調大,這個量也很可觀。
此外,這個查詢本身也很昂貴。若你追隨 200 人,系統就要分別抓取這 200 人的近期帖子列表,再把它們歸併。每秒 200 萬次時間線查詢,等價於資料庫每秒要執行約 4 億次“按傳送者查最近帖子”。這還只是平均情況。少數使用者會追隨數萬賬戶,這個查詢對他們尤其昂貴,也更難做快。
### 時間線的物化與更新 {#sec_introduction_materializing}
要如何最佳化?第一,與其輪詢,不如由伺服器主動向線上追隨者推送新帖。第二,我們應該預先計算上述查詢結果,讓首頁時間線請求可以直接從快取返回。
設想我們為每個使用者維護一個數據結構,儲存其首頁時間線,也就是其所追隨者的近期帖子。每當使用者發帖,我們就找出其所有追隨者,把這條帖子插入每個追隨者的首頁時間線中,就像往郵箱裡投遞信件。這樣使用者登入時,可以直接讀取預先算好的時間線。若要接收新帖提醒,客戶端只需訂閱“寫入該時間線”的帖子流即可。
這種方法的缺點是:每次發帖時都要做更多工作,因為首頁時間線屬於需要持續更新的派生資料。這個過程見 [圖 2-2](#fig_twitter_timelines)。當一個初始請求觸發多個下游請求時,我們用 *扇出* 描述請求數量被放大的倍數。
{{< figure src="/fig/ddia_0202.png" id="fig_twitter_timelines" caption="圖 2-2. 扇出:將新帖子傳遞給釋出帖子的使用者的每個追隨者。" class="w-full my-4" >}}
按每秒 5,700 條帖子計算,若平均每條帖到達 200 名追隨者(扇出因子 200),則每秒需要略高於 100 萬次首頁時間線寫入。這已經很多,但相比原先每秒 4 億次“按傳送者查帖”,仍是顯著最佳化。
如果遇到特殊事件導致發帖速率激增,我們不必立刻完成時間線投遞。可以先入隊,接受“帖子出現在追隨者時間線中”會暫時變慢。即便在這種峰值期,時間線載入仍然很快,因為讀取仍來自快取。
這種預先計算並持續更新查詢結果的過程稱為 *物化*。時間線快取就是一種 *物化檢視*(這個概念見 [“維護物化檢視”](/tw/ch12#sec_stream_mat_view))。物化檢視能加速讀取,但代價是寫入側工作量增加。對大多數使用者而言,這個寫入成本仍可接受,但社交網路還要處理一些極端情況:
* 如果某使用者追隨了大量賬戶,且這些賬戶發帖頻繁,那麼該使用者的物化時間線寫入率會很高。但在這種場景下,使用者通常也看不完全部帖子,因此可以丟棄部分時間線寫入,只展示其追隨賬戶帖子的一部分樣本 [^5]。
* 如果一個擁有海量追隨者的名人賬號發帖,我們需要把這條帖子寫入其數百萬追隨者的首頁時間線,工作量極大。此時不能隨意丟寫。常見做法是把名人帖子與普通帖子分開處理:名人帖單獨儲存,讀取時間線時再與物化時間線合併,從而省去寫入數百萬條時間線的成本。即便如此,服務名人賬號仍需大量基礎設施 [^6]。
## 描述效能 {#sec_introduction_percentiles}
軟體效能通常圍繞兩類指標展開:
響應時間
: 從使用者發出請求到收到響應所經歷的時間。單位是秒(或毫秒、微秒)。
吞吐量
: 系統每秒可處理的請求數或資料量。對於給定硬體資源,系統存在一個可處理的 *最大吞吐量*。單位是“每秒某種工作量”。
在社交網路案例中,“每秒帖子數”和“每秒時間線寫入數”屬於吞吐量指標;“載入首頁時間線所需時間”或“帖子送達追隨者所需時間”屬於響應時間指標。
吞吐量和響應時間之間通常相關。線上服務的典型關係如 [圖 2-3](#fig_throughput):低吞吐量時響應時間較低,負載升高後響應時間上升。原因是 *排隊*。請求到達高負載系統時,CPU 往往已在處理前一個請求,新請求只能等待;當吞吐量逼近硬體上限,排隊延遲會急劇上升。
{{< figure src="/fig/ddia_0203.png" id="fig_throughput" caption="圖 2-3. 隨著服務的吞吐量接近其容量,由於排隊,響應時間急劇增加。" class="w-full my-4" >}}
--------
> [!TIP] 當過載系統無法恢復時
如果系統已接近過載、吞吐量逼近極限,有時會進入惡性迴圈:效率下降,進而更加過載。例如,請求佇列很長時,響應時間可能高到讓客戶端超時並重發請求,導致請求速率進一步上升,問題持續惡化,形成 *重試風暴*。即使負載後來回落,系統也可能仍卡在過載狀態,直到重啟或重置。這種現象叫 *亞穩態故障*(Metastable Failure),可能引發嚴重生產故障 [^7] [^8]。
為了避免重試把服務拖垮,可以在客戶端拉大並隨機化重試間隔(*指數退避* [^9] [^10]),並臨時停止向近期報錯或超時的服務發請求(例如 *熔斷器* [^11] [^12] 或 *令牌桶* [^13])。服務端也可在接近過載時主動拒絕請求(*負載卸除* [^14]),並透過響應要求客戶端降速(*背壓* [^1] [^15])。此外,排隊與負載均衡演算法的選擇也會影響結果 [^16]。
--------
從效能指標角度看,使用者通常最關心響應時間;而吞吐量決定了所需計算資源(例如伺服器數量),從而決定承載特定工作負載的成本。如果吞吐量增長可能超過當前硬體上限,就必須擴容;若系統可透過增加計算資源顯著提升最大吞吐量,就稱其 *可伸縮*。
本節主要討論響應時間;吞吐量與可伸縮性會在 ["可伸縮性"](#sec_introduction_scalability) 一節再展開。
### 延遲與響應時間 {#id23}
“延遲”和“響應時間”有時會混用,但本書對它們有明確區分(見 [圖 2-4](#fig_response_time)):
* *響應時間* 是客戶端看到的總時間,包含鏈路上各處產生的全部延遲。
* *服務時間* 是服務主動處理該請求的時間。
* *排隊延遲* 可發生在流程中的多個位置。例如請求到達後,可能要等 CPU 空出來才能處理;同機其他任務若佔滿出站網絡卡,響應包也可能先在緩衝區等待發送。
* *延遲* 是對“請求未被主動處理這段時間”的統稱,也就是請求處於 *潛伏(latent)* 狀態的時間。尤其是 *網路延遲*(或網路時延)指請求與響應在網路中傳播所花的時間。
{{< figure src="/fig/ddia_0204.png" id="fig_response_time" caption="圖 2-4. 響應時間、服務時間、網路延遲和排隊延遲。" class="w-full my-4" >}}
在 [圖 2-4](#fig_response_time) 中,時間從左向右流動。每個通訊節點畫成一條水平線,請求/響應訊息畫成節點間的粗斜箭頭。本書後文會頻繁使用這種圖示風格。
即便反覆傳送同一個請求,響應時間也可能顯著波動。許多因素都會引入隨機延遲:例如切換到後臺程序、網路丟包與 TCP 重傳、垃圾回收暫停、缺頁導致的磁碟讀取、伺服器機架機械振動 [^17] 等。我們會在 ["超時與無界延遲"](/tw/ch9#sec_distributed_queueing) 進一步討論這個問題。
排隊延遲常常是響應時間波動的主要來源。伺服器並行處理能力有限(例如受 CPU 核數約束),少量慢請求就可能堵住後續請求,這就是 *頭部阻塞*。即便後續請求本身服務時間很短,客戶端仍會因為等待前序請求而看到較慢的總體響應。排隊延遲不屬於服務時間,因此必須在客戶端側測量響應時間。
### 平均值、中位數與百分位點 {#id24}
由於響應時間會隨請求變化,我們應將其看作一個可測量的 *分佈*,而非單一數字。在 [圖 2-5](#fig_lognormal) 中,每個灰色柱表示一次請求,柱高是該請求耗時。大多數請求較快,但會有少量更慢的 *異常值*。網路時延波動也常稱為 *抖動*。
{{< figure src="/fig/ddia_0205.png" id="fig_lognormal" caption="圖 2-5. 說明平均值和百分位點:100 個服務請求的響應時間樣本。" class="w-full my-4" >}}
報告服務 *平均* 響應時間很常見(嚴格說是 *算術平均值*:總響應時間除以請求數)。平均值對估算吞吐量上限有幫助 [^18]。但若你想知道“典型”響應時間,平均值並不理想,因為它不能反映到底有多少使用者經歷了這種延遲。
通常,*百分位點* 更有意義。把響應時間從快到慢排序,*中位數* 位於中間。例如中位響應時間為 200 毫秒,表示一半請求在 200 毫秒內返回,另一半更慢。因此中位數適合衡量使用者“通常要等多久”。中位數也稱 *第 50 百分位*,常記為 *p50*。
為了看清異常值有多糟,需要觀察更高百分位點:常見的是 *p95*、*p99*、*p999*。它們表示 95%、99%、99.9% 的請求都快於該閾值。例如 p95 為 1.5 秒,表示 100 個請求裡有 95 個小於 1.5 秒,另外 5 個不小於 1.5 秒。[圖 2-5](#fig_lognormal) 展示了這一點。
響應時間的高百分位點(也叫 *尾部延遲*)非常重要,因為它直接影響使用者體驗。例如亞馬遜內部服務常以第 99.9 百分位設定響應要求,儘管它隻影響 1/1000 的請求。原因是最慢請求往往來自“賬戶資料最多”的客戶,他們通常也是最有價值客戶 [^19]。讓這批使用者也能獲得快速響應,對業務很關鍵。
另一方面,繼續最佳化到第 99.99 百分位(最慢的萬分之一請求)通常成本過高、收益有限。越到高百分位,越容易受不可控隨機因素影響,也更符合邊際收益遞減規律。
--------
> [!TIP] 響應時間對使用者的影響
直覺上,快服務當然比慢服務更好 [^20]。但真正要拿到“延遲如何影響使用者行為”的可靠量化資料,其實非常困難。
一些被頻繁引用的統計並不可靠。2006 年,Google 曾報告:搜尋結果從 400 毫秒變慢到 900 毫秒,與流量和收入下降 20% 相關 [^21]。但 2009 年 Google 另一項研究又稱,延遲增加 400 毫秒僅導致日搜尋量下降 0.6% [^22];同年 Bing 發現,載入時間增加 2 秒會讓廣告收入下降 4.3% [^23]。這些公司的更新資料似乎並未公開。
Akamai 的一項較新研究 [^24] 聲稱:響應時間增加 100 毫秒會讓電商網站轉化率最多下降 7%。但細看可知,同一研究也顯示“載入極快”的頁面同樣和較低轉化率相關。這個看似矛盾的結果,很可能是因為載入最快的頁面往往是“無有效內容”的頁面(如 404)。而該研究並未把“頁面內容影響”和“載入時間影響”區分開,因此結論可能並不可靠。
Yahoo 的一項研究 [^25] 在控制搜尋結果質量後,比對了快慢載入對點選率的影響。結果顯示:當快慢響應差異達到 1.25 秒或以上時,快速搜尋的點選量會高出 20%–30%。
--------
### 響應時間指標的應用 {#sec_introduction_slo_sla}
對於“一個終端請求會觸發多次後端呼叫”的服務,高百分位點尤其關鍵。即使並行呼叫,終端請求仍要等待最慢的那個返回。正如 [圖 2-6](#fig_tail_amplification) 所示,只要一個呼叫慢,就能拖慢整個終端請求。即便慢呼叫比例很小,只要後端呼叫次數變多,撞上慢呼叫的機率就會上升,於是更大比例的終端請求會變慢(稱為 *尾部延遲放大* [^26])。
{{< figure src="/fig/ddia_0206.png" id="fig_tail_amplification" caption="圖 2-6. 當需要幾個後端呼叫來服務請求時,只需要一個慢的後端請求就可以減慢整個終端使用者請求。" class="w-full my-4" >}}
百分位點也常用於定義 *服務級別目標*(SLO)和 *服務級別協議*(SLA)[^27]。例如,一個 SLO 可能要求:中位響應時間低於 200 毫秒、p99 低於 1 秒,並且至少 99.9% 的有效請求返回非錯誤響應。SLA 則是“未達成 SLO 時如何處理”的合同條款(例如客戶可獲賠償)。這是基本思路;但在實踐中,為 SLO/SLA 設計合理可用性指標並不容易 [^28] [^29]。
--------
> [!TIP] 計算百分位點
如果你想在監控面板中展示響應時間百分位點,就需要持續且高效地計算它們。例如,維護“最近 10 分鐘請求響應時間”的滾動視窗,每分鐘計算一次該視窗內的中位數與各百分位點,並繪圖展示。
最簡單的實現是儲存視窗內全部請求的響應時間,並每分鐘排序一次。若效率不夠,可以用一些低 CPU/記憶體開銷的演算法來近似計算百分位點。常見開源庫包括 HdrHistogram、t-digest [^30] [^31]、OpenHistogram [^32] 和 DDSketch [^33]。
要注意,“對百分位點再取平均”(例如降低時間解析度,或合併多機器資料)在數學上沒有意義。聚合響應時間資料的正確方式是聚合直方圖 [^34]。
--------
## 可靠性與容錯 {#sec_introduction_reliability}
每個人對“可靠”與“不可靠”都有直覺。對軟體而言,典型期望包括:
* 應用能完成使用者預期的功能。
* 能容忍使用者犯錯,或以意料之外的方式使用軟體。
* 在預期負載與資料規模下,效能足以支撐目標用例。
* 能防止未授權訪問與濫用。
如果把這些合起來稱為“正確工作”,那麼 *可靠性* 可以粗略理解為:即使出現問題,系統仍能持續正確工作。為了更精確地描述“出問題”,我們區分 *故障* 與 *失效* [^35] [^36] [^37]:
故障
: 指系統某個 *區域性元件* 停止正常工作:例如單個硬碟損壞、單臺機器宕機,或系統依賴的外部服務中斷。
失效
: 指 *整個系統* 無法繼續向用戶提供所需服務;換言之,系統未滿足服務級別目標(SLO)。
“故障”與“失效”的區別容易混淆,因為它們本質上是同一件事在不同層級上的表述。比如一個硬碟壞了,對“硬碟這個系統”來說是失效;但對“由許多硬碟組成的更大系統”來說,它只是一個故障。更大系統若在其他硬碟上有副本,就可能容忍該故障。
### 容錯 {#id27}
如果系統在發生某些故障時仍繼續向用戶提供所需的服務,我們稱系統為 *容錯的*。如果系統不能容忍某個部分變得有故障,我們稱該部分為 *單點故障*(SPOF),因為該部分的故障會升級導致整個系統的失效。
例如在社交網路案例中,扇出流程裡可能有機器崩潰或不可用,導致物化時間線更新中斷。若要讓該流程具備容錯性,就必須保證有其他機器可接管任務,同時既不漏投帖子,也不重複投遞。(這個思想稱為 *恰好一次語義*,我們會在 [“資料庫的端到端論證”](/tw/ch13#sec_future_end_to_end) 中詳細討論。)
容錯能力總是“有邊界”的:它只針對某些型別、某個數量以內的故障。例如系統可能最多容忍 2 塊硬碟同時故障,或 3 個節點裡壞 1 個。若全部節點都崩潰,就無計可施,因此“容忍任意數量故障”並無意義。要是地球和上面的伺服器都被黑洞吞噬,那就只能去太空託管了,預算審批祝你好運。
反直覺的是,在這類系統裡,故意 *提高* 故障發生率反而有意義,例如無預警隨機殺死某個程序。這叫 *故障注入*。許多關鍵故障本質上是錯誤處理做得不夠好 [^38]。透過主動注入故障,可以持續演練並驗證容錯機制,提升對“真實故障發生時系統仍能正確處理”的信心。*混沌工程* 就是圍繞這類實驗建立起來的方法論 [^39]。
儘管我們通常更傾向於“容忍故障”,而非“阻止故障”,但也有“預防優於補救”的場景(例如根本無法補救)。安全問題就是如此:若攻擊者已攻破系統並獲取敏感資料,事件本身無法撤銷。不過,本書主要討論的是可恢復的故障型別。
### 硬體與軟體故障 {#sec_introduction_hardware_faults}
當我們想到系統失效的原因時,硬體故障很快就會浮現在腦海中:
* 機械硬碟每年故障率約為 2%–5% [^40] [^41];在 10,000 盤位的儲存叢集中,平均每天約有 1 塊盤故障。近期資料表明磁碟可靠性在提升,但故障率仍不可忽視 [^42]。
* SSD 每年故障率約為 0.5%–1% [^43]。少量位元錯誤可自動糾正 [^44],但不可糾正錯誤大約每盤每年一次,即使是磨損較輕的新盤也會出現;該錯誤率高於機械硬碟 [^45]、[^46]。
* 其他硬體元件,如電源、RAID 控制器和記憶體模組也會發生故障,儘管頻率低於硬碟驅動器 [^47] [^48]。
* 大約每 1000 臺機器裡就有 1 臺存在“偶發算錯結果”的 CPU 核心,可能由製造缺陷導致 [^49] [^50] [^51]。有時錯誤計算會直接導致崩潰;有時則只是悄悄返回錯誤結果。
* RAM 資料也可能損壞:要麼來自宇宙射線等隨機事件,要麼來自永久性物理缺陷。即便使用 ECC 記憶體,任意一年內仍有超過 1% 的機器會遇到不可糾正錯誤,通常表現為機器崩潰並需要更換受影響記憶體條 [^52]。此外,某些病態訪問模式還可能以較高機率觸發位元翻轉 [^53]。
* 整個資料中心也可能不可用(如停電、網路配置錯誤),甚至被永久摧毀(如火災、洪水、地震 [^54])。太陽風暴會在長距離導線中感應大電流,可能損壞電網和海底通訊電纜 [^55]。這類大規模故障雖罕見,但若服務無法容忍資料中心丟失,後果將極其嚴重 [^56]。
這類事件在小系統裡足夠罕見,通常不必過度擔心,只要能方便地更換故障硬體即可。但在大規模系統裡,硬體故障足夠頻繁,已經是“正常執行”的一部分。
#### 透過冗餘容忍硬體故障 {#tolerating-hardware-faults-through-redundancy}
我們對不可靠硬體的第一反應通常是向各個硬體元件新增冗餘,以降低系統的故障率。磁碟可以設定為 RAID 配置(將資料分佈在同一臺機器的多個磁碟上,以便故障磁碟不會導致資料丟失),伺服器可能有雙電源和可熱插拔的 CPU,資料中心可能有電池和柴油發電機作為備用電源。這種冗餘通常可以使機器不間斷執行多年。
當元件故障獨立時,冗餘最有效,即一個故障的發生不會改變另一個故障發生的可能性。然而,經驗表明,元件故障之間通常存在顯著的相關性 [^41] [^57] [^58];整個伺服器機架或整個資料中心的不可用仍然比我們預期的更頻繁地發生。
硬體冗餘確實能提升單機可用時間;但正如 ["分散式與單節點系統"](/tw/ch1#sec_introduction_distributed) 所述,分散式系統還具備額外優勢,例如可容忍整個資料中心中斷。因此雲系統通常不再過分追求“單機極致可靠”,而是透過軟體層容忍節點故障來實現高可用。雲廠商使用 *可用區* 標識資源是否物理共址;同一可用區內資源比跨地域資源更容易同時失效。
我們在本書中討論的容錯技術旨在容忍整個機器、機架或可用區的丟失。它們通常透過允許一個數據中心的機器在另一個數據中心的機器發生故障或變得不可達時接管來工作。我們將在 [第 6 章](/tw/ch6)、[第 10 章](/tw/ch10) 以及本書的其他各個地方討論這種容錯技術。
能夠容忍整個機器丟失的系統也具有運營優勢:如果你需要重新啟動機器(例如,應用作業系統安全補丁),單伺服器系統需要計劃停機時間,而多節點容錯系統可以一次修補一個節點,而不影響使用者的服務。這稱為 *滾動升級*,我們將在 [第 5 章](/tw/ch5) 中進一步討論它。
#### 軟體故障 {#software-faults}
儘管硬體故障可能存在弱相關,但整體上仍相對獨立:例如一塊盤壞了,同機其他盤往往還能再正常工作一段時間。相比之下,軟體故障常常高度相關,因為許多節點運行同一套軟體,也就共享同一批 bug [^59] [^60]。這類故障更難預判,也往往比“相互獨立的硬體故障”造成更多系統失效 [^47]。例如:
* 在特定情況下導致每個節點同時失效的軟體錯誤。例如,2012 年 6 月 30 日,閏秒導致許多 Java 應用程式由於 Linux 核心中的錯誤而同時掛起 [^61]。由於韌體錯誤,某些型號的所有 SSD 在精確執行 32,768 小時(不到 4 年)後突然失效,使其上的資料無法恢復 [^62]。
* 使用某些共享、有限資源(如 CPU 時間、記憶體、磁碟空間、網路頻寬或執行緒)的失控程序 [^63]。例如,處理大請求時消耗過多記憶體的程序可能會被作業系統殺死。客戶端庫中的錯誤可能導致比預期更高的請求量 [^64]。
* 系統所依賴的服務變慢、無響應或開始返回損壞的響應。
* 不同系統互動後出現“單系統隔離測試中看不到”的湧現行為 [^65]。
* 級聯故障,其中一個元件中的問題導致另一個元件過載和減速,這反過來又導致另一個元件崩潰 [^66] [^67]。
導致這類軟體故障的 bug 往往潛伏很久,直到一組不尋常條件把它觸發出來。這時才暴露出:軟體其實對執行環境做了某些假設,平時大多成立,但終有一天會因某種原因失效 [^68] [^69]。
軟體系統性故障沒有“速效藥”。但許多小措施都有效:認真審視系統假設與互動、充分測試、程序隔離、允許程序崩潰並重啟、避免反饋環路(如重試風暴,參見 ["當過載系統無法恢復時"](#sidebar_metastable)),以及在生產環境持續度量、監控和分析系統行為。
### 人類與可靠性 {#id31}
軟體系統由人設計、構建和運維。與機器不同,人不會只按規則執行;人的優勢在於創造性和適應性。但這也帶來不可預測性,即使本意是好的,也會犯導致失效的錯誤。例如,一項針對大型網際網路服務的研究發現:運維配置變更是中斷首因,而硬體故障(伺服器或網路)僅佔 10%–25% [^70]。
遇到這類問題,人們很容易歸咎於“人為錯誤”,並試圖透過更嚴格流程和更強規則約束來控制人。但“責怪個人”通常適得其反。所謂“人為錯誤”往往不是事故根因,而是社會技術系統本身存在問題的徵兆 [^71]。複雜系統裡,元件意外互動產生的湧現行為也常導致故障 [^72]。
有多種技術手段可降低人為失誤的影響:充分測試(含手寫測試與大量隨機輸入的 *屬性測試*)[^38]、可快速回滾配置變更的機制、新程式碼漸進發布、清晰細緻的監控、用於排查生產問題的可觀測性工具(參見 ["分散式系統的問題"](/tw/ch1#sec_introduction_dist_sys_problems)),以及鼓勵“正確操作”並抑制“錯誤操作”的良好介面設計。
但這些措施都需要時間和預算。在日常業務壓力下,組織往往優先投入“直接創收”活動,而非提升抗錯韌性的建設。若在“更多功能”和“更多測試”之間二選一,很多組織會自然選擇前者。既然如此,當可預防錯誤最終發生時,責怪個人並無意義,問題本質在於組織的優先順序選擇。
越來越多組織在實踐 *無責備事後分析*:事故發生後,鼓勵參與者在不擔心懲罰的前提下完整覆盤細節,讓組織其他人也能學習如何避免類似問題 [^73]。這個過程常會揭示出:業務優先順序需要調整、某些長期被忽視的領域需要補投入、相關激勵機制需要改,或其他應由管理層關注的系統性問題。
一般來說,調查事故時應警惕“過於簡單”的答案。“鮑勃部署時應更小心”沒有建設性,“我們必須用 Haskell 重寫後端”同樣不是。更可行的做法是:管理層藉機從一線人員視角理解社會技術系統的真實執行方式,並據此推動改進 [^71]。
--------
> [!TIP] 可靠性有多重要?
可靠性不只適用於核電站或空管系統,普通應用同樣需要可靠。企業軟體中的 bug 會造成生產力損失(若報表錯誤還會帶來法律風險);電商網站故障則會帶來直接收入損失和品牌傷害。
在許多應用裡,幾分鐘乃至幾小時的短暫中斷尚可容忍 [^74];但永久性資料丟失或損壞往往是災難性的。想象一位家長把孩子的全部照片和影片都存在你的相簿應用裡 [^75]。若資料庫突然損壞,他們會怎樣?又是否知道如何從備份恢復?
另一個“軟體不可靠傷害現實人群”的例子,是英國郵局 Horizon 醜聞。1999 到 2019 年間,數百名郵局網點負責人因會計系統顯示“賬目短缺”被判盜竊或欺詐。後來事實證明,許多“短缺”來自軟體缺陷,且大量判決已被推翻 [^76]。造成這場可能是英國史上最大司法不公的一個關鍵前提,是英國法律預設計算機正常執行(因此其證據可靠),除非有相反證據 [^77]。軟體工程師或許會覺得“軟體無 bug”很荒謬,但這對那些因此被錯判入獄、破產乃至自殺的人來說毫無安慰。
在某些場景下,我們也許會有意犧牲部分可靠性來降低開發成本(例如做未驗證市場的原型產品)。但應明確知道自己在何處“走捷徑”,並充分評估其後果。
--------
## 可伸縮性 {#sec_introduction_scalability}
即便系統今天執行可靠,也不代表將來一定如此。效能退化的常見原因之一是負載增長:比如併發使用者從 1 萬漲到 10 萬,或從 100 萬漲到 1000 萬;也可能是處理的資料規模遠大於從前。
*可伸縮性* 用來描述系統應對負載增長的能力。討論這個話題時,常有人說:“你又不是 Google/Amazon,別擔心規模,直接上關係資料庫。”這句話是否成立,取決於你在做什麼型別的應用。
如果你在做一個目前使用者很少的新產品(例如創業早期),首要工程目標通常是“儘可能簡單、儘可能靈活”,以便隨著對使用者需求理解加深而快速調整產品功能 [^78]。在這種環境下,過早擔心“未來也許會有”的規模往往適得其反:最好情況是白費功夫、過早最佳化;最壞情況是把自己鎖進僵化設計,反而阻礙演進。
原因在於,可伸縮性不是一維標籤;“X 可伸縮”或“Y 不可伸縮”這種說法本身意義不大。更有意義的問題是:
* “如果系統按某種方式增長,我們有哪些應對選項?”
* “我們如何增加計算資源來承載額外負載?”
* “按當前增長趨勢,現有架構何時會觸頂?”
當你的產品真的做起來、負載持續上升時,你自然會看到瓶頸在哪裡,也就知道該沿哪些維度擴充套件。那時再系統性投入可伸縮性技術,通常更合適。
### 描述負載 {#id33}
首先要簡明描述系統當前負載,之後才能討論“增長會怎樣”(例如負載翻倍會發生什麼)。最常見的是吞吐量指標:每秒請求數、每天新增資料量(GB)、每小時購物車結賬次數等。有時你關心的是峰值變數,比如 ["案例研究:社交網路首頁時間線"](#sec_introduction_twitter) 裡的“同時線上使用者數”。
此外還可能有其他統計特徵會影響訪問模式,進而影響可伸縮性要求。例如資料庫讀寫比、快取命中率、每使用者資料項數量(如社交網路裡的追隨者數)。有時平均情況最關鍵,有時瓶頸由少數極端情況主導,具體取決於你的應用細節。
當負載被清楚描述後,就可以分析“負載增加時系統會怎樣”。可從兩個角度看:
* 以某種方式增大負載、但保持資源(CPU、記憶體、網路頻寬等)不變時,效能如何變化?
* 若負載按某種方式增長、但你希望效能不變,需要增加多少資源?
通常目標是:在儘量降低執行成本的同時,讓效能維持在 SLA 要求內(參見 ["響應時間指標的應用"](#sec_introduction_slo_sla))。所需計算資源越多,成本越高。不同硬體的價效比不同,而且會隨著新硬體出現而變化。
如果資源翻倍後能承載兩倍負載且效能不變,這稱為 *線性可伸縮性*,通常是理想狀態。偶爾,藉助規模效應或峰值負載更均勻分佈,甚至可用不足兩倍資源處理兩倍負載 [^79] [^80]。但更常見的是成本增長快於線性,低效原因也很多。比如資料量增大後,即使請求大小相同,處理一次寫請求也可能比資料量小時更耗資源。
### 共享記憶體、共享磁碟與無共享架構 {#sec_introduction_shared_nothing}
增加服務硬體資源的最簡單方式,是遷移到更強的機器。雖然單核 CPU 不再明顯提速,但你仍可購買(或租用)擁有更多 CPU 核心、更多 RAM、更多磁碟的例項。這叫 *縱向伸縮*(scaling up)。
在單機上,你可以透過多程序/多執行緒獲得並行性。同一程序內執行緒共享同一塊 RAM,因此這也叫 *共享記憶體架構*。問題是它的成本常常“超線性增長”:硬體資源翻倍的高階機器,價格往往遠超兩倍;且受限於瓶頸,效能提升通常又達不到兩倍。
另一種方案是 *共享磁碟架構*:多臺機器各有獨立 CPU 和 RAM,但共享同一組磁碟陣列,透過高速網路連線(NAS 或 SAN)。該架構傳統上用於本地資料倉庫場景,但爭用與鎖開銷限制了其可伸縮性 [^81]。
相比之下,*無共享架構* [^82](即 *橫向伸縮*、scaling out)已廣泛流行。這種方案使用多節點分散式系統,每個節點擁有自己的 CPU、RAM 和磁碟;節點間協作透過常規網路在軟體層完成。
無共享的優勢在於:具備線性伸縮潛力、可靈活選用高性價比硬體(尤其在雲上)、更容易隨負載增減調整資源,並可透過跨多個數據中心/地域部署提升容錯。代價是:需要顯式分片(見 [第 7 章](/tw/ch7)),並承擔分散式系統的全部複雜性(見 [第 9 章](/tw/ch9))。
一些雲原生資料庫把“儲存”和“事務執行”拆成獨立服務(參見 ["儲存與計算分離"](/tw/ch1#sec_introduction_storage_compute)),由多個計算節點共享同一儲存服務。這種模式與共享磁碟有相似性,但規避了老系統的可伸縮瓶頸:它不暴露 NAS/SAN 那種檔案系統或塊裝置抽象,而是提供面向資料庫場景定製的儲存 API [^83]。
### 可伸縮性原則 {#id35}
能夠大規模執行的系統架構,通常高度依賴具體應用,不存在通用“一招鮮”的可伸縮架構(俗稱 *萬金油*)。例如:面向“每秒 10 萬次請求、每次 1 kB”的系統,與面向“每分鐘 3 次請求、每次 2 GB”的系統,形態會完全不同,儘管二者資料吞吐量都約為 100 MB/s。
此外,適合某一級負載的架構,通常難以直接承受 10 倍負載。若你在做高速增長服務,幾乎每跨一個數量級都要重新審視架構。考慮到業務需求本身也會變化,提前規劃超過一個數量級的未來伸縮需求,往往不划算。
可伸縮性的一個通用原則,是把系統拆分成儘量可獨立執行的小元件。這也是微服務(參見 ["微服務與無伺服器"](/tw/ch1#sec_introduction_microservices))、分片([第 7 章](/tw/ch7))、流處理([第 12 章](/tw/ch12#ch_stream))和無共享架構的共同基礎。難點在於:哪裡該拆,哪裡該合。微服務設計可參考其他書籍 [^84];無共享系統的分片問題我們會在 [第 7 章](/tw/ch7) 討論。
另一個好原則是:不要把系統做得比必要更複雜。若單機資料庫足夠,就往往優於複雜分散式方案。自動伸縮(按需求自動加減資源)很吸引人,但若負載相對可預測,手動伸縮可能帶來更少運維意外(參見 ["操作:自動或手動再平衡"](/tw/ch7#sec_sharding_operations))。5 個服務的系統通常比 50 個服務更簡單。好架構往往是多種方案的務實組合。
## 可維護性 {#sec_introduction_maintainability}
軟體不會像機械裝置那樣磨損或材料疲勞,但應用需求會變化,軟體所處環境(依賴項、底層平臺)也會變化,程式碼中還會持續暴露需要修復的缺陷。
業界普遍認同:軟體成本的大頭不在初始開發,而在後續維護,包括修 bug、保障系統穩定執行、排查故障、適配新平臺、支援新場景、償還技術債,以及持續交付新功能 [^85] [^86]。
然而維護並不容易。一個長期執行成功的系統,可能仍依賴今天少有人熟悉的舊技術(如大型機和 COBOL);隨著人員流動,系統為何如此設計的組織記憶也可能丟失;維護者往往還要修復前人留下的問題。更重要的是,計算機系統通常與其支撐的組織流程深度耦合,這使得 *遺留* 系統維護既是技術問題,也是人員與組織問題 [^87]。
如果今天構建的系統足夠有價值並長期存活,它終有一天會變成遺留系統。為減少後繼維護者的痛苦,我們應在設計階段就考慮維護性。雖然難以準確預判哪些決策會在未來埋雷,但本書會強調幾條廣泛適用的原則:
可運維性(Operability)
: 讓組織能夠更容易地保持系統平穩執行。
簡單性(Simplicity)
: 採用易理解且一致的模式與結構,避免不必要複雜性,讓新工程師也能快速理解系統。
可演化性(Evolvability)
: 讓工程師在未來能更容易修改系統,使其隨著需求變化而持續適配並擴充套件到未預料場景。
### 可運維性:讓運維更輕鬆 {#id37}
我們在 ["雲時代的運維"](/tw/ch1#sec_introduction_operations) 已討論過運維角色:可靠執行不僅依賴工具,人類流程同樣關鍵。甚至有人指出:“好的運維常能繞過糟糕(或不完整)軟體的侷限;但再好的軟體,碰上糟糕運維也難以可靠執行” [^60]。
在由成千上萬臺機器組成的大規模系統中,純手工維護成本不可接受,自動化必不可少。但自動化也是雙刃劍:總會有邊緣場景(如罕見故障)需要運維團隊人工介入。並且“自動化處理不了”的往往恰恰最複雜,因此自動化越深,越需要 **更** 高水平的運維團隊來兜底 [^88]。
另外,一旦自動化系統本身出錯,往往比“部分依賴人工操作”的系統更難排查。因此自動化並非越多越好。合理自動化程度取決於你所在應用與組織的具體條件。
良好的可運維性意味著把日常任務做簡單,讓運維團隊把精力投入到高價值工作。資料系統可以透過多種方式達成這一點 [^89]:
* 讓監控工具能獲取關鍵指標,並支援可觀測性工具(參見 ["分散式系統的問題"](/tw/ch1#sec_introduction_dist_sys_problems))以洞察執行時行為。相關商業/開源工具都很多 [^90]。
* 避免依賴單機(系統整體不停機的前提下允許下線機器維護)。
* 提供完善文件和易理解的操作模型(“我做 X,會發生 Y”)。
* 提供良好預設值,同時允許管理員在需要時覆蓋預設行為。
* 適當支援自愈,同時在必要時保留管理員對系統狀態的手動控制權。
* 行為可預測,儘量減少“驚喜”。
### 簡單性:管理複雜度 {#id38}
小型專案往往能保持簡潔、優雅、富有表達力;但專案變大後,程式碼常會迅速變複雜且難理解。這種複雜性會拖慢所有參與者效率,進一步抬高維護成本。陷入這種狀態的軟體專案常被稱為 *大泥團* [^91]。
當複雜性讓維護變難時,預算和進度常常失控。在複雜軟體裡,變更時引入缺陷的風險也更高:系統越難理解和推理,隱藏假設、非預期後果和意外互動就越容易被忽略 [^69]。反過來,降低複雜性能顯著提升可維護性,因此“追求簡單”應是系統設計核心目標之一。
簡單系統更容易理解,因此我們應儘可能用最簡單方式解決問題。但“簡單”知易行難。什麼叫簡單,往往帶有主觀判斷,因為不存在絕對客觀的簡單性標準 [^92]。例如,一個系統可能“介面簡單但實現複雜”,另一個可能“實現簡單但暴露更多內部細節”,到底誰更簡單,並不總有標準答案。
一種常見分析方法是把複雜性分成兩類:**本質複雜性** 與 **偶然複雜性** [^93]。前者源於業務問題本身,後者源於工具與實現限制。但這種劃分也並不完美,因為隨著工具演進,“本質”和“偶然”的邊界會移動 [^94]。
管理複雜度最重要的工具之一是 **抽象**。好的抽象能在清晰外觀後隱藏大量實現細節,也能被多種場景複用。這種複用不僅比反覆重寫更高效,也能提升質量,因為抽象元件一旦改進,所有依賴它的應用都會受益。
例如,高階語言是對機器碼、CPU 暫存器和系統呼叫的抽象。SQL 則抽象了磁碟/記憶體中的複雜資料結構、來自其他客戶端的併發請求,以及崩潰後的不一致狀態。用高階語言程式設計時,我們仍然在“使用機器碼”,但不再 *直接* 面對它,因為語言抽象替我們遮蔽了細節。
應用程式碼層面的抽象,常藉助 *設計模式* [^95]、*領域驅動設計*(DDD)[^96] 等方法來構建。本書重點不在這類應用專用抽象,而在你可以拿來構建應用的通用抽象,例如資料庫事務、索引、事件日誌等。若你想採用 DDD 等方法,也可以建立在本書介紹的基礎能力之上。
### 可演化性:讓變化更容易 {#sec_introduction_evolvability}
系統需求永遠不變的機率極低。更常見的是持續變化:你會發現新事實,出現此前未預期用例,業務優先順序會調整,使用者會提出新功能,新平臺會替換舊平臺,法律與監管會變化,系統增長也會倒逼架構調整。
在組織層面,*敏捷* 方法為適應變化提供了框架;敏捷社群也發展出多種適用於高變化環境的技術與流程,如測試驅動開發(TDD)和重構。本書關注的是:如何在“由多個不同應用/服務組成的系統層級”提升這種敏捷能力。
資料系統對變化的適應難易度,與其簡單性和抽象質量高度相關:松耦合、簡單系統通常比緊耦合、複雜系統更容易修改。由於這一點極其重要,我們把“資料系統層面的敏捷性”單獨稱為 *可演化性* [^97]。
大型系統中讓變更困難的一個關鍵因素,是某些操作不可逆,因此執行時必須極其謹慎 [^98]。例如從一個數據庫遷移到另一個:若新庫出問題後無法回切,風險就遠高於可隨時回退。儘量減少不可逆操作,能顯著提升系統靈活性。
## 總結 {#summary}
本章討論了幾類核心非功能性需求:效能、可靠性、可伸縮性與可維護性。圍繞這些主題,我們也建立了貫穿全書的一組概念與術語。章節從“社交網路首頁時間線”案例切入,直觀展示了系統在規模增長時會遇到的現實挑戰。
我們討論了如何衡量效能(例如響應時間百分位點)、如何描述系統負載(例如吞吐量指標),以及這些指標如何進入 SLA。與之緊密相關的是可伸縮性:當負載增長時,如何保持效能不退化。我們也給出了若干通用原則,例如將任務拆解為可獨立執行的小元件。後續章節會深入展開相關技術細節。
為實現可靠性,可以使用容錯機制,使系統在部分元件(如磁碟、機器或外部服務)故障時仍能持續提供服務。我們區分了硬體故障與軟體故障,並指出軟體故障常更難處理,因為它們往往高度相關。可靠性的另一面是“對人為失誤的韌性”,其中 *無責備事後分析* 是重要學習機制。
最後,我們討論了可維護性的多個維度:支援運維工作、管理複雜度、提升系統可演化性。實現這些目標沒有銀彈,但一個普遍有效的做法是:用清晰、可理解、具備良好抽象的構件來搭建系統。接下來全書會介紹一系列在實踐中證明有效的構件。
### 參考文獻
[^1]: Mike Cvet. [How We Learned to Stop Worrying and Love Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016.
[^2]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK)
[^3]: Twitter. [Twitter's Recommendation Algorithm](https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm). *blog.twitter.com*, March 2023. Archived at [perma.cc/L5GT-229T](https://perma.cc/L5GT-229T)
[^4]: Raffi Krikorian. [New Tweets per second record, and how!](https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how) *blog.twitter.com*, August 2013. Archived at [perma.cc/6JZN-XJYN](https://perma.cc/6JZN-XJYN)
[^5]: Jaz Volpert. [When Imperfect Systems are Good, Actually: Bluesky's Lossy Timelines](https://jazco.dev/2025/02/19/imperfection/). *jazco.dev*, February 2025. Archived at [perma.cc/2PVE-L2MX](https://perma.cc/2PVE-L2MX)
[^6]: Samuel Axon. [3% of Twitter's Servers Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010. Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX)
[^7]: Nathan Bronson, Abutalib Aghayev, Aleksey Charapko, and Timothy Zhu. [Metastable Failures in Distributed Systems](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf). At *Workshop on Hot Topics in Operating Systems* (HotOS), May 2021. [doi:10.1145/3458336.3465286](https://doi.org/10.1145/3458336.3465286)
[^8]: Marc Brooker. [Metastability and Distributed Systems](https://brooker.co.za/blog/2021/05/24/metastable.html). *brooker.co.za*, May 2021. Archived at [perma.cc/7FGJ-7XRK](https://perma.cc/7FGJ-7XRK)
[^9]: Marc Brooker. [Exponential Backoff And Jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/). *aws.amazon.com*, March 2015. Archived at [perma.cc/R6MS-AZKH](https://perma.cc/R6MS-AZKH)
[^10]: Marc Brooker. [What is Backoff For?](https://brooker.co.za/blog/2022/08/11/backoff.html) *brooker.co.za*, August 2022. Archived at [perma.cc/PW9N-55Q5](https://perma.cc/PW9N-55Q5)
[^11]: Michael T. Nygard. [*Release It!*](https://learning.oreilly.com/library/view/release-it-2nd/9781680504552/), 2nd Edition. Pragmatic Bookshelf, January 2018. ISBN: 9781680502398
[^12]: Frank Chen. [Slowing Down to Speed Up – Circuit Breakers for Slack's CI/CD](https://slack.engineering/circuit-breakers/). *slack.engineering*, August 2022. Archived at [perma.cc/5FGS-ZPH3](https://perma.cc/5FGS-ZPH3)
[^13]: Marc Brooker. [Fixing retries with token buckets and circuit breakers](https://brooker.co.za/blog/2022/02/28/retries.html). *brooker.co.za*, February 2022. Archived at [perma.cc/MD6N-GW26](https://perma.cc/MD6N-GW26)
[^14]: David Yanacek. [Using load shedding to avoid overload](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/). Amazon Builders' Library, *aws.amazon.com*. Archived at [perma.cc/9SAW-68MP](https://perma.cc/9SAW-68MP)
[^15]: Matthew Sackman. [Pushing Back](https://wellquite.org/posts/lshift/pushing_back/). *wellquite.org*, May 2016. Archived at [perma.cc/3KCZ-RUFY](https://perma.cc/3KCZ-RUFY)
[^16]: Dmitry Kopytkov and Patrick Lee. [Meet Bandaid, the Dropbox service proxy](https://dropbox.tech/infrastructure/meet-bandaid-the-dropbox-service-proxy). *dropbox.tech*, March 2018. Archived at [perma.cc/KUU6-YG4S](https://perma.cc/KUU6-YG4S)
[^17]: Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. [Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems](https://www.usenix.org/system/files/conference/fast18/fast18-gunawi.pdf). At *16th USENIX Conference on File and Storage Technologies*, February 2018.
[^18]: Marc Brooker. [Is the Mean Really Useless?](https://brooker.co.za/blog/2017/12/28/mean.html) *brooker.co.za*, December 2017. Archived at [perma.cc/U5AE-CVEM](https://perma.cc/U5AE-CVEM)
[^19]: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. [Dynamo: Amazon's Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007. [doi:10.1145/1294261.1294281](https://doi.org/10.1145/1294261.1294281)
[^20]: Kathryn Whitenton. [The Need for Speed, 23 Years Later](https://www.nngroup.com/articles/the-need-for-speed/). *nngroup.com*, May 2020. Archived at [perma.cc/C4ER-LZYA](https://perma.cc/C4ER-LZYA)
[^21]: Greg Linden. [Marissa Mayer at Web 2.0](https://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html). *glinden.blogspot.com*, November 2005. Archived at [perma.cc/V7EA-3VXB](https://perma.cc/V7EA-3VXB)
[^22]: Jake Brutlag. [Speed Matters for Google Web Search](https://services.google.com/fh/files/blogs/google_delayexp.pdf). *services.google.com*, June 2009. Archived at [perma.cc/BK7R-X7M2](https://perma.cc/BK7R-X7M2)
[^23]: Eric Schurman and Jake Brutlag. [Performance Related Changes and their User Impact](https://www.youtube.com/watch?v=bQSE51-gr2s). Talk at *Velocity 2009*.
[^24]: Akamai Technologies, Inc. [The State of Online Retail Performance](https://web.archive.org/web/20210729180749/https%3A//www.akamai.com/us/en/multimedia/documents/report/akamai-state-of-online-retail-performance-spring-2017.pdf). *akamai.com*, April 2017. Archived at [perma.cc/UEK2-HYCS](https://perma.cc/UEK2-HYCS)
[^25]: Xiao Bai, Ioannis Arapakis, B. Barla Cambazoglu, and Ana Freire. [Understanding and Leveraging the Impact of Response Latency on User Behaviour in Web Search](https://iarapakis.github.io/papers/TOIS17.pdf). *ACM Transactions on Information Systems*, volume 36, issue 2, article 21, April 2018. [doi:10.1145/3106372](https://doi.org/10.1145/3106372)
[^26]: Jeffrey Dean and Luiz André Barroso. [The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/). *Communications of the ACM*, volume 56, issue 2, pages 74–80, February 2013. [doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794)
[^27]: Alex Hidalgo. [*Implementing Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets*](https://www.oreilly.com/library/view/implementing-service-level/9781492076803/). O'Reilly Media, September 2020. ISBN: 1492076813
[^28]: Jeffrey C. Mogul and John Wilkes. [Nines are Not Enough: Meaningful Metrics for Clouds](https://research.google/pubs/pub48033/). At *17th Workshop on Hot Topics in Operating Systems* (HotOS), May 2019. [doi:10.1145/3317550.3321432](https://doi.org/10.1145/3317550.3321432)
[^29]: Tamás Hauer, Philipp Hoffmann, John Lunney, Dan Ardelean, and Amer Diwan. [Meaningful Availability](https://www.usenix.org/conference/nsdi20/presentation/hauer). At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020.
[^30]: Ted Dunning. [The t-digest: Efficient estimates of distributions](https://www.sciencedirect.com/science/article/pii/S2665963820300403). *Software Impacts*, volume 7, article 100049, February 2021. [doi:10.1016/j.simpa.2020.100049](https://doi.org/10.1016/j.simpa.2020.100049)
[^31]: David Kohn. [How percentile approximation works (and why it's more useful than averages)](https://www.timescale.com/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/). *timescale.com*, September 2021. Archived at [perma.cc/3PDP-NR8B](https://perma.cc/3PDP-NR8B)
[^32]: Heinrich Hartmann and Theo Schlossnagle. [Circllhist — A Log-Linear Histogram Data Structure for IT Infrastructure Monitoring](https://arxiv.org/pdf/2001.06561.pdf). *arxiv.org*, January 2020.
[^33]: Charles Masson, Jee E. Rim, and Homin K. Lee. [DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees](https://www.vldb.org/pvldb/vol12/p2195-masson.pdf). *Proceedings of the VLDB Endowment*, volume 12, issue 12, pages 2195–2205, August 2019. [doi:10.14778/3352063.3352135](https://doi.org/10.14778/3352063.3352135)
[^34]: Baron Schwartz. [Why Percentiles Don't Work the Way You Think](https://orangematter.solarwinds.com/2016/11/18/why-percentiles-dont-work-the-way-you-think/). *solarwinds.com*, November 2016. Archived at [perma.cc/469T-6UGB](https://perma.cc/469T-6UGB)
[^35]: Walter L. Heimerdinger and Charles B. Weinstock. [A Conceptual Framework for System Fault Tolerance](https://resources.sei.cmu.edu/asset_files/TechnicalReport/1992_005_001_16112.pdf). Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992. Archived at [perma.cc/GD2V-DMJW](https://perma.cc/GD2V-DMJW)
[^36]: Felix C. Gärtner. [Fundamentals of fault-tolerant distributed computing in asynchronous environments](https://dl.acm.org/doi/pdf/10.1145/311531.311532). *ACM Computing Surveys*, volume 31, issue 1, pages 1–26, March 1999. [doi:10.1145/311531.311532](https://doi.org/10.1145/311531.311532)
[^37]: Algirdas Avižienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. [Basic Concepts and Taxonomy of Dependable and Secure Computing](https://hdl.handle.net/1903/6459). *IEEE Transactions on Dependable and Secure Computing*, volume 1, issue 1, January 2004. [doi:10.1109/TDSC.2004.2](https://doi.org/10.1109/TDSC.2004.2)
[^38]: Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. [Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf). At *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014.
[^39]: Casey Rosenthal and Nora Jones. [*Chaos Engineering*](https://learning.oreilly.com/library/view/chaos-engineering/9781492043850/). O'Reilly Media, April 2020. ISBN: 9781492043867
[^40]: Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andre Barroso. [Failure Trends in a Large Disk Drive Population](https://www.usenix.org/legacy/events/fast07/tech/full_papers/pinheiro/pinheiro_old.pdf). At *5th USENIX Conference on File and Storage Technologies* (FAST), February 2007.
[^41]: Bianca Schroeder and Garth A. Gibson. [Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?](https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder.pdf) At *5th USENIX Conference on File and Storage Technologies* (FAST), February 2007.
[^42]: Andy Klein. [Backblaze Drive Stats for Q2 2021](https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2021/). *backblaze.com*, August 2021. Archived at [perma.cc/2943-UD5E](https://perma.cc/2943-UD5E)
[^43]: Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and Kushagra Vaid. [SSD Failures in Datacenters: What? When? and Why?](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/a7-narayanan.pdf) At *9th ACM International on Systems and Storage Conference* (SYSTOR), June 2016. [doi:10.1145/2928275.2928278](https://doi.org/10.1145/2928275.2928278)
[^44]: Alibaba Cloud Storage Team. [Storage System Design Analysis: Factors Affecting NVMe SSD Performance (1)](https://www.alibabacloud.com/blog/594375). *alibabacloud.com*, January 2019. Archived at [archive.org](https://web.archive.org/web/20230522005034/https%3A//www.alibabacloud.com/blog/594375)
[^45]: Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. [Flash Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf). At *14th USENIX Conference on File and Storage Technologies* (FAST), February 2016.
[^46]: Jacob Alter, Ji Xue, Alma Dimnaku, and Evgenia Smirni. [SSD failures in the field: symptoms, causes, and prediction models](https://dl.acm.org/doi/pdf/10.1145/3295500.3356172). At *International Conference for High Performance Computing, Networking, Storage and Analysis* (SC), November 2019. [doi:10.1145/3295500.3356172](https://doi.org/10.1145/3295500.3356172)
[^47]: Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. [Availability in Globally Distributed Storage Systems](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Ford.pdf). At *9th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2010.
[^48]: Kashi Venkatesh Vishwanath and Nachiappan Nagappan. [Characterizing Cloud Computing Hardware Reliability](https://www.microsoft.com/en-us/research/wp-content/uploads/2010/06/socc088-vishwanath.pdf). At *1st ACM Symposium on Cloud Computing* (SoCC), June 2010. [doi:10.1145/1807128.1807161](https://doi.org/10.1145/1807128.1807161)
[^49]: Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat. [Cores that don't count](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf). At *Workshop on Hot Topics in Operating Systems* (HotOS), June 2021. [doi:10.1145/3458336.3465297](https://doi.org/10.1145/3458336.3465297)
[^50]: Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, and Sriram Sankar. [Silent Data Corruptions at Scale](https://arxiv.org/abs/2102.11245). *arXiv:2102.11245*, February 2021.
[^51]: Diogo Behrens, Marco Serafini, Sergei Arnautov, Flavio P. Junqueira, and Christof Fetzer. [Scalable Error Isolation for Distributed Systems](https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/behrens). At *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015.
[^52]: Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. [DRAM Errors in the Wild: A Large-Scale Field Study](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf). At *11th International Joint Conference on Measurement and Modeling of Computer Systems* (SIGMETRICS), June 2009. [doi:10.1145/1555349.1555372](https://doi.org/10.1145/1555349.1555372)
[^53]: Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. [Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf). At *41st Annual International Symposium on Computer Architecture* (ISCA), June 2014. [doi:10.5555/2665671.2665726](https://doi.org/10.5555/2665671.2665726)
[^54]: Tim Bray. [Worst Case](https://www.tbray.org/ongoing/When/202x/2021/10/08/The-WOrst-Case). *tbray.org*, October 2021. Archived at [perma.cc/4QQM-RTHN](https://perma.cc/4QQM-RTHN)
[^55]: Sangeetha Abdu Jyothi. [Solar Superstorms: Planning for an Internet Apocalypse](https://ics.uci.edu/~sabdujyo/papers/sigcomm21-cme.pdf). At *ACM SIGCOMM Conferene*, August 2021. [doi:10.1145/3452296.3472916](https://doi.org/10.1145/3452296.3472916)
[^56]: Adrian Cockcroft. [Failure Modes and Continuous Resilience](https://adrianco.medium.com/failure-modes-and-continuous-resilience-6553078caad5). *adrianco.medium.com*, November 2019. Archived at [perma.cc/7SYS-BVJP](https://perma.cc/7SYS-BVJP)
[^57]: Shujie Han, Patrick P. C. Lee, Fan Xu, Yi Liu, Cheng He, and Jiongzhou Liu. [An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers](https://www.usenix.org/conference/fast21/presentation/han). At *19th USENIX Conference on File and Storage Technologies* (FAST), February 2021.
[^58]: Edmund B. Nightingale, John R. Douceur, and Vince Orgovan. [Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs](https://eurosys2011.cs.uni-salzburg.at/pdf/eurosys2011-nightingale.pdf). At *6th European Conference on Computer Systems* (EuroSys), April 2011. [doi:10.1145/1966445.1966477](https://doi.org/10.1145/1966445.1966477)
[^59]: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. [What Bugs Live in the Cloud?](https://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf) At *5th ACM Symposium on Cloud Computing* (SoCC), November 2014. [doi:10.1145/2670979.2670986](https://doi.org/10.1145/2670979.2670986)
[^60]: Jay Kreps. [Getting Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW)
[^61]: Nelson Minar. [Leap Second Crashes Half the Internet](https://www.somebits.com/weblog/tech/bad/leap-second-2012.html). *somebits.com*, July 2012. Archived at [perma.cc/2WB8-D6EU](https://perma.cc/2WB8-D6EU)
[^62]: Hewlett Packard Enterprise. [Support Alerts – Customer Bulletin a00092491en\_us](https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us). *support.hpe.com*, November 2019. Archived at [perma.cc/S5F6-7ZAC](https://perma.cc/S5F6-7ZAC)
[^63]: Lorin Hochstein. [awesome limits](https://github.com/lorin/awesome-limits). *github.com*, November 2020. Archived at [perma.cc/3R5M-E5Q4](https://perma.cc/3R5M-E5Q4)
[^64]: Caitie McCaffrey. [Clients Are Jerks: AKA How Halo 4 DoSed the Services at Launch & How We Survived](https://www.caitiem.com/2015/06/23/clients-are-jerks-aka-how-halo-4-dosed-the-services-at-launch-how-we-survived/). *caitiem.com*, June 2015. Archived at [perma.cc/MXX4-W373](https://perma.cc/MXX4-W373)
[^65]: Lilia Tang, Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, and Tianyin Xu. [Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems](https://tianyin.github.io/pub/csi-failures.pdf). At *18th European Conference on Computer Systems* (EuroSys), May 2023. [doi:10.1145/3552326.3587448](https://doi.org/10.1145/3552326.3587448)
[^66]: Mike Ulrich. [Addressing Cascading Failures](https://sre.google/sre-book/addressing-cascading-failures/). In Betsy Beyer, Jennifer Petoff, Chris Jones, and Niall Richard Murphy (ed). [*Site Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). O'Reilly Media, 2016. ISBN: 9781491929124
[^67]: Harri Faßbender. [Cascading failures in large-scale distributed systems](https://blog.mi.hdm-stuttgart.de/index.php/2022/03/03/cascading-failures-in-large-scale-distributed-systems/). *blog.mi.hdm-stuttgart.de*, March 2022. Archived at [perma.cc/K7VY-YJRX](https://perma.cc/K7VY-YJRX)
[^68]: Richard I. Cook. [How Complex Systems Fail](https://www.adaptivecapacitylabs.com/HowComplexSystemsFail.pdf). Cognitive Technologies Laboratory, April 2000. Archived at [perma.cc/RDS6-2YVA](https://perma.cc/RDS6-2YVA)
[^69]: David D. Woods. [STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity](https://snafucatchers.github.io/). *snafucatchers.github.io*, March 2017. Archived at [archive.org](https://web.archive.org/web/20230306130131/https%3A//snafucatchers.github.io/)
[^70]: David Oppenheimer, Archana Ganapathi, and David A. Patterson. [Why Do Internet Services Fail, and What Can Be Done About It?](https://static.usenix.org/events/usits03/tech/full_papers/oppenheimer/oppenheimer.pdf) At *4th USENIX Symposium on Internet Technologies and Systems* (USITS), March 2003.
[^71]: Sidney Dekker. [*The Field Guide to Understanding 'Human Error', 3rd Edition*](https://learning.oreilly.com/library/view/the-field-guide/9781317031833/). CRC Press, November 2017. ISBN: 9781472439055
[^72]: Sidney Dekker. [*Drift into Failure: From Hunting Broken Components to Understanding Complex Systems*](https://www.taylorfrancis.com/books/mono/10.1201/9781315257396/drift-failure-sidney-dekker). CRC Press, 2011. ISBN: 9781315257396
[^73]: John Allspaw. [Blameless PostMortems and a Just Culture](https://www.etsy.com/codeascraft/blameless-postmortems/). *etsy.com*, May 2012. Archived at [perma.cc/YMJ7-NTAP](https://perma.cc/YMJ7-NTAP)
[^74]: Itzy Sabo. [Uptime Guarantees — A Pragmatic Perspective](https://world.hey.com/itzy/uptime-guarantees-a-pragmatic-perspective-736d7ea4). *world.hey.com*, March 2023. Archived at [perma.cc/F7TU-78JB](https://perma.cc/F7TU-78JB)
[^75]: Michael Jurewitz. [The Human Impact of Bugs](http://jury.me/blog/2013/3/14/the-human-impact-of-bugs). *jury.me*, March 2013. Archived at [perma.cc/5KQ4-VDYL](https://perma.cc/5KQ4-VDYL)
[^76]: Mark Halper. [How Software Bugs led to 'One of the Greatest Miscarriages of Justice' in British History](https://cacm.acm.org/news/how-software-bugs-led-to-one-of-the-greatest-miscarriages-of-justice-in-british-history/). *Communications of the ACM*, January 2025. [doi:10.1145/3703779](https://doi.org/10.1145/3703779)
[^77]: Nicholas Bohm, James Christie, Peter Bernard Ladkin, Bev Littlewood, Paul Marshall, Stephen Mason, Martin Newby, Steven J. Murdoch, Harold Thimbleby, and Martyn Thomas. [The legal rule that computers are presumed to be operating correctly – unforeseen and unjust consequences](https://www.benthamsgaze.org/wp-content/uploads/2022/06/briefing-presumption-that-computers-are-reliable.pdf). Briefing note, *benthamsgaze.org*, June 2022. Archived at [perma.cc/WQ6X-TMW4](https://perma.cc/WQ6X-TMW4)
[^78]: Dan McKinley. [Choose Boring Technology](https://mcfunley.com/choose-boring-technology). *mcfunley.com*, March 2015. Archived at [perma.cc/7QW7-J4YP](https://perma.cc/7QW7-J4YP)
[^79]: Andy Warfield. [Building and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. Archived at [perma.cc/7LPK-TP7V](https://perma.cc/7LPK-TP7V)
[^80]: Marc Brooker. [Surprising Scalability of Multitenancy](https://brooker.co.za/blog/2023/03/23/economics.html). *brooker.co.za*, March 2023. Archived at [perma.cc/ZZD9-VV8T](https://perma.cc/ZZD9-VV8T)
[^81]: Ben Stopford. [Shared Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/). *benstopford.com*, November 2009. Archived at [perma.cc/7BXH-EDUR](https://perma.cc/7BXH-EDUR)
[^82]: Michael Stonebraker. [The Case for Shared Nothing](https://dsf.berkeley.edu/papers/hpts85-nothing.pdf). *IEEE Database Engineering Bulletin*, volume 9, issue 1, pages 4–9, March 1986.
[^83]: Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. [Socrates: The New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 1743–1756, June 2019. [doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047)
[^84]: Sam Newman. [*Building Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). O'Reilly Media, 2021. ISBN: 9781492034025
[^85]: Nathan Ensmenger. [When Good Software Goes Bad: The Surprising Durability of an Ephemeral Technology](https://themaintainers.wpengine.com/wp-content/uploads/2021/04/ensmenger-maintainers-v2.pdf). At *The Maintainers Conference*, April 2016. Archived at [perma.cc/ZXT4-HGZB](https://perma.cc/ZXT4-HGZB)
[^86]: Robert L. Glass. [*Facts and Fallacies of Software Engineering*](https://learning.oreilly.com/library/view/facts-and-fallacies/0321117425/). Addison-Wesley Professional, October 2002. ISBN: 9780321117427
[^87]: Marianne Bellotti. [*Kill It with Fire*](https://learning.oreilly.com/library/view/kill-it-with/9781098128883/). No Starch Press, April 2021. ISBN: 9781718501188
[^88]: Lisanne Bainbridge. [Ironies of automation](https://www.adaptivecapacitylabs.com/IroniesOfAutomation-Bainbridge83.pdf). *Automatica*, volume 19, issue 6, pages 775–779, November 1983. [doi:10.1016/0005-1098(83)90046-8](https://doi.org/10.1016/0005-1098%2883%2990046-8)
[^89]: James Hamilton. [On Designing and Deploying Internet-Scale Services](https://www.usenix.org/legacy/events/lisa07/tech/full_papers/hamilton/hamilton.pdf). At *21st Large Installation System Administration Conference* (LISA), November 2007.
[^90]: Dotan Horovits. [Open Source for Better Observability](https://horovits.medium.com/open-source-for-better-observability-8c65b5630561). *horovits.medium.com*, October 2021. Archived at [perma.cc/R2HD-U2ZT](https://perma.cc/R2HD-U2ZT)
[^91]: Brian Foote and Joseph Yoder. [Big Ball of Mud](http://www.laputan.org/pub/foote/mud.pdf). At *4th Conference on Pattern Languages of Programs* (PLoP), September 1997. Archived at [perma.cc/4GUP-2PBV](https://perma.cc/4GUP-2PBV)
[^92]: Marc Brooker. [What is a simple system?](https://brooker.co.za/blog/2022/05/03/simplicity.html) *brooker.co.za*, May 2022. Archived at [perma.cc/U72T-BFVE](https://perma.cc/U72T-BFVE)
[^93]: Frederick P. Brooks. [No Silver Bullet – Essence and Accident in Software Engineering](https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf). In [*The Mythical Man-Month*](https://www.oreilly.com/library/view/mythical-man-month-the/0201835959/), Anniversary edition, Addison-Wesley, 1995. ISBN: 9780201835953
[^94]: Dan Luu. [Against essential and accidental complexity](https://danluu.com/essential-complexity/). *danluu.com*, December 2020. Archived at [perma.cc/H5ES-69KC](https://perma.cc/H5ES-69KC)
[^95]: Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. [*Design Patterns: Elements of Reusable Object-Oriented Software*](https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/). Addison-Wesley Professional, October 1994. ISBN: 9780201633610
[^96]: Eric Evans. [*Domain-Driven Design: Tackling Complexity in the Heart of Software*](https://learning.oreilly.com/library/view/domain-driven-design-tackling/0321125215/). Addison-Wesley Professional, August 2003. ISBN: 9780321125217
[^97]: Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson. [Analyzing Software Evolvability](https://www.es.mdh.se/pdf_publications/1251.pdf). at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. [doi:10.1109/COMPSAC.2008.50](https://doi.org/10.1109/COMPSAC.2008.50)
[^98]: Enrico Zaninotto. [From X programming to the X organisation](https://martinfowler.com/articles/zaninotto.pdf). At *XP Conference*, May 2002. Archived at [perma.cc/R9AR-QCKZ](https://perma.cc/R9AR-QCKZ)
================================================
FILE: content/tw/ch3.md
================================================
---
title: "3. 資料模型與查詢語言"
weight: 103
breadcrumbs: false
---

> *語言的邊界就是世界的邊界。*
>
> 路德維希・維特根斯坦,《邏輯哲學論》(1922)
資料模型或許是開發軟體最重要的部分,因為它們有著深遠的影響:不僅影響軟體的編寫方式,還影響我們 **思考問題** 的方式。
大多數應用程式都是透過層層疊加的資料模型來構建的。每一層的關鍵問題是:如何用更低層次的資料模型來 **表示** 它?例如:
1. 作為應用程式開發者,你觀察現實世界(其中有人員、組織、貨物、行為、資金流動、感測器等),並用物件或資料結構,以及操作這些資料結構的 API 來建模。這些結構通常是特定於應用程式的。
2. 當你想要儲存這些資料結構時,你用通用的資料模型來表達它們,例如 JSON 或 XML 文件、關係資料庫中的表,或者圖中的頂點和邊。這些資料模型是本章的主題。
3. 構建你的資料庫軟體的工程師決定了如何用記憶體、磁碟或網路上的位元組來表示文件/關係/圖資料。這種表示可能允許以各種方式查詢、搜尋、操作和處理資料。我們將在 [第 4 章](/tw/ch4#ch_storage) 中討論這些儲存引擎的設計。
4. 在更低的層次上,硬體工程師已經想出了如何用電流、光脈衝、磁場等來表示位元組的方法。
在複雜的應用程式中可能有更多的中間層,例如基於 API 之上的 API,但基本思想仍然相同:每一層透過提供一個簡潔的資料模型來隱藏下層的複雜性。這些抽象允許不同的人群 —— 例如,資料庫供應商的工程師和使用他們資料庫的應用程式開發者 —— 有效地合作。
在實踐中廣泛使用著幾種不同的資料模型,通常用於不同的目的。某些型別的資料和某些查詢在一種模型中很容易表達,而在另一種模型中則很困難。在本章中,我們將透過比較關係模型、文件模型、基於圖的資料模型、事件溯源和資料框來探討這些權衡。我們還將簡要介紹允許你使用這些模型的查詢語言。這種比較將幫助你決定何時使用哪種模型。
--------
> [!TIP] 術語:宣告式查詢語言
>
> 本章中的許多查詢語言(如 SQL、Cypher、SPARQL 或 Datalog)都是 **宣告式** 的,這意味著你指定所需資料的模式 ——
> 結果必須滿足什麼條件,以及你希望如何轉換資料(例如,排序、分組和聚合)—— 但不指定 **如何** 實現該目標。
> 資料庫系統的查詢最佳化器可以決定使用哪些索引和哪些連線演算法,以及以什麼順序執行查詢的各個部分。
>
> 相比之下,使用大多數程式語言,你必須編寫一個 **演算法** —— 即告訴計算機以什麼順序執行哪些操作。
> 宣告式查詢語言很有吸引力,因為它通常更簡潔,比顯式演算法更容易編寫。
> 但更重要的是,它還隱藏了查詢引擎的實現細節,這使得資料庫系統可以在不需要更改任何查詢的情況下引入效能改進 [^1]。
>
> 例如,資料庫可能能夠跨多個 CPU 核心和機器並行執行宣告式查詢,而你無需擔心如何實現該並行性 [^2]。
> 如果用手寫演算法,實現這種並行執行將需要大量工作。
--------
## 關係模型與文件模型 {#sec_datamodels_history}
今天最廣為人知的資料模型可能是 SQL,它基於 Edgar Codd 在 1970 年提出的關係模型 [^3]:
資料被組織成 **關係**(在 SQL 中稱為 **表**),其中每個關係是 **元組**(在 SQL 中稱為 **行**)的無序集合。
關係模型最初是一個理論提議,當時許多人懷疑它是否能夠高效實現。
然而,到 20 世紀 80 年代中期,關係資料庫管理系統(RDBMS)和 SQL 已成為大多數需要儲存和查詢具有某種規則結構的資料的人的首選工具。
許多資料管理用例在幾十年後仍然由關係資料主導 —— 例如,商業分析(參見 ["星型與雪花型:分析模式"](#sec_datamodels_analytics))。
多年來,出現了許多與資料儲存和查詢相關的競爭方法。在 20 世紀 70 年代和 80 年代初,**網狀模型** 和 **層次模型** 是主要的替代方案,但關係模型最終戰勝了它們。
物件資料庫在 20 世紀 80 年代末和 90 年代初出現又消失。XML 資料庫在 21 世紀初出現,但只獲得了小眾的採用。
每個關係模型的競爭者在其時代都產生了大量的炒作,但都沒有持續下去 [^4]。
相反,SQL 已經發展到在其關係核心之外納入其他資料型別 —— 例如,增加了對 XML、JSON 和圖資料的支援 [^5]。
在 2010 年代,**NoSQL** 是試圖推翻關係資料庫主導地位的最新流行詞。
NoSQL 指的不是單一技術,而是圍繞新資料模型、模式靈活性、可伸縮性以及向開源許可模式轉變的一系列鬆散的想法。
一些資料庫將自己標榜為 *NewSQL*,因為它們旨在提供 NoSQL 系統的可伸縮性以及傳統關係資料庫的資料模型和事務保證。
NoSQL 和 NewSQL 的想法在資料系統設計中產生了很大的影響,但隨著這些原則被廣泛採用,這些術語的使用已經減少。
NoSQL 運動的一個持久影響是 **文件模型** 的流行,它通常將資料表示為 JSON。
這個模型最初由專門的文件資料庫(如 MongoDB 和 Couchbase)推廣,儘管大多數關係資料庫現在也增加了 JSON 支援。
與通常被視為具有嚴格和不靈活模式的關係表相比,JSON 文件被認為更加靈活。
文件和關係資料的優缺點已經被廣泛討論;讓我們來看看該辯論的一些關鍵點。
### 物件關係不匹配 {#sec_datamodels_document}
如今,大部分應用程式開發都是使用物件導向的程式語言完成的,這導致了對 SQL 資料模型的常見批評:如果資料儲存在關係表中,則需要在應用程式程式碼中的物件和資料庫的表、行、列模型之間建立一個笨拙的轉換層。這種模型之間的脫節有時被稱為 *阻抗不匹配*。
--------
> [!NOTE]
> 術語 *阻抗不匹配* 借自電子學。每個電路的輸入和輸出都有一定的阻抗(對交流電的阻力)。當你將一個電路的輸出連線到另一個電路的輸入時,如果兩個電路的輸出和輸入阻抗匹配,則透過連線的功率傳輸將最大化。阻抗不匹配可能導致訊號反射和其他問題。
--------
#### 物件關係對映(ORM) {#object-relational-mapping-orm}
物件關係對映(ORM)框架(如 ActiveRecord 和 Hibernate)減少了這個轉換層所需的樣板程式碼量,但它們經常受到批評 [^6]。一些常見的問題包括:
* ORM 很複雜,無法完全隱藏兩種模型之間的差異,因此開發人員仍然需要考慮資料的關係和物件表示。
* ORM 通常僅用於 OLTP 應用程式開發(參見 ["表徵事務處理和分析"](/tw/ch1#sec_introduction_oltp));為分析目的提供資料的資料工程師仍然需要使用底層的關係表示,因此在使用 ORM 時,關係模式的設計仍然很重要。
* 許多 ORM 僅適用於關係型 OLTP 資料庫。擁有多樣化資料系統(如搜尋引擎、圖資料庫和 NoSQL 系統)的組織可能會發現 ORM 支援不足。
* 一些 ORM 會自動生成關係模式,但這些模式對於直接訪問關係資料的使用者來說可能很尷尬,並且在底層資料庫上可能效率低下。自定義 ORM 的模式和查詢生成可能很複雜,並否定了首先使用 ORM 的好處。
* ORM 使得意外編寫低效查詢變得容易,例如 *N+1 查詢問題* [^7]。例如,假設你想在頁面上顯示使用者評論列表,因此你執行一個返回 *N* 條評論的查詢,每條評論都包含其作者的 ID。要顯示評論作者的姓名,你需要在使用者表中查詢 ID。在手寫 SQL 中,你可能會在查詢中執行此連線並返回每個評論的作者姓名,但使用 ORM 時,你可能最終會為 *N* 條評論中的每一條在使用者表上進行單獨的查詢以查詢其作者,總共產生 *N*+1 個數據庫查詢,這比在資料庫中執行連線要慢。為了避免這個問題,你可能需要告訴 ORM 在獲取評論的同時獲取作者資訊。
然而,ORM 也有優勢:
* 對於非常適合關係模型的資料,持久關係和記憶體物件表示之間的某種轉換是不可避免的,ORM 減少了這種轉換所需的樣板程式碼量。複雜的查詢可能仍然需要在 ORM 之外處理,但 ORM 可以幫助處理簡單和重複的情況。
* 一些 ORM 有助於快取資料庫查詢的結果,這可以幫助減少資料庫的負載。
* ORM 還可以幫助管理模式遷移和其他管理活動。
#### 用於一對多關係的文件資料模型 {#the-document-data-model-for-one-to-many-relationships}
並非所有資料都很適合關係表示;讓我們透過一個例子來探討關係模型的侷限性。[圖 3-1](#fig_obama_relational) 說明了如何在關係模式中表達簡歷(LinkedIn 個人資料)。整個個人資料可以透過唯一識別符號 `user_id` 來識別。像 `first_name` 和 `last_name` 這樣的欄位每個使用者只出現一次,因此它們可以建模為 `users` 表上的列。
大多數人在職業生涯中有多份工作(職位),人們可能有不同數量的教育經歷和任意數量的聯絡資訊。表示這種 *一對多關係* 的一種方法是將職位、教育和聯絡資訊放在單獨的表中,並使用外部索引鍵引用 `users` 表,如 [圖 3-1](#fig_obama_relational) 所示。
{{< figure src="/fig/ddia_0301.png" id="fig_obama_relational" caption="圖 3-1. 使用關係模式表示 LinkedIn 個人資料。" class="w-full my-4" >}}
另一種表示相同資訊的方式,可能更自然並且更接近應用程式程式碼中的物件結構,是作為 JSON 文件,如 [示例 3-1](#fig_obama_json) 所示。
{{< figure id="fig_obama_json" title="示例 3-1. 將 LinkedIn 個人資料表示為 JSON 文件" class="w-full my-4" >}}
```json
{
"user_id": 251,
"first_name": "Barack",
"last_name": "Obama",
"headline": "Former President of the United States of America",
"region_id": "us:91",
"photo_url": "/p/7/000/253/05b/308dd6e.jpg",
"positions": [
{"job_title": "President", "organization": "United States of America"},
{"job_title": "US Senator (D-IL)", "organization": "United States Senate"}
],
"education": [
{"school_name": "Harvard University", "start": 1988, "end": 1991},
{"school_name": "Columbia University", "start": 1981, "end": 1983}
],
"contact_info": {
"website": "https://barackobama.com",
"twitter": "https://twitter.com/barackobama"
}
}
```
一些開發人員認為 JSON 模型減少了應用程式程式碼和儲存層之間的阻抗不匹配。然而,正如我們將在 [第 5 章](/tw/ch5#ch_encoding) 中看到的,JSON 作為資料編碼格式也存在問題。缺乏模式通常被認為是一個優勢;我們將在 ["文件模型中的模式靈活性"](#sec_datamodels_schema_flexibility) 中討論這個問題。
與 [圖 3-1](#fig_obama_relational) 中的多表模式相比,JSON 表示具有更好的 *區域性*(參見 ["讀寫的資料區域性"](#sec_datamodels_document_locality))。如果你想在關係示例中獲取個人資料,你需要執行多個查詢(透過 `user_id` 查詢每個表)或在 `users` 表與其從屬表之間執行複雜的多表連線 [^8]。在 JSON 表示中,所有相關資訊都在一個地方,使查詢既更快又更簡單。
從使用者個人資料到使用者職位、教育歷史和聯絡資訊的一對多關係暗示了資料中的樹形結構,而 JSON 表示使這種樹形結構變得明確(見 [圖 3-2](#fig_json_tree))。
{{< figure src="/fig/ddia_0302.png" id="fig_json_tree" caption="圖 3-2. 一對多關係形成樹狀結構。" class="w-full my-4" >}}
--------
> [!NOTE]
> 這種型別的關係有時被稱為 *一對少* 而不是 *一對多*,因為簡歷通常有少量的職位 [^9] [^10]。在可能存在真正大量相關專案的情況下 —— 比如名人社交媒體帖子上的評論,可能有成千上萬條 —— 將它們全部嵌入同一個文件中可能太笨拙了,因此 [圖 3-1](#fig_obama_relational) 中的關係方法更可取。
--------
### 正規化、反正規化與連線 {#sec_datamodels_normalization}
在前一節的 [示例 3-1](#fig_obama_json) 中,`region_id` 被給出為 ID,而不是純文字字串 `"Washington, DC, United States"`。為什麼?
如果使用者介面有一個用於輸入地區的自由文字欄位,將其儲存為純文字字串是有意義的。但是,擁有標準化的地理區域列表並讓使用者從下拉列表或自動補全中選擇也有其優勢:
* 不同個人資料之間的風格和拼寫保持一致
* 避免歧義:如果有幾個同名的地方(如果字串只是 "Washington",它是指 DC 還是州?)
* 易於更新 —— 名稱只儲存在一個地方,因此如果需要更改(例如,由於政治事件而更改城市名稱),可以輕鬆地全面更新
* 本地化支援 —— 當網站被翻譯成其他語言時,標準化列表可以被本地化,因此區域可以用檢視者的語言顯示
* 更好的搜尋 —— 例如,搜尋美國東海岸的人可以匹配此個人資料,因為區域列表可以編碼華盛頓位於東海岸的事實(這從字串 `"Washington, DC"` 中並不明顯)
無論你儲存 ID 還是文字字串,這都是 *正規化* 的問題。當你使用 ID 時,你的資料更加正規化:對人類有意義的資訊(如文字 *Washington, DC*)只儲存在一個地方,所有引用它的地方都使用 ID(它只在資料庫中有意義)。當你直接儲存文字時,你在使用它的每條記錄中都複製了對人類有意義的資訊;這種表示是 *反正規化* 的。
使用 ID 的優勢在於,因為它對人類沒有意義,所以永遠不需要更改:即使它標識的資訊發生變化,ID 也可以保持不變。任何對人類有意義的東西將來某個時候可能需要更改 —— 如果該資訊被複制,所有冗餘副本都需要更新。這需要更多的程式碼、更多的寫操作、更多的磁碟空間,並且存在不一致的風險(其中一些資訊副本被更新但其他的沒有)。
正規化表示的缺點是,每次要顯示包含 ID 的記錄時,都必須進行額外的查詢以將 ID 解析為人類可讀的內容。在關係資料模型中,這是使用 *連線* 完成的,例如:
```sql
SELECT users.*, regions.region_name
FROM users
JOIN regions ON users.region_id = regions.id
WHERE users.id = 251;
```
文件資料庫可以儲存正規化和反正規化的資料,但它們通常與反正規化相關聯 —— 部分是因為 JSON 資料模型使得儲存額外的反正規化欄位變得容易,部分是因為許多文件資料庫中對連線的弱支援使得正規化不方便。一些文件資料庫根本不支援連線,因此你必須在應用程式程式碼中執行它們 —— 也就是說,你首先獲取包含 ID 的文件,然後執行第二個查詢將該 ID 解析為另一個文件。在 MongoDB 中,也可以使用聚合管道中的 `$lookup` 運算元執行連線:
```mongodb-json
db.users.aggregate([
{ $match: { _id: 251 } },
{ $lookup: {
from: "regions",
localField: "region_id",
foreignField: "_id",
as: "region"
} }
])
```
#### 正規化的權衡 {#trade-offs-of-normalization}
在簡歷示例中,雖然 `region_id` 欄位是對標準化區域集的引用,但 `organization`(人工作的公司或政府)和 `school_name`(他們學習的地方)的名稱只是字串。這種表示是反正規化的:許多人可能在同一家公司工作過,但沒有 ID 將他們聯絡起來。
也許組織和學校應該是實體,個人資料應該引用它們的 ID 而不是它們的名稱?引用區域 ID 的相同論點也適用於此。例如,假設我們想在他們的名字之外包括學校或公司的標誌:
* 在反正規化表示中,我們會在每個人的個人資料中包含標誌的影像 URL;這使得 JSON 文件自包含,但如果我們需要更改標誌,就會產生麻煩,因為我們現在需要找到舊 URL 的所有出現並更新它們 [^9]。
* 在正規化表示中,我們將建立一個代表組織或學校的實體,並在該實體上儲存其名稱、標誌 URL 以及可能的其他屬性(描述、新聞提要等)一次。然後,每個提到該組織的簡歷都會簡單地引用其 ID,更新標誌很容易。
作為一般原則,正規化資料通常寫入更快(因為只有一個副本),但查詢更慢(因為它需要連線);反正規化資料通常讀取更快(連線更少),但寫入更昂貴(更多副本要更新,使用更多磁碟空間)。你可能會發現將反正規化視為派生資料的一種形式很有幫助(["記錄系統與派生資料"](/tw/ch1#sec_introduction_derived)),因為你需要設定一個過程來更新資料的冗餘副本。
除了執行所有這些更新的成本之外,如果程序在進行更新的過程中崩潰,你還需要考慮資料庫的一致性。提供原子事務的資料庫(參見 ["原子性"](/tw/ch8#sec_transactions_acid_atomicity))使保持一致性變得更容易,但並非所有資料庫都在多個文件之間提供原子性。透過流處理確保一致性也是可能的,我們將在 ["保持系統同步"](/tw/ch12#sec_stream_sync) 中討論。
正規化往往更適合 OLTP 系統,其中讀取和更新都需要快速;分析系統通常使用反正規化資料表現更好,因為它們批次執行更新,只讀查詢的效能是主要關注點。此外,在中小規模的系統中,正規化資料模型通常是最好的,因為你不必擔心保持資料的多個副本相互一致,執行連線的成本是可以接受的。然而,在非常大規模的系統中,連線的成本可能會成為問題。
#### 社交網路案例研究中的反正規化 {#denormalization-in-the-social-networking-case-study}
在 ["案例研究:社交網路首頁時間線"](/tw/ch2#sec_introduction_twitter) 中,我們比較了正規化表示([圖 2-1](/tw/ch2#fig_twitter_relational))和反正規化表示(預計算的物化時間線):這裡,`posts` 和 `follows` 之間的連線太昂貴了,物化時間線是該連線結果的快取。將新帖子插入關注者時間線的扇出過程是我們保持反正規化表示一致的方式。
然而,X(前 Twitter)的物化時間線實現實際上並不儲存每個帖子的實際文字:每個條目實際上只儲存帖子 ID、釋出者的使用者 ID,以及一些額外的資訊來識別轉發和回覆 [^11]。換句話說,它大致是以下查詢的預計算結果:
```sql
SELECT posts.id, posts.sender_id
FROM posts
JOIN follows ON posts.sender_id = follows.followee_id
WHERE follows.follower_id = current_user
ORDER BY posts.timestamp DESC
LIMIT 1000
```
這意味著每當讀取時間線時,服務仍然需要執行兩個連線:透過 ID 查詢帖子以獲取實際的帖子內容(以及點贊數和回覆數等統計資訊),並透過 ID 查詢傳送者的個人資料(以獲取他們的使用者名稱、個人資料圖片和其他詳細資訊)。這個將 ID 補全為人類可讀資訊的過程稱為 *hydrating* ID,本質上是在應用程式程式碼中執行的連線 [^11]。
在預計算時間線中僅儲存 ID 的原因是它們引用的資料變化很快:熱門帖子的點贊數和回覆數可能每秒變化多次,一些使用者定期更改他們的使用者名稱或個人資料照片。由於時間線在檢視時應該顯示最新的點贊數和個人資料圖片,因此將此資訊反正規化到物化時間線中是沒有意義的。此外,這種反正規化會顯著增加儲存成本。
這個例子表明,在讀取資料時必須執行連線並不像有時聲稱的那樣,是建立高效能、可擴充套件服務的障礙。`hydrating` 帖子 ID 和使用者 ID 實際上是一個相當容易擴充套件的操作,因為它可以很好地並行化,並且成本不取決於你關注的賬戶數量或你擁有的關注者數量。
如果你需要決定是否在應用程式中反正規化某些內容,社交網路案例研究表明選擇並不是立即顯而易見的:最可擴充套件的方法可能涉及反正規化某些內容並保持其他內容正規化。你必須仔細考慮資訊更改的頻率以及讀寫成本(這可能由異常值主導,例如在典型社交網路的情況下擁有許多關注/關注者的使用者)。正規化和反正規化本質上並不好或壞 —— 它們只是在讀寫效能以及實施工作量方面的權衡。
### 多對一與多對多關係 {#sec_datamodels_many_to_many}
雖然 [圖 3-1](#fig_obama_relational) 中的 `positions` 和 `education` 是一對多或一對少關係的例子(一份簡歷有多個職位,但每個職位只屬於一份簡歷),但 `region_id` 欄位是 *多對一* 關係的例子(許多人住在同一個地區,但我們假設每個人在任何時候只住在一個地區)。
如果我們為組織和學校引入實體,並透過 ID 從簡歷中引用它們,那麼我們也有 *多對多* 關係(一個人曾為多個組織工作,一個組織有多個過去或現在的員工)。在關係模型中,這種關係通常表示為 *關聯表* 或 *連線表*,如 [圖 3-3](#fig_datamodels_m2m_rel) 所示:每個職位將一個使用者 ID 與一個組織 ID 關聯起來。
{{< figure src="/fig/ddia_0303.png" id="fig_datamodels_m2m_rel" caption="圖 3-3. 關係模型中的多對多關係。" class="w-full my-4" >}}
多對一和多對多關係不容易適應一個自包含的 JSON 文件;它們更適合正規化表示。在文件模型中,一種可能的表示如 [示例 3-2](#fig_datamodels_m2m_json) 所示,並在 [圖 3-4](#fig_datamodels_many_to_many) 中說明:每個虛線矩形內的資料可以分組到一個文件中,但到組織和學校的連結最好表示為對其他文件的引用。
{{< figure id="fig_datamodels_m2m_json" title="示例 3-2. 透過 ID 引用組織的簡歷。" class="w-full my-4" >}}
```json
{
"user_id": 251,
"first_name": "Barack",
"last_name": "Obama",
"positions": [
{"start": 2009, "end": 2017, "job_title": "President", "org_id": 513},
{"start": 2005, "end": 2008, "job_title": "US Senator (D-IL)", "org_id": 514}
],
...
}
```
{{< figure src="/fig/ddia_0304.png" id="fig_datamodels_many_to_many" caption="圖 3-4. 文件模型中的多對多關係:每個虛線框內的資料可以分組到一個文件中。" class="w-full my-4" >}}
多對多關係通常需要"雙向"查詢:例如,找到特定人員工作過的所有組織,以及找到在特定組織工作過的所有人員。啟用此類查詢的一種方法是在兩邊都儲存 ID 引用,即簡歷包含該人工作過的每個組織的 ID,組織文件包含提到該組織的簡歷的 ID。這種表示是反正規化的,因為關係儲存在兩個地方,可能會相互不一致。
正規化表示僅在一個地方儲存關係,並依賴 *二級索引*(我們將在 [第 4 章](/tw/ch4#ch_storage) 中討論)來允許有效地雙向查詢關係。在 [圖 3-3](#fig_datamodels_m2m_rel) 的關係模式中,我們會告訴資料庫在 `positions` 表的 `user_id` 和 `org_id` 列上建立索引。
在 [示例 3-2](#fig_datamodels_m2m_json) 的文件模型中,資料庫需要索引 `positions` 陣列內物件的 `org_id` 欄位。許多文件資料庫和具有 JSON 支援的關係資料庫能夠在文件內的值上建立此類索引。
### 星型與雪花型:分析模式 {#sec_datamodels_analytics}
資料倉庫(參見 ["資料倉庫"](/tw/ch1#sec_introduction_dwh))通常是關係型的,並且資料倉庫中表結構有一些廣泛使用的約定:*星型模式*、*雪花模式*、*維度建模* [^12],以及 *一張大表*(OBT)。這些結構針對業務分析師的需求進行了最佳化。ETL 過程將來自運營系統的資料轉換為此模式。
[圖 3-5](#fig_dwh_schema) 顯示了一個可能在雜貨零售商的資料倉庫中找到的星型模式示例。模式的中心是所謂的 *事實表*(在此示例中,它稱為 `fact_sales`)。事實表的每一行代表在特定時間發生的事件(這裡,每一行代表客戶購買產品)。如果我們分析的是網站流量而不是零售銷售,每一行可能代表使用者的頁面檢視或點選。
{{< figure src="/fig/ddia_0305.png" id="fig_dwh_schema" caption="圖 3-5. 用於資料倉庫的星型模式示例。" class="w-full my-4" >}}
通常,事實被捕獲為單個事件,因為這允許以後最大的分析靈活性。然而,這意味著事實表可能變得非常大。一個大型企業可能在其資料倉庫中有許多 PB 的交易歷史,主要表示為事實表。
事實表中的一些列是屬性,例如產品售出的價格和從供應商那裡購買它的成本(允許計算利潤率)。事實表中的其他列是對其他表的外部索引鍵引用,稱為 *維度表*。由於事實表中的每一行代表一個事件,維度代表事件的 *誰*、*什麼*、*哪裡*、*何時*、*如何* 和 *為什麼*。
例如,在 [圖 3-5](#fig_dwh_schema) 中,其中一個維度是售出的產品。`dim_product` 表中的每一行代表一種待售產品型別,包括其庫存單位(SKU)、描述、品牌名稱、類別、脂肪含量、包裝尺寸等。`fact_sales` 表中的每一行使用外部索引鍵來指示在該特定交易中售出了哪種產品。查詢通常涉及對多個維度表的多個連線。
即使日期和時間也經常使用維度表表示,因為這允許編碼有關日期的附加資訊(例如公共假期),允許查詢區分假期和非假期的銷售。
[圖 3-5](#fig_dwh_schema) 是星型模式的一個例子。該名稱來自這樣一個事實:當表關係被視覺化時,事實表位於中間,被其維度表包圍;到這些表的連線就像星星的光芒。
這個模板的一個變體被稱為 *雪花模式*,其中維度被進一步分解為子維度。例如,品牌和產品類別可能有單獨的表,`dim_product` 表中的每一行都可以將品牌和類別作為外部索引鍵引用,而不是將它們作為字串儲存在 `dim_product` 表中。雪花模式比星型模式更正規化,但星型模式通常更受歡迎,因為它們對分析師來說更簡單 [^12]。
在典型的資料倉庫中,表通常非常寬:事實表通常有超過 100 列,有時有幾百列。維度表也可能很寬,因為它們包括所有可能與分析相關的元資料 —— 例如,`dim_store` 表可能包括每個商店提供哪些服務的詳細資訊、是否有店內麵包房、平方英尺、商店首次開業的日期、最後一次改造的時間、距離最近的高速公路有多遠等。
星型或雪花模式主要由多對一關係組成(例如,許多銷售發生在一個特定產品,在一個特定商店),表示為事實表對維度表的外部索引鍵,或維度對子維度的外部索引鍵。原則上,其他型別的關係可能存在,但它們通常被反正規化以簡化查詢。例如,如果客戶一次購買多種不同的產品,則該多項交易不會被明確表示;相反,事實表中為每個購買的產品都有一個單獨的行,這些事實都恰好具有相同的客戶 ID、商店 ID 和時間戳。
一些資料倉庫模式進一步進行反正規化,完全省略維度表,將維度中的資訊摺疊到事實表上的反正規化列中(本質上是預計算事實表和維度表之間的連線)。這種方法被稱為 *一張大表*(OBT),雖然它需要更多的儲存空間,但有時可以實現更快的查詢 [^13]。
在分析的背景下,這種反正規化是沒有問題的,因為資料通常代表不會改變的歷史資料日誌(除了偶爾糾正錯誤)。OLTP 系統中反正規化出現的資料一致性和寫入開銷問題在分析中並不那麼緊迫。
### 何時使用哪種模型 {#sec_datamodels_document_summary}
文件資料模型的主要論點是模式靈活性、由於區域性而獲得更好的效能,以及對於某些應用程式來說,它更接近應用程式使用的物件模型。關係模型透過為連線、多對一和多對多關係提供更好的支援來反擊。讓我們更詳細地研究這些論點。
如果你的應用程式中的資料具有類似文件的結構(即一對多關係的樹,通常一次載入整個樹),那麼使用文件模型可能是個好主意。將類似文件的結構 *切碎*(shredding)為多個表的關係技術(如 [圖 3-1](#fig_obama_relational) 中的 `positions`、`education` 和 `contact_info`)可能導致繁瑣的模式和不必要複雜的應用程式程式碼。
文件模型有侷限性:例如,你不能直接引用文件中的巢狀項,而是需要說類似"使用者 251 的職位列表中的第二項"之類的話。如果你確實需要引用巢狀項,關係方法效果更好,因為你可以透過其 ID 直接引用任何項。
一些應用程式允許使用者選擇專案的順序:例如,想象一個待辦事項列表或問題跟蹤器,使用者可以拖放任務來重新排序它們。文件模型很好地支援此類應用程式,因為專案(或它們的 ID)可以簡單地儲存在 JSON 陣列中以確定它們的順序。在關係資料庫中,沒有表示此類可重新排序列表的標準方法,並且使用各種技巧:按整數列排序(在插入中間時需要重新編號)、ID 的連結串列或分數索引 [^14] [^15] [^16]。
#### 文件模型中的模式靈活性 {#sec_datamodels_schema_flexibility}
大多數文件資料庫以及關係資料庫中的 JSON 支援不會對文件中的資料強制執行任何模式。關係資料庫中的 XML 支援通常帶有可選的模式驗證。沒有模式意味著可以將任意鍵和值新增到文件中,並且在讀取時,客戶端不能保證文件可能包含哪些欄位。
文件資料庫有時被稱為 *無模式*,但這是誤導性的,因為讀取資料的程式碼通常假設某種結構 —— 即存在隱式模式,但資料庫不強制執行 [^17]。更準確的術語是 *讀時模式*(資料的結構是隱式的,只有在讀取資料時才解釋),與 *寫時模式*(關係資料庫的傳統方法,其中模式是顯式的,資料庫確保所有資料在寫入時都符合它)形成對比 [^18]。
讀時模式類似於程式語言中的動態(執行時)型別檢查,而寫時模式類似於靜態(編譯時)型別檢查。正如靜態和動態型別檢查的倡導者對它們的相對優點有很大的爭論 [^19],資料庫中模式的強制執行是一個有爭議的話題,通常沒有正確或錯誤的答案。
當應用程式想要更改其資料格式時,這些方法之間的差異特別明顯。例如,假設你當前在一個欄位中儲存每個使用者的全名,而你想要分別儲存名字和姓氏 [^20]。在文件資料庫中,你只需開始編寫具有新欄位的新文件,並在應用程式中編寫處理讀取舊文件時的程式碼。例如:
```mongodb-json
if (user && user.name && !user.first_name) {
// 2023 年 12 月 8 日之前寫入的文件沒有 first_name
user.first_name = user.name.split(" ")[0];
}
```
這種方法的缺點是,從資料庫讀取的應用程式的每個部分現在都需要處理可能很久以前寫入的舊格式的文件。另一方面,在寫時模式資料庫中,你通常會執行 *遷移*,如下所示:
```sql
ALTER TABLE users ADD COLUMN first_name text DEFAULT NULL;
UPDATE users SET first_name = split_part(name, ' ', 1); -- PostgreSQL
UPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL
```
在大多數關係資料庫中,新增具有預設值的列即使在大表上也是快速且無問題的。然而,在大表上執行 `UPDATE` 語句可能會很慢,因為每一行都需要重寫,其他模式操作(例如更改列的資料型別)通常也需要複製整個表。
存在各種工具允許在後臺執行此類模式更改而無需停機 [^21] [^22] [^23] [^24],但在大型資料庫上執行此類遷移在操作上仍然具有挑戰性。透過僅新增預設值為 `NULL` 的 `first_name` 列(這很快)並在讀取時填充它,可以避免複雜的遷移,就像你在文件資料庫中所做的那樣。
如果集合中的專案由於某種原因並非都具有相同的結構(即資料是異構的),則讀時模式方法是有利的 —— 例如,因為:
* 有許多不同型別的物件,將每種型別的物件放在自己的表中是不切實際的。
* 資料的結構由你無法控制且可能隨時更改的外部系統決定。
在這樣的情況下,模式可能弊大於利,無模式文件可能是更自然的資料模型。但在所有記錄都應具有相同結構的情況下,模式是記錄和強制該結構的有用機制。我們將在 [第 5 章](/tw/ch5#ch_encoding) 中更詳細地討論模式和模式演化。
#### 讀寫的資料區域性 {#sec_datamodels_document_locality}
文件通常儲存為單個連續字串,編碼為 JSON、XML 或二進位制變體(如 MongoDB 的 BSON)。如果你的應用程式經常需要訪問整個文件(例如,在網頁上渲染它),則這種 *儲存區域性* 具有效能優勢。如果資料分佈在多個表中,如 [圖 3-1](#fig_obama_relational) 所示,則需要多次索引查詢才能檢索所有資料,這可能需要更多的磁碟尋道並花費更多時間。
區域性優勢僅在你同時需要文件的大部分時才適用。資料庫通常需要載入整個文件,如果你只需要訪問大文件的一小部分,這可能會浪費。在文件更新時,通常需要重寫整個文件。由於這些原因,通常建議你保持文件相當小,並避免頻繁對文件進行小的更新。
然而,將相關資料儲存在一起以獲得區域性的想法並不限於文件模型。例如,Google 的 Spanner 資料庫在關係資料模型中提供相同的區域性屬性,允許模式宣告表的行應該交錯(巢狀)在父表中 [^25]。Oracle 允許相同的功能,使用稱為 *多表索引叢集表* 的功能 [^26]。由 Google 的 Bigtable 推廣並在 HBase 和 Accumulo 等中使用的 *寬列* 資料模型具有 *列族* 的概念,其目的類似於管理區域性 [^27]。
#### 文件的查詢語言 {#query-languages-for-documents}
關係資料庫和文件資料庫之間的另一個區別是你用來查詢它的語言或 API。大多數關係資料庫使用 SQL 查詢,但文件資料庫更加多樣化。一些只允許透過主鍵進行鍵值訪問,而另一些還提供二級索引來查詢文件內的值,有些提供豐富的查詢語言。
XML 資料庫通常使用 XQuery 和 XPath 查詢,它們旨在允許複雜的查詢,包括跨多個文件的連線,並將其結果格式化為 XML [^28]。JSON Pointer [^29] 和 JSONPath [^30] 為 JSON 提供了等效於 XPath 的功能。
MongoDB 的聚合管道,我們在 ["正規化、反正規化與連線"](#sec_datamodels_normalization) 中看到了其用於連線的 `$lookup` 運算元,是 JSON 文件集合查詢語言的一個例子。
讓我們看另一個例子來感受這種語言 —— 這次是聚合,這對分析特別需要。想象你是一名海洋生物學家,每次你在海洋中看到動物時,你都會向資料庫新增一條觀察記錄。現在你想生成一份報告,說明你每個月看到了多少條鯊魚。在 PostgreSQL 中,你可能會這樣表達該查詢:
```sql
SELECT date_trunc('month', observation_timestamp) AS observation_month, ❶
sum(num_animals) AS total_animals
FROM observations
WHERE family = 'Sharks'
GROUP BY observation_month;
```
❶ : `date_trunc('month', timestamp)` 函式確定包含 `timestamp` 的日曆月,並返回表示該月開始的另一個時間戳。換句話說,它將時間戳向下舍入到最近的月份。
此查詢首先過濾觀察結果以僅顯示 `Sharks` 家族中的物種,然後按它們發生的日曆月對觀察結果進行分組,最後將該月所有觀察中看到的動物數量相加。可以使用 MongoDB 的聚合管道表達相同的查詢,如下所示:
```mongodb-json
db.observations.aggregate([
{ $match: { family: "Sharks" } },
{ $group: {
_id: {
year: { $year: "$observationTimestamp" },
month: { $month: "$observationTimestamp" }
},
totalAnimals: { $sum: "$numAnimals" }
} }
]);
```
聚合管道語言在表達能力上類似於 SQL 的子集,但它使用基於 JSON 的語法而不是 SQL 的英語句子風格語法;差異可能是品味問題。
#### 文件和關係資料庫的融合 {#convergence-of-document-and-relational-databases}
文件資料庫和關係資料庫最初是非常不同的資料管理方法,但隨著時間的推移,它們變得更加相似 [^31]。關係資料庫增加了對 JSON 型別和查詢運算元的支援,以及索引文件內屬性的能力。一些文件資料庫(如 MongoDB、Couchbase 和 RethinkDB)增加了對連線、二級索引和宣告式查詢語言的支援。
模型的這種融合對應用程式開發人員來說是個好訊息,因為當你可以在同一個資料庫中組合兩者時,關係模型和文件模型效果最好。許多文件資料庫需要對其他文件進行關係式引用,許多關係資料庫也有一些場景更適合模式靈活性。關係-文件混合是一個強大的組合。
--------
> [!NOTE]
> Codd 對關係模型的原始描述 [^3] 實際上允許在關係模式中存在類似於 JSON 的東西。他稱之為 *非簡單域*。這個想法是,行中的值不必只是原始資料型別(如數字或字串),但它也可以是巢狀關係(表)—— 所以你可以有一個任意巢狀的樹結構作為值,很像 30 多年後新增到 SQL 的 JSON 或 XML 支援。
--------
## 圖資料模型 {#sec_datamodels_graph}
我們之前看到,關係型別是不同資料模型之間的重要區別特徵。如果你的應用程式主要具有一對多關係(樹形結構資料)並且記錄之間很少有其他關係,則文件模型是合適的。
但是,如果你的資料中多對多關係非常常見呢?關係模型可以處理多對多關係的簡單情況,但隨著資料內部連線變得更加複雜,開始將資料建模為圖變得更加自然。
圖由兩種物件組成:*頂點*(也稱為 *節點* 或 *實體*)和 *邊*(也稱為 *關係* 或 *弧*)。許多型別的資料可以建模為圖。典型的例子包括:
社交圖
: 頂點是人,邊表示哪些人相互認識。
網頁圖
: 頂點是網頁,邊表示指向其他頁面的 HTML 連結。
道路或鐵路網路
: 頂點是交叉點,邊表示它們之間的道路或鐵路線。
眾所周知的演算法可以在這些圖上執行:例如,地圖導航應用程式搜尋道路網路中兩點之間的最短路徑,PageRank 可用於網頁圖以確定網頁的受歡迎程度,從而確定其在搜尋結果中的排名 [^32]。
圖可以用幾種不同的方式表示。在 *鄰接表* 模型中,每個頂點儲存其相距一條邊的鄰居頂點的 ID。或者,你可以使用 *鄰接矩陣*,這是一個二維陣列,其中每一行和每一列對應一個頂點,當行頂點和列頂點之間沒有邊時值為零,如果有邊則值為一。鄰接表適合圖遍歷,矩陣適合機器學習(參見 ["資料框、矩陣與陣列"](#sec_datamodels_dataframes))。
在剛才給出的示例中,圖中的所有頂點都表示相同型別的事物(分別是人、網頁或道路交叉點)。然而,圖不限於這種 *同質* 資料:圖的一個同樣強大的用途是提供一種一致的方式在單個數據庫中儲存完全不同型別的物件。例如:
* Facebook 維護一個包含許多不同型別頂點和邊的單一圖:頂點表示人員、位置、事件、簽到和使用者發表的評論;邊表示哪些人彼此是朋友、哪個簽到發生在哪個位置、誰評論了哪個帖子、誰參加了哪個事件等等 [^33]。
* 知識圖被搜尋引擎用來記錄搜尋查詢中經常出現的實體(如組織、人員和地點)的事實 [^34]。這些資訊透過爬取和分析網站上的文字獲得;一些網站(如 Wikidata)也以結構化形式釋出圖資料。
在圖中構建和查詢資料有幾種不同但相關的方式。在本節中,我們將討論 *屬性圖* 模型(由 Neo4j、Memgraph、KùzuDB [^35] 和其他 [^36] 實現)和 *三元組儲存* 模型(由 Datomic、AllegroGraph、Blazegraph 和其他實現)。這些模型在它們可以表達的內容方面相當相似,一些圖資料庫(如 Amazon Neptune)支援兩種模型。
我們還將檢視圖的四種查詢語言(Cypher、SPARQL、Datalog 和 GraphQL),以及用於查詢圖的 SQL 支援。還存在其他圖查詢語言,如 Gremlin [^37],但這些將為我們提供代表性的概述。
為了說明這些不同的語言和模型,本節使用 [圖 3-6](#fig_datamodels_graph) 中顯示的圖作為執行示例。它可能取自社交網路或家譜資料庫:它顯示了兩個人,來自愛達荷州的 Lucy 和來自法國聖洛的 Alain。他們已婚並住在倫敦。每個人和每個位置都表示為頂點,它們之間的關係表示為邊。此示例將幫助演示一些在圖資料庫中很容易但在其他模型中很困難的查詢。
{{< figure src="/fig/ddia_0306.png" id="fig_datamodels_graph" caption="圖 3-6. 圖結構資料示例(框表示頂點,箭頭表示邊)。" class="w-full my-4" >}}
### 屬性圖 {#id56}
在 *屬性圖*(也稱為 *標記屬性圖*)模型中,每個頂點包含:
* 唯一識別符號
* 標籤(字串),描述此頂點表示的物件型別
* 一組出邊
* 一組入邊
* 屬性集合(鍵值對)
每條邊包含:
* 唯一識別符號
* 邊開始的頂點(*尾頂點*)
* 邊結束的頂點(*頭頂點*)
* 描述兩個頂點之間關係型別的標籤
* 屬性集合(鍵值對)
你可以將圖儲存視為由兩個關係表組成,一個用於頂點,一個用於邊,如 [示例 3-3](#fig_graph_sql_schema) 所示(此模式使用 PostgreSQL `jsonb` 資料型別來儲存每個頂點或邊的屬性)。每條邊都儲存頭頂點和尾頂點;如果你想要頂點的入邊或出邊集,可以分別透過 `head_vertex` 或 `tail_vertex` 查詢 `edges` 表。
{{< figure id="fig_graph_sql_schema" title="示例 3-3. 使用關係模式表示屬性圖" class="w-full my-4" >}}
```sql
CREATE TABLE vertices (
vertex_id integer PRIMARY KEY,
label text,
properties jsonb
);
CREATE TABLE edges (
edge_id integer PRIMARY KEY,
tail_vertex integer REFERENCES vertices (vertex_id),
head_vertex integer REFERENCES vertices (vertex_id),
label text,
properties jsonb
);
CREATE INDEX edges_tails ON edges (tail_vertex);
CREATE INDEX edges_heads ON edges (head_vertex);
```
此模型的一些重要方面是:
1. 任何頂點都可以有一條邊將其與任何其他頂點連線。沒有限制哪些型別的事物可以或不能關聯的模式。
2. 給定任何頂點,你可以有效地找到其入邊和出邊,從而 *遍歷* 圖 —— 即透過頂點鏈跟隨路徑 —— 向前和向後。(這就是為什麼 [示例 3-3](#fig_graph_sql_schema) 在 `tail_vertex` 和 `head_vertex` 列上都有索引。)
3. 透過對不同型別的頂點和關係使用不同的標籤,你可以在單個圖中儲存幾種不同型別的資訊,同時仍保持簡潔的資料模型。
邊表就像我們在 ["多對一與多對多關係"](#sec_datamodels_many_to_many) 中看到的多對多關聯表/連線表,泛化為允許在同一表中儲存許多不同型別的關係。標籤和屬性上也可能有索引,允許有效地找到具有某些屬性的頂點或邊。
--------
> [!NOTE]
> 圖模型的一個限制是邊只能將兩個頂點相互關聯,而關係連線表可以透過在單行上具有多個外部索引鍵引用來表示三元或甚至更高階的關係。此類關係可以透過為連線表的每一行建立一個額外的頂點,以及到/從該頂點的邊,或者使用 *超圖* 在圖中表示。
--------
這些功能為資料建模提供了極大的靈活性,如 [圖 3-6](#fig_datamodels_graph) 所示。該圖顯示了一些在傳統關係模式中難以表達的內容,例如不同國家的不同區域結構(法國有 *省* 和 *大區*,而美國有 *縣* 和 *州*)、歷史的怪癖(如國中之國)(暫時忽略主權國家和民族的複雜性),以及不同粒度的資料(Lucy 的當前居住地指定為城市,而她的出生地僅在州級別指定)。
你可以想象擴充套件圖以包括有關 Lucy 和 Alain 或其他人的許多其他事實。例如,你可以使用它來指示他們有哪些食物過敏(透過為每個過敏原引入一個頂點,並在人和過敏原之間設定邊以指示過敏),並將過敏原與顯示哪些食物含有哪些物質的一組頂點連結。然後你可以編寫查詢來找出每個人可以安全食用的食物。圖適合可演化性:隨著你嚮應用程式新增功能,圖可以輕鬆擴充套件以適應應用程式資料結構的變化。
### Cypher 查詢語言 {#id57}
*Cypher* 是用於屬性圖的查詢語言,最初為 Neo4j 圖資料庫建立,後來作為 *openCypher* 發展為開放標準 [^38]。除了 Neo4j,Cypher 還得到 Memgraph、KùzuDB [^35]、Amazon Neptune、Apache AGE(在 PostgreSQL 中儲存)等的支援。它以電影《駭客帝國》中的角色命名,與密碼學中的密碼無關 [^39]。
[示例 3-4](#fig_cypher_create) 顯示了將 [圖 3-6](#fig_datamodels_graph) 的左側部分插入圖資料庫的 Cypher 查詢。圖的其餘部分可以類似地新增。每個頂點都被賦予一個符號名稱,如 `usa` 或 `idaho`。該名稱不儲存在資料庫中,僅在查詢內部使用以在頂點之間建立邊,使用箭頭符號:`(idaho) -[:WITHIN]-> (usa)` 建立一條標記為 `WITHIN` 的邊,其中 `idaho` 作為尾節點,`usa` 作為頭節點。
{{< figure link="#fig_datamodels_graph" id="fig_cypher_create" title="示例 3-4. 圖 3-6 中資料的子集,表示為 Cypher 查詢" class="w-full my-4" >}}
```
CREATE
(namerica :Location {name:'North America', type:'continent'}),
(usa :Location {name:'United States', type:'country' }),
(idaho :Location {name:'Idaho', type:'state' }),
(lucy :Person {name:'Lucy' }),
(idaho) -[:WITHIN ]-> (usa) -[:WITHIN]-> (namerica),
(lucy) -[:BORN_IN]-> (idaho)
```
當 [圖 3-6](#fig_datamodels_graph) 的所有頂點和邊都新增到資料庫後,我們可以開始提出有趣的問題:例如,*查詢所有從美國移民到歐洲的人的姓名*。也就是說,找到所有具有指向美國境內位置的 `BORN_IN` 邊,以及指向歐洲境內位置的 `LIVING_IN` 邊的頂點,並返回每個頂點的 `name` 屬性。
[示例 3-5](#fig_cypher_query) 顯示了如何在 Cypher 中表達該查詢。相同的箭頭符號用於 `MATCH` 子句中以在圖中查詢模式:`(person) -[:BORN_IN]-> ()` 匹配由標記為 `BORN_IN` 的邊相關的任意兩個頂點。該邊的尾頂點繫結到變數 `person`,頭頂點未命名。
{{< figure id="fig_cypher_query" title="示例 3-5. Cypher 查詢查詢從美國移民到歐洲的人" class="w-full my-4" >}}
```
MATCH
(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (:Location {name:'United States'}),
(person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (:Location {name:'Europe'})
RETURN person.name
```
查詢可以這樣理解:
> 找到滿足以下 *兩個* 條件的任何頂點(稱為 `person`):
>
> 1. `person` 有一條出邊 `BORN_IN` 指向某個頂點。從那個頂點,你可以跟隨一條出邊 `WITHIN` 鏈,直到最終到達一個型別為 `Location` 的頂點,其 `name` 屬性等於 `"United States"`。
> 2. 同一個 `person` 頂點也有一條出邊 `LIVES_IN`。跟隨該邊,然後是一條出邊 `WITHIN` 鏈,你最終到達一個型別為 `Location` 的頂點,其 `name` 屬性等於 `"Europe"`。
>
> 對於每個這樣的 `person` 頂點,返回 `name` 屬性。
有幾種可能的執行查詢的方法。這裡給出的描述建議你從掃描資料庫中的所有人開始,檢查每個人的出生地和居住地,並僅返回符合條件的人。
但等效地,你可以從兩個 `Location` 頂點開始並向後工作。如果 `name` 屬性上有索引,你可以有效地找到表示美國和歐洲的兩個頂點。然後你可以透過跟隨所有傳入的 `WITHIN` 邊來查詢美國和歐洲各自的所有位置(州、地區、城市等)。最後,你可以尋找可以透過位置頂點之一的傳入 `BORN_IN` 或 `LIVES_IN` 邊找到的人。
### SQL 中的圖查詢 {#id58}
[示例 3-3](#fig_graph_sql_schema) 建議圖資料可以在關係資料庫中表示。但如果我們將圖資料放入關係結構中,我們還能使用 SQL 查詢它嗎?
答案是肯定的,但有一些困難。你在圖查詢中遍歷的每條邊實際上都是與 `edges` 表的連線。在關係資料庫中,你通常事先知道查詢中需要哪些連線。另一方面,在圖查詢中,你可能需要遍歷可變數量的邊才能找到你要查詢的頂點 —— 也就是說,連線的數量不是預先固定的。
在我們的示例中,這發生在 Cypher 查詢中的 `() -[:WITHIN*0..]-> ()` 模式中。一個人的 `LIVES_IN` 邊可能指向任何型別的位置:街道、城市、區(district)、地區(region)、州等。一個城市可能在(`WITHIN`)某個地區,該地區在(`WITHIN`)某個州,該州在(`WITHIN`)某個國家,等等。`LIVES_IN` 邊可能直接指向你要查詢的位置頂點,或者它可能在位置層次結構中相距幾個級別。
在 Cypher 中,`:WITHIN*0..` 非常簡潔地表達了這個事實:它意味著"跟隨 `WITHIN` 邊,零次或多次"。它就像正則表示式中的 `*` 運算元。
自 SQL:1999 以來,查詢中可變長度遍歷路徑的想法可以使用稱為 *遞迴公用表表達式*(`WITH RECURSIVE` 語法)的東西來表達。[示例 3-6](#fig_graph_sql_query) 顯示了相同的查詢 —— 查詢從美國移民到歐洲的人的姓名 —— 使用此技術在 SQL 中表達。然而,與 Cypher 相比,語法非常笨拙。
{{< figure link="#fig_cypher_query" id="fig_graph_sql_query" title="示例 3-6. 與 示例 3-5 相同的查詢,使用遞迴公用表表達式在 SQL 中編寫" class="w-full my-4" >}}
```sql
WITH RECURSIVE
-- in_usa 是美國境內所有位置的頂點 ID 集合
in_usa(vertex_id) AS (
SELECT vertex_id FROM vertices
WHERE label = 'Location' AND properties->>'name' = 'United States' ❶
UNION
SELECT edges.tail_vertex FROM edges ❷
JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
WHERE edges.label = 'within'
),
-- in_europe 是歐洲境內所有位置的頂點 ID 集合
in_europe(vertex_id) AS (
SELECT vertex_id FROM vertices
WHERE label = 'location' AND properties->>'name' = 'Europe' ❸
UNION
SELECT edges.tail_vertex FROM edges
JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
WHERE edges.label = 'within'
),
-- born_in_usa 是所有在美國出生的人的頂點 ID 集合
born_in_usa(vertex_id) AS ( ❹
SELECT edges.tail_vertex FROM edges
JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
WHERE edges.label = 'born_in'
),
-- lives_in_europe 是所有居住在歐洲的人的頂點 ID 集合
lives_in_europe(vertex_id) AS ( ❺
SELECT edges.tail_vertex FROM edges
JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
WHERE edges.label = 'lives_in'
)
SELECT vertices.properties->>'name'
FROM vertices
-- 連線以找到那些既在美國出生 *又* 居住在歐洲的人
JOIN born_in_usa ON vertices.vertex_id = born_in_usa.vertex_id ❻
JOIN lives_in_europe ON vertices.vertex_id = lives_in_europe.vertex_id;
```
❶: 首先找到 `name` 屬性值為 `"United States"` 的頂點,並使其成為頂點集 `in_usa` 的第一個元素。
❷: 從集合 `in_usa` 中的頂點跟隨所有傳入的 `within` 邊,並將它們新增到同一集合中,直到訪問了所有傳入的 `within` 邊。
❸: 從 `name` 屬性值為 `"Europe"` 的頂點開始執行相同操作,並構建頂點集 `in_europe`。
❹: 對於集合 `in_usa` 中的每個頂點,跟隨傳入的 `born_in` 邊以查詢在美國某個地方出生的人。
❺: 類似地,對於集合 `in_europe` 中的每個頂點,跟隨傳入的 `lives_in` 邊以查詢居住在歐洲的人。
❻: 最後,透過連線它們來將在美國出生的人的集合與居住在歐洲的人的集合相交。
4 行 Cypher 查詢需要 31 行 SQL 的事實表明,正確選擇資料模型和查詢語言可以產生多大的差異。這只是開始;還有更多細節需要考慮,例如,處理迴圈,以及在廣度優先或深度優先遍歷之間進行選擇 [^40]。
Oracle 對遞迴查詢有不同的 SQL 擴充套件,它稱之為 *層次* [^41]。
然而,情況可能正在改善:在撰寫本文時,有計劃向 SQL 標準新增一種名為 GQL 的圖查詢語言 [^42] [^43],它將提供受 Cypher、GSQL [^44] 和 PGQL [^45] 啟發的語法。
### 三元組儲存與 SPARQL {#id59}
三元組儲存模型大多等同於屬性圖模型,使用不同的詞來描述相同的想法。儘管如此,它仍值得討論,因為有各種三元組儲存的工具和語言,它們可以成為構建應用程式工具箱的寶貴補充。
在三元組儲存中,所有資訊都以非常簡單的三部分語句的形式儲存:(*主語*、*謂語*、*賓語*)。例如,在三元組(*Jim*、*likes*、*bananas*)中,*Jim* 是主語,*likes* 是謂語(動詞),*bananas* 是賓語。
三元組的主語等同於圖中的頂點。賓語是兩種東西之一:
1. 原始資料型別的值,如字串或數字。在這種情況下,三元組的謂語和賓語等同於主語頂點上屬性的鍵和值。使用 [圖 3-6](#fig_datamodels_graph) 中的示例,(*lucy*、*birthYear*、*1989*)就像一個頂點 `lucy`,其屬性為 `{"birthYear": 1989}`。
2. 圖中的另一個頂點。在這種情況下,謂語是圖中的邊,主語是尾頂點,賓語是頭頂點。例如,在(*lucy*、*marriedTo*、*alain*)中,主語和賓語 *lucy* 和 *alain* 都是頂點,謂語 *marriedTo* 是連線它們的邊的標籤。
> [!NOTE]
> 準確地說,提供類似三元組資料模型的資料庫通常需要在每個元組上儲存一些額外的元資料。例如,AWS Neptune 使用四元組(4-tuples),透過向每個三元組新增圖 ID [^46];Datomic 使用 5 元組,用事務 ID 和一個表示刪除的布林值擴充套件每個三元組 [^47]。由於這些資料庫保留了上面解釋的基本 *主語-謂語-賓語* 結構,本書仍然稱它們為三元組儲存。
[示例 3-7](#fig_graph_n3_triples) 顯示了與 [示例 3-4](#fig_cypher_create) 中相同的資料,以稱為 *Turtle* 的格式編寫為三元組,它是 *Notation3*(*N3*)的子集 [^48]。
{{< figure link="#fig_datamodels_graph" id="fig_graph_n3_triples" title="示例 3-7. 圖 3-6 中資料的子集,表示為 Turtle 三元組" class="w-full my-4" >}}
```
@prefix : .
_:lucy a :Person.
_:lucy :name "Lucy".
_:lucy :bornIn _:idaho.
_:idaho a :Location.
_:idaho :name "Idaho".
_:idaho :type "state".
_:idaho :within _:usa.
_:usa a :Location.
_:usa :name "United States".
_:usa :type "country".
_:usa :within _:namerica.
_:namerica a :Location.
_:namerica :name "North America".
_:namerica :type "continent".
```
在此示例中,圖的頂點寫為 `_:someName`。該名稱在此檔案之外沒有任何意義;它的存在只是因為否則我們不知道哪些三元組引用同一個頂點。當謂語表示邊時,賓語是頂點,如 `_:idaho :within _:usa`。當謂語是屬性時,賓語是字串字面量,如 `_:usa :name "United States"`。
一遍又一遍地重複相同的主語相當重複,但幸運的是,你可以使用分號來表達關於同一主語的多個內容。這使得 Turtle 格式非常易讀:見 [示例 3-8](#fig_graph_n3_shorthand)。
{{< figure link="#fig_graph_n3_triples" id="fig_graph_n3_shorthand" title="示例 3-8. 編寫 示例 3-7 中資料的更簡潔方式" class="w-full my-4" >}}
```
@prefix : .
_:lucy a :Person; :name "Lucy"; :bornIn _:idaho.
_:idaho a :Location; :name "Idaho"; :type "state"; :within _:usa.
_:usa a :Location; :name "United States"; :type "country"; :within _:namerica.
_:namerica a :Location; :name "North America"; :type "continent".
```
--------
> [!TIP] 語義網
一些三元組儲存的研究和開發工作是由 *語義網* 推動的,這是 2000 年代初的一項努力,旨在透過不僅以人類可讀的網頁形式釋出資料,還以標準化的機器可讀格式釋出資料來促進網際網路範圍的資料交換。儘管最初設想的語義網沒有成功 [^49] [^50],但語義網專案的遺產在幾項特定技術中繼續存在:*連結資料* 標準(如 JSON-LD [^51])、生物醫學科學中使用的 *本體* [^52]、Facebook 的開放圖協議 [^53](用於連結展開 [^54])、知識圖(如 Wikidata)以及由 [`schema.org`](https://schema.org/) 維護的結構化資料的標準化詞彙表。
三元組儲存是另一種在其原始用例之外找到用途的語義網技術:即使你對語義網沒有興趣,三元組也可以成為應用程式的良好內部資料模型。
--------
#### RDF 資料模型 {#the-rdf-data-model}
我們在 [示例 3-8](#fig_graph_n3_shorthand) 中使用的 Turtle 語言實際上是在 *資源描述框架*(RDF)[^55] 中編碼資料的一種方式,這是為語義網設計的資料模型。RDF 資料也可以用其他方式編碼,例如(更冗長地)用 XML,如 [示例 3-9](#fig_graph_rdf_xml) 所示。像 Apache Jena 這樣的工具可以在不同的 RDF 編碼之間自動轉換。
{{< figure link="#fig_graph_n3_shorthand" id="fig_graph_rdf_xml" title="示例 3-9. 示例 3-8 的資料,使用 RDF/XML 語法表示" class="w-full my-4" >}}
```xml
IdahostateUnited StatescountryNorth AmericacontinentLucy
```
RDF 有一些怪癖,因為它是為網際網路範圍的資料交換而設計的。三元組的主語、謂語和賓語通常是 URI。例如,謂語可能是一個 URI,如 `` 或 ``,而不僅僅是 `WITHIN` 或 `LIVES_IN`。這種設計背後的原因是,你應該能夠將你的資料與其他人的資料結合起來,如果他們給單詞 `within` 或 `lives_in` 附加了不同的含義,你不會發生衝突,因為他們的謂語實際上是 `` 和 ``。
URL `` 不一定需要解析為任何內容 —— 從 RDF 的角度來看,它只是一個名稱空間。為了避免與 `http://` URL 的潛在混淆,本節中的示例使用不可解析的 URI,如 `urn:example:within`。幸運的是,你只需在檔案頂部指定一次此字首,然後就可以忘記它。
#### SPARQL 查詢語言 {#the-sparql-query-language}
*SPARQL* 是使用 RDF 資料模型的三元組儲存的查詢語言 [^56]。(它是 *SPARQL Protocol and RDF Query Language* 的首字母縮略詞,發音為 "sparkle"。)它早於 Cypher,由於 Cypher 的模式匹配是從 SPARQL 借用的,它們看起來非常相似。
與之前相同的查詢 —— 查詢從美國搬到歐洲的人 —— 在 SPARQL 中與在 Cypher 中一樣簡潔(見 [示例 3-10](#fig_sparql_query))。
{{< figure id="fig_sparql_query" title="示例 3-10. 與 [示例 3-5](#fig_cypher_query) 相同的查詢,用 SPARQL 表示" class="w-full my-4" >}}
```
PREFIX :
SELECT ?personName WHERE {
?person :name ?personName.
?person :bornIn / :within* / :name "United States".
?person :livesIn / :within* / :name "Europe".
}
```
結構非常相似。以下兩個表示式是等效的(變數在 SPARQL 中以問號開頭):
```
(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (location) # Cypher
?person :bornIn / :within* ?location. # SPARQL
```
因為 RDF 不區分屬性和邊,而只是對兩者都使用謂語,所以你可以使用相同的語法來匹配屬性。在以下表達式中,變數 `usa` 繫結到任何具有 `name` 屬性且其值為字串 `"United States"` 的頂點:
```
(usa {name:'United States'}) # Cypher
?usa :name "United States". # SPARQL
```
SPARQL 得到 Amazon Neptune、AllegroGraph、Blazegraph、OpenLink Virtuoso、Apache Jena 和各種其他三元組儲存的支援 [^36]。
### Datalog:遞迴關係查詢 {#id62}
Datalog 是一種比 SPARQL 或 Cypher 更古老的語言:它源於 20 世紀 80 年代的學術研究 [^57] [^58] [^59]。它在軟體工程師中不太為人所知,並且在主流資料庫中沒有得到廣泛支援,但它應該更為人所知,因為它是一種非常有表現力的語言,對於複雜查詢特別強大。幾個小眾資料庫,包括 Datomic、LogicBlox、CozoDB 和 LinkedIn 的 LIquid [^60] 使用 Datalog 作為它們的查詢語言。
Datalog 實際上基於關係資料模型,而不是圖,但它出現在本書的圖資料庫部分,因為圖上的遞迴查詢是 Datalog 的特殊優勢。
Datalog 資料庫的內容由 *事實* 組成,每個事實對應於關係表中的一行。例如,假設我們有一個包含位置的表 *location*,它有三列:*ID*、*name* 和 *type*。美國是一個國家的事實可以寫成 `location(2, "United States", "country")`,其中 `2` 是美國的 ID。一般來說,語句 `table(val1, val2, …)` 意味著 `table` 包含一行,其中第一列包含 `val1`,第二列包含 `val2`,依此類推。
[示例 3-11](#fig_datalog_triples) 顯示了如何在 Datalog 中編寫 [圖 3-6](#fig_datamodels_graph) 左側的資料。圖的邊(`within`、`born_in` 和 `lives_in`)表示為兩列連線表。例如,Lucy 的 ID 是 100,愛達荷州的 ID 是 3,所以關係"Lucy 出生在愛達荷州"表示為 `born_in(100, 3)`。
{{< figure id="fig_datalog_triples" title="示例 3-11. [圖 3-6](#fig_datamodels_graph) 中資料的子集,表示為 Datalog 事實" class="w-full my-4" >}}
```
location(1, "North America", "continent").
location(2, "United States", "country").
location(3, "Idaho", "state").
within(2, 1). /* 美國在北美 */
within(3, 2). /* 愛達荷州在美國 */
person(100, "Lucy").
born_in(100, 3). /* Lucy 出生在愛達荷州 */
```
現在我們已經定義了資料,我們可以編寫與之前相同的查詢,如 [示例 3-12](#fig_datalog_query) 所示。它看起來與 Cypher 或 SPARQL 中的等效查詢有點不同,但不要讓這嚇倒你。Datalog 是 Prolog 的子集,這是一種程式語言,如果你學過計算機科學,你可能見過它。
{{< figure id="fig_datalog_query" title="示例 3-12. 與 [示例 3-5](#fig_cypher_query) 相同的查詢,用 Datalog 表示" class="w-full my-4" >}}
```sql
within_recursive(LocID, PlaceName) :- location(LocID, PlaceName, _). /* 規則 1 */
within_recursive(LocID, PlaceName) :- within(LocID, ViaID), /* 規則 2 */
within_recursive(ViaID, PlaceName).
migrated(PName, BornIn, LivingIn) :- person(PersonID, PName), /* 規則 3 */
born_in(PersonID, BornID),
within_recursive(BornID, BornIn),
lives_in(PersonID, LivingID),
within_recursive(LivingID, LivingIn).
us_to_europe(Person) :- migrated(Person, "United States", "Europe"). /* 規則 4 */
/* us_to_europe 包含行 "Lucy"。 */
```
Cypher 和 SPARQL 直接用 `SELECT` 開始,但 Datalog 一次只邁出一小步。我們定義 *規則* 從底層事實派生新的虛擬表。這些派生表就像(虛擬)SQL 檢視:它們不儲存在資料庫中,但你可以像查詢包含儲存事實的表一樣查詢它們。
在 [示例 3-12](#fig_datalog_query) 中,我們定義了三個派生表:`within_recursive`、`migrated` 和 `us_to_europe`。虛擬表的名稱和列由每個規則的 `:-` 符號之前出現的內容定義。例如,`migrated(PName, BornIn, LivingIn)` 是一個具有三列的虛擬表:一個人的姓名、他們出生地的名稱和他們居住地的名稱。
虛擬表的內容由規則的 `:-` 符號之後的部分定義,我們在其中嘗試查詢表中匹配某種模式的行。例如,`person(PersonID, PName)` 匹配行 `person(100, "Lucy")`,變數 `PersonID` 繫結到值 `100`,變數 `PName` 繫結到值 `"Lucy"`。如果系統可以為 `:-` 運算元右側的 *所有* 模式找到匹配項,則規則適用。當規則適用時,就好像 `:-` 的左側被新增到資料庫中(變數被它們匹配的值替換)。
因此,應用規則的一種可能方式是(如 [圖 3-7](#fig_datalog_naive) 所示):
1. `location(1, "North America", "continent")` 存在於資料庫中,因此規則 1 適用。它生成 `within_recursive(1, "North America")`。
2. `within(2, 1)` 存在於資料庫中,前一步生成了 `within_recursive(1, "North America")`,因此規則 2 適用。它生成 `within_recursive(2, "North America")`。
3. `within(3, 2)` 存在於資料庫中,前一步生成了 `within_recursive(2, "North America")`,因此規則 2 適用。它生成 `within_recursive(3, "North America")`。
透過重複應用規則 1 和 2,`within_recursive` 虛擬表可以告訴我們資料庫中包含的北美(或任何其他位置)的所有位置。
{{< figure link="#fig_datalog_query" src="/fig/ddia_0307.png" id="fig_datalog_naive" title="圖 3-7. 使用示例 3-12 中的 Datalog 規則確定愛達荷州在北美。" class="w-full my-4" >}}
> 圖 3-7. 使用 [示例 3-12](#fig_datalog_query) 中的 Datalog 規則確定愛達荷州在北美。
現在規則 3 可以找到出生在某個位置 `BornIn` 並居住在某個位置 `LivingIn` 的人。規則 4 使用 `BornIn = 'United States'` 和 `LivingIn = 'Europe'` 呼叫規則 3,並僅返回匹配搜尋的人的姓名。透過查詢虛擬 `us_to_europe` 表的內容,Datalog 系統最終得到與早期 Cypher 和 SPARQL 查詢相同的答案。
與本章討論的其他查詢語言相比,Datalog 方法需要不同型別的思維。它允許逐條規則地構建複雜查詢,一個規則引用其他規則,類似於你將程式碼分解為相互呼叫的函式的方式。就像函式可以遞迴一樣,Datalog 規則也可以呼叫自己,如 [示例 3-12](#fig_datalog_query) 中的規則 2,這使得 Datalog 查詢中的圖遍歷成為可能。
### GraphQL {#id63}
GraphQL 是一種查詢語言,從設計上講,它比我們在本章中看到的其他查詢語言限制性更強。GraphQL 的目的是允許在使用者裝置上執行的客戶端軟體(如移動應用程式或 JavaScript Web 應用程式前端)請求具有特定結構的 JSON 文件,其中包含渲染其使用者介面所需的欄位。GraphQL 介面允許開發人員快速更改客戶端程式碼中的查詢,而無需更改伺服器端 API。
GraphQL 的靈活性是有代價的。採用 GraphQL 的組織通常需要工具將 GraphQL 查詢轉換為對內部服務的請求,這些服務通常使用 REST 或 gRPC(參見 [第 5 章](/tw/ch5#ch_encoding))。授權、速率限制和效能挑戰是額外的關注點 [^61]。GraphQL 的查詢語言也受到限制,因為 GraphQL 查詢來自不受信任的來源。該語言不允許任何可能執行成本高昂的操作,否則使用者可能透過執行大量昂貴的查詢對伺服器執行拒絕服務攻擊。特別是,GraphQL 不允許遞迴查詢(與 Cypher、SPARQL、SQL 或 Datalog 不同),並且不允許任意搜尋條件,如"查詢在美國出生並現在居住在歐洲的人"(除非服務所有者特別選擇提供此類搜尋功能)。
儘管如此,GraphQL 還是很有用的。[示例 3-13](#fig_graphql_query) 顯示了如何使用 GraphQL 實現 Discord 或 Slack 等群聊應用程式。查詢請求使用者有權訪問的所有頻道,包括頻道名稱和每個頻道中的 50 條最新訊息。對於每條訊息,它請求時間戳、訊息內容以及訊息傳送者的姓名和個人資料圖片 URL。此外,如果訊息是對另一條訊息的回覆,查詢還會請求傳送者姓名和它所回覆的訊息內容(可能以較小的字型呈現在回覆上方,以提供一些上下文)。
{{< figure id="fig_graphql_query" title="示例 3-13. 群聊應用程式的示例 GraphQL 查詢" class="w-full my-4" >}}
```
query ChatApp {
channels {
name
recentMessages(latest: 50) {
timestamp
content
sender {
fullName
imageUrl
}
replyTo {
content
sender {
fullName
}
}
}
}
}
```
[示例 3-14](#fig_graphql_response) 顯示了對 [示例 3-13](#fig_graphql_query) 中查詢的響應可能是什麼樣子。響應是一個反映查詢結構的 JSON 文件:它正好包含請求的那些屬性,不多也不少。這種方法的優點是伺服器不需要知道客戶端需要哪些屬性來渲染使用者介面;相反,客戶端可以簡單地請求它需要的內容。例如,此查詢不會為 `replyTo` 訊息的傳送者請求個人資料圖片 URL,但如果使用者介面更改為新增該個人資料圖片,客戶端可以很容易地將所需的 `imageUrl` 屬性新增到查詢中,而無需更改伺服器。
{{< figure link="#fig_graphql_query" id="fig_graphql_response" title="示例 3-14. 對 示例 3-13 中查詢的可能響應" class="w-full my-4" >}}
```json
{
"data": {
"channels": [
{
"name": "#general",
"recentMessages": [
{
"timestamp": 1693143014,
"content": "Hey! How are y'all doing?",
"sender": {"fullName": "Aaliyah", "imageUrl": "https://..."},
"replyTo": null
},
{
"timestamp": 1693143024,
"content": "Great! And you?",
"sender": {"fullName": "Caleb", "imageUrl": "https://..."},
"replyTo": {
"content": "Hey! How are y'all doing?",
"sender": {"fullName": "Aaliyah"}
}
},
...
```
在 [示例 3-14](#fig_graphql_response) 中,訊息傳送者的姓名和影像 URL 直接嵌入在訊息物件中。如果同一使用者傳送多條訊息,此資訊會在每條訊息上重複。原則上,可以減少這種重複,但 GraphQL 做出了接受更大響應大小的設計選擇,以便更簡單地基於資料渲染使用者介面。
`replyTo` 欄位類似:在 [示例 3-14](#fig_graphql_response) 中,第二條訊息是對第一條訊息的回覆,內容("Hey!…")和傳送者 Aaliyah 在 `replyTo` 下重複。可以改為返回被回覆訊息的 ID,但如果該 ID 不在返回的 50 條最新訊息中,客戶端就必須向伺服器發出額外的請求。重複內容使得處理資料變得更加簡單。
伺服器的資料庫可以以更正規化的形式儲存資料,並執行必要的連線來處理查詢。例如,伺服器可能儲存訊息以及傳送者的使用者 ID 和它所回覆的訊息的 ID;當它收到如上所示的查詢時,伺服器將解析這些 ID 以查詢它們引用的記錄。但是,客戶端只能要求伺服器執行 GraphQL 模式中明確提供的連線。
即使對 GraphQL 查詢的響應看起來類似於文件資料庫的響應,即使它的名稱中有"graph",GraphQL 也可以在任何型別的資料庫之上實現 —— 關係型、文件型或圖型。
## 事件溯源與 CQRS {#sec_datamodels_events}
在我們迄今為止討論的所有資料模型中,資料以與寫入相同的形式被查詢 —— 無論是 JSON 文件、表中的行,還是圖中的頂點和邊。然而,在複雜的應用程式中,有時很難找到一種能夠滿足所有不同查詢和呈現資料方式的單一資料表示。在這種情況下,以一種形式寫入資料,然後從中派生出針對不同型別讀取最佳化的多種表示形式可能是有益的。
我們之前在 ["記錄系統與派生資料"](/tw/ch1#sec_introduction_derived) 中看到了這個想法,ETL(參見 ["資料倉庫"](/tw/ch1#sec_introduction_dwh))就是這種派生過程的一個例子。現在我們將進一步深入這個想法。如果我們無論如何都要從一種資料表示派生出另一種,我們可以選擇分別針對寫入和讀取最佳化的不同表示。如果你只想為寫入最佳化資料建模,而不關心高效查詢,你會如何建模?
也許寫入資料的最簡單、最快速和最具表現力的方式是 *事件日誌*:每次你想寫入一些資料時,你將其編碼為自包含的字串(可能是 JSON),包括時間戳,然後將其追加到事件序列中。此日誌中的事件是 *不可變的*:你永遠不會更改或刪除它們,你只會向日志追加更多事件(這可能會取代早期事件)。事件可以包含任意屬性。
[圖 3-8](#fig_event_sourcing) 顯示了一個可能來自會議管理系統的示例。會議可能是一個複雜的業務領域:不僅個人參與者可以註冊並用信用卡付款,公司也可以批次訂購座位,透過發票付款,然後再將座位分配給個人。一些座位可能為演講者、贊助商、志願者助手等保留。預訂也可能被取消,與此同時,會議組織者可能透過將其移至不同的房間來更改活動的容量。在所有這些情況發生時,簡單地計算可用座位數量就成為一個具有挑戰性的查詢。
{{< figure src="/fig/ddia_0308.png" id="fig_event_sourcing" title="圖 3-8. 使用不可變事件日誌作為真相來源(權威資料來源),並從中派生物化檢視。" class="w-full my-4" >}}
在 [圖 3-8](#fig_event_sourcing) 中,會議狀態的每個變化(例如組織者開放註冊,或參與者進行和取消註冊)首先被儲存為事件。每當事件追加到日誌時,幾個 *物化檢視*(也稱為 *投影* 或 *讀模型*)也會更新以反映該事件的影響。在會議示例中,可能有一個物化檢視收集與每個預訂狀態相關的所有資訊,另一個為會議組織者的儀表板計算圖表,第三個為列印參與者徽章的印表機生成檔案。
使用事件作為真相來源(權威資料來源),並將每個狀態變化表達為事件的想法被稱為 *事件溯源* [^62] [^63]。維護單獨的讀最佳化表示並從寫最佳化表示派生它們的原則稱為 *命令查詢責任分離(CQRS)* [^64]。這些術語起源於領域驅動設計(DDD)社群,儘管類似的想法已經存在很長時間了,例如 *狀態機複製*(參見 ["使用共享日誌"](/tw/ch10#sec_consistency_smr))。
當用戶的請求進來時,它被稱為 *命令*,首先需要驗證。只有在命令已執行並確定有效(例如,請求的預訂有足夠的可用座位)後,它才成為事實,相應的事件被新增到日誌中。因此,事件日誌應該只包含有效事件,構建物化檢視的事件日誌消費者不允許拒絕事件。
在以事件溯源風格建模資料時,建議你使用過去時態命名事件(例如,"座位已預訂"),因為事件是記錄過去發生的事情的記錄。即使使用者後來決定更改或取消,他們以前持有預訂的事實仍然是真實的,更改或取消是稍後新增的單獨事件。
事件溯源與星型模式事實表之間的相似之處(如 ["星型與雪花型:分析模式"](#sec_datamodels_analytics) 中所討論的)是兩者都是過去發生的事件的集合。然而,事實表中的行都具有相同的列集,而在事件溯源中可能有許多不同的事件型別,每種都有不同的屬性。此外,事實表是無序集合,而在事件溯源中事件的順序很重要:如果先進行預訂然後取消,以錯誤的順序處理這些事件將沒有意義。
事件溯源和 CQRS 有幾個優點:
* 對於開發系統的人來說,事件更好地傳達了 *為什麼* 發生某事的意圖。例如,理解事件"預訂已取消"比理解"`bookings` 表第 4001 行的 `active` 列被設定為 `false`,與該預訂相關的三行從 `seat_assignments` 表中刪除,並且在 `payments` 表中插入了一行代表退款"更容易。當物化檢視處理取消事件時,這些行修改仍可能發生,但當它們由事件驅動時,更新的原因變得更加清晰。
* 事件溯源的關鍵原則是物化檢視以可重現的方式從事件日誌派生:你應該始終能夠刪除物化檢視並透過以相同順序處理相同事件,使用相同程式碼來重新計算它們。如果檢視維護程式碼中有錯誤,你可以刪除檢視並使用新程式碼重新計算它。查詢錯誤也更容易,因為你可以隨意重新執行檢視維護程式碼並檢查其行為。
* 你可以有多個物化檢視,針對應用程式所需的特定查詢進行最佳化。它們可以儲存在與事件相同的資料庫中,也可以儲存在不同的資料庫中,具體取決於你的需求。它們可以使用任何資料模型,並且可以為快速讀取而反正規化。你甚至可以只在記憶體中保留檢視並避免持久化它,只要可以在服務重新啟動時從事件日誌重新計算檢視即可。
* 如果你決定以新方式呈現現有資訊,很容易從現有事件日誌構建新的物化檢視。你還可以透過新增新型別的事件或向現有事件型別新增新屬性(任何舊事件保持未修改)來發展系統以支援新功能。你還可以將新行為連結到現有事件(例如,當會議參與者取消時,他們的座位可以提供給等候名單上的下一個人)。
* 如果某個事件被錯誤寫入,你可以再把它刪掉,這樣你就能重建出一個沒有這個被刪除事件的檢視。另一方面,在直接更新和刪除資料的資料庫中,已提交的事務通常很難撤銷。因此,事件溯源可以減少系統中不可逆操作的數量,使其更容易更改(參見 ["可演化性:讓變更變得容易"](/tw/ch2#sec_introduction_evolvability))。
* 事件日誌還可以作為系統中發生的所有事情的審計日誌,這在需要此類可審計性的受監管行業中很有價值。
然而,事件溯源和 CQRS 也有缺點:
* 如果涉及外部資訊,你需要小心。例如,假設一個事件包含以一種貨幣給出的價格,對於其中一個檢視,它需要轉換為另一種貨幣。由於匯率可能會波動,在處理事件時從外部源獲取匯率會有問題,因為如果你在另一個日期重新計算物化檢視,你會得到不同的結果。為了使事件處理邏輯具有確定性,你要麼需要在事件本身中包含匯率,要麼有一種方法來查詢事件中指示的時間戳處的歷史匯率,確保此查詢始終為相同的時間戳返回相同的結果。
* 事件不可變的要求會在事件包含使用者的個人資料時產生問題,因為使用者可能行使他們的權利(例如,根據 GDPR)請求刪除他們的資料。如果事件日誌是基於每個使用者的,你可以刪除該使用者的整個日誌,但如果你的事件日誌包含與多個使用者相關的事件,這就不起作用了。你可以嘗試將個人資料儲存在實際事件之外,或者使用金鑰對其進行加密,你可以稍後選擇刪除該金鑰,但這也使得在需要時更難重新計算派生狀態。
* 如果存在外部可見的副作用,重新處理事件需要小心 —— 例如,你可能不希望每次重建物化檢視時都重新發送確認電子郵件。
你可以在任何資料庫之上實現事件溯源,但也有一些專門設計來支援這種模式的系統,例如 EventStoreDB、MartenDB(基於 PostgreSQL)和 Axon Framework。你還可以使用訊息代理(如 Apache Kafka)來儲存事件日誌,流處理器可以使物化檢視保持最新;我們將在 ["資料變更捕獲與事件溯源"](/tw/ch12#sec_stream_event_sourcing) 中回到這些主題。
唯一重要的要求是事件儲存系統必須保證所有物化檢視以與它們在日誌中出現的完全相同的順序處理事件;正如我們將在 [第 10 章](/tw/ch10#ch_consistency) 中看到的,這在分散式系統中並不總是容易實現。
## 資料框、矩陣與陣列 {#sec_datamodels_dataframes}
到目前為止,我們在本章中看到的資料模型通常用於事務處理和分析目的(參見 ["分析與運營系統"](/tw/ch1#sec_introduction_analytics))。還有一些資料模型你可能會在分析或科學環境中遇到,但很少出現在 OLTP 系統中:資料框和多維數字陣列(如矩陣)。
資料框是 R 語言、Python 的 Pandas 庫、Apache Spark、ArcticDB、Dask 和其他系統支援的資料模型。它們是資料科學家為訓練機器學習模型準備資料的流行工具,但它們也廣泛用於資料探索、統計資料分析、資料視覺化和類似目的。
乍一看,資料框類似於關係資料庫中的表或電子表格。它支援對資料框內容執行批次操作的類關係運算元:例如,將函式應用於所有行、基於某些條件過濾行、按某些列對行進行分組並聚合其他列,以及基於某個鍵將一個數據框中的行與另一個數據框連線(關係資料庫稱為 *連線* 的操作在資料框上通常稱為 *合併*)。
資料框通常不是透過宣告式查詢(如 SQL)而是透過一系列修改其結構和內容的命令來操作的。這符合資料科學家的典型工作流程,他們逐步"整理"資料,使其成為能夠找到他們所提問題答案的形式。這些操作通常在資料科學家的資料集私有副本上進行,通常在他們的本地機器上,儘管最終結果可能與其他使用者共享。
資料框 API 還提供了遠遠超出關係資料庫提供的各種操作,資料模型的使用方式通常與典型的關係資料建模非常不同 [^65]。例如,資料框的常見用途是將資料從類似關係的表示轉換為矩陣或多維陣列表示,這是許多機器學習演算法期望的輸入形式。
[圖 3-9](#fig_dataframe_to_matrix) 顯示了這種轉換的簡單示例。左側是不同使用者如何評價各種電影的關係表(評分為 1 到 5),右側資料已轉換為矩陣,其中每列是一部電影,每行是一個使用者(類似於電子表格中的 *資料透視表*)。矩陣是 *稀疏* 的,這意味著許多使用者-電影組合沒有資料,但這沒關係。這個矩陣可能有數千列,因此不太適合關係資料庫,但資料框和提供稀疏陣列的庫(如 Python 的 NumPy)可以輕鬆處理此類資料。
{{< figure src="/fig/ddia_0309.png" id="fig_dataframe_to_matrix" title="圖 3-9. 將電影評分的關係資料庫轉換為矩陣表示。" class="w-full my-4" >}}
矩陣只能包含數字,各種技術用於將非數字資料轉換為矩陣中的數字。例如:
* 日期(在 [圖 3-9](#fig_dataframe_to_matrix) 的示例矩陣中省略了)可以縮放為某個合適範圍內的浮點數。
* 對於只能取一小組固定值之一的列(例如,電影資料庫中電影的型別),通常使用 *獨熱編碼*:我們為每個可能的值建立一列(一個用於"喜劇",一個用於"劇情",一個用於"恐怖"等),對於代表電影的每一行,我們在對應於該電影型別的列中放置 1,在所有其他列中放置 0。這種表示也很容易推廣到適合多種型別的電影。
一旦資料以數字矩陣的形式存在,它就適合線性代數運算,這構成了許多機器學習演算法的基礎。例如,[圖 3-9](#fig_dataframe_to_matrix) 中的資料可能是推薦使用者可能喜歡的電影系統的一部分。資料框足夠靈活,允許資料從關係形式逐漸演變為矩陣表示,同時讓資料科學家控制最適合實現資料分析或模型訓練過程目標的表示。
還有像 TileDB [^66] 這樣專門儲存大型多維數字陣列的資料庫;它們被稱為 *陣列資料庫*,最常用於科學資料集,如地理空間測量(規則間隔網格上的柵格資料)、醫學成像或天文望遠鏡的觀測 [^67]。資料框在金融行業也用於表示 *時間序列資料*,如資產價格和隨時間變化的交易 [^68]。
## 總結 {#summary}
資料模型是一個巨大的主題,在本章中,我們快速瀏覽了各種不同的模型。我們沒有空間深入每個模型的所有細節,但希望這個概述足以激發你的興趣,找出最適合你的應用需求的模型。
*關係模型* 儘管已有半個多世紀的歷史,但對許多應用來說仍然是一個重要的資料模型——特別是在資料倉庫和商業分析中,關係星型或雪花模式和 SQL 查詢無處不在。然而,關係資料的幾種替代方案也在其他領域變得流行:
* *文件模型* 針對資料以獨立的 JSON 文件形式出現的用例,以及一個文件與另一個文件之間的關係很少的情況。
* *圖資料模型* 走向相反的方向,針對任何東西都可能與一切相關的用例,以及查詢可能需要遍歷多個跳躍才能找到感興趣的資料(可以使用 Cypher、SPARQL 或 Datalog 中的遞迴查詢來表達)。
* *資料框* 將關係資料推廣到大量列,從而在資料庫和構成大量機器學習、統計資料分析和科學計算基礎的多維陣列之間提供橋樑。
在某種程度上,一個模型可以用另一個模型來模擬——例如,圖資料可以在關係資料庫中表示——但結果可能很彆扭,正如我們在 SQL 中對遞迴查詢的支援中看到的那樣。
因此,為每個資料模型開發了各種專業資料庫,提供針對特定模型最佳化的查詢語言和儲存引擎。然而,資料庫也有透過新增對其他資料模型的支援來擴充套件到相鄰領域的趨勢:例如,關係資料庫以 JSON 列的形式添加了對文件資料的支援,文件資料庫添加了類似關係的連線,SQL 中對圖資料的支援也在逐步改進。
我們討論的另一個模型是 *事件溯源*,它將資料表示為不可變事件的僅追加日誌,這對於建模複雜業務領域中的活動可能是有利的。僅追加日誌有利於寫入資料(正如我們將在 [第 4 章](/tw/ch4#ch_storage) 中看到的);為了支援高效查詢,事件日誌透過 CQRS 轉換為讀最佳化的物化檢視。
非關係資料模型的一個共同點是,它們通常不會對儲存的資料強制執行模式,這可以使應用更容易適應不斷變化的需求。然而,你的應用很可能仍然假設資料具有某種結構;這只是模式是顯式的(在寫入時強制執行)還是隱式的(在讀取時假設)的問題。
儘管我們涵蓋了很多內容,但仍有資料模型未被提及。僅舉幾個簡短的例子:
* 研究基因組資料的研究人員通常需要執行 *序列相似性搜尋*,這意味著獲取一個非常長的字串(代表 DNA 分子)並將其與相似但不相同的大量字串資料庫進行匹配。這裡描述的資料庫都無法處理這種用法,這就是研究人員編寫了像 GenBank [^69] 這樣的專門基因組資料庫軟體的原因。
* 許多金融系統使用具有複式記賬的 *賬本* 作為其資料模型。這種型別的資料可以在關係資料庫中表示,但也有像 TigerBeetle 這樣專門研究這種資料模型的資料庫。加密貨幣和區塊鏈通常基於分散式賬本,它們的資料模型中也內建了價值轉移。
* *全文檢索* 可以說是一種經常與資料庫一起使用的資料模型。資訊檢索是一個大型的專業主題,我們不會在本書中詳細介紹,但我們將在 ["全文檢索"](/tw/ch4#sec_storage_full_text) 中涉及搜尋索引和向量搜尋。
我們現在必須到此為止了。在下一章中,我們將討論在 *實現* 本章中描述的資料模型時出現的一些權衡。
### 參考文獻
[^1]: Jamie Brandon. [Unexplanations: query optimization works because sql is declarative](https://www.scattered-thoughts.net/writing/unexplanations-sql-declarative/). *scattered-thoughts.net*, February 2024. Archived at [perma.cc/P6W2-WMFZ](https://perma.cc/P6W2-WMFZ)
[^2]: Joseph M. Hellerstein. [The Declarative Imperative: Experiences and Conjectures in Distributed Logic](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf). Tech report UCB/EECS-2010-90, Electrical Engineering and Computer Sciences, University of California at Berkeley, June 2010. Archived at [perma.cc/K56R-VVQM](https://perma.cc/K56R-VVQM)
[^3]: Edgar F. Codd. [A Relational Model of Data for Large Shared Data Banks](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf). *Communications of the ACM*, volume 13, issue 6, pages 377–387, June 1970. [doi:10.1145/362384.362685](https://doi.org/10.1145/362384.362685)
[^4]: Michael Stonebraker and Joseph M. Hellerstein. [What Goes Around Comes Around](http://mitpress2.mit.edu/books/chapters/0262693143chapm1.pdf). In *Readings in Database Systems*, 4th edition, MIT Press, pages 2–41, 2005. ISBN: 9780262693141
[^5]: Markus Winand. [Modern SQL: Beyond Relational](https://modern-sql.com/). *modern-sql.com*, 2015. Archived at [perma.cc/D63V-WAPN](https://perma.cc/D63V-WAPN)
[^6]: Martin Fowler. [OrmHate](https://martinfowler.com/bliki/OrmHate.html). *martinfowler.com*, May 2012. Archived at [perma.cc/VCM8-PKNG](https://perma.cc/VCM8-PKNG)
[^7]: Vlad Mihalcea. [N+1 query problem with JPA and Hibernate](https://vladmihalcea.com/n-plus-1-query-problem/). *vladmihalcea.com*, January 2023. Archived at [perma.cc/79EV-TZKB](https://perma.cc/79EV-TZKB)
[^8]: Jens Schauder. [This is the Beginning of the End of the N+1 Problem: Introducing Single Query Loading](https://spring.io/blog/2023/08/31/this-is-the-beginning-of-the-end-of-the-n-1-problem-introducing-single-query). *spring.io*, August 2023. Archived at [perma.cc/6V96-R333](https://perma.cc/6V96-R333)
[^9]: William Zola. [6 Rules of Thumb for MongoDB Schema Design](https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design). *mongodb.com*, June 2014. Archived at [perma.cc/T2BZ-PPJB](https://perma.cc/T2BZ-PPJB)
[^10]: Sidney Andrews and Christopher McClister. [Data modeling in Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data). *learn.microsoft.com*, February 2023. Archived at [archive.org](https://web.archive.org/web/20230207193233/https%3A//learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data)
[^11]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK)
[^12]: Ralph Kimball and Margy Ross. [*The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*](https://learning.oreilly.com/library/view/the-data-warehouse/9781118530801/), 3rd edition. John Wiley & Sons, July 2013. ISBN: 9781118530801
[^13]: Michael Kaminsky. [Data warehouse modeling: Star schema vs. OBT](https://www.fivetran.com/blog/star-schema-vs-obt). *fivetran.com*, August 2022. Archived at [perma.cc/2PZK-BFFP](https://perma.cc/2PZK-BFFP)
[^14]: Joe Nelson. [User-defined Order in SQL](https://begriffs.com/posts/2018-03-20-user-defined-order.html). *begriffs.com*, March 2018. Archived at [perma.cc/GS3W-F7AD](https://perma.cc/GS3W-F7AD)
[^15]: Evan Wallace. [Realtime Editing of Ordered Sequences](https://www.figma.com/blog/realtime-editing-of-ordered-sequences/). *figma.com*, March 2017. Archived at [perma.cc/K6ER-CQZW](https://perma.cc/K6ER-CQZW)
[^16]: David Greenspan. [Implementing Fractional Indexing](https://observablehq.com/%40dgreensp/implementing-fractional-indexing). *observablehq.com*, October 2020. Archived at [perma.cc/5N4R-MREN](https://perma.cc/5N4R-MREN)
[^17]: Martin Fowler. [Schemaless Data Structures](https://martinfowler.com/articles/schemaless/). *martinfowler.com*, January 2013.
[^18]: Amr Awadallah. [Schema-on-Read vs. Schema-on-Write](https://www.slideshare.net/awadallah/schemaonread-vs-schemaonwrite). At *Berkeley EECS RAD Lab Retreat*, Santa Cruz, CA, May 2009. Archived at [perma.cc/DTB2-JCFR](https://perma.cc/DTB2-JCFR)
[^19]: Martin Odersky. [The Trouble with Types](https://www.infoq.com/presentations/data-types-issues/). At *Strange Loop*, September 2013. Archived at [perma.cc/85QE-PVEP](https://perma.cc/85QE-PVEP)
[^20]: Conrad Irwin. [MongoDB—Confessions of a PostgreSQL Lover](https://speakerdeck.com/conradirwin/mongodb-confessions-of-a-postgresql-lover). At *HTML5DevConf*, October 2013. Archived at [perma.cc/C2J6-3AL5](https://perma.cc/C2J6-3AL5)
[^21]: [Percona Toolkit Documentation: pt-online-schema-change](https://docs.percona.com/percona-toolkit/pt-online-schema-change.html). *docs.percona.com*, 2023. Archived at [perma.cc/9K8R-E5UH](https://perma.cc/9K8R-E5UH)
[^22]: Shlomi Noach. [gh-ost: GitHub’s Online Schema Migration Tool for MySQL](https://github.blog/2016-08-01-gh-ost-github-s-online-migration-tool-for-mysql/). *github.blog*, August 2016. Archived at [perma.cc/7XAG-XB72](https://perma.cc/7XAG-XB72)
[^23]: Shayon Mukherjee. [pg-osc: Zero downtime schema changes in PostgreSQL](https://www.shayon.dev/post/2022/47/pg-osc-zero-downtime-schema-changes-in-postgresql/). *shayon.dev*, February 2022. Archived at [perma.cc/35WN-7WMY](https://perma.cc/35WN-7WMY)
[^24]: Carlos Pérez-Aradros Herce. [Introducing pgroll: zero-downtime, reversible, schema migrations for Postgres](https://xata.io/blog/pgroll-schema-migrations-postgres). *xata.io*, October 2023. Archived at [archive.org](https://web.archive.org/web/20231008161750/https%3A//xata.io/blog/pgroll-schema-migrations-postgres)
[^25]: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. [Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012.
[^26]: Donald K. Burleson. [Reduce I/O with Oracle Cluster Tables](http://www.dba-oracle.com/oracle_tip_hash_index_cluster_table.htm). *dba-oracle.com*. Archived at [perma.cc/7LBJ-9X2C](https://perma.cc/7LBJ-9X2C)
[^27]: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. [Bigtable: A Distributed Storage System for Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006.
[^28]: Priscilla Walmsley. [*XQuery, 2nd Edition*](https://learning.oreilly.com/library/view/xquery-2nd-edition/9781491915080/). O’Reilly Media, December 2015. ISBN: 9781491915080
[^29]: Paul C. Bryan, Kris Zyp, and Mark Nottingham. [JavaScript Object Notation (JSON) Pointer](https://www.rfc-editor.org/rfc/rfc6901). RFC 6901, IETF, April 2013.
[^30]: Stefan Gössner, Glyn Normington, and Carsten Bormann. [JSONPath: Query Expressions for JSON](https://www.rfc-editor.org/rfc/rfc9535.html). RFC 9535, IETF, February 2024.
[^31]: Michael Stonebraker and Andrew Pavlo. [What Goes Around Comes Around… And Around…](https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf). *ACM SIGMOD Record*, volume 53, issue 2, pages 21–37. [doi:10.1145/3685980.3685984](https://doi.org/10.1145/3685980.3685984)
[^32]: Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. [The PageRank Citation Ranking: Bringing Order to the Web](http://ilpubs.stanford.edu:8090/422/). Technical Report 1999-66, Stanford University InfoLab, November 1999. Archived at [perma.cc/UML9-UZHW](https://perma.cc/UML9-UZHW)
[^33]: Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. [TAO: Facebook’s Distributed Data Store for the Social Graph](https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson). At *USENIX Annual Technical Conference* (ATC), June 2013.
[^34]: Natasha Noy, Yuqing Gao, Anshu Jain, Anant Narayanan, Alan Patterson, and Jamie Taylor. [Industry-Scale Knowledge Graphs: Lessons and Challenges](https://cacm.acm.org/magazines/2019/8/238342-industry-scale-knowledge-graphs/fulltext). *Communications of the ACM*, volume 62, issue 8, pages 36–43, August 2019. [doi:10.1145/3331166](https://doi.org/10.1145/3331166)
[^35]: Xiyang Feng, Guodong Jin, Ziyi Chen, Chang Liu, and Semih Salihoğlu. [KÙZU Graph Database Management System](https://www.cidrdb.org/cidr2023/papers/p48-jin.pdf). At *3th Annual Conference on Innovative Data Systems Research* (CIDR 2023), January 2023.
[^36]: Maciej Besta, Emanuel Peter, Robert Gerstenberger, Marc Fischer, Michał Podstawski, Claude Barthels, Gustavo Alonso, Torsten Hoefler. [Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries](https://arxiv.org/pdf/1910.09017.pdf). *arxiv.org*, October 2019.
[^37]: [Apache TinkerPop 3.6.3 Documentation](https://tinkerpop.apache.org/docs/3.6.3/reference/). *tinkerpop.apache.org*, May 2023. Archived at [perma.cc/KM7W-7PAT](https://perma.cc/KM7W-7PAT)
[^38]: Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. [Cypher: An Evolving Query Language for Property Graphs](https://core.ac.uk/download/pdf/158372754.pdf). At *International Conference on Management of Data* (SIGMOD), pages 1433–1445, May 2018. [doi:10.1145/3183713.3190657](https://doi.org/10.1145/3183713.3190657)
[^39]: Emil Eifrem. [Twitter correspondence](https://twitter.com/emileifrem/status/419107961512804352), January 2014. Archived at [perma.cc/WM4S-BW64](https://perma.cc/WM4S-BW64)
[^40]: Francesco Tisiot. [Explore the new SEARCH and CYCLE features in PostgreSQL® 14](https://aiven.io/blog/explore-the-new-search-and-cycle-features-in-postgresql-14). *aiven.io*, December 2021. Archived at [perma.cc/J6BT-83UZ](https://perma.cc/J6BT-83UZ)
[^41]: Gaurav Goel. [Understanding Hierarchies in Oracle](https://towardsdatascience.com/understanding-hierarchies-in-oracle-43f85561f3d9). *towardsdatascience.com*, May 2020. Archived at [perma.cc/5ZLR-Q7EW](https://perma.cc/5ZLR-Q7EW)
[^42]: Alin Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Wim Martens, Jan Michels, Filip Murlak, Stefan Plantikow, Petra Selmer, Oskar van Rest, Hannes Voigt, Domagoj Vrgoč, Mingxi Wu, and Fred Zemke. [Graph Pattern Matching in GQL and SQL/PGQ](https://arxiv.org/abs/2112.06217). At *International Conference on Management of Data* (SIGMOD), pages 2246–2258, June 2022. [doi:10.1145/3514221.3526057](https://doi.org/10.1145/3514221.3526057)
[^43]: Alastair Green. [SQL... and now GQL](https://opencypher.org/articles/2019/09/12/SQL-and-now-GQL/). *opencypher.org*, September 2019. Archived at [perma.cc/AFB2-3SY7](https://perma.cc/AFB2-3SY7)
[^44]: Alin Deutsch, Yu Xu, and Mingxi Wu. [Seamless Syntactic and Semantic Integration of Query Primitives over Relational and Graph Data in GSQL](https://cdn2.hubspot.net/hubfs/4114546/IntegrationQuery%20PrimitivesGSQL.pdf). *tigergraph.com*, November 2018. Archived at [perma.cc/JG7J-Y35X](https://perma.cc/JG7J-Y35X)
[^45]: Oskar van Rest, Sungpack Hong, Jinha Kim, Xuming Meng, and Hassan Chafi. [PGQL: a property graph query language](https://event.cwi.nl/grades/2016/07-VanRest.pdf). At *4th International Workshop on Graph Data Management Experiences and Systems* (GRADES), June 2016. [doi:10.1145/2960414.2960421](https://doi.org/10.1145/2960414.2960421)
[^46]: Amazon Web Services. [Neptune Graph Data Model](https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html). Amazon Neptune User Guide, *docs.aws.amazon.com*. Archived at [perma.cc/CX3T-EZU9](https://perma.cc/CX3T-EZU9)
[^47]: Cognitect. [Datomic Data Model](https://docs.datomic.com/cloud/whatis/data-model.html). Datomic Cloud Documentation, *docs.datomic.com*. Archived at [perma.cc/LGM9-LEUT](https://perma.cc/LGM9-LEUT)
[^48]: David Beckett and Tim Berners-Lee. [Turtle – Terse RDF Triple Language](https://www.w3.org/TeamSubmission/turtle/). W3C Team Submission, March 2011.
[^49]: Sinclair Target. [Whatever Happened to the Semantic Web?](https://twobithistory.org/2018/05/27/semantic-web.html) *twobithistory.org*, May 2018. Archived at [perma.cc/M8GL-9KHS](https://perma.cc/M8GL-9KHS)
[^50]: Gavin Mendel-Gleason. [The Semantic Web is Dead – Long Live the Semantic Web!](https://terminusdb.com/blog/the-semantic-web-is-dead/) *terminusdb.com*, August 2022. Archived at [perma.cc/G2MZ-DSS3](https://perma.cc/G2MZ-DSS3)
[^51]: Manu Sporny. [JSON-LD and Why I Hate the Semantic Web](http://manu.sporny.org/2014/json-ld-origins-2/). *manu.sporny.org*, January 2014. Archived at [perma.cc/7PT4-PJKF](https://perma.cc/7PT4-PJKF)
[^52]: University of Michigan Library. [Biomedical Ontologies and Controlled Vocabularies](https://guides.lib.umich.edu/ontology), *guides.lib.umich.edu/ontology*. Archived at [perma.cc/Q5GA-F2N8](https://perma.cc/Q5GA-F2N8)
[^53]: Facebook. [The Open Graph protocol](https://ogp.me/), *ogp.me*. Archived at [perma.cc/C49A-GUSY](https://perma.cc/C49A-GUSY)
[^54]: Matt Haughey. [Everything you ever wanted to know about unfurling but were afraid to ask /or/ How to make your site previews look amazing in Slack](https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254). *medium.com*, November 2015. Archived at [perma.cc/C7S8-4PZN](https://perma.cc/C7S8-4PZN)
[^55]: W3C RDF Working Group. [Resource Description Framework (RDF)](https://www.w3.org/RDF/). *w3.org*, February 2004.
[^56]: Steve Harris, Andy Seaborne, and Eric Prud’hommeaux. [SPARQL 1.1 Query Language](https://www.w3.org/TR/sparql11-query/). W3C Recommendation, March 2013.
[^57]: Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou. [Datalog and Recursive Query Processing](http://blogs.evergreen.edu/sosw/files/2014/04/Green-Vol5-DBS-017.pdf). *Foundations and Trends in Databases*, volume 5, issue 2, pages 105–195, November 2013. [doi:10.1561/1900000017](https://doi.org/10.1561/1900000017)
[^58]: Stefano Ceri, Georg Gottlob, and Letizia Tanca. [What You Always Wanted to Know About Datalog (And Never Dared to Ask)](https://www.researchgate.net/profile/Letizia_Tanca/publication/3296132_What_you_always_wanted_to_know_about_Datalog_and_never_dared_to_ask/links/0fcfd50ca2d20473ca000000.pdf). *IEEE Transactions on Knowledge and Data Engineering*, volume 1, issue 1, pages 146–166, March 1989. [doi:10.1109/69.43410](https://doi.org/10.1109/69.43410)
[^59]: Serge Abiteboul, Richard Hull, and Victor Vianu. [*Foundations of Databases*](http://webdam.inria.fr/Alice/). Addison-Wesley, 1995. ISBN: 9780201537710, available online at [*webdam.inria.fr/Alice*](http://webdam.inria.fr/Alice/)
[^60]: Scott Meyer, Andrew Carter, and Andrew Rodriguez. [LIquid: The soul of a new graph database, Part 2](https://engineering.linkedin.com/blog/2020/liquid--the-soul-of-a-new-graph-database--part-2). *engineering.linkedin.com*, September 2020. Archived at [perma.cc/K9M4-PD6Q](https://perma.cc/K9M4-PD6Q)
[^61]: Matt Bessey. [Why, after 6 years, I’m over GraphQL](https://bessey.dev/blog/2024/05/24/why-im-over-graphql/). *bessey.dev*, May 2024. Archived at [perma.cc/2PAU-JYRA](https://perma.cc/2PAU-JYRA)
[^62]: Dominic Betts, Julián Domínguez, Grigori Melnik, Fernando Simonazzi, and Mani Subramanian. [*Exploring CQRS and Event Sourcing*](https://learn.microsoft.com/en-us/previous-versions/msp-n-p/jj554200%28v%3Dpandp.10%29). Microsoft Patterns & Practices, July 2012. ISBN: 1621140164, archived at [perma.cc/7A39-3NM8](https://perma.cc/7A39-3NM8)
[^63]: Greg Young. [CQRS and Event Sourcing](https://www.youtube.com/watch?v=JHGkaShoyNs). At *Code on the Beach*, August 2014.
[^64]: Greg Young. [CQRS Documents](https://cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf). *cqrs.wordpress.com*, November 2010. Archived at [perma.cc/X5R6-R47F](https://perma.cc/X5R6-R47F)
[^65]: Devin Petersohn, Stephen Macke, Doris Xin, William Ma, Doris Lee, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya Parameswaran. [Towards Scalable Dataframe Systems](https://www.vldb.org/pvldb/vol13/p2033-petersohn.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 11, pages 2033–2046. [doi:10.14778/3407790.3407807](https://doi.org/10.14778/3407790.3407807)
[^66]: Stavros Papadopoulos, Kushal Datta, Samuel Madden, and Timothy Mattson. [The TileDB Array Data Storage Manager](https://www.vldb.org/pvldb/vol10/p349-papadopoulos.pdf). *Proceedings of the VLDB Endowment*, volume 10, issue 4, pages 349–360, November 2016. [doi:10.14778/3025111.3025117](https://doi.org/10.14778/3025111.3025117)
[^67]: Florin Rusu. [Multidimensional Array Data Management](https://faculty.ucmerced.edu/frusu/Papers/Report/2022-09-fntdb-arrays.pdf). *Foundations and Trends in Databases*, volume 12, numbers 2–3, pages 69–220, February 2023. [doi:10.1561/1900000069](https://doi.org/10.1561/1900000069)
[^68]: Ed Targett. [Bloomberg, Man Group team up to develop open source “ArcticDB” database](https://www.thestack.technology/bloomberg-man-group-arcticdb-database-dataframe/). *thestack.technology*, March 2023. Archived at [perma.cc/M5YD-QQYV](https://perma.cc/M5YD-QQYV)
[^69]: Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and David L. Wheeler. [GenBank](https://academic.oup.com/nar/article/36/suppl_1/D25/2507746). *Nucleic Acids Research*, volume 36, database issue, pages D25–D30, December 2007. [doi:10.1093/nar/gkm929](https://doi.org/10.1093/nar/gkm929)
================================================
FILE: content/tw/ch4.md
================================================
---
title: "4. 儲存與檢索"
weight: 104
breadcrumbs: false
---

> *生活的苦惱之一是,每個人對事物的命名都有些偏差。這讓我們理解世界變得比本該有的樣子困難一些,要是命名方式不同就好了。計算機的主要功能並不是傳統意義上的計算,比如算術運算。[……] 它們主要是歸檔系統。*
>
> [理查德·費曼](https://www.youtube.com/watch?v=EKWGGDXe5MA&t=296s),
> *特立獨行的思考* 研討會(1985)
在最基礎的層面上,資料庫需要做兩件事:當你給它一些資料時,它應該儲存這些資料;當你之後再詢問時,它應該把資料返回給你。
在 [第 3 章](/tw/ch3#ch_datamodels) 中,我們討論了資料模型和查詢語言 —— 即你向資料庫提供資料的格式,以及之後再次請求資料的介面。在本章中,我們從資料庫的角度討論同樣的問題:資料庫如何儲存你提供的資料,以及當你請求時如何再次找到這些資料。
作為應用開發者,你為什麼要關心資料庫內部如何處理儲存和檢索?你可能不會從頭開始實現自己的儲存引擎,但你 *確實* 需要從眾多可用的儲存引擎中選擇一個適合你應用的。為了讓儲存引擎在你的工作負載型別上表現良好,你需要對儲存引擎在底層做了什麼有個大致的瞭解。
特別是,針對事務型工作負載(OLTP)最佳化的儲存引擎和針對分析型工作負載最佳化的儲存引擎之間存在巨大差異(我們在 ["分析型與事務型系統"](/tw/ch1#sec_introduction_analytics) 中介紹了這種區別)。本章首先研究兩種用於 OLTP 的儲存引擎家族:寫入不可變資料檔案的 *日誌結構* 儲存引擎,以及像 *B 樹* 這樣就地更新資料的儲存引擎。這些結構既用於鍵值儲存,也用於二級索引。
隨後在 ["分析型資料儲存"](#sec_storage_analytics) 中,我們將討論一系列針對分析最佳化的儲存引擎;在 ["多維索引與全文索引"](#sec_storage_multidimensional) 中,我們將簡要介紹用於更高階查詢(如文字檢索)的索引。
## OLTP 系統的儲存與索引 {#sec_storage_oltp}
考慮世界上最簡單的資料庫,用兩個 Bash 函式實現:
```bash
#!/bin/bash
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
```
這兩個函式實現了一個鍵值儲存。你可以呼叫 `db_set key value`,它將在資料庫中儲存 `key` 和 `value`。鍵和值可以是(幾乎)任何你喜歡的內容 —— 例如,值可以是一個 JSON 文件。然後你可以呼叫 `db_get key`,它會查詢與該特定鍵關聯的最新值並返回它。
麻雀雖小,五臟俱全:
```bash
$ db_set 12 '{"name":"London","attractions":["Big Ben","London Eye"]}'
$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'
$ db_get 42
{"name":"San Francisco","attractions":["Golden Gate Bridge"]}
```
儲存格式非常簡單:一個文字檔案,每行包含一個鍵值對,用逗號分隔(大致類似 CSV 檔案,忽略轉義問題)。每次呼叫 `db_set` 都會追加到檔案末尾。如果你多次更新一個鍵,舊版本的值不會被覆蓋 —— 你需要檢視檔案中鍵的最後一次出現來找到最新值(因此 `db_get` 中使用了 `tail -n 1`):
```bash
$ db_set 42 '{"name":"San Francisco","attractions":["Exploratorium"]}'
$ db_get 42
{"name":"San Francisco","attractions":["Exploratorium"]}
$ cat database
12,{"name":"London","attractions":["Big Ben","London Eye"]}
42,{"name":"San Francisco","attractions":["Golden Gate Bridge"]}
42,{"name":"San Francisco","attractions":["Exploratorium"]}
```
對於如此簡單的實現,`db_set` 函式實際上有相當好的效能,因為追加到檔案通常非常高效。與 `db_set` 所做的類似,許多資料庫內部使用 *日誌*,這是一個僅追加的資料檔案。真正的資料庫有更多問題要處理(如處理併發寫入、回收磁碟空間以防日誌無限增長,以及從崩潰中恢復時處理部分寫入的記錄),但基本原理是相同的。日誌非常有用,我們將在本書中多次遇到它們。
---------
> [!NOTE]
> *日誌* 這個詞通常用於指應用程式日誌,應用程式輸出描述正在發生什麼的文字。在本書中,*日誌* 用於更一般的含義:磁碟上僅追加的記錄序列。它不一定是人類可讀的;它可能是二進位制的,僅供資料庫系統內部使用。
--------
另一方面,如果你的資料庫中有大量記錄,`db_get` 函式的效能會很糟糕。每次你想查詢一個鍵時,`db_get` 必須從頭到尾掃描整個資料庫檔案,尋找該鍵的出現。用算法術語來說,查詢的成本是 *O*(*n*):如果你的資料庫中的記錄數 *n* 翻倍,查詢時間也會翻倍。這並不好。
為了高效地找到資料庫中特定鍵的值,我們需要一個不同的資料結構:*索引*。在本章中,我們將研究一系列索引結構並瞭解它們的比較;一般思想是以特定方式(例如,按某個鍵排序)構建資料,使定位所需資料更快。如果你想以幾種不同的方式搜尋相同的資料,你可能需要在資料的不同部分上建立幾個不同的索引。
索引是從主資料派生出的 *額外* 結構。許多資料庫允許你新增和刪除索引,這不會影響資料庫的內容;它隻影響查詢的效能。維護額外的結構會產生開銷,特別是在寫入時。對於寫入,很難超越簡單地追加到檔案的效能,因為這是最簡單的寫入操作。任何型別的索引通常都會減慢寫入速度,因為每次寫入資料時也需要更新索引。
這是儲存系統中的一個重要權衡:精心選擇的索引加快了讀查詢速度,但每個索引都會消耗額外的磁碟空間並減慢寫入速度,有時會大幅減慢 [^1]。因此,資料庫通常不會預設為所有內容建立索引,而是要求你 —— 編寫應用程式或管理資料庫的人 —— 使用你對應用程式典型查詢模式的瞭解來手動選擇索引。然後你可以選擇為你的應用程式帶來最大收益的索引,而不會引入超過必要的寫入開銷。
### 日誌結構儲存 {#sec_storage_log_structured}
首先,讓我們假設你想繼續將資料儲存在 `db_set` 寫入的僅追加檔案中,你只是想加快讀取速度。一種方法是在記憶體中保留一個雜湊對映,其中每個鍵都對映到檔案中可以找到該鍵最新值的位元組偏移量,如 [圖 4-1](#fig_storage_csv_hash_index) 所示。
{{< figure src="/fig/ddia_0401.png" id="fig_storage_csv_hash_index" caption="圖 4-1. 以類似 CSV 格式儲存鍵值對日誌,使用記憶體雜湊對映建立索引。" class="w-full my-4" >}}
每當你向檔案追加新的鍵值對時,你也會更新雜湊對映以反映剛剛寫入資料的偏移量。當你想查詢一個值時,你使用雜湊對映找到日誌檔案中的偏移量,尋找到該位置,然後讀取值。如果資料檔案的那部分已經在檔案系統快取中,讀取根本不需要任何磁碟 I/O。
這種方法速度更快,但仍然存在幾個問題:
* 你永遠不會釋放被覆蓋的舊日誌條目佔用的磁碟空間;如果你不斷寫入資料庫,可能會耗盡磁碟空間。
* 雜湊對映不是持久化的,所以當你重啟資料庫時必須重建它 —— 例如,透過掃描整個日誌檔案來找到每個鍵的最新位元組偏移量。如果你有大量資料,這會使重啟變慢。
* 雜湊表必須適合記憶體。原則上,你可以在磁碟上維護雜湊表,但不幸的是,很難讓磁碟上的雜湊對映表現良好。它需要大量的隨機訪問 I/O,當它變滿時擴充套件成本高昂,雜湊衝突需要複雜的邏輯 [^2]。
* 範圍查詢效率不高。例如,你不能輕鬆掃描 `10000` 和 `19999` 之間的所有鍵 —— 你必須在雜湊對映中單獨查詢每個鍵。
#### SSTable 檔案格式 {#the-sstable-file-format}
實際上,雜湊表很少用於資料庫索引,相反,保持資料 *按鍵排序* 的結構更為常見 [^3]。這種結構的一個例子是 *排序字串表*(*Sorted String Table*),簡稱 *SSTable*,如 [圖 4-2](#fig_storage_sstable_index) 所示。這種檔案格式也儲存鍵值對,但它確保它們按鍵排序,每個鍵在檔案中只出現一次。
{{< figure src="/fig/ddia_0402.png" id="fig_storage_sstable_index" caption="圖 4-2. 帶有稀疏索引的 SSTable,允許查詢跳轉到正確的塊。" class="w-full my-4" >}}
現在你不需要在記憶體中保留所有鍵:你可以將 SSTable 中的鍵值對分組為幾千位元組的 *塊*,然後在索引中儲存每個塊的第一個鍵。這種只儲存部分鍵的索引稱為 *稀疏* 索引。這個索引儲存在 SSTable 的單獨部分,例如使用不可變 B 樹、字典樹或其他允許查詢快速查詢特定鍵的資料結構 [^4]。
例如,在 [圖 4-2](#fig_storage_sstable_index) 中,一個塊的第一個鍵是 `handbag`,下一個塊的第一個鍵是 `handsome`。現在假設你要查詢鍵 `handiwork`,它沒有出現在稀疏索引中。由於排序,你知道 `handiwork` 必須出現在 `handbag` 和 `handsome` 之間。這意味著你可以尋找到 `handbag` 的偏移量,然後從那裡掃描檔案,直到找到 `handiwork`(或沒有,如果該鍵不在檔案中)。幾千位元組的塊可以非常快速地掃描。
此外,每個記錄塊都可以壓縮(在 [圖 4-2](#fig_storage_sstable_index) 中用陰影區域表示)。除了節省磁碟空間外,壓縮還減少了 I/O 頻寬使用,代價是使用更多一點的 CPU 時間。
#### 構建和合並 SSTable {#constructing-and-merging-sstables}
SSTable 檔案格式在讀取方面比僅追加日誌更好,但它使寫入更加困難。我們不能簡單地追加到末尾,因為那樣檔案就不再有序了(除非鍵恰好按升序寫入)。如果我們每次在中間某處插入鍵時都必須重寫整個 SSTable,寫入將變得太昂貴。
我們可以用 *日誌結構* 方法解決這個問題,這是僅追加日誌和排序檔案之間的混合:
1. 當寫入操作到來時,將其新增到記憶體中的有序對映資料結構中,例如紅黑樹、跳錶 [^5] 或字典樹 [^6]。使用這些資料結構,你可以按任意順序插入鍵,高效地查詢它們,並按排序順序讀回它們。這個記憶體資料結構稱為 *記憶體表*(*memtable*)。
2. 當記憶體表變得大於某個閾值(通常是幾兆位元組)時,將其按排序順序作為 SSTable 檔案寫入磁碟。我們將這個新的 SSTable 檔案稱為資料庫的最新 *段*,它與舊段一起作為單獨的檔案儲存。每個段都有自己內容的單獨索引。當新段被寫入磁碟時,資料庫可以繼續寫入新的記憶體表例項,當 SSTable 寫入完成時,舊記憶體表的記憶體被釋放。
3. 為了讀取某個鍵的值,首先嘗試在記憶體表和最新的磁碟段中找到該鍵。如果沒有找到,就在下一個較舊的段中查詢,依此類推,直到找到鍵或到達最舊的段。如果鍵沒有出現在任何段中,則它不存在於資料庫中。
4. 不時地在後臺執行合併和壓實過程,以合併段檔案並丟棄被覆蓋或刪除的值。
合併段的工作方式類似於 *歸併排序* 演算法 [^5]。該過程如 [圖 4-3](#fig_storage_sstable_merging) 所示:並排開始讀取輸入檔案,檢視每個檔案中的第一個鍵,將最低的鍵(根據排序順序)複製到輸出檔案,然後重複。如果同一個鍵出現在多個輸入檔案中,只保留較新的值。這會產生一個新的合併段檔案,也按鍵排序,每個鍵只有一個值,並且它使用最少的記憶體,因為我們可以一次遍歷一個鍵的 SSTable。
{{< figure src="/fig/ddia_0403.png" id="fig_storage_sstable_merging" caption="圖 4-3. 合併多個 SSTable 段,僅保留每個鍵的最新值。" class="w-full my-4" >}}
為了確保資料庫崩潰時記憶體表中的資料不會丟失,儲存引擎在磁碟上保留一個單獨的日誌,每次寫入都會立即追加到該日誌中。此日誌不按鍵排序,但這無關緊要,因為它的唯一目的是在崩潰後恢復記憶體表。每次記憶體表被寫出到 SSTable 後,日誌的相應部分就可以丟棄。
如果你想刪除一個鍵及其關聯的值,你必須向資料檔案追加一個稱為 *墓碑*(*tombstone*)的特殊刪除記錄。當日誌段合併時,墓碑告訴合併過程丟棄已刪除鍵的任何先前值。一旦墓碑合併到最舊的段中,它就可以被丟棄。
這裡描述的演算法本質上就是 RocksDB [^7]、Cassandra、Scylla 和 HBase [^8] 中使用的演算法,它們都受到 Google 的 Bigtable 論文 [^9] 的啟發(該論文引入了 *SSTable* 和 *memtable* 這兩個術語)。
該演算法最初於 1996 年以 *日誌結構合併樹*(*Log-Structured Merge-Tree*)或 *LSM 樹*(*LSM-Tree*)[^10] 的名稱釋出,建立在早期日誌結構檔案系統工作的基礎上 [^11]。因此,基於合併和壓實排序檔案原理的儲存引擎通常被稱為 *LSM 儲存引擎*。
在 LSM 儲存引擎中,段檔案是一次性寫入的(透過寫出記憶體表或合併一些現有段),此後它是不可變的。段的合併和壓實可以在後臺執行緒中完成,當它進行時,我們仍然可以使用舊的段檔案繼續提供讀取服務。當合並過程完成時,我們將讀取請求切換到使用新的合併段而不是舊段,然後可以刪除舊的段檔案。
段檔案不一定必須儲存在本地磁碟上:它們也非常適合寫入物件儲存。例如,SlateDB 和 Delta Lake [^12] 採用了這種方法。
具有不可變段檔案也簡化了崩潰恢復:如果在寫出記憶體表或合併段時發生崩潰,資料庫可以刪除未完成的 SSTable 並重新開始。將寫入持久化到記憶體表的日誌如果在寫入記錄的過程中發生崩潰,或者磁碟已滿,可能包含不完整的記錄;這些通常透過在日誌中包含校驗和來檢測,並丟棄損壞或不完整的日誌條目。我們將在 [第 8 章](/tw/ch8#ch_transactions) 中更多地討論永續性和崩潰恢復。
#### 布隆過濾器 {#bloom-filters}
使用 LSM 儲存,讀取很久以前更新的鍵或不存在的鍵可能會很慢,因為儲存引擎需要檢查多個段檔案。為了加快此類讀取,LSM 儲存引擎通常在每個段中包含一個 *布隆過濾器*(*Bloom filter*)[^13],它提供了一種快速但近似的方法來檢查特定鍵是否出現在特定 SSTable 中。
[圖 4-4](#fig_storage_bloom) 顯示了一個包含兩個鍵和 16 位的布隆過濾器示例(實際上,它會包含更多的鍵和更多的位)。對於 SSTable 中的每個鍵,我們計算一個雜湊函式,產生一組數字,然後將其解釋為位陣列的索引 [^14]。我們將對應於這些索引的位設定為 1,其餘保持為 0。例如,鍵 `handbag` 雜湊為數字 (2, 9, 4),所以我們將第 2、9 和 4 位設定為 1。然後將點陣圖與鍵的稀疏索引一起儲存為 SSTable 的一部分。這需要一點額外的空間,但與 SSTable 的其餘部分相比,布隆過濾器通常很小。
{{< figure src="/fig/ddia_0404.png" id="fig_storage_bloom" caption="圖 4-4. 布隆過濾器提供了一種快速的機率檢查,用於判斷特定鍵是否存在於特定 SSTable 中。" class="w-full my-4" >}}
當我們想知道一個鍵是否出現在 SSTable 中時,我們像以前一樣計算該鍵的相同雜湊,並檢查這些索引處的位。例如,在 [圖 4-4](#fig_storage_bloom) 中,我們查詢鍵 `handheld`,它雜湊為 (6, 11, 2)。其中一個位是 1(即第 2 位),而另外兩個是 0。這些檢查可以使用所有 CPU 都支援的位運算非常快速地進行。
如果至少有一個位是 0,我們知道該鍵肯定不在 SSTable 中。如果查詢中的位都是 1,那麼該鍵很可能在 SSTable 中,但也有可能是巧合,所有這些位都被其他鍵設定為 1。這種看起來鍵存在但實際上不存在的情況稱為 *假陽性*(*false positive*)。
假陽性的機率取決於鍵的數量、每個鍵設定的位數和布隆過濾器中的總位數。你可以使用線上計算器工具為你的應用計算出正確的引數 [^15]。作為經驗法則,你需要為 SSTable 中的每個鍵分配 10 位布隆過濾器空間以獲得 1% 的假陽性機率,每為每個鍵分配額外的 5 位,機率就會降低十倍。
在 LSM 儲存引擎的上下文中,假陽性沒有問題:
* 如果布隆過濾器說鍵 *不* 存在,我們可以安全地跳過該 SSTable,因為我們可以確定它不包含該鍵。
* 如果布隆過濾器說鍵 *存在*,我們必須查詢稀疏索引並解碼鍵值對塊以檢查鍵是否真的在那裡。如果是假陽性,我們做了一些不必要的工作,但除此之外沒有害處 —— 我們只是繼續使用下一個最舊的段進行搜尋。
#### 壓實策略 {#sec_storage_lsm_compaction}
一個重要的細節是 LSM 儲存如何選擇何時執行壓實,以及在壓實中包括哪些 SSTable。許多基於 LSM 的儲存系統允許你配置使用哪種壓實策略,一些常見的選擇是 [^16] [^17]:
分層壓實(Size-tiered compaction)
: 較新和較小的 SSTable 依次合併到較舊和較大的 SSTable 中。包含較舊資料的 SSTable 可能變得非常大,合併它們需要大量的臨時磁碟空間。這種策略的優點是它可以處理非常高的寫入吞吐量。
分級壓實(Leveled compaction)
: 鍵範圍被分成較小的 SSTable,較舊的資料被移動到單獨的"級別"中,這允許壓實更增量地進行,並且比分層策略使用更少的磁碟空間。這種策略對於讀取比分層壓實更有效,因為儲存引擎需要讀取更少的 SSTable 來檢查它們是否包含該鍵。
作為經驗法則,如果你主要有寫入而讀取很少,分層壓實表現更好,而如果你的工作負載以讀取為主,分級壓實表現更好。如果你頻繁寫入少量鍵,而很少寫入大量鍵,那麼分級壓實也可能有優勢 [^18]。
儘管有許多細微之處,但 LSM 樹的基本思想 —— 保持在後臺合併的 SSTable 級聯 —— 簡單而有效。我們將在 ["比較 B 樹與 LSM 樹"](#sec_storage_btree_lsm_comparison) 中更詳細地討論它們的效能特徵。
--------
> [!TIP] 嵌入式儲存引擎
許多資料庫作為接受網路查詢的服務執行,但也有 *嵌入式* 資料庫不公開網路 API。相反,它們是在與應用程式程式碼相同的程序中執行的庫,通常讀取和寫入本地磁碟上的檔案,你透過正常的函式呼叫與它們互動。嵌入式儲存引擎的例子包括 RocksDB、SQLite、LMDB、DuckDB 和 KùzuDB [^19]。
嵌入式資料庫在移動應用中非常常用,用於儲存本地使用者的資料。在後端,如果資料足夠小以適合單臺機器,並且沒有太多併發事務,它們可能是一個合適的選擇。例如,在多租戶系統中,如果每個租戶足夠小且完全與其他租戶分離(即,你不需要執行合併多個租戶資料的查詢),你可能可以為每個租戶使用單獨的嵌入式資料庫例項 [^20]。
我們在本章討論的儲存和檢索方法既用於嵌入式資料庫,也用於客戶端-伺服器資料庫。在 [第 6 章](/tw/ch6#ch_replication) 和 [第 7 章](/tw/ch7#ch_sharding) 中,我們將討論跨多臺機器擴充套件資料庫的技術。
--------
### B 樹 {#sec_storage_b_trees}
日誌結構方法很流行,但它不是鍵值儲存的唯一形式。按鍵讀取和寫入資料庫記錄最廣泛使用的結構是 *B 樹*。
B 樹於 1970 年引入 [^21],不到 10 年後就被稱為"無處不在"[^22],它們經受住了時間的考驗。它們仍然是幾乎所有關係資料庫中的標準索引實現,許多非關係資料庫也使用它們。
像 SSTable 一樣,B 樹按鍵保持鍵值對排序,這允許高效的鍵值查詢和範圍查詢。但相似之處到此為止:B 樹有著非常不同的設計理念。
我們之前看到的日誌結構索引將資料庫分解為可變大小的 *段*,通常為幾兆位元組或更大,寫入一次後就不可變。相比之下,B 樹將資料庫分解為固定大小的 *塊* 或 *頁*,並可能就地覆蓋頁。頁傳統上大小為 4 KiB,但 PostgreSQL 現在預設使用 8 KiB,MySQL 預設使用 16 KiB。
每個頁都可以使用頁號來標識,這允許一個頁引用另一個頁 —— 類似於指標,但在磁碟上而不是在記憶體中。如果所有頁都儲存在同一個檔案中,將頁號乘以頁大小就給我們檔案中頁所在位置的位元組偏移量。我們可以使用這些頁引用來構建頁樹,如 [圖 4-5](#fig_storage_b_tree) 所示。
{{< figure src="/fig/ddia_0405.png" id="fig_storage_b_tree" caption="圖 4-5. 使用 B 樹索引查詢鍵 251。從根頁開始,我們首先跟隨引用到鍵 200–300 的頁,然後是鍵 250–270 的頁。" class="w-full my-4" >}}
一個頁被指定為 B 樹的 *根*;每當你想在索引中查詢一個鍵時,你就從這裡開始。該頁包含幾個鍵和對子頁的引用。每個子頁負責一個連續的鍵範圍,引用之間的鍵指示這些範圍之間的邊界在哪裡。(這種結構有時稱為 B+ 樹,但我們不需要將其與其他 B 樹變體區分開來。)
在 [圖 4-5](#fig_storage_b_tree) 的例子中,我們正在查詢鍵 251,所以我們知道我們需要跟隨邊界 200 和 300 之間的頁引用。這將我們帶到一個看起來相似的頁,該頁進一步將 200–300 範圍分解為子範圍。最終我們到達包含單個鍵的頁(*葉頁*),該頁要麼內聯包含每個鍵的值,要麼包含對可以找到值的頁的引用。
B 樹的一個頁中對子頁的引用數稱為 *分支因子*。例如,在 [圖 4-5](#fig_storage_b_tree) 中,分支因子為六。實際上,分支因子取決於儲存頁引用和範圍邊界所需的空間量,但通常為幾百。
如果你想更新 B 樹中現有鍵的值,你搜索包含該鍵的葉頁,並用包含新值的版本覆蓋磁碟上的該頁。如果你想新增一個新鍵,你需要找到其範圍包含新鍵的頁並將其新增到該頁。如果頁中沒有足夠的空閒空間來容納新鍵,則頁被分成兩個半滿的頁,並更新父頁以說明鍵範圍的新細分。
{{< figure src="/fig/ddia_0406.png" id="fig_storage_b_tree_split" caption="圖 4-6. 透過在邊界鍵 337 上分割頁來增長 B 樹。父頁被更新以引用兩個子頁。" class="w-full my-4" >}}
在 [圖 4-6](#fig_storage_b_tree_split) 的例子中,我們想插入鍵 334,但範圍 333–345 的頁已經滿了。因此,我們將其分成範圍 333–337(包括新鍵)的頁和 337–344 的頁。我們還必須更新父頁以引用兩個子頁,它們之間的邊界值為 337。如果父頁沒有足夠的空間容納新引用,它也可能需要被分割,分割可以一直持續到樹的根。當根被分割時,我們在它上面建立一個新根。刪除鍵(可能需要合併節點)更複雜 [^5]。
這個演算法確保樹保持 *平衡*:具有 *n* 個鍵的 B 樹始終具有 *O*(log *n*) 的深度。大多數資料庫可以適合三或四層深的 B 樹,所以你不需要跟隨許多頁引用來找到你要查詢的頁。(具有 500 分支因子的 4 KiB 頁的四層樹可以儲存多達 250 TB。)
#### 使 B 樹可靠 {#sec_storage_btree_wal}
B 樹的基本底層寫操作是用新資料覆蓋磁碟上的頁。假設覆蓋不會改變頁的位置;即,當頁被覆蓋時,對該頁的所有引用保持不變。這與日誌結構索引(如 LSM 樹)形成鮮明對比,後者只追加到檔案(並最終刪除過時的檔案),但從不就地修改檔案。
一次覆蓋多個頁,如在頁分割中,是一個危險的操作:如果資料庫在只寫入了部分頁後崩潰,你最終會得到一個損壞的樹(例如,可能有一個 *孤立* 頁,它不是任何父頁的子頁)。如果硬體不能原子地寫入整個頁,你也可能最終得到部分寫入的頁(這稱為 *撕裂頁*(*torn page*)[^23])。
為了使資料庫對崩潰具有彈性,B 樹實現通常包括磁碟上的額外資料結構:*預寫日誌*(*write-ahead log*,WAL)。這是一個僅追加檔案,每個 B 樹修改必須在應用於樹本身的頁之前寫入其中。當資料庫在崩潰後恢復時,此日誌用於將 B 樹恢復到一致狀態 [^2] [^24]。在檔案系統中,等效機制稱為 *日誌記錄*(*journaling*)。
為了提高效能,B 樹實現通常不會立即將每個修改的頁寫入磁碟,而是首先將 B 樹頁緩衝在記憶體中一段時間。預寫日誌還確保在崩潰的情況下資料不會丟失:只要資料已寫入 WAL,並使用 `fsync()` 系統呼叫重新整理到磁碟,資料就是持久的,因為資料庫將能夠在崩潰後恢復它 [^25]。
#### B 樹變體 {#b-tree-variants}
由於 B 樹已經存在了很長時間,多年來已經開發了許多變體。僅舉幾個例子:
* 一些資料庫(如 LMDB)使用寫時複製方案 [^26],而不是覆蓋頁並維護 WAL 以進行崩潰恢復。修改的頁被寫入不同的位置,並建立樹中父頁的新版本,指向新位置。這種方法對於併發控制也很有用,我們將在 ["快照隔離和可重複讀"](/tw/ch8#sec_transactions_snapshot_isolation) 中看到。
* 我們可以透過不儲存整個鍵而是縮寫它來節省頁中的空間。特別是在樹內部的頁中,鍵只需要提供足夠的資訊來充當鍵範圍之間的邊界。在頁中打包更多鍵允許樹具有更高的分支因子,從而減少層數。
* 為了加快按排序順序掃描鍵範圍,一些 B 樹實現嘗試佈局樹,使葉頁按順序出現在磁碟上,減少磁碟尋道次數。然而,隨著樹的增長,很難維持這種順序。
* 已向樹添加了其他指標。例如,每個葉頁可能有對其左右兄弟頁的引用,這允許按順序掃描鍵而無需跳回父頁。
### 比較 B 樹與 LSM 樹 {#sec_storage_btree_lsm_comparison}
作為經驗法則,LSM 樹更適合寫入密集型應用,而 B 樹對讀取更快 [^27] [^28]。然而,基準測試通常對工作負載的細節很敏感。你需要使用特定的工作負載測試系統,以便進行有效的比較。此外,這不是 LSM 和 B 樹之間的嚴格二選一選擇:儲存引擎有時會混合兩種方法的特徵,例如具有多個 B 樹並以 LSM 風格合併它們。在本節中,我們將簡要討論在衡量儲存引擎效能時值得考慮的幾件事。
#### 讀取效能 {#read-performance}
在 B 樹中,查詢鍵涉及在 B 樹的每個層級讀取一個頁。由於層級數通常很小,這意味著從 B 樹讀取通常很快並且具有可預測的效能。在 LSM 儲存引擎中,讀取通常必須檢查處於不同壓實階段的幾個不同 SSTable,但布隆過濾器有助於減少所需的實際磁碟 I/O 運算元。兩種方法都可以表現良好,哪個更快取決於儲存引擎的細節和工作負載。
範圍查詢在 B 樹上簡單而快速,因為它們可以使用樹的排序結構。在 LSM 儲存上,範圍查詢也可以利用 SSTable 排序,但它們需要並行掃描所有段並組合結果。布隆過濾器對範圍查詢沒有幫助(因為你需要計算範圍內每個可能鍵的雜湊,這是不切實際的),使得範圍查詢在 LSM 方法中比點查詢更昂貴 [^29]。
如果記憶體表填滿,高寫入吞吐量可能會導致日誌結構儲存引擎中的延遲峰值。如果資料無法足夠快地寫入磁碟,可能是因為壓實過程無法跟上傳入的寫入,就會發生這種情況。許多儲存引擎,包括 RocksDB,在這種情況下執行 *背壓*:它們暫停所有讀取和寫入,直到記憶體表被寫入磁碟 [^30] [^31]。
關於讀取吞吐量,現代 SSD(特別是 NVMe)可以並行執行許多獨立的讀請求。LSM 樹和 B 樹都能夠提供高讀取吞吐量,但儲存引擎需要仔細設計以利用這種並行性 [^32]。
#### 順序與隨機寫入 {#sidebar_sequential}
使用 B 樹時,如果應用程式寫入的鍵分散在整個鍵空間中,生成的磁碟操作也會隨機分散,因為儲存引擎需要覆蓋的頁可能位於磁碟的任何位置。另一方面,日誌結構儲存引擎一次寫入整個段檔案(無論是寫出記憶體表還是壓實現有段),這比 B 樹中的頁大得多。
許多小的、分散的寫入模式(如 B 樹中的)稱為 *隨機寫入*,而較少的大寫入模式(如 LSM 樹中的)稱為 *順序寫入*。磁碟通常具有比隨機寫入更高的順序寫入吞吐量,這意味著日誌結構儲存引擎通常可以在相同硬體上處理比 B 樹更高的寫入吞吐量。這種差異在旋轉磁碟硬碟(HDD)上特別大;在今天大多數資料庫使用的固態硬碟(SSD)上,差異較小,但仍然明顯(參見 ["SSD 上的順序與隨機寫入"](#sidebar_sequential))。
--------
> [!TIP] SSD 上的順序與隨機寫入
在旋轉磁碟硬碟(HDD)上,順序寫入比隨機寫入快得多:隨機寫入必須機械地將磁頭移動到新位置,並等待碟片的正確部分經過磁頭下方,這需要幾毫秒 —— 在計算時間尺度上是永恆的。然而,SSD(固態硬碟)包括 NVMe(Non-Volatile Memory Express,即連線到 PCI Express 匯流排的快閃記憶體)現在已經在許多場景中超越了 HDD,它們不受這種機械限制。
儘管如此,SSD 對順序寫入的吞吐量也高於隨機寫入。原因是快閃記憶體可以一次讀取或寫入一頁(通常為 4 KiB),但只能一次擦除一個塊(通常為 512 KiB)。塊中的某些頁可能包含有效資料,而其他頁可能包含不再需要的資料。在擦除塊之前,控制器必須首先將包含有效資料的頁移動到其他塊中;這個過程稱為 *垃圾回收*(GC)[^33]。
順序寫入工作負載一次寫入更大的資料塊,因此整個 512 KiB 塊很可能屬於單個檔案;當該檔案稍後再次被刪除時,整個塊可以被擦除而無需執行任何 GC。另一方面,對於隨機寫入工作負載,塊更可能包含有效和無效資料頁的混合,因此 GC 必須在塊可以擦除之前執行更多工作 [^34] [^35] [^36]。
GC 消耗的寫入頻寬就不能用於應用程式。此外,GC 執行的額外寫入會導致快閃記憶體磨損;因此,隨機寫入比順序寫入更快地磨損驅動器。
--------
#### 寫放大 {#write-amplification}
對於任何型別的儲存引擎,來自應用程式的一次寫請求都會轉換為底層磁碟上的多個 I/O 操作。對於 LSM 樹,一個值首先被寫入日誌以保證永續性,然後在記憶體表寫入磁碟時再次寫入,並且每次鍵值對參與壓即時再次寫入。(如果值明顯大於鍵,可以透過將值與鍵分開儲存,並僅對包含鍵和值引用的 SSTable 執行壓實來減少這種開銷 [^37]。)
B 樹索引必須至少寫入每條資料兩次:一次寫入預寫日誌,一次寫入樹頁本身。此外,它們有時需要寫出整個頁,即使該頁中只有幾個位元組發生了變化,以確保 B 樹在崩潰或斷電後可以正確恢復 [^38] [^39]。
如果你獲取在某個工作負載中寫入磁碟的總位元組數,然後除以如果你只是寫入沒有索引的僅追加日誌需要寫入的位元組數,你就得到了 *寫放大*。(有時寫放大是根據 I/O 操作而不是位元組來定義的。)在寫入密集型應用程式中,瓶頸可能是資料庫可以寫入磁碟的速率。在這種情況下,寫放大越高,它在可用磁碟頻寬內可以處理的每秒寫入次數就越少。
寫放大是 LSM 樹和 B 樹中的問題。哪個更好取決於各種因素,例如鍵和值的長度,以及你覆蓋現有鍵與插入新鍵的頻率。對於典型的工作負載,LSM 樹往往具有較低的寫放大,因為它們不必寫入整個頁,並且可以壓縮 SSTable 的塊 [^40]。這是使 LSM 儲存引擎非常適合寫入密集型工作負載的另一個因素。
除了影響吞吐量,寫放大也與 SSD 的磨損有關:寫放大較低的儲存引擎將更慢地磨損 SSD。
在測量儲存引擎的寫入吞吐量時,重要的是要執行足夠長的實驗,以便寫放大的影響變得清晰。當寫入空的 LSM 樹時,還沒有進行壓實,因此所有磁碟頻寬都可用於新寫入。隨著資料庫的增長,新寫入需要與壓實共享磁碟頻寬。
#### 磁碟空間使用 {#disk-space-usage}
B 樹可能會隨著時間的推移變得 *碎片化*:例如,如果刪除了大量鍵,資料庫檔案可能包含許多 B 樹不再使用的頁。對 B 樹的後續新增可以使用這些空閒頁,但它們不能輕易地返回給作業系統,因為它們在檔案的中間,所以它們仍然佔用檔案系統上的空間。因此,資料庫需要一個後臺過程來移動頁以更好地放置它們,例如 PostgreSQL 中的真空過程 [^25]。
碎片化在 LSM 樹中不太成問題,因為壓實過程無論如何都會定期重寫資料檔案,而且 SSTable 沒有未使用空間的頁。此外,SSTable 中的鍵值對塊可以更好地壓縮,因此通常比 B 樹在磁碟上產生更小的檔案。被覆蓋的鍵和值繼續消耗空間,直到它們被壓實刪除,但使用分級壓即時,這種開銷相當低 [^40] [^41]。分層壓實(參見 ["壓實策略"](#sec_storage_lsm_compaction))使用更多的磁碟空間,特別是在壓實期間臨時使用。
在磁碟上有一些資料的多個副本也可能是一個問題,當你需要刪除一些資料,並確信它真的已被刪除(也許是為了遵守資料保護法規)。例如,在大多數 LSM 儲存引擎中,已刪除的記錄可能仍然存在於較高級別中,直到代表刪除的墓碑透過所有壓實級別傳播,這可能需要很長時間。專門的儲存引擎設計可以更快地傳播刪除 [^42]。
另一方面,SSTable 段檔案的不可變性質在你想在某個時間點對資料庫進行快照時很有用(例如,用於備份或建立資料庫副本以進行測試):你可以寫出記憶體表並記錄該時間點存在的段檔案。只要你不刪除快照的一部分的檔案,你就不需要實際複製它們。在其頁被覆蓋的 B 樹中,有效地進行這樣的快照更困難。
### 多列索引與二級索引 {#sec_storage_index_multicolumn}
到目前為止,我們只討論了鍵值索引,它們就像關係模型中的 *主鍵* 索引。主鍵唯一標識關係表中的一行,或文件資料庫中的一個文件,或圖資料庫中的一個頂點。資料庫中的其他記錄可以透過其主鍵(或 ID)引用該行/文件/頂點,索引用於解析此類引用。
擁有 *二級索引* 也非常常見。在關係資料庫中,你可以使用 `CREATE INDEX` 命令在同一個表上建立多個二級索引,允許你按主鍵以外的列進行搜尋。例如,在 [第 3 章](/tw/ch3#ch_datamodels) 的 [圖 3-1](/tw/ch3#fig_obama_relational) 中,你很可能在 `user_id` 列上有一個二級索引,以便你可以在每個表中找到屬於同一使用者的所有行。
二級索引可以很容易地從鍵值索引構建。主要區別在於,在二級索引中,索引值不一定是唯一的;也就是說,同一索引條目下可能有許多行(文件、頂點)。這可以透過兩種方式解決:要麼使索引中的每個值成為匹配行識別符號的列表(如全文索引中的倒排列表),要麼透過向其追加行識別符號使每個條目唯一。具有就地更新的儲存引擎(如 B 樹)和日誌結構儲存都可用於實現索引。
#### 在索引中儲存值 {#sec_storage_index_heap}
索引中的鍵是查詢搜尋的內容,但值可以是幾種東西之一:
* 如果實際資料(行、文件、頂點)直接儲存在索引結構中,則稱為 *聚簇索引*。例如,在 MySQL 的 InnoDB 儲存引擎中,表的主鍵始終是聚簇索引,在 SQL Server 中,你可以為每個表指定一個聚簇索引 [^43]。
* 或者,值可以是對實際資料的引用:要麼是相關行的主鍵(InnoDB 對二級索引這樣做),要麼是對磁碟上位置的直接引用。在後一種情況下,儲存行的地方稱為 *堆檔案*,它以無特定順序儲存資料(它可能是僅追加的,或者它可能跟蹤已刪除的行以便稍後用新資料覆蓋它們)。例如,Postgres 使用堆檔案方法 [^44]。
* 兩者之間的折中是 *覆蓋索引* 或 *包含列的索引*,它在索引中儲存表的 *某些* 列,除了在堆上或主鍵聚簇索引中儲存完整行 [^45]。這允許僅使用索引來回答某些查詢,而無需解析主鍵或檢視堆檔案(在這種情況下,索引被稱為 *覆蓋* 查詢)。這可以使某些查詢更快,但資料的重複意味著索引使用更多的磁碟空間並減慢寫入速度。
到目前為止討論的索引只將單個鍵對映到值。如果你需要同時查詢表的多個列(或文件中的多個欄位),請參見 ["多維索引與全文索引"](#sec_storage_multidimensional)。
當更新值而不更改鍵時,堆檔案方法可以允許記錄就地覆蓋,前提是新值不大於舊值。如果新值更大,情況會更複雜,因為它可能需要移動到堆中有足夠空間的新位置。在這種情況下,要麼所有索引都需要更新以指向記錄的新堆位置,要麼在舊堆位置留下轉發指標 [^2]。
### 全記憶體儲存 {#sec_storage_inmemory}
本章到目前為止討論的資料結構都是對磁碟限制的回應。與主記憶體相比,磁碟很難處理。對於磁碟和 SSD,如果你想在讀取和寫入上獲得良好的效能,磁碟上的資料需要仔細布局。然而,我們容忍這種尷尬,因為磁碟有兩個顯著的優勢:它們是持久的(如果斷電,其內容不會丟失),並且它們每千兆位元組的成本比 RAM 低。
隨著 RAM 變得更便宜,按每 GB 計價的成本優勢正在減弱。許多資料集根本沒有那麼大,因此將它們完全保留在記憶體中是完全可行的,甚至可以分佈在幾臺機器上。這導致了 *記憶體資料庫* 的發展。
一些記憶體鍵值儲存,例如 Memcached,僅用於快取,如果機器重新啟動,資料丟失是可以接受的。但其他記憶體資料庫旨在實現永續性,這可以透過特殊硬體(例如電池供電的 RAM)、將更改日誌寫入磁碟、將定期快照寫入磁碟或將記憶體狀態複製到其他機器來實現。
當記憶體資料庫重新啟動時,它需要重新載入其狀態,要麼從磁碟,要麼透過網路從副本(除非使用特殊硬體)。儘管寫入磁碟,它仍然是一個記憶體資料庫,因為磁碟僅用作永續性的僅追加日誌,讀取完全從記憶體提供。寫入磁碟還具有操作優勢:磁碟上的檔案可以輕鬆備份、檢查和由外部實用程式分析。
VoltDB、SingleStore 和 Oracle TimesTen 等產品是具有關係模型的記憶體資料庫,供應商聲稱,透過消除管理磁碟資料結構相關的所有開銷,它們可以提供巨大的效能改進 [^46] [^47]。RAMCloud 是一個開源的記憶體鍵值儲存,具有永續性(對記憶體中的資料以及磁碟上的資料使用日誌結構方法)[^48]。
Redis 和 Couchbase 透過非同步寫入磁碟提供弱永續性。
反直覺的是,記憶體資料庫的效能優勢不是因為它們不需要從磁碟讀取。即使是基於磁碟的儲存引擎,如果你有足夠的記憶體,也可能永遠不需要從磁碟讀取,因為作業系統無論如何都會在記憶體中快取最近使用的磁碟塊。相反,它們可以更快,因為它們可以避免將記憶體資料結構編碼為可以寫入磁碟的形式的開銷 [^49]。
除了效能,記憶體資料庫的另一個有趣領域是提供了基於磁碟的索引難以實現的資料模型。例如,Redis 為各種資料結構(例如優先佇列和集合)提供類似資料庫的介面。因為它將所有資料保留在記憶體中,其實現相對簡單。
## 分析型資料儲存 {#sec_storage_analytics}
資料倉庫的資料模型最常見的是關係型,因為 SQL 通常非常適合分析查詢。有許多圖形化資料分析工具可以生成 SQL 查詢、視覺化結果,並允許分析師探索資料(透過 *下鑽* 和 *切片切塊* 等操作)。
表面上,資料倉庫和關係型 OLTP 資料庫看起來很相似,因為它們都有 SQL 查詢介面。然而,系統的內部可能看起來完全不同,因為它們針對非常不同的查詢模式進行了最佳化。許多資料庫供應商現在專注於支援事務處理或分析工作負載,但不是兩者兼而有之。
一些資料庫,如 Microsoft SQL Server、SAP HANA 和 SingleStore,在同一產品中支援事務處理和資料倉庫。然而,這些混合事務和分析處理(HTAP)資料庫(在 ["資料倉庫"](/tw/ch1#sec_introduction_dwh) 中介紹)越來越多地成為兩個獨立的儲存和查詢引擎,它們恰好可以透過通用的 SQL 介面訪問 [^50] [^51] [^52] [^53]。
### 雲資料倉庫 {#sec_cloud_data_warehouses}
Teradata、Vertica 和 SAP HANA 等資料倉庫供應商既銷售商業許可下的本地倉庫,也銷售基於雲的解決方案。但隨著他們的許多客戶轉向雲,新的雲資料倉庫(如 Google Cloud BigQuery、Amazon Redshift 和 Snowflake)也變得廣泛採用。與傳統資料倉庫不同,雲資料倉庫利用可擴充套件的雲基礎設施,如物件儲存和無伺服器計算平臺。
雲資料倉庫往往與其他雲服務更好地整合,並且更具彈性。例如,許多雲倉庫支援自動日誌攝取,並提供與資料處理框架(如 Google Cloud 的 Dataflow 或 Amazon Web Services 的 Kinesis)的輕鬆整合。這些倉庫也更具彈性,因為它們將查詢計算與儲存層解耦 [^54]。資料持久儲存在物件儲存而不是本地磁碟上,這使得可以獨立調整儲存容量和查詢的計算資源,正如我們之前在 ["雲原生系統架構"](/tw/ch1#sec_introduction_cloud_native) 中看到的。
Apache Hive、Trino 和 Apache Spark 等開源資料倉庫也隨著雲的發展而發展。隨著分析資料儲存轉移到物件儲存上的資料湖,開源倉庫也開始解耦拆分 [^55]。以下元件以前整合在單個系統(如 Apache Hive)中,現在通常作為單獨的元件實現:
查詢引擎
: Trino、Apache DataFusion 和 Presto 等查詢引擎解析 SQL 查詢,將其最佳化為執行計劃,並在資料上執行這些計劃。執行通常需要並行、分散式的資料處理任務。一些查詢引擎提供內建任務執行,而有些則選擇使用第三方執行框架,如 Apache Spark 或 Apache Flink。
儲存格式
: 儲存格式確定表的行如何編碼為檔案中的位元組,然後通常儲存在物件儲存或分散式檔案系統中 [^12]。然後查詢引擎可以訪問這些資料,但使用資料湖的其他應用程式也可以訪問。此類儲存格式的示例包括 Parquet、ORC、Lance 或 Nimble,我們將在下一節中看到更多關於它們的內容。
表格式
: 以 Apache Parquet 和類似儲存格式編寫的檔案一旦寫入通常就是不可變的。為了支援行插入和刪除,通常會使用 Apache Iceberg 或 Databricks Delta 等表格式。表格式規定了哪些檔案構成一張表,以及表模式的定義格式。此類格式還提供高階功能,例如時間旅行(查詢表在過去某個時間點狀態的能力)、垃圾回收,甚至事務。
資料目錄
: 就像表格式定義哪些檔案構成表一樣,資料目錄定義哪些表組成資料庫。目錄用於建立、重新命名和刪除表。與儲存和表格式不同,Snowflake 的 Polaris 和 Databricks 的 Unity Catalog 等資料目錄通常作為可以使用 REST 介面查詢的獨立服務執行。Apache Iceberg 也提供目錄,可以在客戶端內執行或作為單獨的程序執行。查詢引擎在讀取和寫入表時使用目錄資訊。傳統上,目錄和查詢引擎已經整合,但將它們解耦使資料發現和資料治理系統(在 ["資料系統、法律和社會"](/tw/ch1#sec_introduction_compliance) 中討論)也能夠訪問目錄的元資料。
### 列式儲存 {#sec_storage_column}
如 ["星型和雪花型:分析模式"](/tw/ch3#sec_datamodels_analytics) 中所討論的,資料倉庫按照慣例通常使用帶有大型事實表的關係模式,該表包含對維度表的外部索引鍵引用。如果你的事實表中有數萬億行和數 PB 的資料,有效地儲存和查詢它們就成為一個具有挑戰性的問題。維度表通常要小得多(數百萬行),因此在本節中我們將重點關注事實的儲存。
儘管事實表通常有超過 100 列,但典型的資料倉庫查詢一次只訪問其中的 4 或 5 列(分析很少需要 `"SELECT *"` 查詢)[^52]。以 [示例 4-1](#fig_storage_analytics_query) 中的查詢為例:它訪問大量行(2024 日曆年期間每次有人購買水果或糖果的情況),但它只需要訪問 `fact_sales` 表的三列:`date_key`、`product_sk` 和 `quantity`。查詢忽略所有其他列。
{{< figure id="fig_storage_analytics_query" title="示例 4-1. 分析人們是否更傾向於購買新鮮水果或糖果,取決於星期幾" class="w-full my-4" >}}
```sql
SELECT
dim_date.weekday, dim_product.category,
SUM(fact_sales.quantity) AS quantity_sold
FROM fact_sales
JOIN dim_date ON fact_sales.date_key = dim_date.date_key
JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
WHERE
dim_date.year = 2024 AND
dim_product.category IN ('Fresh fruit', 'Candy')
GROUP BY
dim_date.weekday, dim_product.category;
```
我們如何高效地執行這個查詢?
在大多數 OLTP 資料庫中,儲存是以 *面向行* 的方式佈局的:表中一行的所有值彼此相鄰儲存。文件資料庫類似:整個文件通常作為一個連續的位元組序列儲存。你可以在 [圖 4-1](#fig_storage_csv_hash_index) 的 CSV 示例中看到這一點。
為了處理像 [示例 4-1](#fig_storage_analytics_query) 這樣的查詢,你可能在 `fact_sales.date_key` 和/或 `fact_sales.product_sk` 上有索引,告訴儲存引擎在哪裡找到特定日期或特定產品的所有銷售。但是,面向行的儲存引擎仍然需要將所有這些行(每行包含超過 100 個屬性)從磁碟載入到記憶體中,解析它們,並過濾掉不符合所需條件的行。這可能需要很長時間。
*面向列*(或 *列式*)儲存背後的想法很簡單:不要將一行中的所有值儲存在一起,而是將每 *列* 中的所有值儲存在一起 [^56]。如果每列單獨儲存,查詢只需要讀取和解析該查詢中使用的那些列,這可以節省大量工作。[圖 4-7](#fig_column_store) 使用 [圖 3-5](/tw/ch3#fig_dwh_schema) 中事實表的擴充套件版本展示了這一原理。
--------
> [!NOTE]
> 列儲存在關係資料模型中最容易理解,但它同樣適用於非關係資料。例如,Parquet [^57] 是一種列式儲存格式,它支援基於 Google 的 Dremel [^58] 的文件資料模型,使用一種稱為 *分解*(*shredding*)或 *條帶化*(*striping*)的技術 [^59]。
--------
{{< figure src="/fig/ddia_0407.png" id="fig_column_store" caption="圖 4-7. 按列而不是按行儲存關係資料。" class="w-full my-4" >}}
面向列的儲存佈局依賴於每列以相同順序儲存行。因此,如果你需要重新組裝整行,你可以從每個單獨的列中取出第 23 個條目,並將它們組合在一起形成表的第 23 行。
實際上,列式儲存引擎並不真的一次儲存整個列(可能包含數萬億行)。相反,它們將表分解為數千或數百萬行的塊,並且在每個塊內,它們分別儲存每列的值 [^60]。由於許多查詢都限制在特定的日期範圍內,因此通常使每個塊包含特定時間戳範圍的行。然後查詢只需要在與所需日期範圍重疊的那些塊中載入它需要的列。
列式儲存如今幾乎用於所有分析資料庫 [^60],從大規模雲資料倉庫(如 Snowflake [^61])到單節點嵌入式資料庫(如 DuckDB [^62]),以及產品分析系統(如 Pinot [^63] 和 Druid [^64])。它用於儲存格式,如 Parquet、ORC [^65] [^66]、Lance [^67] 和 Nimble [^68],以及記憶體分析格式,如 Apache Arrow [^65] [^69] 和 Pandas/NumPy [^70]。一些時間序列資料庫,如 InfluxDB IOx [^71] 和 TimescaleDB [^72],也基於面向列的儲存。
#### 列壓縮 {#sec_storage_column_compression}
除了只從磁碟載入查詢所需的那些列之外,我們還可以透過壓縮資料進一步減少對磁碟吞吐量和網路頻寬的需求。幸運的是,面向列的儲存通常非常適合壓縮。
看看 [圖 4-7](#fig_column_store) 中每列的值序列:它們看起來經常重複,這是壓縮的良好跡象。根據列中的資料,可以使用不同的壓縮技術。在資料倉庫中特別有效的一種技術是 *點陣圖編碼*,如 [圖 4-8](#fig_bitmap_index) 所示。
{{< figure src="/fig/ddia_0408.png" id="fig_bitmap_index" caption="圖 4-8. 單列的壓縮、點陣圖索引儲存。" class="w-full my-4" >}}
通常,列中不同值的數量與行數相比很小(例如,零售商可能有數十億條銷售交易,但只有 100,000 種不同的產品)。我們現在可以將具有 *n* 個不同值的列轉換為 *n* 個單獨的點陣圖:每個不同值一個位圖,每行一位。如果該行具有該值,則該位為 1,否則為 0。
一種選擇是使用每行一位來儲存這些點陣圖。然而,這些點陣圖通常包含大量零(我們說它們是 *稀疏* 的)。在這種情況下,點陣圖可以另外進行遊程編碼:計算連續零或一的數量並存儲該數字,如 [圖 4-8](#fig_bitmap_index) 底部所示。諸如 *咆哮點陣圖*(*roaring bitmaps*)之類的技術在兩種位圖表示之間切換,使用最緊湊的表示 [^73]。這可以使列的編碼非常高效。
像這樣的點陣圖索引非常適合資料倉庫中常見的查詢型別。例如:
`WHERE product_sk IN (31, 68, 69):`
: 載入 `product_sk = 31`、`product_sk = 68` 和 `product_sk = 69` 的三個點陣圖,並計算三個點陣圖的按位 *OR*,這可以非常高效地完成。
`WHERE product_sk = 30 AND store_sk = 3:`
: 載入 `product_sk = 30` 和 `store_sk = 3` 的點陣圖,並計算按位 *AND*。這有效是因為列以相同的順序包含行,所以一列點陣圖中的第 *k* 位對應於另一列點陣圖中第 *k* 位的同一行。
點陣圖也可用於回答圖查詢,例如查詢社交網路中被使用者 *X* 關注並且也關注使用者 *Y* 的所有使用者 [^74]。列式資料庫還有各種其他壓縮方案,你可以在參考文獻中找到 [^75]。
--------
> [!NOTE]
> 不要將面向列的資料庫與 *寬列*(也稱為 *列族*)資料模型混淆,在該模型中,一行可以有數千列,並且不需要所有行都有相同的列 [^9]。儘管名稱相似,寬列資料庫是面向行的,因為它們將一行中的所有值儲存在一起。Google 的 Bigtable、Apache Accumulo 和 HBase 是寬列模型的例子。
--------
#### 列儲存中的排序順序 {#sort-order-in-column-storage}
在列儲存中,行的儲存順序並不一定重要。最簡單的是按插入順序儲存它們,因為這樣插入新行只需追加到每列。但是,我們可以選擇強制執行順序,就像我們之前對 SSTable 所做的那樣,並將其用作索引機制。
請注意,獨立排序每列是沒有意義的,因為那樣我們就不再知道列中的哪些項屬於同一行。我們只能重建一行,因為我們知道一列中的第 *k* 個項與另一列中的第 *k* 個項屬於同一行。
相反,資料需要一次排序整行,即使它是按列儲存的。資料庫管理員可以使用他們對常見查詢的瞭解來選擇表應按哪些列排序。例如,如果查詢經常針對日期範圍(例如上個月),則將 `date_key` 作為第一個排序鍵可能是有意義的。然後查詢可以只掃描上個月的行,這將比掃描所有行快得多。
第二列可以確定在第一列中具有相同值的任何行的排序順序。例如,如果 `date_key` 是 [圖 4-7](#fig_column_store) 中的第一個排序鍵,那麼 `product_sk` 作為第二個排序鍵可能是有意義的,這樣同一天同一產品的所有銷售都在儲存中分組在一起。這將有助於需要在某個日期範圍內按產品分組或過濾銷售的查詢。
排序順序的另一個優點是它可以幫助壓縮列。如果主排序列沒有許多不同的值,那麼排序後,它將有很長的序列,其中相同的值在一行中重複多次。簡單的遊程編碼,就像我們在 [圖 4-8](#fig_bitmap_index) 中用於點陣圖的那樣,可以將該列壓縮到幾千位元組 —— 即使表有數十億行。
該壓縮效果在第一個排序鍵上最強。第二和第三個排序鍵將更加混亂,因此不會有如此長的重複值執行。排序優先順序較低的列基本上以隨機順序出現,因此它們可能不會壓縮得那麼好。但是,讓前幾列排序仍然是整體上的勝利。
#### 寫入列式儲存 {#writing-to-column-oriented-storage}
我們在 ["事務處理和分析的特徵"](/tw/ch1#sec_introduction_oltp) 中看到,資料倉庫中的讀取往往包括大量行的聚合;列式儲存、壓縮和排序都有助於使這些讀取查詢更快。資料倉庫中的寫入往往是資料的批次匯入,通常透過 ETL 過程。
使用列式儲存,在排序表的中間某處寫入單個行將非常低效,因為你必須從插入位置開始重寫所有壓縮列。但是,一次批次寫入許多行會分攤重寫這些列的成本,使其高效。
通常使用日誌結構方法以批次執行寫入。所有寫入首先進入面向行的、排序的記憶體儲存。當積累了足夠的寫入時,它們將與磁碟上的列編碼檔案合併,並批次寫入新檔案。由於舊檔案保持不可變,新檔案一次寫入,物件儲存非常適合儲存這些檔案。
查詢需要檢查磁碟上的列資料和記憶體中的最近寫入,並將兩者結合起來。查詢執行引擎對使用者隱藏了這種區別。從分析師的角度來看,已透過插入、更新或刪除修改的資料會立即反映在後續查詢中。Snowflake、Vertica、Apache Pinot、Apache Druid 和許多其他系統都這樣做 [^61] [^63] [^64] [^76]。
### 查詢執行:編譯與向量化 {#sec_storage_vectorized}
用於分析的複雜 SQL 查詢被分解為由多個階段組成的 *查詢計劃*,稱為 *運算元*,這些運算元可能分佈在多臺機器上以並行執行。查詢規劃器可以透過選擇使用哪些運算元、以何種順序執行它們以及在哪裡執行每個運算元來執行大量最佳化。
在每個運算元內,查詢引擎需要對列中的值執行各種操作,例如查詢值在特定值集中的所有行(可能作為連線的一部分),或檢查值是否大於 15。它還需要檢視同一行的幾列,例如查詢產品是香蕉且門店是某個特定目標門店的所有銷售交易。
對於需要掃描數百萬行的資料倉庫查詢,我們不僅需要擔心它們需要從磁碟讀取的資料量,還需要擔心執行複雜運算元所需的 CPU 時間。最簡單的運算元型別就像程式語言的直譯器:在遍歷每一行時,它檢查表示查詢的資料結構,以找出需要對哪些列執行哪些比較或計算。不幸的是,這對許多分析目的來說太慢了。高效查詢執行的兩種替代方法已經出現 [^77]:
查詢編譯
: 查詢引擎獲取 SQL 查詢並生成用於執行它的程式碼。程式碼逐行迭代,檢視感興趣列中的值,執行所需的任何比較或計算,如果滿足所需條件,則將必要的值複製到輸出緩衝區。查詢引擎將生成的程式碼編譯為機器程式碼(通常使用現有編譯器,如 LLVM),然後在已載入到記憶體中的列編碼資料上執行它。這種程式碼生成方法類似於 Java 虛擬機器(JVM)和類似執行時中使用的即時(JIT)編譯方法。
向量化處理
: 查詢被解釋,而不是編譯,但透過批次處理列中的許多值而不是逐行迭代來提高速度。一組固定的預定義運算元內建在資料庫中;我們可以向它們傳遞引數並獲得一批結果 [^50] [^75]。
例如,我們可以將 `product_sk` 列和"香蕉"的 ID 傳遞給相等運算元,並獲得一個位圖(輸入列中每個值一位,如果是香蕉則為 1);然後我們可以將 `store_sk` 列和感興趣商店的 ID 傳遞給相同的相等運算元,並獲得另一個位圖;然後我們可以將兩個點陣圖傳遞給"按位 AND"運算元,如 [圖 4-9](#fig_bitmap_and) 所示。結果將是一個位圖,包含特定商店中所有香蕉銷售的 1。
{{< figure src="/fig/ddia_0409.png" id="fig_bitmap_and" caption="圖 4-9. 兩個點陣圖之間的按位 AND 適合向量化。" class="w-full my-4" >}}
這兩種方法在實現方面非常不同,但兩者都在實踐中使用 [^77]。兩者都可以透過利用現代 CPU 的特性來實現非常好的效能:
* 優先選擇順序記憶體訪問而不是隨機訪問以減少快取未命中 [^78],
* 在緊密的內部迴圈中完成大部分工作(即,具有少量指令且沒有函式呼叫)以保持 CPU 指令處理管道繁忙併避免分支預測錯誤,
* 利用並行性,例如多執行緒和單指令多資料(SIMD)指令 [^79] [^80],以及
* 直接對壓縮資料進行操作,而無需將其解碼為單獨的記憶體表示,這可以節省記憶體分配和複製成本。
### 物化檢視與資料立方體 {#sec_storage_materialized_views}
我們之前在 ["物化和更新時間線"](/tw/ch2#sec_introduction_materializing) 中遇到了 *物化檢視*:在關係資料模型中,它們是表狀物件,其內容是某些查詢的結果。區別在於物化檢視是查詢結果的實際副本,寫入磁碟,而虛擬檢視只是編寫查詢的快捷方式。當你從虛擬檢視讀取時,SQL 引擎會即時將其擴充套件為檢視的基礎查詢,然後處理擴充套件的查詢。
當基礎資料更改時,物化檢視需要相應更新。一些資料庫可以自動執行此操作,還有像 Materialize 這樣專門從事物化檢視維護的系統 [^81]。執行此類更新意味著寫入時需要更多工作,但物化檢視可以改善在重複需要執行相同查詢的工作負載中的讀取效能。
*物化聚合* 是一種可以在資料倉庫中有用的物化檢視型別。如前所述,資料倉庫查詢通常涉及聚合函式,例如 SQL 中的 `COUNT`、`SUM`、`AVG`、`MIN` 或 `MAX`。如果許多不同的查詢使用相同的聚合,每次都處理原始資料可能會很浪費。為什麼不快取查詢最常使用的一些計數或總和?*資料立方體*(*OLAP 立方體*)透過建立按不同維度分組的聚合網格來做到這一點 [^82]。[圖 4-10](#fig_data_cube) 顯示了一個示例。
{{< figure src="/fig/ddia_0410.png" id="fig_data_cube" caption="圖 4-10. 資料立方體的兩個維度,透過求和聚合資料。" class="w-full my-4" >}}
現在假設每個事實只有兩個維度表的外部索引鍵 —— 在 [圖 4-10](#fig_data_cube) 中,這些是 `date_key` 和 `product_sk`。你現在可以繪製一個二維表,日期沿著一個軸,產品沿著另一個軸。每個單元格包含具有該日期-產品組合的所有事實的屬性(例如 `net_price`)的聚合(例如 `SUM`)。然後,你可以沿著每行或列應用相同的聚合,並獲得已減少一個維度的摘要(不管日期的產品銷售,或不管產品的日期銷售)。
一般來說,事實通常有兩個以上的維度。在 [圖 3-5](/tw/ch3#fig_dwh_schema) 中有五個維度:日期、產品、商店、促銷和客戶。很難想象五維超立方體會是什麼樣子,但原理保持不變:每個單元格包含特定日期-產品-商店-促銷-客戶組合的銷售。然後可以沿著每個維度重複彙總這些值。
物化資料立方體的優點是某些查詢會變得非常快,因為結果已經被預先計算好了。例如,如果你想知道昨天每個商店的總銷售額,你只需要檢視相應維度上的彙總值 —— 不需要掃描數百萬行。
缺點是資料立方體不像直接查詢原始資料那樣靈活。例如,沒有辦法計算售價超過 100 美元的商品銷售佔比,因為價格並不是其中一個維度。因此,大多數資料倉庫都會盡可能保留原始資料,只把這類聚合(如資料立方體)當作特定查詢的效能加速手段。
## 多維索引與全文索引 {#sec_storage_multidimensional}
我們在本章前半部分看到的 B 樹和 LSM 樹允許對單個屬性進行範圍查詢:例如,如果鍵是使用者名稱,你可以使用它們作為索引來高效查詢所有以 L 開頭的名稱。但有時,按單個屬性搜尋是不夠的。
最常見的多列索引型別稱為 *聯合索引*,它透過將一列追加到另一列來將幾個欄位組合成一個鍵(索引定義指定欄位以何種順序連線)。這就像老式的紙質電話簿,它提供從(*姓氏*、*名字*)到電話號碼的索引。由於排序順序,索引可用於查詢具有特定姓氏的所有人,或具有特定 *姓氏-名字* 組合的所有人。但是,如果你想查詢具有特定名字的所有人,索引是無用的。
另一方面,*多維索引* 允許你一次查詢多個列。在地理空間資料中這尤其重要。例如,餐廳搜尋網站可能有一個包含每個餐廳的緯度和經度的資料庫。當用戶在地圖上檢視餐廳時,網站需要搜尋使用者當前檢視的矩形地圖區域內的所有餐廳。這需要像以下這樣的二維範圍查詢:
```sql
SELECT * FROM restaurants WHERE latitude > 51.4946 AND latitude < 51.5079
AND longitude > -0.1162 AND longitude < -0.1004;
```
緯度和經度列上的聯合索引無法有效地回答這種查詢:它可以為你提供緯度範圍內的所有餐廳(但在任何經度),或經度範圍內的所有餐廳(但在北極和南極之間的任何地方),但不能同時提供兩者。
一種選擇是使用空間填充曲線將二維位置轉換為單個數字,然後使用常規 B 樹索引 [^83]。更常見的是,使用專門的空間索引,如 R 樹或 Bkd 樹 [^84];它們劃分空間,使附近的資料點傾向於分組在同一子樹中。例如,PostGIS 使用 PostgreSQL 的通用搜索樹索引設施將地理空間索引實現為 R 樹 [^85]。也可以使用規則間隔的三角形、正方形或六邊形網格 [^86]。
多維索引不僅用於地理位置。例如,在電子商務網站上,你可以在維度(*紅色*、*綠色*、*藍色*)上使用三維索引來搜尋某個顏色範圍內的產品,或者在天氣觀測資料庫中,你可以在(*日期*、*溫度*)上有一個二維索引,以便有效地搜尋 2013 年期間溫度在 25 到 30°C 之間的所有觀測。使用一維索引,你必須掃描 2013 年的所有記錄(不管溫度),然後按溫度過濾它們,反之亦然。二維索引可以同時按時間戳和溫度縮小範圍 [^87]。
### 全文檢索 {#sec_storage_full_text}
全文檢索允許你透過可能出現在文字中任何位置的關鍵字搜尋文字文件集合(網頁、產品描述等)[^88]。資訊檢索是一個大的專業主題,通常涉及特定於語言的處理:例如,幾種亞洲語言在單詞之間沒有空格或標點符號,因此將文字分割成單詞需要一個指示哪些字元序列構成單詞的模型。全文檢索還經常涉及匹配相似但不相同的單詞(例如拼寫錯誤或單詞的不同語法形式)和同義詞。這些問題超出了本書的範圍。
然而,在其核心,你可以將全文檢索視為另一種多維查詢:在這種情況下,可能出現在文字中的每個單詞(*詞項*)是一個維度。包含詞項 *x* 的文件在維度 *x* 中的值為 1,不包含 *x* 的文件的值為 0。搜尋提到“紅蘋果”的文件意味著查詢在 *紅* 維度中查詢 1,同時在 *蘋果* 維度中查詢 1。維度數量可能因此非常大。
許多搜尋引擎用來回答此類查詢的資料結構稱為 *倒排索引*。這是一個鍵值結構,其中鍵是詞項,值是包含該詞項的所有文件的 ID 列表(*倒排列表*)。如果文件 ID 是順序數字,倒排列表也可以表示為稀疏點陣圖,如 [圖 4-8](#fig_bitmap_index):詞項 *x* 的點陣圖中的第 *n* 位是 1,如果 ID 為 *n* 的文件包含詞項 *x* [^89]。
查詢包含詞項 *x* 和 *y* 的所有文件現在類似於搜尋匹配兩個條件的行的向量化資料倉庫查詢([圖 4-9](#fig_bitmap_and)):載入詞項 *x* 和 *y* 的兩個點陣圖並計算它們的按位 AND。即使點陣圖是遊程編碼的,這也可以非常高效地完成。
例如,Elasticsearch 和 Solr 使用的全文索引引擎 Lucene 就是這樣工作的 [^90]。它將詞項到倒排列表的對映儲存在類似 SSTable 的排序檔案中,這些檔案使用我們在本章前面看到的相同日誌結構方法在後臺合併 [^91]。PostgreSQL 的 GIN 索引型別也使用倒排列表來支援全文檢索和 JSON 文件內的索引 [^92] [^93]。
除了將文字分解為單詞,另一種選擇是查詢長度為 *n* 的所有子字串,稱為 *n-gram*(*n 元語法*)。例如,字串 `"hello"` 的三元語法(*n* = 3)是 `"hel"`、`"ell"` 和 `"llo"`。如果我們為所有三元語法構建倒排索引,我們就可以搜尋任意至少三個字元長的子字串。三元語法索引甚至允許在搜尋查詢中使用正則表示式;缺點是它們相當大 [^94]。
為了處理文件或查詢中的拼寫錯誤,Lucene 能夠在一定編輯距離內搜尋文字中的單詞(編輯距離為 1 意味著已新增、刪除或替換了一個字母)[^95]。它透過將詞項集儲存為字元上的有限狀態自動機(類似於 *字典樹* [^96])並將其轉換為 *萊文斯坦自動機* 來實現,該自動機支援在給定編輯距離內高效搜尋單詞 [^97]。
### 向量嵌入 {#id92}
語義搜尋超越了同義詞和拼寫錯誤,試圖理解文件概念和使用者意圖。例如,如果你的幫助頁面中有一個標題為“取消訂閱”的頁面,使用者在搜尋“如何關閉我的賬戶”或“終止合同”時,仍應能找到這個頁面,即使查詢詞完全不同,但語義非常接近。
為了理解文件的語義 —— 它的含義 —— 語義搜尋索引使用嵌入模型將文件轉換為浮點值向量,稱為 *向量嵌入*。向量表示多維空間中的一個點,每個浮點值表示文件沿著一個維度軸的位置。嵌入模型生成的向量嵌入在(這個多維空間中)彼此接近,當嵌入的輸入文件在語義上相似時。
--------
> [!NOTE]
> 我們在 ["查詢執行:編譯與向量化"](#sec_storage_vectorized) 中看到了術語 *向量化處理*。語義搜尋中的向量有不同的含義。在向量化處理中,向量指的是可以用特別最佳化的程式碼處理的一批位。在嵌入模型中,向量是表示多維空間中位置的浮點數列表。
--------
例如,關於農業的維基百科頁面的三維向量嵌入可能是 `[0.1, 0.22, 0.11]`。關於蔬菜的維基百科頁面會非常接近,可能嵌入為 `[0.13, 0.19, 0.24]`。關於星型模式的頁面可能有 `[0.82, 0.39, -0.74]` 的嵌入,相對較遠。我們可以透過觀察看出前兩個向量比第三個更接近。
嵌入模型使用更大的向量(通常超過 1,000 個數字),但原理是相同的。我們不試圖理解各個數字的含義;它們只是嵌入模型指向抽象多維空間中位置的一種方式。搜尋引擎使用距離函式(如餘弦相似度或歐幾里得距離)來測量向量之間的距離。餘弦相似度測量兩個向量角度的餘弦以確定它們的接近程度,而歐幾里得距離測量空間中兩點之間的直線距離。
許多早期的嵌入模型,如 Word2Vec [^98]、BERT [^99] 和 GPT [^100] 都處理文字資料。這些模型通常實現為神經網路。研究人員繼續為影片、音訊和影像建立嵌入模型。最近,模型架構已經變成 *多模態* 的:單個模型可以為多種模態(如文字和影像)生成向量嵌入。
語義搜尋引擎在使用者輸入查詢時使用嵌入模型生成向量嵌入。使用者的查詢和相關上下文(例如使用者的位置)被輸入到嵌入模型中。嵌入模型生成查詢的向量嵌入後,搜尋引擎必須使用向量索引找到具有相似向量嵌入的文件。
向量索引儲存文件集合的向量嵌入。要查詢索引,你傳入查詢的向量嵌入,索引返回其向量最接近查詢向量的文件。由於我們之前看到的 R 樹不適用於多維向量,因此使用專門的向量索引,例如:
平面索引(Flat indexes)
: 向量按原樣儲存在索引中。查詢必須讀取每個向量並測量其與查詢向量的距離。平面索引是準確的,但測量查詢與每個向量之間的距離很慢。
倒排檔案(IVF)索引
: 向量空間被聚類為向量的分割槽(稱為 *質心*),以減少必須比較的向量數量。IVF 索引比平面索引更快,但只能給出近似結果:即使查詢和文件彼此接近,它們也可能落入不同的分割槽。對 IVF 索引的查詢首先定義 *探針*,這只是要檢查的分割槽數。使用更多探針的查詢將更準確,但會更慢,因為必須比較更多向量。
分層可導航小世界(HNSW)
: HNSW 索引維護向量空間的多個層,如 [圖 4-11](#fig_vector_hnsw) 所示。每一層都表示為一個圖,其中節點表示向量,邊表示與附近向量的接近度。查詢首先在最頂層定位最近的向量,該層具有少量節點。然後查詢移動到下面一層的同一節點,並跟隨該層中的邊,該層連線更密集,尋找更接近查詢向量的向量。該過程繼續直到到達最後一層。與 IVF 索引一樣,HNSW 索引是近似的。
{{< figure src="/fig/ddia_0411.png" id="fig_vector_hnsw" caption="圖 4-11. 在 HNSW 索引中搜索最接近給定查詢向量的資料庫條目。" class="w-full my-4" >}}
許多流行的向量資料庫實現了 IVF 和 HNSW 索引。Facebook 的 Faiss 庫有每種的許多變體 [^101],PostgreSQL 的 pgvector 也支援兩者 [^102]。IVF 和 HNSW 演算法的完整細節超出了本書的範圍,但它們的論文是極好的資源 [^103] [^104]。
## 總結 {#summary}
在本章中,我們試圖深入瞭解資料庫如何執行儲存和檢索。當你在資料庫中儲存資料時會發生什麼,當你稍後再次查詢資料時資料庫會做什麼?
["分析型與事務型系統"](/tw/ch1#sec_introduction_analytics) 介紹了事務處理(OLTP)和分析(OLAP)之間的區別。在本章中,我們看到為 OLTP 最佳化的儲存引擎與為分析最佳化的儲存引擎看起來非常不同:
* OLTP 系統針對大量請求進行了最佳化,每個請求讀取和寫入少量記錄,並且需要快速響應。記錄通常透過主鍵或二級索引訪問,這些索引通常是從鍵到記錄的有序對映,也支援範圍查詢。
* 資料倉庫和類似的分析系統針對掃描大量記錄的複雜讀取查詢進行了最佳化。它們通常使用帶有壓縮的列式儲存佈局,以最小化此類查詢需要從磁碟讀取的資料量,並使用查詢的即時編譯或向量化來最小化處理資料所花費的 CPU 時間。
在 OLTP 方面,我們看到了兩個主要思想流派的儲存引擎:
* 日誌結構方法,只允許追加到檔案和刪除過時檔案,但從不更新已寫入的檔案。SSTable、LSM 樹、RocksDB、Cassandra、HBase、Scylla、Lucene 等屬於這一組。一般來說,日誌結構儲存引擎往往提供高寫入吞吐量。
* 就地更新方法,將磁碟視為一組可以覆蓋的固定大小頁。B 樹是這種理念的最大例子,用於所有主要的關係型 OLTP 資料庫以及許多非關係型資料庫。作為經驗法則,B 樹往往更適合讀取,提供比日誌結構儲存更高的讀取吞吐量和更低的響應時間。
然後我們查看了可以同時搜尋多個條件的索引:多維索引(如 R 樹)可以同時按緯度和經度搜索地圖上的點,全文檢索索引可以搜尋出現在同一文字中的多個關鍵字。最後,向量資料庫用於文字文件和其他媒體的語義搜尋;它們使用具有大量維度的向量,並透過比較向量相似性來查詢相似文件。
作為應用開發者,如果你掌握了這些關於儲存引擎內部機制的知識,就能更好地判斷哪種工具最適合你的具體應用。如果你需要調整資料庫的調優引數,這種理解也能幫助你預判引數調高或調低可能帶來的影響。
儘管本章不能讓你成為調優某個特定儲存引擎的專家,但它希望已經為你提供了足夠的術語和思路,使你能夠讀懂所選資料庫的文件。
### 參考
[^1]: Nikolay Samokhvalov. [How partial, covering, and multicolumn indexes may slow down UPDATEs in PostgreSQL](https://postgres.ai/blog/20211029-how-partial-and-covering-indexes-affect-update-performance-in-postgresql). *postgres.ai*, October 2021. Archived at [perma.cc/PBK3-F4G9](https://perma.cc/PBK3-F4G9)
[^2]: Goetz Graefe. [Modern B-Tree Techniques](https://w6113.github.io/files/papers/btreesurvey-graefe.pdf). *Foundations and Trends in Databases*, volume 3, issue 4, pages 203–402, August 2011. [doi:10.1561/1900000028](https://doi.org/10.1561/1900000028)
[^3]: Evan Jones. [Why databases use ordered indexes but programming uses hash tables](https://www.evanjones.ca/ordered-vs-unordered-indexes.html). *evanjones.ca*, December 2019. Archived at [perma.cc/NJX8-3ZZD](https://perma.cc/NJX8-3ZZD)
[^4]: Branimir Lambov. [CEP-25: Trie-indexed SSTable format](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A%2BTrie-indexed%2BSSTable%2Bformat). *cwiki.apache.org*, November 2022. Archived at [perma.cc/HD7W-PW8U](https://perma.cc/HD7W-PW8U). Linked Google Doc archived at [perma.cc/UL6C-AAAE](https://perma.cc/UL6C-AAAE)
[^5]: Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein: *Introduction to Algorithms*, 3rd edition. MIT Press, 2009. ISBN: 978-0-262-53305-8
[^6]: Branimir Lambov. [Trie Memtables in Cassandra](https://www.vldb.org/pvldb/vol15/p3359-lambov.pdf). *Proceedings of the VLDB Endowment*, volume 15, issue 12, pages 3359–3371, August 2022. [doi:10.14778/3554821.3554828](https://doi.org/10.14778/3554821.3554828)
[^7]: Dhruba Borthakur. [The History of RocksDB](https://rocksdb.blogspot.com/2013/11/the-history-of-rocksdb.html). *rocksdb.blogspot.com*, November 2013. Archived at [perma.cc/Z7C5-JPSP](https://perma.cc/Z7C5-JPSP)
[^8]: Matteo Bertozzi. [Apache HBase I/O – HFile](https://blog.cloudera.com/apache-hbase-i-o-hfile/). *blog.cloudera.com*, June 2012. Archived at [perma.cc/U9XH-L2KL](https://perma.cc/U9XH-L2KL)
[^9]: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. [Bigtable: A Distributed Storage System for Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006.
[^10]: Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. [The Log-Structured Merge-Tree (LSM-Tree)](https://www.cs.umb.edu/~poneil/lsmtree.pdf). *Acta Informatica*, volume 33, issue 4, pages 351–385, June 1996. [doi:10.1007/s002360050048](https://doi.org/10.1007/s002360050048)
[^11]: Mendel Rosenblum and John K. Ousterhout. [The Design and Implementation of a Log-Structured File System](https://research.cs.wisc.edu/areas/os/Qual/papers/lfs.pdf). *ACM Transactions on Computer Systems*, volume 10, issue 1, pages 26–52, February 1992. [doi:10.1145/146941.146943](https://doi.org/10.1145/146941.146943)
[^12]: Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, Michał Świtakowski, Michał Szafrański, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, and Matei Zaharia. [Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores](https://vldb.org/pvldb/vol13/p3411-armbrust.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3411–3424, August 2020. [doi:10.14778/3415478.3415560](https://doi.org/10.14778/3415478.3415560)
[^13]: Burton H. Bloom. [Space/Time Trade-offs in Hash Coding with Allowable Errors](https://people.cs.umass.edu/~emery/classes/cmpsci691st/readings/Misc/p422-bloom.pdf). *Communications of the ACM*, volume 13, issue 7, pages 422–426, July 1970. [doi:10.1145/362686.362692](https://doi.org/10.1145/362686.362692)
[^14]: Adam Kirsch and Michael Mitzenmacher. [Less Hashing, Same Performance: Building a Better Bloom Filter](https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf). *Random Structures & Algorithms*, volume 33, issue 2, pages 187–218, September 2008. [doi:10.1002/rsa.20208](https://doi.org/10.1002/rsa.20208)
[^15]: Thomas Hurst. [Bloom Filter Calculator](https://hur.st/bloomfilter/). *hur.st*, September 2023. Archived at [perma.cc/L3AV-6VC2](https://perma.cc/L3AV-6VC2)
[^16]: Chen Luo and Michael J. Carey. [LSM-based storage techniques: a survey](https://arxiv.org/abs/1812.07527). *The VLDB Journal*, volume 29, pages 393–418, July 2019. [doi:10.1007/s00778-019-00555-y](https://doi.org/10.1007/s00778-019-00555-y)
[^17]: Subhadeep Sarkar and Manos Athanassoulis. [Dissecting, Designing, and Optimizing LSM-based Data Stores](https://www.youtube.com/watch?v=hkMkBZn2mGs). Tutorial at *ACM International Conference on Management of Data* (SIGMOD), June 2022. Slides archived at [perma.cc/93B3-E827](https://perma.cc/93B3-E827)
[^18]: Mark Callaghan. [Name that compaction algorithm](https://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html). *smalldatum.blogspot.com*, August 2018. Archived at [perma.cc/CN4M-82DY](https://perma.cc/CN4M-82DY)
[^19]: Prashanth Rao. [Embedded databases (1): The harmony of DuckDB, KùzuDB and LanceDB](https://thedataquarry.com/posts/embedded-db-1/). *thedataquarry.com*, August 2023. Archived at [perma.cc/PA28-2R35](https://perma.cc/PA28-2R35)
[^20]: Hacker News discussion. [Bluesky migrates to single-tenant SQLite](https://news.ycombinator.com/item?id=38171322). *news.ycombinator.com*, October 2023. Archived at [perma.cc/69LM-5P6X](https://perma.cc/69LM-5P6X)
[^21]: Rudolf Bayer and Edward M. McCreight. [Organization and Maintenance of Large Ordered Indices](https://dl.acm.org/doi/pdf/10.1145/1734663.1734671). Boeing Scientific Research Laboratories, Mathematical and Information Sciences Laboratory, report no. 20, July 1970. [doi:10.1145/1734663.1734671](https://doi.org/10.1145/1734663.1734671)
[^22]: Douglas Comer. [The Ubiquitous B-Tree](https://web.archive.org/web/20170809145513id_/http%3A//sites.fas.harvard.edu/~cs165/papers/comer.pdf). *ACM Computing Surveys*, volume 11, issue 2, pages 121–137, June 1979. [doi:10.1145/356770.356776](https://doi.org/10.1145/356770.356776)
[^23]: Alex Miller. [Torn Write Detection and Protection](https://transactional.blog/blog/2025-torn-writes). *transactional.blog*, April 2025. Archived at [perma.cc/G7EB-33EW](https://perma.cc/G7EB-33EW)
[^24]: C. Mohan and Frank Levine. [ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging](https://ics.uci.edu/~cs223/papers/p371-mohan.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 1992. [doi:10.1145/130283.130338](https://doi.org/10.1145/130283.130338)
[^25]: Hironobu Suzuki. [The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017.
[^26]: Howard Chu. [LDAP at Lightning Speed](https://buildstuff14.sched.com/event/08a1a368e272eb599a52e08b4c3c779d). At *Build Stuff ’14*, November 2014. Archived at [perma.cc/GB6Z-P8YH](https://perma.cc/GB6Z-P8YH)
[^27]: Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, Radu Stoica, Stratos Idreos, Anastasia Ailamaki, and Mark Callaghan. [Designing Access Methods: The RUM Conjecture](https://openproceedings.org/2016/conf/edbt/paper-12.pdf). At *19th International Conference on Extending Database Technology* (EDBT), March 2016. [doi:10.5441/002/edbt.2016.42](https://doi.org/10.5441/002/edbt.2016.42)
[^28]: Ben Stopford. [Log Structured Merge Trees](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/). *benstopford.com*, February 2015. Archived at [perma.cc/E5BV-KUJ6](https://perma.cc/E5BV-KUJ6)
[^29]: Mark Callaghan. [The Advantages of an LSM vs a B-Tree](https://smalldatum.blogspot.com/2016/01/summary-of-advantages-of-lsm-vs-b-tree.html). *smalldatum.blogspot.co.uk*, January 2016. Archived at [perma.cc/3TYZ-EFUD](https://perma.cc/3TYZ-EFUD)
[^30]: Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan Gupta, Ravishankar Chandhiramoorthi, and Diego Didona. [SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores](https://www.usenix.org/conference/atc19/presentation/balmau). At *USENIX Annual Technical Conference*, July 2019.
[^31]: Igor Canadi, Siying Dong, Mark Callaghan, et al. [RocksDB Tuning Guide](https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide). *github.com*, 2023. Archived at [perma.cc/UNY4-MK6C](https://perma.cc/UNY4-MK6C)
[^32]: Gabriel Haas and Viktor Leis. [What Modern NVMe Storage Can Do, and How to Exploit it: High-Performance I/O for High-Performance Storage Engines](https://www.vldb.org/pvldb/vol16/p2090-haas.pdf). *Proceedings of the VLDB Endowment*, volume 16, issue 9, pages 2090-2102. [doi:10.14778/3598581.3598584](https://doi.org/10.14778/3598581.3598584)
[^33]: Emmanuel Goossaert. [Coding for SSDs](https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/). *codecapsule.com*, February 2014.
[^34]: Jack Vanlightly. [Is sequential IO dead in the era of the NVMe drive?](https://jack-vanlightly.com/blog/2023/5/9/is-sequential-io-dead-in-the-era-of-the-nvme-drive) *jack-vanlightly.com*, May 2023. Archived at [perma.cc/7TMZ-TAPU](https://perma.cc/7TMZ-TAPU)
[^35]: Alibaba Cloud Storage Team. [Storage System Design Analysis: Factors Affecting NVMe SSD Performance (2)](https://www.alibabacloud.com/blog/594376). *alibabacloud.com*, January 2019. Archived at [archive.org](https://web.archive.org/web/20230510065132/https%3A//www.alibabacloud.com/blog/594376)
[^36]: Xiao-Yu Hu and Robert Haas. [The Fundamental Limit of Flash Random Write Performance: Understanding, Analysis and Performance Modelling](https://dominoweb.draco.res.ibm.com/reports/rz3771.pdf). *dominoweb.draco.res.ibm.com*, March 2010. Archived at [perma.cc/8JUL-4ZDS](https://perma.cc/8JUL-4ZDS)
[^37]: Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [WiscKey: Separating Keys from Values in SSD-conscious Storage](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf). At *4th USENIX Conference on File and Storage Technologies* (FAST), February 2016.
[^38]: Peter Zaitsev. [Innodb Double Write](https://www.percona.com/blog/innodb-double-write/). *percona.com*, August 2006. Archived at [perma.cc/NT4S-DK7T](https://perma.cc/NT4S-DK7T)
[^39]: Tomas Vondra. [On the Impact of Full-Page Writes](https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-page-writes/). *2ndquadrant.com*, November 2016. Archived at [perma.cc/7N6B-CVL3](https://perma.cc/7N6B-CVL3)
[^40]: Mark Callaghan. [Read, write & space amplification - B-Tree vs LSM](https://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html). *smalldatum.blogspot.com*, November 2015. Archived at [perma.cc/S487-WK5P](https://perma.cc/S487-WK5P)
[^41]: Mark Callaghan. [Choosing Between Efficiency and Performance with RocksDB](https://codemesh.io/codemesh2016/mark-callaghan). At *Code Mesh*, November 2016. Video at [youtube.com/watch?v=tgzkgZVXKB4](https://www.youtube.com/watch?v=tgzkgZVXKB4)
[^42]: Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, and Manos Athanassoulis. [Enabling Timely and Persistent Deletion in LSM-Engines](https://subhadeep.net/assets/fulltext/Enabling_Timely_and_Persistent_Deletion_in_LSM-Engines.pdf). *ACM Transactions on Database Systems*, volume 48, issue 3, article no. 8, August 2023. [doi:10.1145/3599724](https://doi.org/10.1145/3599724)
[^43]: Lukas Fittl. [Postgres vs. SQL Server: B-Tree Index Differences & the Benefit of Deduplication](https://pganalyze.com/blog/postgresql-vs-sql-server-btree-index-deduplication). *pganalyze.com*, April 2025. Archived at [perma.cc/XY6T-LTPX](https://perma.cc/XY6T-LTPX)
[^44]: Drew Silcock. [How Postgres stores data on disk – this one’s a page turner](https://drew.silcock.dev/blog/how-postgres-stores-data-on-disk/). *drew.silcock.dev*, August 2024. Archived at [perma.cc/8K7K-7VJ2](https://perma.cc/8K7K-7VJ2)
[^45]: Joe Webb. [Using Covering Indexes to Improve Query Performance](https://www.red-gate.com/simple-talk/databases/sql-server/learn/using-covering-indexes-to-improve-query-performance/). *simple-talk.com*, September 2008. Archived at [perma.cc/6MEZ-R5VR](https://perma.cc/6MEZ-R5VR)
[^46]: Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. [The End of an Architectural Era (It’s Time for a Complete Rewrite)](https://vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf). At *33rd International Conference on Very Large Data Bases* (VLDB), September 2007.
[^47]: [VoltDB Technical Overview White Paper](https://www.voltactivedata.com/wp-content/uploads/2017/03/hv-white-paper-voltdb-technical-overview.pdf). VoltDB, 2017. Archived at [perma.cc/B9SF-SK5G](https://perma.cc/B9SF-SK5G)
[^48]: Stephen M. Rumble, Ankita Kejriwal, and John K. Ousterhout. [Log-Structured Memory for DRAM-Based Storage](https://www.usenix.org/system/files/conference/fast14/fast14-paper_rumble.pdf). At *12th USENIX Conference on File and Storage Technologies* (FAST), February 2014.
[^49]: Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker. [OLTP Through the Looking Glass, and What We Found There](https://hstore.cs.brown.edu/papers/hstore-lookingglass.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2008. [doi:10.1145/1376616.1376713](https://doi.org/10.1145/1376616.1376713)
[^50]: Per-Åke Larson, Cipri Clinciu, Campbell Fraser, Eric N. Hanson, Mostafa Mokhtar, Michal Nowakiewicz, Vassilis Papadimos, Susan L. Price, Srikumar Rangarajan, Remus Rusanu, and Mayukh Saubhasik. [Enhancements to SQL Server Column Stores](https://web.archive.org/web/20131203001153id_/http%3A//research.microsoft.com/pubs/193599/Apollo3%20-%20Sigmod%202013%20-%20final.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2013. [doi:10.1145/2463676.2463708](https://doi.org/10.1145/2463676.2463708)
[^51]: Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes Rauhe, and Jonathan Dees. [The SAP HANA Database – An Architecture Overview](https://web.archive.org/web/20220208081111id_/http%3A//sites.computer.org/debull/A12mar/hana.pdf). *IEEE Data Engineering Bulletin*, volume 35, issue 1, pages 28–33, March 2012.
[^52]: Michael Stonebraker. [The Traditional RDBMS Wisdom Is (Almost Certainly) All Wrong](https://slideshot.epfl.ch/talks/166). Presentation at *EPFL*, May 2013.
[^53]: Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. [Cloud-Native Transactions and Analytics in SingleStore](https://dl.acm.org/doi/pdf/10.1145/3514221.3526055). At *ACM International Conference on Management of Data* (SIGMOD), June 2022. [doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055)
[^54]: Tino Tereshko and Jordan Tigani. [BigQuery under the hood](https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood). *cloud.google.com*, January 2016. Archived at [perma.cc/WP2Y-FUCF](https://perma.cc/WP2Y-FUCF)
[^55]: Wes McKinney. [The Road to Composable Data Systems: Thoughts on the Last 15 Years and the Future](https://wesmckinney.com/blog/looking-back-15-years/). *wesmckinney.com*, September 2023. Archived at [perma.cc/6L2M-GTJX](https://perma.cc/6L2M-GTJX)
[^56]: Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. [C-Store: A Column-oriented DBMS](https://www.vldb.org/archives/website/2005/program/paper/thu/p553-stonebraker.pdf). At *31st International Conference on Very Large Data Bases* (VLDB), pages 553–564, September 2005.
[^57]: Julien Le Dem. [Dremel Made Simple with Parquet](https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html). *blog.twitter.com*, September 2013.
[^58]: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. [Dremel: Interactive Analysis of Web-Scale Datasets](https://vldb.org/pvldb/vol3/R29.pdf). At *36th International Conference on Very Large Data Bases* (VLDB), pages 330–339, September 2010. [doi:10.14778/1920841.1920886](https://doi.org/10.14778/1920841.1920886)
[^59]: Joe Kearney. [Understanding Record Shredding: storing nested data in columns](https://www.joekearney.co.uk/posts/understanding-record-shredding). *joekearney.co.uk*, December 2016. Archived at [perma.cc/ZD5N-AX5D](https://perma.cc/ZD5N-AX5D)
[^60]: Jamie Brandon. [A shallow survey of OLAP and HTAP query engines](https://www.scattered-thoughts.net/writing/a-shallow-survey-of-olap-and-htap-query-engines). *scattered-thoughts.net*, September 2023. Archived at [perma.cc/L3KH-J4JF](https://perma.cc/L3KH-J4JF)
[^61]: Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. [The Snowflake Elastic Data Warehouse](https://dl.acm.org/doi/pdf/10.1145/2882903.2903741). At *ACM International Conference on Management of Data* (SIGMOD), pages 215–226, June 2016. [doi:10.1145/2882903.2903741](https://doi.org/10.1145/2882903.2903741)
[^62]: Mark Raasveldt and Hannes Mühleisen. [Data Management for Data Science Towards Embedded Analytics](https://duckdb.org/pdf/CIDR2020-raasveldt-muehleisen-duckdb.pdf). At *10th Conference on Innovative Data Systems Research* (CIDR), January 2020.
[^63]: Jean-François Im, Kishore Gopalakrishna, Subbu Subramaniam, Mayank Shrivastava, Adwait Tumbde, Xiaotian Jiang, Jennifer Dai, Seunghyun Lee, Neha Pawar, Jialiang Li, and Ravi Aringunram. [Pinot: Realtime OLAP for 530 Million Users](https://cwiki.apache.org/confluence/download/attachments/103092375/Pinot.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 583–594, May 2018. [doi:10.1145/3183713.3190661](https://doi.org/10.1145/3183713.3190661)
[^64]: Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, and Deep Ganguli. [Druid: A Real-time Analytical Data Store](https://static.druid.io/docs/druid.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2014. [doi:10.1145/2588555.2595631](https://doi.org/10.1145/2588555.2595631)
[^65]: Chunwei Liu, Anna Pavlenko, Matteo Interlandi, and Brandon Haynes. [Deep Dive into Common Open Formats for Analytical DBMSs](https://www.vldb.org/pvldb/vol16/p3044-liu.pdf). *Proceedings of the VLDB Endowment*, volume 16, issue 11, pages 3044–3056, July 2023. [doi:10.14778/3611479.3611507](https://doi.org/10.14778/3611479.3611507)
[^66]: Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. [An Empirical Evaluation of Columnar Storage Formats](https://www.vldb.org/pvldb/vol17/p148-zeng.pdf). *Proceedings of the VLDB Endowment*, volume 17, issue 2, pages 148–161. [doi:10.14778/3626292.3626298](https://doi.org/10.14778/3626292.3626298)
[^67]: Weston Pace. [Lance v2: A columnar container format for modern data](https://blog.lancedb.com/lance-v2/). *blog.lancedb.com*, April 2024. Archived at [perma.cc/ZK3Q-S9VJ](https://perma.cc/ZK3Q-S9VJ)
[^68]: Yoav Helfman. [Nimble, A New Columnar File Format](https://www.youtube.com/watch?v=bISBNVtXZ6M). At *VeloxCon*, April 2024.
[^69]: Wes McKinney. [Apache Arrow: High-Performance Columnar Data Framework](https://www.youtube.com/watch?v=YhF8YR0OEFk). At *CMU Database Group – Vaccination Database Tech Talks*, December 2021.
[^70]: Wes McKinney. [Python for Data Analysis, 3rd Edition](https://learning.oreilly.com/library/view/python-for-data/9781098104023/). O’Reilly Media, August 2022. ISBN: 9781098104023
[^71]: Paul Dix. [The Design of InfluxDB IOx: An In-Memory Columnar Database Written in Rust with Apache Arrow](https://www.youtube.com/watch?v=_zbwz-4RDXg). At *CMU Database Group – Vaccination Database Tech Talks*, May 2021.
[^72]: Carlota Soto and Mike Freedman. [Building Columnar Compression for Large PostgreSQL Databases](https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database/). *timescale.com*, March 2024. Archived at [perma.cc/7KTF-V3EH](https://perma.cc/7KTF-V3EH)
[^73]: Daniel Lemire, Gregory Ssi‐Yan‐Kai, and Owen Kaser. [Consistently faster and smaller compressed bitmaps with Roaring](https://arxiv.org/pdf/1603.06549). *Software: Practice and Experience*, volume 46, issue 11, pages 1547–1569, November 2016. [doi:10.1002/spe.2402](https://doi.org/10.1002/spe.2402)
[^74]: Jaz Volpert. [An entire Social Network in 1.6GB (GraphD Part 2)](https://jazco.dev/2024/04/20/roaring-bitmaps/). *jazco.dev*, April 2024. Archived at [perma.cc/L27Z-QVMG](https://perma.cc/L27Z-QVMG)
[^75]: Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel Madden. [The Design and Implementation of Modern Column-Oriented Database Systems](https://www.cs.umd.edu/~abadi/papers/abadi-column-stores.pdf). *Foundations and Trends in Databases*, volume 5, issue 3, pages 197–280, December 2013. [doi:10.1561/1900000024](https://doi.org/10.1561/1900000024)
[^76]: Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. [The Vertica Analytic Database: C-Store 7 Years Later](https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf). *Proceedings of the VLDB Endowment*, volume 5, issue 12, pages 1790–1801, August 2012. [doi:10.14778/2367502.2367518](https://doi.org/10.14778/2367502.2367518)
[^77]: Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. [Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask](https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf). *Proceedings of the VLDB Endowment*, volume 11, issue 13, pages 2209–2222, September 2018. [doi:10.14778/3275366.3284966](https://doi.org/10.14778/3275366.3284966)
[^78]: Forrest Smith. [Memory Bandwidth Napkin Math](https://www.forrestthewoods.com/blog/memory-bandwidth-napkin-math/). *forrestthewoods.com*, February 2020. Archived at [perma.cc/Y8U4-PS7N](https://perma.cc/Y8U4-PS7N)
[^79]: Peter Boncz, Marcin Zukowski, and Niels Nes. [MonetDB/X100: Hyper-Pipelining Query Execution](https://www.cidrdb.org/cidr2005/papers/P19.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005.
[^80]: Jingren Zhou and Kenneth A. Ross. [Implementing Database Operations Using SIMD Instructions](https://www1.cs.columbia.edu/~kar/pubsk/simd.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 145–156, June 2002. [doi:10.1145/564691.564709](https://doi.org/10.1145/564691.564709)
[^81]: Kevin Bartley. [OLTP Queries: Transfer Expensive Workloads to Materialize](https://materialize.com/blog/oltp-queries/). *materialize.com*, August 2024. Archived at [perma.cc/4TYM-TYD8](https://perma.cc/4TYM-TYD8)
[^82]: Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. [Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals](https://arxiv.org/pdf/cs/0701155). *Data Mining and Knowledge Discovery*, volume 1, issue 1, pages 29–53, March 2007. [doi:10.1023/A:1009726021843](https://doi.org/10.1023/A%3A1009726021843)
[^83]: Frank Ramsak, Volker Markl, Robert Fenk, Martin Zirkel, Klaus Elhardt, and Rudolf Bayer. [Integrating the UB-Tree into a Database System Kernel](https://www.vldb.org/conf/2000/P263.pdf). At *26th International Conference on Very Large Data Bases* (VLDB), September 2000.
[^84]: Octavian Procopiuc, Pankaj K. Agarwal, Lars Arge, and Jeffrey Scott Vitter. [Bkd-Tree: A Dynamic Scalable kd-Tree](https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf). At *8th International Symposium on Spatial and Temporal Databases* (SSTD), pages 46–65, July 2003. [doi:10.1007/978-3-540-45072-6\_4](https://doi.org/10.1007/978-3-540-45072-6_4)
[^85]: Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer. [Generalized Search Trees for Database Systems](https://dsf.berkeley.edu/papers/vldb95-gist.pdf). At *21st International Conference on Very Large Data Bases* (VLDB), September 1995.
[^86]: Isaac Brodsky. [H3: Uber’s Hexagonal Hierarchical Spatial Index](https://eng.uber.com/h3/). *eng.uber.com*, June 2018. Archived at [archive.org](https://web.archive.org/web/20240722003854/https%3A//www.uber.com/blog/h3/)
[^87]: Robert Escriva, Bernard Wong, and Emin Gün Sirer. [HyperDex: A Distributed, Searchable Key-Value Store](https://www.cs.princeton.edu/courses/archive/fall13/cos518/papers/hyperdex.pdf). At *ACM SIGCOMM Conference*, August 2012. [doi:10.1145/2377677.2377681](https://doi.org/10.1145/2377677.2377681)
[^88]: Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/). Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at [nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/)
[^89]: Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. [An Experimental Study of Bitmap Compression vs. Inverted List Compression](https://cseweb.ucsd.edu/~swanson/papers/SIGMOD2017-ListCompression.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 993–1008, May 2017. [doi:10.1145/3035918.3064007](https://doi.org/10.1145/3035918.3064007)
[^90]: Adrien Grand. [What is in a Lucene Index?](https://speakerdeck.com/elasticsearch/what-is-in-a-lucene-index) At *Lucene/Solr Revolution*, November 2013. Archived at [perma.cc/Z7QN-GBYY](https://perma.cc/Z7QN-GBYY)
[^91]: Michael McCandless. [Visualizing Lucene’s Segment Merges](https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html). *blog.mikemccandless.com*, February 2011. Archived at [perma.cc/3ZV8-72W6](https://perma.cc/3ZV8-72W6)
[^92]: Lukas Fittl. [Understanding Postgres GIN Indexes: The Good and the Bad](https://pganalyze.com/blog/gin-index). *pganalyze.com*, December 2021. Archived at [perma.cc/V3MW-26H6](https://perma.cc/V3MW-26H6)
[^93]: Jimmy Angelakos. [The State of (Full) Text Search in PostgreSQL 12](https://www.youtube.com/watch?v=c8IrUHV70KQ). At *FOSDEM*, February 2020. Archived at [perma.cc/J6US-3WZS](https://perma.cc/J6US-3WZS)
[^94]: Alexander Korotkov. [Index support for regular expression search](https://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf). At *PGConf.EU Prague*, October 2012. Archived at [perma.cc/5RFZ-ZKDQ](https://perma.cc/5RFZ-ZKDQ)
[^95]: Michael McCandless. [Lucene’s FuzzyQuery Is 100 Times Faster in 4.0](https://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html). *blog.mikemccandless.com*, March 2011. Archived at [perma.cc/E2WC-GHTW](https://perma.cc/E2WC-GHTW)
[^96]: Steffen Heinz, Justin Zobel, and Hugh E. Williams. [Burst Tries: A Fast, Efficient Data Structure for String Keys](https://web.archive.org/web/20130903070248id_/http%3A//ww2.cs.mu.oz.au%3A80/~jz/fulltext/acmtois02.pdf). *ACM Transactions on Information Systems*, volume 20, issue 2, pages 192–223, April 2002. [doi:10.1145/506309.506312](https://doi.org/10.1145/506309.506312)
[^97]: Klaus U. Schulz and Stoyan Mihov. [Fast String Correction with Levenshtein Automata](https://dmice.ohsu.edu/bedricks/courses/cs655/pdf/readings/2002_Schulz.pdf). *International Journal on Document Analysis and Recognition*, volume 5, issue 1, pages 67–85, November 2002. [doi:10.1007/s10032-002-0082-8](https://doi.org/10.1007/s10032-002-0082-8)
[^98]: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781). At *International Conference on Learning Representations* (ICLR), May 2013. [doi:10.48550/arXiv.1301.3781](https://doi.org/10.48550/arXiv.1301.3781)
[^99]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805). At *Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, volume 1, pages 4171–4186, June 2019. [doi:10.18653/v1/N19-1423](https://doi.org/10.18653/v1/N19-1423)
[^100]: Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf). *openai.com*, June 2018. Archived at [perma.cc/5N3C-DJ4C](https://perma.cc/5N3C-DJ4C)
[^101]: Matthijs Douze, Maria Lomeli, and Lucas Hosseini. [Faiss indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). *github.com*, August 2024. Archived at [perma.cc/2EWG-FPBS](https://perma.cc/2EWG-FPBS)
[^102]: Varik Matevosyan. [Understanding pgvector’s HNSW Index Storage in Postgres](https://lantern.dev/blog/pgvector-storage). *lantern.dev*, August 2024. Archived at [perma.cc/B2YB-JB59](https://perma.cc/B2YB-JB59)
[^103]: Dmitry Baranchuk, Artem Babenko, and Yury Malkov. [Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors](https://arxiv.org/pdf/1802.02422). At *European Conference on Computer Vision* (ECCV), pages 202–216, September 2018. [doi:10.1007/978-3-030-01258-8\_13](https://doi.org/10.1007/978-3-030-01258-8_13)
[^104]: Yury A. Malkov and Dmitry A. Yashunin. [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/pdf/1603.09320). *IEEE Transactions on Pattern Analysis and Machine Intelligence*, volume 42, issue 4, pages 824–836, April 2020. [doi:10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473)
================================================
FILE: content/tw/ch5.md
================================================
---
title: "5. 編碼與演化"
weight: 105
math: true
breadcrumbs: false
---

> *萬物流轉,無物常駐。*
>
> 赫拉克利特,引自柏拉圖《克拉提魯斯》(公元前 360 年)
應用程式不可避免地會隨時間而變化。隨著新產品的推出、使用者需求被更深入地理解,或者業務環境發生變化,功能會被新增或修改。在 [第 2 章](/tw/ch2#ch_nonfunctional) 中,我們介紹了 *可演化性* 的概念:我們應該致力於構建易於適應變化的系統(參見 ["可演化性:讓變更更容易"](/tw/ch2#sec_introduction_evolvability))。
在大多數情況下,應用程式功能的變更也需要其儲存資料的變更:可能需要捕獲新的欄位或記錄型別,或者現有資料需要以新的方式呈現。
我們在 [第 3 章](/tw/ch3#ch_datamodels) 中討論的資料模型有不同的方式來應對這種變化。關係資料庫通常假定資料庫中的所有資料都遵循一個模式:儘管該模式可以更改(透過模式遷移;即 `ALTER` 語句),但在任何一個時間點只有一個模式生效。相比之下,讀時模式("無模式")資料庫不強制執行模式,因此資料庫可以包含在不同時間寫入的新舊資料格式的混合(參見 ["文件模型中的模式靈活性"](/tw/ch3#sec_datamodels_schema_flexibility))。
當資料格式或模式發生變化時,通常需要對應用程式程式碼進行相應的更改(例如,你向記錄添加了一個新欄位,應用程式程式碼開始讀寫該欄位)。然而,在大型應用程式中,程式碼更改通常無法立即完成:
* 對於服務端應用程式,你可能希望執行 *滾動升級*(也稱為 *階段釋出*),每次將新版本部署到幾個節點,檢查新版本是否執行順利,然後逐步在所有節點上部署。這允許在不中斷服務的情況下部署新版本,從而鼓勵更頻繁的釋出和更好的可演化性。
* 對於客戶端應用程式,你要看使用者的意願,他們可能很長時間都不安裝更新。
這意味著新舊版本的程式碼,以及新舊資料格式,可能會同時在系統中共存。為了使系統繼續平穩執行,我們需要在兩個方向上保持相容性:
向後相容性
: 較新的程式碼可以讀取由較舊程式碼寫入的資料。
向前相容性
: 較舊的程式碼可以讀取由較新程式碼寫入的資料。
向後相容性通常不難實現:作為新程式碼的作者,你知道舊程式碼寫入的資料格式,因此可以顯式地處理它(如有必要,只需保留舊程式碼來讀取舊資料)。向前相容性可能更棘手,因為它需要舊程式碼忽略新版本程式碼新增的部分。
向前相容性的另一個挑戰如 [圖 5-1](#fig_encoding_preserve_field) 所示。假設你向記錄模式添加了一個欄位,新程式碼建立了包含該新欄位的記錄並將其儲存在資料庫中。隨後,舊版本的程式碼(尚不知道新欄位)讀取記錄,更新它,然後寫回。在這種情況下,理想的行為通常是舊程式碼保持新欄位不變,即使它無法解釋。但是,如果記錄被解碼為不顯式保留未知欄位的模型物件,資料可能會丟失,如 [圖 5-1](#fig_encoding_preserve_field) 所示。
{{< figure src="/fig/ddia_0501.png" id="fig_encoding_preserve_field" caption="圖 5-1. 當舊版本的應用程式更新之前由新版本應用程式寫入的資料時,如果不小心,資料可能會丟失。" class="w-full my-4" >}}
在本章中,我們將研究幾種編碼資料的格式,包括 JSON、XML、Protocol Buffers 和 Avro。特別是,我們將研究它們如何處理模式變化,以及它們如何支援新舊資料和程式碼需要共存的系統。然後我們將討論這些格式如何用於資料儲存和通訊:在資料庫、Web 服務、REST API、遠端過程呼叫(RPC)、工作流引擎以及事件驅動系統(如 actor 和訊息佇列)中。
## 編碼資料的格式 {#sec_encoding_formats}
程式通常以(至少)兩種不同的表示形式處理資料:
1. 在記憶體中,資料儲存在物件、結構體、列表、陣列、雜湊表、樹等中。這些資料結構針對 CPU 的高效訪問和操作進行了最佳化(通常使用指標)。
2. 當你想要將資料寫入檔案或透過網路傳送時,必須將其編碼為某種自包含的位元組序列(例如,JSON 文件)。由於指標對任何其他程序都沒有意義,因此這種位元組序列表示通常與記憶體中常用的資料結構看起來截然不同。
因此,我們需要在兩種表示之間進行某種轉換。從記憶體表示到位元組序列的轉換稱為 *編碼*(也稱為 *序列化* 或 *編組*),反向過程稱為 *解碼*(*解析*、*反序列化*、*反編組*)。
--------
> [!TIP] 術語衝突
>
> *序列化* 這個術語不幸地也用於事務的上下文中(參見 [第 8 章](/tw/ch8#ch_transactions)),具有完全不同的含義。為了避免詞義過載,本書中我們將堅持使用 *編碼*,儘管 *序列化* 可能是更常見的術語。
--------
也有例外情況不需要編碼/解碼——例如,當資料庫直接對從磁碟載入的壓縮資料進行操作時,如 ["查詢執行:編譯與向量化"](/tw/ch4#sec_storage_vectorized) 中所討論的。還有一些 *零複製* 資料格式,旨在在執行時和磁碟/網路上都可以使用,無需顯式轉換步驟,例如 Cap'n Proto 和 FlatBuffers。
然而,大多數系統需要在記憶體物件和平面位元組序列之間進行轉換。由於這是一個如此常見的問題,有無數不同的庫和編碼格式可供選擇。讓我們簡要概述一下。
### 特定語言的格式 {#id96}
許多程式語言都內建了將記憶體物件編碼為位元組序列的支援。例如,Java 有 `java.io.Serializable`,Python 有 `pickle`,Ruby 有 `Marshal`,等等。許多第三方庫也存在,例如 Java 的 Kryo。
這些編碼庫非常方便,因為它們允許用最少的額外程式碼儲存和恢復記憶體物件。然而,它們也有許多深層次的問題:
* 編碼通常與特定程式語言繫結,在另一種語言中讀取會非常困難。如果你以這種編碼儲存或傳輸資料,就等於在相當長時間內把自己繫結在當前程式語言上,也排除了與其他組織(可能使用不同語言)的系統整合。
* 為了以相同的物件型別恢復資料,解碼過程需要能夠例項化任意類。這經常是安全問題的來源 [^1]:如果攻擊者可以讓你的應用程式解碼任意位元組序列,他們可以例項化任意類,這反過來通常允許他們做可怕的事情,例如遠端執行任意程式碼 [^2] [^3]。
* 在這些庫中,資料版本控制通常是事後考慮的:由於它們旨在快速輕鬆地編碼資料,因此它們經常忽略向前和向後相容性的不便問題 [^4]。
* 效率(編碼或解碼所需的 CPU 時間以及編碼結構的大小)通常也是事後考慮的。例如,Java 的內建序列化因其糟糕的效能和臃腫的編碼而臭名昭著 [^5]。
由於這些原因,除了非常臨時的目的外,使用語言的內建編碼通常是個壞主意。
### JSON、XML 及其二進位制變體 {#sec_encoding_json}
當轉向可以由許多程式語言編寫和讀取的標準化編碼時,JSON 和 XML 是顯而易見的競爭者。它們廣為人知,廣受支援,也幾乎同樣廣受詬病。XML 經常因過於冗長和不必要的複雜而受到批評 [^6]。JSON 的流行主要是由於它在 Web 瀏覽器中的內建支援以及相對於 XML 的簡單性。CSV 是另一種流行的與語言無關的格式,但它只支援表格資料而不支援巢狀。
JSON、XML 和 CSV 是文字格式,因此在某種程度上是人類可讀的(儘管語法是一個熱門的爭論話題)。除了表面的語法問題之外,它們還有一些微妙的問題:
* 數字的編碼有很多歧義。在 XML 和 CSV 中,你無法區分數字和恰好由數字組成的字串(除非引用外部模式)。JSON 區分字串和數字,但它不區分整數和浮點數,也不指定精度。
這在處理大數字時是一個問題;例如,大於 2⁵³ 的整數無法在 IEEE 754 雙精度浮點數中精確表示,因此在使用浮點數的語言(如 JavaScript)中解析時,此類數字會變得不準確 [^7]。大於 2⁵³ 的數字的一個例子出現在 X(前身為 Twitter)上,它使用 64 位數字來識別每個帖子。API 返回的 JSON 包括帖子 ID 兩次,一次作為 JSON 數字,一次作為十進位制字串,以解決 JavaScript 應用程式無法正確解析數字的事實 [^8]。
* JSON 和 XML 對 Unicode 字串(即人類可讀文字)有很好的支援,但它們不支援二進位制字串(沒有字元編碼的位元組序列)。二進位制字串是一個有用的功能,因此人們透過使用 Base64 將二進位制資料編碼為文字來繞過這個限制。然後模式用於指示該值應被解釋為 Base64 編碼。這雖然有效,但有點取巧,並且會將資料大小增加 33%。
* XML 模式和 JSON 模式功能強大,因此學習和實現起來相當複雜。由於資料的正確解釋(如數字和二進位制字串)取決於模式中的資訊,不使用 XML/JSON 模式的應用程式需要潛在地硬編碼適當的編碼/解碼邏輯。
* CSV 沒有任何模式,因此應用程式需要定義每行和每列的含義。如果應用程式更改添加了新行或列,你必須手動處理該更改。CSV 也是一種相當模糊的格式(如果值包含逗號或換行符會發生什麼?)。儘管其轉義規則已被正式指定 [^9],但並非所有解析器都正確實現它們。
儘管存在這些缺陷,JSON、XML 和 CSV 對許多目的來說已經足夠好了。它們可能會繼續流行,特別是作為資料交換格式(即從一個組織向另一個組織傳送資料)。在這些情況下,只要人們就格式達成一致,格式有多漂亮或高效通常並不重要。讓不同組織就 *任何事情* 達成一致的困難超過了大多數其他問題。
#### JSON 模式 {#json-schema}
JSON 模式已被廣泛採用,作為系統間交換或寫入儲存時對資料建模的一種方式。你會在 Web 服務中找到 JSON 模式(參見 ["Web 服務"](#sec_web_services))作為 OpenAPI Web 服務規範的一部分,在模式登錄檔中如 Confluent 的 Schema Registry 和 Red Hat 的 Apicurio Registry,以及在資料庫中如 PostgreSQL 的 pg_jsonschema 驗證器擴充套件和 MongoDB 的 `$jsonSchema` 驗證器語法。
JSON 模式規範提供了許多功能。模式包括標準原始型別,包括字串、數字、整數、物件、陣列、布林值或空值。但 JSON 模式還提供了一個單獨的驗證規範,允許開發人員在欄位上疊加約束。例如,`port` 欄位可能具有最小值 1 和最大值 65535。
JSON 模式可以具有開放或封閉的內容模型。開放內容模型允許模式中未定義的任何欄位以任何資料型別存在,而封閉內容模型只允許顯式定義的欄位。JSON 模式中的開放內容模型在 `additionalProperties` 設定為 `true` 時啟用,這是預設值。因此,JSON 模式通常是對 *不允許* 內容的定義(即,任何已定義欄位上的無效值),而不是對模式中 *允許* 內容的定義。
開放內容模型功能強大,但可能很複雜。例如,假設你想定義一個從整數(如 ID)到字串的對映。JSON 沒有對映或字典型別,只有一個可以包含字串鍵和任何型別值的"物件"型別。然後,你可以使用 JSON 模式約束此型別,使鍵只能包含數字,值只能是字串,使用 `patternProperties` 和 `additionalProperties`,如 [示例 5-1](#fig_encoding_json_schema) 所示。
{{< figure id="fig_encoding_json_schema" title="示例 5-1. 具有整數鍵和字串值的示例 JSON 模式。整數鍵表示為僅包含整數的字串,因為 JSON 模式要求所有鍵都是字串。" class="w-full my-4" >}}
```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"patternProperties": {
"^[0-9]+$": {
"type": "string"
}
},
"additionalProperties": false
}
```
除了開放和封閉內容模型以及驗證器之外,JSON 模式還支援條件 if/else 模式邏輯、命名型別、對遠端模式的引用等等。所有這些都構成了一種非常強大的模式語言。這些功能也使定義變得笨重。解析遠端模式、推理條件規則或以向前或向後相容的方式演化模式可能具有挑戰性 [^10]。類似的問題也適用於 XML 模式 [^11]。
#### 二進位制編碼 {#binary-encoding}
JSON 比 XML 更簡潔,但與二進位制格式相比,兩者仍然使用大量空間。這一觀察導致了大量 JSON 二進位制編碼(MessagePack、CBOR、BSON、BJSON、UBJSON、BISON、Hessian 和 Smile 等等)和 XML 二進位制編碼(例如 WBXML 和 Fast Infoset)的發展。這些格式已在各種利基市場中被採用,因為它們更緊湊,有時解析速度更快,但它們都沒有像 JSON 和 XML 的文字版本那樣被廣泛採用 [^12]。
其中一些格式擴充套件了資料型別集(例如,區分整數和浮點數,或新增對二進位制字串的支援),但除此之外,它們保持 JSON/XML 資料模型不變。特別是,由於它們不規定模式,因此需要在編碼資料中包含所有物件欄位名稱。也就是說,在 [示例 5-2](#fig_encoding_json) 中的 JSON 文件的二進位制編碼中,它們需要在某處包含字串 `userName`、`favoriteNumber` 和 `interests`。
{{< figure id="fig_encoding_json" title="示例 5-2. 本章中我們將以幾種二進位制格式編碼的示例記錄" class="w-full my-4" >}}
```json
{
"userName": "Martin",
"favoriteNumber": 1337,
"interests": ["daydreaming", "hacking"]
}
```
讓我們看一個 MessagePack 的例子,它是 JSON 的二進位制編碼。[圖 5-2](#fig_encoding_messagepack) 顯示了如果你使用 MessagePack 編碼 [示例 5-2](#fig_encoding_json) 中的 JSON 文件所得到的位元組序列。前幾個位元組如下:
1. 第一個位元組 `0x83` 表示接下來是一個物件(前四位 = `0x80`),有三個欄位(後四位 = `0x03`)。(如果你想知道如果物件有超過 15 個欄位會發生什麼,以至於欄位數無法裝入四位,那麼它會獲得不同的型別指示符,欄位數會以兩個或四個位元組編碼。)
2. 第二個位元組 `0xa8` 表示接下來是一個字串(前四位 = `0xa0`),長度為八個位元組(後四位 = `0x08`)。
3. 接下來的八個位元組是 ASCII 格式的欄位名 `userName`。由於之前已經指示了長度,因此不需要任何標記來告訴我們字串在哪裡結束(或任何轉義)。
4. 接下來的七個位元組使用字首 `0xa6` 編碼六個字母的字串值 `Martin`,依此類推。
二進位制編碼長度為 66 位元組,僅比文字 JSON 編碼(去除空格後)佔用的 81 位元組少一點。所有 JSON 的二進位制編碼在這方面都是相似的。目前尚不清楚這種小的空間減少(以及可能的解析速度提升)是否值得失去人類可讀性。
在接下來的部分中,我們將看到如何做得更好,將相同的記錄編碼為僅 32 位元組。
{{< figure link="#fig_encoding_json" src="/fig/ddia_0502.png" id="fig_encoding_messagepack" caption="圖 5-2. 使用 MessagePack 編碼的示例記錄 示例 5-2。" class="w-full my-4" >}}
### Protocol Buffers {#sec_encoding_protobuf}
Protocol Buffers (protobuf) 是 Google 開發的二進位制編碼庫。它類似於 Apache Thrift,後者最初由 Facebook 開發 [^13];本節關於 Protocol Buffers 的大部分內容也適用於 Thrift。
Protocol Buffers 需要為任何編碼的資料提供模式。要在 Protocol Buffers 中編碼 [示例 5-2](#fig_encoding_json) 中的資料,你需要像這樣在 Protocol Buffers 介面定義語言(IDL)中描述模式:
```protobuf
syntax = "proto3";
message Person {
string user_name = 1;
int64 favorite_number = 2;
repeated string interests = 3;
}
```
Protocol Buffers 附帶了一個程式碼生成工具,它接受像這裡顯示的模式定義,並生成以各種程式語言實現該模式的類。你的應用程式程式碼可以呼叫此生成的程式碼來編碼或解碼模式的記錄。使用 Protocol Buffers 編碼器編碼 [示例 5-2](#fig_encoding_json) 需要 33 位元組,如 [圖 5-3](#fig_encoding_protobuf) 所示 [^14]。
{{< figure src="/fig/ddia_0503.png" id="fig_encoding_protobuf" caption="圖 5-3. 使用 Protocol Buffers 編碼的示例記錄。" class="w-full my-4" >}}
與 [圖 5-2](#fig_encoding_messagepack) 類似,每個欄位都有一個型別註釋(指示它是字串、整數等)以及必要時的長度指示(例如字串的長度)。資料中出現的字串("Martin"、"daydreaming"、"hacking")也編碼為 ASCII(準確地說是 UTF-8),與之前類似。
與 [圖 5-2](#fig_encoding_messagepack) 相比的最大區別是沒有欄位名(`userName`、`favoriteNumber`、`interests`)。相反,編碼資料包含 *欄位標籤*,即數字(`1`、`2` 和 `3`)。這些是模式定義中出現的數字。欄位標籤就像欄位的別名——它們是說明我們正在談論哪個欄位的緊湊方式,而無需拼寫欄位名。
如你所見,Protocol Buffers 透過將欄位型別和標籤號打包到單個位元組中來節省更多空間。它使用可變長度整數:數字 1337 編碼為兩個位元組,每個位元組的最高位用於指示是否還有更多位元組要來。這意味著 -64 到 63 之間的數字以一個位元組編碼,-8192 到 8191 之間的數字以兩個位元組編碼,等等。更大的數字使用更多位元組。
Protocol Buffers 沒有顯式的列表或陣列資料型別。相反,`interests` 欄位上的 `repeated` 修飾符表示該欄位包含值列表,而不是單個值。在二進位制編碼中,列表元素只是簡單地表示為同一記錄中相同欄位標籤的重複出現。
#### 欄位標籤與模式演化 {#field-tags-and-schema-evolution}
我們之前說過,模式不可避免地需要隨時間而變化。我們稱之為 *模式演化*。Protocol Buffers 如何在保持向後和向前相容性的同時處理模式更改?
從示例中可以看出,編碼記錄只是其編碼欄位的串聯。每個欄位由其標籤號(示例模式中的數字 `1`、`2`、`3`)標識,並帶有資料型別註釋(例如字串或整數)。如果未設定欄位值,則它會從編碼記錄中省略。由此可以看出,欄位標籤對編碼資料的含義至關重要。你可以更改模式中欄位的名稱,因為編碼資料從不引用欄位名,但你不能更改欄位的標籤,因為這會使所有現有的編碼資料無效。
你可以向模式新增新欄位,前提是你為每個欄位提供新的標籤號。如果舊程式碼(不知道你新增的新標籤號)嘗試讀取由新程式碼寫入的資料(包括具有它不識別的標籤號的新欄位),它可以簡單地忽略該欄位。資料型別註釋允許解析器確定需要跳過多少位元組,並保留未知欄位以避免 [圖 5-1](#fig_encoding_preserve_field) 中的問題。這保持了向前相容性:舊程式碼可以讀取由新程式碼編寫的記錄。
向後相容性呢?只要每個欄位都有唯一的標籤號,新程式碼總是可以讀取舊資料,因為標籤號仍然具有相同的含義。如果在新模式中添加了欄位,而你讀取尚未包含該欄位的舊資料,則它將填充預設值(例如,如果欄位型別為字串,則為空字串;如果是數字,則為零)。
刪除欄位就像新增欄位一樣,向後和向前相容性問題相反。你永遠不能再次使用相同的標籤號,因為你可能仍然有在某處寫入的資料包含舊標籤號,並且該欄位必須被新程式碼忽略。可以在模式定義中保留過去使用的標籤號,以確保它們不會被遺忘。
更改欄位的資料型別呢?這在某些型別上是可能的——請檢視文件瞭解詳細資訊——但存在值被截斷的風險。例如,假設你將 32 位整數更改為 64 位整數。新程式碼可以輕鬆讀取舊程式碼寫入的資料,因為解析器可以用零填充任何缺失的位。但是,如果舊程式碼讀取新程式碼寫入的資料,則舊程式碼仍然使用 32 位變數來儲存該值。如果解碼的 64 位值無法裝入 32 位,它將被截斷。
### Avro {#sec_encoding_avro}
Apache Avro 是另一種二進位制編碼格式,與 Protocol Buffers 有著有趣的不同。它於 2009 年作為 Hadoop 的子專案啟動,因為 Protocol Buffers 不太適合 Hadoop 的用例 [^15]。
Avro 也使用模式來指定正在編碼的資料的結構。它有兩種模式語言:一種(Avro IDL)用於人工編輯,另一種(基於 JSON)更容易被機器讀取。與 Protocol Buffers 一樣,此模式語言僅指定欄位及其型別,而不像 JSON 模式那樣指定複雜的驗證規則。
我們的示例模式,用 Avro IDL 編寫,可能如下所示:
```c
record Person {
string userName;
union { null, long } favoriteNumber = null;
array interests;
}
```
該模式的等效 JSON 表示如下:
```c
{
"type": "record",
"name": "Person",
"fields": [
{"name": "userName", "type": "string"},
{"name": "favoriteNumber", "type": ["null", "long"], "default": null},
{"name": "interests", "type": {"type": "array", "items": "string"}}
]
}
```
首先,請注意模式中沒有標籤號。如果我們使用此模式編碼示例記錄([示例 5-2](#fig_encoding_json)),Avro 二進位制編碼只有 32 位元組長——是我們看到的所有編碼中最緊湊的。編碼位元組序列的分解如 [圖 5-4](#fig_encoding_avro) 所示。
如果你檢查位元組序列,你會發現沒有任何東西來標識欄位或其資料型別。編碼只是由串聯在一起的值組成。字串只是一個長度字首,後跟 UTF-8 位元組,但編碼資料中沒有任何內容告訴你它是字串。它也可能是整數,或完全是其他東西。整數使用可變長度編碼進行編碼。
{{< figure src="/fig/ddia_0504.png" id="fig_encoding_avro" caption="圖 5-4. 使用 Avro 編碼的示例記錄。" class="w-full my-4" >}}
要解析二進位制資料,你需要按照模式中出現的欄位順序進行遍歷,並使用模式告訴你每個欄位的資料型別。這意味著只有當讀取資料的程式碼使用與寫入資料的程式碼 *完全相同的模式* 時,二進位制資料才能被正確解碼。讀取器和寫入器之間的任何模式不匹配都意味著資料被錯誤解碼。
那麼,Avro 如何支援模式演化?
#### 寫入者模式與讀取者模式 {#the-writers-schema-and-the-readers-schema}
當應用程式想要編碼一些資料(將其寫入檔案或資料庫,透過網路傳送等)時,它使用它知道的任何版本的模式對資料進行編碼——例如,該模式可能被編譯到應用程式中。這被稱為 *寫入者模式*。
當應用程式想要解碼一些資料(從檔案或資料庫讀取,從網路接收等)時,它使用兩個模式:與用於編碼相同的寫入者模式,以及 *讀取者模式*,後者可能不同。這在 [圖 5-5](#fig_encoding_avro_schemas) 中說明。讀取者模式定義了應用程式程式碼期望的每條記錄的欄位及其型別。
{{< figure src="/fig/ddia_0505.png" id="fig_encoding_avro_schemas" caption="圖 5-5. 在 Protocol Buffers 中,編碼和解碼可以使用不同版本的模式。在 Avro 中,解碼使用兩個模式:寫入者模式必須與用於編碼的模式相同,但讀取者模式可以是較舊或較新的版本。" class="w-full my-4" >}}
如果讀取者模式和寫入者模式相同,解碼很容易。如果它們不同,Avro 透過並排檢視寫入者模式和讀取者模式並將資料從寫入者模式轉換為讀取者模式來解決差異。Avro 規範 [^16] [^17] 準確定義了此解析的工作方式,並在 [圖 5-6](#fig_encoding_avro_resolution) 中進行了說明。
例如,如果寫入者模式和讀取者模式的欄位順序不同,這沒有問題,因為模式解析透過欄位名匹配欄位。如果讀取資料的程式碼遇到出現在寫入者模式中但不在讀取者模式中的欄位,它將被忽略。如果讀取資料的程式碼期望某個欄位,但寫入者模式不包含該名稱的欄位,則使用讀取者模式中宣告的預設值填充它。
{{< figure src="/fig/ddia_0506.png" id="fig_encoding_avro_resolution" caption="圖 5-6. Avro 讀取器解決寫入者模式和讀取者模式之間的差異。" class="w-full my-4" >}}
#### 模式演化規則 {#schema-evolution-rules}
使用 Avro,向前相容性意味著你可以將新版本的模式作為寫入者,將舊版本的模式作為讀取者。相反,向後相容性意味著你可以將新版本的模式作為讀取者,將舊版本作為寫入者。
為了保持相容性,你只能新增或刪除具有預設值的欄位。(我們的 Avro 模式中的 `favoriteNumber` 欄位的預設值為 `null`。)例如,假設你添加了一個具有預設值的欄位,因此這個新欄位存在於新模式中但不在舊模式中。當使用新模式的讀取者讀取使用舊模式編寫的記錄時,將為缺失的欄位填充預設值。
如果你要新增一個沒有預設值的欄位,新讀取者將無法讀取舊寫入者寫入的資料,因此你會破壞向後相容性。如果你要刪除一個沒有預設值的欄位,舊讀取者將無法讀取新寫入者寫入的資料,因此你會破壞向前相容性。
在某些程式語言中,`null` 是任何變數的可接受預設值,但在 Avro 中不是這樣:如果你想允許欄位為 null,你必須使用 *聯合型別*。例如,`union { null, long, string } field;` 表示 `field` 可以是數字、字串或 null。只有當 `null` 是聯合的第一個分支時,你才能將其用作預設值。這比預設情況下一切都可為空更冗長一些,但它透過明確什麼可以和不能為 null 來幫助防止錯誤 [^18]。
更改欄位的資料型別是可能的,前提是 Avro 可以轉換該型別。更改欄位的名稱是可能的,但有點棘手:讀取者模式可以包含欄位名的別名,因此它可以將舊寫入者的模式欄位名與別名匹配。這意味著更改欄位名是向後相容的,但不是向前相容的。同樣,向聯合型別新增分支是向後相容的,但不是向前相容的。
#### 但什麼是寫入者模式? {#but-what-is-the-writers-schema}
到目前為止,我們忽略了一個重要問題:讀取者如何知道特定資料是用哪個寫入者模式編碼的?我們不能只在每條記錄中包含整個模式,因為模式可能比編碼資料大得多,使二進位制編碼節省的所有空間都白費了。
答案取決於 Avro 的使用環境。舉幾個例子:
包含大量記錄的大檔案
: Avro 的一個常見用途是儲存包含數百萬條記錄的大檔案,所有記錄都使用相同的模式編碼。(我們將在 [第 11 章](/tw/ch11#ch_batch) 討論這種情況。)在這種情況下,該檔案的寫入者可以在檔案開頭只包含一次寫入者模式。Avro 指定了一種檔案格式(物件容器檔案)來執行此操作。
具有單獨寫入記錄的資料庫
: 在資料庫中,不同的記錄可能在不同的時間點使用不同的寫入者模式編寫——你不能假定所有記錄都具有相同的模式。最簡單的解決方案是在每個編碼記錄的開頭包含一個版本號,並在資料庫中保留模式版本列表。讀取者可以獲取記錄,提取版本號,然後從資料庫中獲取該版本號的寫入者模式。使用該寫入者模式,它可以解碼記錄的其餘部分。
例如,Apache Kafka 的 Confluent 模式登錄檔 [^19] 和 LinkedIn 的 Espresso [^20] 就是這樣工作的。
透過網路連線傳送記錄
: 當兩個程序透過雙向網路連線進行通訊時,它們可以在連線設定時協商模式版本,然後在連線的生命週期內使用該模式。Avro RPC 協議(參見 ["流經服務的資料流:REST 與 RPC"](#sec_encoding_dataflow_rpc))就是這樣工作的。
無論如何,模式版本資料庫都是有用的,因為它充當文件並讓你有機會檢查模式相容性 [^21]。作為版本號,你可以使用簡單的遞增整數,或者可以使用模式的雜湊值。
#### 動態生成的模式 {#dynamically-generated-schemas}
與 Protocol Buffers 相比,Avro 方法的一個優點是模式不包含任何標籤號。但為什麼這很重要?在模式中保留幾個數字有什麼問題?
區別在於 Avro 對 *動態生成* 的模式更友好。例如,假設你有一個關係資料庫,其內容你想要轉儲到檔案中,並且你想要使用二進位制格式來避免前面提到的文字格式(JSON、CSV、XML)的問題。如果你使用 Avro,你可以相當容易地從關係模式生成 Avro 模式(我們之前看到的 JSON 表示),並使用該模式對資料庫內容進行編碼,將其全部轉儲到 Avro 物件容器檔案中 [^22]。你可以為每個資料庫表生成記錄模式,每列成為該記錄中的一個欄位。資料庫中的列名對映到 Avro 中的欄位名。
現在,如果資料庫模式發生變化(例如,表添加了一列並刪除了一列),你可以從更新的資料庫模式生成新的 Avro 模式,並以新的 Avro 模式匯出資料。資料匯出過程不需要關注模式更改——它可以在每次執行時簡單地進行模式轉換。讀取新資料檔案的任何人都會看到記錄的欄位已更改,但由於欄位是按名稱標識的,因此更新的寫入者模式仍然可以與舊的讀取者模式匹配。
相比之下,如果你為此目的使用 Protocol Buffers,欄位標籤可能必須手動分配:每次資料庫模式更改時,管理員都必須手動更新從資料庫列名到欄位標籤的對映。(這可能是可以自動化的,但模式生成器必須非常小心,不要分配以前使用過的欄位標籤。)這種動態生成的模式根本不是 Protocol Buffers 的設計目標,而 Avro 則是。
### 模式的優點 {#sec_encoding_schemas}
正如我們所見,Protocol Buffers 和 Avro 都使用模式來描述二進位制編碼格式。它們的模式語言比 XML 模式或 JSON 模式簡單得多,後者支援更詳細的驗證規則(例如,"此欄位的字串值必須與此正則表示式匹配"或"此欄位的整數值必須在 0 到 100 之間")。由於 Protocol Buffers 和 Avro 在實現和使用上都更簡單,它們已經發展到支援相當廣泛的程式語言。
這些編碼所基於的想法絕不是新的。例如,它們與 ASN.1 有很多共同之處,ASN.1 是 1984 年首次標準化的模式定義語言 [^23] [^24]。它用於定義各種網路協議,其二進位制編碼(DER)仍用於編碼 SSL 證書(X.509),例如 [^25]。ASN.1 支援使用標籤號的模式演化,類似於 Protocol Buffers [^26]。然而,它也非常複雜且文件記錄不佳,因此 ASN.1 可能不是新應用程式的好選擇。
許多資料系統也為其資料實現某種專有二進位制編碼。例如,大多數關係資料庫都有一個網路協議,你可以透過它向資料庫傳送查詢並獲取響應。這些協議通常特定於特定資料庫,資料庫供應商提供驅動程式(例如,使用 ODBC 或 JDBC API),將資料庫網路協議的響應解碼為記憶體資料結構。
因此,我們可以看到,儘管文字資料格式(如 JSON、XML 和 CSV)廣泛存在,但基於模式的二進位制編碼也是一個可行的選擇。它們具有許多良好的屬性:
* 它們可以比各種"二進位制 JSON"變體緊湊得多,因為它們可以從編碼資料中省略欄位名。
* 模式是一種有價值的文件形式,並且由於解碼需要模式,因此你可以確保它是最新的(而手動維護的文件很容易與現實脫節)。
* 保留模式資料庫允許你在部署任何內容之前檢查模式更改的向前和向後相容性。
* 對於靜態型別程式語言的使用者,從模式生成程式碼的能力很有用,因為它可以在編譯時進行型別檢查。
總之,模式演化允許與無模式/讀時模式 JSON 資料庫相同的靈活性(參見 ["文件模型中的模式靈活性"](/tw/ch3#sec_datamodels_schema_flexibility)),同時還提供更好的資料保證和更好的工具。
## 資料流的模式 {#sec_encoding_dataflow}
在本章開頭,我們說過,當你想要將一些資料傳送到與你不共享記憶體的另一個程序時——例如,當你想要透過網路傳送資料或將其寫入檔案時——你需要將其編碼為位元組序列。然後,我們討論了用於執行此操作的各種不同編碼。
我們討論了向前和向後相容性,這對可演化性很重要(透過允許你獨立升級系統的不同部分,而不必一次更改所有內容,使更改變得容易)。相容性是編碼資料的一個程序與解碼資料的另一個程序之間的關係。
這是一個相當抽象的想法——資料可以透過許多方式從一個程序流向另一個程序。誰編碼資料,誰解碼資料?在本章的其餘部分,我們將探討資料在程序之間流動的一些最常見方式:
* 透過資料庫(參見 ["流經資料庫的資料流"](#sec_encoding_dataflow_db))
* 透過服務呼叫(參見 ["流經服務的資料流:REST 與 RPC"](#sec_encoding_dataflow_rpc))
* 透過工作流引擎(參見 ["持久化執行與工作流"](#sec_encoding_dataflow_workflows))
* 透過非同步訊息(參見 ["事件驅動的架構"](#sec_encoding_dataflow_msg))
### 流經資料庫的資料流 {#sec_encoding_dataflow_db}
在資料庫中,寫入資料庫的程序對資料進行編碼,從資料庫讀取的程序對其進行解碼。可能只有一個程序訪問資料庫,在這種情況下,讀取者只是同一程序的後續版本——在這種情況下,你可以將在資料庫中儲存某些內容視為 *向未來的自己傳送訊息*。
向後相容性在這裡顯然是必要的;否則你未來的自己將無法解碼你之前寫的內容。
通常,幾個不同的程序同時訪問資料庫是很常見的。這些程序可能是幾個不同的應用程式或服務,或者它們可能只是同一服務的幾個例項(為了可伸縮性或容錯而並行執行)。無論哪種方式,在應用程式正在更改的環境中,某些訪問資料庫的程序可能正在執行較新的程式碼,而某些程序正在執行較舊的程式碼——例如,因為新版本當前正在滾動升級中部署,因此某些例項已更新,而其他例項尚未更新。
這意味著資料庫中的值可能由 *較新* 版本的程式碼寫入,隨後由仍在執行的 *較舊* 版本的程式碼讀取。因此,資料庫通常也需要向前相容性。
#### 不同時間寫入的不同值 {#different-values-written-at-different-times}
資料庫通常允許在任何時間更新任何值。這意味著在單個數據庫中,你可能有一些五毫秒前寫入的值,以及一些五年前寫入的值。
當你部署應用程式的新版本時(至少是服務端應用程式),你可能會在幾分鐘內用新版本完全替換舊版本。資料庫內容並非如此:五年前的資料仍然存在,採用原始編碼,除非你自那時以來明確重寫了它。這種觀察有時被總結為 *資料比程式碼更長壽*。
將資料重寫(*遷移*)為新模式當然是可能的,但在大型資料集上這是一件昂貴的事情,因此大多數資料庫儘可能避免它。大多數關係資料庫允許簡單的模式更改,例如新增具有 `null` 預設值的新列,而無需重寫現有資料。從磁碟上的編碼資料中缺少的任何列讀取舊行時,資料庫會為其填充 `null`。因此,模式演化允許整個資料庫看起來好像是用單個模式編碼的,即使底層儲存可能包含用各種歷史版本的模式編碼的記錄。
更複雜的模式更改——例如,將單值屬性更改為多值,或將某些資料移動到單獨的表中——仍然需要重寫資料,通常在應用程式級別 [^27]。在此類遷移中保持向前和向後相容性仍然是一個研究問題 [^28]。
#### 歸檔儲存 {#archival-storage}
也許你會不時對資料庫進行快照,例如用於備份目的或載入到資料倉庫中(參見 ["資料倉庫"](/tw/ch1#sec_introduction_dwh))。在這種情況下,資料轉儲通常將使用最新模式進行編碼,即使源資料庫中的原始編碼包含來自不同時代的模式版本的混合。由於你無論如何都在複製資料,因此你不妨一致地對資料副本進行編碼。
由於資料轉儲是一次性寫入的,此後是不可變的,因此像 Avro 物件容器檔案這樣的格式非常適合。這也是將資料編碼為分析友好的列式格式(如 Parquet)的好機會(參見 ["列壓縮"](/tw/ch4#sec_storage_column_compression))。
在 [第 11 章](/tw/ch11#ch_batch) 中,我們將更多地討論如何使用歸檔儲存中的資料。
### 流經服務的資料流:REST 與 RPC {#sec_encoding_dataflow_rpc}
當你有需要透過網路進行通訊的程序時,有幾種不同的方式來安排這種通訊。最常見的安排是有兩個角色:*客戶端* 和 *伺服器*。伺服器透過網路公開 API,客戶端可以連線到伺服器以向該 API 發出請求。伺服器公開的 API 稱為 *服務*。
Web 就是這樣工作的:客戶端(Web 瀏覽器)向 Web 伺服器發出請求,發出 `GET` 請求以下載 HTML、CSS、JavaScript、影像等,併發出 `POST` 請求以向伺服器提交資料。API 由一組標準化的協議和資料格式(HTTP、URL、SSL/TLS、HTML 等)組成。由於 Web 瀏覽器、Web 伺服器和網站作者大多同意這些標準,因此你可以使用任何 Web 瀏覽器訪問任何網站(至少在理論上!)。
Web 瀏覽器不是唯一型別的客戶端。例如,在移動裝置和桌面計算機上執行的原生應用程式通常也與伺服器通訊,在 Web 瀏覽器內執行的客戶端 JavaScript 應用程式也可以發出 HTTP 請求。在這種情況下,伺服器的響應通常不是用於向人顯示的 HTML,而是以便於客戶端應用程式程式碼進一步處理的編碼資料(最常見的是 JSON)。儘管 HTTP 可能用作傳輸協議,但在其之上實現的 API 是特定於應用程式的,客戶端和伺服器需要就該 API 的詳細資訊達成一致。
在某些方面,服務類似於資料庫:它們通常允許客戶端提交和查詢資料。但是,雖然資料庫允許使用我們在 [第 3 章](/tw/ch3#ch_datamodels) 中討論的查詢語言進行任意查詢,但服務公開了一個特定於應用程式的 API,該 API 僅允許由服務的業務邏輯(應用程式程式碼)預先確定的輸入和輸出 [^29]。這種限制提供了一定程度的封裝:服務可以對客戶端可以做什麼和不能做什麼施加細粒度的限制。
面向服務/微服務架構的一個關鍵設計目標是透過使服務可獨立部署和演化來使應用程式更容易更改和維護。一個常見的原則是每個服務應該由一個團隊擁有,該團隊應該能夠頻繁釋出服務的新版本,而無需與其他團隊協調。因此,我們應該期望伺服器和客戶端的新舊版本同時執行,因此伺服器和客戶端使用的資料編碼必須在服務 API 的各個版本之間相容。
#### Web 服務 {#sec_web_services}
當 HTTP 用作與服務通訊的底層協議時,它被稱為 *Web 服務*。Web 服務通常用於構建面向服務或微服務架構(在 ["微服務與 Serverless"](/tw/ch1#sec_introduction_microservices) 中討論過)。術語"Web 服務"可能有點用詞不當,因為 Web 服務不僅用於 Web,還用於幾種不同的上下文。例如:
1. 在使用者裝置上執行的客戶端應用程式(例如,移動裝置上的原生應用程式,或瀏覽器中的 JavaScript Web 應用程式)向服務發出 HTTP 請求。這些請求通常透過公共網際網路進行。
2. 一個服務向同一組織擁有的另一個服務發出請求,通常位於同一資料中心內,作為面向服務/微服務架構的一部分。
3. 一個服務向不同組織擁有的服務發出請求,通常透過網際網路。這用於不同組織後端系統之間的資料交換。此類別包括線上服務提供的公共 API,例如信用卡處理系統或用於共享訪問使用者資料的 OAuth。
最流行的服務設計理念是 REST,它建立在 HTTP 的原則之上 [^30] [^31]。它強調簡單的資料格式,使用 URL 來標識資源,並使用 HTTP 功能進行快取控制、身份驗證和內容型別協商。根據 REST 原則設計的 API 稱為 *RESTful*。
需要呼叫 Web 服務 API 的程式碼必須知道要查詢哪個 HTTP 端點,以及傳送什麼資料格式以及預期的響應。即使服務採用 RESTful 設計原則,客戶端也需要以某種方式找出這些詳細資訊。服務開發人員通常使用介面定義語言(IDL)來定義和記錄其服務的 API 端點和資料模型,並隨著時間的推移演化它們。然後,其他開發人員可以使用服務定義來確定如何查詢服務。兩種最流行的服務 IDL 是 OpenAPI(也稱為 Swagger [^32])和 gRPC。OpenAPI 用於傳送和接收 JSON 資料的 Web 服務,而 gRPC 服務傳送和接收 Protocol Buffers。
開發人員通常用 JSON 或 YAML 編寫 OpenAPI 服務定義;參見 [示例 5-3](#fig_open_api_def)。服務定義允許開發人員定義服務端點、文件、版本、資料模型等。gRPC 定義看起來類似,但使用 Protocol Buffers 服務定義進行定義。
{{< figure id="fig_open_api_def" title="示例 5-3. YAML 中的示例 OpenAPI 服務定義" class="w-full my-4" >}}
```yaml
openapi: 3.0.0
info:
title: Ping, Pong
version: 1.0.0
servers:
- url: http://localhost:8080
paths:
/ping:
get:
summary: Given a ping, returns a pong message
responses:
'200':
description: A pong
content:
application/json:
schema:
type: object
properties:
message:
type: string
example: Pong!
```
即使採用了設計理念和 IDL,開發人員仍必須編寫實現其服務 API 呼叫的程式碼。通常採用服務框架來簡化這項工作。Spring Boot、FastAPI 和 gRPC 等服務框架允許開發人員為每個 API 端點編寫業務邏輯,而框架程式碼處理路由、指標、快取、身份驗證等。[示例 5-4](#fig_fastapi_def) 顯示了 [示例 5-3](#fig_open_api_def) 中定義的服務的示例 Python 實現。
{{< figure id="fig_fastapi_def" title="示例 5-4. 實現 [示例 5-3](#fig_open_api_def) 中定義的示例 FastAPI 服務" class="w-full my-4" >}}
```python
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Ping, Pong", version="1.0.0")
class PongResponse(BaseModel):
message: str = "Pong!"
@app.get("/ping", response_model=PongResponse,
summary="Given a ping, returns a pong message")
async def ping():
return PongResponse()
```
許多框架將服務定義和伺服器程式碼耦合在一起。在某些情況下,例如流行的 Python FastAPI 框架,伺服器是用程式碼編寫的,IDL 會自動生成。在其他情況下,例如 gRPC,首先編寫服務定義,然後生成伺服器程式碼腳手架。兩種方法都允許開發人員從服務定義生成各種語言的客戶端庫和 SDK。除了程式碼生成之外,Swagger 等 IDL 工具還可以生成文件、驗證模式更改相容性,併為開發人員提供查詢和測試服務的圖形使用者介面。
#### 遠端過程呼叫(RPC)的問題 {#sec_problems_with_rpc}
Web 服務只是透過網路進行 API 請求的一長串技術的最新化身,其中許多技術獲得了大量炒作但存在嚴重問題。Enterprise JavaBeans (EJB) 和 Java 的遠端方法呼叫 (RMI) 僅限於 Java。分散式元件物件模型 (DCOM) 僅限於 Microsoft 平臺。公共物件請求代理架構 (CORBA) 過於複雜,並且不提供向後或向前相容性 [^33]。SOAP 和 WS-\* Web 服務框架旨在提供跨供應商的互操作性,但也受到複雜性和相容性問題的困擾 [^34] [^35] [^36]。
所有這些都基於 *遠端過程呼叫* (RPC) 的想法,這個想法自 1970 年代以來就存在了 [^37]。RPC 模型試圖使向遠端網路服務的請求看起來與在程式語言中呼叫函式或方法相同,在同一程序內(這種抽象稱為 *位置透明性*)。儘管 RPC 起初似乎很方便,但這種方法從根本上是有缺陷的 [^38] [^39]。網路請求與本地函式呼叫非常不同:
* 本地函式呼叫是可預測的,要麼成功要麼失敗,僅取決於你控制的引數。網路請求是不可預測的:由於網路問題,請求或響應可能會丟失,或者遠端機器可能速度慢或不可用,而這些問題完全超出了你的控制。網路問題很常見,因此你必須預料到它們,例如透過重試失敗的請求。
* 本地函式呼叫要麼返回結果,要麼丟擲異常,要麼永不返回(因為它進入無限迴圈或程序崩潰)。網路請求有另一種可能的結果:它可能由於 *超時* 而沒有返回結果。在這種情況下,你根本不知道發生了什麼:如果你沒有從遠端服務獲得響應,你無法知道請求是否透過。(我們在 [第 9 章](/tw/ch9#ch_distributed) 中更詳細地討論了這個問題。)
* 如果你重試失敗的網路請求,可能會發生前一個請求實際上已經成功,只是響應丟失了。在這種情況下,重試將導致操作執行多次,除非你在協議中構建去重機制(*冪等性*)[^40]。本地函式呼叫沒有這個問題。(我們在 [“冪等性”](/tw/ch12#sec_stream_idempotence) 中更詳細地討論冪等性。)
* 每次呼叫本地函式時,通常需要大約相同的時間來執行。網路請求比函式呼叫慢得多,其延遲也變化很大:在良好的時候,它可能在不到一毫秒內完成,但當網路擁塞或遠端服務過載時,執行完全相同的操作可能需要許多秒。
* 當你呼叫本地函式時,你可以有效地將引用(指標)傳遞給本地記憶體中的物件。當你發出網路請求時,所有這些引數都需要編碼為可以透過網路傳送的位元組序列。如果引數是不可變的原語,如數字或短字串,那沒問題,但對於更大量的資料和可變物件,它很快就會出現問題。
* 客戶端和服務可能以不同的程式語言實現,因此 RPC 框架必須將資料型別從一種語言轉換為另一種語言。這可能會變得很醜陋,因為並非所有語言都具有相同的型別——例如,回想一下 JavaScript 處理大於 2⁵³ 的數字的問題(參見 ["JSON、XML 及其二進位制變體"](#sec_encoding_json))。單一語言編寫的單個程序中不存在此問題。
所有這些因素意味著,試圖讓遠端服務看起來太像程式語言中的本地物件是沒有意義的,因為它是根本不同的東西。REST 的部分吸引力在於它將網路上的狀態傳輸視為與函式呼叫不同的過程。
#### 負載均衡器、服務發現和服務網格 {#sec_encoding_service_discovery}
所有服務都透過網路進行通訊。因此,客戶端必須知道它正在連線的服務的地址——這個問題稱為 *服務發現*。最簡單的方法是配置客戶端連線到執行服務的 IP 地址和埠。此配置可以工作,但如果伺服器離線、轉移到新機器或變得過載,則必須手動重新配置客戶端。
為了提供更高的可用性和可伸縮性,通常在不同的機器上執行服務的多個例項,其中任何一個都可以處理傳入的請求。將請求分散到這些例項上稱為 *負載均衡* [^41]。有許多負載均衡和服務發現解決方案可用:
* *硬體負載均衡器* 是安裝在資料中心的專用裝置。它們允許客戶端連線到單個主機和埠,傳入連線被路由到執行服務的伺服器之一。此類負載均衡器在連線到下游伺服器時檢測網路故障,並將流量轉移到其他伺服器。
* *軟體負載均衡器* 的行為方式與硬體負載均衡器大致相同。但是,軟體負載均衡器(如 Nginx 和 HAProxy)不需要特殊裝置,而是可以安裝在標準機器上的應用程式。
* *域名服務 (DNS)* 是當你開啟網頁時在網際網路上解析域名的方式。它透過允許多個 IP 地址與單個域名關聯來支援負載均衡。然後,客戶端可以配置為使用域名而不是 IP 地址連線到服務,並且客戶端的網路層在建立連線時選擇要使用的 IP 地址。這種方法的一個缺點是 DNS 旨在在較長時間內傳播更改並快取 DNS 條目。如果伺服器頻繁啟動、停止或移動,客戶端可能會看到不再有伺服器執行的陳舊 IP 地址。
* *服務發現系統* 使用集中式登錄檔而不是 DNS 來跟蹤哪些服務端點可用。當新服務例項啟動時,它透過宣告它正在偵聽的主機和埠以及相關元資料(如分片所有權資訊(參見 [第 7 章](/tw/ch7#ch_sharding))、資料中心位置等)向服務發現系統註冊自己。然後,服務定期向發現系統傳送心跳訊號,以表明服務仍然可用。
當客戶端希望連線到服務時,它首先查詢發現系統以獲取可用端點列表,然後直接連線到端點。與 DNS 相比,服務發現支援服務例項頻繁更改的更動態環境。發現系統還為客戶端提供有關它們正在連線的服務的更多元資料,這使客戶端能夠做出更智慧的負載均衡決策。
* *服務網格* 是一種複雜的負載均衡形式,它結合了軟體負載均衡器和服務發現。與在單獨機器上執行的傳統軟體負載均衡器不同,服務網格負載均衡器通常作為程序內客戶端庫或作為客戶端和伺服器上的程序或"邊車"容器部署。客戶端應用程式連線到它們自己的本地服務負載均衡器,該負載均衡器連線到伺服器的負載均衡器。從那裡,連線被路由到本地伺服器程序。
雖然複雜,但這種拓撲提供了許多優勢。由於客戶端和伺服器完全透過本地連線路由,因此連線加密可以完全在負載均衡器級別處理。這使客戶端和伺服器免於處理 SSL 證書和 TLS 的複雜性。網格系統還提供複雜的可觀測性。它們可以即時跟蹤哪些服務正在相互呼叫,檢測故障,跟蹤流量負載等。
哪種解決方案合適取決於組織的需求。在使用 Kubernetes 等編排器的非常動態的服務環境中執行的組織通常選擇執行 Istio 或 Linkerd 等服務網格。專門的基礎設施(如資料庫或訊息傳遞系統)可能需要自己專門構建的負載均衡器。更簡單的部署最適合使用軟體負載均衡器。
#### RPC 的資料編碼與演化 {#data-encoding-and-evolution-for-rpc}
對於可演化性,RPC 客戶端和伺服器可以獨立更改和部署非常重要。與透過資料庫流動的資料(如上一節所述)相比,我們可以在透過服務的資料流的情況下做出簡化假設:假設所有伺服器都先更新,然後所有客戶端都更新是合理的。因此,你只需要在請求上向後相容,在響應上向前相容。
RPC 方案的向後和向前相容性屬性繼承自它使用的任何編碼:
* gRPC(Protocol Buffers)和 Avro RPC 可以根據各自編碼格式的相容性規則進行演化。
* RESTful API 最常使用 JSON 作為響應,以及 JSON 或 URI 編碼/表單編碼的請求引數作為請求。新增可選請求引數和向響應物件新增新欄位通常被認為是保持相容性的更改。
服務相容性變得更加困難,因為 RPC 通常用於跨組織邊界的通訊,因此服務提供者通常無法控制其客戶端,也無法強制它們升級。因此,相容性需要保持很長時間,也許是無限期的。如果需要破壞相容性的更改,服務提供者通常最終會並行維護服務 API 的多個版本。
關於 API 版本控制應該如何工作(即客戶端如何指示它想要使用哪個版本的 API)沒有達成一致 [^42]。對於 RESTful API,常見的方法是在 URL 中使用版本號或在 HTTP `Accept` 標頭中使用。對於使用 API 金鑰識別特定客戶端的服務,另一個選項是在伺服器上儲存客戶端請求的 API 版本,並允許透過單獨的管理介面更新此版本選擇 [^43]。
### 持久化執行與工作流 {#sec_encoding_dataflow_workflows}
根據定義,基於服務的架構具有多個服務,這些服務都負責應用程式的不同部分。考慮一個處理信用卡並將資金存入銀行賬戶的支付處理應用程式。該系統可能有不同的服務負責欺詐檢測、信用卡整合、銀行整合等。
在我們的示例中,處理單個付款需要許多服務呼叫。支付處理器服務可能會呼叫欺詐檢測服務以檢查欺詐,呼叫信用卡服務以扣除信用卡費用,並呼叫銀行服務以存入扣除的資金,如 [圖 5-7](#fig_encoding_workflow) 所示。我們將這一系列步驟稱為 *工作流*,每個步驟稱為 *任務*。工作流通常定義為任務圖。工作流定義可以用通用程式語言、領域特定語言 (DSL) 或標記語言(如業務流程執行語言 (BPEL))[^44] 編寫。
--------
> [!TIP] 任務、活動和函式
>
> 不同的工作流引擎對任務使用不同的名稱。例如,Temporal 使用術語 *活動*。其他引擎將任務稱為 *持久函式*。雖然名稱不同,但概念是相同的。
--------
{{< figure src="/fig/ddia_0507.png" id="fig_encoding_workflow" title="圖 5-7. 使用業務流程模型和標記法 (BPMN) 表示的工作流示例,這是一種圖形標記法。" class="w-full my-4" >}}
工作流由 *工作流引擎* 執行或執行。工作流引擎確定何時執行每個任務、任務必須在哪臺機器上執行、如果任務失敗該怎麼辦(例如,如果機器在任務執行時崩潰)、允許並行執行多少任務等。
工作流引擎通常由編排器和執行器組成。編排器負責排程要執行的任務,執行器負責執行任務。當工作流被觸發時,執行開始。如果使用者定義了基於時間的排程(例如每小時執行),則編排器會自行觸發工作流。外部源(如 Web 服務)甚至人類也可以觸發工作流執行。一旦觸發,就會呼叫執行器來執行任務。
有許多型別的工作流引擎可以滿足各種各樣的用例。有些,如 Airflow、Dagster 和 Prefect,與資料系統整合並編排 ETL 任務。其他的,如 Camunda 和 Orkes,為工作流提供圖形標記法(如 [圖 5-7](#fig_encoding_workflow) 中使用的 BPMN),以便非工程師可以更輕鬆地定義和執行工作流。還有一些,如 Temporal 和 Restate,提供 *持久化執行*。
#### 持久化執行 {#durable-execution}
持久化執行框架已成為構建需要事務性的基於服務的架構的流行方式。在我們的支付示例中,我們希望每筆付款都恰好處理一次。工作流執行期間的故障可能導致信用卡扣費,但沒有相應的銀行賬戶存款。在基於服務的架構中,我們不能簡單地將兩個任務包裝在資料庫事務中。此外,我們可能正在與我們控制有限的第三方支付閘道器進行互動。
持久化執行框架是為工作流提供 *恰好一次語義* 的一種方式。如果任務失敗,框架將重新執行該任務,但會跳過任務在失敗之前成功完成的任何 RPC 呼叫或狀態更改。相反,框架將假裝進行呼叫,但實際上將返回先前呼叫的結果。這是可能的,因為持久化執行框架將所有 RPC 和狀態更改記錄到持久儲存(如預寫日誌)[^45] [^46]。[示例 5-5](#fig_temporal_workflow) 顯示了使用 Temporal 支援持久化執行的工作流定義示例。
{{< figure id="fig_temporal_workflow" title="示例 5-5. [圖 5-7](#fig_encoding_workflow) 中支付工作流的 Temporal 工作流定義片段。" class="w-full my-4" >}}
```python
@workflow.defn
class PaymentWorkflow:
@workflow.run
async def run(self, payment: PaymentRequest) -> PaymentResult:
is_fraud = await workflow.execute_activity(
check_fraud,
payment,
start_to_close_timeout=timedelta(seconds=15),
)
if is_fraud:
return PaymentResultFraudulent
credit_card_response = await workflow.execute_activity(
debit_credit_card,
payment,
start_to_close_timeout=timedelta(seconds=15),
)
# ...
```
像 Temporal 這樣的框架並非沒有挑戰。外部服務(例如我們示例中的第三方支付閘道器)仍必須提供冪等 API。開發人員必須記住為這些 API 使用唯一 ID 以防止重複執行 [^47]。由於持久化執行框架按順序記錄每個 RPC 呼叫,因此它期望後續執行以相同的順序進行相同的 RPC 呼叫。這使得程式碼更改變得脆弱:你可能僅透過重新排序函式呼叫就引入未定義的行為 [^48]。與其修改現有工作流的程式碼,不如單獨部署新版本的程式碼更安全,以便現有工作流呼叫的重新執行繼續使用舊版本,只有新呼叫使用新程式碼 [^49]。
同樣,由於持久化執行框架期望以確定性方式重放所有程式碼(相同的輸入產生相同的輸出),因此隨機數生成器或系統時鐘等非確定性程式碼會產生問題 [^48]。框架通常會為這類庫函式提供自己的確定性實現,但你必須記得使用它們。在某些情況下,例如 Temporal 的 workflowcheck 工具,框架還會提供靜態分析工具來判斷是否引入了非確定性行為。
--------
> [!NOTE]
> 使程式碼具有確定性是一個強大的想法,但要穩健地做到這一點很棘手。在 ["確定性的力量"](/tw/ch9#sidebar_distributed_determinism) 中,我們將回到這個話題。
--------
### 事件驅動的架構 {#sec_encoding_dataflow_msg}
在這最後一節中,我們將簡要介紹 *事件驅動架構*,這是編碼資料從一個程序流向另一個程序的另一種方式。請求稱為 *事件* 或 *訊息*;與 RPC 不同,傳送者通常不會等待接收者處理事件。此外,事件通常不是透過直接網路連線傳送給接收者,而是透過稱為 *訊息代理*(也稱為 *事件代理*、*訊息佇列* 或 *面向訊息的中介軟體*)的中介,它臨時儲存訊息 [^50]。
使用訊息代理與直接 RPC 相比有幾個優點:
* 如果接收者不可用或過載,它可以充當緩衝區,從而提高系統可靠性。
* 它可以自動將訊息重新傳遞給已崩潰的程序,從而防止訊息丟失。
* 它避免了服務發現的需要,因為傳送者不需要直接連線到接收者的 IP 地址。
* 它允許將相同的訊息傳送給多個接收者。
* 它在邏輯上將傳送者與接收者解耦(傳送者只是釋出訊息,不關心誰使用它們)。
透過訊息代理的通訊是 *非同步的*:傳送者不會等待訊息被傳遞,而是簡單地傳送它然後忘記它。可以透過讓傳送者在單獨的通道上等待響應來實現類似同步 RPC 的模型。
#### 訊息代理 {#message-brokers}
過去,訊息代理的格局由 TIBCO、IBM WebSphere 和 webMethods 等公司的商業企業軟體主導,然後開源實現(如 RabbitMQ、ActiveMQ、HornetQ、NATS 和 Apache Kafka)變得流行。最近,雲服務(如 Amazon Kinesis、Azure Service Bus 和 Google Cloud Pub/Sub)也獲得了採用。我們將在 [“訊息系統”](/tw/ch12#sec_stream_messaging) 中更詳細地比較它們。
詳細的傳遞語義因實現和配置而異,但通常,最常使用兩種訊息分發模式:
* 一個程序將訊息新增到命名 *佇列*,代理將該訊息傳遞給該佇列的 *消費者*。如果有多個消費者,其中一個會收到訊息。
* 一個程序將訊息釋出到命名 *主題*,代理將該訊息傳遞給該主題的所有 *訂閱者*。如果有多個訂閱者,他們都會收到訊息。
訊息代理通常不強制執行任何特定的資料模型——訊息只是帶有一些元資料的位元組序列,因此你可以使用任何編碼格式。常見的方法是使用 Protocol Buffers、Avro 或 JSON,並在訊息代理旁邊部署模式登錄檔來儲存所有有效的模式版本並檢查其相容性 [^19] [^21]。AsyncAPI(OpenAPI 的基於訊息傳遞的等效物)也可用於指定訊息的模式。
訊息代理在訊息的永續性方面有所不同。許多將訊息寫入磁碟,以便在訊息代理崩潰或需要重新啟動時不會丟失。與資料庫不同,許多訊息代理在訊息被消費後會自動再次刪除訊息。某些代理可以配置為無限期地儲存訊息,如果你想使用事件溯源,這是必需的(參見 ["事件溯源與 CQRS"](/tw/ch3#sec_datamodels_events))。
如果消費者將訊息重新發布到另一個主題,你可能需要小心保留未知欄位,以防止前面在資料庫上下文中描述的問題([圖 5-1](#fig_encoding_preserve_field))。
#### 分散式 actor 框架 {#distributed-actor-frameworks}
*Actor 模型* 是單個程序中併發的程式設計模型。與其直接處理執行緒(以及相關的競態條件、鎖定和死鎖問題),邏輯被封裝在 *actor* 中。每個 actor 通常代表一個客戶端或實體,它可能有一些本地狀態(不與任何其他 actor 共享),並透過傳送和接收非同步訊息與其他 actor 通訊。訊息傳遞不能保證:在某些錯誤場景中,訊息將丟失。由於每個 actor 一次只處理一條訊息,因此它不需要擔心執行緒,並且每個 actor 可以由框架獨立排程。
在 *分散式 actor 框架* 中,如 Akka、Orleans [^51] 和 Erlang/OTP,此程式設計模型用於跨多個節點擴充套件應用程式。無論傳送者和接收者是在同一節點還是不同節點上,都使用相同的訊息傳遞機制。如果它們在不同的節點上,訊息將透明地編碼為位元組序列,透過網路傳送,並在另一端解碼。
位置透明性在 actor 模型中比在 RPC 中效果更好,因為 actor 模型已經假定訊息可能會丟失,即使在單個程序內也是如此。儘管網路上的延遲可能比同一程序內的延遲更高,但在使用 actor 模型時,本地和遠端通訊之間的根本不匹配較少。
分散式 actor 框架本質上將訊息代理和 actor 程式設計模型整合到單個框架中。但是,如果你想對基於 actor 的應用程式執行滾動升級,你仍然必須擔心向前和向後相容性,因為訊息可能從執行新版本的節點發送到執行舊版本的節點,反之亦然。這可以透過使用本章中討論的編碼之一來實現。
## 總結 {#summary}
在本章中,我們研究了將資料結構轉換為網路上的位元組或磁碟上的位元組的幾種方法。我們看到了這些編碼的細節不僅影響其效率,更重要的是還影響應用程式的架構和演化選項。
特別是,許多服務需要支援滾動升級,其中服務的新版本逐步部署到少數節點,而不是同時部署到所有節點。滾動升級允許在不停機的情況下發布服務的新版本(從而鼓勵頻繁的小版本釋出而不是罕見的大版本釋出),並使部署風險更低(允許在影響大量使用者之前檢測和回滾有故障的版本)。這些屬性對 *可演化性* 非常有益,即輕鬆進行應用程式更改。
在滾動升級期間,或出於其他各種原因,我們必須假設不同的節點正在執行我們應用程式程式碼的不同版本。因此,重要的是系統中流動的所有資料都以提供向後相容性(新程式碼可以讀取舊資料)和向前相容性(舊程式碼可以讀取新資料)的方式進行編碼。
我們討論了幾種資料編碼格式及其相容性屬性:
* 特定於程式語言的編碼僅限於單一程式語言,並且通常無法提供向前和向後相容性。
* 文字格式(如 JSON、XML 和 CSV)廣泛存在,其相容性取決於你如何使用它們。它們有可選的模式語言,有時有幫助,有時是障礙。這些格式在資料型別方面有些模糊,因此你必須小心處理數字和二進位制字串等內容。
* 二進位制模式驅動的格式(如 Protocol Buffers 和 Avro)允許使用明確定義的向前和向後相容性語義進行緊湊、高效的編碼。模式可用於文件和程式碼生成,適用於靜態型別語言。但是,這些格式的缺點是資料需要在人類可讀之前進行解碼。
我們還討論了幾種資料流模式,說明了資料編碼很重要的不同場景:
* 資料庫,其中寫入資料庫的程序對資料進行編碼,從資料庫讀取的程序對其進行解碼
* RPC 和 REST API,其中客戶端對請求進行編碼,伺服器對請求進行解碼並對響應進行編碼,客戶端最終對響應進行解碼
* 事件驅動架構(使用訊息代理或 actor),其中節點透過相互發送訊息進行通訊,這些訊息由傳送者編碼並由接收者解碼
我們可以得出結論,透過一點小心,向後/向前相容性和滾動升級是完全可以實現的。願你的應用程式演化迅速,部署頻繁。
### 參考
[^1]: [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*, July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y)
[^2]: Steve Breen. [What Do WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, and Your Application Have in Common? This Vulnerability](https://foxglovesecurity.com/2015/11/06/what-do-weblogic-websphere-jboss-jenkins-opennms-and-your-application-have-in-common-this-vulnerability/). *foxglovesecurity.com*, November 2015. Archived at [perma.cc/9U97-UVVD](https://perma.cc/9U97-UVVD)
[^3]: Patrick McKenzie. [What the Rails Security Issue Means for Your Startup](https://www.kalzumeus.com/2013/01/31/what-the-rails-security-issue-means-for-your-startup/). *kalzumeus.com*, January 2013. Archived at [perma.cc/2MBJ-7PZ6](https://perma.cc/2MBJ-7PZ6)
[^4]: Brian Goetz. [Towards Better Serialization](https://openjdk.org/projects/amber/design-notes/towards-better-serialization). *openjdk.org*, June 2019. Archived at [perma.cc/UK6U-GQDE](https://perma.cc/UK6U-GQDE)
[^5]: Eishay Smith. [jvm-serializers wiki](https://github.com/eishay/jvm-serializers/wiki). *github.com*, October 2023. Archived at [perma.cc/PJP7-WCNG](https://perma.cc/PJP7-WCNG)
[^6]: [XML Is a Poor Copy of S-Expressions](https://wiki.c2.com/?XmlIsaPoorCopyOfEssExpressions). *wiki.c2.com*, May 2013. Archived at [perma.cc/7FAN-YBKL](https://perma.cc/7FAN-YBKL)
[^7]: Julia Evans. [Examples of floating point problems](https://jvns.ca/blog/2023/01/13/examples-of-floating-point-problems/). *jvns.ca*, January 2023. Archived at [perma.cc/M57L-QKKW](https://perma.cc/M57L-QKKW)
[^8]: Matt Harris. [Snowflake: An Update and Some Very Important Information](https://groups.google.com/g/twitter-development-talk/c/ahbvo3VTIYI). Email to *Twitter Development Talk* mailing list, October 2010. Archived at [perma.cc/8UBV-MZ3D](https://perma.cc/8UBV-MZ3D)
[^9]: Yakov Shafranovich. [RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files](https://tools.ietf.org/html/rfc4180). IETF, October 2005.
[^10]: Andy Coates. [Evolving JSON Schemas - Part I](https://www.creekservice.org/articles/2024/01/08/json-schema-evolution-part-1.html) and [Part II](https://www.creekservice.org/articles/2024/01/09/json-schema-evolution-part-2.html). *creekservice.org*, January 2024. Archived at [perma.cc/MZW3-UA54](https://perma.cc/MZW3-UA54) and [perma.cc/GT5H-WKZ5](https://perma.cc/GT5H-WKZ5)
[^11]: Pierre Genevès, Nabil Layaïda, and Vincent Quint. [Ensuring Query Compatibility with Evolving XML Schemas](https://arxiv.org/abs/0811.4324). INRIA Technical Report 6711, November 2008.
[^12]: Tim Bray. [Bits On the Wire](https://www.tbray.org/ongoing/When/201x/2019/11/17/Bits-On-the-Wire). *tbray.org*, November 2019. Archived at [perma.cc/3BT3-BQU3](https://perma.cc/3BT3-BQU3)
[^13]: Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. [Thrift: Scalable Cross-Language Services Implementation](https://thrift.apache.org/static/files/thrift-20070401.pdf). Facebook technical report, April 2007. Archived at [perma.cc/22BS-TUFB](https://perma.cc/22BS-TUFB)
[^14]: Martin Kleppmann. [Schema Evolution in Avro, Protocol Buffers and Thrift](https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html). *martin.kleppmann.com*, December 2012. Archived at [perma.cc/E4R2-9RJT](https://perma.cc/E4R2-9RJT)
[^15]: Doug Cutting, Chad Walters, Jim Kellerman, et al. [[PROPOSAL] New Subproject: Avro](https://lists.apache.org/thread/z571w0r5jmfsjvnl0fq4fgg0vh28d3bk). Email thread on *hadoop-general* mailing list, *lists.apache.org*, April 2009. Archived at [perma.cc/4A79-BMEB](https://perma.cc/4A79-BMEB)
[^16]: Apache Software Foundation. [Apache Avro 1.12.0 Specification](https://avro.apache.org/docs/1.12.0/specification/). *avro.apache.org*, August 2024. Archived at [perma.cc/C36P-5EBQ](https://perma.cc/C36P-5EBQ)
[^17]: Apache Software Foundation. [Avro schemas as LL(1) CFG definitions](https://avro.apache.org/docs/1.12.0/api/java/org/apache/avro/io/parsing/doc-files/parsing.html). *avro.apache.org*, August 2024. Archived at [perma.cc/JB44-EM9Q](https://perma.cc/JB44-EM9Q)
[^18]: Tony Hoare. [Null References: The Billion Dollar Mistake](https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/). Talk at *QCon London*, March 2009.
[^19]: Confluent, Inc. [Schema Registry Overview](https://docs.confluent.io/platform/current/schema-registry/index.html). *docs.confluent.io*, 2024. Archived at [perma.cc/92C3-A9JA](https://perma.cc/92C3-A9JA)
[^20]: Aditya Auradkar and Tom Quiggle. [Introducing Espresso—LinkedIn’s Hot New Distributed Document Store](https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store). *engineering.linkedin.com*, January 2015. Archived at [perma.cc/FX4P-VW9T](https://perma.cc/FX4P-VW9T)
[^21]: Jay Kreps. [Putting Apache Kafka to Use: A Practical Guide to Building a Stream Data Platform (Part 2)](https://www.confluent.io/blog/event-streaming-platform-2/). *confluent.io*, February 2015. Archived at [perma.cc/8UA4-ZS5S](https://perma.cc/8UA4-ZS5S)
[^22]: Gwen Shapira. [The Problem of Managing Schemas](https://www.oreilly.com/content/the-problem-of-managing-schemas/). *oreilly.com*, November 2014. Archived at [perma.cc/BY8Q-RYV3](https://perma.cc/BY8Q-RYV3)
[^23]: John Larmouth. [*ASN.1 Complete*](https://www.oss.com/asn1/resources/books-whitepapers-pubs/larmouth-asn1-book.pdf). Morgan Kaufmann, 1999. ISBN: 978-0-122-33435-1. Archived at [perma.cc/GB7Y-XSXQ](https://perma.cc/GB7Y-XSXQ)
[^24]: Burton S. Kaliski Jr. [A Layman’s Guide to a Subset of ASN.1, BER, and DER](https://luca.ntop.org/Teaching/Appunti/asn1.html). Technical Note, RSA Data Security, Inc., November 1993. Archived at [perma.cc/2LMN-W9U8](https://perma.cc/2LMN-W9U8)
[^25]: Jacob Hoffman-Andrews. [A Warm Welcome to ASN.1 and DER](https://letsencrypt.org/docs/a-warm-welcome-to-asn1-and-der/). *letsencrypt.org*, April 2020. Archived at [perma.cc/CYT2-GPQ8](https://perma.cc/CYT2-GPQ8)
[^26]: Lev Walkin. [Question: Extensibility and Dropping Fields](https://lionet.info/asn1c/blog/2010/09/21/question-extensibility-removing-fields/). *lionet.info*, September 2010. Archived at [perma.cc/VX8E-NLH3](https://perma.cc/VX8E-NLH3)
[^27]: Jacqueline Xu. [Online migrations at scale](https://stripe.com/blog/online-migrations). *stripe.com*, February 2017. Archived at [perma.cc/X59W-DK7Y](https://perma.cc/X59W-DK7Y)
[^28]: Geoffrey Litt, Peter van Hardenberg, and Orion Henry. [Project Cambria: Translate your data with lenses](https://www.inkandswitch.com/cambria/). Technical Report, *Ink & Switch*, October 2020. Archived at [perma.cc/WA4V-VKDB](https://perma.cc/WA4V-VKDB)
[^29]: Pat Helland. [Data on the Outside Versus Data on the Inside](https://www.cidrdb.org/cidr2005/papers/P12.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005.
[^30]: Roy Thomas Fielding. [Architectural Styles and the Design of Network-Based Software Architectures](https://ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf). PhD Thesis, University of California, Irvine, 2000. Archived at [perma.cc/LWY9-7BPE](https://perma.cc/LWY9-7BPE)
[^31]: Roy Thomas Fielding. [REST APIs must be hypertext-driven](https://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven).” *roy.gbiv.com*, October 2008. Archived at [perma.cc/M2ZW-8ATG](https://perma.cc/M2ZW-8ATG)
[^32]: [OpenAPI Specification Version 3.1.0](https://swagger.io/specification/). *swagger.io*, February 2021. Archived at [perma.cc/3S6S-K5M4](https://perma.cc/3S6S-K5M4)
[^33]: Michi Henning. [The Rise and Fall of CORBA](https://cacm.acm.org/practice/the-rise-and-fall-of-corba/). *Communications of the ACM*, volume 51, issue 8, pages 52–57, August 2008. [doi:10.1145/1378704.1378718](https://doi.org/10.1145/1378704.1378718)
[^34]: Pete Lacey. [The S Stands for Simple](https://harmful.cat-v.org/software/xml/soap/simple). *harmful.cat-v.org*, November 2006. Archived at [perma.cc/4PMK-Z9X7](https://perma.cc/4PMK-Z9X7)
[^35]: Stefan Tilkov. [Interview: Pete Lacey Criticizes Web Services](https://www.infoq.com/articles/pete-lacey-ws-criticism/). *infoq.com*, December 2006. Archived at [perma.cc/JWF4-XY3P](https://perma.cc/JWF4-XY3P)
[^36]: Tim Bray. [The Loyal WS-Opposition](https://www.tbray.org/ongoing/When/200x/2004/09/18/WS-Oppo). *tbray.org*, September 2004. Archived at [perma.cc/J5Q8-69Q2](https://perma.cc/J5Q8-69Q2)
[^37]: Andrew D. Birrell and Bruce Jay Nelson. [Implementing Remote Procedure Calls](https://www.cs.princeton.edu/courses/archive/fall03/cs518/papers/rpc.pdf). *ACM Transactions on Computer Systems* (TOCS), volume 2, issue 1, pages 39–59, February 1984. [doi:10.1145/2080.357392](https://doi.org/10.1145/2080.357392)
[^38]: Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. [A Note on Distributed Computing](https://m.mirror.facebook.net/kde/devel/smli_tr-94-29.pdf). Sun Microsystems Laboratories, Inc., Technical Report TR-94-29, November 1994. Archived at [perma.cc/8LRZ-BSZR](https://perma.cc/8LRZ-BSZR)
[^39]: Steve Vinoski. [Convenience over Correctness](https://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf). *IEEE Internet Computing*, volume 12, issue 4, pages 89–92, July 2008. [doi:10.1109/MIC.2008.75](https://doi.org/10.1109/MIC.2008.75)
[^40]: Brandur Leach. [Designing robust and predictable APIs with idempotency](https://stripe.com/blog/idempotency). *stripe.com*, February 2017. Archived at [perma.cc/JD22-XZQT](https://perma.cc/JD22-XZQT)
[^41]: Sam Rose. [Load Balancing](https://samwho.dev/load-balancing/). *samwho.dev*, April 2023. Archived at [perma.cc/Q7BA-9AE2](https://perma.cc/Q7BA-9AE2)
[^42]: Troy Hunt. [Your API versioning is wrong, which is why I decided to do it 3 different wrong ways](https://www.troyhunt.com/your-api-versioning-is-wrong-which-is/). *troyhunt.com*, February 2014. Archived at [perma.cc/9DSW-DGR5](https://perma.cc/9DSW-DGR5)
[^43]: Brandur Leach. [APIs as infrastructure: future-proofing Stripe with versioning](https://stripe.com/blog/api-versioning). *stripe.com*, August 2017. Archived at [perma.cc/L63K-USFW](https://perma.cc/L63K-USFW)
[^44]: Alexandre Alves, Assaf Arkin, Sid Askary, et al. [Web Services Business Process Execution Language Version 2.0](https://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html). *docs.oasis-open.org*, April 2007.
[^45]: [What is a Temporal Service?](https://docs.temporal.io/clusters) *docs.temporal.io*, 2024. Archived at [perma.cc/32P3-CJ9V](https://perma.cc/32P3-CJ9V)
[^46]: Stephan Ewen. [Why we built Restate](https://restate.dev/blog/why-we-built-restate/). *restate.dev*, August 2023. Archived at [perma.cc/BJJ2-X75K](https://perma.cc/BJJ2-X75K)
[^47]: Keith Tenzer and Joshua Smith. [Idempotency and Durable Execution](https://temporal.io/blog/idempotency-and-durable-execution). *temporal.io*, February 2024. Archived at [perma.cc/9LGW-PCLU](https://perma.cc/9LGW-PCLU)
[^48]: [What is a Temporal Workflow?](https://docs.temporal.io/workflows) *docs.temporal.io*, 2024. Archived at [perma.cc/B5C5-Y396](https://perma.cc/B5C5-Y396)
[^49]: Jack Kleeman. [Solving durable execution’s immutability problem](https://restate.dev/blog/solving-durable-executions-immutability-problem/). *restate.dev*, February 2024. Archived at [perma.cc/G55L-EYH5](https://perma.cc/G55L-EYH5)
[^50]: Srinath Perera. [Exploring Event-Driven Architecture: A Beginner’s Guide for Cloud Native Developers](https://wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/). *wso2.com*, August 2023. Archived at [archive.org](https://web.archive.org/web/20240716204613/https%3A//wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/)
[^51]: Philip A. Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. [Orleans: Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical Report MSR-TR-2014-41, March 2014. Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF)
================================================
FILE: content/tw/ch6.md
================================================
---
title: "6. 複製"
weight: 206
breadcrumbs: false
---

> *可能出錯的東西和“不可能”出錯的東西之間,最大的區別在於:後者一旦出錯,往往幾乎無從下手,也難以修復。*
>
> 道格拉斯·亞當斯,《基本無害》(1992)
**複製** 指的是在透過網路連線的多臺機器上保留相同資料的副本。如 ["分散式與單節點系統"](/tw/ch1#sec_introduction_distributed) 中所討論的,你可能出於以下幾個原因希望複製資料:
* 使資料在地理上更接近使用者(從而減少訪問延遲)
* 即使系統的部分元件出現故障,也能讓系統繼續工作(從而提高可用性)
* 擴充套件能夠處理讀查詢的機器數量(從而提高讀吞吐量)
本章假設你的資料集足夠小,每臺機器都可以儲存整個資料集的副本。在 [第 7 章](/tw/ch7#ch_sharding) 中,我們將放寬這一假設,討論單臺機器無法容納的、過大資料集的 **分片**(**分割槽**)。在後續章節中,我們將討論複製資料系統中可能發生的各種故障,以及如何處理它們。
如果需要複製的資料不會隨時間變化,那麼複製就很簡單:只需要將資料複製到每個節點一次就大功告成。處理複製的所有困難都在於處理複製資料的 **變更**,這也是本章的主題。我們將討論三類在節點之間複製變更的演算法:**單主**、**多主** 和 **無主** 複製。幾乎所有分散式資料庫都使用這三種方法之一。它們各有利弊,我們將詳細研究。
複製需要考慮許多權衡:例如,是使用同步還是非同步複製,以及如何處理失敗的副本。這些通常是資料庫中的配置選項,儘管不同資料庫的細節有所不同,但許多不同實現的通用原則是相似的。我們將在本章中討論這些選擇的後果。
資料庫複製是一個古老的話題——自 20 世紀 70 年代研究以來,原理並沒有太大變化 [^1],因為網路的基本約束保持不變。儘管如此古老,像 **最終一致性** 這樣的概念仍然會引起困惑。在 ["複製延遲的問題"](#sec_replication_lag) 中,我們將更準確地瞭解最終一致性,並討論諸如 **讀己之寫** 和 **單調讀** 等保證。
--------
> [!TIP] 備份與複製
>
> 你可能會想,如果有了複製,是否還需要備份。答案是肯定的,因為它們有不同的目的:副本會快速將一個節點的寫入反映到其他節點上,但備份儲存資料的舊快照,以便你可以回到過去的時間點。如果你不小心刪除了一些資料,複製並不能幫助你,因為刪除操作也會傳播到副本,所以如果你想恢復被刪除的資料,就需要備份。
>
> 事實上,複製和備份通常是相互補充的。備份有時是設定複製過程的一部分,正如我們將在 ["設定新的副本"](#sec_replication_new_replica) 中看到的。反過來,歸檔複製日誌可以成為備份過程的一部分。
>
> 一些資料庫在內部維護過去狀態的不可變快照,作為一種內部備份。然而,這意味著在與當前狀態相同的儲存介質上保留資料的舊版本。如果你有大量資料,將舊資料的備份儲存在針對不常訪問資料最佳化的物件儲存中可能會更便宜,而只在主儲存中儲存資料庫的當前狀態。
--------
## 單主複製 {#sec_replication_leader}
儲存資料庫副本的每個節點稱為 **副本**。有了多個副本,不可避免地會出現一個問題:我們如何確保所有資料最終都出現在所有副本上?
每次寫入資料庫都需要由每個副本處理;否則,副本將不再包含相同的資料。最常見的解決方案稱為 **基於領導者的複製**、**主備複製** 或 **主動/被動複製**。它的工作原理如下(見 [圖 6-1](#fig_replication_leader_follower)):
1. 其中一個副本被指定為 **領導者**(也稱為 **主庫** 或 **源** [^2])。當客戶端想要寫入資料庫時,他們必須將請求傳送給領導者,領導者首先將新資料寫入其本地儲存。
2. 其他副本稱為 **追隨者**(**只讀副本**、**從庫** 或 **熱備**)。每當領導者將新資料寫入其本地儲存時,它也會將資料變更作為 **複製日誌** 或 **變更流** 的一部分發送給所有追隨者。每個追隨者從領導者獲取日誌,並透過按照與領導者處理相同的順序應用所有寫入來相應地更新其本地資料庫副本。
3. 當客戶端想要從資料庫讀取時,它可以查詢領導者或任何追隨者。然而,只有領導者接受寫入(從客戶端的角度來看,追隨者是隻讀的)。
{{< figure src="/fig/ddia_0601.png" id="fig_replication_leader_follower" caption="圖 6-1. 單主複製將所有寫入定向到指定的領導者,該領導者向追隨者傳送變更流。" class="w-full my-4" >}}
如果資料庫是分片的(見 [第 7 章](/tw/ch7#ch_sharding)),每個分片都有一個領導者。不同的分片可能在不同的節點上有其領導者,但每個分片仍必須有一個領導者。在 ["多主複製"](#sec_replication_multi_leader) 中,我們將討論一種替代模型,其中系統可能同時為同一分片擁有多個領導者。
單主複製被廣泛使用。它是許多關係資料庫的內建功能,如 PostgreSQL、MySQL、Oracle Data Guard [^3] 和 SQL Server 的 Always On 可用性組 [^4]。它也用於一些文件資料庫,如 MongoDB 和 DynamoDB [^5],訊息代理如 Kafka,複製塊裝置如 DRBD,以及一些網路檔案系統。許多共識演算法(如 Raft)也基於單個領導者,用於 CockroachDB [^6]、TiDB [^7]、etcd 和 RabbitMQ 仲裁佇列(以及其他)中的複製,並在舊領導者失敗時自動選舉新領導者(我們將在 [第 10 章](/tw/ch10#ch_consistency) 中更詳細地討論共識)。
--------
> [!NOTE]
> 在較舊的文件中,你可能會看到術語 **主從複製**。它與基於領導者的複製含義相同,但應該避免使用該術語,因為它被廣泛認為是冒犯性的 [^8]。
--------
### 同步複製與非同步複製 {#sec_replication_sync_async}
複製系統的一個重要細節是複製是 **同步** 發生還是 **非同步** 發生。(在關係資料庫中,這通常是一個可配置選項;其他系統通常硬編碼為其中之一。)
想想 [圖 6-1](#fig_replication_leader_follower) 中發生的情況,一個網站使用者更新他們的個人資料圖片。在某個時間點,客戶端向領導者傳送更新請求;不久之後,領導者收到了它。在某個時間點,領導者將資料變更轉發給追隨者。最終,領導者通知客戶端更新成功。[圖 6-2](#fig_replication_sync_replication) 顯示了時序可能的工作方式。
{{< figure src="/fig/ddia_0602.png" id="fig_replication_sync_replication" caption="圖 6-2. 基於領導者的複製,帶有一個同步和一個非同步追隨者。" class="w-full my-4" >}}
在 [圖 6-2](#fig_replication_sync_replication) 的示例中,對追隨者 1 的複製是 **同步的**:領導者等待追隨者 1 確認它已收到寫入,然後才向用戶報告成功,並使寫入對其他客戶端可見。對追隨者 2 的複製是 **非同步的**:領導者傳送訊息,但不等待追隨者的響應。
圖中顯示,追隨者 2 處理訊息之前有相當大的延遲。通常,複製相當快:大多數資料庫系統在不到一秒的時間內將變更應用到追隨者。然而,不能保證需要多長時間。在某些情況下,追隨者可能落後領導者幾分鐘或更長時間;例如,如果追隨者正在從故障中恢復,如果系統正在接近最大容量執行,或者如果節點之間存在網路問題。
同步複製的優點是追隨者保證擁有與領導者一致的最新資料副本。如果領導者突然失敗,我們可以確信資料仍然在追隨者上可用。缺點是,如果同步追隨者沒有響應(因為它已崩潰,或存在網路故障,或任何其他原因),寫入就無法處理。領導者必須阻塞所有寫入並等待同步副本再次可用。
因此,將所有追隨者都設為同步是不切實際的:任何一個節點的中斷都會導致整個系統停止。實際上,如果資料庫提供同步複製,通常意味著 **一個** 追隨者是同步的,其他的是非同步的。如果同步追隨者變得不可用或緩慢,非同步追隨者之一將變為同步。這保證了你至少在兩個節點上擁有最新的資料副本:領導者和一個同步追隨者。這種配置有時也稱為 **半同步**。
在某些系統中,**多數**(例如,包括領導者在內的 5 個副本中的 3 個)副本被同步更新,其餘少數是非同步的。這是 **仲裁** 的一個例子,我們將在 ["讀寫仲裁"](#sec_replication_quorum_condition) 中進一步討論。多數仲裁通常用於使用共識協議進行自動領導者選舉的系統中,我們將在 [第 10 章](/tw/ch10#ch_consistency) 中回到這個話題。
有時,基於領導者的複製被配置為完全非同步。在這種情況下,如果領導者失敗且無法恢復,任何尚未複製到追隨者的寫入都會丟失。這意味著即使已向客戶端確認,寫入也不能保證持久。然而,完全非同步配置的優點是領導者可以繼續處理寫入,即使所有追隨者都已落後。
弱化永續性可能聽起來像是一個糟糕的權衡,但非同步複製仍然被廣泛使用,特別是如果有許多追隨者或者它們在地理上分佈廣泛 [^9]。我們將在 ["複製延遲的問題"](#sec_replication_lag) 中回到這個問題。
### 設定新的副本 {#sec_replication_new_replica}
不時地,你需要設定新的追隨者——也許是為了增加副本的數量,或者替換失敗的節點。如何確保新的追隨者擁有領導者資料的準確副本?
簡單地將資料檔案從一個節點複製到另一個節點通常是不夠的:客戶端不斷向資料庫寫入,資料總是在變化,所以標準檔案複製會在不同的時間點看到資料庫的不同部分。結果可能沒有任何意義。
你可以透過鎖定資料庫(使其不可用於寫入)來使磁碟上的檔案保持一致,但這將違揹我們的高可用性目標。幸運的是,設定追隨者通常可以在不停機的情況下完成。從概念上講,過程如下所示:
1. 在某個時間點獲取領導者資料庫的一致快照——如果可能,不鎖定整個資料庫。大多數資料庫都有此功能,因為備份也需要它。在某些情況下,需要第三方工具,例如用於 MySQL 的 Percona XtraBackup。
2. 將快照複製到新的追隨者。
3. 追隨者連線到領導者並請求自快照拍攝以來發生的所有資料變更。這要求快照與領導者複製日誌中的確切位置相關聯。該位置有各種名稱:例如,PostgreSQL 稱之為 **日誌序列號**;MySQL 有兩種機制,**binlog 位點** 和 **全域性事務識別符號**(GTID)。
4. 當追隨者處理了自快照以來的資料變更積壓後,我們說它已經 **追上進度**。它現在可以繼續處理領導者發生的資料變更。
設定追隨者的實際步驟因資料庫而異。在某些系統中,該過程是完全自動化的,而在其他系統中,它可能是需要管理員手動執行的有些神秘的多步驟工作流程。
你也可以將複製日誌歸檔到物件儲存;連同物件儲存中整個資料庫的定期快照,這是實現資料庫備份和災難恢復的好方法。你還可以透過從物件儲存下載這些檔案來執行設定新追隨者的步驟 1 和 2。例如,WAL-G 為 PostgreSQL、MySQL 和 SQL Server 執行此操作,Litestream 為 SQLite 執行等效操作。
--------
> [!TIP] 由物件儲存支援的資料庫
>
> 物件儲存可用於存檔資料之外的更多用途。許多資料庫開始使用物件儲存(如 Amazon Web Services S3、Google Cloud Storage 和 Azure Blob Storage)來為即時查詢提供資料。在物件儲存中儲存資料庫資料有許多好處:
>
> * 與其他雲端儲存選項相比,物件儲存價格便宜,這使得雲資料庫可以將較少查詢的資料儲存在更便宜、更高延遲的儲存上,同時從記憶體、SSD 和 NVMe 中提供工作集。
> * 物件儲存還提供具有非常高永續性保證的多區域、雙區域或多區域複製。這也允許資料庫繞過跨區域網路費用。
> * 資料庫可以使用物件儲存的 **條件寫入** 功能——本質上是 **比較並設定**(CAS)操作——來實現事務和領導者選舉 [^10] [^11]
> * 將來自多個數據庫的資料儲存在同一物件儲存中可以簡化資料整合,特別是在使用 Apache Parquet 和 Apache Iceberg 等開放格式時。
>
> 這些好處透過將事務、領導者選舉和複製的責任轉移到物件儲存,大大簡化了資料庫架構。
>
> 採用物件儲存進行復制的系統必須應對一些權衡。值得注意的是,物件儲存的讀寫延遲比本地磁碟或 EBS 等虛擬塊裝置要高得多。許多雲提供商還收取每個 API 呼叫費用,這迫使系統批次讀寫以降低成本。這種批處理進一步增加了延遲。此外,許多物件儲存不提供標準檔案系統介面。這阻止了缺乏物件儲存整合的系統利用物件儲存。像 **使用者空間檔案系統**(FUSE)這樣的介面允許操作員將物件儲存桶掛載為檔案系統,應用程式可以在不知道其資料儲存在物件儲存上的情況下使用。儘管如此,許多物件儲存的 FUSE 介面缺乏系統可能依賴的 POSIX 功能,如非順序寫入或符號連結。
>
> 不同的系統以各種方式處理這些權衡。一些引入了 **分層儲存** 架構,將較少訪問的資料放在物件儲存上,而新的或頻繁訪問的資料儲存在更快的儲存裝置上,如 SSD、NVMe,甚至記憶體中。其他系統使用物件儲存作為其主要儲存層,但使用單獨的低延遲儲存系統(如 Amazon 的 EBS 或 Neon 的 Safekeepers [^12])來儲存其 WAL。最近,一些系統更進一步,採用了 **零磁碟架構**(ZDA)。基於 ZDA 的系統將所有資料持久化到物件儲存,並嚴格將磁碟和記憶體用於快取。這允許節點沒有持久狀態,這大大簡化了運維。WarpStream、Confluent Freight、Buf 的 Bufstream 和 Redpanda Serverless 都是使用零磁碟架構構建的相容 Kafka 的系統。幾乎每個現代雲資料倉庫也採用這種架構,Turbopuffer(向量搜尋引擎)和 SlateDB(雲原生 LSM 儲存引擎)也是如此。
--------
### 處理節點故障 {#sec_replication_failover}
系統中的任何節點都可能發生故障,可能是由於故障意外發生,但同樣可能是由於計劃維護(例如,重新啟動機器以安裝核心安全補丁)。能夠在不停機的情況下重新啟動單個節點對於操作和維護來說是一個很大的優勢。因此,我們的目標是儘管單個節點發生故障,但保持整個系統執行,並儘可能減小節點中斷的影響。
如何透過基於領導者的複製實現高可用性?
#### 追隨者故障:追趕恢復 {#follower-failure-catch-up-recovery}
在其本地磁碟上,每個追隨者保留從領導者接收的資料變更日誌。如果追隨者崩潰並重新啟動,或者如果領導者和追隨者之間的網路暫時中斷,追隨者可以很容易地恢復:從其日誌中,它知道在故障發生之前處理的最後一個事務。因此,追隨者可以連線到領導者並請求在追隨者斷開連線期間發生的所有資料變更。當它應用了這些變更後,它就趕上了領導者,可以像以前一樣繼續接收資料變更流。
儘管追隨者恢復在概念上很簡單,但在效能方面可能具有挑戰性:如果資料庫具有高寫入吞吐量,或者如果追隨者已離線很長時間,可能有很多寫入需要趕上。在進行這種追趕時,恢復的追隨者和領導者(需要將寫入積壓傳送到追隨者)都會有高負載。
一旦所有追隨者都確認已處理了日誌,領導者就可以刪除其寫入日誌,但如果追隨者長時間不可用,領導者面臨選擇:要麼保留日誌直到追隨者恢復並趕上(冒著領導者磁碟空間耗盡的風險),要麼刪除不可用追隨者尚未確認的日誌(在這種情況下,追隨者無法從日誌中恢復,並且在它回來時必須從備份中恢復)。
#### 領導者故障:故障轉移 {#leader-failure-failover}
處理領導者故障更加棘手:其中一個追隨者需要被提升為新的領導者,客戶端需要重新配置以將其寫入傳送到新的領導者,其他追隨者需要開始從新的領導者消費資料變更。這個過程稱為 **故障轉移**。
故障轉移可以手動發生(管理員收到領導者失敗的通知並採取必要步驟來建立新的領導者)或自動發生。自動故障轉移過程通常包括以下步驟:
1. **確定領導者已失效。** 可能會出現許多問題:崩潰、停電、網路故障等。沒有萬無一失的方法能準確判斷發生了什麼,所以大多數系統只是依賴超時:節點之間會頻繁來回傳送訊息,如果某個節點在一段時間內(例如 30 秒)沒有響應,就認為它已經失效。(如果是計劃維護而主動下線領導者,則不適用。)
2. **選擇新的領導者。** 這可以透過選舉過程完成(由剩餘副本中的多數選出領導者),也可以由預先設定的 **控制器節點** 任命 [^13]。最適合擔任領導者的通常是那個擁有舊領導者最新資料變更的副本(以儘量減少資料丟失)。讓所有節點就新領導者達成一致是一個共識問題,我們會在 [第 10 章](/tw/ch10#ch_consistency) 詳細討論。
3. **將系統重新配置為使用新的領導者。** 客戶端現在需要把寫請求傳送到新領導者(我們在 ["請求路由"](/tw/ch7#sec_sharding_routing) 中討論這個問題)。如果舊領導者恢復,它可能仍然以為自己是領導者,並不知道其他副本已經讓它下臺。系統需要確保舊領導者降級為追隨者,並識別新的領導者。
故障轉移充滿了可能出錯的事情:
* 如果使用非同步複製,新的領導者可能在失敗之前沒有收到來自舊領導者的所有寫入。如果前領導者在選擇了新領導者後重新加入叢集,那些寫入應該怎麼辦?新的領導者可能同時收到了衝突的寫入。最常見的解決方案是簡單地丟棄舊領導者未複製的寫入,這意味著你認為已提交的寫入實際上並不持久。
* 如果資料庫之外的其他儲存系統需要與資料庫內容協調,丟棄寫入尤其危險。例如,在 GitHub 的一次事故中 [^14],一個過時的 MySQL 追隨者被提升為領導者。資料庫使用自增計數器為新行分配主鍵,但由於新領導者的計數器落後於舊領導者,它重用了舊領導者先前分配的一些主鍵。這些主鍵也在 Redis 儲存中使用,因此主鍵的重用導致 MySQL 和 Redis 之間的不一致,這導致一些私人資料被錯誤地披露給錯誤的使用者。
* 在某些故障場景中(見 [第 9 章](/tw/ch9#ch_distributed)),可能會發生兩個節點都認為自己是領導者的情況。這種情況稱為 **腦裂**,這是危險的:如果兩個領導者都接受寫入,並且沒有解決衝突的過程(見 ["多主複製"](#sec_replication_multi_leader)),資料很可能會丟失或損壞。作為安全措施,一些系統在檢測到兩個領導者時有一種機制來關閉一個節點。然而,如果這種機制設計不當,你最終可能會關閉兩個節點 [^15]。此外,當檢測到腦裂並關閉舊節點時,可能為時已晚,資料已經損壞。
* 在宣佈領導者死亡之前,正確的超時是什麼?更長的超時意味著在領導者失敗的情況下恢復時間更長。然而,如果超時太短,可能會有不必要的故障轉移。例如,臨時負載峰值可能導致節點的響應時間增加到超時以上,或者網路故障可能導致資料包延遲。如果系統已經在高負載或網路問題上掙扎,不必要的故障轉移可能會使情況變得更糟,而不是更好。
--------
> [!NOTE]
> 透過限制或關閉舊領導者來防止腦裂,被稱為 **柵欄機制**(fencing),或者更直白地說,**爆彼之頭**(STONITH)。我們將在 ["分散式鎖和租約"](/tw/ch9#sec_distributed_lock_fencing) 中更詳細地討論柵欄機制。
--------
這些問題沒有簡單的解決方案。因此,一些運維團隊更喜歡手動執行故障轉移,即使軟體支援自動故障轉移。
故障轉移最重要的是選擇一個最新的追隨者作為新的領導者——如果使用同步或半同步複製,這將是舊領導者在確認寫入之前等待的追隨者。使用非同步複製,你可以選擇具有最大日誌序列號的追隨者。這最小化了故障轉移期間丟失的資料量:丟失幾分之一秒的寫入可能是可以容忍的,但選擇落後幾天的追隨者可能是災難性的。
這些問題——節點故障;不可靠的網路;以及圍繞副本一致性、永續性、可用性和延遲的權衡——實際上是分散式系統中的基本問題。在 [第 9 章](/tw/ch9#ch_distributed) 和 [第 10 章](/tw/ch10#ch_consistency) 中,我們將更深入地討論它們。
### 複製日誌的實現 {#sec_replication_implementation}
基於領導者的複製在底層是如何工作的?讓我們簡要地看看實踐中使用的幾種不同的複製方法。
#### 基於語句的複製 {#statement-based-replication}
在最簡單的情況下,領導者記錄它執行的每個寫入請求(**語句**)並將該語句日誌傳送給其追隨者。對於關係資料庫,這意味著每個 `INSERT`、`UPDATE` 或 `DELETE` 語句都被轉發到追隨者,每個追隨者解析並執行該 SQL 語句,就像它是從客戶端接收的一樣。
雖然這聽起來合理,但這種複製方法可能會出現各種問題:
* 任何呼叫非確定性函式的語句,例如 `NOW()` 獲取當前日期和時間或 `RAND()` 獲取隨機數,可能會在每個副本上生成不同的值。
* 如果語句使用自增列,或者如果它們依賴於資料庫中的現有資料(例如,`UPDATE … WHERE <某條件>`),它們必須在每個副本上以完全相同的順序執行,否則它們可能會產生不同的效果。當有多個併發執行的事務時,這可能會受到限制。
* 具有副作用的語句(例如,觸發器、儲存過程、使用者定義的函式)可能會導致每個副本上發生不同的副作用,除非副作用是絕對確定的。
可以解決這些問題——例如,領導者可以在記錄語句時用固定的返回值替換任何非確定性函式呼叫,以便追隨者都獲得相同的值。以固定順序執行確定性語句的想法類似於我們之前在 ["事件溯源與 CQRS"](/tw/ch3#sec_datamodels_events) 中討論的事件溯源模型。這種方法也稱為 **狀態機複製**,我們將在 ["使用共享日誌"](/tw/ch10#sec_consistency_smr) 中討論其背後的理論。
基於語句的複製在 MySQL 5.1 版本之前使用。它今天有時仍在使用,因為它相當緊湊,但預設情況下,如果語句中有任何非確定性,MySQL 現在會切換到基於行的複製(稍後討論)。VoltDB 使用基於語句的複製,並透過要求事務是確定性的來使其安全 [^16]。然而,確定性在實踐中很難保證,因此許多資料庫更喜歡其他複製方法。
#### 預寫日誌(WAL)傳輸 {#write-ahead-log-wal-shipping}
在 [第 4 章](/tw/ch4#ch_storage) 中,我們看到預寫日誌是使 B 樹儲存引擎健壯所必需的:每個修改首先寫入 WAL,以便在崩潰後可以將樹恢復到一致狀態。由於 WAL 包含將索引和堆恢復到一致狀態所需的所有資訊,我們可以使用完全相同的日誌在另一個節點上構建副本:除了將日誌寫入磁碟外,領導者還透過網路將其傳送給其追隨者。當追隨者處理此日誌時,它構建了與領導者上找到的完全相同的檔案副本。
此複製方法在 PostgreSQL 和 Oracle 等中使用 [^17] [^18]。主要缺點是日誌在非常低的級別描述資料:WAL 包含哪些位元組在哪些磁碟塊中被更改的詳細資訊。這使得複製與儲存引擎緊密耦合。如果資料庫從一個版本更改其儲存格式到另一個版本,通常不可能在領導者和追隨者上執行不同版本的資料庫軟體。
這可能看起來像是一個小的實現細節,但它可能會產生很大的操作影響。如果複製協議允許追隨者使用比領導者更新的軟體版本,你可以透過首先升級追隨者然後執行故障轉移以使其中一個升級的節點成為新的領導者來執行資料庫軟體的零停機升級。如果複製協議不允許此版本不匹配(如 WAL 傳輸的情況),此類升級需要停機。
#### 邏輯(基於行)日誌複製 {#logical-row-based-log-replication}
另一種選擇是為複製和儲存引擎使用不同的日誌格式,這允許複製日誌與儲存引擎內部解耦。這種複製日誌稱為 **邏輯日誌**,以區別於儲存引擎的(**物理**)資料表示。
關係資料庫的邏輯日誌通常是描述以行粒度對資料庫表的寫入的記錄序列:
* 對於插入的行,日誌包含所有列的新值。
* 對於刪除的行,日誌包含足夠的資訊來唯一標識被刪除的行。通常這將是主鍵,但如果表上沒有主鍵,則需要記錄所有列的舊值。
* 對於更新的行,日誌包含足夠的資訊來唯一標識更新的行,以及所有列的新值(或至少所有已更改的列的新值)。
修改多行的事務會生成多個這樣的日誌記錄,後跟指示事務已提交的記錄。MySQL 除了 WAL 之外還保留一個單獨的邏輯複製日誌,稱為 **binlog**(當配置為使用基於行的複製時)。PostgreSQL 透過將物理 WAL 解碼為行插入/更新/刪除事件來實現邏輯複製 [^19]。
由於邏輯日誌與儲存引擎內部解耦,因此可以更容易地保持向後相容,允許領導者和追隨者執行不同版本的資料庫軟體。這反過來又可以以最少的停機時間升級到新版本 [^20]。
邏輯日誌格式也更容易被外部應用解析。如果你想把資料庫內容傳送到外部系統(例如用於離線分析的資料倉庫),或者構建自定義索引和快取 [^21],這一點會很有用。這種技術稱為 **資料變更捕獲**,我們將在 ["資料變更捕獲"](/tw/ch12#sec_stream_cdc) 一節再回到它。
## 複製延遲的問題 {#sec_replication_lag}
能夠容忍節點故障只是想要複製的一個原因。如 ["分散式與單節點系統"](/tw/ch1#sec_introduction_distributed) 中所述,其他原因是可伸縮性(處理比單臺機器能夠處理的更多請求)和延遲(將副本在地理上放置得更接近使用者)。
基於領導者的複製要求所有寫入都透過單個節點,但只讀查詢可以轉到任何副本。對於主要由讀取和只有少量寫入組成的工作負載(這通常是線上服務的情況),有一個有吸引力的選擇:建立許多追隨者,並將讀取請求分佈在這些追隨者上。這減輕了領導者的負載,並允許附近的副本提供讀取請求。
在這種 **讀擴充套件** 架構中,你可以透過新增更多追隨者來簡單地增加服務只讀請求的容量。然而,這種方法只有在使用非同步複製時才現實可行——如果你試圖同步複製到所有追隨者,單個節點故障或網路中斷將使整個系統無法寫入。而且你擁有的節點越多,其中一個節點宕機的可能性就越大,因此完全同步的配置將非常不可靠。
不幸的是,如果應用程式從 **非同步** 追隨者讀取,如果追隨者已落後,它可能會看到過時的資訊。這導致資料庫中出現明顯的不一致:如果你同時在領導者和追隨者上執行相同的查詢,你可能會得到不同的結果,因為並非所有寫入都已反映在追隨者中。這種不一致只是一種臨時狀態——如果你停止向資料庫寫入並等待一段時間,追隨者最終將趕上並與領導者保持一致。因此,這種效果被稱為 **最終一致性** [^22]。
--------
> [!NOTE]
> 術語 **最終一致性** 由 Douglas Terry 等人創造 [^23],由 Werner Vogels 推廣 [^24],併成為許多 NoSQL 專案的戰鬥口號。然而,不僅 NoSQL 資料庫是最終一致的:非同步複製的關係資料庫中的追隨者具有相同的特徵。
--------
術語"最終"是故意模糊的:一般來說,副本可以落後多遠沒有限制。在正常操作中,寫入發生在領導者上並反映在追隨者上之間的延遲——**複製延遲**——可能只是幾分之一秒,在實踐中不會被注意到。然而,如果系統在接近容量執行或網路中存在問題,延遲可以輕易增加到幾秒甚至幾分鐘。
當延遲如此之大時,它引入的不一致不僅僅是一個理論問題,而是應用程式的真正問題。在本節中,我們將重點介紹複製延遲時可能發生的三個問題示例。我們還將概述解決它們的一些方法。
### 讀己之寫 {#sec_replication_ryw}
許多應用程式讓使用者提交一些資料,然後檢視他們提交的內容。這可能是客戶資料庫中的記錄,或討論執行緒上的評論,或其他類似的東西。提交新資料時,必須將其傳送到領導者,但當用戶檢視資料時,可以從追隨者讀取。如果資料經常被檢視但只是偶爾被寫入,這尤其合適。
使用非同步複製,存在一個問題,如 [圖 6-3](#fig_replication_read_your_writes) 所示:如果使用者在寫入後不久檢視資料,新資料可能尚未到達副本。對使用者來說,看起來他們提交的資料丟失了,所以他們自然會不高興。
{{< figure src="/fig/ddia_0603.png" id="fig_replication_read_your_writes" caption="圖 6-3. 使用者進行寫入,然後從陳舊副本讀取。為了防止這種異常,我們需要寫後讀一致性。" class="w-full my-4" >}}
在這種情況下,我們需要 **寫後讀一致性**,也稱為 **讀己之寫一致性** [^23]。這是一種保證,如果使用者重新載入頁面,他們將始終看到他們自己提交的任何更新。它不對其他使用者做出承諾:其他使用者的更新可能直到稍後才可見。然而,它向用戶保證他們自己的輸入已正確儲存。
我們如何在基於領導者的複製系統中實現寫後讀一致性?有各種可能的技術。下面舉幾個例子:
* 當讀取使用者可能已修改的內容時,從領導者或同步更新的追隨者讀取;否則,從非同步更新的追隨者讀取。這要求你有某種方法知道某物是否可能已被修改,而無需實際查詢它。例如,社交網路上的使用者個人資料資訊通常只能由個人資料的所有者編輯,而不能由其他任何人編輯。因此,一個簡單的規則是:始終從領導者讀取使用者自己的個人資料,從追隨者讀取任何其他使用者的個人資料。
* 如果應用程式中的大多數東西都可能被使用者編輯,那種方法將不會有效,因為大多數東西都必須從領導者讀取(否定了讀擴充套件的好處)。在這種情況下,可以使用其他標準來決定是否從領導者讀取。例如,你可以跟蹤上次更新的時間,並在上次更新後的一分鐘內,使所有讀取都來自領導者 [^25]。你還可以監控追隨者上的複製延遲,並防止在落後領導者超過一分鐘的任何追隨者上進行查詢。
* 客戶端可以記住其最近寫入的時間戳——然後系統可以確保為該使用者提供任何讀取的副本至少反映該時間戳之前的更新。如果副本不夠最新,則可以由另一個副本處理讀取,或者查詢可以等待直到副本趕上 [^26]。時間戳可以是 **邏輯時間戳**(指示寫入順序的東西,例如日誌序列號)或實際系統時鐘(在這種情況下,時鐘同步變得至關重要;見 ["不可靠的時鐘"](/tw/ch9#sec_distributed_clocks))。
* 如果你的副本分佈在各個地區(為了地理上接近使用者或為了可用性),還有額外的複雜性。任何需要由領導者提供的請求都必須路由到包含領導者的地區。
當同一使用者從多個裝置訪問你的服務時,會出現另一個複雜情況,例如桌面網路瀏覽器和移動應用程式。在這種情況下,你可能希望提供 **跨裝置** 寫後讀一致性:如果使用者在一個裝置上輸入一些資訊,然後在另一個裝置上檢視它,他們應該看到他們剛剛輸入的資訊。
在這種情況下,需要考慮一些額外的問題:
* 需要記住使用者上次更新的時間戳的方法變得更加困難,因為在一個裝置上執行的程式碼不知道在另一個裝置上發生了什麼更新。此元資料將需要集中化。
* 如果你的副本分佈在不同的地區,則無法保證來自不同裝置的連線將路由到同一地區。(例如,如果使用者的臺式計算機使用家庭寬頻連線,而他們的移動裝置使用蜂窩資料網路,則裝置的網路路由可能完全不同。)如果你的方法需要從領導者讀取,你可能首先需要將來自使用者所有裝置的請求路由到同一地區。
--------
> [!TIP] 地區和可用區
>
> 我們用 **地區**(region)來指代一個地理位置中的一組資料中心。雲服務提供商通常會在同一地區部署多個數據中心,每個資料中心稱為 **可用區**(availability zone,簡稱 AZ)。因此,一個地區由多個可用區組成;每個可用區都是獨立的物理設施,具有自己的供電、製冷等基礎設施。
>
> 同一地區內各可用區通常透過高速網路互聯,延遲足夠低,因此大多數分散式系統可以把同一地區內的多個可用區近似看作一個機房。多可用區部署可以抵禦單個可用區故障,但無法抵禦整個地區不可用。要應對地區級中斷,系統必須跨多個地區部署,這通常會帶來更高延遲、更低吞吐和更高的雲網絡費用。我們將在 ["多主複製拓撲"](#sec_replication_topologies) 中進一步討論這些權衡。這裡你只需記住:本書所說的“地區”,是同一地理位置內多個可用區(資料中心)的集合。
--------
### 單調讀 {#sec_replication_monotonic_reads}
從非同步追隨者讀取時可能發生的第二個異常示例是,使用者可能會看到事物 **在時間上倒退**。
如果使用者從不同的副本進行多次讀取,就可能發生這種情況。例如,[圖 6-4](#fig_replication_monotonic_reads) 顯示使用者 2345 進行相同的查詢兩次,首先到延遲很小的追隨者,然後到延遲更大的追隨者。(如果使用者重新整理網頁,並且每個請求都路由到隨機伺服器,這種情況很可能發生。)第一個查詢返回使用者 1234 最近新增的評論,但第二個查詢沒有返回任何內容,因為滯後的追隨者尚未獲取該寫入。實際上,第二個查詢觀察到的系統狀態比第一個查詢更早的時間點。如果第一個查詢沒有返回任何內容,這不會那麼糟糕,因為使用者 2345 可能不知道使用者 1234 最近添加了評論。然而,如果使用者 2345 首先看到使用者 1234 的評論出現,然後又看到它消失,這對使用者 2345 來說非常令人困惑。
{{< figure src="/fig/ddia_0604.png" id="fig_replication_monotonic_reads" caption="圖 6-4. 使用者首先從新鮮副本讀取,然後從陳舊副本讀取。時間似乎倒退了。為了防止這種異常,我們需要單調讀。" class="w-full my-4" >}}
**單調讀** [^22] 是一種保證這類異常不會發生的會話保證。它比強一致性弱,但比最終一致性強。當你讀取資料時,仍可能看到舊值;單調讀只保證同一使用者按順序進行多次讀取時,不會出現“時間倒退”——也就是先讀到新值,後又讀到更舊的值。
實現單調讀的一種方法是確保每個使用者始終從同一副本進行讀取(不同的使用者可以從不同的副本讀取)。例如,可以基於使用者 ID 的雜湊選擇副本,而不是隨機選擇。然而,如果該副本失敗,使用者的查詢將需要重新路由到另一個副本。
### 一致字首讀 {#sec_replication_consistent_prefix}
我們的第三個複製延遲異常示例涉及違反因果關係。想象一下 Poons 先生和 Cake 夫人之間的以下簡短對話:
Poons 先生
: 你能看到多遠的未來,Cake 夫人?
Cake 夫人
: 通常大約十秒鐘,Poons 先生。
這兩個句子之間存在因果依賴關係:Cake 夫人聽到了 Poons 先生的問題並回答了它。
現在,想象第三個人透過追隨者聽這個對話。Cake 夫人說的話透過延遲很小的追隨者,但 Poons 先生說的話有更長的複製延遲(見 [圖 6-5](#fig_replication_consistent_prefix))。這個觀察者會聽到以下內容:
Cake 夫人
: 通常大約十秒鐘,Poons 先生。
Poons 先生
: 你能看到多遠的未來,Cake 夫人?
對觀察者來說,看起來 Cake 夫人在 Poons 先生甚至提出問題之前就回答了問題。這種通靈能力令人印象深刻,但非常令人困惑 [^27]。
{{< figure src="/fig/ddia_0605.png" id="fig_replication_consistent_prefix" caption="圖 6-5. 如果某些分片的複製比其他分片慢,觀察者可能會在看到問題之前看到答案。" class="w-full my-4" >}}
防止這種異常需要另一種型別的保證:**一致字首讀** [^22]。這種保證說,如果一系列寫入以某個順序發生,那麼任何讀取這些寫入的人都會看到它們以相同的順序出現。
這是分片(分割槽)資料庫中的一個特殊問題,我們將在 [第 7 章](/tw/ch7#ch_sharding) 中討論。如果資料庫始終以相同的順序應用寫入,讀取始終會看到一致的字首,因此這種異常不會發生。然而,在許多分散式資料庫中,不同的分片獨立執行,因此沒有全域性的寫入順序:當用戶從資料庫讀取時,他們可能會看到資料庫的某些部分處於較舊狀態,而某些部分處於較新狀態。
一種解決方案是確保任何因果相關的寫入都寫入同一分片——但在某些應用程式中,這無法有效完成。還有一些演算法明確跟蹤因果依賴關係,這是我們將在 ["先發生關係與併發"](#sec_replication_happens_before) 中回到的主題。
### 複製延遲的解決方案 {#id131}
在使用最終一致系統時,值得思考:如果複製延遲上升到幾分鐘甚至幾小時,應用程式會如何表現。如果答案是“沒問題”,那很好;但如果這會造成糟糕的使用者體驗,就應當設計系統提供更強的保證(如寫後讀一致性)。把非同步複製當作同步複製來假設,往往會在系統承壓時暴露問題。
如前所述,應用程式可以提供比底層資料庫更強的保證——例如,透過在領導者或同步更新的追隨者上執行某些型別的讀取。然而,在應用程式程式碼中處理這些問題很複雜且容易出錯。
對於應用程式開發人員來說,最簡單的程式設計模型是選擇一個為副本提供強一致性保證的資料庫,例如線性一致性(見 [第 10 章](/tw/ch10#ch_consistency))和 ACID 事務(見 [第 8 章](/tw/ch8#ch_transactions))。這允許你大部分忽略複製帶來的挑戰,並將資料庫視為只有一個節點。在 2010 年代初期,**NoSQL** 運動推廣了這樣的觀點,即這些功能限制了可伸縮性,大規模系統必須接受最終一致性。
然而,從那時起,許多資料庫開始提供強一致性和事務,同時還提供分散式資料庫的容錯、高可用性和可伸縮性優勢。如 ["關係模型與文件模型"](/tw/ch3#sec_datamodels_history) 中所述,這種趨勢被稱為 **NewSQL**,以與 NoSQL 形成對比(儘管它不太關於 SQL 本身,而更多關於可伸縮事務管理的新方法)。
儘管現在可以使用可伸縮、強一致的分散式資料庫,但某些應用程式選擇使用提供較弱一致性保證的不同形式的複製仍然有充分的理由:它們可以在面對網路中斷時提供更強的韌性,並且與事務系統相比具有較低的開銷。我們將在本章的其餘部分探討這些方法。
## 多主複製 {#sec_replication_multi_leader}
到目前為止,本章中我們只考慮了使用單個領導者的複製架構。儘管這是一種常見的方法,但還有一些有趣的替代方案。
單主複製有一個主要缺點:所有寫入都必須透過一個領導者。如果由於任何原因無法連線到領導者,例如你和領導者之間的網路中斷,你就無法寫入資料庫。
單主複製模型的自然擴充套件是允許多個節點接受寫入。複製仍然以相同的方式進行:每個處理寫入的節點必須將該資料變更轉發給所有其他節點。我們稱之為 **多主** 配置(也稱為 **主動/主動** 或 **雙向** 複製)。在這種設定中,每個領導者同時充當其他領導者的追隨者。
與單主複製一樣,可以選擇使其同步或非同步。假設你有兩個領導者,*A* 和 *B*,你正在嘗試寫入 *A*。如果寫入從 *A* 同步複製到 *B*,並且兩個節點之間的網路中斷,你就無法寫入 *A* 直到網路恢復。同步多主複製因此給你一個非常類似於單主複製的模型,即如果你讓 *B* 成為領導者,*A* 只是將任何寫入請求轉發給 *B* 執行。
因此,我們不會進一步討論同步多主複製,而只是將其視為等同於單主複製。本節的其餘部分專注於非同步多主複製,其中任何領導者都可以處理寫入,即使其與其他領導者的連線中斷。
### 跨地域執行 {#sec_replication_multi_dc}
在單個地區內使用多主設定很少有意義,因為好處很少超過增加的複雜性。然而,在某些情況下,這種配置是合理的。
想象你有一個數據庫,在幾個不同的地區有副本(也許是為了能夠容忍整個地區的故障,或者是為了更接近你的使用者)。這被稱為 **地理分散式**、**地域分散式** 或 **地域複製** 設定。使用單主複製,領導者必須在 **一個** 地區,所有寫入都必須透過該地區。
在多主配置中,你可以在 **每個** 地區都部署一個領導者。[圖 6-6](#fig_replication_multi_dc) 展示了這種架構:在每個地區內使用常規單主複製(追隨者可能位於與領導者不同的可用區);在地區之間,每個地區的領導者把變更復制給其他地區的領導者。
{{< figure src="/fig/ddia_0606.png" id="fig_replication_multi_dc" caption="圖 6-6. 跨多個地區的多主複製。" class="w-full my-4" >}}
讓我們比較單主和多主配置在多地區部署中的表現:
效能
: 在單主配置中,每次寫入都必須透過網際網路到擁有領導者的地區。這可能會給寫入增加顯著的延遲,並可能違背首先擁有多個地區的目的。在多主配置中,每次寫入都可以在本地地區處理,並非同步複製到其他地區。因此,跨地區網路延遲對使用者是隱藏的,這意味著感知效能可能更好。
地區故障容忍
: 在單主配置中,如果擁有領導者的地區變得不可用,故障轉移可以將另一個地區的追隨者提升為領導者。在多主配置中,每個地區可以獨立於其他地區繼續執行,並在離線地區恢復上線時趕上覆制。
網路問題容忍
: 即使有專用連線,地區之間的流量也可能比同一地區內或單個區域內的流量更不可靠。單主配置對這種跨地區鏈路中的問題非常敏感,因為當一個地區的客戶端想要寫入另一個地區的領導者時,它必須透過該鏈路傳送其請求並等待響應才能完成。
具有非同步複製的多主配置可以更好地容忍網路問題:在臨時網路中斷期間,每個地區的領導者可以繼續獨立處理寫入。
一致性
: 單主系統可以提供強一致性保證,例如可序列化事務,我們將在 [第 8 章](/tw/ch8#ch_transactions) 中討論。多主系統的最大缺點是它們能夠實現的一致性要弱得多。例如,你不能保證銀行賬戶不會變成負數或使用者名稱是唯一的:不同的領導者總是可能處理單獨沒問題的寫入(從賬戶中支付一些錢,註冊特定使用者名稱),但當與另一個領導者上的另一個寫入結合時違反了約束。
這只是分散式系統的基本限制 [^28]。如果你必須強制執行這類約束,通常應選擇單主系統。不過,正如我們將在 ["處理寫入衝突"](#sec_replication_write_conflicts) 中看到的,多主系統在不需要這類約束的廣泛應用裡,仍然可以提供有用的一致性屬性。
多主複製不如單主複製常見,但許多資料庫仍然支援它,包括 MySQL、Oracle、SQL Server 和 YugabyteDB。在某些情況下,它是一個外部附加功能,例如在 Redis Enterprise、EDB Postgres Distributed 和 pglogical 中 [^29]。
由於多主複製在許多資料庫中是一個有點改裝的功能,因此通常存在微妙的配置陷阱和與其他資料庫功能的令人驚訝的互動。例如,自增鍵、觸發器和完整性約束可能會有問題。因此,多主複製通常被認為是應該儘可能避免的危險領域 [^30]。
#### 多主複製拓撲 {#sec_replication_topologies}
**複製拓撲** 描述了寫入從一個節點傳播到另一個節點的通訊路徑。如果你有兩個領導者,如 [圖 6-9](#fig_replication_write_conflict) 中,只有一種合理的拓撲:領導者 1 必須將其所有寫入傳送到領導者 2,反之亦然。有了兩個以上的領導者,各種不同的拓撲是可能的。[圖 6-7](#fig_replication_topologies) 中說明了一些示例。
{{< figure src="/fig/ddia_0607.png" id="fig_replication_topologies" caption="圖 6-7. 可以設定多主複製的三個示例拓撲。" class="w-full my-4" >}}
最通用的拓撲是 **全對全**,如 [圖 6-7](#fig_replication_topologies)(c) 所示,其中每個領導者將其寫入傳送到每個其他領導者。然而,也使用更受限制的拓撲:例如 **環形拓撲**,其中每個節點從一個節點接收寫入並將這些寫入(加上其自己的任何寫入)轉發到另一個節點。另一種流行的拓撲具有 **星形** 形狀:一個指定的根節點將寫入轉發到所有其他節點。星形拓撲可以推廣到樹形。
--------
> [!NOTE]
> 不要將星形網路拓撲與 **星型模式** 混淆(見 ["星型與雪花型:分析模式"](/tw/ch3#sec_datamodels_analytics)),後者描述了資料模型的結構。
--------
在環形和星形拓撲中,寫入可能需要通過幾個節點才能到達所有副本。因此,節點需要轉發它們從其他節點接收的資料變更。為了防止無限複製迴圈,每個節點都被賦予一個唯一識別符號,並且在複製日誌中,每個寫入都用它經過的所有節點的識別符號標記 [^31]。當節點接收到用其自己的識別符號標記的資料變更時,該資料變更將被忽略,因為節點知道它已經被處理過了。
#### 不同拓撲的問題 {#problems-with-different-topologies}
環形和星形拓撲的一個問題是,如果只有一個節點發生故障,它可能會中斷其他節點之間的複製訊息流,使它們無法通訊,直到節點被修復。可以重新配置拓撲以繞過故障節點,但在大多數部署中,這種重新配置必須手動完成。更密集連線的拓撲(如全對全)的容錯性更好,因為它允許訊息沿著不同的路徑傳播,避免單點故障。
另一方面,全對全拓撲也可能有問題。特別是,一些網路鏈路可能比其他鏈路更快(例如,由於網路擁塞),結果是一些複製訊息可能會"超越"其他訊息,如 [圖 6-8](#fig_replication_causality) 所示。
{{< figure src="/fig/ddia_0608.png" id="fig_replication_causality" caption="圖 6-8. 使用多主複製,寫入可能以錯誤的順序到達某些副本。" class="w-full my-4" >}}
在 [圖 6-8](#fig_replication_causality) 中,客戶端 A 在領導者 1 上向表中插入一行,客戶端 B 在領導者 3 上更新該行。然而,領導者 2 可能以不同的順序接收寫入:它可能首先接收更新(從其角度來看,這是對資料庫中不存在的行的更新),然後才接收相應的插入(應該在更新之前)。
這是一個因果關係問題,類似於我們在 ["一致字首讀"](#sec_replication_consistent_prefix) 中看到的問題:更新依賴於先前的插入,因此我們需要確保所有節點首先處理插入,然後處理更新。簡單地為每個寫入附加時間戳是不夠的,因為時鐘不能被信任足夠同步以在領導者 2 上正確排序這些事件(見 [第 9 章](/tw/ch9#ch_distributed))。
為了正確排序這些事件,可以使用一種稱為 **版本向量** 的技術,我們將在本章後面討論(見 ["檢測併發寫入"](#sec_replication_concurrent))。然而,許多多主複製系統不使用良好的技術來排序更新,使它們容易受到像 [圖 6-8](#fig_replication_causality) 中的問題的影響。如果你使用多主複製,值得了解這些問題,仔細閱讀文件,並徹底測試你的資料庫,以確保它真正提供你認為它具有的保證。
### 同步引擎與本地優先軟體 {#sec_replication_offline_clients}
另一種適合多主複製的情況是,如果你有一個需要在與網際網路斷開連線時繼續工作的應用程式。
例如,考慮你的手機、筆記型電腦和其他裝置上的日曆應用程式。你需要能夠隨時檢視你的會議(進行讀取請求)並輸入新會議(進行寫入請求),無論你的裝置當前是否有網際網路連線。如果你在離線時進行任何更改,它們需要在裝置下次上線時與伺服器和你的其他裝置同步。
在這種情況下,每個裝置都擁有一個充當領導者的本地資料庫副本(可接受寫入),並在你所有裝置上的日曆副本之間執行非同步多主複製流程(即同步過程)。複製延遲可能是幾小時甚至幾天,具體取決於你何時能連上網際網路。
從架構的角度來看,這種設定與地區之間的多主複製非常相似,達到了極端:每個裝置是一個"地區",它們之間的網路連線極其不可靠。
#### 即時協作、離線優先和本地優先應用 {#real-time-collaboration-offline-first-and-local-first-apps}
此外,許多現代 Web 應用程式提供 **即時協作** 功能,例如用於文字文件和電子表格的 Google Docs 和 Sheets,用於圖形的 Figma,以及用於專案管理的 Linear。使這些應用程式如此響應的原因是使用者輸入立即反映在使用者介面中,無需等待到伺服器的網路往返,並且一個使用者的編輯以低延遲顯示給他們的協作者 [^32] [^33] [^34]。
這再次導致多主架構:每個開啟共享檔案的 Web 瀏覽器選項卡都是一個副本,你對檔案進行的任何更新都會非同步複製到開啟同一檔案的其他使用者的裝置。即使應用程式不允許你在離線時繼續編輯檔案,多個使用者可以進行編輯而無需等待伺服器的響應這一事實已經使其成為多主。
離線編輯和即時協作都需要類似的複製基礎設施:應用程式需要捕獲使用者對檔案所做的任何更改,並立即將它們傳送給協作者(如果線上),或本地儲存它們以供稍後傳送(如果離線)。此外,應用程式需要接收來自協作者的更改,將它們合併到使用者的檔案本地副本中,並更新使用者介面以反映最新版本。如果多個使用者同時更改了檔案,可能需要衝突解決邏輯來合併這些更改。
支援此過程的軟體庫稱為 **同步引擎**。儘管這個想法已經存在很長時間了,但這個術語最近才受到關注 [^35] [^36] [^37]。允許使用者在離線時繼續編輯檔案的應用程式(可能使用同步引擎實現)稱為 **離線優先** [^38]。術語 **本地優先軟體** 指的是不僅是離線優先的協作應用程式,而且即使製作軟體的開發人員關閉了他們的所有線上服務,也被設計為繼續工作 [^39]。這可以透過使用具有開放標準同步協議的同步引擎來實現,該協議有多個服務提供商可用 [^40]。例如,Git 是一個本地優先的協作系統(儘管不支援即時協作),因為你可以透過 GitHub、GitLab 或任何其他儲存庫託管服務進行同步。
#### 同步引擎的利弊 {#pros-and-cons-of-sync-engines}
今天構建 Web 應用程式的主導方式是在客戶端保留很少的持久狀態,並在需要顯示新資料或需要更新某些資料時依賴向伺服器發出請求。相比之下,當使用同步引擎時,你在客戶端有持久狀態,與伺服器的通訊被移到後臺程序中。同步引擎方法有許多優點:
* 在本地擁有資料意味著使用者介面的響應速度可以比必須等待服務呼叫獲取某些資料時快得多。一些應用程式的目標是在圖形系統的 **下一幀** 響應使用者輸入,這意味著在 60 Hz 重新整理率的顯示器上在 16 毫秒內渲染。
* 允許使用者在離線時繼續工作是有價值的,特別是在具有間歇性連線的移動裝置上。使用同步引擎,應用程式不需要單獨的離線模式:離線與具有非常大的網路延遲相同。
* 與在應用程式程式碼中執行顯式服務呼叫相比,同步引擎簡化了前端應用程式的程式設計模型。每個服務呼叫都需要錯誤處理,如 ["遠端過程呼叫(RPC)的問題"](/tw/ch5#sec_problems_with_rpc) 中所討論的:例如,如果更新伺服器上的資料的請求失敗,使用者介面需要以某種方式反映該錯誤。同步引擎允許應用程式對本地資料執行讀寫,這幾乎從不失敗,導致更具宣告性的程式設計風格 [^41]。
* 為了即時顯示其他使用者的編輯,你需要接收這些編輯的通知並相應地有效更新使用者介面。同步引擎與 **響應式程式設計** 模型相結合是實現此目的的好方法 [^42]。
當用戶可能需要的所有資料都提前下載並持久儲存在客戶端時,同步引擎效果最佳。這意味著資料可用於離線訪問,但這也意味著如果使用者可以訪問非常大量的資料,同步引擎就不適合。例如,下載使用者自己建立的所有檔案可能很好(一個使用者通常不會生成那麼多資料),但下載電子商務網站的整個目錄可能沒有意義。
同步引擎由 Lotus Notes 在 20 世紀 80 年代開創 [^43](沒有使用該術語),特定應用程式(如日曆)的同步也已經存在很長時間了。今天有許多通用同步引擎,其中一些使用專有後端服務(例如,Google Firestore、Realm 或 Ditto),有些具有開源後端,使它們適合建立本地優先軟體(例如,PouchDB/CouchDB、Automerge 或 Yjs)。
多人影片遊戲有類似的需求,需要立即響應使用者的本地操作,並將它們與透過網路非同步接收的其他玩家的操作協調。在遊戲開發術語中,同步引擎的等效物稱為 **網路程式碼**。網路程式碼中使用的技術非常特定於遊戲的要求 [^44],並且不能直接應用於其他型別的軟體,因此我們不會在本書中進一步考慮它們。
### 處理寫入衝突 {#sec_replication_write_conflicts}
多主複製的最大問題——無論是在地域分散式伺服器端資料庫中還是在終端使用者裝置上的本地優先同步引擎中——是不同領導者上的併發寫入可能導致需要解決的衝突。
例如,考慮一個維基頁面同時被兩個使用者編輯,如 [圖 6-9](#fig_replication_write_conflict) 所示。使用者 1 將頁面標題從 A 更改為 B,使用者 2 獨立地將標題從 A 更改為 C。每個使用者的更改成功應用於其本地領導者。然而,當更改非同步複製時,檢測到衝突。這個問題在單主資料庫中不會發生。
{{< figure src="/fig/ddia_0609.png" id="fig_replication_write_conflict" caption="圖 6-9. 兩個領導者併發更新同一記錄導致的寫入衝突。" class="w-full my-4" >}}
> [!NOTE]
> 我們說 [圖 6-9](#fig_replication_write_conflict) 中的兩個寫入是 **併發的**,因為在最初進行寫入時,兩者都不“知道”對方。寫入是否真的在同一時刻發生並不重要;實際上,如果寫入發生在離線狀態,它們在物理時間上可能相隔很久。關鍵在於:一個寫入是否發生在另一個寫入已經生效的狀態之上。
在 ["檢測併發寫入"](#sec_replication_concurrent) 中,我們將解決資料庫如何確定兩個寫入是否併發的問題。現在我們假設我們可以檢測衝突,並且我們想找出解決它們的最佳方法。
#### 衝突避免 {#conflict-avoidance}
衝突的一種策略是首先避免它們發生。例如,如果應用程式可以確保特定記錄的所有寫入都透過同一領導者,那麼即使整個資料庫是多主的,也不會發生衝突。這種方法在同步引擎客戶端離線更新的情況下是不可能的,但在地域複製的伺服器系統中有時是可能的 [^30]。
例如,在一個使用者只能編輯自己資料的應用程式中,你可以確保來自特定使用者的請求始終路由到同一地區,並使用該地區的領導者進行讀寫。不同的使用者可能有不同的"主"地區(可能基於與使用者的地理接近程度選擇),但從任何一個使用者的角度來看,配置本質上是單主的。
然而,有時你可能想要更改記錄的指定領導者——也許是因為一個地區不可用,你需要將流量重新路由到另一個地區,或者也許是因為使用者已經移動到不同的位置,現在更接近不同的地區。現在存在風險,即使用者在指定領導者更改正在進行時執行寫入,導致必須使用下面的方法之一解決的衝突。因此,如果你允許更改領導者,衝突避免就會失效。
衝突避免的另一個例子:想象你想要插入新記錄並基於自增計數器為它們生成唯一 ID。如果你有兩個領導者,你可以設定它們,使得一個領導者只生成奇數,另一個只生成偶數。這樣你可以確保兩個領導者不會同時為不同的記錄分配相同的 ID。我們將在 ["ID 生成器和邏輯時鐘"](/tw/ch10#sec_consistency_logical) 中討論其他 ID 分配方案。
#### 最後寫入勝利(丟棄併發寫入) {#sec_replication_lww}
如果無法避免衝突,解決它們的最簡單方法是為每個寫入附加時間戳,並始終使用具有最大時間戳的值。例如,在 [圖 6-9](#fig_replication_write_conflict) 中,假設使用者 1 的寫入時間戳大於使用者 2 的寫入時間戳。在這種情況下,兩個領導者都將確定頁面的新標題應該是 B,並丟棄將其設定為 C 的寫入。如果寫入巧合地具有相同的時間戳,可以透過比較值來選擇獲勝者(例如,在字串的情況下,取字母表中較早的那個)。
這種方法稱為 **最後寫入勝利**(LWW),因為具有最大時間戳的寫入可以被認為是"最後"的。然而,這個術語是誤導性的,因為當兩個寫入像 [圖 6-9](#fig_replication_write_conflict) 中那樣併發時,哪個更舊,哪個更新是未定義的,因此併發寫入的時間戳順序本質上是隨機的。
因此,LWW 的真正含義是:當同一記錄在不同的領導者上併發寫入時,其中一個寫入被隨機選擇為獲勝者,其他寫入被靜默丟棄,即使它們在各自的領導者上成功處理。這實現了最終所有副本都處於一致狀態的目標,但代價是資料丟失。
如果你可以避免衝突——例如,透過只插入具有唯一鍵(如 UUID)的記錄,而從不更新它們——那麼 LWW 沒有問題。但是,如果你更新現有記錄,或者如果不同的領導者可能插入具有相同鍵的記錄,那麼你必須決定丟失的更新對你的應用程式是否是個問題。如果丟失的更新是不可接受的,你需要使用下面描述的衝突解決方法之一。
LWW 的另一個問題是,如果使用即時時鐘(例如 Unix 時間戳)作為寫入的時間戳,系統對時鐘同步變得非常敏感。如果一個節點的時鐘領先於其他節點,並且你嘗試覆蓋該節點寫入的值,你的寫入可能會被忽略,因為它可能具有較低的時間戳,即使它明顯發生得更晚。這個問題可以透過使用 **邏輯時鐘** 來解決,我們將在 ["ID 生成器和邏輯時鐘"](/tw/ch10#sec_consistency_logical) 中討論。
#### 手動衝突解決 {#manual-conflict-resolution}
如果隨機丟棄你的一些寫入是不可取的,下一個選擇是手動解決衝突。你可能熟悉 Git 和其他版本控制系統中的手動衝突解決:如果兩個不同分支上的提交編輯同一檔案的相同行,並且你嘗試合併這些分支,你將得到一個需要在合併完成之前解決的合併衝突。
在資料庫裡,讓衝突阻塞整個複製流程、直到人工處理,通常並不現實。更常見的是,資料庫會保留某條記錄的所有併發寫入值——例如 [圖 6-9](#fig_replication_write_conflict) 中的 B 和 C。這些值有時稱為 **兄弟**。下次查詢該記錄時,資料庫會返回 **所有** 這些值,而不只是最新值。隨後你可以按需要解決這些值:要麼在應用程式碼裡自動處理(例如把 B 和 C 合併成 "B/C"),要麼讓使用者參與處理;最後再把新值寫回資料庫以消解衝突。
這種衝突解決方法在某些系統中使用,例如 CouchDB。然而,它也存在許多問題:
* 資料庫的 API 發生變化:例如,以前維基頁面的標題只是一個字串,現在它變成了一組字串,通常包含一個元素,但如果有衝突,有時可能包含多個元素。這可能使應用程式程式碼中的資料難以處理。
* 要求使用者手動合併兄弟,會帶來很大負擔:開發者需要構建衝突解決介面,使用者也可能不明白自己為何要做這件事。在很多場景下,自動合併比打擾使用者更合適。
* 如果不夠謹慎,自動合併兄弟也可能產生反直覺行為。例如,亞馬遜購物車曾允許併發更新,並用“並集”策略合併(保留出現在任一兄弟中的所有商品)。這意味著:若使用者在一個兄弟裡刪除了某商品,但另一個兄弟仍保留它,該商品會“復活”回購物車 [^45]。[圖 6-10](#fig_replication_amazon_anomaly) 就是一個例子:裝置 1 刪除 Book,裝置 2 併發刪除 DVD,衝突合併後兩個商品都回來了。
* 如果多個節點觀察到衝突並併發解決它,衝突解決過程本身可能會引入新的衝突。這些解決方案甚至可能不一致:例如,如果你不小心一致地排序它們,一個節點可能將 B 和 C 合併為"B/C",另一個可能將它們合併為"C/B"。當"B/C"和"C/B"之間的衝突被合併時,它可能導致"B/C/C/B"或類似令人驚訝的東西。
{{< figure src="/fig/ddia_0610.png" id="fig_replication_amazon_anomaly" caption="圖 6-10. 亞馬遜購物車異常的示例:如果購物車上的衝突透過取並集合並,刪除的專案可能會重新出現。" class="w-full my-4" >}}
#### 自動衝突解決 {#automatic-conflict-resolution}
對於許多應用程式,處理衝突的最佳方法是使用自動將併發寫入合併為一致狀態的演算法。自動衝突解決確保所有副本 **收斂** 到相同的狀態——即,處理了相同寫入集的所有副本都具有相同的狀態,無論寫入到達的順序如何。
LWW 是衝突解決演算法的一個簡單示例。已經為不同型別的資料開發了更複雜的合併演算法,目標是儘可能保留所有更新的預期效果,從而避免資料丟失:
* 如果資料是文字(例如維基頁面標題或正文),我們可以檢測每次版本演進中的字元插入和刪除。合併結果會保留任一兄弟中的所有插入和刪除。如果多個使用者併發在同一位置插入文字,還可以用確定性順序來排序,以確保所有節點得到同樣的合併結果。
* 如果資料是專案集合(像待辦事項列表那樣有序,或像購物車那樣無序),我們可以透過跟蹤插入和刪除類似於文字來合併它。為了避免 [圖 6-10](#fig_replication_amazon_anomaly) 中的購物車問題,演算法跟蹤 Book 和 DVD 被刪除的事實,因此合併的結果是 Cart = {Soap}。
* 如果資料是可增可減的整數計數器(例如社交媒體帖子的點贊數),合併演算法可以統計每個兄弟上的遞增和遞減次數,並正確求和,既不重複計數,也不丟更新。
* 如果資料是鍵值對映,我們可以透過將其他衝突解決演算法之一應用於該鍵下的值來合併對同一鍵的更新。對不同鍵的更新可以相互獨立處理。
衝突解決的可能性是有限的。例如,如果你想強制一個列表不包含超過五個專案,並且多個使用者併發地向列表新增專案,使得總共有五個以上,你唯一的選擇是丟棄一些專案。儘管如此,自動衝突解決足以構建許多有用的應用程式。如果你從想要構建協作離線優先或本地優先應用程式的要求開始,那麼衝突解決是不可避免的,自動化它通常是最好的方法。
### CRDT 與操作變換 {#sec_replication_crdts}
兩個演算法族通常用於實現自動衝突解決:**無衝突複製資料型別**(CRDT)[^46] 和 **操作變換**(OT)[^47]。它們具有不同的設計理念和效能特徵,但都能夠為前面提到的所有型別的資料執行自動合併。
[圖 6-11](#fig_replication_ot_crdt) 顯示了 OT 和 CRDT 如何合併對文字的併發更新的示例。假設你有兩個副本,都從文字"ice"開始。一個副本在前面新增字母"n"以製作"nice",而另一個副本併發地附加感嘆號以製作"ice!"。
{{< figure src="/fig/ddia_0611.png" id="fig_replication_ot_crdt" caption="圖 6-11. OT 和 CRDT 如何分別合併對字串的兩個併發插入。" class="w-full my-4" >}}
合併的結果"nice!"由兩種型別的演算法以不同的方式實現:
OT
: 我們記錄插入或刪除字元的索引:"n"插入在索引 0,"!"插入在索引 3。接下來,副本交換它們的操作。在 0 處插入"n"可以按原樣應用,但如果在 3 處插入"!"應用於狀態"nice",我們將得到"nic!e",這是不正確的。因此,我們需要轉換每個操作的索引以考慮已經應用的併發操作;在這種情況下,"!"的插入被轉換為索引 4 以考慮在較早索引處插入"n"。
CRDT
: 大多數 CRDT 為每個字元提供唯一的、不可變的 ID,並使用這些 ID 來確定插入/刪除的位置,而不是索引。例如,在 [圖 6-11](#fig_replication_ot_crdt) 中,我們將 ID 1A 分配給"i",ID 2A 分配給"c"等。插入感嘆號時,我們生成一個包含新字元的 ID(4B)和我們想要在其後插入的現有字元的 ID(3A)的操作。要在字串的開頭插入,我們將"nil"作為前面的字元 ID。在同一位置的併發插入按字元的 ID 排序。這確保副本收斂而不執行任何轉換。
有許多基於這些想法變體的演算法。列表/陣列可以類似地支援,使用列表元素而不是字元,其他資料型別(如鍵值對映)可以很容易地新增。OT 和 CRDT 之間存在一些效能和功能權衡,但可以在一個演算法中結合 CRDT 和 OT 的優點 [^48]。
OT 最常用於文字的即時協作編輯,例如在 Google Docs 中 [^32],而 CRDT 可以在分散式資料庫中找到,例如 Redis Enterprise、Riak 和 Azure Cosmos DB [^49]。JSON 資料的同步引擎可以使用 CRDT(例如,Automerge 或 Yjs)和 OT(例如,ShareDB)實現。
#### 什麼是衝突? {#what-is-a-conflict}
某些型別的衝突是顯而易見的。在 [圖 6-9](#fig_replication_write_conflict) 的示例中,兩個寫入併發修改了同一記錄中的同一欄位,將其設定為兩個不同的值。毫無疑問,這是一個衝突。
其他型別的衝突可能更難以檢測。例如,考慮一個會議室預訂系統:它跟蹤哪個房間由哪組人在什麼時間預訂。此應用程式需要確保每個房間在任何時間只由一組人預訂(即,同一房間不得有任何重疊的預訂)。在這種情況下,如果為同一房間同時建立兩個不同的預訂,可能會出現衝突。即使應用程式在允許使用者進行預訂之前檢查可用性,如果兩個預訂是在兩個不同的領導者上進行的,也可能會發生衝突。
沒有現成的快速答案,不過在後續章節中,我們會逐步建立對這個問題的理解。我們將在 [第 8 章](/tw/ch8#ch_transactions) 看到更多衝突案例,並在 ["透過事件順序捕獲因果關係"](/tw/ch13#sec_future_capture_causality) 中討論在複製系統裡可伸縮地檢測和解決衝突的方法。
## 無主複製 {#sec_replication_leaderless}
到目前為止,我們在本章中討論的複製方法——單主和多主複製——都基於這樣的想法:客戶端向一個節點(領導者)傳送寫入請求,資料庫系統負責將該寫入複製到其他副本。領導者確定寫入應該處理的順序,追隨者以相同的順序應用領導者的寫入。
一些資料儲存系統採用不同的方法,放棄領導者的概念,並允許任何副本直接接受來自客戶端的寫入。一些最早的複製資料系統是無主的 [^1] [^50],但在關係資料庫主導的時代,這個想法基本上被遺忘了。在亞馬遜於 2007 年將其用於其內部 **Dynamo** 系統後,它再次成為資料庫的時尚架構 [^45]。Riak、Cassandra 和 ScyllaDB 是受 Dynamo 啟發的具有無主複製模型的開源資料儲存,因此這種資料庫也被稱為 **Dynamo 風格**。
--------
> [!NOTE]
> 原始的 **Dynamo** 系統僅在論文中描述 [^45],但從未在亞馬遜之外發布。AWS 的名稱相似的 **DynamoDB** 是一個更新的雲資料庫,但它具有完全不同的架構:它使用基於 Multi-Paxos 共識演算法的單主複製 [^5]。
--------
在某些無主實現中,客戶端直接將其寫入傳送到多個副本,而在其他實現中,協調器節點代表客戶端執行此操作。然而,與領導者資料庫不同,該協調器不強制執行特定的寫入順序。正如我們將看到的,這種設計差異對資料庫的使用方式產生了深遠的影響。
### 當節點故障時寫入資料庫 {#id287}
想象你有一個具有三個副本的資料庫,其中一個副本當前不可用——也許它正在重新啟動以安裝系統更新。在單主配置中,如果你想繼續處理寫入,你可能需要執行故障轉移(見 ["處理節點故障"](#sec_replication_failover))。
另一方面,在無主配置中,故障轉移不存在。[圖 6-12](#fig_replication_quorum_node_outage) 顯示了發生的情況:客戶端(使用者 1234)將寫入並行傳送到所有三個副本,兩個可用副本接受寫入,但不可用副本錯過了它。假設三個副本中有兩個確認寫入就足夠了:在使用者 1234 收到兩個 **ok** 響應後,我們認為寫入成功。客戶端只是忽略了其中一個副本錯過寫入的事實。
{{< figure src="/fig/ddia_0612.png" id="fig_replication_quorum_node_outage" caption="圖 6-12. 節點中斷後的仲裁寫入、仲裁讀取和讀修復。" class="w-full my-4" >}}
現在想象不可用節點恢復上線,客戶端開始從它讀取。在節點宕機期間發生的任何寫入都從該節點丟失。因此,如果你從該節點讀取,你可能會得到 **陳舊**(過時)值作為響應。
為了解決這個問題,當客戶端從資料庫讀取時,它不只是將其請求傳送到一個副本:**讀取請求也並行傳送到多個節點**。客戶端可能會從不同的節點獲得不同的響應;例如,從一個節點獲得最新值,從另一個節點獲得陳舊值。
為了區分哪些響應是最新的,哪些是過時的,寫入的每個值都需要用版本號或時間戳標記,類似於我們在 ["最後寫入勝利(丟棄併發寫入)"](#sec_replication_lww) 中看到的。當客戶端收到對讀取的多個值響應時,它使用具有最大時間戳的值(即使該值僅由一個副本返回,而其他幾個副本返回較舊的值)。有關更多詳細資訊,請參見 ["檢測併發寫入"](#sec_replication_concurrent)。
#### 追趕錯過的寫入 {#sec_replication_read_repair}
複製系統應確保最終所有資料都複製到每個副本。在不可用節點恢復上線後,它如何趕上它錯過的寫入?在 Dynamo 風格的資料儲存中使用了幾種機制:
讀修復
: 當客戶端並行從多個節點讀取時,它可以檢測任何陳舊響應。例如,在 [圖 6-12](#fig_replication_quorum_node_outage) 中,使用者 2345 從副本 3 獲得版本 6 的值,從副本 1 和 2 獲得版本 7 的值。客戶端發現副本 3 陳舊後,會把較新的值寫回該副本。這種方法適用於經常被讀取的值。
提示移交
: 如果一個副本不可用,另一個副本可能會以 **提示** 的形式代表其儲存寫入。當應該接收這些寫入的副本恢復時,儲存提示的副本將它們傳送到恢復的副本,然後刪除提示。這個 **移交** 過程有助於使副本保持最新,即使對於從未讀取的值也是如此,因此不由讀修復處理。
反熵
: 此外,還有一個後臺程序定期查詢副本之間資料的差異,並將任何缺失的資料從一個副本複製到另一個。與基於領導者的複製中的複製日誌不同,這個 **反熵程序** 不以任何特定順序複製寫入,並且在複製資料之前可能會有顯著的延遲。
#### 讀寫仲裁 {#sec_replication_quorum_condition}
在 [圖 6-12](#fig_replication_quorum_node_outage) 的例子中,即使寫入僅在三個副本中的兩個上處理,我們也認為寫入成功。如果三個副本中只有一個接受了寫入呢?我們能推多遠?
如果我們知道每次成功的寫入都保證至少存在於三個副本中的兩個上,這意味著最多一個副本可能是陳舊的。因此,如果我們從至少兩個副本讀取,我們可以確信兩個中至少有一個是最新的。如果第三個副本宕機或響應緩慢,讀取仍然可以繼續返回最新值。
更一般地說,如果有 *n* 個副本,每次寫入必須由 *w* 個節點確認才能被認為成功,並且我們必須為每次讀取查詢至少 *r* 個節點。(在我們的例子中,*n* = 3,*w* = 2,*r* = 2。)只要 *w* + *r* > *n*,我們在讀取時期望獲得最新值,因為我們讀取的 *r* 個節點中至少有一個必須是最新的。遵守這些 *r* 和 *w* 值的讀取和寫入稱為 **仲裁** 讀取和寫入 [^50]。你可以將 *r* 和 *w* 視為讀取或寫入有效所需的最小投票數。
在 Dynamo 風格的資料庫中,引數 *n*、*w* 和 *r* 通常是可配置的。常見的選擇是使 *n* 為奇數(通常為 3 或 5),並設定 *w* = *r* = (*n* + 1) / 2(向上舍入)。然而,你可以根據需要更改數字。例如,寫入很少而讀取很多的工作負載可能受益於設定 *w* = *n* 和 *r* = 1。這使讀取更快,但缺點是僅一個失敗的節點就會導致所有資料庫寫入失敗。
--------
> [!NOTE]
> 叢集中可能有超過 *n* 個節點,但任何給定值僅儲存在 *n* 個節點上。這允許資料集進行分片,支援比單個節點能容納的更大的資料集。我們將在 [第 7 章](/tw/ch7#ch_sharding) 中回到分片。
--------
仲裁條件 *w* + *r* > *n* 允許系統容忍不可用節點,如下所示:
* 如果 *w* < *n*,如果節點不可用,我們仍然可以處理寫入。
* 如果 *r* < *n*,如果節點不可用,我們仍然可以處理讀取。
* 使用 *n* = 3,*w* = 2,*r* = 2,我們可以容忍一個不可用節點,如 [圖 6-12](#fig_replication_quorum_node_outage) 中所示。
* 使用 *n* = 5,*w* = 3,*r* = 3,我們可以容忍兩個不可用節點。這種情況在 [圖 6-13](#fig_replication_quorum_overlap) 中說明。
通常,讀取和寫入總是並行傳送到所有 *n* 個副本。引數 *w* 和 *r* 確定我們等待多少個節點——即,在我們認為讀取或寫入成功之前,*n* 個節點中有多少個需要報告成功。
{{< figure src="/fig/ddia_0613.png" id="fig_replication_quorum_overlap" caption="圖 6-13. 如果 *w* + *r* > *n*,你讀取的 *r* 個副本中至少有一個必須看到最近的成功寫入。" class="w-full my-4" >}}
如果少於所需的 *w* 或 *r* 個節點可用,寫入或讀取將返回錯誤。節點可能因許多原因不可用:因為節點宕機(崩潰、斷電)、由於執行操作時出錯(無法寫入因為磁碟已滿)、由於客戶端和節點之間的網路中斷,或任何其他原因。我們只關心節點是否返回了成功響應,不需要區分不同型別的故障。
### 仲裁一致性的侷限 {#sec_replication_quorum_limitations}
如果你有 *n* 個副本,並且你選擇 *w* 和 *r* 使得 *w* + *r* > *n*,你通常可以期望每次讀取都返回為鍵寫入的最新值。這是因為你寫入的節點集和你讀取的節點集必須重疊。也就是說,在你讀取的節點中,必須至少有一個具有最新值的節點(如 [圖 6-13](#fig_replication_quorum_overlap) 所示)。
通常,*r* 和 *w* 被選擇為多數(超過 *n*/2)節點,因為這確保了 *w* + *r* > *n*,同時仍然容忍最多 *n*/2(向下舍入)個節點故障。但仲裁不一定是多數——重要的是讀取和寫入操作使用的節點集至少在一個節點中重疊。其他仲裁分配是可能的,這允許分散式演算法設計中的一些靈活性 [^51]。
你也可以將 *w* 和 *r* 設定為較小的數字,使得 *w* + *r* ≤ *n*(即,不滿足仲裁條件)。在這種情況下,讀取和寫入仍將傳送到 *n* 個節點,但需要較少的成功響應數才能使操作成功。
使用較小的 *w* 和 *r*,你更有可能讀取陳舊值,因為你的讀取更可能沒有包含具有最新值的節點。從好的方面來說,這種配置允許更低的延遲和更高的可用性:如果存在網路中斷並且許多副本變得無法訪問,你繼續處理讀取和寫入的機會更高。只有在可訪問副本的數量低於 *w* 或 *r* 之後,資料庫才分別變得無法寫入或讀取。
然而,即使使用 *w* + *r* > *n*,在某些邊緣情況下,一致性屬性可能會令人困惑。一些場景包括:
* 如果攜帶新值的節點失敗,並且其資料從攜帶舊值的副本恢復,儲存新值的副本數量可能低於 *w*,破壞仲裁條件。
* 在重新平衡正在進行時,其中一些資料從一個節點移動到另一個節點(見 [第 7 章](/tw/ch7#ch_sharding)),節點可能對哪些節點應該持有特定值的 *n* 個副本有不一致的檢視。這可能導致讀取和寫入仲裁不再重疊。
* 如果讀取與寫入操作併發,讀取可能會或可能不會看到併發寫入的值。特別是,一次讀取可能看到新值,而後續讀取看到舊值,正如我們將在 ["線性一致性與仲裁"](/tw/ch10#sec_consistency_quorum_linearizable) 中看到的。
* 如果寫入在某些副本上成功但在其他副本上失敗(例如,因為某些節點上的磁碟已滿),並且總體上在少於 *w* 個副本上成功,它不會在成功的副本上回滾。這意味著如果寫入被報告為失敗,後續讀取可能會或可能不會返回該寫入的值 [^52]。
* 如果資料庫使用即時時鐘的時間戳來確定哪個寫入更新(如 Cassandra 和 ScyllaDB 所做的),如果另一個具有更快時鐘的節點已寫入同一鍵,寫入可能會被靜默丟棄——我們之前在 ["最後寫入勝利(丟棄併發寫入)"](#sec_replication_lww) 中看到的問題。我們將在 ["依賴同步時鐘"](/tw/ch9#sec_distributed_clocks_relying) 中更詳細地討論這一點。
* 如果兩個寫入併發發生,其中一個可能首先在一個副本上處理,另一個可能首先在另一個副本上處理。這導致衝突,類似於我們在多主複製中看到的(見 ["處理寫入衝突"](#sec_replication_write_conflicts))。我們將在 ["檢測併發寫入"](#sec_replication_concurrent) 中回到這個主題。
因此,儘管仲裁似乎保證讀取返回最新寫入的值,但實際上並不那麼簡單。Dynamo 風格的資料庫通常針對可以容忍最終一致性的用例進行了最佳化。引數 *w* 和 *r* 允許你調整讀取陳舊值的機率 [^53],但明智的做法是不要將它們視為絕對保證。
#### 監控陳舊性 {#monitoring-staleness}
從操作角度來看,監控你的資料庫是否返回最新結果很重要。即使你的應用程式可以容忍陳舊讀取,你也需要了解複製的健康狀況。如果它明顯落後,它應該提醒你,以便你可以調查原因(例如,網路中的問題或過載的節點)。
對於基於領導者的複製,資料庫通常公開復制延遲的指標,你可以將其輸入到監控系統。這是可能的,因為寫入以相同的順序應用於領導者和追隨者,每個節點在複製日誌中都有一個位置(它在本地應用的寫入數)。透過從領導者的當前位置減去追隨者的當前位置,你可以測量複製延遲的量。
然而,在具有無主複製的系統中,沒有固定的寫入應用順序,這使得監控更加困難。副本為移交儲存的提示數量可以是系統健康的一個度量,但很難有用地解釋 [^54]。最終一致性是一個故意模糊的保證,但為了可操作性,能夠量化"最終"很重要。
### 單主與無主複製的效能 {#sec_replication_leaderless_perf}
基於單個領導者的複製系統可以提供在無主系統中難以或不可能實現的強一致性保證。然而,正如我們在 ["複製延遲的問題"](#sec_replication_lag) 中看到的,如果你在非同步更新的追隨者上進行讀取,基於領導者的複製系統中的讀取也可能返回陳舊值。
從領導者讀取確保最新響應,但它存在效能問題:
* 讀取吞吐量受領導者處理請求能力的限制(與讀擴充套件相反,讀擴充套件將讀取分佈在可能返回陳舊值的非同步更新副本上)。
* 如果領導者失敗,你必須等待檢測到故障,並在繼續處理請求之前完成故障轉移。即使故障轉移過程非常快,使用者也會因為臨時增加的響應時間而注意到它;如果故障轉移需要很長時間,系統在其持續時間內不可用。
* 系統對領導者上的效能問題非常敏感:如果領導者響應緩慢,例如由於過載或某些資源爭用,增加的響應時間也會立即影響使用者。
無主架構的一大優勢是它對此類問題更有彈性。因為沒有故障轉移,而且請求本來就是並行發往多個副本,所以某個副本變慢或不可用對響應時間影響較小:客戶端只需採用更快副本的響應即可。利用最快響應的做法稱為 **請求對沖**,它可以顯著降低尾部延遲 [^55]。
從根本上說,無主系統的彈性來自於它不區分正常情況和故障情況的事實。這在處理所謂的 **灰色故障** 時特別有用,其中節點沒有完全宕機,但以降級狀態執行,處理請求異常緩慢 [^56],或者當節點只是過載時(例如,如果節點已離線一段時間,透過提示移交恢復可能會導致大量額外負載)。基於領導者的系統必須決定情況是否足夠糟糕以保證故障轉移(這本身可能會導致進一步的中斷),而在無主系統中,這個問題甚至不會出現。
也就是說,無主系統也可能有效能問題:
* 即使系統不需要執行故障轉移,一個副本確實需要檢測另一個副本何時不可用,以便它可以儲存有關不可用副本錯過的寫入的提示。當不可用副本恢復時,移交過程需要向其傳送這些提示。這在系統已經處於壓力下時給副本帶來了額外的負載 [^54]。
* 你擁有的副本越多,你的仲裁就越大,在請求完成之前你必須等待的響應就越多。即使你只等待最快的 *r* 或 *w* 個副本響應,即使你並行發出請求,更大的 *r* 或 *w* 增加了你遇到慢副本的機會,增加了總體響應時間(見 ["響應時間指標的應用"](/tw/ch2#sec_introduction_slo_sla))。
* 大規模網路中斷使客戶端與大量副本斷開連線,可能使形成仲裁變得不可能。一些無主資料庫提供了一個配置選項,允許任何可訪問的副本接受寫入,即使它不是該鍵的通常副本之一(Riak 和 Dynamo 稱之為 **寬鬆仲裁** [^45];Cassandra 和 ScyllaDB 稱之為 **一致性級別 ANY**)。不能保證後續讀取會看到寫入的值,但根據應用程式,它可能仍然比寫入失敗更好。
多主複製可以提供比無主複製更大的網路中斷彈性,因為讀取和寫入只需要與一個領導者通訊,該領導者可以與客戶端位於同一位置。然而,由於一個領導者上的寫入非同步傳播到其他領導者,讀取可能任意過時。仲裁讀取和寫入提供了一種折衷:良好的容錯性,同時也有很高的可能性讀取最新資料。
#### 多地區操作 {#multi-region-operation}
我們之前討論了跨地區複製作為多主複製的用例(見 ["多主複製"](#sec_replication_multi_leader))。無主複製也適合多地區操作,因為它被設計為容忍衝突的併發寫入、網路中斷和延遲峰值。
Cassandra 和 ScyllaDB 在正常的無主模型中實現了它們的多地區支援:客戶端直接將其寫入傳送到所有地區的副本,你可以從各種一致性級別中進行選擇,這些級別確定請求成功所需的響應數。例如,你可以請求所有地區中副本的仲裁、每個地區中的單獨仲裁,或僅客戶端本地地區的仲裁。本地仲裁避免了必須等待到其他地區的緩慢請求,但它也更可能返回陳舊結果。
Riak 將客戶端和資料庫節點之間的所有通訊保持在一個地區本地,因此 *n* 描述了一個地區內的副本數。資料庫叢集之間的跨地區複製在後臺非同步發生,其風格類似於多主複製。
### 檢測併發寫入 {#sec_replication_concurrent}
與多主複製一樣,無主資料庫允許對同一鍵進行併發寫入,導致需要解決的衝突。此類衝突可能在寫入發生時發生,但並非總是如此:它們也可能在讀修復、提示移交或反熵期間稍後檢測到。
問題在於,由於可變的網路延遲和部分故障,事件可能以不同的順序到達不同的節點。例如,[圖 6-14](#fig_replication_concurrency) 顯示了兩個客戶端 A 和 B 同時寫入三節點資料儲存中的鍵 *X*:
* 節點 1 接收來自 A 的寫入,但由於瞬時中斷從未接收來自 B 的寫入。
* 節點 2 首先接收來自 A 的寫入,然後接收來自 B 的寫入。
* 節點 3 首先接收來自 B 的寫入,然後接收來自 A 的寫入。
{{< figure src="/fig/ddia_0614.png" id="fig_replication_concurrency" caption="圖 6-14. Dynamo 風格資料儲存中的併發寫入:沒有明確定義的順序。" class="w-full my-4" >}}
如果每個節點在接收到來自客戶端的寫入請求時只是覆蓋鍵的值,節點將變得永久不一致,如 [圖 6-14](#fig_replication_concurrency) 中的最終 *get* 請求所示:節點 2 認為 *X* 的最終值是 B,而其他節點認為值是 A。
為了最終保持一致,副本應該收斂到相同的值。為此,我們可以使用我們之前在 ["處理寫入衝突"](#sec_replication_write_conflicts) 中討論的任何衝突解決機制,例如最後寫入勝利(由 Cassandra 和 ScyllaDB 使用)、手動解決或 CRDT(在 ["CRDT 與操作變換"](#sec_replication_crdts) 中描述,並由 Riak 使用)。
最後寫入勝利很容易實現:每個寫入都標有時間戳,具有更高時間戳的值總是覆蓋具有較低時間戳的值。然而,時間戳不會告訴你兩個值是否實際上衝突(即,它們是併發寫入的)或不衝突(它們是一個接一個寫入的)。如果你想顯式解決衝突,系統需要更加小心地檢測併發寫入。
#### "先發生"關係與併發 {#sec_replication_happens_before}
我們如何決定兩個操作是否併發?為了培養直覺,讓我們看一些例子:
* 在 [圖 6-8](#fig_replication_causality) 中,兩個寫入不是併發的:A 的插入 **先發生於** B 的遞增,因為 B 遞增的值是 A 插入的值。換句話說,B 的操作建立在 A 的操作之上,所以 B 的操作必須稍後發生。我們也說 B **因果依賴** 於 A。
* 另一方面,[圖 6-14](#fig_replication_concurrency) 中的兩個寫入是併發的:當每個客戶端開始操作時,它不知道另一個客戶端也在對同一鍵執行操作。因此,操作之間沒有因果依賴關係。
如果操作 B 知道 A,或依賴於 A,或以某種方式建立在 A 之上,則操作 A **先發生於** 另一個操作 B。一個操作是否先發生於另一個操作是定義併發含義的關鍵。事實上,我們可以簡單地說,如果兩個操作都不先發生於另一個(即,兩者都不知道另一個),則它們是 **併發的** [^57]。
因此,每當你有兩個操作 A 和 B 時,有三種可能性:要麼 A 先發生於 B,要麼 B 先發生於 A,要麼 A 和 B 是併發的。我們需要的是一個演算法來告訴我們兩個操作是否併發。如果一個操作先發生於另一個,後面的操作應該覆蓋前面的操作,但如果操作是併發的,我們有一個需要解決的衝突。
--------
> [!TIP] 併發、時間和相對論
>
> 似乎兩個操作如果"同時"發生,應該稱為併發——但實際上,它們是否真的在時間上重疊並不重要。由於分散式系統中的時鐘問題,實際上很難判斷兩件事是否恰好在同一時間發生——我們將在 [第 9 章](/tw/ch9#ch_distributed) 中更詳細地討論這個問題。
>
> 為了定義併發,確切的時間並不重要:我們只是稱兩個操作併發,如果它們都不知道對方,無論它們發生的物理時間如何。人們有時將這一原則與物理學中的狹義相對論聯絡起來 [^57],它引入了資訊不能比光速傳播更快的想法。因此,如果兩個事件之間的時間短於光在它們之間傳播的時間,那麼相隔一定距離發生的兩個事件不可能相互影響。
>
> 在計算機系統中,即使光速原則上允許一個操作影響另一個,兩個操作也可能是併發的。例如,如果網路在當時很慢或中斷,兩個操作可以相隔一段時間發生,仍然是併發的,因為網路問題阻止了一個操作能夠知道另一個。
--------
#### 捕獲先發生關係 {#capturing-the-happens-before-relationship}
讓我們看一個確定兩個操作是否併發或一個先發生於另一個的演算法。為了簡單起見,讓我們從只有一個副本的資料庫開始。一旦我們弄清楚如何在單個副本上執行此操作,我們就可以將該方法推廣到具有多個副本的無主資料庫。
[圖 6-15](#fig_replication_causality_single) 顯示了兩個客戶端併發地向同一購物車新增專案。(如果這個例子讓你覺得太無聊,想象一下兩個空中交通管制員併發地向他們正在跟蹤的扇區新增飛機。)最初,購物車是空的。兩個客戶端總共向資料庫發起了五次寫入:
1. 客戶端 1 將 `milk` 新增到購物車。這是對該鍵的第一次寫入,因此伺服器成功儲存它併為其分配版本 1。伺服器還將值連同版本號一起回顯給客戶端。
2. 客戶端 2 將 `eggs` 新增到購物車,不知道客戶端 1 併發地添加了 `milk`(客戶端 2 認為它的 `eggs` 是購物車中的唯一專案)。伺服器為此寫入分配版本 2,並將 `eggs` 和 `milk` 儲存為兩個單獨的值(兄弟)。然後,它將 **兩個** 值連同版本號 2 一起返回給客戶端。
3. 客戶端 1,不知道客戶端 2 的寫入,想要將 `flour` 新增到購物車,因此它認為當前購物車內容應該是 `[milk, flour]`。它將此值連同伺服器之前給客戶端 1 的版本號 1 一起傳送到伺服器。伺服器可以從版本號判斷 `[milk, flour]` 的寫入取代了 `[milk]` 的先前值,但它與 `[eggs]` 併發。因此,伺服器將版本 3 分配給 `[milk, flour]`,覆蓋版本 1 值 `[milk]`,但保留版本 2 值 `[eggs]` 並將兩個剩餘值返回給客戶端。
4. 同時,客戶端 2 想要將 `ham` 新增到購物車,不知道客戶端 1 剛剛添加了 `flour`。客戶端 2 在上次響應中從伺服器接收了兩個值 `[milk]` 和 `[eggs]`,因此客戶端現在合併這些值並新增 `ham` 以形成新值 `[eggs, milk, ham]`。它將該值連同先前的版本號 2 一起傳送到伺服器。伺服器檢測到版本 2 覆蓋 `[eggs]` 但與 `[milk, flour]` 併發,因此兩個剩餘值是版本 3 的 `[milk, flour]` 和版本 4 的 `[eggs, milk, ham]`。
5. 最後,客戶端 1 想要新增 `bacon`。它之前從伺服器接收了版本 3 的 `[milk, flour]` 和 `[eggs]`,因此它合併這些,新增 `bacon`,並將最終值 `[milk, flour, eggs, bacon]` 連同版本號 3 一起傳送到伺服器。這覆蓋了 `[milk, flour]`(注意 `[eggs]` 已經在上一步中被覆蓋)但與 `[eggs, milk, ham]` 併發,因此伺服器保留這兩個併發值。
{{< figure src="/fig/ddia_0615.png" id="fig_replication_causality_single" caption="圖 6-15. 捕獲兩個客戶端併發編輯購物車之間的因果依賴關係。" class="w-full my-4" >}}
[圖 6-15](#fig_replication_causality_single) 中操作之間的資料流在 [圖 6-16](#fig_replication_causal_dependencies) 中以圖形方式說明。箭頭指示哪個操作 **先發生於** 哪個其他操作,即後面的操作 **知道** 或 **依賴於** 前面的操作。在這個例子中,客戶端從未完全瞭解伺服器上的資料,因為總是有另一個併發進行的操作。但是值的舊版本最終會被覆蓋,並且不會丟失任何寫入。
{{< figure link="#fig_replication_causality_single" src="/fig/ddia_0616.png" id="fig_replication_causal_dependencies" caption="圖 6-16. 圖 6-15 中因果依賴關係的圖。" class="w-full my-4" >}}
請注意,伺服器可以透過檢視版本號來確定兩個操作是否併發——它不需要解釋值本身(因此值可以是任何資料結構)。演算法的工作原理如下:
* 伺服器為每個鍵維護一個版本號,每次寫入該鍵時遞增版本號,並將新版本號與寫入的值一起儲存。
* 當客戶端讀取鍵時,伺服器返回所有兄弟,即所有未被覆蓋的值,以及最新的版本號。客戶端必須在寫入之前讀取鍵。
* 當客戶端寫入鍵時,它必須包含來自先前讀取的版本號,並且必須合併它在先前讀取中收到的所有值,例如使用 CRDT 或透過詢問使用者。寫入請求的響應就像讀取一樣,返回所有兄弟,這允許我們像購物車示例中那樣連結多個寫入。
* 當伺服器接收到具有特定版本號的寫入時,它可以覆蓋具有該版本號或更低版本號的所有值(因為它知道它們已合併到新值中),但它必須保留具有更高版本號的所有值(因為這些值與傳入寫入併發)。
當寫入包含來自先前讀取的版本號時,這告訴我們寫入基於哪個先前狀態。如果你在不包含版本號的情況下進行寫入,它與所有其他寫入併發,因此它不會覆蓋任何內容——它只會作為後續讀取的值之一返回。
#### 版本向量 {#version-vectors}
[圖 6-15](#fig_replication_causality_single) 中的示例只使用了單個副本。當存在多個副本、且沒有領導者時,演算法如何變化?
[圖 6-15](#fig_replication_causality_single) 使用單個版本號來捕獲操作間依賴關係,但當多個副本併發接受寫入時,這還不夠。我們需要為 **每個副本**、每個鍵分別維護版本號。每個副本在處理寫入時遞增自己的版本號,並追蹤從其他副本看到的版本號。這些資訊決定了哪些值該被覆蓋,哪些值要作為兄弟保留。
來自所有副本的版本號集合稱為 **版本向量** [^58]。這一思想有若干變體,其中較有代表性的是 **點版本向量** [^59] [^60],Riak 2.0 使用了它 [^61] [^62]。這裡不展開細節,它的工作方式與前面的購物車示例非常相似。
和 [圖 6-15](#fig_replication_causality_single) 裡的版本號一樣,版本向量會在讀取時由資料庫副本返回給客戶端,並在後續寫入時再由客戶端帶回資料庫。(Riak 把版本向量編碼成一個字串,稱為 **因果上下文**。)版本向量讓資料庫能夠區分“覆蓋寫入”和“併發寫入”。
版本向量還保證了“從一個副本讀取,再寫回另一個副本”是安全的。這樣做可能會產生兄弟,但只要正確合併兄弟,就不會丟失資料。
--------
> [!TIP] 版本向量和向量時鐘
>
> **版本向量** 有時也稱為 **向量時鐘**,儘管它們不完全相同。差異很微妙——請參閱參考資料以獲取詳細資訊 [^60] [^63] [^64]。簡而言之,在比較副本狀態時,版本向量是要使用的正確資料結構。
--------
## 總結 {#summary}
在本章中,我們研究了複製問題。複製可以服務於多種目的:
**高可用性**
: 即使一臺機器(或幾臺機器、一個區域,甚至整個地區)宕機,也能保持系統執行
**斷開操作**
: 允許應用程式在網路中斷時繼續工作
**延遲**
: 將資料在地理上放置在靠近使用者的位置,以便使用者可以更快地與其互動
**可伸縮性**
: 透過在副本上執行讀取,能夠處理比單臺機器能夠處理的更高的讀取量
儘管目標很簡單——在幾臺機器上保留相同資料的副本——複製卻是一個非常棘手的問題。它需要仔細考慮併發性以及所有可能出錯的事情,並處理這些故障的後果。至少,我們需要處理不可用的節點和網路中斷(這甚至還沒有考慮更隱蔽的故障型別,例如由於軟體錯誤或硬體錯誤導致的靜默資料損壞)。
我們討論了三種主要的複製方法:
**單主複製**
: 客戶端將所有寫入傳送到單個節點(領導者),該節點將資料變更事件流傳送到其他副本(追隨者)。讀取可以在任何副本上執行,但從追隨者讀取可能是陳舊的。
**多主複製**
: 客戶端將每個寫入傳送到幾個領導者之一,任何領導者都可以接受寫入。領導者相互發送資料變更事件流,併發送到任何追隨者。
**無主複製**
: 客戶端將每個寫入傳送到多個節點,並行從多個節點讀取,以檢測和糾正具有陳舊資料的節點。
每種方法都有優缺點。單主複製很受歡迎,因為它相當容易理解,並且提供強一致性。多主和無主複製在存在故障節點、網路中斷和延遲峰值時可以更加健壯——代價是需要衝突解決並提供較弱的一致性保證。
複製可以是同步的或非同步的,這對系統在出現故障時的行為有深遠的影響。儘管非同步複製在系統平穩執行時可能很快,但重要的是要弄清楚當複製延遲增加和伺服器失敗時會發生什麼。如果領導者失敗並且你將非同步更新的追隨者提升為新的領導者,最近提交的資料可能會丟失。
我們研究了複製延遲可能導致的一些奇怪效果,並討論了一些有助於決定應用程式在複製延遲下應如何表現的一致性模型:
**寫後讀一致性**
: 使用者應該始終看到他們自己提交的資料。
**單調讀**
: 在使用者在某個時間點看到資料後,他們不應該稍後從某個較早的時間點看到資料。
**一致字首讀**
: 使用者應該看到處於因果意義狀態的資料:例如,按正確順序看到問題及其回覆。
最後,我們討論了多主和無主複製如何確保所有副本最終收斂到一致狀態:透過使用版本向量或類似演算法來檢測哪些寫入是併發的,並透過使用衝突解決演算法(如 CRDT)來合併併發寫入的值。最後寫入勝利和手動衝突解決也是可能的。
本章假設每個副本都儲存整個資料庫的完整副本,這對於大型資料集是不現實的。在下一章中,我們將研究 **分片**,它允許每臺機器只儲存資料的子集。
### 參考
[^1]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)
[^2]: Kenny Gryp. [MySQL Terminology Updates](https://dev.mysql.com/blog-archive/mysql-terminology-updates/). *dev.mysql.com*, July 2020. Archived at [perma.cc/S62G-6RJ2](https://perma.cc/S62G-6RJ2)
[^3]: Oracle Corporation. [Oracle (Active) Data Guard 19c: Real-Time Data Protection and Availability](https://www.oracle.com/technetwork/database/availability/dg-adg-technical-overview-wp-5347548.pdf). White Paper, *oracle.com*, March 2019. Archived at [perma.cc/P5ST-RPKE](https://perma.cc/P5ST-RPKE)
[^4]: Microsoft. [What is an Always On availability group?](https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server) *learn.microsoft.com*, September 2024. Archived at [perma.cc/ABH6-3MXF](https://perma.cc/ABH6-3MXF)
[^5]: Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, and Akshat Vig. [Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical Conference* (ATC), July 2022.
[^6]: Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. [CockroachDB: The Resilient Geo-Distributed SQL Database](https://dl.acm.org/doi/abs/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 1493–1509, June 2020. [doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134)
[^7]: Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, and Xin Tang. [TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3072–3084. [doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535)
[^8]: Mallory Knodel and Niels ten Oever. [Terminology, Power, and Inclusive Language in Internet-Drafts and RFCs](https://www.ietf.org/archive/id/draft-knodel-terminology-14.html). *IETF Internet-Draft*, August 2023. Archived at [perma.cc/5ZY9-725E](https://perma.cc/5ZY9-725E)
[^9]: Buck Hodges. [Postmortem: VSTS 4 September 2018](https://devblogs.microsoft.com/devopsservice/?p=17485). *devblogs.microsoft.com*, September 2018. Archived at [perma.cc/ZF5R-DYZS](https://perma.cc/ZF5R-DYZS)
[^10]: Gunnar Morling. [Leader Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024. Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y)
[^11]: Vignesh Chandramohan, Rohan Desai, and Chris Riccomini. [SlateDB Manifest Design](https://github.com/slatedb/slatedb/blob/main/rfcs/0001-manifest.md). *github.com*, May 2024. Archived at [perma.cc/8EUY-P32Z](https://perma.cc/8EUY-P32Z)
[^12]: Stas Kelvich. [Why does Neon use Paxos instead of Raft, and what’s the difference?](https://neon.tech/blog/paxos) *neon.tech*, August 2022. Archived at [perma.cc/SEZ4-2GXU](https://perma.cc/SEZ4-2GXU)
[^13]: Dimitri Fontaine. [An introduction to the pg\_auto\_failover project](https://tapoueh.org/blog/2021/11/an-introduction-to-the-pg_auto_failover-project/). *tapoueh.org*, November 2021. Archived at [perma.cc/3WH5-6BAF](https://perma.cc/3WH5-6BAF)
[^14]: Jesse Newland. [GitHub availability this week](https://github.blog/news-insights/the-library/github-availability-this-week/). *github.blog*, September 2012. Archived at [perma.cc/3YRF-FTFJ](https://perma.cc/3YRF-FTFJ)
[^15]: Mark Imbriaco. [Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/). *github.blog*, December 2012. Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ)
[^16]: John Hugg. [‘All In’ with Determinism for Performance and Testing in Distributed Systems](https://www.youtube.com/watch?v=gJRj3vJL4wE). At *Strange Loop*, September 2015.
[^17]: Hironobu Suzuki. [The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017.
[^18]: Amit Kapila. [WAL Internals of PostgreSQL](https://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf). At *PostgreSQL Conference* (PGCon), May 2012. Archived at [perma.cc/6225-3SUX](https://perma.cc/6225-3SUX)
[^19]: Amit Kapila. [Evolution of Logical Replication](https://amitkapila16.blogspot.com/2023/09/evolution-of-logical-replication.html). *amitkapila16.blogspot.com*, September 2023. Archived at [perma.cc/F9VX-JLER](https://perma.cc/F9VX-JLER)
[^20]: Aru Petchimuthu. [Upgrade your Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL database, Part 2: Using the pglogical extension](https://aws.amazon.com/blogs/database/part-2-upgrade-your-amazon-rds-for-postgresql-database-using-the-pglogical-extension/). *aws.amazon.com*, August 2021. Archived at [perma.cc/RXT8-FS2T](https://perma.cc/RXT8-FS2T)
[^21]: Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, David Callies, Abhishek Choudhary, Laurent Demailly, Thomas Fersch, Liat Atsmon Guz, Andrzej Kotulski, Sachin Kulkarni, Sanjeev Kumar, Harry Li, Jun Li, Evgeniy Makeev, Kowshik Prakasam, Robbert van Renesse, Sabyasachi Roy, Pratyush Seth, Yee Jiun Song, Benjamin Wester, Kaushik Veeraraghavan, and Peter Xie. [Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf). At *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015.
[^22]: Douglas B. Terry. [Replicated Data Consistency Explained Through Baseball](https://www.microsoft.com/en-us/research/publication/replicated-data-consistency-explained-through-baseball/). Microsoft Research, Technical Report MSR-TR-2011-137, October 2011. Archived at [perma.cc/F4KZ-AR38](https://perma.cc/F4KZ-AR38)
[^23]: Douglas B. Terry, Alan J. Demers, Karin Petersen, Mike J. Spreitzer, Marvin M. Theher, and Brent B. Welch. [Session Guarantees for Weakly Consistent Replicated Data](https://csis.pace.edu/~marchese/CS865/Papers/SessionGuaranteesPDIS.pdf). At *3rd International Conference on Parallel and Distributed Information Systems* (PDIS), September 1994. [doi:10.1109/PDIS.1994.331722](https://doi.org/10.1109/PDIS.1994.331722)
[^24]: Werner Vogels. [Eventually Consistent](https://queue.acm.org/detail.cfm?id=1466448). *ACM Queue*, volume 6, issue 6, pages 14–19, October 2008. [doi:10.1145/1466443.1466448](https://doi.org/10.1145/1466443.1466448)
[^25]: Simon Willison. [Reply to: “My thoughts about Fly.io (so far) and other newish technology I’m getting into”](https://news.ycombinator.com/item?id=31434055). *news.ycombinator.com*, May 2022. Archived at [perma.cc/ZRV4-WWV8](https://perma.cc/ZRV4-WWV8)
[^26]: Nithin Tharakan. [Scaling Bitbucket’s Database](https://www.atlassian.com/blog/bitbucket/scaling-bitbuckets-database). *atlassian.com*, October 2020. Archived at [perma.cc/JAB7-9FGX](https://perma.cc/JAB7-9FGX)
[^27]: Terry Pratchett. *Reaper Man: A Discworld Novel*. Victor Gollancz, 1991. ISBN: 978-0-575-04979-6
[^28]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). *Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185–196, November 2014. [doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509)
[^29]: Yaser Raja and Peter Celentano. [PostgreSQL bi-directional replication using pglogical](https://aws.amazon.com/blogs/database/postgresql-bi-directional-replication-using-pglogical/). *aws.amazon.com*, January 2022. Archived at
[^30]: Robert Hodges. [If You \*Must\* Deploy Multi-Master Replication, Read This First](https://scale-out-blog.blogspot.com/2012/04/if-you-must-deploy-multi-master.html). *scale-out-blog.blogspot.com*, April 2012. Archived at [perma.cc/C2JN-F6Y8](https://perma.cc/C2JN-F6Y8)
[^31]: Lars Hofhansl. [HBASE-7709: Infinite Loop Possible in Master/Master Replication](https://issues.apache.org/jira/browse/HBASE-7709). *issues.apache.org*, January 2013. Archived at [perma.cc/24G2-8NLC](https://perma.cc/24G2-8NLC)
[^32]: John Day-Richter. [What’s Different About the New Google Docs: Making Collaboration Fast](https://drive.googleblog.com/2010/09/whats-different-about-new-google-docs.html). *drive.googleblog.com*, September 2010. Archived at [perma.cc/5TL8-TSJ2](https://perma.cc/5TL8-TSJ2)
[^33]: Evan Wallace. [How Figma’s multiplayer technology works](https://www.figma.com/blog/how-figmas-multiplayer-technology-works/). *figma.com*, October 2019. Archived at [perma.cc/L49H-LY4D](https://perma.cc/L49H-LY4D)
[^34]: Tuomas Artman. [Scaling the Linear Sync Engine](https://linear.app/blog/scaling-the-linear-sync-engine). *linear.app*, June 2023.
[^35]: Amr Saafan. [Why Sync Engines Might Be the Future of Web Applications](https://www.nilebits.com/blog/2024/09/sync-engines-future-web-applications/). *nilebits.com*, September 2024. Archived at [perma.cc/5N73-5M3V](https://perma.cc/5N73-5M3V)
[^36]: Isaac Hagoel. [Are Sync Engines The Future of Web Applications?](https://dev.to/isaachagoel/are-sync-engines-the-future-of-web-applications-1bbi) *dev.to*, July 2024. Archived at [perma.cc/R9HF-BKKL](https://perma.cc/R9HF-BKKL)
[^37]: Sujay Jayakar. [A Map of Sync](https://stack.convex.dev/a-map-of-sync). *stack.convex.dev*, October 2024. Archived at [perma.cc/82R3-H42A](https://perma.cc/82R3-H42A)
[^38]: Alex Feyerke. [Designing Offline-First Web Apps](https://alistapart.com/article/offline-first/). *alistapart.com*, December 2013. Archived at [perma.cc/WH7R-S2DS](https://perma.cc/WH7R-S2DS)
[^39]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: You own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019, pages 154–178. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
[^40]: Martin Kleppmann. [The past, present, and future of local-first](https://martin.kleppmann.com/2024/05/30/local-first-conference.html). At *Local-First Conference*, May 2024.
[^41]: Conrad Hofmeyr. [API Calling is to Sync Engines as jQuery is to React](https://www.powersync.com/blog/api-calling-is-to-sync-engines-as-jquery-is-to-react). *powersync.com*, November 2024. Archived at [perma.cc/2FP9-7WJJ](https://perma.cc/2FP9-7WJJ)
[^42]: Peter van Hardenberg and Martin Kleppmann. [PushPin: Towards Production-Quality Peer-to-Peer Collaboration](https://martin.kleppmann.com/papers/pushpin-papoc20.pdf). At *7th Workshop on Principles and Practice of Consistency for Distributed Data* (PaPoC), April 2020. [doi:10.1145/3380787.3393683](https://doi.org/10.1145/3380787.3393683)
[^43]: Leonard Kawell, Jr., Steven Beckhardt, Timothy Halvorsen, Raymond Ozzie, and Irene Greif. [Replicated document management in a group communication system](https://dl.acm.org/doi/pdf/10.1145/62266.1024798). At *ACM Conference on Computer-Supported Cooperative Work* (CSCW), September 1988. [doi:10.1145/62266.1024798](https://doi.org/10.1145/62266.1024798)
[^44]: Ricky Pusch. [Explaining how fighting games use delay-based and rollback netcode](https://words.infil.net/w02-netcode.html). *words.infil.net* and *arstechnica.com*, October 2019. Archived at [perma.cc/DE7W-RDJ8](https://perma.cc/DE7W-RDJ8)
[^45]: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. [Dynamo: Amazon’s Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007. [doi:10.1145/1323293.1294281](https://doi.org/10.1145/1323293.1294281)
[^46]: Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. [A Comprehensive Study of Convergent and Commutative Replicated Data Types](https://inria.hal.science/inria-00555588v1/document). INRIA Research Report no. 7506, January 2011.
[^47]: Chengzheng Sun and Clarence Ellis. [Operational Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=aef660812c5a9c4d3f06775f9455eeb090a4ff0f). At *ACM Conference on Computer Supported Cooperative Work* (CSCW), November 1998. [doi:10.1145/289444.289469](https://doi.org/10.1145/289444.289469)
[^48]: Joseph Gentle and Martin Kleppmann. [Collaborative Text Editing with Eg-walker: Better, Faster, Smaller](https://arxiv.org/abs/2409.14252). At *20th European Conference on Computer Systems* (EuroSys), March 2025. [doi:10.1145/3689031.3696076](https://doi.org/10.1145/3689031.3696076)
[^49]: Dharma Shukla. [Azure Cosmos DB: Pushing the frontier of globally distributed databases](https://azure.microsoft.com/en-us/blog/azure-cosmos-db-pushing-the-frontier-of-globally-distributed-databases/). *azure.microsoft.com*, September 2018. Archived at [perma.cc/UT3B-HH6R](https://perma.cc/UT3B-HH6R)
[^50]: David K. Gifford. [Weighted Voting for Replicated Data](https://www.cs.cmu.edu/~15-749/READINGS/required/availability/gifford79.pdf). At *7th ACM Symposium on Operating Systems Principles* (SOSP), December 1979. [doi:10.1145/800215.806583](https://doi.org/10.1145/800215.806583)
[^51]: Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. [Flexible Paxos: Quorum Intersection Revisited](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.OPODIS.2016.25). At *20th International Conference on Principles of Distributed Systems* (OPODIS), December 2016. [doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25)
[^52]: Joseph Blomstedt. [Bringing Consistency to Riak](https://vimeo.com/51973001). At *RICON West*, October 2012.
[^53]: Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, and Ion Stoica. [Quantifying eventual consistency with PBS](http://www.bailis.org/papers/pbs-vldbj2014.pdf). *The VLDB Journal*, volume 23, pages 279–302, April 2014. [doi:10.1007/s00778-013-0330-1](https://doi.org/10.1007/s00778-013-0330-1)
[^54]: Colin Breck. [Shared-Nothing Architectures for Server Replication and Synchronization](https://blog.colinbreck.com/shared-nothing-architectures-for-server-replication-and-synchronization/). *blog.colinbreck.com*, December 2019. Archived at [perma.cc/48P3-J6CJ](https://perma.cc/48P3-J6CJ)
[^55]: Jeffrey Dean and Luiz André Barroso. [The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/). *Communications of the ACM*, volume 56, issue 2, pages 74–80, February 2013. [doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794)
[^56]: Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. [Gray Failure: The Achilles’ Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in Operating Systems* (HotOS), May 2017. [doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005)
[^57]: Leslie Lamport. [Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, volume 21, issue 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563)
[^58]: D. Stott Parker Jr., Gerald J. Popek, Gerard Rudisin, Allen Stoughton, Bruce J. Walker, Evelyn Walton, Johanna M. Chow, David Edwards, Stephen Kiser, and Charles Kline. [Detection of Mutual Inconsistency in Distributed Systems](https://pages.cs.wisc.edu/~remzi/Classes/739/Papers/parker83detection.pdf). *IEEE Transactions on Software Engineering*, volume SE-9, issue 3, pages 240–247, May 1983. [doi:10.1109/TSE.1983.236733](https://doi.org/10.1109/TSE.1983.236733)
[^59]: Nuno Preguiça, Carlos Baquero, Paulo Sérgio Almeida, Victor Fonte, and Ricardo Gonçalves. [Dotted Version Vectors: Logical Clocks for Optimistic Replication](https://arxiv.org/abs/1011.5808). arXiv:1011.5808, November 2010.
[^60]: Giridhar Manepalli. [Clocks and Causality - Ordering Events in Distributed Systems](https://www.exhypothesi.com/clocks-and-causality/). *exhypothesi.com*, November 2022. Archived at [perma.cc/8REU-KVLQ](https://perma.cc/8REU-KVLQ)
[^61]: Sean Cribbs. [A Brief History of Time in Riak](https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak). At *RICON*, October 2014. Archived at [perma.cc/7U9P-6JFX](https://perma.cc/7U9P-6JFX)
[^62]: Russell Brown. [Vector Clocks Revisited Part 2: Dotted Version Vectors](https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/). *riak.com*, November 2015. Archived at [perma.cc/96QP-W98R](https://perma.cc/96QP-W98R)
[^63]: Carlos Baquero. [Version Vectors Are Not Vector Clocks](https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/). *haslab.wordpress.com*, July 2011. Archived at [perma.cc/7PNU-4AMG](https://perma.cc/7PNU-4AMG)
[^64]: Reinhard Schwarz and Friedemann Mattern. [Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed Computing*, volume 7, issue 3, pages 149–174, March 1994. [doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859)
================================================
FILE: content/tw/ch7.md
================================================
---
title: "7. 分片"
weight: 207
breadcrumbs: false
---

> *顯然,我們必須跳出順序計算機指令的窠臼。我們必須敘述定義、提供優先順序和資料描述。我們必須敘述關係,而不是過程。*
>
> Grace Murray Hopper,《未來的計算機及其管理》(1962)
分散式資料庫通常透過兩種方式在節點間分佈資料:
1. 在多個節點上儲存相同資料的副本:這是 *複製*,我們在 [第 6 章](/tw/ch6#ch_replication) 中討論過。
2. 如果我們不想讓每個節點都儲存所有資料,我們可以將大量資料分割成更小的 *分片(shards)* 或 *分割槽(partitions)*,並將不同的分片儲存在不同的節點上。我們將在本章討論分片。
通常,分片的定義方式使得每條資料(每條記錄、行或文件)恰好屬於一個分片。有多種方法可以實現這一點,我們將在本章深入討論。實際上,每個分片本身就是一個小型資料庫,儘管某些資料庫系統支援同時涉及多個分片的操作。
分片通常與複製結合使用,以便每個分片的副本儲存在多個節點上。這意味著,即使每條記錄屬於恰好一個分片,它仍然可以儲存在多個不同的節點上以提供容錯能力。
一個節點可能儲存多個分片。例如,如果使用單領導者複製模型,分片與複製的組合可能如 [圖 7-1](#fig_sharding_replicas) 所示。每個分片的領導者被分配到一個節點,追隨者被分配到其他節點。每個節點可能是某些分片的領導者,同時又是其他分片的追隨者,但每個分片仍然只有一個領導者。
{{< figure src="/fig/ddia_0701.png" id="fig_sharding_replicas" caption="圖 7-1. 複製與分片結合使用:每個節點對某些分片充當領導者,對另一些分片充當追隨者。" class="w-full my-4" >}}
我們在 [第 6 章](/tw/ch6#ch_replication) 中討論的關於資料庫複製的所有內容同樣適用於分片的複製。由於分片方案的選擇大部分獨立於複製方案的選擇,為了簡單起見,我們將在本章中忽略複製。
--------
> [!TIP] 分片和分割槽
在本章中我們稱之為 *分片* 的東西,根據你使用的軟體不同有許多不同的名稱:在 Kafka 中稱為 *分割槽(partition)*,在 CockroachDB 中稱為 *範圍(range)*,在 HBase 和 TiDB 中稱為 *區域(region)*,在 Bigtable 和 YugabyteDB 中稱為 *表塊(tablet)*,在 Cassandra、ScyllaDB 和 Riak 中稱為 *虛節點(vnode)*,在 Couchbase 中稱為 *虛桶(vBucket)*,僅舉幾例。
一些資料庫將分割槽和分片視為兩個不同的概念。例如,在 PostgreSQL 中,分割槽是將大表拆分為儲存在同一臺機器上的多個檔案的方法(這有幾個優點,例如可以非常快速地刪除整個分割槽),而分片則是將資料集拆分到多臺機器上 [^1] [^2]。在許多其他系統中,分割槽只是分片的另一個詞。
雖然 *分割槽* 相當具有描述性,但 *分片* 這個術語可能令人驚訝。根據一種理論,該術語源於線上角色扮演遊戲《網路創世紀》(Ultima Online),其中一塊魔法水晶被打碎成碎片,每個碎片都折射出遊戲世界的副本 [^3]。*分片* 一詞因此用來指一組並行遊戲伺服器中的一個,後來被引入資料庫。另一種理論是 *分片* 最初是 *高可用複製資料系統*(System for Highly Available Replicated Data)的縮寫——據說是 1980 年代的一個數據庫,其細節已經失傳。
順便說一下,分割槽與 *網路分割槽*(netsplits)無關,後者是節點之間網路中的一種故障。我們將在 [第 9 章](/tw/ch9#ch_distributed) 中討論此類故障。
--------
## 分片的利與弊 {#sec_sharding_reasons}
對資料庫進行分片的主要原因是 *可伸縮性*:如果資料量或寫吞吐量已經超出單個節點的處理能力,這是一個解決方案,它允許你將資料和寫入分散到多個節點上。(如果讀吞吐量是問題,你不一定需要分片——你可以使用 [第 6 章](/tw/ch6#ch_replication) 中討論的 *讀擴充套件*。)
事實上,分片是我們實現 *水平擴充套件*(*橫向擴充套件* 架構)的主要工具之一,如 ["共享記憶體、共享磁碟和無共享架構"](/tw/ch2#sec_introduction_shared_nothing) 中所討論的:即,允許系統透過新增更多(較小的)機器而不是轉移到更大的機器來增長其容量。如果你可以劃分工作負載,使每個分片處理大致相等的份額,那麼你可以將這些分片分配給不同的機器,以便並行處理它們的資料和查詢。
雖然複製在小規模和大規模上都很有用,因為它支援容錯和離線操作,但分片是一個重量級解決方案,主要在大規模場景下才有意義。如果你的資料量和寫吞吐量可以在單臺機器上處理(而單臺機器現在可以做很多事情!),通常最好避免分片並堅持使用單分片資料庫。
推薦這樣做的原因是分片通常會增加複雜性:你通常必須透過選擇 *分割槽鍵* 來決定將哪些記錄放在哪個分片中;具有相同分割槽鍵的所有記錄都放在同一個分片中 [^4]。這個選擇很重要,因為如果你知道記錄在哪個分片中,訪問記錄會很快,但如果你不知道分片,你必須在所有分片中進行低效的搜尋,而且分片方案很難更改。
因此,分片通常適用於鍵值資料,你可以輕鬆地按鍵進行分片,但對於關係資料則較難,因為你可能想要透過二級索引搜尋,或連線可能分佈在不同分片中的記錄。我們將在 ["分片與二級索引"](#sec_sharding_secondary_indexes) 中進一步討論這個問題。
分片的另一個問題是寫入可能需要更新多個不同分片中的相關記錄。雖然單節點上的事務相當常見(見 [第 8 章](/tw/ch8#ch_transactions)),但確保跨多個分片的一致性需要 *分散式事務*。正如我們將在 [第 8 章](/tw/ch8#ch_transactions) 中看到的,分散式事務在某些資料庫中可用,但它們通常比單節點事務慢得多,可能成為整個系統的瓶頸,有些系統根本不支援它們。
一些系統即使在單臺機器上也使用分片,通常每個 CPU 核心執行一個單執行緒程序,以利用 CPU 的並行性,或者利用 *非統一記憶體訪問*(NUMA)架構:某些記憶體分割槽比其他分割槽更靠近某個 CPU [^5]。例如,Redis、VoltDB 和 FoundationDB 每個核心使用一個程序,並依靠分片在同一臺機器的 CPU 核心之間分散負載 [^6]。
### 面向多租戶的分片 {#sec_sharding_multitenancy}
軟體即服務(SaaS)產品和雲服務通常是 *多租戶* 的,其中每個租戶是一個客戶。多個使用者可能在同一租戶上擁有登入帳戶,但每個租戶都有一個獨立的資料集,與其他租戶分開。例如,在電子郵件營銷服務中,每個註冊的企業通常是一個單獨的租戶,因為一個企業的通訊訂閱、投遞資料等與其他企業的資料是分開的。
有時分片用於實現多租戶系統:要麼每個租戶被分配一個單獨的分片,要麼多個小租戶可能被分組到一個更大的分片中。這些分片可能是物理上分離的資料庫(我們之前在 ["嵌入式儲存引擎"](/tw/ch4#sidebar_embedded) 中提到過),或者是更大邏輯資料庫的可單獨管理部分 [^7]。使用分片實現多租戶有幾個優點:
資源隔離
: 如果某個租戶執行計算密集型操作,而它與其他租戶執行在不同分片上,那麼其他租戶效能受影響的可能性更小。
許可權隔離
: 如果訪問控制邏輯有漏洞,而租戶資料集又是彼此物理隔離儲存的,那麼誤將一個租戶的資料暴露給另一個租戶的機率會更低。
基於單元的架構
: 你不僅可以在資料儲存級別應用分片,還可以為執行應用程式程式碼的服務應用分片。在 *基於單元的架構* 中,特定租戶集的服務和儲存被分組到一個自包含的 *單元* 中,不同的單元被設定為可以在很大程度上彼此獨立執行。這種方法提供了 *故障隔離*:即,一個單元中的故障僅限於該單元,其他單元中的租戶不受影響 [^8]。
按租戶備份和恢復
: 單獨備份每個租戶的分片使得可以從備份中恢復租戶的狀態而不影響其他租戶,這在租戶意外刪除或覆蓋重要資料的情況下很有用 [^9]。
法規合規性
: 資料隱私法規(如 GDPR)賦予個人訪問和刪除儲存的所有關於他們的資料的權利。如果每個人的資料儲存在單獨的分片中,這就轉化為對其分片的簡單資料匯出和刪除操作 [^10]。
資料駐留
: 如果特定租戶的資料需要儲存在特定司法管轄區以符合資料駐留法律,具有區域感知的資料庫可以允許你將該租戶的分片分配給特定區域。
漸進式模式推出
: 模式遷移(之前在 ["文件模型中的模式靈活性"](/tw/ch3#sec_datamodels_schema_flexibility) 中討論過)可以逐步推出,一次一個租戶。這降低了風險,因為你可以在影響所有租戶之前檢測到問題,但很難以事務方式執行 [^11]。
使用分片實現多租戶的主要挑戰是:
* 它假設每個單獨的租戶都足夠小,可以適應單個節點。如果情況並非如此,並且你有一個對於一臺機器來說太大的租戶,你將需要在單個租戶內額外執行分片,這將我們帶回到為可伸縮性進行分片的主題 [^12]。
* 如果你有許多小租戶,那麼為每個租戶建立單獨的分片可能會產生太多開銷。你可以將幾個小租戶組合到一個更大的分片中,但隨後你會遇到如何在租戶增長時將其從一個分片移動到另一個分片的問題。
* 如果你需要支援跨多個租戶關聯資料的功能,那麼在必須跨多個分片做連線時,實現難度會顯著增加。
## 鍵值資料的分片 {#sec_sharding_key_value}
假設你有大量資料,並且想要對其進行分片。如何決定將哪些記錄儲存在哪些節點上?
我們進行分片的目標是將資料和查詢負載均勻地分佈在各節點上。如果每個節點承擔公平的份額,那麼理論上——10 個節點應該能夠處理 10 倍的資料量和 10 倍單個節點的讀寫吞吐量(忽略複製)。此外,如果我們新增或刪除節點,我們希望能夠 *再平衡* 負載,使其在新增時均勻分佈在 11 個節點上(或刪除時在剩餘的 9 個節點上)。
如果分片不公平,使得某些分片比其他分片承載更多資料或查詢,我們稱之為 *偏斜*。偏斜會顯著削弱分片效果。在極端情況下,所有負載都可能集中在一個分片上,導致 10 個節點中有 9 個處於空閒狀態,而瓶頸落在那一個繁忙節點上。負載明顯高於其他分片的分片稱為 *熱分片* 或 *熱點*。如果某個鍵的負載特別高(例如社交網路中的名人),我們稱之為 *熱鍵*。
因此,我們需要一種演算法,它以記錄的分割槽鍵作為輸入,並告訴我們該記錄在哪個分片中。在鍵值儲存中,分割槽鍵通常是鍵,或鍵的第一部分。在關係模型中,分割槽鍵可能是表的某一列(不一定是其主鍵)。該演算法需要能夠進行再平衡以緩解熱點。
### 按鍵的範圍分片 {#sec_sharding_key_range}
一種分片方法是為每個分片分配一個連續的分割槽鍵範圍(從某個最小值到某個最大值),就像紙質百科全書的卷一樣,如 [圖 7-2](#fig_sharding_encyclopedia) 所示。在這個例子中,條目的分割槽鍵是其標題。如果你想查詢特定標題的條目,你可以透過找到鍵範圍包含你要查詢標題的捲來輕鬆確定哪個分片包含該條目,從而從書架上挑選正確的書。
{{< figure src="/fig/ddia_0702.png" id="fig_sharding_encyclopedia" caption="圖 7-2. 印刷版百科全書按鍵範圍分片。" class="w-full my-4" >}}
鍵的範圍不一定是均勻分佈的,因為你的資料可能不是均勻分佈的。例如,在 [圖 7-2](#fig_sharding_encyclopedia) 中,第 1 捲包含以 A 和 B 開頭的單詞,但第 12 捲包含以 T、U、V、W、X、Y 和 Z 開頭的單詞。簡單地為字母表的每兩個字母分配一卷會導致某些卷比其他卷大得多。為了均勻分佈資料,分片邊界需要適應資料。
分片邊界可能由管理員手動選擇,或者資料庫可以自動選擇它們。手動鍵範圍分片例如被 Vitess(MySQL 的分片層)使用;自動變體被 Bigtable、其開源等價物 HBase、MongoDB 中基於範圍的分片選項、CockroachDB、RethinkDB 和 FoundationDB 使用 [^6]。YugabyteDB 提供手動和自動錶塊分割兩種選項。
在每個分片內,鍵以排序順序儲存(例如,在 B 樹或 SSTable 中,如 [第 4 章](/tw/ch4#ch_storage) 中所討論的)。這樣做的優點是範圍掃描很容易,你可以將鍵視為連線索引,以便在一個查詢中獲取多個相關記錄(參見 ["多維和全文索引"](/tw/ch4#sec_storage_multidimensional))。例如,考慮一個儲存感測器網路資料的應用程式,其中鍵是測量的時間戳。範圍掃描在這種情況下非常有用,因為它們讓你可以輕鬆獲取,比如說,特定月份的所有讀數。
鍵範圍分片的一個缺點是,如果有大量對相鄰鍵的寫入,你很容易得到一個熱分片。例如,如果鍵是時間戳,那麼分片對應於時間範圍——例如,每個月一個分片。不幸的是,如果你在測量發生時將感測器資料寫入資料庫,所有寫入最終都會進入同一個分片(本月的分片),因此該分片可能會因寫入而過載,而其他分片則處於空閒狀態 [^13]。
為了避免感測器資料庫中的這個問題,你需要使用時間戳以外的東西作為鍵的第一個元素。例如,你可以在每個時間戳前加上感測器 ID,使鍵排序首先按感測器 ID,然後按時間戳。假設你有許多感測器同時活動,寫入負載最終會更均勻地分佈在各個分片上。缺點是當你想要在一個時間範圍內獲取多個感測器的值時,你現在需要為每個感測器執行單獨的範圍查詢。
#### 重新平衡鍵範圍分片資料 {#rebalancing-key-range-sharded-data}
當你首次設定資料庫時,沒有鍵範圍可以分割成分片。一些資料庫,如 HBase 和 MongoDB,允許你在空資料庫上配置一組初始分片,這稱為 *預分割*。這要求你已經對鍵分佈將會是什麼樣子有所瞭解,以便你可以選擇適當的鍵範圍邊界 [^14]。
後來,隨著你的資料量和寫吞吐量增長,具有鍵範圍分片的系統透過將現有分片分割成兩個或更多較小的分片來增長,每個分片都儲存原始分片鍵範圍的連續子範圍。然後可以將生成的較小分片分佈在多個節點上。如果刪除了大量資料,你可能還需要將幾個相鄰的已變小的分片合併為一個更大的分片。這個過程類似於 B 樹頂層發生的事情(參見 ["B 樹"](/tw/ch4#sec_storage_b_trees))。
對於自動管理分片邊界的資料庫,分片分割通常由以下觸發:
* 分片達到配置的大小(例如,在 HBase 上,預設值為 10 GB),或
* 在某些系統中,寫吞吐量持續高於某個閾值。因此,即使熱分片沒有儲存大量資料,也可能被分割,以便其寫入負載可以更均勻地分佈。
鍵範圍分片的一個優點是分片數量適應資料量。如果只有少量資料,少量分片就足夠了,因此開銷很小;如果有大量資料,每個單獨分片的大小被限制在可配置的最大值 [^15]。
這種方法的一個缺點是分割分片是一項昂貴的操作,因為它需要將其所有資料重寫到新檔案中,類似於日誌結構儲存引擎中的壓實。需要分割的分片通常也是處於高負載下的分片,分割的成本可能會加劇該負載,有使其過載的風險。
### 按鍵的雜湊分片 {#sec_sharding_hash}
鍵範圍分片在你希望具有相鄰(但不同)分割槽鍵的記錄被分組到同一個分片中時很有用;例如,如果是時間戳,這可能就是這種情況。如果你不關心分割槽鍵是否彼此接近(例如,如果它們是多租戶應用程式中的租戶 ID),一種常見方法是先對分割槽鍵進行雜湊,然後將其對映到分片。
一個好的雜湊函式可以把偏斜的資料變得更均勻。假設你有一個 32 位雜湊函式,輸入是字串。每當給它一個新字串,它都會返回一個看似隨機、介於 0 和 2³² − 1 之間的數字。即使輸入字串非常相似,它們的雜湊值也會在這個範圍內均勻分佈(但相同輸入總是產生相同輸出)。
出於分片目的,雜湊函式不需要是密碼學強度的:例如,MongoDB 使用 MD5,而 Cassandra 和 ScyllaDB 使用 Murmur3。許多程式語言都內建了簡單的雜湊函式(因為它們用於雜湊表),但它們可能不適合分片:例如,在 Java 的 `Object.hashCode()` 和 Ruby 的 `Object#hash` 中,相同的鍵在不同的程序中可能有不同的雜湊值,使它們不適合分片 [^16]。
#### 雜湊取模節點數 {#hash-modulo-number-of-nodes}
一旦你對鍵進行了雜湊,如何選擇將其儲存在哪個分片中?也許你的第一個想法是取雜湊值 *模* 系統中的節點數(在許多程式語言中使用 `%` 運算子)。例如,*hash*(*key*) % 10 將返回 0 到 9 之間的數字(如果我們將雜湊寫為十進位制數,hash % 10 將是最後一位數字)。如果我們有 10 個節點,編號從 0 到 9,這似乎是將每個鍵分配給節點的簡單方法。
*mod N* 方法的問題是,如果節點數 *N* 發生變化,大多數鍵必須從一個節點移動到另一個節點。[圖 7-3](#fig_sharding_hash_mod_n) 顯示了當你有三個節點並新增第四個節點時會發生什麼。在再平衡之前,節點 0 儲存雜湊值為 0、3、6、9 等的鍵。新增第四個節點後,雜湊值為 3 的鍵已移動到節點 3,雜湊值為 6 的鍵已移動到節點 2,雜湊值為 9 的鍵已移動到節點 1,依此類推。
{{< figure src="/fig/ddia_0703.png" id="fig_sharding_hash_mod_n" caption="圖 7-3. 透過對鍵進行雜湊並取模節點數來將鍵分配給節點。更改節點數會導致許多鍵從一個節點移動到另一個節點。" class="w-full my-4" >}}
*mod N* 函式易於計算,但它導致非常低效的再平衡,因為存在大量不必要的記錄從一個節點移動到另一個節點。我們需要一種不會移動超過必要資料的方法。
#### 固定數量的分片 {#fixed-number-of-shards}
一個簡單但廣泛使用的解決方案是建立比節點多得多的分片,併為每個節點分配多個分片。例如,在 10 個節點的叢集上執行的資料庫可能從一開始就被分成 1,000 個分片,以便每個節點分配 100 個分片。然後將鍵儲存在分片號 *hash*(*key*) % 1,000 中,系統單獨跟蹤哪個分片儲存在哪個節點上。
現在,如果向叢集新增一個節點,系統可以從現有節點重新分配一些分片到新節點,直到它們再次公平分佈。這個過程在 [圖 7-4](#fig_sharding_rebalance_fixed) 中說明。如果從叢集中刪除節點,則反向發生相同的事情。
{{< figure src="/fig/ddia_0704.png" id="fig_sharding_rebalance_fixed" caption="圖 7-4. 向每個節點有多個分片的資料庫叢集新增新節點。" class="w-full my-4" >}}
在這個模型中,只有整個分片在節點之間移動,這比分割分片更便宜。分片的數量不會改變,也不會改變鍵到分片的分配。唯一改變的是分片到節點的分配。這種分配的變化不是立即的——透過網路傳輸大量資料需要一些時間——因此在傳輸進行時,舊的分片分配用於任何發生的讀寫。
選擇分片數量為可被許多因子整除的數字是很常見的,這樣資料集可以在各種不同數量的節點之間均勻分割——例如,不要求節點數必須是 2 的冪 [^4]。你甚至可以考慮叢集中不匹配的硬體:透過為更強大的節點分配更多分片,你可以讓這些節點承擔更大份額的負載。
這種分片方法被 Citus(PostgreSQL 的分片層)、Riak、Elasticsearch 和 Couchbase 等使用。只要你對首次建立資料庫時需要多少分片有很好的估計,它就很有效。然後你可以輕鬆新增或刪除節點,但受限於你不能擁有比分片更多的節點。
如果你發現最初配置的分片數量是錯誤的——例如,如果你已經達到需要比分片更多節點的規模——那麼需要進行昂貴的重新分片操作。它需要分割每個分片並將其寫入新檔案,在此過程中使用大量額外的磁碟空間。一些系統不允許在併發寫入資料庫時進行重新分片,這使得在沒有停機時間的情況下更改分片數量變得困難。
如果資料集總大小高度可變(例如起初很小,但會隨時間顯著增長),選擇合適的分片數量就很困難。由於每個分片包含總資料中的固定比例,每個分片的大小會隨叢集總資料量按比例增長。如果分片很大,再平衡和節點故障恢復都會很昂貴;但如果分片太小,又會產生過多管理開銷。最佳效能通常出現在分片大小“恰到好處”時,但在分片數量固定、資料規模又持續變化的情況下,這很難做到。
#### 按雜湊範圍分片 {#sharding-by-hash-range}
如果無法提前預測所需的分片數量,最好使用一種方案,其中分片數量可以輕鬆適應工作負載。前面提到的鍵範圍分片方案具有這個屬性,但當有大量對相鄰鍵的寫入時,它有熱點的風險。一種解決方案是將鍵範圍分片與雜湊函式結合,使每個分片包含 *雜湊值* 的範圍而不是 *鍵* 的範圍。
[圖 7-5](#fig_sharding_hash_range) 顯示了使用 16 位雜湊函式的示例,該函式返回 0 到 65,535 = 2¹⁶ − 1 之間的數字(實際上,雜湊通常是 32 位或更多)。即使輸入鍵非常相似(例如,連續的時間戳),它們的雜湊值也會在該範圍內均勻分佈。然後我們可以為每個分片分配一個雜湊值範圍:例如,值 0 到 16,383 分配給分片 0,值 16,384 到 32,767 分配給分片 1,依此類推。
{{< figure src="/fig/ddia_0705.png" id="fig_sharding_hash_range" caption="圖 7-5. 為每個分片分配連續的雜湊值範圍。" class="w-full my-4" >}}
與鍵範圍分片一樣,雜湊範圍分片中的分片在變得太大或負載太重時可以被分割。這仍然是一個昂貴的操作,但它可以根據需要發生,因此分片數量適應資料量而不是預先固定。
與鍵範圍分片相比的缺點是,對分割槽鍵的範圍查詢效率不高,因為範圍內的鍵現在分散在所有分片中。但是,如果鍵由兩列或更多列組成,並且分割槽鍵只是這些列中的第一列,你仍然可以對第二列和後續列執行高效的範圍查詢:只要範圍查詢中的所有記錄具有相同的分割槽鍵,它們就會在同一個分片中。
--------
> [!TIP] 資料倉庫中的分割槽和範圍查詢
資料倉庫如 BigQuery、Snowflake 和 Delta Lake 支援類似的索引方法,儘管術語不同。例如,在 BigQuery 中,分割槽鍵決定記錄駐留在哪個分割槽中,而"叢集列"決定記錄在分割槽內如何排序。Snowflake 自動將記錄分配給"微分割槽",但允許使用者為表定義叢集鍵。Delta Lake 支援手動和自動分割槽分配,並支援叢集鍵。聚集資料不僅可以提高範圍掃描效能,還可以提高壓縮和過濾效能。
--------
雜湊範圍分片被 YugabyteDB 和 DynamoDB 使用 [^17],並且是 MongoDB 中的一個選項。Cassandra 和 ScyllaDB 使用這種方法的一個變體,如 [圖 7-6](#fig_sharding_cassandra) 所示:雜湊值空間被分割成與節點數成比例的範圍數([圖 7-6](#fig_sharding_cassandra) 中每個節點 3 個範圍,但實際數字在 Cassandra 中預設為每個節點 8 個,在 ScyllaDB 中為每個節點 256 個),這些範圍之間有隨機邊界。這意味著某些範圍比其他範圍大,但透過每個節點有多個範圍,這些不平衡傾向於平均化 [^15] [^18]。
{{< figure src="/fig/ddia_0706.png" id="fig_sharding_cassandra" caption="圖 7-6. Cassandra 和 ScyllaDB 將可能的雜湊值範圍(這裡是 0-1023)分割成具有隨機邊界的連續範圍,併為每個節點分配多個範圍。" class="w-full my-4" >}}
當新增或刪除節點時,會新增和刪除範圍邊界,並相應地分割或合併分片 [^19]。在 [圖 7-6](#fig_sharding_cassandra) 的示例中,當新增節點 3 時,節點 1 將其兩個範圍的部分轉移到節點 3,節點 2 將其一個範圍的部分轉移到節點 3。這樣做的效果是給新節點一個大致公平的資料集份額,而不會在節點之間傳輸超過必要的資料。
#### 一致性雜湊 {#sec_sharding_consistent_hashing}
*一致性雜湊* 演算法是一種雜湊函式,它以滿足兩個屬性的方式將鍵對映到指定數量的分片:
1. 對映到每個分片的鍵數大致相等,並且
2. 當分片數量變化時,儘可能少的鍵從一個分片移動到另一個分片。
注意這裡的 *一致性* 與副本一致性(見 [第 6 章](/tw/ch6#ch_replication))或 ACID 一致性(見 [第 8 章](/tw/ch8#ch_transactions))無關,而是描述了鍵儘可能保持在同一個分片中的傾向。
Cassandra 和 ScyllaDB 使用的分片演算法類似於一致性雜湊的原始定義 [^20],但也提出了其他幾種一致性雜湊演算法 [^21],如 *最高隨機權重*,也稱為 *會合雜湊* [^22],以及 *跳躍一致性雜湊* [^23]。使用 Cassandra 的演算法,如果新增一個節點,少量現有分片會被分割成子範圍;另一方面,使用會合和跳躍一致性雜湊,新節點被分配之前分散在所有其他節點中的單個鍵。哪種更可取取決於應用程式。
### 偏斜的工作負載與緩解熱點 {#sec_sharding_skew}
一致性雜湊保證鍵在節點間大致均勻分佈,但這並不等於實際負載也均勻分佈。如果工作負載高度偏斜,即某些分割槽鍵下的資料量遠大於其他鍵,或某些鍵的請求速率遠高於其他鍵,那麼你仍可能出現部分伺服器過載、其他伺服器幾乎空閒的情況。
例如,在社交媒體網站上,擁有數百萬粉絲的名人使用者在做某事時可能會引起活動風暴 [^24]。這個事件可能導致對同一個鍵的大量讀寫(其中分割槽鍵可能是名人的使用者 ID,或者人們正在評論的動作的 ID)。
在這種情況下,需要更靈活的分片策略 [^25] [^26]。基於鍵範圍(或雜湊範圍)定義分片的系統使得可以將單個熱鍵放在自己的分片中,甚至可能為其分配專用機器 [^27]。
也可以在應用層補償偏斜。例如,如果已知某個鍵非常熱,一個簡單方法是在鍵的前後附加隨機數。僅用兩位十進位制隨機數,就可以把對該鍵的寫入均勻打散到 100 個不同鍵上,從而將它們分佈到不同分片。
然而,將寫入分散到不同的鍵之後,任何讀取現在都必須做額外的工作,因為它們必須從所有 100 個鍵讀取資料並將其組合。對熱鍵每個分片的讀取量沒有減少;只有寫入負載被分割。這種技術還需要額外的記賬:只對少數熱鍵附加隨機數是有意義的;對於寫入吞吐量低的絕大多數鍵,這將是不必要的開銷。因此,你還需要某種方法來跟蹤哪些鍵正在被分割,以及將常規鍵轉換為特殊管理的熱鍵的過程。
問題因負載隨時間變化而進一步複雜化:例如,一個已經病毒式傳播的特定社交媒體帖子可能會在幾天內經歷高負載,但之後可能會再次平靜下來。此外,某些鍵可能對寫入很熱,而其他鍵對讀取很熱,需要不同的策略來處理它們。
一些系統(特別是為大規模設計的雲服務)有自動處理熱分片的方法;例如,Amazon 稱之為 *熱管理* [^28] 或 *自適應容量* [^17]。這些系統如何工作的細節超出了本書的範圍。
### 運維:自動/手動再平衡 {#sec_sharding_operations}
關於再平衡有一個我們已經忽略的重要問題:分片的分割和再平衡是自動發生還是手動發生?
一些系統自動決定何時分割分片以及何時將它們從一個節點移動到另一個節點,無需任何人工互動,而其他系統則讓分片由管理員明確配置。還有一個中間地帶:例如,Couchbase 和 Riak 自動生成建議的分片分配,但需要管理員提交才能生效。
完全自動的再平衡可能很方便,因為正常維護的操作工作較少,這樣的系統甚至可以自動擴充套件以適應工作負載的變化。雲資料庫如 DynamoDB 被宣傳為能夠在幾分鐘內自動新增和刪除分片以適應負載的大幅增加或減少 [^17] [^29]。
然而,自動分片管理也可能是不可預測的。再平衡是一項昂貴的操作,因為它需要重新路由請求並將大量資料從一個節點移動到另一個節點。如果操作不當,這個過程可能會使網路或節點過載,並可能損害其他請求的效能。系統必須在再平衡進行時繼續處理寫入;如果系統接近其最大寫入吞吐量,分片分割過程甚至可能無法跟上傳入寫入的速率 [^29]。
這種自動化與自動故障檢測結合可能很危險。例如,假設一個節點過載並暫時響應請求緩慢。其他節點得出結論,過載的節點已死,並自動重新平衡叢集以將負載從它移開。這會對其他節點和網路施加額外負載,使情況變得更糟。存在導致級聯故障的風險,其中其他節點變得過載並也被錯誤地懷疑已關閉。
出於這個原因,在再平衡過程中有人參與可能是件好事。它比完全自動的過程慢,但它可以幫助防止操作意外。
## 請求路由 {#sec_sharding_routing}
我們已經討論了如何將資料集分片到多個節點上,以及如何在新增或刪除節點時重新平衡這些分片。現在讓我們繼續討論這個問題:如果你想讀取或寫入特定的鍵,你如何知道需要連線到哪個節點——即哪個 IP 地址和埠號?
我們稱這個問題為 *請求路由*,它與 *服務發現* 非常相似,我們之前在 ["負載均衡器、服務發現和服務網格"](/tw/ch5#sec_encoding_service_discovery) 中討論過。兩者之間最大的區別是,對於執行應用程式程式碼的服務,每個例項通常是無狀態的,負載均衡器可以將請求傳送到任何例項。對於分片資料庫,對鍵的請求只能由包含該鍵的分片的副本節點處理。
這意味著請求路由必須知道鍵到分片的分配,以及分片到節點的分配。在高層次上,這個問題有幾種不同的方法(在 [圖 7-7](#fig_sharding_routing) 中說明):
1. 允許客戶端連線任何節點(例如,透過迴圈負載均衡器)。如果該節點恰好擁有請求適用的分片,它可以直接處理請求;否則,它將請求轉發到適當的節點,接收回復,並將回覆傳遞給客戶端。
2. 首先將客戶端的所有請求傳送到路由層,該層確定應該處理每個請求的節點並相應地轉發它。這個路由層本身不處理任何請求;它只充當分片感知的負載均衡器。
3. 要求客戶端知道分片和分片到節點的分配。在這種情況下,客戶端可以直接連線到適當的節點,而無需任何中介。
{{< figure src="/fig/ddia_0707.png" id="fig_sharding_routing" caption="圖 7-7. 將請求路由到正確節點的三種不同方式。" class="w-full my-4" >}}
在所有情況下,都有一些關鍵問題:
* 誰決定哪個分片應該存在於哪個節點上?最簡單的是有一個單一的協調器做出該決定,但在這種情況下,如果執行協調器的節點出現故障,如何使其容錯?如果協調器角色可以故障轉移到另一個節點,如何防止腦裂情況(見 ["處理節點中斷"](/tw/ch6#sec_replication_failover)),其中兩個不同的協調器做出相互矛盾的分片分配?
* 執行路由的元件(可能是節點之一、路由層或客戶端)如何瞭解分片到節點分配的變化?
* 當分片從一個節點移動到另一個節點時,有一個切換期,在此期間新節點已接管,但對舊節點的請求可能仍在傳輸中。如何處理這些?
許多分散式資料系統依賴於單獨的協調服務(如 ZooKeeper 或 etcd)來跟蹤分片分配,如 [圖 7-8](#fig_sharding_zookeeper) 所示。它們使用共識演算法(見 [第 10 章](/tw/ch10#ch_consistency))來提供容錯和防止腦裂。每個節點在 ZooKeeper 中註冊自己,ZooKeeper 維護分片到節點的權威對映。其他參與者,如路由層或分片感知客戶端,可以在 ZooKeeper 中訂閱此資訊。每當分片所有權發生變化,或者新增或刪除節點時,ZooKeeper 都會通知路由層,以便它可以保持其路由資訊最新。
{{< figure src="/fig/ddia_0708.png" id="fig_sharding_zookeeper" caption="圖 7-8. 使用 ZooKeeper 跟蹤分片到節點的分配。" class="w-full my-4" >}}
例如,HBase 和 SolrCloud 使用 ZooKeeper 管理分片分配,Kubernetes 使用 etcd 跟蹤哪個服務例項在哪裡執行。MongoDB 有類似的架構,但它依賴於自己的 *配置伺服器* 實現和 *mongos* 守護程序作為路由層。Kafka、YugabyteDB 和 TiDB 使用內建的 Raft 共識協議實現來執行此協調功能。
Cassandra、ScyllaDB 和 Riak 採用不同的方法:它們在節點之間使用 *流言協議* 來傳播叢集狀態的任何變化。這提供了比共識協議弱得多的一致性;可能會出現腦裂,其中叢集的不同部分對同一分片有不同的節點分配。無主資料庫可以容忍這一點,因為它們通常提供弱一致性保證(見 ["仲裁一致性的限制"](/tw/ch6#sec_replication_quorum_limitations))。
當使用路由層或向隨機節點發送請求時,客戶端仍然需要找到要連線的 IP 地址。這些不像分片到節點的分配那樣快速變化,因此通常使用 DNS 就足夠了。
上面對請求路由的討論,主要關注如何為單個鍵找到對應分片,這對分片 OLTP 資料庫最相關。分析型資料庫通常也使用分片,但其查詢執行模型很不一樣:查詢往往需要並行聚合並連線來自多個分片的資料,而不是在單個分片內執行。我們將在 ["JOIN 和 GROUP BY"](/tw/ch11#sec_batch_join) 中討論這類並行查詢執行技術。
## 分片與二級索引 {#sec_sharding_secondary_indexes}
到目前為止,我們討論的分片方案依賴於客戶端知道它想要訪問的任何記錄的分割槽鍵。這在鍵值資料模型中最容易做到,其中分割槽鍵是主鍵的第一部分(或整個主鍵),因此我們可以使用分割槽鍵來確定分片,從而將讀寫路由到負責該鍵的節點。
如果涉及二級索引,情況會變得更加複雜(另見 ["多列和二級索引"](/tw/ch4#sec_storage_index_multicolumn))。二級索引通常不唯一地標識記錄,而是一種搜尋特定值出現的方法:查詢使用者 `123` 的所有操作、查詢包含單詞 `hogwash` 的所有文章、查詢顏色為 `red` 的所有汽車等。
鍵值儲存通常沒有二級索引;但在關係資料庫中,二級索引是基礎能力,在文件資料庫中也很常見,而且它們正是 Solr、Elasticsearch 等全文檢索引擎的 *立身之本*。二級索引的難點在於,它們不能整齊地對映到分片。帶二級索引的分片資料庫主要有兩種做法:本地索引與全域性索引。
### 本地二級索引 {#id166}
例如,假設你正在運營一個出售二手車的網站(如 [圖 7-9](#fig_sharding_local_secondary) 所示)。每個列表都有一個唯一的 ID——稱之為文件 ID——你使用該 ID 作為分割槽鍵對資料庫進行分片(例如,ID 0 到 499 在分片 0 中,ID 500 到 999 在分片 1 中,等等)。
如果你想讓使用者搜尋汽車,允許他們按顏色和製造商過濾,你需要在 `color` 和 `make` 上建立二級索引(在文件資料庫中這些是欄位;在關係資料庫中這些是列)。如果你已宣告索引,資料庫就可以自動維護索引。例如,每當一輛紅色汽車被寫入資料庫,所在分片會自動將其 ID 加入索引條目 `color:red` 對應的文件 ID 列表。正如 [第 4 章](/tw/ch4#ch_storage) 所述,這個 ID 列表也稱為 *倒排列表*。
{{< figure src="/fig/ddia_0709.png" id="fig_sharding_local_secondary" caption="圖 7-9. 本地二級索引:每個分片只索引其自己分片內的記錄。" class="w-full my-4" >}}
> [!WARNING] 警告
如果你的資料庫只支援鍵值模型,你可能會嘗試透過在應用程式程式碼中建立從值到文件 ID 的對映來自己實現二級索引。如果你走這條路,你需要格外小心,確保你的索引與底層資料保持一致。競態條件和間歇性寫入失敗(其中某些更改已儲存但其他更改未儲存)很容易導致資料不同步——見 ["多物件事務的需求"](/tw/ch8#sec_transactions_need)。
--------
在這種索引方法中,每個分片是完全獨立的:每個分片維護自己的二級索引,僅覆蓋該分片中的文件。它不關心儲存在其他分片中的資料。每當你需要寫入資料庫——新增、刪除或更新記錄——你只需要處理包含你正在寫入的文件 ID 的分片。出於這個原因,這種型別的二級索引被稱為 *本地索引*。在資訊檢索上下文中,它也被稱為 *文件分割槽索引* [^30]。
當從本地二級索引讀取時,如果你已經知道你正在查詢的記錄的分割槽鍵,你可以只在適當的分片上執行搜尋。此外,如果你只想要 *一些* 結果,而不需要全部,你可以將請求傳送到任何分片。
但是,如果你想要所有結果並且事先不知道它們的分割槽鍵,你需要將查詢傳送到所有分片,並組合你收到的結果,因為匹配的記錄可能分散在所有分片中。在 [圖 7-9](#fig_sharding_local_secondary) 中,紅色汽車出現在分片 0 和分片 1 中。
這種查詢分片資料庫的方法有時稱為 *分散/收集*(scatter/gather),它可能使二級索引讀取變得相當昂貴。即使並行查詢各分片,分散/收集也容易導致尾部延遲放大(見 ["響應時間指標的使用"](/tw/ch2#sec_introduction_slo_sla))。它還會限制應用的可伸縮性:增加分片可以提升可儲存資料量,但若每個查詢仍需所有分片參與,查詢吞吐量並不會隨分片數增加而提升。
儘管如此,本地二級索引被廣泛使用 [^31]:例如,MongoDB、Riak、Cassandra [^32]、Elasticsearch [^33]、SolrCloud 和 VoltDB [^34] 都使用本地二級索引。
### 全域性二級索引 {#id167}
我們可以構建一個覆蓋所有分片資料的 *全域性索引*,而不是每個分片有自己的本地二級索引。但是,我們不能只將該索引儲存在一個節點上,因為它可能會成為瓶頸並違背分片的目的。全域性索引也必須進行分片,但它可以以不同於主鍵索引的方式進行分片。
[圖 7-10](#fig_sharding_global_secondary) 說明了這可能是什麼樣子:來自所有分片的紅色汽車的 ID 出現在索引的 `color:red` 下,但索引是分片的,以便以字母 *a* 到 *r* 開頭的顏色出現在分片 0 中,以 *s* 到 *z* 開頭的顏色出現在分片 1 中。汽車製造商的索引也類似地分割槽(分片邊界在 *f* 和 *h* 之間)。
{{< figure src="/fig/ddia_0710.png" id="fig_sharding_global_secondary" caption="圖 7-10. 全域性二級索引反映來自所有分片的資料,並且本身按索引值進行分片。" class="w-full my-4" >}}
這種索引也稱為 *基於詞項分割槽* [^30]:回憶一下 ["全文檢索"](/tw/ch4#sec_storage_full_text),在全文檢索中,*詞項* 是你可以搜尋的文字中的關鍵字。這裡我們將其推廣為指二級索引中你可以搜尋的任何值。
全域性索引使用詞項作為分割槽鍵,因此當你查詢特定詞項或值時,你可以找出需要查詢哪個分片。和以前一樣,分片可以包含連續的詞項範圍(如 [圖 7-10](#fig_sharding_global_secondary)),或者你可以基於詞項的雜湊將詞項分配給分片。
全域性索引的優點是,只有一個查詢條件時(如 *color = red*),只需從一個分片讀取即可獲得倒排列表。但如果你不僅要 ID,還要取回完整記錄,仍然必須去負責這些 ID 的各個分片讀取。
如果你有多個搜尋條件或詞項(例如搜尋某種顏色且某個製造商的汽車,或搜尋同一文字中出現的多個單詞),這些詞項很可能會落在不同分片。要計算兩個條件的邏輯 AND,系統需要找出同時出現在兩個倒排列表中的 ID。若倒排列表較短,這沒問題;但若很長,把它們透過網路傳送後再算交集就可能很慢 [^30]。
全域性二級索引的另一個挑戰是寫入比本地索引更複雜,因為寫入單個記錄可能會影響索引的多個分片(文件中的每個詞項可能在不同的分片或不同的節點上)。這使得二級索引與底層資料保持同步更加困難。一種選擇是使用分散式事務來原子地更新儲存主記錄的分片及其二級索引(見 [第 8 章](/tw/ch8#ch_transactions))。
全域性二級索引被 CockroachDB、TiDB 和 YugabyteDB 使用;DynamoDB 同時支援本地與全域性二級索引。在 DynamoDB 中,寫入會非同步反映到全域性索引,因此從全域性索引讀取到的結果可能是陳舊的(類似複製延遲,見 ["複製延遲的問題"](/tw/ch6#sec_replication_lag))。儘管如此,在讀吞吐量高於寫吞吐量且倒排列表不太長的場景下,全域性索引仍然很有價值。
## 總結 {#summary}
在本章中,我們探討了將大型資料集分片為更小子集的不同方法。當你有如此多的資料以至於在單臺機器上儲存和處理它不再可行時,分片是必要的。
分片的目標是在多臺機器上均勻分佈資料和查詢負載,避免熱點(負載不成比例高的節點)。這需要選擇適合你的資料的分片方案,並在節點新增到叢集或從叢集中刪除時重新平衡分片。
我們討論了兩種主要的分片方法:
**鍵範圍分片**
: 其中鍵是有序的,分片擁有從某個最小值到某個最大值的所有鍵。排序的優點是可以進行高效的範圍查詢,但如果應用程式經常訪問排序順序中彼此接近的鍵,則存在熱點風險。
在這種方法中,當分片變得太大時,通常透過將範圍分成兩個子範圍來動態重新平衡分片。
**雜湊分片**
: 其中對每個鍵應用雜湊函式,分片擁有一個雜湊值範圍(或者可以使用另一種一致性雜湊演算法將雜湊對映到分片)。這種方法破壞了鍵的順序,使範圍查詢效率低下,但可能更均勻地分佈負載。
當按雜湊分片時,通常預先建立固定數量的分片,為每個節點分配多個分片,並在新增或刪除節點時將整個分片從一個節點移動到另一個節點。像鍵範圍一樣分割分片也是可能的。
通常使用鍵的第一部分作為分割槽鍵(即,識別分片),並在該分片內按鍵的其餘部分對記錄進行排序。這樣,你仍然可以在具有相同分割槽鍵的記錄之間進行高效的範圍查詢。
我們還討論了分片和二級索引之間的互動。二級索引也需要進行分片,有兩種方法:
**本地二級索引**
: 其中二級索引與主鍵和值儲存在同一個分片中。這意味著寫入時只需要更新一個分片,但二級索引的查詢需要從所有分片讀取。
**全域性二級索引**
: 它們基於索引值單獨分片。二級索引中的條目可能引用來自主鍵所有分片的記錄。寫入記錄時,可能需要更新多個二級索引分片;但讀取倒排列表時,可以由單個分片提供(獲取實際記錄仍需從多個分片讀取)。
最後,我們討論了將查詢路由到正確分片的技術,以及如何藉助協調服務維護分片到節點的分配資訊。
按設計,每個分片大體獨立執行,這正是分片資料庫能夠擴充套件到多臺機器的原因。然而,凡是需要同時寫多個分片的操作都會變得棘手:例如,一個分片寫入成功、另一個分片寫入失敗時會發生什麼?這個問題將在後續章節中討論。
### 參考
[^1]: Claire Giordano. [Understanding partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959)
[^2]: Brandur Leach. [Partitioning in Postgres, 2022 edition](https://brandur.org/fragments/postgres-partitioning-2022). *brandur.org*, October 2022. Archived at [perma.cc/Z5LE-6AKX](https://perma.cc/Z5LE-6AKX)
[^3]: Raph Koster. [Database “sharding” came from UO?](https://www.raphkoster.com/2009/01/08/database-sharding-came-from-uo/) *raphkoster.com*, January 2009. Archived at [perma.cc/4N9U-5KYF](https://perma.cc/4N9U-5KYF)
[^4]: Garrett Fidalgo. [Herding elephants: Lessons learned from sharding Postgres at Notion](https://www.notion.com/blog/sharding-postgres-at-notion). *notion.com*, October 2021. Archived at [perma.cc/5J5V-W2VX](https://perma.cc/5J5V-W2VX)
[^5]: Ulrich Drepper. [What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf). *akkadia.org*, November 2007. Archived at [perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ)
[^6]: Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. [FoundationDB: A Distributed Unbundled Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2021. [doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559)
[^7]: Marco Slot. [Citus 12: Schema-based sharding for PostgreSQL](https://www.citusdata.com/blog/2023/07/18/citus-12-schema-based-sharding-for-postgres/). *citusdata.com*, July 2023. Archived at [perma.cc/R874-EC9W](https://perma.cc/R874-EC9W)
[^8]: Robisson Oliveira. [Reducing the Scope of Impact with Cell-Based Architecture](https://docs.aws.amazon.com/pdfs/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/reducing-scope-of-impact-with-cell-based-architecture.pdf). AWS Well-Architected white paper, Amazon Web Services, September 2023. Archived at [perma.cc/4KWW-47NR](https://perma.cc/4KWW-47NR)
[^9]: Gwen Shapira. [Things DBs Don’t Do - But Should](https://www.thenile.dev/blog/things-dbs-dont-do). *thenile.dev*, February 2023. Archived at [perma.cc/C3J4-JSFW](https://perma.cc/C3J4-JSFW)
[^10]: Malte Schwarzkopf, Eddie Kohler, M. Frans Kaashoek, and Robert Morris. [Position: GDPR Compliance by Construction](https://cs.brown.edu/people/malte/pub/papers/2019-poly-gdpr.pdf). At *Towards Polystores that manage multiple Databases, Privacy, Security and/or Policy Issues for Heterogenous Data* (Poly), August 2019. [doi:10.1007/978-3-030-33752-0\_3](https://doi.org/10.1007/978-3-030-33752-0_3)
[^11]: Gwen Shapira. [Introducing pg\_karnak: Transactional schema migration across tenant databases](https://www.thenile.dev/blog/distributed-ddl). *thenile.dev*, November 2024. Archived at [perma.cc/R5RD-8HR9](https://perma.cc/R5RD-8HR9)
[^12]: Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón. [Scaling Datastores at Slack with Vitess](https://slack.engineering/scaling-datastores-at-slack-with-vitess/). *slack.engineering*, December 2020. Archived at [perma.cc/UW8F-ALJK](https://perma.cc/UW8F-ALJK)
[^13]: Ikai Lan. [App Engine Datastore Tip: Monotonically Increasing Values Are Bad](https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/). *ikaisays.com*, January 2011. Archived at [perma.cc/BPX8-RPJB](https://perma.cc/BPX8-RPJB)
[^14]: Enis Soztutar. [Apache HBase Region Splitting and Merging](https://www.cloudera.com/blog/technical/apache-hbase-region-splitting-and-merging.html). *cloudera.com*, February 2013. Archived at [perma.cc/S9HS-2X2C](https://perma.cc/S9HS-2X2C)
[^15]: Eric Evans. [Rethinking Topology in Cassandra](https://www.youtube.com/watch?v=Qz6ElTdYjjU). At *Cassandra Summit*, June 2013. Archived at [perma.cc/2DKM-F438](https://perma.cc/2DKM-F438)
[^16]: Martin Kleppmann. [Java’s hashCode Is Not Safe for Distributed Systems](https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html). *martin.kleppmann.com*, June 2012. Archived at [perma.cc/LK5U-VZSN](https://perma.cc/LK5U-VZSN)
[^17]: Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, and Akshat Vig. [Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical Conference* (ATC), July 2022.
[^18]: Brandon Williams. [Virtual Nodes in Cassandra 1.2](https://www.datastax.com/blog/virtual-nodes-cassandra-12). *datastax.com*, December 2012. Archived at [perma.cc/N385-EQXV](https://perma.cc/N385-EQXV)
[^19]: Branimir Lambov. [New Token Allocation Algorithm in Cassandra 3.0](https://www.datastax.com/blog/new-token-allocation-algorithm-cassandra-30). *datastax.com*, January 2016. Archived at [perma.cc/2BG7-LDWY](https://perma.cc/2BG7-LDWY)
[^20]: David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. [Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web](https://people.csail.mit.edu/karger/Papers/web.pdf). At *29th Annual ACM Symposium on Theory of Computing* (STOC), May 1997. [doi:10.1145/258533.258660](https://doi.org/10.1145/258533.258660)
[^21]: Damian Gryski. [Consistent Hashing: Algorithmic Tradeoffs](https://dgryski.medium.com/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8). *dgryski.medium.com*, April 2018. Archived at [perma.cc/B2WF-TYQ8](https://perma.cc/B2WF-TYQ8)
[^22]: David G. Thaler and Chinya V. Ravishankar. [Using name-based mappings to increase hit rates](https://www.cs.kent.edu/~javed/DL/web/p1-thaler.pdf). *IEEE/ACM Transactions on Networking*, volume 6, issue 1, pages 1–14, February 1998. [doi:10.1109/90.663936](https://doi.org/10.1109/90.663936)
[^23]: John Lamping and Eric Veach. [A Fast, Minimal Memory, Consistent Hash Algorithm](https://arxiv.org/abs/1406.2294). *arxiv.org*, June 2014.
[^24]: Samuel Axon. [3% of Twitter’s Servers Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010. Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX)
[^25]: Gerald Guo and Thawan Kooburat. [Scaling services with Shard Manager](https://engineering.fb.com/2020/08/24/production-engineering/scaling-services-with-shard-manager/). *engineering.fb.com*, August 2020. Archived at [perma.cc/EFS3-XQYT](https://perma.cc/EFS3-XQYT)
[^26]: Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, Kaushik Veeraraghavan, Biren Damani, Pol Mauri Ruiz, Vikas Mehta, and Chunqiang Tang. [Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications](https://dl.acm.org/doi/pdf/10.1145/3477132.3483546). *28th ACM SIGOPS Symposium on Operating Systems Principles* (SOSP), pages 553–569, October 2021. [doi:10.1145/3477132.3483546](https://doi.org/10.1145/3477132.3483546)
[^27]: Scott Lystig Fritchie. [A Critique of Resizable Hash Tables: Riak Core & Random Slicing](https://www.infoq.com/articles/dynamo-riak-random-slicing/). *infoq.com*, August 2018. Archived at [perma.cc/RPX7-7BLN](https://perma.cc/RPX7-7BLN)
[^28]: Andy Warfield. [Building and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. Archived at [perma.cc/6S7P-GLM4](https://perma.cc/6S7P-GLM4)
[^29]: Rich Houlihan. [DynamoDB adaptive capacity: smooth performance for chaotic workloads (DAT327)](https://www.youtube.com/watch?v=kMY0_m29YzU). At *AWS re:Invent*, November 2017.
[^30]: Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/). Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at [nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/)
[^31]: Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. [Earlybird: Real-Time Search at Twitter](https://cs.uwaterloo.ca/~jimmylin/publications/Busch_etal_ICDE2012.pdf). At *28th IEEE International Conference on Data Engineering* (ICDE), April 2012. [doi:10.1109/ICDE.2012.149](https://doi.org/10.1109/ICDE.2012.149)
[^32]: Nadav Har’El. [Indexing in Cassandra 3](https://github.com/scylladb/scylladb/wiki/Indexing-in-Cassandra-3). *github.com*, April 2017. Archived at [perma.cc/3ENV-8T9P](https://perma.cc/3ENV-8T9P)
[^33]: Zachary Tong. [Customizing Your Document Routing](https://www.elastic.co/blog/customizing-your-document-routing/). *elastic.co*, June 2013. Archived at [perma.cc/97VM-MREN](https://perma.cc/97VM-MREN)
[^34]: Andrew Pavlo. [H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/). *hstore.cs.brown.edu*, October 2013. Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z)
================================================
FILE: content/tw/ch8.md
================================================
---
title: "8. 事務"
weight: 208
math: true
breadcrumbs: false
---

> *有些作者聲稱,支援通用的兩階段提交代價太大,會帶來效能與可用性的問題。我們認為,讓程式設計師來處理過度使用事務導致的效能問題,總比缺少事務程式設計好得多。*
>
> James Corbett 等人,*Spanner:Google 的全球分散式資料庫*(2012)
在資料系統的殘酷現實中,很多事情都可能出錯:
* 資料庫軟體或硬體可能在任意時刻發生故障(包括寫操作進行到一半時)。
* 應用程式可能在任意時刻崩潰(包括一系列操作的中間)。
* 網路中斷可能會意外切斷應用程式與資料庫的連線,或資料庫節點之間的連線。
* 多個客戶端可能會同時寫入資料庫,覆蓋彼此的更改。
* 客戶端可能讀取到無意義的資料,因為資料只更新了一部分。
* 客戶端之間的競態條件可能導致令人驚訝的錯誤。
為了實現可靠性,系統必須處理這些故障,確保它們不會導致整個系統的災難性故障。然而,實現容錯機制需要大量工作。它需要仔細考慮所有可能出錯的事情,並進行大量測試,以確保解決方案真正有效。
數十年來,*事務*一直是簡化這些問題的首選機制。事務是應用程式將多個讀寫操作組合成一個邏輯單元的一種方式。從概念上講,事務中的所有讀寫操作被視作單個操作來執行:整個事務要麼成功(*提交*),要麼失敗(*中止*、*回滾*)。如果失敗,應用程式可以安全地重試。對於事務來說,應用程式的錯誤處理變得簡單多了,因為它不用再擔心部分失敗——即某些操作成功,某些失敗(無論出於何種原因)。
如果你與事務打交道多年,它們可能看起來顯而易見,但我們不應該將其視為理所當然。事務不是自然法則;它們是有目的地建立的,即為了*簡化應用程式的程式設計模型*。透過使用事務,應用程式可以自由地忽略某些潛在的錯誤場景和併發問題,因為資料庫會替應用處理好這些(我們稱之為*安全保證*)。
並非所有應用程式都需要事務,有時弱化事務保證或完全放棄事務也有好處(例如,為了獲得更高的效能或更高的可用性)。某些安全屬性可以在沒有事務的情況下實現。另一方面,事務可以防止很多麻煩:例如,郵局 Horizon 醜聞(參見["可靠性有多重要?"](/tw/ch2#sidebar_reliability_importance))背後的技術原因可能是底層會計系統缺乏 ACID 事務[^1]。
你如何確定是否需要事務?為了回答這個問題,我們首先需要準確理解事務可以提供哪些安全保證,以及相關的成本。儘管事務乍看起來很簡單,但實際上有許多細微但重要的細節在起作用。
在本章中,我們將研究許多可能出錯的案例,並探索資料庫用於防範這些問題的演算法。我們將特別深入併發控制領域,討論可能發生的各種競態條件,以及資料庫如何實現*讀已提交*、*快照隔離*和*可序列化*等隔離級別。
併發控制對單節點和分散式資料庫都很重要。在本章後面的["分散式事務"](#sec_transactions_distributed)部分,我們將研究*兩階段提交*協議和在分散式事務中實現原子性的挑戰。
## 事務到底是什麼? {#sec_transactions_overview}
今天,幾乎所有的關係型資料庫和一些非關係資料庫都支援事務。它們大多遵循 1975 年由 IBM System R(第一個 SQL 資料庫)引入的風格[^2] [^3] [^4]。儘管一些實現細節發生了變化,但總體思路在 50 年裡幾乎保持不變:MySQL、PostgreSQL、Oracle、SQL Server 等的事務支援與 System R 驚人地相似。
在 2000 年代後期,非關係(NoSQL)資料庫開始流行起來。它們旨在透過提供新的資料模型選擇(參見[第 3 章](/tw/ch3#ch_datamodels)),以及預設包含複製([第 6 章](/tw/ch6#ch_replication))和分片([第 7 章](/tw/ch7#ch_sharding))來改進關係型資料庫的現狀。事務是這一運動的主要犧牲品:許多這一代資料庫完全放棄了事務,或者重新定義了這個詞,用來描述比以前理解的更弱的保證集。
圍繞 NoSQL 分散式資料庫的炒作導致了一種流行的信念,即事務從根本上不可伸縮,任何大規模系統都必須放棄事務以保持良好的效能和高可用性。最近,這種信念被證明是錯誤的。所謂 "NewSQL" 資料庫,如 CockroachDB[^5]、TiDB[^6]、Spanner[^7]、FoundationDB[^8] 和 YugabyteDB 已經證明,事務系統同樣可以具備很強的可伸縮性,並支援大資料量與高吞吐量。這些系統將分片與共識協議([第 10 章](/tw/ch10#ch_consistency))結合,在大規模下提供強 ACID 保證。
然而,這並不意味著每個系統都必須是事務型的:與任何其他技術設計選擇一樣,事務有優點也有侷限性。為了理解這些權衡,讓我們深入瞭解事務可以提供的保證的細節——無論是在正常操作中還是在各種極端(但現實)的情況下。
### ACID 的含義 {#sec_transactions_acid}
事務提供的安全保證通常由眾所周知的首字母縮略詞 *ACID* 來描述,它代表*原子性*(Atomicity)、*一致性*(Consistency)、*隔離性*(Isolation)和*永續性*(Durability)。它由 Theo Härder 和 Andreas Reuter 於 1983 年提出[^9],旨在為資料庫中的容錯機制建立精確的術語。
然而,在實踐中,一個數據庫的 ACID 實現並不等同於另一個數據庫的實現。例如,正如我們將看到的,*隔離性*的含義有很多歧義[^10]。高層次的想法是合理的,但魔鬼在細節中。今天,當一個系統聲稱自己"符合 ACID"時,實際上你能期待什麼保證並不清楚。不幸的是,ACID 基本上已經成為了一個營銷術語。
(不符合 ACID 標準的系統有時被稱為 *BASE*,它代表*基本可用*(Basically Available)、*軟狀態*(Soft state)和*最終一致性*(Eventual consistency)[^11]。這比 ACID 的定義更加模糊。似乎 BASE 唯一合理的定義是"非 ACID";即,它幾乎可以代表任何你想要的東西。)
讓我們深入瞭解原子性、一致性、隔離性和永續性的定義,這將讓我們提煉出事務的思想。
#### 原子性 {#sec_transactions_acid_atomicity}
一般來說,*原子*是指不能分解成更小部分的東西。這個詞在計算機的不同分支中意味著相似但又微妙不同的東西。例如,在多執行緒程式設計中,如果一個執行緒執行原子操作,這意味著另一個執行緒無法看到該操作的半完成結果。系統只能處於操作之前或操作之後的狀態,而不是介於兩者之間。
相比之下,在 ACID 的上下文中,原子性*不是*關於併發的。它不描述如果幾個程序試圖同時訪問相同的資料會發生什麼,因為這包含在字母 *I*(*隔離性*)中(參見["隔離性"](#sec_transactions_acid_isolation))。
相反,ACID 原子性描述了當客戶端想要進行多次寫入,但在某些寫入被處理後發生故障時會發生什麼——例如,程序崩潰、網路連線中斷、磁碟變滿或違反了某些完整性約束。如果這些寫入被分組到一個原子事務中,並且由於故障無法完成(*提交*)事務,則事務被*中止*,資料庫必須丟棄或撤消該事務中迄今為止所做的任何寫入。
如果沒有原子性,如果在進行多處更改的中途發生錯誤,很難知道哪些更改已經生效,哪些沒有。應用程式可以重試,但這有進行兩次相同更改的風險,導致資料重複或錯誤。原子性簡化了這個問題:如果事務被中止,應用程式可以確定它沒有改變任何東西,因此可以安全地重試。
在錯誤時中止事務並丟棄該事務的所有寫入的能力是 ACID 原子性的定義特徵。也許*可中止性*比*原子性*更好,但我們將堅持使用*原子性*,因為這是常用詞。
#### 一致性 {#sec_transactions_acid_consistency}
*一致性*這個詞被嚴重濫用:
* 在[第 6 章](/tw/ch6#ch_replication)中,我們討論了*副本一致性*和非同步複製系統中出現的*最終一致性*問題(參見["複製延遲的問題"](/tw/ch6#sec_replication_lag))。
* 資料庫的*一致快照*(例如,用於備份)是整個資料庫在某一時刻存在的快照。更準確地說,它與先發生關係(happens-before relation)一致(參見["“先發生”關係和併發"](/tw/ch6#sec_replication_happens_before)):也就是說,如果快照包含在特定時間寫入的值,那麼它也反映了在該值寫入之前發生的所有寫入。
* *一致性雜湊*是某些系統用於再平衡的分片方法(參見["一致性雜湊"](/tw/ch7#sec_sharding_consistent_hashing))。
* 在 CAP定理中(參見[第 10 章](/tw/ch10#ch_consistency)),*一致性*一詞用於表示*線性一致性*(參見["線性一致性"](/tw/ch10#sec_consistency_linearizability))。
* 在 ACID 的上下文中,*一致性*是指應用程式特定的資料庫處於"良好狀態"的概念。
不幸的是,同一個詞至少有五種不同的含義。
ACID 一致性的思想是,你對資料有某些陳述(*不變式*)必須始終為真——例如,在會計系統中,所有賬戶的貸方和借方必須始終平衡。如果事務從滿足這些不變式的有效資料庫開始,並且事務期間的任何寫入都保持有效性,那麼你可以確定不變式始終得到滿足。(不變式可能在事務執行期間暫時違反,但在事務提交時應該再次滿足。)
如果你希望資料庫強制執行你的不變式,你需要將它們宣告為模式的一部分的*約束*。例如,外部索引鍵約束、唯一性約束或檢查約束(限制單個行中可以出現的值)通常用於對特定型別的不變式建模。更複雜的一致性要求有時可以使用觸發器或物化檢視建模[^12]。
然而,複雜的不變式可能很難或不可能使用資料庫通常提供的約束來建模。在這種情況下,應用程式有責任正確定義其事務,以便它們保持一致性。如果你寫入違反不變式的錯誤資料,但你沒有宣告這些不變式,資料庫無法阻止你。因此,ACID 中的 C 通常取決於應用程式如何使用資料庫,而不僅僅是資料庫的屬性。
#### 隔離性 {#sec_transactions_acid_isolation}
大多數資料庫都會同時被多個客戶端訪問。如果它們讀寫資料庫的不同部分,這沒有問題,但如果它們訪問相同的資料庫記錄,你可能會遇到併發問題(競態條件)。
[圖 8-1](#fig_transactions_increment) 是這種問題的一個簡單例子。假設你有兩個客戶端同時遞增儲存在資料庫中的計數器。每個客戶端需要讀取當前值,加 1,然後寫回新值(假設資料庫中沒有內建的遞增操作)。在[圖 8-1](#fig_transactions_increment) 中,計數器應該從 42 增加到 44,因為發生了兩次遞增,但實際上由於競態條件只增加到 43。
{{< figure src="/fig/ddia_0801.png" id="fig_transactions_increment" caption="圖 8-1. 兩個客戶端併發遞增計數器之間的競態條件。" class="w-full my-4" >}}
ACID 意義上的*隔離性*意味著同時執行的事務彼此隔離:它們不能相互干擾。經典的資料庫教科書將隔離性形式化為*可序列化*,這意味著每個事務可以假裝它是唯一在整個資料庫上執行的事務。資料庫確保當事務已經提交時,結果與它們*序列*執行(一個接一個)相同,即使實際上它們可能是併發執行的[^13]。
然而,可序列化有效能成本。在實踐中,許多資料庫使用比可序列化更弱的隔離形式:也就是說,它們允許併發事務以有限的方式相互干擾。一些流行的資料庫,如 Oracle,甚至沒有實現它(Oracle 有一個稱為"可序列化"的隔離級別,但它實際上實現了*快照隔離*,這是比可序列化更弱的保證[^10] [^14])。這意味著某些型別的競態條件仍然可能發生。我們將在["弱隔離級別"](#sec_transactions_isolation_levels)中探討快照隔離和其他形式的隔離。
#### 永續性 {#durability}
資料庫系統的目的是提供一個安全的地方來儲存資料,而不用擔心丟失它。*永續性*是一個承諾,即一旦事務成功提交,它寫入的任何資料都不會被遺忘,即使發生硬體故障或資料庫崩潰。
在單節點資料庫中,永續性通常意味著資料已經寫入非易失性儲存,如硬碟或 SSD。定期檔案寫入通常在傳送到磁碟之前在記憶體中緩衝,這意味著如果突然斷電它們將丟失;因此,許多資料庫使用 `fsync()` 系統呼叫來確保資料真正寫入磁碟。資料庫通常還有預寫日誌或類似的(參見["使 B 樹可靠"](/tw/ch4#sec_storage_btree_wal)),這允許它們在寫入過程中發生崩潰時恢復。
在複製資料庫中,永續性可能意味著資料已成功複製到某些節點。為了提供永續性保證,資料庫必須等到這些寫入或複製完成,然後才報告事務成功提交。然而,如["可靠性和容錯"](/tw/ch2#sec_introduction_reliability)中所討論的,完美的永續性不存在:如果所有硬碟和所有備份同時被銷燬,顯然你的資料庫無法挽救你。
--------
> [!TIP] 複製與永續性
歷史上,永續性意味著寫入歸檔磁帶。然後它被理解為寫入磁碟或 SSD。最近,它已經適應為意味著複製。哪種實現更好?
事實是,沒有什麼是完美的:
* 如果你寫入磁碟而機器宕機,即使你的資料沒有丟失,在你修復機器或將磁碟轉移到另一臺機器之前,它也是不可訪問的。複製系統可以保持可用。
* 相關故障——停電或導致每個節點在特定輸入上崩潰的錯誤——可以一次性摧毀所有副本(參見["可靠性和容錯"](/tw/ch2#sec_introduction_reliability)),失去任何僅在記憶體中的資料。因此,寫入磁碟對於複製資料庫仍然相關。
* 在非同步複製系統中,當領導者變得不可用時,最近的寫入可能會丟失(參見["處理節點故障"](/tw/ch6#sec_replication_failover))。
* 當電源突然切斷時,SSD 特別被證明有時會違反它們應該提供的保證:即使 `fsync` 也不能保證正常工作[^15]。磁碟韌體可能有錯誤,就像任何其他型別的軟體一樣[^16] [^17],例如,導致驅動器在正好 32,768 小時操作後失敗[^18]。而且 `fsync` 很難使用;即使 PostgreSQL 使用它不正確超過 20 年[^19] [^20] [^21]。
* 儲存引擎和檔案系統實現之間的微妙互動可能導致難以追蹤的錯誤,並可能導致磁碟上的檔案在崩潰後損壞[^22] [^23]。一個副本上的檔案系統錯誤有時也會傳播到其他副本[^24]。
* 磁碟上的資料可能在未被檢測到的情況下逐漸損壞[^25]。如果資料已經損壞了一段時間,副本和最近的備份也可能損壞。在這種情況下,你需要嘗試從歷史備份中恢復資料。
* 一項關於 SSD 的研究發現,在前四年的執行中,30% 到 80% 的驅動器會開發至少一個壞塊,其中只有一些可以透過韌體糾正[^26]。磁碟驅動器的壞扇區率較低,但完全故障率高於 SSD。
* 當磨損的 SSD(經歷了許多寫/擦除週期)斷電時,它可能在幾周到幾個月的時間尺度上開始丟失資料,具體取決於溫度[^27]。對於磨損水平較低的驅動器,這不是問題[^28]。
在實踐中,沒有一種技術可以提供絕對保證。只有各種降低風險的技術,包括寫入磁碟、複製到遠端機器和備份——它們可以而且應該一起使用。一如既往,明智的做法是對任何理論上的"保證"持健康的懷疑態度。
--------
### 單物件與多物件操作 {#sec_transactions_multi_object}
回顧一下,在 ACID 中,原子性和隔離性描述了如果客戶端在同一事務中進行多次寫入,資料庫應該做什麼:
原子性
: 如果在寫入序列的中途發生錯誤,事務應該被中止,並且到該點為止所做的寫入應該被丟棄。換句話說,資料庫讓你免於擔心部分失敗,透過提供全有或全無的保證。
隔離性
: 併發執行的事務不應該相互干擾。例如,如果一個事務進行多次寫入,那麼另一個事務應該看到所有或不看到這些寫入,但不是某些子集。
這些定義假設你想要同時修改多個物件(行、文件、記錄)。這種*多物件事務*通常需要保持多塊資料同步。[圖 8-2](#fig_transactions_read_uncommitted) 顯示了一個來自電子郵件應用程式的示例。要顯示使用者的未讀訊息數,你可以查詢類似這樣的內容:
```
SELECT COUNT(*) FROM emails WHERE recipient_id = 2 AND unread_flag = true
```
{{< figure src="/fig/ddia_0802.png" id="fig_transactions_read_uncommitted" caption="圖 8-2. 違反隔離性:一個事務讀取另一個事務的未提交寫入(“髒讀”)。" class="w-full my-4" >}}
然而,如果有很多電子郵件,你可能會發現這個查詢太慢,並決定將未讀訊息的數量儲存在一個單獨的欄位中(一種反正規化,我們在["正規化、反正規化和連線"](/tw/ch3#sec_datamodels_normalization)中討論)。現在,每當有新訊息進來時,你必須增加未讀計數器,每當訊息被標記為已讀時,你也必須減少未讀計數器。
在[圖 8-2](#fig_transactions_read_uncommitted) 中,使用者 2 遇到了異常:郵箱列表顯示有未讀訊息,但計數器顯示零未讀訊息,因為計數器增量尚未發生。(如果電子郵件應用程式中的錯誤計數器看起來太微不足道,請考慮客戶賬戶餘額而不是未讀計數器,以及支付事務而不是電子郵件。)隔離本可以透過確保使用者 2 看到插入的電子郵件和更新的計數器,或者兩者都不看到,但不是不一致的中間點,來防止這個問題。
[圖 8-3](#fig_transactions_atomicity) 說明了對原子性的需求:如果在事務過程中某處發生錯誤,郵箱的內容和未讀計數器可能會失去同步。在原子事務中,如果對計數器的更新失敗,事務將被中止,插入的電子郵件將被回滾。
{{< figure src="/fig/ddia_0803.png" id="fig_transactions_atomicity" caption="圖 8-3. 原子性確保如果發生錯誤,該事務的任何先前寫入都會被撤消,以避免不一致的狀態。" class="w-full my-4" >}}
多物件事務需要某種方式來確定哪些讀寫操作屬於同一事務。在關係資料庫中,這通常基於客戶端與資料庫伺服器的 TCP 連線:在任何特定連線上,`BEGIN TRANSACTION` 和 `COMMIT` 語句之間的所有內容都被認為是同一事務的一部分。如果 TCP 連線中斷,事務必須被中止。
另一方面,許多非關係資料庫沒有這樣的方式來將操作組合在一起。即使有多物件 API(例如,鍵值儲存可能有一個*多重放置*操作,在一個操作中更新多個鍵),這並不一定意味著它具有事務語義:該命令可能在某些鍵上成功而在其他鍵上失敗,使資料庫處於部分更新狀態。
#### 單物件寫入 {#sec_transactions_single_object}
當單個物件被更改時,原子性和隔離性也適用。例如,假設你正在向資料庫寫入 20 KB 的 JSON 文件:
* 如果在傳送了前 10 KB 後網路連線中斷,資料庫是否儲存了無法解析的 10 KB JSON 片段?
* 如果資料庫正在覆蓋磁碟上的先前值的過程中電源失效,你是否最終會將新舊值拼接在一起?
* 如果另一個客戶端在寫入過程中讀取該文件,它會看到部分更新的值嗎?
這些問題會令人非常困惑,因此儲存引擎幾乎普遍的目標是在一個節點上的單個物件(如鍵值對)上提供原子性和隔離性。原子性可以使用日誌實現崩潰恢復(參見["使 B 樹可靠"](/tw/ch4#sec_storage_btree_wal)),隔離性可以使用每個物件上的鎖來實現(一次只允許一個執行緒訪問物件)。
某些資料庫還提供更複雜的原子操作,例如遞增操作,它消除了像[圖 8-1](#fig_transactions_increment) 中那樣的讀-修改-寫迴圈的需求。類似流行的是*條件寫入*操作,它允許僅在值未被其他人併發更改時才進行寫入(參見["條件寫入(比較並設定)"](#sec_transactions_compare_and_set)),類似於共享記憶體併發中的比較並設定或比較並交換(CAS)操作。
--------
> [!NOTE]
> 嚴格來說,術語*原子遞增*在多執行緒程式設計的意義上使用了*原子*這個詞。在 ACID 的上下文中,它實際上應該被稱為*隔離*或*可序列化*遞增,但這不是通常的術語。
--------
這些單物件操作很有用,因為它們可以防止多個客戶端嘗試同時寫入同一物件時的丟失更新(參見["防止丟失更新"](#sec_transactions_lost_update))。然而,它們不是通常意義上的事務。例如,Cassandra 和 ScyllaDB 的"輕量級事務"功能以及 Aerospike 的"強一致性"模式在單個物件上提供線性一致(參見["線性一致性"](/tw/ch10#sec_consistency_linearizability))讀取和條件寫入,但不保證跨多個物件。
#### 多物件事務的需求 {#sec_transactions_need}
我們是否需要多物件事務?是否可能僅使用鍵值資料模型和單物件操作來實現任何應用程式?
在某些用例中,單物件插入、更新和刪除就足夠了。然而,在許多其他情況下,需要協調對多個不同物件的寫入:
* 在關係資料模型中,一個表中的行通常具有對另一個表中行的外部索引鍵引用。類似地,在類似圖的資料模型中,頂點具有指向其他頂點的邊。多物件事務允許你確保這些引用保持有效:插入引用彼此的多個記錄時,外部索引鍵必須正確且最新,否則資料變得毫無意義。
* 在文件資料模型中,需要一起更新的欄位通常在同一文件內,它被視為單個物件——更新單個文件時不需要多物件事務。然而,缺乏連線功能的文件資料庫也鼓勵反正規化(參見["何時使用哪種模型"](/tw/ch3#sec_datamodels_document_summary))。當需要更新反正規化資訊時,如[圖 8-2](#fig_transactions_read_uncommitted) 的示例,你需要一次更新多個文件。事務在這種情況下非常有用,可以防止反正規化資料失去同步。
* 在具有二級索引的資料庫中(幾乎除了純鍵值儲存之外的所有資料庫),每次更改值時都需要更新索引。從事務的角度來看,這些索引是不同的資料庫物件:例如,如果沒有事務隔離,記錄可能出現在一個索引中但不在另一個索引中,因為對第二個索引的更新尚未發生(參見["分片和二級索引"](/tw/ch7#sec_sharding_secondary_indexes))。
這些應用程式仍然可以在沒有事務的情況下實現。然而,沒有原子性的錯誤處理變得更加複雜,缺乏隔離性可能導致併發問題。我們將在["弱隔離級別"](#sec_transactions_isolation_levels)中討論這些問題,並在["派生資料與分散式事務"](/tw/ch13#sec_future_derived_vs_transactions)中探索替代方法。
#### 處理錯誤和中止 {#handling-errors-and-aborts}
事務的一個關鍵特性是,如果發生錯誤,它可以被中止並安全地重試。ACID 資料庫基於這樣的哲學:如果資料庫有違反其原子性、隔離性或永續性保證的危險,它寧願完全放棄事務,也不允許它保持半完成狀態。
然而,並非所有系統都遵循這種哲學。特別是,具有無主(無領導者)複製的資料儲存(參見["無主(無領導者)複製"](/tw/ch6#sec_replication_leaderless))更多地基於"盡力而為"的基礎工作,可以總結為"資料庫將盡其所能,如果遇到錯誤,它不會撤消已經完成的操作"——因此,從錯誤中恢復是應用程式的責任。
錯誤不可避免地會發生,但許多軟體開發人員更願意只考慮快樂路徑,而不是錯誤處理的複雜性。例如,流行的物件關係對映(ORM)框架,如 Rails 的 ActiveRecord 和 Django,不會重試中止的事務——錯誤通常導致異常冒泡到堆疊中,因此任何使用者輸入都被丟棄,使用者收到錯誤訊息。這是一種遺憾,因為中止的全部意義是啟用安全重試。
儘管重試中止的事務是一種簡單有效的錯誤處理機制,但它並不完美:
* 如果事務實際上成功了,但在伺服器嘗試向客戶端確認成功提交時網路中斷(因此從客戶端的角度來看超時),那麼重試事務會導致它被執行兩次——除非你有額外的應用程式級去重機制。
* 如果錯誤是由於過載或併發事務之間的高爭用,重試事務會使問題變得更糟,而不是更好。為了避免這種反饋迴圈,你可以限制重試次數,使用指數退避,並以不同的方式處理與過載相關的錯誤與其他錯誤(參見["當過載系統無法恢復時"](/tw/ch2#sidebar_metastable))。
* 僅在瞬態錯誤後重試才值得(例如,由於死鎖、隔離違規、臨時網路中斷和故障轉移);在永久錯誤後(例如,約束違規)重試將毫無意義。
* 如果事務在資料庫之外也有副作用,即使事務被中止,這些副作用也可能發生。例如,如果你正在傳送電子郵件,你不會希望每次重試事務時都再次傳送電子郵件。如果你想確保幾個不同的系統一起提交或中止,兩階段提交可以提供幫助(我們將在["兩階段提交(2PC)"](#sec_transactions_2pc)中討論這個問題)。
* 如果客戶端程序在重試時崩潰,它試圖寫入資料庫的任何資料都會丟失。
## 弱隔離級別 {#sec_transactions_isolation_levels}
如果兩個事務不訪問相同的資料,或者都是隻讀的,它們可以安全地並行執行,因為它們互不依賴。僅當一個事務讀取另一個事務併發修改的資料時,或者當兩個事務嘗試同時修改相同的資料時,才會出現併發問題(競態條件)。
併發錯誤很難透過測試發現,因為這些錯誤只有在時機不巧時才會觸發。這種時機問題可能非常罕見,通常難以重現。併發也很難推理,特別是在大型應用程式中,你不一定知道程式碼的其他部分正在訪問資料庫。如果只有一個使用者,應用程式開發就已經夠困難了;有許多併發使用者會讓情況變得更加困難,因為任何資料都可能在任何時候意外地發生變化。
出於這個原因,資料庫長期以來一直試圖透過提供*事務隔離*來嚮應用程式開發人員隱藏併發問題。理論上,隔離應該讓你的生活更輕鬆,讓你假裝沒有併發發生:*可序列化*隔離意味著資料庫保證事務具有與*序列*執行(即一次一個,沒有任何併發)相同的效果。
在實踐中,隔離不幸並不那麼簡單。可序列化隔離有效能成本,許多資料庫不願意支付這個代價[^10]。因此,系統通常使用較弱的隔離級別,這些級別可以防止*某些*併發問題,但不是全部。這些隔離級別更難理解,它們可能導致微妙的錯誤,但它們在實踐中仍然被使用[^29]。
由弱事務隔離引起的併發錯誤不僅僅是理論問題。它們已經導致了鉅額資金損失[^30] [^31] [^32],引發了金融審計師的調查[^33],並導致客戶資料損壞[^34]。對此類問題披露的一個流行評論是"如果你正在處理金融資料,請使用 ACID 資料庫!"——但這沒有抓住重點。即使許多流行的關係資料庫系統(通常被認為是"ACID")使用弱隔離,因此它們不一定能防止這些錯誤發生。
--------
> [!NOTE]
> 順便說一句,銀行系統的大部分依賴於透過安全 FTP 交換的文字檔案[^35]。在這種情況下,擁有審計跟蹤和一些人為級別的欺詐預防措施實際上比 ACID 屬性更重要。
--------
這些例子還強調了一個重要觀點:即使併發問題在正常操作中很少見,你也必須考慮攻擊者故意向你的 API 傳送大量高度併發請求以故意利用併發錯誤的可能性[^30]。因此,為了構建可靠和安全的應用程式,你必須確保系統地防止此類錯誤。
在本節中,我們將研究實踐中使用的幾種弱(非可序列化)隔離級別,並詳細討論哪些競態條件可以發生和不能發生,以便你可以決定哪個級別適合你的應用程式。完成後,我們將詳細討論可序列化(參見["可序列化"](#sec_transactions_serializability))。我們對隔離級別的討論將是非正式的,使用示例。如果你想要嚴格的定義和對其屬性的分析,你可以在學術文獻中找到它們[^36] [^37] [^38] [^39]。
### 讀已提交 {#sec_transactions_read_committed}
最基本的事務隔離級別是*讀已提交*。它提供兩個保證:
1. 從資料庫讀取時,你只會看到已經提交的資料(沒有*髒讀*)。
2. 寫入資料庫時,你只會覆蓋已經提交的資料(沒有*髒寫*)。
某些資料庫支援更弱的隔離級別,稱為*讀未提交*。它防止髒寫,但不防止髒讀。讓我們更詳細地討論這兩個保證。
#### 沒有髒讀 {#no-dirty-reads}
想象一個事務已經向資料庫寫入了一些資料,但事務尚未提交或中止。另一個事務能看到那個未提交的資料嗎?如果能,這稱為*髒讀*[^3]。
在讀已提交隔離級別下執行的事務必須防止髒讀。這意味著事務的任何寫入只有在該事務提交時才對其他人可見(然後它的所有寫入立即變得可見)。這在[圖 8-4](#fig_transactions_read_committed) 中說明,其中使用者 1 已設定 *x* = 3,但使用者 2 的 *get x* 仍返回舊值 2,因為使用者 1 尚未提交。
{{< figure src="/fig/ddia_0804.png" id="fig_transactions_read_committed" caption="圖 8-4. 沒有髒讀:使用者 2 只有在使用者 1 的事務提交後才能看到 x 的新值。" class="w-full my-4" >}}
有幾個原因說明為什麼防止髒讀是有用的:
* 如果事務需要更新多行,髒讀意味著另一個事務可能看到某些更新但不是其他更新。例如,在[圖 8-2](#fig_transactions_read_uncommitted) 中,使用者看到新的未讀電子郵件但沒有看到更新的計數器。這是電子郵件的髒讀。看到資料庫處於部分更新狀態會讓使用者感到困惑,並可能導致其他事務做出錯誤的決定。
* 如果事務中止,它所做的任何寫入都需要回滾(如[圖 8-3](#fig_transactions_atomicity))。如果資料庫允許髒讀,這意味著事務可能看到後來被回滾的資料——即從未實際提交到資料庫的資料。任何讀取未提交資料的事務也需要被中止,導致稱為*級聯中止*的問題。
#### 沒有髒寫 {#sec_transactions_dirty_write}
如果兩個事務併發嘗試更新資料庫中的同一行會發生什麼?我們不知道寫入將以什麼順序發生,但我們通常假設後面的寫入會覆蓋前面的寫入。
然而,如果前面的寫入是尚未提交的事務的一部分,因此後面的寫入覆蓋了一個未提交的值,會發生什麼?這稱為*髒寫*[^36]。在讀已提交隔離級別下執行的事務必須防止髒寫,通常透過延遲第二個寫入直到第一個寫入的事務已提交或中止。
透過防止髒寫,這個隔離級別避免了某些型別的併發問題:
* 如果事務更新多行,髒寫可能導致糟糕的結果。例如,考慮[圖 8-5](#fig_transactions_dirty_writes),它說明了一個二手車銷售網站,兩個人 Aaliyah 和 Bryce 同時嘗試購買同一輛車。購買汽車需要兩次資料庫寫入:網站上的列表需要更新以反映買家,銷售發票需要傳送給買家。在[圖 8-5](#fig_transactions_dirty_writes) 的情況下,銷售被授予 Bryce(因為他對 `listings` 表執行了獲勝的更新),但發票被傳送給 Aaliyah(因為她對 `invoices` 表執行了獲勝的更新)。讀已提交防止了這種事故。
* 然而,讀已提交*不*防止[圖 8-1](#fig_transactions_increment) 中兩個計數器遞增之間的競態條件。在這種情況下,第二個寫入發生在第一個事務提交之後,所以它不是髒寫。它仍然是不正確的,但原因不同——在["防止丟失更新"](#sec_transactions_lost_update)中,我們將討論如何使此類計數器遞增安全。
{{< figure src="/fig/ddia_0805.png" id="fig_transactions_dirty_writes" caption="圖 8-5. 有了髒寫,來自不同事務的衝突寫入可能會混在一起。" class="w-full my-4" >}}
#### 實現讀已提交 {#sec_transactions_read_committed_impl}
讀已提交是一個非常流行的隔離級別。它是 Oracle Database、PostgreSQL、SQL Server 和許多其他資料庫中的預設設定[^10]。
最常見的是,資料庫透過使用行級鎖來防止髒寫:當事務想要修改特定行(或文件或其他物件)時,它必須首先獲取該行的鎖。然後它必須持有該鎖直到事務提交或中止。任何給定行只能有一個事務持有鎖;如果另一個事務想要寫入同一行,它必須等到第一個事務提交或中止後才能獲取鎖並繼續。這種鎖定由資料庫在讀已提交模式(或更強的隔離級別)下自動完成。
我們如何防止髒讀?一種選擇是使用相同的鎖,並要求任何想要讀取行的事務短暫地獲取鎖,然後在讀取後立即再次釋放它。這將確保在行具有髒的、未提交的值時無法進行讀取(因為在此期間鎖將由進行寫入的事務持有)。
然而,要求讀鎖的方法在實踐中效果不佳,因為一個長時間執行的寫事務可以強制許多其他事務等待,直到長時間執行的事務完成,即使其他事務只讀取並且不向資料庫寫入任何內容。這會損害只讀事務的響應時間,並且對可操作性不利:應用程式一個部分的減速可能會由於等待鎖而在應用程式的完全不同部分產生連鎖效應。
儘管如此,在某些資料庫中使用鎖來防止髒讀,例如 IBM Db2 和 Microsoft SQL Server 在 `read_committed_snapshot=off` 設定中[^29]。
防止髒讀的更常用方法是[圖 8-4](#fig_transactions_read_committed) 中說明的方法:對於每個被寫入的行,資料庫記住舊的已提交值和當前持有寫鎖的事務設定的新值。當事務正在進行時,任何其他讀取該行的事務都只是被給予舊值。只有當新值被提交時,事務才會切換到讀取新值(有關更多詳細資訊,請參見["多版本併發控制(MVCC)"](#sec_transactions_snapshot_impl))。
### 快照隔離與可重複讀 {#sec_transactions_snapshot_isolation}
如果你膚淺地看待讀已提交隔離,你可能會被原諒認為它做了事務需要做的一切:它允許中止(原子性所需),它防止讀取事務的不完整結果,並且它防止併發寫入混淆。確實,這些是有用的功能,比沒有事務的系統能獲得的保證要強得多。
然而,使用這個隔離級別時,仍然有很多方式可能出現併發錯誤。例如,[圖 8-6](#fig_transactions_item_many_preceders) 說明了讀已提交可能發生的問題。
{{< figure src="/fig/ddia_0806.png" id="fig_transactions_item_many_preceders" caption="圖 8-6. 讀取偏差:Aaliyah 觀察到資料庫處於不一致狀態。" class="w-full my-4" >}}
假設 Aaliyah 在銀行有 1,000 美元的儲蓄,分成兩個賬戶,每個 500 美元。現在一筆事務從她的一個賬戶轉賬 100 美元到另一個賬戶。如果她不幸在該事務處理的同時檢視她的賬戶餘額列表,她可能會看到一個賬戶餘額在收款到達之前(餘額為 500 美元),另一個賬戶在轉出之後(新余額為 400 美元)。對 Aaliyah 來說,現在她的賬戶總共只有 900 美元——似乎 100 美元憑空消失了。
這種異常稱為*讀取偏差*,它是*不可重複讀*的一個例子:如果 Aaliyah 在事務結束時再次讀取賬戶 1 的餘額,她會看到與之前查詢中看到的不同的值(600 美元)。讀取偏差在讀已提交隔離下被認為是可接受的:Aaliyah 看到的賬戶餘額確實是在她讀取它們時已提交的。
--------
> [!NOTE]
> 術語*偏斜*不幸地被過載了:我們之前在*具有熱點的不平衡工作負載*的意義上使用它(參見["傾斜負載和緩解熱點"](/tw/ch7#sec_sharding_skew)),而這裡它意味著*時序異常*。
--------
在 Aaliyah 的情況下,這不是一個持久的問題,因為如果她幾秒鐘後重新載入線上銀行網站,她很可能會看到一致的賬戶餘額。然而,某些情況不能容忍這種臨時的不一致性:
備份
: 進行備份需要複製整個資料庫,對於大型資料庫可能需要幾個小時。在備份過程執行期間,寫入將繼續對資料庫進行。因此,你最終可能會得到備份的某些部分包含較舊版本的資料,而其他部分包含較新版本。如果你需要從這樣的備份恢復,不一致性(如消失的錢)將變成永久性的。
分析查詢和完整性檢查
: 有時,你可能想要執行掃描資料庫大部分的查詢。此類查詢在分析中很常見(參見["分析與運營系統"](/tw/ch1#sec_introduction_analytics)),或者可能是定期完整性檢查的一部分,以確保一切正常(監控資料損壞)。如果這些查詢在不同時間點觀察資料庫的不同部分,它們很可能返回無意義的結果。
*快照隔離*[^36] 是解決這個問題的最常見方法。其思想是每個事務從資料庫的*一致快照*讀取——也就是說,事務看到事務開始時資料庫中已提交的所有資料。即使資料隨後被另一個事務更改,每個事務也只能看到該特定時間點的舊資料。
快照隔離對於長時間執行的只讀查詢(如備份和分析)來說是一個福音。如果查詢操作的資料在查詢執行的同時發生變化,很難推理查詢的含義。當事務可以看到資料庫的一致快照(凍結在特定時間點)時,理解起來就容易得多。
快照隔離是一個流行的功能:它的變體受到 PostgreSQL、使用 InnoDB 儲存引擎的 MySQL、Oracle、SQL Server 等的支援,儘管詳細行為因系統而異[^29] [^40] [^41]。某些資料庫,如 Oracle、TiDB 和 Aurora DSQL,甚至選擇快照隔離作為它們的最高隔離級別。
#### 多版本併發控制(MVCC) {#sec_transactions_snapshot_impl}
與讀已提交隔離一樣,快照隔離的實現通常使用寫鎖來防止髒寫(參見["實現讀已提交"](#sec_transactions_read_committed_impl)),這意味著進行寫入的事務可以阻止寫入同一行的另一個事務的進度。但是,讀取不需要任何鎖。從效能的角度來看,快照隔離的一個關鍵原則是*讀者永遠不會阻塞寫者,寫者永遠不會阻塞讀者*。這允許資料庫在一致快照上處理長時間執行的讀查詢,同時正常處理寫入,兩者之間沒有任何鎖爭用。
為了實現快照隔離,資料庫使用了我們在[圖 8-4](#fig_transactions_read_committed) 中看到的防止髒讀機制的泛化。資料庫必須潛在地保留每行的幾個不同的已提交版本,而不是每行的兩個版本(已提交版本和被覆蓋但尚未提交的版本),因為各種正在進行的事務可能需要在不同時間點看到資料庫的狀態。因為它並排維護一行的多個版本,所以這種技術被稱為*多版本併發控制*(MVCC)。
[圖 8-7](#fig_transactions_mvcc) 說明了 PostgreSQL 中如何實現基於 MVCC 的快照隔離[^40] [^42] [^43](其他實現類似)。當事務啟動時,它被賦予一個唯一的、始終遞增的事務 ID(`txid`)。每當事務向資料庫寫入任何內容時,它寫入的資料都用寫入者的事務 ID 標記。(準確地說,PostgreSQL 中的事務 ID 是 32 位整數,因此它們在大約 40 億個事務後溢位。清理過程執行清理以確保溢位不會影響資料。)
{{< figure src="/fig/ddia_0807.png" id="fig_transactions_mvcc" caption="圖 8-7. 使用多版本併發控制實現快照隔離。" class="w-full my-4" >}}
表中的每一行都有一個 `inserted_by` 欄位,包含將此行插入表中的事務的 ID。此外,每行都有一個 `deleted_by` 欄位,最初為空。如果事務刪除一行,該行實際上不會從資料庫中刪除,而是透過將 `deleted_by` 欄位設定為請求刪除的事務的 ID 來標記為刪除。在稍後的某個時間,當確定沒有事務可以再訪問已刪除的資料時,資料庫中的垃圾收集過程會刪除任何標記為刪除的行並釋放它們的空間。
更新在內部被轉換為刪除和插入[^44]。例如,在[圖 8-7](#fig_transactions_mvcc) 中,事務 13 從賬戶 2 中扣除 100 美元,將餘額從 500 美元更改為 400 美元。`accounts` 表現在實際上包含賬戶 2 的兩行:餘額為 500 美元的行被事務 13 標記為已刪除,餘額為 400 美元的行由事務 13 插入。
行的所有版本都儲存在同一個資料庫堆中(參見["在索引中儲存值"](/tw/ch4#sec_storage_index_heap)),無論寫入它們的事務是否已提交。同一行的版本形成一個連結串列,從最新版本到最舊版本或相反,以便查詢可以在內部迭代行的所有版本[^45] [^46]。
#### 觀察一致快照的可見性規則 {#sec_transactions_mvcc_visibility}
當事務從資料庫讀取時,事務 ID 用於決定它可以看到哪些行版本以及哪些是不可見的。透過仔細定義可見性規則,資料庫可以嚮應用程式呈現資料庫的一致快照。這大致如下工作[^43]:
1. 在每個事務開始時,資料庫列出當時正在進行(尚未提交或中止)的所有其他事務。這些事務所做的任何寫入都被忽略,即使事務隨後提交。這確保我們看到一個不受另一個事務提交影響的一致快照。
2. 具有較晚事務 ID(即在當前事務開始後開始,因此不包括在正在進行的事務列表中)的事務所做的任何寫入都被忽略,無論這些事務是否已提交。
3. 中止事務所做的任何寫入都被忽略,無論該中止何時發生。這樣做的好處是,當事務中止時,我們不需要立即從儲存中刪除它寫入的行,因為可見性規則會將它們過濾掉。垃圾收集過程可以稍後刪除它們。
4. 所有其他寫入對應用程式的查詢可見。
這些規則適用於行的插入和刪除。在[圖 8-7](#fig_transactions_mvcc) 中,當事務 12 從賬戶 2 讀取時,它看到 500 美元的餘額,因為 500 美元餘額的刪除是由事務 13 進行的(根據規則 2,事務 12 無法看到事務 13 進行的刪除),而 400 美元餘額的插入尚不可見(根據相同的規則)。
換句話說,如果以下兩個條件都為真,則行是可見的:
* 在讀者事務開始時,插入該行的事務已經提交。
* 該行未標記為刪除,或者如果是,請求刪除的事務在讀者事務開始時尚未提交。
長時間執行的事務可能會長時間繼續使用快照,繼續讀取(從其他事務的角度來看)早已被覆蓋或刪除的值。透過永遠不更新原地的值,而是在每次更改值時插入新版本,資料庫可以提供一致的快照,同時只產生很小的開銷。
#### 索引與快照隔離 {#indexes-and-snapshot-isolation}
索引如何在多版本資料庫中工作?最常見的方法是每個索引條目指向與該條目匹配的行的一個版本(最舊或最新版本)。每個行版本可能包含對下一個最舊或下一個最新版本的引用。使用索引的查詢必須迭代行以找到可見的行,並且值與查詢要查詢的內容匹配。當垃圾收集刪除不再對任何事務可見的舊行版本時,相應的索引條目也可以被刪除。
許多實現細節影響多版本併發控制的效能[^45] [^46]。例如,如果同一行的不同版本可以適合同一頁面,PostgreSQL 有避免索引更新的最佳化[^40]。其他一些資料庫避免儲存修改行的完整副本,而只儲存版本之間的差異以節省空間。
CouchDB、Datomic 和 LMDB 使用另一種方法。儘管它們也使用 B 樹(參見["B 樹"](/tw/ch4#sec_storage_b_trees)),但它們使用*不可變*(寫時複製)變體,在更新時不會覆蓋樹的頁面,而是建立每個修改頁面的新副本。父頁面,直到樹的根,被複制並更新以指向其子頁面的新版本。任何不受寫入影響的頁面都不需要複製,並且可以與新樹共享[^47]。
使用不可變 B 樹,每個寫事務(或事務批次)都會建立一個新的 B 樹根,特定的根是建立時資料庫的一致快照。不需要基於事務 ID 過濾行,因為後續寫入無法修改現有的 B 樹;它們只能建立新的樹根。這種方法還需要後臺程序進行壓縮和垃圾收集。
#### 快照隔離、可重複讀和命名混淆 {#snapshot-isolation-repeatable-read-and-naming-confusion}
MVCC 是資料庫常用的實現技術,通常用於實現快照隔離。然而,不同的資料庫有時使用不同的術語來指代同一件事:例如,快照隔離在 PostgreSQL 中稱為"可重複讀",在 Oracle 中稱為"可序列化"[^29]。有時不同的系統使用相同的術語來表示不同的東西:例如,雖然在 PostgreSQL 中"可重複讀"意味著快照隔離,但在 MySQL 中它意味著比快照隔離更弱一致性的 MVCC 實現[^41]。
這種命名混淆的原因是 SQL 標準沒有快照隔離的概念,因為該標準基於 System R 1975 年的隔離級別定義[^3],而快照隔離當時還沒有被髮明。相反,它定義了可重複讀,表面上看起來類似於快照隔離。PostgreSQL 將其快照隔離級別稱為"可重複讀",因為它符合標準的要求,因此他們可以聲稱符合標準。
不幸的是,SQL 標準對隔離級別的定義是有缺陷的——它是模糊的、不精確的,並且不像標準應該的那樣獨立於實現[^36]。即使幾個資料庫實現了可重複讀,它們實際提供的保證也有很大差異,儘管表面上是標準化的[^29]。研究文獻中有可重複讀的正式定義[^37] [^38],但大多數實現不滿足該正式定義。最重要的是,IBM Db2 使用"可重複讀"來指代可序列化[^10]。
因此,沒有人真正知道可重複讀意味著什麼。
### 防止丟失更新 {#sec_transactions_lost_update}
到目前為止,我們討論的讀已提交和快照隔離級別主要是關於只讀事務在併發寫入存在的情況下可以看到什麼的保證。我們大多忽略了兩個事務併發寫入的問題——我們只討論了髒寫(參見["沒有髒寫"](#sec_transactions_dirty_write)),這是可能發生的一種特定型別的寫-寫衝突。
併發寫入事務之間還可能發生其他幾種有趣的衝突。其中最著名的是*丟失更新*問題,在[圖 8-1](#fig_transactions_increment) 中以兩個併發計數器遞增的例子說明。
如果應用程式從資料庫讀取某個值,修改它,然後寫回修改後的值(*讀-修改-寫迴圈*),就會出現丟失更新問題。如果兩個事務併發執行此操作,其中一個修改可能會丟失,因為第二個寫入不包括第一個修改。(我們有時說後面的寫入*覆蓋*了前面的寫入。)這種模式出現在各種不同的場景中:
* 遞增計數器或更新賬戶餘額(需要讀取當前值,計算新值,並寫回更新的值)
* 對複雜值進行本地更改,例如,向 JSON 文件中的列表新增元素(需要解析文件,進行更改,並寫回修改後的文件)
* 兩個使用者同時編輯 wiki 頁面,每個使用者透過將整個頁面內容傳送到伺服器來儲存他們的更改,覆蓋資料庫中當前的任何內容
因為這是一個如此常見的問題,已經開發了各種解決方案[^48]。
#### 原子寫操作 {#atomic-write-operations}
許多資料庫提供原子更新操作,消除了在應用程式程式碼中實現讀-修改-寫迴圈的需要。如果你的程式碼可以用這些操作來表達,它們通常是最好的解決方案。例如,以下指令在大多數關係資料庫中是併發安全的:
```sql
UPDATE counters SET value = value + 1 WHERE key = 'foo';
```
類似地,文件資料庫(如 MongoDB)提供原子操作來對 JSON 文件的一部分進行本地修改,Redis 提供原子操作來修改資料結構(如優先順序佇列)。並非所有寫入都可以輕鬆地用原子操作來表達——例如,對 wiki 頁面的更新涉及任意文字編輯,可以使用["CRDT 和操作轉換"](/tw/ch6#sec_replication_crdts)中討論的演算法來處理——但在可以使用原子操作的情況下,它們通常是最佳選擇。
原子操作通常透過在讀取物件時對其進行獨佔鎖來實現,以便在應用更新之前沒有其他事務可以讀取它。另一種選擇是簡單地強制所有原子操作在單個執行緒上執行。
不幸的是,物件關係對映(ORM)框架很容易意外地編寫執行不安全的讀-修改-寫迴圈的程式碼,而不是使用資料庫提供的原子操作[^49] [^50] [^51]。這可能是難以透過測試發現的微妙錯誤的來源。
#### 顯式鎖定 {#explicit-locking}
如果資料庫的內建原子操作不提供必要的功能,另一個防止丟失更新的選項是應用程式顯式鎖定要更新的物件。然後應用程式可以執行讀-修改-寫迴圈,如果任何其他事務嘗試併發更新或鎖定同一物件,它將被迫等到第一個讀-修改-寫迴圈完成。
例如,考慮一個多人遊戲,其中幾個玩家可以同時移動同一個棋子。在這種情況下,原子操作可能不夠,因為應用程式還需要確保玩家的移動遵守遊戲規則,這涉及一些你無法合理地作為資料庫查詢實現的邏輯。相反,你可以使用鎖來防止兩個玩家同時移動同一個棋子,如[例 8-1](#fig_transactions_select_for_update) 所示。
{{< figure id="fig_transactions_select_for_update" title="例 8-1. 顯式鎖定行以防止丟失更新" class="w-full my-4" >}}
```sql
BEGIN TRANSACTION;
SELECT * FROM figures
WHERE name = 'robot' AND game_id = 222
FOR UPDATE; ❶
-- 檢查移動是否有效,然後更新
-- 前一個 SELECT 返回的棋子的位置。
UPDATE figures SET position = 'c4' WHERE id = 1234;
COMMIT;
```
❶:`FOR UPDATE` 子句表示資料庫應該對此查詢返回的所有行進行鎖定。
這是有效的,但要正確執行,你需要仔細考慮你的應用程式邏輯。很容易忘記在程式碼中的某個地方新增必要的鎖,從而引入競態條件。
此外,如果你鎖定多個物件,則存在死鎖的風險,其中兩個或多個事務正在等待彼此釋放鎖。許多資料庫會自動檢測死鎖,並中止涉及的事務之一,以便系統可以取得進展。你可以在應用程式級別透過重試中止的事務來處理這種情況。
#### 自動檢測丟失的更新 {#automatically-detecting-lost-updates}
原子操作和鎖是透過強制讀-修改-寫迴圈按順序發生來防止丟失更新的方法。另一種選擇是允許它們並行執行,如果事務管理器檢測到丟失的更新,則中止事務並強制它重試其讀-修改-寫迴圈。
這種方法的一個優點是資料庫可以與快照隔離一起有效地執行此檢查。實際上,PostgreSQL 的可重複讀、Oracle 的可序列化和 SQL Server 的快照隔離級別會自動檢測何時發生丟失的更新並中止有問題的事務。然而,MySQL/InnoDB 的可重複讀不檢測丟失的更新[^29] [^41]。一些作者[^36] [^38] 認為資料庫必須防止丟失的更新才能提供快照隔離,因此根據這個定義,MySQL 不提供快照隔離。
丟失更新檢測是一個很好的功能,因為它不需要應用程式程式碼使用任何特殊的資料庫功能——你可能忘記使用鎖或原子操作從而引入錯誤,但丟失更新檢測會自動發生,因此不太容易出錯。但是,你還必須在應用程式級別重試中止的事務。
#### 條件寫入(比較並設定) {#sec_transactions_compare_and_set}
在不提供事務的資料庫中,你有時會發現一個*條件寫入*操作,它可以透過僅在值自你上次讀取以來未更改時才允許更新來防止丟失的更新(之前在["單物件寫入"](#sec_transactions_single_object)中提到)。如果當前值與你之前讀取的不匹配,則更新無效,必須重試讀-修改-寫迴圈。它是許多 CPU 支援的原子*比較並設定*或*比較並交換*(CAS)指令的資料庫等價物。
例如,為了防止兩個使用者同時更新同一個 wiki 頁面,你可以嘗試類似這樣的操作,期望僅當頁面內容自使用者開始編輯以來沒有更改時才進行更新:
```sql
-- 這可能安全也可能不安全,取決於資料庫實現
UPDATE wiki_pages SET content = 'new content'
WHERE id = 1234 AND content = 'old content';
```
如果內容已更改並且不再匹配 `'old content'`,則此更新將無效,因此你需要檢查更新是否生效並在必要時重試。你也可以使用在每次更新時遞增的版本號列,並且僅在當前版本號未更改時才應用更新,而不是比較完整內容。這種方法有時稱為*樂觀鎖定*[^52]。
請注意,如果另一個事務併發修改了 `content`,則根據 MVCC 可見性規則,新內容可能不可見(參見["觀察一致快照的可見性規則"](#sec_transactions_mvcc_visibility))。MVCC 的許多實現對此場景有可見性規則的例外,其中其他事務寫入的值對 `UPDATE` 和 `DELETE` 查詢的 `WHERE` 子句的評估可見,即使這些寫入在快照中不可見。
#### 衝突解決與複製 {#conflict-resolution-and-replication}
在複製資料庫中(參見[第 6 章](/tw/ch6#ch_replication)),防止丟失的更新具有另一個維度:由於它們在多個節點上有資料副本,並且資料可能在不同節點上併發修改,因此需要採取一些額外的步驟來防止丟失的更新。
鎖和條件寫入操作假設有一個最新的資料副本。然而,具有多領導者或無主(無領導者)複製的資料庫通常允許多個寫入併發發生並非同步複製它們,因此它們不能保證有一個最新的資料副本。因此,基於鎖或條件寫入的技術在此上下文中不適用。(我們將在["線性一致性"](/tw/ch10#sec_consistency_linearizability)中更詳細地重新討論這個問題。)
相反,如["處理衝突寫入"](/tw/ch6#sec_replication_write_conflicts)中所討論的,此類複製資料庫中的常見方法是允許併發寫入建立值的多個衝突版本(也稱為*兄弟節點*),並使用應用程式程式碼或特殊資料結構在事後解決和合並這些版本。
如果更新是可交換的(即,你可以在不同副本上以不同順序應用它們,仍然得到相同的結果),合併衝突值可以防止丟失的更新。例如,遞增計數器或向集合新增元素是可交換操作。這就是 CRDT 背後的想法,我們在["CRDT 和操作轉換"](/tw/ch6#sec_replication_crdts)中遇到過。然而,某些操作(如條件寫入)不能成為可交換的。
另一方面,*最後寫入勝利*(LWW)衝突解決方法容易丟失更新,如["最後寫入勝利(丟棄併發寫入)"](/tw/ch6#sec_replication_lww)中所討論的。不幸的是,LWW 是許多複製資料庫中的預設值。
### 寫偏差與幻讀 {#sec_transactions_write_skew}
在前面的部分中,我們看到了*髒寫*和*丟失更新*,這是當不同事務併發嘗試寫入相同物件時可能發生的兩種競態條件。為了避免資料損壞,需要防止這些競態條件——要麼由資料庫自動防止,要麼透過使用鎖或原子寫操作等手動保護措施。
然而,這並不是併發寫入之間可能發生的潛在競態條件列表的結尾。在本節中,我們將看到一些更微妙的衝突示例。
首先,想象這個例子:你正在為醫生編寫一個應用程式來管理他們在醫院的值班班次。醫院通常試圖在任何時候都有幾位醫生值班,但絕對必須至少有一位醫生值班。醫生可以放棄他們的班次(例如,如果他們自己生病了),前提是該班次中至少有一位同事留在值班[^53] [^54]。
現在想象 Aaliyah 和 Bryce 是特定班次的兩位值班醫生。兩人都感覺不舒服,所以他們都決定請假。不幸的是,他們碰巧大約在同一時間點選了下班的按鈕。接下來發生的事情如[圖 8-8](#fig_transactions_write_skew) 所示。
{{< figure src="/fig/ddia_0808.png" id="fig_transactions_write_skew" caption="圖 8-8. 寫偏差導致應用程式錯誤的示例。" class="w-full my-4" >}}
在每個事務中,你的應用程式首先檢查當前是否有兩個或更多醫生在值班;如果是,它假設一個醫生下班是安全的。由於資料庫使用快照隔離,兩個檢查都返回 `2`,因此兩個事務都繼續到下一階段。Aaliyah 更新她自己的記錄讓自己下班,Bryce 同樣更新他自己的記錄。兩個事務都提交,現在沒有醫生值班。你至少有一個醫生值班的要求被違反了。
#### 寫偏差的特徵 {#characterizing-write-skew}
這種異常稱為*寫偏差*[^36]。它既不是髒寫也不是丟失的更新,因為兩個事務正在更新兩個不同的物件(分別是 Aaliyah 和 Bryce 的值班記錄)。這裡發生衝突不太明顯,但這絕對是一個競態條件:如果兩個事務一個接一個地執行,第二個醫生將被阻止下班。異常行為只有在事務併發執行時才可能。
你可以將寫偏差視為丟失更新問題的概括。如果兩個事務讀取相同的物件,然後更新其中一些物件(不同的事務可能更新不同的物件),就會發生寫偏差。在不同事務更新同一物件的特殊情況下,你會得到髒寫或丟失更新異常(取決於時機)。
我們看到有各種不同的方法可以防止丟失的更新。對於寫偏差,我們的選擇更受限制:
* 原子單物件操作沒有幫助,因為涉及多個物件。
* 不幸的是,你在某些快照隔離實現中發現的丟失更新的自動檢測也沒有幫助:寫偏差在 PostgreSQL 的可重複讀、MySQL/InnoDB 的可重複讀、Oracle 的可序列化或 SQL Server 的快照隔離級別中不會自動檢測到[^29]。自動防止寫偏差需要真正的可序列化隔離(參見["可序列化"](#sec_transactions_serializability))。
* 某些資料庫允許你配置約束,然後由資料庫強制執行(例如,唯一性、外部索引鍵約束或對特定值的限制)。但是,為了指定至少有一個醫生必須值班,你需要一個涉及多個物件的約束。大多數資料庫沒有對此類約束的內建支援,但你可能能夠使用觸發器或物化檢視實現它們,如["一致性"](#sec_transactions_acid_consistency)中所討論的[^12]。
* 如果你不能使用可序列化隔離級別,在這種情況下,第二好的選擇可能是顯式鎖定事務所依賴的行。在醫生示例中,你可以編寫如下內容:
```sql
BEGIN TRANSACTION;
SELECT * FROM doctors
WHERE on_call = true
AND shift_id = 1234 FOR UPDATE; ❶
UPDATE doctors
SET on_call = false
WHERE name = 'Aaliyah'
AND shift_id = 1234;
COMMIT;
```
❶:和以前一樣,`FOR UPDATE` 告訴資料庫鎖定此查詢返回的所有行。
#### 寫偏差的更多例子 {#more-examples-of-write-skew}
寫偏差起初可能看起來是一個深奧的問題,但一旦你意識到它,你可能會注意到更多可能發生的情況。以下是更多示例:
會議室預訂系統
: 假設你想強制同一會議室在同一時間不能有兩個預訂[^55]。當有人想要預訂時,你首先檢查是否有任何衝突的預訂(即,具有重疊時間範圍的同一房間的預訂),如果沒有找到,你就建立會議(參見[例 8-2](#fig_transactions_meeting_rooms))。
{{< figure id="fig_transactions_meeting_rooms" title="例 8-2. 會議室預訂系統試圖避免重複預訂(在快照隔離下不安全)" class="w-full my-4" >}}
```sql
BEGIN TRANSACTION;
-- 檢查是否有任何現有預訂與中午 12 點到 1 點的時間段重疊
SELECT COUNT(*) FROM bookings
WHERE room_id = 123 AND
end_time > '2025-01-01 12:00' AND start_time < '2025-01-01 13:00';
-- 如果前一個查詢返回零:
INSERT INTO bookings (room_id, start_time, end_time, user_id)
VALUES (123, '2025-01-01 12:00', '2025-01-01 13:00', 666);
COMMIT;
```
不幸的是,快照隔離不會阻止另一個使用者併發插入衝突的會議。為了保證你不會出現排程衝突,你再次需要可序列化隔離。
多人遊戲
: 在[例 8-1](#fig_transactions_select_for_update) 中,我們使用鎖來防止丟失的更新(即,確保兩個玩家不能同時移動同一個棋子)。但是,鎖不會阻止玩家將兩個不同的棋子移動到棋盤上的同一位置,或者可能做出違反遊戲規則的其他移動。根據你要執行的規則型別,你可能能夠使用唯一約束,但否則你很容易受到寫偏差的影響。
宣告使用者名稱
: 在每個使用者都有唯一使用者名稱的網站上,兩個使用者可能同時嘗試使用相同的使用者名稱建立賬戶。你可以使用事務來檢查名稱是否被佔用,如果沒有,使用該名稱建立賬戶。但是,就像前面的例子一樣,這在快照隔離下是不安全的。幸運的是,唯一約束在這裡是一個簡單的解決方案(嘗試註冊使用者名稱的第二個事務將由於違反約束而被中止)。
防止重複消費
: 允許使用者花錢或積分的服務需要檢查使用者不會花費超過他們擁有的。你可以透過在使用者賬戶中插入暫定支出專案,列出賬戶中的所有專案,並檢查總和是否為正來實現這一點。有了寫偏差,可能會發生兩個支出專案併發插入,它們一起導致餘額變為負數,但沒有任何事務注意到另一個。
#### 導致寫偏差的幻讀 {#sec_transactions_phantom}
所有這些例子都遵循類似的模式:
1. `SELECT` 查詢透過搜尋匹配某些搜尋條件的行來檢查是否滿足某些要求(至少有兩個醫生值班,該房間在該時間沒有現有預訂,棋盤上的位置還沒有另一個棋子,使用者名稱尚未被佔用,賬戶中仍有錢)。
2. 根據第一個查詢的結果,應用程式程式碼決定如何繼續(也許繼續操作,或者向用戶報告錯誤並中止)。
3. 如果應用程式決定繼續,它會向資料庫進行寫入(`INSERT`、`UPDATE` 或 `DELETE`)並提交事務。
此寫入的效果改變了步驟 2 決策的前提條件。換句話說,如果你在提交寫入後重復步驟 1 的 `SELECT` 查詢,你會得到不同的結果,因為寫入改變了匹配搜尋條件的行集(現在少了一個醫生值班,會議室現在已為該時間預訂,棋盤上的位置現在被移動的棋子佔據,使用者名稱現在被佔用,賬戶中的錢現在更少)。
步驟可能以不同的順序發生。例如,你可以先進行寫入,然後進行 `SELECT` 查詢,最後根據查詢結果決定是中止還是提交。
在醫生值班示例的情況下,步驟 3 中被修改的行是步驟 1 中返回的行之一,因此我們可以透過鎖定步驟 1 中的行(`SELECT FOR UPDATE`)來使事務安全並避免寫偏差。但是,其他四個示例是不同的:它們檢查*不存在*匹配某些搜尋條件的行,而寫入*新增*了匹配相同條件的行。如果步驟 1 中的查詢不返回任何行,`SELECT FOR UPDATE` 就無法附加鎖[^56]。
這種效果,其中一個事務中的寫入改變另一個事務中搜索查詢的結果,稱為*幻讀*[^4]。快照隔離避免了只讀查詢中的幻讀,但在我們討論的讀寫事務中,幻讀可能導致特別棘手的寫偏差情況。ORM 生成的 SQL 也容易出現寫偏差[^50] [^51]。
#### 物化衝突 {#materializing-conflicts}
如果幻讀的問題是沒有物件可以附加鎖,也許我們可以在資料庫中人為地引入一個鎖物件?
例如,在會議室預訂情況下,你可以想象建立一個時間段和房間的表。此表中的每一行對應於特定時間段(例如,15 分鐘)的特定房間。你提前為所有可能的房間和時間段組合建立行,例如,接下來的六個月。
現在,想要建立預訂的事務可以鎖定(`SELECT FOR UPDATE`)表中對應於所需房間和時間段的行。獲取鎖後,它可以像以前一樣檢查重疊的預訂並插入新的預訂。請注意,附加表不用於儲存有關預訂的資訊——它純粹是一組鎖,用於防止同一房間和時間範圍的預訂被併發修改。
這種方法稱為*物化衝突*,因為它採用了幻讀並將其轉化為存在於資料庫中的具體行集上的鎖衝突[^14]。不幸的是,很難且容易出錯地弄清楚如何物化衝突,並且讓併發控制機制洩漏到應用程式資料模型中是醜陋的。出於這些原因,如果沒有其他選擇,物化衝突應被視為最後的手段。在大多數情況下,可序列化隔離級別要好得多。
## 可序列化 {#sec_transactions_serializability}
在本章中,我們已經看到了幾個容易出現競態條件的事務示例。某些競態條件被讀已提交和快照隔離級別所防止,但其他的則沒有。我們遇到了一些特別棘手的寫偏差和幻讀示例。這是一個令人沮喪的情況:
* 隔離級別很難理解,並且在不同資料庫中的實現不一致(例如,"可重複讀"的含義差異很大)。
* 如果你檢視你的應用程式程式碼,很難判斷在特定隔離級別下執行是否安全——特別是在大型應用程式中,你可能不知道所有可能併發發生的事情。
* 沒有好的工具來幫助我們檢測競態條件。原則上,靜態分析可能有所幫助[^33],但研究技術尚未進入實際使用。測試併發問題很困難,因為它們通常是非確定性的——只有在時機不巧時才會出現問題。
這不是一個新問題——自 1970 年代引入弱隔離級別以來一直如此[^3]。一直以來,研究人員的答案都很簡單:使用*可序列化*隔離!
可序列化隔離是最強的隔離級別。它保證即使事務可能並行執行,最終結果與它們*序列*執行(一次一個,沒有任何併發)相同。因此,資料庫保證如果事務在單獨執行時行為正確,那麼在併發執行時它們繼續保持正確——換句話說,資料庫防止了*所有*可能的競態條件。
但如果可序列化隔離比弱隔離級別的混亂要好得多,那為什麼不是每個人都在使用它?要回答這個問題,我們需要檢視實現可序列化的選項,以及它們的效能如何。今天提供可序列化的大多數資料庫使用以下三種技術之一,我們將在本章的其餘部分探討:
* 字面上序列執行事務(參見["實際序列執行"](#sec_transactions_serial))
* 兩階段鎖定(參見["兩階段鎖定(2PL)"](#sec_transactions_2pl)),幾十年來這是唯一可行的選擇
* 樂觀併發控制技術,如可序列化快照隔離(參見["可序列化快照隔離(SSI)"](#sec_transactions_ssi))
### 實際序列執行 {#sec_transactions_serial}
避免併發問題的最簡單方法是完全消除併發:在單個執行緒上按序列順序一次執行一個事務。透過這樣做,我們完全迴避了檢測和防止事務之間衝突的問題:所產生的隔離根據定義是可序列化的。
儘管這似乎是一個顯而易見的想法,但直到 2000 年代,資料庫設計者才決定執行事務的單執行緒迴圈是可行的[^57]。如果在過去 30 年中多執行緒併發被認為是獲得良好效能的必要條件,那是什麼改變使得單執行緒執行成為可能?
兩個發展導致了這種重新思考:
* RAM 變得足夠便宜,對於許多用例,現在可以將整個活動資料集儲存在記憶體中(參見["將所有內容儲存在記憶體中"](/tw/ch4#sec_storage_inmemory))。當事務需要訪問的所有資料都在記憶體中時,事務的執行速度比必須等待從磁碟載入資料要快得多。
* 資料庫設計者意識到 OLTP 事務通常很短,只進行少量讀寫(參見["分析與運營系統"](/tw/ch1#sec_introduction_analytics))。相比之下,長時間執行的分析查詢通常是隻讀的,因此它們可以在序列執行迴圈之外的一致快照上執行(使用快照隔離)。
序列執行事務的方法在 VoltDB/H-Store、Redis 和 Datomic 等中實現[^58] [^59] [^60]。為單執行緒執行設計的系統有時可以比支援併發的系統性能更好,因為它可以避免鎖定的協調開銷。但是,其吞吐量限於單個 CPU 核心。為了充分利用該單執行緒,事務需要以不同於傳統形式的方式構建。
#### 將事務封裝在儲存過程中 {#encapsulating-transactions-in-stored-procedures}
在資料庫的早期,意圖是資料庫事務可以包含整個使用者活動流程。例如,預訂機票是一個多階段過程(搜尋路線、票價和可用座位;決定行程;預訂行程中每個航班的座位;輸入乘客詳細資訊;付款)。資料庫設計者認為,如果整個過程是一個事務,以便可以原子地提交,那將是很好的。
不幸的是,人類做決定和響應的速度非常慢。如果資料庫事務需要等待使用者的輸入,資料庫需要支援潛在的大量併發事務,其中大多數是空閒的。大多數資料庫無法有效地做到這一點,因此幾乎所有 OLTP 應用程式都透過避免在事務中互動式地等待使用者來保持事務簡短。在 Web 上,這意味著事務在同一 HTTP 請求中提交——事務不跨越多個請求。新的 HTTP 請求開始新的事務。
即使人類已經從關鍵路徑中移除,事務仍然以互動式客戶端/伺服器風格執行,一次一個語句。應用程式進行查詢,讀取結果,可能根據第一個查詢的結果進行另一個查詢,依此類推。查詢和結果在應用程式程式碼(在一臺機器上執行)和資料庫伺服器(在另一臺機器上)之間來回傳送。
在這種互動式事務風格中,大量時間花在應用程式和資料庫之間的網路通訊上。如果你要在資料庫中禁止併發並一次只處理一個事務,吞吐量將是可怕的,因為資料庫將大部分時間都在等待應用程式為當前事務發出下一個查詢。在這種資料庫中,為了獲得合理的效能,必須併發處理多個事務。
因此,具有單執行緒序列事務處理的系統不允許互動式多語句事務。相反,應用程式必須將自己限制為包含單個語句的事務,或者提前將整個事務程式碼作為*儲存過程*提交給資料庫[^61]。
互動式事務和儲存過程之間的差異如[圖 8-9](#fig_transactions_stored_proc) 所示。前提是事務所需的所有資料都在記憶體中,儲存過程可以非常快速地執行,而無需等待任何網路或磁碟 I/O。
{{< figure src="/fig/ddia_0809.png" id="fig_transactions_stored_proc" caption="圖 8-9. 互動式事務和儲存過程之間的差異(使用[圖 8-8](#fig_transactions_write_skew)的示例事務)。" class="w-full my-4" >}}
#### 儲存過程的利弊 {#sec_transactions_stored_proc_tradeoffs}
儲存過程在關係資料庫中已經存在了一段時間,自 1999 年以來一直是 SQL 標準(SQL/PSM)的一部分。它們因各種原因獲得了一些不好的聲譽:
* 傳統上,每個資料庫供應商都有自己的儲存過程語言(Oracle 有 PL/SQL,SQL Server 有 T-SQL,PostgreSQL 有 PL/pgSQL 等)。這些語言沒有跟上通用程式語言的發展,因此從今天的角度來看,它們看起來相當醜陋和過時,並且缺乏大多數程式語言中的庫生態系統。
* 在資料庫中執行的程式碼很難管理:與應用程式伺服器相比,除錯更困難,版本控制和部署更尷尬,測試更棘手,並且難以與監控的指標收集系統整合。
* 資料庫通常比應用程式伺服器對效能更敏感,因為單個數據庫例項通常由許多應用程式伺服器共享。資料庫中編寫不當的儲存過程(例如,使用大量記憶體或 CPU 時間)可能比應用程式伺服器中等效的編寫不當的程式碼造成更多麻煩。
* 在允許租戶編寫自己的儲存過程的多租戶系統中,在與資料庫核心相同的程序中執行不受信任的程式碼是一個安全風險[^62]。
然而,這些問題可以克服。儲存過程的現代實現已經放棄了 PL/SQL,而是使用現有的通用程式語言:VoltDB 使用 Java 或 Groovy,Datomic 使用 Java 或 Clojure,Redis 使用 Lua,MongoDB 使用 Javascript。
儲存過程在應用程式邏輯無法輕鬆嵌入其他地方的情況下也很有用。例如,使用 GraphQL 的應用程式可能透過 GraphQL 代理直接公開其資料庫。如果代理不支援複雜的驗證邏輯,你可以使用儲存過程將此類邏輯直接嵌入資料庫中。如果資料庫不支援儲存過程,你必須在代理和資料庫之間部署驗證服務來進行驗證。
使用儲存過程和記憶體資料,在單個執行緒上執行所有事務變得可行。當儲存過程不需要等待 I/O 並避免其他併發控制機制的開銷時,它們可以在單個執行緒上實現相當好的吞吐量。
VoltDB 還使用儲存過程進行復制:它不是將事務的寫入從一個節點複製到另一個節點,而是在每個副本上執行相同的儲存過程。因此,VoltDB 要求儲存過程是*確定性的*(在不同節點上執行時,它們必須產生相同的結果)。例如,如果事務需要使用當前日期和時間,它必須透過特殊的確定性 API 來實現(有關確定性操作的更多詳細資訊,請參見["持久執行和工作流"](/tw/ch5#sec_encoding_dataflow_workflows))。這種方法稱為*狀態機複製*,我們將在[第 10 章](/tw/ch10#ch_consistency)中回到它。
#### 分片 {#sharding}
序列執行所有事務使併發控制變得簡單得多,但將資料庫的事務吞吐量限制為單臺機器上單個 CPU 核心的速度。只讀事務可以使用快照隔離在其他地方執行,但對於具有高寫入吞吐量的應用程式,單執行緒事務處理器可能成為嚴重的瓶頸。
為了擴充套件到多個 CPU 核心和多個節點,你可以對資料進行分片(參見[第 7 章](/tw/ch7#ch_sharding)),VoltDB 支援這一點。如果你可以找到一種對資料集進行分片的方法,使每個事務只需要讀取和寫入單個分片內的資料,那麼每個分片可以有自己的事務處理執行緒,獨立於其他分片執行。在這種情況下,你可以給每個 CPU 核心分配自己的分片,這允許你的事務吞吐量與 CPU 核心數量線性擴充套件[^59]。
但是,對於需要訪問多個分片的任何事務,資料庫必須協調它所涉及的所有分片之間的事務。儲存過程需要在所有分片上同步執行,以確保整個系統的可序列化。
由於跨分片事務具有額外的協調開銷,因此它們比單分片事務慢得多。VoltDB 報告的跨分片寫入吞吐量約為每秒 1,000 次,這比其單分片吞吐量低幾個數量級,並且無法透過新增更多機器來增加[^61]。最近的研究探索了使多分片事務更具可伸縮性的方法[^63]。
事務是否可以是單分片的很大程度上取決於應用程式使用的資料結構。簡單的鍵值資料通常可以很容易地分片,但具有多個二級索引的資料可能需要大量的跨分片協調(參見["分片和二級索引"](/tw/ch7#sec_sharding_secondary_indexes))。
#### 序列執行總結 {#summary-of-serial-execution}
序列執行事務已成為在某些約束條件下實現可序列化隔離的可行方法:
* 每個事務必須小而快,因為只需要一個緩慢的事務就可以阻止所有事務處理。
* 它最適合活動資料集可以適合記憶體的情況。很少訪問的資料可能會移到磁碟,但如果需要在單執行緒事務中訪問,系統會變得非常慢。
* 寫入吞吐量必須足夠低,可以在單個 CPU 核心上處理,否則事務需要分片而不需要跨分片協調。
* 跨分片事務是可能的,但它們的吞吐量很難擴充套件。
### 兩階段鎖定(2PL) {#sec_transactions_2pl}
大約 30 年來,資料庫中只有一種廣泛使用的可序列化演算法:*兩階段鎖定*(2PL),有時稱為*強嚴格兩階段鎖定*(SS2PL),以區別於 2PL 的其他變體。
--------
> [!TIP] 2PL 不是 2PC
兩階段*鎖定*(2PL)和兩階段*提交*(2PC)是兩個非常不同的東西。2PL 提供可序列化隔離,而 2PC 在分散式資料庫中提供原子提交(參見["兩階段提交(2PC)"](#sec_transactions_2pc))。為避免混淆,最好將它們視為完全獨立的概念,並忽略名稱中不幸的相似性。
--------
我們之前看到鎖通常用於防止髒寫(參見["沒有髒寫"](#sec_transactions_dirty_write)):如果兩個事務併發嘗試寫入同一物件,鎖確保第二個寫入者必須等到第一個完成其事務(中止或提交)後才能繼續。
兩階段鎖定類似,但使鎖要求更強。只要沒有人寫入,多個事務就可以併發讀取同一物件。但是一旦有人想要寫入(修改或刪除)物件,就需要獨佔訪問:
* 如果事務 A 已讀取物件而事務 B 想要寫入該物件,B 必須等到 A 提交或中止後才能繼續。(這確保 B 不能在 A 背後意外地更改物件。)
* 如果事務 A 已寫入物件而事務 B 想要讀取該物件,B 必須等到 A 提交或中止後才能繼續。(像[圖 8-4](#fig_transactions_read_committed) 中那樣讀取物件的舊版本在 2PL 下是不可接受的。)
在 2PL 中,寫入者不僅阻塞其他寫入者;它們還阻塞讀者,反之亦然。快照隔離有這樣的口號:*讀者永遠不會阻塞寫者,寫者永遠不會阻塞讀者*(參見["多版本併發控制(MVCC)"](#sec_transactions_snapshot_impl)),這捕捉了快照隔離和兩階段鎖定之間的關鍵區別。另一方面,因為 2PL 提供可序列化,它可以防止早期討論的所有競態條件,包括丟失的更新和寫偏差。
#### 兩階段鎖定的實現 {#implementation-of-two-phase-locking}
2PL 由 MySQL(InnoDB)和 SQL Server 中的可序列化隔離級別以及 Db2 中的可重複讀隔離級別使用[^29]。
讀者和寫者的阻塞是透過在資料庫中的每個物件上有一個鎖來實現的。鎖可以處於*共享模式*或*獨佔模式*(也稱為*多讀者單寫者*鎖)。鎖的使用如下:
* 如果事務想要讀取物件,它必須首先以共享模式獲取鎖。多個事務可以同時以共享模式持有鎖,但如果另一個事務已經對該物件具有獨佔鎖,則這些事務必須等待。
* 如果事務想要寫入物件,它必須首先以獨佔模式獲取鎖。沒有其他事務可以同時持有鎖(無論是共享模式還是獨佔模式),因此如果物件上有任何現有鎖,事務必須等待。
* 如果事務首先讀取然後寫入物件,它可以將其共享鎖升級為獨佔鎖。升級的工作方式與直接獲取獨佔鎖相同。
* 獲取鎖後,事務必須繼續持有鎖直到事務結束(提交或中止)。這就是"兩階段"名稱的來源:第一階段(事務執行時)是獲取鎖,第二階段(事務結束時)是釋放所有鎖。
由於使用了如此多的鎖,很容易發生事務 A 等待事務 B 釋放其鎖,反之亦然的情況。這種情況稱為*死鎖*。資料庫自動檢測事務之間的死鎖並中止其中一個,以便其他事務可以取得進展。中止的事務需要由應用程式重試。
#### 兩階段鎖定的效能 {#performance-of-two-phase-locking}
兩階段鎖定的主要缺點,以及自 1970 年代以來並非每個人都使用它的原因,是效能:在兩階段鎖定下,事務吞吐量和查詢響應時間明顯比弱隔離下差。
這部分是由於獲取和釋放所有這些鎖的開銷,但更重要的是由於併發性降低。按設計,如果兩個併發事務嘗試執行任何可能以任何方式導致競態條件的操作,其中一個必須等待另一個完成。
例如,如果你有一個需要讀取整個表的事務(例如,備份、分析查詢或完整性檢查,如["快照隔離與可重複讀"](#sec_transactions_snapshot_isolation)中所討論的),該事務必須對整個表進行共享鎖。因此,讀取事務首先必須等到所有正在寫入該表的進行中事務完成;然後,在讀取整個表時(對於大表可能需要很長時間),所有想要寫入該表的其他事務都被阻塞,直到大型只讀事務提交。實際上,資料庫在很長一段時間內無法進行寫入。
因此,執行 2PL 的資料庫可能具有相當不穩定的延遲,如果工作負載中存在爭用,它們在高百分位數可能非常慢(參見["描述效能"](/tw/ch2#sec_introduction_percentiles))。可能只需要一個緩慢的事務,或者一個訪問大量資料並獲取許多鎖的事務,就會導致系統的其餘部分停滯不前。
儘管死鎖可能發生在基於鎖的讀已提交隔離級別下,但在 2PL 可序列化隔離下(取決於事務的訪問模式)它們發生得更頻繁。這可能是一個額外的效能問題:當事務由於死鎖而被中止並重試時,它需要重新完成所有工作。如果死鎖頻繁,這可能意味著大量的浪費努力。
#### 謂詞鎖 {#predicate-locks}
在前面的鎖描述中,我們掩蓋了一個微妙但重要的細節。在["導致寫偏差的幻讀"](#sec_transactions_phantom)中,我們討論了*幻讀*的問題——即一個事務改變另一個事務的搜尋查詢結果。具有可序列化隔離的資料庫必須防止幻讀。
在會議室預訂示例中,這意味著如果一個事務已經搜尋了某個時間視窗內某個房間的現有預訂(參見[例 8-2](#fig_transactions_meeting_rooms)),另一個事務不允許併發插入或更新同一房間和時間範圍的另一個預訂。(併發插入其他房間的預訂,或同一房間不影響擬議預訂的不同時間的預訂是可以的。)
我們如何實現這一點?從概念上講,我們需要一個*謂詞鎖*[^4]。它的工作方式類似於前面描述的共享/獨佔鎖,但它不屬於特定物件(例如,表中的一行),而是屬於匹配某些搜尋條件的所有物件,例如:
```
SELECT * FROM bookings
WHERE room_id = 123 AND
end_time > '2025-01-01 12:00' AND
start_time < '2025-01-01 13:00';
```
謂詞鎖限制訪問如下:
* 如果事務 A 想要讀取匹配某些條件的物件,就像在該 `SELECT` 查詢中一樣,它必須在查詢條件上獲取共享模式謂詞鎖。如果另一個事務 B 當前對匹配這些條件的任何物件具有獨佔鎖,A 必須等到 B 釋放其鎖後才允許進行查詢。
* 如果事務 A 想要插入、更新或刪除任何物件,它必須首先檢查舊值或新值是否匹配任何現有的謂詞鎖。如果存在事務 B 持有的匹配謂詞鎖,則 A 必須等到 B 提交或中止後才能繼續。
這裡的關鍵思想是,謂詞鎖甚至適用於資料庫中尚不存在但將來可能新增的物件(幻讀)。如果兩階段鎖定包括謂詞鎖,資料庫將防止所有形式的寫偏差和其他競態條件,因此其隔離變為可序列化。
#### 索引範圍鎖 {#sec_transactions_2pl_range}
不幸的是,謂詞鎖的效能不佳:如果活動事務有許多鎖,檢查匹配鎖變得耗時。因此,大多數具有 2PL 的資料庫實際上實現了*索引範圍鎖定*(也稱為*間隙鎖*),這是謂詞鎖定的簡化近似[^54] [^64]。
透過使謂詞匹配更大的物件集來簡化謂詞是安全的。例如,如果你對中午到下午 1 點之間房間 123 的預訂有謂詞鎖,你可以透過鎖定房間 123 在任何時間的預訂來近似它,或者你可以透過鎖定中午到下午 1 點之間的所有房間(不僅僅是房間 123)來近似它。這是安全的,因為匹配原始謂詞的任何寫入肯定也會匹配近似。
在房間預訂資料庫中,你可能在 `room_id` 列上有索引,和/或在 `start_time` 和 `end_time` 上有索引(否則前面的查詢在大型資料庫上會非常慢):
* 假設你的索引在 `room_id` 上,資料庫使用此索引查詢房間 123 的現有預訂。現在資料庫可以簡單地將共享鎖附加到此索引條目,表示事務已搜尋房間 123 的預訂。
* 或者,如果資料庫使用基於時間的索引查詢現有預訂,它可以將共享鎖附加到該索引中的值範圍,表示事務已搜尋與 2025 年 1 月 1 日中午到下午 1 點的時間段重疊的預訂。
無論哪種方式,搜尋條件的近似都附加到其中一個索引。現在,如果另一個事務想要插入、更新或刪除同一房間和/或重疊時間段的預訂,它將必須更新索引的相同部分。在這樣做的過程中,它將遇到共享鎖,並被迫等到鎖被釋放。
這提供了對幻讀和寫偏差的有效保護。索引範圍鎖不如謂詞鎖精確(它們可能鎖定比嚴格維護可序列化所需的更大範圍的物件),但由於它們的開銷要低得多,它們是一個很好的折衷。
如果沒有合適的索引可以附加範圍鎖,資料庫可以退回到整個表的共享鎖。這對效能不利,因為它將阻止所有其他事務寫入表,但這是一個安全的後備位置。
### 可序列化快照隔離(SSI) {#sec_transactions_ssi}
本章描繪了資料庫併發控制的黯淡畫面。一方面,我們有效能不佳(兩階段鎖定)或可伸縮性不佳(序列執行)的可序列化實現。另一方面,我們有效能良好但容易出現各種競態條件(丟失的更新、寫偏差、幻讀等)的弱隔離級別。可序列化隔離和良好效能從根本上是對立的嗎?
似乎不是:一種稱為*可序列化快照隔離*(SSI)的演算法提供完全可序列化,與快照隔離相比只有很小的效能損失。SSI 相對較新:它於 2008 年首次描述[^53] [^65]。
今天,SSI 和類似演算法用於單節點資料庫(PostgreSQL 中的可序列化隔離級別[^54]、SQL Server 的記憶體 OLTP/Hekaton[^66] 和 HyPer[^67])、分散式資料庫(CockroachDB[^5] 和 FoundationDB[^8])以及嵌入式儲存引擎(如 BadgerDB)。
#### 悲觀併發控制與樂觀併發控制 {#pessimistic-versus-optimistic-concurrency-control}
兩階段鎖定是所謂的*悲觀*併發控制機制:它基於這樣的原則,即如果任何事情可能出錯(如另一個事務持有的鎖所示),最好等到情況再次安全後再做任何事情。它就像*互斥*,用於保護多執行緒程式設計中的資料結構。
序列執行在某種意義上是悲觀到極端:它本質上相當於每個事務在事務期間對整個資料庫(或資料庫的一個分片)具有獨佔鎖。我們透過使每個事務執行得非常快來補償悲觀主義,因此它只需要短時間持有"鎖"。
相比之下,可序列化快照隔離是一種*樂觀*併發控制技術。在這種情況下,樂觀意味著,如果發生潛在危險的事情,事務不會阻塞,而是繼續進行,希望一切都會好起來。當事務想要提交時,資料庫會檢查是否發生了任何不好的事情(即,是否違反了隔離);如果是,事務將被中止並必須重試。只允許可序列執行的事務提交。
樂觀併發控制是一個老想法[^68],其優缺點已經爭論了很長時間[^69]。如果存在高爭用(許多事務嘗試訪問相同的物件),它的效能很差,因為這會導致大部分事務需要中止。如果系統已經接近其最大吞吐量,重試事務的額外事務負載可能會使效能變差。
但是,如果有足夠的備用容量,並且事務之間的爭用不太高,樂觀併發控制技術往往比悲觀技術性能更好。可交換原子操作可以減少爭用:例如,如果幾個事務併發想要遞增計數器,應用遞增的順序無關緊要(只要計數器在同一事務中沒有被讀取),因此併發遞增都可以應用而不會發生衝突。
顧名思義,SSI 基於快照隔離——也就是說,事務中的所有讀取都從資料庫的一致快照進行(參見["快照隔離與可重複讀"](#sec_transactions_snapshot_isolation))。在快照隔離的基礎上,SSI 添加了一種演算法來檢測讀寫之間的序列化衝突,並確定要中止哪些事務。
#### 基於過時前提的決策 {#decisions-based-on-an-outdated-premise}
當我們之前討論快照隔離中的寫偏差時(參見["寫偏差與幻讀"](#sec_transactions_write_skew)),我們觀察到一個反覆出現的模式:事務從資料庫讀取一些資料,檢查查詢結果,並根據它看到的結果決定採取某些行動(寫入資料庫)。但是,在快照隔離下,原始查詢的結果在事務提交時可能不再是最新的,因為資料可能在此期間被修改。
換句話說,事務基於*前提*(事務開始時為真的事實,例如,"當前有兩名醫生值班")採取行動。後來,當事務想要提交時,原始資料可能已更改——前提可能不再為真。
當應用程式進行查詢(例如,"當前有多少醫生值班?")時,資料庫不知道應用程式邏輯如何使用該查詢的結果。為了安全起見,資料庫需要假設查詢結果(前提)中的任何更改都意味著該事務中的寫入可能無效。換句話說,事務中的查詢和寫入之間可能存在因果依賴關係。為了提供可序列化隔離,資料庫必須檢測事務可能基於過時前提採取行動的情況,並在這種情況下中止事務。
資料庫如何知道查詢結果是否可能已更改?有兩種情況需要考慮:
* 檢測陳舊的 MVCC 物件版本的讀取(未提交的寫入發生在讀取之前)
* 檢測影響先前讀取的寫入(寫入發生在讀取之後)
#### 檢測陳舊的 MVCC 讀取 {#detecting-stale-mvcc-reads}
回想一下,快照隔離通常由多版本併發控制(MVCC;參見["多版本併發控制(MVCC)"](#sec_transactions_snapshot_impl))實現。當事務從 MVCC 資料庫中的一致快照讀取時,它會忽略在拍攝快照時尚未提交的任何其他事務所做的寫入。
在[圖 8-10](#fig_transactions_detect_mvcc) 中,事務 43 看到 Aaliyah 的 `on_call = true`,因為事務 42(修改了 Aaliyah 的值班狀態)未提交。但是,當事務 43 想要提交時,事務 42 已經提交。這意味著從一致快照讀取時被忽略的寫入現在已生效,事務 43 的前提不再為真。當寫入者插入以前不存在的資料時,事情變得更加複雜(參見["導致寫偏差的幻讀"](#sec_transactions_phantom))。我們將在["檢測影響先前讀取的寫入"](#sec_detecting_writes_affect_reads)中討論為 SSI 檢測幻寫。
{{< figure src="/fig/ddia_0810.png" id="fig_transactions_detect_mvcc" caption="圖 8-10. 檢測事務何時從 MVCC 快照讀取過時值。" class="w-full my-4" >}}
為了防止這種異常,資料庫需要跟蹤事務由於 MVCC 可見性規則而忽略另一個事務的寫入的時間。當事務想要提交時,資料庫會檢查是否有任何被忽略的寫入現在已經提交。如果是,事務必須被中止。
為什麼要等到提交?為什麼不在檢測到陳舊讀取時立即中止事務 43?好吧,如果事務 43 是隻讀事務,它就不需要被中止,因為沒有寫偏差的風險。在事務 43 進行讀取時,資料庫還不知道該事務是否稍後會執行寫入。此外,事務 42 可能還會中止,或者在事務 43 提交時可能仍未提交,因此讀取可能最終不是陳舊的。透過避免不必要的中止,SSI 保留了快照隔離對從一致快照進行長時間執行讀取的支援。
#### 檢測影響先前讀取的寫入 {#sec_detecting_writes_affect_reads}
要考慮的第二種情況是另一個事務在資料被讀取後修改資料。這種情況如[圖 8-11](#fig_transactions_detect_index_range) 所示。
{{< figure src="/fig/ddia_0811.png" id="fig_transactions_detect_index_range" caption="圖 8-11. 在可序列化快照隔離中,檢測一個事務何時修改另一個事務的讀取。" class="w-full my-4" >}}
在兩階段鎖定的上下文中,我們討論了索引範圍鎖(參見["索引範圍鎖"](#sec_transactions_2pl_range)),它允許資料庫鎖定對匹配某些搜尋查詢的所有行的訪問,例如 `WHERE shift_id = 1234`。我們可以在這裡使用類似的技術,除了 SSI 鎖不會阻塞其他事務。
在[圖 8-11](#fig_transactions_detect_index_range) 中,事務 42 和 43 都在班次 `1234` 期間搜尋值班醫生。如果 `shift_id` 上有索引,資料庫可以使用索引條目 1234 來記錄事務 42 和 43 讀取此資料的事實。(如果沒有索引,可以在表級別跟蹤此資訊。)此資訊只需要保留一段時間:在事務完成(提交或中止)並且所有併發事務完成後,資料庫可以忘記它讀取的資料。
當事務寫入資料庫時,它必須在索引中查詢最近讀取受影響資料的任何其他事務。此過程類似於獲取受影響鍵範圍的寫鎖,但它不是阻塞直到讀者提交,而是充當絆線:它只是通知事務它們讀取的資料可能不再是最新的。
在[圖 8-11](#fig_transactions_detect_index_range) 中,事務 43 通知事務 42 其先前的讀取已過時,反之亦然。事務 42 首先提交,並且成功:儘管事務 43 的寫入影響了 42,但 43 尚未提交,因此寫入尚未生效。但是,當事務 43 想要提交時,來自 42 的衝突寫入已經提交,因此 43 必須中止。
#### 可序列化快照隔離的效能 {#performance-of-serializable-snapshot-isolation}
與往常一樣,許多工程細節會影響演算法在實踐中的工作效果。例如,一個權衡是跟蹤事務讀寫的粒度。如果資料庫詳細跟蹤每個事務的活動,它可以精確地確定哪些事務需要中止,但簿記開銷可能變得很大。不太詳細的跟蹤速度更快,但可能導致比嚴格必要更多的事務被中止。
在某些情況下,事務讀取被另一個事務覆蓋的資訊是可以的:根據發生的其他情況,有時可以證明執行結果仍然是可序列化的。PostgreSQL 使用這一理論來減少不必要中止的數量[^14] [^54]。
與兩階段鎖定相比,可序列化快照隔離的主要優點是一個事務不需要阻塞等待另一個事務持有的鎖。與快照隔離一樣,寫入者不會阻塞讀者,反之亦然。這種設計原則使查詢延遲更可預測且變化更少。特別是,只讀查詢可以在一致快照上執行而無需任何鎖,這對於讀取密集型工作負載非常有吸引力。
與序列執行相比,可序列化快照隔離不限於單個 CPU 核心的吞吐量:例如,FoundationDB 將序列化衝突的檢測分佈在多臺機器上,允許它擴充套件到非常高的吞吐量。即使資料可能分片在多臺機器上,事務也可以在多個分片中讀取和寫入資料,同時確保可序列化隔離。
與非可序列化快照隔離相比,檢查可序列化違規的需要引入了一些效能開銷。這些開銷有多大是一個爭論的問題:有些人認為可序列化檢查不值得[^70],而其他人認為可序列化的效能現在已經很好,不再需要使用較弱的快照隔離[^67]。
中止率顯著影響 SSI 的整體效能。例如,長時間讀取和寫入資料的事務可能會遇到衝突並中止,因此 SSI 要求讀寫事務相當短(長時間執行的只讀事務是可以的)。但是,SSI 對慢事務的敏感性低於兩階段鎖定或序列執行。
## 分散式事務 {#sec_transactions_distributed}
前幾節重點討論了隔離的併發控制,即 ACID 中的 I。我們看到的演算法適用於單節點和分散式資料庫:儘管在使併發控制演算法可擴充套件方面存在挑戰(例如,為 SSI 執行分散式可序列化檢查),但分散式併發控制的高層思想與單節點併發控制相似[^8]。
一致性和永續性在轉向分散式事務時也沒有太大變化。但是,原子性需要更多關注。
對於在單個數據庫節點執行的事務,原子性通常由儲存引擎實現。當客戶端要求資料庫節點提交事務時,資料庫使事務的寫入持久化(通常在預寫日誌中;參見["使 B 樹可靠"](/tw/ch4#sec_storage_btree_wal)),然後將提交記錄附加到磁碟上的日誌。如果資料庫在此過程中崩潰,事務將在節點重新啟動時從日誌中恢復:如果提交記錄在崩潰前成功寫入磁碟,則事務被認為已提交;如果沒有,該事務的任何寫入都將回滾。
因此,在單個節點上,事務提交關鍵取決於資料持久寫入磁碟的*順序*:首先是資料,然後是提交記錄[^22]。事務提交或中止的關鍵決定時刻是磁碟完成寫入提交記錄的時刻:在那一刻之前,仍然可能中止(由於崩潰),但在那一刻之後,事務已提交(即使資料庫崩潰)。因此,是單個裝置(連線到特定節點的特定磁碟驅動器的控制器)使提交成為原子的。
但是,如果多個節點參與事務會怎樣?例如,也許你在分片資料庫中有多物件事務,或者有全域性二級索引(其中索引條目可能與主資料在不同的節點上;參見["分片和二級索引"](/tw/ch7#sec_sharding_secondary_indexes))。大多數"NoSQL"分散式資料儲存不支援此類分散式事務,但各種分散式關係資料庫支援。
在這些情況下,僅向所有節點發送提交請求並在每個節點上獨立提交事務是不夠的。如[圖 8-12](#fig_transactions_non_atomic) 所示,提交可能在某些節點上成功,在其他節點上失敗:
* 某些節點可能檢測到約束違規或衝突,需要中止,而其他節點能夠成功提交。
* 某些提交請求可能在網路中丟失,最終由於超時而中止,而其他提交請求透過。
* 某些節點可能在提交記錄完全寫入之前崩潰並在恢復時回滾,而其他節點成功提交。
{{< figure src="/fig/ddia_0812.png" id="fig_transactions_non_atomic" caption="圖 8-12. 當事務涉及多個數據庫節點時,它可能在某些節點上提交,在其他節點上失敗。" class="w-full my-4" >}}
如果某些節點提交事務而其他節點中止它,節點之間就會變得不一致。一旦事務在一個節點上提交,如果後來發現它在另一個節點上被中止,就不能撤回了。這是因為一旦資料被提交,它在*讀已提交*或更強的隔離下對其他事務可見。例如,在[圖 8-12](#fig_transactions_non_atomic) 中,當用戶 1 注意到其在資料庫 1 上的提交失敗時,使用者 2 已經從資料庫 2 上的同一事務讀取了資料。如果使用者 1 的事務後來被中止,使用者 2 的事務也必須被還原,因為它基於被追溯宣告不存在的資料。
更好的方法是確保參與事務的節點要麼全部提交,要麼全部中止,並防止兩者的混合。確保這一點被稱為*原子提交*問題。
### 兩階段提交(2PC) {#sec_transactions_2pc}
兩階段提交是一種跨多個節點實現原子事務提交的演算法。它是分散式資料庫中的經典演算法[^13] [^71] [^72]。2PC 在某些資料庫內部使用,也以 *XA 事務*[^73] 的形式提供給應用程式(例如,Java 事務 API 支援),或透過 WS-AtomicTransaction 用於 SOAP Web 服務[^74] [^75]。
2PC 的基本流程如[圖 8-13](#fig_transactions_two_phase_commit) 所示。與單節點事務的單個提交請求不同,2PC 中的提交/中止過程分為兩個階段(因此得名)。
{{< figure src="/fig/ddia_0813.png" id="fig_transactions_two_phase_commit" title="圖 8-13. 兩階段提交(2PC)的成功執行。" class="w-full my-4" >}}
2PC 使用一個通常不會出現在單節點事務中的新元件:*協調器*(也稱為*事務管理器*)。協調器通常作為請求事務的同一應用程式程序中的庫實現(例如,嵌入在 Java EE 容器中),但它也可以是單獨的程序或服務。此類協調器的示例包括 Narayana、JOTM、BTM 或 MSDTC。
使用 2PC 時,分散式事務從應用程式在多個數據庫節點上正常讀寫資料開始。我們稱這些資料庫節點為事務中的*參與者*。當應用程式準備提交時,協調器開始第 1 階段:它向每個節點發送*準備*請求,詢問它們是否能夠提交。然後協調器跟蹤參與者的響應:
* 如果所有參與者回覆"是",表示他們準備提交,那麼協調器在第 2 階段發出*提交*請求,提交實際發生。
* 如果任何參與者回覆"否",協調器在第 2 階段向所有節點發送*中止*請求。
這個過程有點像西方文化中的傳統婚禮儀式:牧師分別詢問新娘和新郎是否願意嫁給對方,通常從兩人那裡得到"我願意"的答案。在收到兩個確認後,牧師宣佈這對夫婦為夫妻:事務已提交,這個快樂的事實向所有參加者廣播。如果新娘或新郎沒有說"是",儀式就被中止了[^76]。
#### 系統性的承諾 {#a-system-of-promises}
從這個簡短的描述中,可能不清楚為什麼兩階段提交確保原子性,而跨多個節點的單階段提交卻不能。準備和提交請求在兩階段情況下同樣容易丟失。是什麼讓 2PC 不同?
要理解它為什麼有效,我們必須更詳細地分解這個過程:
1. 當應用程式想要開始分散式事務時,它從協調器請求事務 ID。此事務 ID 是全域性唯一的。
2. 應用程式在每個參與者上開始單節點事務,並將全域性唯一的事務 ID 附加到單節點事務。所有讀寫都在這些單節點事務之一中完成。如果在此階段出現任何問題(例如,節點崩潰或請求超時),協調器或任何參與者都可以中止。
3. 當應用程式準備提交時,協調器向所有參與者傳送準備請求,標記有全域性事務 ID。如果這些請求中的任何一個失敗或超時,協調器向所有參與者傳送該事務 ID 的中止請求。
4. 當參與者收到準備請求時,它確保它可以在任何情況下明確提交事務。
這包括將所有事務資料寫入磁碟(崩潰、電源故障或磁碟空間不足不是稍後拒絕提交的可接受藉口),並檢查任何衝突或約束違規。透過向協調器回覆"是",節點承諾在請求時無錯誤地提交事務。換句話說,參與者放棄了中止事務的權利,但沒有實際提交它。
5. 當協調器收到所有準備請求的響應時,它對是否提交或中止事務做出明確決定(僅當所有參與者投票"是"時才提交)。協調器必須將該決定寫入其磁碟上的事務日誌,以便在隨後崩潰時知道它是如何決定的。這稱為*提交點*。
6. 一旦協調器的決定被寫入磁碟,提交或中止請求就會發送給所有參與者。如果此請求失敗或超時,協調器必須永遠重試,直到成功。沒有回頭路:如果決定是提交,那麼必須執行該決定,無論需要多少次重試。如果參與者在此期間崩潰,事務將在恢復時提交——因為參與者投票"是",它在恢復時不能拒絕提交。
因此,該協議包含兩個關鍵的"不歸路":當參與者投票"是"時,它承諾它肯定能夠稍後提交(儘管協調器仍可能選擇中止);一旦協調器決定,該決定是不可撤銷的。這些承諾確保了 2PC 的原子性。(單節點原子提交將這兩個事件合併為一個:將提交記錄寫入事務日誌。)
回到婚姻比喻,在說"我願意"之前,你和你的新娘/新郎有自由透過說"不行!"(或類似的話)來中止事務。但是,在說"我願意"之後,你不能撤回該宣告。如果你在說"我願意"後暈倒,沒有聽到牧師說"你們現在是夫妻",這並不改變事務已提交的事實。當你稍後恢復意識時,你可以透過向牧師查詢你的全域性事務 ID 的狀態來了解你是否已婚,或者你可以等待牧師下一次重試提交請求(因為重試將在你失去意識期間繼續)。
#### 協調器故障 {#coordinator-failure}
我們已經討論了如果參與者之一或網路在 2PC 期間失敗會發生什麼:如果任何準備請求失敗或超時,協調器將中止事務;如果任何提交或中止請求失敗,協調器將無限期地重試它們。但是,如果協調器崩潰會發生什麼就不太清楚了。
如果協調器在傳送準備請求之前失敗,參與者可以安全地中止事務。但是一旦參與者收到準備請求並投票"是",它就不能再單方面中止——它必須等待協調器回覆事務是提交還是中止。如果協調器此時崩潰或網路失敗,參與者除了等待別無他法。參與者在此狀態下的事務稱為*存疑*或*不確定*。
這種情況如[圖 8-14](#fig_transactions_2pc_crash) 所示。在這個特定的例子中,協調器實際上決定提交,資料庫 2 收到了提交請求。但是,協調器在向資料庫 1 傳送提交請求之前崩潰了,因此資料庫 1 不知道是提交還是中止。即使超時在這裡也沒有幫助:如果資料庫 1 在超時後單方面中止,它將與已提交的資料庫 2 不一致。同樣,單方面提交也不安全,因為另一個參與者可能已中止。
{{< figure src="/fig/ddia_0814.png" id="fig_transactions_2pc_crash" title="圖 8-14. 協調器在參與者投票“是”後崩潰。資料庫 1 不知道是提交還是中止。" class="w-full my-4" >}}
沒有協調器的訊息,參與者無法知道是提交還是中止。原則上,參與者可以相互通訊,瞭解每個參與者如何投票並達成某種協議,但這不是 2PC 協議的一部分。
2PC 完成的唯一方法是等待協調器恢復。這就是為什麼協調器必須在向參與者傳送提交或中止請求之前將其提交或中止決定寫入磁碟上的事務日誌:當協調器恢復時,它透過讀取其事務日誌來確定所有存疑事務的狀態。協調器日誌中沒有提交記錄的任何事務都將中止。因此,2PC 的提交點歸結為協調器上的常規單節點原子提交。
#### 三階段提交 {#three-phase-commit}
由於 2PC 可能會卡住等待協調器恢復,因此兩階段提交被稱為*阻塞*原子提交協議。可以使原子提交協議*非阻塞*,以便在節點失敗時不會卡住。但是,在實踐中使其工作並不那麼簡單。
作為 2PC 的替代方案,已經提出了一種稱為*三階段提交*(3PC)的演算法[^13] [^77]。但是,3PC 假設具有有界延遲的網路和具有有界響應時間的節點;在大多數具有無界網路延遲和程序暫停的實際系統中(參見[第 9 章](/tw/ch9#ch_distributed)),它無法保證原子性。
實踐中更好的解決方案是用容錯共識協議替換單節點協調器。我們將在[第 10 章](/tw/ch10#ch_consistency)中看到如何做到這一點。
### 跨不同系統的分散式事務 {#sec_transactions_xa}
分散式事務和兩階段提交的聲譽參差不齊。一方面,它們被認為提供了一個重要的安全保證,否則很難實現;另一方面,它們因導致操作問題、扼殺效能並承諾超過它們可以提供的東西而受到批評[^78] [^79] [^80] [^81]。許多雲服務由於它們引起的操作問題而選擇不實現分散式事務[^82]。
某些分散式事務的實現會帶來沉重的效能損失。兩階段提交固有的大部分效能成本是由於崩潰恢復所需的額外磁碟強制(`fsync`)和額外的網路往返。
但是,與其直接否定分散式事務,我們應該更詳細地研究它們,因為從中可以學到重要的教訓。首先,我們應該準確說明"分散式事務"的含義。兩種完全不同型別的分散式事務經常被混淆:
資料庫內部分散式事務
: 某些分散式資料庫(即,在其標準配置中使用複製和分片的資料庫)支援該資料庫節點之間的內部事務。例如,YugabyteDB、TiDB、FoundationDB、Spanner、VoltDB 和 MySQL Cluster 的 NDB 儲存引擎都有這樣的內部事務支援。在這種情況下,參與事務的所有節點都執行相同的資料庫軟體。
異構分散式事務
: 在*異構*事務中,參與者是兩個或多個不同的技術:例如,來自不同供應商的兩個資料庫,甚至是非資料庫系統(如訊息代理)。跨這些系統的分散式事務必須確保原子提交,即使系統在底層可能完全不同。
資料庫內部事務不必與任何其他系統相容,因此它們可以使用任何協議並應用特定於該特定技術的最佳化。因此,資料庫內部分散式事務通常可以很好地工作。另一方面,跨異構技術的事務更具挑戰性。
#### 恰好一次訊息處理 {#sec_transactions_exactly_once}
異構分散式事務允許以強大的方式整合各種系統。例如,當且僅當處理訊息的資料庫事務成功提交時,來自訊息佇列的訊息才能被確認為已處理。這是透過在單個事務中原子地提交訊息確認和資料庫寫入來實現的。有了分散式事務支援,即使訊息代理和資料庫是在不同機器上執行的兩種不相關的技術,這也是可能的。
如果訊息傳遞或資料庫事務失敗,兩者都會中止,因此訊息代理可以稍後安全地重新傳遞訊息。因此,透過原子地提交訊息及其處理的副作用,我們可以確保訊息在效果上*恰好*處理一次,即使在成功之前需要幾次重試。中止會丟棄部分完成事務的任何副作用。這被稱為*恰好一次語義*。
但是,只有當受事務影響的所有系統都能夠使用相同的原子提交協議時,這種分散式事務才有可能。例如,假設處理訊息的副作用是傳送電子郵件,而電子郵件伺服器不支援兩階段提交:如果訊息處理失敗並重試,可能會發生電子郵件被傳送兩次或更多次。但是,如果處理訊息的所有副作用在事務中止時都會回滾,那麼處理步驟可以安全地重試,就好像什麼都沒有發生一樣。
我們將在本章後面回到恰好一次語義的主題。讓我們首先看看允許此類異構分散式事務的原子提交協議。
#### XA 事務 {#xa-transactions}
*X/Open XA*(*eXtended Architecture* 的縮寫)是跨異構技術實現兩階段提交的標準[^73]。它於 1991 年推出並得到廣泛實現:XA 受到許多傳統關係資料庫(包括 PostgreSQL、MySQL、Db2、SQL Server 和 Oracle)和訊息代理(包括 ActiveMQ、HornetQ、MSMQ 和 IBM MQ)的支援。
XA 不是網路協議——它只是用於與事務協調器介面的 C API。此 API 的繫結存在於其他語言中;例如,在 Java EE 應用程式的世界中,XA 事務使用 Java 事務 API(JTA)實現,而 JTA 又由許多使用 Java 資料庫連線(JDBC)的資料庫驅動程式和使用 Java 訊息服務(JMS)API 的訊息代理驅動程式支援。
XA 假設你的應用程式使用網路驅動程式或客戶端庫與參與者資料庫或訊息服務進行通訊。如果驅動程式支援 XA,這意味著它呼叫 XA API 來確定操作是否應該是分散式事務的一部分——如果是,它將必要的資訊傳送到資料庫伺服器。驅動程式還公開回調,協調器可以透過回撥要求參與者準備、提交或中止。
事務協調器實現 XA API。該標準沒有指定應該如何實現它,但在實踐中,協調器通常只是載入到發出事務的應用程式的同一程序中的庫(而不是單獨的服務)。它跟蹤事務中的參與者,在要求他們準備後收集參與者的響應(透過驅動程式的回撥),並使用本地磁碟上的日誌來跟蹤每個事務的提交/中止決定。
如果應用程式程序崩潰,或者執行應用程式的機器宕機,協調器也隨之消失。任何準備但未提交事務的參與者都陷入存疑。由於協調器的日誌在應用程式伺服器的本地磁碟上,該伺服器必須重新啟動,協調器庫必須讀取日誌以恢復每個事務的提交/中止結果。然後,協調器才能使用資料庫驅動程式的 XA 回撥來要求參與者提交或中止(視情況而定)。資料庫伺服器無法直接聯絡協調器,因為所有通訊都必須透過其客戶端庫。
#### 存疑時持有鎖 {#holding-locks-while-in-doubt}
為什麼我們如此關心事務陷入存疑?系統的其餘部分不能繼續工作,忽略最終會被清理的存疑事務嗎?
問題在於*鎖定*。如["讀已提交"](#sec_transactions_read_committed)中所討論的,資料庫事務通常對它們修改的任何行進行行級獨佔鎖,以防止髒寫。此外,如果你想要可序列化隔離,使用兩階段鎖定的資料庫還必須對事務*讀取*的任何行進行共享鎖。
資料庫在事務提交或中止之前不能釋放這些鎖(如[圖 8-13](#fig_transactions_two_phase_commit) 中的陰影區域所示)。因此,使用兩階段提交時,事務必須在存疑期間保持鎖。如果協調器崩潰並需要 20 分鐘才能重新啟動,這些鎖將保持 20 分鐘。如果協調器的日誌由於某種原因完全丟失,這些鎖將永遠保持——或者至少直到管理員手動解決情況。
當這些鎖被持有時,沒有其他事務可以修改這些行。根據隔離級別,其他事務甚至可能被阻止讀取這些行。因此,其他事務不能簡單地繼續他們的業務——如果他們想要訪問相同的資料,他們將被阻塞。這可能導致你的應用程式的大部分變得不可用,直到存疑事務得到解決。
#### 從協調器故障中恢復 {#recovering-from-coordinator-failure}
理論上,如果協調器崩潰並重新啟動,它應該從日誌中乾淨地恢復其狀態並解決任何存疑事務。但是,在實踐中,*孤立的*存疑事務確實會發生[^83] [^84]——也就是說,協調器由於某種原因(例如,由於軟體錯誤導致事務日誌丟失或損壞)無法決定結果的事務。這些事務無法自動解決,因此它們永遠留在資料庫中,持有鎖並阻塞其他事務。
即使重新啟動資料庫伺服器也無法解決此問題,因為 2PC 的正確實現必須即使在重新啟動時也保留存疑事務的鎖(否則它將冒著違反原子性保證的風險)。這是一個棘手的情況。
唯一的出路是管理員手動決定是提交還是回滾事務。管理員必須檢查每個存疑事務的參與者,確定是否有任何參與者已經提交或中止,然後將相同的結果應用於其他參與者。解決問題可能需要大量的手動工作,並且很可能需要在嚴重的生產中斷期間在高壓力和時間壓力下完成(否則,為什麼協調器會處於如此糟糕的狀態?)。
許多 XA 實現都有一個名為*啟發式決策*的緊急逃生艙口:允許參與者在沒有協調器明確決定的情況下單方面決定中止或提交存疑事務[^73]。明確地說,這裡的*啟發式*是*可能破壞原子性*的委婉說法,因為啟發式決策違反了兩階段提交中的承諾系統。因此,啟發式決策僅用於擺脫災難性情況,而不用於常規使用。
#### XA 事務的問題 {#problems-with-xa-transactions}
單節點協調器是整個系統的單點故障,使其成為應用程式伺服器的一部分也是有問題的,因為協調器在其本地磁碟上的日誌成為持久系統狀態的關鍵部分——與資料庫本身一樣重要。
原則上,XA 事務的協調器可以是高可用和複製的,就像我們對任何其他重要資料庫的期望一樣。不幸的是,這仍然不能解決 XA 的一個根本問題,即它沒有為事務的協調器和參與者提供直接相互通訊的方式。它們只能透過呼叫事務的應用程式程式碼以及呼叫參與者的資料庫驅動程式進行通訊。
即使協調器被複制,應用程式程式碼也將是單點故障。解決這個問題需要完全重新設計應用程式程式碼的執行方式,使其複製或可重啟,這可能看起來類似於持久執行(參見["持久執行和工作流"](/tw/ch5#sec_encoding_dataflow_workflows))。但是,實踐中似乎沒有任何工具實際採用這種方法。
另一個問題是,由於 XA 需要與各種資料系統相容,它必然是最低公分母。例如,它無法檢測跨不同系統的死鎖(因為這需要系統交換有關每個事務正在等待的鎖的資訊的標準化協議),並且它不適用於 SSI(參見["可序列化快照隔離(SSI)"](#sec_transactions_ssi)),因為這需要跨不同系統識別衝突的協議。
這些問題在某種程度上是跨異構技術執行事務所固有的。但是,保持幾個異構資料系統彼此一致仍然是一個真實而重要的問題,因此我們需要為其找到不同的解決方案。這可以做到,我們將在下一節和["派生資料與分散式事務"](/tw/ch13#sec_future_derived_vs_transactions)中看到。
### 資料庫內部的分散式事務 {#sec_transactions_internal}
如前所述,跨多個異構儲存技術的分散式事務與系統內部的分散式事務之間存在很大差異——即,參與節點都是執行相同軟體的同一資料庫的分片。此類內部分散式事務是"NewSQL"資料庫的定義特徵,例如 CockroachDB[^5]、TiDB[^6]、Spanner[^7]、FoundationDB[^8] 和 YugabyteDB。某些訊息代理(如 Kafka)也支援內部分散式事務[^85]。
這些系統中的許多系統使用兩階段提交來確保寫入多個分片的事務的原子性,但它們不會遇到與 XA 事務相同的問題。原因是,由於它們的分散式事務不需要與任何其他技術介面,它們避免了最低公分母陷阱——這些系統的設計者可以自由使用更可靠、更快的更好協議。
XA 的最大問題可以透過以下方式解決:
* 複製協調器,如果主協調器崩潰,自動故障轉移到另一個協調器節點;
* 允許協調器和資料分片直接通訊,而不透過應用程式程式碼;
* 複製參與分片,以減少由於分片中的故障而必須中止事務的風險;以及
* 將原子提交協議與支援跨分片死鎖檢測和一致讀取的分散式併發控制協議耦合。
共識演算法通常用於複製協調器和資料庫分片。我們將在[第 10 章](/tw/ch10#ch_consistency)中看到如何使用共識演算法實現分散式事務的原子提交。這些演算法透過自動從一個節點故障轉移到另一個節點來容忍故障,無需任何人工干預,同時繼續保證強一致性屬性。
為分散式事務提供的隔離級別取決於系統,但跨分片的快照隔離和可序列化快照隔離都是可能的。有關其工作原理的詳細資訊,請參見本章末尾引用的論文。
#### 再談恰好一次訊息處理 {#exactly-once-message-processing-revisited}
我們在["恰好一次訊息處理"](#sec_transactions_exactly_once)中看到,分散式事務的一個重要用例是確保某些操作恰好生效一次,即使在處理過程中發生崩潰並且需要重試處理。如果你可以跨訊息代理和資料庫原子地提交事務,則當且僅當成功處理訊息並且從處理過程產生的資料庫寫入被提交時,你可以向代理確認訊息。
但是,你實際上不需要這樣的分散式事務來實現恰好一次語義。另一種方法如下,它只需要資料庫中的事務:
1. 假設每條訊息都有唯一的 ID,並且在資料庫中有一個已處理訊息 ID 的表。當你開始從代理處理訊息時,你在資料庫上開始一個新事務,並檢查訊息 ID。如果資料庫中已經存在相同的訊息 ID,你知道它已經被處理,因此你可以向代理確認訊息並丟棄它。
2. 如果訊息 ID 尚未在資料庫中,你將其新增到表中。然後你處理訊息,這可能會導致在同一事務中對資料庫進行額外的寫入。完成處理訊息後,你提交資料庫上的事務。
3. 一旦資料庫事務成功提交,你就可以向代理確認訊息。
4. 一旦訊息成功確認給代理,你知道它不會再次嘗試處理相同的訊息,因此你可以從資料庫中刪除訊息 ID(在單獨的事務中)。
如果訊息處理器在提交資料庫事務之前崩潰,事務將被中止,訊息代理將重試處理。如果它在提交後但在向代理確認訊息之前崩潰,它也將重試處理,但重試將在資料庫中看到訊息 ID 並丟棄它。如果它在確認訊息後但在從資料庫中刪除訊息 ID 之前崩潰,你將有一個舊的訊息 ID 留下,除了佔用一點儲存空間外不會造成任何傷害。如果在資料庫事務中止之前發生重試(如果訊息處理器和資料庫之間的通訊中斷,這可能會發生),訊息 ID 表上的唯一性約束應該防止兩個併發事務插入相同的訊息 ID。
因此,實現恰好一次處理只需要資料庫中的事務——跨資料庫和訊息代理的原子性對於此用例不是必需的。在資料庫中記錄訊息 ID 使訊息處理具備*冪等性*,因此可以安全地重試訊息處理而不會重複其副作用。流處理框架(如 Kafka Streams)中使用類似的方法來實現恰好一次語義,我們將在["容錯"](/tw/ch12#sec_stream_fault_tolerance)中看到。
但是,資料庫內的內部分散式事務對於此類模式的可伸縮性仍然有用:例如,它們將允許訊息 ID 儲存在一個分片上,而訊息處理更新的主資料儲存在其他分片上,並確保跨這些分片的事務提交的原子性。
## 總結 {#summary}
事務是一個抽象層,允許應用程式假裝某些併發問題和某些型別的硬體和軟體故障不存在。大量錯誤被簡化為簡單的*事務中止*,應用程式只需要重試。
在本章中,我們看到了許多事務有助於防止的問題示例。並非所有應用程式都容易受到所有這些問題的影響:具有非常簡單的訪問模式的應用程式(例如,僅讀取和寫入單個記錄)可能可以在沒有事務的情況下管理。但是,對於更複雜的訪問模式,事務可以大大減少你需要考慮的潛在錯誤情況的數量。
沒有事務,各種錯誤場景(程序崩潰、網路中斷、停電、磁碟已滿、意外併發等)意味著資料可能以各種方式變得不一致。例如,反正規化資料很容易與源資料失去同步。沒有事務,很難推理複雜的互動訪問對資料庫可能產生的影響。
在本章中,我們特別深入地探討了併發控制的主題。我們討論了幾種廣泛使用的隔離級別,特別是*讀已提交*、*快照隔離*(有時稱為*可重複讀*)和*可序列化*。我們透過討論各種競態條件的示例來描述這些隔離級別,總結在 [表 8-1](#tab_transactions_isolation_levels) 中:
{{< figure id="tab_transactions_isolation_levels" title="表 8-1. 各種隔離級別可能發生的異常總結" class="w-full my-4" >}}
| 隔離級別 | 髒讀 | 讀取偏差 | 幻讀 | 丟失更新 | 寫偏差 |
|------|------|------|------|-------|------|
| 讀未提交 | ✗ 可能 | ✗ 可能 | ✗ 可能 | ✗ 可能 | ✗ 可能 |
| 讀已提交 | ✓ 防止 | ✗ 可能 | ✗ 可能 | ✗ 可能 | ✗ 可能 |
| 快照隔離 | ✓ 防止 | ✓ 防止 | ✓ 防止 | ? 視情況 | ✗ 可能 |
| 可序列化 | ✓ 防止 | ✓ 防止 | ✓ 防止 | ✓ 防止 | ✓ 防止 |
髒讀
: 一個客戶端在另一個客戶端的寫入提交之前讀取它們。讀已提交隔離級別和更強的級別防止髒讀。
髒寫
: 一個客戶端覆蓋另一個客戶端已寫入但尚未提交的資料。幾乎所有事務實現都防止髒寫。
讀取偏差
: 客戶端在不同時間點看到資料庫的不同部分。某些讀取偏差的情況也稱為*不可重複讀*。這個問題最常透過快照隔離來防止,它允許事務從對應於特定時間點的一致快照讀取。它通常使用*多版本併發控制*(MVCC)實現。
丟失更新
: 兩個客戶端併發執行讀-修改-寫迴圈。一個覆蓋另一個的寫入而不合並其更改,因此資料丟失。某些快照隔離的實現會自動防止此異常,而其他實現需要手動鎖(`SELECT FOR UPDATE`)。
寫偏差
: 事務讀取某些內容,根據它看到的值做出決定,並將決定寫入資料庫。但是,在進行寫入時,決策的前提不再為真。只有可序列化隔離才能防止此異常。
幻讀
: 事務讀取匹配某些搜尋條件的物件。另一個客戶端進行影響該搜尋結果的寫入。快照隔離防止直接的幻讀,但寫偏差上下文中的幻讀需要特殊處理,例如索引範圍鎖。
弱隔離級別可以防止某些異常,但讓你(應用程式開發人員)手動處理其他異常(例如,使用顯式鎖定)。只有可序列化隔離可以防止所有這些問題。我們討論了實現可序列化事務的三種不同方法:
字面上序列執行事務
: 如果你可以使每個事務執行得非常快(通常透過使用儲存過程),並且事務吞吐量足夠低,可以在單個 CPU 核心上處理或可以分片,這是一個簡單有效的選擇。
兩階段鎖定
: 幾十年來,這一直是實現可序列化的標準方法,但許多應用程式由於其效能不佳而避免使用它。
可序列化快照隔離(SSI)
: 一種相對較新的演算法,避免了前面方法的大部分缺點。它使用樂觀方法,允許事務在不阻塞的情況下進行。當事務想要提交時,它會被檢查,如果執行不可序列化,它將被中止。
最後,我們研究了當事務分佈在多個節點上時如何實現原子性,使用兩階段提交。如果這些節點都執行相同的資料庫軟體,分散式事務可以很好地工作,但跨不同儲存技術(使用 XA 事務),2PC 是有問題的:它對協調器和驅動事務的應用程式程式碼中的故障非常敏感,並且與併發控制機制的互動很差。幸運的是,冪等性可以確保恰好一次語義,而無需跨不同儲存技術的原子提交,我們將在後面的章節中看到更多相關內容。
本章中的示例使用了關係資料模型。但是,如["多物件事務的需求"](#sec_transactions_need)中所討論的,無論使用哪種資料模型,事務都是有價值的資料庫功能。
### 參考
[^1]: Steven J. Murdoch. [What went wrong with Horizon: learning from the Post Office Trial](https://www.benthamsgaze.org/2021/07/15/what-went-wrong-with-horizon-learning-from-the-post-office-trial/). *benthamsgaze.org*, July 2021. Archived at [perma.cc/CNM4-553F](https://perma.cc/CNM4-553F)
[^2]: Donald D. Chamberlin, Morton M. Astrahan, Michael W. Blasgen, James N. Gray, W. Frank King, Bruce G. Lindsay, Raymond Lorie, James W. Mehl, Thomas G. Price, Franco Putzolu, Patricia Griffiths Selinger, Mario Schkolnick, Donald R. Slutz, Irving L. Traiger, Bradford W. Wade, and Robert A. Yost. [A History and Evaluation of System R](https://dsf.berkeley.edu/cs262/2005/SystemR.pdf). *Communications of the ACM*, volume 24, issue 10, pages 632–646, October 1981. [doi:10.1145/358769.358784](https://doi.org/10.1145/358769.358784)
[^3]: Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger. [Granularity of Locks and Degrees of Consistency in a Shared Data Base](https://citeseerx.ist.psu.edu/pdf/e127f0a6a912bb9150ecfe03c0ebf7fbc289a023). in *Modelling in Data Base Management Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems*, edited by G. M. Nijssen, pages 364–394, Elsevier/North Holland Publishing, 1976. Also in *Readings in Database Systems*, 4th edition, edited by Joseph M. Hellerstein and Michael Stonebraker, MIT Press, 2005. ISBN: 978-0-262-69314-1
[^4]: Kapali P. Eswaran, Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger. [The Notions of Consistency and Predicate Locks in a Database System](https://jimgray.azurewebsites.net/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf?from=https://research.microsoft.com/en-us/um/people/gray/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf). *Communications of the ACM*, volume 19, issue 11, pages 624–633, November 1976. [doi:10.1145/360363.360369](https://doi.org/10.1145/360363.360369)
[^5]: Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. [CockroachDB: The Resilient Geo-Distributed SQL Database](https://dl.acm.org/doi/pdf/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 1493–1509, June 2020. [doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134)
[^6]: Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, and Xin Tang. [TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3072–3084. [doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535)
[^7]: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. [Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012.
[^8]: Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. [FoundationDB: A Distributed Unbundled Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2021. [doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559)
[^9]: Theo Härder and Andreas Reuter. [Principles of Transaction-Oriented Database Recovery](https://citeseerx.ist.psu.edu/pdf/11ef7c142295aeb1a28a0e714c91fc8d610c3047). *ACM Computing Surveys*, volume 15, issue 4, pages 287–317, December 1983. [doi:10.1145/289.291](https://doi.org/10.1145/289.291)
[^10]: Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [HAT, not CAP: Towards Highly Available Transactions](https://www.usenix.org/system/files/conference/hotos13/hotos13-final80.pdf). At *14th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2013.
[^11]: Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. [Cluster-Based Scalable Network Services](https://people.eecs.berkeley.edu/~brewer/cs262b/TACC.pdf). At *16th ACM Symposium on Operating Systems Principles* (SOSP), October 1997. [doi:10.1145/268998.266662](https://doi.org/10.1145/268998.266662)
[^12]: Tony Andrews. [Enforcing Complex Constraints in Oracle](https://tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html). *tonyandrews.blogspot.co.uk*, October 2004. Archived at [archive.org](https://web.archive.org/web/20220201190625/https%3A//tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html)
[^13]: Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. [*Concurrency Control and Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at [*microsoft.com*](https://www.microsoft.com/en-us/research/people/philbe/book/).
[^14]: Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil, Patrick O’Neil, and Dennis Shasha. [Making Snapshot Isolation Serializable](https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2009/Papers/p492-fekete.pdf). *ACM Transactions on Database Systems*, volume 30, issue 2, pages 492–528, June 2005. [doi:10.1145/1071610.1071615](https://doi.org/10.1145/1071610.1071615)
[^15]: Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge. [Understanding the Robustness of SSDs Under Power Fault](https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf). At *11th USENIX Conference on File and Storage Technologies* (FAST), February 2013.
[^16]: Laurie Denness. [SSDs: A Gift and a Curse](https://laur.ie/blog/2015/06/ssds-a-gift-and-a-curse/). *laur.ie*, June 2015. Archived at [perma.cc/6GLP-BX3T](https://perma.cc/6GLP-BX3T)
[^17]: Adam Surak. [When Solid State Drives Are Not That Solid](https://www.algolia.com/blog/engineering/when-solid-state-drives-are-not-that-solid). *blog.algolia.com*, June 2015. Archived at [perma.cc/CBR9-QZEE](https://perma.cc/CBR9-QZEE)
[^18]: Hewlett Packard Enterprise. [Bulletin: (Revision) HPE SAS Solid State Drives - Critical Firmware Upgrade Required for Certain HPE SAS Solid State Drive Models to Prevent Drive Failure at 32,768 Hours of Operation](https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us). *support.hpe.com*, November 2019. Archived at [perma.cc/CZR4-AQBS](https://perma.cc/CZR4-AQBS)
[^19]: Craig Ringer et al. [PostgreSQL’s handling of fsync() errors is unsafe and risks data loss at least on XFS](https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com). Email thread on pgsql-hackers mailing list, *postgresql.org*, March 2018. Archived at [perma.cc/5RKU-57FL](https://perma.cc/5RKU-57FL)
[^20]: Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [Can Applications Recover from fsync Failures?](https://www.usenix.org/conference/atc20/presentation/rebello) At *USENIX Annual Technical Conference* (ATC), July 2020.
[^21]: Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [Crash Consistency: Rethinking the Fundamental Abstractions of the File System](https://dl.acm.org/doi/pdf/10.1145/2800695.2801719). *ACM Queue*, volume 13, issue 7, pages 20–28, July 2015. [doi:10.1145/2800695.2801719](https://doi.org/10.1145/2800695.2801719)
[^22]: Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf). At *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014.
[^23]: Chris Siebenmann. [Unix’s File Durability Problem](https://utcc.utoronto.ca/~cks/space/blog/unix/FileSyncProblem). *utcc.utoronto.ca*, April 2016. Archived at [perma.cc/VSS8-5MC4](https://perma.cc/VSS8-5MC4)
[^24]: Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://www.usenix.org/conference/fast17/technical-sessions/presentation/ganesan). At *15th USENIX Conference on File and Storage Technologies* (FAST), February 2017.
[^25]: Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [An Analysis of Data Corruption in the Storage Stack](https://www.usenix.org/legacy/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf). At *6th USENIX Conference on File and Storage Technologies* (FAST), February 2008.
[^26]: Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. [Flash Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder). At *14th USENIX Conference on File and Storage Technologies* (FAST), February 2016.
[^27]: Don Allison. [SSD Storage – Ignorance of Technology Is No Excuse](https://blog.korelogic.com/blog/2015/03/24). *blog.korelogic.com*, March 2015. Archived at [perma.cc/9QN4-9SNJ](https://perma.cc/9QN4-9SNJ)
[^28]: Gordon Mah Ung. [Debunked: Your SSD won’t lose data if left unplugged after all](https://www.pcworld.com/article/427602/debunked-your-ssd-wont-lose-data-if-left-unplugged-after-all.html). *pcworld.com*, May 2015. Archived at [perma.cc/S46H-JUDU](https://perma.cc/S46H-JUDU)
[^29]: Martin Kleppmann. [Hermitage: Testing the ‘I’ in ACID](https://martin.kleppmann.com/2014/11/25/hermitage-testing-the-i-in-acid.html). *martin.kleppmann.com*, November 2014. Archived at [perma.cc/KP2Y-AQGK](https://perma.cc/KP2Y-AQGK)
[^30]: Todd Warszawski and Peter Bailis. [ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications](http://www.bailis.org/papers/acidrain-sigmod2017.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017. [doi:10.1145/3035918.3064037](https://doi.org/10.1145/3035918.3064037)
[^31]: Tristan D’Agosta. [BTC Stolen from Poloniex](https://bitcointalk.org/index.php?topic=499580). *bitcointalk.org*, March 2014. Archived at [perma.cc/YHA6-4C5D](https://perma.cc/YHA6-4C5D)
[^32]: bitcointhief2. [How I Stole Roughly 100 BTC from an Exchange and How I Could Have Stolen More!](https://www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/) *reddit.com*, February 2014. Archived at [archive.org](https://web.archive.org/web/20250118042610/https%3A//www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/)
[^33]: Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan. [Automating the Detection of Snapshot Isolation Anomalies](https://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf). At *33rd International Conference on Very Large Data Bases* (VLDB), September 2007.
[^34]: Michael Melanson. [Transactions: The Limits of Isolation](https://www.michaelmelanson.net/posts/transactions-the-limits-of-isolation/). *michaelmelanson.net*, November 2014. Archived at [perma.cc/RG5R-KMYZ](https://perma.cc/RG5R-KMYZ)
[^35]: Edward Kim. [How ACH works: A developer perspective — Part 1](https://engineering.gusto.com/how-ach-works-a-developer-perspective-part-1-339d3e7bea1). *engineering.gusto.com*, April 2014. Archived at [perma.cc/7B2H-PU94](https://perma.cc/7B2H-PU94)
[^36]: Hal Berenson, Philip A. Bernstein, Jim N. Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. [A Critique of ANSI SQL Isolation Levels](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 1995. [doi:10.1145/568271.223785](https://doi.org/10.1145/568271.223785)
[^37]: Atul Adya. [Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions](https://pmg.csail.mit.edu/papers/adya-phd.pdf). PhD Thesis, Massachusetts Institute of Technology, March 1999. Archived at [perma.cc/E97M-HW5Q](https://perma.cc/E97M-HW5Q)
[^38]: Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Highly Available Transactions: Virtues and Limitations](https://www.vldb.org/pvldb/vol7/p181-bailis.pdf). At *40th International Conference on Very Large Data Bases* (VLDB), September 2014.
[^39]: Natacha Crooks, Youer Pu, Lorenzo Alvisi, and Allen Clement. [Seeing is Believing: A Client-Centric Specification of Database Isolation](https://www.cs.cornell.edu/lorenzo/papers/Crooks17Seeing.pdf). At *ACM Symposium on Principles of Distributed Computing* (PODC), pages 73–82, July 2017. [doi:10.1145/3087801.3087802](https://doi.org/10.1145/3087801.3087802)
[^40]: Bruce Momjian. [MVCC Unmasked](https://momjian.us/main/writings/pgsql/mvcc.pdf). *momjian.us*, July 2014. Archived at [perma.cc/KQ47-9GYB](https://perma.cc/KQ47-9GYB)
[^41]: Peter Alvaro and Kyle Kingsbury. [MySQL 8.0.34](https://jepsen.io/analyses/mysql-8.0.34). *jepsen.io*, December 2023. Archived at [perma.cc/HGE2-Z878](https://perma.cc/HGE2-Z878)
[^42]: Egor Rogov. [PostgreSQL 14 Internals](https://postgrespro.com/community/books/internals). *postgrespro.com*, April 2023. Archived at [perma.cc/FRK2-D7WB](https://perma.cc/FRK2-D7WB)
[^43]: Hironobu Suzuki. [The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017.
[^44]: Rohan Reddy Alleti. [Internals of MVCC in Postgres: Hidden costs of Updates vs Inserts](https://medium.com/%40rohanjnr44/internals-of-mvcc-in-postgres-hidden-costs-of-updates-vs-inserts-381eadd35844). *medium.com*, March 2025. Archived at [perma.cc/3ACX-DFXT](https://perma.cc/3ACX-DFXT)
[^45]: Andy Pavlo and Bohan Zhang. [The Part of PostgreSQL We Hate the Most](https://www.cs.cmu.edu/~pavlo/blog/2023/04/the-part-of-postgresql-we-hate-the-most.html). *cs.cmu.edu*, April 2023. Archived at [perma.cc/XSP6-3JBN](https://perma.cc/XSP6-3JBN)
[^46]: Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. [An empirical evaluation of in-memory multi-version concurrency control](https://vldb.org/pvldb/vol10/p781-Wu.pdf). *Proceedings of the VLDB Endowment*, volume 10, issue 7, pages 781–792, March 2017. [doi:10.14778/3067421.3067427](https://doi.org/10.14778/3067421.3067427)
[^47]: Nikita Prokopov. [Unofficial Guide to Datomic Internals](https://tonsky.me/blog/unofficial-guide-to-datomic-internals/). *tonsky.me*, May 2014.
[^48]: Daniil Svetlov. [A Practical Guide to Taming Postgres Isolation Anomalies](https://dansvetlov.me/postgres-anomalies/). *dansvetlov.me*, March 2025. Archived at [perma.cc/L7LE-TDLS](https://perma.cc/L7LE-TDLS)
[^49]: Nate Wiger. [An Atomic Rant](https://nateware.com/2010/02/18/an-atomic-rant/). *nateware.com*, February 2010. Archived at [perma.cc/5ZYB-PE44](https://perma.cc/5ZYB-PE44)
[^50]: James Coglan. [Reading and writing, part 3: web applications](https://blog.jcoglan.com/2020/10/12/reading-and-writing-part-3/). *blog.jcoglan.com*, October 2020. Archived at [perma.cc/A7EK-PJVS](https://perma.cc/A7EK-PJVS)
[^51]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2737784](https://doi.org/10.1145/2723372.2737784)
[^52]: Jaana Dogan. [Things I Wished More Developers Knew About Databases](https://rakyll.medium.com/things-i-wished-more-developers-knew-about-databases-2d0178464f78). *rakyll.medium.com*, April 2020. Archived at [perma.cc/6EFK-P2TD](https://perma.cc/6EFK-P2TD)
[^53]: Michael J. Cahill, Uwe Röhm, and Alan Fekete. [Serializable Isolation for Snapshot Databases](https://www.cs.cornell.edu/~sowell/dbpapers/serializable_isolation.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2008. [doi:10.1145/1376616.1376690](https://doi.org/10.1145/1376616.1376690)
[^54]: Dan R. K. Ports and Kevin Grittner. [Serializable Snapshot Isolation in PostgreSQL](https://drkp.net/papers/ssi-vldb12.pdf). At *38th International Conference on Very Large Databases* (VLDB), August 2012.
[^55]: Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J. Spreitzer and Carl H. Hauser. [Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](https://pdos.csail.mit.edu/6.824/papers/bayou-conflicts.pdf). At *15th ACM Symposium on Operating Systems Principles* (SOSP), December 1995. [doi:10.1145/224056.224070](https://doi.org/10.1145/224056.224070)
[^56]: Hans-Jürgen Schönig. [Constraints over multiple rows in PostgreSQL](https://www.cybertec-postgresql.com/en/postgresql-constraints-over-multiple-rows/). *cybertec-postgresql.com*, June 2021. Archived at [perma.cc/2TGH-XUPZ](https://perma.cc/2TGH-XUPZ)
[^57]: Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. [The End of an Architectural Era (It’s Time for a Complete Rewrite)](https://vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf). At *33rd International Conference on Very Large Data Bases* (VLDB), September 2007.
[^58]: John Hugg. [H-Store/VoltDB Architecture vs. CEP Systems and Newer Streaming Architectures](https://www.youtube.com/watch?v=hD5M4a1UVz8). At *Data @Scale Boston*, November 2014.
[^59]: Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, John Hugg, and Daniel J. Abadi. [H-Store: A High-Performance, Distributed Main Memory Transaction Processing System](https://www.vldb.org/pvldb/vol1/1454211.pdf). *Proceedings of the VLDB Endowment*, volume 1, issue 2, pages 1496–1499, August 2008.
[^60]: Rich Hickey. [The Architecture of Datomic](https://www.infoq.com/articles/Architecture-Datomic/). *infoq.com*, November 2012. Archived at [perma.cc/5YWU-8XJK](https://perma.cc/5YWU-8XJK)
[^61]: John Hugg. [Debunking Myths About the VoltDB In-Memory Database](https://dzone.com/articles/debunking-myths-about-voltdb). *dzone.com*, May 2014. Archived at [perma.cc/2Z9N-HPKF](https://perma.cc/2Z9N-HPKF)
[^62]: Xinjing Zhou, Viktor Leis, Xiangyao Yu, and Michael Stonebraker. [OLTP Through the Looking Glass 16 Years Later: Communication is the New Bottleneck](https://www.vldb.org/cidrdb/papers/2025/p17-zhou.pdf). At *15th Annual Conference on Innovative Data Systems Research* (CIDR), January 2025.
[^63]: Xinjing Zhou, Xiangyao Yu, Goetz Graefe, and Michael Stonebraker. [Lotus: scalable multi-partition transactions on single-threaded partitioned databases](https://www.vldb.org/pvldb/vol15/p2939-zhou.pdf). *Proceedings of the VLDB Endowment* (PVLDB), volume 15, issue 11, pages 2939–2952, July 2022. [doi:10.14778/3551793.3551843](https://doi.org/10.14778/3551793.3551843)
[^64]: Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton. [Architecture of a Database System](https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf). *Foundations and Trends in Databases*, volume 1, issue 2, pages 141–259, November 2007. [doi:10.1561/1900000002](https://doi.org/10.1561/1900000002)
[^65]: Michael J. Cahill. [Serializable Isolation for Snapshot Databases](https://ses.library.usyd.edu.au/bitstream/handle/2123/5353/michael-cahill-2009-thesis.pdf). PhD Thesis, University of Sydney, July 2009. Archived at [perma.cc/727J-NTMP](https://perma.cc/727J-NTMP)
[^66]: Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Åke Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. [Hekaton: SQL Server’s Memory-Optimized OLTP Engine](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/06/Hekaton-Sigmod2013-final.pdf). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 1243–1254, June 2013. [doi:10.1145/2463676.2463710](https://doi.org/10.1145/2463676.2463710)
[^67]: Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. [Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems](https://db.in.tum.de/~muehlbau/papers/mvcc.pdf). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 677–689, May 2015. [doi:10.1145/2723372.2749436](https://doi.org/10.1145/2723372.2749436)
[^68]: D. Z. Badal. [Correctness of Concurrency Control and Implications in Distributed Databases](https://ieeexplore.ieee.org/abstract/document/762563). At *3rd International IEEE Computer Software and Applications Conference* (COMPSAC), November 1979. [doi:10.1109/CMPSAC.1979.762563](https://doi.org/10.1109/CMPSAC.1979.762563)
[^69]: Rakesh Agrawal, Michael J. Carey, and Miron Livny. [Concurrency Control Performance Modeling: Alternatives and Implications](https://people.eecs.berkeley.edu/~brewer/cs262/ConcControl.pdf). *ACM Transactions on Database Systems* (TODS), volume 12, issue 4, pages 609–654, December 1987. [doi:10.1145/32204.32220](https://doi.org/10.1145/32204.32220)
[^70]: Marc Brooker. [Snapshot Isolation vs Serializability](https://brooker.co.za/blog/2024/12/17/occ-and-isolation.html). *brooker.co.za*, December 2024. Archived at [perma.cc/5TRC-CR5G](https://perma.cc/5TRC-CR5G)
[^71]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)
[^72]: C. Mohan, Bruce G. Lindsay, and Ron Obermarck. [Transaction Management in the R\* Distributed Database Management System](https://cs.brown.edu/courses/csci2270/archives/2012/papers/dtxn/p378-mohan.pdf). *ACM Transactions on Database Systems*, volume 11, issue 4, pages 378–396, December 1986. [doi:10.1145/7239.7266](https://doi.org/10.1145/7239.7266)
[^73]: X/Open Company Ltd. [Distributed Transaction Processing: The XA Specification](https://pubs.opengroup.org/onlinepubs/009680699/toc.pdf). Technical Standard XO/CAE/91/300, December 1991. ISBN: 978-1-872-63024-3, archived at [perma.cc/Z96H-29JB](https://perma.cc/Z96H-29JB)
[^74]: Ivan Silva Neto and Francisco Reverbel. [Lessons Learned from Implementing WS-Coordination and WS-AtomicTransaction](https://www.ime.usp.br/~reverbel/papers/icis2008.pdf). At *7th IEEE/ACIS International Conference on Computer and Information Science* (ICIS), May 2008. [doi:10.1109/ICIS.2008.75](https://doi.org/10.1109/ICIS.2008.75)
[^75]: James E. Johnson, David E. Langworthy, Leslie Lamport, and Friedrich H. Vogt. [Formal Specification of a Web Services Protocol](https://www.microsoft.com/en-us/research/publication/formal-specification-of-a-web-services-protocol/). At *1st International Workshop on Web Services and Formal Methods* (WS-FM), February 2004. [doi:10.1016/j.entcs.2004.02.022](https://doi.org/10.1016/j.entcs.2004.02.022)
[^76]: Jim Gray. [The Transaction Concept: Virtues and Limitations](https://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf). At *7th International Conference on Very Large Data Bases* (VLDB), September 1981.
[^77]: Dale Skeen. [Nonblocking Commit Protocols](https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/Ske81.pdf). At *ACM International Conference on Management of Data* (SIGMOD), April 1981. [doi:10.1145/582318.582339](https://doi.org/10.1145/582318.582339)
[^78]: Gregor Hohpe. [Your Coffee Shop Doesn’t Use Two-Phase Commit](https://www.martinfowler.com/ieeeSoftware/coffeeShop.pdf). *IEEE Software*, volume 22, issue 2, pages 64–66, March 2005. [doi:10.1109/MS.2005.52](https://doi.org/10.1109/MS.2005.52)
[^79]: Pat Helland. [Life Beyond Distributed Transactions: An Apostate’s Opinion](https://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf). At *3rd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2007.
[^80]: Jonathan Oliver. [My Beef with MSDTC and Two-Phase Commits](https://blog.jonathanoliver.com/my-beef-with-msdtc-and-two-phase-commits/). *blog.jonathanoliver.com*, April 2011. Archived at [perma.cc/K8HF-Z4EN](https://perma.cc/K8HF-Z4EN)
[^81]: Oren Eini (Ahende Rahien). [The Fallacy of Distributed Transactions](https://ayende.com/blog/167362/the-fallacy-of-distributed-transactions). *ayende.com*, July 2014. Archived at [perma.cc/VB87-2JEF](https://perma.cc/VB87-2JEF)
[^82]: Clemens Vasters. [Transactions in Windows Azure (with Service Bus) – An Email Discussion](https://learn.microsoft.com/en-gb/archive/blogs/clemensv/transactions-in-windows-azure-with-service-bus-an-email-discussion). *learn.microsoft.com*, July 2012. Archived at [perma.cc/4EZ9-5SKW](https://perma.cc/4EZ9-5SKW)
[^83]: Ajmer Dhariwal. [Orphaned MSDTC Transactions (-2 spids)](https://www.eraofdata.com/posts/2008/orphaned-msdtc-transactions-2-spids/). *eraofdata.com*, December 2008. Archived at [perma.cc/YG6F-U34C](https://perma.cc/YG6F-U34C)
[^84]: Paul Randal. [Real World Story of DBCC PAGE Saving the Day](https://www.sqlskills.com/blogs/paul/real-world-story-of-dbcc-page-saving-the-day/). *sqlskills.com*, June 2013. Archived at [perma.cc/2MJN-A5QH](https://perma.cc/2MJN-A5QH)
[^85]: Guozhang Wang, Lei Chen, Ayusman Dikshit, Jason Gustafson, Boyang Chen, Matthias J. Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva Mehta, Varun Madan, and Jun Rao. [Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka](https://dl.acm.org/doi/pdf/10.1145/3448016.3457556). At *ACM International Conference on Management of Data* (SIGMOD), June 2021. [doi:10.1145/3448016.3457556](https://doi.org/10.1145/3448016.3457556)
================================================
FILE: content/tw/ch9.md
================================================
---
title: "9. 分散式系統的麻煩"
weight: 209
breadcrumbs: false
---

> *意外這東西挺有意思:你沒碰上之前,它就從來不會發生。*
>
> A.A. 米爾恩,《小熊維尼和老灰驢的家》(1928)
正如 ["可靠性與容錯"](/tw/ch2#sec_introduction_reliability) 中所討論的,讓系統可靠意味著確保系統作為一個整體繼續工作,即使出了問題(即出現故障)。然而,預料所有可能的故障並處理它們並不是那麼容易。作為開發者,我們很容易主要關注正常路徑(畢竟,大多數時候事情都執行良好!)而忽略故障,因為故障會引入大量邊界情況。
如果你希望系統在故障存在的情況下仍然可靠,你必須從根本上改變你的思維方式,並專注於可能出錯的事情,即使它們可能性很低。一件事情出錯的機率是否只有百萬分之一並不重要:在一個足夠大的系統中,百萬分之一的事件每天都在發生。經驗豐富的系統操作員會告訴你,任何 *可能* 出錯的事情 *都會* 出錯。
此外,使用分散式系統與在單臺計算機上編寫軟體有著根本的不同 —— 主要區別在於有許多新的、令人興奮的出錯方式 [^1] [^2]。在本章中,你將體驗實踐中出現的問題,並理解你可以依賴和不能依賴的事物。
為了理解我們面臨的挑戰,我們現在將把悲觀情緒發揮到極致,探索分散式系統中可能出錯的事情。我們將研究網路問題(["不可靠的網路"](#sec_distributed_networks))以及時鐘和時序問題(["不可靠的時鐘"](#sec_distributed_clocks))。所有這些問題的後果令人迷惑,因此我們將探索如何思考分散式系統的狀態以及如何推理已經發生的事情(["知識、真相與謊言"](#sec_distributed_truth))。稍後,在 [第 10 章](/tw/ch10#ch_consistency) 中,我們將看一些面對這些故障時如何實現容錯的例子。
## 故障與部分失效 {#sec_distributed_partial_failure}
當你在單臺計算機上編寫程式時,它通常以相當可預測的方式執行:要麼工作,要麼不工作。有缺陷的軟體可能會給人一種計算機有時 "狀態不佳" 的印象(這個問題通常透過重啟來解決),但這主要只是編寫不良的軟體的後果。
軟體在單臺計算機上不應該是不穩定的,這沒有根本原因:當硬體正常工作時,相同的操作總是產生相同的結果(它是 *確定性的*)。如果存在硬體問題(例如,記憶體損壞或聯結器鬆動),後果通常是整個系統故障(例如,核心恐慌、"藍色畫面宕機"、無法啟動)。一臺執行良好軟體的單獨計算機通常要麼完全正常執行,要麼完全故障,而不是介於兩者之間。
這是計算機設計中的一個刻意選擇:如果發生內部故障,我們寧願計算機完全崩潰而不是返回錯誤的結果,因為錯誤的結果很難處理且令人困惑。因此,計算機隱藏了它們所實現的模糊物理現實,並呈現一個以數學完美執行的理想化系統模型。CPU 指令總是做同樣的事情;如果你將一些資料寫入記憶體或磁碟,該資料保持完整,不會被隨機損壞。正如 ["硬體與軟體故障"](/tw/ch2#sec_introduction_hardware_faults) 中所討論的,這實際上並不是真的 —— 實際上,資料確實會被靜默損壞,CPU 有時會靜默返回錯誤的結果 —— 但這種情況發生得足夠少,以至於我們可以忽略它。
當你編寫在多臺計算機上執行的軟體,透過網路連線時,情況就根本不同了。在分散式系統中,故障發生得更加頻繁,因此我們不能再忽略它們 —— 我們別無選擇,只能直面物理世界的混亂現實。在物理世界中,可能出錯的事情範圍非常廣泛,正如這個軼事所說明的 [^3]:
> 在我有限的經驗中,我處理過單個數據中心(DC)中的長期網路分割槽、PDU [配電單元] 故障、交換機故障、整個機架的意外斷電、整個 DC 骨幹網故障、整個 DC 電源故障,以及一個低血糖的司機將他的福特皮卡撞進 DC 的 HVAC [供暖、通風和空調] 系統。而我甚至不是運維人員。
>
> —— Coda Hale
在分散式系統中,系統的某些部分可能以某種不可預測的方式出現故障,即使系統的其他部分工作正常。這被稱為 *部分失效*。困難在於部分失效是 *非確定性的*:如果你嘗試做任何涉及多個節點和網路的事情,它有時可能工作,有時可能不可預測地失敗。正如我們將看到的,你甚至可能不 *知道* 某事是否成功!
這種非確定性和部分失效的可能性使分散式系統難以使用 [^4]。另一方面,如果分散式系統可以容忍部分失效,這將開啟強大的可能性:例如,它允許你執行滾動升級,一次重啟一個節點以安裝軟體更新,而系統作為一個整體繼續不間斷地工作。因此,容錯使我們能夠從不可靠的元件構建比單節點系統更可靠的分散式系統。
但在我們實現容錯之前,我們需要更多地瞭解我們應該容忍的故障。重要的是要考慮各種可能的故障 —— 即使是相當不太可能的故障 —— 並在你的測試環境中人為地建立這種情況以檢視會發生什麼。在分散式系統中,懷疑、悲觀和偏執是有回報的。
## 不可靠的網路 {#sec_distributed_networks}
正如 ["共享記憶體、共享磁碟和無共享架構"](/tw/ch2#sec_introduction_shared_nothing) 中所討論的,我們在本書中關注的分散式系統主要是 *無共享系統*:即透過網路連線的一組機器。網路是這些機器進行通訊的唯一方式 —— 我們假設每臺機器都有自己的記憶體和磁碟,一臺機器不能訪問另一臺機器的記憶體或磁碟(除非透過網路向服務發出請求)。即使儲存是共享的,例如亞馬遜的 S3,機器也是透過網路與共享儲存服務通訊。
網際網路和資料中心中的大多數內部網路(通常是乙太網)都是 *非同步分組網路*。在這種網路中,一個節點可以向另一個節點發送訊息(資料包),但網路不保證它何時到達,或者是否會到達。如果你傳送請求並期望響應,許多事情可能會出錯(其中一些如 [圖 9-1](#fig_distributed_network) 所示):
1. 你的請求可能已經丟失(也許有人拔掉了網線)。
2. 你的請求可能在佇列中等待,稍後將被交付(也許網路或接收方過載)。
3. 遠端節點可能已經失效(也許它崩潰了或被關閉了)。
4. 遠端節點可能暫時停止響應(也許它正在經歷長時間的垃圾回收暫停;見 ["程序暫停"](#sec_distributed_clocks_pauses)),但稍後會再次開始響應。
5. 遠端節點可能已經處理了你的請求,但響應在網路上丟失了(也許網路交換機配置錯誤)。
6. 遠端節點可能已經處理了你的請求,但響應被延遲了,稍後將被交付(也許網路或你自己的機器過載)。
{{< figure src="/fig/ddia_0901.png" id="fig_distributed_network" caption="圖 9-1. 如果你傳送請求但沒有收到響應,無法區分是 (a) 請求丟失了,(b) 遠端節點宕機了,還是 (c) 響應丟失了。" class="w-full my-4" >}}
傳送方甚至無法判斷資料包是否已交付:唯一的選擇是讓接收方傳送響應訊息,而響應訊息本身也可能丟失或延遲。在非同步網路中,這些問題是無法區分的:你擁有的唯一資訊是你還沒有收到響應。如果你向另一個節點發送請求但沒有收到響應,*不可能* 判斷原因。
處理這個問題的常用方法是 *超時*:在一段時間後,你放棄等待並假設響應不會到達。然而,當超時發生時,你仍然不知道遠端節點是否收到了你的請求(如果請求仍在某處排隊,即使傳送方已經放棄了它,它仍可能被交付給接收方)。
### TCP 的侷限性 {#sec_distributed_tcp}
網路資料包有最大大小(通常為幾千位元組),但許多應用程式需要傳送太大而無法裝入一個數據包的訊息(請求、響應)。這些應用程式最常使用 TCP(傳輸控制協議)來建立一個 *連線*,將大型資料流分解為單個數據包,並在接收端將它們重新組合起來。
--------
> [!NOTE]
> 我們關於 TCP 的大部分內容也適用於其更新的替代方案 QUIC,以及 WebRTC 中使用的流控制傳輸協議(SCTP)、BitTorrent uTP 協議和其他傳輸協議。有關與 UDP 的比較,請參見 ["TCP 與 UDP"](#sidebar_distributed_tcp_udp)。
--------
TCP 通常被描述為提供 "可靠" 的交付,從某種意義上說,它檢測並重傳丟棄的資料包,檢測重新排序的資料包並將它們恢復到正確的順序,並使用簡單的校驗和檢測資料包損壞。它還計算出可以傳送資料的速度,以便儘快傳輸資料,但不會使網路或接收節點過載;這被稱為 *擁塞控制*、*流量控制* 或 *背壓* [^5]。
當你透過將資料寫入套接字來 "傳送" 一些資料時,它實際上不會立即傳送,而只是放置在由作業系統管理的緩衝區中。當擁塞控制演算法決定它有能力傳送資料包時,它會從該緩衝區中獲取下一個資料包的資料並將其傳遞給網路介面。資料包通過幾個交換機和路由器,最終接收節點的作業系統將資料包的資料放置在接收緩衝區中並向傳送方傳送確認資料包。只有這樣,接收作業系統才會通知應用程式有更多資料到達 [^6]。
那麼,如果 TCP 提供 "可靠性",這是否意味著我們不再需要擔心網路不可靠?不幸的是不是。如果在某個超時時間內沒有收到確認,它會認為資料包一定已經丟失,但 TCP 也無法判斷是出站資料包還是確認丟失了。儘管 TCP 可以重新發送資料包,但它不能保證新資料包也會透過。如果網線被拔掉,TCP 不能為你重新插上它。最終,在可配置的超時後,TCP 放棄並嚮應用程式發出錯誤訊號。
如果 TCP 連線因錯誤而關閉 —— 也許是因為遠端節點崩潰了,或者是因為網路被中斷了 —— 你不幸地無法知道遠端節點實際處理了多少資料 [^6]。即使 TCP 確認資料包已交付,這也僅意味著遠端節點上的作業系統核心收到了它,但應用程式可能在處理該資料之前就崩潰了。如果你想確保請求成功,你需要應用層返回明確的成功響應 [^7]。
儘管如此,TCP 非常有用,因為它提供了一種方便的方式來發送和接收太大而無法裝入一個數據包的訊息。一旦建立了 TCP 連線,你還可以使用它來發送多個請求和響應。這通常是透過首先發送一個標頭來完成的,該標頭以位元組為單位指示後續訊息的長度,然後是實際訊息。HTTP 和許多 RPC 協議(見 ["透過服務的資料流:REST 和 RPC"](/tw/ch5#sec_encoding_dataflow_rpc))就是這樣工作的。
### 實踐中的網路故障 {#sec_distributed_network_faults}
我們已經建立計算機網路幾十年了 —— 人們可能希望到現在我們已經弄清楚如何使它們可靠。不幸的是,我們還沒有成功。有一些系統研究和大量軼事證據表明,網路問題可能出人意料地常見,即使在由一家公司運營的受控環境(如資料中心)中也是如此 [^8]:
* 一項在中型資料中心的研究發現,每月約有 12 次網路故障,其中一半斷開了單臺機器,一半斷開了整個機架 [^9]。
* 另一項研究測量了元件(如機架頂部交換機、匯聚交換機和負載均衡器)的故障率 [^10]。它發現,新增冗餘網路裝置並不能像你希望的那樣減少故障,因為它不能防範人為錯誤(例如,配置錯誤的交換機),這是停機的主要原因。
* 廣域光纖鏈路的中斷被歸咎於奶牛 [^11]、海狸 [^12] 和鯊魚 [^13](儘管由於海底電纜遮蔽更好,鯊魚咬傷已經變得更加罕見 [^14])。人類也有過錯,無論是由於意外配置錯誤 [^15]、拾荒 [^16] 還是破壞 [^17]。
* 在不同的雲區域之間,已經觀察到高百分位數下長達幾 *分鐘* 的往返時間 [^18]。即使在單個數據中心內,在網路拓撲重新配置期間(由交換機軟體升級期間的問題觸發),也可能發生超過一分鐘的資料包延遲 [^19]。因此,我們必須假設訊息可能被任意延遲。
* 有時通訊部分中斷,這取決於你在和誰交談:例如,A 和 B 可以通訊,B 和 C 可以通訊,但 A 和 C 不能 [^20] [^21]。其他令人驚訝的故障包括網路介面有時會丟棄所有入站資料包但成功傳送出站資料包 [^22]:僅僅因為網路鏈路在一個方向上工作並不能保證它在相反方向上也工作。
* 即使是短暫的網路中斷也可能產生比原始問題持續時間更長的影響 [^8] [^20] [^23]。
--------
> [!TIP] 網路分割槽
>
> 當網路的一部分由於網路故障而與其餘部分隔離時,有時稱為 *網路分割槽* 或 *網路分裂*,但它與其他型別的網路中斷沒有根本區別。網路分割槽與儲存系統的分片無關,後者有時也稱為 *分割槽*(見 [第 7 章](/tw/ch7#ch_sharding))。
--------
即使網路故障在你的環境中很少見,故障 *可能* 發生的事實意味著你的軟體需要能夠處理它們。每當透過網路進行任何通訊時,它都可能失敗 —— 這是無法避免的。
如果網路故障的錯誤處理沒有定義和測試,可能會發生任意糟糕的事情:例如,叢集可能會陷入死鎖並永久無法提供請求,即使網路恢復 [^24],或者它甚至可能刪除你的所有資料 [^25]。如果軟體處於意料之外的情況,它可能會做任意意外的事情。
處理網路故障不一定意味著 *容忍* 它們:如果你的網路通常相當可靠,一個有效的方法可能是在網路出現問題時簡單地向用戶顯示錯誤訊息。但是,你確實需要知道你的軟體如何對網路問題做出反應,並確保系統可以從中恢復。故意觸發網路問題並測試系統的響應可能是有意義的(這被稱為 *故障注入*;見 ["故障注入"](#sec_fault_injection))。
### 檢測故障 {#id307}
許多系統需要自動檢測故障節點。例如:
* 負載均衡器需要停止向已死亡的節點發送請求(即,將其 *從輪詢池中摘除*)。
* 在具有單主複製的分散式資料庫中,如果主節點失效,其中一個從節點需要被提升為新的主節點(見 ["處理節點中斷"](/tw/ch6#sec_replication_failover))。
不幸的是,網路的不確定性使得很難判斷節點是否正常工作。在某些特定情況下,你可能會得到一些明確告訴你某事不工作的反饋:
* 如果你可以訪問節點應該執行的機器,但沒有程序監聽目標埠(例如,因為程序崩潰了),作業系統將透過傳送 `RST` 或 `FIN` 資料包來幫助關閉或拒絕 TCP 連線。
* 如果節點程序崩潰(或被管理員殺死)但節點的作業系統仍在執行,指令碼可以通知其他節點有關崩潰的資訊,以便另一個節點可以快速接管而無需等待超時到期。例如,HBase 就是這樣做的 [^26]。
* 如果你可以訪問資料中心中網路交換機的管理介面,你可以查詢它們以在硬體級別檢測鏈路故障(例如,如果遠端機器已關閉電源)。如果你透過網際網路連線,或者你在共享資料中心中無法訪問交換機本身,或者由於網路問題無法訪問管理介面,則此選項被排除。
* 如果路由器確定你嘗試連線的 IP 地址不可達,它可能會向你回覆 ICMP 目標不可達資料包。然而,路由器也沒有神奇的故障檢測能力 —— 它受到與網路其他參與者相同的限制。
關於遠端節點宕機的快速反饋很有用,但你不能指望它。如果出了問題,你可能會在堆疊的某個級別收到錯誤響應,但通常你必須假設你根本不會收到任何響應。你可以重試幾次,等待超時過去,如果在超時內沒有收到回覆,最終宣佈節點死亡。
### 超時和無界延遲 {#sec_distributed_queueing}
如果超時是檢測故障的唯一可靠方法,那麼超時應該多長?不幸的是,沒有簡單的答案。
長超時意味著在節點被宣佈死亡之前需要長時間等待(在此期間,使用者可能不得不等待或看到錯誤訊息)。短超時可以更快地檢測故障,但當節點實際上只是遭受暫時的減速(例如,由於節點或網路上的負載峰值)時,錯誤地宣佈節點死亡的風險更高。
過早地宣佈節點死亡是有問題的:如果節點實際上是活著的並且正在執行某些操作(例如,傳送電子郵件),而另一個節點接管,該操作可能最終被執行兩次。我們將在 ["知識、真相與謊言"](#sec_distributed_truth) 以及第 10 章和後續章節中更詳細地討論這個問題。
當節點被宣佈死亡時,其職責需要轉移到其他節點,這會給其他節點和網路帶來額外的負載。如果系統已經在高負載下掙扎,過早地宣佈節點死亡可能會使問題變得更糟。特別是,可能發生的情況是,節點實際上並沒有死亡,只是由於過載而響應緩慢;將其負載轉移到其他節點可能會導致級聯故障(在極端情況下,所有節點互相宣佈對方死亡,一切都停止工作 —— 見 ["當過載系統無法恢復時"](/tw/ch2#sidebar_metastable))。
想象一個虛構的系統,其網路保證資料包的最大延遲 —— 每個資料包要麼在某個時間 *d* 內交付,要麼丟失,但交付從不會超過 *d*。此外,假設你可以保證未失效的節點總是在某個時間 *r* 內處理請求。在這種情況下,你可以保證每個成功的請求在時間 2*d* + *r* 內收到響應 —— 如果你在該時間內沒有收到響應,你就知道網路或遠端節點不工作。如果這是真的,2*d* + *r* 將是一個合理的超時時間。
不幸的是,我們使用的大多數系統都沒有這些保證:非同步網路具有 *無界延遲*(即,它們嘗試儘快交付資料包,但資料包到達所需的時間沒有上限),大多數伺服器實現無法保證它們可以在某個最大時間內處理請求(見 ["響應時間保證"](#sec_distributed_clocks_realtime))。對於故障檢測,系統大部分時間快速執行是不夠的:如果你的超時很低,往返時間的瞬時峰值就足以使系統失去平衡。
#### 網路擁塞和排隊 {#network-congestion-and-queueing}
開車時,道路網路上的行駛時間通常因交通擁堵而變化最大。同樣,計算機網路上資料包延遲的可變性最常是由於排隊 [^27]:
* 如果幾個不同的節點同時嘗試向同一目的地傳送資料包,網路交換機必須將它們排隊並逐個送入目標網路鏈路(如 [圖 9-2](#fig_distributed_switch_queueing) 所示)。在繁忙的網路鏈路上,資料包可能需要等待一段時間才能獲得一個插槽(這稱為 *網路擁塞*)。如果有太多的傳入資料以至於交換機佇列滿了,資料包將被丟棄,因此需要重新發送 —— 即使網路執行正常。
* 當資料包到達目標機器時,如果所有 CPU 核心當前都很忙,來自網路的傳入請求會被作業系統排隊,直到應用程式準備處理它。根據機器上的負載,這可能需要任意長的時間 [^28]。
* 在虛擬化環境中,正在執行的作業系統經常會暫停幾十毫秒,而另一個虛擬機器使用 CPU 核心。在此期間,VM 無法消耗來自網路的任何資料,因此傳入資料由虛擬機器監視器排隊(緩衝)[^29],進一步增加了網路延遲的可變性。
* 如前所述,為了避免網路過載,TCP 限制傳送資料的速率。這意味著在資料甚至進入網路之前,傳送方就有額外的排隊。
{{< figure src="/fig/ddia_0902.png" id="fig_distributed_switch_queueing" caption="圖 9-2. 如果幾臺機器向同一目的地傳送網路流量,其交換機佇列可能會滿。這裡,埠 1、2 和 4 都試圖向埠 3 傳送資料包。" class="w-full my-4" >}}
此外,當 TCP 檢測到並自動重傳丟失的資料包時,儘管應用程式不會直接看到資料包丟失,但它確實會看到由此產生的延遲(等待超時到期,然後等待重傳的資料包被確認)。
--------
> [!TIP] TCP 與 UDP
>
> 一些對延遲敏感的應用程式,如視訊會議和 IP 語音(VoIP),使用 UDP 而不是 TCP。這是可靠性和延遲可變性之間的權衡:由於 UDP 不執行流量控制並且不重傳丟失的資料包,它避免了網路延遲可變的一些原因(儘管它仍然容易受到交換機佇列和排程延遲的影響)。
>
> UDP 是延遲資料無價值的情況下的好選擇。例如,在 VoIP 電話通話中,在資料應該透過揚聲器播放之前,可能沒有足夠的時間重傳丟失的資料包。在這種情況下,重傳資料包沒有意義 —— 應用程式必須用靜音填充缺失資料包的時間槽(導致聲音短暫中斷)並繼續流。重試發生在人類層面。("你能重複一下嗎?聲音剛剛中斷了一會兒。")
--------
所有這些因素都導致了網路延遲的可變性。當系統接近其最大容量時,排隊延遲的範圍特別大:具有充足備用容量的系統可以輕鬆排空佇列,而在高度利用的系統中,長佇列可以很快建立起來。
在公共雲和多租戶資料中心中,資源在許多客戶之間共享:網路鏈路和交換機,甚至每臺機器的網路介面和 CPU(在虛擬機器上執行時)都是共享的。處理大量資料可以使用網路鏈路的全部容量(*飽和* 它們)。由於你無法控制或瞭解其他客戶對共享資源的使用情況,如果你附近的某人(*吵鬧的鄰居*)正在使用大量資源,網路延遲可能會高度可變 [^30] [^31]。
在這種環境中,你只能透過實驗選擇超時:在較長時間內和許多機器上測量網路往返時間的分佈,以確定延遲的預期可變性。然後,考慮到你的應用程式的特徵,你可以在故障檢測延遲和過早超時風險之間確定適當的權衡。
更好的是,系統可以持續測量響應時間及其可變性(*抖動*),並根據觀察到的響應時間分佈自動調整超時,而不是使用配置的常量超時。Phi 累積故障檢測器 [^32](例如在 Akka 和 Cassandra 中使用 [^33])就是這樣做的一種方法。TCP 重傳超時也以類似的方式工作 [^5]。
### 同步與非同步網路 {#sec_distributed_sync_networks}
如果我們可以依靠網路以某個固定的最大延遲交付資料包,並且不丟棄資料包,分散式系統將會簡單得多。為什麼我們不能在硬體級別解決這個問題,使網路可靠,這樣軟體就不需要擔心它了?
要回答這個問題,比較資料中心網路與傳統的固定電話網路(非蜂窩、非 VoIP)很有趣,後者極其可靠:延遲的音訊幀和掉線非常罕見。電話通話需要持續的低端到端延遲和足夠的頻寬來傳輸你聲音的音訊樣本。在計算機網路中擁有類似的可靠性和可預測性不是很好嗎?
當你透過電話網路撥打電話時,它會建立一個 *電路*:在兩個呼叫者之間的整個路線上分配固定、有保證的頻寬量。該電路一直保持到通話結束 [^34]。例如,ISDN 網路以每秒 4,000 幀的固定速率執行。建立呼叫時,它在每幀內(在每個方向上)分配 16 位空間。因此,在通話期間,每一方都保證能夠每 250 微秒準確傳送 16 位音訊資料 [^35]。
這種網路是 *同步的*:即使資料通過幾個路由器,它也不會遭受排隊,因為呼叫的 16 位空間已經在網路的下一跳中預留了。由於沒有排隊,網路的最大端到端延遲是固定的。我們稱之為 *有界延遲*。
#### 我們不能簡單地使網路延遲可預測嗎? {#can-we-not-simply-make-network-delays-predictable}
請注意,電話網路中的電路與 TCP 連線非常不同:電路是固定數量的預留頻寬,在電路建立期間其他人無法使用,而 TCP 連線的資料包則機會主義地使用任何可用的網路頻寬。你可以給 TCP 一個可變大小的資料塊(例如,電子郵件或網頁),它會嘗試在儘可能短的時間內傳輸它。當 TCP 連線空閒時,它不使用任何頻寬(除了偶爾的保活資料包)。
如果資料中心網路和網際網路是電路交換網路,那麼在建立電路時就可以建立有保證的最大往返時間。然而,它們不是:乙太網和 IP 是分組交換協議,會遭受排隊,因此在網路中有無界延遲。這些協議沒有電路的概念。
為什麼資料中心網路和網際網路使用分組交換?答案是它們針對 *突發流量* 進行了最佳化。電路適合音訊或視訊通話,需要在通話期間傳輸相當恆定的每秒位數。另一方面,請求網頁、傳送電子郵件或傳輸檔案沒有任何特定的頻寬要求 —— 我們只希望它儘快完成。
如果你想透過電路傳輸檔案,你必須猜測頻寬分配。如果你猜得太低,傳輸會不必要地慢,使網路容量未被使用。如果你猜得太高,電路無法建立(因為如果無法保證其頻寬分配,網路無法允許建立電路)。因此,使用電路進行突發資料傳輸會浪費網路容量並使傳輸不必要地緩慢。相比之下,TCP 動態調整資料傳輸速率以適應可用的網路容量。
曾經有一些嘗試構建既支援電路交換又支援分組交換的混合網路。*非同步傳輸模式*(ATM)在 1980 年代是乙太網的競爭對手,但除了電話網路核心交換機外,它沒有獲得太多采用。InfiniBand 有一些相似之處 [^36]:它在鏈路層實現端到端流量控制,減少了網路中排隊的需要,儘管它仍然可能因鏈路擁塞而遭受延遲 [^37]。透過仔細使用 *服務質量*(QoS,資料包的優先順序和排程)和 *准入控制*(對傳送者的速率限制),可以在分組網路上類比電路交換,或提供統計上有界的延遲 [^27] [^34]。新的網路演算法,如低延遲、低損耗和可擴充套件吞吐量(L4S)試圖在客戶端和路由器級別緩解一些排隊和擁塞控制問題。Linux 的流量控制器(TC)也允許應用程式為 QoS 目的重新優先排序資料包。
--------
> [!TIP] 延遲和資源利用率
>
> 更一般地說,你可以將可變延遲視為動態資源分割槽的結果。
>
> 假設你在兩個電話交換機之間有一條可以承載多達 10,000 個同時呼叫的線路。透過此線路交換的每個電路都佔用其中一個呼叫插槽。因此,你可以將該線路視為最多可由 10,000 個同時使用者共享的資源。資源以 *靜態* 方式劃分:即使你現在是線路上唯一的呼叫,並且所有其他 9,999 個插槽都未使用,你的電路仍然分配與線路完全利用時相同的固定頻寬量。
>
> 相比之下,網際網路 *動態* 共享網路頻寬。傳送者互相推擠,儘可能快地透過線路傳送資料包,網路交換機決定在每個時刻傳送哪個資料包(即頻寬分配)。這種方法的缺點是排隊,但優點是它最大化了線路的利用率。線路有固定成本,所以如果你更好地利用它,你透過線路傳送的每個位元組都更便宜。
>
> CPU 也會出現類似的情況:如果你在幾個執行緒之間動態共享每個 CPU 核心,一個執行緒有時必須在作業系統的執行佇列中等待,而另一個執行緒正在執行,因此執行緒可能會暫停不同的時間長度 [^38]。然而,這比為每個執行緒分配靜態數量的 CPU 週期更好地利用硬體(見 ["響應時間保證"](#sec_distributed_clocks_realtime))。更好的硬體利用率也是雲平臺在同一物理機器上執行來自不同客戶的多個虛擬機器的原因。
>
> 如果資源是靜態分割槽的(例如,專用硬體和獨佔頻寬分配),則在某些環境中可以實現延遲保證。然而,這是以降低利用率為代價的 —— 換句話說,它更昂貴。另一方面,具有動態資源分割槽的多租戶提供了更好的利用率,因此更便宜,但它有可變延遲的缺點。
>
> 網路中的可變延遲不是自然法則,而只是成本/收益權衡的結果。
--------
然而,這種服務質量目前在多租戶資料中心和公共雲中未啟用,或者在透過網際網路通訊時未啟用。當前部署的技術不允許我們對網路的延遲或可靠性做出任何保證:我們必須假設網路擁塞、排隊和無界延遲會發生。因此,超時沒有 "正確" 的值 —— 它們需要透過實驗確定。
網際網路服務提供商之間的對等協議和透過邊界閘道器協議(BGP)建立路由,比 IP 本身更接近電路交換。在這個級別,可以購買專用頻寬。然而,網際網路路由在網路級別而不是主機之間的單個連線上執行,並且時間尺度要長得多。
## 不可靠的時鐘 {#sec_distributed_clocks}
時鐘和時間很重要。應用程式以各種方式依賴時鐘來回答如下問題:
1. 這個請求超時了嗎?
2. 這項服務的第 99 百分位響應時間是多少?
3. 這項服務在過去五分鐘內平均每秒處理了多少查詢?
4. 使用者在我們的網站上花了多長時間?
5. 這篇文章是什麼時候發表的?
6. 提醒郵件應該在什麼日期和時間傳送?
7. 這個快取條目何時過期?
8. 日誌檔案中此錯誤訊息的時間戳是什麼?
示例 1-4 測量 *持續時間*(例如,傳送請求和接收響應之間的時間間隔),而示例 5-8 描述 *時間點*(在特定日期、特定時間發生的事件)。
在分散式系統中,時間是一件棘手的事情,因為通訊不是瞬時的:訊息從一臺機器透過網路傳輸到另一臺機器需要時間。接收訊息的時間總是晚於傳送訊息的時間,但由於網路中的可變延遲,我們不知道晚了多少。當涉及多臺機器時,這個事實有時會使確定事情發生的順序變得困難。
此外,網路上的每臺機器都有自己的時鐘,這是一個實際的硬體裝置:通常是石英晶體振盪器。這些裝置並不完全準確,因此每臺機器都有自己的時間概念,可能比其他機器稍快或稍慢。可以在某種程度上同步時鐘:最常用的機制是網路時間協議(NTP),它允許根據一組伺服器報告的時間調整計算機時鐘 [^39]。伺服器反過來從更準確的時間源(如 GPS 接收器)獲取時間。
### 單調時鐘與日曆時鐘 {#sec_distributed_monotonic_timeofday}
現代計算機至少有兩種不同型別的時鐘:*日曆時鐘* 和 *單調時鐘*。儘管它們都測量時間,但區分兩者很重要,因為它們服務於不同的目的。
#### 日曆時鐘 {#time-of-day-clocks}
日曆時鐘做你直觀期望時鐘做的事情:它根據某個日曆返回當前日期和時間(也稱為 *牆上時鐘時間*)。例如,Linux 上的 `clock_gettime(CLOCK_REALTIME)` 和 Java 中的 `System.currentTimeMillis()` 返回自 *紀元* 以來的秒數(或毫秒數):根據格里高利曆,1970 年 1 月 1 日午夜 UTC,不計算閏秒。一些系統使用其他日期作為參考點。(儘管 Linux 時鐘被稱為 *即時*,但它與即時作業系統無關,如 ["響應時間保證"](#sec_distributed_clocks_realtime) 中所討論的。)
日曆時鐘通常與 NTP 同步,這意味著來自一臺機器的時間戳(理想情況下)與另一臺機器上的時間戳意思相同。然而,日曆時鐘也有各種奇怪之處,如下一節所述。特別是,如果本地時鐘遠遠超前於 NTP 伺服器,它可能會被強制重置並顯示跳回到以前的時間點。這些跳躍,以及閏秒引起的類似跳躍,使日曆時鐘不適合測量經過的時間 [^40]。
日曆時鐘可能會因夏令時(DST)的開始和結束而經歷跳躍;這些可以透過始終使用 UTC 作為時區來避免,UTC 沒有 DST。日曆時鐘在歷史上也具有相當粗粒度的解析度,例如,在較舊的 Windows 系統上以 10 毫秒的步長前進 [^41]。在最近的系統上,這不再是一個問題。
#### 單調時鐘 {#monotonic-clocks}
單調時鐘適用於測量持續時間(時間間隔),例如超時或服務的響應時間:例如,Linux 上的 `clock_gettime(CLOCK_MONOTONIC)` 或 `clock_gettime(CLOCK_BOOTTIME)` [^42] 和 Java 中的 `System.nanoTime()` 是單調時鐘。這個名字來源於它們保證始終向前移動的事實(而日曆時鐘可能會在時間上向後跳躍)。
你可以在某個時間點檢查單調時鐘的值,做一些事情,然後在稍後的時間再次檢查時鐘。兩個值之間的 *差值* 告訴你兩次檢查之間經過了多少時間 —— 更像秒錶而不是掛鐘。然而,時鐘的 *絕對* 值是沒有意義的:它可能是自計算機啟動以來的納秒數,或類似的任意值。特別是,比較來自兩臺不同計算機的單調時鐘值是沒有意義的,因為它們不代表同樣的東西。
在具有多個 CPU 插槽的伺服器上,每個 CPU 可能有一個單獨的計時器,它不一定與其他 CPU 同步 [^43]。作業系統會補償任何差異,並嘗試嚮應用程式執行緒呈現時鐘的單調檢視,即使它們被排程到不同的 CPU 上。然而,明智的做法是對這種單調性保證持保留態度 [^44]。
如果 NTP 檢測到計算機的本地石英晶體比 NTP 伺服器執行得更快或更慢,它可能會調整單調時鐘前進的頻率(這被稱為 *調整* 時鐘)。預設情況下,NTP 允許時鐘速率加速或減速高達 0.05%,但 NTP 不能導致單調時鐘向前或向後跳躍。單調時鐘的解析度通常相當好:在大多數系統上,它們可以測量微秒或更短的時間間隔。
在分散式系統中,使用單調時鐘測量經過的時間(例如,超時)通常是可以的,因為它不假設不同節點的時鐘之間有任何同步,並且對測量的輕微不準確不敏感。
### 時鐘同步和準確性 {#sec_distributed_clock_accuracy}
單調時鐘不需要同步,但日曆時鐘需要根據 NTP 伺服器或其他外部時間源設定才能有用。不幸的是,我們讓時鐘顯示正確時間的方法遠不如你希望的那樣可靠或準確 —— 硬體時鐘和 NTP 可能是反覆無常的野獸。僅舉幾個例子:
* 計算機中的石英時鐘不是很準確:它會 *漂移*(比應該的執行得更快或更慢)。時鐘漂移因機器的溫度而異。Google 假設其伺服器的時鐘漂移高達 200 ppm(百萬分之一)[^45],這相當於每 30 秒與伺服器重新同步的時鐘有 6 毫秒漂移,或每天重新同步一次的時鐘有 17 秒漂移。即使一切正常工作,這種漂移也限制了你可以達到的最佳精度。
* 如果計算機的時鐘與 NTP 伺服器相差太多,它可能會拒絕同步,或者本地時鐘將被強制重置 [^39]。任何在重置前後觀察時間的應用程式都可能看到時間倒退或突然向前跳躍。
* 如果節點意外地被防火牆與 NTP 伺服器隔離,配置錯誤可能會在一段時間內未被注意到,在此期間漂移可能會累積成不同節點時鐘之間的巨大差異。軼事證據表明,這在實踐中確實會發生。
* NTP 同步只能與網路延遲一樣好,因此當你在具有可變資料包延遲的擁塞網路上時,其準確性有限。一項實驗表明,透過網際網路同步時可以達到 35 毫秒的最小誤差 [^46],儘管網路延遲的偶爾峰值會導致大約一秒的誤差。根據配置,大的網路延遲可能導致 NTP 客戶端完全放棄。
* 一些 NTP 伺服器是錯誤的或配置錯誤的,報告的時間相差數小時 [^47] [^48]。NTP 客戶端透過查詢多個伺服器並忽略異常值來減輕此類錯誤。儘管如此,將系統的正確性押注在網際網路上陌生人告訴你的時間上還是有些令人擔憂的。
* 閏秒導致一分鐘有 59 秒或 61 秒長,這會搞亂在設計時沒有考慮閏秒的系統中的時序假設 [^49]。閏秒已經導致許多大型系統崩潰的事實 [^40] [^50] 表明,關於時鐘的錯誤假設是多麼容易潛入系統。處理閏秒的最佳方法可能是讓 NTP 伺服器 "撒謊",透過在一天的過程中逐漸執行閏秒調整(這被稱為 *平滑*)[^51] [^52],儘管實際的 NTP 伺服器行為在實踐中有所不同 [^53]。從 2035 年起將不再使用閏秒,所以這個問題幸運地將會消失。
* 在虛擬機器中,硬體時鐘是虛擬化的,這為需要準確計時的應用程式帶來了額外的挑戰 [^54]。當 CPU 核心在虛擬機器之間共享時,每個 VM 在另一個 VM 執行時會暫停數十毫秒。從應用程式的角度來看,這種暫停表現為時鐘突然向前跳躍 [^29]。如果 VM 暫停幾秒鐘,時鐘可能會比實際時間落後幾秒鐘,但 NTP 可能會繼續報告時鐘幾乎完全同步 [^55]。
* 如果你在不完全控制的裝置上執行軟體(例如,移動或嵌入式裝置),你可能根本無法信任裝置的硬體時鐘。一些使用者故意將他們的硬體時鐘設定為不正確的日期和時間,例如在遊戲中作弊 [^56]。因此,時鐘可能被設定為遙遠的過去或未來的時間。
如果你足夠關心時鐘精度並願意投入大量資源,就可以實現非常好的時鐘精度。例如,歐洲金融機構的 MiFID II 法規要求所有高頻交易基金將其時鐘同步到 UTC 的 100 微秒以內,以幫助除錯市場異常(如 "閃崩")並幫助檢測市場操縱 [^57]。
這種精度可以透過一些特殊硬體(GPS 接收器和/或原子鐘)、精確時間協議(PTP)以及仔細的部署和監控來實現 [^58] [^59]。僅依賴 GPS 可能有風險,因為 GPS 訊號很容易被幹擾。在某些地方,這種情況經常發生,例如靠近軍事設施 [^60]。一些雲提供商已經開始為其虛擬機器提供高精度時鐘同步 [^61]。然而,時鐘同步仍然需要很多注意。如果你的 NTP 守護程序配置錯誤,或者防火牆阻止了 NTP 流量,由於漂移導致的時鐘誤差可能會迅速變大。
### 對同步時鐘的依賴 {#sec_distributed_clocks_relying}
時鐘的問題在於,雖然它們看起來簡單易用,但它們有驚人數量的陷阱:一天可能沒有正好 86,400 秒,日曆時鐘可能會在時間上向後移動,根據一個節點的時鐘的時間可能與另一個節點的時鐘相差很大。
本章前面我們討論了網路丟棄和任意延遲資料包。即使網路大部分時間表現良好,軟體也必須設計成假設網路偶爾會出現故障,軟體必須優雅地處理此類故障。時鐘也是如此:儘管它們大部分時間工作得很好,但強健的軟體需要準備好處理不正確的時鐘。
問題的一部分是不正確的時鐘很容易被忽視。如果機器的 CPU 有缺陷或其網路配置錯誤,它很可能根本無法工作,因此會很快被注意到並修復。另一方面,如果它的石英時鐘有缺陷或其 NTP 客戶端配置錯誤,大多數事情看起來會正常工作,即使它的時鐘逐漸偏離現實越來越遠。如果某些軟體依賴於準確同步的時鐘,結果更可能是靜默和微妙的資料丟失,而不是戲劇性的崩潰 [^62] [^63]。
因此,如果你使用需要同步時鐘的軟體,你還必須仔細監控所有機器之間的時鐘偏移。任何時鐘偏離其他節點太遠的節點都應該被宣佈死亡並從叢集中移除。這種監控確保你在損壞的時鐘造成太多損害之前注意到它們。
#### 用於事件排序的時間戳 {#sec_distributed_lww}
讓我們考慮一個特定的情況,其中依賴時鐘是誘人但危險的:跨多個節點的事件排序 [^64]。例如,如果兩個客戶端寫入分散式資料庫,誰先到達?哪個寫入是更新的?
[圖 9-3](#fig_distributed_timestamps) 說明了在具有多主複製的資料庫中日曆時鐘的危險使用(該示例類似於 [圖 6-8](/tw/ch6#fig_replication_causality))。客戶端 A 在節點 1 上寫入 *x* = 1;寫入被複制到節點 3;客戶端 B 在節點 3 上遞增 *x*(我們現在有 *x* = 2);最後,兩個寫入都被複制到節點 2。
{{< figure src="/fig/ddia_0903.png" id="fig_distributed_timestamps" caption="圖 9-3. 客戶端 B 的寫入在因果關係上晚於客戶端 A 的寫入,但 B 的寫入具有更早的時間戳。" class="w-full my-4" >}}
在 [圖 9-3](#fig_distributed_timestamps) 中,當寫入被複制到其他節點時,它會根據寫入起源節點上的日曆時鐘標記時間戳。此示例中的時鐘同步非常好:節點 1 和節點 3 之間的偏差小於 3 毫秒,這可能比你在實踐中可以期望的要好。
由於遞增建立在 *x* = 1 的早期寫入之上,我們可能期望 *x* = 2 的寫入應該具有兩者中更大的時間戳。不幸的是,[圖 9-3](#fig_distributed_timestamps) 中發生的並非如此:寫入 *x* = 1 的時間戳為 42.004 秒,但寫入 *x* = 2 的時間戳為 42.003 秒。
如 ["最後寫入勝利(丟棄併發寫入)"](/tw/ch6#sec_replication_lww) 中所討論的,解決不同節點上併發寫入值之間衝突的一種方法是 *最後寫入勝利*(LWW),這意味著保留給定鍵的具有最大時間戳的寫入,並丟棄所有具有較舊時間戳的寫入。在 [圖 9-3](#fig_distributed_timestamps) 的示例中,當節點 2 接收這兩個事件時,它將錯誤地得出結論,認為 *x* = 1 是更新的值並丟棄寫入 *x* = 2,因此遞增丟失了。
可以透過確保當值被覆蓋時,新值總是具有比被覆蓋值更高的時間戳來防止這個問題,即使該時間戳超前於寫入者的本地時鐘。然而,這會產生額外的讀取成本來查詢最大的現有時間戳。一些系統,包括 Cassandra 和 ScyllaDB,希望在單次往返中寫入所有副本,因此它們只是使用客戶端時鐘的時間戳以及最後寫入勝利策略 [^62]。這種方法有一些嚴重的問題:
* 資料庫寫入可能會神秘地消失:具有滯後時鐘的節點無法覆蓋先前由具有快速時鐘的節點寫入的值,直到節點之間的時鐘偏差時間過去 [^63] [^65]。這種情況可能導致任意數量的資料被靜默丟棄,而不會嚮應用程式報告任何錯誤。
* LWW 無法區分快速連續發生的順序寫入(在 [圖 9-3](#fig_distributed_timestamps) 中,客戶端 B 的遞增肯定發生在客戶端 A 的寫入 *之後*)和真正併發的寫入(兩個寫入者都不知道對方)。需要額外的因果關係跟蹤機制,如版本向量,以防止違反因果關係(見 ["檢測併發寫入"](/tw/ch6#sec_replication_concurrent))。
* 兩個節點可能獨立生成具有相同時間戳的寫入,特別是當時鍾只有毫秒解析度時。需要額外的決勝值(可以簡單地是一個大的隨機數)來解決此類衝突,但這種方法也可能導致違反因果關係 [^62]。
因此,即使透過保留最 "新" 的值並丟棄其他值來解決衝突很誘人,但重要的是要意識到 "新" 的定義取決於本地日曆時鐘,它很可能是不正確的。即使使用緊密 NTP 同步的時鐘,你也可能在時間戳 100 毫秒(根據傳送者的時鐘)傳送資料包,並讓它在時間戳 99 毫秒(根據接收者的時鐘)到達 —— 因此看起來資料包在傳送之前就到達了,這是不可能的。
NTP 同步能否足夠準確以至於不會發生此類錯誤排序?可能不行,因為除了石英漂移等其他誤差源之外,NTP 的同步精度本身受到網路往返時間的限制。要保證正確的排序,你需要時鐘誤差顯著低於網路延遲,這是不可能的。
所謂的 *邏輯時鐘* [^66],基於遞增計數器而不是振盪石英晶體,是排序事件的更安全替代方案(見 ["檢測併發寫入"](/tw/ch6#sec_replication_concurrent))。邏輯時鐘不測量一天中的時間或經過的秒數,只測量事件的相對順序(一個事件是在另一個事件之前還是之後發生)。相比之下,日曆時鐘和單調時鐘測量實際經過的時間,也稱為 *物理時鐘*。我們將在 ["ID 生成器和邏輯時鐘"](/tw/ch10#sec_consistency_logical) 中更詳細地研究邏輯時鐘。
#### 帶置信區間的時鐘讀數 {#clock-readings-with-a-confidence-interval}
你可能能夠以微秒甚至納秒解析度讀取機器的日曆時鐘。但即使你能獲得如此細粒度的測量,也不意味著該值實際上精確到如此精度。事實上,它很可能不是 —— 如前所述,即使你每分鐘與本地網路上的 NTP 伺服器同步,不精確的石英時鐘的漂移也很容易達到幾毫秒。使用公共網際網路上的 NTP 伺服器,最佳可能精度可能是幾十毫秒,當存在網路擁塞時,誤差很容易超過 100 毫秒。
因此,將時鐘讀數視為時間點是沒有意義的 —— 它更像是一個時間範圍,在置信區間內:例如,系統可能有 95% 的信心認為現在的時間在分鐘後的 10.3 到 10.5 秒之間,但它不知道比這更精確的時間 [^67]。如果我們只知道時間 +/- 100 毫秒,時間戳中的微秒數字基本上是沒有意義的。
不確定性邊界可以根據你的時間源計算。如果你有直接連線到計算機的 GPS 接收器或原子鐘,預期誤差範圍由裝置決定,對於 GPS,由來自衛星的訊號質量決定。如果你從伺服器獲取時間,不確定性基於自上次與伺服器同步以來的預期石英漂移,加上 NTP 伺服器的不確定性,加上到伺服器的網路往返時間(作為第一近似,並假設你信任伺服器)。
不幸的是,大多數系統不暴露這種不確定性:例如,當你呼叫 `clock_gettime()` 時,返回值不會告訴你時間戳的預期誤差,所以你不知道它的置信區間是五毫秒還是五年。
有例外:Google Spanner 中的 *TrueTime* API [^45] 和亞馬遜的 ClockBound 明確報告本地時鐘的置信區間。當你詢問當前時間時,你會得到兩個值:`[earliest, latest]`,它們是 *最早可能* 和 *最晚可能* 的時間戳。基於其不確定性計算,時鐘知道實際當前時間在該區間內的某處。區間的寬度取決於多種因素,包括本地石英時鐘上次與更準確的時鐘源同步以來已經過去了多長時間。
#### 用於全域性快照的同步時鐘 {#sec_distributed_spanner}
在 ["快照隔離和可重複讀"](/tw/ch8#sec_transactions_snapshot_isolation) 中,我們討論了 *多版本併發控制*(MVCC),這是資料庫中非常有用的功能,需要支援小型、快速的讀寫事務和大型、長時間執行的只讀事務(例如,用於備份或分析)。它允許只讀事務看到資料庫的 *快照*,即特定時間點的一致狀態,而不會鎖定和干擾讀寫事務。
通常,MVCC 需要單調遞增的事務 ID。如果寫入發生在快照之後(即,寫入的事務 ID 大於快照),則該寫入對快照事務不可見。在單節點資料庫上,簡單的計數器就足以生成事務 ID。
然而,當資料庫分佈在許多機器上,可能在多個數據中心時,全域性單調遞增的事務 ID(跨所有分片)很難生成,因為它需要協調。事務 ID 必須反映因果關係:如果事務 B 讀取或覆蓋先前由事務 A 寫入的值,則 B 必須具有比 A 更高的事務 ID —— 否則,快照將不一致。對於大量小型、快速的事務,在分散式系統中建立事務 ID 成為難以承受的瓶頸。(我們將在 ["ID 生成器和邏輯時鐘"](/tw/ch10#sec_consistency_logical) 中討論此類 ID 生成器。)
我們能否使用同步日曆時鐘的時間戳作為事務 ID?如果我們能夠獲得足夠好的同步,它們將具有正確的屬性:較晚的事務具有更高的時間戳。當然,問題是時鐘精度的不確定性。
Spanner 以這種方式跨資料中心實現快照隔離 [^68] [^69]。它使用 TrueTime API 報告的時鐘置信區間,並基於以下觀察:如果你有兩個置信區間,每個都由最早和最晚可能的時間戳組成(*A* = [*A最早*, *A最晚*] 和 *B* = [*B最早*, *B最晚*]),並且這兩個區間不重疊(即,*A最早* < *A最晚* < *B最早* < *B最晚*),那麼 B 肯定發生在 A 之後 —— 毫無疑問。只有當區間重疊時,我們才不確定 A 和 B 發生的順序。
為了確保事務時間戳反映因果關係,Spanner 在提交讀寫事務之前故意等待置信區間的長度。透過這樣做,它確保任何可能讀取資料的事務都在足夠晚的時間,因此它們的置信區間不會重疊。為了使等待時間儘可能短,Spanner 需要使時鐘不確定性儘可能小;為此,Google 在每個資料中心部署 GPS 接收器或原子鐘,使時鐘能夠同步到大約 7 毫秒以內 [^45]。
原子鐘和 GPS 接收器在 Spanner 中並不是嚴格必要的:重要的是要有一個置信區間,準確的時鐘源只是幫助保持該區間較小。其他系統開始採用類似的方法:例如,YugabyteDB 在 AWS 上執行時可以利用 ClockBound [^70],其他幾個系統現在也在不同程度上依賴時鐘同步 [^71] [^72]。
### 程序暫停 {#sec_distributed_clocks_pauses}
讓我們考慮分散式系統中危險使用時鐘的另一個例子。假設你有一個每個分片都有單個主節點的資料庫。只有主節點被允許接受寫入。節點如何知道它仍然是主節點(它沒有被其他節點宣佈死亡),並且它可以安全地接受寫入?
一種選擇是讓主節點從其他節點獲取 *租約*,這類似於帶有超時的鎖 [^73]。任何時候只有一個節點可以持有租約 —— 因此,當節點獲得租約時,它知道在租約到期之前的一段時間內它是主節點。為了保持主節點身份,節點必須在租約到期之前定期續訂租約。如果節點失效,它會停止續訂租約,因此另一個節點可以在租約到期時接管。
你可以想象請求處理迴圈看起來像這樣:
```js
while (true) {
request = getIncomingRequest();
// 確保租約始終至少有 10 秒的剩餘時間
if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {
lease = lease.renew();
}
if (lease.isValid()) {
process(request);
}
}
```
這段程式碼有什麼問題?首先,它依賴於同步時鐘:租約的到期時間由不同的機器設定(到期時間可能計算為當前時間加 30 秒,例如),並且它與本地系統時鐘進行比較。如果時鐘相差超過幾秒鐘,這段程式碼將開始做奇怪的事情。
其次,即使我們更改協議以僅使用本地單調時鐘,還有另一個問題:程式碼假設在檢查時間(`System.currentTimeMillis()`)和處理請求(`process(request)`)之間經過的時間非常少。通常這段程式碼執行得非常快,所以 10 秒的緩衝時間足以確保租約不會在處理請求的過程中到期。
然而,如果程式執行中出現意外暫停會怎樣?例如,想象執行緒在 `lease.isValid()` 行周圍停止了 15 秒,然後才最終繼續。在這種情況下,處理請求時租約很可能已經到期,另一個節點已經接管了主節點身份。然而,沒有任何東西告訴這個執行緒它暫停了這麼長時間,所以這段程式碼不會注意到租約已經到期,直到迴圈的下一次迭代 —— 到那時它可能已經透過處理請求做了一些不安全的事情。
假設執行緒可能暫停這麼長時間是合理的嗎?不幸的是,是的。有各種原因可能導致這種情況發生:
* 執行緒訪問共享資源(如鎖或佇列)時的爭用可能導致執行緒花費大量時間等待。轉移到具有更多 CPU 核心的機器可能會使此類問題變得更糟,並且爭用問題可能難以診斷 [^74]。
* 許多程式語言執行時(如 Java 虛擬機器)有 *垃圾回收器*(GC),偶爾需要停止所有正在執行的執行緒。過去,這種 *"全域性暫停" GC 暫停* 有時會持續幾分鐘 [^75]!使用現代 GC 演算法,這不再是一個大問題,但 GC 暫停仍然可能很明顯(見 ["限制垃圾回收的影響"](#sec_distributed_gc_impact))。
* 在虛擬化環境中,虛擬機器可以被 *掛起*(暫停所有程序的執行並將記憶體內容儲存到磁碟)和 *恢復*(恢復記憶體內容並繼續執行)。這種暫停可能發生在程序執行的任何時間,並且可能持續任意長的時間。這個功能有時用於虛擬機器從一臺主機到另一臺主機的 *即時遷移*,無需重啟,在這種情況下,暫停的長度取決於程序寫入記憶體的速率 [^76]。
* 在筆記型電腦和手機等終端使用者裝置上,執行也可能被任意掛起和恢復,例如,當用戶合上筆記型電腦蓋時。
* 當作業系統上下文切換到另一個執行緒時,或者當虛擬機器管理程式切換到不同的虛擬機器時(在虛擬機器中執行時),當前執行的執行緒可能在程式碼的任何任意點暫停。在虛擬機器的情況下,在其他虛擬機器中花費的 CPU 時間稱為 *竊取時間*。如果機器負載很重 —— 即,如果有長佇列的執行緒等待執行 —— 暫停的執行緒可能需要一些時間才能再次執行。
* 如果應用程式執行同步磁碟訪問,執行緒可能會暫停等待緩慢的磁碟 I/O 操作完成 [^77]。在許多語言中,磁碟訪問可能會令人驚訝地發生,即使程式碼沒有明確提到檔案訪問 —— 例如,Java 類載入器在首次使用時會延遲載入類檔案,這可能發生在程式執行的任何時間。I/O 暫停和 GC 暫停甚至可能共謀結合它們的延遲 [^78]。如果磁碟實際上是網路檔案系統或網路塊裝置(如亞馬遜的 EBS),I/O 延遲還會受到網路延遲可變性的影響 [^31]。
* 如果作業系統配置為允許 *交換到磁碟*(*分頁*),簡單的記憶體訪問可能會導致頁面錯誤,需要從磁碟載入頁面到記憶體。執行緒在此緩慢的 I/O 操作進行時暫停。如果記憶體壓力很高,這可能反過來需要將不同的頁面交換到磁碟。在極端情況下,作業系統可能會花費大部分時間在記憶體中交換頁面進出,而實際完成的工作很少(這被稱為 *抖動*)。為了避免這個問題,伺服器機器上通常停用分頁(如果你寧願殺死程序以釋放記憶體而不是冒抖動的風險)。
* Unix 程序可以透過向其傳送 `SIGSTOP` 訊號來暫停,例如透過在 shell 中按 Ctrl-Z。此訊號立即停止程序獲取更多 CPU 週期,直到使用 `SIGCONT` 恢復它,此時它從停止的地方繼續執行。即使你的環境通常不使用 `SIGSTOP`,它也可能被運維工程師意外發送。
所有這些情況都可以在任何時候 *搶佔* 正在執行的執行緒,並在稍後的某個時間恢復它,而執行緒甚至沒有注意到。這個問題類似於在單臺機器上使多執行緒程式碼執行緒安全:你不能對時序做任何假設,因為可能會發生任意的上下文切換和並行性。
在單臺機器上編寫多執行緒程式碼時,我們有相當好的工具來使其執行緒安全:互斥鎖、訊號量、原子計數器、無鎖資料結構、阻塞佇列等。不幸的是,這些工具不能直接轉換到分散式系統,因為分散式系統沒有共享記憶體 —— 只有透過不可靠網路傳送的訊息。
分散式系統中的節點必須假設其執行可以在任何時候暫停相當長的時間,即使在函式的中間。在暫停期間,世界的其餘部分繼續執行,甚至可能因為暫停的節點沒有響應而宣佈它死亡。最終,暫停的節點可能會繼續執行,甚至沒有注意到它在睡覺,直到它稍後某個時候檢查其時鐘。
#### 響應時間保證 {#sec_distributed_clocks_realtime}
在許多程式語言和作業系統中,如所討論的,執行緒和程序可能會暫停無限長的時間。如果你足夠努力,這些暫停的原因 *可以* 被消除。
某些軟體在環境中執行,如果未能在指定時間內響應可能會造成嚴重損害:控制飛機、火箭、機器人、汽車和其他物理物件的計算機必須快速且可預測地響應其感測器輸入。在這些系統中,有一個指定的 *截止時間*,軟體必須在此之前響應;如果它沒有達到截止時間,可能會導致整個系統的故障。這些被稱為 *硬即時* 系統。
--------
> [!NOTE]
> 在嵌入式系統中,*即時* 意味著系統經過精心設計和測試,以在所有情況下滿足指定的時序保證。這個含義與網路上更模糊的 *即時* 術語使用形成對比,後者描述伺服器向客戶端推送資料和流處理,沒有硬響應時間約束(見後續章節)。
--------
例如,如果你的汽車的車載感測器檢測到你當前正在經歷碰撞,你不希望安全氣囊的釋放因為安全氣囊釋放系統中不合時宜的 GC 暫停而延遲。
在系統中提供即時保證需要軟體棧所有級別的支援:需要 *即時作業系統*(RTOS),它允許程序在指定的時間間隔內以有保證的 CPU 時間分配進行排程;庫函式必須記錄其最壞情況執行時間;動態記憶體分配可能受到限制或完全禁止(即時垃圾回收器存在,但應用程式仍必須確保它不會給 GC 太多工作);必須進行大量的測試和測量以確保滿足保證。
所有這些都需要大量的額外工作,並嚴重限制了可以使用的程式語言、庫和工具的範圍(因為大多數語言和工具不提供即時保證)。由於這些原因,開發即時系統非常昂貴,它們最常用於安全關鍵的嵌入式裝置。此外,"即時" 不同於 "高效能" —— 事實上,即時系統可能具有較低的吞吐量,因為它們必須優先考慮及時響應高於一切(另見 ["延遲和資源利用率"](#sidebar_distributed_latency_utilization))。
對於大多數伺服器端資料處理系統,即時保證根本不經濟或不合適。因此,這些系統必須承受在非即時環境中執行帶來的暫停和時鐘不穩定性。
#### 限制垃圾回收的影響 {#sec_distributed_gc_impact}
垃圾回收曾經是程序暫停的最大原因之一 [^79],但幸運的是 GC 演算法已經改進了很多:經過適當調整的回收器現在通常只會暫停幾毫秒。Java 執行時提供了併發標記清除(CMS)、G1、Z 垃圾回收器(ZGC)、Epsilon 和 Shenandoah 等回收器。每個都針對不同的記憶體配置檔案進行了最佳化,如高頻物件建立、大堆等。相比之下,Go 提供了一個更簡單的併發標記清除垃圾回收器,試圖自我最佳化。
如果你需要完全避免 GC 暫停,一個選擇是使用根本沒有垃圾回收器的語言。例如,Swift 使用自動引用計數來確定何時可以釋放記憶體;Rust 和 Mojo 使用型別系統跟蹤物件的生命週期,以便編譯器可以確定必須分配記憶體多長時間。
也可以使用垃圾回收語言,同時減輕暫停的影響。一種方法是將 GC 暫停視為節點的短暫計劃中斷,並讓其他節點在一個節點收集垃圾時處理來自客戶端的請求。如果執行時可以警告應用程式節點很快需要 GC 暫停,應用程式可以停止向該節點發送新請求,等待它完成處理未完成的請求,然後在沒有請求進行時執行 GC。這個技巧從客戶端隱藏了 GC 暫停,並減少了響應時間的高百分位數 [^80] [^81]。
這個想法的一個變體是僅對短期物件使用垃圾回收器(快速收集),並定期重啟程序,在它們積累足夠的長期物件需要長期物件的完整 GC 之前 [^79] [^82]。可以一次重啟一個節點,並且可以在計劃重啟之前將流量從節點轉移,就像滾動升級一樣(見 [第 5 章](/tw/ch5#ch_encoding))。
這些措施不能完全防止垃圾回收暫停,但它們可以有效地減少對應用程式的影響。
## 知識、真相與謊言 {#sec_distributed_truth}
到目前為止,在本章中,我們已經探討了分散式系統與在單臺計算機上執行的程式的不同之處:沒有共享記憶體,只有透過不可靠的網路進行訊息傳遞,具有可變延遲,系統可能會遭受部分失效、不可靠的時鐘和處理暫停。
如果你不習慣分散式系統,這些問題的後果會令人深感迷惑。網路中的節點不能 *確切地知道* 關於其他節點的任何事情 —— 它只能根據它接收(或未接收)的訊息進行猜測。節點只能透過與另一個節點交換訊息來了解它處於什麼狀態(它儲存了什麼資料,它是否正常執行等)。如果遠端節點沒有響應,就無法知道它處於什麼狀態,因為網路中的問題無法與節點的問題可靠地區分開來。
這些系統的討論接近哲學:在我們的系統中,我們知道什麼是真或假?如果感知和測量的機制不可靠,我們對這些知識有多確定 [^83]?軟體系統是否應該遵守我們對物理世界的期望法則,如因果關係?
幸運的是,我們不需要走到弄清生命意義的程度。在分散式系統中,我們可以陳述我們對行為(*系統模型*)的假設,並以這樣的方式設計實際系統,使其滿足這些假設。演算法可以被證明在某個系統模型內正確執行。這意味著即使底層系統模型提供的保證很少,也可以實現可靠的行為。
然而,儘管可以在不可靠的系統模型中使軟體表現良好,但這樣做並不簡單。在本章的其餘部分,我們將進一步探討分散式系統中知識和真相的概念,這將幫助我們思考我們可以做出的假設型別和我們可能希望提供的保證。在 [第 10 章](/tw/ch10#ch_consistency) 中,我們將繼續檢視在特定假設下提供特定保證的分散式演算法的一些示例。
### 多數派原則 {#sec_distributed_majority}
想象一個具有不對稱故障的網路:一個節點能夠接收發送給它的所有訊息,但該節點的任何傳出訊息都被丟棄或延遲 [^22]。即使該節點執行得非常好,並且正在接收來自其他節點的請求,其他節點也無法聽到它的響應。在一些超時之後,其他節點宣佈它死亡,因為它們沒有收到該節點的訊息。情況展開就像一場噩夢:半斷開的節點被拖到墓地,踢腿尖叫著 "我沒死!" —— 但由於沒人能聽到它的尖叫,葬禮隊伍以堅忍的決心繼續前進。
在稍微不那麼可怕的情況下,半斷開的節點可能會注意到它傳送的訊息沒有被其他節點確認,因此意識到網路中一定有故障。儘管如此,該節點被其他節點錯誤地宣佈死亡,半斷開的節點對此無能為力。
作為第三種情況,想象一個節點暫停執行一分鐘。在此期間,沒有請求被處理,也沒有響應被傳送。其他節點等待、重試、變得不耐煩,最終宣佈該節點死亡並將其裝上靈車。最後,暫停結束,節點的執行緒繼續執行,就好像什麼都沒發生過。其他節點驚訝地看到據稱已死的節點突然從棺材裡抬起頭來,健康狀況良好,開始愉快地與旁觀者聊天。起初,暫停的節點甚至沒有意識到整整一分鐘已經過去,它被宣佈死亡 —— 從它的角度來看,自從它上次與其他節點交談以來,幾乎沒有時間過去。
這些故事的寓意是,節點不一定能信任自己對情況的判斷。分散式系統不能完全依賴單個節點,因為節點可能隨時失效,可能使系統陷入困境並無法恢復。相反,許多分散式演算法依賴於 *仲裁*,即節點之間的投票(見 ["讀寫仲裁"](/tw/ch6#sec_replication_quorum_condition)):決策需要來自幾個節點的最少票數,以減少對任何一個特定節點的依賴。
這包括關於宣佈節點死亡的決定。如果節點的仲裁宣佈另一個節點死亡,那麼它必須被認為是死亡的,即使該節點仍然感覺自己非常活著。個別節點必須遵守仲裁決定並退出。
最常見的是,仲裁是超過半數節點的絕對多數(儘管其他型別的仲裁也是可能的)。多數仲裁允許系統在少數節點故障時繼續工作(三個節點可以容忍一個故障節點;五個節點可以容忍兩個故障節點)。然而,它仍然是安全的,因為系統中只能有一個多數 —— 不能同時有兩個具有衝突決策的多數。當我們在 [第 10 章](/tw/ch10#ch_consistency) 討論 *共識演算法* 時,我們將更詳細地討論仲裁的使用。
### 分散式鎖和租約 {#sec_distributed_lock_fencing}
分散式應用程式中的鎖和租約容易被誤用,並且是錯誤的常見來源 [^84]。讓我們看看它們如何出錯的一個特定案例。
在 ["程序暫停"](#sec_distributed_clocks_pauses) 中,我們看到租約是一種超時的鎖,如果舊所有者停止響應(可能是因為它崩潰了、暫停太久或與網路斷開連線),可以分配給新所有者。你可以在系統需要只有一個某種東西的情況下使用租約。例如:
* 只允許一個節點成為資料庫分片的主節點,以避免腦裂(見 ["處理節點中斷"](/tw/ch6#sec_replication_failover))。
* 只允許一個事務或客戶端更新特定資源或物件,以防止併發寫入損壞它。
* 只有一個節點應該處理大型處理作業的給定輸入檔案,以避免由於多個節點冗餘地執行相同工作而浪費精力。
值得仔細思考如果幾個節點同時認為它們持有租約會發生什麼,可能是由於程序暫停。在第三個例子中,後果只是一些浪費的計算資源,這不是什麼大問題。但在前兩種情況下,後果可能是資料丟失或損壞,這要嚴重得多。
例如,[圖 9-4](#fig_distributed_lease_pause) 顯示了由於鎖的錯誤實現導致的資料損壞錯誤。(該錯誤不是理論上的:HBase 曾經有這個問題 [^85] [^86]。)假設你想確保儲存服務中的檔案一次只能由一個客戶端訪問,因為如果多個客戶端試圖寫入它,檔案將被損壞。你嘗試透過要求客戶端在訪問檔案之前從鎖服務獲取租約來實現這一點。這種鎖服務通常使用共識演算法實現;我們將在 [第 10 章](/tw/ch10#ch_consistency) 中進一步討論這一點。
{{< figure src="/fig/ddia_0904.png" id="fig_distributed_lease_pause" caption="圖 9-4. 分散式鎖的錯誤實現:客戶端 1 認為它仍然有有效的租約,即使它已經過期,因此損壞了儲存中的檔案。" class="w-full my-4" >}}
問題是我們在 ["程序暫停"](#sec_distributed_clocks_pauses) 中討論的一個例子:如果持有租約的客戶端暫停太久,其租約就會過期。另一個客戶端可以獲得同一檔案的租約,並開始寫入檔案。當暫停的客戶端回來時,它(錯誤地)認為它仍然有有效的租約,並繼續寫入檔案。我們現在有了腦裂情況:客戶端的寫入衝突並損壞了檔案。
[圖 9-5](#fig_distributed_lease_delay) 顯示了具有類似後果的另一個問題。在這個例子中沒有程序暫停,只有客戶端 1 的崩潰。就在客戶端 1 崩潰之前,它向儲存服務傳送了一個寫請求,但這個請求在網路中被延遲了很長時間。(請記住 ["實踐中的網路故障"](#sec_distributed_network_faults),資料包有時可能會延遲一分鐘或更長時間。)當寫請求到達儲存服務時,租約已經超時,允許客戶端 2 獲取它併發出自己的寫入。結果是類似於 [圖 9-4](#fig_distributed_lease_pause) 的損壞。
{{< figure src="/fig/ddia_0905.png" id="fig_distributed_lease_delay" caption="圖 9-5. 來自前租約持有者的訊息可能會延遲很長時間,並在另一個節點接管租約後到達。" class="w-full my-4" >}}
#### 隔離殭屍程序和延遲請求 {#sec_distributed_fencing_tokens}
術語 *殭屍* 有時用於描述尚未發現失去租約的前租約持有者,並且仍在充當當前租約持有者。由於我們不能完全排除殭屍,我們必須確保它們不能以腦裂的形式造成任何損害。這被稱為 *隔離* 殭屍。
一些系統試圖透過關閉殭屍來隔離它們,例如透過斷開它們與網路的連線 [^9]、透過雲提供商的管理介面關閉 VM,甚至物理關閉機器 [^87]。這種方法被稱為 *對端節點爆頭*(STONITH)。不幸的是,它存在一些問題:它不能防範像 [圖 9-5](#fig_distributed_lease_delay) 中那樣的大網路延遲;可能會發生所有節點相互關閉的情況 [^19];到檢測到殭屍並關閉它時,可能已經太晚了,資料可能已經被損壞。
一個更強大的隔離解決方案,可以防範殭屍和延遲請求,如 [圖 9-6](#fig_distributed_fencing) 所示。
{{< figure src="/fig/ddia_0906.png" id="fig_distributed_fencing" caption="圖 9-6. 透過只允許按遞增隔離令牌順序寫入來使儲存訪問安全。" class="w-full my-4" >}}
假設每次鎖服務授予鎖或租約時,它還返回一個 *隔離令牌*,這是一個每次授予鎖時都會增加的數字(例如,由鎖服務遞增)。然後我們可以要求客戶端每次向儲存服務傳送寫請求時,都必須包含其當前的隔離令牌。
--------
> [!NOTE]
> 隔離令牌有幾個替代名稱。在 Google 的鎖服務 Chubby 中,它們被稱為 *序列器* [^88],在 Kafka 中它們被稱為 *紀元編號*。在共識演算法中,我們將在 [第 10 章](/tw/ch10#ch_consistency) 中討論,*投票編號*(Paxos)或 *任期編號*(Raft)起著類似的作用。
--------
在 [圖 9-6](#fig_distributed_fencing) 中,客戶端 1 獲得帶有令牌 33 的租約,但隨後進入長時間暫停,租約過期。客戶端 2 獲得帶有令牌 34 的租約(數字總是增加),然後將其寫請求傳送到儲存服務,包括令牌 34。稍後,客戶端 1 恢復執行並將其寫入傳送到儲存服務,包括其令牌值 33。然而,儲存服務記得它已經處理了具有更高令牌編號(34)的寫入,因此它拒絕帶有令牌 33 的請求。剛剛獲得租約的客戶端必須立即向儲存服務進行寫入,一旦該寫入完成,任何殭屍都被隔離了。
如果 ZooKeeper 是你的鎖服務,你可以使用事務 ID `zxid` 或節點版本 `cversion` 作為隔離令牌 [^85]。使用 etcd,修訂號與租約 ID 一起起著類似的作用 [^89]。Hazelcast 中的 FencedLock API 明確生成隔離令牌 [^90]。
這種機制要求儲存服務有某種方法來檢查寫入是否基於過時的令牌。或者,服務支援僅在物件自當前客戶端上次讀取以來未被另一個客戶端寫入時才成功的寫入就足夠了,類似於原子比較並設定(CAS)操作。例如,物件儲存服務支援這種檢查:Amazon S3 稱之為 *條件寫入*,Azure Blob Storage 稱之為 *條件標頭*,Google Cloud Storage 稱之為 *請求前提條件*。
#### 多副本隔離 {#fencing-with-multiple-replicas}
如果你的客戶端只需要寫入一個支援此類條件寫入的儲存服務,鎖服務在某種程度上是多餘的 [^91] [^92],因為租約分配本可以直接基於該儲存服務實現 [^93]。然而,一旦你有了隔離令牌,你也可以將其用於多個服務或副本,並確保舊的租約持有者在所有這些服務上都被隔離。
例如,想象儲存服務是一個具有最後寫入勝利衝突解決的無主複製鍵值儲存(見 ["無主複製"](/tw/ch6#sec_replication_leaderless))。在這樣的系統中,客戶端直接向每個副本傳送寫入,每個副本根據客戶端分配的時間戳獨立決定是否接受寫入。
如 [圖 9-7](#fig_distributed_fencing_leaderless) 所示,你可以將寫入者的隔離令牌放在時間戳的最高有效位或數字中。然後你可以確保新租約持有者生成的任何時間戳都將大於舊租約持有者的任何時間戳,即使舊租約持有者的寫入發生得更晚。
{{< figure src="/fig/ddia_0907.png" id="fig_distributed_fencing_leaderless" caption="圖 9-7. 使用隔離令牌保護對無主複製資料庫的寫入。" class="w-full my-4" >}}
在 [圖 9-7](#fig_distributed_fencing_leaderless) 中,客戶端 2 有隔離令牌 34,因此它所有以 34… 開頭的時間戳都大於客戶端 1 生成的任何以 33… 開頭的時間戳。客戶端 2 寫入副本的仲裁,但它無法到達副本 3。這意味著當殭屍客戶端 1 稍後嘗試寫入時,它的寫入可能在副本 3 上成功,即使它被副本 1 和 2 忽略。這不是問題,因為後續的仲裁讀取將更喜歡具有更大時間戳的客戶端 2 的寫入,讀修復或反熵最終將覆蓋客戶端 1 寫入的值。
從這些例子可以看出,假設任何時候只有一個節點持有租約是不安全的。幸運的是,透過一點小心,你可以使用隔離令牌來防止殭屍和延遲請求造成任何損害。
### 拜占庭故障 {#sec_distributed_byzantine}
隔離令牌可以檢測並阻止 *無意中* 出錯的節點(例如,因為它尚未發現其租約已過期)。然而,如果節點故意想要破壞系統的保證,它可以透過傳送帶有虛假隔離令牌的訊息輕鬆做到。
在本書中,我們假設節點是不可靠但誠實的:它們可能很慢或從不響應(由於故障),它們的狀態可能已過時(由於 GC 暫停或網路延遲),但我們假設如果節點 *確實* 響應,它就是在說 "真話":據它所知,它正在按協議規則行事。
如果節點可能 "撒謊"(傳送任意錯誤或損壞的響應)的風險存在,分散式系統問題會變得更加困難 —— 例如,它可能在同一次選舉中投出多個相互矛盾的票。這種行為被稱為 *拜占庭故障*,在這種不信任環境中達成共識的問題被稱為 *拜占庭將軍問題* [^94]。
> [!TIP] 拜占庭將軍問題
>
> 拜占庭將軍問題是所謂 *兩將軍問題* [^95] 的推廣,它想象了兩個軍隊將軍需要就戰鬥計劃達成一致的情況。由於他們在兩個不同的地點扎營,他們只能透過信使進行通訊,信使有時會延遲或丟失(就像網路中的資料包)。我們將在 [第 10 章](/tw/ch10#ch_consistency) 中討論這個 *共識* 問題。
>
> 在問題的拜占庭版本中,有 *n* 個需要達成一致的將軍,他們的努力受到他們中間有一些叛徒的阻礙。大多數將軍是忠誠的,因此傳送真實的訊息,但叛徒可能試圖透過傳送虛假或不真實的訊息來欺騙和混淆其他人。事先不知道誰是叛徒。
>
> 拜占庭是一個古希臘城市,後來成為君士坦丁堡,位於現在土耳其的伊斯坦布林。沒有任何歷史證據表明拜占庭的將軍比其他地方的將軍更容易搞陰謀和密謀。相反,這個名字源自 *拜占庭* 一詞在 *過於複雜、官僚、狡猾* 的意義上的使用,這個詞在計算機出現之前很久就在政治中使用了 [^96]。Lamport 想選擇一個不會冒犯任何讀者的國籍,他被建議稱之為 *阿爾巴尼亞將軍問題* 不是個好主意 [^97]。
--------
如果即使某些節點發生故障並且不遵守協議,或者惡意攻擊者干擾網路,系統仍能繼續正確執行,則該系統是 *拜占庭容錯* 的。這種擔憂在某些特定情況下是相關的。例如:
* 在航空航天環境中,計算機記憶體或 CPU 暫存器中的資料可能因輻射而損壞,導致它以任意不可預測的方式響應其他節點。由於系統故障的成本非常高昂(例如,飛機墜毀並殺宕機上所有人,或火箭與國際空間站相撞),飛行控制系統必須容忍拜占庭故障 [^98] [^99]。
* 在有多個參與方的系統中,一些參與者可能試圖欺騙或欺詐其他人。在這種情況下,節點簡單地信任另一個節點的訊息是不安全的,因為它們可能是惡意傳送的。例如,比特幣等加密貨幣和其他區塊鏈可以被認為是讓相互不信任的各方就交易是否發生達成一致的一種方式,而無需依賴中央權威 [^100]。
然而,在我們在本書中討論的系統型別中,我們通常可以安全地假設沒有拜占庭故障。在資料中心中,所有節點都由你的組織控制(因此它們有望被信任),輻射水平足夠低,記憶體損壞不是主要問題(儘管正在考慮軌道資料中心 [^101])。多租戶系統有相互不信任的租戶,但它們使用防火牆、虛擬化和訪問控制策略相互隔離,而不是使用拜占庭容錯。使系統拜占庭容錯的協議相當昂貴 [^102],容錯嵌入式系統依賴於硬體級別的支援 [^98]。在大多數伺服器端資料系統中,部署拜占庭容錯解決方案的成本使它們不切實際。
Web 應用程式確實需要預期客戶端在終端使用者控制下的任意和惡意行為,例如 Web 瀏覽器。這就是輸入驗證、清理和輸出轉義如此重要的原因:例如,防止 SQL 注入和跨站指令碼攻擊。然而,我們通常不在這裡使用拜占庭容錯協議,而只是讓伺服器成為決定什麼客戶端行為被允許和不被允許的權威。在沒有這種中央權威的點對點網路中,拜占庭容錯更相關 [^103] [^104]。
軟體中的錯誤可以被視為拜占庭故障,但如果你將相同的軟體部署到所有節點,那麼拜占庭容錯演算法無法拯救你。大多數拜占庭容錯演算法需要超過三分之二的節點的絕對多數才能正常執行(例如,如果你有四個節點,最多一個可能發生故障)。要使用這種方法對付錯誤,你必須有四個相同軟體的獨立實現,並希望錯誤只出現在四個實現中的一個。
同樣,如果協議可以保護我們免受漏洞、安全妥協和惡意攻擊,那將是很有吸引力的。不幸的是,這也不現實:在大多數系統中,如果攻擊者可以破壞一個節點,他們可能可以破壞所有節點,因為它們可能執行相同的軟體。因此,傳統機制(身份驗證、訪問控制、加密、防火牆等)仍然是防範攻擊者的主要保護。
#### 弱形式的謊言 {#weak-forms-of-lying}
儘管我們假設節點通常是誠實的,但向軟體新增防範弱形式 "謊言" 的機制可能是值得的 —— 例如,由於硬體問題、軟體錯誤和配置錯誤導致的無效訊息。這種保護機制不是完全的拜占庭容錯,因為它們無法抵禦堅定的對手,但它們仍然是朝著更好可靠性邁出的簡單而務實的步驟。例如:
* 由於硬體問題或作業系統、驅動程式、路由器等中的錯誤,網路資料包有時確實會損壞。通常,損壞的資料包會被內置於 TCP 和 UDP 中的校驗和捕獲,但有時它們會逃避檢測 [^105] [^106] [^107]。簡單的措施通常足以防範此類損壞,例如應用程式級協議中的校驗和。TLS 加密連線也提供防損壞保護。
* 公開可訪問的應用程式必須仔細清理來自使用者的任何輸入,例如檢查值是否在合理範圍內,並限制字串的大小以防止透過大記憶體分配進行拒絕服務。防火牆後面的內部服務可能能夠在輸入上進行較少嚴格的檢查,但協議解析器中的基本檢查仍然是個好主意 [^105]。
* NTP 客戶端可以配置多個伺服器地址。同步時,客戶端聯絡所有伺服器,估計它們的錯誤,並檢查大多數伺服器是否在某個時間範圍內達成一致。只要大多數伺服器都正常,報告不正確時間的配置錯誤的 NTP 伺服器就會被檢測為異常值並從同步中排除 [^39]。使用多個伺服器使 NTP 比僅使用單個伺服器更強大。
### 系統模型與現實 {#sec_distributed_system_model}
許多演算法被設計來解決分散式系統問題 —— 例如,我們將在 [第 10 章](/tw/ch10#ch_consistency) 中研究共識問題的解決方案。為了有用,這些演算法需要容忍我們在本章中討論的分散式系統的各種故障。
演算法需要以不過度依賴於它們執行的硬體和軟體配置細節的方式編寫。這反過來又要求我們以某種方式形式化我們期望在系統中發生的故障型別。我們透過定義 *系統模型* 來做到這一點,這是一個描述演算法可能假設什麼事情的抽象。
關於時序假設,三種系統模型常用:
同步模型
: 同步模型假設有界的網路延遲、有界的程序暫停和有界的時鐘誤差。這並不意味著精確同步的時鐘或零網路延遲;它只是意味著你知道網路延遲、暫停和時鐘漂移永遠不會超過某個固定的上限 [^108]。同步模型不是大多數實際系統的現實模型,因為(如本章所討論的)無界延遲和暫停確實會發生。
部分同步模型
: 部分同步意味著系統 *大部分時間* 表現得像同步系統,但有時會超過網路延遲、程序暫停和時鐘漂移的界限 [^108]。這是許多系統的現實模型:大部分時間,網路和程序表現相當良好 —— 否則我們永遠無法完成任何事情 —— 但我們必須考慮到任何時序假設偶爾可能會被打破的事實。發生這種情況時,網路延遲、暫停和時鐘誤差可能會變得任意大。
非同步模型
: 在這個模型中,演算法不允許做出任何時序假設 —— 事實上,它甚至沒有時鐘(因此它不能使用超時)。一些演算法可以為非同步模型設計,但它非常有限。
此外,除了時序問題,我們還必須考慮節點故障。節點的一些常見系統模型是:
崩潰停止故障
: 在 *崩潰停止*(或 *故障停止*)模型中,演算法可以假設節點只能以一種方式失效,即崩潰 [^109]。這意味著節點可能在任何時刻突然停止響應,此後該節點永遠消失 —— 它永遠不會回來。
崩潰恢復故障
: 我們假設節點可能在任何時刻崩潰,並且可能在某個未知時間後再次開始響應。在崩潰恢復模型中,假設節點具有跨崩潰保留的穩定儲存(即非易失性磁碟儲存),而記憶體中的狀態假設丟失。
效能下降和部分功能
: 除了崩潰和重啟之外,節點可能變慢:它們可能仍然能夠響應健康檢查請求,但速度太慢而無法完成任何實際工作。例如,千兆網路介面可能由於驅動程式錯誤突然降至 1 Kb/s 吞吐量 [^110];處於記憶體壓力下的程序可能會花費大部分時間執行垃圾回收 [^111];磨損的 SSD 可能具有不穩定的效能;硬體可能受到高溫、鬆動的聯結器、機械振動、電源問題、韌體錯誤等的影響 [^112]。這種情況被稱為 *跛行節點*、*灰色故障* 或 *慢速故障* [^113],它可能比干淨失效的節點更難處理。一個相關的問題是當程序停止執行它應該做的某些事情,而其他方面繼續工作時,例如因為後臺執行緒崩潰或死鎖 [^114]。
拜占庭(任意)故障
: 節點可能做任何事情,包括試圖欺騙和欺騙其他節點,如上一節所述。
對於建模真實系統,具有崩潰恢復故障的部分同步模型通常是最有用的模型。它允許無界的網路延遲、程序暫停和慢節點。但是分散式演算法如何應對該模型?
#### 定義演算法的正確性 {#defining-the-correctness-of-an-algorithm}
為了定義演算法 *正確* 的含義,我們可以描述它的 *屬性*。例如,排序演算法的輸出具有這樣的屬性:對於輸出列表的任何兩個不同元素,左邊的元素小於右邊的元素。這只是定義列表排序含義的正式方式。
同樣,我們可以寫下我們希望分散式演算法具有的屬性,以定義正確的含義。例如,如果我們為鎖生成隔離令牌(見 ["隔離殭屍程序和延遲請求"](#sec_distributed_fencing_tokens)),我們可能要求演算法具有以下屬性:
唯一性
: 沒有兩個隔離令牌請求返回相同的值。
單調序列
: 如果請求 *x* 返回令牌 *t**x*,請求 *y* 返回令牌 *t**y*,並且 *x* 在 *y* 開始之前完成,則 *t**x* < *t**y*。
可用性
: 請求隔離令牌且不崩潰的節點最終會收到響應。
如果演算法在我們假設該系統模型中可能發生的所有情況下始終滿足其屬性,則該演算法在某個系統模型中是正確的。然而,如果所有節點崩潰,或者所有網路延遲突然變得無限長,那麼沒有演算法能夠完成任何事情。即使在允許完全失效的系統模型中,我們如何仍然做出有用的保證?
#### 安全性與活性 {#sec_distributed_safety_liveness}
為了澄清情況,值得區分兩種不同型別的屬性:*安全性* 和 *活性* 屬性。在剛才給出的例子中,*唯一性* 和 *單調序列* 是安全屬性,但 *可用性* 是活性屬性。
什麼區分這兩種屬性?一個跡象是活性屬性通常在其定義中包含 "最終" 一詞。(是的,你猜對了 —— *最終一致性* 是一個活性屬性 [^115]。)
安全性通常被非正式地定義為 *沒有壞事發生*,活性被定義為 *好事最終會發生*。然而,最好不要過多地解讀這些非正式定義,因為 "好" 和 "壞" 是價值判斷,不能很好地應用於演算法。安全性和活性的實際定義更精確 [^116]:
* 如果違反了安全屬性,我們可以指出它被破壞的特定時間點(例如,如果違反了唯一性屬性,我們可以識別返回重複隔離令牌的特定操作)。在違反安全屬性之後,違規無法撤消 —— 損害已經造成。
* 活性屬性以相反的方式工作:它可能在某個時間點不成立(例如,節點可能已傳送請求但尚未收到響應),但總有希望它將來可能得到滿足(即透過接收響應)。
區分安全性和活性屬性的一個優點是它有助於我們處理困難的系統模型。對於分散式演算法,通常要求安全屬性在系統模型的所有可能情況下 *始終* 成立 [^108]。也就是說,即使所有節點崩潰,或整個網路失效,演算法也必須確保它不會返回錯誤的結果(即,安全屬性保持滿足)。
然而,對於活性屬性,我們可以做出警告:例如,我們可以說請求只有在大多數節點沒有崩潰時才需要收到響應,並且只有在網路最終從中斷中恢復時才需要響應。部分同步模型的定義要求系統最終返回到同步狀態 —— 也就是說,任何網路中斷期只持續有限的時間,然後被修復。
#### 將系統模型對映到現實世界 {#mapping-system-models-to-the-real-world}
安全性和活性屬性以及系統模型對於推理分散式演算法的正確性非常有用。然而,在實踐中實現演算法時,現實的混亂事實又會回來咬你一口,很明顯系統模型是現實的簡化抽象。
例如,崩潰恢復模型中的演算法通常假設穩定儲存中的資料在崩潰後倖存。然而,如果磁碟上的資料損壞了,或者由於硬體錯誤或配置錯誤而擦除了資料,會發生什麼 [^117]?如果伺服器有韌體錯誤並且在重啟時無法識別其硬碟驅動器,即使驅動器正確連線到伺服器,會發生什麼 [^118]?
仲裁演算法(見 ["讀寫仲裁"](/tw/ch6#sec_replication_quorum_condition))依賴於節點記住它聲稱已儲存的資料。如果節點可能患有健忘症並忘記先前儲存的資料,那會破壞仲裁條件,從而破壞演算法的正確性。也許需要一個新的系統模型,其中我們假設穩定儲存大多在崩潰後倖存,但有時可能會丟失。但該模型隨後變得更難推理。
演算法的理論描述可以宣告某些事情被簡單地假設不會發生 —— 在非拜占庭系統中,我們確實必須對可能和不可能發生的故障做出一些假設。然而,真正的實現可能仍然必須包含程式碼來處理被假設為不可能的事情發生的情況,即使該處理歸結為 `printf("Sucks to be you")` 和 `exit(666)` —— 即,讓人類操作員清理爛攤子 [^119]。(這是計算機科學和軟體工程之間的一個區別。)
這並不是說理論上的、抽象的系統模型是無用的 —— 恰恰相反。它們非常有助於將真實系統的複雜性提煉為我們可以推理的可管理的故障集,以便我們可以理解問題並嘗試系統地解決它。
### 形式化方法和隨機測試 {#sec_distributed_formal}
我們如何知道演算法滿足所需的屬性?由於併發性、部分失效和網路延遲,存在大量潛在狀態。我們需要保證屬性在每個可能的狀態下都成立,並確保我們沒有忘記任何邊界情況。
一種方法是透過數學描述演算法來形式驗證它,並使用證明技術來表明它在系統模型允許的所有情況下都滿足所需的屬性。證明演算法正確並不意味著它在真實系統上的 *實現* 必然總是正確執行。但這是一個非常好的第一步,因為理論分析可以發現演算法中的問題,這些問題可能在真實系統中長時間隱藏,並且只有當你的假設(例如,關於時序)由於不尋常的情況而失敗時才會咬你一口。
將理論分析與經驗測試相結合以驗證實現按預期執行是明智的。基於屬性的測試、模糊測試和確定性模擬測試(DST)等技術使用隨機化來在各種情況下測試系統。亞馬遜網路服務等公司已成功地在其許多產品上使用了這些技術的組合 [^120] [^121]。
#### 模型檢查與規範語言 {#model-checking-and-specification-languages}
*模型檢查器* 是幫助驗證演算法或系統按預期執行的工具。演算法規範是用專門構建的語言編寫的,如 TLA+、Gallina 或 FizzBee。這些語言使得更容易專注於演算法的行為,而不必擔心程式碼實現細節。然後,模型檢查器使用這些模型透過系統地嘗試所有可能發生的事情來驗證不變數在演算法的所有狀態中都成立。
模型檢查實際上不能證明演算法的不變數對每個可能的狀態都成立,因為大多數現實世界的演算法都有無限的狀態空間。對所有狀態的真正驗證需要形式證明,這是可以做到的,但通常比執行模型檢查器更困難。相反,模型檢查器鼓勵你將演算法的模型減少到可以完全驗證的近似值,或者將執行限制到某個上限(例如,透過設定可以傳送的最大訊息數)。任何只在更長執行時發生的錯誤將不會被發現。
儘管如此,模型檢查器在易用性和查詢非顯而易見錯誤的能力之間取得了很好的平衡。CockroachDB、TiDB、Kafka 和許多其他分散式系統使用模型規範來查詢和修復錯誤 [^122] [^123] [^124]。例如,使用 TLA+,研究人員能夠證明由演算法的散文描述中的歧義引起的檢視戳複製(VR)中資料丟失的可能性 [^125]。
按設計,模型檢查器不執行你的實際程式碼,而是執行一個簡化的模型,該模型僅指定你的協議的核心思想。這使得系統地探索狀態空間更易處理,但有風險是你的規範和你的實現彼此不同步 [^126]。可以檢查模型和真實實現是否具有等效行為,但這需要在真實實現中進行儀器化 [^127]。
#### 故障注入 {#sec_fault_injection}
許多錯誤是在機器和網路故障發生時觸發的。故障注入是一種有效(有時令人恐懼)的技術,用於驗證系統的實現在出錯時是否按預期工作。這個想法很簡單:將故障注入到正在執行的系統環境中,看看它如何表現。故障可以是網路故障、機器崩潰、磁碟損壞、暫停的程序 —— 你能想象到的計算機出錯的任何事情。
故障注入測試通常在與系統將執行的生產環境非常相似的環境中執行。有些甚至直接將故障注入到他們的生產環境中。Netflix 透過他們的 Chaos Monkey 工具推廣了這種方法 [^128]。生產故障注入通常被稱為 *混沌工程*,我們在 ["可靠性與容錯"](/tw/ch2#sec_introduction_reliability) 中討論過。
要執行故障注入測試,首先部署被測系統以及故障注入協調器和指令碼。協調器負責決定執行什麼故障以及何時執行它們。本地或遠端指令碼負責將故障注入到單個節點或程序中。注入指令碼使用許多不同的工具來觸發故障。可以使用 Linux 的 `kill` 命令暫停或殺死 Linux 程序,可以使用 `umount` 解除安裝磁碟,可以透過防火牆設定中斷網路連線。你可以在注入故障期間和之後檢查系統行為,以確保事情按預期工作。
觸發故障所需的無數工具使故障注入測試編寫起來很麻煩。採用像 Jepsen 這樣的故障注入框架來執行故障注入測試以簡化過程是常見的。這些框架帶有各種作業系統的整合和許多預構建的故障注入器 [^129]。Jepsen 在許多廣泛使用的系統中發現關鍵錯誤方面非常有效 [^130] [^131]。
#### 確定性模擬測試 {#deterministic-simulation-testing}
確定性模擬測試(DST)也已成為模型檢查和故障注入的流行補充。它使用與模型檢查器類似的狀態空間探索過程,但它測試你的實際程式碼,而不是模型。
在 DST 中,模擬自動執行系統的大量隨機執行。模擬期間的網路通訊、I/O 和時鐘時序都被模擬替換,允許模擬器控制事情發生的確切順序,包括各種時序和故障場景。這允許模擬器探索比手寫測試或故障注入更多的情況。如果測試失敗,它可以重新執行,因為模擬器知道觸發故障的確切操作順序 —— 與故障注入相比,後者對系統沒有如此細粒度的控制。
DST 要求模擬器能夠控制所有非確定性來源,例如網路延遲。通常採用三種策略之一來使程式碼確定性:
應用程式級
: 一些系統從頭開始構建,以便於確定性地執行程式碼。例如,DST 領域的先驅之一 FoundationDB 是使用稱為 Flow 的非同步通訊庫構建的。Flow 為開發人員提供了將確定性網路模擬注入系統的點 [^132]。類似地,TigerBeetle 是一個具有一流 DST 支援的線上事務處理(OLTP)資料庫。系統的狀態被建模為狀態機,所有突變都發生在單個事件迴圈中。當與模擬確定性原語(如時鐘)結合時,這種架構能夠確定性地執行 [^133]。
執行時級
: 具有非同步執行時和常用庫的語言提供了引入確定性的插入點。使用單執行緒執行時強制所有非同步程式碼按順序執行。例如,FrostDB 修補 Go 的執行時以按順序執行 goroutine [^134]。Rust 的 madsim 庫以類似的方式工作。Madsim 提供了 Tokio 的非同步執行時 API、AWS 的 S3 庫、Kafka 的 Rust 庫等的確定性實現。應用程式可以交換確定性庫和執行時以獲得確定性測試執行,而無需更改其程式碼。
機器級
: 與其在執行時修補程式碼,不如使整個機器確定性。這是一個微妙的過程,需要機器對所有通常非確定性的呼叫響應確定性響應。Antithesis 等工具透過構建自定義虛擬機器管理程式來做到這一點,該虛擬機器管理程式用確定性操作替換通常的非確定性操作。從時鐘到網路和儲存的一切都需要考慮。不過,一旦完成,開發人員可以在虛擬機器管理程式內的容器集合中執行其整個分散式系統,並獲得完全確定性的分散式系統。
DST 提供了超越可重放性的幾個優勢。Antithesis 等工具試圖透過在發現不太常見的行為時將測試執行分支為多個子執行來探索應用程式程式碼中的許多不同程式碼路徑。由於確定性測試通常使用模擬時鐘和網路呼叫,因此此類測試可以比掛鐘時間執行得更快。例如,TigerBeetle 的時間抽象允許模擬模擬網路延遲和超時,而實際上不需要觸發超時的全部時間長度。這些技術允許模擬器更快地探索更多程式碼路徑。
#### 確定性的力量 {#sidebar_distributed_determinism}
非確定性是我們在本章中討論的所有分散式系統挑戰的核心:併發性、網路延遲、程序暫停、時鐘跳躍和崩潰都以不可預測的方式發生,從系統的一次執行到下一次執行都不同。相反,如果你能使系統確定性,那可以極大地簡化事情。
事實上,使事物確定性是一個簡單但強大的想法,在分散式系統設計中一再出現。除了確定性模擬測試,我們在過去的章節中已經看到了幾種使用確定性的方法:
* 事件溯源的一個關鍵優勢(見 ["事件溯源和 CQRS"](/tw/ch3#sec_datamodels_events))是你可以確定性地重放事件日誌以重建派生的物化檢視。
* 工作流引擎(見 ["持久執行和工作流"](/tw/ch5#sec_encoding_dataflow_workflows))依賴於工作流定義是確定性的,以提供持久執行語義。
* *狀態機複製*,我們將在 ["使用共享日誌"](/tw/ch10#sec_consistency_smr) 中討論,透過在每個副本上獨立執行相同的確定性事務序列來複制資料。我們已經看到了這個想法的兩個變體:基於語句的複製(見 ["複製日誌的實現"](/tw/ch6#sec_replication_implementation))和使用儲存過程的序列事務執行(見 ["儲存過程的利弊"](/tw/ch8#sec_transactions_stored_proc_tradeoffs))。
然而,使程式碼完全確定性需要小心。即使你已經刪除了所有併發性並用確定性模擬替換了 I/O、網路通訊、時鐘和隨機數生成器,非確定性元素可能仍然存在。例如,在某些程式語言中,迭代雜湊表元素的順序可能是非確定性的。是否遇到資源限制(記憶體分配失敗、堆疊溢位)也是非確定性的。
## 總結 {#summary}
在本章中,我們討論了分散式系統中可能發生的各種問題,包括:
* 每當你嘗試透過網路傳送資料包時,它可能會丟失或任意延遲。同樣,回覆可能會丟失或延遲,所以如果你沒有得到回覆,你不知道訊息是否送達。
* 節點的時鐘可能與其他節點嚴重不同步(儘管你盡最大努力設定了 NTP),它可能會突然向前或向後跳躍,而依賴它是危險的,因為你很可能沒有一個好的時鐘置信區間度量。
* 程序可能在其執行的任何時刻暫停相當長的時間,被其他節點宣告死亡,然後再次恢復活動而沒有意識到它曾暫停。
這種 *部分失效* 可能發生的事實是分散式系統的決定性特徵。每當軟體嘗試做任何涉及其他節點的事情時,都有可能偶爾失敗、隨機變慢或根本沒有響應(並最終超時)。在分散式系統中,我們嘗試將對部分失效的容忍構建到軟體中,這樣即使某些組成部分出現故障,整個系統也可以繼續執行。
要容忍故障,第一步是 *檢測* 它們,但即使這樣也很困難。大多數系統沒有準確的機制來檢測節點是否已失敗,因此大多數分散式演算法依賴超時來確定遠端節點是否仍然可用。然而,超時無法區分網路和節點故障,可變的網路延遲有時會導致節點被錯誤地懷疑崩潰。處理跛行節點(limping nodes)更加困難,這些節點正在響應但速度太慢而無法做任何有用的事情。
一旦檢測到故障,讓系統容忍它也不容易:沒有全域性變數、沒有共享記憶體、沒有公共知識或機器之間任何其他型別的共享狀態 [^83]。節點甚至無法就現在是什麼時間達成一致,更不用說任何更深刻的事情了。資訊從一個節點流向另一個節點的唯一方式是透過不可靠的網路傳送。單個節點無法安全地做出重大決策,因此我們需要協議來徵求其他節點的幫助並嘗試獲得法定人數的同意。
如果你習慣於在單臺計算機的理想數學完美環境中編寫軟體,其中相同的操作總是確定性地返回相同的結果,那麼轉向分散式系統混亂的物理現實可能會有點震驚。相反,分散式系統工程師通常會認為如果一個問題可以在單臺計算機上解決,那它就是微不足道的 [^4],而且單臺計算機現在確實可以做很多事情。如果你可以避免開啟潘多拉的盒子,只需將事情保持在單臺機器上,例如使用嵌入式儲存引擎(見 ["嵌入式儲存引擎"](/tw/ch4#sidebar_embedded)),通常值得這樣做。
然而,正如在 ["分散式系統與單節點系統"](/tw/ch1#sec_introduction_distributed) 中討論的,可伸縮性並不是使用分散式系統的唯一原因。容錯和低延遲(透過將資料在地理上放置在靠近使用者的位置)是同樣重要的目標,而這些事情無法透過單個節點實現。分散式系統的力量在於,原則上它們可以在服務層面永遠執行而不被中斷,因為所有故障和維護都可以在節點層面處理。(實際上,如果錯誤的配置更改被推送到所有節點,仍然會讓分散式系統崩潰。)
在本章中,我們還探討了網路、時鐘和程序的不可靠性是否是不可避免的自然法則。我們看到它不是:可以在網路中提供硬即時響應保證和有界延遲,但這樣做非常昂貴,並導致硬體資源利用率降低。大多數非安全關鍵系統選擇便宜和不可靠而不是昂貴和可靠。
本章一直在討論問題,給了我們一個暗淡的前景。在下一章中,我們將轉向解決方案,並討論一些為應對分散式系統中的問題而設計的演算法。
### 參考
[^1]: Mark Cavage. [There’s Just No Getting Around It: You’re Building a Distributed System](https://queue.acm.org/detail.cfm?id=2482856). *ACM Queue*, volume 11, issue 4, pages 80-89, April 2013. [doi:10.1145/2466486.2482856](https://doi.org/10.1145/2466486.2482856)
[^2]: Jay Kreps. [Getting Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW)
[^3]: Coda Hale. [You Can’t Sacrifice Partition Tolerance](https://codahale.com/you-cant-sacrifice-partition-tolerance/). *codahale.com*, October 2010.
[^4]: Jeff Hodges. [Notes on Distributed Systems for Young Bloods](https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/). *somethingsimilar.com*, January 2013. Archived at [perma.cc/B636-62CE](https://perma.cc/B636-62CE)
[^5]: Van Jacobson. [Congestion Avoidance and Control](https://www.cs.usask.ca/ftp/pub/discus/seminars2002-2003/p314-jacobson.pdf). At *ACM Symposium on Communications Architectures and Protocols* (SIGCOMM), August 1988. [doi:10.1145/52324.52356](https://doi.org/10.1145/52324.52356)
[^6]: Bert Hubert. [The Ultimate SO\_LINGER Page, or: Why Is My TCP Not Reliable](https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable). *blog.netherlabs.nl*, January 2009. Archived at [perma.cc/6HDX-L2RR](https://perma.cc/6HDX-L2RR)
[^7]: Jerome H. Saltzer, David P. Reed, and David D. Clark. [End-To-End Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf). *ACM Transactions on Computer Systems*, volume 2, issue 4, pages 277–288, November 1984. [doi:10.1145/357401.357402](https://doi.org/10.1145/357401.357402)
[^8]: Peter Bailis and Kyle Kingsbury. [The Network Is Reliable](https://queue.acm.org/detail.cfm?id=2655736). *ACM Queue*, volume 12, issue 7, pages 48-55, July 2014. [doi:10.1145/2639988.2639988](https://doi.org/10.1145/2639988.2639988)
[^9]: Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera, and Michael Walfish. [Taming Uncertainty in Distributed Systems with Help from the Network](https://cs.nyu.edu/~mwalfish/papers/albatross-eurosys15.pdf). At *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741976](https://doi.org/10.1145/2741948.2741976)
[^10]: Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. [Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications](https://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdf). At *ACM SIGCOMM Conference*, August 2011. [doi:10.1145/2018436.2018477](https://doi.org/10.1145/2018436.2018477)
[^11]: Urs Hölzle. [But recently a farmer had started grazing a herd of cows nearby. And whenever they stepped on the fiber link, they bent it enough to cause a blip](https://x.com/uhoelzle/status/1263333283107991558). *x.com*, May 2020. Archived at [perma.cc/WX8X-ZZA5](https://perma.cc/WX8X-ZZA5)
[^12]: CBC News. [Hundreds lose internet service in northern B.C. after beaver chews through cable](https://www.cbc.ca/news/canada/british-columbia/beaver-internet-down-tumbler-ridge-1.6001594). *cbc.ca*, April 2021. Archived at [perma.cc/UW8C-H2MY](https://perma.cc/UW8C-H2MY)
[^13]: Will Oremus. [The Global Internet Is Being Attacked by Sharks, Google Confirms](https://slate.com/technology/2014/08/shark-attacks-threaten-google-s-undersea-internet-cables-video.html). *slate.com*, August 2014. Archived at [perma.cc/P6F3-C6YG](https://perma.cc/P6F3-C6YG)
[^14]: Jess Auerbach Jahajeeah. [Down to the wire: The ship fixing our internet](https://continent.substack.com/p/down-to-the-wire-the-ship-fixing). *continent.substack.com*, November 2023. Archived at [perma.cc/DP7B-EQ7S](https://perma.cc/DP7B-EQ7S)
[^15]: Santosh Janardhan. [More details about the October 4 outage](https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/). *engineering.fb.com*, October 2021. Archived at [perma.cc/WW89-VSXH](https://perma.cc/WW89-VSXH)
[^16]: Tom Parfitt. [Georgian woman cuts off web access to whole of Armenia](https://www.theguardian.com/world/2011/apr/06/georgian-woman-cuts-web-access). *theguardian.com*, April 2011. Archived at [perma.cc/KMC3-N3NZ](https://perma.cc/KMC3-N3NZ)
[^17]: Antonio Voce, Tural Ahmedzade and Ashley Kirk. [‘Shadow fleets’ and subaquatic sabotage: are Europe’s undersea internet cables under attack?](https://www.theguardian.com/world/ng-interactive/2025/mar/05/shadow-fleets-subaquatic-sabotage-europe-undersea-internet-cables-under-attack) *theguardian.com*, March 2025. Archived at [perma.cc/HA7S-ZDBV](https://perma.cc/HA7S-ZDBV)
[^18]: Shengyun Liu, Paolo Viotti, Christian Cachin, Vivien Quéma, and Marko Vukolić. [XFT: Practical Fault Tolerance beyond Crashes](https://www.usenix.org/system/files/conference/osdi16/osdi16-liu.pdf). At *12th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), November 2016.
[^19]: Mark Imbriaco. [Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/). *github.blog*, December 2012. Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ)
[^20]: Tom Lianza and Chris Snook. [A Byzantine failure in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/). *blog.cloudflare.com*, November 2020. Archived at [perma.cc/83EZ-ALCY](https://perma.cc/83EZ-ALCY)
[^21]: Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan, and Samer Al-Kiswany. [Toward a Generic Fault Tolerance Technique for Partial Network Partitioning](https://www.usenix.org/conference/osdi20/presentation/alfatafta). At *14th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), November 2020.
[^22]: Marc A. Donges. [Re: bnx2 cards Intermittantly Going Offline](https://www.spinics.net/lists/netdev/msg210485.html). Message to Linux *netdev* mailing list, *spinics.net*, September 2012. Archived at [perma.cc/TXP6-H8R3](https://perma.cc/TXP6-H8R3)
[^23]: Troy Toman. [Inside a CODE RED: Network Edition](https://signalvnoise.com/svn3/inside-a-code-red-network-edition/). *signalvnoise.com*, September 2020. Archived at [perma.cc/BET6-FY25](https://perma.cc/BET6-FY25)
[^24]: Kyle Kingsbury. [Call Me Maybe: Elasticsearch](https://aphyr.com/posts/317-call-me-maybe-elasticsearch). *aphyr.com*, June 2014. [perma.cc/JK47-S89J](https://perma.cc/JK47-S89J)
[^25]: Salvatore Sanfilippo. [A Few Arguments About Redis Sentinel Properties and Fail Scenarios](https://antirez.com/news/80). *antirez.com*, October 2014. [perma.cc/8XEU-CLM8](https://perma.cc/8XEU-CLM8)
[^26]: Nicolas Liochon. [CAP: If All You Have Is a Timeout, Everything Looks Like a Partition](http://blog.thislongrun.com/2015/05/CAP-theorem-partition-timeout-zookeeper.html). *blog.thislongrun.com*, May 2015. Archived at [perma.cc/FS57-V2PZ](https://perma.cc/FS57-V2PZ)
[^27]: Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert N. M. Watson, Andrew W. Moore, Steven Hand, and Jon Crowcroft. [Queues Don’t Matter When You Can JUMP Them!](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-grosvenor_update.pdf) At *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015.
[^28]: Theo Julienne. [Debugging network stalls on Kubernetes](https://github.blog/engineering/debugging-network-stalls-on-kubernetes/). *github.blog*, November 2019. Archived at [perma.cc/K9M8-XVGL](https://perma.cc/K9M8-XVGL)
[^29]: Guohui Wang and T. S. Eugene Ng. [The Impact of Virtualization on Network Performance of Amazon EC2 Data Center](https://www.cs.rice.edu/~eugeneng/papers/INFOCOM10-ec2.pdf). At *29th IEEE International Conference on Computer Communications* (INFOCOM), March 2010. [doi:10.1109/INFCOM.2010.5461931](https://doi.org/10.1109/INFCOM.2010.5461931)
[^30]: Brandon Philips. [etcd: Distributed Locking and Service Discovery](https://www.youtube.com/watch?v=HJIjTTHWYnE). At *Strange Loop*, September 2014.
[^31]: Steve Newman. [A Systematic Look at EC2 I/O](https://www.sentinelone.com/blog/a-systematic-look-at-ec2-i-o/). *blog.scalyr.com*, October 2012. Archived at [perma.cc/FL4R-H2VE](https://perma.cc/FL4R-H2VE)
[^32]: Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama. [The ϕ Accrual Failure Detector](https://hdl.handle.net/10119/4784). Japan Advanced Institute of Science and Technology, School of Information Science, Technical Report IS-RR-2004-010, May 2004. Archived at [perma.cc/NSM2-TRYA](https://perma.cc/NSM2-TRYA)
[^33]: Jeffrey Wang. [Phi Accrual Failure Detector](https://ternarysearch.blogspot.com/2013/08/phi-accrual-failure-detector.html). *ternarysearch.blogspot.co.uk*, August 2013. [perma.cc/L452-AMLV](https://perma.cc/L452-AMLV)
[^34]: Srinivasan Keshav. *An Engineering Approach to Computer Networking: ATM Networks, the Internet, and the Telephone Network*. Addison-Wesley Professional, May 1997. ISBN: 978-0-201-63442-6
[^35]: Othmar Kyas. *ATM Networks*. International Thomson Publishing, 1995. ISBN: 978-1-850-32128-6
[^36]: Mellanox Technologies. [InfiniBand FAQ, Rev 1.3](https://network.nvidia.com/related-docs/whitepapers/InfiniBandFAQ_FQ_100.pdf). *network.nvidia.com*, December 2014. Archived at [perma.cc/LQJ4-QZVK](https://perma.cc/LQJ4-QZVK)
[^37]: Jose Renato Santos, Yoshio Turner, and G. (John) Janakiraman. [End-to-End Congestion Control for InfiniBand](https://infocom2003.ieee-infocom.org/papers/28_01.PDF). At *22nd Annual Joint Conference of the IEEE Computer and Communications Societies* (INFOCOM), April 2003. Also published by HP Laboratories Palo Alto, Tech Report HPL-2002-359. [doi:10.1109/INFCOM.2003.1208949](https://doi.org/10.1109/INFCOM.2003.1208949)
[^38]: Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. [Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency](https://syslab.cs.washington.edu/papers/latency-socc14.pdf). At *ACM Symposium on Cloud Computing* (SOCC), November 2014. [doi:10.1145/2670979.2670988](https://doi.org/10.1145/2670979.2670988)
[^39]: Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley. [The NTP FAQ and HOWTO](https://www.ntp.org/ntpfaq/). *ntp.org*, November 2006.
[^40]: John Graham-Cumming. [How and why the leap second affected Cloudflare DNS](https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/). *blog.cloudflare.com*, January 2017. Archived at [archive.org](https://web.archive.org/web/20250202041444/https%3A//blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/)
[^41]: David Holmes. [Inside the Hotspot VM: Clocks, Timers and Scheduling Events – Part I – Windows](https://web.archive.org/web/20160308031939/https%3A//blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks). *blogs.oracle.com*, October 2006. Archived at [archive.org](https://web.archive.org/web/20160308031939/https%3A//blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks)
[^42]: Joran Dirk Greef. [Three Clocks are Better than One](https://tigerbeetle.com/blog/2021-08-30-three-clocks-are-better-than-one/). *tigerbeetle.com*, August 2021. Archived at [perma.cc/5RXG-EU6B](https://perma.cc/5RXG-EU6B)
[^43]: Oliver Yang. [Pitfalls of TSC usage](https://oliveryang.net/2015/09/pitfalls-of-TSC-usage/). *oliveryang.net*, September 2015. Archived at [perma.cc/Z2QY-5FRA](https://perma.cc/Z2QY-5FRA)
[^44]: Steve Loughran. [Time on Multi-Core, Multi-Socket Servers](https://steveloughran.blogspot.com/2015/09/time-on-multi-core-multi-socket-servers.html). *steveloughran.blogspot.co.uk*, September 2015. Archived at [perma.cc/7M4S-D4U6](https://perma.cc/7M4S-D4U6)
[^45]: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. [Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012.
[^46]: M. Caporaloni and R. Ambrosini. [How Closely Can a Personal Computer Clock Track the UTC Timescale Via the Internet?](https://iopscience.iop.org/0143-0807/23/4/103/) *European Journal of Physics*, volume 23, issue 4, pages L17–L21, June 2012. [doi:10.1088/0143-0807/23/4/103](https://doi.org/10.1088/0143-0807/23/4/103)
[^47]: Nelson Minar. [A Survey of the NTP Network](https://alumni.media.mit.edu/~nelson/research/ntp-survey99/). *alumni.media.mit.edu*, December 1999. Archived at [perma.cc/EV76-7ZV3](https://perma.cc/EV76-7ZV3)
[^48]: Viliam Holub. [Synchronizing Clocks in a Cassandra Cluster Pt. 1 – The Problem](https://blog.rapid7.com/2014/03/14/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/). *blog.rapid7.com*, March 2014. Archived at [perma.cc/N3RV-5LNL](https://perma.cc/N3RV-5LNL)
[^49]: Poul-Henning Kamp. [The One-Second War (What Time Will You Die?)](https://queue.acm.org/detail.cfm?id=1967009) *ACM Queue*, volume 9, issue 4, pages 44–48, April 2011. [doi:10.1145/1966989.1967009](https://doi.org/10.1145/1966989.1967009)
[^50]: Nelson Minar. [Leap Second Crashes Half the Internet](https://www.somebits.com/weblog/tech/bad/leap-second-2012.html). *somebits.com*, July 2012. Archived at [perma.cc/2WB8-D6EU](https://perma.cc/2WB8-D6EU)
[^51]: Christopher Pascoe. [Time, Technology and Leaping Seconds](https://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html). *googleblog.blogspot.co.uk*, September 2011. Archived at [perma.cc/U2JL-7E74](https://perma.cc/U2JL-7E74)
[^52]: Mingxue Zhao and Jeff Barr. [Look Before You Leap – The Coming Leap Second and AWS](https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/). *aws.amazon.com*, May 2015. Archived at [perma.cc/KPE9-XMFM](https://perma.cc/KPE9-XMFM)
[^53]: Darryl Veitch and Kanthaiah Vijayalayan. [Network Timing and the 2015 Leap Second](https://opus.lib.uts.edu.au/bitstream/10453/43923/1/LeapSecond_camera.pdf). At *17th International Conference on Passive and Active Measurement* (PAM), April 2016. [doi:10.1007/978-3-319-30505-9\_29](https://doi.org/10.1007/978-3-319-30505-9_29)
[^54]: VMware, Inc. [Timekeeping in VMware Virtual Machines](https://www.vmware.com/docs/vmware_timekeeping). *vmware.com*, October 2008. Archived at [perma.cc/HM5R-T5NF](https://perma.cc/HM5R-T5NF)
[^55]: Victor Yodaiken. [Clock Synchronization in Finance and Beyond](https://www.yodaiken.com/wp-content/uploads/2018/05/financeandbeyond.pdf). *yodaiken.com*, November 2017. Archived at [perma.cc/9XZD-8ZZN](https://perma.cc/9XZD-8ZZN)
[^56]: Mustafa Emre Acer, Emily Stark, Adrienne Porter Felt, Sascha Fahl, Radhika Bhargava, Bhanu Dev, Matt Braithwaite, Ryan Sleevi, and Parisa Tabriz. [Where the Wild Warnings Are: Root Causes of Chrome HTTPS Certificate Errors](https://acmccs.github.io/papers/p1407-acerA.pdf). At *ACM SIGSAC Conference on Computer and Communications Security* (CCS), pages 1407–1420, October 2017. [doi:10.1145/3133956.3134007](https://doi.org/10.1145/3133956.3134007)
[^57]: European Securities and Markets Authority. [MiFID II / MiFIR: Regulatory Technical and Implementing Standards – Annex I](https://www.esma.europa.eu/sites/default/files/library/2015/11/2015-esma-1464_annex_i_-_draft_rts_and_its_on_mifid_ii_and_mifir.pdf). *esma.europa.eu*, Report ESMA/2015/1464, September 2015. Archived at [perma.cc/ZLX9-FGQ3](https://perma.cc/ZLX9-FGQ3)
[^58]: Luke Bigum. [Solving MiFID II Clock Synchronisation With Minimum Spend (Part 1)](https://catach.blogspot.com/2015/11/solving-mifid-ii-clock-synchronisation.html). *catach.blogspot.com*, November 2015. Archived at [perma.cc/4J5W-FNM4](https://perma.cc/4J5W-FNM4)
[^59]: Oleg Obleukhov and Ahmad Byagowi. [How Precision Time Protocol is being deployed at Meta](https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/). *engineering.fb.com*, November 2022. Archived at [perma.cc/29G6-UJNW](https://perma.cc/29G6-UJNW)
[^60]: John Wiseman. [gpsjam.org](https://gpsjam.org/), July 2022.
[^61]: Josh Levinson, Julien Ridoux, and Chris Munns. [It’s About Time: Microsecond-Accurate Clocks on Amazon EC2 Instances](https://aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/). *aws.amazon.com*, November 2023. Archived at [perma.cc/56M6-5VMZ](https://perma.cc/56M6-5VMZ)
[^62]: Kyle Kingsbury. [Call Me Maybe: Cassandra](https://aphyr.com/posts/294-call-me-maybe-cassandra/). *aphyr.com*, September 2013. Archived at [perma.cc/4MBR-J96V](https://perma.cc/4MBR-J96V)
[^63]: John Daily. [Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems](https://riak.com/clocks-are-bad-or-welcome-to-distributed-systems/). *riak.com*, November 2013. Archived at [perma.cc/4XB5-UCXY](https://perma.cc/4XB5-UCXY)
[^64]: Marc Brooker. [It’s About Time!](https://brooker.co.za/blog/2023/11/27/about-time.html) *brooker.co.za*, November 2023. Archived at [perma.cc/N6YK-DRPA](https://perma.cc/N6YK-DRPA)
[^65]: Kyle Kingsbury. [The Trouble with Timestamps](https://aphyr.com/posts/299-the-trouble-with-timestamps). *aphyr.com*, October 2013. Archived at [perma.cc/W3AM-5VAV](https://perma.cc/W3AM-5VAV)
[^66]: Leslie Lamport. [Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, volume 21, issue 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563)
[^67]: Justin Sheehy. [There Is No Now: Problems With Simultaneity in Distributed Systems](https://queue.acm.org/detail.cfm?id=2745385). *ACM Queue*, volume 13, issue 3, pages 36–41, March 2015. [doi:10.1145/2733108](https://doi.org/10.1145/2733108)
[^68]: Murat Demirbas. [Spanner: Google’s Globally-Distributed Database](https://muratbuffalo.blogspot.com/2013/07/spanner-googles-globally-distributed_4.html). *muratbuffalo.blogspot.co.uk*, July 2013. Archived at [perma.cc/6VWR-C9WB](https://perma.cc/6VWR-C9WB)
[^69]: Dahlia Malkhi and Jean-Philippe Martin. [Spanner’s Concurrency Control](https://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf). *ACM SIGACT News*, volume 44, issue 3, pages 73–77, September 2013. [doi:10.1145/2527748.2527767](https://doi.org/10.1145/2527748.2527767)
[^70]: Franck Pachot. [Achieving Precise Clock Synchronization on AWS](https://www.yugabyte.com/blog/aws-clock-synchronization/). *yugabyte.com*, December 2024. Archived at [perma.cc/UYM6-RNBS](https://perma.cc/UYM6-RNBS)
[^71]: Spencer Kimball. [Living Without Atomic Clocks: Where CockroachDB and Spanner diverge](https://www.cockroachlabs.com/blog/living-without-atomic-clocks/). *cockroachlabs.com*, January 2022. Archived at [perma.cc/AWZ7-RXFT](https://perma.cc/AWZ7-RXFT)
[^72]: Murat Demirbas. [Use of Time in Distributed Databases (part 4): Synchronized clocks in production databases](https://muratbuffalo.blogspot.com/2025/01/use-of-time-in-distributed-databases.html). *muratbuffalo.blogspot.com*, January 2025. Archived at [perma.cc/9WNX-Q9U3](https://perma.cc/9WNX-Q9U3)
[^73]: Cary G. Gray and David R. Cheriton. [Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency](https://courses.cs.duke.edu/spring11/cps210/papers/p202-gray.pdf). At *12th ACM Symposium on Operating Systems Principles* (SOSP), December 1989. [doi:10.1145/74850.74870](https://doi.org/10.1145/74850.74870)
[^74]: Daniel Sturman, Scott Delap, Max Ross, et al. [Roblox Return to Service](https://corp.roblox.com/newsroom/2022/01/roblox-return-to-service-10-28-10-31-2021). *corp.roblox.com*, January 2022. Archived at [perma.cc/8ALT-WAS4](https://perma.cc/8ALT-WAS4)
[^75]: Todd Lipcon. [Avoiding Full GCs with MemStore-Local Allocation Buffers](https://www.slideshare.net/slideshow/hbase-hug-presentation/7038178). *slideshare.net*, February 2011. Archived at
[^76]: Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. [Live Migration of Virtual Machines](https://www.usenix.org/legacy/publications/library/proceedings/nsdi05/tech/full_papers/clark/clark.pdf). At *2nd USENIX Symposium on Symposium on Networked Systems Design & Implementation* (NSDI), May 2005.
[^77]: Mike Shaver. [fsyncers and Curveballs](https://web.archive.org/web/20220107141023/http%3A//shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/). *shaver.off.net*, May 2008. Archived at [archive.org](https://web.archive.org/web/20220107141023/http%3A//shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/)
[^78]: Zhenyun Zhuang and Cuong Tran. [Eliminating Large JVM GC Pauses Caused by Background IO Traffic](https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic). *engineering.linkedin.com*, February 2016. Archived at [perma.cc/ML2M-X9XT](https://perma.cc/ML2M-X9XT)
[^79]: Martin Thompson. [Java Garbage Collection Distilled](https://mechanical-sympathy.blogspot.com/2013/07/java-garbage-collection-distilled.html). *mechanical-sympathy.blogspot.co.uk*, July 2013. Archived at [perma.cc/DJT3-NQLQ](https://perma.cc/DJT3-NQLQ)
[^80]: David Terei and Amit Levy. [Blade: A Data Center Garbage Collector](https://arxiv.org/pdf/1504.02578). arXiv:1504.02578, April 2015.
[^81]: Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz. [Trash Day: Coordinating Garbage Collection in Distributed Systems](https://timharris.uk/papers/2015-hotos.pdf). At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
[^82]: Martin Fowler. [The LMAX Architecture](https://martinfowler.com/articles/lmax.html). *martinfowler.com*, July 2011. Archived at [perma.cc/5AV4-N6RJ](https://perma.cc/5AV4-N6RJ)
[^83]: Joseph Y. Halpern and Yoram Moses. [Knowledge and common knowledge in a distributed environment](https://groups.csail.mit.edu/tds/papers/Halpern/JACM90.pdf). *Journal of the ACM* (JACM), volume 37, issue 3, pages 549–587, July 1990. [doi:10.1145/79147.79161](https://doi.org/10.1145/79147.79161)
[^84]: Chuzhe Tang, Zhaoguo Wang, Xiaodong Zhang, Qianmian Yu, Binyu Zang, Haibing Guan, and Haibo Chen. [Ad Hoc Transactions in Web Applications: The Good, the Bad, and the Ugly](https://ipads.se.sjtu.edu.cn/_media/publications/concerto-sigmod22.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2022. [doi:10.1145/3514221.3526120](https://doi.org/10.1145/3514221.3526120)
[^85]: Flavio P. Junqueira and Benjamin Reed. [*ZooKeeper: Distributed Process Coordination*](https://www.oreilly.com/library/view/zookeeper/9781449361297/). O’Reilly Media, 2013. ISBN: 978-1-449-36130-3
[^86]: Enis Söztutar. [HBase and HDFS: Understanding Filesystem Usage in HBase](https://www.slideshare.net/slideshow/hbase-and-hdfs-understanding-filesystem-usage/22990858). At *HBaseCon*, June 2013. Archived at [perma.cc/4DXR-9P88](https://perma.cc/4DXR-9P88)
[^87]: SUSE LLC. [SUSE Linux Enterprise High Availability 15 SP6 Administration Guide, Section 12: Fencing and STONITH](https://documentation.suse.com/sle-ha/15-SP6/html/SLE-HA-all/cha-ha-fencing.html). *documentation.suse.com*, March 2025. Archived at [perma.cc/8LAR-EL9D](https://perma.cc/8LAR-EL9D)
[^88]: Mike Burrows. [The Chubby Lock Service for Loosely-Coupled Distributed Systems](https://research.google/pubs/pub27897/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006.
[^89]: Kyle Kingsbury. [etcd 3.4.3](https://jepsen.io/analyses/etcd-3.4.3). *jepsen.io*, January 2020. Archived at [perma.cc/2P3Y-MPWU](https://perma.cc/2P3Y-MPWU)
[^90]: Ensar Basri Kahveci. [Distributed Locks are Dead; Long Live Distributed Locks!](https://hazelcast.com/blog/long-live-distributed-locks/) *hazelcast.com*, April 2019. Archived at [perma.cc/7FS5-LDXE](https://perma.cc/7FS5-LDXE)
[^91]: Martin Kleppmann. [How to do distributed locking](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html). *martin.kleppmann.com*, February 2016. Archived at [perma.cc/Y24W-YQ5L](https://perma.cc/Y24W-YQ5L)
[^92]: Salvatore Sanfilippo. [Is Redlock safe?](https://antirez.com/news/101) *antirez.com*, February 2016. Archived at [perma.cc/B6GA-9Q6A](https://perma.cc/B6GA-9Q6A)
[^93]: Gunnar Morling. [Leader Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024. Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y)
[^94]: Leslie Lamport, Robert Shostak, and Marshall Pease. [The Byzantine Generals Problem](https://www.microsoft.com/en-us/research/publication/byzantine-generals-problem/). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 4, issue 3, pages 382–401, July 1982. [doi:10.1145/357172.357176](https://doi.org/10.1145/357172.357176)
[^95]: Jim N. Gray. [Notes on Data Base Operating Systems](https://jimgray.azurewebsites.net/papers/dbos.pdf). in *Operating Systems: An Advanced Course*, Lecture Notes in Computer Science, volume 60, edited by R. Bayer, R. M. Graham, and G. Seegmüller, pages 393–481, Springer-Verlag, 1978. ISBN: 978-3-540-08755-7. Archived at [perma.cc/7S9M-2LZU](https://perma.cc/7S9M-2LZU)
[^96]: Brian Palmer. [How Complicated Was the Byzantine Empire?](https://slate.com/news-and-politics/2011/10/the-byzantine-tax-code-how-complicated-was-byzantium-anyway.html) *slate.com*, October 2011. Archived at [perma.cc/AN7X-FL3N](https://perma.cc/AN7X-FL3N)
[^97]: Leslie Lamport. [My Writings](https://lamport.azurewebsites.net/pubs/pubs.html). *lamport.azurewebsites.net*, December 2014. Archived at [perma.cc/5NNM-SQGR](https://perma.cc/5NNM-SQGR)
[^98]: John Rushby. [Bus Architectures for Safety-Critical Embedded Systems](https://www.csl.sri.com/papers/emsoft01/emsoft01.pdf). At *1st International Workshop on Embedded Software* (EMSOFT), October 2001. [doi:10.1007/3-540-45449-7\_22](https://doi.org/10.1007/3-540-45449-7_22)
[^99]: Jake Edge. [ELC: SpaceX Lessons Learned](https://lwn.net/Articles/540368/). *lwn.net*, March 2013. Archived at [perma.cc/AYX8-QP5X](https://perma.cc/AYX8-QP5X)
[^100]: Shehar Bano, Alberto Sonnino, Mustafa Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis. [SoK: Consensus in the Age of Blockchains](https://smeiklej.com/files/aft19a.pdf). At *1st ACM Conference on Advances in Financial Technologies* (AFT), October 2019. [doi:10.1145/3318041.3355458](https://doi.org/10.1145/3318041.3355458)
[^101]: Ezra Feilden, Adi Oltean, and Philip Johnston. [Why we should train AI in space](https://www.starcloud.com/wp). White Paper, *starcloud.com*, September 2024. Archived at [perma.cc/7Y3S-8UB6](https://perma.cc/7Y3S-8UB6)
[^102]: James Mickens. [The Saddest Moment](https://www.usenix.org/system/files/login-logout_1305_mickens.pdf). *USENIX ;login*, May 2013. Archived at [perma.cc/T7BZ-XCFR](https://perma.cc/T7BZ-XCFR)
[^103]: Martin Kleppmann and Heidi Howard. [Byzantine Eventual Consistency and the Fundamental Limits of Peer-to-Peer Databases](https://arxiv.org/abs/2012.00472). *arxiv.org*, December 2020. [doi:10.48550/arXiv.2012.00472](https://doi.org/10.48550/arXiv.2012.00472)
[^104]: Martin Kleppmann. [Making CRDTs Byzantine Fault Tolerant](https://martin.kleppmann.com/papers/bft-crdt-papoc22.pdf). At *9th Workshop on Principles and Practice of Consistency for Distributed Data* (PaPoC), April 2022. [doi:10.1145/3517209.3524042](https://doi.org/10.1145/3517209.3524042)
[^105]: Evan Gilman. [The Discovery of Apache ZooKeeper’s Poison Packet](https://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/). *pagerduty.com*, May 2015. Archived at [perma.cc/RV6L-Y5CQ](https://perma.cc/RV6L-Y5CQ)
[^106]: Jonathan Stone and Craig Partridge. [When the CRC and TCP Checksum Disagree](https://conferences2.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf). At *ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication* (SIGCOMM), August 2000. [doi:10.1145/347059.347561](https://doi.org/10.1145/347059.347561)
[^107]: Evan Jones. [How Both TCP and Ethernet Checksums Fail](https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html). *evanjones.ca*, October 2015. Archived at [perma.cc/9T5V-B8X5](https://perma.cc/9T5V-B8X5)
[^108]: Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. [Consensus in the Presence of Partial Synchrony](https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf). *Journal of the ACM*, volume 35, issue 2, pages 288–323, April 1988. [doi:10.1145/42282.42283](https://doi.org/10.1145/42282.42283)
[^109]: Richard D. Schlichting and Fred B. Schneider. [Fail-stop processors: an approach to designing fault-tolerant computing systems](https://www.cs.cornell.edu/fbs/publications/Fail_Stop.pdf). *ACM Transactions on Computer Systems* (TOCS), volume 1, issue 3, pages 222–238, August 1983. [doi:10.1145/357369.357371](https://doi.org/10.1145/357369.357371)
[^110]: Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. [Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems](https://ucare.cs.uchicago.edu/pdf/socc13-limplock.pdf). At *4th ACM Symposium on Cloud Computing* (SoCC), October 2013. [doi:10.1145/2523616.2523627](https://doi.org/10.1145/2523616.2523627)
[^111]: Josh Snyder and Joseph Lynch. [Garbage collecting unhealthy JVMs, a proactive approach](https://netflixtechblog.medium.com/introducing-jvmquake-ec944c60ba70). Netflix Technology Blog, *netflixtechblog.medium.com*, November 2019. Archived at [perma.cc/8BTA-N3YB](https://perma.cc/8BTA-N3YB)
[^112]: Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. [Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems](https://www.usenix.org/system/files/conference/fast18/fast18-gunawi.pdf). At *16th USENIX Conference on File and Storage Technologies*, February 2018.
[^113]: Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. [Gray Failure: The Achilles’ Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in Operating Systems* (HotOS), May 2017. [doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005)
[^114]: Chang Lou, Peng Huang, and Scott Smith. [Understanding, Detecting and Localizing Partial Failures in Large System Software](https://www.usenix.org/conference/nsdi20/presentation/lou). At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020.
[^115]: Peter Bailis and Ali Ghodsi. [Eventual Consistency Today: Limitations, Extensions, and Beyond](https://queue.acm.org/detail.cfm?id=2462076). *ACM Queue*, volume 11, issue 3, pages 55-63, March 2013. [doi:10.1145/2460276.2462076](https://doi.org/10.1145/2460276.2462076)
[^116]: Bowen Alpern and Fred B. Schneider. [Defining Liveness](https://www.cs.cornell.edu/fbs/publications/DefLiveness.pdf). *Information Processing Letters*, volume 21, issue 4, pages 181–185, October 1985. [doi:10.1016/0020-0190(85)90056-0](https://doi.org/10.1016/0020-0190%2885%2990056-0)
[^117]: Flavio P. Junqueira. [Dude, Where’s My Metadata?](https://fpj.me/2015/05/28/dude-wheres-my-metadata/) *fpj.me*, May 2015. Archived at [perma.cc/D2EU-Y9S5](https://perma.cc/D2EU-Y9S5)
[^118]: Scott Sanders. [January 28th Incident Report](https://github.com/blog/2106-january-28th-incident-report). *github.com*, February 2016. Archived at [perma.cc/5GZR-88TV](https://perma.cc/5GZR-88TV)
[^119]: Jay Kreps. [A Few Notes on Kafka and Jepsen](https://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen). *blog.empathybox.com*, September 2013. [perma.cc/XJ5C-F583](https://perma.cc/XJ5C-F583)
[^120]: Marc Brooker and Ankush Desai. [Systems Correctness Practices at AWS](https://dl.acm.org/doi/pdf/10.1145/3712057). *Queue, Volume 22, Issue 6*, November/December 2024. [doi:10.1145/3712057](https://doi.org/10.1145/3712057)
[^121]: Andrey Satarin. [Testing Distributed Systems: Curated list of resources on testing distributed systems](https://asatarin.github.io/testing-distributed-systems/). *asatarin.github.io*. Archived at [perma.cc/U5V8-XP24](https://perma.cc/U5V8-XP24)
[^122]: Jack Vanlightly. [Verifying Kafka transactions - Diary entry 2 - Writing an initial TLA+ spec](https://jack-vanlightly.com/analyses/2024/12/3/verifying-kafka-transactions-diary-entry-2-writing-an-initial-tla-spec). *jack-vanlightly.com*, December 2024. Archived at [perma.cc/NSQ8-MQ5N](https://perma.cc/NSQ8-MQ5N)
[^123]: Siddon Tang. [From Chaos to Order — Tools and Techniques for Testing TiDB, A Distributed NewSQL Database](https://www.pingcap.com/blog/chaos-practice-in-tidb/). *pingcap.com*, April 2018. Archived at [perma.cc/5EJB-R29F](https://perma.cc/5EJB-R29F)
[^124]: Nathan VanBenschoten. [Parallel Commits: An atomic commit protocol for globally distributed transactions](https://www.cockroachlabs.com/blog/parallel-commits/). *cockroachlabs.com*, November 2019. Archived at [perma.cc/5FZ7-QK6J](https://perma.cc/5FZ7-QK6J%20)
[^125]: Jack Vanlightly. [Paper: VR Revisited - State Transfer (part 3)](https://jack-vanlightly.com/analyses/2022/12/28/paper-vr-revisited-state-transfer-part-3). *jack-vanlightly.com*, December 2022. Archived at [perma.cc/KNK3-K6WS](https://perma.cc/KNK3-K6WS)
[^126]: Hillel Wayne. [What if the spec doesn’t match the code?](https://buttondown.com/hillelwayne/archive/what-if-the-spec-doesnt-match-the-code/) *buttondown.com*, March 2024. Archived at [perma.cc/8HEZ-KHER](https://perma.cc/8HEZ-KHER)
[^127]: Lingzhi Ouyang, Xudong Sun, Ruize Tang, Yu Huang, Madhav Jivrajani, Xiaoxing Ma, Tianyin Xu. [Multi-Grained Specifications for Distributed System Model Checking and Verification](https://arxiv.org/abs/2409.14301). At *20th European Conference on Computer Systems* (EuroSys), March 2025. [doi:10.1145/3689031.3696069](https://doi.org/10.1145/3689031.3696069)
[^128]: Yury Izrailevsky and Ariel Tseitlin. [The Netflix Simian Army](https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116). *netflixtechblog.com*, July, 2011. Archived at [perma.cc/M3NY-FJW6](https://perma.cc/M3NY-FJW6)
[^129]: Kyle Kingsbury. [Jepsen: On the perils of network partitions](https://aphyr.com/posts/281-jepsen-on-the-perils-of-network-partitions). *aphyr.com*, May, 2013. Archived at [perma.cc/W98G-6HQP](https://perma.cc/W98G-6HQP)
[^130]: Kyle Kingsbury. [Jepsen Analyses](https://jepsen.io/analyses). *jepsen.io*, 2024. Archived at [perma.cc/8LDN-D2T8](https://perma.cc/8LDN-D2T8)
[^131]: Rupak Majumdar and Filip Niksic. [Why is random testing effective for partition tolerance bugs?](https://dl.acm.org/doi/pdf/10.1145/3158134) *Proceedings of the ACM on Programming Languages* (PACMPL), volume 2, issue POPL, article no. 46, December 2017. [doi:10.1145/3158134](https://doi.org/10.1145/3158134)
[^132]: FoundationDB project authors. [Simulation and Testing](https://apple.github.io/foundationdb/testing.html). *apple.github.io*. Archived at [perma.cc/NQ3L-PM4C](https://perma.cc/NQ3L-PM4C)
[^133]: Alex Kladov. [Simulation Testing For Liveness](https://tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness/). *tigerbeetle.com*, July 2023. Archived at [perma.cc/RKD4-HGCR](https://perma.cc/RKD4-HGCR)
[^134]: Alfonso Subiotto Marqués. [(Mostly) Deterministic Simulation Testing in Go](https://www.polarsignals.com/blog/posts/2024/05/28/mostly-dst-in-go). *polarsignals.com*, May 2024. Archived at [perma.cc/ULD6-TSA4](https://perma.cc/ULD6-TSA4)
================================================
FILE: content/tw/colophon.md
================================================
---
title: 後記
weight: 600
breadcrumbs: false
---
{{< callout type="warning" >}}
當前頁面來自本書第一版,第二版尚不可用
{{< /callout >}}
## 關於作者
**Martin Kleppmann** 是英國劍橋大學副教授,教授分散式系統與密碼學協議。2017 年出版的《設計資料密集型應用》第一版確立了他在資料系統領域的權威地位;他在分散式系統方面的研究也推動了 local-first 軟體運動。此前他曾在 LinkedIn、Rapportive 等網際網路公司擔任軟體工程師和創業者,負責大規模資料基礎設施。
**Chris Riccomini** 是軟體工程師、創業投資人和作者,擁有 15 年以上在 PayPal、LinkedIn、WePay 的工作經驗。他運營 Materialized View Capital,專注於基礎設施初創企業投資;同時也是 Apache Samza 與 SlateDB 的共同創造者,併合著了 *The Missing README: A Guide for the New Software Engineer*。

## 關於譯者
[**馮若航**](https://vonng.com),網名 [@Vonng](https://github.com/Vonng)。
PostgreSQL 專家,資料庫老司機,雲計算泥石流。
PostgreSQL 發行版 [**Pigsty**](https://pgsty.com) 作者與創始人。
架構師,DBA,全棧工程師 @ TanTan,Alibaba,Apple。
獨立開源貢獻者,[GitStar Ranking 585](https://gitstar-ranking.com/Vonng),[國區活躍 Top20](https://committers.top/china)。
[DDIA](https://ddia.pigsty.io) / [PG Internal](https://pgint.vonng.com) 中文版譯者,資料庫/雲計算 KOL。
## 後記
《設計資料密集型應用》封面上的動物是 **印度野豬(Sus scrofa cristatus)**,它是在印度、緬甸、尼泊爾、斯里蘭卡和泰國發現的一種野豬的亞種。與歐洲野豬不同,它們有更高的背部鬃毛,沒有體表絨毛,以及更大更直的頭骨。
印度野豬有一頭灰色或黑色的頭髮,脊背上有短而硬的毛。雄性有突出的犬齒(稱為 T),用來與對手戰鬥或抵禦掠食者。雄性比雌性大,這些物種平均肩高 33-35 英寸,體重 200-300 磅。他們的天敵包括熊、老虎和各種大型貓科動物。
這些動物夜行且雜食 —— 它們吃各種各樣的東西,包括根、昆蟲、腐肉、堅果、漿果和小動物。野豬經常因為破壞農作物的根被人們所熟知,他們造成大量的破壞,並被農民所敵視。他們每天需要攝入 4,000 ~ 4,500 卡路里的能量。野豬有發達的嗅覺,這有助於尋找地下植物和挖掘動物。然而,它們的視力很差。
野豬在人類文化中一直具有重要意義。在印度教傳說中,野豬是毗溼奴神的化身。在古希臘的喪葬紀念碑中,它是一個勇敢失敗者的象徵(與勝利的獅子相反)。由於它的侵略,它被描繪在斯堪的納維亞、日耳曼和盎格魯撒克遜戰士的盔甲和武器上。在中國十二生肖中,它象徵著決心和急躁。
O'Reilly 封面上的許多動物都受到威脅,這些動物對世界都很重要。要了解有關如何提供幫助的更多資訊,請訪問 animals.oreilly.com。
封面圖片來自 Shaw's Zoology。封面字型是 URW Typewriter 和 Guardian Sans。文字字型是 Adobe Minion Pro;圖中的字型是 Adobe Myriad Pro;標題字型是 Adobe Myriad Condensed;程式碼字型是 Dalton Maag 的 Ubuntu Mono。
================================================
FILE: content/tw/contrib.md
================================================
---
title: 貢獻者
weight: 800
breadcrumbs: false
---
## 譯者
[**馮若航**](https://vonng.com),網名 [@Vonng](https://github.com/Vonng)。
PostgreSQL 專家,資料庫老司機,雲計算泥石流。
[**Pigsty**](https://pgsty.com) 作者與創始人。
架構師,DBA,全棧工程師 @ TanTan,Alibaba,Apple。
獨立開源貢獻者,[GitStar Ranking 585](https://gitstar-ranking.com/Vonng),[國區活躍 Top20](https://committers.top/china)。
[DDIA](https://ddia.pigsty.io) / [PG Internal](https://pgint.vonng.com) 中文版譯者,公眾號:《老馮雲數》,資料庫 KOL。
## 校訂與維護
YinGang [@yingang](https://github.com/yingang) 對本書進行了全文校訂,並持續維護。
## 繁體中文版本
[繁體中文](/tw) **版本維護** by [@afunTW](https://github.com/afunTW)
## 貢獻列表
[GitHub 貢獻者列表](https://github.com/Vonng/ddia/graphs/contributors)
0. 全文校訂 by [@yingang](https://github.com/Vonng/ddia/commits?author=yingang)
1. [序言初翻修正](https://github.com/Vonng/ddia/commit/afb5edab55c62ed23474149f229677e3b42dfc2c) by [@seagullbird](https://github.com/Vonng/ddia/commits?author=seagullbird)
2. [第一章語法標點校正](https://github.com/Vonng/ddia/commit/973b12cd8f8fcdf4852f1eb1649ddd9d187e3644) by [@nevertiree](https://github.com/Vonng/ddia/commits?author=nevertiree)
3. [第六章部分校正](https://github.com/Vonng/ddia/commit/d4eb0852c0ec1e93c8aacc496c80b915bb1e6d48) 與[第十章的初翻](https://github.com/Vonng/ddia/commit/9de8dbd1bfe6fbb03b3bf6c1a1aa2291aed2490e) by [@MuAlex](https://github.com/Vonng/ddia/commits?author=MuAlex)
4. [第一部分](/tw/part-i)前言,[ch2](/tw/ch2)校正 by [@jiajiadebug](https://github.com/Vonng/ddia/commits?author=jiajiadebug)
5. [詞彙表](/tw/glossary)、[後記](/tw/colophon)關於野豬的部分 by [@Chowss](https://github.com/Vonng/ddia/commits?author=Chowss)
6. [繁體中文](https://github.com/Vonng/ddia/pulls)版本與轉換指令碼 by [@afunTW](https://github.com/afunTW)
7. 多處翻譯修正 by [@songzhibin97](https://github.com/Vonng/ddia/commits?author=songzhibin97) [@MamaShip](https://github.com/Vonng/ddia/commits?author=MamaShip) [@FangYuan33](https://github.com/Vonng/ddia/commits?author=FangYuan33)
感謝所有提出意見,作出貢獻的朋友們,您可以在這裡找到所有貢獻的 [Issue 列表](https://github.com/Vonng/ddia/issues) 與 [PR 列表](https://github.com/Vonng/ddia/pulls):
| ISSUE & Pull Requests | USER | Title |
|-------------------------------------------------|------------------------------------------------------------|----------------------------------------------------------------|
| [359](https://github.com/Vonng/ddia/pull/359) | [@c25423](https://github.com/c25423) | ch10: 修正一處拼寫錯誤 |
| [358](https://github.com/Vonng/ddia/pull/358) | [@lewiszlw](https://github.com/lewiszlw) | ch4: 修正一處拼寫錯誤 |
| [356](https://github.com/Vonng/ddia/pull/356) | [@lewiszlw](https://github.com/lewiszlw) | ch2: 修正一處標點錯誤 |
| [355](https://github.com/Vonng/ddia/pull/355) | [@DuroyGeorge](https://github.com/DuroyGeorge) | ch12: 修正一處格式錯誤 |
| [354](https://github.com/Vonng/ddia/pull/354) | [@justlorain](https://github.com/justlorain) | ch7: 修正一處參考連結 |
| [353](https://github.com/Vonng/ddia/pull/353) | [@fantasyczl](https://github.com/fantasyczl) | ch3&9: 修正兩處引用錯誤 |
| [352](https://github.com/Vonng/ddia/pull/352) | [@fantasyczl](https://github.com/fantasyczl) | 支援輸出為 EPUB 格式 |
| [349](https://github.com/Vonng/ddia/pull/349) | [@xiyihan0](https://github.com/xiyihan0) | ch1: 修正一處格式錯誤 |
| [348](https://github.com/Vonng/ddia/pull/348) | [@omegaatt36](https://github.com/omegaatt36) | ch3: 修正一處影像連結 |
| [346](https://github.com/Vonng/ddia/issues/346) | [@Vermouth1995](https://github.com/Vermouth1995) | ch1: 最佳化一處翻譯 |
| [343](https://github.com/Vonng/ddia/pull/343) | [@kehao-chen](https://github.com/kehao-chen) | ch10: 最佳化一處翻譯 |
| [341](https://github.com/Vonng/ddia/pull/341) | [@YKIsTheBest](https://github.com/YKIsTheBest) | ch3: 最佳化兩處翻譯 |
| [340](https://github.com/Vonng/ddia/pull/340) | [@YKIsTheBest](https://github.com/YKIsTheBest) | ch2: 最佳化多處翻譯 |
| [338](https://github.com/Vonng/ddia/pull/338) | [@YKIsTheBest](https://github.com/YKIsTheBest) | ch1: 最佳化一處翻譯 |
| [335](https://github.com/Vonng/ddia/pull/335) | [@kimi0230](https://github.com/kimi0230) | 修正一處繁體中文錯誤 |
| [334](https://github.com/Vonng/ddia/pull/334) | [@soulrrrrr](https://github.com/soulrrrrr) | ch2: 修正一處繁體中文錯誤 |
| [332](https://github.com/Vonng/ddia/pull/332) | [@justlorain](https://github.com/justlorain) | ch5: 修正一處翻譯錯誤 |
| [331](https://github.com/Vonng/ddia/pull/331) | [@Lyianu](https://github.com/Lyianu) | ch9: 更正幾處拼寫錯誤 |
| [330](https://github.com/Vonng/ddia/pull/330) | [@Lyianu](https://github.com/Lyianu) | ch7: 最佳化一處翻譯 |
| [329](https://github.com/Vonng/ddia/issues/329) | [@Lyianu](https://github.com/Lyianu) | ch6: 指出一處翻譯錯誤 |
| [328](https://github.com/Vonng/ddia/pull/328) | [@justlorain](https://github.com/justlorain) | ch4: 更正一處翻譯遺漏 |
| [326](https://github.com/Vonng/ddia/pull/326) | [@liangGTY](https://github.com/liangGTY) | ch1: 最佳化一處翻譯 |
| [323](https://github.com/Vonng/ddia/pull/323) | [@marvin263](https://github.com/marvin263) | ch5: 最佳化一處翻譯 |
| [322](https://github.com/Vonng/ddia/pull/322) | [@marvin263](https://github.com/marvin263) | ch8: 最佳化一處翻譯 |
| [304](https://github.com/Vonng/ddia/pull/304) | [@spike014](https://github.com/spike014) | ch11: 最佳化一處翻譯 |
| [298](https://github.com/Vonng/ddia/pull/298) | [@Makonike](https://github.com/Makonike) | ch11&12: 修正兩處錯誤 |
| [284](https://github.com/Vonng/ddia/pull/284) | [@WAangzE](https://github.com/WAangzE) | ch4: 更正一處列表錯誤 |
| [283](https://github.com/Vonng/ddia/pull/283) | [@WAangzE](https://github.com/WAangzE) | ch3: 更正一處錯別字 |
| [282](https://github.com/Vonng/ddia/pull/282) | [@WAangzE](https://github.com/WAangzE) | ch2: 更正一處公式問題 |
| [281](https://github.com/Vonng/ddia/pull/281) | [@lyuxi99](https://github.com/lyuxi99) | 更正多處內部連結錯誤 |
| [280](https://github.com/Vonng/ddia/pull/280) | [@lyuxi99](https://github.com/lyuxi99) | ch9: 更正內部連結錯誤 |
| [279](https://github.com/Vonng/ddia/issues/279) | [@codexvn](https://github.com/codexvn) | ch9: 指出公式在 GitHub Pages 顯示的問題 |
| [278](https://github.com/Vonng/ddia/pull/278) | [@LJlkdskdjflsa](https://github.com/LJlkdskdjflsa) | 發現了繁體中文版本中的錯誤翻譯 |
| [275](https://github.com/Vonng/ddia/pull/275) | [@117503445](https://github.com/117503445) | 更正 LICENSE 連結 |
| [274](https://github.com/Vonng/ddia/pull/274) | [@uncle-lv](https://github.com/uncle-lv) | ch7: 修正錯別字 |
| [273](https://github.com/Vonng/ddia/pull/273) | [@Sdot-Python](https://github.com/Sdot-Python) | ch7: 統一了 write skew 的翻譯 |
| [271](https://github.com/Vonng/ddia/pull/271) | [@Makonike](https://github.com/Makonike) | ch6: 統一了 rebalancing 的翻譯 |
| [270](https://github.com/Vonng/ddia/pull/270) | [@Ynjxsjmh](https://github.com/Ynjxsjmh) | ch7: 修正不一致的翻譯 |
| [263](https://github.com/Vonng/ddia/pull/263) | [@zydmayday](https://github.com/zydmayday) | ch5: 修正譯文中的重複單詞 |
| [260](https://github.com/Vonng/ddia/pull/260) | [@haifeiWu](https://github.com/haifeiWu) | ch4: 修正部分不準確的翻譯 |
| [258](https://github.com/Vonng/ddia/pull/258) | [@bestgrc](https://github.com/bestgrc) | ch3: 修正一處翻譯錯誤 |
| [257](https://github.com/Vonng/ddia/pull/257) | [@UnderSam](https://github.com/UnderSam) | ch8: 修正一處拼寫錯誤 |
| [256](https://github.com/Vonng/ddia/pull/256) | [@AlphaWang](https://github.com/AlphaWang) | ch7: 修正“可序列化”相關內容的多處翻譯不當 |
| [255](https://github.com/Vonng/ddia/pull/255) | [@AlphaWang](https://github.com/AlphaWang) | ch7: 修正“可重複讀”相關內容的多處翻譯不當 |
| [253](https://github.com/Vonng/ddia/pull/253) | [@AlphaWang](https://github.com/AlphaWang) | ch7: 修正“讀已提交”相關內容的多處翻譯不當 |
| [246](https://github.com/Vonng/ddia/pull/246) | [@derekwu0101](https://github.com/derekwu0101) | ch3: 修正繁體中文的轉譯錯誤 |
| [245](https://github.com/Vonng/ddia/pull/245) | [@skyran1278](https://github.com/skyran1278) | ch12: 修正繁體中文的轉譯錯誤 |
| [244](https://github.com/Vonng/ddia/pull/244) | [@Axlgrep](https://github.com/Axlgrep) | ch9: 修正不通順的翻譯 |
| [242](https://github.com/Vonng/ddia/pull/242) | [@lynkeib](https://github.com/lynkeib) | ch9: 修正不通順的翻譯 |
| [241](https://github.com/Vonng/ddia/pull/241) | [@lynkeib](https://github.com/lynkeib) | ch8: 修正不正確的公式格式 |
| [240](https://github.com/Vonng/ddia/pull/240) | [@8da2k](https://github.com/8da2k) | ch9: 修正不通順的翻譯 |
| [239](https://github.com/Vonng/ddia/pull/239) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch7: 修正不一致的翻譯 |
| [237](https://github.com/Vonng/ddia/pull/237) | [@zhangnew](https://github.com/zhangnew) | ch3: 修正錯誤的圖片連結 |
| [229](https://github.com/Vonng/ddia/pull/229) | [@lis186](https://github.com/lis186) | 指出繁體中文的轉譯錯誤:複雜 |
| [226](https://github.com/Vonng/ddia/pull/226) | [@chroming](https://github.com/chroming) | ch1: 修正導航欄中的章節名稱 |
| [220](https://github.com/Vonng/ddia/pull/220) | [@skyran1278](https://github.com/skyran1278) | ch9: 修正線性一致的繁體中文翻譯 |
| [194](https://github.com/Vonng/ddia/pull/194) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch4: 修正錯誤的翻譯 |
| [193](https://github.com/Vonng/ddia/pull/193) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch4: 最佳化譯文 |
| [192](https://github.com/Vonng/ddia/pull/192) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch4: 修正不一致和不通順的翻譯 |
| [190](https://github.com/Vonng/ddia/pull/190) | [@Pcrab](https://github.com/Pcrab) | ch1: 修正不準確的翻譯 |
| [187](https://github.com/Vonng/ddia/pull/187) | [@narojay](https://github.com/narojay) | ch9: 修正生硬的翻譯 |
| [186](https://github.com/Vonng/ddia/pull/186) | [@narojay](https://github.com/narojay) | ch8: 修正錯別字 |
| [185](https://github.com/Vonng/ddia/issues/185) | [@8da2k](https://github.com/8da2k) | 指出小標題跳轉的問題 |
| [184](https://github.com/Vonng/ddia/pull/184) | [@DavidZhiXing](https://github.com/DavidZhiXing) | ch10: 修正失效的網址 |
| [183](https://github.com/Vonng/ddia/pull/183) | [@OneSizeFitsQuorum](https://github.com/OneSizeFitsQuorum) | ch8: 修正錯別字 |
| [182](https://github.com/Vonng/ddia/issues/182) | [@lroolle](https://github.com/lroolle) | 建議docsify的主題風格 |
| [181](https://github.com/Vonng/ddia/pull/181) | [@YunfengGao](https://github.com/YunfengGao) | ch2: 修正翻譯錯誤 |
| [180](https://github.com/Vonng/ddia/pull/180) | [@skyran1278](https://github.com/skyran1278) | ch3: 指出繁體中文的轉譯錯誤 |
| [177](https://github.com/Vonng/ddia/pull/177) | [@exzhawk](https://github.com/exzhawk) | 支援 Github Pages 裡的公式顯示 |
| [176](https://github.com/Vonng/ddia/pull/176) | [@haifeiWu](https://github.com/haifeiWu) | ch2: 語義網相關翻譯更正 |
| [175](https://github.com/Vonng/ddia/pull/175) | [@cwr31](https://github.com/cwr31) | ch7: 不變式相關翻譯更正 |
| [174](https://github.com/Vonng/ddia/pull/174) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | README & preface: 更正不正確的中文用詞和標點符號 |
| [173](https://github.com/Vonng/ddia/pull/173) | [@ZvanYang](https://github.com/ZvanYang) | ch12: 修正不完整的翻譯 |
| [171](https://github.com/Vonng/ddia/pull/171) | [@ZvanYang](https://github.com/ZvanYang) | ch12: 修正重複的譯文 |
| [169](https://github.com/Vonng/ddia/pull/169) | [@ZvanYang](https://github.com/ZvanYang) | ch12: 更正不太通順的翻譯 |
| [166](https://github.com/Vonng/ddia/pull/166) | [@bp4m4h94](https://github.com/bp4m4h94) | ch1: 發現錯誤的文獻索引 |
| [164](https://github.com/Vonng/ddia/pull/164) | [@DragonDriver](https://github.com/DragonDriver) | preface: 更正錯誤的標點符號 |
| [163](https://github.com/Vonng/ddia/pull/163) | [@llmmddCoder](https://github.com/llmmddCoder) | ch1: 更正錯誤字 |
| [160](https://github.com/Vonng/ddia/pull/160) | [@Zhayhp](https://github.com/Zhayhp) | ch2: 建議將 network model 翻譯為網狀模型 |
| [159](https://github.com/Vonng/ddia/pull/159) | [@1ess](https://github.com/1ess) | ch4: 更正錯誤字 |
| [157](https://github.com/Vonng/ddia/pull/157) | [@ZvanYang](https://github.com/ZvanYang) | ch7: 更正不太通順的翻譯 |
| [155](https://github.com/Vonng/ddia/pull/155) | [@ZvanYang](https://github.com/ZvanYang) | ch7: 更正不太通順的翻譯 |
| [153](https://github.com/Vonng/ddia/pull/153) | [@DavidZhiXing](https://github.com/DavidZhiXing) | ch9: 修正縮圖的錯別字 |
| [152](https://github.com/Vonng/ddia/pull/152) | [@ZvanYang](https://github.com/ZvanYang) | ch7: 除重->去重 |
| [151](https://github.com/Vonng/ddia/pull/151) | [@ZvanYang](https://github.com/ZvanYang) | ch5: 修訂sibling相關的翻譯 |
| [147](https://github.com/Vonng/ddia/pull/147) | [@ZvanYang](https://github.com/ZvanYang) | ch5: 更正一處不準確的翻譯 |
| [145](https://github.com/Vonng/ddia/pull/145) | [@Hookey](https://github.com/Hookey) | 識別了當前簡繁轉譯過程中處理不當的地方,暫透過轉換指令碼規避 |
| [144](https://github.com/Vonng/ddia/issues/144) | [@secret4233](https://github.com/secret4233) | ch7: 不翻譯`next-key locking` |
| [143](https://github.com/Vonng/ddia/issues/143) | [@imcheney](https://github.com/imcheney) | ch3: 更新殘留的機翻段落 |
| [142](https://github.com/Vonng/ddia/issues/142) | [@XIJINIAN](https://github.com/XIJINIAN) | 建議去除段首的製表符 |
| [141](https://github.com/Vonng/ddia/issues/141) | [@Flyraty](https://github.com/Flyraty) | ch5: 發現一處錯誤格式的章節引用 |
| [140](https://github.com/Vonng/ddia/pull/140) | [@Bowser1704](https://github.com/Bowser1704) | ch5: 修正章節Summary中多處不通順的翻譯 |
| [139](https://github.com/Vonng/ddia/pull/139) | [@Bowser1704](https://github.com/Bowser1704) | ch2&ch3: 修正多處不通順的或錯誤的翻譯 |
| [137](https://github.com/Vonng/ddia/pull/137) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch5&ch6: 最佳化多處不通順的或錯誤的翻譯 |
| [134](https://github.com/Vonng/ddia/pull/134) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch4: 最佳化多處不通順的或錯誤的翻譯 |
| [133](https://github.com/Vonng/ddia/pull/133) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch3: 最佳化多處錯誤的或不通順的翻譯 |
| [132](https://github.com/Vonng/ddia/pull/132) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch3: 最佳化一處容易產生歧義的翻譯 |
| [131](https://github.com/Vonng/ddia/pull/131) | [@rwwg4](https://github.com/rwwg4) | ch6: 修正兩處錯誤的翻譯 |
| [129](https://github.com/Vonng/ddia/pull/129) | [@anaer](https://github.com/anaer) | ch4: 修正兩處強調文字和四處程式碼變數名稱 |
| [128](https://github.com/Vonng/ddia/pull/128) | [@meilin96](https://github.com/meilin96) | ch5: 修正一處錯誤的引用 |
| [126](https://github.com/Vonng/ddia/pull/126) | [@cwr31](https://github.com/cwr31) | ch10: 修正一處錯誤的翻譯(功能 -> 函式) |
| [125](https://github.com/Vonng/ddia/pull/125) | [@dch1228](https://github.com/dch1228) | ch2: 最佳化 how best 的翻譯(如何以最佳方式) |
| [123](https://github.com/Vonng/ddia/pull/123) | [@yingang](https://github.com/yingang) | translation updates (chapter 9, TOC in readme, glossary, etc.) |
| [121](https://github.com/Vonng/ddia/pull/121) | [@yingang](https://github.com/yingang) | translation updates (chapter 5 to chapter 8) |
| [120](https://github.com/Vonng/ddia/pull/120) | [@jiong-han](https://github.com/jiong-han) | Typo fix: 呲之以鼻 -> 嗤之以鼻 |
| [119](https://github.com/Vonng/ddia/pull/119) | [@cclauss](https://github.com/cclauss) | Streamline file operations in convert() |
| [118](https://github.com/Vonng/ddia/pull/118) | [@yingang](https://github.com/yingang) | translation updates (chapter 2 to chapter 4) |
| [117](https://github.com/Vonng/ddia/pull/117) | [@feeeei](https://github.com/feeeei) | 統一每章的標題格式 |
| [115](https://github.com/Vonng/ddia/pull/115) | [@NageNalock](https://github.com/NageNalock) | 第七章病句修改: 重複詞語 |
| [114](https://github.com/Vonng/ddia/pull/114) | [@Sunt-ing](https://github.com/Sunt-ing) | Update README.md: correct the book name |
| [113](https://github.com/Vonng/ddia/pull/113) | [@lpxxn](https://github.com/lpxxn) | 修改語句 |
| [112](https://github.com/Vonng/ddia/pull/112) | [@ibyte2011](https://github.com/ibyte2011) | Update ch9.md |
| [110](https://github.com/Vonng/ddia/pull/110) | [@lpxxn](https://github.com/lpxxn) | 讀已寫入資料 |
| [107](https://github.com/Vonng/ddia/pull/107) | [@abbychau](https://github.com/abbychau) | 單調鐘和好死還是賴活著 |
| [106](https://github.com/Vonng/ddia/pull/106) | [@enochii](https://github.com/enochii) | typo in ch2: fix braces typo |
| [105](https://github.com/Vonng/ddia/pull/105) | [@LiminCode](https://github.com/LiminCode) | Chronicle translation error |
| [104](https://github.com/Vonng/ddia/pull/104) | [@Sunt-ing](https://github.com/Sunt-ing) | several advice for better translation |
| [103](https://github.com/Vonng/ddia/pull/103) | [@Sunt-ing](https://github.com/Sunt-ing) | typo in ch4: should be 完成 rather than 完全 |
| [102](https://github.com/Vonng/ddia/pull/102) | [@Sunt-ing](https://github.com/Sunt-ing) | ch4: better-translation: 扼殺 → 破壞 |
| [101](https://github.com/Vonng/ddia/pull/101) | [@Sunt-ing](https://github.com/Sunt-ing) | typo in Ch4: should be "改變" rathr than "蓋面" |
| [100](https://github.com/Vonng/ddia/pull/100) | [@LiminCode](https://github.com/LiminCode) | fix missing translation |
| [99 ](https://github.com/Vonng/ddia/pull/99) | [@mrdrivingduck](https://github.com/mrdrivingduck) | ch6: fix the word rebalancing |
| [98 ](https://github.com/Vonng/ddia/pull/98) | [@jacklightChen](https://github.com/jacklightChen) | fix ch7.md: fix wrong references |
| [97 ](https://github.com/Vonng/ddia/pull/97) | [@jenac](https://github.com/jenac) | 96 |
| [96 ](https://github.com/Vonng/ddia/pull/96) | [@PragmaTwice](https://github.com/PragmaTwice) | ch2: fix typo about 'may or may not be' |
| [95 ](https://github.com/Vonng/ddia/pull/95) | [@EvanMu96](https://github.com/EvanMu96) | fix translation of "the battle cry" in ch5 |
| [94 ](https://github.com/Vonng/ddia/pull/94) | [@kemingy](https://github.com/kemingy) | ch6: fix markdown and punctuations |
| [93 ](https://github.com/Vonng/ddia/pull/93) | [@kemingy](https://github.com/kemingy) | ch5: fix markdown and some typos |
| [92 ](https://github.com/Vonng/ddia/pull/92) | [@Gilbert1024](https://github.com/Gilbert1024) | Merge pull request #1 from Vonng/master |
| [88 ](https://github.com/Vonng/ddia/pull/88) | [@kemingy](https://github.com/kemingy) | fix typo for ch1, ch2, ch3, ch4 |
| [87 ](https://github.com/Vonng/ddia/pull/87) | [@wynn5a](https://github.com/wynn5a) | Update ch3.md |
| [86 ](https://github.com/Vonng/ddia/pull/86) | [@northmorn](https://github.com/northmorn) | Update ch1.md |
| [85 ](https://github.com/Vonng/ddia/pull/85) | [@sunbuhui](https://github.com/sunbuhui) | fix ch2.md: fix ch2 ambiguous translation |
| [84 ](https://github.com/Vonng/ddia/pull/84) | [@ganler](https://github.com/ganler) | Fix translation: use up |
| [83 ](https://github.com/Vonng/ddia/pull/83) | [@afunTW](https://github.com/afunTW) | Using OpenCC to convert from zh-cn to zh-tw |
| [82 ](https://github.com/Vonng/ddia/pull/82) | [@kangni](https://github.com/kangni) | fix gitbook url |
| [78 ](https://github.com/Vonng/ddia/pull/78) | [@hanyu2](https://github.com/hanyu2) | Fix unappropriated translation |
| [77 ](https://github.com/Vonng/ddia/pull/77) | [@Ozarklake](https://github.com/Ozarklake) | fix typo |
| [75 ](https://github.com/Vonng/ddia/pull/75) | [@2997ms](https://github.com/2997ms) | Fix typo |
| [74 ](https://github.com/Vonng/ddia/pull/74) | [@2997ms](https://github.com/2997ms) | Update ch9.md |
| [70 ](https://github.com/Vonng/ddia/pull/70) | [@2997ms](https://github.com/2997ms) | Update ch7.md |
| [67 ](https://github.com/Vonng/ddia/pull/67) | [@jiajiadebug](https://github.com/jiajiadebug) | fix issues in ch2 - ch9 and glossary |
| [66 ](https://github.com/Vonng/ddia/pull/66) | [@blindpirate](https://github.com/blindpirate) | Fix typo |
| [63 ](https://github.com/Vonng/ddia/pull/63) | [@haifeiWu](https://github.com/haifeiWu) | Update ch10.md |
| [62 ](https://github.com/Vonng/ddia/pull/62) | [@ych](https://github.com/ych) | fix ch1.md typesetting problem |
| [61 ](https://github.com/Vonng/ddia/pull/61) | [@xianlaioy](https://github.com/xianlaioy) | docs:鍾-->種,去掉ou |
| [60 ](https://github.com/Vonng/ddia/pull/60) | [@Zombo1296](https://github.com/Zombo1296) | 否則 -> 或者 |
| [59 ](https://github.com/Vonng/ddia/pull/59) | [@AlexanderMisel](https://github.com/AlexanderMisel) | 呼叫->呼叫,顯著->顯著 |
| [58 ](https://github.com/Vonng/ddia/pull/58) | [@ibyte2011](https://github.com/ibyte2011) | Update ch8.md |
| [55 ](https://github.com/Vonng/ddia/pull/55) | [@saintube](https://github.com/saintube) | ch8: 修改連結錯誤 |
| [54 ](https://github.com/Vonng/ddia/pull/54) | [@Panmax](https://github.com/Panmax) | Update ch2.md |
| [53 ](https://github.com/Vonng/ddia/pull/53) | [@ibyte2011](https://github.com/ibyte2011) | Update ch9.md |
| [52 ](https://github.com/Vonng/ddia/pull/52) | [@hecenjie](https://github.com/hecenjie) | Update ch1.md |
| [51 ](https://github.com/Vonng/ddia/pull/51) | [@latavin243](https://github.com/latavin243) | fix 修正ch3 ch4幾處翻譯 |
| [50 ](https://github.com/Vonng/ddia/pull/50) | [@AlexZFX](https://github.com/AlexZFX) | 幾個疏漏和格式錯誤 |
| [49 ](https://github.com/Vonng/ddia/pull/49) | [@haifeiWu](https://github.com/haifeiWu) | Update ch1.md |
| [48 ](https://github.com/Vonng/ddia/pull/48) | [@scaugrated](https://github.com/scaugrated) | fix typo |
| [47 ](https://github.com/Vonng/ddia/pull/47) | [@lzwill](https://github.com/lzwill) | Fixed typos in ch2 |
| [45 ](https://github.com/Vonng/ddia/pull/45) | [@zenuo](https://github.com/zenuo) | 刪除一個多餘的右括號 |
| [44 ](https://github.com/Vonng/ddia/pull/44) | [@akxxsb](https://github.com/akxxsb) | 修正第七章底部連結錯誤 |
| [43 ](https://github.com/Vonng/ddia/pull/43) | [@baijinping](https://github.com/baijinping) | "更假簡單"->"更加簡單" |
| [42 ](https://github.com/Vonng/ddia/pull/42) | [@tisonkun](https://github.com/tisonkun) | 修復 ch1 中的無序列表格式 |
| [38 ](https://github.com/Vonng/ddia/pull/38) | [@renjie-c](https://github.com/renjie-c) | 糾正多處的翻譯小錯誤 |
| [37 ](https://github.com/Vonng/ddia/pull/37) | [@tankilo](https://github.com/tankilo) | fix translation mistakes in ch4.md |
| [36 ](https://github.com/Vonng/ddia/pull/36) | [@wwek](https://github.com/wwek) | 1.修復多個連結錯誤 2.名詞最佳化修訂 3.錯誤修訂 |
| [35 ](https://github.com/Vonng/ddia/pull/35) | [@wwek](https://github.com/wwek) | fix ch7.md to ch8.md link error |
| [34 ](https://github.com/Vonng/ddia/pull/34) | [@wwek](https://github.com/wwek) | Merge pull request #1 from Vonng/master |
| [33 ](https://github.com/Vonng/ddia/pull/33) | [@wwek](https://github.com/wwek) | fix part-ii.md link error |
| [32 ](https://github.com/Vonng/ddia/pull/32) | [@JCYoky](https://github.com/JCYoky) | Update ch2.md |
| [31 ](https://github.com/Vonng/ddia/pull/31) | [@elsonLee](https://github.com/elsonLee) | Update ch7.md |
| [26 ](https://github.com/Vonng/ddia/pull/26) | [@yjhmelody](https://github.com/yjhmelody) | 修復一些明顯錯誤 |
| [25 ](https://github.com/Vonng/ddia/pull/25) | [@lqbilbo](https://github.com/lqbilbo) | 修復連結錯誤 |
| [24 ](https://github.com/Vonng/ddia/pull/24) | [@artiship](https://github.com/artiship) | 修改詞語順序 |
| [23 ](https://github.com/Vonng/ddia/pull/23) | [@artiship](https://github.com/artiship) | 修正錯別字 |
| [22 ](https://github.com/Vonng/ddia/pull/22) | [@artiship](https://github.com/artiship) | 糾正翻譯錯誤 |
| [21 ](https://github.com/Vonng/ddia/pull/21) | [@zhtisi](https://github.com/zhtisi) | 修正目錄和本章標題不符的情況 |
| [20 ](https://github.com/Vonng/ddia/pull/20) | [@rentiansheng](https://github.com/rentiansheng) | Update ch7.md |
| [19 ](https://github.com/Vonng/ddia/pull/19) | [@LHRchina](https://github.com/LHRchina) | 修復語句小bug |
| [16 ](https://github.com/Vonng/ddia/pull/16) | [@MuAlex](https://github.com/MuAlex) | Master |
| [15 ](https://github.com/Vonng/ddia/pull/15) | [@cg-zhou](https://github.com/cg-zhou) | Update translation progress |
| [14 ](https://github.com/Vonng/ddia/pull/14) | [@cg-zhou](https://github.com/cg-zhou) | Translate glossary |
| [13 ](https://github.com/Vonng/ddia/pull/13) | [@cg-zhou](https://github.com/cg-zhou) | 詳細修改了後記中和印度野豬相關的描述 |
| [12 ](https://github.com/Vonng/ddia/pull/12) | [@ibyte2011](https://github.com/ibyte2011) | 修改了部分翻譯 |
| [11 ](https://github.com/Vonng/ddia/pull/11) | [@jiajiadebug](https://github.com/jiajiadebug) | ch2 100% |
| [10 ](https://github.com/Vonng/ddia/pull/10) | [@jiajiadebug](https://github.com/jiajiadebug) | ch2 20% |
| [9 ](https://github.com/Vonng/ddia/pull/9) | [@jiajiadebug](https://github.com/jiajiadebug) | Preface, ch1, part-i translation minor fixes |
| [7 ](https://github.com/Vonng/ddia/pull/7) | [@MuAlex](https://github.com/MuAlex) | Ch6 translation pull request |
| [6 ](https://github.com/Vonng/ddia/pull/6) | [@MuAlex](https://github.com/MuAlex) | Ch6 change version1 |
| [5 ](https://github.com/Vonng/ddia/pull/5) | [@nevertiree](https://github.com/nevertiree) | Chapter 01語法微調 |
| [2 ](https://github.com/Vonng/ddia/pull/2) | [@seagullbird](https://github.com/seagullbird) | 序言初翻 |
================================================
FILE: content/tw/glossary.md
================================================
---
title: 術語表
weight: 500
breadcrumbs: false
---
> 請注意:本術語表的定義刻意保持簡短,旨在傳達核心概念,而非覆蓋術語的全部細節。更多內容請參閱正文對應章節。
### 非同步(asynchronous)
不等待某件事完成(例如透過網路把資料傳送到另一個節點),且不假設它會在多長時間內完成。參見“[同步與非同步複製](/tw/ch6#sec_replication_sync_async)”、“[同步網路與非同步網路](/tw/ch9#sec_distributed_sync_networks)”和“[系統模型與現實](/tw/ch9#sec_distributed_system_model)”。
### 原子(atomic)
1. 在併發語境下:指一個操作看起來在某個單一時刻生效,其他併發程序不會看到它處於“半完成”狀態。另見 *isolation*。
2. 在事務語境下:指一組寫入要麼全部提交、要麼全部回滾,即使發生故障也不例外。參見“[原子性](/tw/ch8#sec_transactions_acid_atomicity)”和“[兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc)”。
### 背壓(backpressure)
當接收方跟不上時,強制傳送方降速。也稱為 *flow control*。參見“[系統過載後無法恢復時會發生什麼](/tw/ch2#sidebar_metastable)”。
### 批處理(batch process)
以一個固定(通常較大)資料集為輸入、產出另一份資料且不修改輸入的計算。參見[第 11 章](/tw/ch11#ch_batch)。
### 有界(bounded)
具有已知上限或大小。例如可用於描述網路延遲(參見“[超時與無界延遲](/tw/ch9#sec_distributed_queueing)”)和資料集(參見[第 12 章](/tw/ch12#ch_stream)導言)。
### 拜占庭故障(Byzantine fault)
節點以任意錯誤方式行為,例如向不同節點發送相互矛盾或惡意訊息。參見“[拜占庭故障](/tw/ch9#sec_distributed_byzantine)”。
### 快取(cache)
透過記住近期訪問資料來加速後續讀取的元件。快取通常不完整:若未命中,需要回源到更慢但完整的底層資料儲存。
### CAP 定理(CAP theorem)
一個在實踐中經常被誤解、且不太有直接指導價值的理論結果。參見“[CAP 定理](/tw/ch10#the-cap-theorem)”。
### 因果關係(causality)
當一件事“先於”另一件事發生時產生的事件依賴關係。例如後續事件對先前事件的響應、建立在先前事件之上,或必須結合先前事件理解。參見“[happens-before 關係與併發](/tw/ch6#sec_replication_happens_before)”。
### 共識(consensus)
分散式計算中的基本問題:讓多個節點就某件事達成一致(例如誰是主節點)。這比直覺上要困難得多。參見“[共識](/tw/ch10#sec_consistency_consensus)”。
### 資料倉庫(data warehouse)
將多個 OLTP 系統的資料彙總並整理後,用於分析場景的資料庫。參見“[資料倉庫](/tw/ch1#sec_introduction_dwh)”。
### 宣告式(declarative)
描述“想要什麼性質”,而非“如何一步步實現”。在資料庫查詢中,最佳化器接收宣告式查詢並決定最佳執行方式。參見“[術語:宣告式查詢語言](/tw/ch3)”。
### 反正規化(denormalize)
在已正規化資料集中引入一定冗餘(常見形式為快取或索引)以換取更快讀取。反正規化值可看作預計算結果,類似物化檢視。參見“[正規化、反正規化與連線](/tw/ch3#sec_datamodels_normalization)”。
### 派生資料(derived data)
透過可重複流程由其他資料生成的資料集,必要時可重新計算。通常用於加速某類讀取。索引、快取、物化檢視都屬於派生資料。參見“[記錄系統與派生資料](/tw/ch1#sec_introduction_derived)”。
### 確定性(deterministic)
一個函式在相同輸入下總產生相同輸出,不依賴隨機數、當前時間、網路互動等不可預測因素。參見“[確定性的力量](/tw/ch9#sidebar_distributed_determinism)”。
### 分散式(distributed)
系統在多個透過網路連線的節點上執行。其典型特徵是 *部分失效*:一部分壞了,另一部分仍在工作,而軟體往往難以精確知道哪裡壞了。參見“[故障與部分失效](/tw/ch9#sec_distributed_partial_failure)”。
### 永續性(durable)
以你相信不會丟失的方式儲存資料,即使發生各種故障。參見“[永續性](/tw/ch8#durability)”。
### ETL
Extract-Transform-Load(提取-轉換-載入):從源資料庫抽取資料,轉成更適合分析查詢的形式,再載入到資料倉庫或批處理系統。參見“[資料倉庫](/tw/ch1#sec_introduction_dwh)”。
### 故障切換(failover)
在單主系統中,將主角色從一個節點切到另一個節點的過程。參見“[處理節點故障](/tw/ch6#sec_replication_failover)”。
### 容錯(fault-tolerant)
出現故障(如機器崩潰、鏈路故障)後仍可自動恢復。參見“[可靠性與容錯](/tw/ch2#sec_introduction_reliability)”。
### 流量控制(flow control)
見 *backpressure*。
### 追隨者(follower)
不直接接收客戶端寫入、僅應用來自主節點變更的副本。也稱 *secondary*、*read replica* 或 *hot standby*。參見“[單主複製](/tw/ch6#sec_replication_leader)”。
### 全文檢索(full-text search)
按任意關鍵詞搜尋文字,通常支援近似拼寫、同義詞等能力。全文索引是支援此類查詢的一種 *secondary index*。參見“[全文檢索](/tw/ch4#sec_storage_full_text)”。
### 圖(graph)
由 *vertices*(可引用物件,也稱 *nodes* 或 *entities*)和 *edges*(頂點間連線,也稱 *relationships* 或 *arcs*)組成的資料結構。參見“[圖狀資料模型](/tw/ch3#sec_datamodels_graph)”。
### 雜湊(hash)
把輸入對映成看似隨機數字的函式。相同輸入總得相同輸出;不同輸入通常輸出不同,但也可能碰撞(*collision*)。參見“[按鍵的雜湊分片](/tw/ch7#sec_sharding_hash)”。
### 冪等(idempotent)
可安全重試的操作:執行多次與執行一次效果相同。參見“[冪等性](/tw/ch12#sec_stream_idempotence)”。
### 索引(index)
一種可高效檢索“某欄位取某值”的記錄的資料結構。參見“[OLTP 的儲存與索引](/tw/ch4#sec_storage_oltp)”。
### 隔離性(isolation)
在事務語境下,併發事務相互干擾的程度。*Serializable* 最強,也常用更弱隔離級別。參見“[隔離性](/tw/ch8#sec_transactions_acid_isolation)”。
### 連線(join)
把具有關聯關係的記錄拼在一起。常見於一個記錄引用另一個記錄(外部索引鍵、文件引用、圖邊)時,查詢需要取到被引用物件。參見“[正規化、反正規化與連線](/tw/ch3#sec_datamodels_normalization)”和“[JOIN 與 GROUP BY](/tw/ch11#sec_batch_join)”。
### 領導者(leader)
當資料或服務跨多個節點複製時,被指定為可接受寫入的副本。可透過協議選舉或管理員指定。也稱 *primary* 或 *source*。參見“[單主複製](/tw/ch6#sec_replication_leader)”。
### 線性一致(linearizable)
表現得像系統裡只有一份資料副本,且由原子操作更新。參見“[線性一致性](/tw/ch10#sec_consistency_linearizability)”。
### 區域性(locality)
一種效能最佳化:把經常被一起訪問的資料放在一起。參見“[讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality)”。
### 鎖(lock)
保證同一時刻只有一個執行緒/節點/事務訪問某資源的機制;其他訪問者需等待鎖釋放。參見“[兩階段鎖(2PL)](/tw/ch8#sec_transactions_2pl)”和“[分散式鎖與租約](/tw/ch9#sec_distributed_lock_fencing)”。
### 日誌(log)
只追加寫入的資料檔案。*WAL* 用於崩潰恢復(參見“[讓 B 樹可靠](/tw/ch4#sec_storage_btree_wal)”);*log-structured* 儲存把日誌作為主儲存格式(參見“[日誌結構儲存](/tw/ch4#sec_storage_log_structured)”);*replication log* 用於主從複製(參見“[單主複製](/tw/ch6#sec_replication_leader)”);*event log* 可表示資料流(參見“[基於日誌的訊息代理](/tw/ch12#sec_stream_log) ”)。
### 物化(materialize)
把計算結果提前算出並寫下來,而不是按需即時計算。參見“[事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)”。
### 節點(node)
執行在某臺計算機上的軟體例項,透過網路與其他節點協作完成任務。
### 正規化(normalized)
資料結構中儘量避免冗餘與重複。正規化資料庫裡某資料變化時通常只改一處,不需多處同步。參見“[正規化、反正規化與連線](/tw/ch3#sec_datamodels_normalization)”。
### OLAP
Online Analytic Processing(線上分析處理):典型訪問模式是對大量記錄做聚合(如 count/sum/avg)。參見“[事務系統與分析系統](/tw/ch1#sec_introduction_analytics)”。
### OLTP
Online Transaction Processing(線上事務處理):典型訪問模式是快速讀寫少量記錄,通常按鍵索引。參見“[事務系統與分析系統](/tw/ch1#sec_introduction_analytics)”。
### 分片(sharding)
把單機裝不下的大資料集或計算拆成更小部分並分散到多臺機器上。也稱 *partitioning*。參見[第 7 章](/tw/ch7#ch_sharding)。
### 百分位(percentile)
透過統計多少值高於/低於某閾值來描述分佈。例如某時段 95 分位響應時間為 *t*,表示 95% 請求耗時小於 *t*,5% 更長。參見“[描述效能](/tw/ch2#sec_introduction_percentiles)”。
### 主鍵(primary key)
唯一標識一條記錄的值(通常為數字或字串)。在很多應用中由系統在建立時生成(順序或隨機),而非使用者手工指定。另見 *secondary index*。
### 法定票數(quorum)
一個操作被判定成功前所需的最少投票節點數。參見“[讀寫法定票數](/tw/ch6#sec_replication_quorum_condition)”。
### 再平衡(rebalance)
為均衡負載,把資料或服務從一個節點遷移到另一個節點。參見“[鍵值資料的分片](/tw/ch7#sec_sharding_key_value)”。
### 複製(replication)
在多個節點(*replicas*)上儲存同一份資料,以便部分節點不可達時仍可訪問。參見[第 6 章](/tw/ch6#ch_replication)。
### 模式(schema)
對資料結構(欄位、型別等)的描述。資料是否符合模式可在生命週期不同階段檢查(參見“[文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility)”),模式也可隨時間演進(參見[第 5 章](/tw/ch5#ch_encoding))。
### 二級索引(secondary index)
與主儲存並行維護的附加結構,用於高效檢索滿足某類條件的記錄。參見“[多列索引與二級索引](/tw/ch4#sec_storage_index_multicolumn)”和“[分片與二級索引](/tw/ch7#sec_sharding_secondary_indexes)”。
### 可序列化(serializable)
一種 *isolation* 保證:多個事務併發執行時,行為等價於某個序列順序逐個執行。參見“[可序列化](/tw/ch8#sec_transactions_serializability)”。
### 無共享(shared-nothing)
一種架構:獨立節點(各自 CPU、記憶體、磁碟)透過普通網路連線;相對的是共享記憶體或共享磁碟架構。參見“[共享記憶體、共享磁碟與無共享架構](/tw/ch2#sec_introduction_shared_nothing)”。
### 偏斜(skew)
1. 分片負載不均:某些分片請求/資料很多,另一些很少。也稱 *hot spots*。參見“[負載偏斜與熱點消除](/tw/ch7#sec_sharding_skew)”。
2. 一種時序異常,導致事件呈現為非預期的非順序。參見“[快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)”中的讀偏斜、“[寫偏斜與幻讀](/tw/ch8#sec_transactions_write_skew)”中的寫偏斜、以及“[用於事件排序的時間戳](/tw/ch9#sec_distributed_lww)”中的時鐘偏斜。
### 腦裂(split brain)
兩個節點同時認為自己是領導者,可能破壞系統保證。參見“[處理節點故障](/tw/ch6#sec_replication_failover)”和“[少數服從多數](/tw/ch9#sec_distributed_majority)”。
### 儲存過程(stored procedure)
把事務邏輯編碼到資料庫伺服器端執行,使事務過程中無需與客戶端來回通訊。參見“[實際序列執行](/tw/ch8#sec_transactions_serial)”。
### 流處理(stream process)
持續執行的計算:消費無窮事件流併產出結果。參見[第 12 章](/tw/ch12#ch_stream)。
### 同步(synchronous)
*asynchronous* 的反義詞。
### 記錄系統(system of record)
持有某類資料主權威版本的系統,也稱 *source of truth*。資料變更首先寫入這裡,其他資料集可由其派生。參見“[記錄系統與派生資料](/tw/ch1#sec_introduction_derived)”。
### 超時(timeout)
最簡單的故障檢測方式之一:在一定時間內未收到響應即判定超時。但無法確定是遠端節點故障還是網路問題導致。參見“[超時與無界延遲](/tw/ch9#sec_distributed_queueing)”。
### 全序(total order)
一種可比較關係(如時間戳),任意兩者都能判定大小。若存在不可比較元素,則稱 *partial order*(偏序)。
### 事務(transaction)
把多次讀寫封裝為一個邏輯單元,以簡化錯誤處理與併發問題。參見[第 8 章](/tw/ch8#ch_transactions)。
### 兩階段提交(two-phase commit, 2PC)
保證多個數據庫節點對同一事務要麼都 *atomically* 提交、要麼都中止的演算法。參見“[兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc)”。
### 兩階段鎖(two-phase locking, 2PL)
實現 *serializable isolation* 的演算法:事務對讀寫資料加鎖並持有到事務結束。參見“[兩階段鎖(2PL)](/tw/ch8#sec_transactions_2pl)”。
### 無界(unbounded)
沒有已知上限或大小。與 *bounded* 相反。
================================================
FILE: content/tw/indexes.md
================================================
---
title: 索引
weight: 550
breadcrumbs: false
---
### 符號
- 3FS(分散式檔案系統), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
### A
- 中止(事務), [事務](/tw/ch8#ch_transactions), [原子性](/tw/ch8#sec_transactions_acid_atomicity)
- 級聯, [沒有髒讀](/tw/ch8#no-dirty-reads)
- 在兩階段提交中, [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc)
- 樂觀併發控制的效能, [可序列化快照隔離的效能](/tw/ch8#performance-of-serializable-snapshot-isolation)
- 重試已中止的事務, [處理錯誤和中止](/tw/ch8#handling-errors-and-aborts)
- 抽象, [雲服務的分層](/tw/ch1#layering-of-cloud-services), [簡單性:管理複雜度](/tw/ch2#id38), [資料模型與查詢語言](/tw/ch3#ch_datamodels), [事務](/tw/ch8#ch_transactions), [總結](/tw/ch8#summary)
- 意外複雜性, [簡單性:管理複雜度](/tw/ch2#id38)
- 問責制, [責任與問責](/ch14#id371)
- 會計(財務資料), [總結](/tw/ch3#summary), [不可變事件的優點](/tw/ch12#sec_stream_immutability_pros)
- Accumulo(資料庫)
- 寬柱資料模型, [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality), [列壓縮](/tw/ch4#sec_storage_column_compression)
- ACID 屬性(事務), [ACID 的含義](/tw/ch8#sec_transactions_acid)
- 原子性, [原子性](/tw/ch8#sec_transactions_acid_atomicity), [單物件與多物件操作](/tw/ch8#sec_transactions_multi_object)
- 一致性, [一致性](/tw/ch8#sec_transactions_acid_consistency), [維護完整性,儘管軟體有Bug](/tw/ch13#id455)
- 永續性, [使 B 樹可靠](/tw/ch4#sec_storage_btree_wal), [永續性](/tw/ch8#durability)
- 隔離性, [隔離性](/tw/ch8#sec_transactions_acid_isolation), [單物件與多物件操作](/tw/ch8#sec_transactions_multi_object)
- 確認(訊息), [確認與重新傳遞](/tw/ch12#sec_stream_reordering)
- active/active replication(見 multi-leader replication)
- active/passive replication(見 基於領導者的複製)
- ActiveMQ(訊息系統), [訊息代理](/tw/ch5#message-brokers), [訊息代理與資料庫的對比](/tw/ch12#id297)
- 分散式事務支援, [XA 事務](/tw/ch8#xa-transactions)
- ActiveRecord(物件關係對映器), [物件關係對映(ORM)](/tw/ch3#object-relational-mapping-orm), [處理錯誤和中止](/tw/ch8#handling-errors-and-aborts)
- activity (workflows)(見 workflow engines)
- Actor 模型, [分散式 actor 框架](/tw/ch5#distributed-actor-frameworks)
- (另見 event-driven architecture)
- 與流處理的比較, [事件驅動架構與 RPC](/tw/ch12#sec_stream_actors_drpc)
- 自適應容量, [偏斜的工作負載與緩解熱點](/tw/ch7#sec_sharding_skew)
- Advanced Message Queuing Protocol(見 AMQP)
- 航空航天系統, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)
- Aerospike(資料庫)
- 強一致性模式, [單物件寫入](/tw/ch8#sec_transactions_single_object)
- AGE(圖資料庫), [Cypher 查詢語言](/tw/ch3#id57)
- 彙總
- 資料立方體和已實現檢視, [物化檢視與資料立方體](/tw/ch4#sec_storage_materialized_views)
- 分批處理, [排序與記憶體聚合](/tw/ch11#id275)
- 流程中, [流分析](/tw/ch12#id318)
- 聚合管道(MongoDB), [正規化、反正規化與連線](/tw/ch3#sec_datamodels_normalization), [文件的查詢語言](/tw/ch3#query-languages-for-documents)
- 敏捷, [可演化性:讓變化更容易](/tw/ch2#sec_introduction_evolvability)
- 最小化不可逆性, [批處理](/tw/ch11#ch_batch), [應用演化後重新處理資料](/tw/ch13#sec_future_reprocessing)
- 充滿自信地快速前進, [端到端原則重現](/tw/ch13#id456)
- 一致意見, [單值共識](/tw/ch10#single-value-consensus), [原子提交作為共識](/tw/ch10#atomic-commitment-as-consensus)
- (另見 共識)
- AI (artificial intelligence)(見 machine learning)
- AI Act (European Union), [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- Airbyte, [資料倉庫](/tw/ch1#sec_introduction_dwh)
- Airflow(工作流排程器), [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows), [批處理](/tw/ch11#ch_batch), [工作流排程](/tw/ch11#sec_batch_workflows)
- 雲資料倉整合, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- 用於 ETL, [提取-轉換-載入(ETL)](/tw/ch11#sec_batch_etl_usage)
- 阿卡邁
- 響應時間研究, [平均值、中位數與百分位點](/tw/ch2#id24)
- 演算法
- 演算法正確性, [定義演算法的正確性](/tw/ch9#defining-the-correctness-of-an-algorithm)
- B樹, [B 樹](/tw/ch4#sec_storage_b_trees)-[B 樹變體](/tw/ch4#b-tree-variants)
- 分散式系統, [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 歸併排序, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables), [混洗資料](/tw/ch11#sec_shuffle)
- 排程, [資源分配](/tw/ch11#id279)
- SSTable 與 LSM 樹, [SSTable 檔案格式](/tw/ch4#the-sstable-file-format)-[壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- 全互聯複製拓撲, [多主複製拓撲](/tw/ch6#sec_replication_topologies)
- AllegroGraph(資料庫), [圖資料模型](/tw/ch3#sec_datamodels_graph)
- SPARQL 查詢語言, [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- ALTER TABLE 語句(SQL), [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility), [編碼與演化](/tw/ch5#ch_encoding)
- 亞馬遜
- Dynamo(見 Dynamo(資料庫))
- 響應時間研究, [平均值、中位數與百分位點](/tw/ch2#id24)
- Amazon Web Services (AWS)
- Aurora(見 Aurora(雲資料庫))
- ClockBound(見 ClockBound(時間同步))
- 正確性測試, [形式化方法和隨機測試](/tw/ch9#sec_distributed_formal)
- DynamoDB(見 DynamoDB(資料庫))
- EBS(見 EBS(虛擬塊裝置))
- Kinesis(見 Kinesis(訊息系統))
- Neptune(見 Neptune(圖資料庫))
- 網路可靠性, [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults)
- S3(見 S3(物件儲存))
- 放大
- 偏見, [偏見與歧視](/ch14#id370)
- 故障, [維護派生狀態](/tw/ch13#id446)
- 尾延遲, [響應時間指標的應用](/tw/ch2#sec_introduction_slo_sla), [本地二級索引](/tw/ch7#id166)
- 寫入放大, [寫放大](/tw/ch4#write-amplification)
- AMQP(高階訊息佇列協議), [訊息代理與資料庫的對比](/tw/ch12#id297)
- (另見 messaging systems)
- 比較基於日誌的郵件, [日誌與傳統的訊息傳遞相比](/tw/ch12#sec_stream_logs_vs_messaging), [重播舊訊息](/tw/ch12#sec_stream_replay)
- 訊息順序, [確認與重新傳遞](/tw/ch12#sec_stream_reordering)
- 分析系統, [分析型與事務型系統](/tw/ch1#sec_introduction_analytics)
- 作為衍生資料系統, [記錄系統與派生資料](/tw/ch1#sec_introduction_derived)
- 來自運營系統的 ETL, [資料倉庫](/tw/ch1#sec_introduction_dwh)
- 治理, [超越資料湖](/tw/ch1#beyond-the-data-lake)
- 分析, [分析型與事務型系統](/tw/ch1#sec_introduction_analytics)-[記錄系統與派生資料](/tw/ch1#sec_introduction_derived)
- 與事務處理的比較, [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp)
- 資料正常化, [正規化的權衡](/tw/ch3#trade-offs-of-normalization)
- data warehousing(見 data warehousing)
- predictive(見 predictive analytics)
- 與批次處理的關係, [分析(Analytics)](/tw/ch11#sec_batch_olap)-[分析(Analytics)](/tw/ch11#sec_batch_olap)
- 計劃, [星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)-[星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)
- 快速隔離查詢, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)
- 流式分析, [流分析](/tw/ch12#id318)
- 分析工程, [分析型與事務型系統](/tw/ch1#sec_introduction_analytics)
- 反熵, [追趕錯過的寫入](/tw/ch6#sec_replication_read_repair)
- Antithesis(確定性模擬測試), [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- Apache Accumulo(見 Accumulo)
- Apache ActiveMQ(見 ActiveMQ)
- Apache AGE(見 AGE)
- Apache Arrow(見 Arrow(資料格式))
- Apache Avro(見 Avro)
- Apache Beam(見 Beam)
- Apache BookKeeper(見 BookKeeper)
- Apache Cassandra(見 Cassandra)
- Apache Curator(見 Curator)
- Apache DataFusion(見 DataFusion(查詢引擎))
- Apache Druid(見 Druid(資料庫))
- Apache Flink(見 Flink(處理框架))
- Apache HBase(見 HBase)
- Apache Iceberg(見 Iceberg(表格式))
- Apache Jena(見 Jena)
- Apache Kafka(見 Kafka)
- Apache Lucene(見 Lucene)
- Apache Oozie(見 Oozie(工作流排程器))
- Apache ORC(見 ORC(資料格式))
- Apache Parquet(見 Parquet(資料格式))
- Apache Pig(查詢語言), [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- Apache Pinot(見 Pinot(資料庫))
- Apache Pulsar(見 Pulsar)
- Apache Qpid(見 Qpid)
- Apache Samza(見 Samza)
- Apache Solr(見 Solr)
- Apache Spark(見 Spark;見 Spark(處理框架))
- Apache Storm(見 Storm)
- Apache Superset(見 Superset(資料視覺化軟體))
- Apache Thrift(見 Thrift)
- Apache ZooKeeper(見 ZooKeeper)
- Apama (流式分析), [複合事件處理](/tw/ch12#id317)
- append-only files(見 logs)
- Application Programming Interfaces (APIs), [資料模型與查詢語言](/tw/ch3#ch_datamodels)
- 用於改變流, [變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- 分散式事務, [XA 事務](/tw/ch8#xa-transactions)
- 服務費用, [流經服務的資料流:REST 與 RPC](/tw/ch5#sec_encoding_dataflow_rpc)-[RPC 的資料編碼與演化](/tw/ch5#data-encoding-and-evolution-for-rpc)
- (另見 services)
- 可演化性, [RPC 的資料編碼與演化](/tw/ch5#data-encoding-and-evolution-for-rpc)
- RESTful, [Web 服務](/tw/ch5#sec_web_services)
- application state(見 國家)
- approximate search(見 similarity search)
- 檔案儲存、資料庫資料, [歸檔儲存](/tw/ch5#archival-storage)
- arcs(見 edges)
- ArcticDB(資料庫), [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 算術平均值, [平均值、中位數與百分位點](/tw/ch2#id24)
- 陣列
- 陣列資料庫, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 多層面, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- Arrow(資料格式), [列式儲存](/tw/ch4#sec_storage_column), [DataFrames](/tw/ch11#id287)
- artificial intelligence(見 machine learning)
- ASCII text, [Protocol Buffers](/tw/ch5#sec_encoding_protobuf)
- ASN.1 (schema language), [模式的優點](/tw/ch5#sec_encoding_schemas)
- 關聯表格, [多對一與多對多關係](/tw/ch3#sec_datamodels_many_to_many), [屬性圖](/tw/ch3#id56)
- 同步網路, [不可靠的網路](/tw/ch9#sec_distributed_networks), [術語表](/tw/glossary)
- 比較同步網路, [同步與非同步網路](/tw/ch9#sec_distributed_sync_networks)
- 系統模型, [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 同步複製, [同步複製與非同步複製](/tw/ch6#sec_replication_sync_async), [術語表](/tw/glossary)
- 故障資料損失, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover)
- 從同步跟蹤器讀取, [複製延遲的問題](/tw/ch6#sec_replication_lag)
- 有多個領導, [多主複製](/tw/ch6#sec_replication_multi_leader)
- 非同步傳輸模式, [我們不能簡單地使網路延遲可預測嗎?](/tw/ch9#can-we-not-simply-make-network-delays-predictable)
- 原子廣播, [共享日誌作為共識](/tw/ch10#sec_consistency_shared_logs)
- 原子鐘, [帶置信區間的時鐘讀數](/tw/ch9#clock-readings-with-a-confidence-interval), [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- (另見 clocks)
- 原子性, [術語表](/tw/glossary)
- 原子自增, [單物件寫入](/tw/ch8#sec_transactions_single_object)
- 比較和設定, [條件寫入(比較並設定)](/tw/ch8#sec_transactions_compare_and_set), [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- (另見 比較和設定)
- 異常資料, [正規化的權衡](/tw/ch3#trade-offs-of-normalization)
- 獲取和新增/遞增, [ID 生成器和邏輯時鐘](/tw/ch10#sec_consistency_logical), [共識](/tw/ch10#sec_consistency_consensus), [獲取並增加作為共識](/tw/ch10#fetch-and-add-as-consensus)
- 寫入操作, [原子寫操作](/tw/ch8#atomic-write-operations)
- 原子性, [原子性](/tw/ch8#sec_transactions_acid_atomicity), [單物件與多物件操作](/tw/ch8#sec_transactions_multi_object), [術語表](/tw/glossary)
- 原子提交
- 避開, [多分割槽請求處理](/tw/ch13#id360), [無協調資料系統](/tw/ch13#id454)
- 遮蔽和非遮蔽, [三階段提交](/tw/ch8#three-phase-commit)
- 在溪流處理中, [恰好一次訊息處理](/tw/ch8#sec_transactions_exactly_once), [再談恰好一次訊息處理](/tw/ch8#exactly-once-message-processing-revisited), [原子提交再現](/tw/ch12#sec_stream_atomic_commit)
- 維護衍生資料, [保持系統同步](/tw/ch12#sec_stream_sync)
- 分散式事務, [分散式事務](/tw/ch8#sec_transactions_distributed)-[再談恰好一次訊息處理](/tw/ch8#exactly-once-message-processing-revisited)
- 用於多物件事務, [單物件與多物件操作](/tw/ch8#sec_transactions_multi_object)
- 用於單物件寫入, [單物件寫入](/tw/ch8#sec_transactions_single_object)
- 與協商一致的關係, [原子提交作為共識](/tw/ch10#atomic-commitment-as-consensus)
- 可審計性, [信任但驗證](/tw/ch13#sec_future_verification)-[用於可審計資料系統的工具](/tw/ch13#id366)
- 設計, [為可審計性而設計](/tw/ch13#id365)
- 自動審計系統, [不要盲目信任承諾](/tw/ch13#id364)
- 透過不可改變性, [不可變事件的優點](/tw/ch12#sec_stream_immutability_pros)
- 可審計資料系統工具, [用於可審計資料系統的工具](/tw/ch13#id366)
- Aurora(雲資料庫), [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native)
- Aurora DSQL(資料庫)
- 快速隔離支援, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)
- 自動縮放, [運維:自動/手動再平衡](/tw/ch7#sec_sharding_operations)
- Automerge (CRDT library), [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- 可用性, [可靠性與容錯](/tw/ch2#sec_introduction_reliability)
- (另見 fault tolerance)
- 在 CAP 定理中, [CAP 定理](/tw/ch10#the-cap-theorem)
- 領袖選舉, [共識的微妙之處](/tw/ch10#subtleties-of-consensus)
- 在服務級別協議(SLA)中, [響應時間指標的應用](/tw/ch2#sec_introduction_slo_sla)
- 可用區, [透過冗餘容忍硬體故障](/tw/ch2#tolerating-hardware-faults-through-redundancy), [讀己之寫](/tw/ch6#sec_replication_ryw)
- Avro(資料格式), [Avro](/tw/ch5#sec_encoding_avro)-[動態生成的模式](/tw/ch5#dynamically-generated-schemas)
- 動態生成的計劃, [動態生成的模式](/tw/ch5#dynamically-generated-schemas)
- 物件容器檔案, [但什麼是寫入者模式?](/tw/ch5#but-what-is-the-writers-schema), [歸檔儲存](/tw/ch5#archival-storage)
- 讀者決定作家的計劃, [但什麼是寫入者模式?](/tw/ch5#but-what-is-the-writers-schema)
- 計劃演變, [寫入者模式與讀取者模式](/tw/ch5#the-writers-schema-and-the-readers-schema)
- 批次處理中的用途, [MapReduce](/tw/ch11#sec_batch_mapreduce)
- awk (Unix 工具) (英語)., [簡單日誌分析](/tw/ch11#sec_batch_log_analysis), [簡單日誌分析](/tw/ch11#sec_batch_log_analysis), [分散式作業編排](/tw/ch11#id278)
- Axon Framework, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- Azkaban(工作流排程器), [批處理](/tw/ch11#ch_batch)
- Azure Blob Storage(物件儲存), [雲服務的分層](/tw/ch1#layering-of-cloud-services), [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 有條件的標題, [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens)
- Azure managed disks, [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute)
- Azure SQL DB(資料庫), [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native)
- Azure Storage, [物件儲存](/tw/ch11#id277)
- Azure Synapse Analytics(資料庫), [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native)
- Azure Virtual Machines
- 現場虛擬機器, [故障處理](/tw/ch11#id281)
### B
- B樹(指數), [B 樹](/tw/ch4#sec_storage_b_trees)-[B 樹變體](/tw/ch4#b-tree-variants)
- B+ trees, [B 樹變體](/tw/ch4#b-tree-variants)
- 分支因子, [B 樹](/tw/ch4#sec_storage_b_trees)
- comparison to LSM-trees, [比較 B 樹與 LSM 樹](/tw/ch4#sec_storage_btree_lsm_comparison)-[磁碟空間使用](/tw/ch4#disk-space-usage)
- 崩潰恢復, [使 B 樹可靠](/tw/ch4#sec_storage_btree_wal)
- 透過分割頁面增長, [B 樹](/tw/ch4#sec_storage_b_trees)
- 不可變變種, [B 樹變體](/tw/ch4#b-tree-variants), [索引與快照隔離](/tw/ch8#indexes-and-snapshot-isolation)
- 與硬分裂相似, [重新平衡鍵範圍分片資料](/tw/ch7#rebalancing-key-range-sharded-data)
- 變體, [B 樹變體](/tw/ch4#b-tree-variants)
- B2(物件儲存), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- Backblaze B2(見 B2(物件儲存))
- 後端, [資料系統架構中的權衡](/tw/ch1#ch_tradeoffs)
- 返回, 指數, [描述效能](/tw/ch2#sec_introduction_percentiles), [處理錯誤和中止](/tw/ch8#handling-errors-and-aborts)
- 背壓, [描述效能](/tw/ch2#sec_introduction_percentiles), [讀取效能](/tw/ch4#read-performance), [訊息傳遞系統](/tw/ch12#sec_stream_messaging), [術語表](/tw/glossary)
- 分批處理, [工作流排程](/tw/ch11#sec_batch_workflows)
- in TCP, [TCP 的侷限性](/tw/ch9#sec_distributed_tcp)
- 備份
- 用於複製的資料庫快照, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 在多使用者系統中, [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- 完整性, [不要盲目信任承諾](/tw/ch13#id364)
- 抓圖隔離, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)
- 使用物件儲存, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 相對複製, [複製](/tw/ch6#ch_replication)
- 向後相容, [編碼與演化](/tw/ch5#ch_encoding)
- BadgerDB(資料庫)
- 可序列事務, [可序列化快照隔離(SSI)](/tw/ch8#sec_transactions_ssi)
- BASE, contrast to ACID, [ACID 的含義](/tw/ch8#sec_transactions_acid)
- 擊打彈殼(Unix), [OLTP 系統的儲存與索引](/tw/ch4#sec_storage_oltp)
- 批處理, [批處理](/tw/ch11#ch_batch)-[本章小結](/tw/ch11#id292), [術語表](/tw/glossary)
- 方案規劃和職能規劃, [MapReduce](/tw/ch11#sec_batch_mapreduce)
- 惠益, [批處理](/tw/ch11#ch_batch)
- 結合流處理, [統一批處理和流處理](/tw/ch13#id338)
- 與流處理的比較, [流處理](/tw/ch12#sec_stream_processing)
- 資料流引擎, [資料流引擎](/tw/ch11#sec_batch_dataflow)-[資料流引擎](/tw/ch11#sec_batch_dataflow)
- 過失容忍, [故障處理](/tw/ch11#id281), [訊息傳遞系統](/tw/ch12#sec_stream_messaging)
- 資料整合, [批處理與流處理](/tw/ch13#sec_future_batch_streaming)-[統一批處理和流處理](/tw/ch13#id338)
- 圖表和迭代處理, [機器學習](/tw/ch11#id290)
- high-level APIs and languages, [查詢語言](/tw/ch11#sec_batch_query_lanauges)-[查詢語言](/tw/ch11#sec_batch_query_lanauges)
- 雲資料倉庫中, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- 在分散式系統中, [分散式系統中的批處理](/tw/ch11#sec_batch_distributed)
- 加入和分組, [JOIN 與 GROUP BY](/tw/ch11#sec_batch_join)-[JOIN 與 GROUP BY](/tw/ch11#sec_batch_join)
- 限制, [批處理](/tw/ch11#ch_batch)
- 基於日誌的資訊和, [重播舊訊息](/tw/ch12#sec_stream_replay)
- 保持衍生狀態, [維護派生狀態](/tw/ch13#id446)
- 衡量業績, [批處理](/tw/ch11#ch_batch)
- 模式, [批處理模型](/tw/ch11#id431)
- 資源分配, [資源分配](/tw/ch11#id279)-[資源分配](/tw/ch11#id279)
- 資源管理員, [分散式作業編排](/tw/ch11#id278)
- 排程器, [分散式作業編排](/tw/ch11#id278)
- 服務衍生資料, [對外提供派生資料](/tw/ch11#sec_batch_serving_derived)-[對外提供派生資料](/tw/ch11#sec_batch_serving_derived)
- 移動資料, [混洗資料](/tw/ch11#sec_shuffle)-[混洗資料](/tw/ch11#sec_shuffle)
- 任務執行, [分散式作業編排](/tw/ch11#id278)
- 使用大小寫, [批處理用例](/tw/ch11#sec_batch_output)-[對外提供派生資料](/tw/ch11#sec_batch_serving_derived)
- 使用 Unix 工具(例如), [使用 Unix 工具的批處理](/tw/ch11#sec_batch_unix)-[排序與記憶體聚合](/tw/ch11#id275)
- 批處理框架
- 與作業系統的比較, [分散式系統中的批處理](/tw/ch11#sec_batch_distributed)
- Beam (資料流庫), [統一批處理和流處理](/tw/ch13#id338)
- BERT (language model), [向量嵌入](/tw/ch4#id92)
- 偏向, [偏見與歧視](/ch14#id370)
- bidirectional replication(見 multi-leader replication)
- 泥漿大球, [簡單性:管理複雜度](/tw/ch2#id38)
- 大資料
- 對資料最小化, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance), [立法與自律](/ch14#sec_future_legislation)
- BigQuery(資料庫), [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses), [批處理](/tw/ch11#ch_batch)
- DataFrames, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- 硬化和叢集, [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 移動資料, [混洗資料](/tw/ch11#sec_shuffle)
- 快速隔離支援, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)
- Bigtable(資料庫)
- 硬化計劃, [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range)
- 儲存佈局, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 平板(硬化), [分片](/tw/ch7#ch_sharding)
- 寬柱資料模型, [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality), [列壓縮](/tw/ch4#sec_storage_column_compression)
- 二進位制資料編碼, [二進位制編碼](/tw/ch5#binary-encoding)-[模式的優點](/tw/ch5#sec_encoding_schemas)
- Avro, [Avro](/tw/ch5#sec_encoding_avro)-[動態生成的模式](/tw/ch5#dynamically-generated-schemas)
- MessagePack, [二進位制編碼](/tw/ch5#binary-encoding)-[二進位制編碼](/tw/ch5#binary-encoding)
- Protocol Buffers, [Protocol Buffers](/tw/ch5#sec_encoding_protobuf)-[欄位標籤與模式演化](/tw/ch5#field-tags-and-schema-evolution)
- 二進位制編碼
- 根據計劃, [模式的優點](/tw/ch5#sec_encoding_schemas)
- 按網路驅動程式, [模式的優點](/tw/ch5#sec_encoding_schemas)
- binary strings, lack of support in JSON and XML, [JSON、XML 及其二進位制變體](/tw/ch5#sec_encoding_json)
- 比特幣(催眠幣), [用於可審計資料系統的工具](/tw/ch13#id366)
- 拜占庭斷層承受力, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)
- 交換中的貨幣錯誤, [弱隔離級別](/tw/ch8#sec_transactions_isolation_levels)
- 點陣圖索引, [列壓縮](/tw/ch4#sec_storage_column_compression)
- BitTorrent uTP protocol, [TCP 的侷限性](/tw/ch9#sec_distributed_tcp)
- Bkd-樹木(指數), [多維索引與全文索引](/tw/ch4#sec_storage_multidimensional)
- 無咎死後, [人類與可靠性](/tw/ch2#id31)
- Blazegraph(資料庫), [圖資料模型](/tw/ch3#sec_datamodels_graph)
- SPARQL 查詢語言, [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- blob storage(見 object storage)
- 塊, [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 塊裝置(磁碟), [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute)
- 塊鏈, [總結](/tw/ch3#summary)
- 拜占庭斷層承受力, [拜占庭故障](/tw/ch9#sec_distributed_byzantine), [共識](/tw/ch10#sec_consistency_consensus), [用於可審計資料系統的工具](/tw/ch13#id366)
- 阻止原子承諾, [三階段提交](/tw/ch8#three-phase-commit)
- Bloom 過濾器(演算法), [布隆過濾器](/tw/ch4#bloom-filters), [讀取效能](/tw/ch4#read-performance), [流分析](/tw/ch12#id318)
- BookKeeper (replicated log), [將工作分配給節點](/tw/ch10#allocating-work-to-nodes)
- 邊框資料集, [流處理](/tw/ch12#ch_stream), [術語表](/tw/glossary)
- (另見 batch processing)
- 受限延遲, [術語表](/tw/glossary)
- 在網路中, [同步與非同步網路](/tw/ch9#sec_distributed_sync_networks)
- 程序暫停, [響應時間保證](/tw/ch9#sec_distributed_clocks_realtime)
- 廣播
- 全序廣播(見 shared logs)
- 無中介訊息, [直接從生產者傳遞給消費者](/tw/ch12#id296)
- 粗糙(計量聚合器), [直接從生產者傳遞給消費者](/tw/ch12#id296)
- BTM (transaction coordinator), [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc)
- 緩衝
- Bufstream(訊息系統), [設定新的副本](/tw/ch6#sec_replication_new_replica)
- Bufstream(訊息系統), [磁碟空間使用](/tw/ch12#sec_stream_disk_usage)
- 新建或購買, [雲服務與自託管](/tw/ch1#sec_introduction_cloud)
- 快速網路交通模式, [我們不能簡單地使網路延遲可預測嗎?](/tw/ch9#can-we-not-simply-make-network-delays-predictable)
- 商業分析員, [分析型與事務型系統](/tw/ch1#sec_introduction_analytics), [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake)
- 商業資料處理, [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp)
- 商業情報, [分析型與事務型系統](/tw/ch1#sec_introduction_analytics)-[資料倉庫](/tw/ch1#sec_introduction_dwh)
- Business Process Execution Language (BPEL), [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows)
- Business Process Model and Notation (BPMN), [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows)
- 例項, [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows)
- 位元組序列,編碼資料, [編碼資料的格式](/tw/ch5#sec_encoding_formats)
- 拜占庭斷層, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)-[弱形式的謊言](/tw/ch9#weak-forms-of-lying), [系統模型與現實](/tw/ch9#sec_distributed_system_model), [術語表](/tw/glossary)
- 拜占庭容錯系統, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)
- Byzantine Generals Problem, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)
- 協商一致演算法和, [共識](/tw/ch10#sec_consistency_consensus), [用於可審計資料系統的工具](/tw/ch13#id366)
### C
- 快取, [全記憶體儲存](/tw/ch4#sec_storage_inmemory), [術語表](/tw/glossary)
- 意見, [物化檢視與資料立方體](/tw/ch4#sec_storage_materialized_views)
- 作為衍生資料, [記錄系統與派生資料](/tw/ch1#sec_introduction_derived), [組合使用資料儲存技術](/tw/ch13#id447)-[分拆系統與整合系統](/tw/ch13#id448)
- in CPUs, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized), [線性一致性與網路延遲](/tw/ch10#linearizability-and-network-delays)
- 無效和贍養費, [保持系統同步](/tw/ch12#sec_stream_sync), [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- 線性一致性, [線性一致性](/tw/ch10#sec_consistency_linearizability)
- 雲中的本地磁碟, [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute)
- 日曆同步, [同步引擎與本地優先軟體](/tw/ch6#sec_replication_offline_clients), [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- California Consumer Privacy Act (CCPA), [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- Camunda(工作流程引擎), [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows)
- (資料), [記錄系統與派生資料](/tw/ch1#sec_introduction_derived)
- CAP定理, [CAP 定理](/tw/ch10#the-cap-theorem)-[CAP 定理](/tw/ch10#the-cap-theorem), [術語表](/tw/glossary)
- 能力規劃, [雲時代的運維](/tw/ch1#sec_introduction_operations)
- Cap'n Proto(資料格式), [編碼資料的格式](/tw/ch5#sec_encoding_formats)
- 碳排放, [分散式與單節點系統](/tw/ch1#sec_introduction_distributed)
- 級聯中止, [沒有髒讀](/tw/ch8#no-dirty-reads)
- 連鎖失敗, [軟體故障](/tw/ch2#software-faults), [運維:自動/手動再平衡](/tw/ch7#sec_sharding_operations), [超時和無界延遲](/tw/ch9#sec_distributed_queueing)
- Cassandra(資料庫)
- 資料變更捕獲, [資料變更捕獲的實現](/tw/ch12#id307), [變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- 壓縮戰略, [壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- consistency level ANY, [單主與無主複製的效能](/tw/ch6#sec_replication_leaderless_perf)
- 雜湊變硬, [按鍵的雜湊分片](/tw/ch7#sec_sharding_hash), [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 最後寫成的解決衝突, [檢測併發寫入](/tw/ch6#sec_replication_concurrent)
- 無領導複製, [無主複製](/tw/ch6#sec_replication_leaderless)
- 輕量事務, [單物件寫入](/tw/ch8#sec_transactions_single_object)
- 線性,缺少, [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable)
- 日誌結構儲存, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 多區域支助, [多地區操作](/tw/ch6#multi-region-operation)
- 二級指數, [本地二級索引](/tw/ch7#id166)
- 使用時鐘, [仲裁一致性的侷限](/tw/ch6#sec_replication_quorum_limitations), [用於事件排序的時間戳](/tw/ch9#sec_distributed_lww)
- 節點(硬化), [分片](/tw/ch7#ch_sharding)
- 貓(Unix 工具), [簡單日誌分析](/tw/ch11#sec_batch_log_analysis)
- 目錄, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 因果關係, [版本向量](/tw/ch6#version-vectors)
- (另見 causal dependencies)
- 因果關係, ["先發生"關係與併發](/tw/ch6#sec_replication_happens_before)-[版本向量](/tw/ch6#version-vectors)
- 捕獲, [版本向量](/tw/ch6#version-vectors), [排序事件以捕獲因果關係](/tw/ch13#sec_future_capture_causality), [讀也是事件](/tw/ch13#sec_future_read_events)
- 按總訂單, [全序的限制](/tw/ch13#id335)
- 事務中, [基於過時前提的決策](/tw/ch8#decisions-based-on-an-outdated-premise)
- 向朋友傳送訊息(例如), [排序事件以捕獲因果關係](/tw/ch13#sec_future_capture_causality)
- 因果關係, [術語表](/tw/glossary)
- 因果順序
- 與, [邏輯時鐘](/tw/ch10#sec_consistency_timestamps)
- 與, [邏輯時鐘](/tw/ch10#sec_consistency_timestamps)-[使用邏輯時鐘強制約束](/tw/ch10#enforcing-constraints-using-logical-clocks)
- 發生關係前, ["先發生"關係與併發](/tw/ch6#sec_replication_happens_before)
- 在可序列事務中, [基於過時前提的決策](/tw/ch8#decisions-based-on-an-outdated-premise)-[檢測影響先前讀取的寫入](/tw/ch8#sec_detecting_writes_affect_reads)
- 與時鐘不符, [用於事件排序的時間戳](/tw/ch9#sec_distributed_lww)
- 命令要抓取的事件, [排序事件以捕獲因果關係](/tw/ch13#sec_future_capture_causality)
- 違反《公約》的行為, [一致字首讀](/tw/ch6#sec_replication_consistent_prefix), [不同拓撲的問題](/tw/ch6#problems-with-different-topologies), [用於事件排序的時間戳](/tw/ch9#sec_distributed_lww)
- 帶有同步時鐘, [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 基於單元格的架構, [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- 複合事件處理(見 複合事件處理)
- CephFS(分散式檔案系統), [批處理](/tw/ch11#ch_batch), [物件儲存](/tw/ch11#id277)
- 證書透明性, [用於可審計資料系統的工具](/tw/ch13#id366)
- c組, [分散式作業編排](/tw/ch11#id278)
- 資料變更捕獲, [邏輯(基於行)日誌複製](/tw/ch6#logical-row-based-log-replication), [資料變更捕獲](/tw/ch12#sec_stream_cdc)
- 變更流的 API 支援, [變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- 比較事件來源, [資料變更捕獲與事件溯源](/tw/ch12#sec_stream_event_sourcing)
- 執行, [資料變更捕獲的實現](/tw/ch12#id307)
- 初始快照, [初始快照](/tw/ch12#sec_stream_cdc_snapshot)
- 日誌壓縮, [日誌壓縮](/tw/ch12#sec_stream_log_compaction)
- 更改日誌, [狀態、流和不變性](/tw/ch12#sec_stream_immutability)
- 資料變更捕獲, [資料變更捕獲](/tw/ch12#sec_stream_cdc)
- 操作狀態, [失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)
- 在溪流中連線, [流表連線(流擴充)](/tw/ch12#sec_stream_table_joins)
- 日誌壓縮, [日誌壓縮](/tw/ch12#sec_stream_log_compaction)
- 保持衍生狀態, [資料庫與流](/tw/ch12#sec_stream_databases)
- 混亂工程, [容錯](/tw/ch2#id27), [故障注入](/tw/ch9#sec_fault_injection)
- 檢查站
- 在高效能計算中, [雲計算與超級計算](/tw/ch1#id17)
- 在流處理器中, [微批次與存檔點](/tw/ch12#id329)
- 斷路器(限制重試), [描述效能](/tw/ch2#sec_introduction_percentiles)
- 電路交換網路, [同步與非同步網路](/tw/ch9#sec_distributed_sync_networks)
- 迴圈緩衝器, [磁碟空間使用](/tw/ch12#sec_stream_disk_usage)
- 迴圈複製地形, [多主複製拓撲](/tw/ch6#sec_replication_topologies)
- Citus(資料庫)
- 雜湊變硬, [固定數量的分片](/tw/ch7#fixed-number-of-shards)
- ClickHouse(資料庫), [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp), [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native)
- 增量檢視維護, [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- 點選流資料,分析, [JOIN 與 GROUP BY](/tw/ch11#sec_batch_join)
- 客戶
- 電話服務, [流經服務的資料流:REST 與 RPC](/tw/ch5#sec_encoding_dataflow_rpc)
- 離線, [同步引擎與本地優先軟體](/tw/ch6#sec_replication_offline_clients), [有狀態、可離線的客戶端](/tw/ch13#id347)
- 推動狀態更改到, [將狀態變更推送給客戶端](/tw/ch13#id348)
- 請求路由, [請求路由](/tw/ch7#sec_sharding_routing)
- ClockBound(時間同步), [帶置信區間的時鐘讀數](/tw/ch9#clock-readings-with-a-confidence-interval)
- use in YugabyteDB, [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 時鐘, [不可靠的時鐘](/tw/ch9#sec_distributed_clocks)-[限制垃圾回收的影響](/tw/ch9#sec_distributed_gc_impact)
- 原子鐘, [帶置信區間的時鐘讀數](/tw/ch9#clock-readings-with-a-confidence-interval), [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 信任間隔, [帶置信區間的時鐘讀數](/tw/ch9#clock-readings-with-a-confidence-interval)-[用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 全球快照, [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 混合邏輯時鐘, [混合邏輯時鐘](/tw/ch10#hybrid-logical-clocks)
- logical(見 logical clocks)
- 偏斜, [最後寫入勝利(丟棄併發寫入)](/tw/ch6#sec_replication_lww), [仲裁一致性的侷限](/tw/ch6#sec_replication_quorum_limitations), [對同步時鐘的依賴](/tw/ch9#sec_distributed_clocks_relying)-[帶置信區間的時鐘讀數](/tw/ch9#clock-readings-with-a-confidence-interval), [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable)
- 殺人, [單調時鐘](/tw/ch9#monotonic-clocks)
- 同步和準確性, [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy)-[時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy)
- synchronization using GPS, [不可靠的時鐘](/tw/ch9#sec_distributed_clocks), [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy), [帶置信區間的時鐘讀數](/tw/ch9#clock-readings-with-a-confidence-interval), [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 時間與單調時鐘, [單調時鐘與日曆時鐘](/tw/ch9#sec_distributed_monotonic_timeofday)
- 時間標記事件, [你用的是誰的時鐘?](/tw/ch12#id438)
- 雲服務, [雲服務與自託管](/tw/ch1#sec_introduction_cloud)-[雲計算與超級計算](/tw/ch1#id17)
- 可用區, [透過冗餘容忍硬體故障](/tw/ch2#tolerating-hardware-faults-through-redundancy), [讀己之寫](/tw/ch6#sec_replication_ryw)
- 資料倉庫, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 需要發現服務, [服務發現](/tw/ch10#service-discovery)
- 網路故障, [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults)
- 利弊關係, [雲服務的利弊](/tw/ch1#sec_introduction_cloud_tradeoffs)-[雲服務的利弊](/tw/ch1#sec_introduction_cloud_tradeoffs)
- 配額, [雲時代的運維](/tw/ch1#sec_introduction_operations)
- regions(見 regions (geographic distribution))
- 無伺服器, [微服務與無伺服器](/tw/ch1#sec_introduction_microservices)
- 共享資源, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 對超級計算, [雲計算與超級計算](/tw/ch1#id17)
- 雲內, [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native)-[雲時代的運維](/tw/ch1#sec_introduction_operations)
- 雲飛
- R2(見 R2(物件儲存))
- 組合索引, [在索引中儲存值](/tw/ch4#sec_storage_index_heap)
- 分組(記錄順序), [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- CockroachDB(資料庫)
- 基於共識的複製, [單主複製](/tw/ch6#sec_replication_leader)
- 一致性模式, [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- 鍵程硬化, [分片](/tw/ch7#ch_sharding), [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range)
- 可序列事務, [可序列化快照隔離(SSI)](/tw/ch8#sec_transactions_ssi)
- 硬化二級指數, [全域性二級索引](/tw/ch7#id167)
- 事務, [事務到底是什麼?](/tw/ch8#sec_transactions_overview), [資料庫內部的分散式事務](/tw/ch8#sec_transactions_internal)
- 使用模型檢查, [模型檢查與規範語言](/tw/ch9#model-checking-and-specification-languages)
- 程式碼生成
- 用於查詢執行, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 帶有協議緩衝, [Protocol Buffers](/tw/ch5#sec_encoding_protobuf)
- 協作編輯, [即時協作、離線優先和本地優先應用](/tw/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- 列家庭(大表), [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality), [列壓縮](/tw/ch4#sec_storage_column_compression)
- 面向列的儲存, [列式儲存](/tw/ch4#sec_storage_column)-[查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 列壓縮, [列壓縮](/tw/ch4#sec_storage_column_compression)
- 公園, [列式儲存](/tw/ch4#sec_storage_column), [歸檔儲存](/tw/ch5#archival-storage)
- 排序在, [列儲存中的排序順序](/tw/ch4#sort-order-in-column-storage)-[列儲存中的排序順序](/tw/ch4#sort-order-in-column-storage)
- 向量處理, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 寬柱型, [列壓縮](/tw/ch4#sec_storage_column_compression)
- 寫入, [寫入列式儲存](/tw/ch4#writing-to-column-oriented-storage)
- comma-separated values(見 CSV)
- 命令查詢責任分離, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)-[事件溯源與 CQRS](/tw/ch3#sec_datamodels_events), [從同一事件日誌中派生多個檢視](/tw/ch12#sec_stream_deriving_views)
- 命令(活動來源), [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 執行(事務), [事務](/tw/ch8#ch_transactions)
- 原子提交, [分散式事務](/tw/ch8#sec_transactions_distributed)-[再談恰好一次訊息處理](/tw/ch8#exactly-once-message-processing-revisited)
- (另見 原子性)
- 讀作承諾隔離, [讀已提交](/tw/ch8#sec_transactions_read_committed)
- three-phase commit (3PC), [三階段提交](/tw/ch8#three-phase-commit)
- 兩階段提交, [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc)-[協調器故障](/tw/ch8#coordinator-failure)
- 通用業務, [衝突解決與複製](/tw/ch8#conflict-resolution-and-replication)
- 壓實(Compaction)
- 更改日誌, [日誌壓縮](/tw/ch12#sec_stream_log_compaction)
- (另見 日誌壓縮)
- 流運算子狀態, [失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)
- 日誌結構儲存, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 問題, [讀取效能](/tw/ch4#read-performance)
- 規模分級和分級辦法, [壓實策略](/tw/ch4#sec_storage_lsm_compaction), [磁碟空間使用](/tw/ch4#disk-space-usage)
- 比較和設定, [條件寫入(比較並設定)](/tw/ch8#sec_transactions_compare_and_set), [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- 執行鎖定, [協調服務](/tw/ch10#sec_consistency_coordination)
- 執行獨特性限制, [約束與唯一性保證](/tw/ch10#sec_consistency_uniqueness)
- 在物件儲存中, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 與協商一致的關係, [線性一致性與仲裁](/tw/ch10#sec_consistency_quorum_linearizable), [共識](/tw/ch10#sec_consistency_consensus), [比較並設定作為共識](/tw/ch10#compare-and-set-as-consensus)
- 與柵欄標誌的關係, [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens)
- 與事務的關係, [單物件寫入](/tw/ch8#sec_transactions_single_object)
- 相容性, [編碼與演化](/tw/ch5#ch_encoding), [資料流的模式](/tw/ch5#sec_encoding_dataflow)
- 電話服務, [RPC 的資料編碼與演化](/tw/ch5#data-encoding-and-evolution-for-rpc)
- 編碼格式的屬性, [總結](/tw/ch5#summary)
- 使用資料庫, [流經資料庫的資料流](/tw/ch5#sec_encoding_dataflow_db)-[歸檔儲存](/tw/ch5#archival-storage)
- 補償事務, [不可變事件的優點](/tw/ch12#sec_stream_immutability_pros), [寬鬆地解釋約束](/tw/ch13#id362)
- 彙編, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 複合事件處理, [複合事件處理](/tw/ch12#id317)
- 複雜度
- 理論模型中的蒸餾, [將系統模型對映到現實世界](/tw/ch9#mapping-system-models-to-the-real-world)
- 重要和意外事項, [簡單性:管理複雜度](/tw/ch2#id38)
- 使用抽象來隱藏, [資料模型與查詢語言](/tw/ch3#ch_datamodels)
- 管理, [簡單性:管理複雜度](/tw/ch2#id38)
- composing data systems(見 unbundling databases)
- 壓縮
- in SSTables, [SSTable 檔案格式](/tw/ch4#the-sstable-file-format)
- 計算密集型應用程式, [資料系統架構中的權衡](/tw/ch1#ch_tradeoffs)
- 電腦遊戲, [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- 縮寫索引, [多維索引與全文索引](/tw/ch4#sec_storage_multidimensional)
- 在雜湊硬化系統中, [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 併發
- 演員程式設計模式, [分散式 actor 框架](/tw/ch5#distributed-actor-frameworks), [事件驅動架構與 RPC](/tw/ch12#sec_stream_actors_drpc)
- (另見 event-driven architecture)
- 事務隔離薄弱時出現的錯誤, [弱隔離級別](/tw/ch8#sec_transactions_isolation_levels)
- 解決衝突, [處理寫入衝突](/tw/ch6#sec_replication_write_conflicts)-[處理寫入衝突](/tw/ch6#sec_replication_write_conflicts)
- 定義, [處理寫入衝突](/tw/ch6#sec_replication_write_conflicts)
- 檢測並行寫作, [檢測併發寫入](/tw/ch6#sec_replication_concurrent)-[版本向量](/tw/ch6#version-vectors)
- 雙寫、 問題, [保持系統同步](/tw/ch12#sec_stream_sync)
- 發生關係前, ["先發生"關係與併發](/tw/ch6#sec_replication_happens_before)
- 在複製系統中, [複製延遲的問題](/tw/ch6#sec_replication_lag)-[版本向量](/tw/ch6#version-vectors), [線性一致性](/tw/ch10#sec_consistency_linearizability)-[線性一致性與網路延遲](/tw/ch10#linearizability-and-network-delays)
- 丟失更新, [防止丟失更新](/tw/ch8#sec_transactions_lost_update)
- 多版本併發控制, [多版本併發控制(MVCC)](/tw/ch8#sec_transactions_snapshot_impl), [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 樂觀併發控制, [悲觀併發控制與樂觀併發控制](/tw/ch8#pessimistic-versus-optimistic-concurrency-control)
- 行動命令, [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- 透過事件日誌減少, [併發控制](/tw/ch12#sec_stream_concurrency), [資料流:應用程式碼與狀態變化的互動](/tw/ch13#id450)
- 時間和相對性, ["先發生"關係與併發](/tw/ch6#sec_replication_happens_before)
- 事務隔離, [隔離性](/tw/ch8#sec_transactions_acid_isolation)
- 寫偏差, [寫偏差與幻讀](/tw/ch8#sec_transactions_write_skew)-[物化衝突](/tw/ch8#materializing-conflicts)
- 有條件寫入, [條件寫入(比較並設定)](/tw/ch8#sec_transactions_compare_and_set)
- 事務中, [單物件寫入](/tw/ch8#sec_transactions_single_object)
- 在物件儲存中, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 會議管理系統(例如), [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- conflict-free replicated datatypes (CRDTs), [CRDT 與操作變換](/tw/ch6#sec_replication_crdts)
- 用於無頭複製, [捕獲先發生關係](/tw/ch6#capturing-the-happens-before-relationship)
- 防止丟失更新, [衝突解決與複製](/tw/ch8#conflict-resolution-and-replication)
- 衝突
- 撤銷, [衝突避免](/tw/ch6#conflict-avoidance)
- 因果關係, ["先發生"關係與併發](/tw/ch6#sec_replication_happens_before)
- 衝突檢測
- 分散式事務, [XA 事務的問題](/tw/ch8#problems-with-xa-transactions)
- 在基於日誌的系統中, [唯一性約束需要達成共識](/tw/ch13#id452)
- in serializable snapshot isolation (SSI), [檢測影響先前讀取的寫入](/tw/ch8#sec_detecting_writes_affect_reads)
- 在兩階段提交中, [系統性的承諾](/tw/ch8#a-system-of-promises)
- 解決衝突
- 透過中止事務, [悲觀併發控制與樂觀併發控制](/tw/ch8#pessimistic-versus-optimistic-concurrency-control)
- 透過道歉, [寬鬆地解釋約束](/tw/ch13#id362)
- 最後寫入勝利, [用於事件排序的時間戳](/tw/ch9#sec_distributed_lww)
- 使用原子操作, [衝突解決與複製](/tw/ch8#conflict-resolution-and-replication)
- 確定什麼是衝突, [處理寫入衝突](/tw/ch6#sec_replication_write_conflicts), [基於日誌訊息傳遞中的唯一性](/tw/ch13#sec_future_uniqueness_log)
- 無領導複製, [檢測併發寫入](/tw/ch6#sec_replication_concurrent)
- 丟失更新, [防止丟失更新](/tw/ch8#sec_transactions_lost_update)-[衝突解決與複製](/tw/ch8#conflict-resolution-and-replication)
- 實現, [物化衝突](/tw/ch8#materializing-conflicts)
- 決議, [處理寫入衝突](/tw/ch6#sec_replication_write_conflicts)-[處理寫入衝突](/tw/ch6#sec_replication_write_conflicts)
- 自動, [自動衝突解決](/tw/ch6#automatic-conflict-resolution)
- 無頭系統, [檢測併發寫入](/tw/ch6#sec_replication_concurrent)
- 最後寫入勝利, [最後寫入勝利(丟棄併發寫入)](/tw/ch6#sec_replication_lww)
- 使用自定義邏輯, [手動衝突解決](/tw/ch6#manual-conflict-resolution), [捕獲先發生關係](/tw/ch6#capturing-the-happens-before-relationship)
- 兄弟, [手動衝突解決](/tw/ch6#manual-conflict-resolution), [捕獲先發生關係](/tw/ch6#capturing-the-happens-before-relationship)
- 合併, [捕獲先發生關係](/tw/ch6#capturing-the-happens-before-relationship)
- 寫偏差, [寫偏差與幻讀](/tw/ch8#sec_transactions_write_skew)-[物化衝突](/tw/ch8#materializing-conflicts)
- 調和
- Freight(訊息系統), [設定新的副本](/tw/ch6#sec_replication_new_replica), [磁碟空間使用](/tw/ch12#sec_stream_disk_usage)
- 計劃登記, [JSON 模式](/tw/ch5#json-schema), [但什麼是寫入者模式?](/tw/ch5#but-what-is-the-writers-schema)
- 擁堵(網路)
- 撤銷, [TCP 的侷限性](/tw/ch9#sec_distributed_tcp)
- 限制時鐘的準確性, [帶置信區間的時鐘讀數](/tw/ch9#clock-readings-with-a-confidence-interval)
- 排隊延遲, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 共識, [共識](/tw/ch10#sec_consistency_consensus)-[總結](/tw/ch10#summary), [術語表](/tw/glossary)
- 演算法, [共識](/tw/ch10#sec_consistency_consensus), [共識的實踐](/tw/ch10#sec_consistency_total_order)
- 協商一致編號, [獲取並增加作為共識](/tw/ch10#fetch-and-add-as-consensus)
- 協調事務, [協調服務](/tw/ch10#sec_consistency_coordination)-[服務發現](/tw/ch10#service-discovery)
- 費用, [共識的利弊](/tw/ch10#pros-and-cons-of-consensus)
- 無法實現, [共識](/tw/ch10#sec_consistency_consensus)
- 防止腦分裂, [從單主複製到共識](/tw/ch10#from-single-leader-replication-to-consensus)
- 重組, [共識的微妙之處](/tw/ch10#subtleties-of-consensus)
- 與原子承諾的關係, [原子提交作為共識](/tw/ch10#atomic-commitment-as-consensus)
- relation to compare-and-set (CAS), [線性一致性與仲裁](/tw/ch10#sec_consistency_quorum_linearizable), [比較並設定作為共識](/tw/ch10#compare-and-set-as-consensus)
- 與獲取和新增的關係, [獲取並增加作為共識](/tw/ch10#fetch-and-add-as-consensus)
- 與複製有關, [使用共享日誌](/tw/ch10#sec_consistency_smr)
- 與共享日誌的關係, [共享日誌作為共識](/tw/ch10#sec_consistency_shared_logs)
- 與獨特性制約因素的關係, [唯一性約束需要達成共識](/tw/ch13#id452)
- 安全和生活特性, [單值共識](/tw/ch10#single-value-consensus)
- 單一價值共識, [單值共識](/tw/ch10#single-value-consensus)
- consent (GDPR), [同意與選擇自由](/ch14#id375)
- 一致性, [一致性](/tw/ch8#sec_transactions_acid_consistency), [及時性與完整性](/tw/ch13#sec_future_integrity)
- 跨越不同資料庫, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover), [保持系統同步](/tw/ch12#sec_stream_sync), [從同一事件日誌中派生多個檢視](/tw/ch12#sec_stream_deriving_views), [派生資料與分散式事務](/tw/ch13#sec_future_derived_vs_transactions)
- 因果關係, [一致字首讀](/tw/ch6#sec_replication_consistent_prefix), [不同拓撲的問題](/tw/ch6#problems-with-different-topologies), [排序事件以捕獲因果關係](/tw/ch13#sec_future_capture_causality)
- 一致字首讀, [一致字首讀](/tw/ch6#sec_replication_consistent_prefix)-[一致字首讀](/tw/ch6#sec_replication_consistent_prefix)
- 一致的快照, [設定新的副本](/tw/ch6#sec_replication_new_replica), [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)-[快照隔離、可重複讀和命名混淆](/tw/ch8#snapshot-isolation-repeatable-read-and-naming-confusion), [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner), [初始快照](/tw/ch12#sec_stream_cdc_snapshot), [建立索引](/tw/ch13#id340)
- (另見 snapshots)
- 崩潰恢復, [使 B 樹可靠](/tw/ch4#sec_storage_btree_wal)
- enforcing constraints(見 constraints)
- 最終, [複製延遲的問題](/tw/ch6#sec_replication_lag)
- (另見 最終一致性)
- in ACID transactions, [一致性](/tw/ch8#sec_transactions_acid_consistency), [維護完整性,儘管軟體有Bug](/tw/ch13#id455)
- 在 CAP 定理中, [CAP 定理](/tw/ch10#the-cap-theorem)
- 領袖選舉, [共識的微妙之處](/tw/ch10#subtleties-of-consensus)
- 微服務, [分散式系統的問題](/tw/ch1#sec_introduction_dist_sys_problems)
- 線性一致性, [複製延遲的解決方案](/tw/ch6#id131), [線性一致性](/tw/ch10#sec_consistency_linearizability)-[線性一致性與網路延遲](/tw/ch10#linearizability-and-network-delays)
- 含義, [一致性](/tw/ch8#sec_transactions_acid_consistency)
- 單調讀, [單調讀](/tw/ch6#sec_replication_monotonic_reads)-[單調讀](/tw/ch6#sec_replication_monotonic_reads)
- 二級指數, [多物件事務的需求](/tw/ch8#sec_transactions_need), [索引與快照隔離](/tw/ch8#indexes-and-snapshot-isolation), [理解資料流](/tw/ch13#id443), [建立索引](/tw/ch13#id340)
- 讀後寫, [讀己之寫](/tw/ch6#sec_replication_ryw)-[讀己之寫](/tw/ch6#sec_replication_ryw)
- 在衍生資料系統中, [派生資料與分散式事務](/tw/ch13#sec_future_derived_vs_transactions)
- strong(見 線性一致性)
- 及時性和完整性, [及時性與完整性](/tw/ch13#sec_future_integrity)
- 使用法定人數, [仲裁一致性的侷限](/tw/ch6#sec_replication_quorum_limitations), [線性一致性與仲裁](/tw/ch10#sec_consistency_quorum_linearizable)
- 連續的雜湊, [一致性雜湊](/tw/ch7#sec_sharding_consistent_hashing)
- 一致字首讀, [一致字首讀](/tw/ch6#sec_replication_consistent_prefix)
- 限制(資料庫), [一致性](/tw/ch8#sec_transactions_acid_consistency), [寫偏差的特徵](/tw/ch8#characterizing-write-skew)
- 同步檢查, [寬鬆地解釋約束](/tw/ch13#id362)
- 避免協調, [無協調資料系統](/tw/ch13#id454)
- 確保一能, [操作識別符號](/tw/ch13#id355)
- 在基於日誌的系統中, [強制約束](/tw/ch13#sec_future_constraints)-[多分割槽請求處理](/tw/ch13#id360)
- 跨越多個硬塊, [多分割槽請求處理](/tw/ch13#id360)
- 在兩階段提交中, [分散式事務](/tw/ch8#sec_transactions_distributed), [系統性的承諾](/tw/ch8#a-system-of-promises)
- 與協商一致的關係, [唯一性約束需要達成共識](/tw/ch13#id452)
- 需要線性, [約束與唯一性保證](/tw/ch10#sec_consistency_uniqueness)
- 領事(協調處), [協調服務](/tw/ch10#sec_consistency_coordination)
- 用於服務發現, [服務發現](/tw/ch10#service-discovery)
- 消費者(資訊流), [訊息代理](/tw/ch5#message-brokers), [傳遞事件流](/tw/ch12#sec_stream_transmit)
- 背壓, [訊息傳遞系統](/tw/ch12#sec_stream_messaging)
- 消費者群體, [多個消費者](/tw/ch12#id298)
- 以原木計的消費者抵銷額, [消費者偏移量](/tw/ch12#sec_stream_log_offsets)
- 失敗, [確認與重新傳遞](/tw/ch12#sec_stream_reordering), [消費者偏移量](/tw/ch12#sec_stream_log_offsets)
- 扇出, [時間線的物化與更新](/tw/ch2#sec_introduction_materializing), [多個消費者](/tw/ch12#id298), [日誌與傳統的訊息傳遞相比](/tw/ch12#sec_stream_logs_vs_messaging)
- 負載平衡, [多個消費者](/tw/ch12#id298), [日誌與傳統的訊息傳遞相比](/tw/ch12#sec_stream_logs_vs_messaging)
- 未與生產者保持同步, [訊息傳遞系統](/tw/ch12#sec_stream_messaging), [磁碟空間使用](/tw/ch12#sec_stream_disk_usage), [開展分拆工作](/tw/ch13#sec_future_unbundling_favor)
- content models (JSON Schema), [JSON 模式](/tw/ch5#json-schema)
- 引數
- 事務之間, [處理錯誤和中止](/tw/ch8#handling-errors-and-aborts)
- 遮蔽執行緒, [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- 樂觀併發控制的效能, [悲觀併發控制與樂觀併發控制](/tw/ch8#pessimistic-versus-optimistic-concurrency-control)
- 雙相鎖定, [兩階段鎖定的效能](/tw/ch8#performance-of-two-phase-locking)
- 上下文開關, [延遲與響應時間](/tw/ch2#id23), [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- 收斂, [自動衝突解決](/tw/ch6#automatic-conflict-resolution)-[CRDT 與操作變換](/tw/ch6#sec_replication_crdts)
- 協調
- 撤銷, [無協調資料系統](/tw/ch13#id454)
- 跨資料中心, [全序的限制](/tw/ch13#id335)
- 跨區域, [跨地域執行](/tw/ch6#sec_replication_multi_dc)
- 交叉硬度順序, [分片](/tw/ch8#sharding), [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner), [使用共享日誌](/tw/ch10#sec_consistency_smr), [多分割槽請求處理](/tw/ch13#id360)
- 路徑請求到硬體, [請求路由](/tw/ch7#sec_sharding_routing)
- 服務, [鎖定與領導者選舉](/tw/ch10#locking-and-leader-election), [協調服務](/tw/ch10#sec_consistency_coordination)-[服務發現](/tw/ch10#service-discovery)
- 協調者, [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc)
- 失效, [協調器故障](/tw/ch8#coordinator-failure)
- in XA transactions, [XA 事務](/tw/ch8#xa-transactions)-[XA 事務的問題](/tw/ch8#problems-with-xa-transactions)
- 恢復, [從協調器故障中恢復](/tw/ch8#recovering-from-coordinator-failure)
- 複製寫(B- 樹), [B 樹變體](/tw/ch4#b-tree-variants), [索引與快照隔離](/tw/ch8#indexes-and-snapshot-isolation)
- 公共物件請求代理體系結構, [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc)
- coronal mass ejection(見 solar storm)
- 正確性
- 可審計性, [信任但驗證](/tw/ch13#sec_future_verification)-[用於可審計資料系統的工具](/tw/ch13#id366)
- 拜占庭斷層承受力, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)
- 處理部分失敗, [故障與部分失效](/tw/ch9#sec_distributed_partial_failure)
- 在基於日誌的系統中, [強制約束](/tw/ch13#sec_future_constraints)-[多分割槽請求處理](/tw/ch13#id360)
- 系統模型中的演算法, [定義演算法的正確性](/tw/ch9#defining-the-correctness-of-an-algorithm)
- 生成資料, [為可審計性而設計](/tw/ch13#id365)
- 不可變資料, [不可變事件的優點](/tw/ch12#sec_stream_immutability_pros)
- 個人資料, [責任與問責](/ch14#id371), [隱私與資料使用](/ch14#id457)
- 時間, [不同拓撲的問題](/tw/ch6#problems-with-different-topologies), [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy)-[用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 事務次數, [一致性](/tw/ch8#sec_transactions_acid_consistency), [追求正確性](/tw/ch13#sec_future_correctness), [維護完整性,儘管軟體有Bug](/tw/ch13#id455)
- 及時性和完整性, [及時性與完整性](/tw/ch13#sec_future_integrity)-[無協調資料系統](/tw/ch13#id454)
- 資料腐敗
- 檢測, [端到端原則](/tw/ch13#sec_future_e2e_argument), [不要盲目信任承諾](/tw/ch13#id364)-[用於可審計資料系統的工具](/tw/ch13#id366)
- 由於病態記憶體訪問, [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults)
- 輻射所致, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)
- 由於大腦分裂, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover), [分散式鎖和租約](/tw/ch9#sec_distributed_lock_fencing)
- 由於事務隔離薄弱, [弱隔離級別](/tw/ch8#sec_transactions_isolation_levels)
- 完整性作為不存在, [及時性與完整性](/tw/ch13#sec_future_integrity)
- 網路包, [弱形式的謊言](/tw/ch9#weak-forms-of-lying)
- 磁碟, [永續性](/tw/ch8#durability)
- 防止使用寫頭日誌, [使 B 樹可靠](/tw/ch4#sec_storage_btree_wal)
- 從, [批處理](/tw/ch11#ch_batch), [不可變事件的優點](/tw/ch12#sec_stream_immutability_pros)
- 餘弦相似性(語義搜尋), [向量嵌入](/tw/ch4#id92)
- Couchbase(資料庫)
- 文件資料模型, [關係模型與文件模型](/tw/ch3#sec_datamodels_history)
- 永續性, [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- 雜湊變硬, [固定數量的分片](/tw/ch7#fixed-number-of-shards)
- 加入支援, [文件和關係資料庫的融合](/tw/ch3#convergence-of-document-and-relational-databases)
- 再平衡, [運維:自動/手動再平衡](/tw/ch7#sec_sharding_operations)
- vBuckets(硬化), [分片](/tw/ch7#ch_sharding)
- CouchDB(資料庫)
- 作為同步引擎, [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- B-樹木儲存, [索引與快照隔離](/tw/ch8#indexes-and-snapshot-isolation)
- 解決衝突, [手動衝突解決](/tw/ch6#manual-conflict-resolution)
- 耦合(鬆緊), [可演化性:讓變化更容易](/tw/ch2#sec_introduction_evolvability)
- 覆蓋索引, [在索引中儲存值](/tw/ch4#sec_storage_index_heap)
- CozoDB(資料庫), [Datalog:遞迴關係查詢](/tw/ch3#id62)
- CPUs
- 快取一致性和記憶體障礙, [線性一致性與網路延遲](/tw/ch10#linearizability-and-network-delays)
- 緩衝和管道, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 計算錯誤的結果, [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults)
- SIMD instructions, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 斷層和斷層, [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- CRDTs(見 conflict-free replicated datatypes)
- CREATE INDEX statement (SQL), [多列索引與二級索引](/tw/ch4#sec_storage_index_multicolumn), [建立索引](/tw/ch13#id340)
- 信用評級機構, [責任與問責](/ch14#id371)
- 加密重新整理, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events), [不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- 密碼, [總結](/tw/ch3#summary)
- 密碼學
- 防禦攻擊者, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)
- 端到端加密和認證, [端到端原則](/tw/ch13#sec_future_e2e_argument)
- CSV (comma-separated values), [OLTP 系統的儲存與索引](/tw/ch4#sec_storage_oltp), [JSON、XML 及其二進位制變體](/tw/ch5#sec_encoding_json)
- Curator (ZooKeeper recipes), [鎖定與領導者選舉](/tw/ch10#locking-and-leader-election), [將工作分配給節點](/tw/ch10#allocating-work-to-nodes)
- Cypher(查詢語言), [Cypher 查詢語言](/tw/ch3#id57)
- comparison to SPARQL, [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
### D
- Daft(處理框架)
- DataFrames, [DataFrames](/tw/ch11#id287)
- 移動資料, [混洗資料](/tw/ch11#sec_shuffle)
- Dagster(工作流排程器), [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows), [批處理](/tw/ch11#ch_batch), [工作流排程](/tw/ch11#sec_batch_workflows)
- 雲資料倉整合, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- 儀表板(業務情報), [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp)
- Dask(處理框架), [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 資料目錄, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 資料聯結器, [資料倉庫](/tw/ch1#sec_introduction_dwh)
- 資料合同, [提取-轉換-載入(ETL)](/tw/ch11#sec_batch_etl_usage)
- 資料變更捕獲, [資料變更捕獲與事件溯源](/tw/ch12#sec_stream_event_sourcing)
- data corruption(見 corruption of data)
- 資料方塊, [物化檢視與資料立方體](/tw/ch4#sec_storage_materialized_views)
- 資料工程, [分析型與事務型系統](/tw/ch1#sec_introduction_analytics)
- 資料結構, [提取-轉換-載入(ETL)](/tw/ch11#sec_batch_etl_usage)
- data formats(見 編碼)
- 資料基礎設施, [資料系統架構中的權衡](/tw/ch1#ch_tradeoffs)
- 資料整合, [資料整合](/tw/ch13#sec_future_integration)-[統一批處理和流處理](/tw/ch13#id338), [本章小結](/tw/ch13#id367)
- 批次和流處理, [批處理與流處理](/tw/ch13#sec_future_batch_streaming)-[統一批處理和流處理](/tw/ch13#id338)
- 保持衍生狀態, [維護派生狀態](/tw/ch13#id446)
- 後處理資料, [應用演化後重新處理資料](/tw/ch13#sec_future_reprocessing)
- 統一, [統一批處理和流處理](/tw/ch13#id338)
- 透過解開資料庫, [分拆資料庫](/tw/ch13#sec_future_unbundling)-[多分割槽資料處理](/tw/ch13#sec_future_unbundled_multi_shard)
- 與聯邦資料庫的比較, [一切的元資料庫](/tw/ch13#id341)
- 透過生成資料合併工具, [組合使用派生資料的工具](/tw/ch13#id442)-[排序事件以捕獲因果關係](/tw/ch13#sec_future_capture_causality)
- 衍生資料與分散式事務, [派生資料與分散式事務](/tw/ch13#sec_future_derived_vs_transactions)
- 總訂單的限制, [全序的限制](/tw/ch13#id335)
- 命令事件捕獲因果關係, [排序事件以捕獲因果關係](/tw/ch13#sec_future_capture_causality)
- 關於資料流的推理, [理解資料流](/tw/ch13#id443)
- 需求, [記錄系統與派生資料](/tw/ch1#sec_introduction_derived)
- 使用批次處理, [批處理](/tw/ch11#ch_batch), [提取-轉換-載入(ETL)](/tw/ch11#sec_batch_etl_usage)
- 資料湖, [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake)
- 資料湖區, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses), [分析(Analytics)](/tw/ch11#sec_batch_olap)
- data locality(見 區域性)
- 資料網格, [提取-轉換-載入(ETL)](/tw/ch11#sec_batch_etl_usage)
- 資料最小化, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance), [立法與自律](/ch14#sec_future_legislation)
- 資料模型, [資料模型與查詢語言](/tw/ch3#ch_datamodels)-[總結](/tw/ch3#summary)
- DataFrames and arrays, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 類似圖表的模型, [圖資料模型](/tw/ch3#sec_datamodels_graph)-[GraphQL](/tw/ch3#id63)
- 資料日誌語言, [Datalog:遞迴關係查詢](/tw/ch3#id62)-[Datalog:遞迴關係查詢](/tw/ch3#id62)
- 屬性圖, [屬性圖](/tw/ch3#id56)
- RDF and triple-stores, [三元組儲存與 SPARQL](/tw/ch3#id59)-[SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- 關係模型對文件模型, [關係模型與文件模型](/tw/ch3#sec_datamodels_history)-[文件和關係資料庫的融合](/tw/ch3#convergence-of-document-and-relational-databases)
- 支援多個, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 資料管道, [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake), [記錄系統與派生資料](/tw/ch1#sec_introduction_derived), [提取-轉換-載入(ETL)](/tw/ch11#sec_batch_etl_usage)
- 資料產品, [超越資料湖](/tw/ch1#beyond-the-data-lake)
- data protection regulations(見 GDPR)
- 資料居住法, [分散式與單節點系統](/tw/ch1#sec_introduction_distributed), [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- 資料科學, [分析型與事務型系統](/tw/ch1#sec_introduction_analytics), [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake)
- 資料倉, [資料倉庫](/tw/ch1#sec_introduction_dwh)
- 資料系統
- 正確性、制約因素和完整性, [追求正確性](/tw/ch13#sec_future_correctness)-[用於可審計資料系統的工具](/tw/ch13#id366)
- 資料整合, [資料整合](/tw/ch13#sec_future_integration)-[統一批處理和流處理](/tw/ch13#id338)
- 使用目標, [資料系統架構中的權衡](/tw/ch1#ch_tradeoffs)
- 多樣性, 保持同步, [保持系統同步](/tw/ch12#sec_stream_sync)
- 可維護性, [可運維性](/tw/ch2#sec_introduction_maintainability)-[可演化性:讓變化更容易](/tw/ch2#sec_introduction_evolvability)
- 可能的錯誤, [事務](/tw/ch8#ch_transactions)
- 可靠性, [可靠性與容錯](/tw/ch2#sec_introduction_reliability)-[人類與可靠性](/tw/ch2#id31)
- 硬體故障, [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults)
- 人類錯誤, [人類與可靠性](/tw/ch2#id31)
- 重要性, [人類與可靠性](/tw/ch2#id31)
- 軟體故障, [軟體故障](/tw/ch2#software-faults)
- 可伸縮性, [可伸縮性](/tw/ch2#sec_introduction_scalability)-[可伸縮性原則](/tw/ch2#id35)
- 解析資料庫, [分拆資料庫](/tw/ch13#sec_future_unbundling)-[多分割槽資料處理](/tw/ch13#sec_future_unbundled_multi_shard)
- 不可靠的時鐘, [不可靠的時鐘](/tw/ch9#sec_distributed_clocks)-[限制垃圾回收的影響](/tw/ch9#sec_distributed_gc_impact)
- 資料儲存, [資料倉庫](/tw/ch1#sec_introduction_dwh), [術語表](/tw/glossary)
- 基於雲的解決辦法, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- ETL, [資料倉庫](/tw/ch1#sec_introduction_dwh), [保持系統同步](/tw/ch12#sec_stream_sync)
- 用於批處理, [批處理](/tw/ch11#ch_batch)
- 保持資料系統的同步, [保持系統同步](/tw/ch12#sec_stream_sync)
- 設計, [星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)
- 硬化和叢集, [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 緩慢變化的維度, [連線的時間依賴性](/tw/ch12#sec_stream_join_time)
- 資料密集型應用, [資料系統架構中的權衡](/tw/ch1#ch_tradeoffs)
- 資料庫管理員, [雲時代的運維](/tw/ch1#sec_introduction_operations)
- 內部分散式事務, [跨不同系統的分散式事務](/tw/ch8#sec_transactions_xa), [資料庫內部的分散式事務](/tw/ch8#sec_transactions_internal), [原子提交再現](/tw/ch12#sec_stream_atomic_commit)
- 資料庫
- 歸檔儲存, [歸檔儲存](/tw/ch5#archival-storage)
- 信件經紀人的比較, [訊息代理與資料庫的對比](/tw/ch12#id297)
- 資料流, [流經資料庫的資料流](/tw/ch5#sec_encoding_dataflow_db)
- 端到端引數, [端到端原則](/tw/ch13#sec_future_e2e_argument)-[在資料系統中應用端到端思考](/tw/ch13#id357)
- 檢查完整性, [端到端原則重現](/tw/ch13#id456)
- 與事件流的關係, [資料庫與流](/tw/ch12#sec_stream_databases)-[不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- (另見 changelogs)
- 變更流的 API 支援, [變更流的 API 支援](/tw/ch12#sec_stream_change_api), [應用程式碼和狀態的分離](/tw/ch13#id344)
- 資料變更捕獲, [資料變更捕獲](/tw/ch12#sec_stream_cdc)-[變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- 事件溯源, [資料變更捕獲與事件溯源](/tw/ch12#sec_stream_event_sourcing)
- 保持系統同步, [保持系統同步](/tw/ch12#sec_stream_sync)-[保持系統同步](/tw/ch12#sec_stream_sync)
- 不可改變事件哲學, [狀態、流和不變性](/tw/ch12#sec_stream_immutability)-[不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- 分拆, [分拆資料庫](/tw/ch13#sec_future_unbundling)-[多分割槽資料處理](/tw/ch13#sec_future_unbundled_multi_shard)
- 構建資料儲存技術, [組合使用資料儲存技術](/tw/ch13#id447)-[分拆系統與整合系統](/tw/ch13#id448)
- 圍繞資料流設計應用程式, [圍繞資料流設計應用](/tw/ch13#sec_future_dataflow)-[流處理器和服務](/tw/ch13#id345)
- 觀察匯出狀態, [觀察派生資料狀態](/tw/ch13#sec_future_observing)-[多分割槽資料處理](/tw/ch13#sec_future_unbundled_multi_shard)
- 資料中心
- 失敗, [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults)
- geographically distributed(見 regions (geographic distribution))
- 多種使用和共享資源, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 網路架構, [雲計算與超級計算](/tw/ch1#id17)
- 網路斷層, [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults)
- 資料流動, [資料流的模式](/tw/ch5#sec_encoding_dataflow)-[分散式 actor 框架](/tw/ch5#distributed-actor-frameworks), [圍繞資料流設計應用](/tw/ch13#sec_future_dataflow)-[流處理器和服務](/tw/ch13#id345)
- 資料流系統的正確性, [資料流系統的正確性](/tw/ch13#id453)
- 資料流引擎, [資料流引擎](/tw/ch11#sec_batch_dataflow)
- 與流處理的比較, [流處理](/tw/ch12#sec_stream_processing)
- DataFrames, [DataFrames](/tw/ch11#id287)
- 批次處理框架中的支援, [批處理](/tw/ch11#ch_batch)
- 事件驅動, [事件驅動的架構](/tw/ch5#sec_encoding_dataflow_msg)-[分散式 actor 框架](/tw/ch5#distributed-actor-frameworks)
- 關於, [理解資料流](/tw/ch13#id443)
- 透過資料庫, [流經資料庫的資料流](/tw/ch5#sec_encoding_dataflow_db)
- 透過服務, [流經服務的資料流:REST 與 RPC](/tw/ch5#sec_encoding_dataflow_rpc)-[RPC 的資料編碼與演化](/tw/ch5#data-encoding-and-evolution-for-rpc)
- workflow engines(見 workflow engines)
- DataFrames, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 執行, [DataFrames](/tw/ch11#id287)
- 分批處理, [DataFrames](/tw/ch11#id287)
- 在筆記本中, [機器學習](/tw/ch11#id290)
- 批次處理框架中的支援, [批處理](/tw/ch11#ch_batch)
- DataFusion(查詢引擎), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- Datalog(查詢語言), [Datalog:遞迴關係查詢](/tw/ch3#id62)-[Datalog:遞迴關係查詢](/tw/ch3#id62)
- 資料流(變化資料捕獲), [變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- 資料型別
- binary strings in XML and JSON, [JSON、XML 及其二進位制變體](/tw/ch5#sec_encoding_json)
- 無衝突, [CRDT 與操作變換](/tw/ch6#sec_replication_crdts)
- 在 Avro 編碼中, [Avro](/tw/ch5#sec_encoding_avro)
- 在協議緩衝中, [欄位標籤與模式演化](/tw/ch5#field-tags-and-schema-evolution)
- numbers in XML and JSON, [JSON、XML 及其二進位制變體](/tw/ch5#sec_encoding_json)
- 日期和日期, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- Datomic(資料庫)
- B-樹木儲存, [索引與快照隔離](/tw/ch8#indexes-and-snapshot-isolation)
- 資料模型, [圖資料模型](/tw/ch3#sec_datamodels_graph), [三元組儲存與 SPARQL](/tw/ch3#id59)
- 資料日誌查詢語言, [Datalog:遞迴關係查詢](/tw/ch3#id62)
- 切除, [不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- 事務語言, [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs)
- 事務的序列執行, [實際序列執行](/tw/ch8#sec_transactions_serial)
- Daylight Saving Time (DST), [日曆時鐘](/tw/ch9#time-of-day-clocks)
- Db2(資料庫)
- 資料變更捕獲, [資料變更捕獲的實現](/tw/ch12#id307)
- DBA (database administrator), [雲時代的運維](/tw/ch1#sec_introduction_operations)
- 僵局, [顯式鎖定](/tw/ch8#explicit-locking)
- 檢測, 分散式事務, [XA 事務的問題](/tw/ch8#problems-with-xa-transactions)
- in two-phase locking (2PL), [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- Debezium(變化資料捕獲), [資料變更捕獲的實現](/tw/ch12#id307)
- 卡桑德拉島, [變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- 資料整合, [分拆系統與整合系統](/tw/ch13#id448)
- 宣告語言, [資料模型與查詢語言](/tw/ch3#ch_datamodels), [術語表](/tw/glossary)
- 並同步引擎, [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- 資料日誌, [Datalog:遞迴關係查詢](/tw/ch3#id62)
- 文件資料庫中, [文件和關係資料庫的融合](/tw/ch3#convergence-of-document-and-relational-databases)
- recursive SQL queries, [SQL 中的圖查詢](/tw/ch3#id58)
- SPARQL, [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- DeepSeek
- 3FS(見 3FS)
- 延遲
- 限制網路延遲, [同步與非同步網路](/tw/ch9#sec_distributed_sync_networks)
- 邊框程序暫停, [響應時間保證](/tw/ch9#sec_distributed_clocks_realtime)
- 無限制的網路延遲, [超時和無界延遲](/tw/ch9#sec_distributed_queueing)
- 未繫結的程序暫停, [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- 刪除資料, [不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- in LSM storage, [磁碟空間使用](/tw/ch4#disk-space-usage)
- 法律依據, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- Delta Lake(表格式), [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 硬化和叢集, [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 非軍事區(聯網), [對外提供派生資料](/tw/ch11#sec_batch_serving_derived)
- 非正常化(資料表示), [正規化、反正規化與連線](/tw/ch3#sec_datamodels_normalization)-[多對一與多對多關係](/tw/ch3#sec_datamodels_many_to_many), [術語表](/tw/glossary)
- 在衍生資料系統中, [記錄系統與派生資料](/tw/ch1#sec_introduction_derived)
- in event sourcing/CQRS, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 社會網路案例研究, [社交網路案例研究中的反正規化](/tw/ch3#denormalization-in-the-social-networking-case-study)
- 實際意見, [物化檢視與資料立方體](/tw/ch4#sec_storage_materialized_views)
- 更新衍生資料, [單物件與多物件操作](/tw/ch8#sec_transactions_multi_object), [多物件事務的需求](/tw/ch8#sec_transactions_need), [組合使用派生資料的工具](/tw/ch13#id442)
- 相對於正常化, [從同一事件日誌中派生多個檢視](/tw/ch12#sec_stream_deriving_views)
- 衍生資料, [記錄系統與派生資料](/tw/ch1#sec_introduction_derived), [流處理](/tw/ch12#ch_stream), [術語表](/tw/glossary)
- 批處理, [批處理](/tw/ch11#ch_batch)
- 事件溯源與 CQRS, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 從變化資料抓取, [資料變更捕獲的實現](/tw/ch12#id307)
- 透過日誌維護匯出狀態, [資料庫與流](/tw/ch12#sec_stream_databases)-[變更流的 API 支援](/tw/ch12#sec_stream_change_api), [狀態、流和不變性](/tw/ch12#sec_stream_immutability)-[併發控制](/tw/ch12#sec_stream_concurrency)
- 透過對流的訂閱來觀察, [端到端的事件流](/tw/ch13#id349)
- 批次和流處理的產出, [批處理與流處理](/tw/ch13#sec_future_batch_streaming)
- 透過應用程式程式碼, [應用程式碼作為派生函式](/tw/ch13#sec_future_dataflow_derivation)
- 相對於已分配事務, [派生資料與分散式事務](/tw/ch13#sec_future_derived_vs_transactions)
- 設計模式, [簡單性:管理複雜度](/tw/ch2#id38)
- 決定性行動, [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs), [故障與部分失效](/tw/ch9#sec_distributed_partial_failure), [術語表](/tw/glossary)
- 專有權, [冪等性](/tw/ch12#sec_stream_idempotence), [理解資料流](/tw/ch13#id443)
- 計算衍生資料, [維護派生狀態](/tw/ch13#id446), [資料流系統的正確性](/tw/ch13#id453), [為可審計性而設計](/tw/ch13#id365)
- 如果來源, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 狀態機器複製, [使用共享日誌](/tw/ch10#sec_consistency_smr), [資料庫與流](/tw/ch12#sec_stream_databases)
- 基於語句的複製, [基於語句的複製](/tw/ch6#statement-based-replication)
- 測試中, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 加入, [連線的時間依賴性](/tw/ch12#sec_stream_join_time)
- 使程式碼確定性, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 概覽, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 確定性模擬測試(DST), [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- DevOps, [雲時代的運維](/tw/ch1#sec_introduction_operations)
- 維度表, [星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)
- dimensional modeling(見 star schemas)
- directed acyclic graphs (DAG)
- 工作流程, [工作流排程](/tw/ch11#sec_batch_workflows)
- (另見 workflow engines)
- 髒讀, [沒有髒讀](/tw/ch8#no-dirty-reads)
- 髒字(事務隔離), [沒有髒寫](/tw/ch8#sec_transactions_dirty_write)
- 分類
- 儲存和計算, [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute)
- discord(分組聊天)
- GraphQL example, [GraphQL](/tw/ch3#id63)
- 歧視, [偏見與歧視](/ch14#id370)
- disks(見 hard disks)
- 分散式行為者框架, [分散式 actor 框架](/tw/ch5#distributed-actor-frameworks)
- 分散式檔案系統, [分散式檔案系統](/tw/ch11#sec_batch_dfs)-[分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 比較物件儲存, [物件儲存](/tw/ch11#id277)
- 由 Flink 使用, [失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)
- 已分發分類賬, [總結](/tw/ch3#summary)
- 分散式系統, [分散式系統的麻煩](/tw/ch9#ch_distributed)-[總結](/tw/ch9#summary), [術語表](/tw/glossary)
- 拜占庭斷層, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)-[弱形式的謊言](/tw/ch9#weak-forms-of-lying)
- 檢測網路斷層, [檢測故障](/tw/ch9#id307)
- 過失和部分失敗, [故障與部分失效](/tw/ch9#sec_distributed_partial_failure)
- 協商一致的正式化, [單值共識](/tw/ch10#single-value-consensus)
- 無法取得的結果, [CAP 定理](/tw/ch10#the-cap-theorem), [共識](/tw/ch10#sec_consistency_consensus)
- 出現故障的問題, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover)
- multi-region(見 regions (geographic distribution))
- 網路問題, [不可靠的網路](/tw/ch9#sec_distributed_networks)-[我們不能簡單地使網路延遲可預測嗎?](/tw/ch9#can-we-not-simply-make-network-delays-predictable)
- 問題, [分散式系統的問題](/tw/ch1#sec_introduction_dist_sys_problems)
- 法定人數,依賴, [多數派原則](/tw/ch9#sec_distributed_majority)
- 使用原因, [分散式與單節點系統](/tw/ch1#sec_introduction_distributed), [複製](/tw/ch6#ch_replication)
- 同步時鐘, 依賴, [對同步時鐘的依賴](/tw/ch9#sec_distributed_clocks_relying)-[用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 系統模型, [系統模型與現實](/tw/ch9#sec_distributed_system_model)-[確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 使用時鐘和時間, [不可靠的時鐘](/tw/ch9#sec_distributed_clocks)
- distributed transactions(見 transactions)
- Django(網路框架), [處理錯誤和中止](/tw/ch8#handling-errors-and-aborts)
- DMZ (demilitarized zone), [對外提供派生資料](/tw/ch11#sec_batch_serving_derived)
- DNS (Domain Name System), [請求路由](/tw/ch7#sec_sharding_routing), [服務發現](/tw/ch10#service-discovery)
- 用於負載平衡, [負載均衡器、服務發現和服務網格](/tw/ch5#sec_encoding_service_discovery)
- Docker (集裝箱管理器), [應用程式碼和狀態的分離](/tw/ch13#id344)
- 文件資料模型, [關係模型與文件模型](/tw/ch3#sec_datamodels_history)-[文件和關係資料庫的融合](/tw/ch3#convergence-of-document-and-relational-databases)
- 比較關係模式, [何時使用哪種模型](/tw/ch3#sec_datamodels_document_summary)-[文件和關係資料庫的融合](/tw/ch3#convergence-of-document-and-relational-databases)
- 多物件事務, 需要, [多物件事務的需求](/tw/ch8#sec_transactions_need)
- 硬化二級指數, [分片與二級索引](/tw/ch7#sec_sharding_secondary_indexes)
- 相對關係模式
- 模式的趨同, [文件和關係資料庫的融合](/tw/ch3#convergence-of-document-and-relational-databases)
- 資料位置, [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality)
- document-partitioned indexes(見 local secondary indexes)
- 領域驅動設計, [簡單性:管理複雜度](/tw/ch2#id38), [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 點版向量, [版本向量](/tw/ch6#version-vectors)
- 雙重登入簿記, [總結](/tw/ch3#summary)
- DRBD (Distributed Replicated Block Device), [單主複製](/tw/ch6#sec_replication_leader)
- 漂移(小時), [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy)
- Druid(資料庫), [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp), [列式儲存](/tw/ch4#sec_storage_column), [從同一事件日誌中派生多個檢視](/tw/ch12#sec_stream_deriving_views)
- 處理寫入, [寫入列式儲存](/tw/ch4#writing-to-column-oriented-storage)
- 預彙總, [分析(Analytics)](/tw/ch11#sec_batch_olap)
- 服務衍生資料, [對外提供派生資料](/tw/ch11#sec_batch_serving_derived)
- Dryad(資料流引擎), [資料流引擎](/tw/ch11#sec_batch_dataflow)
- 雙寫、 問題, [保持系統同步](/tw/ch12#sec_stream_sync)
- DuckDB(資料庫), [分散式系統的問題](/tw/ch1#sec_introduction_dist_sys_problems), [壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- 面向列的儲存, [列式儲存](/tw/ch4#sec_storage_column)
- 用於 ETL, [提取-轉換-載入(ETL)](/tw/ch11#sec_batch_etl_usage)
- 減少重複,消除, [抑制重複](/tw/ch13#id354)
- (另見 冪等性)
- using a unique ID, [操作識別符號](/tw/ch13#id355), [多分割槽請求處理](/tw/ch13#id360)
- 永續性, [使 B 樹可靠](/tw/ch4#sec_storage_btree_wal), [永續性](/tw/ch8#durability), [術語表](/tw/glossary)
- 持久執行, [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows)
- 依賴決定性因素, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- Restate(見 Restate (workflow engine))
- Temporal(見 Temporal (workflow engine))
- durable functions(見 workflow engines)
- 時間(時間), [不可靠的時鐘](/tw/ch9#sec_distributed_clocks)
- 用單音鍾測量, [單調時鐘](/tw/ch9#monotonic-clocks)
- 動態輸入語言
- 類比於閱讀時的圖案, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility)
- Dynamo(資料庫), [無主複製](/tw/ch6#sec_replication_leaderless)
- Dynamo-style databases(見 leaderless replication)
- DynamoDB(資料庫)
- 自動縮放, [運維:自動/手動再平衡](/tw/ch7#sec_sharding_operations)
- 雜湊變硬, [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 基於領導者的複製, [單主複製](/tw/ch6#sec_replication_leader)
- 硬化二級指數, [全域性二級索引](/tw/ch7#id167)
### E
- EBS(虛擬塊裝置), [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute)
- 比較物件儲存, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- ECC(見 error-correcting codes)
- EDB Postgres Distributed(資料庫), [跨地域執行](/tw/ch6#sec_replication_multi_dc)
- 邊緣(圖), [圖資料模型](/tw/ch3#sec_datamodels_graph)
- 屬性圖模型, [屬性圖](/tw/ch3#id56)
- 編輯距離(全文搜尋), [全文檢索](/tw/ch4#sec_storage_full_text)
- 有效即時語義, [容錯](/tw/ch12#sec_stream_fault_tolerance), [恰好執行一次操作](/tw/ch13#id353)
- (另見 恰好一次語義)
- 維護完整性, [資料流系統的正確性](/tw/ch13#id453)
- Elastic Compute Cloud (EC2)
- 現場例項, [故障處理](/tw/ch11#id281)
- 彈性, [分散式與單節點系統](/tw/ch1#sec_introduction_distributed)
- 雲資料倉庫, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses), [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- 彈性搜尋(搜尋伺服器)
- 本地二級指數, [本地二級索引](/tw/ch7#id166)
- 剖析器(流搜尋), [在流上搜索](/tw/ch12#id320)
- 服務衍生資料, [對外提供派生資料](/tw/ch11#sec_batch_serving_derived)
- 硬調和, [固定數量的分片](/tw/ch7#fixed-number-of-shards)
- 使用 Lucene, [全文檢索](/tw/ch4#sec_storage_full_text)
- 精靈(程式語言), [端到端的事件流](/tw/ch13#id349)
- ELT (extract-load-transform), [資料倉庫](/tw/ch1#sec_introduction_dwh)
- 與批次處理的關係, [提取-轉換-載入(ETL)](/tw/ch11#sec_batch_etl_usage)
- 嚴重平行(演算法)
- 提取-轉換-載入(ETL)(見 ETL)
- MapReduce, [MapReduce](/tw/ch11#sec_batch_mapreduce)
- (另見 MapReduce)
- 嵌入式儲存引擎, [壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- 嵌入(顯示器), [向量嵌入](/tw/ch4#id92)
- 編碼(資料格式), [編碼與演化](/tw/ch5#ch_encoding)-[模式的優點](/tw/ch5#sec_encoding_schemas)
- Avro, [Avro](/tw/ch5#sec_encoding_avro)-[動態生成的模式](/tw/ch5#dynamically-generated-schemas)
- binary variants of JSON and XML, [二進位制編碼](/tw/ch5#binary-encoding)
- 相容性, [編碼與演化](/tw/ch5#ch_encoding)
- 電話服務, [RPC 的資料編碼與演化](/tw/ch5#data-encoding-and-evolution-for-rpc)
- 使用資料庫, [流經資料庫的資料流](/tw/ch5#sec_encoding_dataflow_db)-[歸檔儲存](/tw/ch5#archival-storage)
- 定義, [編碼資料的格式](/tw/ch5#sec_encoding_formats)
- JSON, XML, and CSV, [JSON、XML 及其二進位制變體](/tw/ch5#sec_encoding_json)
- 語言特定格式, [特定語言的格式](/tw/ch5#id96)
- 計劃的價值, [模式的優點](/tw/ch5#sec_encoding_schemas)
- Protocol Buffers, [Protocol Buffers](/tw/ch5#sec_encoding_protobuf)-[欄位標籤與模式演化](/tw/ch5#field-tags-and-schema-evolution)
- 資料說明, [編碼資料的格式](/tw/ch5#sec_encoding_formats)
- 端到端原則, [端到端原則](/tw/ch13#sec_future_e2e_argument)-[在資料系統中應用端到端思考](/tw/ch13#id357)
- 檢查完整性, [端到端原則重現](/tw/ch13#id456)
- 釋出/訂閱流, [端到端的事件流](/tw/ch13#id349)
- 濃縮(流), [流表連線(流擴充)](/tw/ch12#sec_stream_table_joins)
- Enterprise JavaBeans (EJB), [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc)
- 企業軟體, [資料系統架構中的權衡](/tw/ch1#ch_tradeoffs)
- entities(見 vertices)
- 電子儲存, [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute)
- 時代(協商一致演算法), [從單主複製到共識](/tw/ch10#from-single-leader-replication-to-consensus)
- 時代(Unix 時間戳), [日曆時鐘](/tw/ch9#time-of-day-clocks)
- 清除編碼(錯誤校正), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 錯誤處理
- 網路斷層, [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults)
- 事務中, [處理錯誤和中止](/tw/ch8#handling-errors-and-aborts)
- 錯誤更正程式碼, [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- Esper (CEP engine), [複合事件處理](/tw/ch12#id317)
- 基本複雜性, [簡單性:管理複雜度](/tw/ch2#id38)
- 協調事務, [協調服務](/tw/ch10#sec_consistency_coordination)-[服務發現](/tw/ch10#service-discovery)
- 生成柵欄標誌, [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens), [協調服務](/tw/ch10#sec_consistency_coordination)
- 線性操作, [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable), [共識的微妙之處](/tw/ch10#subtleties-of-consensus)
- 鎖和領袖選舉, [鎖定與領導者選舉](/tw/ch10#locking-and-leader-election)
- 用於服務發現, [負載均衡器、服務發現和服務網格](/tw/ch5#sec_encoding_service_discovery), [服務發現](/tw/ch10#service-discovery)
- 用於硬性轉讓, [請求路由](/tw/ch7#sec_sharding_routing)
- 使用 Raft 演算法, [單主複製](/tw/ch6#sec_replication_leader)
- 伊特魯姆(塊鏈), [用於可審計資料系統的工具](/tw/ch13#id366)
- 乙太網(網路), [雲計算與超級計算](/tw/ch1#id17), [不可靠的網路](/tw/ch9#sec_distributed_networks), [我們不能簡單地使網路延遲可預測嗎?](/tw/ch9#can-we-not-simply-make-network-delays-predictable)
- 包檢查和, [弱形式的謊言](/tw/ch9#weak-forms-of-lying), [端到端原則](/tw/ch13#sec_future_e2e_argument)
- 道德操守, [將事情做正確](/ch14)-[立法與自律](/ch14#sec_future_legislation)
- 道德守則和專業實務, [將事情做正確](/ch14)
- 立法和自律, [立法與自律](/ch14#sec_future_legislation)
- 預測分析, [預測分析](/ch14#id369)-[反饋迴路](/ch14#id372)
- 擴大偏見, [偏見與歧視](/ch14#id370)
- 反饋迴圈, [反饋迴路](/ch14#id372)
- 隱私和跟蹤, [隱私與追蹤](/ch14#id373)-[立法與自律](/ch14#sec_future_legislation)
- 同意和選擇自由, [同意與選擇自由](/ch14#id375)
- 資料作為資產和權力, [資料作為資產與權力](/ch14#id376)
- 隱私的含義, [隱私與資料使用](/ch14#id457)
- 監視, [監視](/ch14#id374)
- 尊重、尊嚴和機構, [立法與自律](/ch14#sec_future_legislation)
- 意外後果, [將事情做正確](/ch14), [反饋迴路](/ch14#id372)
- ETL, [資料倉庫](/tw/ch1#sec_introduction_dwh), [保持系統同步](/tw/ch12#sec_stream_sync), [術語表](/tw/glossary)
- 與批次處理的關係, [提取-轉換-載入(ETL)](/tw/ch11#sec_batch_etl_usage)-[提取-轉換-載入(ETL)](/tw/ch11#sec_batch_etl_usage)
- 使用批次處理, [批處理](/tw/ch11#ch_batch)
- 歐幾利得距離(語義搜尋), [向量嵌入](/tw/ch4#id92)
- European Union
- AI Act(見 AI Act)
- GDPR(見 GDPR)
- 事件溯源, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)-[事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 並更改資料捕獲, [資料變更捕獲與事件溯源](/tw/ch12#sec_stream_event_sourcing)
- 與變化資料捕獲的比較, [資料變更捕獲與事件溯源](/tw/ch12#sec_stream_event_sourcing)
- 不可更改性和可審計性, [狀態、流和不變性](/tw/ch12#sec_stream_immutability), [為可審計性而設計](/tw/ch13#id365)
- 大型可靠資料系統, [操作識別符號](/tw/ch13#id355), [資料流系統的正確性](/tw/ch13#id453)
- 依賴決定性因素, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- event streams(見 streams)
- 事件驅動的架構, [事件驅動的架構](/tw/ch5#sec_encoding_dataflow_msg)-[分散式 actor 框架](/tw/ch5#distributed-actor-frameworks)
- 分散式行為者框架, [分散式 actor 框架](/tw/ch5#distributed-actor-frameworks)
- 事件, [傳遞事件流](/tw/ch12#sec_stream_transmit)
- 決定總順序, [全序的限制](/tw/ch13#id335)
- 從事件日誌中得出看法, [從同一事件日誌中派生多個檢視](/tw/ch12#sec_stream_deriving_views)
- 事件時間與處理時間, [事件時間與處理時間](/tw/ch12#id322), [微批次與存檔點](/tw/ch12#id329), [統一批處理和流處理](/tw/ch13#id338)
- 不可改變的優點, [不可變事件的優點](/tw/ch12#sec_stream_immutability_pros), [為可審計性而設計](/tw/ch13#id365)
- 命令捕獲因果關係, [排序事件以捕獲因果關係](/tw/ch13#sec_future_capture_causality)
- 讀作:, [讀也是事件](/tw/ch13#sec_future_read_events)
- 疏遠者, [處理滯留事件](/tw/ch12#id323)
- 溪流處理中的時間戳, [你用的是誰的時鐘?](/tw/ch12#id438)
- EventSource (browser API), [將狀態變更推送給客戶端](/tw/ch13#id348)
- EventStoreDB(資料庫), [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 最終一致性, [複製](/tw/ch6#ch_replication), [複製延遲的問題](/tw/ch6#sec_replication_lag), [安全性與活性](/tw/ch9#sec_distributed_safety_liveness)
- (另見 conflicts)
- 和長期不一致, [及時性與完整性](/tw/ch13#sec_future_integrity)
- 最終的一致性, [自動衝突解決](/tw/ch6#automatic-conflict-resolution)
- 證據
- 資料用作, [人類與可靠性](/tw/ch2#id31)
- 可演化性, [可演化性:讓變化更容易](/tw/ch2#sec_introduction_evolvability), [編碼與演化](/tw/ch5#ch_encoding)
- 電話服務, [RPC 的資料編碼與演化](/tw/ch5#data-encoding-and-evolution-for-rpc)
- 事件溯源, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 圖表結構資料, [屬性圖](/tw/ch3#id56)
- 資料庫, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility), [流經資料庫的資料流](/tw/ch5#sec_encoding_dataflow_db)-[歸檔儲存](/tw/ch5#archival-storage), [從同一事件日誌中派生多個檢視](/tw/ch12#sec_stream_deriving_views), [應用演化後重新處理資料](/tw/ch13#sec_future_reprocessing)
- 後處理資料, [應用演化後重新處理資料](/tw/ch13#sec_future_reprocessing), [統一批處理和流處理](/tw/ch13#id338)
- Avro 的策略進化, [寫入者模式與讀取者模式](/tw/ch5#the-writers-schema-and-the-readers-schema)
- 協議緩衝的策略演變, [欄位標籤與模式演化](/tw/ch5#field-tags-and-schema-evolution)
- 閱讀時的圖謀, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility), [編碼與演化](/tw/ch5#ch_encoding), [模式的優點](/tw/ch5#sec_encoding_schemas)
- 恰好一次語義, [恰好一次訊息處理](/tw/ch8#sec_transactions_exactly_once), [再談恰好一次訊息處理](/tw/ch8#exactly-once-message-processing-revisited), [容錯](/tw/ch12#sec_stream_fault_tolerance), [恰好執行一次操作](/tw/ch13#id353)
- 與批次處理器對等, [統一批處理和流處理](/tw/ch13#id338)
- 維護完整性, [資料流系統的正確性](/tw/ch13#id453)
- 使用持久執行, [持久化執行](/tw/ch5#durable-execution)
- 獨佔模式, [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- 指數備份, [描述效能](/tw/ch2#sec_introduction_percentiles), [處理錯誤和中止](/tw/ch8#handling-errors-and-aborts)
- ext4 (file system), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- eXtended Architecture transactions(見 XA 事務)
- ETL(見 提取-轉換-載入(ETL))
### F
- 臉書
- 費斯(媒介指數), [向量嵌入](/tw/ch4#id92)
- 反應(使用者介面庫), [端到端的事件流](/tw/ch13#id349)
- 社會圖表, [圖資料模型](/tw/ch3#sec_datamodels_graph)
- 事實
- 事實表(星圖), [星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)
- 在資料日誌中, [Datalog:遞迴關係查詢](/tw/ch3#id62)
- 如果來源, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 慢故障, [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 失敗停止模式, [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 故障切換, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover), [術語表](/tw/glossary)
- (另見 基於領導者的複製)
- 無領導複製,沒有, [當節點故障時寫入資料庫](/tw/ch6#id287)
- 領袖選舉, [分散式鎖和租約](/tw/ch9#sec_distributed_lock_fencing), [共識](/tw/ch10#sec_consistency_consensus), [從單主複製到共識](/tw/ch10#from-single-leader-replication-to-consensus)
- 潛在問題, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover)
- 失敗
- 透過經銷事務擴充, [維護派生狀態](/tw/ch13#id446)
- 檢測失敗, [檢測故障](/tw/ch9#id307)
- 自動再平衡導致連鎖故障, [運維:自動/手動再平衡](/tw/ch7#sec_sharding_operations)
- 超時和無限制延誤, [超時和無界延遲](/tw/ch9#sec_distributed_queueing), [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 使用協調服務, [協調服務](/tw/ch10#sec_consistency_coordination)
- 錯對錯, [可靠性與容錯](/tw/ch2#sec_introduction_reliability)
- 部分失敗, [故障與部分失效](/tw/ch9#sec_distributed_partial_failure), [總結](/tw/ch9#summary)
- 費斯(媒介指數), [向量嵌入](/tw/ch4#id92)
- 假陽性(Bloom 過濾器), [布隆過濾器](/tw/ch4#bloom-filters)
- 扇出, [時間線的物化與更新](/tw/ch2#sec_introduction_materializing), [多個消費者](/tw/ch12#id298)
- 斷層注射, [容錯](/tw/ch2#id27), [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults), [故障注入](/tw/ch9#sec_fault_injection)
- 斷層隔離, [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- 過失容忍, [可靠性與容錯](/tw/ch2#sec_introduction_reliability)-[人類與可靠性](/tw/ch2#id31), [術語表](/tw/glossary)
- 協商一致的形式化, [單值共識](/tw/ch10#single-value-consensus)
- 容忍人為失誤, [批處理](/tw/ch11#ch_batch)
- 分批處理, [故障處理](/tw/ch11#id281)
- 在基於日誌的系統中, [在資料系統中應用端到端思考](/tw/ch13#id357), [及時性與完整性](/tw/ch13#sec_future_integrity)-[資料流系統的正確性](/tw/ch13#id453)
- 在溪流處理中, [容錯](/tw/ch12#sec_stream_fault_tolerance)-[失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)
- 原子提交, [原子提交再現](/tw/ch12#sec_stream_atomic_commit)
- 冪等性, [冪等性](/tw/ch12#sec_stream_idempotence)
- 保持衍生狀態, [維護派生狀態](/tw/ch13#id446)
- 微打鬥和檢查站, [微批次與存檔點](/tw/ch12#id329)
- 失敗後重建狀態, [失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)
- 分散式事務, [XA 事務](/tw/ch8#xa-transactions)-[再談恰好一次訊息處理](/tw/ch8#exactly-once-message-processing-revisited)
- 基於領導和無領導者的複製, [單主與無主複製的效能](/tw/ch6#sec_replication_leaderless_perf)
- 事務原子性, [原子性](/tw/ch8#sec_transactions_acid_atomicity), [分散式事務](/tw/ch8#sec_transactions_distributed)-[恰好一次訊息處理](/tw/ch8#sec_transactions_exactly_once)
- 錯誤
- 拜占庭斷層, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)-[弱形式的謊言](/tw/ch9#weak-forms-of-lying)
- 失敗與, [可靠性與容錯](/tw/ch2#sec_introduction_reliability)
- 事務處理, [事務](/tw/ch8#ch_transactions)
- 超級計算機和雲計算處理, [雲計算與超級計算](/tw/ch1#id17)
- 硬體, [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults)
- 在分散式系統中, [故障與部分失效](/tw/ch9#sec_distributed_partial_failure)
- introducing deliberately(見 fault injection)
- 網路斷層, [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults)-[檢測故障](/tw/ch9#id307)
- 非對稱斷層, [多數派原則](/tw/ch9#sec_distributed_majority)
- 檢測, [檢測故障](/tw/ch9#id307)
- 容忍,多領導複製, [跨地域執行](/tw/ch6#sec_replication_multi_dc)
- 軟體故障, [軟體故障](/tw/ch2#software-faults)
- tolerating(見 fault tolerance)
- 特性工程(機器學習), [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake)
- 聯邦資料庫, [一切的元資料庫](/tw/ch13#id341)
- Feldera(資料庫)
- 增量檢視維護, [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- 圍欄, [線性一致性與網路延遲](/tw/ch10#linearizability-and-network-delays)
- 屏障, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover), [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens)-[多副本隔離](/tw/ch9#fencing-with-multiple-replicas)
- 生成柵欄標誌, [使用共享日誌](/tw/ch10#sec_consistency_smr), [協調服務](/tw/ch10#sec_consistency_coordination)
- 柵欄標誌的屬性, [定義演算法的正確性](/tw/ch9#defining-the-correctness-of-an-algorithm)
- 流處理器寫入資料庫, [冪等性](/tw/ch12#sec_stream_idempotence), [恰好執行一次操作](/tw/ch13#id353)
- 獲取和新增
- 與協商一致的關係, [獲取並增加作為共識](/tw/ch10#fetch-and-add-as-consensus)
- 纖維通道(網路), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 欄位標記(協議緩衝), [Protocol Buffers](/tw/ch5#sec_encoding_protobuf)-[欄位標籤與模式演化](/tw/ch5#field-tags-and-schema-evolution)
- Figma (圖形軟體), [即時協作、離線優先和本地優先應用](/tw/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- filesystem in userspace (FUSE), [設定新的副本](/tw/ch6#sec_replication_new_replica), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 在物件儲存中, [物件儲存](/tw/ch11#id277)
- 財務資料
- 會計分類賬, [總結](/tw/ch3#summary)
- 不可改變性, [不可變事件的優點](/tw/ch12#sec_stream_immutability_pros)
- 時間序列資料, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 五特蘭, [資料倉庫](/tw/ch1#sec_introduction_dwh)
- FizzBee (specification language), [模型檢查與規範語言](/tw/ch9#model-checking-and-specification-languages)
- 平面指數(媒介指數), [向量嵌入](/tw/ch4#id92)
- FlatBuffers(資料格式), [編碼資料的格式](/tw/ch5#sec_encoding_formats)
- Flink(處理框架), [批處理](/tw/ch11#ch_batch), [資料流引擎](/tw/ch11#sec_batch_dataflow)
- 成本效率, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- DataFrames, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes), [DataFrames](/tw/ch11#id287)
- 過失容忍, [故障處理](/tw/ch11#id281), [微批次與存檔點](/tw/ch12#id329), [失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)
- FlinkML, [機器學習](/tw/ch11#id290)
- 資料倉庫, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- high availability using ZooKeeper, [協調服務](/tw/ch10#sec_consistency_coordination)
- 整合批次和流處理, [統一批處理和流處理](/tw/ch13#id338)
- 查詢最佳化器, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- 移動資料, [混洗資料](/tw/ch11#sec_shuffle)
- 流處理, [流分析](/tw/ch12#id318)
- streaming SQL support, [複合事件處理](/tw/ch12#id317)
- 流量控制, [TCP 的侷限性](/tw/ch9#sec_distributed_tcp), [訊息傳遞系統](/tw/ch12#sec_stream_messaging), [術語表](/tw/glossary)
- FLP result (on consensus), [共識](/tw/ch10#sec_consistency_consensus)
- Flyte(工作流排程器), [機器學習](/tw/ch11#id290)
- 追隨者, [單主複製](/tw/ch6#sec_replication_leader), [術語表](/tw/glossary)
- (另見 基於領導者的複製)
- 正式方法, [形式化方法和隨機測試](/tw/ch9#sec_distributed_formal)-[確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 轉發相容性, [編碼與演化](/tw/ch5#ch_encoding)
- 前進衰變(演算法), [響應時間指標的應用](/tw/ch2#sec_introduction_slo_sla)
- 化石(版本控制系統), [併發控制](/tw/ch12#sec_stream_concurrency)
- 避免, [不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- FoundationDB(資料庫)
- 一致性模式, [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- 確定性模擬測試, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 鍵程硬化, [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range)
- 程序/核心模式, [分片的利與弊](/tw/ch7#sec_sharding_reasons)
- 可序列事務, [可序列化快照隔離(SSI)](/tw/ch8#sec_transactions_ssi), [可序列化快照隔離的效能](/tw/ch8#performance-of-serializable-snapshot-isolation)
- 事務, [事務到底是什麼?](/tw/ch8#sec_transactions_overview), [資料庫內部的分散式事務](/tw/ch8#sec_transactions_internal)
- 分數索引, [何時使用哪種模型](/tw/ch3#sec_datamodels_document_summary)
- 碎裂(B樹), [磁碟空間使用](/tw/ch4#disk-space-usage)
- 框架(計算機圖形), [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- 前端 (網頁開發), [資料系統架構中的權衡](/tw/ch1#ch_tradeoffs)
- FrostDB(資料庫)
- 確定性模擬測試(DST), [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- fsync (系統呼叫), [使 B 樹可靠](/tw/ch4#sec_storage_btree_wal), [永續性](/tw/ch8#durability)
- 全文檢索, [全文檢索](/tw/ch4#sec_storage_full_text), [術語表](/tw/glossary)
- 和模糊的指數, [全文檢索](/tw/ch4#sec_storage_full_text)
- Lucene 儲存引擎, [全文檢索](/tw/ch4#sec_storage_full_text)
- 硬化指數, [分片與二級索引](/tw/ch7#sec_sharding_secondary_indexes)
- Function as a Service (FaaS), [微服務與無伺服器](/tw/ch1#sec_introduction_microservices)
- 職能方案擬訂
- inspiration for MapReduce, [MapReduce](/tw/ch11#sec_batch_mapreduce)
- 職能要求, [定義非功能性需求](/tw/ch2#ch_nonfunctional)
- FUSE(見 filesystem in userspace (FUSE))
- 模糊, [形式化方法和隨機測試](/tw/ch9#sec_distributed_formal)
- fuzzy search(見 similarity search)
### G
- Gallina(特寫語言), [模型檢查與規範語言](/tw/ch9#model-checking-and-specification-languages)
- 遊戲開發, [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- 垃圾收集
- 不可改變性和, [不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- 程序暫停, [延遲與響應時間](/tw/ch2#id23), [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)-[限制垃圾回收的影響](/tw/ch9#sec_distributed_gc_impact), [多數派原則](/tw/ch9#sec_distributed_majority)
- (另見 process pauses)
- 加油站演算法定價, [反饋迴路](/ch14#id372)
- GDPR (regulation), [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance), [不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- 同意書, [同意與選擇自由](/ch14#id375)
- 資料最小化, [立法與自律](/ch14#sec_future_legislation)
- 合法權益, [同意與選擇自由](/ch14#id375)
- 使用權, [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- 清除的權利, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance), [磁碟空間使用](/tw/ch4#disk-space-usage), [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- GenBank (genome database), [總結](/tw/ch3#summary)
- General Data Protection Regulation(見 GDPR (regulation))
- 基因組分析, [總結](/tw/ch3#summary)
- geographic distribution(見 regions (geographic distribution))
- 地理空間指數, [多維索引與全文索引](/tw/ch4#sec_storage_multidimensional)
- Git(版本控制系統), [併發控制](/tw/ch12#sec_stream_concurrency)
- 本地第一軟體, [即時協作、離線優先和本地優先應用](/tw/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- 合併衝突, [手動衝突解決](/tw/ch6#manual-conflict-resolution)
- GitHub, postmortems, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover), [領導者故障:故障轉移](/tw/ch6#leader-failure-failover), [將系統模型對映到現實世界](/tw/ch9#mapping-system-models-to-the-real-world)
- 全球二級指數, [全域性二級索引](/tw/ch7#id167), [總結](/tw/ch7#summary)
- globally unique identifiers(見 UUIDs)
- GlusterFS(分散式檔案系統), [批處理](/tw/ch11#ch_batch), [分散式檔案系統](/tw/ch11#sec_batch_dfs), [物件儲存](/tw/ch11#id277)
- GNU Coreutils (Linux), [排序與記憶體聚合](/tw/ch11#id275)
- Go(程式語言)
- 垃圾收集, [限制垃圾回收的影響](/tw/ch9#sec_distributed_gc_impact)
- GoldenGate (change data capture), [資料變更捕獲的實現](/tw/ch12#id307)
- (另見 Oracle)
- 谷歌
- BigQuery(見 BigQuery(資料庫))
- Bigtable(見 Bigtable(資料庫))
- Chubby(鎖服務), [協調服務](/tw/ch10#sec_consistency_coordination)
- Cloud Storage(物件儲存), [設定新的副本](/tw/ch6#sec_replication_new_replica), [物件儲存](/tw/ch11#id277)
- 請求先決條件, [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens)
- Compute Engine
- 預設例項, [故障處理](/tw/ch11#id281)
- 資料流(流程處理)
- 資料倉整合, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 移動資料, [混洗資料](/tw/ch11#sec_shuffle)
- 資料流(流處理器), [流分析](/tw/ch12#id318), [原子提交再現](/tw/ch12#sec_stream_atomic_commit), [統一批處理和流處理](/tw/ch13#id338)
- (另見 Beam)
- 資料流(變化資料捕獲), [變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- Docs(協作編輯), [即時協作、離線優先和本地優先應用](/tw/ch6#real-time-collaboration-offline-first-and-local-first-apps), [CRDT 與操作變換](/tw/ch6#sec_replication_crdts)
- 操作轉換, [CRDT 與操作變換](/tw/ch6#sec_replication_crdts)
- Dremel(查詢引擎), [列式儲存](/tw/ch4#sec_storage_column)
- Firestore(資料庫), [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- MapReduce (batch processing), [批處理](/tw/ch11#ch_batch)
- (另見 MapReduce)
- Percolator(事務系統), [實現線性一致的 ID 生成器](/tw/ch10#implementing-a-linearizable-id-generator)
- 永續性磁碟(雲服務), [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute)
- Pub/Sub(訊息系統), [訊息代理](/tw/ch5#message-brokers), [訊息代理與資料庫的對比](/tw/ch12#id297), [使用日誌進行訊息儲存](/tw/ch12#id300)
- 響應時間研究, [平均值、中位數與百分位點](/tw/ch2#id24)
- 工作表(協作電子表格), [即時協作、離線優先和本地優先應用](/tw/ch6#real-time-collaboration-offline-first-and-local-first-apps), [CRDT 與操作變換](/tw/ch6#sec_replication_crdts)
- Spanner(見 Spanner(資料庫))
- TrueTime (clock API), [帶置信區間的時鐘讀數](/tw/ch9#clock-readings-with-a-confidence-interval)
- 流言協議, [請求路由](/tw/ch7#sec_sharding_routing)
- 治理, [超越資料湖](/tw/ch1#beyond-the-data-lake)
- 政府對資料的使用, [資料作為資產與權力](/ch14#id376)
- GPS (Global Positioning System)
- 用於時鐘同步, [不可靠的時鐘](/tw/ch9#sec_distributed_clocks), [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy), [帶置信區間的時鐘讀數](/tw/ch9#clock-readings-with-a-confidence-interval), [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- GPT (language model), [向量嵌入](/tw/ch4#id92)
- GPU (graphics processing unit), [雲服務的分層](/tw/ch1#layering-of-cloud-services), [分散式與單節點系統](/tw/ch1#sec_introduction_distributed)
- gradual rollout(見 rolling upgrades)
- GraphQL(查詢語言), [GraphQL](/tw/ch3#id63)
- 驗證, [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs)
- 圖表, [術語表](/tw/glossary)
- 作為資料模型, [圖資料模型](/tw/ch3#sec_datamodels_graph)-[GraphQL](/tw/ch3#id63)
- 屬性圖, [屬性圖](/tw/ch3#id56)
- RDF and triple-stores, [三元組儲存與 SPARQL](/tw/ch3#id59)-[SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- DAGs(見 directed acyclic graphs)
- 處理和分析, [機器學習](/tw/ch11#id290)
- 查詢語言
- 密碼, [Cypher 查詢語言](/tw/ch3#id57)
- 資料日誌, [Datalog:遞迴關係查詢](/tw/ch3#id62)-[Datalog:遞迴關係查詢](/tw/ch3#id62)
- GraphQL, [GraphQL](/tw/ch3#id63)
- 格倫林, [圖資料模型](/tw/ch3#sec_datamodels_graph)
- recursive SQL queries, [SQL 中的圖查詢](/tw/ch3#id58)
- SPARQL, [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)-[SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- 轉彎, [屬性圖](/tw/ch3#id56)
- 灰色失敗, [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 無領導複製, [單主與無主複製的效能](/tw/ch6#sec_replication_leaderless_perf)
- 格勒姆林(圖形查詢語言), [圖資料模型](/tw/ch3#sec_datamodels_graph)
- grep (Unix 工具) (英語)., [簡單日誌分析](/tw/ch11#sec_batch_log_analysis)
- gRPC (service calls), [微服務與無伺服器](/tw/ch1#sec_introduction_microservices), [Web 服務](/tw/ch5#sec_web_services)
- 前向和後向相容性, [RPC 的資料編碼與演化](/tw/ch5#data-encoding-and-evolution-for-rpc)
- GUIDs(見 UUIDs)
### H
- Hadoop(資料基礎設施)
- 比較分散式資料庫, [批處理](/tw/ch11#ch_batch)
- MapReduce(見 MapReduce)
- NodeManager, [分散式作業編排](/tw/ch11#id278)
- YARN(見 YARN (job scheduler))
- HANA(見 SAP HANA(資料庫))
- 發生關係前, ["先發生"關係與併發](/tw/ch6#sec_replication_happens_before)
- 硬碟
- 訪問模式, [順序與隨機寫入](/tw/ch4#sidebar_sequential)
- 偵查腐敗, [端到端原則](/tw/ch13#sec_future_e2e_argument), [不要盲目信任承諾](/tw/ch13#id364)
- 錯誤在, [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults), [永續性](/tw/ch8#durability)
- 順序對隨機寫入, [順序與隨機寫入](/tw/ch4#sidebar_sequential)
- 連續寫入吞吐量, [磁碟空間使用](/tw/ch12#sec_stream_disk_usage)
- 硬體故障, [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults)
- 雜湊函式
- 在 Bloom 過濾器中, [布隆過濾器](/tw/ch4#bloom-filters)
- 加入雜湊
- 在溪流處理中, [流表連線(流擴充)](/tw/ch12#sec_stream_table_joins)
- 雜湊變硬, [按鍵的雜湊分片](/tw/ch7#sec_sharding_hash)-[一致性雜湊](/tw/ch7#sec_sharding_consistent_hashing), [總結](/tw/ch7#summary)
- 連續的雜湊, [一致性雜湊](/tw/ch7#sec_sharding_consistent_hashing)
- Hash mod N的問題, [雜湊取模節點數](/tw/ch7#hash-modulo-number-of-nodes)
- 區域查詢, [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 合適的雜湊函式, [按鍵的雜湊分片](/tw/ch7#sec_sharding_hash)
- 有固定的硬塊數, [固定數量的分片](/tw/ch7#fixed-number-of-shards)
- 散列表格, [日誌結構儲存](/tw/ch4#sec_storage_log_structured)
- Hazelcast(模擬資料網)
- FencedLock, [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens)
- Flake ID Generator, [ID 生成器和邏輯時鐘](/tw/ch10#sec_consistency_logical)
- HBase(資料庫)
- 由於缺乏圍欄而出現錯誤, [分散式鎖和租約](/tw/ch9#sec_distributed_lock_fencing)
- 鍵程硬化, [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range)
- 日誌結構儲存, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 區域(硬化), [分片](/tw/ch7#ch_sharding)
- 請求路由, [請求路由](/tw/ch7#sec_sharding_routing)
- 大小級緊湊, [壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- 寬柱資料模型, [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality), [列壓縮](/tw/ch4#sec_storage_column_compression)
- HDFS (Hadoop Distributed File System), [批處理](/tw/ch11#ch_batch), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- (另見 distributed filesystems)
- 檢查資料完整性, [不要盲目信任承諾](/tw/ch13#id364)
- DataNode, [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- NameNode, [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- use in MapReduce, [MapReduce](/tw/ch11#sec_batch_mapreduce)
- 工作流程示例, [工作流排程](/tw/ch11#sec_batch_workflows)
- HdrHistogram (numerical library), [響應時間指標的應用](/tw/ch2#sec_introduction_slo_sla)
- 頭 (Unix 工具), [簡單日誌分析](/tw/ch11#sec_batch_log_analysis), [分散式作業編排](/tw/ch11#id278)
- 頭頂(財產圖), [屬性圖](/tw/ch3#id56)
- 頭部阻塞, [延遲與響應時間](/tw/ch2#id23)
- 堆積檔案(資料庫), [在索引中儲存值](/tw/ch4#sec_storage_index_heap)
- 多轉換併發控制, [多版本併發控制(MVCC)](/tw/ch8#sec_transactions_snapshot_impl)
- 熱量管理, [偏斜的工作負載與緩解熱點](/tw/ch7#sec_sharding_skew)
- 被套期請求, [單主與無主複製的效能](/tw/ch6#sec_replication_leaderless_perf)
- 分散事務, [跨不同系統的分散式事務](/tw/ch8#sec_transactions_xa), [XA 事務的問題](/tw/ch8#problems-with-xa-transactions)
- 啟發式決策, [從協調器故障中恢復](/tw/ch8#recovering-from-coordinator-failure)
- 十六進位制(註解本), [機器學習](/tw/ch11#id290)
- 六邊形
- 地理空間索引, [多維索引與全文索引](/tw/ch4#sec_storage_multidimensional)
- Hibernate(物件關係對映器), [物件關係對映(ORM)](/tw/ch3#object-relational-mapping-orm)
- 層次模型, [關係模型與文件模型](/tw/ch3#sec_datamodels_history)
- 可導航的小世界(媒介指數), [向量嵌入](/tw/ch4#id92)
- hierarchical queries(見 recursive common table expressions)
- high availability(見 fault tolerance)
- 高頻事務, [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy)
- high-performance computing (HPC), [雲計算與超級計算](/tw/ch1#id17)
- 提示移交, [追趕錯過的寫入](/tw/ch6#sec_replication_read_repair)
- 直方圖, [響應時間指標的應用](/tw/ch2#sec_introduction_slo_sla)
- 蜂窩(資料倉), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 查詢最佳化器, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- HNSW (vector index), [向量嵌入](/tw/ch4#id92)
- 購物視窗(流程處理), [視窗的型別](/tw/ch12#id324)
- (另見 windows)
- Hoptimator(查詢引擎), [一切的元資料庫](/tw/ch13#id341)
- 地平線醜聞, [人類與可靠性](/tw/ch2#id31)
- 缺乏事務, [事務](/tw/ch8#ch_transactions)
- horizontal scaling(見 scaling out)
- 透過磨損, [分片的利與弊](/tw/ch7#sec_sharding_reasons)
- HornetQ(訊息系統), [訊息代理](/tw/ch5#message-brokers), [訊息代理與資料庫的對比](/tw/ch12#id297)
- 分散式事務支援, [XA 事務](/tw/ch8#xa-transactions)
- 熱鍵, [鍵值資料的分片](/tw/ch7#sec_sharding_key_value)
- 熱點, [鍵值資料的分片](/tw/ch7#sec_sharding_key_value)
- 由於名人, [偏斜的工作負載與緩解熱點](/tw/ch7#sec_sharding_skew)
- 時間序列資料, [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range)
- 解除武裝, [偏斜的工作負載與緩解熱點](/tw/ch7#sec_sharding_skew)
- hot standbys(見 基於領導者的複製)
- HTAP(見 hybrid transactional/analytic processing)
- HTTP, use in APIs(見 services)
- 人類錯誤, [人類與可靠性](/tw/ch2#id31), [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults), [批處理](/tw/ch11#ch_batch)
- 混合邏輯時鐘, [混合邏輯時鐘](/tw/ch10#hybrid-logical-clocks)
- 混合事務/分析處理, [資料倉庫](/tw/ch1#sec_introduction_dwh), [分析型資料儲存](/tw/ch4#sec_storage_analytics)
- hydrating IDs (join), [社交網路案例研究中的反正規化](/tw/ch3#denormalization-in-the-social-networking-case-study)
- 高頻圖, [屬性圖](/tw/ch3#id56)
- HyperLogLog (algorithm), [流分析](/tw/ch12#id318)
### I
- I/O operations, waiting for, [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- IaaS(見 infrastructure as a service (IaaS))
- IBM
- Db2(資料庫)
- 分散式事務支援, [XA 事務](/tw/ch8#xa-transactions)
- 可序列隔離, [快照隔離、可重複讀和命名混淆](/tw/ch8#snapshot-isolation-repeatable-read-and-naming-confusion), [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- MQ(訊息系統), [訊息代理與資料庫的對比](/tw/ch12#id297)
- 分散式事務支援, [XA 事務](/tw/ch8#xa-transactions)
- System R(資料庫), [事務到底是什麼?](/tw/ch8#sec_transactions_overview)
- WebSphere(訊息系統), [訊息代理](/tw/ch5#message-brokers)
- Iceberg(表格式), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 物件儲存的資料庫, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 基於日誌的資訊代理儲存, [磁碟空間使用](/tw/ch12#sec_stream_disk_usage)
- 冪等性, [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc), [冪等性](/tw/ch12#sec_stream_idempotence), [術語表](/tw/glossary)
- by giving operations unique IDs, [多分割槽請求處理](/tw/ch13#id360)
- by giving requests unique IDs, [操作識別符號](/tw/ch13#id355)
- 對於完全的語義, [再談恰好一次訊息處理](/tw/ch8#exactly-once-message-processing-revisited)
- 一元業務, [恰好執行一次操作](/tw/ch13#id353)
- 工作流程引擎中, [持久化執行](/tw/ch5#durable-execution)
- 不可改變性
- 好處, [不可變事件的優點](/tw/ch12#sec_stream_immutability_pros), [為可審計性而設計](/tw/ch13#id365)
- 和清除的權利, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance), [磁碟空間使用](/tw/ch4#disk-space-usage)
- 刪除加密, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events), [不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- 從事件日誌中獲取狀態, [狀態、流和不變性](/tw/ch12#sec_stream_immutability)-[不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- 事故恢復, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 在B樹上, [B 樹變體](/tw/ch4#b-tree-variants), [索引與快照隔離](/tw/ch8#indexes-and-snapshot-isolation)
- 如果來源, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events), [資料變更捕獲與事件溯源](/tw/ch12#sec_stream_event_sourcing)
- 限制, [併發控制](/tw/ch12#sec_stream_concurrency)
- 阻抗不匹配, [物件關係不匹配](/tw/ch3#sec_datamodels_document)
- 存疑, [協調器故障](/tw/ch8#coordinator-failure)
- 鎖定, [存疑時持有鎖](/tw/ch8#holding-locks-while-in-doubt)
- 孤兒事務, [從協調器故障中恢復](/tw/ch8#recovering-from-coordinator-failure)
- 模擬資料庫, [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- 永續性, [永續性](/tw/ch8#durability)
- 序列事務執行, [實際序列執行](/tw/ch8#sec_transactions_serial)
- 事件
- 導致錯誤定罪的會計軟體錯誤, [人類與可靠性](/tw/ch2#id31)
- 無咎死後, [人類與可靠性](/tw/ch2#id31)
- 跳躍秒墜機, [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy)
- 資料腐敗和貨幣錯誤造成的經濟損失, [弱隔離級別](/tw/ch8#sec_transactions_isolation_levels)
- 硬碟上的資料腐敗, [永續性](/tw/ch8#durability)
- 資料損失,因最後寫成, [用於事件排序的時間戳](/tw/ch9#sec_distributed_lww)
- 磁碟上無法讀取的資料, [將系統模型對映到現實世界](/tw/ch9#mapping-system-models-to-the-real-world)
- 由於重用主鑰匙而披露敏感資料, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover)
- 事務序列性中的錯誤, [維護完整性,儘管軟體有Bug](/tw/ch13#id455)
- gigabit network interface with 1 Kb/s throughput, [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 跳躍第二次崩潰, [軟體故障](/tw/ch2#software-faults)
- 網路斷層, [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults)
- 網路介面只放下入境包, [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults)
- 網路分割槽和全資料中心故障, [故障與部分失效](/tw/ch9#sec_distributed_partial_failure)
- 網路故障處理不當, [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults)
- 向前合夥人傳送訊息, [排序事件以捕獲因果關係](/tw/ch13#sec_future_capture_causality)
- 咬海底電纜的鯊魚, [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults)
- split brain due to 1-minute packet delay, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover), [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults)
- SSD failure after 32,768 hours, [軟體故障](/tw/ch2#software-faults)
- 執行緒爭吵導致服務下降, [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- 伺服器架中的振動, [延遲與響應時間](/tw/ch2#id23)
- 違反獨特性限制, [維護完整性,儘管軟體有Bug](/tw/ch13#id455)
- incremental view maintenance (IVM), [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- 資料整合, [分拆系統與整合系統](/tw/ch13#id448)
- 索引, [OLTP 系統的儲存與索引](/tw/ch4#sec_storage_oltp), [術語表](/tw/glossary)
- 並快照隔離, [索引與快照隔離](/tw/ch8#indexes-and-snapshot-isolation)
- 作為衍生資料, [記錄系統與派生資料](/tw/ch1#sec_introduction_derived), [組合使用資料儲存技術](/tw/ch13#id447)-[分拆系統與整合系統](/tw/ch13#id448)
- B樹, [B 樹](/tw/ch4#sec_storage_b_trees)-[B 樹變體](/tw/ch4#b-tree-variants)
- 分組, [在索引中儲存值](/tw/ch4#sec_storage_index_heap)
- comparison of B-trees and LSM-trees, [比較 B 樹與 LSM 樹](/tw/ch4#sec_storage_btree_lsm_comparison)-[磁碟空間使用](/tw/ch4#disk-space-usage)
- 覆蓋(包括各欄), [在索引中儲存值](/tw/ch4#sec_storage_index_heap)
- 建立, [建立索引](/tw/ch13#id340)
- 全文檢索, [全文檢索](/tw/ch4#sec_storage_full_text)
- 地理空間, [多維索引與全文索引](/tw/ch4#sec_storage_multidimensional)
- 索引範圍鎖定, [索引範圍鎖](/tw/ch8#sec_transactions_2pl_range)
- 多列(壓縮), [多維索引與全文索引](/tw/ch4#sec_storage_multidimensional)
- 中學, [多列索引與二級索引](/tw/ch4#sec_storage_index_multicolumn)
- (另見 secondary indexes)
- 雙寫問題, [保持系統同步](/tw/ch12#sec_stream_sync), [理解資料流](/tw/ch13#id443)
- 硬化指數和二級指數, [分片與二級索引](/tw/ch7#sec_sharding_secondary_indexes)-[全域性二級索引](/tw/ch7#id167), [總結](/tw/ch7#summary)
- 人煙稀少, [SSTable 檔案格式](/tw/ch4#the-sstable-file-format)
- SSTable 與 LSM 樹, [SSTable 檔案格式](/tw/ch4#the-sstable-file-format)-[壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- 資料變化時更新, [保持系統同步](/tw/ch12#sec_stream_sync), [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- Industrial Revolution, [回顧工業革命](/ch14#id377)
- InfiniBand (networks), [我們不能簡單地使網路延遲可預測嗎?](/tw/ch9#can-we-not-simply-make-network-delays-predictable)
- InfluxDB IOx (storage engine), [列式儲存](/tw/ch4#sec_storage_column)
- information retrieval(見 全文檢索)
- infrastructure as a service (IaaS), [雲服務與自託管](/tw/ch1#sec_introduction_cloud), [雲服務的分層](/tw/ch1#layering-of-cloud-services)
- InnoDB (storage engine)
- 主金鑰的分組索引, [在索引中儲存值](/tw/ch4#sec_storage_index_heap)
- 不防止丟失的更新, [自動檢測丟失的更新](/tw/ch8#automatically-detecting-lost-updates)
- 防止寫入skew, [寫偏差的特徵](/tw/ch8#characterizing-write-skew), [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- 可序列隔離, [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- 快速隔離支援, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)
- 例項(雲計算), [雲服務的分層](/tw/ch1#layering-of-cloud-services)
- integrating different data systems(見 資料整合)
- 誠信, [及時性與完整性](/tw/ch13#sec_future_integrity)
- 協調-避免資料系統, [無協調資料系統](/tw/ch13#id454)
- 資料流系統的正確性, [資料流系統的正確性](/tw/ch13#id453)
- 協商一致形式化, [單值共識](/tw/ch10#single-value-consensus), [原子提交作為共識](/tw/ch10#atomic-commitment-as-consensus)
- 完整性檢查, [不要盲目信任承諾](/tw/ch13#id364)
- (另見 審計)
- 端到端, [端到端原則](/tw/ch13#sec_future_e2e_argument), [端到端原則重現](/tw/ch13#id456)
- 使用快照隔離, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)
- 儘管軟體錯誤仍然維護, [維護完整性,儘管軟體有Bug](/tw/ch13#id455)
- Interface Definition Language (IDL), [Protocol Buffers](/tw/ch5#sec_encoding_protobuf), [Avro](/tw/ch5#sec_encoding_avro), [Web 服務](/tw/ch5#sec_web_services)
- 不變式, [一致性](/tw/ch8#sec_transactions_acid_consistency)
- (另見 constraints)
- 反向檔案索引(向量索引), [向量嵌入](/tw/ch4#id92)
- 倒轉索引, [全文檢索](/tw/ch4#sec_storage_full_text)
- 不可逆轉,儘量減少, [可演化性:讓變化更容易](/tw/ch2#sec_introduction_evolvability), [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events), [批處理](/tw/ch11#ch_batch)
- ISDN (Integrated Services Digital Network), [同步與非同步網路](/tw/ch9#sec_distributed_sync_networks)
- 隔離性
- cgroups(見 cgroups)
- 隔離性, [隔離性](/tw/ch8#sec_transactions_acid_isolation), [單物件與多物件操作](/tw/ch8#sec_transactions_multi_object), [術語表](/tw/glossary)
- 正確性和, [追求正確性](/tw/ch13#sec_future_correctness)
- 用於單物件寫入, [單物件寫入](/tw/ch8#sec_transactions_single_object)
- 可序列化, [可序列化](/tw/ch8#sec_transactions_serializability)-[可序列化快照隔離的效能](/tw/ch8#performance-of-serializable-snapshot-isolation)
- 實際執行, [實際序列執行](/tw/ch8#sec_transactions_serial)-[序列執行總結](/tw/ch8#summary-of-serial-execution)
- 可序列化快照隔離, [可序列化快照隔離(SSI)](/tw/ch8#sec_transactions_ssi)-[可序列化快照隔離的效能](/tw/ch8#performance-of-serializable-snapshot-isolation)
- 兩階段鎖定, [兩階段鎖定(2PL)](/tw/ch8#sec_transactions_2pl)-[索引範圍鎖](/tw/ch8#sec_transactions_2pl_range)
- 違反, [單物件與多物件操作](/tw/ch8#sec_transactions_multi_object)
- 薄弱的隔離水平, [弱隔離級別](/tw/ch8#sec_transactions_isolation_levels)-[物化衝突](/tw/ch8#materializing-conflicts)
- 防止丟失更新, [防止丟失更新](/tw/ch8#sec_transactions_lost_update)-[衝突解決與複製](/tw/ch8#conflict-resolution-and-replication)
- 讀已提交, [讀已提交](/tw/ch8#sec_transactions_read_committed)-[實現讀已提交](/tw/ch8#sec_transactions_read_committed_impl)
- 快照隔離, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)-[快照隔離、可重複讀和命名混淆](/tw/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- IVF (vector index), [向量嵌入](/tw/ch4#id92)
### J
- 資料庫連線
- 分散式事務支援, [XA 事務](/tw/ch8#xa-transactions)
- 網路驅動程式, [模式的優點](/tw/ch5#sec_encoding_schemas)
- Java Enterprise Edition (EE), [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc), [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc), [XA 事務](/tw/ch8#xa-transactions)
- Java Message Service (JMS), [訊息代理與資料庫的對比](/tw/ch12#id297)
- (另見 messaging systems)
- 比較基於日誌的郵件, [日誌與傳統的訊息傳遞相比](/tw/ch12#sec_stream_logs_vs_messaging), [重播舊訊息](/tw/ch12#sec_stream_replay)
- 分散式事務支援, [XA 事務](/tw/ch8#xa-transactions)
- 訊息順序, [確認與重新傳遞](/tw/ch12#sec_stream_reordering)
- Java Transaction API (JTA), [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc), [XA 事務](/tw/ch8#xa-transactions)
- Java Virtual Machine (JVM)
- 垃圾收集, [程序暫停](/tw/ch9#sec_distributed_clocks_pauses), [限制垃圾回收的影響](/tw/ch9#sec_distributed_gc_impact)
- JIT compilation, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 批次處理器中的工藝再利用, [資料流引擎](/tw/ch11#sec_batch_dataflow)
- Jena (RDF framework), [RDF 資料模型](/tw/ch3#the-rdf-data-model)
- SPARQL 查詢語言, [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- Jepsen(過失容忍度測試), [故障注入](/tw/ch9#sec_fault_injection), [追求正確性](/tw/ch13#sec_future_correctness)
- jitter (網路延遲), [平均值、中位數與百分位點](/tw/ch2#id24), [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- JMESPath(查詢語言), [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- 合併表格, [多對一與多對多關係](/tw/ch3#sec_datamodels_many_to_many), [屬性圖](/tw/ch3#id56)
- 加入, [術語表](/tw/glossary)
- 作為關係運算符表示, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- handling GraphQL query, [GraphQL](/tw/ch3#id63)
- 應用程式程式碼, [正規化、反正規化與連線](/tw/ch3#sec_datamodels_normalization), [社交網路案例研究中的反正規化](/tw/ch3#denormalization-in-the-social-networking-case-study)
- in DataFrames, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 關係資料庫和文件資料庫, [正規化、反正規化與連線](/tw/ch3#sec_datamodels_normalization)
- 二級指數和, [多列索引與二級索引](/tw/ch4#sec_storage_index_multicolumn)
- 排序合併, [JOIN 與 GROUP BY](/tw/ch11#sec_batch_join)
- 串流連線, [流連線](/tw/ch12#sec_stream_joins)-[連線的時間依賴性](/tw/ch12#sec_stream_join_time)
- 串流流連線, [流流連線(視窗連線)](/tw/ch12#id440)
- 序列表連線, [流表連線(流擴充)](/tw/ch12#sec_stream_table_joins)
- 表格連線, [表表連線(維護物化檢視)](/tw/ch12#id326)
- 時間的依賴性, [連線的時間依賴性](/tw/ch12#sec_stream_join_time)
- 文件資料庫中的支援, [文件和關係資料庫的融合](/tw/ch3#convergence-of-document-and-relational-databases)
- JOTM (transaction coordinator), [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc)
- 日記(檔案系統), [使 B 樹可靠](/tw/ch4#sec_storage_btree_wal)
- JSON
- 管道彙總(用克里語), [文件的查詢語言](/tw/ch3#query-languages-for-documents)
- Avro 方案說明, [Avro](/tw/ch5#sec_encoding_avro)
- 二進位制變體, [二進位制編碼](/tw/ch5#binary-encoding)
- 資料位置, [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality)
- 文件資料模型, [關係模型與文件模型](/tw/ch3#sec_datamodels_history)
- 應用資料的問題, [JSON、XML 及其二進位制變體](/tw/ch5#sec_encoding_json)
- GraphQL response, [GraphQL](/tw/ch3#id63)
- 關係資料庫, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility)
- 代表簡歷(例), [用於一對多關係的文件資料模型](/tw/ch3#the-document-data-model-for-one-to-many-relationships)
- 模式, [JSON 模式](/tw/ch5#json-schema)
- JSON-LD, [三元組儲存與 SPARQL](/tw/ch3#id59)
- JsonPath(查詢語言), [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- JuiceFS(分散式檔案系統), [分散式檔案系統](/tw/ch11#sec_batch_dfs), [物件儲存](/tw/ch11#id277)
- 朱皮特(註解本), [機器學習](/tw/ch11#id290)
- just-in-time (JIT) compilation, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
### K
- Kafka(訊息系統), [訊息代理](/tw/ch5#message-brokers), [使用日誌進行訊息儲存](/tw/ch12#id300)
- 消費者群體, [多個消費者](/tw/ch12#id298)
- 資料整合, [分拆系統與整合系統](/tw/ch13#id448)
- 用於事件原始碼, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- Kafka 連線(資料庫整合), [資料變更捕獲的實現](/tw/ch12#id307), [變更流的 API 支援](/tw/ch12#sec_stream_change_api), [從同一事件日誌中派生多個檢視](/tw/ch12#sec_stream_deriving_views)
- 卡夫卡流(流處理器), [流分析](/tw/ch12#id318), [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- 恰好一次語義, [再談恰好一次訊息處理](/tw/ch8#exactly-once-message-processing-revisited)
- 過失容忍, [失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)
- ksqlDB (stream database), [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- 基於領導者的複製, [單主複製](/tw/ch6#sec_replication_leader)
- 日誌壓縮, [日誌壓縮](/tw/ch12#sec_stream_log_compaction), [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- 頁:1, [使用日誌進行訊息儲存](/tw/ch12#id300), [冪等性](/tw/ch12#sec_stream_idempotence)
- 分割槽, [分片](/tw/ch7#ch_sharding)
- 請求路由, [請求路由](/tw/ch7#sec_sharding_routing)
- 計劃登記, [但什麼是寫入者模式?](/tw/ch5#but-what-is-the-writers-schema)
- 服務衍生資料, [對外提供派生資料](/tw/ch11#sec_batch_serving_derived)
- 分層儲存, [磁碟空間使用](/tw/ch12#sec_stream_disk_usage)
- 事務, [資料庫內部的分散式事務](/tw/ch8#sec_transactions_internal), [原子提交再現](/tw/ch12#sec_stream_atomic_commit)
- 不潔領袖選舉, [共識的微妙之處](/tw/ch10#subtleties-of-consensus)
- 使用模型檢查, [模型檢查與規範語言](/tw/ch9#model-checking-and-specification-languages)
- kappa 架構, [統一批處理和流處理](/tw/ch13#id338)
- 關鍵價值儲存, [OLTP 系統的儲存與索引](/tw/ch4#sec_storage_oltp)
- 比較物件儲存, [物件儲存](/tw/ch11#id277)
- 記憶, [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- LSM storage, [日誌結構儲存](/tw/ch4#sec_storage_log_structured)-[磁碟空間使用](/tw/ch4#disk-space-usage)
- 分片, [鍵值資料的分片](/tw/ch7#sec_sharding_key_value)-[偏斜的工作負載與緩解熱點](/tw/ch7#sec_sharding_skew)
- 鍵的雜湊, [按鍵的雜湊分片](/tw/ch7#sec_sharding_hash), [總結](/tw/ch7#summary)
- 按金鑰範圍, [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range), [總結](/tw/ch7#summary)
- 搖擺和熱點, [偏斜的工作負載與緩解熱點](/tw/ch7#sec_sharding_skew)
- Kinesis(訊息系統), [訊息代理](/tw/ch5#message-brokers), [使用日誌進行訊息儲存](/tw/ch12#id300)
- 資料倉整合, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- Kryo (Java), [特定語言的格式](/tw/ch5#id96)
- ksqlDB (stream database), [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- Kubernetes(叢集經理), [雲服務與自託管](/tw/ch1#sec_introduction_cloud), [微服務與無伺服器](/tw/ch1#sec_introduction_microservices), [分散式作業編排](/tw/ch11#id278), [應用程式碼和狀態的分離](/tw/ch13#id344)
- 庫貝流, [機器學習](/tw/ch11#id290)
- 立方體, [分散式作業編排](/tw/ch11#id278)
- 運算元, [分散式作業編排](/tw/ch11#id278)
- 使用等資料d, [請求路由](/tw/ch7#sec_sharding_routing), [協調服務](/tw/ch10#sec_consistency_coordination)
- KùzuDB (database), [分散式系統的問題](/tw/ch1#sec_introduction_dist_sys_problems), [圖資料模型](/tw/ch3#sec_datamodels_graph)
- 作為嵌入式儲存引擎, [壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- Cypher 查詢語言, [Cypher 查詢語言](/tw/ch3#id57)
### L
- labeled property graphs(見 property graphs)
- 羊肉達建築, [統一批處理和流處理](/tw/ch13#id338)
- Lamport 時間戳, [Lamport 時間戳](/tw/ch10#lamport-timestamps)
- Lance(資料格式), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses), [列式儲存](/tw/ch4#sec_storage_column)
- (另見 column-oriented storage)
- large language models (LLMs)
- 預處理培訓資料, [機器學習](/tw/ch11#id290)
- 最後寫入勝利, [最後寫入勝利(丟棄併發寫入)](/tw/ch6#sec_replication_lww), [檢測併發寫入](/tw/ch6#sec_replication_concurrent), [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable)
- 問題, [用於事件排序的時間戳](/tw/ch9#sec_distributed_lww)
- 容易丟失更新, [衝突解決與複製](/tw/ch8#conflict-resolution-and-replication)
- 延遲, [延遲與響應時間](/tw/ch2#id23)
- (另見 響應時間)
- 跨區域, [分散式與單節點系統](/tw/ch1#sec_introduction_distributed)
- 在兩階段鎖定下的不穩定, [兩階段鎖定的效能](/tw/ch8#performance-of-two-phase-locking)
- 網路延遲和資源利用, [我們不能簡單地使網路延遲可預測嗎?](/tw/ch9#can-we-not-simply-make-network-delays-predictable)
- 根據請求減少套期保值, [單主與無主複製的效能](/tw/ch6#sec_replication_leaderless_perf)
- 響應時間對比, [延遲與響應時間](/tw/ch2#id23)
- 尾延遲, [平均值、中位數與百分位點](/tw/ch2#id24), [響應時間指標的應用](/tw/ch2#sec_introduction_slo_sla), [本地二級索引](/tw/ch7#id166)
- law(見 legal matters)
- (雲服務), [雲服務的分層](/tw/ch1#layering-of-cloud-services)
- 基於領導者的複製, [單主複製](/tw/ch6#sec_replication_leader)-[邏輯(基於行)日誌複製](/tw/ch6#logical-row-based-log-replication)
- (另見 複製)
- 故障切換, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover), [分散式鎖和租約](/tw/ch9#sec_distributed_lock_fencing)
- 處理節點斷電, [處理節點故障](/tw/ch6#sec_replication_failover)
- 實施複製日誌
- 資料變更捕獲, [資料變更捕獲](/tw/ch12#sec_stream_cdc)-[變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- (另見 changelogs)
- 基於語句的, [基於語句的複製](/tw/ch6#statement-based-replication)
- 預寫日誌(WAL)傳輸, [預寫日誌(WAL)傳輸](/tw/ch6#write-ahead-log-wal-shipping)
- 操作的可線性, [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable)
- 鎖定和領導者選舉, [鎖定與領導者選舉](/tw/ch10#locking-and-leader-election)
- 日誌序列號, [設定新的副本](/tw/ch6#sec_replication_new_replica), [消費者偏移量](/tw/ch12#sec_stream_log_offsets)
- 讀縮放架構, [複製延遲的問題](/tw/ch6#sec_replication_lag), [單主與無主複製的效能](/tw/ch6#sec_replication_leaderless_perf)
- 與協商一致的關係, [共識](/tw/ch10#sec_consistency_consensus), [從單主複製到共識](/tw/ch10#from-single-leader-replication-to-consensus), [共識的利弊](/tw/ch10#pros-and-cons-of-consensus)
- 設立新的追隨者, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 同步對同步, [同步複製與非同步複製](/tw/ch6#sec_replication_sync_async)-[同步複製與非同步複製](/tw/ch6#sec_replication_sync_async)
- 無領導複製, [無主複製](/tw/ch6#sec_replication_leaderless)-[版本向量](/tw/ch6#version-vectors)
- (另見 複製)
- 追趕丟失的寫入, [追趕錯過的寫入](/tw/ch6#sec_replication_read_repair)
- 檢測並行寫作, [檢測併發寫入](/tw/ch6#sec_replication_concurrent)-[版本向量](/tw/ch6#version-vectors)
- 版本向量, [版本向量](/tw/ch6#version-vectors)
- 多區域, [多地區操作](/tw/ch6#multi-region-operation)
- 法定人數, [讀寫仲裁](/tw/ch6#sec_replication_quorum_condition)-[多地區操作](/tw/ch6#multi-region-operation)
- 一致性限制, [仲裁一致性的侷限](/tw/ch6#sec_replication_quorum_limitations)-[監控陳舊性](/tw/ch6#monitoring-staleness), [線性一致性與仲裁](/tw/ch10#sec_consistency_quorum_linearizable)
- 跳躍秒, [軟體故障](/tw/ch2#software-faults), [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy)
- 時鐘, [日曆時鐘](/tw/ch9#time-of-day-clocks)
- 租賃, [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- 與協調處合作執行, [協調服務](/tw/ch10#sec_consistency_coordination)
- 需要圍欄, [分散式鎖和租約](/tw/ch9#sec_distributed_lock_fencing)
- 與協商一致的關係, [單值共識](/tw/ch10#single-value-consensus)
- 分類賬(會計), [總結](/tw/ch3#summary)
- 不可改變性, [不可變事件的優點](/tw/ch12#sec_stream_immutability_pros)
- 遺留系統,維護, [可運維性](/tw/ch2#sec_introduction_maintainability)
- 法律事項, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)-[資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- 資料刪除, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance), [磁碟空間使用](/tw/ch4#disk-space-usage)
- 資料儲存, [分散式與單節點系統](/tw/ch1#sec_introduction_distributed), [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- 隱私監管, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance), [立法與自律](/ch14#sec_future_legislation)
- legitimate interest (GDPR), [同意與選擇自由](/ch14#id375)
- 平面壓縮, [壓實策略](/tw/ch4#sec_storage_lsm_compaction), [磁碟空間使用](/tw/ch4#disk-space-usage)
- Levenshtein 自動地圖, [全文檢索](/tw/ch4#sec_storage_full_text)
- 跛腳(部分失敗), [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 線性(專案管理軟體), [即時協作、離線優先和本地優先應用](/tw/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- 線性代數, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 線性可縮放性, [描述負載](/tw/ch2#id33)
- 線性一致性, [複製延遲的解決方案](/tw/ch6#id131), [線性一致性](/tw/ch10#sec_consistency_linearizability)-[線性一致性與網路延遲](/tw/ch10#linearizability-and-network-delays), [術語表](/tw/glossary)
- 和共識, [共識](/tw/ch10#sec_consistency_consensus)
- 費用, [線性一致性的代價](/tw/ch10#sec_linearizability_cost)-[線性一致性與網路延遲](/tw/ch10#linearizability-and-network-delays)
- CAP定理, [CAP 定理](/tw/ch10#the-cap-theorem)
- memory on multi-core CPUs, [線性一致性與網路延遲](/tw/ch10#linearizability-and-network-delays)
- 定義, [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)-[什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- ID generation, [線性一致的 ID 生成器](/tw/ch10#sec_consistency_linearizable_id)
- 協調事務, [協調服務](/tw/ch10#sec_consistency_coordination)
- 資料系統
- 避免協調, [無協調資料系統](/tw/ch13#id454)
- 不同複製方法, [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable)-[線性一致性與仲裁](/tw/ch10#sec_consistency_quorum_linearizable)
- 使用法定人數, [線性一致性與仲裁](/tw/ch10#sec_consistency_quorum_linearizable)
- 在協商一致的制度中讀取, [共識的微妙之處](/tw/ch10#subtleties-of-consensus)
- 依賴, [依賴線性一致性](/tw/ch10#sec_consistency_linearizability_usage)-[跨通道時序依賴](/tw/ch10#cross-channel-timing-dependencies)
- 限制和獨特性, [約束與唯一性保證](/tw/ch10#sec_consistency_uniqueness)
- 跨渠道時間依賴性, [跨通道時序依賴](/tw/ch10#cross-channel-timing-dependencies)
- 鎖定和領導者選舉, [鎖定與領導者選舉](/tw/ch10#locking-and-leader-election)
- 可序列性, [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- 連結資料, [三元組儲存與 SPARQL](/tw/ch3#id59)
- LinkedIn
- Espresso(資料庫), [但什麼是寫入者模式?](/tw/ch5#but-what-is-the-writers-schema)
- LIquid(資料庫), [Datalog:遞迴關係查詢](/tw/ch3#id62)
- 配置檔案(例), [用於一對多關係的文件資料模型](/tw/ch3#the-document-data-model-for-one-to-many-relationships)
- Linux 跳過第二個錯誤, [軟體故障](/tw/ch2#software-faults), [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy)
- Litestream (備份工具), [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 生活屬性, [安全性與活性](/tw/ch9#sec_distributed_safety_liveness)
- LLVM (compiler), [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- LMDB (storage engine), [壓實策略](/tw/ch4#sec_storage_lsm_compaction), [B 樹變體](/tw/ch4#b-tree-variants), [索引與快照隔離](/tw/ch8#indexes-and-snapshot-isolation)
- 負載
- 應付, [可伸縮性原則](/tw/ch2#id35)
- 描述, [描述負載](/tw/ch2#id33)
- 負載平衡, [描述效能](/tw/ch2#sec_introduction_percentiles), [負載均衡器、服務發現和服務網格](/tw/ch5#sec_encoding_service_discovery)
- 硬體, [負載均衡器、服務發現和服務網格](/tw/ch5#sec_encoding_service_discovery)
- 軟體, [負載均衡器、服務發現和服務網格](/tw/ch5#sec_encoding_service_discovery)
- 使用信件經紀人, [多個消費者](/tw/ch12#id298)
- 裝彈, [描述效能](/tw/ch2#sec_introduction_percentiles)
- 本地二級指數, [本地二級索引](/tw/ch7#id166), [總結](/tw/ch7#summary)
- 本地第一軟體, [即時協作、離線優先和本地優先應用](/tw/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- 區域性, [用於一對多關係的文件資料模型](/tw/ch3#the-document-data-model-for-one-to-many-relationships), [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality), [術語表](/tw/glossary)
- 分批處理, [資料流引擎](/tw/ch11#sec_batch_dataflow)
- 在狀態客戶端, [同步引擎與本地優先軟體](/tw/ch6#sec_replication_offline_clients), [有狀態、可離線的客戶端](/tw/ch13#id347)
- 在溪流處理中, [流表連線(流擴充)](/tw/ch12#sec_stream_table_joins), [失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance), [流處理器和服務](/tw/ch13#id345), [基於日誌訊息傳遞中的唯一性](/tw/ch13#sec_future_uniqueness_log)
- 地點透明度, [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc)
- 在演員模式中, [分散式 actor 框架](/tw/ch5#distributed-actor-frameworks)
- 鎖定, [雲服務的利弊](/tw/ch1#sec_introduction_cloud_tradeoffs)
- 鎖, [術語表](/tw/glossary)
- 死鎖, [顯式鎖定](/tw/ch8#explicit-locking), [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- 分散式鎖定, [分散式鎖和租約](/tw/ch9#sec_distributed_lock_fencing)-[多副本隔離](/tw/ch9#fencing-with-multiple-replicas), [鎖定與領導者選舉](/tw/ch10#locking-and-leader-election)
- 柵欄標誌, [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens)
- 與協調處合作執行, [協調服務](/tw/ch10#sec_consistency_coordination)
- 與協商一致的關係, [單值共識](/tw/ch10#single-value-consensus)
- 用於事務隔離
- 在快照隔離中, [多版本併發控制(MVCC)](/tw/ch8#sec_transactions_snapshot_impl)
- in two-phase locking (2PL), [兩階段鎖定(2PL)](/tw/ch8#sec_transactions_2pl)-[索引範圍鎖](/tw/ch8#sec_transactions_2pl_range)
- 使操作原子化, [原子寫操作](/tw/ch8#atomic-write-operations)
- 效能, [兩階段鎖定的效能](/tw/ch8#performance-of-two-phase-locking)
- 防止骯髒的寫作, [實現讀已提交](/tw/ch8#sec_transactions_read_committed_impl)
- 防止帶有索引範圍鎖的幽靈, [索引範圍鎖](/tw/ch8#sec_transactions_2pl_range), [檢測影響先前讀取的寫入](/tw/ch8#sec_detecting_writes_affect_reads)
- 讀取鎖(共享模式), [實現讀已提交](/tw/ch8#sec_transactions_read_committed_impl), [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- 共享模式和專屬模式, [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- 分散式事務
- 發現僵局, [XA 事務的問題](/tw/ch8#problems-with-xa-transactions)
- 持有鎖的可疑事務, [存疑時持有鎖](/tw/ch8#holding-locks-while-in-doubt)
- 實現衝突, [物化衝突](/tw/ch8#materializing-conflicts)
- 透過明確鎖定防止丟失更新, [顯式鎖定](/tw/ch8#explicit-locking)
- 日誌序列號, [設定新的副本](/tw/ch6#sec_replication_new_replica), [消費者偏移量](/tw/ch12#sec_stream_log_offsets)
- 邏輯時鐘, [用於事件排序的時間戳](/tw/ch9#sec_distributed_lww), [ID 生成器和邏輯時鐘](/tw/ch10#sec_consistency_logical)-[使用邏輯時鐘強制約束](/tw/ch10#enforcing-constraints-using-logical-clocks), [排序事件以捕獲因果關係](/tw/ch13#sec_future_capture_causality)
- 最後寫成的, [最後寫入勝利(丟棄併發寫入)](/tw/ch6#sec_replication_lww)
- 讀後寫入一致性, [讀己之寫](/tw/ch6#sec_replication_ryw)
- 混合邏輯時鐘, [混合邏輯時鐘](/tw/ch10#hybrid-logical-clocks)
- 執行制約因素不足, [使用邏輯時鐘強制約束](/tw/ch10#enforcing-constraints-using-logical-clocks)
- Lamport 時間戳, [Lamport 時間戳](/tw/ch10#lamport-timestamps)
- 邏輯複製, [邏輯(基於行)日誌複製](/tw/ch6#logical-row-based-log-replication)
- 用於獲取變化資料, [資料變更捕獲的實現](/tw/ch12#id307)
- LogicBlox(資料庫), [Datalog:遞迴關係查詢](/tw/ch3#id62)
- 日誌(資料結構), [OLTP 系統的儲存與索引](/tw/ch4#sec_storage_oltp), [共享日誌作為共識](/tw/ch10#sec_consistency_shared_logs), [術語表](/tw/glossary)
- (另見 shared logs)
- 不可改變性的好處, [不可變事件的優點](/tw/ch12#sec_stream_immutability_pros)
- 和清除的權利, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance), [磁碟空間使用](/tw/ch4#disk-space-usage)
- 壓實(Compaction), [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables), [壓實策略](/tw/ch4#sec_storage_lsm_compaction), [日誌壓縮](/tw/ch12#sec_stream_log_compaction), [狀態、流和不變性](/tw/ch12#sec_stream_immutability)
- 流運算子狀態, [失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)
- 執行獨特性限制, [基於日誌訊息傳遞中的唯一性](/tw/ch13#sec_future_uniqueness_log)
- 基於日誌的資訊, [基於日誌的訊息代理](/tw/ch12#sec_stream_log)-[重播舊訊息](/tw/ch12#sec_stream_replay)
- 比較傳統訊息, [日誌與傳統的訊息傳遞相比](/tw/ch12#sec_stream_logs_vs_messaging), [重播舊訊息](/tw/ch12#sec_stream_replay)
- 減 減, [消費者偏移量](/tw/ch12#sec_stream_log_offsets)
- 磁碟空間使用情況, [磁碟空間使用](/tw/ch12#sec_stream_disk_usage)
- 重播舊信件, [重播舊訊息](/tw/ch12#sec_stream_replay), [應用演化後重新處理資料](/tw/ch13#sec_future_reprocessing), [統一批處理和流處理](/tw/ch13#id338)
- 緩慢的消費者, [當消費者跟不上生產者時](/tw/ch12#id459)
- 使用日誌儲存信件, [使用日誌進行訊息儲存](/tw/ch12#id300)
- 日誌結構儲存, [OLTP 系統的儲存與索引](/tw/ch4#sec_storage_oltp)-[壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- log-structured merge tree(見 LSM-trees)
- 與協商一致的關係, [共享日誌作為共識](/tw/ch10#sec_consistency_shared_logs)
- 複製, [單主複製](/tw/ch6#sec_replication_leader), [複製日誌的實現](/tw/ch6#sec_replication_implementation)-[邏輯(基於行)日誌複製](/tw/ch6#logical-row-based-log-replication)
- 資料變更捕獲, [資料變更捕獲](/tw/ch12#sec_stream_cdc)-[變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- (另見 changelogs)
- 與快照協調, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 邏輯(基於row) 複製, [邏輯(基於行)日誌複製](/tw/ch6#logical-row-based-log-replication)
- 基於語句的複製, [基於語句的複製](/tw/ch6#statement-based-replication)
- 預寫日誌(WAL)傳輸, [預寫日誌(WAL)傳輸](/tw/ch6#write-ahead-log-wal-shipping)
- 伸縮性限制, [全序的限制](/tw/ch13#id335)
- 瀏覽器(商業情報軟體), [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp), [分析(Analytics)](/tw/ch11#sec_batch_olap)
- 松耦合, [開展分拆工作](/tw/ch13#sec_future_unbundling_favor)
- lost updates(見 updates)
- 蓮花筆記(同步引擎), [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- LSM-trees (indexes), [SSTable 檔案格式](/tw/ch4#the-sstable-file-format)-[壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- 與B樹的比較, [比較 B 樹與 LSM 樹](/tw/ch4#sec_storage_btree_lsm_comparison)-[磁碟空間使用](/tw/ch4#disk-space-usage)
- Lucene(儲存引擎), [全文檢索](/tw/ch4#sec_storage_full_text)
- 相似性搜尋, [全文檢索](/tw/ch4#sec_storage_full_text)
- 最後寫入勝利(見 最後寫入勝利)
### M
- 機器學習
- 批次推論, [機器學習](/tw/ch11#id290)
- data preparation with DataFrames, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 刪除培訓資料, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- 部署資料產品, [超越資料湖](/tw/ch1#beyond-the-data-lake)
- 道德考慮, [預測分析](/ch14#id369)
- (另見 ethics)
- 特性工程, [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake), [機器學習](/tw/ch11#id290)
- 分析系統, [分析型與事務型系統](/tw/ch1#sec_introduction_analytics)
- 迭代處理, [機器學習](/tw/ch11#id290)
- LLMs(見 large language models (LLMs))
- 培訓資料產生的模型, [應用程式碼作為派生函式](/tw/ch13#sec_future_dataflow_derivation)
- 與批次處理的關係, [機器學習](/tw/ch11#id290)-[機器學習](/tw/ch11#id290)
- 使用資料湖, [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake)
- using GPUs, [雲服務的分層](/tw/ch1#layering-of-cloud-services), [分散式與單節點系統](/tw/ch1#sec_introduction_distributed)
- 使用矩陣, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 瘋狂(決定性模擬測試), [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 萬金油, [可伸縮性原則](/tw/ch2#id35)
- 可維護性, [可運維性](/tw/ch2#sec_introduction_maintainability)-[可演化性:讓變化更容易](/tw/ch2#sec_introduction_evolvability), [流式系統的哲學](/tw/ch13#ch_philosophy)
- 可演化性(見 可演化性)
- 可操作性, [可運維性:讓運維更輕鬆](/tw/ch2#id37)
- 簡化和管理複雜性, [簡單性:管理複雜度](/tw/ch2#id38)
- 多種關係, [多對一與多對多關係](/tw/ch3#sec_datamodels_many_to_many)
- 模擬為圖表, [圖資料模型](/tw/ch3#sec_datamodels_graph)
- 多對一關係, [多對一與多對多關係](/tw/ch3#sec_datamodels_many_to_many)
- 在恆星計時, [星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)
- MapReduce (batch processing), [批處理](/tw/ch11#ch_batch), [MapReduce](/tw/ch11#sec_batch_mapreduce)-[MapReduce](/tw/ch11#sec_batch_mapreduce)
- 使用者活動活動分析(例項), [JOIN 與 GROUP BY](/tw/ch11#sec_batch_join)
- 與流處理的比較, [流處理](/tw/ch12#sec_stream_processing)
- 不利條件和限制, [MapReduce](/tw/ch11#sec_batch_mapreduce)
- 過失容忍, [故障處理](/tw/ch11#id281)
- 高階工具, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- 對映和縮小函式, [MapReduce](/tw/ch11#sec_batch_mapreduce)
- 移動資料, [混洗資料](/tw/ch11#sec_shuffle)
- 排序合併, [JOIN 與 GROUP BY](/tw/ch11#sec_batch_join)
- 工作流程, [工作流排程](/tw/ch11#sec_batch_workflows)
- (另見 workflow engines)
- 編組(見 編碼)
- MartenDB(資料庫), [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 主奴隸複製(過時術語), [單主複製](/tw/ch6#sec_replication_leader)
- 物化, [術語表](/tw/glossary)
- 總價值, [物化檢視與資料立方體](/tw/ch4#sec_storage_materialized_views)
- 衝突, [物化衝突](/tw/ch8#materializing-conflicts)
- 實際意見, [物化檢視與資料立方體](/tw/ch4#sec_storage_materialized_views)
- 作為衍生資料, [記錄系統與派生資料](/tw/ch1#sec_introduction_derived), [組合使用資料儲存技術](/tw/ch13#id447)-[分拆系統與整合系統](/tw/ch13#id448)
- 如果來源, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 增量檢視維護, [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- (另見 incremental view maintenance (IVM))
- 維護,使用流處理, [維護物化檢視](/tw/ch12#sec_stream_mat_view), [表表連線(維護物化檢視)](/tw/ch12#id326)
- 社會網路時間表例項, [時間線的物化與更新](/tw/ch2#sec_introduction_materializing)
- 物化, [物化檢視與資料立方體](/tw/ch4#sec_storage_materialized_views)
- 增量檢視維護, [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- 矩陣, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 人煙稀少, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- Maxwell(變化資料捕獲), [資料變更捕獲的實現](/tw/ch12#id307)
- 說, [平均值、中位數與百分位點](/tw/ch2#id24)
- 媒體監測, [在流上搜索](/tw/ch12#id320)
- 中位數, [平均值、中位數與百分位點](/tw/ch2#id24)
- 會議室預訂(例), [寫偏差的更多例子](/tw/ch8#more-examples-of-write-skew), [謂詞鎖](/tw/ch8#predicate-locks), [強制約束](/tw/ch13#sec_future_constraints)
- 除錯(除錯伺服器), [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- Memgraph(資料庫), [圖資料模型](/tw/ch3#sec_datamodels_graph)
- Cypher 查詢語言, [Cypher 查詢語言](/tw/ch3#id57)
- 記憶體
- 壁障, [線性一致性與網路延遲](/tw/ch10#linearizability-and-network-delays)
- 腐敗, [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults)
- 模擬資料庫, [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- 永續性, [永續性](/tw/ch8#durability)
- 序列事務執行, [實際序列執行](/tw/ch8#sec_transactions_serial)
- 資料模擬表示, [編碼資料的格式](/tw/ch5#sec_encoding_formats)
- 記憶體表, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 隨機位元- flips in, [信任但驗證](/tw/ch13#sec_future_verification)
- 索引的使用, [日誌結構儲存](/tw/ch4#sec_storage_log_structured)
- 記憶體表, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 商品(版本控制系統), [併發控制](/tw/ch12#sec_stream_concurrency)
- 合併, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 合併排序的檔案, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables), [混洗資料](/tw/ch11#sec_shuffle)
- 默克爾樹, [用於可審計資料系統的工具](/tw/ch13#id366)
- Mesos(分組管理器), [應用程式碼和狀態的分離](/tw/ch13#id344)
- message brokers(見 messaging systems)
- message-passing(見 event-driven architecture)
- MessagePack (encoding format), [二進位制編碼](/tw/ch5#binary-encoding)
- 通訊系統, [流處理](/tw/ch12#ch_stream)-[重播舊訊息](/tw/ch12#sec_stream_replay)
- (另見 streams)
- 後壓、緩衝或丟棄信件, [訊息傳遞系統](/tw/ch12#sec_stream_messaging)
- 無中介訊息, [直接從生產者傳遞給消費者](/tw/ch12#id296)
- 事件日誌, [基於日誌的訊息代理](/tw/ch12#sec_stream_log)-[重播舊訊息](/tw/ch12#sec_stream_replay)
- 作為資料模型, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 比較傳統訊息, [日誌與傳統的訊息傳遞相比](/tw/ch12#sec_stream_logs_vs_messaging), [重播舊訊息](/tw/ch12#sec_stream_replay)
- 減 減, [消費者偏移量](/tw/ch12#sec_stream_log_offsets)
- 重播舊信件, [重播舊訊息](/tw/ch12#sec_stream_replay), [應用演化後重新處理資料](/tw/ch13#sec_future_reprocessing), [統一批處理和流處理](/tw/ch13#id338)
- 緩慢的消費者, [當消費者跟不上生產者時](/tw/ch12#id459)
- 恰好一次語義, [恰好一次訊息處理](/tw/ch8#sec_transactions_exactly_once), [再談恰好一次訊息處理](/tw/ch8#exactly-once-message-processing-revisited), [容錯](/tw/ch12#sec_stream_fault_tolerance)
- 信件經紀人, [訊息代理](/tw/ch12#id433)-[確認與重新傳遞](/tw/ch12#sec_stream_reordering)
- 承認和重新交付, [確認與重新傳遞](/tw/ch12#sec_stream_reordering)
- 比較事件日誌, [日誌與傳統的訊息傳遞相比](/tw/ch12#sec_stream_logs_vs_messaging), [重播舊訊息](/tw/ch12#sec_stream_replay)
- 同一主題的多個消費者, [多個消費者](/tw/ch12#id298)
- versus RPC, [事件驅動的架構](/tw/ch5#sec_encoding_dataflow_msg)
- 訊息丟失, [訊息傳遞系統](/tw/ch12#sec_stream_messaging)
- 可靠性, [訊息傳遞系統](/tw/ch12#sec_stream_messaging)
- 以日誌為基礎的信件中的獨特性, [基於日誌訊息傳遞中的唯一性](/tw/ch13#sec_future_uniqueness_log)
- 可調味的失敗, [描述效能](/tw/ch2#sec_introduction_percentiles)
- 計票
- 無伺服器, [微服務與無伺服器](/tw/ch1#sec_introduction_microservices)
- 儲存, [雲時代的運維](/tw/ch1#sec_introduction_operations)
- 微批次, [微批次與存檔點](/tw/ch12#id329)
- 微服務, [微服務與無伺服器](/tw/ch1#sec_introduction_microservices)
- (另見 services)
- 各種服務的因果關係, [全序的限制](/tw/ch13#id335)
- 松耦合, [開展分拆工作](/tw/ch13#sec_future_unbundling_favor)
- 與批次/流程處理器的關係, [批處理](/tw/ch11#ch_batch), [流處理器和服務](/tw/ch13#id345)
- 微軟
- Azure Blob Storage(見 Azure Blob Storage)
- Azure managed disks, [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute)
- Azure Service Bus(訊息系統), [訊息代理](/tw/ch5#message-brokers), [訊息代理與資料庫的對比](/tw/ch12#id297)
- Azure SQL DB(資料庫), [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native)
- Azure Storage, [物件儲存](/tw/ch11#id277)
- Azure Stream Analytics, [流分析](/tw/ch12#id318)
- Azure Synapse Analytics(資料庫), [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native)
- 分散式元件物件模型, [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc)
- MSDTC (transaction coordinator), [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc)
- SQL Server(見 SQL Server)
- Microsoft Power BI(見 Power BI (business intelligence software))
- 遷移(重寫)資料, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility), [不同時間寫入的不同值](/tw/ch5#different-values-written-at-different-times), [從同一事件日誌中派生多個檢視](/tw/ch12#sec_stream_deriving_views), [應用演化後重新處理資料](/tw/ch13#sec_future_reprocessing)
- MinIO(物件儲存), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 移動應用程式, [資料系統架構中的權衡](/tw/ch1#ch_tradeoffs)
- 嵌入式資料庫, [壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- 模式檢查, [模型檢查與規範語言](/tw/ch9#model-checking-and-specification-languages)
- 模組操作員(%), [雜湊取模節點數](/tw/ch7#hash-modulo-number-of-nodes)
- Mojo(程式語言)
- 記憶體管理, [限制垃圾回收的影響](/tw/ch9#sec_distributed_gc_impact)
- MongoDB(資料庫)
- 管道合計, [文件的查詢語言](/tw/ch3#query-languages-for-documents)
- 原子操作, [原子寫操作](/tw/ch8#atomic-write-operations)
- BSON, [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality)
- 文件資料模型, [關係模型與文件模型](/tw/ch3#sec_datamodels_history)
- 雜湊變硬, [按鍵的雜湊分片](/tw/ch7#sec_sharding_hash), [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 在雲層中, [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native)
- 加入支援, [文件和關係資料庫的融合](/tw/ch3#convergence-of-document-and-relational-databases)
- 加入(\$$ookup 運算子), [正規化、反正規化與連線](/tw/ch3#sec_datamodels_normalization)
- JSON Schema validation, [JSON 模式](/tw/ch5#json-schema)
- 基於領導者的複製, [單主複製](/tw/ch6#sec_replication_leader)
- ObjectIds, [ID 生成器和邏輯時鐘](/tw/ch10#sec_consistency_logical)
- 基於範圍的硬化, [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range)
- 請求路由, [請求路由](/tw/ch7#sec_sharding_routing)
- 二級指數, [本地二級索引](/tw/ch7#id166)
- 硬分裂, [重新平衡鍵範圍分片資料](/tw/ch7#rebalancing-key-range-sharded-data)
- 儲存程式, [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs)
- 監測, [雲時代的運維](/tw/ch1#sec_introduction_operations), [人類與可靠性](/tw/ch2#id31), [可運維性:讓運維更輕鬆](/tw/ch2#id37)
- 單音鍾, [單調時鐘](/tw/ch9#monotonic-clocks)
- 單調讀, [單調讀](/tw/ch6#sec_replication_monotonic_reads)
- Morel(查詢語言), [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- MSMQ(訊息系統), [XA 事務](/tw/ch8#xa-transactions)
- 多列索引, [多維索引與全文索引](/tw/ch4#sec_storage_multidimensional)
- 多領導複製, [多主複製](/tw/ch6#sec_replication_multi_leader)-[處理寫入衝突](/tw/ch6#sec_replication_write_conflicts)
- (另見 複製)
- 協作編輯, [即時協作、離線優先和本地優先應用](/tw/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- 衝突檢測, [處理寫入衝突](/tw/ch6#sec_replication_write_conflicts)
- 解決衝突, [處理寫入衝突](/tw/ch6#sec_replication_write_conflicts)
- 供多區域複製, [跨地域執行](/tw/ch6#sec_replication_multi_dc), [線性一致性的代價](/tw/ch10#sec_linearizability_cost)
- 線性,缺少, [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable)
- 可離線客戶端, [同步引擎與本地優先軟體](/tw/ch6#sec_replication_offline_clients)
- 複製地形, [多主複製拓撲](/tw/ch6#sec_replication_topologies)-[不同拓撲的問題](/tw/ch6#problems-with-different-topologies)
- 多物件事務, [單物件與多物件操作](/tw/ch8#sec_transactions_multi_object)
- 需求, [多物件事務的需求](/tw/ch8#sec_transactions_need)
- Multi-Paxos (consensus algorithm), [共識的實踐](/tw/ch10#sec_consistency_total_order)
- 多讀單寫鎖定, [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- 多表索引叢集表, [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality)
- 多版本併發控制, [多版本併發控制(MVCC)](/tw/ch8#sec_transactions_snapshot_impl), [總結](/tw/ch8#summary)
- detecting stale MVCC reads, [檢測陳舊的 MVCC 讀取](/tw/ch8#detecting-stale-mvcc-reads)
- 索引和快照隔離, [索引與快照隔離](/tw/ch8#indexes-and-snapshot-isolation)
- 使用同步時鐘, [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 多層面陣列, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 多重租賃, [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute), [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 透過磨損, [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- 使用嵌入式資料庫, [壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- 與拜占庭斷層承受能力相比, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)
- 相互排斥, [悲觀併發控制與樂觀併發控制](/tw/ch8#pessimistic-versus-optimistic-concurrency-control)
- (另見 locks)
- MySQL(資料庫)
- archiving WAL to object stores, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 二進位制日誌座標, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 資料變更捕獲, [資料變更捕獲的實現](/tw/ch12#id307), [變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- 迴圈複製地形, [多主複製拓撲](/tw/ch6#sec_replication_topologies)
- 一致的快照, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 分散式事務支援, [XA 事務](/tw/ch8#xa-transactions)
- global transaction identifiers (GTIDs), [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 在雲層中, [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native)
- InnoDB storage engine(見 InnoDB)
- 基於領導者的複製, [單主複製](/tw/ch6#sec_replication_leader)
- 多領導複製, [跨地域執行](/tw/ch6#sec_replication_multi_dc)
- 基於行的複製, [邏輯(基於行)日誌複製](/tw/ch6#logical-row-based-log-replication)
- 分片(見 Vitess(資料庫))
- 快速隔離支援, [快照隔離、可重複讀和命名混淆](/tw/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- (另見 InnoDB)
- 基於語句的複製, [基於語句的複製](/tw/ch6#statement-based-replication)
### N
- N+1 query problem, [物件關係對映(ORM)](/tw/ch3#object-relational-mapping-orm)
- 奈米msg(資訊庫), [直接從生產者傳遞給消費者](/tw/ch12#id296)
- Narayana(事務協調員), [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc)
- NATS(訊息系統), [訊息代理](/tw/ch5#message-brokers)
- 自然語言處理, [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake)
- Neo4j(資料庫)
- Cypher 查詢語言, [Cypher 查詢語言](/tw/ch3#id57)
- 圖表資料模型, [圖資料模型](/tw/ch3#sec_datamodels_graph)
- Neon(資料庫), [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 侄子(資料流引擎), [資料流引擎](/tw/ch11#sec_batch_dataflow)
- Neptune(圖資料庫), [圖資料模型](/tw/ch3#sec_datamodels_graph)
- Cypher 查詢語言, [Cypher 查詢語言](/tw/ch3#id57)
- SPARQL 查詢語言, [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- 網碼(遊戲開發), [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- Network Attached Storage (NAS), [共享記憶體、共享磁碟與無共享架構](/tw/ch2#sec_introduction_shared_nothing), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 網路模型(資料表示), [關係模型與文件模型](/tw/ch3#sec_datamodels_history)
- Network Time Protocol(見 網路時間協議)
- 網路
- 擁堵和排隊, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 資料中心網路地形, [雲計算與超級計算](/tw/ch1#id17)
- faults(見 faults)
- 線性化和網路延遲, [線性一致性與網路延遲](/tw/ch10#linearizability-and-network-delays)
- 網路分割槽, [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults)
- 在 CAP 定理中, [線性一致性的代價](/tw/ch10#sec_linearizability_cost)
- 超時和無限制延誤, [超時和無界延遲](/tw/ch9#sec_distributed_queueing)
- NewSQL, [關係模型與文件模型](/tw/ch3#sec_datamodels_history), [複製延遲的解決方案](/tw/ch6#id131)
- 事務和, [事務到底是什麼?](/tw/ch8#sec_transactions_overview), [資料庫內部的分散式事務](/tw/ch8#sec_transactions_internal)
- 下鍵鎖定, [索引範圍鎖](/tw/ch8#sec_transactions_2pl_range)
- NFS (network file system), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 在物件儲存中, [物件儲存](/tw/ch11#id277)
- Nimble(資料格式), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses), [列式儲存](/tw/ch4#sec_storage_column)
- (另見 column-oriented storage)
- node (in graphs)(見 vertices)
- 節點(程序), [分散式與單節點系統](/tw/ch1#sec_introduction_distributed), [術語表](/tw/glossary)
- 在基於領導器的複製中處理斷電, [處理節點故障](/tw/ch6#sec_replication_failover)
- 失敗的系統模型, [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 吵鬧的鄰居, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 原子承諾, [三階段提交](/tw/ch8#three-phase-commit)
- 非決定性操作, [基於語句的複製](/tw/ch6#statement-based-replication)
- (另見 deterministic operations)
- 在分散式系統中, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 工作流程引擎中, [持久化執行](/tw/ch5#durable-execution)
- 部分失敗, [故障與部分失效](/tw/ch9#sec_distributed_partial_failure)
- 非決定因素, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 不起作用的要求, [定義非功能性需求](/tw/ch2#ch_nonfunctional), [總結](/tw/ch2#summary)
- 不可重複讀作, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)
- (另見 讀取偏差)
- 正規化, [正規化、反正規化與連線](/tw/ch3#sec_datamodels_normalization)-[多對一與多對多關係](/tw/ch3#sec_datamodels_many_to_many), [術語表](/tw/glossary)
- 外國關鍵參考文獻, [多物件事務的需求](/tw/ch8#sec_transactions_need)
- 社會網路案例研究, [社交網路案例研究中的反正規化](/tw/ch3#denormalization-in-the-social-networking-case-study)
- 在記錄系統中, [記錄系統與派生資料](/tw/ch1#sec_introduction_derived)
- 相對於非正常化, [從同一事件日誌中派生多個檢視](/tw/ch12#sec_stream_deriving_views)
- NoSQL, [關係模型與文件模型](/tw/ch3#sec_datamodels_history), [複製延遲的解決方案](/tw/ch6#id131), [分拆資料庫](/tw/ch13#sec_future_unbundling)
- 事務和, [事務到底是什麼?](/tw/ch8#sec_transactions_overview)
- Notation3 (N3), [三元組儲存與 SPARQL](/tw/ch3#id59)
- 網路時間協議, [不可靠的時鐘](/tw/ch9#sec_distributed_clocks)
- 準確性, [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy), [用於事件排序的時間戳](/tw/ch9#sec_distributed_lww)
- 對單音鐘的調整, [單調時鐘](/tw/ch9#monotonic-clocks)
- 多個伺服器地址, [弱形式的謊言](/tw/ch9#weak-forms-of-lying)
- XML 與 JSON 編碼中的數字, [JSON、XML 及其二進位制變體](/tw/ch5#sec_encoding_json)
- NumPy (Python library), [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes), [列式儲存](/tw/ch4#sec_storage_column)
- NVMe (Non-Volatile Memory Express)(見 solid state drives (SSDs))
### O
- 物件資料庫, [關係模型與文件模型](/tw/ch3#sec_datamodels_history)
- 物件儲存, [雲服務的分層](/tw/ch1#layering-of-cloud-services), [物件儲存](/tw/ch11#id277)-[物件儲存](/tw/ch11#id277)
- Azure Blob Storage(見 Azure Blob Storage)
- 比較分散式檔案系統, [物件儲存](/tw/ch11#id277)
- 與關鍵價值庫存的比較, [物件儲存](/tw/ch11#id277)
- 資料庫由, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 備份, [複製](/tw/ch6#ch_replication)
- 用於雲資料倉庫, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses), [寫入列式儲存](/tw/ch4#writing-to-column-oriented-storage)
- 資料庫複製, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- Google Cloud Storage(見 Google Cloud Storage)
- 物件大小, [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute)
- S3(見 S3(物件儲存))
- storing LSM segment files, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 支援圍欄, [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens)
- 資料湖中的使用, [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake)
- 物件關係對映(ORM)框架, [物件關係對映(ORM)](/tw/ch3#object-relational-mapping-orm)
- 處理錯誤和中止事務, [處理錯誤和中止](/tw/ch8#handling-errors-and-aborts)
- 不安全的讀寫週期程式碼, [原子寫操作](/tw/ch8#atomic-write-operations)
- 物件關係不匹配, [物件關係不匹配](/tw/ch3#sec_datamodels_document)
- 可觀察性, [分散式系統的問題](/tw/ch1#sec_introduction_dist_sys_problems), [人類與可靠性](/tw/ch2#id31), [可運維性:讓運維更輕鬆](/tw/ch2#id37)
- 觀察員模式, [應用程式碼和狀態的分離](/tw/ch13#id344)
- OBT (one big table), [星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics), [星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)
- 離線系統, [批處理](/tw/ch11#ch_batch)
- (另見 batch processing)
- 離線第一應用程式, [即時協作、離線優先和本地優先應用](/tw/ch6#real-time-collaboration-offline-first-and-local-first-apps), [有狀態、可離線的客戶端](/tw/ch13#id347)
- 頁:1
- 加工過的原木中的消費者抵消額, [消費者偏移量](/tw/ch12#sec_stream_log_offsets)
- 已磨損日誌中的訊息, [使用日誌進行訊息儲存](/tw/ch12#id300)
- OLAP, [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp), [術語表](/tw/glossary)
- 資料方塊, [物化檢視與資料立方體](/tw/ch4#sec_storage_materialized_views)
- OLTP, [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp), [術語表](/tw/glossary)
- 分析查詢與, [分析(Analytics)](/tw/ch11#sec_batch_olap)
- 資料正常化, [正規化的權衡](/tw/ch3#trade-offs-of-normalization)
- 工作量特點, [實際序列執行](/tw/ch8#sec_transactions_serial)
- 現場部署, [雲服務與自託管](/tw/ch1#sec_introduction_cloud)
- 資料倉庫, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 一個大表格(資料倉計劃), [星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics), [星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)
- 單熱編碼, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 一對夫婦關係, [用於一對多關係的文件資料模型](/tw/ch3#the-document-data-model-for-one-to-many-relationships)
- 一對多種關係, [用於一對多關係的文件資料模型](/tw/ch3#the-document-data-model-for-one-to-many-relationships)
- JSON representation, [用於一對多關係的文件資料模型](/tw/ch3#the-document-data-model-for-one-to-many-relationships)
- 線上系統, [批處理](/tw/ch11#ch_batch)
- (另見 services)
- 相對於科學計算, [雲計算與超級計算](/tw/ch1#id17)
- 腫瘤, [三元組儲存與 SPARQL](/tw/ch3#id59)
- Oozie(工作流排程器), [批處理](/tw/ch11#ch_batch)
- OpenAPI (service definition format), [微服務與無伺服器](/tw/ch1#sec_introduction_microservices), [Web 服務](/tw/ch5#sec_web_services), [Web 服務](/tw/ch5#sec_web_services)
- use of JSON Schema, [JSON 模式](/tw/ch5#json-schema)
- openCypher(見 Cypher(查詢語言))
- OpenLink Virtuoso(見 Virtuoso(資料庫))
- OpenStack
- Swift(物件儲存), [物件儲存](/tw/ch11#id277)
- 可操作性, [可運維性:讓運維更輕鬆](/tw/ch2#id37)
- 作業系統與資料庫, [分拆資料庫](/tw/ch13#sec_future_unbundling)
- 業務系統, [分析型與事務型系統](/tw/ch1#sec_introduction_analytics)
- (另見 線上事務處理)
- 作為記錄系統, [記錄系統與派生資料](/tw/ch1#sec_introduction_derived)
- ETL into analytical systems, [資料倉庫](/tw/ch1#sec_introduction_dwh)
- 操作轉換, [CRDT 與操作變換](/tw/ch6#sec_replication_crdts)
- 行動組, [雲時代的運維](/tw/ch1#sec_introduction_operations)
- 運算元, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 在溪流處理中, [流處理](/tw/ch12#sec_stream_processing)
- 樂觀併發控制, [悲觀併發控制與樂觀併發控制](/tw/ch8#pessimistic-versus-optimistic-concurrency-control)
- 樂觀鎖定, [條件寫入(比較並設定)](/tw/ch8#sec_transactions_compare_and_set)
- Oracle(資料庫)
- 分散式事務支援, [XA 事務](/tw/ch8#xa-transactions)
- GoldenGate (change data capture), [資料變更捕獲的實現](/tw/ch12#id307)
- 等級查詢, [SQL 中的圖查詢](/tw/ch3#id58), [SQL 中的圖查詢](/tw/ch3#id58)
- 缺乏序列性, [隔離性](/tw/ch8#sec_transactions_acid_isolation)
- 基於領導者的複製, [單主複製](/tw/ch6#sec_replication_leader)
- 多領導複製, [跨地域執行](/tw/ch6#sec_replication_multi_dc)
- 多表索引叢集表, [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality)
- 無法阻止寫入 skew, [寫偏差的特徵](/tw/ch8#characterizing-write-skew)
- PL/SQL language, [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs)
- 防止丟失更新, [自動檢測丟失的更新](/tw/ch8#automatically-detecting-lost-updates)
- 讀作承諾隔離, [實現讀已提交](/tw/ch8#sec_transactions_read_committed_impl)
- Real Application Clusters (RAC), [鎖定與領導者選舉](/tw/ch10#locking-and-leader-election)
- 快速隔離支援, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation), [快照隔離、可重複讀和命名混淆](/tw/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- TimesTen (in-memory database), [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- WAL-based replication, [預寫日誌(WAL)傳輸](/tw/ch6#write-ahead-log-wal-shipping)
- ORC(資料格式), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses), [列式儲存](/tw/ch4#sec_storage_column)
- (另見 column-oriented storage)
- 協調(服務部署), [雲服務與自託管](/tw/ch1#sec_introduction_cloud), [微服務與無伺服器](/tw/ch1#sec_introduction_microservices)
- 批次任務執行, [分散式作業編排](/tw/ch11#id278)-[分散式作業編排](/tw/ch11#id278)
- 工作流程引擎, [批處理](/tw/ch11#ch_batch)
- 順序
- 事件日誌, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 總訂單的限制, [全序的限制](/tw/ch13#id335)
- 邏輯時間戳, [邏輯時鐘](/tw/ch10#sec_consistency_timestamps)
- of auto-incrementing IDs, [ID 生成器和邏輯時鐘](/tw/ch10#sec_consistency_logical)
- 共享日誌, [共識的實踐](/tw/ch10#sec_consistency_total_order)-[共識的利弊](/tw/ch10#pros-and-cons-of-consensus)
- Orkes(工作流程引擎), [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows)
- 孤兒頁面(B- 樹), [使 B 樹可靠](/tw/ch4#sec_storage_btree_wal)
- 發件箱圖案, [資料變更捕獲與事件溯源](/tw/ch12#sec_stream_event_sourcing)
- 異常值(響應時間), [平均值、中位數與百分位點](/tw/ch2#id24)
- 外包, [雲服務與自託管](/tw/ch1#sec_introduction_cloud)
- 超載, [描述效能](/tw/ch2#sec_introduction_percentiles), [處理錯誤和中止](/tw/ch8#handling-errors-and-aborts)
### P
- PACELC principle, [CAP 定理](/tw/ch10#the-cap-theorem)
- 軟體包管理器, [應用程式碼和狀態的分離](/tw/ch13#id344)
- 包切換, [我們不能簡單地使網路延遲可預測嗎?](/tw/ch9#can-we-not-simply-make-network-delays-predictable)
- 資料包
- 腐敗, [弱形式的謊言](/tw/ch9#weak-forms-of-lying)
- sending via UDP, [直接從生產者傳遞給消費者](/tw/ch12#id296)
- PageRank (algorithm), [圖資料模型](/tw/ch3#sec_datamodels_graph), [查詢語言](/tw/ch11#sec_batch_query_lanauges), [機器學習](/tw/ch11#id290)
- paging(見 virtual memory)
- 大熊貓(蟒蛇圖書館), [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake), [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes), [列式儲存](/tw/ch4#sec_storage_column), [DataFrames](/tw/ch11#id287)
- Parquet(資料格式), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses), [列式儲存](/tw/ch4#sec_storage_column), [歸檔儲存](/tw/ch5#archival-storage), [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- (另見 column-oriented storage)
- 物件儲存的資料庫, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 文件資料模型, [列式儲存](/tw/ch4#sec_storage_column)
- 批次處理中的用途, [MapReduce](/tw/ch11#sec_batch_mapreduce)
- 部分失敗, [故障與部分失效](/tw/ch9#sec_distributed_partial_failure), [總結](/tw/ch9#summary)
- 跛腳, [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 部分同步(系統模型), [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 分割槽鍵, [分片的利與弊](/tw/ch7#sec_sharding_reasons), [鍵值資料的分片](/tw/ch7#sec_sharding_key_value)
- 分割槽(見 分片)
- Paxos(協商一致演算法), [共識](/tw/ch10#sec_consistency_consensus), [共識的實踐](/tw/ch10#sec_consistency_total_order)
- 票數, [從單主複製到共識](/tw/ch10#from-single-leader-replication-to-consensus)
- Multi-Paxos, [共識的實踐](/tw/ch10#sec_consistency_total_order)
- payment card industry (PCI), [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- PCI (payment card industry) compliance, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- 百分位點, [平均值、中位數與百分位點](/tw/ch2#id24), [術語表](/tw/glossary)
- 高效計算, [響應時間指標的應用](/tw/ch2#sec_introduction_slo_sla)
- 高百分數的重要性, [響應時間指標的應用](/tw/ch2#sec_introduction_slo_sla)
- use in service level agreements (SLAs), [響應時間指標的應用](/tw/ch2#sec_introduction_slo_sla)
- Percolator (Google), [實現線性一致的 ID 生成器](/tw/ch10#implementing-a-linearizable-id-generator)
- Percona XtraBackup (MySQL tool), [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 效能
- 作為過失的降解, [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 描述, [描述效能](/tw/ch2#sec_introduction_percentiles)
- 分散式事務, [跨不同系統的分散式事務](/tw/ch8#sec_transactions_xa)
- 記憶體資料庫, [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- 線性, [線性一致性與網路延遲](/tw/ch10#linearizability-and-network-delays)
- 多領導者複製, [跨地域執行](/tw/ch6#sec_replication_multi_dc)
- 許可權隔離, [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- 永久不一致, [及時性與完整性](/tw/ch13#sec_future_integrity)
- 悲觀併發控制, [悲觀併發控制與樂觀併發控制](/tw/ch8#pessimistic-versus-optimistic-concurrency-control)
- pglogical (PostgreSQL extension), [跨地域執行](/tw/ch6#sec_replication_multi_dc)
- pgvector (向量指數), [向量嵌入](/tw/ch4#id92)
- 幻讀, [導致寫偏差的幻讀](/tw/ch8#sec_transactions_phantom)
- 物化衝突, [物化衝突](/tw/ch8#materializing-conflicts)
- 預防,序列性, [謂詞鎖](/tw/ch8#predicate-locks)
- physical clocks(見 clocks)
- pick菜(蟒魚), [特定語言的格式](/tw/ch5#id96)
- Pinot(資料庫), [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp), [列式儲存](/tw/ch4#sec_storage_column)
- 處理寫入, [寫入列式儲存](/tw/ch4#writing-to-column-oriented-storage)
- 預彙總, [分析(Analytics)](/tw/ch11#sec_batch_olap)
- 服務衍生資料, [對外提供派生資料](/tw/ch11#sec_batch_serving_derived), [對外提供派生資料](/tw/ch11#sec_batch_serving_derived)
- 編審中的執行
- 資料倉查詢, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 樞軸表, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 時間點, [不可靠的時鐘](/tw/ch9#sec_distributed_clocks)
- 點查詢, [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp)
- 極地(資料目錄), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 投票, [表示使用者、帖子與關注關係](/tw/ch2#id20)
- 多邊儲存器, [一切的元資料庫](/tw/ch13#id341)
- POSIX (portable operating system interface)
- 符合的檔案系統, [設定新的副本](/tw/ch6#sec_replication_new_replica), [分散式檔案系統](/tw/ch11#sec_batch_dfs), [物件儲存](/tw/ch11#id277)
- 郵政局地平線醜聞, [人類與可靠性](/tw/ch2#id31)
- 缺乏事務, [事務](/tw/ch8#ch_transactions)
- PostgreSQL(資料庫)
- archiving WAL to object stores, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 資料變更捕獲, [資料變更捕獲的實現](/tw/ch12#id307), [變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- 分散式事務支援, [XA 事務](/tw/ch8#xa-transactions)
- 外國資料包, [一切的元資料庫](/tw/ch13#id341)
- 全文搜尋支援, [組合使用派生資料的工具](/tw/ch13#id442)
- 在雲層中, [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native)
- JSON Schema validation, [JSON 模式](/tw/ch5#json-schema)
- 基於領導者的複製, [單主複製](/tw/ch6#sec_replication_leader)
- 日誌序列號, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 邏輯解碼, [邏輯(基於行)日誌複製](/tw/ch6#logical-row-based-log-replication)
- 實現檢視維護, [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- 多領導複製, [跨地域執行](/tw/ch6#sec_replication_multi_dc)
- MVCC implementation, [多版本併發控制(MVCC)](/tw/ch8#sec_transactions_snapshot_impl), [索引與快照隔離](/tw/ch8#indexes-and-snapshot-isolation)
- 分割對硬化, [分片](/tw/ch7#ch_sharding)
- pgvector (向量指數), [向量嵌入](/tw/ch4#id92)
- PL/pgSQL language, [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs)
- PostGIS geospatial indexes, [多維索引與全文索引](/tw/ch4#sec_storage_multidimensional)
- 防止丟失更新, [自動檢測丟失的更新](/tw/ch8#automatically-detecting-lost-updates)
- 防止寫入skew, [寫偏差的特徵](/tw/ch8#characterizing-write-skew), [可序列化快照隔離(SSI)](/tw/ch8#sec_transactions_ssi)
- 讀作承諾隔離, [實現讀已提交](/tw/ch8#sec_transactions_read_committed_impl)
- 表示圖表, [屬性圖](/tw/ch3#id56)
- 可序列化快照隔離, [可序列化快照隔離(SSI)](/tw/ch8#sec_transactions_ssi)
- 分片(見 Citus(資料庫))
- 快速隔離支援, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation), [快照隔離、可重複讀和命名混淆](/tw/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- WAL-based replication, [預寫日誌(WAL)傳輸](/tw/ch6#write-ahead-log-wal-shipping)
- 倒排列表, [全文檢索](/tw/ch4#sec_storage_full_text)
- 在硬化指數中, [本地二級索引](/tw/ch7#id166)
- 死後無咎, [人類與可靠性](/tw/ch2#id31)
- PouchDB(資料庫), [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- Power BI (business intelligence software), [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp), [分析(Analytics)](/tw/ch11#sec_batch_olap)
- 預彙總, [分析(Analytics)](/tw/ch11#sec_batch_olap)
- 服務衍生資料, [對外提供派生資料](/tw/ch11#sec_batch_serving_derived)
- 分享前, [重新平衡鍵範圍分片資料](/tw/ch7#rebalancing-key-range-sharded-data)
- Precision Time Protocol (PTP), [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy)
- 上游鎖定, [謂詞鎖](/tw/ch8#predicate-locks)
- 預測分析, [分析型與事務型系統](/tw/ch1#sec_introduction_analytics), [預測分析](/ch14#id369)-[反饋迴路](/ch14#id372)
- 擴大偏見, [偏見與歧視](/ch14#id370)
- ethics of(見 ethics)
- 反饋迴圈, [反饋迴路](/ch14#id372)
- 預設, [資源分配](/tw/ch11#id279)
- 在分散式排程器中, [故障處理](/tw/ch11#id281)
- 執行緒, [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- Prefect(工作流排程器), [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows), [批處理](/tw/ch11#ch_batch), [工作流排程](/tw/ch11#sec_batch_workflows)
- 雲資料倉整合, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- Presto(查詢引擎), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 主金鑰, [多列索引與二級索引](/tw/ch4#sec_storage_index_multicolumn), [術語表](/tw/glossary)
- 自動遞增, [ID 生成器和邏輯時鐘](/tw/ch10#sec_consistency_logical)
- 對分割槽鍵, [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- primary-backup replication(見 基於領導者的複製)
- 隱私, [隱私與追蹤](/ch14#id373)-[立法與自律](/ch14#sec_future_legislation)
- 同意和選擇自由, [同意與選擇自由](/ch14#id375)
- 資料作為資產和權力, [資料作為資產與權力](/ch14#id376)
- 刪除資料, [不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- ethical considerations(見 ethics)
- 立法和自律, [立法與自律](/ch14#sec_future_legislation)
- 含義, [隱私與資料使用](/ch14#id457)
- 條例, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- 監視, [監視](/ch14#id374)
- 跟蹤行為資料, [隱私與追蹤](/ch14#id373)
- 機率演算法, [響應時間指標的應用](/tw/ch2#sec_introduction_slo_sla), [流分析](/tw/ch12#id318)
- 程序暫停, [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)-[限制垃圾回收的影響](/tw/ch9#sec_distributed_gc_impact)
- 處理時間(事件), [時間推理](/tw/ch12#sec_stream_time)
- 生產者(資訊流), [傳遞事件流](/tw/ch12#sec_stream_transmit)
- 產品分析, [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp)
- 面向列的儲存, [列式儲存](/tw/ch4#sec_storage_column)
- 程式語言
- 用於儲存程式, [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs)
- 預測(活動來源), [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- Prolog(語言), [Datalog:遞迴關係查詢](/tw/ch3#id62)
- (另見 Datalog)
- 屬性圖, [屬性圖](/tw/ch3#id56)
- Cypher 查詢語言, [Cypher 查詢語言](/tw/ch3#id57)
- Property Graph Query Language (PGQL), [SQL 中的圖查詢](/tw/ch3#id58)
- 基於屬性的測試, [人類與可靠性](/tw/ch2#id31), [形式化方法和隨機測試](/tw/ch9#sec_distributed_formal)
- Protocol Buffers(資料格式), [Protocol Buffers](/tw/ch5#sec_encoding_protobuf)-[欄位標籤與模式演化](/tw/ch5#field-tags-and-schema-evolution), [Protocol Buffers](/tw/ch5#sec_encoding_protobuf)
- 欄位標記和計劃演變, [欄位標籤與模式演化](/tw/ch5#field-tags-and-schema-evolution)
- 資料來源, [為可審計性而設計](/tw/ch13#id365)
- 釋出/訂閱模式, [訊息傳遞系統](/tw/ch12#sec_stream_messaging)
- 出版社(資訊流), [傳遞事件流](/tw/ch12#sec_stream_transmit)
- Pulsar (流線平臺), [確認與重新傳遞](/tw/ch12#sec_stream_reordering)
- PyTorch (machine learning library), [機器學習](/tw/ch11#id290)
### Q
- Qpid(訊息系統), [訊息代理與資料庫的對比](/tw/ch12#id297)
- quality of service (QoS), [我們不能簡單地使網路延遲可預測嗎?](/tw/ch9#can-we-not-simply-make-network-delays-predictable)
- Quantcast File System(分散式檔案系統), [物件儲存](/tw/ch11#id277)
- 查詢引擎
- 彙編和向量化, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 在雲資料倉庫中, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 運算元, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 最佳化申報查詢, [資料模型與查詢語言](/tw/ch3#ch_datamodels)
- 查詢語言
- 密碼, [Cypher 查詢語言](/tw/ch3#id57)
- 資料日誌, [Datalog:遞迴關係查詢](/tw/ch3#id62)
- GraphQL, [GraphQL](/tw/ch3#id63)
- MongoDB aggregation pipeline, [正規化、反正規化與連線](/tw/ch3#sec_datamodels_normalization), [文件的查詢語言](/tw/ch3#query-languages-for-documents)
- recursive SQL queries, [SQL 中的圖查詢](/tw/ch3#id58)
- SPARQL, [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- SQL, [正規化、反正規化與連線](/tw/ch3#sec_datamodels_normalization)
- 查詢最佳化器, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- 查詢計劃, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 排隊延遲, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 頭部阻塞, [延遲與響應時間](/tw/ch2#id23)
- 延遲和反應時間, [延遲與響應時間](/tw/ch2#id23)
- 佇列(訊息), [訊息代理](/tw/ch5#message-brokers)
- QUIC (protocol), [TCP 的侷限性](/tw/ch9#sec_distributed_tcp)
- 法定人數, [讀寫仲裁](/tw/ch6#sec_replication_quorum_condition)-[多地區操作](/tw/ch6#multi-region-operation), [術語表](/tw/glossary)
- 用於無頭複製, [讀寫仲裁](/tw/ch6#sec_replication_quorum_condition)
- 在共識演算法中, [從單主複製到共識](/tw/ch10#from-single-leader-replication-to-consensus)
- 一致性的限制, [仲裁一致性的侷限](/tw/ch6#sec_replication_quorum_limitations)-[監控陳舊性](/tw/ch6#monitoring-staleness), [線性一致性與仲裁](/tw/ch10#sec_consistency_quorum_linearizable)
- 在分散式系統中作出決定, [多數派原則](/tw/ch9#sec_distributed_majority)
- 監測停滯情況, [監控陳舊性](/tw/ch6#monitoring-staleness)
- 多區域複製, [多地區操作](/tw/ch6#multi-region-operation)
- 依賴耐久性, [將系統模型對映到現實世界](/tw/ch9#mapping-system-models-to-the-real-world)
- 配額, [雲時代的運維](/tw/ch1#sec_introduction_operations)
### R
- R(語言), [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake), [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes), [DataFrames](/tw/ch11#id287)
- R樹(指數), [多維索引與全文索引](/tw/ch4#sec_storage_multidimensional)
- R2(物件儲存), [雲服務的分層](/tw/ch1#layering-of-cloud-services), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- RabbitMQ(訊息系統), [訊息代理](/tw/ch5#message-brokers), [訊息代理與資料庫的對比](/tw/ch12#id297)
- 法定人數佇列(複製), [單主複製](/tw/ch6#sec_replication_leader)
- 種族條件, [隔離性](/tw/ch8#sec_transactions_acid_isolation)
- (另見 併發)
- 以可線性避免, [跨通道時序依賴](/tw/ch10#cross-channel-timing-dependencies)
- 由雙寫引起, [保持系統同步](/tw/ch12#sec_stream_sync)
- 造成資金損失, [弱隔離級別](/tw/ch8#sec_transactions_isolation_levels)
- 骯髒的寫作, [沒有髒寫](/tw/ch8#sec_transactions_dirty_write)
- 逆增量, [沒有髒寫](/tw/ch8#sec_transactions_dirty_write)
- 丟失更新, [防止丟失更新](/tw/ch8#sec_transactions_lost_update)-[衝突解決與複製](/tw/ch8#conflict-resolution-and-replication)
- 以事件日誌防止, [併發控制](/tw/ch12#sec_stream_concurrency), [資料流:應用程式碼與狀態變化的互動](/tw/ch13#id450)
- 以可序列隔離的方式防止, [可序列化](/tw/ch8#sec_transactions_serializability)
- 事務隔離薄弱, [弱隔離級別](/tw/ch8#sec_transactions_isolation_levels)
- 寫偏差, [寫偏差與幻讀](/tw/ch8#sec_transactions_write_skew)-[物化衝突](/tw/ch8#materializing-conflicts)
- Raft(協商一致演算法), [共識](/tw/ch10#sec_consistency_consensus), [共識的實踐](/tw/ch10#sec_consistency_total_order)
- 基於領導者的複製, [單主複製](/tw/ch6#sec_replication_leader)
- 對網路問題的敏感性, [共識的利弊](/tw/ch10#pros-and-cons-of-consensus)
- 任期, [從單主複製到共識](/tw/ch10#from-single-leader-replication-to-consensus)
- 用於等, [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable)
- RAID (Redundant Array of Independent Disks), [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute), [透過冗餘容忍硬體故障](/tw/ch2#tolerating-hardware-faults-through-redundancy), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 鐵路,計劃遷移, [應用演化後重新處理資料](/tw/ch13#sec_future_reprocessing)
- RAM(見 memory)
- RAMCloud (in-memory storage), [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- 隨機寫入(訪問模式), [順序與隨機寫入](/tw/ch4#sidebar_sequential)
- 區域查詢
- 在B樹上, [B 樹](/tw/ch4#sec_storage_b_trees), [讀取效能](/tw/ch4#read-performance)
- in LSM-trees, [讀取效能](/tw/ch4#read-performance)
- 雜湊地圖中不高效, [日誌結構儲存](/tw/ch4#sec_storage_log_structured)
- 與大麻的磨損,, [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 排名演算法, [機器學習](/tw/ch11#id290)
- Ray(工作流排程器), [機器學習](/tw/ch11#id290)
- RDF (Resource Description Framework), [RDF 資料模型](/tw/ch3#the-rdf-data-model)
- querying with SPARQL, [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- 遠端直接記憶體訪問, [雲服務的分層](/tw/ch1#layering-of-cloud-services), [雲計算與超級計算](/tw/ch1#id17)
- 反應(使用者介面庫), [端到端的事件流](/tw/ch13#id349)
- 被動方案擬訂, [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- 讀取承諾隔離級別, [讀已提交](/tw/ch8#sec_transactions_read_committed)-[實現讀已提交](/tw/ch8#sec_transactions_read_committed_impl)
- 執行, [實現讀已提交](/tw/ch8#sec_transactions_read_committed_impl)
- 多版本併發控制, [多版本併發控制(MVCC)](/tw/ch8#sec_transactions_snapshot_impl)
- 沒有髒讀, [沒有髒讀](/tw/ch8#no-dirty-reads)
- 沒有汙穢的文字, [沒有髒寫](/tw/ch8#sec_transactions_dirty_write)
- 讀取模型(活動來源), [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 讀路徑, [觀察派生資料狀態](/tw/ch13#sec_future_observing)
- (無鉛複製), [追趕錯過的寫入](/tw/ch6#sec_replication_read_repair)
- 線性, [線性一致性與仲裁](/tw/ch10#sec_consistency_quorum_linearizable)
- 只讀副本(見 基於領導者的複製)
- 讀取偏差, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation), [總結](/tw/ch8#summary)
- 讀取未承諾的隔離級別, [實現讀已提交](/tw/ch8#sec_transactions_read_committed_impl)
- 寫後讀一致性, [讀己之寫](/tw/ch6#sec_replication_ryw), [及時性與完整性](/tw/ch13#sec_future_integrity)
- 交叉裝置, [讀己之寫](/tw/ch6#sec_replication_ryw)
- 在衍生資料系統中, [派生資料與分散式事務](/tw/ch13#sec_future_derived_vs_transactions)
- 讀 - 修改 - 寫入週期, [防止丟失更新](/tw/ch8#sec_transactions_lost_update)
- 讀縮放架構, [複製延遲的問題](/tw/ch6#sec_replication_lag), [單主與無主複製的效能](/tw/ch6#sec_replication_leaderless_perf)
- 與磨損, [分片的利與弊](/tw/ch7#sec_sharding_reasons)
- 讀作事件, [讀也是事件](/tw/ch13#sec_future_read_events)
- 即時
- analytics(見 product analytics)
- 協作編輯, [即時協作、離線優先和本地優先應用](/tw/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- 釋出/訂閱資料流, [端到端的事件流](/tw/ch13#id349)
- 響應時間保障, [響應時間保證](/tw/ch9#sec_distributed_clocks_realtime)
- 每日時鐘, [日曆時鐘](/tw/ch9#time-of-day-clocks)
- Realm(資料庫), [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- 重新平衡困難, [重新平衡鍵範圍分片資料](/tw/ch7#rebalancing-key-range-sharded-data)-[運維:自動/手動再平衡](/tw/ch7#sec_sharding_operations), [術語表](/tw/glossary)
- (另見 分片)
- 自動或人工重新平衡, [運維:自動/手動再平衡](/tw/ch7#sec_sharding_operations)
- 固定塊數, [固定數量的分片](/tw/ch7#fixed-number-of-shards)
- 每個節點的固定硬度數, [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- Hash mod N的問題, [雜湊取模節點數](/tw/ch7#hash-modulo-number-of-nodes)
- 新鮮度保證, [線性一致性](/tw/ch10#sec_consistency_linearizability)
- 建議引擎, [分析型與事務型系統](/tw/ch1#sec_introduction_analytics)
- building using DataFrames, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 迭代處理, [機器學習](/tw/ch11#id290)
- 重組(協商一致), [共識的微妙之處](/tw/ch10#subtleties-of-consensus)
- 記錄, [MapReduce](/tw/ch11#sec_batch_mapreduce)
- 流處理中的事件, [傳遞事件流](/tw/ch12#sec_stream_transmit)
- 遞迴查詢
- 在金鑰中, [Cypher 查詢語言](/tw/ch3#id57)
- 在資料日誌中, [Datalog:遞迴關係查詢](/tw/ch3#id62)
- in SPARQL, [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- lack of, in GraphQL, [GraphQL](/tw/ch3#id63)
- SQL common table expressions, [SQL 中的圖查詢](/tw/ch3#id58)
- Red Hat
- Apicurio Registry, [JSON 模式](/tw/ch5#json-schema)
- 紅黑樹, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 重新交付(通訊), [確認與重新傳遞](/tw/ch12#sec_stream_reordering)
- Redis(資料庫)
- 原子操作, [原子寫操作](/tw/ch8#atomic-write-operations)
- CRDT support, [CRDT 與操作變換](/tw/ch6#sec_replication_crdts)
- 永續性, [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- Lua 指令碼, [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs)
- 多領導複製, [跨地域執行](/tw/ch6#sec_replication_multi_dc)
- 程序/核心模式, [分片的利與弊](/tw/ch7#sec_sharding_reasons)
- 單條執行, [實際序列執行](/tw/ch8#sec_transactions_serial)
- redo log(見 write-ahead log)
- Redpanda(訊息系統), [訊息代理](/tw/ch5#message-brokers), [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 分層儲存, [磁碟空間使用](/tw/ch12#sec_stream_disk_usage)
- Redshift(資料庫), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 冗餘
- 硬體元件, [透過冗餘容忍硬體故障](/tw/ch2#tolerating-hardware-faults-through-redundancy)
- 生成資料, [記錄系統與派生資料](/tw/ch1#sec_introduction_derived)
- (另見 衍生資料)
- Reed--Solomon codes (error correction), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 重構, [可演化性:讓變化更容易](/tw/ch2#sec_introduction_evolvability)
- (另見 可演化性)
- (地理分佈), [讀己之寫](/tw/ch6#sec_replication_ryw)
- (另見 datacenters)
- 協商一致, [共識的利弊](/tw/ch10#pros-and-cons-of-consensus)
- 定義, [讀己之寫](/tw/ch6#sec_replication_ryw)
- 延遲, [分散式與單節點系統](/tw/ch1#sec_introduction_distributed)
- linearizable ID generation, [實現線性一致的 ID 生成器](/tw/ch10#implementing-a-linearizable-id-generator)
- 在整個區域複製, [跨地域執行](/tw/ch6#sec_replication_multi_dc)-[不同拓撲的問題](/tw/ch6#problems-with-different-topologies), [線性一致性的代價](/tw/ch10#sec_linearizability_cost), [全序的限制](/tw/ch13#id335)
- 無主(無領導者), [多地區操作](/tw/ch6#multi-region-operation)
- 多領導者, [跨地域執行](/tw/ch6#sec_replication_multi_dc)
- 區域(硬化), [分片](/tw/ch7#ch_sharding)
- 暫存器, [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- regulation(見 legal matters)
- 關係資料模型, [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake), [關係模型與文件模型](/tw/ch3#sec_datamodels_history)-[文件和關係資料庫的融合](/tw/ch3#convergence-of-document-and-relational-databases)
- 與檔案模型的比較, [何時使用哪種模型](/tw/ch3#sec_datamodels_document_summary)-[文件和關係資料庫的融合](/tw/ch3#convergence-of-document-and-relational-databases)
- graph queries in SQL, [SQL 中的圖查詢](/tw/ch3#id58)
- 模擬資料庫, [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- 多對多對多的關係, [多對一與多對多關係](/tw/ch3#sec_datamodels_many_to_many)
- 多物件事務, 需要, [多物件事務的需求](/tw/ch8#sec_transactions_need)
- 物件關係不匹配, [物件關係不匹配](/tw/ch3#sec_datamodels_document)
- 代表可重排列表, [何時使用哪種模型](/tw/ch3#sec_datamodels_document_summary)
- 對文件模式
- 模式的趨同, [文件和關係資料庫的融合](/tw/ch3#convergence-of-document-and-relational-databases)
- 資料位置, [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality)
- 關係資料庫
- 最終一致性, [複製延遲的問題](/tw/ch6#sec_replication_lag)
- 歷史, [關係模型與文件模型](/tw/ch3#sec_datamodels_history)
- 基於領導者的複製, [單主複製](/tw/ch6#sec_replication_leader)
- 邏輯日誌, [邏輯(基於行)日誌複製](/tw/ch6#logical-row-based-log-replication)
- 哲學比Unix, [分拆資料庫](/tw/ch13#sec_future_unbundling), [一切的元資料庫](/tw/ch13#id341)
- 方案變化, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility), [編碼與演化](/tw/ch5#ch_encoding), [不同時間寫入的不同值](/tw/ch5#different-values-written-at-different-times)
- 硬化二級指數, [分片與二級索引](/tw/ch7#sec_sharding_secondary_indexes)
- 基於語句的複製, [基於語句的複製](/tw/ch6#statement-based-replication)
- B樹指數的使用, [B 樹](/tw/ch4#sec_storage_b_trees)
- relationships(見 edges)
- 可靠性, [可靠性與容錯](/tw/ch2#sec_introduction_reliability)-[人類與可靠性](/tw/ch2#id31), [流式系統的哲學](/tw/ch13#ch_philosophy)
- 從不可靠的元件建立可靠的系統, [故障與部分失效](/tw/ch9#sec_distributed_partial_failure)
- 硬體故障, [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults)
- 人類錯誤, [人類與可靠性](/tw/ch2#id31)
- 重要性, [人類與可靠性](/tw/ch2#id31)
- 通訊系統, [訊息傳遞系統](/tw/ch12#sec_stream_messaging)
- 軟體故障, [軟體故障](/tw/ch2#software-faults)
- Remote Method Invocation (Java RMI), [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc)
- remote procedure calls (RPCs), [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc)-[RPC 的資料編碼與演化](/tw/ch5#data-encoding-and-evolution-for-rpc)
- (另見 services)
- 資料編碼和演化, [RPC 的資料編碼與演化](/tw/ch5#data-encoding-and-evolution-for-rpc)
- 問題, [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc)
- 使用 Avro, [但什麼是寫入者模式?](/tw/ch5#but-what-is-the-writers-schema)
- 對信件經紀人, [事件驅動的架構](/tw/ch5#sec_encoding_dataflow_msg)
- 可再生能源, [分散式與單節點系統](/tw/ch1#sec_introduction_distributed)
- 可重複讀(切換隔離), [快照隔離、可重複讀和命名混淆](/tw/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- 複製品, [單主複製](/tw/ch6#sec_replication_leader)
- 複製, [複製](/tw/ch6#ch_replication)-[總結](/tw/ch6#summary), [術語表](/tw/glossary)
- 永續性, [永續性](/tw/ch8#durability)
- 解決衝突, [衝突解決與複製](/tw/ch8#conflict-resolution-and-replication)
- 一致性屬性, [複製延遲的問題](/tw/ch6#sec_replication_lag)-[複製延遲的解決方案](/tw/ch6#id131)
- 一致字首讀, [一致字首讀](/tw/ch6#sec_replication_consistent_prefix)
- 單調讀, [單調讀](/tw/ch6#sec_replication_monotonic_reads)
- 讀取您的寫作, [讀己之寫](/tw/ch6#sec_replication_ryw)
- 在分散式檔案系統中, [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 無主(無領導者), [無主複製](/tw/ch6#sec_replication_leaderless)-[版本向量](/tw/ch6#version-vectors)
- 檢測並行寫作, [檢測併發寫入](/tw/ch6#sec_replication_concurrent)-[版本向量](/tw/ch6#version-vectors)
- 法定人數一致性的限制, [仲裁一致性的侷限](/tw/ch6#sec_replication_quorum_limitations)-[監控陳舊性](/tw/ch6#monitoring-staleness), [線性一致性與仲裁](/tw/ch10#sec_consistency_quorum_linearizable)
- 監測停滯情況, [監控陳舊性](/tw/ch6#monitoring-staleness)
- 多領導者, [多主複製](/tw/ch6#sec_replication_multi_leader)-[處理寫入衝突](/tw/ch6#sec_replication_write_conflicts)
- 跨多個區域, [跨地域執行](/tw/ch6#sec_replication_multi_dc), [線性一致性的代價](/tw/ch10#sec_linearizability_cost)
- 解決衝突, [處理寫入衝突](/tw/ch6#sec_replication_write_conflicts)-[處理寫入衝突](/tw/ch6#sec_replication_write_conflicts)
- 複製地形, [多主複製拓撲](/tw/ch6#sec_replication_topologies)-[不同拓撲的問題](/tw/ch6#problems-with-different-topologies)
- 使用原因, [分散式與單節點系統](/tw/ch1#sec_introduction_distributed), [複製](/tw/ch6#ch_replication)
- 硬化和, [分片](/tw/ch7#ch_sharding)
- 單人領導, [單主複製](/tw/ch6#sec_replication_leader)-[邏輯(基於行)日誌複製](/tw/ch6#logical-row-based-log-replication)
- 故障切換, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover)
- 實施複製日誌, [複製日誌的實現](/tw/ch6#sec_replication_implementation)-[邏輯(基於行)日誌複製](/tw/ch6#logical-row-based-log-replication)
- 與協商一致的關係, [從單主複製到共識](/tw/ch10#from-single-leader-replication-to-consensus), [共識的利弊](/tw/ch10#pros-and-cons-of-consensus)
- 設立新的追隨者, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 同步對同步, [同步複製與非同步複製](/tw/ch6#sec_replication_sync_async)-[同步複製與非同步複製](/tw/ch6#sec_replication_sync_async)
- 狀態機複製, [基於語句的複製](/tw/ch6#statement-based-replication), [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs), [使用共享日誌](/tw/ch10#sec_consistency_smr), [資料庫與流](/tw/ch12#sec_stream_databases)
- 事件溯源, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 依賴決定性因素, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 利用協商一致, [共識的利弊](/tw/ch10#pros-and-cons-of-consensus)
- 使用擦除編碼, [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 使用物件儲存, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 相對備份, [複製](/tw/ch6#ch_replication)
- 具有多樣化資料系統, [保持系統同步](/tw/ch12#sec_stream_sync)
- replication logs(見 logs)
- representations of data(見 data models)
- 後處理資料, [應用演化後重新處理資料](/tw/ch13#sec_future_reprocessing), [統一批處理和流處理](/tw/ch13#id338)
- (另見 可演化性)
- 從基於日誌的信件, [重播舊訊息](/tw/ch12#sec_stream_replay)
- 請求套期, [單主與無主複製的效能](/tw/ch6#sec_replication_leaderless_perf)
- 請求識別符號, [操作識別符號](/tw/ch13#id355), [多分割槽請求處理](/tw/ch13#id360)
- 請求路由, [請求路由](/tw/ch7#sec_sharding_routing)-[請求路由](/tw/ch7#sec_sharding_routing)
- 方法, [請求路由](/tw/ch7#sec_sharding_routing)
- 資料居住法, [分散式與單節點系統](/tw/ch1#sec_introduction_distributed), [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- 彈性系統, [可靠性與容錯](/tw/ch2#sec_introduction_reliability)
- (另見 fault tolerance)
- 資源隔離, [雲計算與超級計算](/tw/ch1#id17), [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- 資源限制, [雲時代的運維](/tw/ch1#sec_introduction_operations)
- 響應時間
- 作為業績計量, [描述效能](/tw/ch2#sec_introduction_percentiles), [批處理](/tw/ch11#ch_batch)
- 保證, [響應時間保證](/tw/ch9#sec_distributed_clocks_realtime)
- 對使用者的影響, [平均值、中位數與百分位點](/tw/ch2#id24)
- 在複製系統中, [單主與無主複製的效能](/tw/ch6#sec_replication_leaderless_perf)
- 暫時性與, [延遲與響應時間](/tw/ch2#id23)
- 平均值和百分位數, [平均值、中位數與百分位點](/tw/ch2#id24)
- 使用者體驗, [平均值、中位數與百分位點](/tw/ch2#id24)
- 責任和問責制, [責任與問責](/ch14#id371)
- 表述性狀態傳遞, [Web 服務](/tw/ch5#sec_web_services)
- (另見 services)
- 重報(工作流程引擎), [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows)
- RethinkDB(資料庫)
- 加入支援, [文件和關係資料庫的融合](/tw/ch3#convergence-of-document-and-relational-databases)
- 鍵程硬化, [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range)
- 重試風暴, [描述效能](/tw/ch2#sec_introduction_percentiles), [軟體故障](/tw/ch2#software-faults)
- reverse ETL, [超越資料湖](/tw/ch1#beyond-the-data-lake)
- Riak(資料庫)
- CRDT support, [CRDT 與操作變換](/tw/ch6#sec_replication_crdts), [檢測併發寫入](/tw/ch6#sec_replication_concurrent)
- 點版向量, [版本向量](/tw/ch6#version-vectors)
- 流言協議, [請求路由](/tw/ch7#sec_sharding_routing)
- 雜湊變硬, [固定數量的分片](/tw/ch7#fixed-number-of-shards)
- 無領導複製, [無主複製](/tw/ch6#sec_replication_leaderless)
- 線性,缺少, [線性一致性與仲裁](/tw/ch10#sec_consistency_quorum_linearizable)
- 多區域支助, [多地區操作](/tw/ch6#multi-region-operation)
- 再平衡, [運維:自動/手動再平衡](/tw/ch7#sec_sharding_operations)
- 二級指數, [本地二級索引](/tw/ch7#id166)
- 草率法定人數, [單主與無主複製的效能](/tw/ch6#sec_replication_leaderless_perf)
- 節點(硬化), [分片](/tw/ch7#ch_sharding)
- 環緩衝器, [磁碟空間使用](/tw/ch12#sec_stream_disk_usage)
- RisingWave(資料庫)
- 增量檢視維護, [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- 火箭彈, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)
- RocksDB (storage engine), [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 作為嵌入式儲存引擎, [壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- 平面壓縮, [壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- 服務衍生資料, [對外提供派生資料](/tw/ch11#sec_batch_serving_derived)
- 退縮(事務), [事務](/tw/ch8#ch_transactions)
- 滾動升級, [透過冗餘容忍硬體故障](/tw/ch2#tolerating-hardware-faults-through-redundancy), [編碼與演化](/tw/ch5#ch_encoding), [故障與部分失效](/tw/ch9#sec_distributed_partial_failure)
- 在多種租戶系統中, [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- routing(見 request routing)
- 基於行的複製, [邏輯(基於行)日誌複製](/tw/ch6#logical-row-based-log-replication)
- 面向行儲存, [列式儲存](/tw/ch4#sec_storage_column)
- 搶劫犯(貪汙), [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults)
- RPCs(見 remote procedure calls)
- 規則(資料), [Datalog:遞迴關係查詢](/tw/ch3#id62)
- Rust(程式語言)
- 記憶體管理, [限制垃圾回收的影響](/tw/ch9#sec_distributed_gc_impact)
### S
- S3(物件儲存), [雲服務的分層](/tw/ch1#layering-of-cloud-services), [設定新的副本](/tw/ch6#sec_replication_new_replica), [批處理](/tw/ch11#ch_batch), [分散式檔案系統](/tw/ch11#sec_batch_dfs), [物件儲存](/tw/ch11#id277)
- 檢查資料完整性, [不要盲目信任承諾](/tw/ch13#id364)
- 有條件寫入, [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens)
- 物件大小, [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute)
- S3 Express One Zone, [物件儲存](/tw/ch11#id277), [物件儲存](/tw/ch11#id277)
- use in MapReduce, [MapReduce](/tw/ch11#sec_batch_mapreduce)
- 工作流程示例, [工作流排程](/tw/ch11#sec_batch_workflows)
- SaaS(見 軟體即服務(SaaS))
- 安全和生活特性, [安全性與活性](/tw/ch9#sec_distributed_safety_liveness)
- 在共識演算法中, [單值共識](/tw/ch10#single-value-consensus)
- 事務中, [事務](/tw/ch8#ch_transactions)
- sagas(見 compensating transactions)
- Samza (流處理器), [流分析](/tw/ch12#id318)
- SAP HANA(資料庫), [分析型資料儲存](/tw/ch4#sec_storage_analytics)
- 可伸縮性, [可伸縮性](/tw/ch2#sec_introduction_scalability)-[可伸縮性原則](/tw/ch2#id35), [流式系統的哲學](/tw/ch13#ch_philosophy)
- 自動縮放, [運維:自動/手動再平衡](/tw/ch7#sec_sharding_operations)
- 透過磨損, [分片的利與弊](/tw/ch7#sec_sharding_reasons)
- 描述負載, [描述負載](/tw/ch2#id33)
- 描述效能, [描述效能](/tw/ch2#sec_introduction_percentiles)
- 線性, [描述負載](/tw/ch2#id33)
- 原則, [可伸縮性原則](/tw/ch2#id35)
- 複製和, [複製延遲的問題](/tw/ch6#sec_replication_lag)
- 擴大規模與擴大規模, [共享記憶體、共享磁碟與無共享架構](/tw/ch2#sec_introduction_shared_nothing)
- 縮放, [共享記憶體、共享磁碟與無共享架構](/tw/ch2#sec_introduction_shared_nothing)
- (另見 shared-nothing architecture)
- 透過磨損, [分片的利與弊](/tw/ch7#sec_sharding_reasons)
- 擴大規模, [共享記憶體、共享磁碟與無共享架構](/tw/ch2#sec_introduction_shared_nothing)
- 緩慢變化的維度, [連線的時間依賴性](/tw/ch12#sec_stream_join_time)
- 排程
- 演算法, [資源分配](/tw/ch11#id279)
- 批次任務, [分散式作業編排](/tw/ch11#id278)-[工作流排程](/tw/ch11#sec_batch_workflows)
- 幫派列表, [資源分配](/tw/ch11#id279)
- 閱讀時的圖謀, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility)
- 與可變方案比較, [模式的優點](/tw/ch5#sec_encoding_schemas)
- 拼寫圖, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility)
- schemaless databases(見 schema-on-read)
- 計劃, [術語表](/tw/glossary)
- Avro, [Avro](/tw/ch5#sec_encoding_avro)-[動態生成的模式](/tw/ch5#dynamically-generated-schemas)
- 讀者決定作家的計劃, [但什麼是寫入者模式?](/tw/ch5#but-what-is-the-writers-schema)
- 計劃演變, [寫入者模式與讀取者模式](/tw/ch5#the-writers-schema-and-the-readers-schema)
- 動態生成, [動態生成的模式](/tw/ch5#dynamically-generated-schemas)
- 變化, [應用演化後重新處理資料](/tw/ch13#sec_future_reprocessing)
- 影響應用程式程式碼, [編碼與演化](/tw/ch5#ch_encoding)
- 相容性檢查, [但什麼是寫入者模式?](/tw/ch5#but-what-is-the-writers-schema)
- 資料庫中, [流經資料庫的資料流](/tw/ch5#sec_encoding_dataflow_db)-[歸檔儲存](/tw/ch5#archival-storage)
- 服務電話, [RPC 的資料編碼與演化](/tw/ch5#data-encoding-and-evolution-for-rpc)
- 檔案模式的靈活性, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility)
- 用於分析, [星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)-[星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)
- for JSON and XML, [JSON、XML 及其二進位制變體](/tw/ch5#sec_encoding_json), [JSON 模式](/tw/ch5#json-schema)
- generation and migration using ORMs, [物件關係對映(ORM)](/tw/ch3#object-relational-mapping-orm)
- 案情, [模式的優點](/tw/ch5#sec_encoding_schemas)
- 遷移, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility)
- Protocol Buffers, [Protocol Buffers](/tw/ch5#sec_encoding_protobuf)-[欄位標籤與模式演化](/tw/ch5#field-tags-and-schema-evolution)
- 計劃演變, [欄位標籤與模式演化](/tw/ch5#field-tags-and-schema-evolution)
- 鐵路移民計劃, [應用演化後重新處理資料](/tw/ch13#sec_future_reprocessing)
- 傳統的設計方法,謬誤, [從同一事件日誌中派生多個檢視](/tw/ch12#sec_stream_deriving_views)
- 科學計算, [雲計算與超級計算](/tw/ch1#id17)
- scikit-learn (Python 圖書館), [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake)
- ScyllaDB(資料庫)
- 叢集元資料, [請求路由](/tw/ch7#sec_sharding_routing)
- consistency level ANY, [單主與無主複製的效能](/tw/ch6#sec_replication_leaderless_perf)
- 雜湊變硬, [按鍵的雜湊分片](/tw/ch7#sec_sharding_hash), [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 最後寫成的解決衝突, [檢測併發寫入](/tw/ch6#sec_replication_concurrent)
- 無領導複製, [無主複製](/tw/ch6#sec_replication_leaderless)
- 輕量事務, [單物件寫入](/tw/ch8#sec_transactions_single_object)
- 線性,缺少, [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable)
- 日誌結構儲存, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 多區域支助, [多地區操作](/tw/ch6#multi-region-operation)
- 使用時鐘, [仲裁一致性的侷限](/tw/ch6#sec_replication_quorum_limitations), [用於事件排序的時間戳](/tw/ch9#sec_distributed_lww)
- 節點(硬化), [分片](/tw/ch7#ch_sharding)
- search engines(見 全文檢索)
- 搜尋流, [在流上搜索](/tw/ch12#id320)
- 備庫(見 基於領導者的複製)
- 二級指數, [多列索引與二級索引](/tw/ch4#sec_storage_index_multicolumn), [術語表](/tw/glossary)
- 多對多關係, [多對一與多對多關係](/tw/ch3#sec_datamodels_many_to_many)
- 雙寫問題, [保持系統同步](/tw/ch12#sec_stream_sync), [理解資料流](/tw/ch13#id443)
- 分片, [分片與二級索引](/tw/ch7#sec_sharding_secondary_indexes)-[全域性二級索引](/tw/ch7#id167), [總結](/tw/ch7#summary)
- 全球, [全域性二級索引](/tw/ch7#id167)
- 指數維護, [維護派生狀態](/tw/ch13#id446)
- 當地, [本地二級索引](/tw/ch7#id166)
- 更新、事務隔離和, [多物件事務的需求](/tw/ch8#sec_transactions_need)
- 二次排序, [JOIN 與 GROUP BY](/tw/ch11#sec_batch_join)
- sed (Unix 工具) (英語)., [簡單日誌分析](/tw/ch11#sec_batch_log_analysis)
- 自我託管, [雲服務與自託管](/tw/ch1#sec_introduction_cloud)
- 資料倉庫, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 自我歡樂, [本章小結](/tw/ch12#id332)
- 自動驗證系統, [不要盲目信任承諾](/tw/ch13#id364)
- 語義搜尋, [向量嵌入](/tw/ch4#id92)
- 語義相似性, [向量嵌入](/tw/ch4#id92)
- 語義網, [三元組儲存與 SPARQL](/tw/ch3#id59)
- 半同步複製, [同步複製與非同步複製](/tw/ch6#sec_replication_sync_async)
- 順序寫(訪問模式), [順序與隨機寫入](/tw/ch4#sidebar_sequential)
- 可序列化, [隔離性](/tw/ch8#sec_transactions_acid_isolation), [弱隔離級別](/tw/ch8#sec_transactions_isolation_levels), [可序列化](/tw/ch8#sec_transactions_serializability)-[可序列化快照隔離的效能](/tw/ch8#performance-of-serializable-snapshot-isolation), [術語表](/tw/glossary)
- 線性比對, [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- 悲觀與樂觀的併發控制, [悲觀併發控制與樂觀併發控制](/tw/ch8#pessimistic-versus-optimistic-concurrency-control)
- 序列執行, [實際序列執行](/tw/ch8#sec_transactions_serial)-[序列執行總結](/tw/ch8#summary-of-serial-execution)
- 分片, [分片](/tw/ch8#sharding)
- 使用儲存程式, [將事務封裝在儲存過程中](/tw/ch8#encapsulating-transactions-in-stored-procedures), [使用共享日誌](/tw/ch10#sec_consistency_smr)
- 可序列化快照隔離, [可序列化快照隔離(SSI)](/tw/ch8#sec_transactions_ssi)-[可序列化快照隔離的效能](/tw/ch8#performance-of-serializable-snapshot-isolation)
- detecting stale MVCC reads, [檢測陳舊的 MVCC 讀取](/tw/ch8#detecting-stale-mvcc-reads)
- 檢測影響先前讀取的寫入, [檢測影響先前讀取的寫入](/tw/ch8#sec_detecting_writes_affect_reads)
- 分散式執行, [可序列化快照隔離的效能](/tw/ch8#performance-of-serializable-snapshot-isolation), [資料庫內部的分散式事務](/tw/ch8#sec_transactions_internal)
- performance of SSI, [可序列化快照隔離的效能](/tw/ch8#performance-of-serializable-snapshot-isolation)
- 防止寫入skew, [基於過時前提的決策](/tw/ch8#decisions-based-on-an-outdated-premise)-[檢測影響先前讀取的寫入](/tw/ch8#sec_detecting_writes_affect_reads)
- 嚴格的序列性, [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- 及時性與完整性, [及時性與完整性](/tw/ch13#sec_future_integrity)
- 兩階段鎖定, [兩階段鎖定(2PL)](/tw/ch8#sec_transactions_2pl)-[索引範圍鎖](/tw/ch8#sec_transactions_2pl_range)
- 索引範圍鎖定, [索引範圍鎖](/tw/ch8#sec_transactions_2pl_range)
- 效能, [兩階段鎖定的效能](/tw/ch8#performance-of-two-phase-locking)
- 可序列化, [特定語言的格式](/tw/ch5#id96)
- 序列化, [編碼資料的格式](/tw/ch5#sec_encoding_formats)
- (另見 編碼)
- 無伺服器, [微服務與無伺服器](/tw/ch1#sec_introduction_microservices)
- 服務發現, [負載均衡器、服務發現和服務網格](/tw/ch5#sec_encoding_service_discovery), [請求路由](/tw/ch7#sec_sharding_routing), [服務發現](/tw/ch10#service-discovery)
- 登記, [負載均衡器、服務發現和服務網格](/tw/ch5#sec_encoding_service_discovery)
- using DNS, [負載均衡器、服務發現和服務網格](/tw/ch5#sec_encoding_service_discovery), [請求路由](/tw/ch7#sec_sharding_routing), [服務發現](/tw/ch10#service-discovery)
- 服務級別協議(SLA), [響應時間指標的應用](/tw/ch2#sec_introduction_slo_sla), [描述負載](/tw/ch2#id33)
- 服務網格, [負載均衡器、服務發現和服務網格](/tw/ch5#sec_encoding_service_discovery)
- Service Organization Control (SOC), [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- 服務時間, [延遲與響應時間](/tw/ch2#id23)
- 面向服務的體系結構, [微服務與無伺服器](/tw/ch1#sec_introduction_microservices)
- (另見 services)
- 服務, [流經服務的資料流:REST 與 RPC](/tw/ch5#sec_encoding_dataflow_rpc)-[RPC 的資料編碼與演化](/tw/ch5#data-encoding-and-evolution-for-rpc)
- 微服務, [微服務與無伺服器](/tw/ch1#sec_introduction_microservices)
- 各種服務的因果關係, [全序的限制](/tw/ch13#id335)
- 松耦合, [開展分拆工作](/tw/ch13#sec_future_unbundling_favor)
- 與批次/流程處理器的關係, [批處理](/tw/ch11#ch_batch), [流處理器和服務](/tw/ch13#id345)
- remote procedure calls (RPCs), [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc)-[RPC 的資料編碼與演化](/tw/ch5#data-encoding-and-evolution-for-rpc)
- 問題, [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc)
- 與資料庫相似, [流經服務的資料流:REST 與 RPC](/tw/ch5#sec_encoding_dataflow_rpc)
- 網路服務, [Web 服務](/tw/ch5#sec_web_services)
- 會話視窗(流處理), [視窗的型別](/tw/ch12#id324)
- (另見 windows)
- 分片, [分片](/tw/ch7#ch_sharding)-[總結](/tw/ch7#summary), [術語表](/tw/glossary)
- 和共識, [使用共享日誌](/tw/ch10#sec_consistency_smr)
- 複製, [分片](/tw/ch7#ch_sharding)
- 分散事務, [分散式事務](/tw/ch8#sec_transactions_distributed)
- 熱的軟糖, [鍵值資料的分片](/tw/ch7#sec_sharding_key_value)
- 分批處理, [批處理](/tw/ch11#ch_batch)
- 鍵程分割, [重新平衡鍵範圍分片資料](/tw/ch7#rebalancing-key-range-sharded-data)
- 多硬性操作, [多分割槽資料處理](/tw/ch13#sec_future_unbundled_multi_shard)
- 執行限制, [多分割槽請求處理](/tw/ch13#id360)
- 二級指數維護, [維護派生狀態](/tw/ch13#id446)
- 關鍵值資料, [鍵值資料的分片](/tw/ch7#sec_sharding_key_value)-[偏斜的工作負載與緩解熱點](/tw/ch7#sec_sharding_skew)
- 按金鑰範圍, [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range)
- 搖擺和熱點, [偏斜的工作負載與緩解熱點](/tw/ch7#sec_sharding_skew)
- 詞源, [分片](/tw/ch7#ch_sharding)
- 分割槽鍵, [分片的利與弊](/tw/ch7#sec_sharding_reasons), [鍵值資料的分片](/tw/ch7#sec_sharding_key_value)
- 再平衡
- 金鑰範圍壓縮資料, [重新平衡鍵範圍分片資料](/tw/ch7#rebalancing-key-range-sharded-data)
- 重新平衡困難, [重新平衡鍵範圍分片資料](/tw/ch7#rebalancing-key-range-sharded-data)-[運維:自動/手動再平衡](/tw/ch7#sec_sharding_operations)
- 自動或人工重新平衡, [運維:自動/手動再平衡](/tw/ch7#sec_sharding_operations)
- Hash mod N的問題, [雜湊取模節點數](/tw/ch7#hash-modulo-number-of-nodes)
- 使用固定的碎片數, [固定數量的分片](/tw/ch7#fixed-number-of-shards)
- 使用 N 個節點, [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 請求路由, [請求路由](/tw/ch7#sec_sharding_routing)-[請求路由](/tw/ch7#sec_sharding_routing)
- 二級指數, [分片與二級索引](/tw/ch7#sec_sharding_secondary_indexes)-[全域性二級索引](/tw/ch7#id167)
- 全球, [全域性二級索引](/tw/ch7#id167)
- 當地, [本地二級索引](/tw/ch7#id166)
- 連續執行事務和, [分片](/tw/ch8#sharding)
- 正在排序硬化資料, [混洗資料](/tw/ch11#sec_shuffle)
- 共享日誌, [共識的實踐](/tw/ch10#sec_consistency_total_order)-[共識的利弊](/tw/ch10#pros-and-cons-of-consensus), [全序的限制](/tw/ch13#id335), [基於日誌訊息傳遞中的唯一性](/tw/ch13#sec_future_uniqueness_log)
- 演算法, [共識的實踐](/tw/ch10#sec_consistency_total_order)
- 用於事件原始碼, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 用於通訊, [基於日誌的訊息代理](/tw/ch12#sec_stream_log)-[重播舊訊息](/tw/ch12#sec_stream_replay)
- 與協商一致的關係, [共享日誌作為共識](/tw/ch10#sec_consistency_shared_logs)
- 使用, [使用共享日誌](/tw/ch10#sec_consistency_smr)
- 共享模式, [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- 共享磁碟架構, [共享記憶體、共享磁碟與無共享架構](/tw/ch2#sec_introduction_shared_nothing), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 共享記憶體架構, [共享記憶體、共享磁碟與無共享架構](/tw/ch2#sec_introduction_shared_nothing)
- 共享- 無結構, [共享記憶體、共享磁碟與無共享架構](/tw/ch2#sec_introduction_shared_nothing), [術語表](/tw/glossary)
- 分散式檔案系統, [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- (另見 distributed filesystems)
- 網路的使用, [不可靠的網路](/tw/ch9#sec_distributed_networks)
- 鯊魚
- 咬海底電纜, [實踐中的網路故障](/tw/ch9#sec_distributed_network_faults)
- 計數(例), [文件的查詢語言](/tw/ch3#query-languages-for-documents)
- shredding (deletion)(見 crypto-shredding)
- 粉碎(專欄編碼), [列式儲存](/tw/ch4#sec_storage_column)
- 粉碎(相關模型), [何時使用哪種模型](/tw/ch3#sec_datamodels_document_summary)
- 混洗, [混洗資料](/tw/ch11#sec_shuffle)-[混洗資料](/tw/ch11#sec_shuffle)
- 兄弟, [手動衝突解決](/tw/ch6#manual-conflict-resolution), [捕獲先發生關係](/tw/ch6#capturing-the-happens-before-relationship), [衝突解決與複製](/tw/ch8#conflict-resolution-and-replication)
- (另見 conflicts)
- 倉, [資料倉庫](/tw/ch1#sec_introduction_dwh)
- 相似性搜尋
- 編輯距離, [全文檢索](/tw/ch4#sec_storage_full_text)
- 基因組資料, [總結](/tw/ch3#summary)
- 簡單, [簡單性:管理複雜度](/tw/ch2#id38)
- 歌手, [資料倉庫](/tw/ch1#sec_introduction_dwh)
- single-instruction-multi-data (SIMD) instructions, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- single-leader replication(見 基於領導者的複製)
- 單條執行, [原子寫操作](/tw/ch8#atomic-write-operations), [實際序列執行](/tw/ch8#sec_transactions_serial)
- 在溪流處理中, [日誌與傳統的訊息傳遞相比](/tw/ch12#sec_stream_logs_vs_messaging), [併發控制](/tw/ch12#sec_stream_concurrency), [基於日誌訊息傳遞中的唯一性](/tw/ch13#sec_future_uniqueness_log)
- SingleStore(資料庫)
- 記憶體儲, [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- 工地可靠性工程師, [雲時代的運維](/tw/ch1#sec_introduction_operations)
- 大小級緊湊, [壓實策略](/tw/ch4#sec_storage_lsm_compaction), [磁碟空間使用](/tw/ch4#disk-space-usage)
- 偏斜, [術語表](/tw/glossary)
- 時鐘搖擺, [對同步時鐘的依賴](/tw/ch9#sec_distributed_clocks_relying)-[帶置信區間的時鐘讀數](/tw/ch9#clock-readings-with-a-confidence-interval), [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable)
- 事務隔離
- 讀取偏差, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation), [總結](/tw/ch8#summary)
- 寫偏差, [寫偏差與幻讀](/tw/ch8#sec_transactions_write_skew)-[物化衝突](/tw/ch8#materializing-conflicts), [基於過時前提的決策](/tw/ch8#decisions-based-on-an-outdated-premise)-[檢測影響先前讀取的寫入](/tw/ch8#sec_detecting_writes_affect_reads)
- (另見 寫偏差)
- 含義, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)
- 不平衡的工作量, [鍵值資料的分片](/tw/ch7#sec_sharding_key_value)
- 補償, [偏斜的工作負載與緩解熱點](/tw/ch7#sec_sharding_skew)
- 由於名人, [偏斜的工作負載與緩解熱點](/tw/ch7#sec_sharding_skew)
- 時間序列資料, [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range)
- 跳過列表, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 服務級別協議(見 服務級別協議)
- Slack(分組聊天)
- GraphQL example, [GraphQL](/tw/ch3#id63)
- SlateDB(資料庫), [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables), [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 滑動視窗(流處理), [視窗的型別](/tw/ch12#id324)
- (另見 windows)
- 草率法定人數, [單主與無主複製的效能](/tw/ch6#sec_replication_leaderless_perf)
- 緩慢變化的維度, [連線的時間依賴性](/tw/ch12#sec_stream_join_time)
- 塗抹(傾斜秒調整), [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy)
- 快照(資料庫)
- 作為備份, [複製](/tw/ch6#ch_replication)
- 計算衍生資料, [建立索引](/tw/ch13#id340)
- 變化資料捕獲中, [初始快照](/tw/ch12#sec_stream_cdc_snapshot)
- 可序列化快照隔離, [可序列化快照隔離(SSI)](/tw/ch8#sec_transactions_ssi)-[可序列化快照隔離的效能](/tw/ch8#performance-of-serializable-snapshot-isolation)
- 新建複製品, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 快速隔離和可重複讀取, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)-[快照隔離、可重複讀和命名混淆](/tw/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- implementing with MVCC, [多版本併發控制(MVCC)](/tw/ch8#sec_transactions_snapshot_impl)
- indexes and MVCC, [索引與快照隔離](/tw/ch8#indexes-and-snapshot-isolation)
- 可見度規則, [觀察一致快照的可見性規則](/tw/ch8#sec_transactions_mvcc_visibility)
- 全球快照同步時鐘, [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- Snowflake(資料庫), [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native), [雲服務的分層](/tw/ch1#layering-of-cloud-services), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses), [批處理](/tw/ch11#ch_batch)
- 面向列的儲存, [列式儲存](/tw/ch4#sec_storage_column)
- 處理寫入, [寫入列式儲存](/tw/ch4#writing-to-column-oriented-storage)
- 硬化和叢集, [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 雪園, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- Snowflake (ID generator), [ID 生成器和邏輯時鐘](/tw/ch10#sec_consistency_logical)
- 雪花計劃, [星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)
- SOAP (web services), [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc)
- SOC2(見 Service Organization Control (SOC))
- 社會圖表, [圖資料模型](/tw/ch3#sec_datamodels_graph)
- 社會
- 的責任, [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance), [立法與自律](/ch14#sec_future_legislation)
- 社會技術系統, [人類與可靠性](/tw/ch2#id31)
- 軟體即服務(SaaS), [資料系統架構中的權衡](/tw/ch1#ch_tradeoffs), [雲服務與自託管](/tw/ch1#sec_introduction_cloud)
- ETL from, [資料倉庫](/tw/ch1#sec_introduction_dwh)
- 多重租賃, [面向多租戶的分片](/tw/ch7#sec_sharding_multitenancy)
- 軟體錯誤, [軟體故障](/tw/ch2#software-faults)
- 維護誠信, [維護完整性,儘管軟體有Bug](/tw/ch13#id455)
- 太陽風暴, [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults)
- solid state drives (SSDs)
- 訪問模式, [順序與隨機寫入](/tw/ch4#sidebar_sequential)
- 比較物件儲存, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 偵查腐敗, [端到端原則](/tw/ch13#sec_future_e2e_argument), [不要盲目信任承諾](/tw/ch13#id364)
- 失敗率, [硬體與軟體故障](/tw/ch2#sec_introduction_hardware_faults)
- 錯誤在, [永續性](/tw/ch8#durability)
- 韌體錯誤, [軟體故障](/tw/ch2#software-faults)
- 讀取吞吐量, [讀取效能](/tw/ch4#read-performance)
- 順序對隨機寫入, [順序與隨機寫入](/tw/ch4#sidebar_sequential)
- Solr (搜尋伺服器)
- 本地二級指數, [本地二級索引](/tw/ch7#id166)
- 請求路由, [請求路由](/tw/ch7#sec_sharding_routing)
- 使用 Lucene, [全文檢索](/tw/ch4#sec_storage_full_text)
- 排序(Unix 工具), [簡單日誌分析](/tw/ch11#sec_batch_log_analysis), [簡單日誌分析](/tw/ch11#sec_batch_log_analysis), [排序與記憶體聚合](/tw/ch11#id275), [分散式作業編排](/tw/ch11#id278)
- 排序歸併連線(MapReduce), [JOIN 與 GROUP BY](/tw/ch11#sec_batch_join)
- Sorted String Tables(見 SSTables)
- 排序
- 列儲存中的排序順序, [列儲存中的排序順序](/tw/ch4#sort-order-in-column-storage)
- 真相來源(權威資料來源)(見 systems of record)
- Spanner(資料庫)
- 一致性模式, [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- 資料位置, [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality)
- 在雲層中, [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native)
- 使用時鐘快照隔離, [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 事務, [事務到底是什麼?](/tw/ch8#sec_transactions_overview), [資料庫內部的分散式事務](/tw/ch8#sec_transactions_internal)
- TrueTime API, [帶置信區間的時鐘讀數](/tw/ch9#clock-readings-with-a-confidence-interval)
- Spark(處理框架), [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake), [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native), [批處理](/tw/ch11#ch_batch), [資料流引擎](/tw/ch11#sec_batch_dataflow)
- 成本效率, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- DataFrames, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes), [DataFrames](/tw/ch11#id287)
- 過失容忍, [故障處理](/tw/ch11#id281)
- 資料倉庫, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- high availability using ZooKeeper, [協調服務](/tw/ch10#sec_consistency_coordination)
- MLlib, [機器學習](/tw/ch11#id290)
- 查詢最佳化器, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- 移動資料, [混洗資料](/tw/ch11#sec_shuffle)
- Spark Streaming, [流分析](/tw/ch12#id318)
- 微批次, [微批次與存檔點](/tw/ch12#id329)
- streaming SQL support, [複合事件處理](/tw/ch12#id317)
- 用於 ETL, [提取-轉換-載入(ETL)](/tw/ch11#sec_batch_etl_usage)
- SPARQL(查詢語言), [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- 零星指數, [SSTable 檔案格式](/tw/ch4#the-sstable-file-format)
- 稀疏矩陣, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 腦裂, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover), [請求路由](/tw/ch7#sec_sharding_routing), [術語表](/tw/glossary)
- 執行限制, [唯一性約束需要達成共識](/tw/ch13#id452)
- 在共識演算法中, [共識](/tw/ch10#sec_consistency_consensus), [從單主複製到共識](/tw/ch10#from-single-leader-replication-to-consensus)
- 預防, [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable)
- 使用柵欄標誌來避免, [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens)-[多副本隔離](/tw/ch9#fencing-with-multiple-replicas)
- 現場例項, [故障處理](/tw/ch11#id281)
- 電子表格, [資料系統架構中的權衡](/tw/ch1#ch_tradeoffs), [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 資料流程式設計, [圍繞資料流設計應用](/tw/ch13#sec_future_dataflow)
- 樞軸表, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- SQL (Structured Query Language), [簡單性:管理複雜度](/tw/ch2#id38), [關係模型與文件模型](/tw/ch3#sec_datamodels_history), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 用於分析, [資料倉庫](/tw/ch1#sec_introduction_dwh), [列式儲存](/tw/ch4#sec_storage_column)
- 圖表查詢, [SQL 中的圖查詢](/tw/ch3#id58)
- 隔離級別標準,問題, [快照隔離、可重複讀和命名混淆](/tw/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
- 加入, [正規化、反正規化與連線](/tw/ch3#sec_datamodels_normalization)
- 簡歷(例), [用於一對多關係的文件資料模型](/tw/ch3#the-document-data-model-for-one-to-many-relationships)
- 社會網路家庭時間表(例), [表示使用者、帖子與關注關係](/tw/ch2#id20)
- SQL injection vulnerability, [拜占庭故障](/tw/ch9#sec_distributed_byzantine)
- 基於語句的複製, [基於語句的複製](/tw/ch6#statement-based-replication)
- 儲存程式, [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs)
- 批次處理框架中的支援, [批處理](/tw/ch11#ch_batch)
- 檢視, [Datalog:遞迴關係查詢](/tw/ch3#id62)
- SQL Server(資料庫)
- archiving WAL to object stores, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- 資料變更捕獲, [資料變更捕獲的實現](/tw/ch12#id307)
- 資料儲存支援, [分析型資料儲存](/tw/ch4#sec_storage_analytics)
- 分散式事務支援, [XA 事務](/tw/ch8#xa-transactions)
- 基於領導者的複製, [單主複製](/tw/ch6#sec_replication_leader)
- 多領導複製, [跨地域執行](/tw/ch6#sec_replication_multi_dc)
- 防止丟失更新, [自動檢測丟失的更新](/tw/ch8#automatically-detecting-lost-updates)
- 防止寫入skew, [寫偏差的特徵](/tw/ch8#characterizing-write-skew), [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- 讀作承諾隔離, [實現讀已提交](/tw/ch8#sec_transactions_read_committed_impl)
- 可序列隔離, [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- 快速隔離支援, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)
- T-SQL language, [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs)
- SQLite(資料庫), [分散式系統的問題](/tw/ch1#sec_introduction_dist_sys_problems), [壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- archiving WAL to object stores, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- SRE (site reliability engineer), [雲時代的運維](/tw/ch1#sec_introduction_operations)
- SSDs(見 solid state drives)
- SSTables (storage format), [SSTable 檔案格式](/tw/ch4#the-sstable-file-format)-[壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- 建造和維護, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- making LSM-Tree from, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 階段釋出(見 rolling upgrades)
- 停滯(舊資料), [讀己之寫](/tw/ch6#sec_replication_ryw)
- 跨渠道時間依賴性, [跨通道時序依賴](/tw/ch10#cross-channel-timing-dependencies)
- 無頭資料庫中, [當節點故障時寫入資料庫](/tw/ch6#id287)
- 多轉換併發控制, [檢測陳舊的 MVCC 讀取](/tw/ch8#detecting-stale-mvcc-reads)
- 監測, [監控陳舊性](/tw/ch6#monitoring-staleness)
- 客戶端狀態, [將狀態變更推送給客戶端](/tw/ch13#id348)
- 相對線性, [線性一致性](/tw/ch10#sec_consistency_linearizability)
- 相對於及時性, [及時性與完整性](/tw/ch13#sec_future_integrity)
- standbys(見 基於領導者的複製)
- 恆星複製地形, [多主複製拓撲](/tw/ch6#sec_replication_topologies)
- 恆星計劃, [星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)-[星型與雪花型:分析模式](/tw/ch3#sec_datamodels_analytics)
- 星球大戰類比(事件時間與處理時間), [事件時間與處理時間](/tw/ch12#id322)
- 飢餓(時間安排), [資源分配](/tw/ch11#id279)
- 國家
- 從不可改變事件日誌中得出, [狀態、流和不變性](/tw/ch12#sec_stream_immutability)
- 狀態變化與應用程式程式碼之間的相互作用, [資料流:應用程式碼與狀態變化的互動](/tw/ch13#id450)
- 保持衍生狀態, [維護派生狀態](/tw/ch13#id446)
- 由流處理器在流-流連線中維護, [流流連線(視窗連線)](/tw/ch12#id440)
- 觀察匯出狀態, [觀察派生資料狀態](/tw/ch13#sec_future_observing)-[多分割槽資料處理](/tw/ch13#sec_future_unbundled_multi_shard)
- 流處理器失敗後重建, [失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)
- 應用程式碼和, [應用程式碼和狀態的分離](/tw/ch13#id344)
- 狀態機複製, [基於語句的複製](/tw/ch6#statement-based-replication), [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs), [使用共享日誌](/tw/ch10#sec_consistency_smr), [資料庫與流](/tw/ch12#sec_stream_databases)
- 事件溯源, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 依賴決定性因素, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 無國籍人制度, [資料系統架構中的權衡](/tw/ch1#ch_tradeoffs)
- 基於語句的複製, [基於語句的複製](/tw/ch6#statement-based-replication)
- 依賴決定性因素, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 靜態輸入語言
- 類比於圖案, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility)
- 統計和數字演算法, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- StatsD (metrics aggregator), [直接從生產者傳遞給消費者](/tw/ch12#id296)
- 股票市場飼料, [直接從生產者傳遞給消費者](/tw/ch12#id296)
- 爆彼之頭, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover)
- 問題, [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens)
- 停止所有處理(見 garbage collection)
- 儲存
- 構建資料儲存技術, [組合使用資料儲存技術](/tw/ch13#id447)-[分拆系統與整合系統](/tw/ch13#id448)
- 儲存區網路, [共享記憶體、共享磁碟與無共享架構](/tw/ch2#sec_introduction_shared_nothing), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 儲存引擎, [儲存與檢索](/tw/ch4#ch_storage)-[總結](/tw/ch4#summary)
- 面向列, [列式儲存](/tw/ch4#sec_storage_column)-[查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 列壓縮, [列壓縮](/tw/ch4#sec_storage_column_compression)-[列壓縮](/tw/ch4#sec_storage_column_compression)
- 定義, [列式儲存](/tw/ch4#sec_storage_column)
- 公園, [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses), [列式儲存](/tw/ch4#sec_storage_column), [歸檔儲存](/tw/ch5#archival-storage)
- 排序在, [列儲存中的排序順序](/tw/ch4#sort-order-in-column-storage)-[列儲存中的排序順序](/tw/ch4#sort-order-in-column-storage)
- 寬柱型, [列壓縮](/tw/ch4#sec_storage_column_compression)
- 寫入, [寫入列式儲存](/tw/ch4#writing-to-column-oriented-storage)
- 記憶體儲, [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- 永續性, [永續性](/tw/ch8#durability)
- 面向行, [OLTP 系統的儲存與索引](/tw/ch4#sec_storage_oltp)-[全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- B樹, [B 樹](/tw/ch4#sec_storage_b_trees)-[B 樹變體](/tw/ch4#b-tree-variants)
- comparing B-trees and LSM-trees, [比較 B 樹與 LSM 樹](/tw/ch4#sec_storage_btree_lsm_comparison)-[磁碟空間使用](/tw/ch4#disk-space-usage)
- 定義, [列式儲存](/tw/ch4#sec_storage_column)
- 日誌結構, [日誌結構儲存](/tw/ch4#sec_storage_log_structured)-[壓實策略](/tw/ch4#sec_storage_lsm_compaction)
- 儲存程式, [將事務封裝在儲存過程中](/tw/ch8#encapsulating-transactions-in-stored-procedures)-[儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs), [術語表](/tw/glossary)
- 和共享日誌, [使用共享日誌](/tw/ch10#sec_consistency_smr)
- 利弊因素, [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs)
- 類似於流處理器, [應用程式碼作為派生函式](/tw/ch13#sec_future_dataflow_derivation)
- 風暴(流處理器), [流分析](/tw/ch12#id318)
- distributed RPC, [事件驅動架構與 RPC](/tw/ch12#sec_stream_actors_drpc), [多分割槽資料處理](/tw/ch13#sec_future_unbundled_multi_shard)
- 三叉戟狀態處理, [冪等性](/tw/ch12#sec_stream_idempotence)
- 斜拉機事件, [處理滯留事件](/tw/ch12#id323)
- Stream Control Transmission Protocol (SCTP), [TCP 的侷限性](/tw/ch9#sec_distributed_tcp)
- 流處理, [流處理](/tw/ch12#sec_stream_processing)-[本章小結](/tw/ch12#id332), [術語表](/tw/glossary)
- 在工作範圍內獲得外部服務, [流表連線(流擴充)](/tw/ch12#sec_stream_table_joins), [微批次與存檔點](/tw/ch12#id329), [冪等性](/tw/ch12#sec_stream_idempotence), [恰好執行一次操作](/tw/ch13#id353)
- 與批次處理相結合, [統一批處理和流處理](/tw/ch13#id338)
- 與批次處理的比較, [流處理](/tw/ch12#sec_stream_processing)
- 複合事件處理, [複合事件處理](/tw/ch12#id317)
- 過失容忍, [容錯](/tw/ch12#sec_stream_fault_tolerance)-[失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)
- 原子提交, [原子提交再現](/tw/ch12#sec_stream_atomic_commit)
- 冪等性, [冪等性](/tw/ch12#sec_stream_idempotence)
- 微打鬥和檢查站, [微批次與存檔點](/tw/ch12#id329)
- 失敗後重建狀態, [失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)
- 資料整合, [批處理與流處理](/tw/ch13#sec_future_batch_streaming)-[統一批處理和流處理](/tw/ch13#id338)
- 用於事件原始碼, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 保持衍生狀態, [維護派生狀態](/tw/ch13#id446)
- 維持實際意見, [維護物化檢視](/tw/ch12#sec_stream_mat_view)
- messaging systems(見 messaging systems)
- 關於時間的推理, [時間推理](/tw/ch12#sec_stream_time)-[視窗的型別](/tw/ch12#id324)
- 事件時間與處理時間, [事件時間與處理時間](/tw/ch12#id322), [微批次與存檔點](/tw/ch12#id329), [統一批處理和流處理](/tw/ch13#id338)
- 知道視窗何時準備好, [處理滯留事件](/tw/ch12#id323)
- 視窗型別, [視窗的型別](/tw/ch12#id324)
- relation to databases(見 streams)
- 與服務的關係, [流處理器和服務](/tw/ch13#id345)
- 與批次處理的關係, [批處理](/tw/ch11#ch_batch)
- 在流中搜索, [在流上搜索](/tw/ch12#id320)
- 單條執行, [日誌與傳統的訊息傳遞相比](/tw/ch12#sec_stream_logs_vs_messaging), [併發控制](/tw/ch12#sec_stream_concurrency)
- 流式分析, [流分析](/tw/ch12#id318)
- 串流連線, [流連線](/tw/ch12#sec_stream_joins)-[連線的時間依賴性](/tw/ch12#sec_stream_join_time)
- 串流流連線, [流流連線(視窗連線)](/tw/ch12#id440)
- 序列表連線, [流表連線(流擴充)](/tw/ch12#sec_stream_table_joins)
- 表格連線, [表表連線(維護物化檢視)](/tw/ch12#id326)
- 時間的依賴性, [連線的時間依賴性](/tw/ch12#sec_stream_join_time)
- 流程, [流處理](/tw/ch12#ch_stream)-[重播舊訊息](/tw/ch12#sec_stream_replay)
- 端對端,向客戶推進事件, [端到端的事件流](/tw/ch13#id349)
- messaging systems(見 messaging systems)
- processing(見 流處理)
- 與資料庫的關係, [資料庫與流](/tw/ch12#sec_stream_databases)-[不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- (另見 changelogs)
- 變更流的 API 支援, [變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- 資料變更捕獲, [資料變更捕獲](/tw/ch12#sec_stream_cdc)-[變更流的 API 支援](/tw/ch12#sec_stream_change_api)
- 按時間分列的狀態衍生物, [狀態、流和不變性](/tw/ch12#sec_stream_immutability)
- 事件溯源, [資料變更捕獲與事件溯源](/tw/ch12#sec_stream_event_sourcing)
- 保持系統同步, [保持系統同步](/tw/ch12#sec_stream_sync)-[保持系統同步](/tw/ch12#sec_stream_sync)
- 不可改變事件哲學, [狀態、流和不變性](/tw/ch12#sec_stream_immutability)-[不變性的侷限性](/tw/ch12#sec_stream_immutability_limitations)
- 專題, [傳遞事件流](/tw/ch12#sec_stream_transmit)
- 嚴格的序列性, [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- 及時性與完整性, [及時性與完整性](/tw/ch13#sec_future_integrity)
- 條紋(列編碼), [列式儲存](/tw/ch4#sec_storage_column)
- 強一致性(見 線性一致性)
- 最終的一致性, [自動衝突解決](/tw/ch6#automatic-conflict-resolution)
- 強烈的單份序列性, [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- 主題、上游和物體(三層), [三元組儲存與 SPARQL](/tw/ch3#id59)
- 訂閱者, [傳遞事件流](/tw/ch12#sec_stream_transmit)
- (另見 consumers)
- 超級計算機, [雲計算與超級計算](/tw/ch1#id17)
- Superset(資料視覺化軟體), [分析(Analytics)](/tw/ch11#sec_batch_olap)
- 監視, [監視](/ch14#id374)
- (另見 隱私)
- 壽司原則, [從資料倉庫到資料湖](/tw/ch1#from-data-warehouse-to-data-lake)
- 可持續性, [分散式與單節點系統](/tw/ch1#sec_introduction_distributed)
- Swagger(服務定義格式), [Web 服務](/tw/ch5#sec_web_services)
- swapping to disk(見 virtual memory)
- Swift(程式語言)
- 記憶體管理, [限制垃圾回收的影響](/tw/ch9#sec_distributed_gc_impact)
- 同步引擎, [同步引擎與本地優先軟體](/tw/ch6#sec_replication_offline_clients)-[同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- 例項, [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- 用於本地第一軟體, [即時協作、離線優先和本地優先應用](/tw/ch6#real-time-collaboration-offline-first-and-local-first-apps)
- 同步網路, [同步與非同步網路](/tw/ch9#sec_distributed_sync_networks), [術語表](/tw/glossary)
- 比較同步網路, [同步與非同步網路](/tw/ch9#sec_distributed_sync_networks)
- 系統模型, [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 同步複製, [同步複製與非同步複製](/tw/ch6#sec_replication_sync_async), [術語表](/tw/glossary)
- 有多個領導, [多主複製](/tw/ch6#sec_replication_multi_leader)
- 系統管理員, [雲時代的運維](/tw/ch1#sec_introduction_operations)
- 系統模型, [知識、真相和謊言](/tw/ch9#sec_distributed_truth), [系統模型與現實](/tw/ch9#sec_distributed_system_model)-[確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- 假設, [信任但驗證](/tw/ch13#sec_future_verification)
- 演算法的正確性, [定義演算法的正確性](/tw/ch9#defining-the-correctness-of-an-algorithm)
- 繪製真實世界的地圖, [將系統模型對映到現實世界](/tw/ch9#mapping-system-models-to-the-real-world)
- 安全和生活, [安全性與活性](/tw/ch9#sec_distributed_safety_liveness)
- 記錄系統, [記錄系統與派生資料](/tw/ch1#sec_introduction_derived), [術語表](/tw/glossary)
- 資料變更捕獲, [資料變更捕獲的實現](/tw/ch12#id307), [理解資料流](/tw/ch13#id443)
- 事件日誌, [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- 事件日誌處理為, [狀態、流和不變性](/tw/ch12#sec_stream_immutability)
- 系統思維, [反饋迴路](/ch14#id372)
### T
- t- digest(演算法), [響應時間指標的應用](/tw/ch2#sec_introduction_slo_sla)
- 表格連線, [表表連線(維護物化檢視)](/tw/ch12#id326)
- Tableau(資料視覺化軟體), [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp), [分析(Analytics)](/tw/ch11#sec_batch_olap)
- 尾巴 (Unix 工具), [使用日誌進行訊息儲存](/tw/ch12#id300)
- tail latency(見 延遲)
- 尾頂(財產圖), [屬性圖](/tw/ch3#id56)
- task (workflows)(見 workflow engines)
- TCP (Transmission Control Protocol), [TCP 的侷限性](/tw/ch9#sec_distributed_tcp)
- 比較電路切換, [我們不能簡單地使網路延遲可預測嗎?](/tw/ch9#can-we-not-simply-make-network-delays-predictable)
- comparison to UDP, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 連線失敗, [檢測故障](/tw/ch9#id307)
- 流量控制, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing), [訊息傳遞系統](/tw/ch12#sec_stream_messaging)
- 包檢查和, [弱形式的謊言](/tw/ch9#weak-forms-of-lying), [端到端原則](/tw/ch13#sec_future_e2e_argument), [信任但驗證](/tw/ch13#sec_future_verification)
- 可靠性和重複壓制, [抑制重複](/tw/ch13#id354)
- 轉發超時, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 用於事務會話, [單物件與多物件操作](/tw/ch8#sec_transactions_multi_object)
- 時間(工作流程引擎), [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows)
- Tensorflow (機器學習圖書館), [機器學習](/tw/ch11#id290)
- Teradata(資料庫), [雲原生系統架構](/tw/ch1#sec_introduction_cloud_native), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- term-partitioned indexes(見 global secondary indexes)
- 終止(協商一致), [單值共識](/tw/ch10#single-value-consensus), [原子提交作為共識](/tw/ch10#atomic-commitment-as-consensus)
- 測試, [人類與可靠性](/tw/ch2#id31)
- 擊打(記憶體斷), [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- 執行緒(併發)
- Actor 模型, [分散式 actor 框架](/tw/ch5#distributed-actor-frameworks), [事件驅動架構與 RPC](/tw/ch12#sec_stream_actors_drpc)
- (另見 event-driven architecture)
- 原子操作, [原子性](/tw/ch8#sec_transactions_acid_atomicity)
- 背景執行緒, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables)
- 執行暫停, [我們不能簡單地使網路延遲可預測嗎?](/tw/ch9#can-we-not-simply-make-network-delays-predictable), [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)-[程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- 記憶體障礙, [線性一致性與網路延遲](/tw/ch10#linearizability-and-network-delays)
- 預設, [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- single(見 single-threaded execution)
- 三階段承諾, [三階段提交](/tw/ch8#three-phase-commit)
- 三方關係, [屬性圖](/tw/ch3#id56)
- Thrift(資料格式), [Protocol Buffers](/tw/ch5#sec_encoding_protobuf)
- 吞吐量, [描述效能](/tw/ch2#sec_introduction_percentiles), [描述負載](/tw/ch2#id33), [批處理](/tw/ch11#ch_batch)
- TIBCO, [訊息代理](/tw/ch5#message-brokers)
- Enterprise Message Service, [訊息代理與資料庫的對比](/tw/ch12#id297)
- StreamBase (stream analytics), [複合事件處理](/tw/ch12#id317)
- TiDB(資料庫)
- 基於共識的複製, [單主複製](/tw/ch6#sec_replication_leader)
- 區域(硬化), [分片](/tw/ch7#ch_sharding)
- 請求路由, [請求路由](/tw/ch7#sec_sharding_routing)
- 服務衍生資料, [對外提供派生資料](/tw/ch11#sec_batch_serving_derived)
- 硬化二級指數, [全域性二級索引](/tw/ch7#id167)
- 快速隔離支援, [快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)
- 時間戳, [實現線性一致的 ID 生成器](/tw/ch10#implementing-a-linearizable-id-generator)
- 事務, [事務到底是什麼?](/tw/ch8#sec_transactions_overview), [資料庫內部的分散式事務](/tw/ch8#sec_transactions_internal)
- 使用模型檢查, [模型檢查與規範語言](/tw/ch9#model-checking-and-specification-languages)
- 分層儲存, [設定新的副本](/tw/ch6#sec_replication_new_replica), [磁碟空間使用](/tw/ch12#sec_stream_disk_usage)
- TigerBeetle(資料庫), [總結](/tw/ch3#summary)
- 確定性模擬測試, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- TigerGraph(資料庫)
- GSQL language, [SQL 中的圖查詢](/tw/ch3#id58)
- Tigris(物件儲存), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- TileDB(資料庫), [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 時間
- 併發與, ["先發生"關係與併發](/tw/ch6#sec_replication_happens_before)
- 跨渠道時間依賴性, [跨通道時序依賴](/tw/ch10#cross-channel-timing-dependencies)
- 在分散式系統中, [不可靠的時鐘](/tw/ch9#sec_distributed_clocks)-[限制垃圾回收的影響](/tw/ch9#sec_distributed_gc_impact)
- (另見 clocks)
- 時鐘同步和準確性, [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy)
- 依賴同步時鐘, [對同步時鐘的依賴](/tw/ch9#sec_distributed_clocks_relying)-[用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 程序暫停, [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)-[限制垃圾回收的影響](/tw/ch9#sec_distributed_gc_impact)
- 流程處理器中的推理, [時間推理](/tw/ch12#sec_stream_time)-[視窗的型別](/tw/ch12#id324)
- 事件時間與處理時間, [事件時間與處理時間](/tw/ch12#id322), [微批次與存檔點](/tw/ch12#id329), [統一批處理和流處理](/tw/ch13#id338)
- 知道視窗何時準備好, [處理滯留事件](/tw/ch12#id323)
- 事件的時間戳, [你用的是誰的時鐘?](/tw/ch12#id438)
- 視窗型別, [視窗的型別](/tw/ch12#id324)
- 分散式系統的系統模型, [系統模型與現實](/tw/ch9#sec_distributed_system_model)
- 串流中的時間依賴, [連線的時間依賴性](/tw/ch12#sec_stream_join_time)
- 時間序列資料
- as DataFrames, [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- 面向列的儲存, [列式儲存](/tw/ch4#sec_storage_column)
- 每日時鐘, [日曆時鐘](/tw/ch9#time-of-day-clocks)
- 混合邏輯時鐘, [混合邏輯時鐘](/tw/ch10#hybrid-logical-clocks)
- 及時性, [及時性與完整性](/tw/ch13#sec_future_integrity)
- 協調-避免資料系統, [無協調資料系統](/tw/ch13#id454)
- 資料流系統的正確性, [資料流系統的正確性](/tw/ch13#id453)
- 超時, [不可靠的網路](/tw/ch9#sec_distributed_networks), [術語表](/tw/glossary)
- 動態配置, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 失敗, [領導者故障:故障轉移](/tw/ch6#leader-failure-failover)
- 長度, [超時和無界延遲](/tw/ch9#sec_distributed_queueing)
- TimescaleDB(資料庫), [列式儲存](/tw/ch4#sec_storage_column)
- 時間戳, [邏輯時鐘](/tw/ch10#sec_consistency_timestamps)
- 指定流處理中的事件, [你用的是誰的時鐘?](/tw/ch12#id438)
- 讀後寫入一致性, [讀己之寫](/tw/ch6#sec_replication_ryw)
- 用於事務命令, [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
- 執行制約因素不足, [使用邏輯時鐘強制約束](/tw/ch10#enforcing-constraints-using-logical-clocks)
- 金鑰範圍, [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range)
- 蘭波特, [Lamport 時間戳](/tw/ch10#lamport-timestamps)
- 邏輯, [排序事件以捕獲因果關係](/tw/ch13#sec_future_capture_causality)
- 命令事件, [用於事件排序的時間戳](/tw/ch9#sec_distributed_lww)
- 時間戳, [實現線性一致的 ID 生成器](/tw/ch10#implementing-a-linearizable-id-generator)
- TLA+ (specification language), [模型檢查與規範語言](/tw/ch9#model-checking-and-specification-languages)
- 符號桶(限制重試), [描述效能](/tw/ch2#sec_introduction_percentiles)
- 墓碑, [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables), [磁碟空間使用](/tw/ch4#disk-space-usage), [日誌壓縮](/tw/ch12#sec_stream_log_compaction)
- 專題(資訊), [訊息代理](/tw/ch5#message-brokers), [傳遞事件流](/tw/ch12#sec_stream_transmit)
- 撕裂的頁面(B- 樹), [使 B 樹可靠](/tw/ch4#sec_storage_btree_wal)
- 全序, [術語表](/tw/glossary)
- broadcast(見 shared logs)
- 限制, [全序的限制](/tw/ch13#id335)
- 在邏輯時間戳上, [邏輯時鐘](/tw/ch10#sec_consistency_timestamps)
- 追蹤, [分散式系統的問題](/tw/ch1#sec_introduction_dist_sys_problems)
- 跟蹤行為資料, [隱私與追蹤](/ch14#id373)
- (另見 隱私)
- 權衡, [資料系統架構中的權衡](/tw/ch1#ch_tradeoffs)-[資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- transaction coordinator(見 協調者)
- transaction manager(見 協調者)
- 事務處理, [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp)-[事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp)
- 與分析的比較, [事務處理與分析的特徵](/tw/ch1#sec_introduction_oltp)
- 與資料儲存的比較, [分析型資料儲存](/tw/ch4#sec_storage_analytics)
- 事務, [事務](/tw/ch8#ch_transactions)-[總結](/tw/ch8#summary), [術語表](/tw/glossary)
- ACID properties of, [ACID 的含義](/tw/ch8#sec_transactions_acid)
- 原子性, [原子性](/tw/ch8#sec_transactions_acid_atomicity)
- 一致性, [一致性](/tw/ch8#sec_transactions_acid_consistency)
- 永續性, [使 B 樹可靠](/tw/ch4#sec_storage_btree_wal), [永續性](/tw/ch8#durability)
- 隔離性, [隔離性](/tw/ch8#sec_transactions_acid_isolation)
- 資料完整性, [及時性與完整性](/tw/ch13#sec_future_integrity)
- 複製, [複製延遲的解決方案](/tw/ch6#id131)
- compensating(見 compensating transactions)
- 概念, [事務到底是什麼?](/tw/ch8#sec_transactions_overview)
- 分散式事務, [分散式事務](/tw/ch8#sec_transactions_distributed)-[再談恰好一次訊息處理](/tw/ch8#exactly-once-message-processing-revisited)
- 避開, [派生資料與分散式事務](/tw/ch13#sec_future_derived_vs_transactions), [開展分拆工作](/tw/ch13#sec_future_unbundling_favor), [強制約束](/tw/ch13#sec_future_constraints)-[無協調資料系統](/tw/ch13#id454)
- 失敗放大, [維護派生狀態](/tw/ch13#id446)
- 已磨損的系統, [分片的利與弊](/tw/ch7#sec_sharding_reasons)
- 可疑/不確定狀況, [協調器故障](/tw/ch8#coordinator-failure), [存疑時持有鎖](/tw/ch8#holding-locks-while-in-doubt)
- 兩階段提交, [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc)-[三階段提交](/tw/ch8#three-phase-commit)
- 使用, [跨不同系統的分散式事務](/tw/ch8#sec_transactions_xa)-[恰好一次訊息處理](/tw/ch8#sec_transactions_exactly_once)
- XA 事務, [XA 事務](/tw/ch8#xa-transactions)-[XA 事務的問題](/tw/ch8#problems-with-xa-transactions)
- OLTP versus analytics queries, [分析(Analytics)](/tw/ch11#sec_batch_olap)
- 目標, [事務](/tw/ch8#ch_transactions)
- 可序列化, [可序列化](/tw/ch8#sec_transactions_serializability)-[可序列化快照隔離的效能](/tw/ch8#performance-of-serializable-snapshot-isolation)
- 實際執行, [實際序列執行](/tw/ch8#sec_transactions_serial)-[序列執行總結](/tw/ch8#summary-of-serial-execution)
- 悲觀與樂觀的併發控制, [悲觀併發控制與樂觀併發控制](/tw/ch8#pessimistic-versus-optimistic-concurrency-control)
- 可序列化快照隔離, [可序列化快照隔離(SSI)](/tw/ch8#sec_transactions_ssi)-[可序列化快照隔離的效能](/tw/ch8#performance-of-serializable-snapshot-isolation)
- 兩階段鎖定, [兩階段鎖定(2PL)](/tw/ch8#sec_transactions_2pl)-[索引範圍鎖](/tw/ch8#sec_transactions_2pl_range)
- 單物件和多物件, [單物件與多物件操作](/tw/ch8#sec_transactions_multi_object)-[處理錯誤和中止](/tw/ch8#handling-errors-and-aborts)
- 處理錯誤和中止, [處理錯誤和中止](/tw/ch8#handling-errors-and-aborts)
- 多物件事務的需要, [多物件事務的需求](/tw/ch8#sec_transactions_need)
- 單物件寫入, [單物件寫入](/tw/ch8#sec_transactions_single_object)
- 快照隔離(見 snapshots)
- 嚴格的序列性, [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition)
- 薄弱的隔離水平, [弱隔離級別](/tw/ch8#sec_transactions_isolation_levels)-[物化衝突](/tw/ch8#materializing-conflicts)
- 防止丟失更新, [防止丟失更新](/tw/ch8#sec_transactions_lost_update)-[衝突解決與複製](/tw/ch8#conflict-resolution-and-replication)
- 讀已提交, [讀已提交](/tw/ch8#sec_transactions_read_committed)-[快照隔離與可重複讀](/tw/ch8#sec_transactions_snapshot_isolation)
- 曲線(圖), [屬性圖](/tw/ch3#id56)
- 三(資料結構), [構建和合並 SSTable](/tw/ch4#constructing-and-merging-sstables), [全文檢索](/tw/ch4#sec_storage_full_text)
- as SSTable index, [SSTable 檔案格式](/tw/ch4#the-sstable-file-format)
- 觸發器(資料庫), [傳遞事件流](/tw/ch12#sec_stream_transmit)
- Trino(資料倉庫), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 聯邦資料庫, [一切的元資料庫](/tw/ch13#id341)
- 查詢最佳化器, [查詢語言](/tw/ch11#sec_batch_query_lanauges)
- 用於 ETL, [提取-轉換-載入(ETL)](/tw/ch11#sec_batch_etl_usage)
- 工作流程示例, [工作流排程](/tw/ch11#sec_batch_workflows)
- 三層, [三元組儲存與 SPARQL](/tw/ch3#id59)-[SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- SPARQL 查詢語言, [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- 翻轉視窗(流處理), [視窗的型別](/tw/ch12#id324)
- (另見 windows)
- 在微戰鬥中, [微批次與存檔點](/tw/ch12#id329)
- Turbopuffer(種子搜尋) Name, [設定新的副本](/tw/ch6#sec_replication_new_replica)
- Turtle (RDF data format), [三元組儲存與 SPARQL](/tw/ch3#id59)
- Twitter(見 X (social network))
- 兩階段提交, [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc)-[協調器故障](/tw/ch8#coordinator-failure), [術語表](/tw/glossary)
- 與雙相鎖定混淆, [兩階段鎖定(2PL)](/tw/ch8#sec_transactions_2pl)
- 協調員失敗, [協調器故障](/tw/ch8#coordinator-failure)
- 協調員恢復, [從協調器故障中恢復](/tw/ch8#recovering-from-coordinator-failure)
- 如何運作, [系統性的承諾](/tw/ch8#a-system-of-promises)
- 績效成本, [跨不同系統的分散式事務](/tw/ch8#sec_transactions_xa)
- problems with XA transactions, [XA 事務的問題](/tw/ch8#problems-with-xa-transactions)
- 持有鎖定的事務, [存疑時持有鎖](/tw/ch8#holding-locks-while-in-doubt)
- 兩階段鎖定, [兩階段鎖定(2PL)](/tw/ch8#sec_transactions_2pl)-[索引範圍鎖](/tw/ch8#sec_transactions_2pl_range), [什麼使系統具有線性一致性?](/tw/ch10#sec_consistency_lin_definition), [術語表](/tw/glossary)
- 與兩階段提交混淆, [兩階段鎖定(2PL)](/tw/ch8#sec_transactions_2pl)
- 增長和縮小階段, [兩階段鎖定的實現](/tw/ch8#implementation-of-two-phase-locking)
- 索引範圍鎖定, [索引範圍鎖](/tw/ch8#sec_transactions_2pl_range)
- 業績, [兩階段鎖定的效能](/tw/ch8#performance-of-two-phase-locking)
- 型別檢查,動態對靜態, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility)
### U
- UDP (User Datagram Protocol)
- comparison to TCP, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 多廣播, [直接從生產者傳遞給消費者](/tw/ch12#id296)
- 終極線上(遊戲), [分片](/tw/ch7#ch_sharding)
- 未繫結的資料集, [流處理](/tw/ch12#ch_stream), [術語表](/tw/glossary)
- (另見 streams)
- 無限制的延誤, [術語表](/tw/glossary)
- 在網路中, [超時和無界延遲](/tw/ch9#sec_distributed_queueing)
- 程序暫停, [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- 解析資料庫, [分拆資料庫](/tw/ch13#sec_future_unbundling)-[多分割槽資料處理](/tw/ch13#sec_future_unbundled_multi_shard)
- 構建資料儲存技術, [組合使用資料儲存技術](/tw/ch13#id447)-[分拆系統與整合系統](/tw/ch13#id448)
- 聯邦制與拆分制, [一切的元資料庫](/tw/ch13#id341)
- 圍繞資料流設計應用程式, [圍繞資料流設計應用](/tw/ch13#sec_future_dataflow)-[流處理器和服務](/tw/ch13#id345)
- 觀察匯出狀態, [觀察派生資料狀態](/tw/ch13#sec_future_observing)-[多分割槽資料處理](/tw/ch13#sec_future_unbundled_multi_shard)
- 實現檢視和快取, [物化檢視和快取](/tw/ch13#id451)
- 多硬資料處理, [多分割槽資料處理](/tw/ch13#sec_future_unbundled_multi_shard)
- 推動客戶端更改狀態, [將狀態變更推送給客戶端](/tw/ch13#id348)
- uncertain (transaction status)(見 存疑)
- 聯盟型別(在 Avro), [模式演化規則](/tw/ch5#schema-evolution-rules)
- uniq(Unix 工具), [簡單日誌分析](/tw/ch11#sec_batch_log_analysis), [簡單日誌分析](/tw/ch11#sec_batch_log_analysis), [分散式作業編排](/tw/ch11#id278)
- 獨特性限制
- 同步檢查, [寬鬆地解釋約束](/tw/ch13#id362)
- 需要協商一致, [唯一性約束需要達成共識](/tw/ch13#id452)
- 需要線性, [約束與唯一性保證](/tw/ch10#sec_consistency_uniqueness)
- 以日誌為基礎的信件中的獨特性, [基於日誌訊息傳遞中的唯一性](/tw/ch13#sec_future_uniqueness_log)
- 團結(資料目錄), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- universally unique identifiers(見 UUIDs)
- unix 哲學
- 比較關係資料庫, [分拆資料庫](/tw/ch13#sec_future_unbundling), [一切的元資料庫](/tw/ch13#id341)
- 與流處理的比較, [流處理](/tw/ch12#sec_stream_processing)
- unix 管道, [簡單日誌分析](/tw/ch11#sec_batch_log_analysis)
- 與分散式批次處理相比, [工作流排程](/tw/ch11#sec_batch_workflows)
- UPDATE statement (SQL), [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility)
- 更新
- 防止丟失更新, [防止丟失更新](/tw/ch8#sec_transactions_lost_update)-[衝突解決與複製](/tw/ch8#conflict-resolution-and-replication)
- 原子寫入操作, [原子寫操作](/tw/ch8#atomic-write-operations)
- 自動檢測丟失的更新, [自動檢測丟失的更新](/tw/ch8#automatically-detecting-lost-updates)
- 比較和設定, [條件寫入(比較並設定)](/tw/ch8#sec_transactions_compare_and_set)
- 衝突解決和推廣, [衝突解決與複製](/tw/ch8#conflict-resolution-and-replication)
- 使用明確的鎖定, [顯式鎖定](/tw/ch8#explicit-locking)
- 防止寫入skew, [寫偏差與幻讀](/tw/ch8#sec_transactions_write_skew)-[物化衝突](/tw/ch8#materializing-conflicts)
- 使用量
- 批次過程排程, [資源分配](/tw/ch11#id279)
- 透過預設增加, [故障處理](/tw/ch11#id281)
- 與暫時取捨, [我們不能簡單地使網路延遲可預測嗎?](/tw/ch9#can-we-not-simply-make-network-delays-predictable)
- uTP protocol (BitTorrent), [TCP 的侷限性](/tw/ch9#sec_distributed_tcp)
- UUIDs, [ID 生成器和邏輯時鐘](/tw/ch10#sec_consistency_logical)
### V
- 有效性(協商一致), [單值共識](/tw/ch10#single-value-consensus), [原子提交作為共識](/tw/ch10#atomic-commitment-as-consensus)
- vBuckets(硬化), [分片](/tw/ch7#ch_sharding)
- 向量時鐘, [版本向量](/tw/ch6#version-vectors)
- (另見 版本向量)
- 和 Lamport/hybrid 邏輯鍾, [Lamport/混合邏輯時鐘 vs. 向量時鐘](/tw/ch10#lamporthybrid-logical-clocks-vs-vector-clocks)
- 和版本向量, [版本向量](/tw/ch6#version-vectors)
- 向量嵌入, [向量嵌入](/tw/ch4#id92)
- 向量處理, [查詢執行:編譯與向量化](/tw/ch4#sec_storage_vectorized)
- 供應商鎖定, [雲服務的利弊](/tw/ch1#sec_introduction_cloud_tradeoffs)
- Venice(資料庫), [對外提供派生資料](/tw/ch11#sec_batch_serving_derived)
- 核查, [信任但驗證](/tw/ch13#sec_future_verification)-[用於可審計資料系統的工具](/tw/ch13#id366)
- 避免盲目信任, [不要盲目信任承諾](/tw/ch13#id364)
- 設計可審計性, [為可審計性而設計](/tw/ch13#id365)
- 端對端完整性檢查, [端到端原則重現](/tw/ch13#id456)
- 可審計資料系統工具, [用於可審計資料系統的工具](/tw/ch13#id366)
- 版本控制系統
- 合併衝突, [手動衝突解決](/tw/ch6#manual-conflict-resolution)
- 依賴不可改變的資料, [併發控制](/tw/ch12#sec_stream_concurrency)
- 版本向量, [不同拓撲的問題](/tw/ch6#problems-with-different-topologies), [版本向量](/tw/ch6#version-vectors)
- 點數, [版本向量](/tw/ch6#version-vectors)
- 對向量時鐘, [版本向量](/tw/ch6#version-vectors)
- Vertica(資料庫), [雲資料倉庫](/tw/ch4#sec_cloud_data_warehouses)
- 處理寫入, [寫入列式儲存](/tw/ch4#writing-to-column-oriented-storage)
- vertical scaling(見 scaling up)
- 頂點(圖), [圖資料模型](/tw/ch3#sec_datamodels_graph)
- 屬性圖模型, [屬性圖](/tw/ch3#id56)
- 電子遊戲, [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- 影片轉碼(例如), [跨通道時序依賴](/tw/ch10#cross-channel-timing-dependencies)
- views (SQL queries), [Datalog:遞迴關係查詢](/tw/ch3#id62)
- materialized views(見 物化)
- 檢視戳複製, [共識](/tw/ch10#sec_consistency_consensus), [共識的實踐](/tw/ch10#sec_consistency_total_order)
- 使用模型檢查, [模型檢查與規範語言](/tw/ch9#model-checking-and-specification-languages)
- 檢視編號, [從單主複製到共識](/tw/ch10#from-single-leader-replication-to-consensus)
- 虛擬塊裝置, [儲存與計算的分離](/tw/ch1#sec_introduction_storage_compute)
- 虛擬檔案系統, [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 比較分散式檔案系統, [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- 虛擬機器, [雲服務的分層](/tw/ch1#layering-of-cloud-services)
- 上下文開關, [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- 網路效能, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 吵鬧的鄰居, [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- 虛擬時鐘在, [時鐘同步和準確性](/tw/ch9#sec_distributed_clock_accuracy)
- 虛擬記憶體
- 因頁面錯誤造成的程序暫停, [延遲與響應時間](/tw/ch2#id23), [程序暫停](/tw/ch9#sec_distributed_clocks_pauses)
- Virtuoso(資料庫), [SPARQL 查詢語言](/tw/ch3#the-sparql-query-language)
- VisiCalc (spreadsheets), [圍繞資料流設計應用](/tw/ch13#sec_future_dataflow)
- Vitess(資料庫)
- 鍵程硬化, [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range)
- 節點(硬化), [分片](/tw/ch7#ch_sharding)
- 詞彙, [三元組儲存與 SPARQL](/tw/ch3#id59)
- Voice over IP (VoIP), [網路擁塞和排隊](/tw/ch9#network-congestion-and-queueing)
- VoltDB(資料庫)
- 交叉硬度序列化, [分片](/tw/ch8#sharding)
- 確定性儲存程式, [儲存過程的利弊](/tw/ch8#sec_transactions_stored_proc_tradeoffs)
- 記憶體儲, [全記憶體儲存](/tw/ch4#sec_storage_inmemory)
- 程序/核心模式, [分片的利與弊](/tw/ch7#sec_sharding_reasons)
- 二級指數, [本地二級索引](/tw/ch7#id166)
- 事務的序列執行, [實際序列執行](/tw/ch8#sec_transactions_serial)
- 基於語句的複製, [基於語句的複製](/tw/ch6#statement-based-replication), [失敗後重建狀態](/tw/ch12#sec_stream_state_fault_tolerance)
- 流程處理中的事務, [原子提交再現](/tw/ch12#sec_stream_atomic_commit)
### W
- 預寫式日誌, [使 B 樹可靠](/tw/ch4#sec_storage_btree_wal)
- WAL-G (backup tool), [設定新的副本](/tw/ch6#sec_replication_new_replica)
- WarpStream(訊息系統), [磁碟空間使用](/tw/ch12#sec_stream_disk_usage)
- web services(見 services)
- 網路使用者, [直接從生產者傳遞給消費者](/tw/ch12#id296)
- 網路方法(通訊), [訊息代理](/tw/ch5#message-brokers)
- WebSocket (protocol), [將狀態變更推送給客戶端](/tw/ch13#id348)
- 寬柱資料模型, [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality)
- 相對於面向列的儲存, [列壓縮](/tw/ch4#sec_storage_column_compression)
- 視窗(流程處理), [流分析](/tw/ch12#id318), [時間推理](/tw/ch12#sec_stream_time)-[視窗的型別](/tw/ch12#id324)
- 更改日誌的無限視窗, [維護物化檢視](/tw/ch12#sec_stream_mat_view), [流表連線(流擴充)](/tw/ch12#sec_stream_table_joins)
- 知道所有事件何時到來, [處理滯留事件](/tw/ch12#id323)
- 串流在視窗內連線, [流流連線(視窗連線)](/tw/ch12#id440)
- 視窗型別, [視窗的型別](/tw/ch12#id324)
- WITH RECURSIVE syntax (SQL), [SQL 中的圖查詢](/tw/ch3#id58)
- Word2Vec (language model), [向量嵌入](/tw/ch4#id92)
- 工作流程引擎, [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows)
- Airflow(見 Airflow(工作流排程器))
- 批處理, [工作流排程](/tw/ch11#sec_batch_workflows)
- Camunda(見 Camunda (workflow engine))
- Dagster(見 Dagster(工作流排程器))
- 持久執行, [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows)
- 提取-轉換-載入(ETL)(見 ETL)
- 執行器, [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows)
- 樂團, [持久化執行與工作流](/tw/ch5#sec_encoding_dataflow_workflows), [批處理](/tw/ch11#ch_batch)
- Orkes(見 Orkes (workflow engine))
- Prefect(見 Prefect(工作流排程器))
- 依賴決定性因素, [確定性模擬測試](/tw/ch9#deterministic-simulation-testing)
- Restate(見 Restate (workflow engine))
- Temporal(見 Temporal (workflow engine))
- 工作設定, [排序與記憶體聚合](/tw/ch11#id275)
- 寫入放大, [寫放大](/tw/ch4#write-amplification)
- 寫路徑, [觀察派生資料狀態](/tw/ch13#sec_future_observing)
- 寫偏差, [寫偏差與幻讀](/tw/ch8#sec_transactions_write_skew)-[物化衝突](/tw/ch8#materializing-conflicts)
- 特性, [寫偏差與幻讀](/tw/ch8#sec_transactions_write_skew)-[導致寫偏差的幻讀](/tw/ch8#sec_transactions_phantom), [基於過時前提的決策](/tw/ch8#decisions-based-on-an-outdated-premise)
- 例項, [寫偏差與幻讀](/tw/ch8#sec_transactions_write_skew), [寫偏差的更多例子](/tw/ch8#more-examples-of-write-skew)
- 物化衝突, [物化衝突](/tw/ch8#materializing-conflicts)
- 實際發生情況, [維護完整性,儘管軟體有Bug](/tw/ch13#id455)
- 幻讀, [導致寫偏差的幻讀](/tw/ch8#sec_transactions_phantom)
- 預防
- 在快照隔離中, [基於過時前提的決策](/tw/ch8#decisions-based-on-an-outdated-premise)-[檢測影響先前讀取的寫入](/tw/ch8#sec_detecting_writes_affect_reads)
- 雙相鎖定, [謂詞鎖](/tw/ch8#predicate-locks)-[索引範圍鎖](/tw/ch8#sec_transactions_2pl_range)
- 選項, [寫偏差的特徵](/tw/ch8#characterizing-write-skew)
- 預寫式日誌, [使 B 樹可靠](/tw/ch4#sec_storage_btree_wal), [預寫日誌(WAL)傳輸](/tw/ch6#write-ahead-log-wal-shipping)
- 持久執行, [持久化執行](/tw/ch5#durable-execution)
- 寫入(資料庫)
- 原子寫入操作, [原子寫操作](/tw/ch8#atomic-write-operations)
- 檢測影響前讀的寫入, [檢測影響先前讀取的寫入](/tw/ch8#sec_detecting_writes_affect_reads)
- 防止汙穢的寫作,, [沒有髒寫](/tw/ch8#sec_transactions_dirty_write)
- WS-\* framework, [遠端過程呼叫(RPC)的問題](/tw/ch5#sec_problems_with_rpc)
- WS-AtomicTransaction (2PC), [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc)
### X
- X (社會網路)
- 建造住房時間表(例如), [案例研究:社交網路首頁時間線](/tw/ch2#sec_introduction_twitter), [從同一事件日誌中派生多個檢視](/tw/ch12#sec_stream_deriving_views), [表表連線(維護物化檢視)](/tw/ch12#id326), [物化檢視和快取](/tw/ch13#id451)
- 加入費用, [社交網路案例研究中的反正規化](/tw/ch3#denormalization-in-the-social-networking-case-study)
- 描述負載, [描述負載](/tw/ch2#id33)
- 過失容忍, [容錯](/tw/ch2#id27)
- 業績計量, [描述效能](/tw/ch2#sec_introduction_percentiles)
- DistributedLog (event log), [使用日誌進行訊息儲存](/tw/ch12#id300)
- Snowflake (ID generator), [ID 生成器和邏輯時鐘](/tw/ch10#sec_consistency_logical)
- XA 事務, [兩階段提交(2PC)](/tw/ch8#sec_transactions_2pc), [XA 事務](/tw/ch8#xa-transactions)-[XA 事務的問題](/tw/ch8#problems-with-xa-transactions)
- 啟發式決策, [從協調器故障中恢復](/tw/ch8#recovering-from-coordinator-failure)
- 問題, [XA 事務的問題](/tw/ch8#problems-with-xa-transactions)
- xargs (Unix 工具) (英語)., [簡單日誌分析](/tw/ch11#sec_batch_log_analysis)
- XFS (file system), [分散式檔案系統](/tw/ch11#sec_batch_dfs)
- XGBoost (machine learning library), [機器學習](/tw/ch11#id290)
- XML
- 二進位制變體, [二進位制編碼](/tw/ch5#binary-encoding)
- 資料位置, [讀寫的資料區域性](/tw/ch3#sec_datamodels_document_locality)
- encoding RDF data, [RDF 資料模型](/tw/ch3#the-rdf-data-model)
- 應用資料的問題, [JSON、XML 及其二進位制變體](/tw/ch5#sec_encoding_json)
- 關係資料庫, [文件模型中的模式靈活性](/tw/ch3#sec_datamodels_schema_flexibility)
- XML databases, [關係模型與文件模型](/tw/ch3#sec_datamodels_history), [文件的查詢語言](/tw/ch3#query-languages-for-documents)
- Xorq(查詢引擎), [一切的元資料庫](/tw/ch13#id341)
- XPath, [文件的查詢語言](/tw/ch3#query-languages-for-documents)
- XQuery, [文件的查詢語言](/tw/ch3#query-languages-for-documents)
### Y
- 亞虎
- 響應時間研究, [平均值、中位數與百分位點](/tw/ch2#id24)
- YARN (job scheduler), [分散式作業編排](/tw/ch11#id278), [應用程式碼和狀態的分離](/tw/ch13#id344)
- ApplicationMaster, [分散式作業編排](/tw/ch11#id278)
- Yjs (CRDT library), [同步引擎的利弊](/tw/ch6#pros-and-cons-of-sync-engines)
- YugabyteDB(資料庫)
- 雜湊變硬, [按雜湊範圍分片](/tw/ch7#sharding-by-hash-range)
- 鍵程硬化, [按鍵的範圍分片](/tw/ch7#sec_sharding_key_range)
- 多領導複製, [跨地域執行](/tw/ch6#sec_replication_multi_dc)
- 請求路由, [請求路由](/tw/ch7#sec_sharding_routing)
- 硬化二級指數, [全域性二級索引](/tw/ch7#id167)
- 平板(硬化), [分片](/tw/ch7#ch_sharding)
- 事務, [事務到底是什麼?](/tw/ch8#sec_transactions_overview), [資料庫內部的分散式事務](/tw/ch8#sec_transactions_internal)
- 使用時鐘同步, [用於全域性快照的同步時鐘](/tw/ch9#sec_distributed_spanner)
### Z
- Zab(協商一致演算法), [共識](/tw/ch10#sec_consistency_consensus), [共識的實踐](/tw/ch10#sec_consistency_total_order)
- use in ZooKeeper, [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable)
- 零複製, [編碼資料的格式](/tw/ch5#sec_encoding_formats)
- zero-disk architecture (ZDA), [設定新的副本](/tw/ch6#sec_replication_new_replica)
- ZeroMQ (messaging library), [直接從生產者傳遞給消費者](/tw/ch12#id296)
- 殭屍(分裂的大腦), [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens)
- zones (cloud computing)(見 availability zones)
- ZooKeeper (coordination service), [協調服務](/tw/ch10#sec_consistency_coordination)-[服務發現](/tw/ch10#service-discovery)
- 生成柵欄標誌, [隔離殭屍程序和延遲請求](/tw/ch9#sec_distributed_fencing_tokens), [使用共享日誌](/tw/ch10#sec_consistency_smr), [協調服務](/tw/ch10#sec_consistency_coordination)
- 線性操作, [實現線性一致性系統](/tw/ch10#sec_consistency_implementing_linearizable)
- 鎖和領袖選舉, [鎖定與領導者選舉](/tw/ch10#locking-and-leader-election)
- 觀察員, [服務發現](/tw/ch10#service-discovery)
- 用於服務發現, [負載均衡器、服務發現和服務網格](/tw/ch5#sec_encoding_service_discovery), [服務發現](/tw/ch10#service-discovery)
- 用於硬性轉讓, [請求路由](/tw/ch7#sec_sharding_routing)
- 使用 Zab 演算法, [共識](/tw/ch10#sec_consistency_consensus)
================================================
FILE: content/tw/part-i.md
================================================
---
title: 第一部分:資料系統基礎
weight: 100
breadcrumbs: false
---
{{< callout type="warning" >}}
當前頁面來自本書第一版,第二版尚不可用
{{< /callout >}}
本書前五章介紹了資料系統底層的基礎概念,無論是在單臺機器上執行的單點資料系統,還是分佈在多臺機器上的分散式資料系統都適用。
1. [第一章](/tw/ch1) 將介紹 **資料系統架構中的利弊權衡**。我們將討論不同型別的資料系統(例如,分析型與事務型),以及它們在雲環境中的執行方式。
2. [第二章](/tw/ch2) 將介紹非功能性需求的定義。。**可靠性,可伸縮性和可維護性** ,這些詞彙到底意味著什麼?如何實現這些目標?
3. [第三章](/tw/ch3) 將對幾種不同的 **資料模型和查詢語言** 進行比較。從程式設計師的角度看,這是資料庫之間最明顯的區別。不同的資料模型適用於不同的應用場景。
4. [第四章](/tw/ch4) 將深入 **儲存引擎** 內部,研究資料庫如何在磁碟上擺放資料。不同的儲存引擎針對不同的負載進行最佳化,選擇合適的儲存引擎對系統性能有巨大影響。
5. [第五章](/tw/ch5) 將對幾種不同的 **資料編碼** 進行比較。特別研究了這些格式在應用需求經常變化、模式需要隨時間演變的環境中表現如何。
[第二部分](/tw/part-ii) 將專門討論在 **分散式資料系統** 中特有的問題。
## [1. 資料系統架構中的權衡](/tw/ch1)
- [分析型與事務型系統](/tw/ch1#sec_introduction_analytics)
- [雲服務與自託管](/tw/ch1#sec_introduction_cloud)
- [分散式與單節點系統](/tw/ch1#sec_introduction_distributed)
- [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- [總結](/tw/ch1#summary)
## [2. 定義非功能性需求](/tw/ch2)
- [案例研究:社交網路首頁時間線](/tw/ch2#sec_introduction_twitter)
- [描述效能](/tw/ch2#sec_introduction_percentiles)
- [可靠性與容錯](/tw/ch2#sec_introduction_reliability)
- [可伸縮性](/tw/ch2#sec_introduction_scalability)
- [可運維性](/tw/ch2#sec_introduction_maintainability)
- [總結](/tw/ch2#summary)
## [3. 資料模型與查詢語言](/tw/ch3)
- [關係模型與文件模型](/tw/ch3#sec_datamodels_history)
- [圖資料模型](/tw/ch3#sec_datamodels_graph)
- [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- [總結](/tw/ch3#summary)
## [4. 儲存與檢索](/tw/ch4)
- [OLTP 系統的儲存與索引](/tw/ch4#sec_storage_oltp)
- [分析型資料儲存](/tw/ch4#sec_storage_analytics)
- [多維索引與全文索引](/tw/ch4#sec_storage_multidimensional)
- [總結](/tw/ch4#summary)
## [5. 編碼與演化](/tw/ch5)
- [編碼資料的格式](/tw/ch5#sec_encoding_formats)
- [資料流的模式](/tw/ch5#sec_encoding_dataflow)
- [總結](/tw/ch5#summary)
================================================
FILE: content/tw/part-ii.md
================================================
---
title: 第二部分:分散式資料
weight: 200
breadcrumbs: false
---
{{< callout type="warning" >}}
當前頁面來自本書第一版,第二版尚不可用
{{< /callout >}}
> 一個成功的技術,現實的優先順序必須高於公關,你可以糊弄別人,但糊弄不了自然規律。
>
> —— 羅傑斯委員會報告(1986)
>
-------
在本書的 [第一部分](/tw/part-i) 中,我們討論了資料系統的各個方面,但僅限於資料儲存在單臺機器上的情況。
現在我們到了 [第二部分](/tw/part-ii),進入更高的層次,並提出一個問題:如果 **多臺機器** 參與資料的儲存和檢索,會發生什麼?
你可能會出於各種各樣的原因,希望將資料庫分佈到多臺機器上:
可伸縮性
: 如果你的資料量、讀取負載、寫入負載超出單臺機器的處理能力,可以將負載分散到多臺計算機上。
容錯 / 高可用性
: 如果你的應用需要在單臺機器(或多臺機器,網路或整個資料中心)出現故障的情況下仍然能繼續工作,則可使用多臺機器,以提供冗餘。一臺故障時,另一臺可以接管。
延遲
: 如果在世界各地都有使用者,你也許會考慮在全球範圍部署多個伺服器,從而每個使用者可以從地理上最近的資料中心獲取服務,避免了等待網路資料包穿越半個世界。
## 伸縮至更高的負載
如果你需要的只是伸縮至更高的 **負載(load)**,最簡單的方法就是購買更強大的機器(有時稱為 **垂直伸縮**,即 vertical scaling,或 **向上伸縮**,即 scale up)。許多處理器,記憶體和磁碟可以在同一個作業系統下相互連線,快速的相互連線允許任意處理器訪問記憶體或磁碟的任意部分。在這種 **共享記憶體架構(shared-memory architecture)** 中,所有的元件都可以看作一臺單獨的機器。
> [!NOTE]
> 在大型機中,儘管任意處理器都可以訪問記憶體的任意部分,但總有一些記憶體區域與一些處理器更接近(稱為 **非均勻記憶體訪問(nonuniform memory access, NUMA)** [^1])。為了有效利用這種架構特性,需要對處理進行細分,以便每個處理器主要訪問臨近的記憶體,這意味著即使表面上看起來只有一臺機器在執行,**分割槽(partitioning)** 仍然是必要的。
共享記憶體方法的問題在於,成本增長速度快於線性增長:一臺有著雙倍處理器數量,雙倍記憶體大小,雙倍磁碟容量的機器,通常成本會遠遠超過原來的兩倍。而且可能因為存在瓶頸,並不足以處理雙倍的載荷。
共享記憶體架構可以提供有限的容錯能力,高階機器可以使用熱插拔的元件(不關機更換磁碟,記憶體模組,甚至處理器)—— 但它必然囿於單個地理位置的桎梏。
另一種方法是 **共享磁碟架構(shared-disk architecture)**,它使用多臺具有獨立處理器和記憶體的機器,但將資料儲存在機器之間共享的磁碟陣列上,這些磁碟透過快速網路連線。這種架構用於某些資料倉庫,但競爭和鎖定的開銷限制了共享磁碟方法的可伸縮性 [^2]。
> [!NOTE]
> 網路附屬儲存(Network Attached Storage, NAS),或 **儲存區網路(Storage Area Network, SAN)**
### 無共享架構
相比之下,**無共享架構** [^3](shared-nothing architecture,有時被稱為 **水平伸縮**,即 horizontal scaling,或 **向外伸縮**,即 scaling out)已經相當普及。
在這種架構中,執行資料庫軟體的每臺機器 / 虛擬機器都稱為 **節點(node)**。每個節點只使用各自的處理器,記憶體和磁碟。節點之間的任何協調,都是在軟體層面使用傳統網路實現的。
無共享系統不需要使用特殊的硬體,所以你可以用任意機器 —— 比如價效比最好的機器。你也許可以跨多個地理區域分佈資料從而減少使用者延遲,或者在損失一整個資料中心的情況下倖免於難。
隨著雲端虛擬機器部署的出現,即使是小公司,現在無需 Google 級別的運維,也可以實現異地分散式架構。
在這一部分裡,我們將重點放在無共享架構上。它不見得是所有場景的最佳選擇,但它是最需要你謹慎從事的架構。
如果你的資料分佈在多個節點上,你需要意識到這樣一個分散式系統中約束和權衡 —— 資料庫並不能魔術般地把這些東西隱藏起來。
雖然分散式無共享架構有許多優點,但它通常也會給應用帶來額外的複雜度,有時也會限制你可用資料模型的表達力。
在某些情況下,一個簡單的單執行緒程式可以比一個擁有超過 100 個 CPU 核的叢集表現得更好 [^4]。另一方面,無共享系統可以非常強大。接下來的幾章,將詳細討論分散式資料會帶來的問題。
### 複製 vs 分割槽
資料分佈在多個節點上有兩種常見的方式:
複製(Replication)
: 在幾個不同的節點上儲存資料的相同副本,可能放在不同的位置。複製提供了冗餘:如果一些節點不可用,剩餘的節點仍然可以提供資料服務。複製也有助於改善效能。[第六章](/tw/ch6) 將討論複製。
分割槽 (Partitioning)
: 將一個大型資料庫拆分成較小的子集(稱為 **分割槽**,即 partitions),從而不同的分割槽可以指派給不同的 **節點**(nodes,亦稱 **分片**,即 sharding)。[第七章](/tw/ch7) 將討論分割槽。
複製和分割槽是不同的機制,但它們經常同時使用。如 [圖 II-1](#fig_replication_partitioning) 所示。
{{< figure src="/v1/ddia_part-ii_01.png" id="fig_replication_partitioning" caption="圖 II-1 一個數據庫切分為兩個分割槽,每個分割槽都有兩個副本" class="w-full my-4" >}}
理解了這些概念,就可以開始討論在分散式系統中需要做出的困難抉擇。[第八章](/tw/ch8) 將討論 **事務(Transaction)**,這對於瞭解資料系統中可能出現的各種問題,以及我們可以做些什麼很有幫助。
[第九章](/tw/ch9) 和 [第十章](/tw/ch10) 將討論分散式系統的根本侷限性。
在本書的 [第三部分](/tw/part-iii) 中,將討論如何將多個(可能是分散式的)資料儲存整合為一個更大的系統,以滿足複雜的應用需求。但首先,我們來聊聊分散式的資料。
## [6. 複製](/tw/ch6)
- [單主複製](/tw/ch6#sec_replication_leader)
- [複製延遲的問題](/tw/ch6#sec_replication_lag)
- [多主複製](/tw/ch6#sec_replication_multi_leader)
- [無主複製](/tw/ch6#sec_replication_leaderless)
- [總結](/tw/ch6#summary)
## [7. 分片](/tw/ch7)
- [分片的利與弊](/tw/ch7#sec_sharding_reasons)
- [鍵值資料的分片](/tw/ch7#sec_sharding_key_value)
- [請求路由](/tw/ch7#sec_sharding_routing)
- [分片與二級索引](/tw/ch7#sec_sharding_secondary_indexes)
- [總結](/tw/ch7#summary)
## [8. 事務](/tw/ch8)
- [事務到底是什麼?](/tw/ch8#sec_transactions_overview)
- [弱隔離級別](/tw/ch8#sec_transactions_isolation_levels)
- [可序列化](/tw/ch8#sec_transactions_serializability)
- [分散式事務](/tw/ch8#sec_transactions_distributed)
- [總結](/tw/ch8#summary)
- [參考](/tw/ch8#參考)
## [9. 分散式系統的麻煩](/tw/ch9)
- [故障與部分失效](/tw/ch9#sec_distributed_partial_failure)
- [不可靠的網路](/tw/ch9#sec_distributed_networks)
- [不可靠的時鐘](/tw/ch9#sec_distributed_clocks)
- [知識、真相和謊言](/tw/ch9#sec_distributed_truth)
- [總結](/tw/ch9#summary)
## [10. 一致性與共識](/tw/ch10)
- [線性一致性](/tw/ch10#sec_consistency_linearizability)
- [ID 生成器和邏輯時鐘](/tw/ch10#sec_consistency_logical)
- [共識](/tw/ch10#sec_consistency_consensus)
- [總結](/tw/ch10#summary)
### 參考
[^1]: Ulrich Drepper: “[What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf),” akka‐dia.org, November 21, 2007.
[^2]: Ben Stopford: “[Shared Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/),” benstopford.com, November 24, 2009.
[^3]: Michael Stonebraker: “[The Case for Shared Nothing](http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf),” IEEE Database EngineeringBulletin, volume 9, number 1, pages 4–9, March 1986.
[^4]: Frank McSherry, Michael Isard, and Derek G. Murray: “[Scalability! But at What COST?](http://www.frankmcsherry.org/assets/COST.pdf),” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS),May 2015.
================================================
FILE: content/tw/part-iii.md
================================================
---
title: 第三部分:派生資料
weight: 300
breadcrumbs: false
---
{{< callout type="warning" >}}
當前頁面來自本書第一版,第二版尚不可用
{{< /callout >}}
在本書的 [第一部分](/tw/part-i) 和 [第二部分](/tw/part-ii) 中,我們自底向上地把所有關於分散式資料庫的主要考量都過了一遍。從資料在磁碟上的佈局,一直到出現故障時分散式系統一致性的侷限。但所有的討論都假定了應用中只用了一種資料庫。
現實世界中的資料系統往往更為複雜。大型應用程式經常需要以多種方式訪問和處理資料,沒有一個數據庫可以同時滿足所有這些不同的需求。因此應用程式通常組合使用多種元件:資料儲存、索引、快取、分析系統等等,並實現在這些元件中移動資料的機制。
本書的最後一部分,會研究將多個不同資料系統(可能有著不同資料模型,並針對不同的訪問模式進行最佳化)整合為一個協調一致的應用架構時,會遇到的問題。軟體供應商經常會忽略這一方面的生態建設,並聲稱他們的產品能夠滿足你的所有需求。在現實世界中,整合不同的系統是實際應用中最重要的事情之一。
## 記錄系統和派生資料系統
從高層次上看,儲存和處理資料的系統可以分為兩大類:
權威記錄系統(System of record)
: **記錄系統**,也被稱為 **真相源(source of truth)**,持有資料的權威版本。當新的資料進入時(例如,使用者輸入)首先會記錄在這裡。
每個事實正正好好表示一次(表示通常是 **正規化的**,即 normalized)。如果其他系統和 **記錄系統** 之間存在任何差異,那麼記錄系統中的值是正確的(根據定義)。
派生資料系統(Derived data systems)
: **派生系統** 中的資料,通常是另一個系統中的現有資料以某種方式進行轉換或處理的結果。如果丟失派生資料,可以從原始來源重新建立。
典型的例子是 **快取(cache)**:如果資料在快取中,就可以由快取提供服務;如果快取不包含所需資料,則降級由底層資料庫提供。反正規化的值,索引和物化檢視亦屬此類。在推薦系統中,預測彙總資料通常派生自使用者日誌。
從技術上講,派生資料是 **冗餘的(redundant)**,因為它重複了已有的資訊。但是派生資料對於獲得良好的只讀查詢效能通常是至關重要的。它通常是反正規化的。可以從單個源頭派生出多個不同的資料集,使你能從不同的 “視角” 洞察資料。
並不是所有的系統都在其架構中明確區分 **記錄系統** 和 **派生資料系統**,但是這是一種有用的區分方式,因為它明確了系統中的資料流:系統的哪一部分具有哪些輸入和哪些輸出,以及它們如何相互依賴。
大多數資料庫,儲存引擎和查詢語言,本質上既不是記錄系統也不是派生系統。資料庫只是一個工具:如何使用它取決於你自己。**記錄系統和派生資料系統之間的區別不在於工具,而在於應用程式中的使用方式。**
透過梳理資料的派生關係,可以清楚地理解一個令人困惑的系統架構。這將貫穿本書的這一部分。
## 章節概述
我們將從 [第十一章](/tw/ch11) 開始,研究例如 MapReduce 這樣 **面向批處理(batch-oriented)** 的資料流系統。對於建設大規模資料系統,我們將看到,它們提供了優秀的工具和思想。
[第十二章](/tw/ch12) 將把這些思想應用到 **流式資料(data streams)** 中,使我們能用更低的延遲完成同樣的任務。[第十三章](/tw/ch13) 將探討如何使用這些工具來構建可靠、可伸縮和可維護的應用。[第十四章](/ch14) 將以倫理、隱私與社會影響為主題,為全書收束。
## 索引
## [11. 批處理](/tw/ch11)
- [使用 Unix 工具的批處理](/tw/ch11#sec_batch_unix)
- [分散式系統中的批處理](/tw/ch11#sec_batch_distributed)
- [批處理模型](/tw/ch11#id431)
- [批處理用例](/tw/ch11#sec_batch_output)
- [本章小結](/tw/ch11#id292)
- [參考文獻](/tw/ch11#references)
## [12. 流處理](/tw/ch12)
- [傳遞事件流](/tw/ch12#sec_stream_transmit)
- [資料庫與流](/tw/ch12#sec_stream_databases)
- [流處理](/tw/ch12#sec_stream_processing)
- [本章小結](/tw/ch12#id332)
- [參考文獻](/tw/ch12#references)
## [13. 流式系統的哲學](/tw/ch13)
- [資料整合](/tw/ch13#sec_future_integration)
- [分拆資料庫](/tw/ch13#sec_future_unbundling)
- [追求正確性](/tw/ch13#sec_future_correctness)
- [本章小結](/tw/ch13#id367)
- [參考文獻](/tw/ch13#references)
## [14. 將事情做正確](/ch14)
- [預測分析](/ch14#id369)
- [隱私與追蹤](/ch14#id373)
- [總結](/ch14#id594)
- [參考文獻](/ch14#references)
================================================
FILE: content/tw/preface.md
================================================
---
title: 序言
weight: 50
breadcrumbs: false
---
{{< callout type="warning" >}}
當前頁面來自本書第一版,第二版尚不可用
{{< /callout >}}
如果近幾年從業於軟體工程,特別是伺服器端和後端系統開發,那麼你很有可能已經被大量關於資料儲存和處理的時髦詞彙轟炸過了: NoSQL!大資料!Web-Scale!分片!最終一致性!ACID!CAP 定理!雲服務!MapReduce!即時!
在最近十年中,我們看到了很多有趣的進展,關於資料庫,分散式系統,以及在此基礎上構建應用程式的方式。這些進展有著各種各樣的驅動力:
* 谷歌、雅虎、亞馬遜、臉書、領英、微軟和推特等網際網路公司正在和巨大的流量 / 資料打交道,這迫使他們去創造能有效應對如此規模的新工具。
* 企業需要變得敏捷,需要低成本地檢驗假設,需要透過縮短開發週期和保持資料模型的靈活性,快速地響應新的市場洞察。
* 免費和開源軟體變得非常成功,在許多環境中比商業軟體和定製軟體更受歡迎。
* 處理器主頻幾乎沒有增長,但是多核處理器已經成為標配,網路也越來越快。這意味著並行化程度只增不減。
* 即使你在一個小團隊中工作,現在也可以構建分佈在多臺計算機甚至多個地理區域的系統,這要歸功於譬如亞馬遜網路服務(AWS)等基礎設施即服務(IaaS)概念的踐行者。
* 許多服務都要求高可用,因停電或維護導致的服務不可用,變得越來越難以接受。
**資料密集型應用(data-intensive applications)** 正在透過使用這些技術進步來推動可能性的邊界。一個應用被稱為 **資料密集型** 的,如果 **資料是其主要挑戰**(資料量,資料複雜度或資料變化速度)—— 與之相對的是 **計算密集型**,即處理器速度是其瓶頸。
幫助資料密集型應用儲存和處理資料的工具與技術,正迅速地適應這些變化。新型資料庫系統(“NoSQL”)已經備受關注,而訊息佇列,快取,搜尋索引,批處理和流處理框架以及相關技術也非常重要。很多應用組合使用這些工具與技術。
這些生意盎然的時髦詞彙體現出人們對新的可能性的熱情,這是一件好事。但是作為軟體工程師和架構師,如果要開發優秀的應用,我們還需要對各種層出不窮的技術及其利弊權衡有精準的技術理解。為了獲得這種洞察,我們需要深挖時髦詞彙背後的內容。
幸運的是,在技術迅速變化的背後總是存在一些持續成立的原則,無論你使用了特定工具的哪個版本。如果你理解了這些原則,就可以領會這些工具的適用場景,如何充分利用它們,以及如何避免其中的陷阱。這正是本書的初衷。
本書的目標是幫助你在飛速變化的資料處理和資料儲存技術大觀園中找到方向。本書並不是某個特定工具的教程,也不是一本充滿枯燥理論的教科書。相反,我們將看到一些成功資料系統的樣例:許多流行應用每天都要在生產中滿足可伸縮性、效能、以及可靠性的要求,而這些技術構成了這些應用的基礎。
我們將深入這些系統的內部,理清它們的關鍵演算法,討論背後的原則和它們必須做出的權衡。在這個過程中,我們將嘗試尋找 **思考** 資料系統的有效方式 —— 不僅關於它們 **如何** 工作,還包括它們 **為什麼** 以這種方式工作,以及哪些問題是我們需要問的。
閱讀本書後,你能很好地決定哪種技術適合哪種用途,並瞭解如何將工具組合起來,為一個良好應用架構奠定基礎。本書並不足以使你從頭開始構建自己的資料庫儲存引擎,不過幸運的是這基本上很少有必要。你將獲得對系統底層發生事情的敏銳直覺,這樣你就有能力推理它們的行為,做出優秀的設計決策,並追蹤任何可能出現的問題。
## 本書的目標讀者
如果你開發的應用具有用於儲存或處理資料的某種伺服器 / 後端系統,而且使用網路(例如,Web 應用、移動應用或連線到網際網路的感測器),那麼本書就是為你準備的。
本書是為軟體工程師,軟體架構師,以及喜歡寫程式碼的技術經理準備的。如果你需要對所從事系統的架構做出決策 —— 例如你需要選擇解決某個特定問題的工具,並找出如何最好地使用這些工具,那麼這本書對你尤有價值。但即使你無法選擇你的工具,本書仍將幫助你更好地瞭解所使用工具的長處和短處。
你應當具有一些開發 Web 應用或網路服務的經驗,且應當熟悉關係型資料庫和 SQL。任何你瞭解的非關係型資料庫和其他與資料相關工具都會有所幫助,但不是必需的。對常見網路協議如 TCP 和 HTTP 的大概理解是有幫助的。程式語言或框架的選擇對閱讀本書沒有任何不同影響。
如果以下任意一條對你為真,你會發現這本書很有價值:
* 你想了解如何使資料系統可伸縮,例如,支援擁有數百萬使用者的 Web 或移動應用。
* 你需要提高應用程式的可用性(最大限度地減少停機時間),保持穩定執行。
* 你正在尋找使系統在長期執行過程易於維護的方法,即使系統規模增長,需求與技術也發生變化。
* 你對事物的運作方式有著天然的好奇心,並且希望知道一些主流網站和線上服務背後發生的事情。這本書打破了各種資料庫和資料處理系統的內幕,探索這些系統設計中的智慧是非常有趣的。
有時在討論可伸縮的資料系統時,人們會說:“你又不在谷歌或亞馬遜,別操心可伸縮性了,直接上關係型資料庫”。這個陳述有一定的道理:為了不必要的伸縮性而設計程式,不僅會浪費不必要的精力,並且可能會把你鎖死在一個不靈活的設計中。實際上這是一種 “過早最佳化” 的形式。不過,選擇合適的工具確實很重要,而不同的技術各有優缺點。我們將看到,關係資料庫雖然很重要,但絕不是資料處理的終章。
## 本書涉及的領域
本書並不會嘗試告訴讀者如何安裝或使用特定的軟體包或 API,因為已經有大量文件給出了詳細的使用說明。相反,我們會討論資料系統的基礎 —— 各種原則與利弊權衡,並探討了不同產品所做出的不同設計決策。
在電子書中包含了線上資源全文的連結。所有連結在出版時都進行了驗證,但不幸的是,由於網路的自然規律,連結往往會頻繁地破損。如果你遇到連結斷開的情況,或者正在閱讀本書的列印副本,可以使用搜索引擎查詢參考文獻。對於學術論文,你可以在 Google 學術中搜索標題,查詢可以公開獲取的 PDF 檔案。或者,你也可以在 https://github.com/ept/ddia-references 中找到所有的參考資料,我們在那兒維護最新的連結。
我們主要關注的是資料系統的 **架構(architecture)**,以及它們被整合到資料密集型應用中的方式。本書沒有足夠的空間覆蓋部署、運維、安全、管理等領域 —— 這些都是複雜而重要的主題,僅僅在本書中用粗略的註解討論這些對它們很不公平。每個領域都值得用單獨的書去講。
本書中描述的許多技術都被涵蓋在 **大資料(Big Data)** 這個時髦詞的範疇中。然而 “大資料” 這個術語被濫用,缺乏明確定義,以至於在嚴肅的工程討論中沒有用處。這本書使用歧義更小的術語,如 “單節點” 之於 “分散式系統”,或 “線上 / 互動式系統” 之於 “離線 / 批處理系統”。
本書對 **自由和開源軟體(FOSS)** 有一定偏好,因為閱讀、修改和執行原始碼是瞭解某事物詳細工作原理的好方法。開放的平臺也可以降低供應商壟斷的風險。然而在適當的情況下,我們也會討論專利軟體(閉源軟體,軟體即服務 SaaS,或一些在文獻中描述過但未公開發行的公司內部軟體)。
## 本書綱要
本書分為三部分:
1. 在 [第一部分](/tw/part-i) 中,我們會討論設計資料密集型應用所賴的基本思想。我們從 [第一章](/tw/ch1) 開始,討論我們實際要達到的目標:可靠性、可伸縮性和可維護性;我們該如何思考這些概念;以及如何實現它們。在 [第二章](/tw/ch2) 中,我們比較了幾種不同的資料模型和查詢語言,看看它們如何適用於不同的場景。在 [第三章](/tw/ch3) 中將討論儲存引擎:資料庫如何在磁碟上擺放資料,以便能高效地再次找到它。[第四章](/tw/ch4) 轉向資料編碼(序列化),以及隨時間演化的模式。
2. 在 [第二部分](/tw/part-ii) 中,我們從討論儲存在一臺機器上的資料轉向討論分佈在多臺機器上的資料。這對於可伸縮性通常是必需的,但帶來了各種獨特的挑戰。我們首先討論複製([第五章](/tw/ch5))、分割槽 / 分片([第六章](/tw/ch6))和事務([第七章](/tw/ch7))。然後我們將探索關於分散式系統問題的更多細節([第八章](/tw/ch8)),以及在分散式系統中實現一致性與共識意味著什麼([第九章](/tw/ch9))。
3. 在 [第三部分](/tw/part-iii) 中,我們討論那些從其他資料集派生出一些資料集的系統。派生資料經常出現在異構系統中:當沒有單個數據庫可以把所有事情都做的很好時,應用需要整合幾種不同的資料庫、快取、索引等。在 [第十章](/tw/ch10) 中我們將從一種派生資料的批處理方法開始,然後在此基礎上建立在 [第十一章](/tw/ch11) 中討論的流處理。最後,在 [第十二章](/tw/ch12) 中,我們將所有內容彙總,討論在將來構建可靠、可伸縮和可維護的應用程式的方法。
## 參考文獻與延伸閱讀
本書中討論的大部分內容已經在其它地方以某種形式出現過了 —— 會議簡報、研究論文、部落格文章、程式碼、BUG 跟蹤器、郵件列表以及工程習慣中。本書總結了不同來源資料中最重要的想法,並在文字中包含了指向原始文獻的連結。如果你想更深入地探索一個領域,那麼每章末尾的參考文獻都是很好的資源,其中大部分可以免費線上獲取。
## O‘Reilly Safari
[Safari](http://oreilly.com/safari) (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.
Members have access to thousands of books, training videos, Learning Paths, interac‐ tive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Pro‐ fessional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.
For more information, please visit http://oreilly.com/safari.
## 聯絡我們
有關本書的評論和問題,請聯絡出版社:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938(美國或加拿大)
707-829-0515(國際或本地)
707-829-0104(傳真)
我們為本書提供了網頁,會在上面列出勘誤、示例以及任何補充資訊。你可以訪問:*http://bit.ly/designing-data-intensive-apps*。
如需發表評論或提出技術問題,請傳送郵件至:*bookquestions@oreilly.com*。
有關 O’Reilly 圖書、課程、會議和新聞的更多資訊,請訪問:*http://www.oreilly.com*。
* Facebook: [http://facebook.com/oreilly](http://facebook.com/oreilly)
* Twitter: [http://twitter.com/oreillymedia](http://twitter.com/oreillymedia)
* YouTube: [http://www.youtube.com/oreillymedia](http://www.youtube.com/oreillymedia)
## 致謝
本書融合了學術研究和工業實踐的經驗,融合並系統化了大量其他人的想法與知識。在計算領域,我們往往會被各種新鮮花樣所吸引,但我認為前人完成的工作中,有太多值得我們學習的地方了。本書有 800 多處引用:文章、部落格、講座、文件等,對我來說這些都是寶貴的學習資源。我非常感謝這些材料的作者分享他們的知識。
我也從與人交流中學到了很多東西,很多人花費了寶貴的時間與我討論想法並耐心解釋。特別感謝 Joe Adler, Ross Anderson, Peter Bailis, Márton Balassi, Alastair Beresford, Mark Callaghan, Mat Clayton, Patrick Collison, Sean Cribbs, Shirshanka Das, Niklas Ekström, Stephan Ewen, Alan Fekete, Gyula Fóra, Camille Fournier, Andres Freund, John Garbutt, Seth Gilbert, Tom Haggett, Pat Hel‐ land, Joe Hellerstein, Jakob Homan, Heidi Howard, John Hugg, Julian Hyde, Conrad Irwin, Evan Jones, Flavio Junqueira, Jessica Kerr, Kyle Kingsbury, Jay Kreps, Carl Lerche, Nicolas Liochon, Steve Loughran, Lee Mallabone, Nathan Marz, Caitie McCaffrey, Josie McLellan, Christopher Meiklejohn, Ian Meyers, Neha Narkhede, Neha Narula, Cathy O’Neil, Onora O’Neill, Ludovic Orban, Zoran Perkov, Julia Powles, Chris Riccomini, Henry Robinson, David Rosenthal, Jennifer Rullmann, Matthew Sackman, Martin Scholl, Amit Sela, Gwen Shapira, Greg Spurrier, Sam Stokes, Ben Stopford, Tom Stuart, Diana Vasile, Rahul Vohra, Pete Warden, 以及 Brett Wooldridge.
更多人透過審閱草稿並提供反饋意見在本書的創作過程中做出了無價的貢獻。我要特別感謝 Raul Agepati, Tyler Akidau, Mattias Andersson, Sasha Baranov, Veena Basavaraj, David Beyer, Jim Brikman, Paul Carey, Raul Castro Fernandez, Joseph Chow, Derek Elkins, Sam Elliott, Alexander Gallego, Mark Grover, Stu Halloway, Heidi Howard, Nicola Kleppmann, Stefan Kruppa, Bjorn Madsen, Sander Mak, Stefan Podkowinski, Phil Potter, Hamid Ramazani, Sam Stokes, 以及 Ben Summers。當然對於本書中的任何遺留錯誤或難以接受的見解,我都承擔全部責任。
為了幫助這本書落地,並且耐心地處理我緩慢的寫作和不尋常的要求,我要對編輯 Marie Beaugureau,Mike Loukides,Ann Spencer 和 O'Reilly 的所有團隊表示感謝。我要感謝 Rachel Head 幫我找到了合適的術語。我要感謝 Alastair Beresford,Susan Goodhue,Neha Narkhede 和 Kevin Scott,在其他工作事務之外給了我充分地創作時間和自由。
特別感謝 Shabbir Diwan 和 Edie Freedman,他們非常用心地為各章配了地圖。他們提出了不落俗套的靈感,創作了這些地圖,美麗而引人入勝,真是太棒了。
最後我要表達對家人和朋友們的愛,沒有他們,我將無法走完這個將近四年的寫作歷程。你們是最棒的。
================================================
FILE: content/tw/toc.md
================================================
---
title: "目錄"
linkTitle: "目錄"
weight: 10
breadcrumbs: false
---

## [序言](/tw/preface)
- [本書的目標讀者](/tw/preface#本書的目標讀者)
- [本書涉及的領域](/tw/preface#本書涉及的領域)
- [本書綱要](/tw/preface#本書綱要)
- [參考文獻與延伸閱讀](/tw/preface#參考文獻與延伸閱讀)
- [O‘Reilly Safari](/tw/preface#oreilly-safari)
- [致謝](/tw/preface#致謝)
## [1. 資料系統架構中的權衡](/tw/ch1)
- [分析型與事務型系統](/tw/ch1#sec_introduction_analytics)
- [雲服務與自託管](/tw/ch1#sec_introduction_cloud)
- [分散式與單節點系統](/tw/ch1#sec_introduction_distributed)
- [資料系統、法律與社會](/tw/ch1#sec_introduction_compliance)
- [總結](/tw/ch1#summary)
## [2. 定義非功能性需求](/tw/ch2)
- [案例研究:社交網路首頁時間線](/tw/ch2#sec_introduction_twitter)
- [描述效能](/tw/ch2#sec_introduction_percentiles)
- [可靠性與容錯](/tw/ch2#sec_introduction_reliability)
- [可伸縮性](/tw/ch2#sec_introduction_scalability)
- [可運維性](/tw/ch2#sec_introduction_maintainability)
- [總結](/tw/ch2#summary)
## [3. 資料模型與查詢語言](/tw/ch3)
- [關係模型與文件模型](/tw/ch3#sec_datamodels_history)
- [圖資料模型](/tw/ch3#sec_datamodels_graph)
- [事件溯源與 CQRS](/tw/ch3#sec_datamodels_events)
- [資料框、矩陣與陣列](/tw/ch3#sec_datamodels_dataframes)
- [總結](/tw/ch3#summary)
## [4. 儲存與檢索](/tw/ch4)
- [OLTP 系統的儲存與索引](/tw/ch4#sec_storage_oltp)
- [分析型資料儲存](/tw/ch4#sec_storage_analytics)
- [多維索引與全文索引](/tw/ch4#sec_storage_multidimensional)
- [總結](/tw/ch4#summary)
## [5. 編碼與演化](/tw/ch5)
- [編碼資料的格式](/tw/ch5#sec_encoding_formats)
- [資料流的模式](/tw/ch5#sec_encoding_dataflow)
- [總結](/tw/ch5#summary)
## [6. 複製](/tw/ch6)
- [單主複製](/tw/ch6#sec_replication_leader)
- [複製延遲的問題](/tw/ch6#sec_replication_lag)
- [多主複製](/tw/ch6#sec_replication_multi_leader)
- [無主複製](/tw/ch6#sec_replication_leaderless)
- [總結](/tw/ch6#summary)
## [7. 分片](/tw/ch7)
- [分片的利與弊](/tw/ch7#sec_sharding_reasons)
- [鍵值資料的分片](/tw/ch7#sec_sharding_key_value)
- [請求路由](/tw/ch7#sec_sharding_routing)
- [分片與二級索引](/tw/ch7#sec_sharding_secondary_indexes)
- [總結](/tw/ch7#summary)
## [8. 事務](/tw/ch8)
- [事務到底是什麼?](/tw/ch8#sec_transactions_overview)
- [弱隔離級別](/tw/ch8#sec_transactions_isolation_levels)
- [可序列化](/tw/ch8#sec_transactions_serializability)
- [分散式事務](/tw/ch8#sec_transactions_distributed)
- [總結](/tw/ch8#summary)
- [參考](/tw/ch8#參考)
## [9. 分散式系統的麻煩](/tw/ch9)
- [故障與部分失效](/tw/ch9#sec_distributed_partial_failure)
- [不可靠的網路](/tw/ch9#sec_distributed_networks)
- [不可靠的時鐘](/tw/ch9#sec_distributed_clocks)
- [知識、真相和謊言](/tw/ch9#sec_distributed_truth)
- [總結](/tw/ch9#summary)
## [10. 一致性與共識](/tw/ch10)
- [線性一致性](/tw/ch10#sec_consistency_linearizability)
- [ID 生成器和邏輯時鐘](/tw/ch10#sec_consistency_logical)
- [共識](/tw/ch10#sec_consistency_consensus)
- [總結](/tw/ch10#summary)
## [11. 批處理](/tw/ch11)
- [使用 Unix 工具的批處理](/tw/ch11#sec_batch_unix)
- [分散式系統中的批處理](/tw/ch11#sec_batch_distributed)
- [批處理模型](/tw/ch11#id431)
- [批處理用例](/tw/ch11#sec_batch_output)
- [本章小結](/tw/ch11#id292)
- [參考文獻](/tw/ch11#references)
## [12. 流處理](/tw/ch12)
- [傳遞事件流](/tw/ch12#sec_stream_transmit)
- [資料庫與流](/tw/ch12#sec_stream_databases)
- [流處理](/tw/ch12#sec_stream_processing)
- [本章小結](/tw/ch12#id332)
- [參考文獻](/tw/ch12#references)
## [13. 流式系統的哲學](/tw/ch13)
- [資料整合](/tw/ch13#sec_future_integration)
- [分拆資料庫](/tw/ch13#sec_future_unbundling)
- [追求正確性](/tw/ch13#sec_future_correctness)
- [本章小結](/tw/ch13#id367)
- [參考文獻](/tw/ch13#references)
## [14. 將事情做正確](/ch14)
- [預測分析](/ch14#id369)
- [隱私與追蹤](/ch14#id373)
- [總結](/ch14#id594)
- [參考文獻](/ch14#references)
## [術語表](/tw/glossary)
## [後記](/tw/colophon)
- [關於作者](/tw/colophon#關於作者)
- [關於譯者](/tw/colophon#關於譯者)
- [後記](/tw/colophon#後記)
================================================
FILE: content/v1/_index.md
================================================
---
title: 设计数据密集型应用(第一版)
linkTitle: DDIA
cascade:
type: docs
breadcrumbs: false
---
**作者**: [Martin Kleppmann](https://martin.kleppmann.com),[《Designing Data-Intensive Applications 2nd Edition》](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html) : 英国剑桥大学分布式系统研究员,演讲者,博主和开源贡献者,软件工程师和企业家,曾在 LinkedIn 和 Rapportive 负责数据基础架构。
**译者**:[**冯若航**](https://vonng.com),网名 [@Vonng](https://github.com/Vonng)。
PostgreSQL 专家,数据库老司机,云计算泥石流。
[**Pigsty**](https://pgsty.com) 作者与创始人。
架构师,DBA,全栈工程师 @ TanTan,Alibaba,Apple。
独立开源贡献者,[GitStar Ranking 585](https://gitstar-ranking.com/Vonng),[国区活跃 Top20](https://committers.top/china)。
[DDIA](https://ddia.pigsty.io) / [PG Internal](https://pgint.vonng.com) 中文版译者,公众号:《老冯云数》,数据库 KOL。
**校订**: [@yingang](https://github.com/yingang) | [繁體中文](/tw) **版本维护** by [@afunTW](https://github.com/afunTW) | [完整贡献者列表](/contrib)
> [!NOTE]
> DDIA [**第二版**](/zh) 正在翻译中 ([`content/v2`](https://github.com/Vonng/ddia/tree/main) 目录),欢迎加入并提出您的宝贵意见。
## 译序
> 不懂数据库的全栈工程师不是好架构师 —— 冯若航 / Vonng
现今,尤其是在互联网领域,大多数应用都属于数据密集型应用。本书从底层数据结构到顶层架构设计,将数据系统设计中的精髓娓娓道来。其中的宝贵经验无论是对架构师、DBA、还是后端工程师、甚至产品经理都会有帮助。
这是一本理论结合实践的书,书中很多问题,译者在实际场景中都曾遇到过,读来让人击节扼腕。如果能早点读到这本书,该少走多少弯路啊!
这也是一本深入浅出的书,讲述概念的来龙去脉而不是卖弄定义,介绍事物发展演化历程而不是事实堆砌,将复杂的概念讲述的浅显易懂,但又直击本质不失深度。每章最后的引用质量非常好,是深入学习各个主题的绝佳索引。
本书为数据系统的设计、实现、与评价提供了很好的概念框架。读完并理解本书内容后,读者可以轻松看破大多数的技术忽悠,与技术砖家撕起来虎虎生风。
这是 2017 年译者读过最好的一本技术类书籍,这么好的书没有中文翻译,实在是遗憾。某不才,愿为先进技术文化的传播贡献一份力量。既可以深入学习有趣的技术主题,又可以锻炼中英文语言文字功底,何乐而不为?
## 前言
> 在我们的社会中,技术是一种强大的力量。数据、软件、通信可以用于坏的方面:不公平的阶级固化,损害公民权利,保护既得利益集团。但也可以用于好的方面:让底层人民发出自己的声音,让每个人都拥有机会,避免灾难。本书献给所有将技术用于善途的人们。
> 计算是一种流行文化,流行文化鄙视历史。流行文化关乎个体身份和参与感,但与合作无关。流行文化活在当下,也与过去和未来无关。我认为大部分(为了钱)编写代码的人就是这样的,他们不知道自己的文化来自哪里。
>
> —— 阿兰・凯接受 Dobb 博士的杂志采访时(2012 年)
## 目录
### [序言](/v1/preface)
### [第一部分:数据系统基础](/v1/part-i)
* [第一章:可靠性、可伸缩性和可维护性](/v1/ch1)
* [第二章:数据模型与查询语言](/v1/ch2)
* [第三章:存储与检索](/v1/ch3)
* [第四章:编码与演化](/v1/ch4)
### [第二部分:分布式数据](/v1/part-ii)
* [第五章:复制](/v1/ch5)
* [第六章:分区](/v1/ch6)
* [第七章:事务](/v1/ch7)
* [第八章:分布式系统的麻烦](/v1/ch8)
* [第九章:一致性与共识](/v1/ch9)
### [第三部分:衍生数据](/v1/part-iii)
* [第十章:批处理](/v1/ch10)
* [第十一章:流处理](/v1/ch11)
* [第十二章:数据系统的未来](/v1/ch12)
### [术语表](/v1/glossary)
### [后记](/v1/colophon)
---------
## 法律声明
从原作者处得知,已经有简体中文的翻译计划,将于 2018 年末完成。[购买地址](https://search.jd.com/Search?keyword=设计数据密集型应用)
译者纯粹出于 **学习目的** 与 **个人兴趣** 翻译本书,不追求任何经济利益。
译者保留对此版本译文的署名权,其他权利以原作者和出版社的主张为准。
本译文只供学习研究参考之用,不得公开传播发行或用于商业用途。有能力阅读英文书籍者请购买正版支持。
---------
## 贡献
0. 全文校订 by [@yingang](https://github.com/Vonng/ddia/commits?author=yingang)
1. [序言初翻修正](https://github.com/Vonng/ddia/commit/afb5edab55c62ed23474149f229677e3b42dfc2c) by [@seagullbird](https://github.com/Vonng/ddia/commits?author=seagullbird)
2. [第一章语法标点校正](https://github.com/Vonng/ddia/commit/973b12cd8f8fcdf4852f1eb1649ddd9d187e3644) by [@nevertiree](https://github.com/Vonng/ddia/commits?author=nevertiree)
3. [第六章部分校正](https://github.com/Vonng/ddia/commit/d4eb0852c0ec1e93c8aacc496c80b915bb1e6d48) 与[第十章的初翻](https://github.com/Vonng/ddia/commit/9de8dbd1bfe6fbb03b3bf6c1a1aa2291aed2490e) by [@MuAlex](https://github.com/Vonng/ddia/commits?author=MuAlex)
4. 第一部分]前言,ch2 校正 by [@jiajiadebug](https://github.com/Vonng/ddia/commits?author=jiajiadebug)
5. 词汇表、后记关于野猪的部分 by [@Chowss](https://github.com/Vonng/ddia/commits?author=Chowss)
6. 繁體中文版本与转换脚本 by [@afunTW](https://github.com/afunTW)
7. 多处翻译修正 by [@songzhibin97](https://github.com/Vonng/ddia/commits?author=songzhibin97) [@MamaShip](https://github.com/Vonng/ddia/commits?author=MamaShip) [@FangYuan33](https://github.com/Vonng/ddia/commits?author=FangYuan33)
8. [感谢所有作出贡献,提出意见的朋友们](/contrib):
Pull Requests & Issues
| ISSUE & Pull Requests | USER | Title |
|-------------------------------------------------|------------------------------------------------------------|----------------------------------------------------------------|
| [386](https://github.com/Vonng/ddia/pull/386) | [@uncle-lv](https://github.com/uncle-lv) | ch2: 优化一处翻译 |
| [384](https://github.com/Vonng/ddia/pull/384) | [@PanggNOTlovebean](https://github.com/PanggNOTlovebean) | docs: 优化中文文档的措辞和表达 |
| [383](https://github.com/Vonng/ddia/pull/383) | [@PanggNOTlovebean](https://github.com/PanggNOTlovebean) | docs: 修正 ch4 中的术语和表达错误 |
| [382](https://github.com/Vonng/ddia/pull/382) | [@uncle-lv](https://github.com/uncle-lv) | ch1: 优化一处翻译 |
| [381](https://github.com/Vonng/ddia/pull/381) | [@Max-Tortoise](https://github.com/Max-Tortoise) | ch4: 修正一处术语不完整问题 |
| [377](https://github.com/Vonng/ddia/pull/377) | [@huang06](https://github.com/huang06) | 优化翻译术语 |
| [375](https://github.com/Vonng/ddia/issues/375) | [@z-soulx](https://github.com/z-soulx) | 对于是否100%全中文翻译的必要性讨论?个人-没必要100%,特别是“名词”,有原单词更加适合it人员 |
| [371](https://github.com/Vonng/ddia/pull/371) | [@lewiszlw](https://github.com/lewiszlw) | CPU core -> CPU 核心 |
| [369](https://github.com/Vonng/ddia/pull/369) | [@bbwang-gl](https://github.com/bbwang-gl) | ch7: 可串行化快照隔离检测一个事务何时修改另一个事务的读取 |
| [368](https://github.com/Vonng/ddia/pull/368) | [@yhao3](https://github.com/yhao3) | 更新 zh-tw.py 与 zh-tw 内容 |
| [367](https://github.com/Vonng/ddia/pull/367) | [@yhao3](https://github.com/yhao3) | 修正拼写、格式和标点问题 |
| [366](https://github.com/Vonng/ddia/pull/366) | [@yangshangde](https://github.com/yangshangde) | ch8: 将“电源失败”改为“电源失效” |
| [365](https://github.com/Vonng/ddia/pull/365) | [@xyohn](https://github.com/xyohn) | ch1: 优化“存储与计算分离”相关翻译 |
| [364](https://github.com/Vonng/ddia/issues/364) | [@xyohn](https://github.com/xyohn) | ch1: 优化“存储与计算分离”相关翻译 |
| [363](https://github.com/Vonng/ddia/pull/363) | [@xyohn](https://github.com/xyohn) | #362: 优化一处翻译 |
| [362](https://github.com/Vonng/ddia/issues/362) | [@xyohn](https://github.com/xyohn) | ch1: 优化一处翻译 |
| [359](https://github.com/Vonng/ddia/pull/359) | [@c25423](https://github.com/c25423) | ch10: 修正一处拼写错误 |
| [358](https://github.com/Vonng/ddia/pull/358) | [@lewiszlw](https://github.com/lewiszlw) | ch4: 修正一处拼写错误 |
| [356](https://github.com/Vonng/ddia/pull/356) | [@lewiszlw](https://github.com/lewiszlw) | ch2: 修正一处标点错误 |
| [355](https://github.com/Vonng/ddia/pull/355) | [@DuroyGeorge](https://github.com/DuroyGeorge) | ch12: 修正一处格式错误 |
| [354](https://github.com/Vonng/ddia/pull/354) | [@justlorain](https://github.com/justlorain) | ch7: 修正一处参考链接 |
| [353](https://github.com/Vonng/ddia/pull/353) | [@fantasyczl](https://github.com/fantasyczl) | ch3&9: 修正两处引用错误 |
| [352](https://github.com/Vonng/ddia/pull/352) | [@fantasyczl](https://github.com/fantasyczl) | 支持输出为 EPUB 格式 |
| [349](https://github.com/Vonng/ddia/pull/349) | [@xiyihan0](https://github.com/xiyihan0) | ch1: 修正一处格式错误 |
| [348](https://github.com/Vonng/ddia/pull/348) | [@omegaatt36](https://github.com/omegaatt36) | ch3: 修正一处图像链接 |
| [346](https://github.com/Vonng/ddia/issues/346) | [@Vermouth1995](https://github.com/Vermouth1995) | ch1: 优化一处翻译 |
| [343](https://github.com/Vonng/ddia/pull/343) | [@kehao-chen](https://github.com/kehao-chen) | ch10: 优化一处翻译 |
| [341](https://github.com/Vonng/ddia/pull/341) | [@YKIsTheBest](https://github.com/YKIsTheBest) | ch3: 优化两处翻译 |
| [340](https://github.com/Vonng/ddia/pull/340) | [@YKIsTheBest](https://github.com/YKIsTheBest) | ch2: 优化多处翻译 |
| [338](https://github.com/Vonng/ddia/pull/338) | [@YKIsTheBest](https://github.com/YKIsTheBest) | ch1: 优化一处翻译 |
| [335](https://github.com/Vonng/ddia/pull/335) | [@kimi0230](https://github.com/kimi0230) | 修正一处繁体中文错误 |
| [334](https://github.com/Vonng/ddia/pull/334) | [@soulrrrrr](https://github.com/soulrrrrr) | ch2: 修正一处繁体中文错误 |
| [332](https://github.com/Vonng/ddia/pull/332) | [@justlorain](https://github.com/justlorain) | ch5: 修正一处翻译错误 |
| [331](https://github.com/Vonng/ddia/pull/331) | [@Lyianu](https://github.com/Lyianu) | ch9: 更正几处拼写错误 |
| [330](https://github.com/Vonng/ddia/pull/330) | [@Lyianu](https://github.com/Lyianu) | ch7: 优化一处翻译 |
| [329](https://github.com/Vonng/ddia/issues/329) | [@Lyianu](https://github.com/Lyianu) | ch6: 指出一处翻译错误 |
| [328](https://github.com/Vonng/ddia/pull/328) | [@justlorain](https://github.com/justlorain) | ch4: 更正一处翻译遗漏 |
| [326](https://github.com/Vonng/ddia/pull/326) | [@liangGTY](https://github.com/liangGTY) | ch1: 优化一处翻译 |
| [323](https://github.com/Vonng/ddia/pull/323) | [@marvin263](https://github.com/marvin263) | ch5: 优化一处翻译 |
| [322](https://github.com/Vonng/ddia/pull/322) | [@marvin263](https://github.com/marvin263) | ch8: 优化一处翻译 |
| [304](https://github.com/Vonng/ddia/pull/304) | [@spike014](https://github.com/spike014) | ch11: 优化一处翻译 |
| [298](https://github.com/Vonng/ddia/pull/298) | [@Makonike](https://github.com/Makonike) | ch11&12: 修正两处错误 |
| [284](https://github.com/Vonng/ddia/pull/284) | [@WAangzE](https://github.com/WAangzE) | ch4: 更正一处列表错误 |
| [283](https://github.com/Vonng/ddia/pull/283) | [@WAangzE](https://github.com/WAangzE) | ch3: 更正一处错别字 |
| [282](https://github.com/Vonng/ddia/pull/282) | [@WAangzE](https://github.com/WAangzE) | ch2: 更正一处公式问题 |
| [281](https://github.com/Vonng/ddia/pull/281) | [@lyuxi99](https://github.com/lyuxi99) | 更正多处内部链接错误 |
| [280](https://github.com/Vonng/ddia/pull/280) | [@lyuxi99](https://github.com/lyuxi99) | ch9: 更正内部链接错误 |
| [279](https://github.com/Vonng/ddia/issues/279) | [@codexvn](https://github.com/codexvn) | ch9: 指出公式在 GitHub Pages 显示的问题 |
| [278](https://github.com/Vonng/ddia/pull/278) | [@LJlkdskdjflsa](https://github.com/LJlkdskdjflsa) | 发现了繁体中文版本中的错误翻译 |
| [275](https://github.com/Vonng/ddia/pull/275) | [@117503445](https://github.com/117503445) | 更正 LICENSE 链接 |
| [274](https://github.com/Vonng/ddia/pull/274) | [@uncle-lv](https://github.com/uncle-lv) | ch7: 修正错别字 |
| [273](https://github.com/Vonng/ddia/pull/273) | [@Sdot-Python](https://github.com/Sdot-Python) | ch7: 统一了 write skew 的翻译 |
| [271](https://github.com/Vonng/ddia/pull/271) | [@Makonike](https://github.com/Makonike) | ch6: 统一了 rebalancing 的翻译 |
| [270](https://github.com/Vonng/ddia/pull/270) | [@Ynjxsjmh](https://github.com/Ynjxsjmh) | ch7: 修正不一致的翻译 |
| [263](https://github.com/Vonng/ddia/pull/263) | [@zydmayday](https://github.com/zydmayday) | ch5: 修正译文中的重复单词 |
| [260](https://github.com/Vonng/ddia/pull/260) | [@haifeiWu](https://github.com/haifeiWu) | ch4: 修正部分不准确的翻译 |
| [258](https://github.com/Vonng/ddia/pull/258) | [@bestgrc](https://github.com/bestgrc) | ch3: 修正一处翻译错误 |
| [257](https://github.com/Vonng/ddia/pull/257) | [@UnderSam](https://github.com/UnderSam) | ch8: 修正一处拼写错误 |
| [256](https://github.com/Vonng/ddia/pull/256) | [@AlphaWang](https://github.com/AlphaWang) | ch7: 修正“可串行化”相关内容的多处翻译不当 |
| [255](https://github.com/Vonng/ddia/pull/255) | [@AlphaWang](https://github.com/AlphaWang) | ch7: 修正“可重复读”相关内容的多处翻译不当 |
| [253](https://github.com/Vonng/ddia/pull/253) | [@AlphaWang](https://github.com/AlphaWang) | ch7: 修正“读已提交”相关内容的多处翻译不当 |
| [246](https://github.com/Vonng/ddia/pull/246) | [@derekwu0101](https://github.com/derekwu0101) | ch3: 修正繁体中文的转译错误 |
| [245](https://github.com/Vonng/ddia/pull/245) | [@skyran1278](https://github.com/skyran1278) | ch12: 修正繁体中文的转译错误 |
| [244](https://github.com/Vonng/ddia/pull/244) | [@Axlgrep](https://github.com/Axlgrep) | ch9: 修正不通顺的翻译 |
| [242](https://github.com/Vonng/ddia/pull/242) | [@lynkeib](https://github.com/lynkeib) | ch9: 修正不通顺的翻译 |
| [241](https://github.com/Vonng/ddia/pull/241) | [@lynkeib](https://github.com/lynkeib) | ch8: 修正不正确的公式格式 |
| [240](https://github.com/Vonng/ddia/pull/240) | [@8da2k](https://github.com/8da2k) | ch9: 修正不通顺的翻译 |
| [239](https://github.com/Vonng/ddia/pull/239) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch7: 修正不一致的翻译 |
| [237](https://github.com/Vonng/ddia/pull/237) | [@zhangnew](https://github.com/zhangnew) | ch3: 修正错误的图片链接 |
| [229](https://github.com/Vonng/ddia/pull/229) | [@lis186](https://github.com/lis186) | 指出繁体中文的转译错误:复杂 |
| [226](https://github.com/Vonng/ddia/pull/226) | [@chroming](https://github.com/chroming) | ch1: 修正导航栏中的章节名称 |
| [220](https://github.com/Vonng/ddia/pull/220) | [@skyran1278](https://github.com/skyran1278) | ch9: 修正线性一致的繁体中文翻译 |
| [194](https://github.com/Vonng/ddia/pull/194) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch4: 修正错误的翻译 |
| [193](https://github.com/Vonng/ddia/pull/193) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch4: 优化译文 |
| [192](https://github.com/Vonng/ddia/pull/192) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | ch4: 修正不一致和不通顺的翻译 |
| [190](https://github.com/Vonng/ddia/pull/190) | [@Pcrab](https://github.com/Pcrab) | ch1: 修正不准确的翻译 |
| [187](https://github.com/Vonng/ddia/pull/187) | [@narojay](https://github.com/narojay) | ch9: 修正生硬的翻译 |
| [186](https://github.com/Vonng/ddia/pull/186) | [@narojay](https://github.com/narojay) | ch8: 修正错别字 |
| [185](https://github.com/Vonng/ddia/issues/185) | [@8da2k](https://github.com/8da2k) | 指出小标题跳转的问题 |
| [184](https://github.com/Vonng/ddia/pull/184) | [@DavidZhiXing](https://github.com/DavidZhiXing) | ch10: 修正失效的网址 |
| [183](https://github.com/Vonng/ddia/pull/183) | [@OneSizeFitsQuorum](https://github.com/OneSizeFitsQuorum) | ch8: 修正错别字 |
| [182](https://github.com/Vonng/ddia/issues/182) | [@lroolle](https://github.com/lroolle) | 建议docsify的主题风格 |
| [181](https://github.com/Vonng/ddia/pull/181) | [@YunfengGao](https://github.com/YunfengGao) | ch2: 修正翻译错误 |
| [180](https://github.com/Vonng/ddia/pull/180) | [@skyran1278](https://github.com/skyran1278) | ch3: 指出繁体中文的转译错误 |
| [177](https://github.com/Vonng/ddia/pull/177) | [@exzhawk](https://github.com/exzhawk) | 支持 Github Pages 里的公式显示 |
| [176](https://github.com/Vonng/ddia/pull/176) | [@haifeiWu](https://github.com/haifeiWu) | ch2: 语义网相关翻译更正 |
| [175](https://github.com/Vonng/ddia/pull/175) | [@cwr31](https://github.com/cwr31) | ch7: 不变式相关翻译更正 |
| [174](https://github.com/Vonng/ddia/pull/174) | [@BeBraveBeCurious](https://github.com/BeBraveBeCurious) | README & preface: 更正不正确的中文用词和标点符号 |
| [173](https://github.com/Vonng/ddia/pull/173) | [@ZvanYang](https://github.com/ZvanYang) | ch12: 修正不完整的翻译 |
| [171](https://github.com/Vonng/ddia/pull/171) | [@ZvanYang](https://github.com/ZvanYang) | ch12: 修正重复的译文 |
| [169](https://github.com/Vonng/ddia/pull/169) | [@ZvanYang](https://github.com/ZvanYang) | ch12: 更正不太通顺的翻译 |
| [166](https://github.com/Vonng/ddia/pull/166) | [@bp4m4h94](https://github.com/bp4m4h94) | ch1: 发现错误的文献索引 |
| [164](https://github.com/Vonng/ddia/pull/164) | [@DragonDriver](https://github.com/DragonDriver) | preface: 更正错误的标点符号 |
| [163](https://github.com/Vonng/ddia/pull/163) | [@llmmddCoder](https://github.com/llmmddCoder) | ch1: 更正错误字 |
| [160](https://github.com/Vonng/ddia/pull/160) | [@Zhayhp](https://github.com/Zhayhp) | ch2: 建议将 network model 翻译为网状模型 |
| [159](https://github.com/Vonng/ddia/pull/159) | [@1ess](https://github.com/1ess) | ch4: 更正错误字 |
| [157](https://github.com/Vonng/ddia/pull/157) | [@ZvanYang](https://github.com/ZvanYang) | ch7: 更正不太通顺的翻译 |
| [155](https://github.com/Vonng/ddia/pull/155) | [@ZvanYang](https://github.com/ZvanYang) | ch7: 更正不太通顺的翻译 |
| [153](https://github.com/Vonng/ddia/pull/153) | [@DavidZhiXing](https://github.com/DavidZhiXing) | ch9: 修正缩略图的错别字 |
| [152](https://github.com/Vonng/ddia/pull/152) | [@ZvanYang](https://github.com/ZvanYang) | ch7: 除重->去重 |
| [151](https://github.com/Vonng/ddia/pull/151) | [@ZvanYang](https://github.com/ZvanYang) | ch5: 修订sibling相关的翻译 |
| [147](https://github.com/Vonng/ddia/pull/147) | [@ZvanYang](https://github.com/ZvanYang) | ch5: 更正一处不准确的翻译 |
| [145](https://github.com/Vonng/ddia/pull/145) | [@Hookey](https://github.com/Hookey) | 识别了当前简繁转译过程中处理不当的地方,暂通过转换脚本规避 |
| [144](https://github.com/Vonng/ddia/issues/144) | [@secret4233](https://github.com/secret4233) | ch7: 不翻译`next-key locking` |
| [143](https://github.com/Vonng/ddia/issues/143) | [@imcheney](https://github.com/imcheney) | ch3: 更新残留的机翻段落 |
| [142](https://github.com/Vonng/ddia/issues/142) | [@XIJINIAN](https://github.com/XIJINIAN) | 建议去除段首的制表符 |
| [141](https://github.com/Vonng/ddia/issues/141) | [@Flyraty](https://github.com/Flyraty) | ch5: 发现一处错误格式的章节引用 |
| [140](https://github.com/Vonng/ddia/pull/140) | [@Bowser1704](https://github.com/Bowser1704) | ch5: 修正章节Summary中多处不通顺的翻译 |
| [139](https://github.com/Vonng/ddia/pull/139) | [@Bowser1704](https://github.com/Bowser1704) | ch2&ch3: 修正多处不通顺的或错误的翻译 |
| [137](https://github.com/Vonng/ddia/pull/137) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch5&ch6: 优化多处不通顺的或错误的翻译 |
| [134](https://github.com/Vonng/ddia/pull/134) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch4: 优化多处不通顺的或错误的翻译 |
| [133](https://github.com/Vonng/ddia/pull/133) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch3: 优化多处错误的或不通顺的翻译 |
| [132](https://github.com/Vonng/ddia/pull/132) | [@fuxuemingzhu](https://github.com/fuxuemingzhu) | ch3: 优化一处容易产生歧义的翻译 |
| [131](https://github.com/Vonng/ddia/pull/131) | [@rwwg4](https://github.com/rwwg4) | ch6: 修正两处错误的翻译 |
| [129](https://github.com/Vonng/ddia/pull/129) | [@anaer](https://github.com/anaer) | ch4: 修正两处强调文本和四处代码变量名称 |
| [128](https://github.com/Vonng/ddia/pull/128) | [@meilin96](https://github.com/meilin96) | ch5: 修正一处错误的引用 |
| [126](https://github.com/Vonng/ddia/pull/126) | [@cwr31](https://github.com/cwr31) | ch10: 修正一处错误的翻译(功能 -> 函数) |
| [125](https://github.com/Vonng/ddia/pull/125) | [@dch1228](https://github.com/dch1228) | ch2: 优化 how best 的翻译(如何以最佳方式) |
| [123](https://github.com/Vonng/ddia/pull/123) | [@yingang](https://github.com/yingang) | translation updates (chapter 9, TOC in readme, glossary, etc.) |
| [121](https://github.com/Vonng/ddia/pull/121) | [@yingang](https://github.com/yingang) | translation updates (chapter 5 to chapter 8) |
| [120](https://github.com/Vonng/ddia/pull/120) | [@jiong-han](https://github.com/jiong-han) | Typo fix: 呲之以鼻 -> 嗤之以鼻 |
| [119](https://github.com/Vonng/ddia/pull/119) | [@cclauss](https://github.com/cclauss) | Streamline file operations in convert() |
| [118](https://github.com/Vonng/ddia/pull/118) | [@yingang](https://github.com/yingang) | translation updates (chapter 2 to chapter 4) |
| [117](https://github.com/Vonng/ddia/pull/117) | [@feeeei](https://github.com/feeeei) | 统一每章的标题格式 |
| [115](https://github.com/Vonng/ddia/pull/115) | [@NageNalock](https://github.com/NageNalock) | 第七章病句修改: 重复词语 |
| [114](https://github.com/Vonng/ddia/pull/114) | [@Sunt-ing](https://github.com/Sunt-ing) | Update README.md: correct the book name |
| [113](https://github.com/Vonng/ddia/pull/113) | [@lpxxn](https://github.com/lpxxn) | 修改语句 |
| [112](https://github.com/Vonng/ddia/pull/112) | [@ibyte2011](https://github.com/ibyte2011) | Update ch9.md |
| [110](https://github.com/Vonng/ddia/pull/110) | [@lpxxn](https://github.com/lpxxn) | 读已写入数据 |
| [107](https://github.com/Vonng/ddia/pull/107) | [@abbychau](https://github.com/abbychau) | 單調鐘和好死还是赖活着 |
| [106](https://github.com/Vonng/ddia/pull/106) | [@enochii](https://github.com/enochii) | typo in ch2: fix braces typo |
| [105](https://github.com/Vonng/ddia/pull/105) | [@LiminCode](https://github.com/LiminCode) | Chronicle translation error |
| [104](https://github.com/Vonng/ddia/pull/104) | [@Sunt-ing](https://github.com/Sunt-ing) | several advice for better translation |
| [103](https://github.com/Vonng/ddia/pull/103) | [@Sunt-ing](https://github.com/Sunt-ing) | typo in ch4: should be 完成 rather than 完全 |
| [102](https://github.com/Vonng/ddia/pull/102) | [@Sunt-ing](https://github.com/Sunt-ing) | ch4: better-translation: 扼杀 → 破坏 |
| [101](https://github.com/Vonng/ddia/pull/101) | [@Sunt-ing](https://github.com/Sunt-ing) | typo in Ch4: should be "改变" rathr than "盖面" |
| [100](https://github.com/Vonng/ddia/pull/100) | [@LiminCode](https://github.com/LiminCode) | fix missing translation |
| [99 ](https://github.com/Vonng/ddia/pull/99) | [@mrdrivingduck](https://github.com/mrdrivingduck) | ch6: fix the word rebalancing |
| [98 ](https://github.com/Vonng/ddia/pull/98) | [@jacklightChen](https://github.com/jacklightChen) | fix ch7.md: fix wrong references |
| [97 ](https://github.com/Vonng/ddia/pull/97) | [@jenac](https://github.com/jenac) | 96 |
| [96 ](https://github.com/Vonng/ddia/pull/96) | [@PragmaTwice](https://github.com/PragmaTwice) | ch2: fix typo about 'may or may not be' |
| [95 ](https://github.com/Vonng/ddia/pull/95) | [@EvanMu96](https://github.com/EvanMu96) | fix translation of "the battle cry" in ch5 |
| [94 ](https://github.com/Vonng/ddia/pull/94) | [@kemingy](https://github.com/kemingy) | ch6: fix markdown and punctuations |
| [93 ](https://github.com/Vonng/ddia/pull/93) | [@kemingy](https://github.com/kemingy) | ch5: fix markdown and some typos |
| [92 ](https://github.com/Vonng/ddia/pull/92) | [@Gilbert1024](https://github.com/Gilbert1024) | Merge pull request #1 from Vonng/master |
| [88 ](https://github.com/Vonng/ddia/pull/88) | [@kemingy](https://github.com/kemingy) | fix typo for ch1, ch2, ch3, ch4 |
| [87 ](https://github.com/Vonng/ddia/pull/87) | [@wynn5a](https://github.com/wynn5a) | Update ch3.md |
| [86 ](https://github.com/Vonng/ddia/pull/86) | [@northmorn](https://github.com/northmorn) | Update ch1.md |
| [85 ](https://github.com/Vonng/ddia/pull/85) | [@sunbuhui](https://github.com/sunbuhui) | fix ch2.md: fix ch2 ambiguous translation |
| [84 ](https://github.com/Vonng/ddia/pull/84) | [@ganler](https://github.com/ganler) | Fix translation: use up |
| [83 ](https://github.com/Vonng/ddia/pull/83) | [@afunTW](https://github.com/afunTW) | Using OpenCC to convert from zh-cn to zh-tw |
| [82 ](https://github.com/Vonng/ddia/pull/82) | [@kangni](https://github.com/kangni) | fix gitbook url |
| [78 ](https://github.com/Vonng/ddia/pull/78) | [@hanyu2](https://github.com/hanyu2) | Fix unappropriated translation |
| [77 ](https://github.com/Vonng/ddia/pull/77) | [@Ozarklake](https://github.com/Ozarklake) | fix typo |
| [75 ](https://github.com/Vonng/ddia/pull/75) | [@2997ms](https://github.com/2997ms) | Fix typo |
| [74 ](https://github.com/Vonng/ddia/pull/74) | [@2997ms](https://github.com/2997ms) | Update ch9.md |
| [70 ](https://github.com/Vonng/ddia/pull/70) | [@2997ms](https://github.com/2997ms) | Update ch7.md |
| [67 ](https://github.com/Vonng/ddia/pull/67) | [@jiajiadebug](https://github.com/jiajiadebug) | fix issues in ch2 - ch9 and glossary |
| [66 ](https://github.com/Vonng/ddia/pull/66) | [@blindpirate](https://github.com/blindpirate) | Fix typo |
| [63 ](https://github.com/Vonng/ddia/pull/63) | [@haifeiWu](https://github.com/haifeiWu) | Update ch10.md |
| [62 ](https://github.com/Vonng/ddia/pull/62) | [@ych](https://github.com/ych) | fix ch1.md typesetting problem |
| [61 ](https://github.com/Vonng/ddia/pull/61) | [@xianlaioy](https://github.com/xianlaioy) | docs:钟-->种,去掉ou |
| [60 ](https://github.com/Vonng/ddia/pull/60) | [@Zombo1296](https://github.com/Zombo1296) | 否则 -> 或者 |
| [59 ](https://github.com/Vonng/ddia/pull/59) | [@AlexanderMisel](https://github.com/AlexanderMisel) | 呼叫->调用,显着->显著 |
| [58 ](https://github.com/Vonng/ddia/pull/58) | [@ibyte2011](https://github.com/ibyte2011) | Update ch8.md |
| [55 ](https://github.com/Vonng/ddia/pull/55) | [@saintube](https://github.com/saintube) | ch8: 修改链接错误 |
| [54 ](https://github.com/Vonng/ddia/pull/54) | [@Panmax](https://github.com/Panmax) | Update ch2.md |
| [53 ](https://github.com/Vonng/ddia/pull/53) | [@ibyte2011](https://github.com/ibyte2011) | Update ch9.md |
| [52 ](https://github.com/Vonng/ddia/pull/52) | [@hecenjie](https://github.com/hecenjie) | Update ch1.md |
| [51 ](https://github.com/Vonng/ddia/pull/51) | [@latavin243](https://github.com/latavin243) | fix 修正ch3 ch4几处翻译 |
| [50 ](https://github.com/Vonng/ddia/pull/50) | [@AlexZFX](https://github.com/AlexZFX) | 几个疏漏和格式错误 |
| [49 ](https://github.com/Vonng/ddia/pull/49) | [@haifeiWu](https://github.com/haifeiWu) | Update ch1.md |
| [48 ](https://github.com/Vonng/ddia/pull/48) | [@scaugrated](https://github.com/scaugrated) | fix typo |
| [47 ](https://github.com/Vonng/ddia/pull/47) | [@lzwill](https://github.com/lzwill) | Fixed typos in ch2 |
| [45 ](https://github.com/Vonng/ddia/pull/45) | [@zenuo](https://github.com/zenuo) | 删除一个多余的右括号 |
| [44 ](https://github.com/Vonng/ddia/pull/44) | [@akxxsb](https://github.com/akxxsb) | 修正第七章底部链接错误 |
| [43 ](https://github.com/Vonng/ddia/pull/43) | [@baijinping](https://github.com/baijinping) | "更假简单"->"更加简单" |
| [42 ](https://github.com/Vonng/ddia/pull/42) | [@tisonkun](https://github.com/tisonkun) | 修复 ch1 中的无序列表格式 |
| [38 ](https://github.com/Vonng/ddia/pull/38) | [@renjie-c](https://github.com/renjie-c) | 纠正多处的翻译小错误 |
| [37 ](https://github.com/Vonng/ddia/pull/37) | [@tankilo](https://github.com/tankilo) | fix translation mistakes in ch4.md |
| [36 ](https://github.com/Vonng/ddia/pull/36) | [@wwek](https://github.com/wwek) | 1.修复多个链接错误 2.名词优化修订 3.错误修订 |
| [35 ](https://github.com/Vonng/ddia/pull/35) | [@wwek](https://github.com/wwek) | fix ch7.md to ch8.md link error |
| [34 ](https://github.com/Vonng/ddia/pull/34) | [@wwek](https://github.com/wwek) | Merge pull request #1 from Vonng/master |
| [33 ](https://github.com/Vonng/ddia/pull/33) | [@wwek](https://github.com/wwek) | fix part-ii.md link error |
| [32 ](https://github.com/Vonng/ddia/pull/32) | [@JCYoky](https://github.com/JCYoky) | Update ch2.md |
| [31 ](https://github.com/Vonng/ddia/pull/31) | [@elsonLee](https://github.com/elsonLee) | Update ch7.md |
| [26 ](https://github.com/Vonng/ddia/pull/26) | [@yjhmelody](https://github.com/yjhmelody) | 修复一些明显错误 |
| [25 ](https://github.com/Vonng/ddia/pull/25) | [@lqbilbo](https://github.com/lqbilbo) | 修复链接错误 |
| [24 ](https://github.com/Vonng/ddia/pull/24) | [@artiship](https://github.com/artiship) | 修改词语顺序 |
| [23 ](https://github.com/Vonng/ddia/pull/23) | [@artiship](https://github.com/artiship) | 修正错别字 |
| [22 ](https://github.com/Vonng/ddia/pull/22) | [@artiship](https://github.com/artiship) | 纠正翻译错误 |
| [21 ](https://github.com/Vonng/ddia/pull/21) | [@zhtisi](https://github.com/zhtisi) | 修正目录和本章标题不符的情况 |
| [20 ](https://github.com/Vonng/ddia/pull/20) | [@rentiansheng](https://github.com/rentiansheng) | Update ch7.md |
| [19 ](https://github.com/Vonng/ddia/pull/19) | [@LHRchina](https://github.com/LHRchina) | 修复语句小bug |
| [16 ](https://github.com/Vonng/ddia/pull/16) | [@MuAlex](https://github.com/MuAlex) | Master |
| [15 ](https://github.com/Vonng/ddia/pull/15) | [@cg-zhou](https://github.com/cg-zhou) | Update translation progress |
| [14 ](https://github.com/Vonng/ddia/pull/14) | [@cg-zhou](https://github.com/cg-zhou) | Translate glossary |
| [13 ](https://github.com/Vonng/ddia/pull/13) | [@cg-zhou](https://github.com/cg-zhou) | 详细修改了后记中和印度野猪相关的描述 |
| [12 ](https://github.com/Vonng/ddia/pull/12) | [@ibyte2011](https://github.com/ibyte2011) | 修改了部分翻译 |
| [11 ](https://github.com/Vonng/ddia/pull/11) | [@jiajiadebug](https://github.com/jiajiadebug) | ch2 100% |
| [10 ](https://github.com/Vonng/ddia/pull/10) | [@jiajiadebug](https://github.com/jiajiadebug) | ch2 20% |
| [9 ](https://github.com/Vonng/ddia/pull/9) | [@jiajiadebug](https://github.com/jiajiadebug) | Preface, ch1, part-i translation minor fixes |
| [7 ](https://github.com/Vonng/ddia/pull/7) | [@MuAlex](https://github.com/MuAlex) | Ch6 translation pull request |
| [6 ](https://github.com/Vonng/ddia/pull/6) | [@MuAlex](https://github.com/MuAlex) | Ch6 change version1 |
| [5 ](https://github.com/Vonng/ddia/pull/5) | [@nevertiree](https://github.com/nevertiree) | Chapter 01语法微调 |
| [2 ](https://github.com/Vonng/ddia/pull/2) | [@seagullbird](https://github.com/seagullbird) | 序言初翻 |
---------
## 许可证
本项目采用 [CC-BY 4.0](https://github.com/Vonng/ddia/blob/master/LICENSE) 许可证,您可以在这里找到完整说明:
- [署名 4.0 协议国际版 CC BY 4.0 Deed](https://creativecommons.org/licenses/by/4.0/deed.zh-hans)
- [Attribution 4.0 International CC BY 4.0](https://creativecommons.org/licenses/by/4.0/deed.en)
================================================
FILE: content/v1/ch1.md
================================================
---
title: "第一章:可靠性、可伸缩性和可维护性"
linkTitle: "1. 可靠性、可伸缩性和可维护性"
weight: 101
breadcrumbs: false
---

> 互联网做得太棒了,以至于大多数人将它看作像太平洋这样的自然资源,而不是什么人工产物。上一次出现这种大规模且无差错的技术,你还记得是什么时候吗?
>
> —— [艾伦・凯](http://www.drdobbs.com/architecture-and-design/interview-with-alan-kay/240003442) 在接受 Dobb 博士杂志采访时说(2012 年)
现今很多应用程序都是 **数据密集型(data-intensive)** 的,而非 **计算密集型(compute-intensive)** 的。因此 CPU 很少成为这类应用的瓶颈,更大的问题通常来自数据量、数据复杂性、以及数据的变更速度。
数据密集型应用通常由标准组件构建而成,标准组件提供了很多通用的功能;例如,许多应用程序都需要:
- 存储数据,以便自己或其他应用程序之后能再次找到 (*数据库,即 databases*)
- 记住开销昂贵操作的结果,加快读取速度(*缓存,即 caches*)
- 允许用户按关键字搜索数据,或以各种方式对数据进行过滤(*搜索索引,即 search indexes*)
- 向其他进程发送消息,进行异步处理(*流处理,即 stream processing*)
- 定期处理累积的大批量数据(*批处理,即 batch processing*)
如果这些功能听上去平淡无奇,那是因为这些 **数据系统(data system)** 是非常成功的抽象:我们一直不假思索地使用它们并习以为常。绝大多数工程师不会幻想从零开始编写存储引擎,因为在开发应用时,数据库已经是足够完美的工具了。
但现实没有这么简单。不同的应用有着不同的需求,因而数据库系统也是百花齐放,有着各式各样的特性。实现缓存有很多种手段,创建搜索索引也有好几种方法,诸如此类。因此在开发应用前,我们依然有必要先弄清楚最适合手头工作的工具和方法。而且当单个工具解决不了你的问题时,组合使用这些工具可能还是有些难度的。
本书将是一趟关于数据系统原理、实践与应用的旅程,并讲述了设计数据密集型应用的方法。我们将探索不同工具之间的共性与特性,以及各自的实现原理。
本章将从我们所要实现的基础目标开始:可靠、可伸缩、可维护的数据系统。我们将澄清这些词语的含义,概述考量这些目标的方法。并回顾一些后续章节所需的基础知识。在接下来的章节中我们将抽丝剥茧,研究设计数据密集型应用时可能遇到的设计决策。
## 关于数据系统的思考
我们通常认为,数据库、消息队列、缓存等工具分属于几个差异显著的类别。虽然数据库和消息队列表面上有一些相似性 —— 它们都会存储一段时间的数据 —— 但它们有迥然不同的访问模式,这意味着迥异的性能特征和实现手段。
那我们为什么要把这些东西放在 **数据系统(data system)** 的总称之下混为一谈呢?
近些年来,出现了许多新的数据存储工具与数据处理工具。它们针对不同应用场景进行优化,因此不再适合生硬地归入传统类别【1】。类别之间的界限变得越来越模糊,例如:数据存储可以被当成消息队列用(Redis),消息队列则带有类似数据库的持久保证(Apache Kafka)。
其次,越来越多的应用程序有着各种严格而广泛的要求,单个工具不足以满足所有的数据处理和存储需求。取而代之的是,总体工作被拆分成一系列能被单个工具高效完成的任务,并通过应用代码将它们缝合起来。
例如,如果将缓存(应用管理的缓存层,Memcached 或同类产品)和全文搜索(全文搜索服务器,例如 Elasticsearch 或 Solr)功能从主数据库剥离出来,那么使缓存 / 索引与主数据库保持同步通常是应用代码的责任。[图 1-1](/v1/ddia_0101.png) 给出了这种架构可能的样子(细节将在后面的章节中详细介绍)。

**图 1-1 一个可能的组合使用多个组件的数据系统架构**
当你将多个工具组合在一起提供服务时,服务的接口或 **应用程序编程接口(API, Application Programming Interface)** 通常向客户端隐藏这些实现细节。现在,你基本上已经使用较小的通用组件创建了一个全新的、专用的数据系统。这个新的复合数据系统可能会提供特定的保证,例如:缓存在写入时会作废或更新,以便外部客户端获取一致的结果。现在你不仅是应用程序开发人员,还是数据系统设计人员了。
设计数据系统或服务时可能会遇到很多棘手的问题,例如:当系统出问题时,如何确保数据的正确性和完整性?当部分系统退化降级时,如何为客户提供始终如一的良好性能?当负载增加时,如何扩容应对?什么样的 API 才是好的 API?
影响数据系统设计的因素很多,包括参与人员的技能和经验、历史遗留问题、系统路径依赖、交付时限、公司的风险容忍度、监管约束等,这些因素都需要具体问题具体分析。
本书着重讨论三个在大多数软件系统中都很重要的问题:
可靠性(Reliability)
: 系统在 **困境**(adversity,比如硬件故障、软件故障、人为错误)中仍可正常工作(正确完成功能,并能达到期望的性能水准)。请参阅 “[可靠性](#可靠性)”。
可伸缩性(Scalability)
: 有合理的办法应对系统的增长(数据量、流量、复杂性)。请参阅 “[可伸缩性](#可伸缩性)”。
可维护性(Maintainability)
: 许多不同的人(工程师、运维)在不同的生命周期,都能高效地在系统上工作(使系统保持现有行为,并适应新的应用场景)。请参阅 “[可维护性](#可维护性)”。
人们经常追求这些词汇,却没有清楚理解它们到底意味着什么。为了工程的严谨性,本章的剩余部分将探讨可靠性、可伸缩性和可维护性的含义。为实现这些目标而使用的各种技术,架构和算法将在后续的章节中研究。
## 可靠性
人们对于一个东西是否可靠,都有一个直观的想法。人们对可靠软件的典型期望包括:
* 应用程序表现出用户所期望的功能。
* 允许用户犯错,允许用户以出乎意料的方式使用软件。
* 在预期的负载和数据量下,性能满足要求。
* 系统能防止未经授权的访问和滥用。
如果所有这些在一起意味着 “正确工作”,那么可以把可靠性粗略理解为 “即使出现问题,也能继续正确工作”。
造成错误的原因叫做 **故障(fault)**,能预料并应对故障的系统特性可称为 **容错(fault-tolerant)** 或 **回弹性(resilient)**。“**容错**” 一词可能会产生误导,因为它暗示着系统可以容忍所有可能的错误,但在实际中这是不可能的。比方说,如果整个地球(及其上的所有服务器)都被黑洞吞噬了,想要容忍这种错误,需要把网络托管到太空中 —— 这种预算能不能批准就祝你好运了。所以在讨论容错时,只有谈论特定类型的错误才有意义。
注意 **故障(fault)** 不同于 **失效(failure)**【2】。**故障** 通常定义为系统的一部分状态偏离其标准,而 **失效** 则是系统作为一个整体停止向用户提供服务。故障的概率不可能降到零,因此最好设计容错机制以防因 **故障** 而导致 **失效**。本书中我们将介绍几种用不可靠的部件构建可靠系统的技术。
反直觉的是,在这类容错系统中,通过故意触发来 **提高** 故障率是有意义的,例如:在没有警告的情况下随机地杀死单个进程。许多高危漏洞实际上是由糟糕的错误处理导致的【3】,因此我们可以通过故意引发故障来确保容错机制不断运行并接受考验,从而提高故障自然发生时系统能正确处理的信心。Netflix 公司的 *Chaos Monkey*【4】就是这种方法的一个例子。
尽管比起 **阻止错误(prevent error)**,我们通常更倾向于 **容忍错误**。但也有 **预防胜于治疗** 的情况(比如不存在治疗方法时)。安全问题就属于这种情况。例如,如果攻击者破坏了系统,并获取了敏感数据,这种事是撤销不了的。但本书主要讨论的是可以恢复的故障种类,正如下面几节所述。
### 硬件故障
当想到系统失效的原因时,**硬件故障(hardware faults)** 总会第一个进入脑海。硬盘崩溃、内存出错、机房断电、有人拔错网线…… 任何与大型数据中心打过交道的人都会告诉你:一旦你拥有很多机器,这些事情 **总** 会发生!
据报道称,硬盘的 **平均无故障时间(MTTF, mean time to failure)** 约为 10 到 50 年【5】【6】。因此从数学期望上讲,在拥有 10000 个磁盘的存储集群上,平均每天会有 1 个磁盘出故障。
为了减少系统的故障率,第一反应通常都是增加单个硬件的冗余度,例如:磁盘可以组建 RAID,服务器可能有双路电源和热插拔 CPU,数据中心可能有电池和柴油发电机作为后备电源,某个组件挂掉时冗余组件可以立刻接管。这种方法虽然不能完全防止由硬件问题导致的系统失效,但它简单易懂,通常也足以让机器不间断运行很多年。
直到最近,硬件冗余对于大多数应用来说已经足够了,它使单台机器完全失效变得相当罕见。只要你能快速地把备份恢复到新机器上,故障停机时间对大多数应用而言都算不上灾难性的。只有少量高可用性至关重要的应用才会要求有多套硬件冗余。
但是随着数据量和应用计算需求的增加,越来越多的应用开始大量使用机器,这会相应地增加硬件故障率。此外,在类似亚马逊 AWS(Amazon Web Services)的一些云服务平台上,虚拟机实例不可用却没有任何警告也是很常见的【7】,因为云平台的设计就是优先考虑 **灵活性(flexibility)** 和 **弹性(elasticity)**[^i],而不是单机可靠性。
如果在硬件冗余的基础上进一步引入软件容错机制,那么系统在容忍整个(单台)机器故障的道路上就更进一步了。这样的系统也有运维上的便利,例如:如果需要重启机器(例如应用操作系统安全补丁),单服务器系统就需要计划停机。而允许机器失效的系统则可以一次修复一个节点,无需整个系统停机。
[^i]: 在 [应对负载的方法](#应对负载的方法) 一节定义
### 软件错误
我们通常认为硬件故障是随机的、相互独立的:一台机器的磁盘失效并不意味着另一台机器的磁盘也会失效。虽然大量硬件组件之间可能存在微弱的相关性(例如服务器机架的温度等共同的原因),但同时发生故障也是极为罕见的。
另一类错误是内部的 **系统性错误(systematic error)**【8】。这类错误难以预料,而且因为是跨节点相关的,所以比起不相关的硬件故障往往可能造成更多的 **系统失效**【5】。例子包括:
* 接受特定的错误输入,便导致所有应用服务器实例崩溃的 BUG。例如 2012 年 6 月 30 日的闰秒,由于 Linux 内核中的一个错误【9】,许多应用同时挂掉了。
* 失控进程会用尽一些共享资源,包括 CPU 时间、内存、磁盘空间或网络带宽。
* 系统依赖的服务变慢,没有响应,或者开始返回错误的响应。
* 级联故障,一个组件中的小故障触发另一个组件中的故障,进而触发更多的故障【10】。
导致这类软件故障的 BUG 通常会潜伏很长时间,直到被异常情况触发为止。这种情况意味着软件对其环境做出了某种假设 —— 虽然这种假设通常来说是正确的,但由于某种原因最后不再成立了【11】。
虽然软件中的系统性故障没有速效药,但我们还是有很多小办法,例如:仔细考虑系统中的假设和交互;彻底的测试;进程隔离;允许进程崩溃并重启;测量、监控并分析生产环境中的系统行为。如果系统能够提供一些保证(例如在一个消息队列中,进入与发出的消息数量相等),那么系统就可以在运行时不断自检,并在出现 **差异(discrepancy)** 时报警【12】。
### 人为错误
设计并构建了软件系统的工程师是人类,维持系统运行的运维也是人类。即使他们怀有最大的善意,人类也是不可靠的。举个例子,一项关于大型互联网服务的研究发现,运维配置错误是导致服务中断的首要原因,而硬件故障(服务器或网络)仅导致了 10-25% 的服务中断【13】。
尽管人类不可靠,但怎么做才能让系统变得可靠?最好的系统会组合使用以下几种办法:
* 以最小化犯错机会的方式设计系统。例如,精心设计的抽象、API 和管理后台使做对事情更容易,搞砸事情更困难。但如果接口限制太多,人们就会忽略它们的好处而想办法绕开。很难正确把握这种微妙的平衡。
* 将人们最容易犯错的地方与可能导致失效的地方 **解耦(decouple)**。特别是提供一个功能齐全的非生产环境 **沙箱(sandbox)**,使人们可以在不影响真实用户的情况下,使用真实数据安全地探索和实验。
* 在各个层次进行彻底的测试【3】,从单元测试、全系统集成测试到手动测试。自动化测试易于理解,已经被广泛使用,特别适合用来覆盖正常情况中少见的 **边缘场景(corner case)**。
* 允许从人为错误中简单快速地恢复,以最大限度地减少失效情况带来的影响。例如,快速回滚配置变更,分批发布新代码(以便任何意外错误只影响一小部分用户),并提供数据重算工具(以备旧的计算出错)。
* 配置详细和明确的监控,比如性能指标和错误率。在其他工程学科中这指的是 **遥测(telemetry)**(一旦火箭离开了地面,遥测技术对于跟踪发生的事情和理解失败是至关重要的)。监控可以向我们发出预警信号,并允许我们检查是否有任何地方违反了假设和约束。当出现问题时,指标数据对于问题诊断是非常宝贵的。
* 良好的管理实践与充分的培训 —— 一个复杂而重要的方面,但超出了本书的范围。
### 可靠性有多重要?
可靠性不仅仅是针对核电站和空中交通管制软件而言,我们也期望更多平凡的应用能可靠地运行。商务应用中的错误会导致生产力损失(也许数据报告不完整还会有法律风险),而电商网站的中断则可能会导致收入和声誉的巨大损失。
即使在 “非关键” 应用中,我们也对用户负有责任。试想一位家长把所有的照片和孩子的视频储存在你的照片应用里【15】。如果数据库突然损坏,他们会感觉如何?他们可能会知道如何从备份恢复吗?
在某些情况下,我们可能会选择牺牲可靠性来降低开发成本(例如为未经证实的市场开发产品原型)或运营成本(例如利润率极低的服务),但我们偷工减料时,应该清楚意识到自己在做什么。
## 可伸缩性
系统今天能可靠运行,并不意味未来也能可靠运行。服务 **降级(degradation)** 的一个常见原因是负载增加,例如:系统负载已经从一万个并发用户增长到十万个并发用户,或者从一百万增长到一千万。也许现在处理的数据量级要比过去大得多。
**可伸缩性(Scalability)** 是用来描述系统应对负载增长能力的术语。但是请注意,这不是贴在系统上的一维标签:说 “X 可伸缩” 或 “Y 不可伸缩” 是没有任何意义的。相反,讨论可伸缩性意味着考虑诸如 “如果系统以特定方式增长,有什么选项可以应对增长?” 和 “如何增加计算资源来处理额外的负载?” 等问题。
### 描述负载
在讨论增长问题(如果负载加倍会发生什么?)前,首先要能简要描述系统的当前负载。负载可以用一些称为 **负载参数(load parameters)** 的数字来描述。参数的最佳选择取决于系统架构,它可能是每秒向 Web 服务器发出的请求、数据库中的读写比率、聊天室中同时活跃的用户数量、缓存命中率或其他东西。除此之外,也许平均情况对你很重要,也许你的瓶颈是少数极端场景。
为了使这个概念更加具体,我们以推特在 2012 年 11 月发布的数据【16】为例。推特的两个主要业务是:
发布推文
: 用户可以向其粉丝发布新消息(平均 4.6k 请求 / 秒,峰值超过 12k 请求 / 秒)。
主页时间线
: 用户可以查阅他们关注的人发布的推文(300k 请求 / 秒)。
处理每秒 12,000 次写入(发推文的速率峰值)还是很简单的。然而推特的伸缩性挑战并不是主要来自推特量,而是来自 **扇出(fan-out)**[^ii]—— 每个用户关注了很多人,也被很多人关注。
[^ii]: 扇出:从电子工程学中借用的术语,它描述了输入连接到另一个门输出的逻辑门数量。输出需要提供足够的电流来驱动所有连接的输入。在事务处理系统中,我们使用它来描述为了服务一个传入请求而需要执行其他服务的请求数量。
大体上讲,这一对操作有两种实现方式。
1. 发布推文时,只需将新推文插入全局推文集合即可。当一个用户请求自己的主页时间线时,首先查找他关注的所有人,查询这些被关注用户发布的推文并按时间顺序合并。在如 [图 1-2](/v1/ddia_0102.png) 所示的关系型数据库中,可以编写这样的查询:
```sql
SELECT tweets.*, users.*
FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user
```

**图 1-2 推特主页时间线的关系型模式简单实现**
2. 为每个用户的主页时间线维护一个缓存,就像每个用户的推文收件箱([图 1-3](/v1/ddia_0103.png))。当一个用户发布推文时,查找所有关注该用户的人,并将新的推文插入到每个主页时间线缓存中。因此读取主页时间线的请求开销很小,因为结果已经提前计算好了。

**图 1-3 用于分发推特至关注者的数据流水线,2012 年 11 月的负载参数【16】**
推特的第一个版本使用了方法 1,但系统很难跟上主页时间线查询的负载。所以公司转向了方法 2,方法 2 的效果更好,因为发推频率比查询主页时间线的频率几乎低了两个数量级,所以在这种情况下,最好在写入时做更多的工作,而在读取时做更少的工作。
然而方法 2 的缺点是,发推现在需要大量的额外工作。平均来说,一条推文会发往约 75 个关注者,所以每秒 4.6k 的发推写入,变成了对主页时间线缓存每秒 345k 的写入。但这个平均值隐藏了用户粉丝数差异巨大这一现实,一些用户有超过 3000 万的粉丝,这意味着一条推文就可能会导致主页时间线缓存的 3000 万次写入!及时完成这种操作是一个巨大的挑战 —— 推特尝试在 5 秒内向粉丝发送推文。
在推特的例子中,每个用户粉丝数的分布(可能按这些用户的发推频率来加权)是探讨可伸缩性的一个关键负载参数,因为它决定了扇出负载。你的应用程序可能具有非常不同的特征,但可以采用相似的原则来考虑它的负载。
推特轶事的最终转折:现在已经稳健地实现了方法 2,推特逐步转向了两种方法的混合。大多数用户发的推文会被扇出写入其粉丝主页时间线缓存中。但是少数拥有海量粉丝的用户(即名流)会被排除在外。当用户读取主页时间线时,分别地获取出该用户所关注的每位名流的推文,再与用户的主页时间线缓存合并,如方法 1 所示。这种混合方法能始终如一地提供良好性能。在 [第十二章](/v1/ch12) 中我们将重新讨论这个例子,这在覆盖更多技术层面之后。
### 描述性能
一旦系统的负载被描述好,就可以研究当负载增加会发生什么。我们可以从两种角度来看:
* 增加负载参数并保持系统资源(CPU、内存、网络带宽等)不变时,系统性能将受到什么影响?
* 增加负载参数并希望保持性能不变时,需要增加多少系统资源?
这两个问题都需要性能数据,所以让我们简单地看一下如何描述系统性能。
对于 Hadoop 这样的批处理系统,通常关心的是 **吞吐量(throughput)**,即每秒可以处理的记录数量,或者在特定规模数据集上运行作业的总时间 [^iii]。对于在线系统,通常更重要的是服务的 **响应时间(response time)**,即客户端发送请求到接收响应之间的时间。
[^iii]: 理想情况下,批量作业的运行时间是数据集的大小除以吞吐量。在实践中由于数据倾斜(数据不是均匀分布在每个工作进程中),需要等待最慢的任务完成,所以运行时间往往更长。
> #### 延迟和响应时间
>
> **延迟(latency)** 和 **响应时间(response time)** 经常用作同义词,但实际上它们并不一样。响应时间是客户所看到的,除了实际处理请求的时间( **服务时间(service time)** )之外,还包括网络延迟和排队延迟。延迟是某个请求等待处理的 **持续时长**,在此期间它处于 **休眠(latent)** 状态,并等待服务【17】。
即使不断重复发送同样的请求,每次得到的响应时间也都会略有不同。现实世界的系统会处理各式各样的请求,响应时间可能会有很大差异。因此我们需要将响应时间视为一个可以测量的数值 **分布(distribution)**,而不是单个数值。
在 [图 1-4](/v1/ddia_0104.png) 中,每个灰条代表一次对服务的请求,其高度表示请求花费了多长时间。大多数请求是相当快的,但偶尔会出现需要更长的时间的异常值。这也许是因为缓慢的请求实质上开销更大,例如它们可能会处理更多的数据。但即使(你认为)所有请求都花费相同时间的情况下,随机的附加延迟也会导致结果变化,例如:上下文切换到后台进程,网络数据包丢失与 TCP 重传,垃圾收集暂停,强制从磁盘读取的页面错误,服务器机架中的震动【18】,还有很多其他原因。

**图 1-4 展示了一个服务 100 次请求响应时间的均值与百分位数**
通常报表都会展示服务的平均响应时间。(严格来讲 “平均” 一词并不指代任何特定公式,但实际上它通常被理解为 **算术平均值(arithmetic mean)**:给定 n 个值,加起来除以 n )。然而如果你想知道 “**典型(typical)**” 响应时间,那么平均值并不是一个非常好的指标,因为它不能告诉你有多少用户实际上经历了这个延迟。
通常使用 **百分位点(percentiles)** 会更好。如果将响应时间列表按最快到最慢排序,那么 **中位数(median)** 就在正中间:举个例子,如果你的响应时间中位数是 200 毫秒,这意味着一半请求的返回时间少于 200 毫秒,另一半比这个要长。
如果想知道典型场景下用户需要等待多长时间,那么中位数是一个好的度量标准:一半用户请求的响应时间少于响应时间的中位数,另一半服务时间比中位数长。中位数也被称为第 50 百分位点,有时缩写为 p50。注意中位数是关于单个请求的;如果用户同时发出几个请求(在一个会话过程中,或者由于一个页面中包含了多个资源),则至少一个请求比中位数慢的概率远大于 50%。
为了弄清异常值有多糟糕,可以看看更高的百分位点,例如第 95、99 和 99.9 百分位点(缩写为 p95,p99 和 p999)。它们意味着 95%、99% 或 99.9% 的请求响应时间要比该阈值快,例如:如果第 95 百分位点响应时间是 1.5 秒,则意味着 100 个请求中的 95 个响应时间快于 1.5 秒,而 100 个请求中的 5 个响应时间超过 1.5 秒。如 [图 1-4](/v1/ddia_0104.png) 所示。
响应时间的高百分位点(也称为 **尾部延迟**,即 **tail latencies**)非常重要,因为它们直接影响用户的服务体验。例如亚马逊在描述内部服务的响应时间要求时是以 99.9 百分位点为准,即使它只影响一千个请求中的一个。这是因为请求响应最慢的客户往往也是数据最多的客户,也可以说是最有价值的客户 —— 因为他们掏钱了【19】。保证网站响应迅速对于保持客户的满意度非常重要,亚马逊观察到:响应时间增加 100 毫秒,销售量就减少 1%【20】;而另一些报告说:慢 1 秒钟会让客户满意度指标减少 16%【21,22】。
另一方面,优化第 99.99 百分位点(一万个请求中最慢的一个)被认为太昂贵了,不能为亚马逊的目标带来足够好处。减小高百分位点处的响应时间相当困难,因为它很容易受到随机事件的影响,这超出了控制范围,而且效益也很小。
百分位点通常用于 **服务级别目标(SLO, service level objectives)** 和 **服务级别协议(SLA, service level agreements)**,即定义服务预期性能和可用性的合同。SLA 可能会声明,如果服务响应时间的中位数小于 200 毫秒,且 99.9 百分位点低于 1 秒,则认为服务工作正常(如果响应时间更长,就认为服务不达标)。这些指标为客户设定了期望值,并允许客户在 SLA 未达标的情况下要求退款。
**排队延迟(queueing delay)** 通常占了高百分位点处响应时间的很大一部分。由于服务器只能并行处理少量的事务(如受其 CPU 核数的限制),所以只要有少量缓慢的请求就能阻碍后续请求的处理,这种效应有时被称为 **头部阻塞(head-of-line blocking)** 。即使后续请求在服务器上处理的非常迅速,由于需要等待先前请求完成,客户端最终看到的是缓慢的总体响应时间。因为存在这种效应,测量客户端的响应时间非常重要。
为测试系统的可伸缩性而人为产生负载时,产生负载的客户端要独立于响应时间不断发送请求。如果客户端在发送下一个请求之前等待先前的请求完成,这种行为会产生人为排队的效果,使得测试时的队列比现实情况更短,使测量结果产生偏差【23】。
> #### 实践中的百分位点
>
> 在多重调用的后端服务里,高百分位数变得特别重要。即使并行调用,最终用户请求仍然需要等待最慢的并行调用完成。如 [图 1-5](/v1/ddia_0105.png) 所示,只需要一个缓慢的调用就可以使整个最终用户请求变慢。即使只有一小部分后端调用速度较慢,如果最终用户请求需要多个后端调用,则获得较慢调用的机会也会增加,因此较高比例的最终用户请求速度会变慢(该效果称为尾部延迟放大,即 tail latency amplification【24】)。
>
> 如果你想将响应时间百分点添加到你的服务的监视仪表板,则需要持续有效地计算它们。例如,你可以使用滑动窗口来跟踪连续10分钟内的请求响应时间。每一分钟,你都会计算出该窗口中的响应时间中值和各种百分数,并将这些度量值绘制在图上。
>
> 简单的实现是在时间窗口内保存所有请求的响应时间列表,并且每分钟对列表进行排序。如果对你来说效率太低,那么有一些算法能够以最小的 CPU 和内存成本(如前向衰减【25】、t-digest【26】或 HdrHistogram 【27】)来计算百分位数的近似值。请注意,平均百分比(例如,减少时间分辨率或合并来自多台机器的数据)在数学上没有意义 - 聚合响应时间数据的正确方法是添加直方图【28】。

**图 1-5 当一个请求需要多个后端请求时,单个后端慢请求就会拖慢整个终端用户的请求**
### 应对负载的方法
现在我们已经讨论了用于描述负载的参数和用于衡量性能的指标。可以开始认真讨论可伸缩性了:当负载参数增加时,如何保持良好的性能?
适应某个级别负载的架构不太可能应付 10 倍于此的负载。如果你正在开发一个快速增长的服务,那么每次负载发生数量级的增长时,你可能都需要重新考虑架构 —— 或者更频繁。
人们经常讨论 **纵向伸缩**(scaling up,也称为垂直伸缩,即 vertical scaling,转向更强大的机器)和 **横向伸缩**(scaling out,也称为水平伸缩,即 horizontal scaling,将负载分布到多台小机器上)之间的对立。跨多台机器分配负载也称为 “**无共享(shared-nothing)**” 架构。可以在单台机器上运行的系统通常更简单,但高端机器可能非常贵,所以非常密集的负载通常无法避免地需要横向伸缩。现实世界中的优秀架构需要将这两种方法务实地结合,因为使用几台足够强大的机器可能比使用大量的小型虚拟机更简单也更便宜。
有些系统是 **弹性(elastic)** 的,这意味着可以在检测到负载增加时自动增加计算资源,而其他系统则是手动伸缩(人工分析容量并决定向系统添加更多的机器)。如果负载 **极难预测(highly unpredictable)**,则弹性系统可能很有用,但手动伸缩系统更简单,并且意外操作可能会更少(请参阅 “[分区再平衡](/v1/ch6#分区再平衡)”)。
跨多台机器部署 **无状态服务(stateless services)** 非常简单,但将带状态的数据系统从单节点变为分布式配置则可能引入许多额外复杂度。出于这个原因,常识告诉我们应该将数据库放在单个节点上(纵向伸缩),直到伸缩成本或可用性需求迫使其改为分布式。
随着分布式系统的工具和抽象越来越好,至少对于某些类型的应用而言,这种常识可能会改变。可以预见分布式数据系统将成为未来的默认设置,即使对不处理大量数据或流量的场景也如此。本书的其余部分将介绍多种分布式数据系统,不仅讨论它们在可伸缩性方面的表现,还包括易用性和可维护性。
大规模的系统架构通常是应用特定的 —— 没有一招鲜吃遍天的通用可伸缩架构(不正式的叫法:**万金油(magic scaling sauce)** )。应用的问题可能是读取量、写入量、要存储的数据量、数据的复杂度、响应时间要求、访问模式或者所有问题的大杂烩。
举个例子,用于处理每秒十万个请求(每个大小为 1 kB)的系统与用于处理每分钟 3 个请求(每个大小为 2GB)的系统看上去会非常不一样,尽管两个系统有同样的数据吞吐量。
一个良好适配应用的可伸缩架构,是围绕着 **假设(assumption)** 建立的:哪些操作是常见的?哪些操作是罕见的?这就是所谓负载参数。如果假设最终是错误的,那么为伸缩所做的工程投入就白费了,最糟糕的是适得其反。在早期创业公司或非正式产品中,通常支持产品快速迭代的能力,要比可伸缩至未来的假想负载要重要的多。
尽管这些架构是应用程序特定的,但可伸缩的架构通常也是从通用的积木块搭建而成的,并以常见的模式排列。在本书中,我们将讨论这些构件和模式。
## 可维护性
众所周知,软件的大部分开销并不在最初的开发阶段,而是在持续的维护阶段,包括修复漏洞、保持系统正常运行、调查失效、适配新的平台、为新的场景进行修改、偿还技术债和添加新的功能。
不幸的是,许多从事软件系统行业的人不喜欢维护所谓的 **遗留(legacy)** 系统,—— 也许因为涉及修复其他人的错误、和过时的平台打交道,或者系统被迫使用于一些份外工作。每一个遗留系统都以自己的方式让人不爽,所以很难给出一个通用的建议来和它们打交道。
但是我们可以,也应该以这样一种方式来设计软件:在设计之初就尽量考虑尽可能减少维护期间的痛苦,从而避免自己的软件系统变成遗留系统。为此,我们将特别关注软件系统的三个设计原则:
可操作性(Operability)
: 便于运维团队保持系统平稳运行。
简单性(Simplicity)
: 从系统中消除尽可能多的 **复杂度(complexity)**,使新工程师也能轻松理解系统(注意这和用户接口的简单性不一样)。
可演化性(evolvability)
: 使工程师在未来能轻松地对系统进行更改,当需求变化时为新应用场景做适配。也称为 **可扩展性(extensibility)**、**可修改性(modifiability)** 或 **可塑性(plasticity)**。
和之前提到的可靠性、可伸缩性一样,实现这些目标也没有简单的解决方案。不过我们会试着想象具有可操作性,简单性和可演化性的系统会是什么样子。
### 可操作性:人生苦短,关爱运维
有人认为,“良好的运维经常可以绕开垃圾(或不完整)软件的局限性,而再好的软件摊上垃圾运维也没法可靠运行”。尽管运维的某些方面可以,而且应该是自动化的,但在最初建立正确运作的自动化机制仍然取决于人。
运维团队对于保持软件系统顺利运行至关重要。一个优秀运维团队的典型职责如下(或者更多)【29】:
* 监控系统的运行状况,并在服务状态不佳时快速恢复服务。
* 跟踪问题的原因,例如系统故障或性能下降。
* 及时更新软件和平台,比如安全补丁。
* 了解系统间的相互作用,以便在异常变更造成损失前进行规避。
* 预测未来的问题,并在问题出现之前加以解决(例如,容量规划)。
* 建立部署、配置、管理方面的良好实践,编写相应工具。
* 执行复杂的维护任务,例如将应用程序从一个平台迁移到另一个平台。
* 当配置变更时,维持系统的安全性。
* 定义工作流程,使运维操作可预测,并保持生产环境稳定。
* 铁打的营盘流水的兵,维持组织对系统的了解。
良好的可操作性意味着更轻松的日常工作,进而运维团队能专注于高价值的事情。数据系统可以通过各种方式使日常任务更轻松:
* 通过良好的监控,提供对系统内部状态和运行时行为的 **可见性(visibility)**。
* 为自动化提供良好支持,将系统与标准化工具相集成。
* 避免依赖单台机器(在整个系统继续不间断运行的情况下允许机器停机维护)。
* 提供良好的文档和易于理解的操作模型(“如果做 X,会发生 Y”)。
* 提供良好的默认行为,但需要时也允许管理员自由覆盖默认值。
* 有条件时进行自我修复,但需要时也允许管理员手动控制系统状态。
* 行为可预测,最大限度减少意外。
### 简单性:管理复杂度
小型软件项目可以使用简单讨喜的、富表现力的代码,但随着项目越来越大,代码往往变得非常复杂,难以理解。这种复杂度拖慢了所有系统相关人员,进一步增加了维护成本。一个陷入复杂泥潭的软件项目有时被描述为 **烂泥潭(a big ball of mud)** 【30】。
**复杂度(complexity)** 有各种可能的症状,例如:状态空间激增、模块间紧密耦合、纠结的依赖关系、不一致的命名和术语、解决性能问题的 Hack、需要绕开的特例等等,现在已经有很多关于这个话题的讨论【31,32,33】。
因为复杂度导致维护困难时,预算和时间安排通常会超支。在复杂的软件中进行变更,引入错误的风险也更大:当开发人员难以理解系统时,隐藏的假设、无意的后果和意外的交互就更容易被忽略。相反,降低复杂度能极大地提高软件的可维护性,因此简单性应该是构建系统的一个关键目标。
简化系统并不一定意味着减少功能;它也可以意味着消除 **额外的(accidental)** 的复杂度。Moseley 和 Marks【32】把 **额外复杂度** 定义为:由具体实现中涌现,而非(从用户视角看,系统所解决的)问题本身固有的复杂度。
用于消除 **额外复杂度** 的最好工具之一是 **抽象(abstraction)**。一个好的抽象可以将大量实现细节隐藏在一个干净,简单易懂的外观下面。一个好的抽象也可以广泛用于各类不同应用。比起重复造很多轮子,重用抽象不仅更有效率,而且有助于开发高质量的软件。抽象组件的质量改进将使所有使用它的应用受益。
例如,高级编程语言是一种抽象,隐藏了机器码、CPU 寄存器和系统调用。SQL 也是一种抽象,隐藏了复杂的磁盘 / 内存数据结构、来自其他客户端的并发请求、崩溃后的不一致性。当然在用高级语言编程时,我们仍然用到了机器码;只不过没有 **直接(directly)** 使用罢了,正是因为编程语言的抽象,我们才不必去考虑这些实现细节。
抽象可以帮助我们将系统的复杂度控制在可管理的水平,不过,找到好的抽象是非常困难的。在分布式系统领域虽然有许多好的算法,但我们并不清楚它们应该打包成什么样抽象。
本书将紧盯那些允许我们将大型系统的部分提取为定义明确的、可重用的组件的优秀抽象。
### 可演化性:拥抱变化
系统的需求永远不变,基本是不可能的。更可能的情况是,它们处于常态的变化中,例如:你了解了新的事实、出现意想不到的应用场景、业务优先级发生变化、用户要求新功能、新平台取代旧平台、法律或监管要求发生变化、系统增长迫使架构变化等。
在组织流程方面,**敏捷(agile)** 工作模式为适应变化提供了一个框架。敏捷社区还开发了对在频繁变化的环境中开发软件很有帮助的技术工具和模式,如 **测试驱动开发(TDD, test-driven development)** 和 **重构(refactoring)** 。
这些敏捷技术的大部分讨论都集中在相当小的规模(同一个应用中的几个代码文件)。本书将探索在更大数据系统层面上提高敏捷性的方法,可能由几个不同的应用或服务组成。例如,为了将装配主页时间线的方法从方法 1 变为方法 2,你会如何 “重构” 推特的架构 ?
修改数据系统并使其适应不断变化需求的容易程度,是与 **简单性** 和 **抽象性** 密切相关的:简单易懂的系统通常比复杂系统更容易修改。但由于这是一个非常重要的概念,我们将用一个不同的词来指代数据系统层面的敏捷性: **可演化性(evolvability)** 【34】。
## 本章小结
本章探讨了一些关于数据密集型应用的基本思考方式。这些原则将指导我们阅读本书的其余部分,那里将会深入技术细节。
一个应用必须满足各种需求才称得上有用。有一些 **功能需求**(functional requirements,即它应该做什么,比如允许以各种方式存储,检索,搜索和处理数据)以及一些 **非功能性需求**(nonfunctional,即通用属性,例如安全性、可靠性、合规性、可伸缩性、兼容性和可维护性)。在本章详细讨论了可靠性,可伸缩性和可维护性。
**可靠性(Reliability)** 意味着即使发生故障,系统也能正常工作。故障可能发生在硬件(通常是随机的和不相关的)、软件(通常是系统性的 Bug,很难处理)和人类(不可避免地时不时出错)。**容错技术** 可以对终端用户隐藏某些类型的故障。
**可伸缩性(Scalability)** 意味着即使在负载增加的情况下也有保持性能的策略。为了讨论可伸缩性,我们首先需要定量描述负载和性能的方法。我们简要了解了推特主页时间线的例子,介绍描述负载的方法,并将响应时间百分位点作为衡量性能的一种方式。在可伸缩的系统中可以添加 **处理容量(processing capacity)** 以在高负载下保持可靠。
**可维护性(Maintainability)** 有许多方面,但实质上是关于工程师和运维团队的生活质量的。良好的抽象可以帮助降低复杂度,并使系统易于修改和适应新的应用场景。良好的可操作性意味着对系统的健康状态具有良好的可见性,并拥有有效的管理手段。
不幸的是,使应用可靠、可伸缩或可维护并不容易。但是某些模式和技术会不断重新出现在不同的应用中。在接下来的几章中,我们将看到一些数据系统的例子,并分析它们如何实现这些目标。
在本书后面的 [第三部分](/v1/part-iii) 中,我们将看到一种模式:几个组件协同工作以构成一个完整的系统(如 [图 1-1](/v1/ddia_0101.png) 中的例子)
## 参考文献
1. Michael Stonebraker and Uğur Çetintemel: “['One Size Fits All': An Idea Whose Time Has Come and Gone](https://cs.brown.edu/~ugur/fits_all.pdf),” at *21st International Conference on Data Engineering* (ICDE), April 2005.
1. Walter L. Heimerdinger and Charles B. Weinstock: “[A Conceptual Framework for System Fault Tolerance](https://resources.sei.cmu.edu/asset_files/TechnicalReport/1992_005_001_16112.pdf),” Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992.
1. Ding Yuan, Yu Luo, Xin Zhuang, et al.: “[Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf),” at *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014.
1. Yury Izrailevsky and Ariel Tseitlin: “[The Netflix Simian Army](https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116),” *netflixtechblog.com*, July 19, 2011.
1. Daniel Ford, François Labelle, Florentina I. Popovici, et al.: “[Availability in Globally Distributed Storage Systems](http://research.google.com/pubs/archive/36737.pdf),” at *9th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2010.
1. Brian Beach: “[Hard Drive Reliability Update – Sep 2014](https://www.backblaze.com/blog/hard-drive-reliability-update-september-2014/),” *backblaze.com*, September 23, 2014.
1. Laurie Voss: “[AWS: The Good, the Bad and the Ugly](https://web.archive.org/web/20160429075023/http://blog.awe.sm/2012/12/18/aws-the-good-the-bad-and-the-ugly/),” *blog.awe.sm*, December 18, 2012.
1. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “[What Bugs Live in the Cloud?](http://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf),” at *5th ACM Symposium on Cloud Computing* (SoCC), November 2014. [doi:10.1145/2670979.2670986](http://dx.doi.org/10.1145/2670979.2670986)
1. Nelson Minar: “[Leap Second Crashes Half the Internet](http://www.somebits.com/weblog/tech/bad/leap-second-2012.html),” *somebits.com*, July 3, 2012.
1. Amazon Web Services: “[Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region](http://aws.amazon.com/message/65648/),” *aws.amazon.com*, April 29, 2011.
1. Richard I. Cook: “[How Complex Systems Fail](https://www.adaptivecapacitylabs.com/HowComplexSystemsFail.pdf),” Cognitive Technologies Laboratory, April 2000.
1. Jay Kreps: “[Getting Real About Distributed System Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability),” *blog.empathybox.com*, March 19, 2012.
1. David Oppenheimer, Archana Ganapathi, and David A. Patterson: “[Why Do Internet Services Fail, and What Can Be Done About It?](http://static.usenix.org/legacy/events/usits03/tech/full_papers/oppenheimer/oppenheimer.pdf),” at *4th USENIX Symposium on Internet Technologies and Systems* (USITS), March 2003.
1. Nathan Marz: “[Principles of Software Engineering, Part 1](http://nathanmarz.com/blog/principles-of-software-engineering-part-1.html),” *nathanmarz.com*, April 2, 2013.
1. Michael Jurewitz: “[The Human Impact of Bugs](http://jury.me/blog/2013/3/14/the-human-impact-of-bugs),” *jury.me*, March 15, 2013.
1. Raffi Krikorian: “[Timelines at Scale](http://www.infoq.com/presentations/Twitter-Timeline-Scalability),” at *QCon San Francisco*, November 2012.
1. Martin Fowler: *Patterns of Enterprise Application Architecture*. Addison Wesley, 2002. ISBN: 978-0-321-12742-6
1. Kelly Sommers: “[After all that run around, what caused 500ms disk latency even when we replaced physical server?](https://twitter.com/kellabyte/status/532930540777635840)” *twitter.com*, November 13, 2014.
1. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “[Dynamo: Amazon's Highly Available Key-Value Store](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf),” at *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007.
1. Greg Linden: “[Make Data Useful](http://glinden.blogspot.co.uk/2006/12/slides-from-my-talk-at-stanford.html),” slides from presentation at Stanford University Data Mining class (CS345), December 2006.
1. Tammy Everts: “[The Real Cost of Slow Time vs Downtime](https://www.slideshare.net/Radware/radware-cmg2014-tammyevertsslowtimevsdowntime),” *slideshare.net*, November 5, 2014.
1. Jake Brutlag: “[Speed Matters](https://ai.googleblog.com/2009/06/speed-matters.html),” *ai.googleblog.com*, June 23, 2009.
1. Tyler Treat: “[Everything You Know About Latency Is Wrong](http://bravenewgeek.com/everything-you-know-about-latency-is-wrong/),” *bravenewgeek.com*, December 12, 2015.
1. Jeffrey Dean and Luiz André Barroso: “[The Tail at Scale](http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext),” *Communications of the ACM*, volume 56, number 2, pages 74–80, February 2013. [doi:10.1145/2408776.2408794](http://dx.doi.org/10.1145/2408776.2408794)
1. Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, and Bojian Xu: “[Forward Decay: A Practical Time Decay Model for Streaming Systems](http://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf),” at *25th IEEE International Conference on Data Engineering* (ICDE), March 2009.
1. Ted Dunning and Otmar Ertl: “[Computing Extremely Accurate Quantiles Using t-Digests](https://github.com/tdunning/t-digest),” *github.com*, March 2014.
1. Gil Tene: “[HdrHistogram](http://www.hdrhistogram.org/),” *hdrhistogram.org*.
1. Baron Schwartz: “[Why Percentiles Don’t Work the Way You Think](https://orangematter.solarwinds.com/2016/11/18/why-percentiles-dont-work-the-way-you-think/),” *solarwinds.com*, November 18, 2016.
1. James Hamilton: “[On Designing and Deploying Internet-Scale Services](https://www.usenix.org/legacy/events/lisa07/tech/full_papers/hamilton/hamilton.pdf),” at *21st Large Installation System Administration Conference* (LISA), November 2007.
1. Brian Foote and Joseph Yoder: “[Big Ball of Mud](http://www.laputan.org/pub/foote/mud.pdf),” at *4th Conference on Pattern Languages of Programs* (PLoP), September 1997.
1. Frederick P Brooks: “No Silver Bullet – Essence and Accident in Software Engineering,” in *The Mythical Man-Month*, Anniversary edition, Addison-Wesley, 1995. ISBN: 978-0-201-83595-3
1. Ben Moseley and Peter Marks: “[Out of the Tar Pit](https://curtclifton.net/papers/MoseleyMarks06a.pdf),” at *BCS Software Practice Advancement* (SPA), 2006.
1. Rich Hickey: “[Simple Made Easy](http://www.infoq.com/presentations/Simple-Made-Easy),” at *Strange Loop*, September 2011.
1. Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson: “[Analyzing Software Evolvability](http://www.es.mdh.se/pdf_publications/1251.pdf),” at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. [doi:10.1109/COMPSAC.2008.50](http://dx.doi.org/10.1109/COMPSAC.2008.50)
================================================
FILE: content/v1/ch10.md
================================================
---
title: "第十章:批处理"
linkTitle: "10. 批处理"
weight: 310
breadcrumbs: false
---

> 带有太强个人色彩的系统无法成功。当最初的设计完成并且相对稳定时,不同的人们以自己的方式进行测试,真正的考验才开始。
>
> —— 高德纳
在本书的前两部分中,我们讨论了很多关于 **请求** 和 **查询** 以及相应的 **响应** 或 **结果**。许多现有数据系统中都采用这种数据处理方式:你发送请求指令,一段时间后(我们期望)系统会给出一个结果。数据库、缓存、搜索索引、Web 服务器以及其他一些系统都以这种方式工作。
像这样的 **在线(online)** 系统,无论是浏览器请求页面还是调用远程 API 的服务,我们通常认为请求是由人类用户触发的,并且正在等待响应。他们不应该等太久,所以我们非常关注系统的响应时间(请参阅 “[描述性能](/v1/ch1#描述性能)”)。
Web 和越来越多的基于 HTTP/REST 的 API 使交互的请求 / 响应风格变得如此普遍,以至于很容易将其视为理所当然。但我们应该记住,这不是构建系统的唯一方式,其他方法也有其优点。我们来看看三种不同类型的系统:
服务(在线系统)
: 服务等待客户的请求或指令到达。每收到一个,服务会试图尽快处理它,并发回一个响应。响应时间通常是服务性能的主要衡量指标,可用性通常非常重要(如果客户端无法访问服务,用户可能会收到错误消息)。
批处理系统(离线系统)
: 一个批处理系统有大量的输入数据,跑一个 **作业(job)** 来处理它,并生成一些输出数据,这往往需要一段时间(从几分钟到几天),所以通常不会有用户等待作业完成。相反,批量作业通常会定期运行(例如,每天一次)。批处理作业的主要性能衡量标准通常是吞吐量(处理特定大小的输入所需的时间)。本章中讨论的就是批处理。
流处理系统(准实时系统)
: 流处理介于在线和离线(批处理)之间,所以有时候被称为 **准实时(near-real-time)** 或 **准在线(nearline)** 处理。像批处理系统一样,流处理消费输入并产生输出(并不需要响应请求)。但是,流式作业在事件发生后不久就会对事件进行操作,而批处理作业则需等待固定的一组输入数据。这种差异使流处理系统比起批处理系统具有更低的延迟。由于流处理基于批处理,我们将在 [第十一章](/v1/ch11) 讨论它。
正如我们将在本章中看到的那样,批处理是构建可靠、可伸缩和可维护应用程序的重要组成部分。例如,2004 年发布的批处理算法 Map-Reduce(可能被过分热情地)被称为 “造就 Google 大规模可伸缩性的算法”【2】。随后在各种开源数据系统中得到应用,包括 Hadoop、CouchDB 和 MongoDB。
与多年前为数据仓库开发的并行处理系统【3,4】相比,MapReduce 是一个相当低级别的编程模型,但它使得在商用硬件上能进行的处理规模迈上一个新的台阶。虽然 MapReduce 的重要性正在下降【5】,但它仍然值得去理解,因为它描绘了一幅关于批处理为什么有用,以及如何做到有用的清晰图景。
实际上,批处理是一种非常古老的计算方式。早在可编程数字计算机诞生之前,打孔卡制表机(例如 1890 年美国人口普查【6】中使用的霍尔里斯机)实现了半机械化的批处理形式,从大量输入中汇总计算。Map-Reduce 与 1940 年代和 1950 年代广泛用于商业数据处理的机电 IBM 卡片分类机器有着惊人的相似之处【7】。正如我们所说,历史总是在不断重复自己。
在本章中,我们将了解 MapReduce 和其他一些批处理算法和框架,并探索它们在现代数据系统中的作用。但首先我们将看看使用标准 Unix 工具的数据处理。即使你已经熟悉了它们,Unix 的哲学也值得一读,Unix 的思想和经验教训可以迁移到大规模、异构的分布式数据系统中。
## 使用Unix工具的批处理
我们从一个简单的例子开始。假设你有一台 Web 服务器,每次处理请求时都会在日志文件中附加一行。例如,使用 nginx 默认的访问日志格式,日志的一行可能如下所示:
```bash
216.58.210.78 - - [27/Feb/2015:17:55:11 +0000] "GET /css/typography.css HTTP/1.1"
200 3377 "http://martin.kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36"
```
(实际上这只是一行,分成多行只是为了便于阅读。)这一行中有很多信息。为了解释它,你需要了解日志格式的定义,如下所示:
```bash
$remote_addr - $remote_user [$time_local] "$request"
$status $body_bytes_sent "$http_referer" "$http_user_agent"
```
日志的这一行表明在 UTC 时间的 2015 年 2 月 27 日 17 点 55 分 11 秒,服务器从客户端 IP 地址 `216.58.210.78` 接收到对文件 `/css/typography.css` 的请求。用户没有认证,所以 `$remote_user` 被设置为连字符(`-`)。响应状态是 200(即请求成功),响应的大小是 3377 字节。网页浏览器是 Chrome 40,它加载了这个文件是因为该文件在网址为 `http://martin.kleppmann.com/` 的页面中被引用到了。
### 简单日志分析
很多工具可以从这些日志文件生成关于网站流量的漂亮的报告,但为了练手,让我们使用基本的 Unix 功能创建自己的工具。例如,假设你想在你的网站上找到五个最受欢迎的网页。则可以在 Unix shell 中这样做:[^i]
[^i]: 有些人认为 `cat` 这里并没有必要,因为输入文件可以直接作为 awk 的参数。但这种写法让线性管道更为显眼。
```bash
cat /var/log/nginx/access.log | #1
awk '{print $7}' | #2
sort | #3
uniq -c | #4
sort -r -n | #5
head -n 5 #6
```
1. 读取日志文件
2. 将每一行按空格分割成不同的字段,每行只输出第七个字段,恰好是请求的 URL。在我们的例子中是 `/css/typography.css`。
3. 按字母顺序排列请求的 URL 列表。如果某个 URL 被请求过 n 次,那么排序后,文件将包含连续重复出现 n 次的该 URL。
4. `uniq` 命令通过检查两个相邻的行是否相同来过滤掉输入中的重复行。`-c` 则表示还要输出一个计数器:对于每个不同的 URL,它会报告输入中出现该 URL 的次数。
5. 第二种排序按每行起始处的数字(`-n`)排序,这是 URL 的请求次数。然后逆序(`-r`)返回结果,大的数字在前。
6. 最后,只输出前五行(`-n 5`),并丢弃其余的。该系列命令的输出如下所示:
```bash
4189 /favicon.ico
3631 /2013/05/24/improving-security-of-ssh-private-keys.html
2124 /2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
1369 /
915 /css/typography.css
```
如果你不熟悉 Unix 工具,上面的命令行可能看起来有点吃力,但是它非常强大。它能在几秒钟内处理几 GB 的日志文件,并且你可以根据需要轻松修改命令。例如,如果要从报告中省略 CSS 文件,可以将 awk 参数更改为 `'$7 !~ /\.css$/ {print $7}'`, 如果想统计最多的客户端 IP 地址,可以把 awk 参数改为 `'{print $1}'`,等等。
我们不会在这里详细探索 Unix 工具,但是它非常值得学习。令人惊讶的是,使用 awk、sed、grep、sort、uniq 和 xargs 的组合,可以在几分钟内完成许多数据分析,并且它们的性能相当的好【8】。
#### 命令链与自定义程序
除了 Unix 命令链,你还可以写一个简单的程序来做同样的事情。例如在 Ruby 中,它可能看起来像这样:
```ruby
counts = Hash.new(0) # 1
File.open('/var/log/nginx/access.log') do |file|
file.each do |line|
url = line.split[6] # 2
counts[url] += 1 # 3
end
end
top5 = counts.map{|url, count| [count, url] }.sort.reverse[0...5] # 4
top5.each{|count, url| puts "#{count} #{url}" } # 5
```
1. `counts` 是一个存储计数器的哈希表,保存了每个 URL 被浏览的次数,默认为 0。
2. 逐行读取日志,抽取每行第七个被空格分隔的字段为 URL(这里的数组索引是 6,因为 Ruby 的数组索引从 0 开始计数)
3. 将日志当前行中 URL 对应的计数器值加一。
4. 按计数器值(降序)对哈希表内容进行排序,并取前五位。
5. 打印出前五个条目。
这个程序并不像 Unix 管道那样简洁,但是它的可读性很强,喜欢哪一种属于口味的问题。但两者除了表面上的差异之外,执行流程也有很大差异,如果你在大文件上运行此分析,则会变得明显。
#### 排序 VS 内存中的聚合
Ruby 脚本在内存中保存了一个 URL 的哈希表,将每个 URL 映射到它出现的次数。Unix 管道没有这样的哈希表,而是依赖于对 URL 列表的排序,在这个 URL 列表中,同一个 URL 的只是简单地重复出现。
哪种方法更好?这取决于你有多少个不同的 URL。对于大多数中小型网站,你可能可以为所有不同网址提供一个计数器(假设我们使用 1GB 内存)。在此例中,作业的 **工作集**(working set,即作业需要随机访问的内存大小)仅取决于不同 URL 的数量:如果日志中只有单个 URL,重复出现一百万次,则散列表所需的空间表就只有一个 URL 加上一个计数器的大小。当工作集足够小时,内存散列表表现良好,甚至在性能较差的笔记本电脑上也可以正常工作。
另一方面,如果作业的工作集大于可用内存,则排序方法的优点是可以高效地使用磁盘。这与我们在 “[SSTables 和 LSM 树](/v1/ch3#SSTables和LSM树)” 中讨论过的原理是一样的:数据块可以在内存中排序并作为段文件写入磁盘,然后多个排序好的段可以合并为一个更大的排序文件。归并排序具有在磁盘上运行良好的顺序访问模式。(请记住,针对顺序 I/O 进行优化是 [第三章](/v1/ch3) 中反复出现的主题,相同的模式在此重现)
GNU Coreutils(Linux)中的 `sort` 程序通过溢出至磁盘的方式来自动应对大于内存的数据集,并能同时使用多个 CPU 核进行并行排序【9】。这意味着我们之前看到的简单的 Unix 命令链很容易伸缩至大数据集,且不会耗尽内存。瓶颈可能是从磁盘读取输入文件的速度。
### Unix哲学
我们可以非常容易地使用前一个例子中的一系列命令来分析日志文件,这并非巧合:事实上,这实际上是 Unix 的关键设计思想之一,而且它直至今天也仍然令人讶异地重要。让我们更深入地研究一下,以便从 Unix 中借鉴一些想法【10】。
Unix 管道的发明者道格・麦克罗伊(Doug McIlroy)在 1964 年首先描述了这种情况【11】:“我们需要一种类似园艺胶管的方式来拼接程序 —— 当我们需要将消息从一个程序传递另一个程序时,直接接上去就行。I/O 应该也按照这种方式进行 ”。水管的类比仍然在生效,通过管道连接程序的想法成为了现在被称为 **Unix 哲学** 的一部分 —— 这一组设计原则在 Unix 用户与开发者之间流行起来,该哲学在 1978 年表述如下【12,13】:
1. 让每个程序都做好一件事。要做一件新的工作,写一个新程序,而不是通过添加 “功能” 让老程序复杂化。
2. 期待每个程序的输出成为另一个程序的输入。不要将无关信息混入输出。避免使用严格的列数据或二进制输入格式。不要坚持交互式输入。
3. 设计和构建软件时,即使是操作系统,也让它们能够尽早地被试用,最好在几周内完成。不要犹豫,扔掉笨拙的部分,重建它们。
4. 优先使用工具来减轻编程任务,即使必须绕道去编写工具,且在用完后很可能要扔掉大部分。
这种方法 —— 自动化,快速原型设计,增量式迭代,对实验友好,将大型项目分解成可管理的块 —— 听起来非常像今天的敏捷开发和 DevOps 运动。奇怪的是,四十年来变化不大。
`sort` 工具是一个很好的例子。可以说它比大多数编程语言标准库中的实现(它们不会利用磁盘或使用多线程,即使这样做有很大好处)要更好。然而,单独使用 `sort` 几乎没什么用。它只能与其他 Unix 工具(如 `uniq`)结合使用。
像 `bash` 这样的 Unix shell 可以让我们轻松地将这些小程序组合成令人讶异的强大数据处理任务。尽管这些程序中有很多是由不同人群编写的,但它们可以灵活地结合在一起。Unix 如何实现这种可组合性?
#### 统一的接口
如果你希望一个程序的输出成为另一个程序的输入,那意味着这些程序必须使用相同的数据格式 —— 换句话说,一个兼容的接口。如果你希望能够将任何程序的输出连接到任何程序的输入,那意味着所有程序必须使用相同的 I/O 接口。
在 Unix 中,这种接口是一个 **文件**(file,更准确地说,是一个文件描述符)。一个文件只是一串有序的字节序列。因为这是一个非常简单的接口,所以可以使用相同的接口来表示许多不同的东西:文件系统上的真实文件,到另一个进程(Unix 套接字,stdin,stdout)的通信通道,设备驱动程序(比如 `/dev/audio` 或 `/dev/lp0`),表示 TCP 连接的套接字,等等。很容易将这些设计视为理所当然的,但实际上能让这些差异巨大的东西共享一个统一的接口是非常厉害的,这使得它们可以很容易地连接在一起 [^ii]。
[^ii]: 统一接口的另一个例子是 URL 和 HTTP,这是 Web 的基石。一个 URL 标识一个网站上的一个特定的东西(资源),你可以链接到任何其他网站的任何网址。具有网络浏览器的用户因此可以通过跟随链接在网站之间无缝跳转,即使服务器可能由完全不相关的组织维护。这个原则现在似乎非常明显,但它却是网络取能取得今天成就的关键。之前的系统并不是那么统一:例如,在公告板系统(BBS)时代,每个系统都有自己的电话号码和波特率配置。从一个 BBS 到另一个 BBS 的引用必须以电话号码和调制解调器设置的形式;用户将不得不挂断,拨打其他 BBS,然后手动找到他们正在寻找的信息。直接链接到另一个 BBS 内的一些内容当时是不可能的。
按照惯例,许多(但不是全部)Unix 程序将这个字节序列视为 ASCII 文本。我们的日志分析示例使用了这个事实:`awk`、`sort`、`uniq` 和 `head` 都将它们的输入文件视为由 `\n`(换行符,ASCII `0x0A`)字符分隔的记录列表。`\n` 的选择是任意的 —— 可以说,ASCII 记录分隔符 `0x1E` 本来就是一个更好的选择,因为它是为了这个目的而设计的【14】,但是无论如何,所有这些程序都使用相同的记录分隔符允许它们互操作。
每条记录(即一行输入)的解析则更加模糊。Unix 工具通常通过空白或制表符将行分割成字段,但也使用 CSV(逗号分隔),管道分隔和其他编码。即使像 `xargs` 这样一个相当简单的工具也有六个命令行选项,用于指定如何解析输入。
ASCII 文本的统一接口大多数时候都能工作,但它不是很优雅:我们的日志分析示例使用 `{print $7}` 来提取网址,这样可读性不是很好。在理想的世界中可能是 `{print $request_url}` 或类似的东西。我们稍后会回顾这个想法。
尽管几十年后还不够完美,但统一的 Unix 接口仍然是非常出色的设计。没有多少软件能像 Unix 工具一样交互组合的这么好:你不能通过自定义分析工具轻松地将电子邮件帐户的内容和在线购物历史记录以管道传送至电子表格中,并将结果发布到社交网络或维基。今天,像 Unix 工具一样流畅地运行程序是一种例外,而不是规范。
即使是具有 **相同数据模型** 的数据库,将数据从一种数据库导出再导入到另一种数据库也并不容易。缺乏整合导致了数据的 **巴尔干化**[^译注i]。
[^译注i]: **巴尔干化(Balkanization)** 是一个常带有贬义的地缘政治学术语,其定义为:一个国家或政区分裂成多个互相敌对的国家或政区的过程。
#### 逻辑与布线相分离
Unix 工具的另一个特点是使用标准输入(`stdin`)和标准输出(`stdout`)。如果你运行一个程序,而不指定任何其他的东西,标准输入来自键盘,标准输出指向屏幕。但是,你也可以从文件输入和 / 或将输出重定向到文件。管道允许你将一个进程的标准输出附加到另一个进程的标准输入(有个小内存缓冲区,而不需要将整个中间数据流写入磁盘)。
如果需要,程序仍然可以直接读取和写入文件,但 Unix 方法在程序不关心特定的文件路径、只使用标准输入和标准输出时效果最好。这允许 shell 用户以任何他们想要的方式连接输入和输出;该程序不知道或不关心输入来自哪里以及输出到哪里。(人们可以说这是一种 **松耦合(loose coupling)**,**晚期绑定(late binding)**【15】或 **控制反转(inversion of control)**【16】)。将输入 / 输出布线与程序逻辑分开,可以将小工具组合成更大的系统。
你甚至可以编写自己的程序,并将它们与操作系统提供的工具组合在一起。你的程序只需要从标准输入读取输入,并将输出写入标准输出,它就可以加入数据处理的管道中。在日志分析示例中,你可以编写一个将 User-Agent 字符串转换为更灵敏的浏览器标识符,或者将 IP 地址转换为国家代码的工具,并将其插入管道。`sort` 程序并不关心它是否与操作系统的另一部分或者你写的程序通信。
但是,使用 `stdin` 和 `stdout` 能做的事情是有限的。需要多个输入或输出的程序虽然可能,却非常棘手。你没法将程序的输出管道连接至网络连接中【17,18】[^iii] 。如果程序直接打开文件进行读取和写入,或者将另一个程序作为子进程启动,或者打开网络连接,那么 I/O 的布线就取决于程序本身了。它仍然可以被配置(例如通过命令行选项),但在 Shell 中对输入和输出进行布线的灵活性就少了。
[^iii]: 除了使用一个单独的工具,如 `netcat` 或 `curl`。Unix 起初试图将所有东西都表示为文件,但是 BSD 套接字 API 偏离了这个惯例【17】。研究用操作系统 Plan 9 和 Inferno 在使用文件方面更加一致:它们将 TCP 连接表示为 `/net/tcp` 中的文件【18】。
#### 透明度和实验
使 Unix 工具如此成功的部分原因是,它们使查看正在发生的事情变得非常容易:
- Unix 命令的输入文件通常被视为不可变的。这意味着你可以随意运行命令,尝试各种命令行选项,而不会损坏输入文件。
- 你可以在任何时候结束管道,将管道输出到 `less`,然后查看它是否具有预期的形式。这种检查能力对调试非常有用。
- 你可以将一个流水线阶段的输出写入文件,并将该文件用作下一阶段的输入。这使你可以重新启动后面的阶段,而无需重新运行整个管道。
因此,与关系数据库的查询优化器相比,即使 Unix 工具非常简单,但仍然非常有用,特别是对于实验而言。
然而,Unix 工具的最大局限在于它们只能在一台机器上运行 —— 而 Hadoop 这样的工具即应运而生。
## MapReduce和分布式文件系统
MapReduce 有点像 Unix 工具,但分布在数千台机器上。像 Unix 工具一样,它相当简单粗暴,但令人惊异地管用。一个 MapReduce 作业可以和一个 Unix 进程相类比:它接受一个或多个输入,并产生一个或多个输出。
和大多数 Unix 工具一样,运行 MapReduce 作业通常不会修改输入,除了生成输出外没有任何副作用。输出文件以连续的方式一次性写入(一旦写入文件,不会修改任何现有的文件部分)。
虽然 Unix 工具使用 `stdin` 和 `stdout` 作为输入和输出,但 MapReduce 作业在分布式文件系统上读写文件。在 Hadoop 的 MapReduce 实现中,该文件系统被称为 **HDFS(Hadoop 分布式文件系统)**,一个 Google 文件系统(GFS)的开源实现【19】。
除 HDFS 外,还有各种其他分布式文件系统,如 GlusterFS 和 Quantcast File System(QFS)【20】。诸如 Amazon S3、Azure Blob 存储和 OpenStack Swift【21】等对象存储服务在很多方面都是相似的 [^iv]。在本章中,我们将主要使用 HDFS 作为示例,但是这些原则适用于任何分布式文件系统。
[^iv]: 一个不同之处在于,对于 HDFS,可以将计算任务安排在存储特定文件副本的计算机上运行,而对象存储通常将存储和计算分开。如果网络带宽是一个瓶颈,从本地磁盘读取有性能优势。但是请注意,如果使用纠删码(Erasure Coding),则会丢失局部性,因为来自多台机器的数据必须进行合并以重建原始文件【20】。
与网络连接存储(NAS)和存储区域网络(SAN)架构的共享磁盘方法相比,HDFS 基于 **无共享** 原则(请参阅 [第二部分](/v1/part-ii) 的介绍)。共享磁盘存储由集中式存储设备实现,通常使用定制硬件和专用网络基础设施(如光纤通道)。而另一方面,无共享方法不需要特殊的硬件,只需要通过传统数据中心网络连接的计算机。
HDFS 在每台机器上运行了一个守护进程,它对外暴露网络服务,允许其他节点访问存储在该机器上的文件(假设数据中心中的每台通用计算机都挂载着一些磁盘)。名为 **NameNode** 的中央服务器会跟踪哪个文件块存储在哪台机器上。因此,HDFS 在概念上创建了一个大型文件系统,可以使用所有运行有守护进程的机器的磁盘。
为了容忍机器和磁盘故障,文件块被复制到多台机器上。复制可能意味着多个机器上的相同数据的多个副本,如 [第五章](/v1/ch5) 中所述,或者诸如 Reed-Solomon 码这样的纠删码方案,它能以比完全复制更低的存储开销来支持恢复丢失的数据【20,22】。这些技术与 RAID 相似,后者可以在连接到同一台机器的多个磁盘上提供冗余;区别在于在分布式文件系统中,文件访问和复制是在传统的数据中心网络上完成的,没有特殊的硬件。
HDFS 的可伸缩性已经很不错了:在撰写本书时,最大的 HDFS 部署运行在上万台机器上,总存储容量达数百 PB【23】。如此大的规模已经变得可行,因为使用商品硬件和开源软件的 HDFS 上的数据存储和访问成本远低于在专用存储设备上支持同等容量的成本【24】。
### MapReduce作业执行
MapReduce 是一个编程框架,你可以使用它编写代码来处理 HDFS 等分布式文件系统中的大型数据集。理解它的最简单方法是参考 “[简单日志分析](#简单日志分析)” 中的 Web 服务器日志分析示例。MapReduce 中的数据处理模式与此示例非常相似:
1. 读取一组输入文件,并将其分解成 **记录(records)**。在 Web 服务器日志示例中,每条记录都是日志中的一行(即 `\n` 是记录分隔符)。
2. 调用 Mapper 函数,从每条输入记录中提取一对键值。在前面的例子中,Mapper 函数是 `awk '{print $7}'`:它提取 URL(`$7`)作为键,并将值留空。
3. 按键排序所有的键值对。在日志的例子中,这由第一个 `sort` 命令完成。
4. 调用 Reducer 函数遍历排序后的键值对。如果同一个键出现多次,排序使它们在列表中相邻,所以很容易组合这些值而不必在内存中保留很多状态。在前面的例子中,Reducer 是由 `uniq -c` 命令实现的,该命令使用相同的键来统计相邻记录的数量。
这四个步骤可以作为一个 MapReduce 作业执行。步骤 2(Map)和 4(Reduce)是你编写自定义数据处理代码的地方。步骤 1(将文件分解成记录)由输入格式解析器处理。步骤 3 中的排序步骤隐含在 MapReduce 中 —— 你不必编写它,因为 Mapper 的输出始终在送往 Reducer 之前进行排序。
要创建 MapReduce 作业,你需要实现两个回调函数,Mapper 和 Reducer,其行为如下(请参阅 “[MapReduce 查询](/v1/ch2#MapReduce查询)”):
Mapper
: Mapper 会在每条输入记录上调用一次,其工作是从输入记录中提取键值。对于每个输入,它可以生成任意数量的键值对(包括 None)。它不会保留从一个输入记录到下一个记录的任何状态,因此每个记录都是独立处理的。
Reducer
: MapReduce 框架拉取由 Mapper 生成的键值对,收集属于同一个键的所有值,并在这组值上迭代调用 Reducer。Reducer 可以产生输出记录(例如相同 URL 的出现次数)。
在 Web 服务器日志的例子中,我们在第 5 步中有第二个 `sort` 命令,它按请求数对 URL 进行排序。在 MapReduce 中,如果你需要第二个排序阶段,则可以通过编写第二个 MapReduce 作业并将第一个作业的输出用作第二个作业的输入来实现它。这样看来,Mapper 的作用是将数据放入一个适合排序的表单中,并且 Reducer 的作用是处理已排序的数据。
#### 分布式执行MapReduce
MapReduce 与 Unix 命令管道的主要区别在于,MapReduce 可以在多台机器上并行执行计算,而无需编写代码来显式处理并行问题。Mapper 和 Reducer 一次只能处理一条记录;它们不需要知道它们的输入来自哪里,或者输出去往什么地方,所以框架可以处理在机器之间移动数据的复杂性。
在分布式计算中可以使用标准的 Unix 工具作为 Mapper 和 Reducer【25】,但更常见的是,它们被实现为传统编程语言的函数。在 Hadoop MapReduce 中,Mapper 和 Reducer 都是实现特定接口的 Java 类。在 MongoDB 和 CouchDB 中,Mapper 和 Reducer 都是 JavaScript 函数(请参阅 “[MapReduce 查询](/v1/ch2#MapReduce查询)”)。
[图 10-1](/v1/ddia_1001.png) 显示了 Hadoop MapReduce 作业中的数据流。其并行化基于分区(请参阅 [第六章](/v1/ch6)):作业的输入通常是 HDFS 中的一个目录,输入目录中的每个文件或文件块都被认为是一个单独的分区,可以单独处理 map 任务([图 10-1](/v1/ddia_1001.png) 中的 m1,m2 和 m3 标记)。
每个输入文件的大小通常是数百兆字节。MapReduce 调度器(图中未显示)试图在其中一台存储输入文件副本的机器上运行每个 Mapper,只要该机器有足够的备用 RAM 和 CPU 资源来运行 Mapper 任务【26】。这个原则被称为 **将计算放在数据附近**【27】:它节省了通过网络复制输入文件的开销,减少网络负载并增加局部性。

**图 10-1 具有三个 Mapper 和三个 Reducer 的 MapReduce 任务**
在大多数情况下,应该在 Mapper 任务中运行的应用代码在将要运行它的机器上还不存在,所以 MapReduce 框架首先将代码(例如 Java 程序中的 JAR 文件)复制到适当的机器。然后启动 Map 任务并开始读取输入文件,一次将一条记录传入 Mapper 回调函数。Mapper 的输出由键值对组成。
计算的 Reduce 端也被分区。虽然 Map 任务的数量由输入文件块的数量决定,但 Reducer 的任务的数量是由作业作者配置的(它可以不同于 Map 任务的数量)。为了确保具有相同键的所有键值对最终落在相同的 Reducer 处,框架使用键的散列值来确定哪个 Reduce 任务应该接收到特定的键值对(请参阅 “[根据键的散列分区](/v1/ch6#根据键的散列分区)”)。
键值对必须进行排序,但数据集可能太大,无法在单台机器上使用常规排序算法进行排序。相反,分类是分阶段进行的。首先每个 Map 任务都按照 Reducer 对输出进行分区。每个分区都被写入 Mapper 程序的本地磁盘,使用的技术与我们在 “[SSTables 与 LSM 树](/v1/ch3#SSTables和LSM树)” 中讨论的类似。
只要当 Mapper 读取完输入文件,并写完排序后的输出文件,MapReduce 调度器就会通知 Reducer 可以从该 Mapper 开始获取输出文件。Reducer 连接到每个 Mapper,并下载自己相应分区的有序键值对文件。按 Reducer 分区,排序,从 Mapper 向 Reducer 复制分区数据,这一整个过程被称为 **混洗(shuffle)**【26】(一个容易混淆的术语 —— 不像洗牌,在 MapReduce 中的混洗没有随机性)。
Reduce 任务从 Mapper 获取文件,并将它们合并在一起,并保留有序特性。因此,如果不同的 Mapper 生成了键相同的记录,则在 Reducer 的输入中,这些记录将会相邻。
Reducer 调用时会收到一个键,和一个迭代器作为参数,迭代器会顺序地扫过所有具有该键的记录(因为在某些情况可能无法完全放入内存中)。Reducer 可以使用任意逻辑来处理这些记录,并且可以生成任意数量的输出记录。这些输出记录会写入分布式文件系统上的文件中(通常是在跑 Reducer 的机器本地磁盘上留一份,并在其他机器上留几份副本)。
#### MapReduce工作流
单个 MapReduce 作业可以解决的问题范围很有限。以日志分析为例,单个 MapReduce 作业可以确定每个 URL 的页面浏览次数,但无法确定最常见的 URL,因为这需要第二轮排序。
因此将 MapReduce 作业链接成为 **工作流(workflow)** 中是极为常见的,例如,一个作业的输出成为下一个作业的输入。Hadoop MapReduce 框架对工作流没有特殊支持,所以这个链是通过目录名隐式实现的:第一个作业必须将其输出配置为 HDFS 中的指定目录,第二个作业必须将其输入配置为从同一个目录。从 MapReduce 框架的角度来看,这是两个独立的作业。
因此,被链接的 MapReduce 作业并没有那么像 Unix 命令管道(它直接将一个进程的输出作为另一个进程的输入,仅用一个很小的内存缓冲区)。它更像是一系列命令,其中每个命令的输出写入临时文件,下一个命令从临时文件中读取。这种设计有利也有弊,我们将在 “[物化中间状态](#物化中间状态)” 中讨论。
只有当作业成功完成后,批处理作业的输出才会被视为有效的(MapReduce 会丢弃失败作业的部分输出)。因此,工作流中的一项作业只有在先前的作业 —— 即生产其输入的作业 —— 成功完成后才能开始。为了处理这些作业之间的依赖,有很多针对 Hadoop 的工作流调度器被开发出来,包括 Oozie、Azkaban、Luigi、Airflow 和 Pinball 【28】。
这些调度程序还具有管理功能,在维护大量批处理作业时非常有用。在构建推荐系统时,由 50 到 100 个 MapReduce 作业组成的工作流是常见的【29】。而在大型组织中,许多不同的团队可能运行不同的作业来读取彼此的输出。工具支持对于管理这样复杂的数据流而言非常重要。
Hadoop 的各种高级工具(如 Pig 【30】、Hive 【31】、Cascading 【32】、Crunch 【33】和 FlumeJava 【34】)也能自动布线组装多个 MapReduce 阶段,生成合适的工作流。
### Reduce侧连接与分组
我们在 [第二章](/v1/ch2) 中讨论了数据模型和查询语言的连接,但是我们还没有深入探讨连接是如何实现的。现在是我们再次捡起这条线索的时候了。
在许多数据集中,一条记录与另一条记录存在关联是很常见的:关系模型中的 **外键**,文档模型中的 **文档引用** 或图模型中的 **边**。当你需要同时访问这一关联的两侧(持有引用的记录与被引用的记录)时,连接就是必须的。正如 [第二章](/v1/ch2) 所讨论的,非规范化可以减少对连接的需求,但通常无法将其完全移除 [^v]。
[^v]: 我们在本书中讨论的连接通常是等值连接,即最常见的连接类型,其中记录通过与其他记录在特定字段(例如 ID)中具有 **相同值** 相关联。有些数据库支持更通用的连接类型,例如使用小于运算符而不是等号运算符,但是我们没有地方来讲这些东西。
在数据库中,如果执行只涉及少量记录的查询,数据库通常会使用 **索引** 来快速定位感兴趣的记录(请参阅 [第三章](/v1/ch3))。如果查询涉及到连接,则可能涉及到查找多个索引。然而 MapReduce 没有索引的概念 —— 至少在通常意义上没有。
当 MapReduce 作业被赋予一组文件作为输入时,它读取所有这些文件的全部内容;数据库会将这种操作称为 **全表扫描**。如果你只想读取少量的记录,则全表扫描与索引查询相比,代价非常高昂。但是在分析查询中(请参阅 “[事务处理还是分析?](/v1/ch3#事务处理还是分析?)”),通常需要计算大量记录的聚合。在这种情况下,特别是如果能在多台机器上并行处理时,扫描整个输入可能是相当合理的事情。
当我们在批处理的语境中讨论连接时,我们指的是在数据集中解析某种关联的全量存在。例如我们假设一个作业是同时处理所有用户的数据,而非仅仅是为某个特定用户查找数据(而这能通过索引更高效地完成)。
#### 示例:用户活动事件分析
[图 10-2](/v1/ddia_1002.png) 给出了一个批处理作业中连接的典型例子。左侧是事件日志,描述登录用户在网站上做的事情(称为 **活动事件**,即 activity events,或 **点击流数据**,即 clickstream data),右侧是用户数据库。你可以将此示例看作是星型模式的一部分(请参阅 “[星型和雪花型:分析的模式](/v1/ch3#星型和雪花型:分析的模式)”):事件日志是事实表,用户数据库是其中的一个维度。

**图 10-2 用户行为日志与用户档案的连接**
分析任务可能需要将用户活动与用户档案信息相关联:例如,如果档案包含用户的年龄或出生日期,系统就可以确定哪些页面更受哪些年龄段的用户欢迎。然而活动事件仅包含用户 ID,而没有包含完整的用户档案信息。在每个活动事件中嵌入这些档案信息很可能会非常浪费。因此,活动事件需要与用户档案数据库相连接。
实现这一连接的最简单方法是,逐个遍历活动事件,并为每个遇到的用户 ID 查询用户数据库(在远程服务器上)。这是可能的,但是它的性能可能会非常差:处理吞吐量将受限于受数据库服务器的往返时间,本地缓存的有效性很大程度上取决于数据的分布,并行运行大量查询可能会轻易压垮数据库【35】。
为了在批处理过程中实现良好的吞吐量,计算必须(尽可能)限于单台机器上进行。为待处理的每条记录发起随机访问的网络请求实在是太慢了。而且,查询远程数据库意味着批处理作业变为 **非确定的(nondeterministic)**,因为远程数据库中的数据可能会改变。
因此,更好的方法是获取用户数据库的副本(例如,使用 ETL 进程从数据库备份中提取数据,请参阅 “[数据仓库](/v1/ch3#数据仓库)”),并将它和用户行为日志放入同一个分布式文件系统中。然后你可以将用户数据库存储在 HDFS 中的一组文件中,而用户活动记录存储在另一组文件中,并能用 MapReduce 将所有相关记录集中到同一个地方进行高效处理。
#### 排序合并连接
回想一下,Mapper 的目的是从每个输入记录中提取一对键值。在 [图 10-2](/v1/ddia_1002.png) 的情况下,这个键就是用户 ID:一组 Mapper 会扫过活动事件(提取用户 ID 作为键,活动事件作为值),而另一组 Mapper 将会扫过用户数据库(提取用户 ID 作为键,用户的出生日期作为值)。这个过程如 [图 10-3](/v1/ddia_1003.png) 所示。

**图 10-3 在用户 ID 上进行的 Reduce 端连接。如果输入数据集分区为多个文件,则每个分区都会被多个 Mapper 并行处理**
当 MapReduce 框架通过键对 Mapper 输出进行分区,然后对键值对进行排序时,效果是具有相同 ID 的所有活动事件和用户记录在 Reducer 输入中彼此相邻。Map-Reduce 作业甚至可以也让这些记录排序,使 Reducer 总能先看到来自用户数据库的记录,紧接着是按时间戳顺序排序的活动事件 —— 这种技术被称为 **二次排序(secondary sort)**【26】。
然后 Reducer 可以容易地执行实际的连接逻辑:每个用户 ID 都会被调用一次 Reducer 函数,且因为二次排序,第一个值应该是来自用户数据库的出生日期记录。Reducer 将出生日期存储在局部变量中,然后使用相同的用户 ID 遍历活动事件,输出 **已观看网址** 和 **观看者年龄** 的结果对。随后的 Map-Reduce 作业可以计算每个 URL 的查看者年龄分布,并按年龄段进行聚集。
由于 Reducer 一次处理一个特定用户 ID 的所有记录,因此一次只需要将一条用户记录保存在内存中,而不需要通过网络发出任何请求。这个算法被称为 **排序合并连接(sort-merge join)**,因为 Mapper 的输出是按键排序的,然后 Reducer 将来自连接两侧的有序记录列表合并在一起。
#### 把相关数据放在一起
在排序合并连接中,Mapper 和排序过程确保了所有对特定用户 ID 执行连接操作的必须数据都被放在同一个地方:单次调用 Reducer 的地方。预先排好了所有需要的数据,Reducer 可以是相当简单的单线程代码,能够以高吞吐量和与低内存开销扫过这些记录。
这种架构可以看做,Mapper 将 “消息” 发送给 Reducer。当一个 Mapper 发出一个键值对时,这个键的作用就像值应该传递到的目标地址。即使键只是一个任意的字符串(不是像 IP 地址和端口号那样的实际的网络地址),它表现的就像一个地址:所有具有相同键的键值对将被传递到相同的目标(一次 Reducer 的调用)。
使用 MapReduce 编程模型,能将计算的物理网络通信层面(从正确的机器获取数据)从应用逻辑中剥离出来(获取数据后执行处理)。这种分离与数据库的典型用法形成了鲜明对比,从数据库中获取数据的请求经常出现在应用代码内部【36】。由于 MapReduce 处理了所有的网络通信,因此它也避免了让应用代码去担心部分故障,例如另一个节点的崩溃:MapReduce 在不影响应用逻辑的情况下能透明地重试失败的任务。
#### 分组
除了连接之外,“把相关数据放在一起” 的另一种常见模式是,按某个键对记录分组(如 SQL 中的 GROUP BY 子句)。所有带有相同键的记录构成一个组,而下一步往往是在每个组内进行某种聚合操作,例如:
- 统计每个组中记录的数量(例如在统计 PV 的例子中,在 SQL 中表示为 `COUNT(*)` 聚合)
- 对某个特定字段求和(SQL 中的 `SUM(fieldname)`)
- 按某种分级函数取出排名前 k 条记录。
使用 MapReduce 实现这种分组操作的最简单方法是设置 Mapper,以便它们生成的键值对使用所需的分组键。然后分区和排序过程将所有具有相同分区键的记录导向同一个 Reducer。因此在 MapReduce 之上实现分组和连接看上去非常相似。
分组的另一个常见用途是整理特定用户会话的所有活动事件,以找出用户进行的一系列操作(称为 **会话化(sessionization)**【37】)。例如,可以使用这种分析来确定显示新版网站的用户是否比那些显示旧版本的用户更有购买欲(A/B 测试),或者计算某个营销活动是否值得。
如果你有多个 Web 服务器处理用户请求,则特定用户的活动事件很可能分散在各个不同的服务器的日志文件中。你可以通过使用会话 cookie,用户 ID 或类似的标识符作为分组键,以将特定用户的所有活动事件放在一起来实现会话化,与此同时,不同用户的事件仍然散布在不同的分区中。
#### 处理偏斜
如果存在与单个键关联的大量数据,则 “将具有相同键的所有记录放到相同的位置” 这种模式就被破坏了。例如在社交网络中,大多数用户可能会与几百人有连接,但少数名人可能有数百万的追随者。这种不成比例的活动数据库记录被称为 **关键对象(linchpin object)**【38】或 **热键(hot key)**。
在单个 Reducer 中收集与某个名人相关的所有活动(例如他们发布内容的回复)可能导致严重的 **偏斜**(也称为 **热点**,即 hot spot)—— 也就是说,一个 Reducer 必须比其他 Reducer 处理更多的记录(请参阅 “[负载偏斜与热点消除](/v1/ch6#负载偏斜与热点消除)”)。由于 MapReduce 作业只有在所有 Mapper 和 Reducer 都完成时才完成,所有后续作业必须等待最慢的 Reducer 才能启动。
如果连接的输入存在热键,可以使用一些算法进行补偿。例如,Pig 中的 **偏斜连接(skewed join)** 方法首先运行一个抽样作业(Sampling Job)来确定哪些键是热键【39】。连接实际执行时,Mapper 会将热键的关联记录 **随机**(相对于传统 MapReduce 基于键散列的确定性方法)发送到几个 Reducer 之一。对于另外一侧的连接输入,与热键相关的记录需要被复制到 **所有** 处理该键的 Reducer 上【40】。
这种技术将处理热键的工作分散到多个 Reducer 上,这样可以使其更好地并行化,代价是需要将连接另一侧的输入记录复制到多个 Reducer 上。Crunch 中的 **分片连接(sharded join)** 方法与之类似,但需要显式指定热键而不是使用抽样作业。这种技术也非常类似于我们在 “[负载偏斜与热点消除](/v1/ch6#负载偏斜与热点消除)” 中讨论的技术,使用随机化来缓解分区数据库中的热点。
Hive 的偏斜连接优化采取了另一种方法。它需要在表格元数据中显式指定热键,并将与这些键相关的记录单独存放,与其它文件分开。当在该表上执行连接时,对于热键,它会使用 Map 端连接(请参阅下一节)。
当按照热键进行分组并聚合时,可以将分组分两个阶段进行。第一个 MapReduce 阶段将记录发送到随机 Reducer,以便每个 Reducer 只对热键的子集执行分组,为每个键输出一个更紧凑的中间聚合结果。然后第二个 MapReduce 作业将所有来自第一阶段 Reducer 的中间聚合结果合并为每个键一个值。
### Map侧连接
上一节描述的连接算法在 Reducer 中执行实际的连接逻辑,因此被称为 Reduce 侧连接。Mapper 扮演着预处理输入数据的角色:从每个输入记录中提取键值,将键值对分配给 Reducer 分区,并按键排序。
Reduce 侧方法的优点是不需要对输入数据做任何假设:无论其属性和结构如何,Mapper 都可以对其预处理以备连接。然而不利的一面是,排序,复制至 Reducer,以及合并 Reducer 输入,所有这些操作可能开销巨大。当数据通过 MapReduce 阶段时,数据可能需要落盘好几次,取决于可用的内存缓冲区【37】。
另一方面,如果你 **能** 对输入数据作出某些假设,则通过使用所谓的 Map 侧连接来加快连接速度是可行的。这种方法使用了一个裁减掉 Reducer 与排序的 MapReduce 作业,每个 Mapper 只是简单地从分布式文件系统中读取一个输入文件块,然后将输出文件写入文件系统,仅此而已。
#### 广播散列连接
适用于执行 Map 端连接的最简单场景是大数据集与小数据集连接的情况。要点在于小数据集需要足够小,以便可以将其全部加载到每个 Mapper 的内存中。
例如,假设在 [图 10-2](/v1/ddia_1002.png) 的情况下,用户数据库小到足以放进内存中。在这种情况下,当 Mapper 启动时,它可以首先将用户数据库从分布式文件系统读取到内存中的散列表中。完成此操作后,Mapper 可以扫描用户活动事件,并简单地在散列表中查找每个事件的用户 ID [^vi]。
[^vi]: 这个例子假定散列表中的每个键只有一个条目,这对用户数据库(用户 ID 唯一标识一个用户)可能是正确的。通常,哈希表可能需要包含具有相同键的多个条目,而连接运算符将对每个键输出所有的匹配。
参与连接的较大输入的每个文件块各有一个 Mapper(在 [图 10-2](/v1/ddia_1002.png) 的例子中活动事件是较大的输入)。每个 Mapper 都会将较小输入整个加载到内存中。
这种简单有效的算法被称为 **广播散列连接(broadcast hash join)**:**广播** 一词反映了这样一个事实,每个连接较大输入端分区的 Mapper 都会将较小输入端数据集整个读入内存中(所以较小输入实际上 “广播” 到较大数据的所有分区上),**散列** 一词反映了它使用一个散列表。Pig(名为 “**复制链接(replicated join)**”),Hive(“**MapJoin**”),Cascading 和 Crunch 支持这种连接。它也被诸如 Impala 的数据仓库查询引擎使用【41】。
除了将较小的连接输入加载到内存散列表中,另一种方法是将较小输入存储在本地磁盘上的只读索引中【42】。索引中经常使用的部分将保留在操作系统的页面缓存中,因而这种方法可以提供与内存散列表几乎一样快的随机查找性能,但实际上并不需要数据集能放入内存中。
#### 分区散列连接
如果 Map 侧连接的输入以相同的方式进行分区,则散列连接方法可以独立应用于每个分区。在 [图 10-2](/v1/ddia_1002.png) 的情况中,你可以根据用户 ID 的最后一位十进制数字来对活动事件和用户数据库进行分区(因此连接两侧各有 10 个分区)。例如,Mapper3 首先将所有具有以 3 结尾的 ID 的用户加载到散列表中,然后扫描 ID 为 3 的每个用户的所有活动事件。
如果分区正确无误,可以确定的是,所有你可能需要连接的记录都落在同一个编号的分区中。因此每个 Mapper 只需要从输入两端各读取一个分区就足够了。好处是每个 Mapper 都可以在内存散列表中少放点数据。
这种方法只有当连接两端输入有相同的分区数,且两侧的记录都是使用相同的键与相同的哈希函数做分区时才适用。如果输入是由之前执行过这种分组的 MapReduce 作业生成的,那么这可能是一个合理的假设。
分区散列连接在 Hive 中称为 **Map 侧桶连接(bucketed map joins)【37】**。
#### Map侧合并连接
如果输入数据集不仅以相同的方式进行分区,而且还基于相同的键进行 **排序**,则可适用另一种 Map 侧连接的变体。在这种情况下,输入是否小到能放入内存并不重要,因为这时候 Mapper 同样可以执行归并操作(通常由 Reducer 执行)的归并操作:按键递增的顺序依次读取两个输入文件,将具有相同键的记录配对。
如果能进行 Map 侧合并连接,这通常意味着前一个 MapReduce 作业可能一开始就已经把输入数据做了分区并进行了排序。原则上这个连接就可以在前一个作业的 Reduce 阶段进行。但使用独立的仅 Map 作业有时也是合适的,例如,分好区且排好序的中间数据集可能还会用于其他目的。
#### MapReduce工作流与Map侧连接
当下游作业使用 MapReduce 连接的输出时,选择 Map 侧连接或 Reduce 侧连接会影响输出的结构。Reduce 侧连接的输出是按照 **连接键** 进行分区和排序的,而 Map 端连接的输出则按照与较大输入相同的方式进行分区和排序(因为无论是使用分区连接还是广播连接,连接较大输入端的每个文件块都会启动一个 Map 任务)。
如前所述,Map 侧连接也对输入数据集的大小,有序性和分区方式做出了更多假设。在优化连接策略时,了解分布式文件系统中数据集的物理布局变得非常重要:仅仅知道编码格式和数据存储目录的名称是不够的;你还必须知道数据是按哪些键做的分区和排序,以及分区的数量。
在 Hadoop 生态系统中,这种关于数据集分区的元数据通常在 HCatalog 和 Hive Metastore 中维护【37】。
### 批处理工作流的输出
我们已经说了很多用于实现 MapReduce 工作流的算法,但却忽略了一个重要的问题:这些处理完成之后的最终结果是什么?我们最开始为什么要跑这些作业?
在数据库查询的场景中,我们将事务处理(OLTP)与分析两种目的区分开来(请参阅 “[事务处理还是分析?](/v1/ch3#事务处理还是分析?)”)。我们看到,OLTP 查询通常根据键查找少量记录,使用索引,并将其呈现给用户(比如在网页上)。另一方面,分析查询通常会扫描大量记录,执行分组与聚合,输出通常有着报告的形式:显示某个指标随时间变化的图表,或按照某种排位取前 10 项,或将一些数字细化为子类。这种报告的消费者通常是需要做出商业决策的分析师或经理。
批处理放哪里合适?它不属于事务处理,也不是分析。它和分析比较接近,因为批处理通常会扫过输入数据集的绝大部分。然而 MapReduce 作业工作流与用于分析目的的 SQL 查询是不同的(请参阅 “[Hadoop 与分布式数据库的对比](#Hadoop与分布式数据库的对比)”)。批处理过程的输出通常不是报表,而是一些其他类型的结构。
#### 建立搜索索引
Google 最初使用 MapReduce 是为其搜索引擎建立索引,其实现为由 5 到 10 个 MapReduce 作业组成的工作流【1】。虽然 Google 后来也不仅仅是为这个目的而使用 MapReduce 【43】,但如果从构建搜索索引的角度来看,更能帮助理解 MapReduce。(直至今日,Hadoop MapReduce 仍然是为 Lucene/Solr 构建索引的好方法【44】)
我们在 “[全文搜索和模糊索引](/v1/ch3#全文搜索和模糊索引)” 中简要地了解了 Lucene 这样的全文搜索索引是如何工作的:它是一个文件(关键词字典),你可以在其中高效地查找特定关键字,并找到包含该关键字的所有文档 ID 列表(文章列表)。这是一种非常简化的看法 —— 实际上,搜索索引需要各种额外数据,以便根据相关性对搜索结果进行排名、纠正拼写错误、解析同义词等等 —— 但这个原则是成立的。
如果需要对一组固定文档执行全文搜索,则批处理是一种构建索引的高效方法:Mapper 根据需要对文档集合进行分区,每个 Reducer 构建该分区的索引,并将索引文件写入分布式文件系统。构建这样的文档分区索引(请参阅 “[分区与次级索引](/v1/ch6#分区与次级索引)”)并行处理效果拔群。
由于按关键字查询搜索索引是只读操作,因而这些索引文件一旦创建就是不可变的。
如果索引的文档集合发生更改,一种选择是定期重跑整个索引工作流,并在完成后用新的索引文件批量替换以前的索引文件。如果只有少量的文档发生了变化,这种方法的计算成本可能会很高。但它的优点是索引过程很容易理解:文档进,索引出。
另一个选择是,可以增量建立索引。如 [第三章](/v1/ch3) 中讨论的,如果要在索引中添加,删除或更新文档,Lucene 会写新的段文件,并在后台异步合并压缩段文件。我们将在 [第十一章](/v1/ch11) 中看到更多这种增量处理。
#### 键值存储作为批处理输出
搜索索引只是批处理工作流可能输出的一个例子。批处理的另一个常见用途是构建机器学习系统,例如分类器(比如垃圾邮件过滤器,异常检测,图像识别)与推荐系统(例如,你可能认识的人,你可能感兴趣的产品或相关的搜索【29】)。
这些批处理作业的输出通常是某种数据库:例如,可以通过给定用户 ID 查询该用户推荐好友的数据库,或者可以通过产品 ID 查询相关产品的数据库【45】。
这些数据库需要被处理用户请求的 Web 应用所查询,而它们通常是独立于 Hadoop 基础设施的。那么批处理过程的输出如何回到 Web 应用可以查询的数据库中呢?
最直接的选择可能是,直接在 Mapper 或 Reducer 中使用你最爱的数据库的客户端库,并从批处理作业直接写入数据库服务器,一次写入一条记录。它能工作(假设你的防火墙规则允许从你的 Hadoop 环境直接访问你的生产数据库),但这并不是一个好主意,出于以下几个原因:
- 正如前面在连接的上下文中讨论的那样,为每条记录发起一个网络请求,要比批处理任务的正常吞吐量慢几个数量级。即使客户端库支持批处理,性能也可能很差。
- MapReduce 作业经常并行运行许多任务。如果所有 Mapper 或 Reducer 都同时写入相同的输出数据库,并以批处理的预期速率工作,那么该数据库很可能被轻易压垮,其查询性能可能变差。这可能会导致系统其他部分的运行问题【35】。
- 通常情况下,MapReduce 为作业输出提供了一个干净利落的 “全有或全无” 保证:如果作业成功,则结果就是每个任务恰好执行一次所产生的输出,即使某些任务失败且必须一路重试。如果整个作业失败,则不会生成输出。然而从作业内部写入外部系统,会产生外部可见的副作用,这种副作用是不能以这种方式被隐藏的。因此,你不得不去操心对其他系统可见的部分完成的作业结果,并需要理解 Hadoop 任务尝试与预测执行的复杂性。
更好的解决方案是在批处理作业 **内** 创建一个全新的数据库,并将其作为文件写入分布式文件系统中作业的输出目录,就像上节中的搜索索引一样。这些数据文件一旦写入就是不可变的,可以批量加载到处理只读查询的服务器中。不少键值存储都支持在 MapReduce 作业中构建数据库文件,包括 Voldemort 【46】、Terrapin 【47】、ElephantDB 【48】和 HBase 批量加载【49】。
构建这些数据库文件是 MapReduce 的一种好用法:使用 Mapper 提取出键并按该键排序,已经完成了构建索引所必需的大量工作。由于这些键值存储大多都是只读的(文件只能由批处理作业一次性写入,然后就不可变),所以数据结构非常简单。比如它们就不需要预写式日志(WAL,请参阅 “[让 B 树更可靠](/v1/ch3#让B树更可靠)”)。
将数据加载到 Voldemort 时,服务器将继续用旧数据文件服务请求,同时将新数据文件从分布式文件系统复制到服务器的本地磁盘。一旦复制完成,服务器会自动将查询切换到新文件。如果在这个过程中出现任何问题,它可以轻易回滚至旧文件,因为它们仍然存在而且不可变【46】。
#### 批处理输出的哲学
本章前面讨论过的 Unix 哲学(“[Unix 哲学](#Unix哲学)”)鼓励以显式指明数据流的方式进行实验:程序读取输入并写入输出。在这一过程中,输入保持不变,任何先前的输出都被新输出完全替换,且没有其他副作用。这意味着你可以随心所欲地重新运行一个命令,略做改动或进行调试,而不会搅乱系统的状态。
MapReduce 作业的输出处理遵循同样的原理。通过将输入视为不可变且避免副作用(如写入外部数据库),批处理作业不仅实现了良好的性能,而且更容易维护:
- 如果在代码中引入了一个错误,而输出错误或损坏了,则可以简单地回滚到代码的先前版本,然后重新运行该作业,输出将重新被纠正。或者,甚至更简单,你可以将旧的输出保存在不同的目录中,然后切换回原来的目录。具有读写事务的数据库没有这个属性:如果你部署了错误的代码,将错误的数据写入数据库,那么回滚代码将无法修复数据库中的数据。(能够从错误代码中恢复的概念被称为 **人类容错(human fault tolerance)**【50】)
- 由于回滚很容易,比起在错误意味着不可挽回的伤害的环境,功能开发进展能快很多。这种 **最小化不可逆性(minimizing irreversibility)** 的原则有利于敏捷软件开发【51】。
- 如果 Map 或 Reduce 任务失败,MapReduce 框架将自动重新调度,并在同样的输入上再次运行它。如果失败是由代码中的错误造成的,那么它会不断崩溃,并最终导致作业在几次尝试之后失败。但是如果故障是由于临时问题导致的,那么故障就会被容忍。因为输入不可变,这种自动重试是安全的,而失败任务的输出会被 MapReduce 框架丢弃。
- 同一组文件可用作各种不同作业的输入,包括计算指标的监控作业并且评估作业的输出是否具有预期的性质(例如,将其与前一次运行的输出进行比较并测量差异) 。
- 与 Unix 工具类似,MapReduce 作业将逻辑与布线(配置输入和输出目录)分离,这使得关注点分离,可以重用代码:一个团队可以专注实现一个做好一件事的作业;而其他团队可以决定何时何地运行这项作业。
在这些领域,在 Unix 上表现良好的设计原则似乎也适用于 Hadoop,但 Unix 和 Hadoop 在某些方面也有所不同。例如,因为大多数 Unix 工具都假设输入输出是无类型文本文件,所以它们必须做大量的输入解析工作(本章开头的日志分析示例使用 `{print $7}` 来提取 URL)。在 Hadoop 上可以通过使用更结构化的文件格式消除一些低价值的语法转换:比如 Avro(请参阅 “[Avro](/v1/ch4#Avro)”)和 Parquet(请参阅 “[列式存储](/v1/ch3#列式存储)”)经常使用,因为它们提供了基于模式的高效编码,并允许模式随时间推移而演进(见 [第四章](/v1/ch4))。
### Hadoop与分布式数据库的对比
正如我们所看到的,Hadoop 有点像 Unix 的分布式版本,其中 HDFS 是文件系统,而 MapReduce 是 Unix 进程的怪异实现(总是在 Map 阶段和 Reduce 阶段运行 `sort` 工具)。我们了解了如何在这些原语的基础上实现各种连接和分组操作。
当 MapReduce 论文发表时【1】,它从某种意义上来说 —— 并不新鲜。我们在前几节中讨论的所有处理和并行连接算法已经在十多年前所谓的 **大规模并行处理(MPP,massively parallel processing)** 数据库中实现了【3,40】。比如 Gamma database machine、Teradata 和 Tandem NonStop SQL 就是这方面的先驱【52】。
最大的区别是,MPP 数据库专注于在一组机器上并行执行分析 SQL 查询,而 MapReduce 和分布式文件系统【19】的组合则更像是一个可以运行任意程序的通用操作系统。
#### 存储多样性
数据库要求你根据特定的模型(例如关系或文档)来构造数据,而分布式文件系统中的文件只是字节序列,可以使用任何数据模型和编码来编写。它们可能是数据库记录的集合,但同样可以是文本、图像、视频、传感器读数、稀疏矩阵、特征向量、基因组序列或任何其他类型的数据。
说白了,Hadoop 开放了将数据不加区分地转储到 HDFS 的可能性,允许后续再研究如何进一步处理【53】。相比之下,在将数据导入数据库专有存储格式之前,MPP 数据库通常需要对数据和查询模式进行仔细的前期建模。
在纯粹主义者看来,这种仔细的建模和导入似乎是可取的,因为这意味着数据库的用户有更高质量的数据来处理。然而实践经验表明,简单地使数据快速可用 —— 即使它很古怪,难以使用,使用原始格式 —— 也通常要比事先决定理想数据模型要更有价值【54】。
这个想法与数据仓库类似(请参阅 “[数据仓库](/v1/ch3#数据仓库)”):将大型组织的各个部分的数据集中在一起是很有价值的,因为它可以跨越以前相互分离的数据集进行连接。MPP 数据库所要求的谨慎模式设计拖慢了集中式数据收集速度;以原始形式收集数据,稍后再操心模式的设计,能使数据收集速度加快(有时被称为 “**数据湖(data lake)**” 或 “**企业数据中心(enterprise data hub)**”【55】)。
不加区分的数据转储转移了解释数据的负担:数据集的生产者不再需要强制将其转化为标准格式,数据的解释成为消费者的问题(**读时模式** 方法【56】;请参阅 “[文档模型中的模式灵活性](/v1/ch2#文档模型中的模式灵活性)”)。如果生产者和消费者是不同优先级的不同团队,这可能是一种优势。甚至可能不存在一个理想的数据模型,对于不同目的有不同的合适视角。以原始形式简单地转储数据,可以允许多种这样的转换。这种方法被称为 **寿司原则(sushi principle)**:“原始数据更好”【57】。
因此,Hadoop 经常被用于实现 ETL 过程(请参阅 “[数据仓库](/v1/ch3#数据仓库)”):事务处理系统中的数据以某种原始形式转储到分布式文件系统中,然后编写 MapReduce 作业来清理数据,将其转换为关系形式,并将其导入 MPP 数据仓库以进行分析。数据建模仍然在进行,但它在一个单独的步骤中进行,与数据收集相解耦。这种解耦是可行的,因为分布式文件系统支持以任何格式编码的数据。
#### 处理模型的多样性
MPP 数据库是单体的,紧密集成的软件,负责磁盘上的存储布局,查询计划,调度和执行。由于这些组件都可以针对数据库的特定需求进行调整和优化,因此整个系统可以在其设计针对的查询类型上取得非常好的性能。而且,SQL 查询语言允许以优雅的语法表达查询,而无需编写代码,可以在业务分析师使用的可视化工具(例如 Tableau)中访问到。
另一方面,并非所有类型的处理都可以合理地表达为 SQL 查询。例如,如果要构建机器学习和推荐系统,或者使用相关性排名模型的全文搜索索引,或者执行图像分析,则很可能需要更一般的数据处理模型。这些类型的处理通常是特别针对特定应用的(例如机器学习的特征工程,机器翻译的自然语言模型,欺诈预测的风险评估函数),因此它们不可避免地需要编写代码,而不仅仅是查询。
MapReduce 使工程师能够轻松地在大型数据集上运行自己的代码。如果你有 HDFS 和 MapReduce,那么你 **可以** 在它之上建立一个 SQL 查询执行引擎,事实上这正是 Hive 项目所做的【31】。但是,你也可以编写许多其他形式的批处理,这些批处理不必非要用 SQL 查询表示。
随后,人们发现 MapReduce 对于某些类型的处理而言局限性很大,表现很差,因此在 Hadoop 之上其他各种处理模型也被开发出来(我们将在 “[MapReduce 之后](#MapReduce之后)” 中看到其中一些)。只有两种处理模型,SQL 和 MapReduce,还不够,需要更多不同的模型!而且由于 Hadoop 平台的开放性,实施一整套方法是可行的,而这在单体 MPP 数据库的范畴内是不可能的【58】。
至关重要的是,这些不同的处理模型都可以在共享的单个机器集群上运行,所有这些机器都可以访问分布式文件系统上的相同文件。在 Hadoop 方式中,不需要将数据导入到几个不同的专用系统中进行不同类型的处理:系统足够灵活,可以支持同一个集群内不同的工作负载。不需要移动数据,使得从数据中挖掘价值变得容易得多,也使采用新的处理模型容易的多。
Hadoop 生态系统包括随机访问的 OLTP 数据库,如 HBase(请参阅 “[SSTables 和 LSM 树](/v1/ch3#SSTables和LSM树)”)和 MPP 风格的分析型数据库,如 Impala 【41】。HBase 与 Impala 都不使用 MapReduce,但都使用 HDFS 进行存储。它们是迥异的数据访问与处理方法,但是它们可以共存,并被集成到同一个系统中。
#### 针对频繁故障设计
当比较 MapReduce 和 MPP 数据库时,两种不同的设计思路出现了:处理故障和使用内存与磁盘的方式。与在线系统相比,批处理对故障不太敏感,因为就算失败也不会立即影响到用户,而且它们总是能再次运行。
如果一个节点在执行查询时崩溃,大多数 MPP 数据库会中止整个查询,并让用户重新提交查询或自动重新运行它【3】。由于查询通常最多运行几秒钟或几分钟,所以这种错误处理的方法是可以接受的,因为重试的代价不是太大。MPP 数据库还倾向于在内存中保留尽可能多的数据(例如,使用散列连接)以避免从磁盘读取的开销。
另一方面,MapReduce 可以容忍单个 Map 或 Reduce 任务的失败,而不会影响作业的整体,通过以单个任务的粒度重试工作。它也会非常急切地将数据写入磁盘,一方面是为了容错,另一部分是因为假设数据集太大而不能适应内存。
MapReduce 方式更适用于较大的作业:要处理如此之多的数据并运行很长时间的作业,以至于在此过程中很可能至少遇到一个任务故障。在这种情况下,由于单个任务失败而重新运行整个作业将是非常浪费的。即使以单个任务的粒度进行恢复引入了使得无故障处理更慢的开销,但如果任务失败率足够高,这仍然是一种合理的权衡。
但是这些假设有多么现实呢?在大多数集群中,机器故障确实会发生,但是它们不是很频繁 —— 可能少到绝大多数作业都不会经历机器故障。为了容错,真的值得带来这么大的额外开销吗?
要了解 MapReduce 节约使用内存和在任务的层次进行恢复的原因,了解最初设计 MapReduce 的环境是很有帮助的。Google 有着混用的数据中心,在线生产服务和离线批处理作业在同样机器上运行。每个任务都有一个通过容器强制执行的资源配给(CPU 核心、RAM、磁盘空间等)。每个任务也具有优先级,如果优先级较高的任务需要更多的资源,则可以终止(抢占)同一台机器上较低优先级的任务以释放资源。优先级还决定了计算资源的定价:团队必须为他们使用的资源付费,而优先级更高的进程花费更多【59】。
这种架构允许非生产(低优先级)计算资源被 **过量使用(overcommitted)**,因为系统知道必要时它可以回收资源。与分离生产和非生产任务的系统相比,过量使用资源可以更好地利用机器并提高效率。但由于 MapReduce 作业以低优先级运行,它们随时都有被抢占的风险,因为优先级较高的进程可能需要其资源。在高优先级进程拿走所需资源后,批量作业能有效地 “捡面包屑”,利用剩下的任何计算资源。
在谷歌,运行一个小时的 MapReduce 任务有大约有 5% 的风险被终止,为了给更高优先级的进程挪地方。这一概率比硬件问题、机器重启或其他原因的概率高了一个数量级【59】。按照这种抢占率,如果一个作业有 100 个任务,每个任务运行 10 分钟,那么至少有一个任务在完成之前被终止的风险大于 50%。
这就是 MapReduce 被设计为容忍频繁意外任务终止的原因:不是因为硬件很不可靠,而是因为任意终止进程的自由有利于提高计算集群中的资源利用率。
在开源的集群调度器中,抢占的使用较少。YARN 的 CapacityScheduler 支持抢占,以平衡不同队列的资源分配【58】,但在编写本文时,YARN,Mesos 或 Kubernetes 不支持通用的优先级抢占【60】。在任务不经常被终止的环境中,MapReduce 的这一设计决策就没有多少意义了。在下一节中,我们将研究一些与 MapReduce 设计决策相异的替代方案。
## MapReduce之后
虽然 MapReduce 在 2000 年代后期变得非常流行,并受到大量的炒作,但它只是分布式系统的许多可能的编程模型之一。对于不同的数据量,数据结构和处理类型,其他工具可能更适合表示计算。
不管如何,我们在这一章花了大把时间来讨论 MapReduce,因为它是一种有用的学习工具,它是分布式文件系统的一种相当简单明晰的抽象。在这里,**简单** 意味着我们能理解它在做什么,而不是意味着使用它很简单。恰恰相反:使用原始的 MapReduce API 来实现复杂的处理工作实际上是非常困难和费力的 —— 例如,任意一种连接算法都需要你从头开始实现【37】。
针对直接使用 MapReduce 的困难,在 MapReduce 上有很多高级编程模型(Pig、Hive、Cascading、Crunch)被创造出来,作为建立在 MapReduce 之上的抽象。如果你了解 MapReduce 的原理,那么它们学起来相当简单。而且它们的高级结构能显著简化许多常见批处理任务的实现。
但是,MapReduce 执行模型本身也存在一些问题,这些问题并没有通过增加另一个抽象层次而解决,而对于某些类型的处理,它表现得非常差劲。一方面,MapReduce 非常稳健:你可以使用它在任务会频繁终止的多租户系统上处理几乎任意大量级的数据,并且仍然可以完成工作(虽然速度很慢)。另一方面,对于某些类型的处理而言,其他工具有时会快上几个数量级。
在本章的其余部分中,我们将介绍一些批处理方法。在 [第十一章](/v1/ch11) 我们将转向流处理,它可以看作是加速批处理的另一种方法。
### 物化中间状态
如前所述,每个 MapReduce 作业都独立于其他任何作业。作业与世界其他地方的主要连接点是分布式文件系统上的输入和输出目录。如果希望一个作业的输出成为第二个作业的输入,则需要将第二个作业的输入目录配置为第一个作业输出目录,且外部工作流调度程序必须在第一个作业完成后再启动第二个。
如果第一个作业的输出是要在组织内广泛发布的数据集,则这种配置是合理的。在这种情况下,你需要通过名称引用它,并将其重用为多个不同作业的输入(包括由其他团队开发的作业)。将数据发布到分布式文件系统中众所周知的位置能够带来 **松耦合**,这样作业就不需要知道是谁在提供输入或谁在消费输出(请参阅 “[逻辑与布线相分离](#逻辑与布线相分离)”)。
但在很多情况下,你知道一个作业的输出只能用作另一个作业的输入,这些作业由同一个团队维护。在这种情况下,分布式文件系统上的文件只是简单的 **中间状态(intermediate state)**:一种将数据从一个作业传递到下一个作业的方式。在一个用于构建推荐系统的,由 50 或 100 个 MapReduce 作业组成的复杂工作流中,存在着很多这样的中间状态【29】。
将这个中间状态写入文件的过程称为 **物化(materialization)**。(在 “[聚合:数据立方体和物化视图](/v1/ch3#聚合:数据立方体和物化视图)” 中已经在物化视图的背景中遇到过这个术语。它意味着对某个操作的结果立即求值并写出来,而不是在请求时按需计算)
作为对照,本章开头的日志分析示例使用 Unix 管道将一个命令的输出与另一个命令的输入连接起来。管道并没有完全物化中间状态,而是只使用一个小的内存缓冲区,将输出增量地 **流(stream)** 向输入。
与 Unix 管道相比,MapReduce 完全物化中间状态的方法存在不足之处:
- MapReduce 作业只有在前驱作业(生成其输入)中的所有任务都完成时才能启动,而由 Unix 管道连接的进程会同时启动,输出一旦生成就会被消费。不同机器上的数据偏斜或负载不均意味着一个作业往往会有一些掉队的任务,比其他任务要慢得多才能完成。必须等待至前驱作业的所有任务完成,拖慢了整个工作流程的执行。
- Mapper 通常是多余的:它们仅仅是读取刚刚由 Reducer 写入的同样文件,为下一个阶段的分区和排序做准备。在许多情况下,Mapper 代码可能是前驱 Reducer 的一部分:如果 Reducer 和 Mapper 的输出有着相同的分区与排序方式,那么 Reducer 就可以直接串在一起,而不用与 Mapper 相互交织。
- 将中间状态存储在分布式文件系统中意味着这些文件被复制到多个节点,对这些临时数据这么搞就比较过分了。
#### 数据流引擎
为了解决 MapReduce 的这些问题,几种用于分布式批处理的新执行引擎被开发出来,其中最著名的是 Spark 【61,62】,Tez 【63,64】和 Flink 【65,66】。它们的设计方式有很多区别,但有一个共同点:把整个工作流作为单个作业来处理,而不是把它分解为独立的子作业。
由于它们将工作流显式建模为数据从几个处理阶段穿过,所以这些系统被称为 **数据流引擎(dataflow engines)**。像 MapReduce 一样,它们在一条线上通过反复调用用户定义的函数来一次处理一条记录,它们通过输入分区来并行化载荷,它们通过网络将一个函数的输出复制到另一个函数的输入。
与 MapReduce 不同,这些函数不需要严格扮演交织的 Map 与 Reduce 的角色,而是可以以更灵活的方式进行组合。我们称这些函数为 **算子(operators)**,数据流引擎提供了几种不同的选项来将一个算子的输出连接到另一个算子的输入:
- 一种选项是对记录按键重新分区并排序,就像在 MapReduce 的混洗阶段一样(请参阅 “[分布式执行 MapReduce](#分布式执行MapReduce)”)。这种功能可以用于实现排序合并连接和分组,就像在 MapReduce 中一样。
- 另一种可能是接受多个输入,并以相同的方式进行分区,但跳过排序。当记录的分区重要但顺序无关紧要时,这省去了分区散列连接的工作,因为构建散列表还是会把顺序随机打乱。
- 对于广播散列连接,可以将一个算子的输出,发送到连接算子的所有分区。
这种类型的处理引擎是基于像 Dryad【67】和 Nephele【68】这样的研究系统,与 MapReduce 模型相比,它有几个优点:
- 排序等昂贵的工作只需要在实际需要的地方执行,而不是默认地在每个 Map 和 Reduce 阶段之间出现。
- 没有不必要的 Map 任务,因为 Mapper 所做的工作通常可以合并到前面的 Reduce 算子中(因为 Mapper 不会更改数据集的分区)。
- 由于工作流中的所有连接和数据依赖都是显式声明的,因此调度程序能够总览全局,知道哪里需要哪些数据,因而能够利用局部性进行优化。例如,它可以尝试将消费某些数据的任务放在与生成这些数据的任务相同的机器上,从而数据可以通过共享内存缓冲区传输,而不必通过网络复制。
- 通常,算子间的中间状态足以保存在内存中或写入本地磁盘,这比写入 HDFS 需要更少的 I/O(必须将其复制到多台机器,并将每个副本写入磁盘)。MapReduce 已经对 Mapper 的输出做了这种优化,但数据流引擎将这种思想推广至所有的中间状态。
- 算子可以在输入就绪后立即开始执行;后续阶段无需等待前驱阶段整个完成后再开始。
- 与 MapReduce(为每个任务启动一个新的 JVM)相比,现有 Java 虚拟机(JVM)进程可以重用来运行新算子,从而减少启动开销。
你可以使用数据流引擎执行与 MapReduce 工作流同样的计算,而且由于此处所述的优化,通常执行速度要明显快得多。既然算子是 Map 和 Reduce 的泛化,那么相同的处理代码就可以在任一执行引擎上运行:Pig,Hive 或 Cascading 中实现的工作流可以无需修改代码,可以通过修改配置,简单地从 MapReduce 切换到 Tez 或 Spark【64】。
Tez 是一个相当薄的库,它依赖于 YARN shuffle 服务来实现节点间数据的实际复制【58】,而 Spark 和 Flink 则是包含了独立网络通信层,调度器,及用户向 API 的大型框架。我们将简要讨论这些高级 API。
#### 容错
完全物化中间状态至分布式文件系统的一个优点是,它具有持久性,这使得 MapReduce 中的容错相当容易:如果一个任务失败,它可以在另一台机器上重新启动,并从文件系统重新读取相同的输入。
Spark、Flink 和 Tez 避免将中间状态写入 HDFS,因此它们采取了不同的方法来容错:如果一台机器发生故障,并且该机器上的中间状态丢失,则它会从其他仍然可用的数据重新计算(在可行的情况下是先前的中间状态,要么就只能是原始输入数据,通常在 HDFS 上)。
为了实现这种重新计算,框架必须跟踪一个给定的数据是如何计算的 —— 使用了哪些输入分区?应用了哪些算子? Spark 使用 **弹性分布式数据集(RDD,Resilient Distributed Dataset)** 的抽象来跟踪数据的谱系【61】,而 Flink 对算子状态存档,允许恢复运行在执行过程中遇到错误的算子【66】。
在重新计算数据时,重要的是要知道计算是否是 **确定性的**:也就是说,给定相同的输入数据,算子是否始终产生相同的输出?如果一些丢失的数据已经发送给下游算子,这个问题就很重要。如果算子重新启动,重新计算的数据与原有的丢失数据不一致,下游算子很难解决新旧数据之间的矛盾。对于不确定性算子来说,解决方案通常是杀死下游算子,然后再重跑新数据。
为了避免这种级联故障,最好让算子具有确定性。但需要注意的是,非确定性行为很容易悄悄溜进来:例如,许多编程语言在迭代哈希表的元素时不能对顺序作出保证,许多概率和统计算法显式依赖于使用随机数,以及用到系统时钟或外部数据源,这些都是都不确定性的行为。为了能可靠地从故障中恢复,需要消除这种不确定性因素,例如使用固定的种子生成伪随机数。
通过重算数据来从故障中恢复并不总是正确的答案:如果中间状态数据要比源数据小得多,或者如果计算量非常大,那么将中间数据物化为文件可能要比重新计算廉价的多。
#### 关于物化的讨论
回到 Unix 的类比,我们看到,MapReduce 就像是将每个命令的输出写入临时文件,而数据流引擎看起来更像是 Unix 管道。尤其是 Flink 是基于管道执行的思想而建立的:也就是说,将算子的输出增量地传递给其他算子,不待输入完成便开始处理。
排序算子不可避免地需要消费全部的输入后才能生成任何输出,因为输入中最后一条输入记录可能具有最小的键,因此需要作为第一条记录输出。因此,任何需要排序的算子都需要至少暂时地累积状态。但是工作流的许多其他部分可以以流水线方式执行。
当作业完成时,它的输出需要持续到某个地方,以便用户可以找到并使用它 —— 很可能它会再次写入分布式文件系统。因此,在使用数据流引擎时,HDFS 上的物化数据集通常仍是作业的输入和最终输出。和 MapReduce 一样,输入是不可变的,输出被完全替换。比起 MapReduce 的改进是,你不用再自己去将中间状态写入文件系统了。
### 图与迭代处理
在 “[图数据模型](/v1/ch2#图数据模型)” 中,我们讨论了使用图来建模数据,并使用图查询语言来遍历图中的边与点。[第二章](/v1/ch2) 的讨论集中在 OLTP 风格的应用场景:快速执行查询来查找少量符合特定条件的顶点。
批处理上下文中的图也很有趣,其目标是在整个图上执行某种离线处理或分析。这种需求经常出现在机器学习应用(如推荐引擎)或排序系统中。例如,最着名的图形分析算法之一是 PageRank 【69】,它试图根据链接到某个网页的其他网页来估计该网页的流行度。它作为配方的一部分,用于确定网络搜索引擎呈现结果的顺序。
> 像 Spark、Flink 和 Tez 这样的数据流引擎(请参阅 “[物化中间状态](#物化中间状态)”)通常将算子作为 **有向无环图(DAG)** 的一部分安排在作业中。这与图处理不一样:在数据流引擎中,**从一个算子到另一个算子的数据流** 被构造成一个图,而数据本身通常由关系型元组构成。在图处理中,数据本身具有图的形式。又一个不幸的命名混乱!
许多图算法是通过一次遍历一条边来表示的,将一个顶点与近邻的顶点连接起来,以传播一些信息,并不断重复,直到满足一些条件为止 —— 例如,直到没有更多的边要跟进,或直到一些指标收敛。我们在 [图 2-6](/v1/ddia_0206.png) 中看到一个例子,它通过重复跟进标明地点归属关系的边,生成了数据库中北美包含的所有地点列表(这种算法被称为 **传递闭包**,即 transitive closure)。
可以在分布式文件系统中存储图(包含顶点和边的列表的文件),但是这种 “重复至完成” 的想法不能用普通的 MapReduce 来表示,因为它只扫过一趟数据。这种算法因此经常以 **迭代** 的风格实现:
1. 外部调度程序运行批处理来计算算法的一个步骤。
2. 当批处理过程完成时,调度器检查它是否完成(基于完成条件 —— 例如,没有更多的边要跟进,或者与上次迭代相比的变化低于某个阈值)。
3. 如果尚未完成,则调度程序返回到步骤 1 并运行另一轮批处理。
这种方法是有效的,但是用 MapReduce 实现它往往非常低效,因为 MapReduce 没有考虑算法的迭代性质:它总是读取整个输入数据集并产生一个全新的输出数据集,即使与上次迭代相比,改变的仅仅是图中的一小部分。
#### Pregel处理模型
针对图批处理的优化 —— **批量同步并行(BSP,Bulk Synchronous Parallel)** 计算模型【70】已经开始流行起来。其中,Apache Giraph 【37】,Spark 的 GraphX API 和 Flink 的 Gelly API 【71】实现了它。它也被称为 **Pregel** 模型,因为 Google 的 Pregel 论文推广了这种处理图的方法【72】。
回想一下在 MapReduce 中,Mapper 在概念上向 Reducer 的特定调用 “发送消息”,因为框架将所有具有相同键的 Mapper 输出集中在一起。Pregel 背后有一个类似的想法:一个顶点可以向另一个顶点 “发送消息”,通常这些消息是沿着图的边发送的。
在每次迭代中,为每个顶点调用一个函数,将所有发送给它的消息传递给它 —— 就像调用 Reducer 一样。与 MapReduce 的不同之处在于,在 Pregel 模型中,顶点在一次迭代到下一次迭代的过程中会记住它的状态,所以这个函数只需要处理新的传入消息。如果图的某个部分没有被发送消息,那里就不需要做任何工作。
这与 Actor 模型有些相似(请参阅 “[分布式的 Actor 框架](/v1/ch4#分布式的Actor框架)”),除了顶点状态和顶点之间的消息具有容错性和持久性,且通信以固定的回合进行:在每次迭代中,框架递送上次迭代中发送的所有消息。Actor 通常没有这样的时序保证。
#### 容错
顶点只能通过消息传递进行通信(而不是直接相互查询)的事实有助于提高 Pregel 作业的性能,因为消息可以成批处理,且等待通信的次数也减少了。唯一的等待是在迭代之间:由于 Pregel 模型保证所有在一轮迭代中发送的消息都在下轮迭代中送达,所以在下一轮迭代开始前,先前的迭代必须完全完成,而所有的消息必须在网络上完成复制。
即使底层网络可能丢失、重复或任意延迟消息(请参阅 “[不可靠的网络](/v1/ch8#不可靠的网络)”),Pregel 的实现能保证在后续迭代中消息在其目标顶点恰好处理一次。像 MapReduce 一样,框架能从故障中透明地恢复,以简化在 Pregel 上实现算法的编程模型。
这种容错是通过在迭代结束时,定期存档所有顶点的状态来实现的,即将其全部状态写入持久化存储。如果某个节点发生故障并且其内存中的状态丢失,则最简单的解决方法是将整个图计算回滚到上一个存档点,然后重启计算。如果算法是确定性的,且消息记录在日志中,那么也可以选择性地只恢复丢失的分区(就像之前讨论过的数据流引擎)【72】。
#### 并行执行
顶点不需要知道它在哪台物理机器上执行;当它向其他顶点发送消息时,它只是简单地将消息发往某个顶点 ID。图的分区取决于框架 —— 即,确定哪个顶点运行在哪台机器上,以及如何通过网络路由消息,以便它们到达正确的地方。
由于编程模型一次仅处理一个顶点(有时称为 “像顶点一样思考”),所以框架可以以任意方式对图分区。理想情况下如果顶点需要进行大量的通信,那么它们最好能被分区到同一台机器上。然而找到这样一种优化的分区方法是很困难的 —— 在实践中,图经常按照任意分配的顶点 ID 分区,而不会尝试将相关的顶点分组在一起。
因此,图算法通常会有很多跨机器通信的额外开销,而中间状态(节点之间发送的消息)往往比原始图大。通过网络发送消息的开销会显著拖慢分布式图算法的速度。
出于这个原因,如果你的图可以放入一台计算机的内存中,那么单机(甚至可能是单线程)算法很可能会超越分布式批处理【73,74】。图比内存大也没关系,只要能放入单台计算机的磁盘,使用 GraphChi 等框架进行单机处理是就一个可行的选择【75】。如果图太大,不适合单机处理,那么像 Pregel 这样的分布式方法是不可避免的。高效的并行图算法是一个进行中的研究领域【76】。
### 高级API和语言
自 MapReduce 开始流行的这几年以来,分布式批处理的执行引擎已经很成熟了。到目前为止,基础设施已经足够强大,能够存储和处理超过 10,000 台机器集群上的数 PB 的数据。由于在这种规模下物理执行批处理的问题已经被认为或多或少解决了,所以关注点已经转向其他领域:改进编程模型,提高处理效率,扩大这些技术可以解决的问题集。
如前所述,Hive、Pig、Cascading 和 Crunch 等高级语言和 API 变得越来越流行,因为手写 MapReduce 作业实在是个苦力活。随着 Tez 的出现,这些高级语言还有一个额外好处,可以迁移到新的数据流执行引擎,而无需重写作业代码。Spark 和 Flink 也有它们自己的高级数据流 API,通常是从 FlumeJava 中获取的灵感【34】。
这些数据流 API 通常使用关系型构建块来表达一个计算:按某个字段连接数据集;按键对元组做分组;按某些条件过滤;并通过计数求和或其他函数来聚合元组。在内部,这些操作是使用本章前面讨论过的各种连接和分组算法来实现的。
除了少写代码的明显优势之外,这些高级接口还支持交互式用法,在这种交互式使用中,你可以在 Shell 中增量式编写分析代码,频繁运行来观察它做了什么。这种开发风格在探索数据集和试验处理方法时非常有用。这也让人联想到 Unix 哲学,我们在 “[Unix 哲学](#Unix哲学)” 中讨论过这个问题。
此外,这些高级接口不仅提高了人类的工作效率,也提高了机器层面的作业执行效率。
#### 向声明式查询语言的转变
与硬写执行连接的代码相比,指定连接关系算子的优点是,框架可以分析连接输入的属性,并自动决定哪种上述连接算法最适合当前任务。Hive、Spark 和 Flink 都有基于代价的查询优化器可以做到这一点,甚至可以改变连接顺序,最小化中间状态的数量【66,77,78,79】。
连接算法的选择可以对批处理作业的性能产生巨大影响,而无需理解和记住本章中讨论的各种连接算法。如果连接是以 **声明式(declarative)** 的方式指定的,那这就这是可行的:应用只是简单地说明哪些连接是必需的,查询优化器决定如何最好地执行连接。我们以前在 “[数据查询语言](/v1/ch2#数据查询语言)” 中见过这个想法。
但 MapReduce 及其数据流后继者在其他方面,与 SQL 的完全声明式查询模型有很大区别。MapReduce 是围绕着回调函数的概念建立的:对于每条记录或者一组记录,调用一个用户定义的函数(Mapper 或 Reducer),并且该函数可以自由地调用任意代码来决定输出什么。这种方法的优点是可以基于大量已有库的生态系统创作:解析、自然语言分析、图像分析以及运行数值或统计算法等。
自由运行任意代码,长期以来都是传统 MapReduce 批处理系统与 MPP 数据库的区别所在(请参阅 “[Hadoop 与分布式数据库的对比](#Hadoop与分布式数据库的对比)” 一节)。虽然数据库具有编写用户定义函数的功能,但是它们通常使用起来很麻烦,而且与大多数编程语言中广泛使用的程序包管理器和依赖管理系统兼容不佳(例如 Java 的 Maven、Javascript 的 npm 以及 Ruby 的 gems)。
然而数据流引擎已经发现,支持除连接之外的更多 **声明式特性** 还有其他的优势。例如,如果一个回调函数只包含一个简单的过滤条件,或者只是从一条记录中选择了一些字段,那么在为每条记录调用函数时会有相当大的额外 CPU 开销。如果以声明方式表示这些简单的过滤和映射操作,那么查询优化器可以利用列式存储布局(请参阅 “[列式存储](/v1/ch3#列式存储)”),只从磁盘读取所需的列。Hive、Spark DataFrames 和 Impala 还使用了向量化执行(请参阅 “[内存带宽和矢量化处理](/v1/ch3#内存带宽和矢量化处理)”):在对 CPU 缓存友好的内部循环中迭代数据,避免函数调用。Spark 生成 JVM 字节码【79】,Impala 使用 LLVM 为这些内部循环生成本机代码【41】。
通过在高级 API 中引入声明式的部分,并使查询优化器可以在执行期间利用这些来做优化,批处理框架看起来越来越像 MPP 数据库了(并且能实现可与之媲美的性能)。同时,通过拥有运行任意代码和以任意格式读取数据的可扩展性,它们保持了灵活性的优势。
#### 专业化的不同领域
尽管能够运行任意代码的可扩展性是很有用的,但是也有很多常见的例子,不断重复着标准的处理模式。因而这些模式值得拥有自己的可重用通用构建模块实现。传统上,MPP 数据库满足了商业智能分析和业务报表的需求,但这只是许多使用批处理的领域之一。
另一个越来越重要的领域是统计和数值算法,它们是机器学习应用所需要的(例如分类器和推荐系统)。可重用的实现正在出现:例如,Mahout 在 MapReduce、Spark 和 Flink 之上实现了用于机器学习的各种算法,而 MADlib 在关系型 MPP 数据库(Apache HAWQ)中实现了类似的功能【54】。
空间算法也是有用的,例如 **k 近邻搜索(k-nearest neighbors, kNN)**【80】,它在一些多维空间中搜索与给定项最近的项目 —— 这是一种相似性搜索。近似搜索对于基因组分析算法也很重要,它们需要找到相似但不相同的字符串【81】。
批处理引擎正被用于分布式执行日益广泛的各领域算法。随着批处理系统获得各种内置功能以及高级声明式算子,且随着 MPP 数据库变得更加灵活和易于编程,两者开始看起来相似了:最终,它们都只是存储和处理数据的系统。
## 本章小结
在本章中,我们探索了批处理的主题。我们首先看到了诸如 awk、grep 和 sort 之类的 Unix 工具,然后我们看到了这些工具的设计理念是如何应用到 MapReduce 和更近的数据流引擎中的。一些设计原则包括:输入是不可变的,输出是为了作为另一个(仍未知的)程序的输入,而复杂的问题是通过编写 “做好一件事” 的小工具来解决的。
在 Unix 世界中,允许程序与程序组合的统一接口是文件与管道;在 MapReduce 中,该接口是一个分布式文件系统。我们看到数据流引擎添加了自己的管道式数据传输机制,以避免将中间状态物化至分布式文件系统,但作业的初始输入和最终输出通常仍是 HDFS。
分布式批处理框架需要解决的两个主要问题是:
分区
: 在 MapReduce 中,Mapper 根据输入文件块进行分区。Mapper 的输出被重新分区、排序并合并到可配置数量的 Reducer 分区中。这一过程的目的是把所有的 **相关** 数据(例如带有相同键的所有记录)都放在同一个地方。
后 MapReduce 时代的数据流引擎若非必要会尽量避免排序,但它们也采取了大致类似的分区方法。
容错
: MapReduce 经常写入磁盘,这使得从单个失败的任务恢复很轻松,无需重新启动整个作业,但在无故障的情况下减慢了执行速度。数据流引擎更多地将中间状态保存在内存中,更少地物化中间状态,这意味着如果节点发生故障,则需要重算更多的数据。确定性算子减少了需要重算的数据量。
我们讨论了几种 MapReduce 的连接算法,其中大多数也在 MPP 数据库和数据流引擎内部使用。它们也很好地演示了分区算法是如何工作的:
排序合并连接
: 每个参与连接的输入都通过一个提取连接键的 Mapper。通过分区、排序和合并,具有相同键的所有记录最终都会进入相同的 Reducer 调用。这个函数能输出连接好的记录。
广播散列连接
: 两个连接输入之一很小,所以它并没有分区,而且能被完全加载进一个哈希表中。因此,你可以为连接输入大端的每个分区启动一个 Mapper,将输入小端的散列表加载到每个 Mapper 中,然后扫描大端,一次一条记录,并为每条记录查询散列表。
分区散列连接
: 如果两个连接输入以相同的方式分区(使用相同的键,相同的散列函数和相同数量的分区),则可以独立地对每个分区应用散列表方法。
分布式批处理引擎有一个刻意限制的编程模型:回调函数(比如 Mapper 和 Reducer)被假定是无状态的,而且除了指定的输出外,必须没有任何外部可见的副作用。这一限制允许框架在其抽象下隐藏一些困难的分布式系统问题:当遇到崩溃和网络问题时,任务可以安全地重试,任何失败任务的输出都被丢弃。如果某个分区的多个任务成功,则其中只有一个能使其输出实际可见。
得益于这个框架,你在批处理作业中的代码无需操心实现容错机制:框架可以保证作业的最终输出与没有发生错误的情况相同,虽然实际上也许不得不重试各种任务。比起在线服务一边处理用户请求一边将写入数据库作为处理请求的副作用,批处理提供的这种可靠性语义要强得多。
批处理作业的显著特点是,它读取一些输入数据并产生一些输出数据,但不修改输入 —— 换句话说,输出是从输入衍生出的。最关键的是,输入数据是 **有界的(bounded)**:它有一个已知的,固定的大小(例如,它包含一些时间点的日志文件或数据库内容的快照)。因为它是有界的,一个作业知道自己什么时候完成了整个输入的读取,所以一个工作在做完后,最终总是会完成的。
在下一章中,我们将转向流处理,其中的输入是 **无界的(unbounded)** —— 也就是说,你还有活儿要干,然而它的输入是永无止境的数据流。在这种情况下,作业永无完成之日。因为在任何时候都可能有更多的工作涌入。我们将看到,在某些方面上,流处理和批处理是相似的。但是关于无尽数据流的假设也对我们构建系统的方式产生了很多改变。
## 参考文献
1. Jeffrey Dean and Sanjay Ghemawat: “[MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/pub62/),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004.
1. Joel Spolsky: “[The Perils of JavaSchools](https://www.joelonsoftware.com/2005/12/29/the-perils-of-javaschools-2/),” *joelonsoftware.com*, December 29, 2005.
1. Shivnath Babu and Herodotos Herodotou: “[Massively Parallel Databases and MapReduce Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/db-mr-survey-final.pdf),” *Foundations and Trends in Databases*, volume 5, number 1, pages 1–104, November 2013. [doi:10.1561/1900000036](http://dx.doi.org/10.1561/1900000036)
1. David J. DeWitt and Michael Stonebraker: “[MapReduce: A Major Step Backwards](https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html),” originally published at *databasecolumn.vertica.com*, January 17, 2008.
1. Henry Robinson: “[The Elephant Was a Trojan Horse: On the Death of Map-Reduce at Google](https://www.the-paper-trail.org/post/2014-06-25-the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/),” *the-paper-trail.org*, June 25, 2014.
1. “[The Hollerith Machine](https://www.census.gov/history/www/innovations/technology/the_hollerith_tabulator.html),” United States Census Bureau, *census.gov*.
1. “[IBM 82, 83, and 84 Sorters Reference Manual](https://bitsavers.org/pdf/ibm/punchedCard/Sorter/A24-1034-1_82-83-84_sorters.pdf),” Edition A24-1034-1, International Business Machines Corporation, July 1962.
1. Adam Drake: “[Command-Line Tools Can Be 235x Faster than Your Hadoop Cluster](https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html),” *aadrake.com*, January 25, 2014.
1. “[GNU Coreutils 8.23 Documentation](http://www.gnu.org/software/coreutils/manual/html_node/index.html),” Free Software Foundation, Inc., 2014.
1. Martin Kleppmann: “[Kafka, Samza, and the Unix Philosophy of Distributed Data](http://martin.kleppmann.com/2015/08/05/kafka-samza-unix-philosophy-distributed-data.html),” *martin.kleppmann.com*, August 5, 2015.
1. Doug McIlroy: [Internal Bell Labs memo](https://swtch.com/~rsc/thread/mdmpipe.pdf), October 1964. Cited in: Dennis M. Richie: “[Advice from Doug McIlroy](https://www.bell-labs.com/usr/dmr/www/mdmpipe.html),” *bell-labs.com*.
1. M. D. McIlroy, E. N. Pinson, and B. A. Tague: “[UNIX Time-Sharing System: Foreword](https://archive.org/details/bstj57-6-1899),” *The Bell System Technical Journal*, volume 57, number 6, pages 1899–1904, July 1978.
1. Eric S. Raymond: [*The Art of UNIX Programming*](http://www.catb.org/~esr/writings/taoup/html/). Addison-Wesley, 2003. ISBN: 978-0-13-142901-7
1. Ronald Duncan: “[Text File Formats – ASCII Delimited Text – Not CSV or TAB Delimited Text](https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/),” *ronaldduncan.wordpress.com*, October 31, 2009.
1. Alan Kay: “[Is 'Software Engineering' an Oxymoron?](http://tinlizzie.org/~takashi/IsSoftwareEngineeringAnOxymoron.pdf),” *tinlizzie.org*.
1. Martin Fowler: “[InversionOfControl](http://martinfowler.com/bliki/InversionOfControl.html),” *martinfowler.com*, June 26, 2005.
1. Daniel J. Bernstein: “[Two File Descriptors for Sockets](http://cr.yp.to/tcpip/twofd.html),” *cr.yp.to*.
1. Rob Pike and Dennis M. Ritchie: “[The Styx Architecture for Distributed Systems](http://doc.cat-v.org/inferno/4th_edition/styx),” *Bell Labs Technical Journal*, volume 4, number 2, pages 146–152, April 1999.
1. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: “[The Google File System](http://research.google.com/archive/gfs-sosp2003.pdf),” at *19th ACM Symposium on Operating Systems Principles* (SOSP), October 2003. [doi:10.1145/945445.945450](http://dx.doi.org/10.1145/945445.945450)
1. Michael Ovsiannikov, Silvius Rus, Damian Reeves, et al.: “[The Quantcast File System](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p808-ovsiannikov.pdf),” *Proceedings of the VLDB Endowment*, volume 6, number 11, pages 1092–1101, August 2013. [doi:10.14778/2536222.2536234](http://dx.doi.org/10.14778/2536222.2536234)
1. “[OpenStack Swift 2.6.1 Developer Documentation](http://docs.openstack.org/developer/swift/),” OpenStack Foundation, *docs.openstack.org*, March 2016.
1. Zhe Zhang, Andrew Wang, Kai Zheng, et al.: “[Introduction to HDFS Erasure Coding in Apache Hadoop](https://blog.cloudera.com/introduction-to-hdfs-erasure-coding-in-apache-hadoop/),” *blog.cloudera.com*, September 23, 2015.
1. Peter Cnudde: “[Hadoop Turns 10](https://web.archive.org/web/20190119112713/https://yahoohadoop.tumblr.com/post/138739227316/hadoop-turns-10),” *yahoohadoop.tumblr.com*, February 5, 2016.
1. Eric Baldeschwieler: “[Thinking About the HDFS vs. Other Storage Technologies](https://web.archive.org/web/20190529215115/http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/),” *hortonworks.com*, July 25, 2012.
1. Brendan Gregg: “[Manta: Unix Meets Map Reduce](https://web.archive.org/web/20220125052545/http://dtrace.org/blogs/brendan/2013/06/25/manta-unix-meets-map-reduce/),” *dtrace.org*, June 25, 2013.
1. Tom White: *Hadoop: The Definitive Guide*, 4th edition. O'Reilly Media, 2015. ISBN: 978-1-491-90163-2
1. Jim N. Gray: “[Distributed Computing Economics](http://arxiv.org/pdf/cs/0403019.pdf),” Microsoft Research Tech Report MSR-TR-2003-24, March 2003.
1. Márton Trencséni: “[Luigi vs Airflow vs Pinball](http://bytepawn.com/luigi-airflow-pinball.html),” *bytepawn.com*, February 6, 2016.
1. Roshan Sumbaly, Jay Kreps, and Sam Shah: “[The 'Big Data' Ecosystem at LinkedIn](http://www.slideshare.net/s_shah/the-big-data-ecosystem-at-linkedin-23512853),” at *ACM International Conference on Management of Data* (SIGMOD), July 2013. [doi:10.1145/2463676.2463707](http://dx.doi.org/10.1145/2463676.2463707)
1. Alan F. Gates, Olga Natkovich, Shubham Chopra, et al.: “[Building a High-Level Dataflow System on Top of Map-Reduce: The Pig Experience](http://www.vldb.org/pvldb/vol2/vldb09-1074.pdf),” at *35th International Conference on Very Large Data Bases* (VLDB), August 2009.
1. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, et al.: “[Hive – A Petabyte Scale Data Warehouse Using Hadoop](http://i.stanford.edu/~ragho/hive-icde2010.pdf),” at *26th IEEE International Conference on Data Engineering* (ICDE), March 2010. [doi:10.1109/ICDE.2010.5447738](http://dx.doi.org/10.1109/ICDE.2010.5447738)
1. “[Cascading 3.0 User Guide](https://web.archive.org/web/20231206195311/http://docs.cascading.org/cascading/3.0/userguide/),” Concurrent, Inc., *docs.cascading.org*, January 2016.
1. “[Apache Crunch User Guide](https://crunch.apache.org/user-guide.html),” Apache Software Foundation, *crunch.apache.org*.
1. Craig Chambers, Ashish Raniwala, Frances Perry, et al.: “[FlumeJava: Easy, Efficient Data-Parallel Pipelines](https://research.google.com/pubs/archive/35650.pdf),” at *31st ACM SIGPLAN Conference on Programming Language Design and Implementation* (PLDI), June 2010. [doi:10.1145/1806596.1806638](http://dx.doi.org/10.1145/1806596.1806638)
1. Jay Kreps: “[Why Local State is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing),” *oreilly.com*, July 31, 2014.
1. Martin Kleppmann: “[Rethinking Caching in Web Apps](http://martin.kleppmann.com/2012/10/01/rethinking-caching-in-web-apps.html),” *martin.kleppmann.com*, October 1, 2012.
1. Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira: *[Hadoop Application Architectures](http://shop.oreilly.com/product/0636920033196.do)*. O'Reilly Media, 2015. ISBN: 978-1-491-90004-8
1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
1. Sriranjan Manjunath: “[Skewed Join](https://web.archive.org/web/20151228114742/https://wiki.apache.org/pig/PigSkewedJoinSpec),” *wiki.apache.org*, 2009.
1. David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri: “[Practical Skew Handling in Parallel Joins](http://www.vldb.org/conf/1992/P027.PDF),” at *18th International Conference on Very Large Data Bases* (VLDB), August 1992.
1. Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: “[Impala: A Modern, Open-Source SQL Engine for Hadoop](http://pandis.net/resources/cidr15impala.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
1. Matthieu Monsch: “[Open-Sourcing PalDB, a Lightweight Companion for Storing Side Data](https://engineering.linkedin.com/blog/2015/10/open-sourcing-paldb--a-lightweight-companion-for-storing-side-da),” *engineering.linkedin.com*, October 26, 2015.
1. Daniel Peng and Frank Dabek: “[Large-Scale Incremental Processing Using Distributed Transactions and Notifications](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf),” at *9th USENIX conference on Operating Systems Design and Implementation* (OSDI), October 2010.
1. “["Cloudera Search User Guide,"](http://www.cloudera.com/documentation/cdh/5-1-x/Search/Cloudera-Search-User-Guide/Cloudera-Search-User-Guide.html) Cloudera, Inc., September 2015.
1. Lili Wu, Sam Shah, Sean Choi, et al.: “[The Browsemaps: Collaborative Filtering at LinkedIn](http://ceur-ws.org/Vol-1271/Paper3.pdf),” at *6th Workshop on Recommender Systems and the Social Web* (RSWeb), October 2014.
1. Roshan Sumbaly, Jay Kreps, Lei Gao, et al.: “[Serving Large-Scale Batch Computed Data with Project Voldemort](http://static.usenix.org/events/fast12/tech/full_papers/Sumbaly.pdf),” at *10th USENIX Conference on File and Storage Technologies* (FAST), February 2012.
1. Varun Sharma: “[Open-Sourcing Terrapin: A Serving System for Batch Generated Data](https://web.archive.org/web/20170215032514/https://engineering.pinterest.com/blog/open-sourcing-terrapin-serving-system-batch-generated-data-0),” *engineering.pinterest.com*, September 14, 2015.
1. Nathan Marz: “[ElephantDB](http://www.slideshare.net/nathanmarz/elephantdb),” *slideshare.net*, May 30, 2011.
1. Jean-Daniel (JD) Cryans: “[How-to: Use HBase Bulk Loading, and Why](https://blog.cloudera.com/how-to-use-hbase-bulk-loading-and-why/),” *blog.cloudera.com*, September 27, 2013.
1. Nathan Marz: “[How to Beat the CAP Theorem](http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html),” *nathanmarz.com*, October 13, 2011.
1. Molly Bartlett Dishman and Martin Fowler: “[Agile Architecture](https://web.archive.org/web/20161130034721/http://conferences.oreilly.com/software-architecture/sa2015/public/schedule/detail/40388),” at *O'Reilly Software Architecture Conference*, March 2015.
1. David J. DeWitt and Jim N. Gray: “[Parallel Database Systems: The Future of High Performance Database Systems](http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/dewittgray92.pdf),” *Communications of the ACM*, volume 35, number 6, pages 85–98, June 1992. [doi:10.1145/129888.129894](http://dx.doi.org/10.1145/129888.129894)
1. Jay Kreps: “[But the multi-tenancy thing is actually really really hard](https://twitter.com/jaykreps/status/528235702480142336),” tweetstorm, *twitter.com*, October 31, 2014.
1. Jeffrey Cohen, Brian Dolan, Mark Dunlap, et al.: “[MAD Skills: New Analysis Practices for Big Data](http://www.vldb.org/pvldb/vol2/vldb09-219.pdf),” *Proceedings of the VLDB Endowment*, volume 2, number 2, pages 1481–1492, August 2009. [doi:10.14778/1687553.1687576](http://dx.doi.org/10.14778/1687553.1687576)
1. Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino: “[Data Wrangling: The Challenging Journey from the Wild to the Lake](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
1. Paige Roberts: “[To Schema on Read or to Schema on Write, That Is the Hadoop Data Lake Question](https://web.archive.org/web/20171105001306/http://adaptivesystemsinc.com/blog/to-schema-on-read-or-to-schema-on-write-that-is-the-hadoop-data-lake-question/),” *adaptivesystemsinc.com*, July 2, 2015.
1. Bobby Johnson and Joseph Adler: “[The Sushi Principle: Raw Data Is Better](https://web.archive.org/web/20161126104941/https://conferences.oreilly.com/strata/big-data-conference-ca-2015/public/schedule/detail/38737),” at *Strata+Hadoop World*, February 2015.
1. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, et al.: “[Apache Hadoop YARN: Yet Another Resource Negotiator](https://www.cs.cmu.edu/~garth/15719/papers/yarn.pdf),” at *4th ACM Symposium on Cloud Computing* (SoCC), October 2013. [doi:10.1145/2523616.2523633](http://dx.doi.org/10.1145/2523616.2523633)
1. Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, et al.: “[Large-Scale Cluster Management at Google with Borg](http://research.google.com/pubs/pub43438.html),” at *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741964](http://dx.doi.org/10.1145/2741948.2741964)
1. Malte Schwarzkopf: “[The Evolution of Cluster Scheduler Architectures](https://web.archive.org/web/20201109052657/http://www.firmament.io/blog/scheduler-architectures.html),” *firmament.io*, March 9, 2016.
1. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al.: “[Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf),” at *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012.
1. Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia: *Learning Spark*. O'Reilly Media, 2015. ISBN: 978-1-449-35904-1
1. Bikas Saha and Hitesh Shah: “[Apache Tez: Accelerating Hadoop Query Processing](http://www.slideshare.net/Hadoop_Summit/w-1205phall1saha),” at *Hadoop Summit*, June 2014.
1. Bikas Saha, Hitesh Shah, Siddharth Seth, et al.: “[Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications](http://home.cse.ust.hk/~weiwa/teaching/Fall15-COMP6611B/reading_list/Tez.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742790](http://dx.doi.org/10.1145/2723372.2742790)
1. Kostas Tzoumas: “[Apache Flink: API, Runtime, and Project Roadmap](http://www.slideshare.net/KostasTzoumas/apache-flink-api-runtime-and-project-roadmap),” *slideshare.net*, January 14, 2015.
1. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al.: “[The Stratosphere Platform for Big Data Analytics](https://ssc.io/pdf/2014-VLDBJ_Stratosphere_Overview.pdf),” *The VLDB Journal*, volume 23, number 6, pages 939–964, May 2014. [doi:10.1007/s00778-014-0357-y](http://dx.doi.org/10.1007/s00778-014-0357-y)
1. Michael Isard, Mihai Budiu, Yuan Yu, et al.: “[Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks](https://www.microsoft.com/en-us/research/publication/dryad-distributed-data-parallel-programs-from-sequential-building-blocks/),” at *European Conference on Computer Systems* (EuroSys), March 2007. [doi:10.1145/1272996.1273005](http://dx.doi.org/10.1145/1272996.1273005)
1. Daniel Warneke and Odej Kao: “[Nephele: Efficient Parallel Data Processing in the Cloud](https://stratosphere2.dima.tu-berlin.de/assets/papers/Nephele_09.pdf),” at *2nd Workshop on Many-Task Computing on Grids and Supercomputers* (MTAGS), November 2009. [doi:10.1145/1646468.1646476](http://dx.doi.org/10.1145/1646468.1646476)
1. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd: “[The PageRank Citation Ranking: Bringing Order to the Web](https://web.archive.org/web/20230219170930/http://ilpubs.stanford.edu:8090/422/),” Stanford InfoLab Technical Report 422, 1999.
1. Leslie G. Valiant: “[A Bridging Model for Parallel Computation](http://dl.acm.org/citation.cfm?id=79181),” *Communications of the ACM*, volume 33, number 8, pages 103–111, August 1990. [doi:10.1145/79173.79181](http://dx.doi.org/10.1145/79173.79181)
1. Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl: “[Spinning Fast Iterative Data Flows](http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf),” *Proceedings of the VLDB Endowment*, volume 5, number 11, pages 1268-1279, July 2012. [doi:10.14778/2350229.2350245](http://dx.doi.org/10.14778/2350229.2350245)
1. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, et al.: “[Pregel: A System for Large-Scale Graph Processing](https://kowshik.github.io/JPregel/pregel_paper.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2010. [doi:10.1145/1807167.1807184](http://dx.doi.org/10.1145/1807167.1807184)
1. Frank McSherry, Michael Isard, and Derek G. Murray: “[Scalability! But at What COST?](http://www.frankmcsherry.org/assets/COST.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
1. Ionel Gog, Malte Schwarzkopf, Natacha Crooks, et al.: “[Musketeer: All for One, One for All in Data Processing Systems](http://www.cl.cam.ac.uk/research/srg/netos/camsas/pubs/eurosys15-musketeer.pdf),” at *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741968](http://dx.doi.org/10.1145/2741948.2741968)
1. Aapo Kyrola, Guy Blelloch, and Carlos Guestrin: “[GraphChi: Large-Scale Graph Computation on Just a PC](https://www.usenix.org/system/files/conference/osdi12/osdi12-final-126.pdf),” at *10th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2012.
1. Andrew Lenharth, Donald Nguyen, and Keshav Pingali: “[Parallel Graph Analytics](http://cacm.acm.org/magazines/2016/5/201591-parallel-graph-analytics/fulltext),” *Communications of the ACM*, volume 59, number 5, pages 78–87, May 2016. [doi:10.1145/2901919](http://dx.doi.org/10.1145/2901919)
1. Fabian Hüske: “[Peeking into Apache Flink's Engine Room](http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html),” *flink.apache.org*, March 13, 2015.
1. Mostafa Mokhtar: “[Hive 0.14 Cost Based Optimizer (CBO) Technical Overview](https://web.archive.org/web/20170607112708/http://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/),” *hortonworks.com*, March 2, 2015.
1. Michael Armbrust, Reynold S Xin, Cheng Lian, et al.: “[Spark SQL: Relational Data Processing in Spark](http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742797](http://dx.doi.org/10.1145/2723372.2742797)
1. Daniel Blazevski: “[Planting Quadtrees for Apache Flink](https://blog.insightdatascience.com/planting-quadtrees-for-apache-flink-b396ebc80d35),” *insightdataengineering.com*, March 25, 2016.
1. Tom White: “[Genome Analysis Toolkit: Now Using Apache Spark for Data Processing](https://web.archive.org/web/20190215132904/http://blog.cloudera.com/blog/2016/04/genome-analysis-toolkit-now-using-apache-spark-for-data-processing/),” *blog.cloudera.com*, April 6, 2016.
================================================
FILE: content/v1/ch11.md
================================================
---
title: "第十一章:流处理"
linkTitle: "11. 流处理"
weight: 311
math: true
breadcrumbs: false
---

> 有效的复杂系统总是从简单的系统演化而来。反之亦然:从零设计的复杂系统没一个能有效工作的。
>
> —— 约翰・加尔,Systemantics(1975)
在 [第十章](/v1/ch10) 中,我们讨论了批处理技术,它读取一组文件作为输入,并生成一组新的文件作为输出。输出是 **衍生数据(derived data)** 的一种形式;也就是说,如果需要,可以通过再次运行批处理过程来重新创建数据集。我们看到了如何使用这个简单而强大的想法来建立搜索索引、推荐系统、做分析等等。
然而,在 [第十章](/v1/ch10) 中仍然有一个很大的假设:即输入是有界的,即已知和有限的大小,所以批处理知道它何时完成输入的读取。例如,MapReduce 核心的排序操作必须读取其全部输入,然后才能开始生成输出:可能发生这种情况:最后一条输入记录具有最小的键,因此需要第一个被输出,所以提早开始输出是不可行的。
实际上,很多数据是 **无界限** 的,因为它随着时间的推移而逐渐到达:你的用户在昨天和今天产生了数据,明天他们将继续产生更多的数据。除非你停业,否则这个过程永远都不会结束,所以数据集从来就不会以任何有意义的方式 “完成”【1】。因此,批处理程序必须将数据人为地分成固定时间段的数据块,例如,在每天结束时处理一天的数据,或者在每小时结束时处理一小时的数据。
日常批处理中的问题是,输入的变更只会在一天之后的输出中反映出来,这对于许多急躁的用户来说太慢了。为了减少延迟,我们可以更频繁地运行处理 —— 比如说,在每秒钟的末尾 —— 或者甚至更连续一些,完全抛开固定的时间切片,当事件发生时就立即进行处理,这就是 **流处理(stream processing)** 背后的想法。
一般来说,“流” 是指随着时间的推移逐渐可用的数据。这个概念出现在很多地方:Unix 的 stdin 和 stdout、编程语言(惰性列表)【2】、文件系统 API(如 Java 的 `FileInputStream`)、TCP 连接、通过互联网传送音频和视频等等。
在本章中,我们将把 **事件流(event stream)** 视为一种数据管理机制:无界限,增量处理,与上一章中的批量数据相对应。我们将首先讨论怎样表示、存储、通过网络传输流。在 “[数据库与流](#数据库与流)” 中,我们将研究流和数据库之间的关系。最后在 “[流处理](#流处理)” 中,我们将研究连续处理这些流的方法和工具,以及它们用于应用构建的方式。
## 传递事件流
在批处理领域,作业的输入和输出是文件(也许在分布式文件系统上)。流处理领域中的等价物看上去是什么样子的?
当输入是一个文件(一个字节序列),第一个处理步骤通常是将其解析为一系列记录。在流处理的上下文中,记录通常被叫做 **事件(event)** ,但它本质上是一样的:一个小的、自包含的、不可变的对象,包含某个时间点发生的某件事情的细节。一个事件通常包含一个来自日历时钟的时间戳,以指明事件发生的时间(请参阅 “[单调钟与日历时钟](/v1/ch8#单调钟与日历时钟)”)。
例如,发生的事件可能是用户采取的行动,例如查看页面或进行购买。它也可能来源于机器,例如对温度传感器或 CPU 利用率的周期性测量。在 “[使用 Unix 工具的批处理](/v1/ch10#使用Unix工具的批处理)” 的示例中,Web 服务器日志的每一行都是一个事件。
事件可能被编码为文本字符串或 JSON,或者某种二进制编码,如 [第四章](/v1/ch4) 所述。这种编码允许你存储一个事件,例如将其追加到一个文件,将其插入关系表,或将其写入文档数据库。它还允许你通过网络将事件发送到另一个节点以进行处理。
在批处理中,文件被写入一次,然后可能被多个作业读取。类似地,在流处理术语中,一个事件由 **生产者(producer)** (也称为 **发布者(publisher)** 或 **发送者(sender)** )生成一次,然后可能由多个 **消费者(consumer)** ( **订阅者(subscribers)** 或 **接收者(recipients)** )进行处理【3】。在文件系统中,文件名标识一组相关记录;在流式系统中,相关的事件通常被聚合为一个 **主题(topic)** 或 **流(stream)** 。
原则上讲,文件或数据库就足以连接生产者和消费者:生产者将其生成的每个事件写入数据存储,且每个消费者定期轮询数据存储,检查自上次运行以来新出现的事件。这实际上正是批处理在每天结束时处理当天数据时所做的事情。
但当我们想要进行低延迟的连续处理时,如果数据存储不是为这种用途专门设计的,那么轮询开销就会很大。轮询的越频繁,能返回新事件的请求比例就越低,而额外开销也就越高。相比之下,最好能在新事件出现时直接通知消费者。
数据库在传统上对这种通知机制支持的并不好,关系型数据库通常有 **触发器(trigger)** ,它们可以对变化(如,插入表中的一行)作出反应,但是它们的功能非常有限,并且在数据库设计中有些后顾之忧【4,5】。相应的是,已经开发了专门的工具来提供事件通知。
### 消息传递系统
向消费者通知新事件的常用方式是使用 **消息传递系统(messaging system)**:生产者发送包含事件的消息,然后将消息推送给消费者。我们之前在 “[消息传递中的数据流](/v1/ch4#消息传递中的数据流)” 中谈到了这些系统,但现在我们将详细介绍这些系统。
像生产者和消费者之间的 Unix 管道或 TCP 连接这样的直接信道,是实现消息传递系统的简单方法。但是,大多数消息传递系统都在这一基本模型上进行了扩展。特别的是,Unix 管道和 TCP 将恰好一个发送者与恰好一个接收者连接,而一个消息传递系统允许多个生产者节点将消息发送到同一个主题,并允许多个消费者节点接收主题中的消息。
在这个 **发布 / 订阅** 模式中,不同的系统采取各种各样的方法,并没有针对所有目的的通用答案。为了区分这些系统,问一下这两个问题会特别有帮助:
1. **如果生产者发送消息的速度比消费者能够处理的速度快会发生什么?** 一般来说,有三种选择:系统可以丢掉消息,将消息放入缓冲队列,或使用 **背压**(backpressure,也称为 **流量控制**,即 flow control:阻塞生产者,以免其发送更多的消息)。例如 Unix 管道和 TCP 就使用了背压:它们有一个固定大小的小缓冲区,如果填满,发送者会被阻塞,直到接收者从缓冲区中取出数据(请参阅 “[网络拥塞和排队](/v1/ch8#网络拥塞和排队)”)。
如果消息被缓存在队列中,那么理解队列增长会发生什么是很重要的。当队列装不进内存时系统会崩溃吗?还是将消息写入磁盘?如果是这样,磁盘访问又会如何影响消息传递系统的性能【6】?
2. **如果节点崩溃或暂时脱机,会发生什么情况? —— 是否会有消息丢失?** 与数据库一样,持久性可能需要写入磁盘和 / 或复制的某种组合(请参阅 “[复制与持久性](/v1/ch7#复制与持久性)”),这是有代价的。如果你能接受有时消息会丢失,则可能在同一硬件上获得更高的吞吐量和更低的延迟。
是否可以接受消息丢失取决于应用。例如,对于周期传输的传感器读数和指标,偶尔丢失的数据点可能并不重要,因为更新的值会在短时间内发出。但要注意,如果大量的消息被丢弃,可能无法立刻意识到指标已经不正确了【7】。如果你正在对事件计数,那么它们能够可靠送达是更重要的,因为每个丢失的消息都意味着使计数器的错误扩大。
我们在 [第十章](/v1/ch10) 中探讨的批处理系统的一个很好的特性是,它们提供了强大的可靠性保证:失败的任务会自动重试,失败任务的部分输出会自动丢弃。这意味着输出与没有发生故障一样,这有助于简化编程模型。在本章的后面,我们将研究如何在流处理的上下文中提供类似的保证。
#### 直接从生产者传递给消费者
许多消息传递系统使用生产者和消费者之间的直接网络通信,而不通过中间节点:
* UDP 组播广泛应用于金融行业,例如股票市场,其中低时延非常重要【8】。虽然 UDP 本身是不可靠的,但应用层的协议可以恢复丢失的数据包(生产者必须记住它发送的数据包,以便能按需重新发送数据包)。
* 无代理的消息库,如 ZeroMQ 【9】和 nanomsg 采取类似的方法,通过 TCP 或 IP 多播实现发布 / 订阅消息传递。
* StatsD 【10】和 Brubeck 【7】使用不可靠的 UDP 消息传递来收集网络中所有机器的指标并对其进行监控。(在 StatsD 协议中,只有接收到所有消息,才认为计数器指标是正确的;使用 UDP 将使得指标处在一种最佳近似状态【11】。另请参阅 “[TCP 与 UDP](/v1/ch8#TCP与UDP)”
* 如果消费者在网络上公开了服务,生产者可以直接发送 HTTP 或 RPC 请求(请参阅 “[服务中的数据流:REST 与 RPC](/v1/ch4#服务中的数据流:REST与RPC)”)将消息推送给使用者。这就是 webhooks 背后的想法【12】,一种服务的回调 URL 被注册到另一个服务中,并且每当事件发生时都会向该 URL 发出请求。
尽管这些直接消息传递系统在设计它们的环境中运行良好,但是它们通常要求应用代码意识到消息丢失的可能性。它们的容错程度极为有限:即使协议检测到并重传在网络中丢失的数据包,它们通常也只是假设生产者和消费者始终在线。
如果消费者处于脱机状态,则可能会丢失其不可达时发送的消息。一些协议允许生产者重试失败的消息传递,但当生产者崩溃时,它可能会丢失消息缓冲区及其本应发送的消息,这种方法可能就没用了。
#### 消息代理
一种广泛使用的替代方法是通过 **消息代理**(message broker,也称为 **消息队列**,即 message queue)发送消息,消息代理实质上是一种针对处理消息流而优化的数据库。它作为服务器运行,生产者和消费者作为客户端连接到服务器。生产者将消息写入代理,消费者通过从代理那里读取来接收消息。
通过将数据集中在代理上,这些系统可以更容易地容忍来来去去的客户端(连接,断开连接和崩溃),而持久性问题则转移到代理的身上。一些消息代理只将消息保存在内存中,而另一些消息代理(取决于配置)将其写入磁盘,以便在代理崩溃的情况下不会丢失。针对缓慢的消费者,它们通常会允许无上限的排队(而不是丢弃消息或背压),尽管这种选择也可能取决于配置。
排队的结果是,消费者通常是 **异步(asynchronous)** 的:当生产者发送消息时,通常只会等待代理确认消息已经被缓存,而不等待消息被消费者处理。向消费者递送消息将发生在未来某个未定的时间点 —— 通常在几分之一秒之内,但有时当消息堆积时会显著延迟。
#### 消息代理与数据库的对比
有些消息代理甚至可以使用 XA 或 JTA 参与两阶段提交协议(请参阅 “[实践中的分布式事务](/v1/ch9#实践中的分布式事务)”)。这个功能与数据库在本质上非常相似,尽管消息代理和数据库之间仍存在实践上很重要的差异:
* 数据库通常保留数据直至显式删除,而大多数消息代理在消息成功递送给消费者时会自动删除消息。这样的消息代理不适合长期的数据存储。
* 由于它们很快就能删除消息,大多数消息代理都认为它们的工作集相当小 —— 即队列很短。如果代理需要缓冲很多消息,比如因为消费者速度较慢(如果内存装不下消息,可能会溢出到磁盘),每个消息需要更长的处理时间,整体吞吐量可能会恶化【6】。
* 数据库通常支持次级索引和各种搜索数据的方式,而消息代理通常支持按照某种模式匹配主题,订阅其子集。虽然机制并不一样,但对于客户端选择想要了解的数据的一部分,都是基本的方式。
* 查询数据库时,结果通常基于某个时间点的数据快照;如果另一个客户端随后向数据库写入一些改变了查询结果的内容,则第一个客户端不会发现其先前结果现已过期(除非它重复查询或轮询变更)。相比之下,消息代理不支持任意查询,但是当数据发生变化时(即新消息可用时),它们会通知客户端。
这是关于消息代理的传统观点,它被封装在诸如 JMS 【14】和 AMQP 【15】的标准中,并且被诸如 RabbitMQ、ActiveMQ、HornetQ、Qpid、TIBCO 企业消息服务、IBM MQ、Azure Service Bus 和 Google Cloud Pub/Sub 所实现 【16】。
#### 多个消费者
当多个消费者从同一主题中读取消息时,有两种主要的消息传递模式,如 [图 11-1](/v1/ddia_1101.png) 所示:
负载均衡(load balancing)
: 每条消息都被传递给消费者 **之一**,所以处理该主题下消息的工作能被多个消费者共享。代理可以为消费者任意分配消息。当处理消息的代价高昂,希望能并行处理消息时,此模式非常有用(在 AMQP 中,可以通过让多个客户端从同一个队列中消费来实现负载均衡,而在 JMS 中则称之为 **共享订阅**,即 shared subscription)。
扇出(fan-out)
: 每条消息都被传递给 **所有** 消费者。扇出允许几个独立的消费者各自 “收听” 相同的消息广播,而不会相互影响 —— 这个流处理中的概念对应批处理中多个不同批处理作业读取同一份输入文件 (JMS 中的主题订阅与 AMQP 中的交叉绑定提供了这一功能)。

**图 11-1 (a)负载平衡:在消费者间共享消费主题;(b)扇出:将每条消息传递给多个消费者。**
两种模式可以组合使用:例如,两个独立的消费者组可以每组各订阅同一个主题,每一组都共同收到所有消息,但在每一组内部,每条消息仅由单个节点处理。
#### 确认与重新传递
消费者随时可能会崩溃,所以有一种可能的情况是:代理向消费者递送消息,但消费者没有处理,或者在消费者崩溃之前只进行了部分处理。为了确保消息不会丢失,消息代理使用 **确认(acknowledgments)**:客户端必须显式告知代理消息处理完毕的时间,以便代理能将消息从队列中移除。
如果与客户端的连接关闭,或者代理超出一段时间未收到确认,代理则认为消息没有被处理,因此它将消息再递送给另一个消费者。(请注意可能发生这样的情况,消息 **实际上是** 处理完毕的,但 **确认** 在网络中丢失了。需要一种原子提交协议才能处理这种情况,正如在 “[实践中的分布式事务](/v1/ch9#实践中的分布式事务)” 中所讨论的那样)
当与负载均衡相结合时,这种重传行为对消息的顺序有种有趣的影响。在 [图 11-2](/v1/ddia_1102.png) 中,消费者通常按照生产者发送的顺序处理消息。然而消费者 2 在处理消息 m3 时崩溃,与此同时消费者 1 正在处理消息 m4。未确认的消息 m3 随后被重新发送给消费者 1,结果消费者 1 按照 m4,m3,m5 的顺序处理消息。因此 m3 和 m4 的交付顺序与生产者 1 的发送顺序不同。

**图 11-2 在处理 m3 时消费者 2 崩溃,因此稍后重传至消费者 1**
即使消息代理试图保留消息的顺序(如 JMS 和 AMQP 标准所要求的),负载均衡与重传的组合也不可避免地导致消息被重新排序。为避免此问题,你可以让每个消费者使用单独的队列(即不使用负载均衡功能)。如果消息是完全独立的,则消息顺序重排并不是一个问题。但正如我们将在本章后续部分所述,如果消息之间存在因果依赖关系,这就是一个很重要的问题。
### 分区日志
通过网络发送数据包或向网络服务发送请求通常是短暂的操作,不会留下永久的痕迹。尽管可以永久记录(通过抓包与日志),但我们通常不这么做。即使是将消息持久地写入磁盘的消息代理,在送达给消费者之后也会很快删除消息,因为它们建立在短暂消息传递的思维方式上。
数据库和文件系统采用截然相反的方法论:至少在某人显式删除前,通常写入数据库或文件的所有内容都要被永久记录下来。
这种思维方式上的差异对创建衍生数据的方式有巨大影响。如 [第十章](/v1/ch10) 所述,批处理过程的一个关键特性是,你可以反复运行它们,试验处理步骤,不用担心损坏输入(因为输入是只读的)。而 AMQP/JMS 风格的消息传递并非如此:收到消息是具有破坏性的,因为确认可能导致消息从代理中被删除,因此你不能期望再次运行同一个消费者能得到相同的结果。
如果你将新的消费者添加到消息传递系统,通常只能接收到消费者注册之后开始发送的消息。先前的任何消息都随风而逝,一去不复返。作为对比,你可以随时为文件和数据库添加新的客户端,且能读取任意久远的数据(只要应用没有显式覆盖或删除这些数据)。
为什么我们不能把它俩杂交一下,既有数据库的持久存储方式,又有消息传递的低延迟通知?这就是 **基于日志的消息代理(log-based message brokers)** 背后的想法。
#### 使用日志进行消息存储
日志只是磁盘上简单的仅追加记录序列。我们先前在 [第三章](/v1/ch3) 中日志结构存储引擎和预写式日志的上下文中讨论了日志,在 [第五章](/v1/ch5) 复制的上下文里也讨论了它。
同样的结构可以用于实现消息代理:生产者通过将消息追加到日志末尾来发送消息,而消费者通过依次读取日志来接收消息。如果消费者读到日志末尾,则会等待新消息追加的通知。Unix 工具 `tail -f` 能监视文件被追加写入的数据,基本上就是这样工作的。
为了伸缩超出单个磁盘所能提供的更高吞吐量,可以对日志进行 **分区**(按 [第六章](/v1/ch6) 的定义)。不同的分区可以托管在不同的机器上,使得每个分区都有一份能独立于其他分区进行读写的日志。一个主题可以定义为一组携带相同类型消息的分区。这种方法如 [图 11-3](/v1/ddia_1103.png) 所示。
在每个分区内,代理为每个消息分配一个单调递增的序列号或 **偏移量**(offset,在 [图 11-3](/v1/ddia_1103.png) 中,框中的数字是消息偏移量)。这种序列号是有意义的,因为分区是仅追加写入的,所以分区内的消息是完全有序的。没有跨不同分区的顺序保证。

**图 11-3 生产者通过将消息追加写入主题分区文件来发送消息,消费者依次读取这些文件**
Apache Kafka 【17,18】、Amazon Kinesis Streams 【19】和 Twitter 的 DistributedLog 【20,21】都是基于日志的消息代理。Google Cloud Pub/Sub 在架构上类似,但对外暴露的是 JMS 风格的 API,而不是日志抽象【16】。尽管这些消息代理将所有消息写入磁盘,但通过跨多台机器分区,每秒能够实现数百万条消息的吞吐量,并通过复制消息来实现容错性【22,23】。
#### 日志与传统的消息传递相比
基于日志的方法天然支持扇出式消息传递,因为多个消费者可以独立读取日志,而不会相互影响 —— 读取消息不会将其从日志中删除。为了在一组消费者之间实现负载平衡,代理可以将整个分区分配给消费者组中的节点,而不是将单条消息分配给消费者客户端。
然后每个客户端将消费被指派分区中的 **所有** 消息。通常情况下,当一个用户被指派了一个日志分区时,它会以简单的单线程方式顺序地读取分区中的消息。这种粗粒度的负载均衡方法有一些缺点:
* 共享消费主题工作的节点数,最多为该主题中的日志分区数,因为同一个分区内的所有消息被递送到同一个节点 [^i]。
* 如果某条消息处理缓慢,则它会阻塞该分区中后续消息的处理(一种行首阻塞的形式;请参阅 “[描述性能](/v1/ch1#描述性能)”)。
因此在消息处理代价高昂,希望逐条并行处理,以及消息的顺序并没有那么重要的情况下,JMS/AMQP 风格的消息代理是可取的。另一方面,在消息吞吐量很高,处理迅速,顺序很重要的情况下,基于日志的方法表现得非常好。
[^i]: 要设计一种负载均衡方案也是有可能的,在这种方案中,两个消费者通过读取全部消息来共享分区处理的工作,但是其中一个只考虑具有偶数偏移量的消息,而另一个消费者只处理奇数编号的偏移量。或者你可以将消息摊到一个线程池中来处理,但这种方法会使消费者偏移量管理变得复杂。一般来说,单线程处理单分区是合适的,可以通过增加更多分区来提高并行度。
#### 消费者偏移量
顺序消费一个分区使得判断消息是否已经被处理变得相当容易:所有偏移量小于消费者的当前偏移量的消息已经被处理,而具有更大偏移量的消息还没有被看到。因此,代理不需要跟踪确认每条消息,只需要定期记录消费者的偏移即可。这种方法减少了额外簿记开销,而且在批处理和流处理中采用这种方法有助于提高基于日志的系统的吞吐量。
实际上,这种偏移量与单领导者数据库复制中常见的日志序列号非常相似,我们在 “[设置新从库](/v1/ch5#设置新从库)” 中讨论了这种情况。在数据库复制中,日志序列号允许跟随者断开连接后,重新连接到领导者,并在不跳过任何写入的情况下恢复复制。这里原理完全相同:消息代理表现得像一个主库,而消费者就像一个从库。
如果消费者节点失效,则失效消费者的分区将指派给其他节点,并从最后记录的偏移量开始消费消息。如果消费者已经处理了后续的消息,但还没有记录它们的偏移量,那么重启后这些消息将被处理两次。我们将在本章后面讨论这个问题的处理方法。
#### 磁盘空间使用
如果只追加写入日志,则磁盘空间终究会耗尽。为了回收磁盘空间,日志实际上被分割成段,并不时地将旧段删除或移动到归档存储。(我们将在后面讨论一种更为复杂的磁盘空间释放方式)
这就意味着如果一个慢消费者跟不上消息产生的速率而落后得太多,它的消费偏移量指向了删除的段,那么它就会错过一些消息。实际上,日志实现了一个有限大小的缓冲区,当缓冲区填满时会丢弃旧消息,它也被称为 **循环缓冲区(circular buffer)** 或 **环形缓冲区(ring buffer)**。不过由于缓冲区在磁盘上,因此缓冲区可能相当的大。
让我们做个简单计算。在撰写本文时,典型的大型硬盘容量为 6TB,顺序写入吞吐量为 150MB/s。如果以最快的速度写消息,则需要大约 11 个小时才能填满磁盘。因而磁盘可以缓冲 11 个小时的消息,之后它将开始覆盖旧的消息。即使使用多个磁盘和机器,这个比率也是一样的。实践中的部署很少能用满磁盘的写入带宽,所以通常可以保存一个几天甚至几周的日志缓冲区。
不管保留多长时间的消息,日志的吞吐量或多或少保持不变,因为无论如何,每个消息都会被写入磁盘【18】。这种行为与默认将消息保存在内存中,仅当队列太长时才写入磁盘的消息传递系统形成鲜明对比。当队列很短时,这些系统非常快;而当这些系统开始写入磁盘时,就要慢的多,所以吞吐量取决于保留的历史数量。
#### 当消费者跟不上生产者时
在 “[消息传递系统](#消息传递系统)” 中,如果消费者无法跟上生产者发送信息的速度时,我们讨论了三种选择:丢弃信息,进行缓冲或施加背压。在这种分类法里,基于日志的方法是缓冲的一种形式,具有很大但大小固定的缓冲区(受可用磁盘空间的限制)。
如果消费者远远落后,而所要求的信息比保留在磁盘上的信息还要旧,那么它将不能读取这些信息,所以代理实际上丢弃了比缓冲区容量更大的旧信息。你可以监控消费者落后日志头部的距离,如果落后太多就发出报警。由于缓冲区很大,因而有足够的时间让运维人员来修复慢消费者,并在消息开始丢失之前让其赶上。
即使消费者真的落后太多开始丢失消息,也只有那个消费者受到影响;它不会中断其他消费者的服务。这是一个巨大的运维优势:你可以实验性地消费生产日志,以进行开发,测试或调试,而不必担心会中断生产服务。当消费者关闭或崩溃时,会停止消耗资源,唯一剩下的只有消费者偏移量。
这种行为也与传统的消息代理形成了鲜明对比,在那种情况下,你需要小心地删除那些消费者已经关闭的队列 —— 否则那些队列就会累积不必要的消息,从其他仍活跃的消费者那里占走内存。
#### 重播旧消息
我们之前提到,使用 AMQP 和 JMS 风格的消息代理,处理和确认消息是一个破坏性的操作,因为它会导致消息在代理上被删除。另一方面,在基于日志的消息代理中,使用消息更像是从文件中读取数据:这是只读操作,不会更改日志。
除了消费者的任何输出之外,处理的唯一副作用是消费者偏移量的前进。但偏移量是在消费者的控制之下的,所以如果需要的话可以很容易地操纵:例如你可以用昨天的偏移量跑一个消费者副本,并将输出写到不同的位置,以便重新处理最近一天的消息。你可以使用各种不同的处理代码重复任意次。
这一方面使得基于日志的消息传递更像上一章的批处理,其中衍生数据通过可重复的转换过程与输入数据显式分离。它允许进行更多的实验,更容易从错误和漏洞中恢复,使其成为在组织内集成数据流的良好工具【24】。
## 数据库与流
我们已经在消息代理和数据库之间进行了一些比较。尽管传统上它们被视为单独的工具类别,但是我们看到基于日志的消息代理已经成功地从数据库中获取灵感并将其应用于消息传递。我们也可以反过来:从消息传递和流中获取灵感,并将它们应用于数据库。
我们之前曾经说过,事件是某个时刻发生的事情的记录。发生的事情可能是用户操作(例如键入搜索查询)或读取传感器,但也可能是 **写入数据库**。某些东西被写入数据库的事实是可以被捕获、存储和处理的事件。这一观察结果表明,数据库和数据流之间的联系不仅仅是磁盘日志的物理存储 —— 而是更深层的联系。
事实上,复制日志(请参阅 “[复制日志的实现](/v1/ch5#复制日志的实现)”)是一个由数据库写入事件组成的流,由主库在处理事务时生成。从库将写入流应用到它们自己的数据库副本,从而最终得到相同数据的精确副本。复制日志中的事件描述发生的数据更改。
我们还在 “[全序广播](/v1/ch9#全序广播)” 中遇到了状态机复制原理,其中指出:如果每个事件代表对数据库的写入,并且每个副本按相同的顺序处理相同的事件,则副本将达到相同的最终状态 (假设事件处理是一个确定性的操作)。这是事件流的又一种场景!
在本节中,我们将首先看看异构数据系统中出现的一个问题,然后探讨如何通过将事件流的想法带入数据库来解决这个问题。
### 保持系统同步
正如我们在本书中所看到的,没有一个系统能够满足所有的数据存储、查询和处理需求。在实践中,大多数重要应用都需要组合使用几种不同的技术来满足所有的需求:例如,使用 OLTP 数据库来为用户请求提供服务,使用缓存来加速常见请求,使用全文索引来处理搜索查询,使用数据仓库用于分析。每一种技术都有自己的数据副本,并根据自己的目的进行存储方式的优化。
由于相同或相关的数据出现在了不同的地方,因此相互间需要保持同步:如果某个项目在数据库中被更新,它也应当在缓存、搜索索引和数据仓库中被更新。对于数据仓库,这种同步通常由 ETL 进程执行(请参阅 “[数据仓库](/v1/ch3#数据仓库)”),通常是先取得数据库的完整副本,然后执行转换,并批量加载到数据仓库中 —— 换句话说,批处理。我们在 “[批处理工作流的输出](/v1/ch10#批处理工作流的输出)” 中同样看到了如何使用批处理创建搜索索引、推荐系统和其他衍生数据系统。
如果周期性的完整数据库转储过于缓慢,有时会使用的替代方法是 **双写(dual write)**,其中应用代码在数据变更时明确写入每个系统:例如,首先写入数据库,然后更新搜索索引,然后使缓存项失效(甚至同时执行这些写入)。
但是,双写有一些严重的问题,其中一个是竞争条件,如 [图 11-4](/v1/ddia_1104.png) 所示。在这个例子中,两个客户端同时想要更新一个项目 X:客户端 1 想要将值设置为 A,客户端 2 想要将其设置为 B。两个客户端首先将新值写入数据库,然后将其写入到搜索索引。因为运气不好,这些请求的时序是交错的:数据库首先看到来自客户端 1 的写入将值设置为 A,然后来自客户端 2 的写入将值设置为 B,因此数据库中的最终值为 B。搜索索引首先看到来自客户端 2 的写入,然后是客户端 1 的写入,所以搜索索引中的最终值是 A。即使没发生错误,这两个系统现在也永久地不一致了。

**图 11-4 在数据库中 X 首先被设置为 A,然后被设置为 B,而在搜索索引处,写入以相反的顺序到达**
除非有一些额外的并发检测机制,例如我们在 “[检测并发写入](/v1/ch5#检测并发写入)” 中讨论的版本向量,否则你甚至不会意识到发生了并发写入 —— 一个值将简单地以无提示方式覆盖另一个值。
双重写入的另一个问题是,其中一个写入可能会失败,而另一个成功。这是一个容错问题,而不是一个并发问题,但也会造成两个系统互相不一致的结果。确保它们要么都成功要么都失败,是原子提交问题的一个例子,解决这个问题的代价是昂贵的(请参阅 “[原子提交与两阶段提交](/v1/ch9#原子提交与两阶段提交)”)。
如果你只有一个单领导者复制的数据库,那么这个领导者决定了写入顺序,而状态机复制方法可以在数据库副本上工作。然而,在 [图 11-4](/v1/ddia_1104.png) 中,没有单个主库:数据库可能有一个领导者,搜索索引也可能有一个领导者,但是两者都不追随对方,所以可能会发生冲突(请参阅 “[多主复制](/v1/ch5#多主复制)”)。
如果实际上只有一个领导者 —— 例如,数据库 —— 而且我们能让搜索索引成为数据库的追随者,情况要好得多。但这在实践中可能吗?
### 变更数据捕获
大多数数据库的复制日志的问题在于,它们一直被当做数据库的内部实现细节,而不是公开的 API。客户端应该通过其数据模型和查询语言来查询数据库,而不是解析复制日志并尝试从中提取数据。
数十年来,许多数据库根本没有记录在档的获取变更日志的方式。由于这个原因,捕获数据库中所有的变更,然后将其复制到其他存储技术(搜索索引、缓存或数据仓库)中是相当困难的。
最近,人们对 **变更数据捕获(change data capture, CDC)** 越来越感兴趣,这是一种观察写入数据库的所有数据变更,并将其提取并转换为可以复制到其他系统中的形式的过程。CDC 是非常有意思的,尤其是当变更能在被写入后立刻用于流时。
例如,你可以捕获数据库中的变更,并不断将相同的变更应用至搜索索引。如果变更日志以相同的顺序应用,则可以预期搜索索引中的数据与数据库中的数据是匹配的。搜索索引和任何其他衍生数据系统只是变更流的消费者,如 [图 11-5](/v1/ddia_1105.png) 所示。

**图 11-5 将数据按顺序写入一个数据库,然后按照相同的顺序将这些更改应用到其他系统**
#### 变更数据捕获的实现
我们可以将日志消费者叫做 **衍生数据系统**,正如在 [第三部分](/v1/part-iii) 的介绍中所讨论的:存储在搜索索引和数据仓库中的数据,只是 **记录系统** 数据的额外视图。变更数据捕获是一种机制,可确保对记录系统所做的所有更改都反映在衍生数据系统中,以便衍生系统具有数据的准确副本。
从本质上说,变更数据捕获使得一个数据库成为领导者(被捕获变化的数据库),并将其他组件变为追随者。基于日志的消息代理非常适合从源数据库传输变更事件,因为它保留了消息的顺序(避免了 [图 11-2](/v1/ddia_1102.png) 的重新排序问题)。
数据库触发器可用来实现变更数据捕获(请参阅 “[基于触发器的复制](/v1/ch5#基于触发器的复制)”),通过注册观察所有变更的触发器,并将相应的变更项写入变更日志表中。但是它们往往是脆弱的,而且有显著的性能开销。解析复制日志可能是一种更稳健的方法,但它也很有挑战,例如如何应对模式变更。
LinkedIn 的 Databus【25】,Facebook 的 Wormhole【26】和 Yahoo! 的 Sherpa【27】大规模地应用这个思路。Bottled Water 使用解码 WAL 的 API 实现了 PostgreSQL 的 CDC【28】,Maxwell 和 Debezium 通过解析 binlog 对 MySQL 做了类似的事情【29,30,31】,Mongoriver 读取 MongoDB oplog【32,33】,而 GoldenGate 为 Oracle 提供类似的功能【34,35】。
类似于消息代理,变更数据捕获通常是异步的:记录数据库系统在提交变更之前不会等待消费者应用变更。这种设计具有的运维优势是,添加缓慢的消费者不会过度影响记录系统。不过,所有复制延迟可能有的问题在这里都可能出现(请参阅 “[复制延迟问题](/v1/ch5#复制延迟问题)”)。
#### 初始快照
如果你拥有 **所有** 对数据库进行变更的日志,则可以通过重播该日志,来重建数据库的完整状态。但是在许多情况下,永远保留所有更改会耗费太多磁盘空间,且重播过于费时,因此日志需要被截断。
例如,构建新的全文索引需要整个数据库的完整副本 —— 仅仅应用最近变更的日志是不够的,因为这样会丢失最近未曾更新的项目。因此,如果你没有完整的历史日志,则需要从一个一致的快照开始,如先前的 “[设置新从库](/v1/ch5#设置新从库)” 中所述。
数据库的快照必须与变更日志中的已知位置或偏移量相对应,以便在处理完快照后知道从哪里开始应用变更。一些 CDC 工具集成了这种快照功能,而其他工具则把它留给你手动执行。
#### 日志压缩
如果你只能保留有限的历史日志,则每次要添加新的衍生数据系统时,都需要做一次快照。但 **日志压缩(log compaction)** 提供了一个很好的备选方案。
我们之前在 “[散列索引](/v1/ch3#散列索引)” 中关于日志结构存储引擎的上下文中讨论了日志压缩(请参阅 [图 3-2](/v1/ddia_0302.png) 的示例)。原理很简单:存储引擎定期在日志中查找具有相同键的记录,丢掉所有重复的内容,并只保留每个键的最新更新。这个压缩与合并过程在后台运行。
在日志结构存储引擎中,具有特殊值 NULL(**墓碑**,即 tombstone)的更新表示该键被删除,并会在日志压缩过程中被移除。但只要键不被覆盖或删除,它就会永远留在日志中。这种压缩日志所需的磁盘空间仅取决于数据库的当前内容,而不取决于数据库中曾经发生的写入次数。如果相同的键经常被覆盖写入,则先前的值将最终将被垃圾回收,只有最新的值会保留下来。
在基于日志的消息代理与变更数据捕获的上下文中也适用相同的想法。如果 CDC 系统被配置为,每个变更都包含一个主键,且每个键的更新都替换了该键以前的值,那么只需要保留对键的最新写入就足够了。
现在,无论何时需要重建衍生数据系统(如搜索索引),你可以从压缩日志主题的零偏移量处启动新的消费者,然后依次扫描日志中的所有消息。日志能保证包含数据库中每个键的最新值(也可能是一些较旧的值)—— 换句话说,你可以使用它来获取数据库内容的完整副本,而无需从 CDC 源数据库取一个快照。
Apache Kafka 支持这种日志压缩功能。正如我们将在本章后面看到的,它允许消息代理被当成持久性存储使用,而不仅仅是用于临时消息。
#### 变更流的API支持
越来越多的数据库开始将变更流作为第一等的接口,而不像传统上要去做加装改造,或者费工夫逆向工程一个 CDC。例如,RethinkDB 允许查询订阅通知,当查询结果变更时获得通知【36】,Firebase 【37】和 CouchDB 【38】基于变更流进行同步,该变更流同样可用于应用。而 Meteor 使用 MongoDB oplog 订阅数据变更,并改变了用户接口【39】。
VoltDB 允许事务以流的形式连续地从数据库中导出数据【40】。数据库将关系数据模型中的输出流表示为一个表,事务可以向其中插入元组,但不能查询。已提交事务按照提交顺序写入这个特殊表,而流则由该表中的元组日志构成。外部消费者可以异步消费该日志,并使用它来更新衍生数据系统。
Kafka Connect【41】致力于将广泛的数据库系统的变更数据捕获工具与 Kafka 集成。一旦变更事件进入 Kafka 中,它就可以用于更新衍生数据系统,比如搜索索引,也可以用于本章稍后讨论的流处理系统。
### 事件溯源
我们在这里讨论的想法和 **事件溯源(Event Sourcing)** 之间有一些相似之处,这是一个在 **领域驱动设计(domain-driven design, DDD)** 社区中折腾出来的技术。我们将简要讨论事件溯源,因为它包含了一些关于流处理系统的有用想法。
与变更数据捕获类似,事件溯源涉及到 **将所有对应用状态的变更** 存储为变更事件日志。最大的区别是事件溯源将这一想法应用到了一个不同的抽象层次上:
* 在变更数据捕获中,应用以 **可变方式(mutable way)** 使用数据库,可以任意更新和删除记录。变更日志是从数据库的底层提取的(例如,通过解析复制日志),从而确保从数据库中提取的写入顺序与实际写入的顺序相匹配,从而避免 [图 11-4](/v1/ddia_1104.png) 中的竞态条件。写入数据库的应用不需要知道 CDC 的存在。
* 在事件溯源中,应用逻辑显式构建在写入事件日志的不可变事件之上。在这种情况下,事件存储是仅追加写入的,更新与删除是不鼓励的或禁止的。事件被设计为旨在反映应用层面发生的事情,而不是底层的状态变更。
事件溯源是一种强大的数据建模技术:从应用的角度来看,将用户的行为记录为不可变的事件更有意义,而不是在可变数据库中记录这些行为的影响。事件溯源使得应用随时间演化更为容易,通过更容易理解事情发生的原因来帮助调试的进行,并有利于防止应用 Bug(请参阅 “[不可变事件的优点](#不可变事件的优点)”)。
例如,存储 “学生取消选课” 事件以中性的方式清楚地表达了单个行为的意图,而其副作用 “从登记表中删除了一个条目,而一条取消原因的记录被添加到学生反馈表” 则嵌入了很多有关稍后对数据的使用方式的假设。如果引入一个新的应用功能,例如 “将位置留给等待列表中的下一个人” —— 事件溯源方法允许将新的副作用轻松地从现有事件中脱开。
事件溯源类似于 **编年史(chronicle)** 数据模型【45】,事件日志与星型模式中的事实表之间也存在相似之处(请参阅 “[星型和雪花型:分析的模式](/v1/ch3#星型和雪花型:分析的模式)”) 。
诸如 Event Store【46】这样的专业数据库已经被开发出来,供使用事件溯源的应用使用,但总的来说,这种方法独立于任何特定的工具。传统的数据库或基于日志的消息代理也可以用来构建这种风格的应用。
#### 从事件日志中派生出当前状态
事件日志本身并不是很有用,因为用户通常期望看到的是系统的当前状态,而不是变更历史。例如,在购物网站上,用户期望能看到他们购物车里的当前内容,而不是他们购物车所有变更的一个仅追加列表。
因此,使用事件溯源的应用需要拉取事件日志(表示 **写入** 系统的数据),并将其转换为适合向用户显示的应用状态(从系统 **读取** 数据的方式【47】)。这种转换可以使用任意逻辑,但它应当是确定性的,以便能再次运行,并从事件日志中衍生出相同的应用状态。
与变更数据捕获一样,重播事件日志允许让你重新构建系统的当前状态。不过,日志压缩需要采用不同的方式处理:
* 用于记录更新的 CDC 事件通常包含记录的 **完整新版本**,因此主键的当前值完全由该主键的最近事件确定,而日志压缩可以丢弃相同主键的先前事件。
* 另一方面,事件溯源在更高层次进行建模:事件通常表示用户操作的意图,而不是因为操作而发生的状态更新机制。在这种情况下,后面的事件通常不会覆盖先前的事件,所以你需要完整的历史事件来重新构建最终状态。这里进行同样的日志压缩是不可能的。
使用事件溯源的应用通常有一些机制,用于存储从事件日志中导出的当前状态快照,因此它们不需要重复处理完整的日志。然而这只是一种性能优化,用来加速读取,提高从崩溃中恢复的速度;真正的目的是系统能够永久存储所有原始事件,并在需要时重新处理完整的事件日志。我们将在 “[不变性的局限性](#不变性的局限性)” 中讨论这个假设。
#### 命令和事件
事件溯源的哲学是仔细区分 **事件(event)** 和 **命令(command)**【48】。当来自用户的请求刚到达时,它一开始是一个命令:在这个时间点上它仍然可能失败,比如,因为违反了一些完整性条件。应用必须首先验证它是否可以执行该命令。如果验证成功并且命令被接受,则它变为一个持久化且不可变的事件。
例如,如果用户试图注册特定用户名,或预定飞机或剧院的座位,则应用需要检查用户名或座位是否已被占用。(先前在 “[容错共识](/v1/ch9#容错共识)” 中讨论过这个例子)当检查成功时,应用可以生成一个事件,指示特定的用户名是由特定的用户 ID 注册的,或者座位已经预留给特定的顾客。
在事件生成的时刻,它就成为了 **事实(fact)**。即使客户稍后决定更改或取消预订,他们之前曾预定了某个特定座位的事实仍然成立,而更改或取消是之后添加的单独的事件。
事件流的消费者不允许拒绝事件:当消费者看到事件时,它已经成为日志中不可变的一部分,并且可能已经被其他消费者看到了。因此任何对命令的验证,都需要在它成为事件之前同步完成。例如,通过使用一个可以原子性地自动验证命令并发布事件的可串行事务。
或者,预订座位的用户请求可以拆分为两个事件:第一个是暂时预约,第二个是验证预约后的独立的确认事件(如 “[使用全序广播实现线性一致的存储](/v1/ch9#使用全序广播实现线性一致的存储)” 中所述) 。这种分割方式允许验证发生在一个异步的过程中。
### 状态、流和不变性
我们在 [第十章](/v1/ch10) 中看到,批处理因其输入文件不变性而受益良多,你可以在现有输入文件上运行实验性处理作业,而不用担心损坏它们。这种不变性原则也是使得事件溯源与变更数据捕获如此强大的原因。
我们通常将数据库视为应用程序当前状态的存储 —— 这种表示针对读取进行了优化,而且通常对于服务查询而言是最为方便的表示。状态的本质是,它会变化,所以数据库才会支持数据的增删改。这又该如何匹配不变性呢?
只要你的状态发生了变化,那么这个状态就是这段时间中事件修改的结果。例如,当前可用的座位列表是你已处理的预订所产生的结果,当前帐户余额是帐户中的借与贷的结果,而 Web 服务器的响应时间图,是所有已发生 Web 请求的独立响应时间的聚合结果。
无论状态如何变化,总是有一系列事件导致了这些变化。即使事情已经执行与回滚,这些事件出现是始终成立的。关键的想法是:可变的状态与不可变事件的仅追加日志相互之间并不矛盾:它们是一体两面,互为阴阳的。所有变化的日志 —— **变化日志(changelog)**,表示了随时间演变的状态。
如果你倾向于数学表示,那么你可能会说,应用状态是事件流对时间求积分得到的结果,而变更流是状态对时间求微分的结果,如 [图 11-6](/v1/ddia_1106.png) 所示【49,50,51】。这个比喻有一些局限性(例如,状态的二阶导似乎没有意义),但这是考虑数据的一个实用出发点。
$$
state(now) = \int_{t=0}^{now}{stream(t) \ dt} \\
stream(t) = \frac{d\ state(t)}{dt}
$$

**图 11-6 应用当前状态与事件流之间的关系**
如果你持久存储了变更日志,那么重现状态就非常简单。如果你认为事件日志是你的记录系统,而所有的衍生状态都从它派生而来,那么系统中的数据流动就容易理解的多。正如帕特・赫兰(Pat Helland)所说的【52】:
> 事务日志记录了数据库的所有变更。高速追加是更改日志的唯一方法。从这个角度来看,数据库的内容其实是日志中记录最新值的缓存。日志才是真相,数据库是日志子集的缓存,这一缓存子集恰好来自日志中每条记录与索引值的最新值。
日志压缩(如 “[日志压缩](#日志压缩)” 中所述)是连接日志与数据库状态之间的桥梁:它只保留每条记录的最新版本,并丢弃被覆盖的版本。
#### 不可变事件的优点
数据库中的不变性是一个古老的概念。例如,会计在几个世纪以来一直在财务记账中应用不变性。一笔交易发生时,它被记录在一个仅追加写入的分类帐中,实质上是描述货币、商品或服务转手的事件日志。账目,比如利润、亏损、资产负债表,是从分类账中的交易求和衍生而来【53】。
如果发生错误,会计师不会删除或更改分类帐中的错误交易 —— 而是添加另一笔交易以补偿错误,例如退还一笔不正确的费用。不正确的交易将永远保留在分类帐中,对于审计而言可能非常重要。如果从不正确的分类账衍生出的错误数字已经公布,那么下一个会计周期的数字就会包括一个更正。这个过程在会计事务中是很常见的【54】。
尽管这种可审计性只在金融系统中尤其重要,但对于不受这种严格监管的许多其他系统,也是很有帮助的。如 “[批处理输出的哲学](/v1/ch10#批处理输出的哲学)” 中所讨论的,如果你意外地部署了将错误数据写入数据库的错误代码,当代码会破坏性地覆写数据时,恢复要困难得多。使用不可变事件的仅追加日志,诊断问题与故障恢复就要容易的多。
不可变的事件也包含了比当前状态更多的信息。例如在购物网站上,顾客可以将物品添加到他们的购物车,然后再将其移除。虽然从履行订单的角度,第二个事件取消了第一个事件,但对分析目的而言,知道客户考虑过某个特定项而之后又反悔,可能是很有用的。也许他们会选择在未来购买,或者他们已经找到了替代品。这个信息被记录在事件日志中,但对于移出购物车就删除记录的数据库而言,这个信息在移出购物车时可能就丢失了【42】。
#### 从同一事件日志中派生多个视图
此外,通过从不变的事件日志中分离出可变的状态,你可以针对不同的读取方式,从相同的事件日志中衍生出几种不同的表现形式。效果就像一个流的多个消费者一样([图 11-5](/v1/ddia_1105.png)):例如,分析型数据库 Druid 使用这种方式直接从 Kafka 摄取数据【55】,Pistachio 是一个分布式的键值存储,使用 Kafka 作为提交日志【56】,Kafka Connect 能将来自 Kafka 的数据导出到各种不同的数据库与索引【41】。这对于许多其他存储和索引系统(如搜索服务器)来说是很有意义的,当系统要从分布式日志中获取输入时亦然(请参阅 “[保持系统同步](#保持系统同步)”)。
添加从事件日志到数据库的显式转换,能够使应用更容易地随时间演进:如果你想要引入一个新功能,以新的方式表示现有数据,则可以使用事件日志来构建一个单独的、针对新功能的读取优化视图,无需修改现有系统而与之共存。并行运行新旧系统通常比在现有系统中执行复杂的模式迁移更容易。一旦不再需要旧的系统,你可以简单地关闭它并回收其资源【47,57】。
如果你不需要担心如何查询与访问数据,那么存储数据通常是非常简单的。模式设计、索引和存储引擎的许多复杂性,都是希望支持某些特定查询和访问模式的结果(请参阅 [第三章](/v1/ch3))。出于这个原因,通过将数据写入的形式与读取形式相分离,并允许几个不同的读取视图,你能获得很大的灵活性。这个想法有时被称为 **命令查询责任分离(command query responsibility segregation, CQRS)**【42,58,59】。
数据库和模式设计的传统方法是基于这样一种谬论,数据必须以与查询相同的形式写入。如果可以将数据从针对写入优化的事件日志转换为针对读取优化的应用状态,那么有关规范化和非规范化的争论就变得无关紧要了(请参阅 “[多对一和多对多的关系](/v1/ch2#多对一和多对多的关系)”):在针对读取优化的视图中对数据进行非规范化是完全合理的,因为翻译过程提供了使其与事件日志保持一致的机制。
在 “[描述负载](/v1/ch1#描述负载)” 中,我们讨论了推特主页时间线,它是特定用户关注的人群所发推特的缓存(类似邮箱)。这是 **针对读取优化的状态** 的又一个例子:主页时间线是高度非规范化的,因为你的推文与你所有粉丝的时间线都构成了重复。然而,扇出服务保持了这种重复状态与新推特以及新关注关系的同步,从而保证了重复的可管理性。
#### 并发控制
事件溯源和变更数据捕获的最大缺点是,事件日志的消费者通常是异步的,所以可能会出现这样的情况:用户会写入日志,然后从日志衍生视图中读取,结果发现他的写入还没有反映在读取视图中。我们之前在 “[读己之写](/v1/ch5#读己之写)” 中讨论了这个问题以及可能的解决方案。
一种解决方案是将事件追加到日志时同步执行读取视图的更新。而将这些写入操作合并为一个原子单元需要 **事务**,所以要么将事件日志和读取视图保存在同一个存储系统中,要么就需要跨不同系统进行分布式事务。或者,你也可以使用在 “[使用全序广播实现线性一致的存储](/v1/ch9#使用全序广播实现线性一致的存储)” 中讨论的方法。
另一方面,从事件日志导出当前状态也简化了并发控制的某些部分。许多对于多对象事务的需求(请参阅 “[单对象和多对象操作](/v1/ch7#单对象和多对象操作)”)源于单个用户操作需要在多个不同的位置更改数据。通过事件溯源,你可以设计一个自包含的事件以表示一个用户操作。然后用户操作就只需要在一个地方进行单次写入操作 —— 即将事件附加到日志中 —— 这个还是很容易使原子化的。
如果事件日志与应用状态以相同的方式分区(例如,处理分区 3 中的客户事件只需要更新分区 3 中的应用状态),那么直接使用单线程日志消费者就不需要写入并发控制了。它从设计上一次只处理一个事件(请参阅 “[真的串行执行](/v1/ch7#真的串行执行)”)。日志通过在分区中定义事件的序列顺序,消除了并发性的不确定性【24】。如果一个事件触及多个状态分区,那么需要做更多的工作,我们将在 [第十二章](/v1/ch12) 讨论。
#### 不变性的局限性
许多不使用事件溯源模型的系统也还是依赖不可变性:各种数据库在内部使用不可变的数据结构或多版本数据来支持时间点快照(请参阅 “[索引和快照隔离](/v1/ch7#索引和快照隔离)” )。Git、Mercurial 和 Fossil 等版本控制系统也依靠不可变的数据来保存文件的版本历史记录。
永远保持所有变更的不变历史,在多大程度上是可行的?答案取决于数据集的流失率。一些工作负载主要是添加数据,很少更新或删除;它们很容易保持不变。其他工作负载在相对较小的数据集上有较高的更新 / 删除率;在这些情况下,不可变的历史可能增至难以接受的巨大,碎片化可能成为一个问题,压缩与垃圾收集的表现对于运维的稳健性变得至关重要【60,61】。
除了性能方面的原因外,也可能有出于管理方面的原因需要删除数据的情况,尽管这些数据都是不可变的。例如,隐私条例可能要求在用户关闭帐户后删除他们的个人信息,数据保护立法可能要求删除错误的信息,或者可能需要阻止敏感信息的意外泄露。
在这种情况下,仅仅在日志中添加另一个事件来指明先前的数据应该被视为删除是不够的 —— 你实际上是想改写历史,并假装数据从一开始就没有写入。例如,Datomic 管这个特性叫 **切除(excision)** 【62】,而 Fossil 版本控制系统有一个类似的概念叫 **避免(shunning)** 【63】。
真正删除数据是非常非常困难的【64】,因为副本可能存在于很多地方:例如,存储引擎,文件系统和 SSD 通常会向一个新位置写入,而不是原地覆盖旧数据【52】,而备份通常是特意做成不可变的,防止意外删除或损坏。删除操作更多的是指 “使取回数据更困难”,而不是指 “使取回数据不可能”。无论如何,有时你必须得尝试,正如我们在 “[立法与自律](/v1/ch12#立法与自律)” 中所看到的。
## 流处理
到目前为止,本章中我们已经讨论了流的来源(用户活动事件,传感器和写入数据库),我们讨论了流如何传输(直接通过消息传送,通过消息代理,通过事件日志)。
剩下的就是讨论一下你可以用流做什么 —— 也就是说,你可以处理它。一般来说,有三种选项:
1. 你可以将事件中的数据写入数据库、缓存、搜索索引或类似的存储系统,然后能被其他客户端查询。如 [图 11-5](/v1/ddia_1105.png) 所示,这是数据库与系统其他部分所发生的变更保持同步的好方法 —— 特别是当流消费者是写入数据库的唯一客户端时。如 “[批处理工作流的输出](/v1/ch10#批处理工作流的输出)” 中所讨论的,它是写入存储系统的流等价物。
2. 你能以某种方式将事件推送给用户,例如发送报警邮件或推送通知,或将事件流式传输到可实时显示的仪表板上。在这种情况下,人是流的最终消费者。
3. 你可以处理一个或多个输入流,并产生一个或多个输出流。流可能会经过由几个这样的处理阶段组成的流水线,最后再输出(选项 1 或 2)。
在本章的剩余部分中,我们将讨论选项 3:处理流以产生其他衍生流。处理这样的流的代码片段,被称为 **算子(operator)** 或 **作业(job)**。它与我们在 [第十章](/v1/ch10) 中讨论过的 Unix 进程和 MapReduce 作业密切相关,数据流的模式是相似的:一个流处理器以只读的方式使用输入流,并将其输出以仅追加的方式写入一个不同的位置。
流处理中的分区和并行化模式也非常类似于 [第十章](/v1/ch10) 中介绍的 MapReduce 和数据流引擎,因此我们不再重复这些主题。基本的 Map 操作(如转换和过滤记录)也是一样的。
与批量作业相比的一个关键区别是,流不会结束。这种差异会带来很多隐含的结果。正如本章开始部分所讨论的,排序对无界数据集没有意义,因此无法使用 **排序合并连接**(请参阅 “[Reduce 侧连接与分组](/v1/ch10#Reduce侧连接与分组)”)。容错机制也必须改变:对于已经运行了几分钟的批处理作业,可以简单地从头开始重启失败任务,但是对于已经运行数年的流作业,重启后从头开始跑可能并不是一个可行的选项。
### 流处理的应用
长期以来,流处理一直用于监控目的,如果某个事件发生,组织希望能得到警报。例如:
* 欺诈检测系统需要确定信用卡的使用模式是否有意外地变化,如果信用卡可能已被盗刷,则锁卡。
* 交易系统需要检查金融市场的价格变化,并根据指定的规则进行交易。
* 制造系统需要监控工厂中机器的状态,如果出现故障,可以快速定位问题。
* 军事和情报系统需要跟踪潜在侵略者的活动,并在出现袭击征兆时发出警报。
这些类型的应用需要非常精密复杂的模式匹配与相关检测。然而随着时代的进步,流处理的其他用途也开始出现。在本节中,我们将简要比较一下这些应用。
#### 复合事件处理
**复合事件处理(complex event processing, CEP)** 是 20 世纪 90 年代为分析事件流而开发出的一种方法,尤其适用于需要搜索某些事件模式的应用【65,66】。与正则表达式允许你在字符串中搜索特定字符模式的方式类似,CEP 允许你指定规则以在流中搜索某些事件模式。
CEP 系统通常使用高层次的声明式查询语言,比如 SQL,或者图形用户界面,来描述应该检测到的事件模式。这些查询被提交给处理引擎,该引擎消费输入流,并在内部维护一个执行所需匹配的状态机。当发现匹配时,引擎发出一个 **复合事件**(即 complex event,CEP 因此得名),并附有检测到的事件模式详情【67】。
在这些系统中,查询和数据之间的关系与普通数据库相比是颠倒的。通常情况下,数据库会持久存储数据,并将查询视为临时的:当查询进入时,数据库搜索与查询匹配的数据,然后在查询完成时丢掉查询。CEP 引擎反转了角色:查询是长期存储的,来自输入流的事件不断流过它们,搜索匹配事件模式的查询【68】。
CEP 的实现包括 Esper【69】、IBM InfoSphere Streams【70】、Apama、TIBCO StreamBase 和 SQLstream。像 Samza 这样的分布式流处理组件,支持使用 SQL 在流上进行声明式查询【71】。
#### 流分析
使用流处理的另一个领域是对流进行分析。CEP 与流分析之间的边界是模糊的,但一般来说,分析往往对找出特定事件序列并不关心,而更关注大量事件上的聚合与统计指标 —— 例如:
* 测量某种类型事件的速率(每个时间间隔内发生的频率)
* 滚动计算一段时间窗口内某个值的平均值
* 将当前的统计值与先前的时间区间的值对比(例如,检测趋势,当指标与上周同比异常偏高或偏低时报警)
这些统计值通常是在固定时间区间内进行计算的,例如,你可能想知道在过去 5 分钟内服务每秒查询次数的均值,以及此时间段内响应时间的第 99 百分位点。在几分钟内取平均,能抹平秒和秒之间的无关波动,且仍然能向你展示流量模式的时间图景。聚合的时间间隔称为 **窗口(window)**,我们将在 “[时间推理](#时间推理)” 中更详细地讨论窗口。
流分析系统有时会使用概率算法,例如 Bloom filter(我们在 “[性能优化](/v1/ch3#性能优化)” 中遇到过)来管理成员资格,HyperLogLog【72】用于基数估计以及各种百分比估计算法(请参阅 “[实践中的百分位点](/v1/ch1#实践中的百分位点)”)。概率算法产出近似的结果,但比起精确算法的优点是内存使用要少得多。使用近似算法有时让人们觉得流处理系统总是有损的和不精确的,但这是错误看法:流处理并没有任何内在的近似性,而概率算法只是一种优化【73】。
许多开源分布式流处理框架的设计都是针对分析设计的:例如 Apache Storm、Spark Streaming、Flink、Concord、Samza 和 Kafka Streams 【74】。托管服务包括 Google Cloud Dataflow 和 Azure Stream Analytics。
#### 维护物化视图
我们在 “[数据库与流](#数据库与流)” 中看到,数据库的变更流可以用于维护衍生数据系统(如缓存、搜索索引和数据仓库),并使其与源数据库保持最新。我们可以将这些示例视作维护 **物化视图(materialized view)** 的一种具体场景(请参阅 “[聚合:数据立方体和物化视图](/v1/ch3#聚合:数据立方体和物化视图)”):在某个数据集上衍生出一个替代视图以便高效查询,并在底层数据变更时更新视图【50】。
同样,在事件溯源中,应用程序的状态是通过应用事件日志来维护的;这里的应用程序状态也是一种物化视图。与流分析场景不同的是,仅考虑某个时间窗口内的事件通常是不够的:构建物化视图可能需要任意时间段内的 **所有** 事件,除了那些可能由日志压缩丢弃的过时事件(请参阅 “[日志压缩](#日志压缩)”)。实际上,你需要一个可以一直延伸到时间开端的窗口。
原则上讲,任何流处理组件都可以用于维护物化视图,尽管 “永远运行” 与一些面向分析的框架假设的 “主要在有限时间段窗口上运行” 背道而驰,Samza 和 Kafka Streams 支持这种用法,建立在 Kafka 对日志压缩的支持上【75】。
#### 在流上搜索
除了允许搜索由多个事件构成模式的 CEP 外,有时也存在基于复杂标准(例如全文搜索查询)来搜索单个事件的需求。
例如,媒体监测服务可以订阅新闻文章 Feed 与来自媒体的播客,搜索任何关于公司、产品或感兴趣的话题的新闻。这是通过预先构建一个搜索查询来完成的,然后不断地将新闻项的流与该查询进行匹配。在一些网站上也有类似的功能:例如,当市场上出现符合其搜索条件的新房产时,房地产网站的用户可以要求网站通知他们。Elasticsearch 的这种过滤器功能,是实现这种流搜索的一种选择【76】。
传统的搜索引擎首先索引文件,然后在索引上跑查询。相比之下,搜索一个数据流则反了过来:查询被存储下来,文档从查询中流过,就像在 CEP 中一样。最简单的情况就是,你可以为每个文档测试每个查询。但是如果你有大量查询,这可能会变慢。为了优化这个过程,可以像对文档一样,为查询建立索引。因而收窄可能匹配的查询集合【77】。
#### 消息传递和RPC
在 “[消息传递中的数据流](/v1/ch4#消息传递中的数据流)” 中我们讨论过,消息传递系统可以作为 RPC 的替代方案,即作为一种服务间通信的机制,比如在 Actor 模型中所使用的那样。尽管这些系统也是基于消息和事件,但我们通常不会将其视作流处理组件:
* Actor 框架主要是管理模块通信的并发和分布式执行的一种机制,而流处理主要是一种数据管理技术。
* Actor 之间的交流往往是短暂的、一对一的;而事件日志则是持久的、多订阅者的。
* Actor 可以以任意方式进行通信(包括循环的请求 / 响应模式),但流处理通常配置在无环流水线中,其中每个流都是一个特定作业的输出,由良好定义的输入流中派生而来。
也就是说,RPC 类系统与流处理之间有一些交叉领域。例如,Apache Storm 有一个称为 **分布式 RPC** 的功能,它允许将用户查询分散到一系列也处理事件流的节点上;然后这些查询与来自输入流的事件交织,而结果可以被汇总并发回给用户【78】(另请参阅 “[多分区数据处理](/v1/ch12#多分区数据处理)”)。
也可以使用 Actor 框架来处理流。但是,很多这样的框架在崩溃时不能保证消息的传递,除非你实现了额外的重试逻辑,否则这种处理不是容错的。
### 时间推理
流处理通常需要与时间打交道,尤其是用于分析目的时候,会频繁使用时间窗口,例如 “过去五分钟的平均值”。“过去五分钟” 的含义看上去似乎是清晰而无歧义的,但不幸的是,这个概念非常棘手。
在批处理中过程中,大量的历史事件被快速地处理。如果需要按时间来分析,批处理器需要检查每个事件中嵌入的时间戳。读取运行批处理机器的系统时钟没有任何意义,因为处理运行的时间与事件实际发生的时间无关。
批处理可以在几分钟内读取一年的历史事件;在大多数情况下,感兴趣的时间线是历史中的一年,而不是处理中的几分钟。而且使用事件中的时间戳,使得处理是 **确定性** 的:在相同的输入上再次运行相同的处理过程会得到相同的结果(请参阅 “[容错](/v1/ch10#容错)”)。
另一方面,许多流处理框架使用处理机器上的本地系统时钟(**处理时间**,即 processing time)来确定 **窗口(windowing)**【79】。这种方法的优点是简单,如果事件创建与事件处理之间的延迟可以忽略不计,那也是合理的。然而,如果存在任何显著的处理延迟 —— 即,事件处理显著地晚于事件实际发生的时间,这种处理方式就失效了。
#### 事件时间与处理时间
很多原因都可能导致处理延迟:排队,网络故障(请参阅 “[不可靠的网络](/v1/ch8#不可靠的网络)”),性能问题导致消息代理 / 消息处理器出现争用,流消费者重启,从故障中恢复时重新处理过去的事件(请参阅 “[重播旧消息](#重播旧消息)”),或者在修复代码 BUG 之后。
而且,消息延迟还可能导致无法预测消息顺序。例如,假设用户首先发出一个 Web 请求(由 Web 服务器 A 处理),然后发出第二个请求(由服务器 B 处理)。A 和 B 发出描述它们所处理请求的事件,但是 B 的事件在 A 的事件发生之前到达消息代理。现在,流处理器将首先看到 B 事件,然后看到 A 事件,即使它们实际上是以相反的顺序发生的。
有一个类比也许能帮助理解,“星球大战” 电影:第四集于 1977 年发行,第五集于 1980 年,第六集于 1983 年,紧随其后的是 1999 年的第一集,2002 年的第二集,和 2005 年的第三集,以及 2015 年的第七集【80】[^ii]。如果你按照按照它们上映的顺序观看电影,你处理电影的顺序与它们叙事的顺序就是不一致的。(集数编号就像事件时间戳,而你观看电影的日期就是处理时间)作为人类,我们能够应对这种不连续性,但是流处理算法需要专门编写,以适应这种时序与顺序的问题。
[^ii]: 感谢 Flink 社区的 Kostas Kloudas 提出这个比喻。
将事件时间和处理时间搞混会导致错误的数据。例如,假设你有一个流处理器用于测量请求速率(计算每秒请求数)。如果你重新部署流处理器,它可能会停止一分钟,并在恢复之后处理积压的事件。如果你按处理时间来衡量速率,那么在处理积压日志时,请求速率看上去就像有一个异常的突发尖峰,而实际上请求速率是稳定的([图 11-7](/v1/ddia_1107.png))。

**图 11-7 按处理时间分窗,会因为处理速率的变动引入人为因素**
#### 知道什么时候准备好了
用事件时间来定义窗口的一个棘手的问题是,你永远也无法确定是不是已经收到了特定窗口的所有事件,还是说还有一些事件正在来的路上。
例如,假设你将事件分组为一分钟的窗口,以便统计每分钟的请求数。你已经计数了一些带有本小时内第 37 分钟时间戳的事件,时间流逝,现在进入的主要都是本小时内第 38 和第 39 分钟的事件。什么时候才能宣布你已经完成了第 37 分钟的窗口计数,并输出其计数器值?
在一段时间没有看到任何新的事件之后,你可以超时并宣布一个窗口已经就绪,但仍然可能发生这种情况:某些事件被缓冲在另一台机器上,由于网络中断而延迟。你需要能够处理这种在窗口宣告完成之后到达的 **滞留(straggler)** 事件。大体上,你有两种选择【1】:
1. 忽略这些滞留事件,因为在正常情况下它们可能只是事件中的一小部分。你可以将丢弃事件的数量作为一个监控指标,并在出现大量丢消息的情况时报警。
2. 发布一个 **更正(correction)**,一个包括滞留事件的更新窗口值。你可能还需要收回以前的输出。
在某些情况下,可以使用特殊的消息来指示 “从现在开始,不会有比 t 更早时间戳的消息了”,消费者可以使用它来触发窗口【81】。但是,如果不同机器上的多个生产者都在生成事件,每个生产者都有自己的最小时间戳阈值,则消费者需要分别跟踪每个生产者。在这种情况下,添加和删除生产者都是比较棘手的。
#### 你用的是谁的时钟?
当事件可能在系统内多个地方进行缓冲时,为事件分配时间戳更加困难了。例如,考虑一个移动应用向服务器上报关于用量的事件。该应用可能会在设备处于脱机状态时被使用,在这种情况下,它将在设备本地缓冲事件,并在下一次互联网连接可用时向服务器上报这些事件(可能是几小时甚至几天)。对于这个流的任意消费者而言,它们就如延迟极大的滞留事件一样。
在这种情况下,事件上的事件戳实际上应当是用户交互发生的时间,取决于移动设备的本地时钟。然而用户控制的设备上的时钟通常是不可信的,因为它可能会被无意或故意设置成错误的时间(请参阅 “[时钟同步与准确性](/v1/ch8#时钟同步与准确性)”)。服务器收到事件的时间(取决于服务器的时钟)可能是更准确的,因为服务器在你的控制之下,但在描述用户交互方面意义不大。
要校正不正确的设备时钟,一种方法是记录三个时间戳【82】:
* 事件发生的时间,取决于设备时钟
* 事件发送往服务器的时间,取决于设备时钟
* 事件被服务器接收的时间,取决于服务器时钟
通过从第三个时间戳中减去第二个时间戳,可以估算设备时钟和服务器时钟之间的偏移(假设网络延迟与所需的时间戳精度相比可忽略不计)。然后可以将该偏移应用于事件时间戳,从而估计事件实际发生的真实时间(假设设备时钟偏移在事件发生时与送往服务器之间没有变化)。
这并不是流处理独有的问题,批处理有着完全一样的时间推理问题。只是在流处理的上下文中,我们更容易意识到时间的流逝。
#### 窗口的类型
当你知道如何确定一个事件的时间戳后,下一步就是如何定义时间段的窗口。然后窗口就可以用于聚合,例如事件计数,或计算窗口内值的平均值。有几种窗口很常用【79,83】:
滚动窗口(Tumbling Window)
: 滚动窗口有着固定的长度,每个事件都仅能属于一个窗口。例如,假设你有一个 1 分钟的滚动窗口,则所有时间戳在 `10:03:00` 和 `10:03:59` 之间的事件会被分组到一个窗口中,`10:04:00` 和 `10:04:59` 之间的事件被分组到下一个窗口,依此类推。通过将每个事件时间戳四舍五入至最近的分钟来确定它所属的窗口,可以实现 1 分钟的滚动窗口。
跳动窗口(Hopping Window)
: 跳动窗口也有着固定的长度,但允许窗口重叠以提供一些平滑。例如,一个带有 1 分钟跳跃步长的 5 分钟窗口将包含 `10:03:00` 至 `10:07:59` 之间的事件,而下一个窗口将覆盖 `10:04:00` 至 `10:08:59` 之间的事件,等等。通过首先计算 1 分钟的滚动窗口(tunmbling window),然后在几个相邻窗口上进行聚合,可以实现这种跳动窗口。
滑动窗口(Sliding Window)
: 滑动窗口包含了彼此间距在特定时长内的所有事件。例如,一个 5 分钟的滑动窗口应当覆盖 `10:03:39` 和 `10:08:12` 的事件,因为它们相距不超过 5 分钟(注意滚动窗口与步长 5 分钟的跳动窗口可能不会把这两个事件分组到同一个窗口中,因为它们使用固定的边界)。通过维护一个按时间排序的事件缓冲区,并不断从窗口中移除过期的旧事件,可以实现滑动窗口。
会话窗口(Session window)
: 与其他窗口类型不同,会话窗口没有固定的持续时间,而定义为:将同一用户出现时间相近的所有事件分组在一起,而当用户一段时间没有活动时(例如,如果 30 分钟内没有事件)窗口结束。会话切分是网站分析的常见需求(请参阅 “[分组](/v1/ch10#分组)”)。
### 流连接
在 [第十章](/v1/ch10) 中,我们讨论了批处理作业如何通过键来连接数据集,以及这种连接是如何成为数据管道的重要组成部分的。由于流处理将数据管道泛化为对无限数据集进行增量处理,因此对流进行连接的需求也是完全相同的。
然而,新事件随时可能出现在一个流中,这使得流连接要比批处理连接更具挑战性。为了更好地理解情况,让我们先来区分三种不同类型的连接:**流 - 流** 连接,**流 - 表** 连接,与 **表 - 表** 连接【84】。我们将在下面的章节中通过例子来说明。
#### 流流连接(窗口连接)
假设你的网站上有搜索功能,而你想要找出搜索 URL 的近期趋势。每当有人键入搜索查询时,都会记录下一个包含查询与其返回结果的事件。每当有人点击其中一个搜索结果时,就会记录另一个记录点击事件。为了计算搜索结果中每个 URL 的点击率,你需要将搜索动作与点击动作的事件连在一起,这些事件通过相同的会话 ID 进行连接。广告系统中需要类似的分析【85】。
如果用户丢弃了搜索结果,点击可能永远不会发生,即使它出现了,搜索与点击之间的时间可能是高度可变的:在很多情况下,它可能是几秒钟,但也可能长达几天或几周(如果用户执行搜索,忘掉了这个浏览器页面,过了一段时间后重新回到这个浏览器页面上,并点击了一个结果)。由于可变的网络延迟,点击事件甚至可能先于搜索事件到达。你可以选择合适的连接窗口 —— 例如,如果点击与搜索之间的时间间隔在一小时内,你可能会选择连接两者。
请注意,在点击事件中嵌入搜索详情与事件连接并不一样:这样做的话,只有当用户点击了一个搜索结果时你才能知道,而那些没有点击的搜索就无能为力了。为了衡量搜索质量,你需要准确的点击率,为此搜索事件和点击事件两者都是必要的。
为了实现这种类型的连接,流处理器需要维护 **状态**:例如,按会话 ID 索引最近一小时内发生的所有事件。无论何时发生搜索事件或点击事件,都会被添加到合适的索引中,而流处理器也会检查另一个索引是否有具有相同会话 ID 的事件到达。如果有匹配事件就会发出一个表示搜索结果被点击的事件;如果搜索事件直到过期都没看见有匹配的点击事件,就会发出一个表示搜索结果未被点击的事件。
#### 流表连接(流扩充)
在 “[示例:用户活动事件分析](/v1/ch10#示例:用户活动事件分析)”([图 10-2](/v1/ddia_1002.png))中,我们看到了连接两个数据集的批处理作业示例:一组用户活动事件和一个用户档案数据库。将用户活动事件视为流,并在流处理器中连续执行相同的连接是很自然的想法:输入是包含用户 ID 的活动事件流,而输出还是活动事件流,但其中用户 ID 已经被扩展为用户的档案信息。这个过程有时被称为使用数据库的信息来 **扩充(enriching)** 活动事件。
要执行此连接,流处理器需要一次处理一个活动事件,在数据库中查找事件的用户 ID,并将档案信息添加到活动事件中。数据库查询可以通过查询远程数据库来实现。但正如在 “[示例:用户活动事件分析](/v1/ch10#示例:用户活动事件分析)” 一节中讨论的,此类远程查询可能会很慢,并且有可能导致数据库过载【75】。
另一种方法是将数据库副本加载到流处理器中,以便在本地进行查询而无需网络往返。这种技术与我们在 “[Map 侧连接](/v1/ch10#Map侧连接)” 中讨论的散列连接非常相似:如果数据库的本地副本足够小,则可以是内存中的散列表,比较大的话也可以是本地磁盘上的索引。
与批处理作业的区别在于,批处理作业使用数据库的时间点快照作为输入,而流处理器是长时间运行的,且数据库的内容可能随时间而改变,所以流处理器数据库的本地副本需要保持更新。这个问题可以通过变更数据捕获来解决:流处理器可以订阅用户档案数据库的更新日志,如同活动事件流一样。当增添或修改档案时,流处理器会更新其本地副本。因此,我们有了两个流之间的连接:活动事件和档案更新。
流表连接实际上非常类似于流流连接;最大的区别在于对于表的变更日志流,连接使用了一个可以回溯到 “时间起点” 的窗口(概念上是无限的窗口),新版本的记录会覆盖更早的版本。对于输入的流,连接可能压根儿就没有维护任何窗口。
#### 表表连接(维护物化视图)
我们在 “[描述负载](/v1/ch1#描述负载)” 中讨论的推特时间线例子时说过,当用户想要查看他们的主页时间线时,迭代用户所关注人群的推文并合并它们是一个开销巨大的操作。
相反,我们需要一个时间线缓存:一种每个用户的 “收件箱”,在发送推文的时候写入这些信息,因而读取时间线时只需要简单地查询即可。物化与维护这个缓存需要处理以下事件:
* 当用户 u 发送新的推文时,它将被添加到每个关注用户 u 的时间线上。
* 用户删除推文时,推文将从所有用户的时间表中删除。
* 当用户 $u_1$ 开始关注用户 $u_2$ 时,$u_2$ 最近的推文将被添加到 $u_1$ 的时间线上。
* 当用户 $u_1$ 取消关注用户 $u_2$ 时,$u_2$ 的推文将从 $u_1$ 的时间线中移除。
要在流处理器中实现这种缓存维护,你需要推文事件流(发送与删除)和关注关系事件流(关注与取消关注)。流处理需要维护一个数据库,包含每个用户的粉丝集合。以便知道当一条新推文到达时,需要更新哪些时间线【86】。
观察这个流处理过程的另一种视角是:它维护了一个连接了两个表(推文与关注)的物化视图,如下所示:
```sql
SELECT follows.follower_id AS timeline_id,
array_agg(tweets.* ORDER BY tweets.timestamp DESC)
FROM tweets
JOIN follows ON follows.followee_id = tweets.sender_id
GROUP BY follows.follower_id
```
流连接直接对应于这个查询中的表连接。时间线实际上是这个查询结果的缓存,每当底层的表发生变化时都会更新 [^iii]。
[^iii]: 如果你将流视作表的衍生物,如 [图 11-6](/v1/ddia_1106.png) 所示,而把一个连接看作是两个表的乘法u·v,那么会发生一些有趣的事情:物化连接的变化流遵循乘积法则:(u·v)'= u'v + uv'。换句话说,任何推文的变化量都与当前的关注联系在一起,任何关注的变化量都与当前的推文相连接【49,50】。
#### 连接的时间依赖性
这里描述的三种连接(流流,流表,表表)有很多共通之处:它们都需要流处理器维护连接一侧的一些状态(搜索与点击事件,用户档案,关注列表),然后当连接另一侧的消息到达时查询该状态。
用于维护状态的事件顺序是很重要的(先关注然后取消关注,或者其他类似操作)。在分区日志中,单个分区内的事件顺序是保留下来的。但典型情况下是没有跨流或跨分区的顺序保证的。
这就产生了一个问题:如果不同流中的事件发生在近似的时间范围内,则应该按照什么样的顺序进行处理?在流表连接的例子中,如果用户更新了它们的档案,哪些活动事件与旧档案连接(在档案更新前处理),哪些又与新档案连接(在档案更新之后处理)?换句话说:你需要对一些状态做连接,如果状态会随着时间推移而变化,那应当使用什么时间点来连接呢【45】?
这种时序依赖可能出现在很多地方。例如销售东西需要对发票应用适当的税率,这取决于所处的国家 / 州,产品类型,销售日期(因为税率时不时会变化)。当连接销售额与税率表时,你可能期望的是使用销售时的税率参与连接。如果你正在重新处理历史数据,销售时的税率可能和现在的税率有所不同。
如果跨越流的事件顺序是未定的,则连接会变为不确定性的【87】,这意味着你在同样输入上重跑相同的作业未必会得到相同的结果:当你重跑任务时,输入流上的事件可能会以不同的方式交织。
在数据仓库中,这个问题被称为 **缓慢变化的维度(slowly changing dimension, SCD)**,通常通过对特定版本的记录使用唯一的标识符来解决:例如,每当税率改变时都会获得一个新的标识符,而发票在销售时会带有税率的标识符【88,89】。这种变化使连接变为确定性的,但也会导致日志压缩无法进行:表中所有的记录版本都需要保留。
### 容错
在本章的最后一节中,让我们看一看流处理是如何容错的。我们在 [第十章](/v1/ch10) 中看到,批处理框架可以很容易地容错:如果 MapReduce 作业中的任务失败,可以简单地在另一台机器上再次启动,并且丢弃失败任务的输出。这种透明的重试是可能的,因为输入文件是不可变的,每个任务都将其输出写入到 HDFS 上的独立文件中,而输出仅当任务成功完成后可见。
特别是,批处理容错方法可确保批处理作业的输出与没有出错的情况相同,即使实际上某些任务失败了。看起来好像每条输入记录都被处理了恰好一次 —— 没有记录被跳过,而且没有记录被处理两次。尽管重启任务意味着实际上可能会多次处理记录,但输出中的可见效果看上去就像只处理过一次。这个原则被称为 **恰好一次语义(exactly-once semantics)**,尽管 **等效一次(effectively-once)** 可能会是一个更写实的术语【90】。
在流处理中也出现了同样的容错问题,但是处理起来没有那么直观:等待某个任务完成之后再使其输出可见并不是一个可行选项,因为你永远无法处理完一个无限的流。
#### 微批量与存档点
一个解决方案是将流分解成小块,并像微型批处理一样处理每个块。这种方法被称为 **微批次(microbatching)**,它被用于 Spark Streaming 【91】。批次的大小通常约为 1 秒,这是对性能妥协的结果:较小的批次会导致更大的调度与协调开销,而较大的批次意味着流处理器结果可见之前的延迟要更长。
微批次也隐式提供了一个与批次大小相等的滚动窗口(按处理时间而不是事件时间戳分窗)。任何需要更大窗口的作业都需要显式地将状态从一个微批次转移到下一个微批次。
Apache Flink 则使用不同的方法,它会定期生成状态的滚动存档点并将其写入持久存储【92,93】。如果流算子崩溃,它可以从最近的存档点重启,并丢弃从最近检查点到崩溃之间的所有输出。存档点会由消息流中的 **壁障(barrier)** 触发,类似于微批次之间的边界,但不会强制一个特定的窗口大小。
在流处理框架的范围内,微批次与存档点方法提供了与批处理一样的 **恰好一次语义**。但是,只要输出离开流处理器(例如,写入数据库,向外部消息代理发送消息,或发送电子邮件),框架就无法抛弃失败批次的输出了。在这种情况下,重启失败任务会导致外部副作用发生两次,只有微批次或存档点不足以阻止这一问题。
#### 原子提交再现
为了在出现故障时表现出恰好处理一次的样子,我们需要确保事件处理的所有输出和副作用 **当且仅当** 处理成功时才会生效。这些影响包括发送给下游算子或外部消息传递系统(包括电子邮件或推送通知)的任何消息,任何数据库写入,对算子状态的任何变更,以及对输入消息的任何确认(包括在基于日志的消息代理中将消费者偏移量前移)。
这些事情要么都原子地发生,要么都不发生,但是它们不应当失去同步。如果这种方法听起来很熟悉,那是因为我们在分布式事务和两阶段提交的上下文中讨论过它(请参阅 “[恰好一次的消息处理](/v1/ch9#恰好一次的消息处理)”)。
在 [第九章](/v1/ch9) 中,我们讨论了分布式事务传统实现中的问题(如 XA)。然而在限制更为严苛的环境中,也是有可能高效实现这种原子提交机制的。Google Cloud Dataflow【81,92】和 VoltDB 【94】中使用了这种方法,Apache Kafka 有计划加入类似的功能【95,96】。与 XA 不同,这些实现不会尝试跨异构技术提供事务,而是通过在流处理框架中同时管理状态变更与消息传递来内化事务。事务协议的开销可以通过在单个事务中处理多个输入消息来分摊。
#### 幂等性
我们的目标是丢弃任何失败任务的部分输出,以便能安全地重试,而不会生效两次。分布式事务是实现这个目标的一种方式,而另一种方式是依赖 **幂等性(idempotence)**【97】。
幂等操作是多次重复执行与单次执行效果相同的操作。例如,将键值存储中的某个键设置为某个特定值是幂等的(再次写入该值,只是用同样的值替代),而递增一个计数器不是幂等的(再次执行递增意味着该值递增两次)。
即使一个操作不是天生幂等的,往往可以通过一些额外的元数据做成幂等的。例如,在使用来自 Kafka 的消息时,每条消息都有一个持久的、单调递增的偏移量。将值写入外部数据库时可以将这个偏移量带上,这样你就可以判断一条更新是不是已经执行过了,因而避免重复执行。
Storm 的 Trident 基于类似的想法来处理状态【78】。依赖幂等性意味着隐含了一些假设:重启一个失败的任务必须以相同的顺序重播相同的消息(基于日志的消息代理能做这些事),处理必须是确定性的,没有其他节点能同时更新相同的值【98,99】。
当从一个处理节点故障切换到另一个节点时,可能需要进行 **防护**(fencing,请参阅 “[领导者和锁](/v1/ch8#领导者和锁)”),以防止被假死节点干扰。尽管有这么多注意事项,幂等操作是一种实现 **恰好一次语义** 的有效方式,仅需很小的额外开销。
#### 失败后重建状态
任何需要状态的流处理 —— 例如,任何窗口聚合(例如计数器,平均值和直方图)以及任何用于连接的表和索引,都必须确保在失败之后能恢复其状态。
一种选择是将状态保存在远程数据存储中,并进行复制,然而正如在 “[流表连接(流扩充)](#流表连接(流扩充))” 中所述,每个消息都要查询远程数据库可能会很慢。另一种方法是在流处理器本地保存状态,并定期复制。然后当流处理器从故障中恢复时,新任务可以读取状态副本,恢复处理而不丢失数据。
例如,Flink 定期捕获算子状态的快照,并将它们写入 HDFS 等持久存储中【92,93】。Samza 和 Kafka Streams 通过将状态变更发送到具有日志压缩功能的专用 Kafka 主题来复制状态变更,这与变更数据捕获类似【84,100】。VoltDB 通过在多个节点上对每个输入消息进行冗余处理来复制状态(请参阅 “[真的串行执行](/v1/ch7#真的串行执行)”)。
在某些情况下,甚至可能都不需要复制状态,因为它可以从输入流重建。例如,如果状态是从相当短的窗口中聚合而成,则简单地重播该窗口中的输入事件可能是足够快的。如果状态是通过变更数据捕获来维护的数据库的本地副本,那么也可以从日志压缩的变更流中重建数据库(请参阅 “[日志压缩](#日志压缩)”)。
然而,所有这些权衡取决于底层基础架构的性能特征:在某些系统中,网络延迟可能低于磁盘访问延迟,网络带宽也可能与磁盘带宽相当。没有针对所有情况的普适理想权衡,随着存储和网络技术的发展,本地状态与远程状态的优点也可能会互换。
## 本章小结
在本章中,我们讨论了事件流,它们所服务的目的,以及如何处理它们。在某些方面,流处理非常类似于在 [第十章](/v1/ch10) 中讨论的批处理,不过是在无限的(永无止境的)流而不是固定大小的输入上持续进行。从这个角度来看,消息代理和事件日志可以视作文件系统的流式等价物。
我们花了一些时间比较两种消息代理:
AMQP/JMS 风格的消息代理
: 代理将单条消息分配给消费者,消费者在成功处理单条消息后确认消息。消息被确认后从代理中删除。这种方法适合作为一种异步形式的 RPC(另请参阅 “[消息传递中的数据流](/v1/ch4#消息传递中的数据流)”),例如在任务队列中,消息处理的确切顺序并不重要,而且消息在处理完之后,不需要回头重新读取旧消息。
基于日志的消息代理
: 代理将一个分区中的所有消息分配给同一个消费者节点,并始终以相同的顺序传递消息。并行是通过分区实现的,消费者通过存档最近处理消息的偏移量来跟踪工作进度。消息代理将消息保留在磁盘上,因此如有必要的话,可以回跳并重新读取旧消息。
基于日志的方法与数据库中的复制日志(请参阅 [第五章](/v1/ch5))和日志结构存储引擎(请参阅 [第三章](/v1/ch3))有相似之处。我们看到,这种方法对于消费输入流,并产生衍生状态或衍生输出数据流的系统而言特别适用。
就流的来源而言,我们讨论了几种可能性:用户活动事件,定期读数的传感器,和 Feed 数据(例如,金融中的市场数据)能够自然地表示为流。我们发现将数据库写入视作流也是很有用的:我们可以捕获变更日志 —— 即对数据库所做的所有变更的历史记录 —— 隐式地通过变更数据捕获,或显式地通过事件溯源。日志压缩允许流也能保有数据库内容的完整副本。
将数据库表示为流为系统集成带来了很多强大机遇。通过消费变更日志并将其应用至衍生系统,你能使诸如搜索索引、缓存以及分析系统这类衍生数据系统不断保持更新。你甚至能从头开始,通过读取从创世至今的所有变更日志,为现有数据创建全新的视图。
像流一样维护状态以及消息重播的基础设施,是在各种流处理框架中实现流连接和容错的基础。我们讨论了流处理的几种目的,包括搜索事件模式(复杂事件处理),计算分窗聚合(流分析),以及保证衍生数据系统处于最新状态(物化视图)。
然后我们讨论了在流处理中对时间进行推理的困难,包括处理时间与事件时间戳之间的区别,以及当你认为窗口已经完事之后,如何处理到达的掉队事件的问题。
我们区分了流处理中可能出现的三种连接类型:
流流连接
: 两个输入流都由活动事件组成,而连接算子在某个时间窗口内搜索相关的事件。例如,它可能会将同一个用户 30 分钟内进行的两个活动联系在一起。如果你想要找出一个流内的相关事件,连接的两侧输入可能实际上都是同一个流(**自连接**,即 self-join)。
流表连接
: 一个输入流由活动事件组成,另一个输入流是数据库变更日志。变更日志保证了数据库的本地副本是最新的。对于每个活动事件,连接算子将查询数据库,并输出一个扩展的活动事件。
表表连接
: 两个输入流都是数据库变更日志。在这种情况下,一侧的每一个变化都与另一侧的最新状态相连接。结果是两表连接所得物化视图的变更流。
最后,我们讨论了在流处理中实现容错和恰好一次语义的技术。与批处理一样,我们需要放弃任何失败任务的部分输出。然而由于流处理长时间运行并持续产生输出,所以不能简单地丢弃所有的输出。相反,可以使用更细粒度的恢复机制,基于微批次、存档点、事务或幂等写入。
## 参考文献
1. Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “[The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 12, pages 1792–1803, August 2015. [doi:10.14778/2824032.2824076](http://dx.doi.org/10.14778/2824032.2824076)
1. Harold Abelson, Gerald Jay Sussman, and Julie Sussman: [*Structure and Interpretation of Computer Programs*](https://web.archive.org/web/20220807043536/https://mitpress.mit.edu/sites/default/files/sicp/index.html), 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, available online at *mitpress.mit.edu*
1. Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “[The Many Faces of Publish/Subscribe](http://www.cs.ru.nl/~pieter/oss/manyfaces.pdf),” *ACM Computing Surveys*, volume 35, number 2, pages 114–131, June 2003. [doi:10.1145/857076.857078](http://dx.doi.org/10.1145/857076.857078)
1. Joseph M. Hellerstein and Michael Stonebraker: [*Readings in Database Systems*](http://redbook.cs.berkeley.edu/), 4th edition. MIT Press, 2005. ISBN: 978-0-262-69314-1, available online at *redbook.cs.berkeley.edu*
1. Don Carney, Uğur Çetintemel, Mitch Cherniack, et al.: “[Monitoring Streams – A New Class of Data Management Applications](http://www.vldb.org/conf/2002/S07P02.pdf),” at *28th International Conference on Very Large Data Bases* (VLDB), August 2002.
1. Matthew Sackman: “[Pushing Back](https://wellquite.org/posts/lshift/pushing_back/),” *lshift.net*, May 5, 2016.
1. Vicent Martí: “[Brubeck, a statsd-Compatible Metrics Aggregator](http://githubengineering.com/brubeck/),” *githubengineering.com*, June 15, 2015.
1. Seth Lowenberger: “[MoldUDP64 Protocol Specification V 1.00](http://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf),” *nasdaqtrader.com*, July 2009.
1. Pieter Hintjens: [*ZeroMQ – The Guide*](http://zguide.zeromq.org/page:all). O'Reilly Media, 2013. ISBN: 978-1-449-33404-8
1. Ian Malpass: “[Measure Anything, Measure Everything](https://codeascraft.com/2011/02/15/measure-anything-measure-everything/),” *codeascraft.com*, February 15, 2011.
1. Dieter Plaetinck: “[25 Graphite, Grafana and statsd Gotchas](https://grafana.com/blog/2016/03/03/25-graphite-grafana-and-statsd-gotchas/),” *grafana.com*, March 3, 2016.
1. Jeff Lindsay: “[Web Hooks to Revolutionize the Web](https://web.archive.org/web/20180928201955/http://progrium.com/blog/2007/05/03/web-hooks-to-revolutionize-the-web/),” *progrium.com*, May 3, 2007.
1. Jim N. Gray: “[Queues Are Databases](https://arxiv.org/pdf/cs/0701158.pdf),” Microsoft Research Technical Report MSR-TR-95-56, December 1995.
1. Mark Hapner, Rich Burridge, Rahul Sharma, et al.: “[JSR-343 Java Message Service (JMS) 2.0 Specification](https://jcp.org/en/jsr/detail?id=343),” *jms-spec.java.net*, March 2013.
1. Sanjay Aiyagari, Matthew Arrott, Mark Atwell, et al.: “[AMQP: Advanced Message Queuing Protocol Specification](http://www.rabbitmq.com/resources/specs/amqp0-9-1.pdf),” Version 0-9-1, November 2008.
1. “[Google Cloud Pub/Sub: A Google-Scale Messaging Service](https://cloud.google.com/pubsub/architecture),” *cloud.google.com*, 2016.
1. “[Apache Kafka 0.9 Documentation](http://kafka.apache.org/documentation.html),” *kafka.apache.org*, November 2015.
1. Jay Kreps, Neha Narkhede, and Jun Rao: “[Kafka: A Distributed Messaging System for Log Processing](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf),” at *6th International Workshop on Networking Meets Databases* (NetDB), June 2011.
1. “[Amazon Kinesis Streams Developer Guide](http://docs.aws.amazon.com/streams/latest/dev/introduction.html),” *docs.aws.amazon.com*, April 2016.
1. Leigh Stewart and Sijie Guo: “[Building DistributedLog: Twitter’s High-Performance Replicated Log Service](https://blog.twitter.com/2015/building-distributedlog-twitter-s-high-performance-replicated-log-service),” *blog.twitter.com*, September 16, 2015.
1. “[DistributedLog Documentation](https://web.archive.org/web/20210517201308/https://bookkeeper.apache.org/distributedlog/docs/latest/),” Apache Software Foundation, *distributedlog.io*.
1. Jay Kreps: “[Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)](https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines),” *engineering.linkedin.com*, April 27, 2014.
1. Kartik Paramasivam: “[How We’re Improving and Advancing Kafka at LinkedIn](https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedin),” *engineering.linkedin.com*, September 2, 2015.
1. Jay Kreps: “[The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying),” *engineering.linkedin.com*, December 16, 2013.
1. Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “[All Aboard the Databus!](http://www.socc2012.org/s18-das.pdf),” at *3rd ACM Symposium on Cloud Computing* (SoCC), October 2012.
1. Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “[Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf),” at *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015.
1. P. P. S. Narayan: “[Sherpa Update](http://web.archive.org/web/20160801221400/https://developer.yahoo.com/blogs/ydn/sherpa-7992.html),” *developer.yahoo.com*, June 8, .
1. Martin Kleppmann: “[Bottled Water: Real-Time Integration of PostgreSQL and Kafka](http://martin.kleppmann.com/2015/04/23/bottled-water-real-time-postgresql-kafka.html),” *martin.kleppmann.com*, April 23, 2015.
1. Ben Osheroff: “[Introducing Maxwell, a mysql-to-kafka Binlog Processor](https://web.archive.org/web/20170208100334/https://developer.zendesk.com/blog/introducing-maxwell-a-mysql-to-kafka-binlog-processor),” *developer.zendesk.com*, August 20, 2015.
1. Randall Hauch: “[Debezium 0.2.1 Released](https://debezium.io/blog/2016/06/10/Debezium-0.2.1-Released/),” *debezium.io*, June 10, 2016.
1. Prem Santosh Udaya Shankar: “[Streaming MySQL Tables in Real-Time to Kafka](https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html),” *engineeringblog.yelp.com*, August 1, 2016.
1. “[Mongoriver](https://github.com/stripe/mongoriver),” Stripe, Inc., *github.com*, September 2014.
1. Dan Harvey: “[Change Data Capture with Mongo + Kafka](http://www.slideshare.net/danharvey/change-data-capture-with-mongodb-and-kafka),” at *Hadoop Users Group UK*, August 2015.
1. “[Oracle GoldenGate 12c: Real-Time Access to Real-Time Information](https://web.archive.org/web/20160923105841/http://www.oracle.com/us/products/middleware/data-integration/oracle-goldengate-realtime-access-2031152.pdf),” Oracle White Paper, March 2015.
1. “[Oracle GoldenGate Fundamentals: How Oracle GoldenGate Works](https://www.youtube.com/watch?v=6H9NibIiPQE),” Oracle Corporation, *youtube.com*, November 2012.
1. Slava Akhmechet: “[Advancing the Realtime Web](http://rethinkdb.com/blog/realtime-web/),” *rethinkdb.com*, January 27, 2015.
1. “[Firebase Realtime Database Documentation](https://firebase.google.com/docs/database/),” Google, Inc., *firebase.google.com*, May 2016.
1. “[Apache CouchDB 1.6 Documentation](http://docs.couchdb.org/en/latest/),” *docs.couchdb.org*, 2014.
1. Matt DeBergalis: “[Meteor 0.7.0: Scalable Database Queries Using MongoDB Oplog Instead of Poll-and-Diff](https://web.archive.org/web/20160324055429/http://info.meteor.com/blog/meteor-070-scalable-database-queries-using-mongodb-oplog-instead-of-poll-and-diff),” *info.meteor.com*, December 17, 2013.
1. “[Chapter 15. Importing and Exporting Live Data](https://docs.voltdb.com/UsingVoltDB/ChapExport.php),” VoltDB 6.4 User Manual, *docs.voltdb.com*, June 2016.
1. Neha Narkhede: “[Announcing Kafka Connect: Building Large-Scale Low-Latency Data Pipelines](http://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines),” *confluent.io*, February 18, 2016.
1. Greg Young: “[CQRS and Event Sourcing](https://www.youtube.com/watch?v=JHGkaShoyNs),” at *Code on the Beach*, August 2014.
1. Martin Fowler: “[Event Sourcing](http://martinfowler.com/eaaDev/EventSourcing.html),” *martinfowler.com*, December 12, 2005.
1. Vaughn Vernon: [*Implementing Domain-Driven Design*](https://www.informit.com/store/implementing-domain-driven-design-9780321834577). Addison-Wesley Professional, 2013. ISBN: 978-0-321-83457-7
1. H. V. Jagadish, Inderpal Singh Mumick, and Abraham Silberschatz: “[View Maintenance Issues for the Chronicle Data Model](https://dl.acm.org/doi/10.1145/212433.220201),” at *14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems* (PODS), May 1995. [doi:10.1145/212433.220201](http://dx.doi.org/10.1145/212433.220201)
1. “[Event Store 3.5.0 Documentation](http://docs.geteventstore.com/),” Event Store LLP, *docs.geteventstore.com*, February 2016.
1. Martin Kleppmann: [*Making Sense of Stream Processing*](http://www.oreilly.com/data/free/stream-processing.csp). Report, O'Reilly Media, May 2016.
1. Sander Mak: “[Event-Sourced Architectures with Akka](http://www.slideshare.net/SanderMak/eventsourced-architectures-with-akka),” at *JavaOne*, September 2014.
1. Julian Hyde: [personal communication](https://twitter.com/julianhyde/status/743374145006641153), June 2016.
1. Ashish Gupta and Inderpal Singh Mumick: *Materialized Views: Techniques, Implementations, and Applications*. MIT Press, 1999. ISBN: 978-0-262-57122-7
1. Timothy Griffin and Leonid Libkin: “[Incremental Maintenance of Views with Duplicates](http://homepages.inf.ed.ac.uk/libkin/papers/sigmod95.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), May 1995. [doi:10.1145/223784.223849](http://dx.doi.org/10.1145/223784.223849)
1. Pat Helland: “[Immutability Changes Everything](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
1. Martin Kleppmann: “[Accounting for Computer Scientists](http://martin.kleppmann.com/2011/03/07/accounting-for-computer-scientists.html),” *martin.kleppmann.com*, March 7, 2011.
1. Pat Helland: “[Accountants Don't Use Erasers](https://web.archive.org/web/20200220161036/https://blogs.msdn.microsoft.com/pathelland/2007/06/14/accountants-dont-use-erasers/),” *blogs.msdn.com*, June 14, 2007.
1. Fangjin Yang: “[Dogfooding with Druid, Samza, and Kafka: Metametrics at Metamarkets](https://metamarkets.com/2015/dogfooding-with-druid-samza-and-kafka-metametrics-at-metamarkets/),” *metamarkets.com*, June 3, 2015.
1. Gavin Li, Jianqiu Lv, and Hang Qi: “[Pistachio: Co-Locate the Data and Compute for Fastest Cloud Compute](https://web.archive.org/web/20181214032620/https://yahoohadoop.tumblr.com/post/116365275781/pistachio-co-locate-the-data-and-compute-for),” *yahoohadoop.tumblr.com*, April 13, 2015.
1. Kartik Paramasivam: “[Stream Processing Hard Problems – Part 1: Killing Lambda](https://engineering.linkedin.com/blog/2016/06/stream-processing-hard-problems-part-1-killing-lambda),” *engineering.linkedin.com*, June 27, 2016.
1. Martin Fowler: “[CQRS](http://martinfowler.com/bliki/CQRS.html),” *martinfowler.com*, July 14, 2011.
1. Greg Young: “[CQRS Documents](https://cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf),” *cqrs.files.wordpress.com*, November 2010.
1. Baron Schwartz: “[Immutability, MVCC, and Garbage Collection](https://web.archive.org/web/20161110094746/http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/),” *xaprb.com*, December 28, 2013.
1. Daniel Eloff, Slava Akhmechet, Jay Kreps, et al.: ["Re: Turning the Database Inside-out with Apache Samza](https://news.ycombinator.com/item?id=9145197)," Hacker News discussion, *news.ycombinator.com*, March 4, 2015.
1. “[Datomic Development Resources: Excision](http://docs.datomic.com/excision.html),” Cognitect, Inc., *docs.datomic.com*.
1. “[Fossil Documentation: Deleting Content from Fossil](http://fossil-scm.org/index.html/doc/trunk/www/shunning.wiki),” *fossil-scm.org*, 2016.
1. Jay Kreps: “[The irony of distributed systems is that data loss is really easy but deleting data is surprisingly hard,](https://twitter.com/jaykreps/status/582580836425330688)” *twitter.com*, March 30, 2015.
1. David C. Luckham: “[What’s the Difference Between ESP and CEP?](http://www.complexevents.com/2006/08/01/what%E2%80%99s-the-difference-between-esp-and-cep/),” *complexevents.com*, August 1, 2006.
1. Srinath Perera: “[How Is Stream Processing and Complex Event Processing (CEP) Different?](https://www.quora.com/How-is-stream-processing-and-complex-event-processing-CEP-different),” *quora.com*, December 3, 2015.
1. Arvind Arasu, Shivnath Babu, and Jennifer Widom: “[The CQL Continuous Query Language: Semantic Foundations and Query Execution](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cql.pdf),” *The VLDB Journal*, volume 15, number 2, pages 121–142, June 2006. [doi:10.1007/s00778-004-0147-z](http://dx.doi.org/10.1007/s00778-004-0147-z)
1. Julian Hyde: “[Data in Flight: How Streaming SQL Technology Can Help Solve the Web 2.0 Data Crunch](http://queue.acm.org/detail.cfm?id=1667562),” *ACM Queue*, volume 7, number 11, December 2009. [doi:10.1145/1661785.1667562](http://dx.doi.org/10.1145/1661785.1667562)
1. “[Esper Reference, Version 5.4.0](http://esper.espertech.com/release-5.4.0/esper-reference/html_single/index.html),” EsperTech, Inc., *espertech.com*, April 2016.
1. Zubair Nabi, Eric Bouillet, Andrew Bainbridge, and Chris Thomas: “[Of Streams and Storms](https://web.archive.org/web/20170711081434/https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2014/04/Streams-and-Storm-April-2014-Final.pdf),” IBM technical report, *developer.ibm.com*, April 2014.
1. Milinda Pathirage, Julian Hyde, Yi Pan, and Beth Plale: “[SamzaSQL: Scalable Fast Data Management with Streaming SQL](https://github.com/milinda/samzasql-hpbdc2016/blob/master/samzasql-hpbdc2016.pdf),” at *IEEE International Workshop on High-Performance Big Data Computing* (HPBDC), May 2016. [doi:10.1109/IPDPSW.2016.141](http://dx.doi.org/10.1109/IPDPSW.2016.141)
1. Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier: “[HyperLogLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf),” at *Conference on Analysis of Algorithms* (AofA), June 2007.
1. Jay Kreps: “[Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture),” *oreilly.com*, July 2, 2014.
1. Ian Hellström: “[An Overview of Apache Streaming Technologies](https://databaseline.bitbucket.io/an-overview-of-apache-streaming-technologies/),” *databaseline.bitbucket.io*, March 12, 2016.
1. Jay Kreps: “[Why Local State Is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing),” *oreilly.com*, July 31, 2014.
1. Shay Banon: “[Percolator](https://www.elastic.co/blog/percolator),” *elastic.co*, February 8, 2011.
1. Alan Woodward and Martin Kleppmann: “[Real-Time Full-Text Search with Luwak and Samza](http://martin.kleppmann.com/2015/04/13/real-time-full-text-search-luwak-samza.html),” *martin.kleppmann.com*, April 13, 2015.
1. “[Apache Storm 2.1.0 Documentation](https://storm.apache.org/releases/2.1.0/index.html),” *storm.apache.org*, October 2019.
1. Tyler Akidau: “[The World Beyond Batch: Streaming 102](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102),” *oreilly.com*, January 20, 2016.
1. Stephan Ewen: “[Streaming Analytics with Apache Flink](https://www.confluent.io/resources/kafka-summit-2016/advanced-streaming-analytics-apache-flink-apache-kafka/),” at *Kafka Summit*, April 2016.
1. Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, et al.: “[MillWheel: Fault-Tolerant Stream Processing at Internet Scale](http://research.google.com/pubs/pub41378.html),” at *39th International Conference on Very Large Data Bases* (VLDB), August 2013.
1. Alex Dean: “[Improving Snowplow's Understanding of Time](https://snowplow.io/blog/improving-snowplows-understanding-of-time/),” *snowplowanalytics.com*, September 15, 2015.
1. “[Windowing (Azure Stream Analytics)](https://msdn.microsoft.com/en-us/library/azure/dn835019.aspx),” Microsoft Azure Reference, *msdn.microsoft.com*, April 2016.
1. “[State Management](http://samza.apache.org/learn/documentation/0.10/container/state-management.html),” Apache Samza 0.10 Documentation, *samza.apache.org*, December 2015.
1. Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, et al.: “[Photon: Fault-Tolerant and Scalable Joining of Continuous Data Streams](http://research.google.com/pubs/pub41318.html),” at *ACM International Conference on Management of Data* (SIGMOD), June 2013. [doi:10.1145/2463676.2465272](http://dx.doi.org/10.1145/2463676.2465272)
1. Martin Kleppmann: “[Samza Newsfeed Demo](https://github.com/ept/newsfeed),” *github.com*, September 2014.
1. Ben Kirwin: “[Doing the Impossible: Exactly-Once Messaging Patterns in Kafka](http://ben.kirw.in/2014/11/28/kafka-patterns/),” *ben.kirw.in*, November 28, 2014.
1. Pat Helland: “[Data on the Outside Versus Data on the Inside](http://cidrdb.org/cidr2005/papers/P12.pdf),” at *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005.
1. Ralph Kimball and Margy Ross: *The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*, 3rd edition. John Wiley & Sons, 2013. ISBN: 978-1-118-53080-1
1. Viktor Klang: “[I'm coining the phrase 'effectively-once' for message processing with at-least-once + idempotent operations](https://twitter.com/viktorklang/status/789036133434978304),” *twitter.com*, October 20, 2016.
1. Matei Zaharia, Tathagata Das, Haoyuan Li, et al.: “[Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters](https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final28.pdf),” at *4th USENIX Conference in Hot Topics in Cloud Computing* (HotCloud), June 2012.
1. Kostas Tzoumas, Stephan Ewen, and Robert Metzger: “[High-Throughput, Low-Latency, and Exactly-Once Stream Processing with Apache Flink](https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink),” *ververica.com*, August 5, 2015.
1. Paris Carbone, Gyula Fóra, Stephan Ewen, et al.: “[Lightweight Asynchronous Snapshots for Distributed Dataflows](http://arxiv.org/abs/1506.08603),” arXiv:1506.08603 [cs.DC], June 29, 2015.
1. Ryan Betts and John Hugg: [*Fast Data: Smart and at Scale*](http://www.oreilly.com/data/free/fast-data-smart-and-at-scale.csp). Report, O'Reilly Media, October 2015.
1. Flavio Junqueira: “[Making Sense of Exactly-Once Semantics](https://web.archive.org/web/20160812172900/http://conferences.oreilly.com/strata/hadoop-big-data-eu/public/schedule/detail/49690),” at *Strata+Hadoop World London*, June 2016.
1. Jason Gustafson, Flavio Junqueira, Apurva Mehta, Sriram Subramanian, and Guozhang Wang: “[KIP-98 – Exactly Once Delivery and Transactional Messaging](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging),” *cwiki.apache.org*, November 2016.
1. Pat Helland: “[Idempotence Is Not a Medical Condition](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=4b6dda7fe75b51e1c543a87ca7b3b322fbf55614),” *Communications of the ACM*, volume 55, number 5, page 56, May 2012. [doi:10.1145/2160718.2160734](http://dx.doi.org/10.1145/2160718.2160734)
1. Jay Kreps: “[Re: Trying to Achieve Deterministic Behavior on Recovery/Rewind](http://mail-archives.apache.org/mod_mbox/samza-dev/201409.mbox/%3CCAOeJiJg%2Bc7Ei%3DgzCuOz30DD3G5Hm9yFY%3DUJ6SafdNUFbvRgorg%40mail.gmail.com%3E),” email to *samza-dev* mailing list, September 9, 2014.
1. E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson: “[A Survey of Rollback-Recovery Protocols in Message-Passing Systems](http://www.cs.utexas.edu/~lorenzo/papers/SurveyFinal.pdf),” *ACM Computing Surveys*, volume 34, number 3, pages 375–408, September 2002. [doi:10.1145/568522.568525](http://dx.doi.org/10.1145/568522.568525)
1. Adam Warski: “[Kafka Streams – How Does It Fit the Stream Processing Landscape?](https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/),” *softwaremill.com*, June 1, 2016.
================================================
FILE: content/v1/ch12.md
================================================
---
title: "第十二章:数据系统的未来"
linkTitle: "12. 数据系统的未来"
weight: 312
breadcrumbs: false
---

> 如果船长的终极目标是保护船只,他应该永远待在港口。
>
> —— 圣托马斯・阿奎那《神学大全》(1265-1274)
到目前为止,本书主要描述的是 **现状**。在这最后一章中,我们将放眼 **未来**,讨论应该是怎么样的:我将提出一些想法与方法,我相信它们能从根本上改进我们设计与构建应用的方式。
对未来的看法与推测当然具有很大的主观性。所以在撰写本章时,当提及我个人的观点时会使用第一人称。你完全可以不同意这些观点并提出自己的看法,但我希望本章中的概念,至少能成为富有成效的讨论出发点,并澄清一些经常被混淆的概念。
[第一章](/v1/ch1) 概述了本书的目标:探索如何创建 **可靠**、**可伸缩** 和 **可维护** 的应用与系统。这一主题贯穿了所有的章节:例如,我们讨论了许多有助于提高可靠性的容错算法,有助于提高可伸缩性的分区,以及有助于提高可维护性的演化与抽象机制。在本章中,我们将把所有这些想法结合在一起,并在它们的基础上展望未来。我们的目标是,发现如何设计出比现有应用更好的应用 —— 健壮、正确、可演化、且最终对人类有益。
## 数据集成
本书中反复出现的主题是,对于任何给定的问题都会有好几种解决方案,所有这些解决方案都有不同的优缺点与利弊权衡。例如在 [第三章](/v1/ch3) 讨论存储引擎时,我们看到了日志结构存储、B 树以及列式存储。在 [第五章](/v1/ch5) 讨论复制时,我们看到了单领导者、多领导者和无领导者的方法。
如果你有一个类似于 “我想存储一些数据并稍后再查询” 的问题,那么并没有一种正确的解决方案。但对于不同的具体环境,总会有不同的合适方法。软件实现通常必须选择一种特定的方法。使单条代码路径能做到稳定健壮且表现良好已经是一件非常困难的事情了 —— 尝试在单个软件中完成所有事情,几乎可以保证,实现效果会很差。
因此软件工具的最佳选择也取决于情况。每一种软件,甚至所谓的 “通用” 数据库,都是针对特定的使用模式设计的。
面对让人眼花缭乱的诸多替代品,第一个挑战就是弄清软件与其适用环境的映射关系。供应商不愿告诉你他们软件不适用的工作负载,这是可以理解的。但是希望先前的章节能给你提供一些问题,让你读出字里行间的言外之意,并更好地理解这些权衡。
但是,即使你已经完全理解各种工具与其适用环境间的关系,还有一个挑战:在复杂的应用中,数据的用法通常花样百出。不太可能存在适用于 **所有** 不同数据应用场景的软件,因此你不可避免地需要拼凑几个不同的软件来以提供应用所需的功能。
### 组合使用衍生数据的工具
例如,为了处理任意关键词的搜索查询,将 OLTP 数据库与全文搜索索引集成在一起是很常见的需求。尽管一些数据库(例如 PostgreSQL)包含了全文索引功能,对于简单的应用完全够了【1】,但更复杂的搜索能力就需要专业的信息检索工具了。相反的是,搜索索引通常不适合作为持久的记录系统,因此许多应用需要组合这两种不同的工具以满足所有需求。
我们在 “[保持系统同步](/v1/ch11#保持系统同步)” 中接触过集成数据系统的问题。随着数据不同表示形式的增加,集成问题变得越来越困难。除了数据库和搜索索引之外,也许你需要在分析系统(数据仓库,或批处理和流处理系统)中维护数据副本;维护从原始数据中衍生的缓存,或反规范化的数据版本;将数据灌入机器学习、分类、排名或推荐系统中;或者基于数据变更发送通知。
令人惊讶的是,我经常看到软件工程师做出这样的陈述:“根据我的经验,99% 的人只需要 X” 或者 “...... 不需要 X”(对于各种各样的 X)。我认为这种陈述更像是发言人自己的经验,而不是技术实际上的实用性。可能对数据执行的操作,其范围极其宽广。某人认为鸡肋而毫无意义的功能可能是别人的核心需求。当你拉高视角,并考虑跨越整个组织范围的数据流时,数据集成的需求往往就会变得明显起来。
#### 理解数据流
当需要在多个存储系统中维护相同数据的副本以满足不同的访问模式时,你要对输入和输出了如指掌:哪些数据先写入,哪些数据表示衍生自哪些来源?如何以正确的格式,将所有数据导入正确的地方?
例如,你可能会首先将数据写入 **记录系统** 数据库,捕获对该数据库所做的变更(请参阅 “[变更数据捕获](/v1/ch11#变更数据捕获)”),然后将变更以相同的顺序应用于搜索索引。如果变更数据捕获(CDC)是更新索引的唯一方式,则可以确定该索引完全派生自记录系统,因此与其保持一致(除软件错误外)。写入数据库是向该系统提供新输入的唯一方式。
允许应用程序直接写入搜索索引和数据库引入了如 [图 11-4](/v1/ddia_1104.png) 所示的问题,其中两个客户端同时发送冲突的写入,且两个存储系统按不同顺序处理它们。在这种情况下,既不是数据库说了算,也不是搜索索引说了算,所以它们做出了相反的决定,进入彼此间持久性的不一致状态。
如果你可以通过单个系统来提供所有用户输入,从而决定所有写入的排序,则通过按相同顺序处理写入,可以更容易地衍生出其他数据表示。这是状态机复制方法的一个应用,我们在 “[全序广播](/v1/ch9#全序广播)” 中看到。无论你使用变更数据捕获还是事件溯源日志,都不如简单的基于全序的决策原则更重要。
基于事件日志来更新衍生数据的系统,通常可以做到 **确定性** 与 **幂等性**(请参阅 “[幂等性](/v1/ch11#幂等性)”),使得从故障中恢复相当容易。
#### 衍生数据与分布式事务
保持不同数据系统彼此一致的经典方法涉及分布式事务,如 “[原子提交与两阶段提交](/v1/ch9#原子提交与两阶段提交)” 中所述。与分布式事务相比,使用衍生数据系统的方法如何?
在抽象层面,它们通过不同的方式达到类似的目标。分布式事务通过 **锁** 进行互斥来决定写入的顺序(请参阅 “[两阶段锁定](/v1/ch7#两阶段锁定)”),而 CDC 和事件溯源使用日志进行排序。分布式事务使用原子提交来确保变更只生效一次,而基于日志的系统通常基于 **确定性重试** 和 **幂等性**。
最大的不同之处在于事务系统通常提供 [线性一致性](/v1/ch9#线性一致性),这包含着有用的保证,例如 [读己之写](/v1/ch5#读己之写)。另一方面,衍生数据系统通常是异步更新的,因此它们默认不会提供相同的时序保证。
在愿意为分布式事务付出代价的有限场景中,它们已被成功应用。但是,我认为 XA 的容错能力和性能很差劲(请参阅 “[实践中的分布式事务](/v1/ch9#实践中的分布式事务)”),这严重限制了它的实用性。我相信为分布式事务设计一种更好的协议是可行的。但使这样一种协议被现有工具广泛接受是很有挑战的,且不是立竿见影的事。
在没有广泛支持的良好分布式事务协议的情况下,我认为基于日志的衍生数据是集成不同数据系统的最有前途的方法。然而,诸如读己之写的保证是有用的,我认为告诉所有人 “最终一致性是不可避免的 —— 忍一忍并学会和它打交道” 是没有什么建设性的(至少在缺乏 **如何** 应对的良好指导时)。
在 “[将事情做正确](#将事情做正确)” 中,我们将讨论一些在异步衍生系统之上实现更强保障的方法,并迈向分布式事务和基于日志的异步系统之间的中间地带。
#### 全序的限制
对于足够小的系统,构建一个完全有序的事件日志是完全可行的(正如单主复制数据库的流行所证明的那样,它正好建立了这样一种日志)。但是,随着系统向更大更复杂的工作负载伸缩,限制开始出现:
* 在大多数情况下,构建完全有序的日志,需要所有事件汇集于决定顺序的 **单个领导者节点**。如果事件吞吐量大于单台计算机的处理能力,则需要将其分区到多台计算机上(请参阅 “[分区日志](/v1/ch11#分区日志)”)。然后两个不同分区中的事件顺序关系就不明确了。
* 如果服务器分布在多个 **地理位置分散** 的数据中心上,例如为了容忍整个数据中心掉线,你通常在每个数据中心都有单独的主库,因为网络延迟会导致同步的跨数据中心协调效率低下(请参阅 “[多主复制](/v1/ch5#多主复制)”)。这意味着源自两个不同数据中心的事件顺序未定义。
* 将应用程序部署为微服务时(请参阅 “[服务中的数据流:REST 与 RPC](/v1/ch4#服务中的数据流:REST与RPC)”),常见的设计选择是将每个服务及其持久状态作为独立单元进行部署,服务之间不共享持久状态。当两个事件来自不同的服务时,这些事件间的顺序未定义。
* 某些应用程序在客户端保存状态,该状态在用户输入时立即更新(无需等待服务器确认),甚至可以继续脱机工作(请参阅 “[需要离线操作的客户端](/v1/ch5#需要离线操作的客户端)”)。对于这样的应用程序,客户端和服务器很可能以不同的顺序看到事件。
在形式上,决定事件的全局顺序称为 **全序广播**,相当于 **共识**(请参阅 “[共识算法和全序广播](/v1/ch9#共识算法和全序广播)”)。大多数共识算法都是针对单个节点的吞吐量足以处理整个事件流的情况而设计的,并且这些算法不提供多个节点共享事件排序工作的机制。设计可以伸缩至单个节点的吞吐量之上,且在地理位置分散的环境中仍然工作良好的的共识算法仍然是一个开放的研究问题。
#### 排序事件以捕获因果关系
在事件之间不存在因果关系的情况下,全序的缺乏并不是一个大问题,因为并发事件可以任意排序。其他一些情况很容易处理:例如,当同一对象有多个更新时,它们可以通过将特定对象 ID 的所有更新路由到相同的日志分区来完全排序。然而,因果关系有时会以更微妙的方式出现(请参阅 “[顺序与因果关系](/v1/ch9#顺序与因果关系)”)。
例如,考虑一个社交网络服务,以及一对曾处于恋爱关系但刚分手的用户。其中一个用户将另一个用户从好友中移除,然后向剩余的好友发送消息,抱怨他们的前任。用户的心思是他们的前任不应该看到这些粗鲁的消息,因为消息是在好友状态解除后发送的。
但是如果好友关系状态与消息存储在不同的地方,在这样一个系统中,可能会出现 **解除好友** 事件与 **发送消息** 事件之间的因果依赖丢失的情况。如果因果依赖关系没有被捕捉到,则发送有关新消息的通知的服务可能会在 **解除好友** 事件之前处理 **发送消息** 事件,从而错误地向前任发送通知。
在本例中,通知实际上是消息和好友列表之间的连接,使得它与我们先前讨论的连接的时序问题有关(请参阅 “[连接的时间依赖性](/v1/ch11#连接的时间依赖性)”)。不幸的是,这个问题似乎并没有一个简单的答案【2,3】。起点包括:
* 逻辑时间戳可以提供无需协调的全局顺序(请参阅 “[序列号顺序](/v1/ch9#序列号顺序)”),因此它们可能有助于全序广播不可行的情况。但是,他们仍然要求收件人处理不按顺序发送的事件,并且需要传递其他元数据。
* 如果你可以记录一个事件来记录用户在做出决定之前所看到的系统状态,并给该事件一个唯一的标识符,那么后面的任何事件都可以引用该事件标识符来记录因果关系【4】。我们将在 “[读也是事件](#读也是事件)” 中回到这个想法。
* 冲突解决算法(请参阅 “[自动冲突解决](/v1/ch5#自动冲突解决)”)有助于处理以意外顺序传递的事件。它们对于维护状态很有用,但如果行为有外部副作用(例如,给用户发送通知),就没什么帮助了。
也许,随着时间的推移,应用开发模式将出现,使得能够有效地捕获因果依赖关系,并且保持正确的衍生状态,而不会迫使所有事件经历全序广播的瓶颈)。
### 批处理与流处理
我会说数据集成的目标是,确保数据最终能在所有正确的地方表现出正确的形式。这样做需要消费输入、转换、连接、过滤、聚合、训练模型、评估、以及最终写出适当的输出。批处理和流处理是实现这一目标的工具。
批处理和流处理的输出是衍生数据集,例如搜索索引、物化视图、向用户显示的建议、聚合指标等(请参阅 “[批处理工作流的输出](/v1/ch10#批处理工作流的输出)” 和 “[流处理的应用](/v1/ch11#流处理的应用)”)。
正如我们在 [第十章](/v1/ch10) 和 [第十一章](/v1/ch11) 中看到的,批处理和流处理有许多共同的原则,主要的根本区别在于流处理器在无限数据集上运行,而批处理输入是已知的有限大小。处理引擎的实现方式也有很多细节上的差异,但是这些区别已经开始模糊。
Spark 在批处理引擎上执行流处理,将流分解为 **微批次(microbatches)**,而 Apache Flink 则在流处理引擎上执行批处理【5】。原则上,一种类型的处理可以用另一种类型来模拟,但是性能特征会有所不同:例如,在跳跃或滑动窗口上,微批次可能表现不佳【6】。
#### 维护衍生状态
批处理有着很强的函数式风格(即使其代码不是用函数式语言编写的):它鼓励确定性的纯函数,其输出仅依赖于输入,除了显式输出外没有副作用,将输入视作不可变的,且输出是仅追加的。流处理与之类似,但它扩展了算子以允许受管理的、容错的状态(请参阅 “[失败后重建状态”](/v1/ch11#失败后重建状态))。
具有良好定义的输入和输出的确定性函数的原理不仅有利于容错(请参阅 “[幂等性](/v1/ch11#幂等性)”),也简化了有关组织中数据流的推理【7】。无论衍生数据是搜索索引、统计模型还是缓存,采用这种观点思考都是很有帮助的:将其视为从一个东西衍生出另一个的数据管道,通过函数式应用代码推送一个系统的状态变更,并将其效果应用至衍生系统中。
原则上,衍生数据系统可以同步地维护,就像关系数据库在与索引表写入操作相同的事务中同步更新次级索引一样。然而,异步是使基于事件日志的系统稳健的原因:它允许系统的一部分故障被抑制在本地。而如果任何一个参与者失败,分布式事务将中止,因此它们倾向于通过将故障传播到系统的其余部分来放大故障(请参阅 “[分布式事务的限制](/v1/ch9#分布式事务的限制)”)。
我们在 “[分区与次级索引](/v1/ch6#分区与次级索引)” 中看到,次级索引经常跨越分区边界。具有次级索引的分区系统需要将写入发送到多个分区(如果索引按关键词分区的话)或将读取发送到所有分区(如果索引是按文档分区的话)。如果索引是异步维护的,这种跨分区通信也是最可靠和最可伸缩的【8】(另请参阅 “[多分区数据处理](#多分区数据处理)”)。
#### 应用演化后重新处理数据
在维护衍生数据时,批处理和流处理都是有用的。流处理允许将输入中的变化以低延迟反映在衍生视图中,而批处理允许重新处理大量累积的历史数据以便将新视图导出到现有数据集上。
特别是,重新处理现有数据为维护系统、演化并支持新功能和需求变更提供了一个良好的机制(请参阅 [第四章](/v1/ch4))。没有重新进行处理,模式演化将仅限于简单的变化,例如向记录中添加新的可选字段或添加新类型的记录。无论是在写时模式还是在读时模式中都是如此(请参阅 “[文档模型中的模式灵活性](/v1/ch2#文档模型中的模式灵活性)”)。另一方面,通过重新处理,可以将数据集重组为一个完全不同的模型,以便更好地满足新的要求。
> ### 铁路上的模式迁移
>
> 大规模的 “模式迁移” 也发生在非计算机系统中。例如,在 19 世纪英国铁路建设初期,轨距(两轨之间的距离)就有了各种各样的竞争标准。为一种轨距而建的列车不能在另一种轨距的轨道上运行,这限制了火车网络中可能的相互连接【9】。
>
> 在 1846 年最终确定了一个标准轨距之后,其他轨距的轨道必须转换 —— 但是如何在不停运火车线路的情况下进行数月甚至数年的迁移?解决的办法是首先通过添加第三条轨道将轨道转换为 **双轨距(dual guage)** 或 **混合轨距**。这种转换可以逐渐完成,当完成时,两种轨距的列车都可以在线路上跑,使用三条轨道中的两条。事实上,一旦所有的列车都转换成标准轨距,那么可以移除提供非标准轨距的轨道。
>
> 以这种方式 “再加工” 现有的轨道,让新旧版本并存,可以在几年的时间内逐渐改变轨距。然而,这是一项昂贵的事业,这就是今天非标准轨距仍然存在的原因。例如,旧金山湾区的 BART 系统使用了与美国大部分地区不同的轨距。
衍生视图允许 **渐进演化(gradual evolution)**。如果你想重新构建数据集,不需要执行突然切换式的迁移。取而代之的是,你可以将旧架构和新架构并排维护为相同基础数据上的两个独立衍生视图。然后可以开始将少量用户转移到新视图,以测试其性能并发现任何错误,而大多数用户仍然会被路由到旧视图。你可以逐渐地增加访问新视图的用户比例,最终可以删除旧视图【10】。
这种逐渐迁移的美妙之处在于,如果出现问题,每个阶段的过程都很容易逆转:你始终有一个可以回滚的可用系统。通过降低不可逆损害的风险,你能对继续前进更有信心,从而更快地改善系统【11】。
#### Lambda架构
如果批处理用于重新处理历史数据,而流处理用于处理最近的更新,那么如何将这两者结合起来?Lambda 架构【12】是这方面的一个建议,引起了很多关注。
Lambda 架构的核心思想是通过将不可变事件附加到不断增长的数据集来记录传入数据,这类似于事件溯源(请参阅 “[事件溯源](/v1/ch11#事件溯源)”)。为了从这些事件中衍生出读取优化的视图,Lambda 架构建议并行运行两个不同的系统:批处理系统(如 Hadoop MapReduce)和独立的流处理系统(如 Storm)。
在 Lambda 方法中,流处理器消耗事件并快速生成对视图的近似更新;批处理器稍后将使用同一组事件并生成衍生视图的更正版本。这个设计背后的原因是批处理更简单,因此不易出错,而流处理器被认为是不太可靠和难以容错的(请参阅 “[容错](/v1/ch11#容错)”)。而且,流处理可以使用快速近似算法,而批处理使用较慢的精确算法。
Lambda 架构是一种有影响力的想法,它将数据系统的设计变得更好,尤其是通过推广这样的原则:在不可变事件流上建立衍生视图,并在需要时重新处理事件。但是我也认为它有一些实际问题:
* 在批处理和流处理框架中维护相同的逻辑是很显著的额外工作。虽然像 Summingbird【13】这样的库提供了一种可以在批处理和流处理的上下文中运行的计算抽象。调试、调整和维护两个不同系统的操作复杂性依然存在【14】。
* 由于流管道和批处理管道产生独立的输出,因此需要合并它们以响应用户请求。如果计算是基于滚动窗口的简单聚合,则合并相当容易,但如果视图基于更复杂的操作(例如连接和会话化)而导出,或者输出不是时间序列,则会变得非常困难。
* 尽管有能力重新处理整个历史数据集是很好的,但在大型数据集上这样做经常会开销巨大。因此,批处理流水线通常需要设置为处理增量批处理(例如,在每小时结束时处理一小时的数据),而不是重新处理所有内容。这引发了 “[时间推理](/v1/ch11#时间推理)” 中讨论的问题,例如处理滞留事件和处理跨批次边界的窗口。增量化批处理计算会增加复杂性,使其更类似于流式传输层,这与保持批处理层尽可能简单的目标背道而驰。
#### 统一批处理和流处理
最近的工作使得 Lambda 架构的优点在没有其缺点的情况下得以实现,允许批处理计算(重新处理历史数据)和流计算(在事件到达时即处理)在同一个系统中实现【15】。
在一个系统中统一批处理和流处理需要以下功能,这些功能也正在越来越广泛地被提供:
* 通过处理最近事件流的相同处理引擎来重播历史事件的能力。例如,基于日志的消息代理可以重播消息(请参阅 “[重播旧消息](/v1/ch11#重播旧消息)”),某些流处理器可以从 HDFS 等分布式文件系统读取输入。
* 对于流处理器来说,恰好一次语义 —— 即确保输出与未发生故障的输出相同,即使事实上发生故障(请参阅 “[容错](/v1/ch11#容错)”)。与批处理一样,这需要丢弃任何失败任务的部分输出。
* 按事件时间进行窗口化的工具,而不是按处理时间进行窗口化,因为处理历史事件时,处理时间毫无意义(请参阅 “[时间推理](/v1/ch11#时间推理)”)。例如,Apache Beam 提供了用于表达这种计算的 API,可以在 Apache Flink 或 Google Cloud Dataflow 使用。
## 分拆数据库
在最抽象的层面上,数据库,Hadoop 和操作系统都发挥相同的功能:它们存储一些数据,并允许你处理和查询这些数据【16】。数据库将数据存储为特定数据模型的记录(表中的行、文档、图中的顶点等),而操作系统的文件系统则将数据存储在文件中 —— 但其核心都是 “信息管理” 系统【17】。正如我们在 [第十章](/v1/ch10) 中看到的,Hadoop 生态系统有点像 Unix 的分布式版本。
当然,有很多实际的差异。例如,许多文件系统都不能很好地处理包含 1000 万个小文件的目录,而包含 1000 万个小记录的数据库完全是寻常而不起眼的。无论如何,操作系统和数据库之间的相似之处和差异值得探讨。
Unix 和关系数据库以非常不同的哲学来处理信息管理问题。Unix 认为它的目的是为程序员提供一种相当低层次的硬件的逻辑抽象,而关系数据库则希望为应用程序员提供一种高层次的抽象,以隐藏磁盘上数据结构的复杂性、并发性、崩溃恢复等等。Unix 发展出的管道和文件只是字节序列,而数据库则发展出了 SQL 和事务。
哪种方法更好?当然这取决于你想要的是什么。Unix 是 “简单的”,因为它是对硬件资源相当薄的包装;关系数据库是 “更简单” 的,因为一个简短的声明性查询可以利用很多强大的基础设施(查询优化、索引、连接方法、并发控制、复制等),而不需要查询的作者理解其实现细节。
这些哲学之间的矛盾已经持续了几十年(Unix 和关系模型都出现在 70 年代初),仍然没有解决。例如,我将 NoSQL 运动解释为,希望将类 Unix 的低级别抽象方法应用于分布式 OLTP 数据存储的领域。
在这一部分我将试图调和这两个哲学,希望我们能各取其美。
### 组合使用数据存储技术
在本书的过程中,我们讨论了数据库提供的各种功能及其工作原理,其中包括:
* 次级索引,使你可以根据字段的值有效地搜索记录(请参阅 “[其他索引结构](/v1/ch3#其他索引结构)”)
* 物化视图,这是一种预计算的查询结果缓存(请参阅 “[聚合:数据立方体和物化视图](/v1/ch3#聚合:数据立方体和物化视图)”)
* 复制日志,保持其他节点上数据的副本最新(请参阅 “[复制日志的实现](/v1/ch5#复制日志的实现)”)
* 全文搜索索引,允许在文本中进行关键字搜索(请参阅 “[全文搜索和模糊索引](/v1/ch3#全文搜索和模糊索引)”),也内置于某些关系数据库【1】
在 [第十章](/v1/ch10) 和 [第十一章](/v1/ch11) 中,出现了类似的主题。我们讨论了如何构建全文搜索索引(请参阅 “[批处理工作流的输出](/v1/ch10#批处理工作流的输出)”),了解了如何维护物化视图(请参阅 “[维护物化视图](/v1/ch11#维护物化视图)”)以及如何将变更从数据库复制到衍生数据系统(请参阅 “[变更数据捕获](/v1/ch11#变更数据捕获)”)。
数据库中内置的功能与人们用批处理和流处理器构建的衍生数据系统似乎有相似之处。
#### 创建索引
想想当你运行 `CREATE INDEX` 在关系数据库中创建一个新的索引时会发生什么。数据库必须扫描表的一致性快照,挑选出所有被索引的字段值,对它们进行排序,然后写出索引。然后它必须处理自一致快照以来所做的写入操作(假设表在创建索引时未被锁定,所以写操作可能会继续)。一旦完成,只要事务写入表中,数据库就必须继续保持索引最新。
此过程非常类似于设置新的从库副本(请参阅 “[设置新从库](/v1/ch5#设置新从库)”),也非常类似于流处理系统中的 **引导(bootstrap)** 变更数据捕获(请参阅 “[初始快照](/v1/ch11#初始快照)”)。
无论何时运行 `CREATE INDEX`,数据库都会重新处理现有数据集(如 “[应用演化后重新处理数据](#应用演化后重新处理数据)” 中所述),并将该索引作为新视图导出到现有数据上。现有数据可能是状态的快照,而不是所有发生变化的日志,但两者密切相关(请参阅 “[状态、流和不变性](/v1/ch11#状态、流和不变性)”)。
#### 一切的元数据库
有鉴于此,我认为整个组织的数据流开始像一个巨大的数据库【7】。每当批处理、流或 ETL 过程将数据从一个地方传输到另一个地方并组装时,它表现地就像数据库子系统一样,使索引或物化视图保持最新。
从这种角度来看,批处理和流处理器就像精心实现的触发器、存储过程和物化视图维护例程。它们维护的衍生数据系统就像不同的索引类型。例如,关系数据库可能支持 B 树索引、散列索引、空间索引(请参阅 “[多列索引](/v1/ch3#多列索引)”)以及其他类型的索引。在新兴的衍生数据系统架构中,不是将这些设施作为单个集成数据库产品的功能实现,而是由各种不同的软件提供,运行在不同的机器上,由不同的团队管理。
这些发展在未来将会把我们带到哪里?如果我们从没有适合所有访问模式的单一数据模型或存储格式的前提出发,我推测有两种途径可以将不同的存储和处理工具组合成一个有凝聚力的系统:
**联合数据库:统一读取**
可以为各种各样的底层存储引擎和处理方法提供一个统一的查询接口 —— 一种称为 **联合数据库(federated database)** 或 **多态存储(polystore)** 的方法【18,19】。例如,PostgreSQL 的 **外部数据包装器(foreign data wrapper)** 功能符合这种模式【20】。需要专用数据模型或查询接口的应用程序仍然可以直接访问底层存储引擎,而想要组合来自不同位置的数据的用户可以通过联合接口轻松完成操作。
联合查询接口遵循着单一集成系统的关系型传统,带有高级查询语言和优雅的语义,但实现起来非常复杂。
**分拆数据库:统一写入**
虽然联合能解决跨多个不同系统的只读查询问题,但它并没有很好的解决跨系统 **同步** 写入的问题。我们说过,在单个数据库中,创建一致的索引是一项内置功能。当我们构建多个存储系统时,我们同样需要确保所有数据变更都会在所有正确的位置结束,即使在出现故障时也是如此。想要更容易地将存储系统可靠地插接在一起(例如,通过变更数据捕获和事件日志),就像将数据库的索引维护功能以可以跨不同技术同步写入的方式分开【7,21】。
分拆方法遵循 Unix 传统的小型工具,它可以很好地完成一件事【22】,通过统一的低层级 API(管道)进行通信,并且可以使用更高层级的语言进行组合(shell)【16】 。
#### 开展分拆工作
联合和分拆是一个硬币的两面:用不同的组件构成可靠、 可伸缩和可维护的系统。联合只读查询需要将一个数据模型映射到另一个数据模型,这需要一些思考,但最终还是一个可解决的问题。而我认为同步写入到几个存储系统是更困难的工程问题,所以我将重点关注它。
传统的同步写入方法需要跨异构存储系统的分布式事务【18】,我认为这是错误的解决方案(请参阅 “[衍生数据与分布式事务](#衍生数据与分布式事务)”)。单个存储或流处理系统内的事务是可行的,但是当数据跨越不同技术之间的边界时,我认为具有幂等写入的异步事件日志是一种更加健壮和实用的方法。
例如,分布式事务在某些流处理组件内部使用,以匹配 **恰好一次(exactly-once)** 语义(请参阅 “[原子提交再现](/v1/ch11#原子提交再现)”),这可以很好地工作。然而,当事务需要涉及由不同人群编写的系统时(例如,当数据从流处理组件写入分布式键值存储或搜索索引时),缺乏标准化的事务协议会使集成更难。有幂等消费者的有序事件日志(请参阅 “[幂等性](/v1/ch11#幂等性)”)是一种更简单的抽象,因此在异构系统中实现更加可行【7】。
基于日志的集成的一大优势是各个组件之间的 **松散耦合(loose coupling)**,这体现在两个方面:
1. 在系统级别,异步事件流使整个系统在个别组件的中断或性能下降时更加稳健。如果消费者运行缓慢或失败,那么事件日志可以缓冲消息(请参阅 “[磁盘空间使用](/v1/ch11#磁盘空间使用)”),以便生产者和任何其他消费者可以继续不受影响地运行。有问题的消费者可以在问题修复后赶上,因此不会错过任何数据,并且包含故障。相比之下,分布式事务的同步交互往往会将本地故障升级为大规模故障(请参阅 “[分布式事务的限制](/v1/ch9#分布式事务的限制)”)。
2. 在人力方面,分拆数据系统允许不同的团队独立开发,改进和维护不同的软件组件和服务。专业化使得每个团队都可以专注于做好一件事,并与其他团队的系统以明确的接口交互。事件日志提供了一个足够强大的接口,以捕获相当强的一致性属性(由于持久性和事件的顺序),但也足够普适于几乎任何类型的数据。
#### 分拆系统vs集成系统
如果分拆确实成为未来的方式,它也不会取代目前形式的数据库 —— 它们仍然会像以往一样被需要。为了维护流处理组件中的状态,数据库仍然是需要的,并且为批处理和流处理器的输出提供查询服务(请参阅 “[批处理工作流的输出](/v1/ch10#批处理工作流的输出)” 与 “[流处理](/v1/ch11#流处理)”)。专用查询引擎对于特定的工作负载仍然非常重要:例如,MPP 数据仓库中的查询引擎针对探索性分析查询进行了优化,并且能够很好地处理这种类型的工作负载(请参阅 “[Hadoop 与分布式数据库的对比](/v1/ch10#Hadoop与分布式数据库的对比)”)。
运行几种不同基础设施的复杂性可能是一个问题:每种软件都有一个学习曲线,配置问题和操作怪癖,因此部署尽可能少的移动部件是很有必要的。比起使用应用代码拼接多个工具而成的系统,单一集成软件产品也可以在其设计应对的工作负载类型上实现更好、更可预测的性能【23】。正如在前言中所说的那样,为了不需要的规模而构建系统是白费精力,而且可能会将你锁死在一个不灵活的设计中。实际上,这是一种过早优化的形式。
分拆的目标不是要针对个别数据库与特定工作负载的性能进行竞争;我们的目标是允许你结合多个不同的数据库,以便在比单个软件可能实现的更广泛的工作负载范围内实现更好的性能。这是关于广度,而不是深度 —— 与我们在 “[Hadoop 与分布式数据库的对比](/v1/ch10#Hadoop与分布式数据库的对比)” 中讨论的存储和处理模型的多样性一样。
因此,如果有一项技术可以满足你的所有需求,那么最好使用该产品,而不是试图用更低层级的组件重新实现它。只有当没有单一软件满足你的所有需求时,才会出现拆分和联合的优势。
#### 少了什么?
用于组成数据系统的工具正在变得越来越好,但我认为还缺少一个主要的东西:我们还没有与 Unix shell 类似的分拆数据库等价物(即,一种声明式的、简单的、用于组装存储和处理系统的高级语言)。
例如,如果我们可以简单地声明 `mysql | elasticsearch`,类似于 Unix 管道【22】,成为 `CREATE INDEX` 的分拆等价物:它将读取 MySQL 数据库中的所有文档并将其索引到 Elasticsearch 集群中。然后它会不断捕获对数据库所做的所有变更,并自动将它们应用于搜索索引,而无需编写自定义应用代码。这种集成应当支持几乎任何类型的存储或索引系统。
同样,能够更容易地预先计算和更新缓存将是一件好事。回想一下,物化视图本质上是一个预先计算的缓存,所以你可以通过为复杂查询声明指定物化视图来创建缓存,包括图上的递归查询(请参阅 “[图数据模型](/v1/ch2#图数据模型)”)和应用逻辑。在这方面有一些有趣的早期研究,如 **差分数据流(differential dataflow)**【24,25】,我希望这些想法能够在生产系统中找到自己的方法。
### 围绕数据流设计应用
使用应用代码组合专用存储与处理系统来分拆数据库的方法,也被称为 “**数据库由内而外(database inside-out)**” 方法【26】,该名称来源于我在 2014 年的一次会议演讲标题【27】。然而称它为 “新架构” 过于夸大,我仅将其看作是一种设计模式,一个讨论的起点,我们只是简单地给它起一个名字,以便我们能更好地讨论它。
这些想法不是我的;它们是很多人的思想的融合,这些思想非常值得我们学习。尤其是,以 Oz【28】和 Juttle【29】为代表的数据流语言,以 Elm【30,31】为代表的 **函数式响应式编程(functional reactive programming, FRP)**,以 Bloom【32】为代表的逻辑编程语言。在这一语境中的术语 **分拆(unbundling)** 是由 Jay Kreps 提出的【7】。
即使是 **电子表格** 也在数据流编程能力上甩开大多数主流编程语言几条街【33】。在电子表格中,可以将公式放入一个单元格中(例如,对另一列中的单元格求和),并且只要公式的任何输入发生变更,公式的结果都会自动重新计算。这正是我们在数据系统层次所需要的:当数据库中的记录发生变更时,我们希望自动更新该记录的任何索引,并且自动刷新依赖于记录的任何缓存视图或聚合。你不必担心这种刷新如何发生的技术细节,但能够简单地相信它可以正常工作。
因此,我认为绝大多数数据系统仍然可以从 VisiCalc 在 1979 年已经具备的功能中学习【34】。与电子表格的不同之处在于,今天的数据系统需要具有容错性,可伸缩性以及持久存储数据。它们还需要能够整合不同人群编写的不同技术,并重用现有的库和服务:期望使用某一种特定的语言、框架或工具来开发所有软件是不切实际的。
在本节中,我将详细介绍这些想法,并探讨一些围绕分拆数据库和数据流的想法构建应用的方法。
#### 应用代码作为衍生函数
当一个数据集衍生自另一个数据集时,它会经历某种转换函数。例如:
* 次级索引是由一种直白的转换函数生成的衍生数据集:对于基础表中的每行或每个文档,它挑选被索引的列或字段中的值,并按这些值排序(假设使用 B 树或 SSTable 索引,按键排序,如 [第三章](/v1/ch3) 所述)。
* 全文搜索索引是通过应用各种自然语言处理函数而创建的,诸如语言检测、分词、词干或词汇化、拼写纠正和同义词识别,然后构建用于高效查找的数据结构(例如倒排索引)。
* 在机器学习系统中,我们可以将模型视作从训练数据通过应用各种特征提取、统计分析函数衍生的数据,当模型应用于新的输入数据时,模型的输出是从输入和模型(因此间接地从训练数据)中衍生的。
* 缓存通常包含将以用户界面(UI)显示的形式的数据聚合。因此填充缓存需要知道 UI 中引用的字段;UI 中的变更可能需要更新缓存填充方式的定义,并重建缓存。
用于次级索引的衍生函数是如此常用的需求,以致于它作为核心功能被内建至许多数据库中,你可以简单地通过 `CREATE INDEX` 来调用它。对于全文索引,常见语言的基本语言特征可能内置到数据库中,但更复杂的特征通常需要领域特定的调整。在机器学习中,特征工程是众所周知的特定于应用的特征,通常需要包含很多关于用户交互与应用部署的详细知识【35】。
当创建衍生数据集的函数不是像创建次级索引那样的标准搬砖函数时,需要自定义代码来处理特定于应用的东西。而这个自定义代码是让许多数据库挣扎的地方,虽然关系数据库通常支持触发器、存储过程和用户定义的函数,可以用它们来在数据库中执行应用代码,但它们有点像数据库设计里的事后反思。(请参阅 “[传递事件流](/v1/ch11#传递事件流)”)。
#### 应用代码和状态的分离
理论上,数据库可以是任意应用代码的部署环境,就如同操作系统一样。然而实践中它们对这一目标适配的很差。它们不满足现代应用开发的要求,例如依赖和软件包管理、版本控制、滚动升级、可演化性、监控、指标、对网络服务的调用以及与外部系统的集成。
另一方面,Mesos、YARN、Docker、Kubernetes 等部署和集群管理工具专为运行应用代码而设计。通过专注于做好一件事情,他们能够做得比将数据库作为其众多功能之一执行用户定义的功能要好得多。
我认为让系统的某些部分专门用于持久数据存储并让其他部分专门运行应用程序代码是有意义的。这两者可以在保持独立的同时互动。
现在大多数 Web 应用程序都是作为无状态服务部署的,其中任何用户请求都可以路由到任何应用程序服务器,并且服务器在发送响应后会忘记所有请求。这种部署方式很方便,因为可以随意添加或删除服务器,但状态必须到某个地方:通常是数据库。趋势是将无状态应用程序逻辑与状态管理(数据库)分开:不将应用程序逻辑放入数据库中,也不将持久状态置于应用程序中【36】。正如函数式编程社区喜欢开玩笑说的那样,“我们相信 **教会(Church)** 与 **国家(state)** 的分离”【37】 [^i]
[^i]: 解释笑话很少会让人感觉更好,但我不想让任何人感到被遗漏。在这里,Church 指代的是数学家的阿隆佐・邱奇,他创立了 lambda 演算,这是计算的早期形式,是大多数函数式编程语言的基础。lambda 演算不具有可变状态(即没有变量可以被覆盖),所以可以说可变状态与 Church 的工作是分离的。
在这个典型的 Web 应用模型中,数据库充当一种可以通过网络同步访问的可变共享变量。应用程序可以读取和更新变量,而数据库负责维持它的持久性,提供一些诸如并发控制和容错的功能。
但是,在大多数编程语言中,你无法订阅可变变量中的变更 —— 你只能定期读取它。与电子表格不同,如果变量的值发生变化,变量的读者不会收到通知(你可以在自己的代码中实现这样的通知 —— 这被称为 **观察者模式** —— 但大多数语言没有将这种模式作为内置功能)。
数据库继承了这种可变数据的被动方法:如果你想知道数据库的内容是否发生了变化,通常你唯一的选择就是轮询(即定期重复你的查询)。订阅变更只是刚刚开始出现的功能(请参阅 “[变更流的 API 支持](/v1/ch11#变更流的API支持)”)。
#### 数据流:应用代码与状态变化的交互
从数据流的角度思考应用程序,意味着重新协调应用代码和状态管理之间的关系。我们不再将数据库视作被应用操纵的被动变量,取而代之的是更多地考虑状态,状态变更和处理它们的代码之间的相互作用与协同关系。应用代码通过在另一个地方触发状态变更来响应状态变更。
我们在 “[数据库与流](/v1/ch11#数据库与流)” 中看到了这一思路,我们讨论了将数据库的变更日志视为一种我们可以订阅的事件流。诸如 Actor 的消息传递系统(请参阅 “[消息传递中的数据流](/v1/ch4#消息传递中的数据流)”)也具有响应事件的概念。早在 20 世纪 80 年代,**元组空间(tuple space)** 模型就已经探索了表达分布式计算的方式:观察状态变更并作出反应的过程【38,39】。
如前所述,当触发器由于数据变更而被触发时,或次级索引更新以反映索引表中的变更时,数据库内部也发生着类似的情况。分拆数据库意味着将这个想法应用于在主数据库之外,用于创建衍生数据集:缓存、全文搜索索引、机器学习或分析系统。我们可以为此使用流处理和消息传递系统。
需要记住的重要一点是,维护衍生数据不同于执行异步任务。传统的消息传递系统通常是为执行异步任务设计的(请参阅 “[日志与传统的消息传递相比](/v1/ch11#日志与传统的消息传递相比)”):
* 在维护衍生数据时,状态变更的顺序通常很重要(如果多个视图是从事件日志衍生的,则需要按照相同的顺序处理事件,以便它们之间保持一致)。如 “[确认与重新传递](/v1/ch11#确认与重新传递)” 中所述,许多消息代理在重传未确认消息时没有此属性,双写也被排除在外(请参阅 “[保持系统同步](/v1/ch11#保持系统同步)”)。
* 容错是衍生数据的关键:仅仅丢失单个消息就会导致衍生数据集永远与其数据源失去同步。消息传递和衍生状态更新都必须可靠。例如,许多 Actor 系统默认在内存中维护 Actor 的状态和消息,所以如果运行 Actor 的机器崩溃,状态和消息就会丢失。
稳定的消息排序和容错消息处理是相当严格的要求,但与分布式事务相比,它们开销更小,运行更稳定。现代流处理组件可以提供这些排序和可靠性保证,并允许应用代码以流算子的形式运行。
这些应用代码可以执行任意处理,包括数据库内置衍生函数通常不提供的功能。就像通过管道链接的 Unix 工具一样,流算子可以围绕着数据流构建大型系统。每个算子接受状态变更的流作为输入,并产生其他状态变化的流作为输出。
#### 流处理器和服务
当今流行的应用开发风格涉及将功能分解为一组通过同步网络请求(如 REST API)进行通信的 **服务**(service,请参阅 “[服务中的数据流:REST 与 RPC](/v1/ch4#服务中的数据流:REST与RPC)”)。这种面向服务的架构优于单一庞大应用的优势主要在于:通过松散耦合来提供组织上的可伸缩性:不同的团队可以专职于不同的服务上,从而减少团队之间的协调工作(因为服务可以独立部署和更新)。
在数据流中组装流算子与微服务方法有很多相似之处【40】。但底层通信机制是有很大区别:数据流采用单向异步消息流,而不是同步的请求 / 响应式交互。
除了在 “[消息传递中的数据流](/v1/ch4#消息传递中的数据流)” 中列出的优点(如更好的容错性),数据流系统还能实现更好的性能。例如,假设客户正在购买以一种货币定价,但以另一种货币支付的商品。为了执行货币换算,你需要知道当前的汇率。这个操作可以通过两种方式实现【40,41】:
1. 在微服务方法中,处理购买的代码可能会查询汇率服务或数据库,以获取特定货币的当前汇率。
2. 在数据流方法中,处理订单的代码会提前订阅汇率变更流,并在汇率发生变动时将当前汇率存储在本地数据库中。处理订单时只需查询本地数据库即可。
第二种方法能将对另一服务的同步网络请求替换为对本地数据库的查询(可能在同一台机器甚至同一个进程中)[^ii]。数据流方法不仅更快,而且当其他服务失效时也更稳健。最快且最可靠的网络请求就是压根没有网络请求!我们现在不再使用 RPC,而是在购买事件和汇率更新事件之间建立流联接(请参阅 “[流表连接(流扩充)](/v1/ch11#流表连接(流扩充))”)。
[^ii]: 在微服务方法中,你也可以通过在处理购买的服务中本地缓存汇率来避免同步网络请求。但是为了保证缓存的新鲜度,你需要定期轮询汇率以获取其更新,或订阅变更流 —— 这恰好是数据流方法中发生的事情。
连接是时间相关的:如果购买事件在稍后的时间点被重新处理,汇率可能已经改变。如果要重建原始输出,则需要获取原始购买时的历史汇率。无论是查询服务还是订阅汇率更新流,你都需要处理这种时间相关性(请参阅 “[连接的时间依赖性](/v1/ch11#连接的时间依赖性)”)。
订阅变更流,而不是在需要时查询当前状态,使我们更接近类似电子表格的计算模型:当某些数据发生变更时,依赖于此的所有衍生数据都可以快速更新。还有很多未解决的问题,例如关于时间相关连接等问题,但我认为围绕数据流构建应用的想法是一个非常有希望的方向。
### 观察衍生数据状态
在抽象层面,上一节讨论的数据流系统提供了创建衍生数据集(例如搜索索引、物化视图和预测模型)并使其保持更新的过程。我们将这个过程称为 **写路径(write path)**:只要某些信息被写入系统,它可能会经历批处理与流处理的多个阶段,而最终每个衍生数据集都会被更新,以适配写入的数据。[图 12-1](/v1/ddia_1201.png) 显示了一个更新搜索索引的例子。

**图 12-1 在搜索索引中,写(文档更新)遇上读(查询)**
但你为什么一开始就要创建衍生数据集?很可能是因为你想在以后再次查询它。这就是 **读路径(read path)**:当服务用户请求时,你需要从衍生数据集中读取,也许还要对结果进行一些额外处理,然后构建给用户的响应。
总而言之,写路径和读路径涵盖了数据的整个旅程,从收集数据开始,到使用数据结束(可能是由另一个人)。写路径是预计算过程的一部分 —— 即,一旦数据进入,即刻完成,无论是否有人需要看它。读路径是这个过程中只有当有人请求时才会发生的部分。如果你熟悉函数式编程语言,则可能会注意到写路径类似于立即求值,读路径类似于惰性求值。
如 [图 12-1](/v1/ddia_1201.png) 所示,衍生数据集是写路径和读路径相遇的地方。它代表了在写入时需要完成的工作量与在读取时需要完成的工作量之间的权衡。
#### 物化视图和缓存
全文搜索索引就是一个很好的例子:写路径更新索引,读路径在索引中搜索关键字。读写都需要做一些工作。写入需要更新文档中出现的所有关键词的索引条目。读取需要搜索查询中的每个单词,并应用布尔逻辑来查找包含查询中所有单词(AND 运算符)的文档,或者每个单词(OR 运算符)的任何同义词。
如果没有索引,搜索查询将不得不扫描所有文档(如 grep),如果有着大量文档,这样做的开销巨大。没有索引意味着写入路径上的工作量较少(没有要更新的索引),但是在读取路径上需要更多工作。
另一方面,可以想象为所有可能的查询预先计算搜索结果。在这种情况下,读路径上的工作量会减少:不需要布尔逻辑,只需查找查询结果并返回即可。但写路径会更加昂贵:可能的搜索查询集合是无限大的,因此预先计算所有可能的搜索结果将需要无限的时间和存储空间。那肯定没戏 [^iii]。
[^iii]: 假设一个有限的语料库,那么返回非空搜索结果的搜索查询集合是有限的。然而,它是与语料库中的术语数量呈指数关系,这仍是一个坏消息。
另一种选择是预先计算一组固定的最常见查询的搜索结果,以便可以快速提供它们而无需转到索引。不常见的查询仍然可以通过索引来提供服务。这通常被称为常见查询的 **缓存(cache)**,尽管我们也可以称之为 **物化视图(materialized view)**,因为当新文档出现,且需要被包含在这些常见查询的搜索结果之中时,这些索引就需要更新。
从这个例子中我们可以看到,索引不是写路径和读路径之间唯一可能的边界;缓存常见搜索结果也是可行的;而在少量文档上使用没有索引的类 grep 扫描也是可行的。由此来看,缓存,索引和物化视图的作用很简单:它们改变了读路径与写路径之间的边界。通过预先计算结果,从而允许我们在写路径上做更多的工作,以节省读路径上的工作量。
在写路径上完成的工作和读路径之间的界限,实际上是本书开始处在 “[描述负载](/v1/ch1#描述负载)” 中推特例子里谈到的主题。在该例中,我们还看到了与普通用户相比,名人的写路径和读路径可能有所不同。在 500 页之后,我们已经绕回了起点!
#### 有状态、可离线的客户端
我发现写路径和读路径之间的边界很有趣,因为我们可以试着改变这个边界,并探讨这种改变的实际意义。我们来看看不同上下文中的这一想法。
过去二十年来,Web 应用的火热让我们对应用开发作出了一些很容易视作理所当然的假设。具体来说就是,客户端 / 服务器模型 —— 客户端大多是无状态的,而服务器拥有数据的权威 —— 已经普遍到我们几乎忘掉了还有其他任何模型的存在。但是技术在不断地发展,我认为不时地质疑现状非常重要。
传统上,网络浏览器是无状态的客户端,只有当连接到互联网时才能做一些有用的事情(能离线执行的唯一事情基本上就是上下滚动之前在线时加载好的页面)。然而,最近的 “单页面” JavaScript Web 应用已经获得了很多有状态的功能,包括客户端用户界面交互,以及 Web 浏览器中的持久化本地存储。移动应用可以类似地在设备上存储大量状态,而且大多数用户交互都不需要与服务器往返交互。
这些不断变化的功能重新引发了对 **离线优先(offline-first)** 应用的兴趣,这些应用尽可能地在同一设备上使用本地数据库,无需连接互联网,并在后台网络连接可用时与远程服务器同步【42】。由于移动设备通常具有缓慢且不可靠的蜂窝网络连接,因此,如果用户的用户界面不必等待同步网络请求,且应用主要是离线工作的,则这是一个巨大优势(请参阅 “[需要离线操作的客户端](/v1/ch5#需要离线操作的客户端)”)。
当我们摆脱无状态客户端与中央数据库交互的假设,并转向在终端用户设备上维护状态时,这就开启了新世界的大门。特别是,我们可以将设备上的状态视为 **服务器状态的缓存**。屏幕上的像素是客户端应用中模型对象的物化视图;模型对象是远程数据中心的本地状态副本【27】。
#### 将状态变更推送给客户端
在典型的网页中,如果你在 Web 浏览器中加载页面,并且随后服务器上的数据发生变更,则浏览器在重新加载页面之前对此一无所知。浏览器只能在一个时间点读取数据,假设它是静态的 —— 它不会订阅来自服务器的更新。因此设备上的状态是陈旧的缓存,除非你显式轮询变更否则不会更新。(像 RSS 这样基于 HTTP 的 Feed 订阅协议实际上只是一种基本的轮询形式)
最近的协议已经超越了 HTTP 的基本请求 / 响应模式:服务端发送的事件(EventSource API)和 WebSockets 提供了通信信道,通过这些信道,Web 浏览器可以与服务器保持打开的 TCP 连接,只要浏览器仍然连接着,服务器就能主动向浏览器推送信息。这为服务器提供了主动通知终端用户客户端的机会,服务器能告知客户端其本地存储状态的任何变化,从而减少客户端状态的陈旧程度。
用我们的写路径与读路径模型来讲,主动将状态变更推至到客户端设备,意味着将写路径一直延伸到终端用户。当客户端首次初始化时,它仍然需要使用读路径来获取其初始状态,但此后它就能够依赖服务器发送的状态变更流了。我们在流处理和消息传递部分讨论的想法并不局限于数据中心中:我们可以进一步采纳这些想法,并将它们一直延伸到终端用户设备【43】。
这些设备有时会离线,并在此期间无法收到服务器状态变更的任何通知。但是我们已经解决了这个问题:在 “[消费者偏移量](/v1/ch11#消费者偏移量)” 中,我们讨论了基于日志的消息代理的消费者能在失败或断开连接后重连,并确保它不会错过掉线期间任何到达的消息。同样的技术适用于单个用户,每个设备都是一个小事件流的小小订阅者。
#### 端到端的事件流
最近用于开发有状态的客户端与用户界面的工具,例如如 Elm 语言【30】和 Facebook 的 React、Flux 和 Redux 工具链,已经通过订阅表示用户输入或服务器响应的事件流来管理客户端的内部状态,其结构与事件溯源相似(请参阅 “[事件溯源](/v1/ch11#事件溯源)”)。
将这种编程模型扩展为:允许服务器将状态变更事件推送到客户端的事件管道中,是非常自然的。因此,状态变化可以通过 **端到端(end-to-end)** 的写路径流动:从一个设备上的交互触发状态变更开始,经由事件日志,并穿过几个衍生数据系统与流处理器,一直到另一台设备上的用户界面,而有人正在观察用户界面上的状态变化。这些状态变化能以相当低的延迟传播 —— 比如说,在一秒内从一端到另一端。
一些应用(如即时消息传递与在线游戏)已经具有这种 “实时” 架构(在低延迟交互的意义上,不是在 “[响应时间保证](/v1/ch8#响应时间保证)” 中的意义上)。但我们为什么不用这种方式构建所有的应用?
挑战在于,关于无状态客户端和请求 / 响应交互的假设已经根深蒂固地植入在我们的数据库、库、框架以及协议之中。许多数据存储支持读取与写入操作,为请求返回一个响应,但只有极少数提供订阅变更的能力 —— 请求返回一个随时间推移的响应流(请参阅 “[变更流的 API 支持](/v1/ch11#变更流的API支持)” )。
为了将写路径延伸至终端用户,我们需要从根本上重新思考我们构建这些系统的方式:从请求 / 响应交互转向发布 / 订阅数据流【27】。更具响应性的用户界面与更好的离线支持,我认为这些优势值得我们付出努力。如果你正在设计数据系统,我希望你对订阅变更的选项留有印象,而不只是查询当前状态。
#### 读也是事件
我们讨论过,当流处理器将衍生数据写入存储(数据库,缓存或索引)时,以及当用户请求查询该存储时,存储将充当写路径和读路径之间的边界。该存储应当允许对数据进行随机访问的读取查询,否则这些查询将需要扫描整个事件日志。
在很多情况下,数据存储与流处理系统是分开的。但回想一下,流处理器还是需要维护状态以执行聚合和连接的(请参阅 “[流连接](/v1/ch11#流连接)”)。这种状态通常隐藏在流处理器内部,但一些框架也允许这些状态被外部客户端查询【45】,将流处理器本身变成一种简单的数据库。
我愿意进一步思考这个想法。正如到目前为止所讨论的那样,对存储的写入是通过事件日志进行的,而读取是临时的网络请求,直接流向存储着待查数据的节点。这是一个合理的设计,但不是唯一可行的设计。也可以将读取请求表示为事件流,并同时将读事件与写事件送往流处理器;流处理器通过将读取结果发送到输出流来响应读取事件【46】。
当写入和读取都被表示为事件,并且被路由到同一个流算子以便处理时,我们实际上是在读取查询流和数据库之间执行流表连接。读取事件需要被送往保存数据的数据库分区(请参阅 “[请求路由](/v1/ch6#请求路由)”),就像批处理和流处理器在连接时需要在同一个键上对输入分区一样(请参阅 “[Reduce 侧连接与分组](/v1/ch10#Reduce侧连接与分组)”)。
服务请求与执行连接之间的这种相似之处是非常关键的【47】。一次性读取请求只是将请求传过连接算子,然后请求马上就被忘掉了;而一个订阅请求,则是与连接另一侧过去与未来事件的持久化连接。
记录读取事件的日志可能对于追踪整个系统中的因果关系与数据来源也有好处:它可以让你重现出当用户做出特定决策之前看见了什么。例如在网商中,向客户显示的预测送达日期与库存状态,可能会影响他们是否选择购买一件商品【4】。要分析这种联系,则需要记录用户查询运输与库存状态的结果。
将读取事件写入持久存储可以更好地跟踪因果关系(请参阅 “[排序事件以捕获因果关系](#排序事件以捕获因果关系)”),但会产生额外的存储与 I/O 成本。优化这些系统以减少开销仍然是一个开放的研究问题【2】。但如果你已经出于运维目的留下了读取请求日志,将其作为请求处理的副作用,那么将这份日志作为请求事件源并不是什么特别大的变更。
#### 多分区数据处理
对于只涉及单个分区的查询,通过流来发送查询与收集响应可能是杀鸡用牛刀了。然而,这个想法开启了分布式执行复杂查询的可能性,这需要合并来自多个分区的数据,利用了流处理器已经提供的消息路由、分区和连接的基础设施。
Storm 的分布式 RPC 功能支持这种使用模式(请参阅 “[消息传递和 RPC](/v1/ch11#消息传递和RPC)”)。例如,它已经被用来计算浏览过某个推特 URL 的人数 —— 即,发推包含该 URL 的所有人的粉丝集合的并集【48】。由于推特的用户是分区的,因此这种计算需要合并来自多个分区的结果。
这种模式的另一个例子是欺诈预防:为了评估特定购买事件是否具有欺诈风险,你可以检查该用户 IP 地址,电子邮件地址,帐单地址,送货地址的信用分。这些信用数据库中的每一个都是有分区的,因此为特定购买事件采集分数需要连接一系列不同的分区数据集【49】。
MPP 数据库的内部查询执行图有着类似的特征(请参阅 “[Hadoop 与分布式数据库的对比](/v1/ch10#Hadoop与分布式数据库的对比)”)。如果需要执行这种多分区连接,则直接使用提供此功能的数据库,可能要比使用流处理器实现它要更简单。然而将查询视为流提供了一种选项,可以用于实现超出传统现成解决方案的大规模应用。
## 将事情做正确
对于只读取数据的无状态服务,出问题也没什么大不了的:你可以修复该错误并重启服务,而一切都恢复正常。像数据库这样的有状态系统就没那么简单了:它们被设计为永远记住事物(或多或少),所以如果出现问题,这种(错误的)效果也将潜在地永远持续下去,这意味着它们需要更仔细的思考【50】。
我们希望构建可靠且 **正确** 的应用(即使面对各种故障,程序的语义也能被很好地定义与理解)。约四十年来,原子性、隔离性和持久性([第七章](/v1/ch7))等事务特性一直是构建正确应用的首选工具。然而这些地基没有看上去那么牢固:例如弱隔离级别带来的困惑可以佐证(请参阅 “[弱隔离级别](/v1/ch7#弱隔离级别)”)。
事务在某些领域被完全抛弃,并被提供更好性能与可伸缩性的模型取代,但后者有更复杂的语义(例如,请参阅 “[无主复制](/v1/ch5#无主复制)”)。**一致性(Consistency)** 经常被谈起,但其定义并不明确(请参阅 “[一致性](/v1/ch7#一致性)” 和 [第九章](/v1/ch9))。有些人断言我们应当为了高可用而 “拥抱弱一致性”,但却对这些概念实际上意味着什么缺乏清晰的认识。
对于如此重要的话题,我们的理解,以及我们的工程方法却是惊人地薄弱。例如,确定在特定事务隔离等级或复制配置下运行特定应用是否安全是非常困难的【51,52】。通常简单的解决方案似乎在低并发性的情况下工作正常,并且没有错误,但在要求更高的情况下却会出现许多微妙的错误。
例如,Kyle Kingsbury 的 Jepsen 实验【53】标出了一些产品声称的安全保证与其在网络问题与崩溃时的实际行为之间的明显差异。即使像数据库这样的基础设施产品没有问题,应用代码仍然需要正确使用它们提供的功能才行,如果配置很难理解,这是很容易出错的(在这种情况下指的是弱隔离级别,法定人数配置等)。
如果你的应用可以容忍偶尔的崩溃,以及以不可预料的方式损坏或丢失数据,那生活就要简单得多,而你可能只要双手合十念阿弥陀佛,期望佛祖能保佑最好的结果。另一方面,如果你需要更强的正确性保证,那么可串行化与原子提交就是久经考验的方法,但它们是有代价的:它们通常只在单个数据中心中工作(这就排除了地理位置分散的架构),并限制了系统能够实现的规模与容错特性。
虽然传统的事务方法并没有走远,但我也相信在使应用正确而灵活地处理错误方面上,事务也不是最后一个可以谈的。在本节中,我将提出一些在数据流架构中考量正确性的方式。
### 数据库的端到端原则
仅仅因为一个应用程序使用了具有相对较强安全属性的数据系统(例如可串行化的事务),并不意味着就可以保证没有数据丢失或损坏。例如,如果某个应用有个 Bug,导致它写入不正确的数据,或者从数据库中删除数据,那么可串行化的事务也救不了你。
这个例子可能看起来很无聊,但值得认真对待:应用会出 Bug,而人也会犯错误。我在 “[状态、流和不变性](/v1/ch11#状态、流和不变性)” 中使用了这个例子来支持不可变和仅追加的数据,阉割掉错误代码摧毁良好数据的能力,能让从错误中恢复更为容易。
虽然不变性很有用,但它本身并非万灵药。让我们来看一个可能发生的、非常微妙的数据损坏案例。
#### 正好执行一次操作
在 “[容错](/v1/ch11#容错)” 中,我们见到了 **恰好一次**(或 **等效一次**)语义的概念。如果在处理消息时出现问题,你可以选择放弃(丢弃消息 —— 导致数据丢失)或重试。如果重试,就会有这种风险:第一次实际上成功了,只不过你没有发现。结果这个消息就被处理了两次。
处理两次是数据损坏的一种形式:为同样的服务向客户收费两次(收费太多)或增长计数器两次(夸大指标)都不是我们想要的。在这种情况下,恰好一次意味着安排计算,使得最终效果与没有发生错误的情况一样,即使操作实际上因为某种错误而重试。我们先前讨论过实现这一目标的几种方法。
最有效的方法之一是使操作 **幂等**(idempotent,请参阅 “[幂等性](/v1/ch11#幂等性)”):即确保它无论是执行一次还是执行多次都具有相同的效果。但是,将不是天生幂等的操作变为幂等的操作需要一些额外的努力与关注:你可能需要维护一些额外的元数据(例如更新了值的操作 ID 集合),并在从一个节点故障切换至另一个节点时做好防护(请参阅 “[领导者和锁](/v1/ch8#领导者和锁)”)。
#### 抑制重复
除了流处理之外,其他许多地方也需要抑制重复的模式。例如,TCP 使用了数据包上的序列号,以便接收方可以将它们正确排序,并确定网络上是否有数据包丢失或重复。在将数据交付应用前,TCP 协议栈会重新传输任何丢失的数据包,也会移除任何重复的数据包。
但是,这种重复抑制仅适用于单条 TCP 连接的场景中。假设 TCP 连接是一个客户端与数据库的连接,并且它正在执行 [例 12-1]() 中的事务。在许多数据库中,事务是绑定在客户端连接上的(如果客户端发送了多个查询,数据库就知道它们属于同一个事务,因为它们是在同一个 TCP 连接上发送的)。如果客户端在发送 `COMMIT` 之后并在从数据库服务器收到响应之前遇到网络中断与连接超时,客户端是不知道事务是否已经被提交的([图 8-1](/v1/ddia_0801.png))。
**例 12-1 资金从一个账户到另一个账户的非幂等转移**
```sql
BEGIN TRANSACTION;
UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;
COMMIT;
```
客户端可以重连到数据库并重试事务,但现在已经处于 TCP 重复抑制的范围之外了。因为 [例 12-1]() 中的事务不是幂等的,可能会发生转了 \$22 而不是期望的 \$11。因此,尽管 [例 12-1]() 是一个事务原子性的标准样例,但它实际上并不正确,而真正的银行并不会这样办事【3】。
两阶段提交(请参阅 “[原子提交与两阶段提交](/v1/ch9#原子提交与两阶段提交)”)协议会破坏 TCP 连接与事务之间的 1:1 映射,因为它们必须在故障后允许事务协调器重连到数据库,告诉数据库将存疑事务提交还是中止。这足以确保事务只被恰好执行一次吗?不幸的是,并不能。
即使我们可以抑制数据库客户端与服务器之间的重复事务,我们仍然需要担心终端用户设备与应用服务器之间的网络。例如,如果终端用户的客户端是 Web 浏览器,则它可能会使用 HTTP POST 请求向服务器提交指令。也许用户正处于一个信号微弱的蜂窝数据网络连接中,它们成功地发送了 POST,但却在能够从服务器接收响应之前没了信号。
在这种情况下,可能会向用户显示错误消息,而他们可能会手动重试。Web 浏览器警告说,“你确定要再次提交这个表单吗?” —— 用户选 “是”,因为他们希望操作发生(Post/Redirect/Get 模式【54】可以避免在正常操作中出现此警告消息,但 POST 请求超时就没办法了)。从 Web 服务器的角度来看,重试是一个独立的请求;从数据库的角度来看,这是一个独立的事务。通常的除重机制无济于事。
#### 操作标识符
要在通过几跳的网络通信上使操作具有幂等性,仅仅依赖数据库提供的事务机制是不够的 —— 你需要考虑 **端到端(end-to-end)** 的请求流。
例如,你可以为操作生成一个唯一的标识符(例如 UUID),并将其作为隐藏表单字段包含在客户端应用中,或通过计算所有表单相关字段的散列来生成操作 ID 【3】。如果 Web 浏览器提交了两次 POST 请求,这两个请求将具有相同的操作 ID。然后,你可以将该操作 ID 一路传递到数据库,并检查你是否曾经使用给定的 ID 执行过一个操作,如 [例 12-2]() 中所示。
**例 12-2 使用唯一 ID 来抑制重复请求**
```sql
ALTER TABLE requests ADD UNIQUE (request_id);
BEGIN TRANSACTION;
INSERT INTO requests
(request_id, from_account, to_account, amount)
VALUES('0286FDB8-D7E1-423F-B40B-792B3608036C', 4321, 1234, 11.00);
UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;
COMMIT;
```
[例 12-2]() 依赖于 `request_id` 列上的唯一约束。如果一个事务尝试插入一个已经存在的 ID,那么 `INSERT` 失败,事务被中止,使其无法生效两次。即使在较弱的隔离级别下,关系数据库也能正确地维护唯一性约束(而在 “[写入偏差与幻读](/v1/ch7#写入偏差与幻读)” 中讨论过,应用级别的 **检查 - 然后 - 插入** 可能会在不可串行化的隔离下失败)。
除了抑制重复的请求之外,[例 12-2]() 中的请求表表现得就像一种事件日志,暗示着事件溯源的想法(请参阅 “[事件溯源](/v1/ch11#事件溯源)”)。更新账户余额事实上不必与插入事件发生在同一个事务中,因为它们是冗余的,而能由下游消费者从请求事件中衍生出来 —— 只要该事件被恰好处理一次,这又一次可以使用请求 ID 来强制执行。
#### 端到端原则
抑制重复事务的这种情况只是一个更普遍的原则的一个例子,这个原则被称为 **端到端原则(end-to-end argument)**,它在 1984 年由 Saltzer、Reed 和 Clark 阐述【55】:
> 只有在通信系统两端应用的知识与帮助下,所讨论的功能才能完全地正确地实现。因而将这种被质疑的功能作为通信系统本身的功能是不可能的(有时,通信系统可以提供这种功能的不完备版本,可能有助于提高性能)。
>
在我们的例子中 **所讨论的功能** 是重复抑制。我们看到 TCP 在 TCP 连接层次抑制了重复的数据包,一些流处理器在消息处理层次提供了所谓的恰好一次语义,但这些都无法阻止当一个请求超时时,用户亲自提交重复的请求。TCP,数据库事务,以及流处理器本身并不能完全排除这些重复。解决这个问题需要一个端到端的解决方案:从终端用户的客户端一路传递到数据库的事务标识符。
端到端原则也适用于检查数据的完整性:以太网,TCP 和 TLS 中内置的校验和可以检测网络中数据包的损坏情况,但是它们无法检测到由连接两端发送 / 接收软件中 Bug 导致的损坏。或数据存储所在磁盘上的损坏。如果你想捕获数据所有可能的损坏来源,你也需要端到端的校验和。
类似的原则也适用于加密【55】:家庭 WiFi 网络上的密码可以防止人们窃听你的 WiFi 流量,但无法阻止互联网上其他地方攻击者的窥探;客户端与服务器之间的 TLS/SSL 可以阻挡网络攻击者,但无法阻止恶意服务器。只有端到端的加密和认证可以防止所有这些事情。
尽管低层级的功能(TCP 重复抑制、以太网校验和、WiFi 加密)无法单独提供所需的端到端功能,但它们仍然很有用,因为它们能降低较高层级出现问题的可能性。例如,如果我们没有 TCP 来将数据包排成正确的顺序,那么 HTTP 请求通常就会被搅烂。我们只需要记住,低级别的可靠性功能本身并不足以确保端到端的正确性。
#### 在数据系统中应用端到端思考
这将我带回最初的论点:仅仅因为应用使用了提供相对较强安全属性的数据系统,例如可串行化的事务,并不意味着应用的数据就不会丢失或损坏了。应用本身也需要采取端到端的措施,例如除重。
这实在是一个遗憾,因为容错机制很难弄好。低层级的可靠机制(比如 TCP 中的那些)运行的相当好,因而剩下的高层级错误基本很少出现。如果能将这些剩下的高层级容错机制打包成抽象,而应用不需要再去操心,那该多好呀 —— 但恐怕我们还没有找到这一正确的抽象。
长期以来,事务被认为是一个很好的抽象,我相信它们确实是很有用的。正如 [第七章](/v1/ch7) 导言中所讨论的,它们将各种可能的问题(并发写入、违背约束、崩溃、网络中断、磁盘故障)合并为两种可能结果:提交或中止。这是对编程模型而言是一种巨大的简化,但恐怕这还不够。
事务是代价高昂的,当涉及异构存储技术时尤为甚(请参阅 “[实践中的分布式事务](/v1/ch9#实践中的分布式事务)”)。我们拒绝使用分布式事务是因为它开销太大,结果我们最后不得不在应用代码中重新实现容错机制。正如本书中大量的例子所示,对并发性与部分失败的推理是困难且违反直觉的,所以我怀疑大多数应用级别的机制都不能正确工作,最终结果是数据丢失或损坏。
出于这些原因,我认为探索对容错的抽象是很有价值的。它使提供应用特定的端到端的正确性属性变得更简单,而且还能在大规模分布式环境中提供良好的性能与运维特性。
### 强制约束
让我们思考一下在 [分拆数据库](#分拆数据库) 上下文中的 **正确性(correctness)**。我们看到端到端的除重可以通过从客户端一路透传到数据库的请求 ID 实现。那么其他类型的约束呢?
我们先来特别关注一下 **唯一性约束** —— 例如我们在 [例 12-2]() 中所依赖的约束。在 “[约束和唯一性保证](/v1/ch9#约束和唯一性保证)” 中,我们看到了几个其他需要强制实施唯一性的应用功能例子:用户名或电子邮件地址必须唯一标识用户,文件存储服务不能包含多个重名文件,两个人不能在航班或剧院预订同一个座位。
其他类型的约束也非常类似:例如,确保帐户余额永远不会变为负数,确保不会超卖库存,或者会议室没有重复的预订。执行唯一性约束的技术通常也可以用于这些约束。
#### 唯一性约束需要达成共识
在 [第九章](/v1/ch9) 中我们看到,在分布式环境中,强制执行唯一性约束需要共识:如果存在多个具有相同值的并发请求,则系统需要决定冲突操作中的哪一个被接受,并拒绝其他违背约束的操作。
达成这一共识的最常见方式是使单个节点作为领导,并使其负责所有决策。只要你不介意所有请求都挤过单个节点(即使客户端位于世界的另一端),只要该节点没有失效,系统就能正常工作。如果你需要容忍领导者失效,那么就又回到了共识问题(请参阅 “[单主复制与共识](/v1/ch9#单主复制与共识)”)。
唯一性检查可以通过对唯一性字段分区做横向伸缩。例如,如果需要通过请求 ID 确保唯一性(如 [例 12-2]() 所示),你可以确保所有具有相同请求 ID 的请求都被路由到同一分区(请参阅 [第六章](/v1/ch6))。如果你需要让用户名是唯一的,则可以按用户名的散列值做分区。
但异步多主复制排除在外,因为可能会发生不同主库同时接受冲突写操作的情况,因而这些值不再是唯一的(请参阅 “[实现线性一致的系统](/v1/ch9#实现线性一致的系统)”)。如果你想立刻拒绝任何违背约束的写入,同步协调是无法避免的【56】。
#### 基于日志消息传递中的唯一性
日志确保所有消费者以相同的顺序看见消息 —— 这种保证在形式上被称为 **全序广播(total order boardcast)** 并且等价于共识(请参阅 “[全序广播](/v1/ch9#全序广播)”)。在使用基于日志的消息传递的分拆数据库方法中,我们可以使用非常类似的方法来执行唯一性约束。
流处理器在单个线程上依次消费单个日志分区中的所有消息(请参阅 “[日志与传统的消息传递相比](/v1/ch11#日志与传统的消息传递相比)”)。因此,如果日志是按需要确保唯一的值做的分区,则流处理器可以无歧义地、确定性地决定几个冲突操作中的哪一个先到达。例如,在多个用户尝试宣告相同用户名的情况下【57】:
1. 每个对用户名的请求都被编码为一条消息,并追加到按用户名散列值确定的分区。
2. 流处理器依序读取日志中的请求,并使用本地数据库来追踪哪些用户名已经被占用了。对于所有申请可用用户名的请求,它都会记录该用户名,并向输出流发送一条成功消息。对于所有申请已占用用户名的请求,它都会向输出流发送一条拒绝消息。
3. 请求用户名的客户端监视输出流,等待与其请求相对应的成功或拒绝消息。
该算法基本上与 “[使用全序广播实现线性一致的存储](/v1/ch9#使用全序广播实现线性一致的存储)” 中的算法相同。它可以简单地通过增加分区数伸缩至较大的请求吞吐量,因为每个分区都可以被独立处理。
该方法不仅适用于唯一性约束,而且适用于许多其他类型的约束。其基本原理是,任何可能冲突的写入都会路由到相同的分区并按顺序处理。正如 “[什么是冲突?](/v1/ch5#什么是冲突?)” 与 “[写入偏差与幻读](/v1/ch7#写入偏差与幻读)” 中所述,冲突的定义可能取决于应用,但流处理器可以使用任意逻辑来验证请求。这个想法与 Bayou 在 90 年代开创的方法类似【58】。
#### 多分区请求处理
当涉及多个分区时,确保操作以原子方式执行且同时满足约束就变得很有趣了。在 [例 12-2]() 中,可能有三个分区:一个包含请求 ID,一个包含收款人账户,另一个包含付款人账户。没有理由把这三种东西放入同一个分区,因为它们都是相互独立的。
在数据库的传统方法中,执行此事务需要跨全部三个分区进行原子提交,就这些分区上的所有其他事务而言,这实质上是将该事务嵌入一个全序。而这样就要求跨分区协调,不同的分区无法再独立地进行处理,因此吞吐量很可能会受到影响。
但事实证明,使用分区日志可以达到等价的正确性而无需原子提交:
1. 从账户 A 向账户 B 转账的请求由客户端提供一个唯一的请求 ID,并按请求 ID 追加写入相应日志分区。
2. 流处理器读取请求日志。对于每个请求消息,它向输出流发出两条消息:付款人账户 A 的借记指令(按 A 分区),收款人 B 的贷记指令(按 B 分区)。被发出的消息中会带有原始的请求 ID。
3. 后续处理器消费借记 / 贷记指令流,按照请求 ID 除重,并将变更应用至账户余额。
步骤 1 和步骤 2 是必要的,因为如果客户直接发送贷记与借记指令,则需要在这两个分区之间进行原子提交,以确保两者要么都发生或都不发生。为了避免对分布式事务的需要,我们首先将请求持久化记录为单条消息,然后从这第一条消息中衍生出贷记指令与借记指令。几乎在所有数据系统中,单对象写入都是原子性的(请参阅 “[单对象写入](/v1/ch7#单对象写入)),因此请求要么出现在日志中,要么就不出现,无需多分区原子提交。
如果流处理器在步骤 2 中崩溃,则它会从上一个存档点恢复处理。这样做时,它不会跳过任何请求消息,但可能会多次处理请求并产生重复的贷记与借记指令。但由于它是确定性的,因此它只是再次生成相同的指令,而步骤 3 中的处理器可以使用端到端请求 ID 轻松地对其除重。
如果你想确保付款人的帐户不会因此次转账而透支,则可以使用一个额外的流处理器来维护账户余额并校验事务(按付款人账户分区),只有有效的事务会被记录在步骤 1 中的请求日志中。
通过将多分区事务分解为两个不同分区方式的阶段,并使用端到端的请求 ID,我们实现了同样的正确性属性(每个请求对付款人与收款人都恰好生效一次),即使在出现故障,且没有使用原子提交协议的情况下依然如此。使用多个不同分区阶段的想法与我们在 “[多分区数据处理](#多分区数据处理)” 中讨论的想法类似(也请参阅 “[并发控制](/v1/ch11#并发控制)”)。
### 及时性与完整性
事务的一个便利属性是,它们通常是线性一致的(请参阅 “[线性一致性](/v1/ch9#线性一致性)”),也就是说,写入者会等到事务提交,而之后其写入立刻对所有读取者可见。
当我们把一个操作拆分为跨越多个阶段的流处理器时,却并非如此:日志的消费者在设计上就是异步的,因此发送者不会等其消息被消费者处理完。但是,客户端等待输出流中的特定消息是可能的。这正是我们在 “[基于日志消息传递中的唯一性](#基于日志消息传递中的唯一性)” 一节中检查唯一性约束时所做的事情。
在这个例子中,唯一性检查的正确性不取决于消息发送者是否等待结果。等待的目的仅仅是同步通知发送者唯一性检查是否成功。但该通知可以与消息处理的结果相解耦。
更一般地来讲,我认为术语 **一致性(consistency)** 这个术语混淆了两个值得分别考虑的需求:
* 及时性(Timeliness)
及时性意味着确保用户观察到系统的最新状态。我们之前看到,如果用户从陈旧的数据副本中读取数据,它们可能会观察到系统处于不一致的状态(请参阅 “[复制延迟问题](/v1/ch5#复制延迟问题)”)。但这种不一致是暂时的,而最终会通过等待与重试简单地得到解决。
CAP 定理(请参阅 “[线性一致性的代价](/v1/ch9#线性一致性的代价)”)使用 **线性一致性(linearizability)** 意义上的一致性,这是实现及时性的强有力方法。像 **写后读** 这样及时性更弱的一致性也很有用(请参阅 “[读己之写](/v1/ch5#读己之写)”)。
* 完整性(Integrity)
完整性意味着没有损坏;即没有数据丢失,并且没有矛盾或错误的数据。尤其是如果某些衍生数据集是作为底层数据之上的视图而维护的(请参阅 “[从事件日志中派生出当前状态](/v1/ch11#从事件日志中派生出当前状态)”),这种衍生必须是正确的。例如,数据库索引必须正确地反映数据库的内容 —— 缺失某些记录的索引并不是很有用。
如果完整性被违背,这种不一致是永久的:在大多数情况下,等待与重试并不能修复数据库损坏。相反的是,需要显式地检查与修复。在 ACID 事务的上下文中(请参阅 “[ACID 的含义](/v1/ch7#ACID的含义)”),一致性通常被理解为某种特定于应用的完整性概念。原子性和持久性是保持完整性的重要工具。
口号形式:违反及时性,“最终一致性”;违反完整性,“永无一致性”。
我断言在大多数应用中,完整性比及时性重要得多。违反及时性可能令人困惑与讨厌,但违反完整性的结果可能是灾难性的。
例如在你的信用卡对账单上,如果某一笔过去 24 小时内完成的交易尚未出现并不令人奇怪 —— 这些系统有一定的滞后是正常的。我们知道银行是异步核算与敲定交易的,这里的及时性并不是非常重要【3】。但如果当期对账单余额与上期对账单余额加交易总额对不上(求和错误),或者出现一笔向你收费但未向商家付款的交易(消失的钱),那就实在是太糟糕了,这样的问题就违背了系统的完整性。
#### 数据流系统的正确性
ACID 事务通常既提供及时性(例如线性一致性)也提供完整性保证(例如原子提交)。因此如果你从 ACID 事务的角度来看待应用的正确性,那么及时性与完整性的区别是无关紧要的。
另一方面,对于在本章中讨论的基于事件的数据流系统而言,它们的一个有趣特性就是将及时性与完整性分开。在异步处理事件流时不能保证及时性,除非你显式构建一个在返回之前明确等待特定消息到达的消费者。但完整性实际上才是流处理系统的核心。
**恰好一次** 或 **等效一次** 语义(请参阅 “[容错](/v1/ch11#容错)”)是一种保持完整性的机制。如果事件丢失或者生效两次,就有可能违背数据系统的完整性。因此在出现故障时,容错消息传递与重复抑制(例如,幂等操作)对于维护数据系统的完整性是很重要的。
正如我们在上一节看到的那样,可靠的流处理系统可以在无需分布式事务与原子提交协议的情况下保持完整性,这意味着它们有潜力达到与后者相当的正确性,同时还具备好得多的性能与运维稳健性。为了达成这种正确性,我们组合使用了多种机制:
* 将写入操作的内容表示为单条消息,从而可以轻松地被原子写入 —— 与事件溯源搭配效果拔群(请参阅 “[事件溯源](/v1/ch11#事件溯源)”)。
* 使用与存储过程类似的确定性衍生函数,从这一消息中衍生出所有其他的状态变更(请参阅 “[真的串行执行](/v1/ch7#真的串行执行)” 和 “[应用代码作为衍生函数](/v1/ch12#应用代码作为衍生函数)”)
* 将客户端生成的请求 ID 传递通过所有的处理层次,从而允许端到端的除重,带来幂等性。
* 使消息不可变,并允许衍生数据能随时被重新处理,这使从错误中恢复更加容易(请参阅 “[不可变事件的优点](/v1/ch11#不可变事件的优点)”)
这种机制组合在我看来,是未来构建容错应用的一个非常有前景的方向。
#### 宽松地解释约束
如前所述,执行唯一性约束需要共识,通常通过在单个节点中汇集特定分区中的所有事件来实现。如果我们想要传统的唯一性约束形式,这种限制是不可避免的,流处理也不例外。
然而另一个需要了解的事实是,许多真实世界的应用实际上可以摆脱这种形式,接受弱得多的唯一性:
* 如果两个人同时注册了相同的用户名或预订了相同的座位,你可以给其中一个人发消息道歉,并要求他们换一个不同的用户名或座位。这种纠正错误的变化被称为 **补偿性事务(compensating transaction)**【59,60】。
* 如果客户订购的物品多于仓库中的物品,你可以下单补仓,并为延误向客户道歉,向他们提供折扣。实际上,这么说吧,如果叉车在仓库中轧过了你的货物,剩下的货物比你想象的要少,那么你也是得这么做【61】。因此,既然道歉工作流无论如何已经成为你商业过程中的一部分了,那么对库存物品数目添加线性一致的约束可能就没必要了。
* 与之类似,许多航空公司都会超卖机票,打着一些旅客可能会错过航班的算盘;许多旅馆也会超卖客房,抱着部分客人可能会取消预订的期望。在这些情况下,出于商业原因而故意违反了 “一人一座” 的约束;当需求超过供给的情况出现时,就会进入补偿流程(退款、升级舱位 / 房型、提供隔壁酒店的免费的房间)。即使没有超卖,为了应对由恶劣天气或员工罢工导致的航班取消,你还是需要道歉与补偿流程 —— 从这些问题中恢复仅仅是商业活动的正常组成部分。
* 如果有人从账户超额取款,银行可以向他们收取透支费用,并要求他们偿还欠款。通过限制每天的提款总额,银行的风险是有限的。
在许多商业场景中,临时违背约束并稍后通过道歉来修复,实际上是可以接受的。道歉的成本各不相同,但通常很低(以金钱或名声来算):你无法撤回已发送的电子邮件,但可以发送一封后续电子邮件进行更正。如果你不小心向信用卡收取了两次费用,则可以将其中一项收费退款,而代价仅仅是手续费,也许还有客户的投诉。尽管一旦 ATM 吐了钱,你无法直接取回,但原则上如果账户透支而客户拒不支付,你可以派催收员收回欠款。
道歉的成本是否能接受是一个商业决策。如果可以接受的话,在写入数据之前检查所有约束的传统模型反而会带来不必要的限制,而线性一致性的约束也不是必须的。乐观写入,事后检查可能是一种合理的选择。你仍然可以在做一些挽回成本高昂的事情前确保有相关的验证,但这并不意味着写入数据之前必须先进行验证。
这些应用 **确实** 需要完整性:你不会希望丢失预订信息,或者由于借方贷方不匹配导致资金消失。但是它们在执行约束时 **并不需要** 及时性:如果你销售的货物多于仓库中的库存,可以在事后道歉后并弥补问题。这种做法与我们在 “[处理写入冲突](/v1/ch5#处理写入冲突)” 中讨论的冲突解决方法类似。
#### 无协调数据系统
我们现在已经做了两个有趣的观察:
1. 数据流系统可以维持衍生数据的完整性保证,而无需原子提交、线性一致性或者同步的跨分区协调。
2. 虽然严格的唯一性约束要求及时性和协调,但许多应用实际上可以接受宽松的约束:只要整个过程保持完整性,这些约束可能会被临时违反并在稍后被修复。
总之这些观察意味着,数据流系统可以为许多应用提供无需协调的数据管理服务,且仍能给出很强的完整性保证。这种 **无协调(coordination-avoiding)** 的数据系统有着很大的吸引力:比起需要执行同步协调的系统,它们能达到更好的性能与更强的容错能力【56】。
例如,这种系统可以使用多领导者配置运维,跨越多个数据中心,在区域间异步复制。任何一个数据中心都可以持续独立运行,因为不需要同步的跨区域协调。这样的系统的及时性保证会很弱 —— 如果不引入协调它是不可能是线性一致的 —— 但它仍然可以提供有力的完整性保证。
在这种情况下,可串行化事务作为维护衍生状态的一部分仍然是有用的,但它们只能在小范围内运行,在那里它们工作得很好【8】。异构分布式事务(如 XA 事务,请参阅 “[实践中的分布式事务](/v1/ch9#实践中的分布式事务)”)不是必需的。同步协调仍然可以在需要的地方引入(例如在无法恢复的操作之前强制执行严格的约束),但是如果只是应用的一小部分地方需要它,没必要让所有操作都付出协调的代价。【43】。
另一种审视协调与约束的角度是:它们减少了由于不一致而必须做出的道歉数量,但也可能会降低系统的性能和可用性,从而可能增加由于宕机中断而需要做出的道歉数量。你不可能将道歉数量减少到零,但可以根据自己的需求寻找最佳平衡点 —— 既不存在太多不一致性,又不存在太多可用性问题。
### 信任但验证
我们所有关于正确性,完整性和容错的讨论都基于一些假设,假设某些事情可能会出错,但其他事情不会。我们将这些假设称为我们的 **系统模型**(system model,请参阅 “[将系统模型映射到现实世界](/v1/ch8#将系统模型映射到现实世界)”):例如,我们应该假设进程可能会崩溃,机器可能突然断电,网络可能会任意延迟或丢弃消息。但是我们也可能假设写入磁盘的数据在执行 `fsync` 后不会丢失,内存中的数据没有损坏,而 CPU 的乘法指令总是能返回正确的结果。
这些假设是相当合理的,因为大多数时候它们都是成立的,如果我们不得不经常担心计算机出错,那么基本上寸步难行。在传统上,系统模型采用二元方法处理故障:我们假设有些事情可能会发生,而其他事情 **永远** 不会发生。实际上,这更像是一个概率问题:有些事情更有可能,其他事情不太可能。问题在于违反我们假设的情况是否经常发生,以至于我们可能在实践中遇到它们。
我们已经看到,数据可能会在尚未落盘时损坏(请参阅 “[复制与持久性](/v1/ch7#复制与持久性)”),而网络上的数据损坏有时可能规避了 TCP 校验和(请参阅 “[弱谎言形式](/v1/ch8#弱谎言形式)” )。也许我们应当更关注这些事情?
我过去所从事的一个应用收集了来自客户端的崩溃报告,我们收到的一些报告,只有在这些设备内存中出现了随机位翻转才解释的通。这看起来不太可能,但是如果有足够多的设备运行你的软件,那么即使再不可能发生的事也确实会发生。除了由于硬件故障或辐射导致的随机存储器损坏之外,一些病态的存储器访问模式甚至可以在没有故障的存储器中翻转位【62】 —— 一种可用于破坏操作系统安全机制的效应【63】(这种技术被称为 **Rowhammer**)。一旦你仔细观察,硬件并不是看上去那样完美的抽象。
要澄清的是,随机位翻转在现代硬件上仍是非常罕见的【64】。我只想指出,它们并没有超越可能性的范畴,所以值得一些关注。
#### 维护完整性,尽管软件有Bug
除了这些硬件问题之外,总是存在软件 Bug 的风险,这些错误不会被较低层次的网络、内存或文件系统校验和所捕获。即使广泛使用的数据库软件也有 Bug:即使像 MySQL 与 PostgreSQL 这样稳健、口碑良好、多年来被许多人充分测试过的软件,就我个人所见也有 Bug,比如 MySQL 未能正确维护唯一约束【65】,以及 PostgreSQL 的可串行化隔离等级存在特定的写入偏差异常【66】。对于不那么成熟的软件来说,情况可能要糟糕得多。
尽管在仔细设计,测试,以及审查上做出很多努力,但 Bug 仍然会在不知不觉中产生。尽管它们很少,而且最终会被发现并被修复,但总会有那么一段时间,这些 Bug 可能会损坏数据。
而对于应用代码,我们不得不假设会有更多的错误,因为绝大多数应用的代码经受的评审与测试远远无法与数据库的代码相比。许多应用甚至没有正确使用数据库提供的用于维持完整性的功能,例如外键或唯一性约束【36】。
ACID 意义下的一致性(请参阅 “[一致性](/v1/ch7#一致性)”)基于这样一种想法:数据库以一致的状态启动,而事务将其从一个一致状态转换至另一个一致的状态。因此,我们期望数据库始终处于一致状态。然而,只有当你假设事务没有 Bug 时,这种想法才有意义。如果应用以某种错误的方式使用数据库,例如,不安全地使用弱隔离等级,数据库的完整性就无法得到保证。
#### 不要盲目信任承诺
由于硬件和软件并不总是符合我们的理想,所以数据损坏似乎早晚不可避免。因此,我们至少应该有办法查明数据是否已经损坏,以便我们能够修复它,并尝试追查错误的来源。检查数据完整性称为 **审计(auditing)**。
如 “[不可变事件的优点](/v1/ch11#不可变事件的优点)” 一节中所述,审计不仅仅适用于财务应用程序。不过,可审计性在财务中是非常非常重要的,因为每个人都知道错误总会发生,我们也都认为能够检测和解决问题是合理的需求。
成熟的系统同样倾向于考虑不太可能的事情出错的可能性,并管理这种风险。例如,HDFS 和 Amazon S3 等大规模存储系统并不完全信任磁盘:它们运行后台进程持续回读文件,并将其与其他副本进行比较,并将文件从一个磁盘移动到另一个,以便降低静默损坏的风险【67】。
如果你想确保你的数据仍然存在,你必须真正读取它并进行检查。大多数时候它们仍然会在那里,但如果不是这样,你一定想尽早知道答案,而不是更晚。按照同样的原则,不时地尝试从备份中恢复是非常重要的 —— 否则当你发现备份损坏时,你可能已经遇到了数据丢失,那时候就真的太晚了。不要盲目地相信它们全都管用。
#### 验证的文化
像 HDFS 和 S3 这样的系统仍然需要假设磁盘大部分时间都能正常工作 —— 这是一个合理的假设,但与它们 **始终** 能正常工作的假设并不相同。然而目前还没有多少系统采用这种 “信任但是验证” 的方式来持续审计自己。许多人认为正确性保证是绝对的,并且没有为罕见的数据损坏的可能性做过准备。我希望未来能看到更多的 **自我验证(self-validating)** 或 **自我审计(self-auditing)** 系统,不断检查自己的完整性,而不是依赖盲目的信任【68】。
我担心 ACID 数据库的文化导致我们在盲目信任技术(如事务机制)的基础上开发应用,而忽视了这种过程中的任何可审计性。由于我们所信任的技术在大多数情况下工作得很好,通常会认为审计机制并不值得投资。
但随之而来的是,数据库的格局发生了变化:在 NoSQL 的旗帜下,更弱的一致性保证成为常态,更不成熟的存储技术越来越被广泛使用。但是由于审计机制还没有被开发出来,尽管这种方式越来越危险,我们仍不断在盲目信任的基础上构建应用。让我们想一想如何针对可审计性而设计吧。
#### 为可审计性而设计
如果一个事务在一个数据库中改变了多个对象,在这一事实发生后,很难说清这个事务到底意味着什么。即使你捕获了事务日志(请参阅 “[变更数据捕获](/v1/ch11#变更数据捕获)”),各种表中的插入、更新和删除操作并不一定能清楚地表明 **为什么** 要执行这些变更。决定这些变更的是应用逻辑中的调用,而这一应用逻辑稍纵即逝,无法重现。
相比之下,基于事件的系统可以提供更好的可审计性。在事件溯源方法中,系统的用户输入被表示为一个单一不可变事件,而任何其导致的状态变更都衍生自该事件。衍生可以实现为具有确定性与可重复性,因而相同的事件日志通过相同版本的衍生代码时,会导致相同的状态变更。
显式处理数据流(请参阅 “[批处理输出的哲学](/v1/ch10#批处理输出的哲学)”)可以使数据的 **来龙去脉(provenance)** 更加清晰,从而使完整性检查更具可行性。对于事件日志,我们可以使用散列来检查事件存储没有被破坏。对于任何衍生状态,我们可以重新运行从事件日志中衍生它的批处理器与流处理器,以检查是否获得相同的结果,或者,甚至并行运行冗余的衍生流程。
具有确定性且定义良好的数据流,也使调试与跟踪系统的执行变得容易,以便确定它 **为什么** 做了某些事情【4,69】。如果出现意想之外的事情,那么重现导致意外事件的确切事故现场的诊断能力 —— 一种时间旅行调试功能是非常有价值的。
#### 端到端原则重现
如果我们不能完全相信系统的每个组件都不会损坏 —— 每一个硬件都没缺陷,每一个软件都没有 Bug —— 那我们至少必须定期检查数据的完整性。如果我们不检查,我们就不能发现损坏,直到无可挽回地导致对下游的破坏时,那时候再去追踪问题就要难得多,且代价也要高的多。
检查数据系统的完整性,最好是以端到端的方式进行(请参阅 “[数据库的端到端原则](#数据库的端到端原则)”):我们能在完整性检查中涵盖的系统越多,某些处理阶中出现不被察觉损坏的几率就越小。如果我们能检查整个衍生数据管道端到端的正确性,那么沿着这一路径的任何磁盘、网络、服务以及算法的正确性检查都隐含在其中了。
持续的端到端完整性检查可以不断提高你对系统正确性的信心,从而使你能更快地进步【70】。与自动化测试一样,审计提高了快速发现错误的可能性,从而降低了系统变更或新存储技术可能导致损失的风险。如果你不害怕进行变更,就可以更好地充分演化一个应用,使其满足不断变化的需求。
#### 用于可审计数据系统的工具
目前,将可审计性作为顶层关注点的数据系统并不多。一些应用实现了自己的审计机制,例如将所有变更记录到单独的审计表中,但是确保审计日志与数据库状态的完整性仍然是很困难的。可以定期使用硬件安全模块对事务日志进行签名来防止篡改,但这无法保证正确的事务一开始就能进入到日志中。
使用密码学工具来证明系统的完整性是十分有趣的,这种方式对于宽泛的硬件与软件问题,甚至是潜在的恶意行为都很稳健有效。加密货币、区块链、以及诸如比特币、以太坊、Ripple、Stellar 的分布式账本技术已经迅速出现在这一领域【71,72,73】。
我没有资格评论这些技术用于货币,或者合同商定机制的价值。但从数据系统的角度来看,它们包含了一些有趣的想法。实质上,它们是分布式数据库,具有数据模型与事务机制,而不同副本可以由互不信任的组织托管。副本不断检查其他副本的完整性,并使用共识协议对应当执行的事务达成一致。
我对这些技术的拜占庭容错方面有些怀疑(请参阅 “[拜占庭故障](/v1/ch8#拜占庭故障)”),而且我发现 **工作证明(proof of work)** 技术非常浪费(比如,比特币挖矿)。比特币的交易吞吐量相当低,尽管更多是出于政治与经济原因而非技术上的原因。不过,完整性检查的方面是很有趣的。
密码学审计与完整性检查通常依赖 **默克尔树(Merkle tree)**【74】,这是一颗散列值的树,能够用于高效地证明一条记录出现在一个数据集中(以及其他一些特性)。除了炒作的沸沸扬扬的加密货币之外,**证书透明性(certificate transparency)** 也是一种依赖 Merkle 树的安全技术,用来检查 TLS/SSL 证书的有效性【75,76】。
我可以想象,那些在证书透明度与分布式账本中使用的完整性检查和审计算法,将会在通用数据系统中得到越来越广泛的应用。要使得这些算法对于没有密码学审计的系统同样可伸缩,并尽可能降低性能损失还需要一些工作。但我认为这是一个值得关注的有趣领域。
## 做正确的事情
在本书的最后部分,我想退后一步。在本书中,我们考察了各种不同的数据系统架构,评价了它们的优点与缺点,并探讨了构建可靠,可伸缩,可维护应用的技术。但是,我们忽略了讨论中一个重要而基础的部分,现在我想补充一下。
每个系统都服务于一个目的;我们采取的每个举措都会同时产生期望的后果与意外的后果。这个目的可能只是简单地赚钱,但其对世界的影响,可能会远远超出最初的目的。我们,建立这些系统的工程师,有责任去仔细考虑这些后果,并有意识地决定,我们希望生活在怎样的世界中。
我们将数据当成一种抽象的东西来讨论,但请记住,许多数据集都是关于人的:他们的行为,他们的兴趣,他们的身份。对待这些数据,我们必须怀着人性与尊重。用户也是人类,人类的尊严是至关重要的。
软件开发越来越多地涉及重要的道德抉择。有一些指导原则可以帮助软件工程师解决这些问题,例如 ACM 的软件工程道德规范与专业实践【77】,但实践中很少会讨论这些,更不用说应用与强制执行了。因此,工程师和产品经理有时会对隐私与产品潜在的负面后果抱有非常傲慢的态度【78,79,80】。
技术本身并无好坏之分 —— 关键在于它被如何使用,以及它如何影响人们。这对枪械这样的武器是成立的,而搜索引擎这样的软件系统与之类似。我认为,软件工程师仅仅专注于技术而忽视其后果是不够的:道德责任也是我们的责任。对道德推理很困难,但它太重要了,我们无法忽视。
### 预测性分析
举个例子,预测性分析是 “大数据” 炒作的主要内容之一。使用数据分析预测天气或疾病传播是一码事【81】;而预测一个罪犯是否可能再犯,一个贷款申请人是否有可能违约,或者一个保险客户是否可能进行昂贵的索赔,则是另外一码事。后者会直接影响到个人的生活。
当然,支付网络希望防止欺诈交易,银行希望避免不良贷款,航空公司希望避免劫机,公司希望避免雇佣效率低下或不值得信任的人。从它们的角度来看,失去商机的成本很低,而不良贷款或问题员工的成本则要高得多,因而组织希望保持谨慎也是自然而然的事情。所以如果存疑,它们通常会 Say No。
然而,随着算法决策变得越来越普遍,被某种算法(准确地或错误地)标记为有风险的某人可能会遭受大量这种 “No” 的决定。系统性地被排除在工作,航旅,保险,租赁,金融服务,以及其他社会关键领域之外。这是一种对个体自由的极大约束,因此被称为 “算法监狱”【82】。在尊重人权的国家,刑事司法系统会做无罪推定(默认清白,直到被证明有罪)。另一方面,自动化系统可以系统地,任意地将一个人排除在社会参与之外,不需要任何有罪的证明,而且几乎没有申诉的机会。
#### 偏见与歧视
算法做出的决定不一定比人类更好或更差。每个人都可能有偏见,即使他们主动抗拒这一点;而歧视性做法也可能已经在文化上被制度化了。人们希望根据数据做出决定,而不是通过人的主观评价与直觉,希望这样能更加公平,并给予传统体制中经常被忽视的人更好的机会【83】。
当我们开发预测性分析系统时,不是仅仅用软件通过一系列 IF ELSE 规则将人类的决策过程自动化,那些规则本身甚至都是从数据中推断出来的。但这些系统学到的模式是个黑盒:即使数据中存在一些相关性,我们可能也压根不知道为什么。如果算法的输入中存在系统性的偏见,则系统很有可能会在输出中学习并放大这种偏见【84】。
在许多国家,反歧视法律禁止按种族、年龄、性别、性取向、残疾或信仰等受保护的特征区分对待不同的人。其他的个人特征可能是允许用于分析的,但是如果这些特征与受保护的特征存在关联,又会发生什么?例如在种族隔离地区中,一个人的邮政编码,甚至是他们的 IP 地址,都是很强的种族指示物。这样的话,相信一种算法可以以某种方式将有偏见的数据作为输入,并产生公平和公正的输出【85】似乎是很荒谬的。然而这种观点似乎常常潜伏在数据驱动型决策的支持者中,这种态度被讽刺为 “在处理偏差上,机器学习与洗钱类似”(machine learning is like money laundering for bias)【86】。
预测性分析系统只是基于过去进行推断;如果过去是歧视性的,它们就会将这种歧视归纳为规律。如果我们希望未来比过去更好,那么就需要道德想象力,而这是只有人类才能提供的东西【87】。数据与模型应该是我们的工具,而不是我们的主人。
#### 责任与问责
自动决策引发了关于责任与问责的问题【87】。如果一个人犯了错误,他可以被追责,受决定影响的人可以申诉。算法也会犯错误,但是如果它们出错,谁来负责【88】?当一辆自动驾驶汽车引发事故时,谁来负责?如果自动信用评分算法系统性地歧视特定种族或宗教的人,这些人是否有任何追索权?如果机器学习系统的决定要受到司法审查,你能向法官解释算法是如何做出决定的吗?
收集关于人的数据并进行决策,信用评级机构是一个很经典的例子。不良的信用评分会使生活变得更艰难,但至少信用分通常是基于个人 **实际的** 借款历史记录,而记录中的任何错误都能被纠正(尽管机构通常会设置门槛)。然而,基于机器学习的评分算法通常会使用更宽泛的输入,并且更不透明;因而很难理解特定决策是怎样作出的,以及是否有人被不公正地,歧视性地对待【89】。
信用分总结了 “你过去的表现如何?”,而预测性分析通常是基于 “谁与你类似,以及与你类似的人过去表现的如何?”。与他人的行为画上等号意味着刻板印象,例如,根据他们居住的地方(与种族和阶级关系密切的特征)。那么那些放错位置的人怎么办?而且,如果是因为错误数据导致的错误决定,追索几乎是不可能的【87】。
很多数据本质上是统计性的,这意味着即使概率分布在总体上是正确的,对于个例也可能是错误的。例如,如果贵国的平均寿命是 80 岁,这并不意味着你在 80 岁生日时就会死掉。很难从平均值与概率分布中对某个特定个体的寿命作出什么判断,同样,预测系统的输出是概率性的,对于个例可能是错误的。
盲目相信数据决策至高无上,这不仅仅是一种妄想,而是有切实危险的。随着数据驱动的决策变得越来越普遍,我们需要弄清楚,如何使算法更负责任且更加透明,如何避免加强现有的偏见,以及如何在它们不可避免地出错时加以修复。
我们还需要想清楚,如何避免数据被用于害人,如何认识数据的积极潜力。例如,分析可以揭示人们生活的财务特点与社会特点。一方面,这种权力可以用来将援助与支持集中在帮助那些最需要援助的人身上。另一方面,它有时会被掠夺性企业用于识别弱势群体,并向其兜售高风险产品,比如高利贷和没有价值的大学文凭【87,90】。
#### 反馈循环
即使是那些对人直接影响比较小的预测性应用,比如推荐系统,也有一些必须正视的难题。当服务变得善于预测用户想要看到什么内容时,它最终可能只会向人们展示他们已经同意的观点,将人们带入滋生刻板印象,误导信息,与极端思想的 **回音室**。我们已经看到过社交媒体回音室对竞选的影响了【91】。
当预测性分析影响人们的生活时,自我强化的反馈循环会导致非常有害的问题。例如,考虑雇主使用信用分来评估候选人的例子。你可能是一个信用分不错的好员工,但因不可抗力的意外而陷入财务困境。由于不能按期付账单,你的信用分会受到影响,进而导致找到工作更为困难。失业使你陷入贫困,这进一步恶化了你的分数,使你更难找到工作【87】。在数据与数学严谨性的伪装背后,隐藏的是由恶毒假设导致的恶性循环。
我们无法预测这种反馈循环何时发生。然而通过对整个系统(不仅仅是计算机化的部分,而且还有与之互动的人)进行整体思考,许多后果是可以够预测的 —— 一种称为 **系统思维(systems thinking)** 的方法【92】。我们可以尝试理解数据分析系统如何响应不同的行为,结构或特性。该系统是否加强和增大了人们之间现有的差异(例如,损不足以奉有余,富者愈富,贫者愈贫),还是试图与不公作斗争?而且即使有着最好的动机,我们也必须当心意想不到的后果。
### 隐私和追踪
除了预测性分析 —— 使用数据来做出关于人的自动决策 —— 数据收集本身也存在道德问题。收集数据的组织,与被收集数据的人之间,到底属于什么关系?
当系统只存储用户明确输入的数据时,是因为用户希望系统以特定方式存储和处理这些数据,**系统是在为用户提供服务**:用户就是客户。但是,当用户的活动被跟踪并记录,作为他们正在做的其他事情的副作用时,这种关系就没有那么清晰了。该服务不再仅仅完成用户想要它要做的事情,而是服务于它自己的利益,而这可能与用户的利益相冲突。
追踪用户行为数据对于许多面向用户的在线服务而言,变得越来越重要:追踪用户点击了哪些搜索结果有助于改善搜索结果的排名;推荐 “喜欢 X 的人也喜欢 Y”,可以帮助用户发现实用有趣的东西;A/B 测试和用户流量分析有助于改善用户界面。这些功能需要一定量的用户行为跟踪,而用户也可以从中受益。
但不同公司有着不同的商业模式,追踪并未止步于此。如果服务是通过广告盈利的,那么广告主才是真正的客户,而用户的利益则屈居其次。跟踪的数据会变得更详细,分析变得更深入,数据会保留很长时间,以便为每个人建立详细画像,用于营销。
现在,公司与被收集数据的用户之间的关系,看上去就不太一样了。公司会免费服务用户,并引诱用户尽可能多地使用服务。对用户的追踪,主要不是服务于该用户个体,而是服务于掏钱资助该服务的广告商。我认为这种关系可以用一个更具罪犯内涵的词来恰当地描述:**监视(surveilance)**。
#### 监视
让我们做一个思想实验,尝试用 **监视(surveillance)** 一词替换 **数据(data)**,再看看常见的短语是不是听起来还那么漂亮【93】。比如:“在我们的监视驱动的组织中,我们收集实时监视流并将它们存储在我们的监视仓库中。我们的监视科学家使用高级分析和监视处理来获得新的见解。”
对于本书《设计监控密集型应用》而言,这个思想实验是罕见的争议性内容,但我认为需要激烈的言辞来强调这一点。在我们尝试制造软件 “吞噬世界” 的过程中【94】,我们已经建立了世界上迄今为止所见过的最伟大的大规模监视基础设施。我们正朝着万物互联迈进,我们正在迅速走近这样一个世界:每个有人居住的空间至少包含一个带互联网连接的麦克风,以智能手机、智能电视、语音控制助理设备、婴儿监视器甚至儿童玩具的形式存在,并使用基于云的语音识别。这些设备中的很多都有着可怕的安全记录【95】。
即使是最为极权与专制的政权,可能也只会想着在每个房间装一个麦克风,并强迫每个人始终携带能够追踪其位置与动向的设备。然而,我们显然是自愿地,甚至热情地投身于这个全域监视的世界。不同之处在于,数据是由公司,而不是由政府机构收集的【96】。
并不是所有的数据收集都称得上监视,但检视这一点有助于理解我们与数据收集者之间的关系。为什么我们似乎很乐意接受企业的监视呢?也许你觉得自己没有什么好隐瞒的 —— 换句话说,你与当权阶级穿一条裤子,你不是被边缘化的少数派,也不必害怕受到迫害【97】。不是每个人都如此幸运。或者,也许这是因为目的似乎是温和的 —— 这不是公然胁迫,也不是强制性的,而只是更好的推荐与更个性化的营销。但是,结合上一节中对预测性分析的讨论,这种区别似乎并不是很清晰。
我们已经看到与汽车追踪设备挂钩的汽车保险费,以及取决于需要人佩戴健身追踪设备来确定的健康保险范围。当监视被用于决定生活的重要方面时,例如保险或就业,它就开始变得不那么温和了。此外,数据分析可以揭示出令人惊讶的私密事物:例如,智能手表或健身追踪器中的运动传感器能以相当好的精度计算出你正在输入的内容(比如密码)【98】。而分析算法只会变得越来越精确。
#### 同意与选择的自由
我们可能会断言用户是自愿选择使用了会跟踪其活动的服务,而且他们已经同意了服务条款与隐私政策,因此他们同意数据收集。我们甚至可以声称,用户在用所提供的数据来 **换取** 有价值的服务,并且为了提供服务,追踪是必要的。毫无疑问,社交网络、搜索引擎以及各种其他免费的在线服务对于用户来说都是有价值的,但是这个说法却存在问题。
用户几乎不知道他们提供给我们的是什么数据,哪些数据被放进了数据库,数据又是怎样被保留与处理的 —— 大多数隐私政策都是模棱两可的,忽悠用户而不敢打开天窗说亮话。如果用户不了解他们的数据会发生什么,就无法给出任何有意义的同意。有时来自一个用户的数据还会提到一些关于其他人的事,而其他那些人既不是该服务的用户,也没有同意任何条款。我们在本书这一部分中讨论的衍生数据集 —— 来自整个用户群的数据,加上行为追踪与外部数据源 —— 就恰好是用户无法(在真正意义上)理解的数据类型。
而且从用户身上挖掘数据是一个单向过程,而不是真正的互惠关系,也不是公平的价值交换。用户对能用多少数据换来什么样的服务,既没有没有发言权也没有选择权:服务与用户之间的关系是非常不对称与单边的。这些条款是由服务提出的,而不是由用户提出的【99】。
对于不同意监视的用户,唯一真正管用的备选项,就是简单地不使用服务。但这个选择也不是真正自由的:如果一项服务如此受欢迎,以至于 “被大多数人认为是基本社会参与的必要条件”【99】,那么指望人们选择退出这项服务是不合理的 —— 使用它 **事实上(de facto)** 是强制性的。例如,在大多数西方社会群体中,携带智能手机,使用 Facebook 进行社交,以及使用 Google 查找信息已成为常态。特别是当一项服务具有网络效应时,人们选择 **不** 使用会产生社会成本。
因为一个服务会跟踪用户而拒绝使用它,这只是少数人才拥有的权力,他们有足够的时间与知识来了解隐私政策,并承受得起代价:错过社会参与,以及使用服务可能带来的专业机会。对于那些处境不太好的人而言,并没有真正意义上的选择:监控是不可避免的。
#### 隐私与数据使用
有时候,人们声称 “隐私已死”,理由是有些用户愿意把各种关于他们生活的事情发布到社交媒体上,有时是平凡俗套,但有时是高度私密的。但这种说法是错误的,而且是对 **隐私(privacy)** 一词的误解。
拥有隐私并不意味着保密一切东西;它意味着拥有选择向谁展示哪些东西的自由,要公开什么,以及要保密什么。**隐私权是一项决定权**:在从保密到透明的光谱上,隐私使得每个人都能决定自己想要在什么地方位于光谱上的哪个位置【99】。这是一个人自由与自主的重要方面。
当通过监控基础设施从人身上提取数据时,隐私权不一定受到损害,而是转移到了数据收集者手中。获取数据的公司实际上是说 “相信我们会用你的数据做正确的事情”,这意味着,决定要透露什么和保密什么的权利从个体手中转移到了公司手中。
这些公司反过来选择保密这些监视结果,因为揭露这些会令人毛骨悚然,并损害它们的商业模式(比其他公司更了解人)。用户的私密信息只会间接地披露,例如针对特定人群定向投放广告的工具(比如那些患有特定疾病的人群)。
即使特定用户无法从特定广告定向的人群中以个体的形式区分出来,但他们已经失去了披露一些私密信息的能动性,例如他们是否患有某种疾病。决定向谁透露什么并不是由个体按照自己的喜好决定的,而是由 **公司**,以利润最大化为目标来行使隐私权的。
许多公司都有一个目标,不要让人 **感觉到** 毛骨悚然 —— 先不说它们收集数据实际上是多么具有侵犯性,让我们先关注对用户感受的管理。这些用户感受经常被管理得很糟糕:例如,在事实上可能正确的一些东西,如果会触发痛苦的回忆,用户可能并不希望被提醒【100】。对于任何类型的数据,我们都应当考虑它出错、不可取、不合时宜的可能性,并且需要建立处理这些失效的机制。无论是 “不可取” 还是 “不合时宜”,当然都是由人的判断决定的;除非我们明确地将算法编码设计为尊重人类的需求,否则算法会无视这些概念。作为这些系统的工程师,我们必须保持谦卑,充分规划,接受这些失效。
允许在线服务的用户控制其隐私设置,例如控制其他用户可以看到哪些东西,是将一些控制交还给用户的第一步。但无论怎么设置,服务本身仍然可以不受限制地访问数据,并能以隐私策略允许的任何方式自由使用它。即使服务承诺不会将数据出售给第三方,它通常会授予自己不受限制的权利,以便在内部处理与分析数据,而且往往比用户公开可见的部分要深入的多。
这种从个体到公司的大规模隐私权转移在历史上是史无前例的【99】。监控一直存在,但它过去是昂贵的、手动的,不是可伸缩的、自动化的。信任关系一直存在,例如患者与其医生之间,或被告与其律师之间 —— 但在这些情况下,数据的使用严格受到道德,法律和监管限制的约束。互联网服务使得在未经有意义的同意下收集大量敏感信息变得容易得多,而且无需用户理解他们的私人数据到底发生了什么。
#### 数据资产与权力
由于行为数据是用户与服务交互的副产品,因此有时被称为 “数据废气” —— 暗示数据是毫无价值的废料。从这个角度来看,行为和预测性分析可以被看作是一种从数据中提取价值的回收形式,否则这些数据就会被浪费。
更准确的看法恰恰相反:从经济的角度来看,如果定向广告是服务的金主,那么关于人的行为数据就是服务的核心资产。在这种情况下,用户与之交互的应用仅仅是一种诱骗用户将更多的个人信息提供给监控基础设施的手段【99】。在线服务中经常表现出的令人愉悦的人类创造力与社会关系,十分讽刺地被数据提取机器所滥用。
个人数据是珍贵资产的说法因为数据中介的存在得到支持,这是阴影中的秘密行业,购买、聚合、分析、推断以及转售私密个人数据,主要用于市场营销【90】。初创公司按照它们的用户数量,“眼球数”,—— 即它们的监视能力来估值。
因为数据很有价值,所以很多人都想要它。当然,公司也想要它 —— 这就是为什么它们一开始就收集数据的原因。但政府也想获得它:通过秘密交易、胁迫、法律强制或者只是窃取【101】。当公司破产时,收集到的个人数据就是被出售的资产之一。而且数据安全很难保护,因此经常发生令人难堪的泄漏事件【102】。
这些观察已经导致批评者声称,数据不仅仅是一种资产,而且是一种 “有毒资产”【101】,或者至少是 “有害物质”【103】。即使我们认为自己有能力阻止数据滥用,但每当我们收集数据时,我们都需要平衡收益以及这些数据落入恶人手中的风险:计算机系统可能会被犯罪分子或敌国特务渗透,数据可能会被内鬼泄露,公司可能会落入不择手段的管理层手中,而这些管理者有着迥然不同的价值观,或者国家可能被能毫无愧色迫使我们交出数据的政权所接管。
俗话说,“知识就是力量”。更进一步,“在避免自己被审视的同时审视他人,是权力最重要的形式之一”【105】。这就是极权政府想要监控的原因:这让它们有能力控制全体居民。尽管今天的科技公司并没有公开地寻求政治权力,但是它们积累的数据与知识却给它们带来了很多权力,其中大部分是在公共监督之外偷偷进行的【106】。
#### 回顾工业革命
数据是信息时代的决定性特征。互联网,数据存储,处理和软件驱动的自动化正在对全球经济和人类社会产生重大影响。我们的日常生活与社会组织在过去十年中发生了变化,而且在未来的十年中可能会继续发生根本性的变化,所以我们会想到与工业革命对比【87,96】。
工业革命是通过重大的技术与农业进步实现的,它带来了持续的经济增长,长期的生活水平显著提高。然而它也带来了一些严重的问题:空气污染(由于烟雾和化学过程)和水污染(工业垃圾和人类垃圾)是可怖的。工厂老板生活在纷奢之中,而城市工人经常居住在非常糟糕的住房中,并且在恶劣的条件下长时间工作。童工很常见,甚至包括矿井中危险而低薪的工作。
制定保护措施花费了很长的时间,例如环境保护条例、工作场所安全条例、宣布使用童工非法以及食品卫生检查。毫无疑问,生产成本增加了,因为工厂再也不能把废物倒入河流、销售污染的食物或者剥削工人。但是整个社会都从中受益良多,我们中很少会有人想回到这些管制条例之前的日子【87】。
就像工业革命有着黑暗面需要应对一样,我们转向信息时代的过程中,也有需要应对与解决的重大问题。我相信数据的收集与使用就是其中一个问题。用 Bruce Schneier 的话来说【96】:
> 数据是信息时代的污染问题,保护隐私是环境挑战。几乎所有的电脑都能生产信息。它堆积在周围,开始溃烂。我们如何处理它 —— 我们如何控制它,以及如何摆脱它 —— 是信息经济健康发展的核心议题。正如我们今天回顾工业时代的早期年代,并想知道我们的祖先在忙于建设工业世界的过程时怎么能忽略污染问题;我们的孙辈在回望信息时代的早期年代时,将会就我们如何应对数据收集和滥用的挑战来评断我们。
>
> 我们应该设法让他们感到骄傲。
#### 立法与自律
数据保护法可能有助于维护个人的权利。例如,1995 年的 “欧洲数据保护指示” 规定,个人数据必须 “为特定的、明确的和合法的目的收集,而不是以与这些目的不相符的方式进一步处理”,并且数据必须 “就收集的目的而言适当、相关、不过分。”【107】。
但是,这个立法在今天的互联网环境下是否有效还是有疑问的【108】。这些规则直接否定了大数据的哲学,即最大限度地收集数据,将其与其他数据集结合起来进行试验和探索,以便产生新的洞察。探索意味着将数据用于未曾预期的目的,这与用户同意的 “特定和明确” 目的相反(如果我们可以有意义地表示同意的话)【109】。更新的规章正在制定中【89】。
那些收集了大量有关人的数据的公司反对监管,认为这是创新的负担与阻碍。在某种程度上,这种反对是有道理的。例如,分享医疗数据时,存在明显的隐私风险,但也有潜在的机遇:如果数据分析能够帮助我们实现更好的诊断或找到更好的治疗方法,能够阻止多少人的死亡【110】?过度监管可能会阻止这种突破。在这种潜在机会与风险之间找出平衡是很困难的【105】。
从根本上说,我认为我们需要科技行业在个人数据方面的文化转变。我们应该停止将用户视作待优化的指标数据,并记住他们是值得尊重、有尊严和能动性的人。我们应当在数据收集和实际处理中自我约束,以建立和维持依赖我们软件的人们的信任【111】。我们应当将教育终端用户视为己任,告诉他们我们是如何使用他们的数据的,而不是将他们蒙在鼓里。
我们应该允许每个人保留自己的隐私 —— 即,对自己数据的控制 —— 而不是通过监视来窃取这种控制权。我们控制自己数据的个体权利就像是国家公园的自然环境:如果我们不去明确地保护它、关心它,它就会被破坏。这将是公地的悲剧,我们都会因此而变得更糟。无所不在的监视并非不可避免的 —— 我们现在仍然能阻止它。
我们究竟能做到哪一步,是一个开放的问题。首先,我们不应该永久保留数据,而是一旦不再需要就立即清除数据【111,112】。清除数据与不变性的想法背道而驰(请参阅 “[不变性的局限性](/v1/ch11#不变性的局限性)”),但这是可以解决的问题。我所看到的一种很有前景的方法是通过加密协议来实施访问控制,而不仅仅是通过策略【113,114】。总的来说,文化与态度的改变是必要的。
## 本章小结
在本章中,我们讨论了设计数据系统的新方式,而且也包括了我的个人观点,以及对未来的猜测。我们从这样一种观察开始:没有单种工具能高效服务所有可能的用例,因此应用必须组合使用几种不同的软件才能实现其目标。我们讨论了如何使用批处理与事件流来解决这一 **数据集成(data integration)** 问题,以便让数据变更在不同系统之间流动。
在这种方法中,某些系统被指定为记录系统,而其他数据则通过转换衍生自记录系统。通过这种方式,我们可以维护索引、物化视图、机器学习模型、统计摘要等等。通过使这些衍生和转换操作异步且松散耦合,能够防止一个区域中的问题扩散到系统中不相关部分,从而增加整个系统的稳健性与容错性。
将数据流表示为从一个数据集到另一个数据集的转换也有助于演化应用程序:如果你想变更其中一个处理步骤,例如变更索引或缓存的结构,则可以在整个输入数据集上重新运行新的转换代码,以便重新衍生输出。同样,出现问题时,你也可以修复代码并重新处理数据以便恢复。
这些过程与数据库内部已经完成的过程非常类似,因此我们将数据流应用的概念重新改写为,**分拆(unbundling)** 数据库组件,并通过组合这些松散耦合的组件来构建应用程序。
衍生状态可以通过观察底层数据的变更来更新。此外,衍生状态本身可以进一步被下游消费者观察。我们甚至可以将这种数据流一路传送至显示数据的终端用户设备,从而构建可动态更新以反映数据变更,并在离线时能继续工作的用户界面。
接下来,我们讨论了如何确保所有这些处理在出现故障时保持正确。我们看到可伸缩的强完整性保证可以通过异步事件处理来实现,通过使用端到端操作标识符使操作幂等,以及通过异步检查约束。客户端可以等到检查通过,或者不等待继续前进,但是可能会冒有违反约束需要道歉的风险。这种方法比使用分布式事务的传统方法更具可伸缩性与可靠性,并且在实践中适用于很多业务流程。
通过围绕数据流构建应用,并异步检查约束,我们可以避免绝大多数的协调工作,创建保证完整性且性能仍然表现良好的系统,即使在地理散布的情况下与出现故障时亦然。然后,我们对使用审计来验证数据完整性,以及损坏检测进行了一些讨论。
最后,我们退后一步,审视了构建数据密集型应用的一些道德问题。我们看到,虽然数据可以用来做好事,但它也可能造成很大伤害:作出严重影响人们生活的决定却难以申诉,导致歧视与剥削、监视常态化、曝光私密信息。我们也冒着数据被泄露的风险,并且可能会发现,即使是善意地使用数据也可能会导致意想不到的后果。
由于软件和数据对世界产生了如此巨大的影响,我们工程师们必须牢记,我们有责任为我们想要的那种世界而努力:一个尊重人们,尊重人性的世界。我希望我们能够一起为实现这一目标而努力。
## 参考文献
1. Rachid Belaid: “[Postgres Full-Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/),” *rachbelaid.com*, July 13, 2015.
1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
1. Pat Helland and Dave Campbell: “[Building on Quicksand](https://web.archive.org/web/20220606172817/https://database.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf),” at *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009.
1. Jessica Kerr: “[Provenance and Causality in Distributed Systems](https://web.archive.org/web/20190425150540/http://blog.jessitron.com/2016/09/provenance-and-causality-in-distributed.html),” *blog.jessitron.com*, September 25, 2016.
1. Kostas Tzoumas: “[Batch Is a Special Case of Streaming](http://data-artisans.com/blog/batch-is-a-special-case-of-streaming/),” *data-artisans.com*, September 15, 2015.
1. Shinji Kim and Robert Blafford: “[Stream Windowing Performance Analysis: Concord and Spark Streaming](https://web.archive.org/web/20180125074821/http://concord.io/posts/windowing_performance_analysis_w_spark_streaming),” *concord.io*, July 6, 2016.
1. Jay Kreps: “[The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying),” *engineering.linkedin.com*, December 16, 2013.
1. Pat Helland: “[Life Beyond Distributed Transactions: An Apostate’s Opinion](https://web.archive.org/web/20200730171311/http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf),” at *3rd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2007.
1. “[Great Western Railway (1835–1948)](https://web.archive.org/web/20160122155425/https://www.networkrail.co.uk/VirtualArchive/great-western/),” Network Rail Virtual Archive, *networkrail.co.uk*.
1. Jacqueline Xu: “[Online Migrations at Scale](https://stripe.com/blog/online-migrations),” *stripe.com*, February 2, 2017.
1. Molly Bartlett Dishman and Martin Fowler: “[Agile Architecture](https://web.archive.org/web/20161130034721/http://conferences.oreilly.com/software-architecture/sa2015/public/schedule/detail/40388),” at *O'Reilly Software Architecture Conference*, March 2015.
1. Nathan Marz and James Warren: [*Big Data: Principles and Best Practices of Scalable Real-Time Data Systems*](https://www.manning.com/books/big-data). Manning, 2015. ISBN: 978-1-617-29034-3
1. Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin: “[Summingbird: A Framework for Integrating Batch and Online MapReduce Computations](http://www.vldb.org/pvldb/vol7/p1441-boykin.pdf),” at *40th International Conference on Very Large Data Bases* (VLDB), September 2014.
1. Jay Kreps: “[Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture),” *oreilly.com*, July 2, 2014.
1. Raul Castro Fernandez, Peter Pietzuch, Jay Kreps, et al.: “[Liquid: Unifying Nearline and Offline Big Data Integration](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper25u.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
1. Dennis M. Ritchie and Ken Thompson: “[The UNIX Time-Sharing System](http://web.eecs.utk.edu/~qcao1/cs560/papers/paper-unix.pdf),” *Communications of the ACM*, volume 17, number 7, pages 365–375, July 1974. [doi:10.1145/361011.361061](http://dx.doi.org/10.1145/361011.361061)
1. Eric A. Brewer and Joseph M. Hellerstein: “[CS262a: Advanced Topics in Computer Systems](http://people.eecs.berkeley.edu/~brewer/cs262/systemr.html),” lecture notes, University of California, Berkeley, *cs.berkeley.edu*, August 2011.
1. Michael Stonebraker: “[The Case for Polystores](http://wp.sigmod.org/?p=1629),” *wp.sigmod.org*, July 13, 2015.
1. Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, et al.: “[The BigDAWG Polystore System](https://dspace.mit.edu/handle/1721.1/100936),” *ACM SIGMOD Record*, volume 44, number 2, pages 11–16, June 2015. [doi:10.1145/2814710.2814713](http://dx.doi.org/10.1145/2814710.2814713)
1. Patrycja Dybka: “[Foreign Data Wrappers for PostgreSQL](https://web.archive.org/web/20221003115732/https://www.vertabelo.com/blog/foreign-data-wrappers-for-postgresql/),” *vertabelo.com*, March 24, 2015.
1. David B. Lomet, Alan Fekete, Gerhard Weikum, and Mike Zwilling: “[Unbundling Transaction Services in the Cloud](https://www.microsoft.com/en-us/research/publication/unbundling-transaction-services-in-the-cloud/),” at *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009.
1. Martin Kleppmann and Jay Kreps: “[Kafka, Samza and the Unix Philosophy of Distributed Data](http://martin.kleppmann.com/papers/kafka-debull15.pdf),” *IEEE Data Engineering Bulletin*, volume 38, number 4, pages 4–14, December 2015.
1. John Hugg: “[Winning Now and in the Future: Where VoltDB Shines](https://voltdb.com/blog/winning-now-and-future-where-voltdb-shines),” *voltdb.com*, March 23, 2016.
1. Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard: “[Differential Dataflow](http://cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf),” at *6th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2013.
1. Derek G Murray, Frank McSherry, Rebecca Isaacs, et al.: “[Naiad: A Timely Dataflow System](http://sigops.org/s/conferences/sosp/2013/papers/p439-murray.pdf),” at *24th ACM Symposium on Operating Systems Principles* (SOSP), pages 439–455, November 2013. [doi:10.1145/2517349.2522738](http://dx.doi.org/10.1145/2517349.2522738)
1. Gwen Shapira: “[We have a bunch of customers who are implementing ‘database inside-out’ concept and they all ask ‘is anyone else doing it? are we crazy?’](https://twitter.com/gwenshap/status/758800071110430720)” *twitter.com*, July 28, 2016.
1. Martin Kleppmann: “[Turning the Database Inside-out with Apache Samza,](http://martin.kleppmann.com/2015/03/04/turning-the-database-inside-out.html)” at *Strange Loop*, September 2014.
1. Peter Van Roy and Seif Haridi: [*Concepts, Techniques, and Models of Computer Programming*](https://www.info.ucl.ac.be/~pvr/book.html). MIT Press, 2004. ISBN: 978-0-262-22069-9
1. “[Juttle Documentation](http://juttle.github.io/juttle/),” *juttle.github.io*, 2016.
1. Evan Czaplicki and Stephen Chong: “[Asynchronous Functional Reactive Programming for GUIs](http://people.seas.harvard.edu/~chong/pubs/pldi13-elm.pdf),” at *34th ACM SIGPLAN Conference on Programming Language Design and Implementation* (PLDI), June 2013. [doi:10.1145/2491956.2462161](http://dx.doi.org/10.1145/2491956.2462161)
1. Engineer Bainomugisha, Andoni Lombide Carreton, Tom van Cutsem, Stijn Mostinckx, and Wolfgang de Meuter: “[A Survey on Reactive Programming](http://soft.vub.ac.be/Publications/2012/vub-soft-tr-12-13.pdf),” *ACM Computing Surveys*, volume 45, number 4, pages 1–34, August 2013. [doi:10.1145/2501654.2501666](http://dx.doi.org/10.1145/2501654.2501666)
1. Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak: “[Consistency Analysis in Bloom: A CALM and Collected Approach](https://dsf.berkeley.edu/cs286/papers/calm-cidr2011.pdf),” at *5th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2011.
1. Felienne Hermans: “[Spreadsheets Are Code](https://vimeo.com/145492419),” at *Code Mesh*, November 2015.
1. Dan Bricklin and Bob Frankston: “[VisiCalc: Information from Its Creators](http://danbricklin.com/visicalc.htm),” *danbricklin.com*.
1. D. Sculley, Gary Holt, Daniel Golovin, et al.: “[Machine Learning: The High-Interest Credit Card of Technical Debt](http://research.google.com/pubs/pub43146.html),” at *NIPS Workshop on Software Engineering for Machine Learning* (SE4ML), December 2014.
1. Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “[Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2737784](http://dx.doi.org/10.1145/2723372.2737784)
1. Guy Steele: “[Re: Need for Macros (Was Re: Icon)](https://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01134.html),” email to *ll1-discuss* mailing list, *people.csail.mit.edu*, December 24, 2001.
1. David Gelernter: “[Generative Communication in Linda](http://cseweb.ucsd.edu/groups/csag/html/teaching/cse291s03/Readings/p80-gelernter.pdf),” *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 7, number 1, pages 80–112, January 1985. [doi:10.1145/2363.2433](http://dx.doi.org/10.1145/2363.2433)
1. Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “[The Many Faces of Publish/Subscribe](http://www.cs.ru.nl/~pieter/oss/manyfaces.pdf),” *ACM Computing Surveys*, volume 35, number 2, pages 114–131, June 2003. [doi:10.1145/857076.857078](http://dx.doi.org/10.1145/857076.857078)
1. Ben Stopford: “[Microservices in a Streaming World](https://www.infoq.com/presentations/microservices-streaming),” at *QCon London*, March 2016.
1. Christian Posta: “[Why Microservices Should Be Event Driven: Autonomy vs Authority](http://blog.christianposta.com/microservices/why-microservices-should-be-event-driven-autonomy-vs-authority/),” *blog.christianposta.com*, May 27, 2016.
1. Alex Feyerke: “[Say Hello to Offline First](https://web.archive.org/web/20210420014747/http://hood.ie/blog/say-hello-to-offline-first.html),” *hood.ie*, November 5, 2013.
1. Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich: “[Global Sequence Protocol: A Robust Abstraction for Replicated Shared State](http://drops.dagstuhl.de/opus/volltexte/2015/5238/),” at *29th European Conference on Object-Oriented Programming* (ECOOP), July 2015. [doi:10.4230/LIPIcs.ECOOP.2015.568](http://dx.doi.org/10.4230/LIPIcs.ECOOP.2015.568)
1. Mark Soper: “[Clearing Up React Data Management Confusion with Flux, Redux, and Relay](https://medium.com/@marksoper/clearing-up-react-data-management-confusion-with-flux-redux-and-relay-aad504e63cae),” *medium.com*, December 3, 2015.
1. Eno Thereska, Damian Guy, Michael Noll, and Neha Narkhede: “[Unifying Stream Processing and Interactive Queries in Apache Kafka](http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/),” *confluent.io*, October 26, 2016.
1. Frank McSherry: “[Dataflow as Database](https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-17.md),” *github.com*, July 17, 2016.
1. Peter Alvaro: “[I See What You Mean](https://www.youtube.com/watch?v=R2Aa4PivG0g),” at *Strange Loop*, September 2015.
1. Nathan Marz: “[Trident: A High-Level Abstraction for Realtime Computation](https://blog.twitter.com/2012/trident-a-high-level-abstraction-for-realtime-computation),” *blog.twitter.com*, August 2, 2012.
1. Edi Bice: “[Low Latency Web Scale Fraud Prevention with Apache Samza, Kafka and Friends](http://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends),” at *Merchant Risk Council MRC Vegas Conference*, March 2016.
1. Charity Majors: “[The Accidental DBA](https://charity.wtf/2016/10/02/the-accidental-dba/),” *charity.wtf*, October 2, 2016.
1. Arthur J. Bernstein, Philip M. Lewis, and Shiyong Lu: “[Semantic Conditions for Correctness at Different Isolation Levels](http://db.cs.berkeley.edu/cs286/papers/isolation-icde2000.pdf),” at *16th International Conference on Data Engineering* (ICDE), February 2000. [doi:10.1109/ICDE.2000.839387](http://dx.doi.org/10.1109/ICDE.2000.839387)
1. Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “[Automating the Detection of Snapshot Isolation Anomalies](http://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf),” at *33rd International Conference on Very Large Data Bases* (VLDB), September 2007.
1. Kyle Kingsbury: [Jepsen blog post series](https://aphyr.com/tags/jepsen), *aphyr.com*, 2013–2016.
1. Michael Jouravlev: “[Redirect After Post](http://www.theserverside.com/news/1365146/Redirect-After-Post),” *theserverside.com*, August 1, 2004.
1. Jerome H. Saltzer, David P. Reed, and David D. Clark: “[End-to-End Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf),” *ACM Transactions on Computer Systems*, volume 2, number 4, pages 277–288, November 1984. [doi:10.1145/357401.357402](http://dx.doi.org/10.1145/357401.357402)
1. Peter Bailis, Alan Fekete, Michael J. Franklin, et al.: “[Coordination-Avoiding Database Systems](http://arxiv.org/pdf/1402.2237.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 3, pages 185–196, November 2014.
1. Alex Yarmula: “[Strong Consistency in Manhattan](https://blog.twitter.com/2016/strong-consistency-in-manhattan),” *blog.twitter.com*, March 17, 2016.
1. Douglas B Terry, Marvin M Theimer, Karin Petersen, et al.: “[Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](http://css.csail.mit.edu/6.824/2014/papers/bayou-conflicts.pdf),” at *15th ACM Symposium on Operating Systems Principles* (SOSP), pages 172–182, December 1995. [doi:10.1145/224056.224070](http://dx.doi.org/10.1145/224056.224070)
1. Jim Gray: “[The Transaction Concept: Virtues and Limitations](http://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf),” at *7th International Conference on Very Large Data Bases* (VLDB), September 1981.
1. Hector Garcia-Molina and Kenneth Salem: “[Sagas](http://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), May 1987. [doi:10.1145/38713.38742](http://dx.doi.org/10.1145/38713.38742)
1. Pat Helland: “[Memories, Guesses, and Apologies](https://web.archive.org/web/20160304020907/http://blogs.msdn.com/b/pathelland/archive/2007/05/15/memories-guesses-and-apologies.aspx),” *blogs.msdn.com*, May 15, 2007.
1. Yoongu Kim, Ross Daly, Jeremie Kim, et al.: “[Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf),” at *41st Annual International Symposium on Computer Architecture* (ISCA), June 2014. [doi:10.1145/2678373.2665726](http://dx.doi.org/10.1145/2678373.2665726)
1. Mark Seaborn and Thomas Dullien: “[Exploiting the DRAM Rowhammer Bug to Gain Kernel Privileges](https://googleprojectzero.blogspot.co.uk/2015/03/exploiting-dram-rowhammer-bug-to-gain.html),” *googleprojectzero.blogspot.co.uk*, March 9, 2015.
1. Jim N. Gray and Catharine van Ingen: “[Empirical Measurements of Disk Failure Rates and Error Rates](https://www.microsoft.com/en-us/research/publication/empirical-measurements-of-disk-failure-rates-and-error-rates/),” Microsoft Research, MSR-TR-2005-166, December 2005.
1. Annamalai Gurusami and Daniel Price: “[Bug #73170: Duplicates in Unique Secondary Index Because of Fix of Bug#68021](http://bugs.mysql.com/bug.php?id=73170),” *bugs.mysql.com*, July 2014.
1. Gary Fredericks: “[Postgres Serializability Bug](https://github.com/gfredericks/pg-serializability-bug),” *github.com*, September 2015.
1. Xiao Chen: “[HDFS DataNode Scanners and Disk Checker Explained](http://blog.cloudera.com/blog/2016/12/hdfs-datanode-scanners-and-disk-checker-explained/),” *blog.cloudera.com*, December 20, 2016.
1. Jay Kreps: “[Getting Real About Distributed System Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability),” *blog.empathybox.com*, March 19, 2012.
1. Martin Fowler: “[The LMAX Architecture](http://martinfowler.com/articles/lmax.html),” *martinfowler.com*, July 12, 2011.
1. Sam Stokes: “[Move Fast with Confidence](http://blog.samstokes.co.uk/blog/2016/07/11/move-fast-with-confidence/),” *blog.samstokes.co.uk*, July 11, 2016.
1. “[Hyperledger Sawtooth documentation](https://web.archive.org/web/20220120211548/https://sawtooth.hyperledger.org/docs/core/releases/latest/introduction.html),” Intel Corporation, *sawtooth.hyperledger.org*, 2017.
1. Richard Gendal Brown: “[Introducing R3 Corda™: A Distributed Ledger Designed for Financial Services](https://gendal.me/2016/04/05/introducing-r3-corda-a-distributed-ledger-designed-for-financial-services/),” *gendal.me*, April 5, 2016.
1. Trent McConaghy, Rodolphe Marques, Andreas Müller, et al.: “[BigchainDB: A Scalable Blockchain Database](https://www.bigchaindb.com/whitepaper/bigchaindb-whitepaper.pdf),” *bigchaindb.com*, June 8, 2016.
1. Ralph C. Merkle: “[A Digital Signature Based on a Conventional Encryption Function](https://people.eecs.berkeley.edu/~raluca/cs261-f15/readings/merkle.pdf),” at *CRYPTO '87*, August 1987. [doi:10.1007/3-540-48184-2_32](http://dx.doi.org/10.1007/3-540-48184-2_32)
1. Ben Laurie: “[Certificate Transparency](http://queue.acm.org/detail.cfm?id=2668154),” *ACM Queue*, volume 12, number 8, pages 10-19, August 2014. [doi:10.1145/2668152.2668154](http://dx.doi.org/10.1145/2668152.2668154)
1. Mark D. Ryan: “[Enhanced Certificate Transparency and End-to-End Encrypted Mail](https://www.ndss-symposium.org/wp-content/uploads/2017/09/12_2_1.pdf),” at *Network and Distributed System Security Symposium* (NDSS), February 2014. [doi:10.14722/ndss.2014.23379](http://dx.doi.org/10.14722/ndss.2014.23379)
1. “[ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics),” Association for Computing Machinery, *acm.org*, 2018.
1. François Chollet: “[Software development is starting to involve important ethical choices](https://twitter.com/fchollet/status/792958695722201088),” *twitter.com*, October 30, 2016.
1. Igor Perisic: “[Making Hard Choices: The Quest for Ethics in Machine Learning](https://engineering.linkedin.com/blog/2016/11/making-hard-choices--the-quest-for-ethics-in-machine-learning),” *engineering.linkedin.com*, November 2016.
1. John Naughton: “[Algorithm Writers Need a Code of Conduct](https://www.theguardian.com/commentisfree/2015/dec/06/algorithm-writers-should-have-code-of-conduct),” *theguardian.com*, December 6, 2015.
1. Logan Kugler: “[What Happens When Big Data Blunders?](http://cacm.acm.org/magazines/2016/6/202655-what-happens-when-big-data-blunders/fulltext),” *Communications of the ACM*, volume 59, number 6, pages 15–16, June 2016. [doi:10.1145/2911975](http://dx.doi.org/10.1145/2911975)
1. Bill Davidow: “[Welcome to Algorithmic Prison](http://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/),” *theatlantic.com*, February 20, 2014.
1. Don Peck: “[They're Watching You at Work](http://www.theatlantic.com/magazine/archive/2013/12/theyre-watching-you-at-work/354681/),” *theatlantic.com*, December 2013.
1. Leigh Alexander: “[Is an Algorithm Any Less Racist Than a Human?](https://www.theguardian.com/technology/2016/aug/03/algorithm-racist-human-employers-work)” *theguardian.com*, August 3, 2016.
1. Jesse Emspak: “[How a Machine Learns Prejudice](https://www.scientificamerican.com/article/how-a-machine-learns-prejudice/),” *scientificamerican.com*, December 29, 2016.
1. Maciej Cegłowski: “[The Moral Economy of Tech](http://idlewords.com/talks/sase_panel.htm),” *idlewords.com*, June 2016.
1. Cathy O'Neil: [*Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*](https://web.archive.org/web/20210621234447/https://weaponsofmathdestructionbook.com/). Crown Publishing, 2016. ISBN: 978-0-553-41881-1
1. Julia Angwin: “[Make Algorithms Accountable](http://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html),” *nytimes.com*, August 1, 2016.
1. Bryce Goodman and Seth Flaxman: “[European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation’](https://arxiv.org/abs/1606.08813),” *arXiv:1606.08813*, August 31, 2016.
1. “[A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes](https://web.archive.org/web/20240619042302/http://educationnewyork.com/files/rockefeller_databroker.pdf),” Staff Report, *United States Senate Committee on Commerce, Science, and Transportation*, *commerce.senate.gov*, December 2013.
1. Olivia Solon: “[Facebook’s Failure: Did Fake News and Polarized Politics Get Trump Elected?](https://www.theguardian.com/technology/2016/nov/10/facebook-fake-news-election-conspiracy-theories)” *theguardian.com*, November 10, 2016.
1. Donella H. Meadows and Diana Wright: *Thinking in Systems: A Primer*. Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7
1. Daniel J. Bernstein: “[Listening to a ‘big data’/‘data science’ talk](https://twitter.com/hashbreaker/status/598076230437568512),” *twitter.com*, May 12, 2015.
1. Marc Andreessen: “[Why Software Is Eating the World](http://genius.com/Marc-andreessen-why-software-is-eating-the-world-annotated),” *The Wall Street Journal*, 20 August 2011.
1. J. M. Porup: “[‘Internet of Things’ Security Is Hilariously Broken and Getting Worse](http://arstechnica.com/security/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/),” *arstechnica.com*, January 23, 2016.
1. Bruce Schneier: [*Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World*](https://www.schneier.com/books/data_and_goliath/). W. W. Norton, 2015. ISBN: 978-0-393-35217-7
1. The Grugq: “[Nothing to Hide](https://grugq.tumblr.com/post/142799983558/nothing-to-hide),” *grugq.tumblr.com*, April 15, 2016.
1. Tony Beltramelli: “[Deep-Spying: Spying Using Smartwatch and Deep Learning](https://arxiv.org/abs/1512.05616),” Masters Thesis, IT University of Copenhagen, December 2015. Available at *arxiv.org/abs/1512.05616*
1. Shoshana Zuboff: “[Big Other: Surveillance Capitalism and the Prospects of an Information Civilization](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2594754),” *Journal of Information Technology*, volume 30, number 1, pages 75–89, April 2015. [doi:10.1057/jit.2015.5](http://dx.doi.org/10.1057/jit.2015.5)
1. Carina C. Zona: “[Consequences of an Insightful Algorithm](https://www.youtube.com/watch?v=YRI40A4tyWU),” at *GOTO Berlin*, November 2016.
1. Bruce Schneier: “[Data Is a Toxic Asset, So Why Not Throw It Out?](https://www.schneier.com/essays/archives/2016/03/data_is_a_toxic_asse.html),” *schneier.com*, March 1, 2016.
1. John E. Dunn: “[The UK’s 15 Most Infamous Data Breaches](https://web.archive.org/web/20161120070058/http://www.techworld.com/security/uks-most-infamous-data-breaches-2016-3604586/),” *techworld.com*, November 18, 2016.
1. Cory Scott: “[Data is not toxic - which implies no benefit - but rather hazardous material, where we must balance need vs. want](https://twitter.com/cory_scott/status/706586399483437056),” *twitter.com*, March 6, 2016.
1. Bruce Schneier: “[Mission Creep: When Everything Is Terrorism](https://www.schneier.com/essays/archives/2013/07/mission_creep_when_e.html),” *schneier.com*, July 16, 2013.
1. Lena Ulbricht and Maximilian von Grafenstein: “[Big Data: Big Power Shifts?](http://policyreview.info/articles/analysis/big-data-big-power-shifts),” *Internet Policy Review*, volume 5, number 1, March 2016. [doi:10.14763/2016.1.406](http://dx.doi.org/10.14763/2016.1.406)
1. Ellen P. Goodman and Julia Powles: “[Facebook and Google: Most Powerful and Secretive Empires We've Ever Known](https://www.theguardian.com/technology/2016/sep/28/google-facebook-powerful-secretive-empire-transparency),” *theguardian.com*, September 28, 2016.
1. [Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data](http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:31995L0046), Official Journal of the European Communities No. L 281/31, *eur-lex.europa.eu*, November 1995.
1. Brendan Van Alsenoy: “[Regulating Data Protection: The Allocation of Responsibility and Risk Among Actors Involved in Personal Data Processing](https://lirias.kuleuven.be/handle/123456789/545027),” Thesis, KU Leuven Centre for IT and IP Law, August 2016.
1. Michiel Rhoen: “[Beyond Consent: Improving Data Protection Through Consumer Protection Law](http://policyreview.info/articles/analysis/beyond-consent-improving-data-protection-through-consumer-protection-law),” *Internet Policy Review*, volume 5, number 1, March 2016. [doi:10.14763/2016.1.404](http://dx.doi.org/10.14763/2016.1.404)
1. Jessica Leber: “[Your Data Footprint Is Affecting Your Life in Ways You Can’t Even Imagine](https://www.fastcoexist.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine),” *fastcoexist.com*, March 15, 2016.
1. Maciej Cegłowski: “[Haunted by Data](http://idlewords.com/talks/haunted_by_data.htm),” *idlewords.com*, October 2015.
1. Sam Thielman: “[You Are Not What You Read: Librarians Purge User Data to Protect Privacy](https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy),” *theguardian.com*, January 13, 2016.
1. Conor Friedersdorf: “[Edward Snowden’s Other Motive for Leaking](http://www.theatlantic.com/politics/archive/2014/05/edward-snowdens-other-motive-for-leaking/370068/),” *theatlantic.com*, May 13, 2014.
1. Phillip Rogaway: “[The Moral Character of Cryptographic Work](http://web.cs.ucdavis.edu/~rogaway/papers/moral-fn.pdf),” Cryptology ePrint 2015/1162, December 2015.
================================================
FILE: content/v1/ch2.md
================================================
---
title: "第二章:数据模型与查询语言"
linkTitle: "2. 数据模型与查询语言"
weight: 102
math: true
breadcrumbs: false
---

> 语言的边界就是思想的边界。
>
> —— 路德维奇・维特根斯坦,《逻辑哲学》(1922)
数据模型可能是软件开发中最重要的部分了,因为它们的影响如此深远:不仅仅影响着软件的编写方式,而且影响着我们的 **解题思路**。
多数应用使用层层叠加的数据模型构建。对于每层数据模型的关键问题是:它是如何用低一层数据模型来 **表示** 的?例如:
1. 作为一名应用开发人员,你观察现实世界(里面有人员、组织、货物、行为、资金流向、传感器等),并采用对象或数据结构,以及操控那些数据结构的 API 来进行建模。那些结构通常是特定于应用程序的。
2. 当要存储那些数据结构时,你可以利用通用数据模型来表示它们,如 JSON 或 XML 文档、关系数据库中的表或图模型。
3. 数据库软件的工程师选定如何以内存、磁盘或网络上的字节来表示 JSON / XML/ 关系 / 图数据。这类表示形式使数据有可能以各种方式来查询,搜索,操纵和处理。
4. 在更低的层次上,硬件工程师已经想出了使用电流、光脉冲、磁场或者其他东西来表示字节的方法。
一个复杂的应用程序可能会有更多的中间层次,比如基于 API 的 API,不过基本思想仍然是一样的:每个层都通过提供一个明确的数据模型来隐藏更低层次中的复杂性。这些抽象允许不同的人群有效地协作(例如数据库厂商的工程师和使用数据库的应用程序开发人员)。
数据模型种类繁多,每个数据模型都带有如何使用的设想。有些用法很容易,有些则不支持如此;有些操作运行很快,有些则表现很差;有些数据转换非常自然,有些则很麻烦。
掌握一个数据模型需要花费很多精力(想想关系数据建模有多少本书)。即便只使用一个数据模型,不用操心其内部工作机制,构建软件也是非常困难的。然而,因为数据模型对上层软件的功能(能做什么,不能做什么)有着至深的影响,所以选择一个适合的数据模型是非常重要的。
在本章中,我们将研究一系列用于数据存储和查询的通用数据模型(前面列表中的第 2 点)。特别地,我们将比较关系模型,文档模型和少量基于图形的数据模型。我们还将查看各种查询语言并比较它们的用例。在 [第三章](/v1/ch3) 中,我们将讨论存储引擎是如何工作的。也就是说,这些数据模型实际上是如何实现的(列表中的第 3 点)。
## 关系模型与文档模型
现在最著名的数据模型可能是 SQL。它基于 Edgar Codd 在 1970 年提出的关系模型【1】:数据被组织成 **关系**(SQL 中称作 **表**),其中每个关系是 **元组**(SQL 中称作 **行**) 的无序集合。
关系模型曾是一个理论性的提议,当时很多人都怀疑是否能够有效实现它。然而到了 20 世纪 80 年代中期,关系数据库管理系统(RDBMSes)和 SQL 已成为大多数人们存储和查询某些常规结构的数据的首选工具。关系数据库已经持续称霸了大约 25~30 年 —— 这对计算机史来说是极其漫长的时间。
关系数据库起源于商业数据处理,在 20 世纪 60 年代和 70 年代用大型计算机来执行。从今天的角度来看,那些用例显得很平常:典型的 **事务处理**(将销售或银行交易,航空公司预订,库存管理信息记录在库)和 **批处理**(客户发票,工资单,报告)。
当时的其他数据库迫使应用程序开发人员必须考虑数据库内部的数据表示形式。关系模型致力于将上述实现细节隐藏在更简洁的接口之后。
多年来,在数据存储和查询方面存在着许多相互竞争的方法。在 20 世纪 70 年代和 80 年代初,网状模型(network model)和层次模型(hierarchical model)曾是主要的选择,但关系模型(relational model)随后占据了主导地位。对象数据库在 20 世纪 80 年代末和 90 年代初来了又去。XML 数据库在二十一世纪初出现,但只有小众采用过。关系模型的每个竞争者都在其时代产生了大量的炒作,但从来没有持续【2】。
随着电脑越来越强大和互联,它们开始用于日益多样化的目的。关系数据库非常成功地被推广到业务数据处理的原始范围之外更为广泛的用例上。你今天在网上看到的大部分内容依旧是由关系数据库来提供支持,无论是在线发布、讨论、社交网络、电子商务、游戏、软件即服务生产力应用程序等内容。
### NoSQL 的诞生
现在 - 2010 年代,NoSQL 开始了最新一轮尝试,试图推翻关系模型的统治地位。“NoSQL” 这个名字让人遗憾,因为实际上它并没有涉及到任何特定的技术。最初它只是作为一个醒目的 Twitter 标签,用在 2009 年一个关于分布式,非关系数据库上的开源聚会上。无论如何,这个术语触动了某些神经,并迅速在网络创业社区内外传播开来。好些有趣的数据库系统现在都与 *#NoSQL* 标签相关联,并且 NoSQL 被追溯性地重新解释为 **不仅是 SQL(Not Only SQL)** 【4】。
采用 NoSQL 数据库的背后有几个驱动因素,其中包括:
* 需要比关系数据库更好的可伸缩性,包括非常大的数据集或非常高的写入吞吐量
* 相比商业数据库产品,免费和开源软件更受偏爱
* 关系模型不能很好地支持一些特殊的查询操作
* 受挫于关系模型的限制性,渴望一种更具多动态性与表现力的数据模型【5】
不同的应用程序有不同的需求,一个用例的最佳技术选择可能不同于另一个用例的最佳技术选择。因此,在可预见的未来,关系数据库似乎可能会继续与各种非关系数据库一起使用 - 这种想法有时也被称为 **混合持久化(polyglot persistence)**。
### 对象关系不匹配
目前大多数应用程序开发都使用面向对象的编程语言来开发,这导致了对 SQL 数据模型的普遍批评:如果数据存储在关系表中,那么需要一个笨拙的转换层,处于应用程序代码中的对象和表,行,列的数据库模型之间。模型之间的不连贯有时被称为 **阻抗不匹配(impedance mismatch)**[^i]。
[^i]: 一个从电子学借用的术语。每个电路的输入和输出都有一定的阻抗(交流电阻)。当你将一个电路的输出连接到另一个电路的输入时,如果两个电路的输出和输入阻抗匹配,则连接上的功率传输将被最大化。阻抗不匹配会导致信号反射及其他问题。
像 ActiveRecord 和 Hibernate 这样的 **对象关系映射(ORM object-relational mapping)** 框架可以减少这个转换层所需的样板代码的数量,但是它们不能完全隐藏这两个模型之间的差异。

**图 2-1 使用关系型模式来表示领英简介**
例如,[图 2-1](/v1/ddia_0201.png) 展示了如何在关系模式中表示简历(一个 LinkedIn 简介)。整个简介可以通过一个唯一的标识符 `user_id` 来标识。像 `first_name` 和 `last_name` 这样的字段每个用户只出现一次,所以可以在 User 表上将其建模为列。但是,大多数人在职业生涯中拥有多于一份的工作,人们可能有不同样的教育阶段和任意数量的联系信息。从用户到这些项目之间存在一对多的关系,可以用多种方式来表示:
* 传统 SQL 模型(SQL:1999 之前)中,最常见的规范化表示形式是将职位,教育和联系信息放在单独的表中,对 User 表提供外键引用,如 [图 2-1](/v1/ddia_0201.png) 所示。
* 后续的 SQL 标准增加了对结构化数据类型和 XML 数据的支持;这允许将多值数据存储在单行内,并支持在这些文档内查询和索引。这些功能在 Oracle,IBM DB2,MS SQL Server 和 PostgreSQL 中都有不同程度的支持【6,7】。JSON 数据类型也得到多个数据库的支持,包括 IBM DB2,MySQL 和 PostgreSQL 【8】。
* 第三种选择是将职业,教育和联系信息编码为 JSON 或 XML 文档,将其存储在数据库的文本列中,并让应用程序解析其结构和内容。这种配置下,通常不能使用数据库来查询该编码列中的值。
对于一个像简历这样自包含文档的数据结构而言,JSON 表示是非常合适的:请参阅 [例 2-1]()。JSON 比 XML 更简单。面向文档的数据库(如 MongoDB 【9】,RethinkDB 【10】,CouchDB 【11】和 Espresso【12】)支持这种数据模型。
**例 2-1. 用 JSON 文档表示一个 LinkedIn 简介**
```json
{
"user_id": 251,
"first_name": "Bill",
"last_name": "Gates",
"summary": "Co-chair of the Bill & Melinda Gates... Active blogger.",
"region_id": "us:91",
"industry_id": 131,
"photo_url": "/p/7/000/253/05b/308dd6e.jpg",
"positions": [
{
"job_title": "Co-chair",
"organization": "Bill & Melinda Gates Foundation"
},
{
"job_title": "Co-founder, Chairman",
"organization": "Microsoft"
}
],
"education": [
{
"school_name": "Harvard University",
"start": 1973,
"end": 1975
},
{
"school_name": "Lakeside School, Seattle",
"start": null,
"end": null
}
],
"contact_info": {
"blog": "http://thegatesnotes.com",
"twitter": "http://twitter.com/BillGates"
}
}
```
有一些开发人员认为 JSON 模型减少了应用程序代码和存储层之间的阻抗不匹配。不过,正如我们将在 [第四章](/v1/ch4) 中看到的那样,JSON 作为数据编码格式也存在问题。无模式对 JSON 模型来说往往被认为是一个优势;我们将在 “[文档模型中的模式灵活性](#文档模型中的模式灵活性)” 中讨论这个问题。
JSON 表示比 [图 2-1](/v1/ddia_0201.png) 中的多表模式具有更好的 **局部性(locality)**。如果在前面的关系型示例中获取简介,那需要执行多个查询(通过 `user_id` 查询每个表),或者在 User 表与其下属表之间混乱地执行多路连接。而在 JSON 表示中,所有相关信息都在同一个地方,一个查询就足够了。
从用户简介文件到用户职位,教育历史和联系信息,这种一对多关系隐含了数据中的一个树状结构,而 JSON 表示使得这个树状结构变得明确(见 [图 2-2](/v1/ddia_0202.png))。

**图 2-2 一对多关系构建了一个树结构**
### 多对一和多对多的关系
在上一节的 [例 2-1]() 中,`region_id` 和 `industry_id` 是以 ID,而不是纯字符串 “Greater Seattle Area” 和 “Philanthropy” 的形式给出的。为什么?
如果用户界面用一个自由文本字段来输入区域和行业,那么将他们存储为纯文本字符串是合理的。另一方式是给出地理区域和行业的标准化的列表,并让用户从下拉列表或自动填充器中进行选择,其优势如下:
* 各个简介之间样式和拼写统一
* 避免歧义(例如,如果有几个同名的城市)
* 易于更新 —— 名称只存储在一个地方,如果需要更改(例如,由于政治事件而改变城市名称),很容易进行全面更新。
* 本地化支持 —— 当网站翻译成其他语言时,标准化的列表可以被本地化,使得地区和行业可以使用用户的语言来显示
* 更好的搜索 —— 例如,搜索华盛顿州的慈善家就会匹配这份简介,因为地区列表可以编码记录西雅图在华盛顿这一事实(从 “Greater Seattle Area” 这个字符串中看不出来)
存储 ID 还是文本字符串,这是个 **副本(duplication)** 问题。当使用 ID 时,对人类有意义的信息(比如单词:Philanthropy)只存储在一处,所有引用它的地方使用 ID(ID 只在数据库中有意义)。当直接存储文本时,对人类有意义的信息会复制在每处使用记录中。
使用 ID 的好处是,ID 对人类没有任何意义,因而永远不需要改变:ID 可以保持不变,即使它标识的信息发生变化。任何对人类有意义的东西都可能需要在将来某个时候改变 —— 如果这些信息被复制,所有的冗余副本都需要更新。这会导致写入开销,也存在不一致的风险(一些副本被更新了,还有些副本没有被更新)。去除此类重复是数据库 **规范化(normalization)** 的关键思想。[^ii]
[^ii]: 关于关系模型的文献区分了几种不同的规范形式,但这些区别几乎没有实际意义。一个经验法则是,如果重复存储了可以存储在一个地方的值,则模式就不是 **规范化(normalized)** 的。
> 数据库管理员和开发人员喜欢争论规范化和非规范化,让我们暂时保留判断吧。在本书的 [第三部分](/v1/part-iii),我们将回到这个话题,探讨系统的方法用以处理缓存,非规范化和衍生数据。
不幸的是,对这些数据进行规范化需要多对一的关系(许多人生活在一个特定的地区,许多人在一个特定的行业工作),这与文档模型不太吻合。在关系数据库中,通过 ID 来引用其他表中的行是正常的,因为连接很容易。在文档数据库中,一对多树结构没有必要用连接,对连接的支持通常很弱 [^iii]。
[^iii]: 在撰写本文时,RethinkDB 支持连接,MongoDB 不支持连接,而 CouchDB 只支持预先声明的视图。
如果数据库本身不支持连接,则必须在应用程序代码中通过对数据库进行多个查询来模拟连接。(在这种情况中,地区和行业的列表可能很小,改动很少,应用程序可以简单地将其保存在内存中。不过,执行连接的工作从数据库被转移到应用程序代码上。)
此外,即便应用程序的最初版本适合无连接的文档模型,随着功能添加到应用程序中,数据会变得更加互联。例如,考虑一下对简历例子进行的一些修改:
组织和学校作为实体
: 在前面的描述中,`organization`(用户工作的公司)和 `school_name`(他们学习的地方)只是字符串。也许他们应该是对实体的引用呢?然后,每个组织、学校或大学都可以拥有自己的网页(标识、新闻提要等)。每个简历可以链接到它所提到的组织和学校,并且包括他们的图标和其他信息(请参阅 [图 2-3](/v1/ddia_0203.png),来自 LinkedIn 的一个例子)。
推荐
: 假设你想添加一个新的功能:一个用户可以为另一个用户写一个推荐。在用户的简历上显示推荐,并附上推荐用户的姓名和照片。如果推荐人更新他们的照片,那他们写的任何推荐都需要显示新的照片。因此,推荐应该拥有作者个人简介的引用。

**图 2-3 公司名不仅是字符串,还是一个指向公司实体的链接(LinkedIn 截图)**
[图 2-4](/v1/ddia_0204.png) 阐明了这些新功能需要如何使用多对多关系。每个虚线矩形内的数据可以分组成一个文档,但是对单位,学校和其他用户的引用需要表示成引用,并且在查询时需要连接。

**图 2-4 使用多对多关系扩展简历**
### 文档数据库是否在重蹈覆辙?
在多对多的关系和连接已常规用在关系数据库时,文档数据库和 NoSQL 重启了辩论:如何以最佳方式在数据库中表示多对多关系。那场辩论可比 NoSQL 古老得多,事实上,最早可以追溯到计算机化数据库系统。
20 世纪 70 年代最受欢迎的业务数据处理数据库是 IBM 的信息管理系统(IMS),最初是为了阿波罗太空计划的库存管理而开发的,并于 1968 年有了首次商业发布【13】。目前它仍在使用和维护,运行在 IBM 大型机的 OS/390 上【14】。
IMS 的设计中使用了一个相当简单的数据模型,称为 **层次模型(hierarchical model)**,它与文档数据库使用的 JSON 模型有一些惊人的相似之处【2】。它将所有数据表示为嵌套在记录中的记录树,这很像 [图 2-2](/v1/ddia_0202.png) 的 JSON 结构。
同文档数据库一样,IMS 能良好处理一对多的关系,但是很难应对多对多的关系,并且不支持连接。开发人员必须决定是否复制(非规范化)数据或手动解决从一个记录到另一个记录的引用。这些二十世纪六七十年代的问题与现在开发人员遇到的文档数据库问题非常相似【15】。
那时人们提出了各种不同的解决方案来解决层次模型的局限性。其中最突出的两个是 **关系模型**(relational model,它变成了 SQL,并统治了世界)和 **网状模型**(network model,最初很受关注,但最终变得冷门)。这两个阵营之间的 “大辩论” 在 70 年代持续了很久时间【2】。
那两个模式解决的问题与当前的问题相关,因此值得简要回顾一下那场辩论。
#### 网状模型
网状模型由一个称为数据系统语言会议(CODASYL)的委员会进行了标准化,并被数个不同的数据库厂商实现;它也被称为 CODASYL 模型【16】。
CODASYL 模型是层次模型的推广。在层次模型的树结构中,每条记录只有一个父节点;在网络模式中,每条记录可能有多个父节点。例如,“Greater Seattle Area” 地区可能是一条记录,每个居住在该地区的用户都可以与之相关联。这允许对多对一和多对多的关系进行建模。
网状模型中记录之间的链接不是外键,而更像编程语言中的指针(同时仍然存储在磁盘上)。访问记录的唯一方法是跟随从根记录起沿这些链路所形成的路径。这被称为 **访问路径(access path)**。
最简单的情况下,访问路径类似遍历链表:从列表头开始,每次查看一条记录,直到找到所需的记录。但在多对多关系的情况中,数条不同的路径可以到达相同的记录,网状模型的程序员必须跟踪这些不同的访问路径。
CODASYL 中的查询是通过利用遍历记录列和跟随访问路径表在数据库中移动游标来执行的。如果记录有多个父结点(即多个来自其他记录的传入指针),则应用程序代码必须跟踪所有的各种关系。甚至 CODASYL 委员会成员也承认,这就像在 n 维数据空间中进行导航【17】。
尽管手动选择访问路径能够最有效地利用 20 世纪 70 年代非常有限的硬件功能(如磁带驱动器,其搜索速度非常慢),但这使得查询和更新数据库的代码变得复杂不灵活。无论是分层还是网状模型,如果你没有所需数据的路径,就会陷入困境。你可以改变访问路径,但是必须浏览大量手写数据库查询代码,并重写来处理新的访问路径。更改应用程序的数据模型是很难的。
#### 关系模型
相比之下,关系模型做的就是将所有的数据放在光天化日之下:一个 **关系(表)** 只是一个 **元组(行)** 的集合,仅此而已。如果你想读取数据,它没有迷宫似的嵌套结构,也没有复杂的访问路径。你可以选中符合任意条件的行,读取表中的任何或所有行。你可以通过指定某些列作为匹配关键字来读取特定行。你可以在任何表中插入一个新的行,而不必担心与其他表的外键关系 [^iv]。
[^iv]: 外键约束允许对修改进行限制,但对于关系模型这并不是必选项。即使有约束,外键连接在查询时执行,而在 CODASYL 中,连接在插入时高效完成。
在关系数据库中,查询优化器自动决定查询的哪些部分以哪个顺序执行,以及使用哪些索引。这些选择实际上是 “访问路径”,但最大的区别在于它们是由查询优化器自动生成的,而不是由程序员生成,所以我们很少需要考虑它们。
如果想按新的方式查询数据,你可以声明一个新的索引,查询会自动使用最合适的那些索引。无需更改查询来利用新的索引(请参阅 “[数据查询语言](#数据查询语言)”)。关系模型因此使添加应用程序新功能变得更加容易。
关系数据库的查询优化器是复杂的,已耗费了多年的研究和开发精力【18】。关系模型的一个关键洞察是:只需构建一次查询优化器,随后使用该数据库的所有应用程序都可以从中受益。如果你没有查询优化器的话,那么为特定查询手动编写访问路径比编写通用优化器更容易 —— 不过从长期看通用解决方案更好。
#### 与文档数据库相比
在一个方面,文档数据库还原为层次模型:在其父记录中存储嵌套记录([图 2-1](/v1/ddia_0201.png) 中的一对多关系,如 `positions`,`education` 和 `contact_info`),而不是在单独的表中。
但是,在表示多对一和多对多的关系时,关系数据库和文档数据库并没有根本的不同:在这两种情况下,相关项目都被一个唯一的标识符引用,这个标识符在关系模型中被称为 **外键**,在文档模型中称为 **文档引用**【9】。该标识符在读取时通过连接或后续查询来解析。迄今为止,文档数据库没有走 CODASYL 的老路。
### 关系型数据库与文档数据库在今日的对比
将关系数据库与文档数据库进行比较时,可以考虑许多方面的差异,包括它们的容错属性(请参阅 [第五章](/v1/ch5))和处理并发性(请参阅 [第七章](/v1/ch7))。本章将只关注数据模型中的差异。
支持文档数据模型的主要论据是架构灵活性,因局部性而拥有更好的性能,以及对于某些应用程序而言更接近于应用程序使用的数据结构。关系模型通过为连接提供更好的支持以及支持多对一和多对多的关系来反击。
#### 哪种数据模型更有助于简化应用代码?
如果应用程序中的数据具有类似文档的结构(即,一对多关系树,通常一次性加载整个树),那么使用文档模型可能是一个好主意。将类似文档的结构分解成多个表(如 [图 2-1](/v1/ddia_0201.png) 中的 `positions`、`education` 和 `contact_info`)的关系技术可能导致繁琐的模式和不必要的复杂的应用程序代码。
文档模型有一定的局限性:例如,不能直接引用文档中的嵌套的项目,而是需要说 “用户 251 的位置列表中的第二项”(很像层次模型中的访问路径)。但是,只要文件嵌套不太深,这通常不是问题。
文档数据库对连接的糟糕支持可能是个问题,也可能不是问题,这取决于应用程序。例如,如果某分析型应用程序使用一个文档数据库来记录何时何地发生了何事,那么多对多关系可能永远也用不上。【19】。
但如果你的应用程序确实会用到多对多关系,那么文档模型就没有那么诱人了。尽管可以通过反规范化来消除对连接的需求,但这需要应用程序代码来做额外的工作以确保数据一致性。尽管应用程序代码可以通过向数据库发出多个请求的方式来模拟连接,但这也将复杂性转移到应用程序中,而且通常也会比由数据库内的专用代码更慢。在这种情况下,使用文档模型可能会导致更复杂的应用代码与更差的性能【15】。
我们没有办法说哪种数据模型更有助于简化应用代码,因为它取决于数据项之间的关系种类。对高度关联的数据而言,文档模型是极其糟糕的,关系模型是可以接受的,而选用图形模型(请参阅 “[图数据模型](#图数据模型)”)是最自然的。
#### 文档模型中的模式灵活性
大多数文档数据库以及关系数据库中的 JSON 支持都不会强制文档中的数据采用何种模式。关系数据库的 XML 支持通常带有可选的模式验证。没有模式意味着可以将任意的键和值添加到文档中,并且当读取时,客户端无法保证文档可能包含的字段。
文档数据库有时称为 **无模式(schemaless)**,但这具有误导性,因为读取数据的代码通常假定某种结构 —— 即存在隐式模式,但不由数据库强制执行【20】。一个更精确的术语是 **读时模式**(即 schema-on-read,数据的结构是隐含的,只有在数据被读取时才被解释),相应的是 **写时模式**(即 schema-on-write,传统的关系数据库方法中,模式明确,且数据库确保所有的数据都符合其模式)【21】。
读时模式类似于编程语言中的动态(运行时)类型检查,而写时模式类似于静态(编译时)类型检查。就像静态和动态类型检查的相对优点具有很大的争议性一样【22】,数据库中模式的强制性是一个具有争议的话题,一般来说没有正确或错误的答案。
在应用程序想要改变其数据格式的情况下,这些方法之间的区别尤其明显。例如,假设你把每个用户的全名存储在一个字段中,而现在想分别存储名字和姓氏【23】。在文档数据库中,只需开始写入具有新字段的新文档,并在应用程序中使用代码来处理读取旧文档的情况。例如:
```go
if (user && user.name && !user.first_name) {
// Documents written before Dec 8, 2013 don't have first_name
user.first_name = user.name.split(" ")[0];
}
```
另一方面,在 “静态类型” 数据库模式中,通常会执行以下 **迁移(migration)** 操作:
```sql
ALTER TABLE users ADD COLUMN first_name text;
UPDATE users SET first_name = split_part(name, ' ', 1); -- PostgreSQL
UPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL
```
模式变更的速度很慢,而且要求停运。它的这种坏名誉并不是完全应得的:大多数关系数据库系统可在几毫秒内执行 `ALTER TABLE` 语句。MySQL 是一个值得注意的例外,它执行 `ALTER TABLE` 时会复制整个表,这可能意味着在更改一个大型表时会花费几分钟甚至几个小时的停机时间,尽管存在各种工具来解决这个限制【24,25,26】。
大型表上运行 `UPDATE` 语句在任何数据库上都可能会很慢,因为每一行都需要重写。要是不可接受的话,应用程序可以将 `first_name` 设置为默认值 `NULL`,并在读取时再填充,就像使用文档数据库一样。
当由于某种原因(例如,数据是异构的)集合中的项目并不都具有相同的结构时,读时模式更具优势。例如,如果:
* 存在许多不同类型的对象,将每种类型的对象放在自己的表中是不现实的。
* 数据的结构由外部系统决定。你无法控制外部系统且它随时可能变化。
在上述情况下,模式的坏处远大于它的帮助,无模式文档可能是一个更加自然的数据模型。但是,要是所有记录都具有相同的结构,那么模式是记录并强制这种结构的有效机制。第四章将更详细地讨论模式和模式演化。
#### 查询的数据局部性
文档通常以单个连续字符串形式进行存储,编码为 JSON、XML 或其二进制变体(如 MongoDB 的 BSON)。如果应用程序经常需要访问整个文档(例如,将其渲染至网页),那么存储局部性会带来性能优势。如果将数据分割到多个表中(如 [图 2-1](/v1/ddia_0201.png) 所示),则需要进行多次索引查找才能将其全部检索出来,这可能需要更多的磁盘查找并花费更多的时间。
局部性仅仅适用于同时需要文档绝大部分内容的情况。即使只访问文档其中的一小部分,数据库通常需要加载整个文档,对于大型文档来说这种加载行为是很浪费的。更新文档时,通常需要整个重写。只有不改变文档大小的修改才可以容易地原地执行。因此,通常建议保持相对小的文档,并避免增加文档大小的写入【9】。这些性能限制大大减少了文档数据库的实用场景。
值得指出的是,为了局部性而分组集合相关数据的想法并不局限于文档模型。例如,Google 的 Spanner 数据库在关系数据模型中提供了同样的局部性属性,允许模式声明一个表的行应该交错(嵌套)在父表内【27】。Oracle 类似地允许使用一个称为 **多表索引集群表(multi-table index cluster tables)** 的类似特性【28】。Bigtable 数据模型(用于 Cassandra 和 HBase)中的 **列族(column-family)** 概念与管理局部性的目的类似【29】。
在 [第三章](/v1/ch3) 将还会看到更多关于局部性的内容。
#### 文档和关系数据库的融合
自 2000 年代中期以来,大多数关系数据库系统(MySQL 除外)都已支持 XML。这包括对 XML 文档进行本地修改的功能,以及在 XML 文档中进行索引和查询的功能。这允许应用程序使用那种与文档数据库应当使用的非常类似的数据模型。
从 9.3 版本开始的 PostgreSQL 【8】,从 5.7 版本开始的 MySQL 以及从版本 10.5 开始的 IBM DB2【30】也对 JSON 文档提供了类似的支持级别。鉴于用在 Web APIs 的 JSON 流行趋势,其他关系数据库很可能会跟随他们的脚步并添加 JSON 支持。
在文档数据库中,RethinkDB 在其查询语言中支持类似关系的连接,一些 MongoDB 驱动程序可以自动解析数据库引用(有效地执行客户端连接,尽管这可能比在数据库中执行的连接慢,需要额外的网络往返,并且优化更少)。
随着时间的推移,关系数据库和文档数据库似乎变得越来越相似,这是一件好事:数据模型相互补充 [^v],如果一个数据库能够处理类似文档的数据,并能够对其执行关系查询,那么应用程序就可以使用最符合其需求的功能组合。
关系模型和文档模型的混合是未来数据库一条很好的路线。
[^v]: Codd 对关系模型【1】的原始描述实际上允许在关系模式中与 JSON 文档非常相似。他称之为 **非简单域(nonsimple domains)**。这个想法是,一行中的值不一定是一个像数字或字符串一样的原始数据类型,也可以是一个嵌套的关系(表),因此可以把一个任意嵌套的树结构作为一个值,这很像 30 年后添加到 SQL 中的 JSON 或 XML 支持。
## 数据查询语言
当引入关系模型时,关系模型包含了一种查询数据的新方法:SQL 是一种 **声明式** 查询语言,而 IMS 和 CODASYL 使用 **命令式** 代码来查询数据库。那是什么意思?
许多常用的编程语言是命令式的。例如,给定一个动物物种的列表,返回列表中的鲨鱼可以这样写:
```js
function getSharks() {
var sharks = [];
for (var i = 0; i < animals.length; i++) {
if (animals[i].family === "Sharks") {
sharks.push(animals[i]);
}
}
return sharks;
}
```
而在关系代数中,你可以这样写:
$$
sharks = \sigma_{family = "sharks"}(animals)
$$
其中 $\sigma$(希腊字母西格玛)是选择操作符,只返回符合 `family="shark"` 条件的动物。
定义 SQL 时,它紧密地遵循关系代数的结构:
```sql
SELECT * FROM animals WHERE family ='Sharks';
```
命令式语言告诉计算机以特定顺序执行某些操作。可以想象一下,逐行地遍历代码,评估条件,更新变量,并决定是否再循环一遍。
在声明式查询语言(如 SQL 或关系代数)中,你只需指定所需数据的模式 - 结果必须符合哪些条件,以及如何将数据转换(例如,排序,分组和集合) - 但不是如何实现这一目标。数据库系统的查询优化器决定使用哪些索引和哪些连接方法,以及以何种顺序执行查询的各个部分。
声明式查询语言是迷人的,因为它通常比命令式 API 更加简洁和容易。但更重要的是,它还隐藏了数据库引擎的实现细节,这使得数据库系统可以在无需对查询做任何更改的情况下进行性能提升。
例如,在本节开头所示的命令代码中,动物列表以特定顺序出现。如果数据库想要在后台回收未使用的磁盘空间,则可能需要移动记录,这会改变动物出现的顺序。数据库能否安全地执行,而不会中断查询?
SQL 示例不确保任何特定的顺序,因此不在意顺序是否改变。但是如果查询用命令式的代码来写的话,那么数据库就永远不可能确定代码是否依赖于排序。SQL 相当有限的功能性为数据库提供了更多自动优化的空间。
最后,声明式语言往往适合并行执行。现在,CPU 的速度通过核心(core)的增加变得更快,而不是以比以前更高的时钟速度运行【31】。命令代码很难在多个核心和多个机器之间并行化,因为它指定了指令必须以特定顺序执行。声明式语言更具有并行执行的潜力,因为它们仅指定结果的模式,而不指定用于确定结果的算法。在适当情况下,数据库可以自由使用查询语言的并行实现【32】。
### Web 上的声明式查询
声明式查询语言的优势不仅限于数据库。为了说明这一点,让我们在一个完全不同的环境中比较声明式和命令式方法:一个 Web 浏览器。
假设你有一个关于海洋动物的网站。用户当前正在查看鲨鱼页面,因此你将当前所选的导航项目 “鲨鱼” 标记为当前选中项目。
```html