Paper Reading - Chaos Engineering


Chaos Engineering 的含义

Many large tech organizations are using experimentation to verify the reliability of such systems. We use the term “Chaos Engineering” to refer to this approach, and discuss the underlying principles and how to use it to run experiments.


介绍了 Netflix 用了 Chaos Monkey 的技术,并且取得了成效。 取得成效后,他们开发了更多的 Chaos 手段:模拟一个时间中心宕机;往服务之间请求注入错误等。

一段时间的实践后,他们意识到,这不仅仅是简单的“break things in production”, Amazon,Google,Microsoft 等公司都有类似实践。他们把这类实践叫做 “Chaos Engineering”。

Chaos Engineering 的概念

Specifically, Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in its capability to withstand turbulent conditions in production

介绍 Netflix 的业务场景及大致的架构。

说他们的 Chaos 是有选择性的使用的,主要用在服务与服务之间。

Shifting to a system perspective


At the heart of Chaos Engineering is the premise that engineers should view a collection of services running in production as a single system, and that we can better understand the behavior of this system by injecting real­world inputs (e.g., transient network failures, surges in incoming requests, malformed data inputs) and observing what happens at the system boundary.


From this perspective, we ask questions like “does everything seem to be working properly?” and “are users complaining?” rather than “does the implementation match the specification?”

Principles of Chaos Engineering


Chaos Engineering revolves around running experiments. As in other experimental disciplines, to design an experiment requires specifying [7]:

  • hypotheses
  • independent variables
  • dependent variables
  • context

We have identified the four principles which we believe embody the Chaos Engineering approach to designing experiments:

  1. Build a hypothesis around steady state behavior
  2. Vary real­world events
  3. Run experiments in production
  4. Automate experiments to run continuously

Build a hypothesis around steady state behavior


关键要解决的问题:怎样衡量系统是否 “works properly”? 他们主要关注的是“可用性 availability”,在它们的场景种(我理解就是类似微服务), 一个服务挂了,用户端不能受到影响。

怎样衡量用户的感受?通过 一些指标 ,比如晚上 18:00,根据经验,有 1000 个人在线, 则认为正常。如果只有 100 个了,则认为是系统发生了问题。各种行为都可以和某些指标对应上。

When we formulate a hypothesis for a Chaos Engineering experiment, the hypothesis is about a very particular kind of metric.

不过在指标选择上,他们也有一些判断:用粗粒度的、更接近用户的指标(比如 SPS), 而非细粒度的、系统本身(比如 CPU 等)。仍然会在实验过程中观察细粒度的指标, 因为这类指标可以评判系统功能是否正常。

Vary real‐world events



  1. 开发者测试的时候,往往容易倾向考虑 “happy path”
  2. 现实世界有各种各样的 corner cases 和 error conditions

会破坏系统稳定状态的一些输入都是一个好的实验的候选,Netflix 使用了一些,比如

  • terminate virtual machine instances
  • inject latency into requests between services
  • fail requests between services
  • fail an internal service
  • make an entire Amazon region unavailable



In some cases, you may need to simulate the event instead of inject it.


Run experiments in production

At Netflix, it is simply not possible to fully reproduce the entire architecture and run an end­to­end test.

Automate experiments to run continuously


The fourth and final principle of Chaos Engineering is to leverage automation in order to maintain confidence in results over time.

Running a Chaos experiment

基于上述原则,定义一个 Chaos 实验

  1. 先定义“稳定状态 steady state”,来衡量系统什么时候是正常的
  2. 假设对照组和实验组都能保持稳定状态
  3. 引入变量来反应现实世界的事件:server 挂了等
  4. 观察对照组和实验组的在“稳定状态”表现上的不同

The Future of Chaos Engineering

Chaos Engineering 这个方向的最佳实践几乎还很少,以后可以做的事情

  • Case studies from other domains:借鉴其它组织的实践
  • Adaoption: 有哪些方法可以让一个组织接受这种实践
  • Tooling: 目前不知道需要哪些工具、工具是否可复用、如何复用等
  • Event injection models: 怎样设计事件?事件可以组合