Paper Reading - Chaos Engineering

April 21, 2020

https://arxiv.org/pdf/1702.05843.pdf

Abstract

Chaos Engineering 的含义

Many large tech organizations are using experimentation to verify the reliability of such systems. We use the term “Chaos Engineering” to refer to this approach, and discuss the underlying principles and how to use it to run experiments.

Introduction

介绍了 Netflix 用了 Chaos Monkey 的技术，并且取得了成效。取得成效后，他们开发了更多的 Chaos 手段：模拟一个时间中心宕机；往服务之间请求注入错误等。

一段时间的实践后，他们意识到，这不仅仅是简单的“break things in production”， Amazon，Google，Microsoft 等公司都有类似实践。他们把这类实践叫做 “Chaos Engineering”。

Chaos Engineering 的概念

Specifically, Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in its capability to withstand turbulent conditions in production

介绍 Netflix 的业务场景及大致的架构。

说他们的 Chaos 是有选择性的使用的，主要用在服务与服务之间。

Shifting to a system perspective

全局视角，把整个系统当成一个整体。

At the heart of Chaos Engineering is the premise that engineers should view a collection of services running in production as a single system, and that we can better understand the behavior of this system by injecting realworld inputs (e.g., transient network failures, surges in incoming requests, malformed data inputs) and observing what happens at the system boundary.

判断系统是否处于正常状态的标准

From this perspective, we ask questions like “does everything seem to be working properly?” and “are users complaining?” rather than “does the implementation match the specification?”

Principles of Chaos Engineering

原则：具体实践的一些原则

Chaos Engineering revolves around running experiments. As in other experimental disciplines, to design an experiment requires specifying [7]:

hypotheses

independent variables

dependent variables

context

We have identified the four principles which we believe embody the Chaos Engineering approach to designing experiments:

Build a hypothesis around steady state behavior

Vary realworld events

Run experiments in production

Automate experiments to run continuously

Build a hypothesis around steady state behavior

基于稳定状态的表现来构造假设

关键要解决的问题：怎样衡量系统是否 “works properly”？他们主要关注的是“可用性 availability”，在它们的场景种（我理解就是类似微服务），一个服务挂了，用户端不能受到影响。

怎样衡量用户的感受？通过 一些指标 ，比如晚上 18:00，根据经验，有 1000 个人在线，则认为正常。如果只有 100 个了，则认为是系统发生了问题。各种行为都可以和某些指标对应上。

When we formulate a hypothesis for a Chaos Engineering experiment, the hypothesis is about a very particular kind of metric.

不过在指标选择上，他们也有一些判断：用粗粒度的、更接近用户的指标（比如 SPS），而非细粒度的、系统本身（比如 CPU 等）。仍然会在实验过程中观察细粒度的指标，因为这类指标可以评判系统功能是否正常。

Vary real‐world events

多样化的现实世界的事件。

提到了几个问题：

开发者测试的时候，往往容易倾向考虑 “happy path”
现实世界有各种各样的 corner cases 和 error conditions

会破坏系统稳定状态的一些输入都是一个好的实验的候选，Netflix 使用了一些，比如

terminate virtual machine instances

inject latency into requests between services

fail requests between services

fail an internal service

make an entire Amazon region unavailable

提到一些实验策略

有时，需要去模拟一些场景，而不是注入。比如把一个数据中心下线。

In some cases, you may need to simulate the event instead of inject it.

有时，只对一部分用户下手。

Run experiments in production

At Netflix, it is simply not possible to fully reproduce the entire architecture and run an endtoend test.

Automate experiments to run continuously

自动化能增强信心。

The fourth and final principle of Chaos Engineering is to leverage automation in order to maintain confidence in results over time.

Running a Chaos experiment

基于上述原则，定义一个 Chaos 实验

先定义“稳定状态 steady state”，来衡量系统什么时候是正常的
假设对照组和实验组都能保持稳定状态
引入变量来反应现实世界的事件：server 挂了等
观察对照组和实验组的在“稳定状态”表现上的不同

The Future of Chaos Engineering

Chaos Engineering 这个方向的最佳实践几乎还很少，以后可以做的事情

Case studies from other domains：借鉴其它组织的实践
Adaoption: 有哪些方法可以让一个组织接受这种实践
Tooling: 目前不知道需要哪些工具、工具是否可复用、如何复用等
Event injection models: 怎样设计事件？事件可以组合