Hystrix入门

Hystrix

背景

分布式系统环境下，服务间类似依赖非常常见，一个业务调用通常依赖多个基础服务。如下图，对于同步调用，当库存服务不可用时，商品服务请求线程被阻塞，当有大批量请求调用库存服务时，最终可能导致整个商品服务资源耗尽，无法继续对外提供服务。并且这种不可用可能沿请求调用链向上传递，这种现象被称为雪崩效应。

雪崩效应常见场景

硬件故障：如服务器宕机，机房断电，光纤被挖断等。
流量激增：如异常流量，重试加大流量等。
缓存穿透：一般发生在应用重启，所有缓存失效时，以及短时间内大量缓存失效时。大量的缓存不命中，使请求直击后端服务，造成服务提供者超负荷运行，引起服务不可用。
程序BUG：如程序逻辑导致内存泄漏，JVM长时间FullGC等。
同步等待：服务间采用同步调用模式，同步等待造成的资源耗尽。

雪崩效应应对策略

针对造成雪崩效应的不同场景，可以使用不同的应对策略，没有一种通用所有场景的策略，参考如下：

硬件故障：多机房容灾、异地多活等。
流量激增：服务自动扩容、流量控制（限流、关闭重试）等。
缓存穿透：缓存预加载、缓存异步加载等。
程序BUG：修改程序bug、及时释放资源等。
同步等待：资源隔离、MQ解耦、不可用服务调用快速失败等。资源隔离通常指不同服务调用采用不同的线程池；不可用服务调用快速失败一般通过熔断器模式结合超时机制实现。

综上所述，如果一个应用不能对来自依赖的故障进行隔离，那该应用本身就处在被拖垮的风险中。因此，为了构建稳定、可靠的分布式系统，我们的服务应当具有自我保护能力，当依赖服务不可用时，当前服务启动自我保护功能，从而避免发生雪崩效应。本文将重点介绍使用Hystrix解决同步等待的雪崩问题。

Hystrix简介

Hystrix [hɪst'rɪks]，中文含义是豪猪，因其背上长满棘刺，从而拥有了自我保护的能力。本文所说的Hystrix是Netflix开源的一款容错框架，同样具有自我保护能力。为了实现容错和自我保护，下面我们看看Hystrix如何设计和实现的。

Hystrix设计目标：

对来自依赖的延迟和故障进行防护和控制——这些依赖通常都是通过网络访问的
阻止故障的连锁反应
快速失败并迅速恢复
回退并优雅降级
提供近实时的监控与告警

Hystrix遵循的设计原则：

防止任何单独的依赖耗尽资源（线程）
过载立即切断并快速失败，防止排队
尽可能提供回退以保护用户免受故障
使用隔离技术（例如隔板，泳道和断路器模式）来限制任何一个依赖的影响
通过近实时的指标，监控和告警，确保故障被及时发现
通过动态修改配置属性，确保故障及时恢复
防止整个依赖客户端执行失败，而不仅仅是网络通信

Hystrix如何实现这些设计目标？

使用命令模式将所有对外部服务（或依赖关系）的调用包装在HystrixCommand或HystrixObservableCommand对象中，并将该对象放在单独的线程中执行；
每个依赖都维护着一个线程池（或信号量），线程池被耗尽则拒绝请求（而不是让请求排队）。
记录请求成功，失败，超时和线程拒绝。
服务错误百分比超过了阈值，熔断器开关自动打开，一段时间内停止对该服务的所有请求。
请求失败，被拒绝，超时或熔断时执行降级逻辑。
近实时地监控指标和配置的修改。

Hystrix入门

线程隔离

下面我们介绍一下线程隔离的实现原理。

在一个高度服务化的系统中，我们实现一个业务逻辑通常会依赖多个服务，比如：商品详情展示服务会依赖商品服务、价格服务、商品评论服务。

如图所示：

商品服务、价格服务、商品评价服务会共享商品详情服务的线程池，如果其中的商品服务不可用，就会出现线程池里面所有线程都被阻塞，从在造成服务雪崩。如下图所示：

Hystrix通过每一个依赖服务分配的线程池进行隔离，从而避免服务雪崩。

如图所示，当商品评论服务不可用时，及时商品服务独立分配的20个线程处于同步等待状态，也不会影响其他服务的调用。

Hystrix中的线程隔离内部在其内部直接实现，无需多余配置直接使用。所以我们更关心的是超时降级的配置。

Hystrix服务降级示例

1、在consumer项目中加入依赖

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-netflix-hystrix</artifactId>
</dependency>

2、在consumer项目中启用服务降级注解

package cn.rayfoo;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cloud.client.circuitbreaker.EnableCircuitBreaker;
import org.springframework.cloud.client.discovery.EnableDiscoveryClient;

/**
 * @Author: rayfoo@qq.com
 * @Date: 2020/7/2 2:27 下午
 * @Description:
 */
@EnableCircuitBreaker
@EnableDiscoveryClient
@SpringBootApplication
public class CustomerRunner {

    public static void main(String[] args) {
        SpringApplication.run(CustomerRunner.class);
    }

}

注意：一个标准的SpringCloud项目都至少会加入这三个注解，所以提供了@SpringCloudApplication注解。其内部封装了如下内容。

//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by Fernflower decompiler)
//

package org.springframework.cloud.client;

import java.lang.annotation.Documented;
import java.lang.annotation.ElementType;
import java.lang.annotation.Inherited;
import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;
import java.lang.annotation.Target;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cloud.client.circuitbreaker.EnableCircuitBreaker;
import org.springframework.cloud.client.discovery.EnableDiscoveryClient;

@Target({ElementType.TYPE})
@Retention(RetentionPolicy.RUNTIME)
@Documented
@Inherited
@SpringBootApplication
@EnableDiscoveryClient
@EnableCircuitBreaker
public @interface SpringCloudApplication {
}

所以，写了@SpringCloudApplication就可以省略上述三个注解。

3、开启线程隔离

在需要开启的方法上加入@HystrixCommand。
创建一个fallbackMethod方法，用于服务调用失败时执行，该方法的名称没有要求，但是返回值必须和其服务方法完全一致，参数列表尽量也要保持一致。
配置@HystrixCommand的fallbackMethod="上一步创建的方法名"

package cn.rayfoo.controller;

import com.netflix.hystrix.contrib.javanica.annotation.HystrixCommand;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.client.RestTemplate;


/**
 * @Author: rayfoo@qq.com
 * @Date: 2020/7/2 2:30 下午
 * @Description:
 */
@RestController
@RequestMapping("/employee")
public class EmployeeController {

    @Autowired
    private RestTemplate restTemplate;

    private static final String REST_URL_PREFIX = "http://emp-provider";

    @GetMapping("/{id}")
    @HystrixCommand(fallbackMethod = "getEmployeeByIdFallback")
    public String getEmployeeById(@PathVariable Integer id) {
        String url = REST_URL_PREFIX + "/employee/" + id;
        //调用接口
        String employee = restTemplate.getForObject(url, String.class);
        //返回结果
        return employee;
    }

    public String getEmployeeByIdFallback(Integer id) {
        //返回结果
        return "服务器繁忙， 请稍后重试！";
    }

}

4、模拟服务超时

package cn.rayfoo.controller;


import cn.rayfoo.bean.Employee;
import cn.rayfoo.service.EmployeeService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

/**
 * @Author: rayfoo@qq.com
 * @Date: 2020/7/2 2:14 下午
 * @Description:
 */
@RestController
@RequestMapping("/employee")
public class EmployeeController {

    @Autowired
    private EmployeeService employeeService;

    @GetMapping("/{id}")
    public Employee getEmployeeById(@PathVariable Integer id){
        try {
            //模拟服务器调用异常
            Thread.sleep(2000L);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return employeeService.getEmployeeById(id);
    }

}

5、启动eureka、启动provider、consumer测试是否能执行fallbackMethod

![image-20200704101333924](/Users/rayfoo/Library/Application Support/typora-user-images/image-20200704101333924.png)

测试成功，但是此时还是存在一个问题

问题1：每个方法都需要加@HystrixCommand的属性十分繁琐，能不能给整个Controller的所有方法都加呢？

答案是可以的，可以直接在Controller上加@DefaultProperties注解，来配置所有@HystrixCommand的默认属性。

但是共有的这个fallback方法有一个注意点，就是不能有参数，因为无法保证所有的方法参数都一致！

    public String defaultFallbackMethod(Integer id) {
        //返回结果
        return "服务器繁忙， 请稍后重试！";
    }

controller配置

@RestController
@RequestMapping("/employee")
@DefaultProperties(defaultFallback = "defaultFallbackMethod")
public class EmployeeController

方法上的注解就无须添加defaultFallback属性

@GetMapping("/{id}")
    @HystrixCommand
    public String getEmployeeById(@PathVariable Integer id) {
        String url = REST_URL_PREFIX + "/employee/" + id;
        //调用接口
        String employee = restTemplate.getForObject(url, String.class);
        //返回结果
        return employee;
    }

注意：虽然方法上的fallback无须添加，但是仍然需要@HystrixCommand注解！

问题2：调用服务后一秒没有相应就会触发异常，这个时间对于很多操作量大的应用来说太少了。

设置超时时长

Hystrix默认的超时长是1秒，也就是说当调用的服务1秒内没有成功响应，或者触发异常。就会触发服务降级，调用fallbackMethod。

如何修改这个默认的超时时长呢？

先点击两次shift键，搜索HystrixCommandProperties点击进入源码，可以看到很多的配置信息。

    private static final Integer default_metricsRollingStatisticalWindowBuckets = 10;
    private static final Integer default_circuitBreakerRequestVolumeThreshold = 20;
    private static final Integer default_circuitBreakerSleepWindowInMilliseconds = 5000;
    private static final Integer default_circuitBreakerErrorThresholdPercentage = 50;
    private static final Boolean default_circuitBreakerForceOpen = false;
    static final Boolean default_circuitBreakerForceClosed = false;
    private static final Integer default_executionTimeoutInMilliseconds = 1000;
    private static final Boolean default_executionTimeoutEnabled = true;
    private static final HystrixCommandProperties.ExecutionIsolationStrategy default_executionIsolationStrategy;
    private static final Boolean default_executionIsolationThreadInterruptOnTimeout;
    private static final Boolean default_executionIsolationThreadInterruptOnFutureCancel;
    private static final Boolean default_metricsRollingPercentileEnabled;
    private static final Boolean default_requestCacheEnabled;
    private static final Integer default_fallbackIsolationSemaphoreMaxConcurrentRequests;
    private static final Boolean default_fallbackEnabled;
    private static final Integer default_executionIsolationSemaphoreMaxConcurrentRequests;
    private static final Boolean default_requestLogEnabled;
    private static final Boolean default_circuitBreakerEnabled;
    private static final Integer default_metricsRollingPercentileWindow;
    private static final Integer default_metricsRollingPercentileWindowBuckets;
    private static final Integer default_metricsRollingPercentileBucketSize;
    private static final Integer default_metricsHealthSnapshotIntervalInMilliseconds;

仔细阅读可以发现default_executionTimeoutInMilliseconds就是默认的超时时长，通过这个属性在此class文件中搜索可以得到其对应的key值为：execution.isolation.thread.timeoutInMilliseconds在@HystrixCommand注解中，设置其commandProperties属性即可。其他的属性配置方法也是一样的。

配置超时时间：

    @GetMapping("/{id}")
    @HystrixCommand(commandProperties = {
            @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds",value = "2000")
    })
    public String getEmployeeById(@PathVariable Integer id) {
        String url = REST_URL_PREFIX + "/employee/" + id;
        //调用接口
        String employee = restTemplate.getForObject(url, String.class);
        //返回结果
        return employee;
    }

此时由于服务端的睡眠时间是2S，再加上程序代码调用的时间，很显然现在配置的2S仍然不够。我们再配置为2500MS

![image-20200704110334564](/Users/rayfoo/Library/Application Support/typora-user-images/image-20200704110334564.png)

此时，程序就可以正常访问啦~

配置公共属性

上面一步，我们已经成功配置好了超时属性，但是这样同样是有些繁琐，能不能配置整体的超时时间，需要单独设置不同超时时间的再单独设置呢？例如公共的设置为2S，操作需要时间长的单独设置为5S。

第一步：注释上一步配置的属性

    @GetMapping("/{id}")
//    @HystrixCommand(commandProperties = {
//            @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds",value = "2500")
//    })
    @HystrixCommand

第二步：配置yml，注意，这里是没有提示的！

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 3000

第三步：配置单独需要设置超时时间的方法。

    @GetMapping("/{id}")
    @HystrixCommand(commandProperties = {
            @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds",value = "2500")
    })

第三步也可以在yml中配置，配置方法：在default同级的位置配置某个服务的超时时间。

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 3000
    emp-provider:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 5000

Hystrix熔断

熔断器也叫断路器，其英文单词为：Circuit Breaker

熔断器的机制很简单，就像家里的保险丝。如果发生短路，能立即熔断保险丝，避免发生火灾。在分布式系统应用中这一模式后，服务调用方也可以自已进行判断某些服务反应慢或者存在大量超时的情况时，能够主动熔断，防止整个系统被拖垮。

不同于电路熔断，只能断而只能不断自动重连，Hystrix可以实现弹性容错，当情况好转之后，可以自动重连。

我们在测试服务降级的时候，如果有provider服务一直是无响应的，那调用方每次都要等待1S以上（根据超时长的配置），这样会占用服务器的线程，减少服务器的并能力。那么此时这个服务就是电路之中负载最高的电器了，它的存在可能会出现“火灾”，此时应该将其熔断。这样就能保障整个系统的平稳运行。

熔断器运行原理：

熔断器的原理是其默认是关闭的
但是当触发服务降级的次数超过预先设置的阈值（默认情况下是最近20次请求中有至少50%的服务都被降级），就会触发熔断开关，当用户再次访问这个请求的时候，就会快速返回失败，而不会再尝试请求该服务。
熔断器打开5秒后会自动进入半开状态（这个时间被称为熔断时间窗），在半开状态下熔断器只会允许一个请求通过，当请求成功时，关闭熔断器。若是请求失败，熔断器保持打开状态，继续下一个窗口期。窗口期结束后继续进入半开状态。

熔断器的开关能保证服务调用者在调用异常服务时, 快速返回结果, 避免大量的同步等待. 并且熔断器能在一段时间后继续侦测请求执行结果, 提供恢复服务调用的可能

阈值、窗口期时间都是可以调节的。调节的方法同超时时间一样，只是key不同。

同样是在com.netflix.hystrix.HystrixCommandProperties中，

通过：

default_circuitBreakerRequestVolumeThreshold：默认断路器请求阈值（默认为20）

default_circuitBreakerSleepWindowInMilliseconds：默认断路器睡眠窗口时间（单位：毫秒,默认5000）

default_circuitBreakerErrorThresholdPercentage：默认断路器容错率（默认为50，一般无需修改）

三个属性找到对应的key

circuitBreaker.requestVolumeThreshold
circuitBreaker.sleepWindowInMilliseconds
circuitBreaker.errorThresholdPercentage

模拟熔断场景

修改consumer的请求方法（id为单数请求失败，双数请求成功），设置阈值、窗口时间、容错率。

    @GetMapping("/{id}")
    @HystrixCommand(commandProperties = {
            @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds",value = "2500"),
            @HystrixProperty(name = "circuitBreaker.requestVolumeThreshold",value = "10"),
            @HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds",value = "10000"),
            @HystrixProperty(name = "circuitBreaker.errorThresholdPercentage",value = "40")
    })
    public String getEmployeeById(@PathVariable Integer id) {
        if(id % 2 == 1){
            throw new RuntimeException("");
        }
        String url = REST_URL_PREFIX + "/employee/" + id;
        //调用接口
        String employee = restTemplate.getForObject(url, String.class);
        //返回结果
        return employee;
    }

在浏览器打开两个窗口，分别访问id为奇数的接口和id为偶数的接口，观察变化。

可以观察得出如果多次错误访问导致服务熔断，即使是访问正确的服务，可会暂时的中断，等待平台期结束后才能恢复。

当然，这些属性也是可以在yml中配置的。不过生产阶段一般不会配置这几个属性，测试和最终发布的版本才会配置。

Hystrix内部处理逻辑

下图为Hystrix内部处理逻辑：

构建Hystrix的Command对象, 调用执行方法.
Hystrix检查当前服务的熔断器开关是否开启, 若开启, 则执行降级服务getFallback方法.
若熔断器开关关闭, 则Hystrix检查当前服务的线程池是否能接收新的请求, 若超过线程池已满, 则执行降级服务getFallback方法.
若线程池接受请求, 则Hystrix开始执行服务调用具体逻辑run方法.
若服务执行失败, 则执行降级服务getFallback方法, 并将执行结果上报Metrics更新服务健康状况.
若服务执行超时, 则执行降级服务getFallback方法, 并将执行结果上报Metrics更新服务健康状况.
若服务执行成功, 返回正常结果.
若服务降级方法getFallback执行成功, 则返回降级结果.
若服务降级方法getFallback执行失败, 则抛出异常.

总结

在Hystrix中，我们一般都要修改超时降级服务的时间，因为默认的一秒确实是太短了。对于操作量比较大的服务都不是很合理。

1、先在controller类中加入注解，声明一个默认的fallback方法。

@DefaultProperties(defaultFallback = "defaultFallbackMethod")

2、在yml中配置超时时间

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 3000

3、在需要Hystrix的方法上加入@HystrixCommand注解。如果超时时间不同于默认配置可以自行修改对应方法的超时时间等属性。

参考博客：文章地址文章地址