记consul集群和spring cloud集成遇到的问题。

前两天想在线上的consul组成一个集群，但只有两台机器，两台机器无法抵御一台机器失效，至少三台（https://www.consul.io/docs/internals/consensus.html#deployment-table）。但两台机器consul起来时是没有报错的，从 server:8500/ui/上看服务也确实加入到了集群。但线上由gateway分发的服务却会报：“微服务异常”，是由于zuul发生了(a failure occurs on a route). 我把一台机器上的服务关闭就正常了。

为了解决这个问题，试着在本地先解决。用本地两台机器，本机(10.0.42.94)和旁边一台开发环境机器（10.0.41.110）。两台机器的consul启动命令为：

nohup /bin/bash -c '/opt/consul agent -server --retry-join=10.0.42.94 -ui -bootstrap-expect=2 -data-dir=/usr/local/consul -node=devslave -advertise=10.0.41.110 -bind=0.0.0.0 -client=0.0.0.0' > /data/logs/consul/consul.log &

consul.exe agent -server -ui -bootstrap-expect=2 -data-dir=D:data-dirconsul -node=devmaster  -advertise=10.0.42.94 -bind=0.0.0.0 -client=0.0.0.0

此时muc的服务在94和110的请求上都起了，往网关发的请求zuul会按照负载均衡的原则，查找服务名为muc的服务，分发请求到不同的机器。但此时我用postman往110的网关发请求，http://10.0.41.110:7979/muc/auth/code/image，看到所有的请求都跑到了110，但用postman往本地发请求，发现有的请求指向了94，有的请求指向了110.也就是说110的网关请求没有实现负载均衡。但其实这两个consul上，无论从10.0.42.94:8500/ui还是10.0.41.110:8500/ui上看，服务都是一模一样的。但本机的地址是localhost，如图：

点进去，如图：

发现这个地址是无线网卡的内网ip，而另外一台显示的是正常网卡ip。所以想到可能是这个原因，本机在consul注册用了无线网卡ip，所以另外一台机器的请求无法找到本机，也就不会把请求分发到本机了，但如果是发向本机的请求却能找到另外一台机器的服务。但为什么会是无线网卡的ip呢？这个问题没有解决，但只要把无线网卡关了，重启consul和服务，发现网关的负载均衡就正常了。
然后consul的ui展现如下：

可以看到，地址不再是localhost，检查的地址也变成了正常的ip。
本地集群没问题之后，转到测试环境。两台机器启动命令如下：

nohup /bin/bash -c '/opt/consul agent -server  -ui -bootstrap-expect=2 -data-dir=/usr/local/consul -node=testmaster -advertise=192.168.101.220 -bind=0.0.0.0 -client=0.0.0.0' > /data/logs/consul/consul.log &
nohup /bin/bash -c '/opt/consul agent -server --retry-join=192.168.101.220 -ui -bootstrap-expect=2 -data-dir=/usr/local/consul -node=testslave -advertise=192.168.101.221 -bind=0.0.0.0 -client=0.0.0.0' > /data/logs/consul/consul.log &

但是很不幸，两台机器无法选出leader

019/11/25 17:53:17 [WARN]  raft: not part of stable configuration, aborting election
    2019/11/25 17:53:18 [ERR] agent: failed to sync remote state: No cluster leader

后来查了下google，https://learn.hashicorp.com/consul/day-2-operations/outage，https://support.hashicorp.com/hc/en-us/articles/115015603408-Consul-Errors-And-Warnings

[WARN] raft: not part of stable configuration, aborting election

-> This means you don’t have a complete peers.json on all the servers (the server is not seeing itself in the peer configuration). You’ll need to stop all the servers and create an identical peers.json file on each, which includes all the server IP:port pairs. Once they all have the same peers.json file you can start them again.

后来我把两台机器下面的/usr/local/consul全部删了，然后重启就没问题了，也没有像文档里说的创建peers.json

可以看到这里用了host做为地址，但我在33上 curl http://iZbp1guuix5grexo50gpgzZ:8000/actuator/health是连不上的，可能这里就是问题。
于是我在/etc/hosts文件上加上下面这句就ok了。

192.168.101.221 iZbp1guuix5grexo50gpgzZ iZbp1guuix5grexo50gpgzZ

那为什么会用host作为地址呢？发现只有config的是ip

在每个工程的application-dev.yml配置文件里都加了 prefer-ip-address: true 这一项，但只有spring cloud config工程里的起了作用。因为其他的工程的配置项都是要用spring cloud config里取的。此时我发现配在application-dev.yml里的

    consul:
      host: rockysaas-consul
      port: 8500
      discovery:
        prefer-ip-address: true
        instance-id: instance-${spring.cloud.client.ip-address}-${spring.application.name}-${server.port}
        service-name: ${spring.application.name}

这些其实都没用（可以随便写一个host没有反应）。可能是spring cloud consul的配置和spring cloud config结合后，需要把cousul的配置加在bootstrap.yml里才有用。（Spring cloud consul里有提到

Distributed Configuration with Consul

）这种情况，是加载在 "bootstrap" phase。

喜欢艺术的码农