systemd实践: 依据情况自动重启服务【转】

1.最简单的自动重启范例

[Unit]
Description=mytest

[Service]
Type=simple
ExecStart=/root/mytest.sh
Restart=always
RestartSec=5
StartLimitInterval=0

[Install]
WantedBy=multi-user.target

重点参数详解

Restart=always: 只要不是通过systemctl stop来停止服务，任何情况下都必须要重启服务，默认值为no
RestartSec=5: 重启间隔，比如某次异常后，等待5(s)再进行启动，默认值0.1(s)
StartLimitInterval: 无限次重启，默认是10秒内如果重启超过5次则不再重启，设置为0表示不限次数重启

2.案例需求

需求：有个业务，当程序因受到OOM而退出的时候，不希望自动重启（此时需要人工介入排查），其他情况下可以自动重启

分析：OOM就是通过kill -9来杀进程，因此只要找到方法，告诉systemd当该服务遇到kill -9时候不自动重启即可

3.RestartPreventExitStatus参数

查询man systemd.service发现，systemd的[Service]段落里支持一个参数，叫做RestartPreventExitStatus

该参数从字面上看，意思是当符合某些退出状态时不要进行重启。

该参数的值支持exit code和信号名2种，可写多个，以空格分隔，例如

RestartPreventExitStatus=143 137 SIGTERM SIGKILL

表示，当退出情况只要符合以下4种情况中任意一种时候，则不再进行重启

exit code为143
exit code为137
信号为TERM
信号为KILL

但具体如何使用，请继续往下看

4.测试方法

/usr/lib/systemd/system/mytest.service

[Unit]
Description=mytest

[Service]
Type=simple
ExecStart=/root/mem
Restart=always
RestartSec=5
StartLimitInterval=0
RestartPreventExitStatus=SIGKILL

[Install]
WantedBy=multi-user.target

/root/mem.c（不断消耗内存直至发生OOM）

#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

  int main ()
  {
      char *p = NULL;
      int count = 1;
      while(1){
          p = (char *)malloc(1024*1024*100);
          if(!p){
              printf("malloc error!n");
              return -1;
          }
          memset(p, 0, 1024*1024*100);
          printf("malloc %dM memoryn", 100*count++);
          usleep(500000);
      }
   }

编译及执行

gcc -o /root/mem /root/mem.c
systemctl daemon-reload
systemctl start mytest

5.测试结果

      [root@fzxiaomange ~]# systemctl status mytest
      ● mytest.service - mytest
         Loaded: loaded (/usr/lib/systemd/system/mytest.service; disabled; vendor preset: disabled)
         Active: failed (Result: signal) since Sat 2018-10-20 23:32:24 CST; 45s ago
        Process: 10555 ExecStart=/root/mem (code=killed, signal=KILL)
       Main PID: 10555 (code=killed, signal=KILL)

     
      Oct 20 23:31:55 fzxiaomange.com systemd[1]: Started mytest.
      Oct 20 23:31:55 fzxiaomange.com systemd[1]: Starting mytest...
      Oct 20 23:32:24 fzxiaomange.com systemd[1]: mytest.service: main process exited, code=killed, status=9/KILL
      Oct 20 23:32:24 fzxiaomange.com systemd[1]: Unit mytest.service entered failed state.
      Oct 20 23:32:24 fzxiaomange.com systemd[1]: mytest.service failed.

重点看上面第6行 MainPID:10555(code=killed,signal=KILL)，这行表示主进程的状态，常见有2种情况

code=exited, status=143：表示systemd认为主进程自行退出的，exit code为143

code=killed, signal=KILL：表示systemd认为主进程是被kill的，接收到的信号是SIGKILL

等待5秒后，并没有自动重启，符合预期

此时将RestartPreventExitStatus=SIGKILL改为RestartPreventExitStatus=SIGTERM

执行systemctl restart mytest，再进行一次观察，等待5秒后，服务自动重启，符合预期

6.注意事项

6.1.RestartPreventExitStatus与Restart的关系

配置RestartPreventExitStatus=后，并没有完全忽略Restart=，而是指当退出情况与RestartPreventExitStatus=匹配的时候，才忽略Restart=，若没有匹配，根据Restart=该怎么样还怎么样（具体详见后面的详细测试数据）

6.2.kill子进程会是什么情况

若systemd启动的不是一个简单进程，而是会派生子进程的情况（比如执行shell脚本，shell脚本里启动多个程序），那么当另外开一个窗口通过 kill-信号测试时，会是什么情况呢，先贴出测试方法

ExecStart=/root/mem改为ExecStart=/root/mytest.sh

/root/mytest.sh内容为

#!/bin/bash
sleep 100000 &
sleep 200000

测试结果

若kill 主进程PID（kill不带参数），则主进程状态为 code=killed,signal=TERM
若kill -9 主进程PID，则主进程状态为 code=killed,signal=KILL
若kill 最后一个子进程PID（kill不带参数），则systemd不认为是接收到信号，而是根据最后一个进程的exit code进行处理，此时主进程状态为 code=exited,status=143
若kill -9 最后一个子进程PID，此时主进程状态为 code=exited,status=137

7.详细测试数据

上面有提到RestartPreventExitStatus和Restart的关系，但没有数据说明

另外，kill和kill -9的区别，也需要有一份数据说明

因此做了一个详细对比，这里附上详细数据

systemd实践: 依据情况自动重启服务

转自

systemd实践: 依据情况自动重启服务 – 小慢哥的技术网站 https://fzxiaomange.com/2018/10/21/systemd-restartpreventexitstatus/

技术|Systemd服务简介 https://linux.cn/article-3352-3.html