pssh nohup 出现的问题

  1. 有一个shell脚本 a.sh

    #!/bin/bash
    #/home/test/a.sh
    i=0
    while [ $i -lt 2 ]
    do
    sleep 70
    echo 'good'
    let i++
    done

  2. pssh -H "host1 host2" "nohup /home/test/a.sh &"
    会报错:
    [1] 21:26:32 [FAILURE] host1 Timed out, Killed by signal 9
    [2] 21:26:32 [FAILURE] host2 Timed out, Killed by signal 9
    使用ssh连到host1和host2会发现:
    a.sh 已经不存在

解决办法:
pssh -H "host1 host2" "nohup /home/test/a.sh &>> /home/test/nohup.out &"

以下是对上面问题的原因分析,由于能力所限,自然可能存在误解,请包涵。

3. 原因分析

3.1 pssh问题

查看pssh源码,可知,pssh
调用了subprocess.Popen 执行ssh命令,如果超过默认超时时间(60s), 则会自动kill。 因此可知,[1] 21:26:32 [FAILURE] host1 Timed out, Killed by signal 9, 这种错误是由pssh设置引起的。

_DEFAULT_TIMEOUT = 60
def _kill(self):
    """Signals the process to terminate."""
    if self.proc:
        try:
            os.kill(-self.proc.pid, signal.SIGKILL)
        except OSError:
            # If the kill fails, then just assume the process is dead.
            pass
        self.killed = True

def timedout(self):
    """Kills the process and registers a timeout error."""
    if not self.killed:
        self._kill()
        self.failures.append('Timed out')

3.2 ssh问题

3.2.1 那么为什么使用ssh执行了"nohup /home/test/a.sh &" 没有立即返回呢?

看一下这个, 对输出重定向就能立即返回。

ssh host "(command 1; command 2; ...) &>/dev/null &"

3.2.2 为什么重定向输出就可以返回?

看一下stackoverflow, 说是因为竞态条件。继续
看 Race Condition Details。

Race Condition Details
As an example, let's take the simple case of:
ssh server cat foo.txt
This should result in the entire contents of the file foo.txt coming back to the client — but in fact, it may not. Consider the following sequence of events:
The SSH connection is set up; sshd starts the target account's shell as shell -c "cat foo.txt" in a child process, reading the shell's stdout and sending the data over the SSH connection. sshd is waiting for the shell to exit.
The shell, in turn, starts cat foo.txt in a child process, and waits for it to exit. The file data from foo.txt which cat write to its stdout, however, does not pass through the shell process on its way to sshd. cat inherits its stdout file descriptor (fd) from it parent process, the shell — that fd is a direct reference to the pipe connecting the shell's stdout to sshd.
cat writes the last chunk of data from foo.txt, and exits; the data is passed to the kernel via the write system call, and is waiting in the pipe buffer to be read by sshd. The shell, which was waiting on the cat process, exits, and then sshd in turn exits, closing the SSH connection. However, there is a race condition here: through the vagaries of process scheduling, it is possible that sshd will receive and act on the SIGCHLD notifying it of the shell's exit, before it reads the last chunk of data from the pipe. If so, then it misses that data.
This sequence of events can, for example, cause file truncation when using scp.

4. 个人理解

在ssh执行nohup命令而没有重定向输出(标准输出和标准错误)的情况下,nohup命令继承了父进程(pssh)的输出,因此父进程仍然会等待nohup退出,而pssh被kill后,由于没有了输出,nohup异常退出。如果重定向了nohup的输出,相当于取消了这种依赖关系,因此pssh不再等待nohup退出,而nohup也能正常执行下去。

原文地址:https://www.cnblogs.com/lyg-blog/p/12046138.html