TCP拥塞控制算法内核实现剖析（七）

内核版本：3.2.12

主要源文件：linux-3.2.12/ net/ ipv4/ tcp_scalable.c

本文主要分析scalable的实现，作者zhangskd @ csdn
====================================================================================

原理

The design objective of STCP is to make the recovery time from loss events be constant regardless of

the window size. This is why it is called "Scalable". Note that the recovery time of TCP-NewReno largely

depends on the current window size.

STCP suggests alternative algorithms for cwnd increase and decrease(in congestion avoidance).

主要算法：

cwnd = cwnd + 0.01 (per ACK received in an RTT)

Upon the first detection of congestion, the cwnd is decreased accordingly.

cwnd = cwnd - (0.125*cwnd)

Scalable算法：主要改进了AIMD。尤其是AI，由于AI为一个常数，故其快速恢复后，恢复到峰值所用的时间

只跟RTT有关。

数据结构

/* These factors derived from the recommended values in the aer:
 * 0.1 and 7/8. We use 50 instend of 100 to account for delayed ack.
 */
#define TCP_SCALABLE_AI_CNT 50U
#define TCP_SCALABLE_MD_SCALE 3

由于delayed ack的使用，通常发送2个数据包才返回一个ack。所以，把cnt = 100改为cnt = 50。如此一来，

相当于每发送100个数据包snd_cwnd就加1。

函数

static void tcp_scalable_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
{
    struct tcp_sock *tp = tcp_sk(sk);
    if (!tcp_is_cwnd_limited(sk, in_flight))
        return;

    if (tp->snd_cwnd <= tp->snd_ssthresh)
        tcp_slow_start(tp);
    else
        /* 当snd_cwnd < 50时，同Reno.*/
        tcp_cong_avoid_ai(tp, min(tp->snd_cwnd, TCP_SCALABLE_AI_CNT));
}

static u32 tcp_scalable_ssthresh(struct sock *sk)
{
    const struct tcp_sock *tp = tcp_sk(sk);
    /* 减小因子为 1/8 */
    return max(tp->snd_cwnd - (tp->snd_cwnd >> TCP_SCALABLE_MD_SCALE), 2U);
}

性能

可以看出Scalable的原理和实现都比较简单，但是，它有一个很有特色的地方，就是在丢包并快速恢复后，

从再次进入拥塞避免到达到上次丢包时的cwnd，Scalable所用的时间只跟RTT有关！

我们先来看以下Reno:

假设在发生丢包时的cwnd为C，那么再次进入拥塞避免时的cwnd=C/2.

由于每个RTT内cwnd加1，所以cwnd从C/2恢复到C需要的时间为：(C/2) * RTT

可见，Reno从再次进入拥塞避免，到恢复到丢包前的cwnd所需要的时间不仅和RTT有关，还和C成正比。

也就是说，当C越大，耗时就越多。在BDP网络中，这是不可以接受的。

对于Scalable:

cwnd(k+1) = cwnd(k) + cwnd(k) / 100

在发生丢包时的cwnd=C，那么再次进入拥塞避免时的cwnd=0.875C.

假设从0.875C恢复到C需要n个RTT的时间。

那么：0.875C(1+0.01)^n = C

所以 n = -log(0.875) / log(1+0.01) = 14

可以看到，不管C是多少，恢复的时间都是14*RTT！在较大的BDP中，Scalable节省的时间就很可观了。

不足：Scalable的Convergence（收敛性）不好，不能很好地处理多条连接共存的情况。