TCP拥塞控制算法内核实现剖析（五）

内核版本：2.6.37

主要源文件：linux-2.6.37/ net/ ipv4/ tcp_vegas.h、 linux-2.6.37/ net/ ipv4/ tcp_vegas.c

本文主要分析Vegas算法的原理和实现，作者zhangskd @ csdn
====================================================================================

实现与论文的区别

需要说明的是：Vegas的具体实现和论文描述中有较大的区别，因为当时TCP的一些不足现在已经有较好的
解决方案，比Vegas给出的更好。Vegas现在的实现比以前更加的简洁、高效。

以下是具体说明：
The main aspects that distinguish this implementation from the Arizona Vegas implementation are:

（1）We do not change the loss detection or recovery mechanisms of Linux in any way. Linux already
recovers from losses quite well, using fine-grained timers, NewReno, and FACK.
Vegas论文通过更及时的检测超时来提前发现丢包。这种方法的前提是RTO的测量不精确，或者超时定时
器不能及时检测丢包。而在目前的版本中这已经不是问题了。SACK完全能取得Vegas对快速重传的改进效
果，甚至做得更好，因为SACK在超时前就能检测到哪些包已经丢失了。当然，Vegas的改进也是有意义的，
目前版本中的RTT测量方法就是借鉴而来的。

（2）To avoid the performance penalty imposed by increasing cwnd only every-other RTT during slow
start, we increase during ervery RTT during slow start, just like Reno.
Vegas论文要求在慢启动阶段，每隔一个RTT才将窗口增大一倍，在这之间的窗口固定不变。这样做的目
的是为了更精确的比较吞吐量的差值。当然，这会给CPU增加计算量，而且慢启动也会比其它的TCP实现
慢，所以不采用。

（3）Largely to allow continuous cwnd growth during slow start, we use the rate at which ACKs come
back as the "actual" rate, rather than the rate at which data is sent.
通常我们计算吞吐量并不是在发送数据时，而是用所发送数据的ACK来计算的。Cubic的HyStart也是使用
同样的方式计算发送率。

（4）To speed convergence to the right rate, we set the cwnd to achieve the right (actual) rate when
we exit slow start.

（5）To filter out the noise caused by delayed ACKs, we use the minimum RTT sample observed
during the last RTT to calculate the actual rate.
在一个RTT内可以得到若干RTT样本，取其中最小者来计算实际吞吐量，这样就可以排除delayed ACK
的干扰。

（6）When the sender re-starts from idle, it waits until it has received ACKs for an entire flight of
new data before making a cwnd adjustment decision. The original Vegas implementation assumed
senders never went idle.
在一个RTT末计算所使用的数据，是来自前一个RTT的。所以在一些情况后，在第一个RTT末不进行计算
和窗口调整，只有当第一个RTT发送的数据全部被ACK后，也就是第二个RTT末才进行计算。

原理简介

与基于丢包的拥塞控制机制（如BIC和CUBIC）不同，Vegas是通过计算期望值的吞吐量与实际吞吐量的
差值来估计网络瓶颈处的可用带宽。这样Vegas不用依靠丢包就能检测到网络拥塞，从而在丢包之前进行
拥塞避免，能减少数据包的丢失，更有效的利用带宽。

其基本思想：当期望的吞吐量与实际的吞吐量相差超过一定值时，就认为网络拥塞程度严重，应该减小
发送窗口；而当两者之间的差小于一定值时，则认为连接没有完全有效的利用带宽，应该要增大发送窗口。
Vegas在拥塞避免阶段的具体算法如下：

（1）计算期望的吞吐量（Expected）和实际吞吐量（Actual）的差值。
Diff = Expected - Actual = cwnd / baseRTT - cwnd / minRTT
其中，baseRTT代表当路由器缓存中没有数据包时的RTT值。minRTT为上个RTT的测量值，由于此时路
由器中已有数据包，需要排队，故minRTT > baseRTT。
Expected = cwnd / baseRTT为期望的吞吐量，是理想的情况下得吞吐量。
Actual=cwnd / minRTT为实际的吞吐量。

（2）路由器中缓存的数据包个数为：
diff = Diff * baseRTT，即吞吐量的差值与链路时延的乘积。

（3）窗口调整策略
在每个RTT末，Vegas根据diff进行窗口调整。

                            cwnd(n) + 1   /* d(n) < alpha */
cwnd(n+1) =     cwnd(n)          /* alpha <= d(n) <= beta */
                            cwnd(n) - 1    /* d(n) > beta */

其中，d(n)代表第n个RTT时估算出的路由器中属于此连接的数据包的个数。
cwnd(n)代表第n个RTT时的拥塞窗口。

参数与变量

static int alpha = 2; /* lower bound of packets in network */
static int beta = 4; /* upper bound of packets in network */
static int gamma = 1; /* limit on increase */

alpha和beta用于拥塞避免阶段。alpha表示当路由器中缓存的属于该连接的数据包的个数小于2个时，该连接
没有充分利用带宽，需要增加发送窗口；beta表示当路由器中缓存的属于该连接的数据包大于4个时，路由器
可能遇到拥塞了，应该较小发送窗口。
gamma用于慢启动阶段，表示当路由器中缓存的属于该连接的数据包个数大于1时，需要降低窗口增长速度，
进入拥塞避免阶段。
这三个值是经验值，可以调整。

/* Vegas variables */
struct vegas {
        u32 beg_snd_nxt; /* right edge during last RTT ,*/
        u32 beg_snd_una; /* left edge during last RTT, 没看见使用过*/
        u32 beg_snd_cwnd; /* saves the size of the cwnd, 没有使用，直接用snd_cwnd代替 */
        u8 doing_vegas_now; /* if true, do vegas for this RTT, 决定是否使用Vegas */
        u16 cntRTT; /* of RTTs measured within last RTT, 每收到一个ACK就可以得到一个RTT
                    样本，这个RTT内所收到的ACK的个数，就是所能得到的上个RTT的样本数。*/
        u32 minRTT; /* min of RTTs measured within last RTT (in usec), 取cntRTT个样本
                    中的最小者作为上个RTT的测量值，这样可以避免delayed ack的干扰。*/
        u32 baseRTT; /* the min of all Vegas RTT measurements seen (in usec), 本连接的
                最小RTT, 实时更新。表示路由中没有缓存本连接数据包时的RTT，用于计算理想吞吐量。*/
};

函数分析

这里涉及到一个比较重要的问题，就是什么时候可以使用Vegas？什么时候不能？
我们先来看下吞吐量的计算方式：

从上图可以看出，Vegas计算采用的数据是上一个RTT的，所以只有当上一个RTT的数据真实有效的时候，才能
使用Vegas来进行计算，从而调整拥塞窗口。
以下两个函数通过设置vegas->doing_vegas_now来控制Vegas的使用。
（画外音，这个模块注释写得太详细了，令人汗颜啊@_@）
vegas_enable()

/* There are several situations when we must "re-start" Vegas:
 * （1）when a connection is established. 显然第一个RTT没有上一个RTT，不能使用Vegas.
 * （2）after an RTO. 超时后的第一个RTT，其上个正常RTT是较久前的，不能使用Vegas.
 * （3）after fast recovery. 快速恢复后的第一个RTT，其上个正常RTT也是较久之前的。
 * （4）when we send a packet and there is no outstanding unacknowledged data 
 *      (restarting an idle connection. idle后发数据，上个RTT是较久以前的，不能使用。

 * In these circumstances we cannot do a Vegas calculation at the end of first RTT,
 * because any calculation we do is using stale info——both the saved cwnd and 
 * congestion feedback are stale. Instead we must wait until the completion of an 
 * RTT during which we actually receive ACKs.
 * 在以上这些情况下，不能在第一个RTT末计算，在第一个RTT内发送的数据全部被ACK后，才能使用
 * Vegas。
 */
static void vegas_enable(struct sock *sk)
{
        const struct tcp_sock *tp = tcp_sk(sk);
        struct vegas *vegas = inet_csk_ca(sk);

        /* Begin taking Vegas samples next time we send something. */
        vegas->doing_vegas_now = 1;
        /* Set the beginning of the next send window. */
        vegas->beg_snd_nxt = tp->snd_nxt;

        vegas->cntRTT = 0; /*RTT计数器清零,实际上这个才是控制是否启用Vegas的关键！*/
        vegas->minRTT = 0x7fffffff; /*设成最大值*/
}

vegas_disable() 和tcp_vegas_init()

/* Stop taking Vegas samples for now. */
static inline void vegas_disable(struct sock *sk)
{
        struct vegas *vegas = inet_csk_ca(sk);
        vegas->doing_vegas_now = 0;
}

void tcp_vegas_init(struct sock *sk)
{
        struct vegas *vegas = inet_csk_ca(sk);
        vegas->baseRTT = 0x7fffffff; /*设成最大值*/
        vegas_enable(sk);
}

tcp_vegas_state()

void tcp_vegas_state(struct sock *sk, u8 ca_state)
{
        if(ca_state == TCP_CA_Open)
                vegas_enable(sk); /*如果要进入Open态*/
        else
                vegas_disable(sk);
}

tcp_vegs_cwnd_event()

/* 
 * If the connection is idle and we are restarting, then we don't want to do any
 * Vegas calculations until we get fresh RTT samples. So when we restart, we reset 
 * our Vegas state to a clean state. After we get acks for this flight of packets,
 * then we can make vegas calculations again.
 */
void tcp_vegas_cwnd_event(struct sock *sk, enum tcp_ca_event event)
{
        if (event == CA_EVENT_CWND_RESTART || event == CA_EVENT_TX_START)
            tcp_vegas_init(sk);
}

上面几个函数就涵盖了4种重置Vegas参数的情况。这时都是调用vegas_enable()，而函数中设置
doing_vegas_now =1，那么是怎么禁止在第一个RTT末进行Vegas计算和调整的呢？这是通过设置
vegas->cntRTT=0来实现的。此时把cntRTT置0，那么第一个RTT末cntRTT只能等于1，通过下面的
代码分析，我们可以知道cntRTT<=2时是不能使用Vegas的。所以在第二个RTT末才会使用Vegas。

关键函数

每收到一个ACK都会调用此函数：

/* Do RTT sampling needed for Vegas.
 * Basically we:
 *（1）min-filter RTT samples from within an RTT to get the current
 * propagation delay + queuing delay (we are min-filtering to try to avoid the 
 * effects of delayed ACKs)
 * 通过接收端对上个窗口发送数据的ACK，发送端可以得到上个RTT的一组样本，取其中最小者作为
 * 上个RTT的测量值，用来计算实际吞吐量。这样做可以避免delayed ACK的干扰。
 * 此时得到的RTT由两部分组成：传播时延和排队时延。

 *（2）min-filter RTT samples from a much longer window (forever for now) to find 
 * the propagation delay (baseRTT)
 * 此时得到的RTT只有传播时延，而不包含排队时延。
 */
void tcp_vegas_pkts_acked(struct sock *sk, u32 cnt, s32 rtt_us)
{
        struct vegas *vegas = inet_csk_ca(sk);
        u32 vrtt;
        if (rtt_us < 0)
            return;

        /* Never allow zero rtt or baseRTT */
        vrtt = rtt_us + 1;

        /* Filter to find propagation delay, 取连接中最小RTT为baseRTT，实时更新*/
        if (vrtt < vegas->baseRTT)
            vegas->baseRTT = vrtt;

        /* Find the minRTT during the last RTT to find the current propagation
         * delay + queuing delay. 更新上一个RTT的测量值，取最小的。
          */
        vegas->minRTT = min(vegas->minRTT, vrtt);

        vegas->cntRTT++; /*从每个ACK都能得到一个RTT测量值*/
}

tcp_vegas_cong_avoid()是拥塞控制最关键的函数：

static void tcp_ vegas_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
{
        struct tcp_sock *tp = tcp_sk(sk);
        struct vegas *vegas = inet_csk_ca(sk);

        if (!vegas->doing_vegas_now) { /*禁止使用Vegas*/
            tcp_reno_cong_avoid(sk, ack, in_flight);
            return;
        }

        /* 在本RTT末计算上一个RTT的吞吐量并以此进行窗口调整*/
        if (after(ack, vegas->beg_snd_nxt)) { /*本RTT结束了*/
            /* Do the Vegas once-per-RTT cwnd adjustment. */

            /* Save the extent of the current window so we can use this at the end
             * of the next RTT.
             */
            vegas->beg_snd_nxt = tp->snd_nxt; /*设置新的RTT结束标志*/

            /* We do the Vegas calculations only  if we got enough RTT samples that
             * we can be reasonably sure that we got at least oneRTT sample that 
             * wasn't from a delayed ACK. If we only had 2 samples total, then that
             * means we're getting only 1 ACK per RTT, which means there're almost 
             * certainly delayed ACKs. If we have 3 samples, we should be OK.
             */
             if (vegas->cntRTT <= 2) { /* RTT样本太少，不能排除delayed ACK*/
                 /* We don't have enough RTT samples to do the Vegas calculation, 
                  * so we'll behave like Reno.
                  */
                  tcp_reno_cong_avoid(sk, ack, in_flight);
              } else {
                  u32 rtt, diff;
                  u64 target_cwnd;

                  /* We have enough RTT samples, so using the Vegas algorithm, we 
                   * determine if we should increase or decrease cwnd, 
                   * and by how much.
                   */
                  /* Pluck out the RTT we are using for the Vegas calculations. 
                   * This is the minRTT seen during the last RTT. Taking the min 
                   * filters out the effects of delayed ACKs, at the cost
                   * of noticing congestion a bit later.
                   */
                   rtt = vegas->minRTT; /* 上个RTT的取值*/

                   /* Calculate the cwnd we should have, if we weren't going 
                    * too fast.
                    * This is : (actual rate in segments) * baseRTT
                    * 理想的发送窗口。
                    */
                    target_cwnd = tp->snd_cwnd * vegas->baseRTT / rtt ; 

                    /* Calculate the difference between the window we had, and the 
                     * window we would like to have. This quantity is Diff from the 
                     * Arizona Vegas papers.
                     * diff = difference of rate * baseRTT，路由器中属于此连接的数据包
                        */
                     diff = tp->snd_cwnd * (rtt - vegas->baseRTT) / vegas->baseRTT; 

                      /*表示处于慢启动阶段，diff>gamma表示增长过快*/
                      if (diff > gamma && tp->snd_cwnd <= tp->snd_ssthresh) {
                          /* Going too fast. Time to slow down and switch to 
                           * congestion avoidance.*/
                          /* Set cwnd to match the actual rate
                           * exactly:
                           * cwnd = (actual rate) * baseRTT
                           * Then we add 1 because the integer truncation robs us of 
                           * full link utilization.
                           * 一般snd_cwnd > target_cwnd，所以此处是减小窗口到合适的。
                               */
                           tp->snd_cwnd = min(tp->snd_cwnd, (u32)target_cwnd + 1);
                           tp->snd_ssthresh = tcp_vegas_ssthresh(tp); /*不大于上个阈值*/

                      } else { /*表示处于慢启动阶段，但没有增长过快*/
                             /* Slow start. */
                            tcp_slow_start(tp);

                      } else { /*处于拥塞避免阶段*/
                             /* Figure out where we would like cwnd to be.*/

                             if (diff > beta) { /*表示增长过快*/
                                  /* The old window was too fast, so we slow down.*/
                                  tp->snd_cwnd--;
                                  tp->snd_ssthresh = tcp_vegas_ssthresh(tp);

                             } else if (diff < alpha) { /* 表示增长过慢*/

                                  /* We don't have enough extra packets in the network, 
                                   * so speed up.
                                   */
                                  tp->snd_cwnd++;

                             } else { /*既不增长过快，也不过慢，则保持原来窗口不变*/
                                   /* Sending just as fast as we should be.*/
                             }
                      } 

                      /* 设置snd_cwnd的最小值和最大值*/
                      if (tp->snd_cwnd < 2)
                          tp->snd_cwnd = 2;
                      else if (tp->snd_cwnd > tp->snd_cwnd_clamp)
                          tp->snd_cwnd = tp->snd_cwnd_clamp;

                      /* 当snd_cwnd有了较大增长时，适当增加阈值*/
                      tp->snd_ssthresh = tcp_current_ssthresh(sk); 
              }

              /* Wipe the slate clean for the next RTT. */
              vegas->cntRTT = 0; /* RTT样本计数器清零，为下个RTT准备*/
              vegas->minRTT = 0x7fffffff; /* minRTT重置成最大*/

        } else if (tp->snd_cwnd <= tp->snd_ssthresh) /* Use normal slow start */
              tcp_slow_start(tp);  /* 慢启动期间非RTT结束ACK*/
       /* 对于拥塞避免期间非RTT结束ACK，不做任何处理*/
}

static inline u32 tcp_vegas_ssthresh(struct tcp_sock *tp)
{
        return min(tp->snd_ssthresh, tp->snd_cwnd - 1); /*不大于上个阈值*/
}

/* If cwnd > sstresh, we may raise ssthresh to be half-way to cwnd.
 * The exception is rate halving phase, when cwnd is decreasing towards ssthresh.
 */
static inline __u32 tcp_current_ssthresh(const struct sock *sk)
{
        const struct tcp_sock *tp = tcp_sk(sk);
        if ((1<<inet_csk(sk)->icsk_ca_state) & (TCPF_CA_CWR | TCPF_CA_Recovery))
             return tp->snd_ssthresh; /*以上两种情况下ssthresh不增加*/
        else /*当cwnd的3/4大于ssthresh，把ssthresh设成cwnd的3/4 */
             return max(tp->snd_ssthresh, 
                     ((tp->snd_cwnd >>1) + (tp->snd_cwnd >> 2)) );
}

性能评价

从tcp_vegas_cong_avoid可以看出，只有当diff < alpha时，新的RTT内cwnd才增加一个包；
当diff > beta时，新的RTT内cwnd减小一个包；当 alpha <= diff <= beta时，新的cwnd保持不变。
也就是说，就算每次计算时diff < alpha，Vegas的增长速度才和Reno相当。而且大多数TCP拥塞
控制算法都不会主动减小cwnd，最多就是放缓增长速度。
所以，Vegas的cwnd增长函数非常谦和，相比于具有侵略性的增长函数（如BIC/CUBIC），具有较弱
的带宽竞争力。

Vegas有几个主要的不足：

（1）“As the load increases and/or the number of router buffers decreases, Vegas's congestion
avoidance mechanisms are not as effective, and Vegas starts to behave more like Reno. Under
heavy congestion, Vegas behaves very similarly to Reno, since Vegas "falls back" to Reno's
course-grained timeout mechanism.”
Vegas主要是通过控制路由器中属于该连接的数据包的数量，从而进行拥塞控制。当路由器正经历
严重的拥塞，也就是路由器的缓存不足时，这个时候Vegas就失效了。这个时候路由器会开始丢包，
从而出现超时。
当然，这其实不算Vegas特有的缺点，因为这个时候如果不适用SACK，多数TCP拥塞控制算法都无法
有效的处理。

（2）“Vegas is less aggressive in its use of router buffers than Reno. This is because Vegas
limits its use of router buffers as specified by the B ssthreshold, whereas Reno increases its
window size until there are losses——which means all the router buffers are being used.”
Vegas规定每个连接能占用路由器的缓存不超过B，超过了就要减速。而Reno就没有这个限制，它一直
增大cwnd，知道出现掉包。
我们知道网络容量Capacity = BDP + Buffer，BDP为Bandwidth与RTT乘积，Buffer为路由器最小缓存。
Reno肯定能利用完Buffer，而Vegas则不一定。

（3）TCP sends packets in burst, it causes temporary queuing even if cwnd is less than available
BDP. In fact this a common problem of many delay-based slow start schemes.
突发的数据量可能引起数据包在路由器中排队，导致RTT增大，但是这并不是由拥塞引起的。Vegas
显然不能区分这一点，从而导致带宽利用的不完全。

小结

如果整个Internet都使用Vegas算法，那么的确能达到很不错的效果，即更大的吞吐量和更低的丢包率。
但是很显然，这是不现实的。Vegas的带宽竞争力太弱了，甚至比Reno还不激进，更别说非常激进的
BIC和CUBIC了。没有人会愿意使用这种过分谦和的算法来降低自己的传输速度。
当然，Vegas作为一种全新的思路，还是很有价值的。不同于那些基于丢包反馈的算法，Vegas更多的
是基于时延来进行拥塞控制，对于网络拥塞有着更加灵敏的嗅觉，很用启发意义。