TCP拥塞状态的变迁

The Linux TCP sender is governed by a state machine that determines the sender actions when
acknowledgements arrive.

The states are as follows:

enum tcp_ca_state {
        TCP_CA_Open = 0,
#define TCPF_CA_Open (1<<TCP_CA_Open)

        TCP_CA_Disorder = 1,
#define TCPF_CA_Disorder (1<<TCP_CA_Disorder)

        TCP_CA_CWR = 2,
#define TCPF_CA_CWR (1<<TCP_CA_CWR)

        TCP_CA_Recovery = 3,
#define TCPF_CA_Recovery (1<<TCP_CA_Recovery)

        TCP_CA_Loss = 4
#define TCPF_CA_Loss (1<<TCP_CA_Loss)
};

Open

This is the normal state in which the TCP sender follows the fast path of execution optimized for
the common case in processing incoming acknowledgements.
When an acknowledgement arrives, the sender increases the congestion window according to
either slow start or congestion avoidance, depending on whether the congestion window is
smaller or larger than the slow start threshold, respectively.

初始状态，也是正常的状态。

Disorder

When the sender detects duplicate ACKs or selective acknowledgements, it moves to the Disorder
state. In this state the congestion window is not adjusted, but each incoming packet triggers
transmission of a new segment. Therefore, the TCP sender follows the packet conservation
principle, which states that a new packet is not send out until an old packet has left the network.

拥塞窗口恒定，网络中数据包守恒。

CWR

The TCP sender may receive congestion notification either by Explicit Congestion Notification,
ICMP source quench, or from a local device. When receiving a congestion notification, the Linux
sender does not reduce the congestion window at once, but by one segment for every second
incoming ACK until the window size is halved. When the sender is in process of reducing the
congestion window size and it does not have outstanding retransmissions, it is in CWR(Congestion
Window Reduced) state. CWR state can be interrupted by Recovery or Loss state.

拥塞窗口减小，且没有明显的重传。

struct tcp_sock {
        ...

        u32 bytes_acked; /* Appropriate Byte Counting */
        u32 prior_ssthresh; /* ssthresh saved at recovery start */
        u32 undo_marker; /* tracking retrans started here */
        u32 high_seq; /* snd_nxt at onset of congestion */
        u32 snd_cwnd_stamp; 
        u8 ecn_flags; /* ECN status bits */

        ...
}

struct inet_connection_sock {
        ...

        __u8 icsk_ca_state;
        __u8 icsk_retransmits;
        const struct tcp_congestion_ops *icsk_ca_ops;

        ...
}

 /* Set slow start threshold and cwnd not falling to slow start */
void tcp_enter_cwr(struct sock *sk, const int set_ssthresh)
{
        struct tcp_sock *tp = tcp_sk(sk);
        const struct inet_connection_sock *icsk = inet_csk(sk);
        tp->prior_ssthresh = 0;
        tp->bytes_acked = 0;

        if (icsk->icsk_ca_state < TCP_CA_CWR) { /*只有Open和Disorder态才能进入*/
                tp->undo_marker = 0;
                if (set_ssthresh)
                     tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); /* 重设慢启动阈值*/
                tp->snd_cwnd = min(tp->snd_cwnd, tcp_packets_in_flight(tp) + 1U);
                tp->snd_cwnd_cnt = 0;
                tp->high_seq = tp->snd_nxt;
                tp->snd_cwnd_stamp = tcp_time_stamp;

                TCP_ECN_queue_cwr(tp);
                tcp_set_ca_state(sk, TCP_CA_CWR);/*设置状态*/
}

#define TCP_ECN_OK 1
#define TCP_ECN_QUEUE_CWR 2
#define TCP_ECN_DEMAND_CWR 4

static inline void TCP_ECN_queue_cwr(struct tcp_sock *tp)
{
        if (tp->ecn_flags & TCP_ECN_OK)
            tp->ecn_flags |= TCP_ECN_QUEUE_CWR;
} 

static inline void tcp_set_ca_state(struct sock *sk, const u8 ca_state)
{
        struct inet_connection_sock *icsk = inet_csk(sk);
        if (icsk->icsk_ca_ops->set_state)
               icsk->icsk_ca_ops->set_state(sk, ca_state);
        icsk->icsk_ca_state = ca_state;
}

Recovery

After a sufficient amount of successive duplicate ACKs arrive at the sender, it retransmits the first
unacknowledged segment and enters the Recovery state. By default, the threshold for entering
Recovery is three successive duplicate ACKs, a value recommended by the TCP congestion
control specification. During the Recovery state, the congestion window size is reduced by one
segment for every second incoming acknowledgement, similar to the CWR state. The window
reduction ends when the congestion window size is equal to ssthresh, i.e. half of the window
size when entering the Recovery state. The congestion window is not increased during the
recovery state, and the sender either retransmits the segments marked lost, or makes forward
transmissions on new data according to the packet conservation principle. The sender stays in
the Recovery state until all of the segments outstanding when the Recovery state was entered
are successfully acknowledged. After this the sender goes back to the Open state. A retrans-
mission timeout can also interrupt the Recovery state.

Loss

When an RTO expires, the sender enters the Loss state. All outstanding segments are marked
lost, and the congestion window is set to one segment, hence the sender starts increasing the
congestion window using the slow start algorithm. A major difference between the Loss and
Recovery states is that in the Loss state the congestion window is increased after the sender
has reset it to one segment, but in the Recovery state the congestion window size can only be
reduced. The Loss state cannot be interrupted by any other state, thus the sender exits to the
Open state only after all data outstanding when the Loss state began have successfully been
acknowledged. For example, fast retransmit cannot be triggered during the Loss state, which
is in conformance with the NewReno specification.

/* Enter Loss state. If "how" is not zero, forget all SACK information and 
 * reset tags competely, otherwise preserve SACKs. If receiver dropped its 
 * ofo queue, we will know this due to reneging detection.
 * 进入Loss状态，是否清除SACK标志取决于how，how不为0则清除
 */
void tcp_enter_loss(struct sock *sk, int how)
{
        const struct inet_connection_sock *icsk = inet_csk(sk);
        struct tcp_sock *tp = tcp_sk(sk);
        struct sk_buff *skb;

        /* Reduce ssthresh if it has not yet been made inside this window. 
         * 在刚进入Loss状态时，减小慢启动阈值
          */
        if (icsk->icsk_ca_state <= TCP_CA_Disorder || tp->snd_una == tp->high_seq
            (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
                /* 保留当前阈值，以便在拥塞窗口调整撤销时使用*/
                tp->prior_ssthresh = tcp_current_ssthresh(sk); 
                /* 减小慢启动阈值*/
                tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
                /* 通知CA_EVENT_LOSS事件给具体的拥塞控制算法*/
                tcp_ca_event(sk, CA_EVENT_LOSS);
        }

        tp->snd_cwnd = 1; /*调整拥塞窗口为1*/
        tp->snd_cwnd_cnt = 0;
        tp->snd_cwnd_stamp = tcp_time_stamp;
        tp->bytes_acked = 0;
        tcp_clear_retrans_partial(tp);/*清零和重传有关的变量*/

        if (tcp_is_reno(tp))
                tcp_reset_reno_sack(tp); /* 清零sacked_out */

        if (!how) { /* 保留SACK标志*/
                tp->undo_marker = tp->snd_una; /* 以便在合适时进行拥塞窗口调整撤销*/
        } else { /* 清除SACK标志*/
                tp->sacked_out = 0;
                tp->fackets_out = 0;
        }
        tcp_clear_all_retrans_hints(tp);

        tcp_for_write_queue(skb, sk) { /*遍历sk->sk_write_queue发送队列*/
                if (skb == tcp_send_head(sk)) /*从snd.una到snd.nxt*/
                    break;

                if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)
                     tp->undo_marker = 0;

                /* 清除重传和丢失标志*/
                TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS) | TCPCB_SACKED_ACKED;

                if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) || how) {
                    TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;/*清除SACK标志*/
                    TCP_SKB_CB(skb)->sacked |= TCPCB_LOST; /*标志为丢失*/
                    tp->lost_out += tcp_skb_pcount(skb); /*统计丢失段的数量*/
                    tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;
                }
        }

        tcp_verify_left_out(tp); /*left_out > packets_out则发出警告*/
        tp->reordering = min_t(unsigned int, tp->reordering, sysctl_tcp_reordering);
        tcp_set_ca_state(sk, TCP_CA_Loss);
        tp->high_seq = tp->snd_nxt;
        TCP_ECN_queue_cwr(tp); /*表示发送发进入拥塞状态*/

        /* Abort F-RTO algorithm if one is in progress */
        tp->frto_counter = 0;
}

#define tcp_for_write_queue(skb, sk)        \
        skb_queue_walk(&(sk)->sk_write_queue, skb)

#define skb_queue_walk(queue, skb)        \
        for (skb = (queue)->next;                         \
               prefetch(skb->next), (skb != (struct sk_buff *) (queue) ) ;      \
               skb = skb->next)

/* Due to TSO, an SKB can be composed of multiple actual packets.
 * To keep these tracked properly, we use this.
 */
static inline int tcp_skb_pcount(const struct sk_buff *skb)
{
        return skb_shinfo(skb)->gso_segs;
}

struct sock {
        ...

        struct sk_buff_head sk_write_queue; /*发送队列头*/
        struct sk_buff *sk_send_head; /* snd_nxt */

        ...
}

/* If cwnd > ssthresh, we may raise ssthresh to be half-way to cwnd. 
 * The exception is rate halving phase, when cwnd is decreasing towards ssthresh 
 */
static inline __u32 tcp_current_ssthresh(const struct sock *sk)
{
        const struct tcp_sock *tp = tcp_sk(sk);
        if ((1<<inet_csk(sk)->icsk_ca_state) & (TCPF_CA_CWR | TCPF_CA_Recovery))
                return tp->snd_ssthresh;  /*CWR和Recovery时cwnd在减小*/
        esle /*调大ssthresh*/
                return max(tp->snd_ssthresh, ((tp->snd_cwnd>>1)+(tp->snd_cwnd>>2)));
}

static inline void tcp_ca_event(struct sock *sk, const enum tcp_ca_event event)
{
        const struct inet_connection_sock *icsk = inet_csk(sk);
        if (icsk->icsk_ca_ops->cwnd_event)
                icsk->icsk_ca_ops->cwnd_event(sk, event);
}