An analysis of the Linux TCP RTO estimator

RTO Estimator for CCID-2

This document presents issues of the RTO estimator used by CCID-2, discusses the suggested RFC2988 algorithm, contrasts this with recent research proposals, and analyses the Linux TCP RTO estimator. From this analysis, it suggests the TCP estimator as a much better solution.

1. Background

The computation of the TCP RTO timeout is standardised in RFC 2988, using the following algorithm:

	/* Given a new RTT measurement `RTT' */

	if (RTT is the first measurement made on this connection) {
		SRTT    := RTT
		RTTVAR  := RTT / 2
		RTO	:= SRTT + max(G, 2 * RTT)	/* G is clock granularity in seconds */
	} else {
		delta	:= RTT - SRTT
		SRTT'	:= SRTT  +  1/8 * delta
		RTTVAR' := 3/4 * RTTVAR  +  1/4 * |delta|
		RTO	:= SRTT' + max(G, 4 * RTTVAR')	
	}

The primed variables in the above refer to the next values obtained for these variables. We here ignore the clock granularity G (Linux provides a clock granularity of up to 1ms).

2. Range analysis and problems

The smoothed RTT SRTT and mean deviation RTTVAR cover an error interval of [SRTT-RTTVAR, SRTT+RTTVAR]. It is interesting to see what happens when delta falls inside or outside this interval.

2a) delta >= 0 (i.e., RTT >= SRTT)

In this case the formulas can be written as

	SRTT'	:= SRTT + 1/8 * delta
	RTTVAR'	:= 3/4 * RTTVAR + 1/4 * delta

	RTO'	:= SRTT + 9/8 * delta + 3 * RTTVAR

Two cases can happen:

RTO' > RTO when delta > 8/9 * RTTVAR
RTO' <= RTO when delta <= 8/9 * RTTVAR

If delta < 8/9 * RTTVAR, the RTO is slightly reduced, accounting for a reduced mean deviation. If delta > 8/9 * RTTVAR then the measured sample is "out of range", and an increase in the RTO results. The threshold of 8/9 * RTTVAR covers errors due to randomness: RTTVAR is expected to be less than SRTT (if it equals or exceeds SRTT then the measurement is not very reliable - the error is too high).

2b) delta < 0 (i.e., RTT < SRTT)

In this case the formulas have the form

	SRTT'	:= SRTT + 1/8 * delta
	RTTVAR'	:= 3/4 * RTTVAR - 1/4 * delta

	RTO'	:= SRTT - 7/8 * delta + 3 * RTTVAR

Two cases can happen:

RTO' > RTO when delta < -8/7 * RTTVAR
RTO' <= RTO when delta >= -8/7 * RTTVAR

The multiplier 8/7 creates a threshold as in case (2a). Note that there is a slight asymmetry: the RTO now increases if the absolute value of delta is greater than 8/7 * RTTVAR, whereas in case (2a) the RTO increases if the absolute value of delta exceeds 8/9 * RTTVAR -- the ratio (8/7 / 8/9) is circa 1.29.

In [LS00] it is pointed out that when the sampled RTT suddenly decreases by a large amount, the RTO increases similarly (to account for an increased mean deviation). This case occurs when RTT < SRTT - 8/7 * RTTVAR; the increment is

	RTO' - RTO = -7/8 * delta        - RTTVAR
		   =  7/8 * (SRTT - RTT) - RTTVAR

Thus, when is RTTVAR small, the RTO increase is close to the absolute value of delta.

Which cases are relevant for the RTO?

A recent analysis [RKS07] has emphasised the need for tight bounds on the RTO value, to avoid spurious timeouts as well as allowing fast loss detection.

When delta falls within the thresholds given in (2a) and (2b) above then the sampled mean deviation decreases, so that convergence towards a lower RTO value is intended. The thresholds depend on the alpha and beta weights (RFC 2988 terminology), the impact of changing these is analysed in [RKS07].

When the RTT sample exceeds the SRTT, and the resulting difference exceeds the threshold given by the mean deviation in (2a), then RTO should also increase.

When the RTT sample dips below the SRTT, an increase of the RTO is counter-intuitive. Below it is shown how Linux neutralises such increases.

3. Pseudo-code

In the current 2.6.24 kernel the tcp_rtt_estimator() function (net/ipv4/tcp_input.c) can be described by the following pseudo-code (for consistency, RTT, SRTT, and RTTVAR are as above, the lower-case variables are Linux-specific):

	/* given a new RTT measurement `RTT' */
	
	RTT := max(RTT, 1)		/* 1 jiffy sampling granularity */

	if (this is the first RTT measurement) {
		SRTT	 := RTT
		mdev	 := RTT/2
		mdev_max := max(RTT/2, 200msec/4)
		RTTVAR	 := mdev_max
		rtt_seq	 := SND.NXT
	} else {
		SRTT'	 := SRTT + 1/8 * (RTT - SRTT)

		if (RTT < SRTT - mdev)
			 mdev'	:= 31/32 * mdev + 1/32 * |RTT - SRTT|
		else
			 mdev'	:= 3/4   * mdev + 1/4  * |RTT - SRTT|
		
		if (mdev' > mdev_max) {
			 mdev_max := mdev'
			 if (mdev_max > RTTVAR)
				RTTVAR' := mdev_max
		}

		if (SND.UNA is `after' rtt_seq) {
			 if (mdev_max < RTTVAR)
				RTTVAR' := 3/4 * RTTVAR + 1/4 * mdev_max
			 rtt_seq  := SND.NXT
			 mdev_max := 200msec/4
		}
	}

	RTO' := SRTT + 4 * RTTVAR

This code agrees in large parts with RFC 2988 and includes improvements; some of these are described in [SK02]. It was subsequently confirmed that this code agrees with the one used for the analysis and measurements in [RKS07].

3a) Higher clock granularity

Linux allows to fine-tune the resolution by setting the HZ variable, so that a resolution of up to 1msec can be achieved. Granularity was a concern in [LS00] (section 3.4), but is no longer a problem today [RKS07].

3b) Initial RTO value

For the first measurement:

RTO = 3 * RTT when RTT > 100 msec (i.e., RTO > 300msec),
RTO = RTT + 200msec otherwise (i.e., 200msec < RTO <= 300msec).

Hence the initial RTO is at least 200msec.

3c) Avoiding spurious timeouts

Linux samples the RTT on each segment, not once per flight. The higher number of samples increases the accuracy, the higher sampling rate avoids aliasing effects on high-speed links (increasing the sampling rate beyond once per window is encouraged in section 3.1 of RFC 1323). The SRTT converges more quickly towards the actual RTT.

The potential pitfall here is that the higher sampling rate reduces the variation: RTTVAR goes towards zero as the sampling rate increases. The outcome is that the RTO `collapses' into the RTT so that spurious timeouts become more likely, as pointed out in section 3.2 of [LS00].

The Linux implementation avoids this pitfall by keeping track of the maximum mean deviation, using the rtt_seq, mdev, and mdev_max variables:

mdev_max is always at least 50msec and only increased when mdev > mdev_max;
RTTVAR is always greater than or equal to mdev_max.

The statement

	 if (mdev_max < RTTVAR)
		RTTVAR' := 3/4 * RTTVAR + 1/4 * mdev_max

is in effect only when 50msec <= mdev_max < RTTVAR in between updates of rtt_seq. To illustrate this:

when the first measurement is taken, RTTVAR = mdev_max;
until the first time that SND.UNA is `after' rtt_seq, RTTVAR >= mdev_max >= mdev;
when SND.UNA is first `after' rtt_seq, RTTVAR stays at its value and mdev_max is reset to 50msec;
in any subsequent round, until SND.UNA is again `after' rtt_seq,

if mdev <= mdev_max in between updates of rtt_seq, then mdev_max stays at 50msec,
when mdev_max has decreased with regard to its previous value stored in RTTVAR, then it is treated as if it were a regular error of RTTVAR that had been sampled once per flight (given by SND.NXT/SND.UNA) and not once per segment.

The result is that RTTVAR decays at most once per flight. Further, due to clamping mdev_max to 50msec, RTTVAR is never decayed below 50msec; hence the RTO is always greater than 200msec.

Frequent decay of the mean deviation due to a higher sampling frequency is blocked, only the long-term effect is tracked.

3d) Avoiding over-estimating RTO when RTT sample dips below SRTT

In (2b) above, RTO increases when the measured RTT suddenly drops by a larger amount (e.g. due to freeing queues at a congested router).

Linux neutralises this effect as described in [SK02]: whenever RTT < SRTT - mdev, the weights for the error of mdev are adjusted as follows:

	mdev' := 31/32 * mdev - 1/32 * delta

With the simplification that RTTVAR = mdev_max = mdev, we get

	RTO'  := SRTT + 1/8 * delta  +  31/8 * RTTVAR - 1/8 * delta
	       = SRTT + 31/8 * RTTVAR

Since 31/8 is close to 4, RTO' stays close to RTO, and then decays towards lower values.

However, since mdev' has only an impact on RTTVAR via mdev_max when mdev > mdev_max, it may be that the dip is entirely ignored, so that RTO' converges towards the lower value due to the (longer-term) decrease of SRTT.

4. Conclusions

The RTO estimator of RFC 2988 is based on the assumptions of older technology. RFC 1323 encourages a sampling rate of more than once per window, but [LS00] shows that this can lead to problems. A further suggestion in [LS00], and also in [KM05], is to adapt the RTO value according to window size. This is interesting for further investigation, but this has not been tested much and is therefore not so well understood. In particular, the timer in [LS00] has rather intricate dynamics.

The Linux RTO estimator addresses the problems of sampling rate, of not collapsing the RTO into the RTT, and to avoid spikes when the RTT suddenly decreases. It also has fared well in the evaluation against other implementations presented in [RKS07]. It therefore is the RTO implementation of choice for CCID-2.

5. References

[KM05]: Kesselman, Alexander and Yishay Mansour. Optimizing TCP Retransmission Timeout. Proceedings of The 4th International Conference on Networking (ICN'05), volume 2 of LNCS, pages 133--140. 2005. Springer.

[LS00]: Ludwig, Reiner and Keith Sklower. The Eifel Retransmission Timer. ACM SIGCOMM Computer Communication Review, 30(3):17-27, 7/2000.

[RKS07]: Rewaskar, Sushant, Jasleen Kaur and F. Donelson Smith. A Performance Study of Loss Detection/Recovery in Real-world TCP Implementations. Proceedings of 15th IEEE International Conference on Network Protocols (ICNP-07). 2007.

[SK02]: Sarolahti, Pasi and Alexey Kuznetsov. Congestion Control in Linux TCP. Proceedings of the FREENIX Track: 2002 USENIX Annual Technical Conference, pages 49-62. 2002.