RTO Estimator for CCID-2
This document presents issues of
the RTO estimator used by CCID-2, discusses the suggested RFC2988
algorithm, contrasts this with recent research proposals, and analyses
the Linux TCP RTO estimator. From this analysis, it suggests the TCP
estimator as a much better solution.
1. Background
The computation of the TCP RTO timeout is standardised in RFC 2988,
using
the following algorithm:
/* Given a new RTT measurement `RTT' */
if (RTT is the first measurement made on this connection) {
SRTT := RTT
RTTVAR := RTT / 2
RTO := SRTT + max(G, 2 * RTT) /* G is clock granularity in seconds */
} else {
delta := RTT - SRTT
SRTT' := SRTT + 1/8 * delta
RTTVAR' := 3/4 * RTTVAR + 1/4 * |delta|
RTO := SRTT' + max(G, 4 * RTTVAR')
}
The primed variables in the
above refer to the next values
obtained for these variables. We here ignore the clock granularity G (Linux provides a clock
granularity of
up to 1ms).
2. Range analysis and problems
The smoothed RTT SRTT and mean deviation RTTVAR cover an error interval
of [SRTT-RTTVAR,
SRTT+RTTVAR].
It is interesting to see what happens when delta falls inside or outside
this interval.
2a) delta >= 0 (i.e., RTT >= SRTT)
In this case the formulas can be written as
SRTT' := SRTT + 1/8 * delta
RTTVAR' := 3/4 * RTTVAR + 1/4 * delta
RTO' := SRTT + 9/8 * delta + 3 * RTTVAR
Two cases can happen:
- RTO' > RTO
when delta > 8/9 *
RTTVAR
- RTO' <= RTO when delta <= 8/9 * RTTVAR
If delta < 8/9 *
RTTVAR, the RTO is slightly reduced, accounting for a reduced
mean deviation. If delta >
8/9 * RTTVAR then the measured sample is "out of range", and
an increase in the RTO results. The threshold of 8/9 * RTTVAR covers errors due
to randomness: RTTVAR is expected to be less
than SRTT (if it equals or exceeds SRTT then the measurement is not very
reliable - the error is too high).
2b) delta <
0 (i.e., RTT <
SRTT)
In this case the formulas have the form
SRTT' := SRTT + 1/8 * delta
RTTVAR' := 3/4 * RTTVAR - 1/4 * delta
RTO' := SRTT - 7/8 * delta + 3 * RTTVAR
Two cases can happen:
- RTO' > RTO
when delta < -8/7 *
RTTVAR
- RTO' <= RTO when delta >= -8/7 * RTTVAR
The multiplier 8/7 creates a threshold
as in case (2a). Note that there is a slight asymmetry: the RTO now increases
if the absolute value of delta
is greater than 8/7 * RTTVAR,
whereas in case (2a) the RTO increases if the absolute value of delta exceeds 8/9 * RTTVAR -- the ratio (8/7 / 8/9) is circa 1.29.
In [LS00] it is pointed out that when the sampled
RTT suddenly decreases by a
large amount, the
RTO increases similarly (to
account for an increased mean deviation). This
case occurs when RTT < SRTT -
8/7 * RTTVAR;
the
increment is
RTO' - RTO = -7/8 * delta - RTTVAR
= 7/8 * (SRTT - RTT) - RTTVAR
Thus, when is RTTVAR
small,
the RTO increase is close to the absolute value of delta.
Which cases are relevant for the RTO?
A recent analysis [RKS07] has emphasised the need for tight bounds on the RTO
value, to avoid spurious timeouts as well as allowing fast loss
detection.
When delta falls within the thresholds given in
(2a) and (2b) above then the sampled mean deviation decreases, so that
convergence towards a lower RTO value is intended. The thresholds
depend on the alpha and beta weights (RFC 2988
terminology), the impact of changing these is analysed in [RKS07].
When the RTT sample exceeds the SRTT,
and the resulting difference exceeds the threshold given by the mean
deviation in (2a), then RTO should also increase.
When the RTT sample dips below the
SRTT, an increase of
the RTO is
counter-intuitive. Below it is shown how Linux neutralises such
increases.
3. Pseudo-code
In the current 2.6.24 kernel the tcp_rtt_estimator()
function (net/ipv4/tcp_input.c)
can be described by the following pseudo-code (for consistency, RTT, SRTT, and RTTVAR are as above, the
lower-case variables are Linux-specific):
/* given a new RTT measurement `RTT' */
RTT := max(RTT, 1) /* 1 jiffy sampling granularity */
if (this is the first RTT measurement) {
SRTT := RTT
mdev := RTT/2
mdev_max := max(RTT/2, 200msec/4)
RTTVAR := mdev_max
rtt_seq := SND.NXT
} else {
SRTT' := SRTT + 1/8 * (RTT - SRTT)
if (RTT < SRTT - mdev)
mdev' := 31/32 * mdev + 1/32 * |RTT - SRTT|
else
mdev' := 3/4 * mdev + 1/4 * |RTT - SRTT|
if (mdev' > mdev_max) {
mdev_max := mdev'
if (mdev_max > RTTVAR)
RTTVAR' := mdev_max
}
if (SND.UNA is `after' rtt_seq) {
if (mdev_max < RTTVAR)
RTTVAR' := 3/4 * RTTVAR + 1/4 * mdev_max
rtt_seq := SND.NXT
mdev_max := 200msec/4
}
}
RTO' := SRTT + 4 * RTTVAR
This code agrees in large parts with RFC 2988 and includes
improvements; some of these are described in [SK02]. It was subsequently
confirmed that this code agrees with the one used for the analysis
and measurements in [RKS07].
3a) Higher clock granularity
Linux allows to fine-tune the resolution by setting the HZ variable, so
that a resolution of up to 1msec can be achieved. Granularity was a
concern in
[LS00] (section 3.4), but is no
longer a problem today [RKS07].
3b) Initial RTO value
For the first measurement:
- RTO = 3 * RTT
when RTT > 100 msec (i.e.,
RTO > 300msec),
- RTO = RTT + 200msec
otherwise
(i.e., 200msec <
RTO <= 300msec).
Hence the initial RTO is at
least 200msec.
3c) Avoiding spurious timeouts
Linux samples the RTT on each segment, not once per flight. The higher
number of samples increases the accuracy, the higher sampling rate
avoids aliasing effects on high-speed links (increasing the sampling rate
beyond once
per window is encouraged in section 3.1 of RFC 1323). The SRTT converges more quickly
towards the actual RTT.
The potential pitfall here is that the higher sampling rate reduces the
variation: RTTVAR goes
towards zero as the sampling rate increases. The outcome is that the
RTO `collapses' into the RTT so that spurious
timeouts become more
likely, as pointed out in section 3.2 of [LS00].
The Linux
implementation avoids this pitfall by keeping track of the maximum mean
deviation, using the rtt_seq,
mdev, and mdev_max variables:
- mdev_max is always
at least 50msec and only increased when mdev > mdev_max;
- RTTVAR is always
greater than or equal to mdev_max.
The statement
if (mdev_max < RTTVAR)
RTTVAR' := 3/4 * RTTVAR + 1/4 * mdev_max
is in effect only when 50msec
<= mdev_max < RTTVAR in between updates of rtt_seq. To illustrate this:
- when the first measurement
is taken, RTTVAR = mdev_max;
- until the first time that SND.UNA
is `after' rtt_seq, RTTVAR >= mdev_max >= mdev;
- when SND.UNA is
first `after' rtt_seq, RTTVAR stays at its value and mdev_max is reset to 50msec;
- in any subsequent round,
until SND.UNA is again
`after' rtt_seq,
- if mdev <= mdev_max
in between updates of rtt_seq,
then mdev_max stays at
50msec,
- when mdev_max has
decreased with regard to its previous value stored in RTTVAR, then it is treated as
if it were a regular error of RTTVAR
that had been sampled once per
flight
(given by SND.NXT/SND.UNA)
and not once per segment.
The result is that RTTVAR decays at most once per
flight. Further, due to clamping mdev_max
to 50msec, RTTVAR is
never decayed below 50msec; hence the RTO is always greater than
200msec.
Frequent decay of the mean deviation due to a higher sampling frequency
is blocked, only the long-term effect is tracked.
3d) Avoiding over-estimating RTO when RTT sample dips below SRTT
In (2b) above, RTO increases
when the measured RTT suddenly drops by a larger amount (e.g. due to freeing queues at a
congested router).
Linux neutralises this effect as described in [SK02]: whenever RTT < SRTT -
mdev, the weights for the error of mdev are adjusted as follows:
mdev' := 31/32 * mdev - 1/32 * delta
With the simplification that
RTTVAR = mdev_max = mdev, we get
RTO' := SRTT + 1/8 * delta + 31/8 * RTTVAR - 1/8 * delta
= SRTT + 31/8 * RTTVAR
Since 31/8 is close to 4, RTO'
stays close to RTO, and
then decays towards lower values.
However, since mdev' has
only an impact on RTTVAR
via mdev_max when mdev > mdev_max, it may be
that the dip is entirely ignored, so that RTO' converges towards the
lower value due to the (longer-term) decrease of SRTT.
4. Conclusions
The RTO estimator of RFC 2988 is based on the assumptions of older
technology. RFC 1323 encourages a sampling rate of more than once per
window, but [LS00] shows
that this can
lead to problems. A further suggestion in [LS00], and also in [KM05], is to adapt the RTO value according to window
size. This is interesting for further investigation, but this has not
been tested much and is therefore not so well understood.
In particular, the timer in [LS00]
has rather intricate dynamics.
The Linux RTO estimator addresses the problems of sampling rate, of not
collapsing the RTO into the RTT, and to avoid spikes when the RTT
suddenly decreases. It also has fared well in the evaluation against
other implementations presented in [RKS07]. It
therefore is the RTO implementation of choice for CCID-2.
5. References
- [KM05]
- Kesselman, Alexander and Yishay Mansour. Optimizing TCP
Retransmission Timeout. Proceedings of The 4th International
Conference on Networking (ICN'05), volume 2 of LNCS,
pages 133--140. 2005. Springer.
- [LS00]
- Ludwig, Reiner and Keith Sklower. The Eifel Retransmission Timer.
ACM SIGCOMM Computer Communication Review, 30(3):17-27,
7/2000.
- [RKS07]
- Rewaskar, Sushant, Jasleen Kaur and F. Donelson
Smith. A Performance Study of Loss Detection/Recovery in Real-world TCP
Implementations. Proceedings of 15th IEEE International Conference
on Network Protocols (ICNP-07). 2007.
- [SK02]
- Sarolahti, Pasi and Alexey Kuznetsov. Congestion Control in Linux
TCP. Proceedings of the FREENIX Track: 2002 USENIX Annual
Technical Conference, pages 49-62. 2002.