Running CCID-3 over loopback/localhost

Note: These notes are outdated and were the result of  a buggy implementation which caused the window counter not to be updated for RTTs of less than 4 microseconds. This bug has since been fixed. Please consider this when reading the notes, the text is not a real performance evaluation (someone should do this). Loopback is interesting for performance tests since it has a very small RTT and very high throughput.

1. Background

The following problem was reported on dccp@vger: when operating CCID-3 over loopback, after a while it would cause huge delays, eventually slowing down to t_mbi = 1 packet per 64 seconds.

Adding a 1 millisecond delay remedied the problem:
tc qdisc add dev lo root netem delay 1ms
This page performs an analysis to find out what really happened here.

2. Measurements

An application-limited scenario was used with the following parameters: a constant packet size `s' of 22 bytes and a constant bitrate of 1.8 Kbps (225 bytes/second). The loopback interface can easily transport packets with Gigabit speed (a quick test with iperf showed 1.75 Gbps on this Pentium-4), so the application sending rate will not saturate the link.

The CCID-3 measurements over loopback resulted in the following: as expected there was no congestion and hence no loss (p = 0). The bitrate X_recv measured at the receiver was a very stable 219 bytes per second (min. 224, max. 225, stddev. 32.52).

The interesting deviations from the expectations follow below. 

The average RTT was 8.3 microseconds (min. 3,  max. 63, stddev. 12.5). We will consider the minimum of 3 microseconds as "true" link RTT.
Unmodified loopback: RTT
There is a notable peak starting with the first packet, afterwards the RTT converged to the actual link RTT. The reason for the peak is that the first RTT measurement is taken from the Request/Response exchange. Since packet exchanges for connection setup take slightly longer than those in steady state, a higher value results. The slow decline is due to the fact that an exponential weighted moving average is used: it takes longer to converge to the actual link RTT of 3 microseconds.

The RTT changes coincide with the changes of X:
Unmodified loopback: Allowed sending rate X
The loopback interface has an MTU of 16436 bytes. With regard to the RFC3390 initial sending rate this means for the packet size s=22:
W_init/RTT  =  min(4*22, max(2*22, 4380)) bytes / 63usec = 88bytes / 63usec  =  11.17 Mbps
which is the value of the first point plotted above.

CCID-3 then performs a kind of slow-start until about 2 seconds. This is not a crystal-clear case of slow-start, so a more detailed analysis is necessary.

The Linux implementation uses the revision 00 of draft rfc3448bis, where an idle sender is defined as one which hasn't sent anything for more than 2 RTTs. Since the RTT is low in this case, in each computation of X the sender is classified as "idle" and gets the benefit of being constrained by the RFC 3390 initial rate rather than 2 * X_recv.
	min_rate = max(W_init/RTT, 2 * X_recv)		= 	max(4*22*8/RTT, 2*X_recv)
X = max(min(2*X, min_rate), s/RTT) = max(min(2*X, min_rate), 22*8/RTT)
Considering that since the sender is application limited and that therefore we have a near-constant X_recv equalling the application-sending rate of 1800 bits per second, we get the following simplification:
	X = max(min(2*X, max(W_init/RTT, 2*X_recv), s/RTT)
= max(min(2*X, max(4*s/RTT, 2*X_recv) s/RTT)
= max(min(2*X, max(4*22*8/RTT, 1800), 22*8/RTT)
The inner `max' simplifies to W_init/RTT since this is greater than 1800 for all RTTs less than 391 msec. Hence
	X = max(min(2*X, 4*s/RTT),       s/RTT)
= max(min(2*X, 4*22*8/RTT), 22*8/RTT)
This means that a "classic" slow-start  (where X' = 2*X) is performed only as long as
	s/RTT  <  2*X  <  4*s/RTT = W_init/RTT
which, with the initial RTT of 63 usec, gives the interval 1.4 Mbps < X < 5.6 Mbps. This is only a narrow stretch in the above diagram (in the dccp_probe log, this doubling of the sending rate happened between 0.826 seconds and 1.022 seconds). As long as X <= 2*s/RTT, the s/RTT term dominates so that X = s/RTT (in the log this happened from the begin up to 0.826 seconds). Lastly, consider the stage where 2 * X >= W_init/RTT: since W_init/RTT is always greater than s/RTT, at this stage the value of X solely depends on W_init/RTT (in the log this happens from 1.022 seconds onwards).

The end of this phase was reached when the RTT converged towards the actual link RTT, peaking at about 2 seconds. This peak value is 234.7 Mbps (reached at 2.391 seconds) and it is simply 4*s/RTT with the link RTT of 3 microseconds, i.e. 4*22*8/3 Mbps.

The t_ipi plot is inverse to the X plot:
Inter-packet Interval t_ipi

It is useful to also plot the inverse of t_ipi, the allowed number of packets per second:

Allowed number of packets per second
This again shows the problem: after reaching the peak at about 2 seconds, the allowed number of packets per second gradually declines; in the reported case it eventually dropped below the minimum required number of one packet per 64 seconds (which corresponds to 0.016 packets per second). This "dead" minimum is not reached in the above plot, but is an asymptotic value.

3. What is the explanation?

Summarising the above conditions:
Two culprits were found for this behaviour: frequent expiry of the nofeedback timer and absence of receiver feedback after the RTT value had reached the link RTT of 3 microseconds.

3.1 Reason for decline of X

The reason for the decline of X (and hence the increase of t_ipi) is the nofeedback timer: according to RFC 3448, 4.3, each time a feedback packet is received, the nofeedback timer is set to expire after
max(4*RTT, 2*s/X)  =  max(12, 2*22*8/X)
microseconds. When it  expires, X is halved by modifying the cached value of X_recv. The revision 00 of rfc3448bis had the additional rule whereby X is halved directly if either (i) no feedback has been received yet or (ii) p == 0, according to
X = max(X/2, s/t_mbi)
where s/t_mbi is the asymptotic value of 1 packet in 64 seconds. Since in the given scenario p == 0, this is what happens (and this rule is still in place in revision 05 of rfc3448bis).

The nofeedback timer will expire every 12 microseconds as long as s/X < 6 microseconds, or as long as X > 29.34 Mbps. When X has been reduced to lower than 29.34 Mbps, then s/X < 2*RTT and thus the next timer expiry happens at 2*s/X. In the X diagram above, the first time that X reaches a value of 29.34 Mbps between 1.02 and 1.11 seconds. But this is still during the semi slow-start phase, where X is increasing.

The second time that X passes 29.34 Mbps is at about 2.68 seconds into the trace, where it already has been halved three times (relative to the peak value of 234.7 Mbps). At this time X was 29.3 Mbps, so the next expiry is at about 12 microseconds later. After halving, X has a value of 14.7 Mbps, the next expiry of the nofeedback timer is thus after 2*22*8/14.7 = 24 microseconds, then leading to further expiries at 48, 96, 192, ... microseconds.

An astute observer will find such a rapid decline of X does not happen in the diagram. The reason for this is that the Linux implementation has clamped the nofeedback expiry interval to a minimum user-configurable value which defaults to 100 milliseconds:
actual_expiry_interval  =  max(user_config_value, max(4*RTT, 2*s/X))
This default value is incidentally near the inter-packet gap (98 milliseconds) of the sending application. Hence halving of the sending rate occurs approximately at sending time intervals.

A slow-down will now occur when t_ipi=s/X > 50msec. This happens when X is less than 3.52 kbps and can be observed in all the above diagrams as a change in the gradient of the increase after about 4 seconds into the trace. Consider the X plot above: sampling points are taken each time a new datagram is sent (i.e. every 98 milliseconds); after 4 seconds, the gradient changes and one can see that often two consecutive samples still have the same X rate, indicating that X now decreases at a slower rate.

3.2 Why is there no longer any feedback?

X will be halved as long as there is no feedback. And indeed, in the trace the receiver stopped sending feedback approximately at the time when the EWMA of the RTT had converged to the actual link RTT of 3 microseconds.

The reason for this is in the window-counter algorithm of RFC 4342: the receiver only sends feedback when the CCVal changes, and the algorithm uses the following expression:
quarter_RTTs = floor((current_time - last_WC_time) / (RTT/4))
Since the implementation uses microsecond resolution, a divide-by-zero would occur for any RTT smaller than 4 microseconds. Hence there is no feedback, and therefore the nofeedback timer continues to halve the sending rate X until reaching the asymptotic value of s/t_mbi.

4. Counter-measure

The obvious thing to do is, since the window counter algorithm does not work for RTTs smaller than 4 microseconds, to clamp all RTT samples to a value greater than 4 microseconds.

This is shown below to successfully work, tested by artificially inflating the RTT. Subsequently an equilibrium condition is computed.

4.1 Reaching stability by artificially inflating the RTT

Experiments were performed where the RTT was artificially inflated by adding 0.5..1msec link delay as described in section 1.
Allowed sending rate X when RTT was inflated

The steps can be traced in the 1ms log and 2ms log. The following graphs are a consequence of this stabilistation (as they derive from X).

Inter-packet-interval t_ipi when RTT was inflated

Number of packets per second allowed by CCID-3 when RTT was inflated

4.2 Computing an equilibrium condition

The particulars of the given situation are:
  1. The inter-packet gap IPG of the sending application is such that IPG > 2*RTT. The sender thus qualifies as "idle between send()s" according to draft rfc3448bis-00, so that the new value X' is computed as
                                                                  X' = max(min(2*X, max(W_init/RTT, 2*X_recv), s/RTT)
  2. The sending rate is application-limited, i.e. X_send < s/RTT. The receiver reports this as X_recv = X_send < s/RTT.
  3. Since X_recv < s/RTT, we can simplify
                                                                  X' = max(min(2*X, W_init/RTT), s/RTT)
This can be considered as equation for a series, the value towards it converges depends on the starting value:
  1. When the start value X >= W_init/(2*RTT) then the series converges towards W_init/RTT.
  2. When the start value X < s/(2*RTT) then the series converges towards s/RTT.
What has not been considered is that the nofeedback timer will also interfere with the processing when 4*RTT < 100ms <= IPG (which is the case above), so that case (B) is likely and observable in the above diagrams:  s/RTT is 176 kbps for an RTT of 1 msec (0.5 msec added delay) and  88 kbps for an RTT of 2 msec (1 msec added delay).