Re: Re: [PATCH 2/25]: Avoid accumulation of large send credit

Gerrit Renker <gerrit@xxxxxxxxxxxxxx> · Fri, 20 Apr 2007 10:50:52 +0100

Ian, I would appreciate if in future you would not copy patch
descriptions over from dccp@vger to dccp@ietf. 

Apart from the fact that I don't like it, this creates the wrong idea among 
people who have little or nothing to do with actual protocol implementation 
- it produces an impression of "let's talk about some implementation bugs". 
(But competent implementation feedback is welcome and solicited on dccp@vger)

Which is the more regrettable since you are right in raising this point
as a general one: it is indeed a limitation of [RFC 3448, 4.6] with regard
to non-realtime OSes. To clarify, the two main issues of this limitation
are summarised below.

I. Uncontrollable speeds
------------------------
Non-realtime OSes schedule processes in discrete timeslices with a granularity
of t_gran = 1/HZ. When packets are scheduled using this mechanism, this
naturally limits the maximum packets per second to HZ.  

There are two speeds involved here: the packet rate `A' of the application
(user-space), and the allowed sending rate `X' determined by the TFRC mechanism
(kernel-space). 

These speeds are not related to one another. The allowed sending rate X will, under
normal circumstances, approach the link bandwidth; following the principles of slow
 start. The application sending rate A is fixed or is not.

No major problems arise when it is ensured that A is always below X. Numerical
example: A=32kbps, X=94Mbps (standard 100 Mb Ethernet link speed). When loss 
occurs, X is reduced according to p. As long as X remains above A, the sender
can send as before; if X is reduced below A, the sender will be limited.

Now the problem: when the application rate A is above s * HZ, there is a range
of speed where the TFRC mechanism is effectively out of control, i.e. requests
to reduce the sending rate in response to ECN-marked packets or congestion events
(ECN-marked or lost packets) will not be followed.

Numerical example: HZ=1000/sec, X=94Mbps, A=59Mbps, s=1500 bytes. The controllable
limit is s * HZ = 1500 * 8 * 1000 bps = 12Mbps. Assume loss occurs in steady-state
such that X is to be reduced to X_reduced. Then, if 
                      s * HZ  <  X_reduced <=  A,
nothing will happen and the effective speed after computing X_reduced will remain at A. 
This is even more problematic if A is not fixed but could increase above its current rate
So, with regard to the numerical example, nothing will happen if X_reduced is between 
12Mpbs ... 59Mbps, the speed after the congestion occurs will remain at A=59Mbps.

The problem is even more serious when considering that Gigabit NICs are standard
in most laptops and desktop PCs, here X will ramp up even higher so that the range
for mayhem is even greater. (Standard Linux even comes with 10 Gbit ethernet drivers).

Again: the problem is that TFRC/CCID3 can not control speeds above s * HZ on a non-realtime
operating system. In car manufacturer terms, this is like a car whose accelerator is
functional, but switches to top speed, somewhere in its range. Obviously, they would not
be allowed to sell cars with such a deficiency. 

A safer solution, therefore, would be to insert a throttle into to limit application speeds 
below s * HZ; to keep applications from stealing bandwidth which they are not supposed to use.

II. Accumulation of send credits
--------------------------------
This second problem is also conceptual and is described as accumulation of send credits.
It has been discussed on this list before, please refer to those threads for a more
detailed description of how this comes about. The relevant point here is that accumulation
of send credits will also happen  as a natural consequence of using [RFC 3448, 4.6] on 
non-realtime operating systems. 

The reason is that the use of discrete time slices leads to a quantisation problem, where
t_nom is always set earlier than would be required by the exact formula: 0.9 msec becomes
0 msec, 1.7 msec becomes 1 msec, 2.8 msec becomes 2 msec and so forth (this assumes HZ=1000,
it is even worse with lower values of HZ). 

Thus, after a few packets, the sender will be "too early" by the sum total of quantisation 
errors that have so far occurred. In the given numerical example, the sender is skewed by 
(0.9 + 0.7 + 0.8) msec = 2.4 msec, which will be broken into a send credit of 2 msec plus a 
remainder of 0.4 msec; which might clear at a later stage. 

In addition, this will lead to speeds which are typically faster than allowed by the exact
value of t_nom: measurements have shown that in the ``linear'' range of speeds below s * HZ,
the real implementation is more than 3 times faster than allowed by the sending rate X = s/t_ipi.

III. Accumulation of inaccuracies
---------------------------------
Due to context switch latencies, interrupt handling, and processing overhead, a scheduling-based
packet pacing will not schedule packets at the exact time, they may be sent slightly earlier or
later. This is another source where send credits can accumulate, but it is not fully understood
yet. It would require measurements to see how far off on average the scheduling is. It does seem
however that this problem is less serious than I/II; scheduling inaccuracies might cancel each other
out over the long term.

NOTE: numerical examples serve to illustrate the principle only. Please do not interpret this as an 
      invitation for discussion of numerical examples.

Thanks.

|  On 4/18/07, Lars Eggert <lars.eggert@xxxxxxxxx> wrote:
|  > On 2007-4-18, at 19:16, ext Colin Perkins wrote:
|  > > On 11 Apr 2007, at 23:45, Ian McDonald wrote:
|  > >> On 4/12/07, Gerrit Renker <gerrit@xxxxxxxxxxxxxx> wrote:
|  > >>> There is no way to stop a Linux CCID3 sender from ramping X up to
|  > >>> the link bandwidth of 1 Gbit/sec; but the scheduler can only
|  > >>> control packet pacing up to a rate of s * HZ bytes per second.
|  > >>
|  > >> Let's start to think laterally about this. Many of the problems
|  > >> around
|  > >> CCID3/TFRC implementation seem to be on local LANs and rtt is less
|  > >> than t_gran. We get really badly affected by how we do x_recv etc and
|  > >> the rate is basically all over the show. We get affected by send
|  > >> credits and numerous other problems.
|  > >
|  > > As a data point, we've seen similar stability issues with our user-
|  > > space TFRC implementation, although at somewhat larger RTTs (order
|  > > of a few milliseconds or less). We're still checking whether these
|  > > are bugs in our code, or issues with TFRC, but this may be a
|  > > broader issue than problems with the Linux DCCP implementation.
|  >
|  > I think Vlad saw similar issues with the KAME code when running over
|  > a local area network. (Vlad?)
|  >
|  > Lars
|  >
|  >
|  >
|  >
|  -
|  To unsubscribe from this list: send the line "unsubscribe dccp" in
|  the body of a message to majordomo@xxxxxxxxxxxxxxx
|  More majordomo info at  http://vger.kernel.org/majordomo-info.html
|  
|