Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP

Gerrit Renker <gerrit@xxxxxxxxxxxxxx> · Thu, 11 Jan 2007 15:19:11 +0000

Quoting Ian McDonald:
|  Gerrit,
|  
|  Thank you for this summary. I think that this discussion shows the
|  fundamental difference in our patches and therefore the fundamental
|  issue we need to resolve. It would be helpful if Eddie or others
|  throws in their 2 cents worth.
|  
|  On 10/01/07, Gerrit Renker <gerrit@xxxxxxxxxxxxxx> wrote:
|  >                 Packet scheduling tardiness problem
|  >                 -----------------------------------
|  >
|  > I would like to continue discussion on your patch; I hope you are not dismayed
|  > about the response: this concerned only the code changes as such. I think that
|  > there is a valid point which you were trying to resolve.
|  >
|  I'm not dismayed at all by the response. These issues sometimes take a
|  lot of discussion before they go into the kernel. Look at hrtimers,
|  kevents, netchannels, reiser4 etc and you will see that we are a in
|  low number of revisions! It is far better to discuss so that we all
|  understand what is happening and can get the best possible solution.
My sincere thanks for this answer. I think it was stupid of me to simply contradict the
patch and be oblivious to what the purpose is. I am interested to learn if there is a
better solution we can use. I may be wrong and so what fool am I to ignore valid points. 
Therefore, I think we have the same interests, I am interested to learn if you have suggestions
how we can tackle these problems.
I think this response is a great help - since we not only have to deal with the usual 
arguments about code, but the work is very impaired by the fact that the module has gone
through so many instances already - things that may have worked for an initial form of 
TFRC in FreeBSD, but not necessarily for CCID 3 in Linux.

|  > (A) The granularity of the rate-based packet scheduling is too coarse
|  >
|  >     We are resolving t_ipi with microsecond granularity but the timing for the
|  >     delay between packets uses millisecond granularity (schedule_timeout). This
|  >     means we can only generate inter-packet delays in the range of 1 millisecond
|  >     up to 15.625 milliseconds (the largest possible inter-packet interval which
|  >     corresponds to 1 byte per each 64 seconds).
|  >
|  >     As a result, our range of speeds that can be influenced is:
|  >
|  >             1/64 byte per second  ....  1000 bytes per second
|  >
|  >     In all other cases, the delay will be 0 (due to integer division) and hence
|  >     their packet spacing solely depends on how fast the hardware can cope.
|  >
|  >     In essence, this is like a car whose accelerator works really well in the
|  >     range 1 meter/hour up to 2 miles per hour, and for everything else it tries
|  >     to use top speed of 120 mph.
|  >
|  
|  This is correct provided you are working on the assumption of part B.
|  I think that this is the whole crux of the problem and why we haven't
|  got an agreed patch between us yet.
I just noted that the calculation is flawed since it did not take `s' into consideration.
The assumption in the above is that s is 1 byte. To see the other extreme, if s=1260
(IPv6 path MTU) then 

 t_ipi |  X
 ------+------------------
 1 ms  |  1.26  Mbyte/sec
 10ms  |  126   kbyte/sec   
 100ms |  12.6  kbyte/sec

Whenever I have my typical case of `struggling receiver', I note delays of up to 500 ... 600ms
in the system logs.

|  Rather than espouse my views directly I'll try and comment on the RFC
|  which is the canonical definition of what we should be doing. I've put
|  in the whole 4.6 section of RFC 3448 and commented inline.
|  
|  4.6.  Scheduling of Packet Transmissions
|  
|     As TFRC is rate-based, and as operating systems typically cannot
|     schedule events precisely, it is necessary to be opportunistic about
|     sending data packets so that the correct average rate is maintained
|     despite the course-grain or irregular scheduling of the operating
|     system.
|  
|  Note here that the requirement is not to maintain precise timing. The
|  requirement is to maintain the correct average rate. This was the
|  light that went on in my head when I was reviewing both of our
|  previous iterations. I believe we were coming from the wrong starting
|  point - we were trying to improve scheduling, not maintain the correct
|  rate.
Have you got any ideas or suggestions of how to achieve this? I am lost at the moment,
I honestly don't know how we can achieve a fair average rate. The difficulties I see
are that
 * when we not use timers, but jiffies, we get at best a resolution of 1ms (HZ=1000)
 * when we use scheduling, then we are also bound by a resolution of 1ms

With a typical HZ=100 we have a t_gran/2 = 5ms.

Are you thinking in terms of queueing disciplines in order to maintain an average rate?
In order to follow this up I looked at the token bucket filter in net/sched/sched_tbf.c
but the comments say it is also bounded by the HZ resolution.

|     When a sender first starts sending at time t_0, it calculates t_ipi,
|     and calculates a nominal send time t_1 = t_0 + t_ipi for packet 1.
|     When the application becomes idle, it checks the current time, t_now,
|     and then requests re-scheduling after (t_ipi - (t_now - t_0))
|     seconds.  When the application is re-scheduled, it checks the current
|     time, t_now, again.  If (t_now > t_1 - delta) then packet 1 is sent.
|  
|  Note that initially we set t_ipi to 1 second. This could be set to a
|  better value based on connection setup as per Eddie and your
|  discussion earlier but I haven't implemented this yet. In this way my
|  code is a hack that I remove the 1 second and add the initial RTT once
|  we obtain it. I see this ugliness can be removed when we make the code
|  base conform to the RFC intent (it is not in RFC yet but Eddie said he
|  would propose for revision)
The initial value comes from section 4.2, "Sender initialisation", where
the initial sending rate X/s is set to 1 packet/second due to t_ipi = s/X.

|     Now a new t_ipi may be calculated, and used to calculate a nominal
|     send time t_2 for packet 2: t2 = t_1 + t_ipi.  The process then
|     repeats, with each successive packet's send time being calculated
|     from the nominal send time of the previous packet.
|  
|     In some cases, when the nominal send time, t_i, of the next packet is
|     calculated, it may already be the case that t_now > t_i - delta.  In
|     such a case the packet should be sent immediately.  Thus if the
|     operating system has coarse timer granularity and the transmit rate
|     is high, then TFRC may send short bursts of several packets separated
|     by intervals of the OS timer granularity.
|  
|  Note a couple of phrases here "the packet should be sent immediately",
|  "TFRC may send short bursts of several packets separated by intervals
|  of the OS timer granularity". This is why I don't think we should ever
|  reset t_nom to the current time. Doing this stops us achieving the
|  average rate required. With reseting t_nom what we are doing is
|  setting X as the maximum instantaneous rate rather than the average
|  rate we are trying to achieve.
As a numerical example, let's assume that t_now = t_nom + 5 * t_ipi 
(5 inter-packet intervals too late), neglecting the value of delta for the moment. 
It then takes 5 more packets to `break even' so we indeed have a burst of packets.

If t_nom were reset at that point then the gap of being 5 packets behind would
be ignored, which would indeed lead to a lower average rate.

So I believe you are right, resetting t_nom will lead to a lower average sending
rate, which means dropping the patch that resets t_nom. 

But I still would like to find out what the limits are of the sending/scheduling
mechanism we are currently using; to avoid that similar problems crop up in other corners.

The problem I see is that
 * scheduling granularity is at best 1ms
 * hence t_gran/2 is at best 500usec
 * and so when t_ipi < 1ms, packets will always be sent in bursts

So we have a `critical point' after which the average sending rate is dominated by the
hardware and no longer under the rate-based control of t_ipi.

I have done a back-of-the-envelope calculation below for different sizes of s; 9kbyte
I think is the maximum size of an Ethernet jumbo frame.

   -----------+---------+---------+---------+---------+-------+---------+-------+
            s | 32      | 100     | 250     | 500     | 1000  | 1500    | 9000  |
   -----------+---------+---------+---------+---------+-------+---------+-------+
    X_critical| 32kbps  | 100kbps | 250kbps | 500kbps | 1mbps | 1.5mbps | 9mbps |
   -----------+---------+---------+---------+---------+-------+---------+-------+ 

That means we can only expect predictable performance up to 9mbps ?????

I am dumbstruck - it means that the whole endeavour to try and use Gigabit cards (or
even 100 Mbit ethernet cards) is futile and we should be using the old 10 Mbit cards???

Ian do you have any advice regarding test scenarios? Maybe it was utterly dumb of me
to expect we could do gigabit speeds with CCID 3.

Or is there another explanation?

|  >     Therefore I wonder if there is some kind of `micro/nanosleep' which we can use?
|  >     Did some grepping and inevitably landed in kernel/hrtimers.c - any advice on
|  >     how to best deploy these?
|  >
|  >     On healthy links the inter-packet times are often in the range of multiples of
|  >     10 microseconds (60 microseconds is frequent).
|  >
|  I don't believe we need to do this from a practical point of view as
|  the RFC says we aim for average packet rate rather than precise
|  timing. It is interesting from a research point of view whether we
|  achieve better results from smoother packet flow given precise timing,
|  and I have seen published literature on this effect. It is not
|  necessary for RFC compliance though.
|  
|  However if you want to implement for research that is fine provided
|  that we satisfy a couple of criteria:
|  - negligible performance impact. I would have thought that hrtimer
|  would add to overhead a reasonable amount. Is the point of hrtimers to
|  achieve a large number of timers per second with low overhead or does
|  it allow you to precisely schedule an event at a time you precisely
|  require? By my calculations using CCID3 on a 1 GBits/sec link using 1
|  hrtimer per packet would mean around 90,000 timer fires per second vs
|  a maximum of 1000 timeslices if we are using standard setup and HZ of
|  1000. Interrupting other operations 90,000 times per second is surely
|  not good??
|  - cross platform support. I'm not sure if platforms like ARM support
|  hrtimers. I want to see CCID3 used across many platforms and small
|  embedded devices based on ARM would be a key market (your cellphone
|  for video calls for example).
|  
|  Based on this I think hrtimer support is not needed but if it is done
|  then it should be an option not a requirement.
Agree with your points.

|  >
|  > (B) Fixing send time for packets which are too late
|  >
|  >     You were mentioning bursts of packets which appear to be too late. I consulted
|  >     with a colleague of how to fix this: the solution seems much more complicated
|  >     than the current infrastructure supports.
|  >     Suppose the TX queue length is n and the packet at the head of the queue is too
|  >     late. Then one would need to recompute the sending times for each packet in the
|  >     TX queue by taking into account the tardiness (it would not be sufficient to
|  >     simply drain the queue). It seems that we would need to implement a type of
|  >     credit-based system (e.g. like a Token Bucket Filter); in effect a kind of Qdisc
|  >     on layer 4.
|  >
|  > When using only a single t_nom for the next-packet-to-send as we are doing at the moment,
|  > resetting t_nom to t_now when packets are late seems the most sensible thing to do.
|  >
|  > So what I think we should do is fix the packet scheduling algorithm to use finer-grained
|  > delay. Since in practice no one really is interested in speeds of 1kbyte/sec and below,
|  > it in effect means that we are not controlling packet spacing at all.
|  >
|  See above.
|  
-
To unsubscribe from this list: send the line "unsubscribe dccp" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html