Re: dccp bugs (Was: Re: panic on 2.6.24rc5)

Gerrit Renker <gerrit@xxxxxxxxxxxxxx> · Mon, 10 Mar 2008 13:00:56 +0000

| > | And so I guess that this NET_XMIT_DROP should be ignored by dccp code?
| >
| > In the test tree it is now (almost) ignored. Since the time you reported
| > this problem I have changed the DCCP_BUG to a DCCP_WARN, so that the
| > drop will still be logged, but there will be fewer such warnings in the
| > log now (in DCCP_WARN, printk is rate-limited).
| >
| I'm not sure it should print the warning. Ok, it's nice to know that the 
| packet was dropped but:
| a) in real world there would be no such information,
| b) it's not something unusual for DCCP to lose packet without warning.
| Maybe it should only warn if debugging option is enabled?
| 
Ok that is a good point. What I will do is to change the DCCP_WARN() for
this error message to a dccp_pr_debug(), so that the message will only 
get printed for debugging purposes.

| > | >  * to avoid this, there is a kernel configuration option of CCID-3
| > | >    to set an upper bound for this.
| > |
| > | How do I set it?
| >
| > In the menu under
| >  Networking -> Network Options -> The DCCP Protocol (EXPERIMENTAL)
| >  -> DCCP CCIDs Configuration (EXPERIMENTAL) -> CCID3
| >  -> (100) Use higher bound for nofeedback timer
| > Ah - just remembered -- the default is 100 milliseconds, so this will
| > probably have caught the problems with the low RTT.
| >
| 100ms? Then it doesn't work as expected because adding just 1ms delay with 
| netem fixed the problem.
| If you meant 100us then it still doesn't work. Changing the parameter to 1000 
| delays the bug - I have to send more packets for it to happen.
| 
It is not as easy as that. This value determines a lower bound, in order
to cope with very low RTTs (i.e. less than 1 millisecond). What this
value changes is when the nofeedback timer is triggered. This is
normally max(4*RTT, 2*s/X), so a minimum of 4 RTT. But when RTTs are
low, it can happen that the nofeedback timer is triggered several times
between sending two frames (e.g. a VoIP inter-frame interval of 20ms). 
The configuration value sets a minimum lower bound of
	timeout = max(CONFIG_RTO_MIN, max(4*RTT, 2*s/X))
to cope with the problem of low RTTs.	

You are calling this a bug -- I think it is quite likely that CCID-3 is simply
not meant for the way you are using it. Please see below.

| > So from what you wrote I read
| >  * without additional delay the described problem occurs and CCID-3
| >    gets into 1-packet-per-64-seconds mode
| >  * when you add 1millisecond delay to the interface then it works ok.
| >
| That's exactly what I meant. After adding this 1ms delay I was not able to 
| reproduce the bug. This does not mean it would not happen after long enough 
| testing.
| 
Thanks for providing the dccp_probe data. I have had a look at it and it
is a bit as expected. The value of the RTT is:

   min: 3usec,  avg: 48.0 usec,  max: 4.513 msec;  with stddev of 324.68.

The loss was all the time 0, the receiver reported something like 1.8 kbps
(230 bytes per second).

The start-up behaviour is that there is a spike in the RTT where it
reaches its maximum right at the beginning (4.5 msec); this soon fades
out after the first 5 seconds, after which it reaches the average of 48
microseconds. This RTT is about 5..10 times less than a standard PC RTT
(250..500 microseconds).

Now I have a question: adding 1 ms delay at the interface avoided the
hang-up, but what do you mean by "long enough"? Is this 60 seconds?

I have attached the plot of the t_ipi which indeed climbs to
astronomical values after the initial period (when the RTT was low).

But since the average RTT is 48 microseconds in the `outfile', I am
assuming that the outfile was produced when the delay was not added to
the interface?

There is a long-standing problem (at least 1 year) with regard to CCID-3
over high-speed networks (and lo interface is in fact a high-speed
interface): high link speeds are outside the control region of CCID-3.

The peak limit of controllable speed is about 12 Mbps. Everything higher
that will cause problems. I have not tested, but you may be able to 
also silence this behaviour by using a netem token-bucket filter with
e.g. a maximum bitrate of <= 10 Mbps.

So I think the least thing we need to do is put a warning into the DCCP
Wiki that people should add interface delay when using CCID-3 for 
testing over loopback.
Attachment:
t_ipi.png

Description: PNG image