| 2.1: I still don't understand all of this discussion. One concrete | issue with the "fifth problem": | | To the receiver this condition will look as if the inter-packet gap | suddenly doubled, meaning it will use samples of twice the actual | RTT. | | I don't see why. Say X before the loss event is 8 packets/RTT, and | after it is 4 packets/RTT, and RTT=1s. Here are the window counters before: | | time wctr | 0.000 1 | 0.125 1 | 0.250 2 | 0.375 2 | 0.500 3 | ... | 1.000 5 | 1.125 5 | | After: | | time wctr | 6.000 1 | 6.250 2 | 6.500 3 | | So where do you get that "inter-packet gap doubling causes samples of | twice the actual RTT"? You are NOT supposed to use the inter-packet gap | to calculate the RTT. You are supposed to use WINDOW COUNTERS plus | inter-packet gaps. And the window counters have, correctly, been | updated to the new sending rate: the quantity (average interpacket | spacing / average wctr delta), which should equal R/4, has remained the | same, namely the correct value of 0.25s. | Thank you for this counter-example. I think that Michael, Gorry, and you are right -- there should not be room for speculation in the draft, and causes for this behaviour should be more carefully tested and analysed - perhaps as a separate research problem. Below I'll try my best to recollect facts, hopefully you can help to rule out a few more factors. Meanwhile, regarding the draft, I think it is best to leave out speculation and to follow Gorry's advice in his dccp@ietf posting of 20th November: "In summary, as regards the ID, I think we should say less on the specifics, and simply indicate that a bad RTT can result in odd behaviour for various reasons." http://www.ietf.org/mail-archive/web/dccp/current/msg03762.html I still hope to find an explanation for this behaviour - in the above posting interactions with other mechanisms were mentioned (QoS, load-balancing, mobility, ...), that might trigger the same conditions and behaviour. First, checked again if there is a bug in the implementation: * samples are accepted if the 1<= CCVal difference <= 4 or * if 4 < CCVal difference and RTT_estimate/2 < sample < RTT_estimate (this is an optimisation in order to get more samples, if the RTT is low, many packets will be sent with a CCVal difference of 5) * implementation looks correct and has been in use for several years. This clarified the log messages seen during the outage: > Jul 15 22:01:26 kernel: [ 2311.949466] dccp_sane_rtt: RTT sample 4766615 out of bounds! > Jul 15 22:01:39 kernel: [ 2324.335916] dccp_sane_rtt: RTT sample 12373169 out of bounds! > Jul 15 22:02:11 kernel: [ 2356.548447] dccp_sane_rtt: RTT sample 32193564 out of bounds! > Jul 15 22:03:15 kernel: [ 2420.760223] dccp_sane_rtt: RTT sample 64201733 out of bounds! ==> The message will be printed only if the CCVal difference is not 0 and the sample is the time interval since receiving the last packet. One possible cause could be reverse-path loss of feedback packets, causing halving of X. With 5 months distance after the event and little data, the only way to know for sure is to repeat the measurements. Here is the setup as far as I recall: * access point was a D-Link System DWL-G122 802.11g USB Adapter (ralink rt73), * channel in the 2412 - 2467 MHz range (hostapd.conf set to channel 12) * client laptop used Intel 3945ABG (iwl3945, 802.11g) * distance between access point client was less than 3 meters * RTS/CTS were set to 'off' * the 'retry' parameter for MAC retransmissions was set to 3-7 * the average link RTT was 2msec * 2-3 competing access points and roaming clients in the neighbourhood - not sure about microwave ovens, cellular phones (bluetooth), DECT * from TCP wireshark traces it is clear that there was interference on the channel (dupAcks and retransmitted packets) Gerrit