On Fri, 8 Aug 2008, Bill Fink wrote: > On Thu, 7 Aug 2008, Ilpo Järvinen wrote: > > > On Wed, 6 Aug 2008, Dâniel Fraga wrote: > > > > > On Thu, 31 Jul 2008 15:47:55 +0200 > > > Thomas Jarosch <thomas.jarosch@xxxxxxxxxxxxx> wrote: > > > > > > > If your problem is really FRTO related (that what the patch is for), > > > > you could try to disable FRTO temporarily: > > > > > > Hi, the patch helped, but what's the conclusion? Is the problem > > > "solved"? Will this patch be merged in the next kernel? This thread > > > seems to be forgotten. > > > > ...Dave, I think we should probably put this FRTO work-around to net-2.6 > > and -stable to remain somewhat robust (it's currently worked around only > > for newreno anyway). ...But I leave the final decision up to you. > > Since you suspect the problem is being caused by a broken middlebox, It seems very likely, any split-TCPish approach that tries to hide some losses that would happen on access links could cause this though it's very stupid to put such box there when there's a physical wire rather than wireless. And even with wireless the given configuration is not going to help but make things worse :-), the box is plain stupid as is (I guess it's deployed because some marketting guy has convinced some clueless whoever that they need the box :-)). In theory it could be at the receiver below the TCP layer too but that's quite unlikely that smtp server would run on such stack. And also then it's kind of middlebox as TCP works end-to-end (not end host to end host) while the rest remains as black box to it, even if something is performed on the very same host below TCP layer. Even less likely thing is that TCP receiver would do this and it doesn't explain pacing of ACKs at all. ...It would be at least kind of twisting of specs if not out-of-spec somewhere. > would it perhaps be a better approach to add a per-route option to > allow disabling of FRTO for the given destination. This would be > similar to Stephen Hemminger's fix for broken middleboxes that don't > handle window scaling properly. It seems this would be better than > modifying FRTO behavior for everyone else that is being compliant. Sure, but that requires some thought still, I'll try after weekend so that I can think it a bit more because there are plenty of states where we can end to after the detection of the first RTO as spurious. It might even be interesting to run CA_Recovery on RTOs when we detect this kind of middlebox because RTOs basically happen because there's lack of duplicate ACKs and then we could efficiently use partial ACKs to send just the lost segments rather than everything which is causing problems after the recovery has finished because we sent with too high rate while recovering. Then fallbackto CA_Loss if RTO is triggered again in CA_Recovery. But I'm not sure if it's worth of the effort though. > A question then arises is if the bogus scenario has a TCP signature > that could be used to print a warning message for the unsuspecting > user so they could then take necessary corrective action. Probably yes, but I need to add some state. I could probably also make it to switch per flow to more robust approach on-demand when enough evidence is gathered. ...I think I'll add 1-bit history counter per flow so that it's possible to do print the warning and switch when there's third RTO in a single window (while two first were found spurious). IMHO it's unlikely enough that there will be three latency spikes (each longer than the previous) within a single window to make the decision, I wouldn't trust two enough because hand-overs can take time and have non-trivial effects. -- i.