Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround

"Ilpo Järvinen" <ilpo.jarvinen@xxxxxxxxxxx> · Mon, 22 Sep 2008 14:22:12 +0300 (EEST)

On Mon, 22 Sep 2008, Dâniel Fraga wrote:

> On Fri, 19 Sep 2008 00:04:23 +0300 (EEST)
> "Ilpo Järvinen" <ilpo.jarvinen@xxxxxxxxxxx> wrote:
> 
> > Anyway, if/when you succeed collecting some strace of the server 
> > processes, please let me know (though putting a full one available might 
> > not be wise thing like I said earlier). After I thought it a bit, it might 
> > be enough the start the strace with -p for all server processes of a 
> > service during a stall and then resolve it after some amount of waiting 
> > with nmap (and hope that strace doesn't resolve it by interfering 
> > something relevant :-), you will see that from the fact that it resolves 
> > without nmap then). That would probably reveal if the processes where 
> > waiting in accept() or not, and if not, where they were.
> 
> 	Hi again Ilpo, I waited the whole day for a stall, and
> fortunatelly it happened while I was stracing dovecot and child
> processes. The stall happened at 01:11 (at the end). I hope that it
> has something useful.

It definately shows a stall, there are _no_ events between 0:53 and 1:11 
while there isn't any other period like that, every other minute since the 
start has some activity going on :-). So this might not be related to 
networking at all like we've kind of already figured out (definately 
accept() has very little to do here). There weren't close()'es there 
either so it looks very stuck on something that's outside of the syscalls 
we listed in -e, I suppose...

It seems that next sensible step is to just obtain a full strace to see 
what actually took place during those long minutes if anything (it's 
better that you keep that log private and just use grep over it on 
request). ...A full strace might grow huge though. Also, for strace use 
-tt instead of -t to get more accurate timestamps and add -T.

When you get the stall next time, please also check that the processes are 
actually sleeping instead of looping like crazy in some buggy userspace 
code :-) (obviously before resolving it with nmap).

When using nmap to resolve, take note on exact timestamp (including 
seconds). E.g., 
$ date > nmap.ts; nmap ...

-- 
 i.