On Thu, 11 Sep 2008, Dâniel Fraga wrote: > On Thu, 11 Sep 2008 16:44:20 +0300 (EEST) > "Ilpo Järvinen" <ilpo.jarvinen@xxxxxxxxxxx> wrote: > > > ...I guess it would be possible to remove SCHED_FEAT_HRTICK from > > /proc/sys/kernel/sched_features then while keeping the hrtimers > > otherwise enabled to test this. > > > > It's possible that hrtimers just affect on how easy it is to trigger > > but at least it seems an useful lead until proven otherwise. > > You're right Ilpo. After days and days without the problem, > today it triggered (but I wasn't online at the time, so I couldn't grab > any data). Thanks. Once we know what the userspace at the server is doing, it might make the problem immediately obvious, though I'm a bit afraid that e.g., strace might interfere with the problem so that it resolves right away and we're again left with nothing... > So, you're correct. HRtimers just affect on how easy it is to > trigger the issue. In other words: with high resolution timer enabled, > the problem appears more frequently. > > At least if we discovered a way how to trigger this, we could > test it more easily. The problem is to wait a long time for it to > happen. > > Just a curiosity: on your servers, I don't really have any I would call "server" in the sense you mean, I might occassionally set up one for test from time to time for a very limited period but normally it's just ssh and some other which I use so rarely that I'd hardly notice, and that's it. I was planning, however, to setup some day a distcc stress test using all my spare cpu cycles (I'd like to put it under kvm but that got stalled due to some timing issue at the guest making it to go into an infinite loop), once I get that working I could probably easily put other test-only stuff to that framework as well. But but, there are other people around the world besides us :-), and afaict this is the only (outstanding) report which relates to ceasing of accept() so I doubt it's something very regularly occuring thing or we would have heard of it. > do you use x86_64? At least on some machines, but like you have discovered it seems to service dependant, so that some processes never got stuck, I might only run such or so, who knows... > It seems > this problem is very specific to x86_64 or appear more often on x86_64 > than x86_32. It never happens on my x86_32 bit servers. Ok. -- i.