Hi Mulyadi, On 3 September 2010 21:23, Mulyadi Santosa <mulyadi.santosa@xxxxxxxxx> wrote: > On Sat, Sep 4, 2010 at 07:07, Vimal <j.vimal@xxxxxxxxx> wrote: >> Hi all, >> >> We're making some modifications to the scheduler and the kernel >> (2.6.35) just crashes without any error whatsoever. The crash is such >> that the kernel responds to pings for a while; but the mouse doesn't >> work, screen doesn't refresh and we're not able to ssh as well. > > Could we see the code somewhere? it's hard to judge just by reading > the "raw facts" you wrote here. Sure. In fact, we removed our modifications and narrowed down the crash to the following patches: http://thread.gmane.org/gmane.linux.kernel/979066. More specifically, the bug is in patches 3 and 4, since those are the only ones that deal with enqueuing/dequeuing of tasks. We also verified that this was the case. The patches (not written by us) provide a bandwidth mechanism for CFS scheduler wherein a task group can be restricted to some percentage of CPU time (i.e., rate limited). We have several observations: * The patched kernel crashes on an Intel Core i7 (8 threads, 12GB RAM), at random times, when saturating all 8 cores with cpu intensive, but rate-limited processes. * The patched kernel hasn't crashed, yet, on an Intel Xeon (2 threads, 2GB RAM, 64-bit). * The time to crash is longer when we hot-unplugged 6 out of 8 threads on the core i7 machine. * The crash happens (within 10 hours) only if we compile the kernel with HZ=1000. A tickless kernel gives rise to other problems, wherein a "throttled" task took a long time to be dequeued. htop showed that the task's status was R(unning), but the process's CPU exec time didn't change and it also didn't respond immediately to SIGKILL. It did respond, after a "long" (variable) time. I could explain in detail what tests we conducted, if that's useful. It was mainly starting and stopping a lot of CPU intensive (while(1);) tasks that were rate limited. (a throttled task is one that has been dequeued since it has consumed more cpu time than it was allotted.) Our hunch is that it's a race condition / deadlock somewhere. We fear that the race condition might not occur/might take longer to surface if we run it on an emulator, given our observations. We're in touch with the authors of the patch and are trying to work through it. > > But IMO, as long you just deal with the time slice calculation > algorithm, it shouldn't introduce trouble...but once you touch other > things like for example you put tasks into multi queue, changing the > way a task is kicked out of the running queue etc, it might be the > cause. The patches play around with enqueuing/dequeuing tasks. > > NB: you need full system emulator I guess, like Qemu...put your kernel > there...build it as debuggable kernel..and hook gdb into Qemu's gdb > stub. It is not 100% identical to real machine behaviour...but at > least you don't need to kick your reset button everytime it hangs... We don't mind hitting the reset button every time it hangs, but if you're suggesting that there's no way to debug the scheduler on a live machine, then I guess qemu might be the only option. :( Thanks a lot! -- Vimal -- To unsubscribe from this list: send an email with "unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx Please read the FAQ at http://kernelnewbies.org/FAQ