Re: Debugging the scheduler

Mulyadi Santosa <mulyadi.santosa@xxxxxxxxx> · Sat, 4 Sep 2010 22:44:36 +0700

Hi Vimal...

I sense it would be a nice discussion for everybody @
kernelnewbies...so let's keep it on move :)

On Sat, Sep 4, 2010 at 14:32, Vimal <j.vimal@xxxxxxxxx> wrote:
> Sure.  In fact, we removed our modifications and narrowed down the
> crash to the following patches:
> http://thread.gmane.org/gmane.linux.kernel/979066.

Hmmm "CFS bandwith control"....and further reading reveal it tries to
implement CPU time hard limit. Please CMIIW....

> More specifically,
> the bug is in patches 3 and 4, since those are the only ones that deal
> with enqueuing/dequeuing of tasks.

IMO you already did  a good job on finding "the needle in the
haystack". OK so let's assume (temporarily) it's due to
{en,de}queueing....

First thing that crosses my mind, it must be something about something
not done atomically or protected by locks....or .... something isn't
designed with quite large scalability in mind (read: multi processor
or multi core).

or maybe..it's about task migration between CPUs...

> The patches (not written by us) provide a bandwidth mechanism for CFS
> scheduler wherein a task group can be restricted to some percentage of
> CPU time (i.e., rate limited).

Nice summary, I got the same conclusion too..

> We have several observations:
>
> * The patched kernel crashes on an Intel Core i7 (8 threads, 12GB
> RAM), at random times, when saturating all 8 cores with cpu intensive,
> but rate-limited processes.
> * The patched kernel hasn't crashed, yet, on an Intel Xeon (2 threads,
> 2GB RAM, 64-bit).

Likely, the patch's bug is a corner case...something that hadn't been
thought to be anticipated. But it could be the other way around: it
shows a bug in kernel.

BFS (Con Kolivas' scheduler) sometimes shows more or less same thing...

> * The time to crash is longer when we hot-unplugged 6 out of 8 threads
> on the core i7 machine.

w00t? ok...so, we can conclude that fewer threads means better
situation, am I right?

> * The crash happens (within 10 hours) only if we compile the kernel
> with HZ=1000.

Wonder why higher tick frequents contributes to this issue...something
is fishy with the time slice accounting....or the way HPET/PIT is
(re)programmed to do another time shot.

> A tickless kernel gives rise to other problems, wherein
> a "throttled" task took a long time to be dequeued.  htop showed that
> the task's status was R(unning), but the process's CPU exec time
> didn't change and it also didn't respond immediately to SIGKILL.  It
> did respond, after a "long" (variable) time.

AFAIK, tickless relies on HPET to do high precision time shot, so it
might confirm my above suspicion.

It respons after "long" time? aha.... signal is handled when a context
is switching from interrupt to kernel or user mode, IIRC the fastest
is when you enabled full preemption.

So, this far: time slicing bug + HPET buggy reprogramming+buggy enqueueing(?).

> I could explain in detail what tests we conducted, if that's useful.

personally, i think it would be nice (and I welcome it) if you share it...

> It was mainly starting and stopping a lot of CPU intensive (while(1);)
> tasks that were rate limited.
>
> (a throttled task is one that has been dequeued since it has consumed
> more cpu time than it was allotted.)
>
> Our hunch is that it's a race condition / deadlock somewhere.  We fear
> that the race condition might not occur/might take longer to surface
> if we run it on an emulator, given our observations.

sure..emulator is serializing things...it hardly does true multi
processing... so using emulator might yield very different result. But
still, maybe with Qemu-KVM, it still worth a shot.

BTW, does this scheduler patches could be adapted to User Mode Linux
architecture? if it can, IMHO it could be more promising platform for
debugging purpose in this case.

> We don't mind hitting the reset button every time it hangs, but if
> you're suggesting that there's no way to debug the scheduler on a live
> machine, then I guess qemu  might be the only option. :(

My knowledge is limited, so you're free to give your own point of view here.

IMHO, we're really dealing with corner case of logic flaws...something
that sometimes is hardly reproduced. I suggest to do very high load
multithreaded stress testing over and over again and try to find the
patern. I am sure eventually it could be found...only it takes time.

-- 
regards,

Mulyadi Santosa
Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com
training: mulyaditraining.blogspot.com

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ