Re: Debugging the scheduler

Vimal <j.vimal@xxxxxxxxx> · Sat, 4 Sep 2010 11:38:07 -0700

On 4 September 2010 08:44, Mulyadi Santosa <mulyadi.santosa@xxxxxxxxx> wrote:
> Hi Vimal...
>
> I sense it would be a nice discussion for everybody @
> kernelnewbies...so let's keep it on move :)

Sure.  Sounds good. :)

>
> Hmmm "CFS bandwith control"....and further reading reveal it tries to
> implement CPU time hard limit. Please CMIIW....
>

You're right.

>
> First thing that crosses my mind, it must be something about something
> not done atomically or protected by locks....or .... something isn't
> designed with quite large scalability in mind (read: multi processor
> or multi core).
>
> or maybe..it's about task migration between CPUs...

Yes, we thought about that as well.  Let me explain what happens in a
very detailed fashion.

A task group is rate-limited by setting a period/quota value, written
via the cgroup file system.  It means that the tg cannot get more than
quota amount of CPU time in one period.  We tested with quota/period =
100ms/500ms.

It's implemented as follows (like a token bucket):

(The structs are defined in sched.c)

* When tg's quota/period are set, a timer is started to fire every "period"
* The timer wakes up, refreshes the quota and unthrottles tasks if
needed.  The code is do_sched_cfs_period_timer() in sched_fair.c.

So where does the throttling happen?:  A scheduler tick accounts cpu
usage of all running processes.  The scheduler tick calls
account_cfs_rq_quota() in sched_fair.c, which throttles the runqueue
if needed.  When HZ=1000, this happens 1000 times a second and leads
to very accurate accounting.  If we enable a tickless kernel, then,
this accounting is called only when some interrupt/system event
happens.  So, with tickless kernel, the bandwidth mechanism isn't
quite accurate and observed this (See Expt 1, later in the mail).

>
> Likely, the patch's bug is a corner case...something that hadn't been
> thought to be anticipated. But it could be the other way around: it
> shows a bug in kernel.
>

Yes, we did consider this possibility, but it is really hard to see
where things are wrong.  I'll try the qemu+KVM method and see if it
causes a problem.

>
>> * The time to crash is longer when we hot-unplugged 6 out of 8 threads
>> on the core i7 machine.
>
> w00t? ok...so, we can conclude that fewer threads means better
> situation, am I right?

Yes, that's right and the bug seems very hard to reproduce.  The Intel
Xeon machine hasn't crashed in 10 hours, but the core i7 machine
crashed, within 10 hours, when I ran it with just 2 cores!  (cpu0, 1;
the rest were unplugged.)

>
>> * The crash happens (within 10 hours) only if we compile the kernel
>> with HZ=1000.
>
> Wonder why higher tick frequents contributes to this issue...something
> is fishy with the time slice accounting....or the way HPET/PIT is
> (re)programmed to do another time shot.

We tried a kernel with HZ=1000, without the cfs bandwidth and it
hasn't crashed.  Also, the time slice accounting is done by the core
scheduler and is not touched by the bandwidth patches.  So, I would
guess that's not where the problem is.

>
>> A tickless kernel gives rise to other problems, wherein
>> a "throttled" task took a long time to be dequeued.  htop showed that
>> the task's status was R(unning), but the process's CPU exec time
>> didn't change and it also didn't respond immediately to SIGKILL.  It
>> did respond, after a "long" (variable) time.
>
> AFAIK, tickless relies on HPET to do high precision time shot, so it
> might confirm my above suspicion.

Yep, we're aware of the HPET.  But what we observed is this:

Expt 1
~~~~~~
Setup: 2.6.35, Intel Core i7, 12G RAM, all 8 threads enabled.

Start a simple while(1){} task with period/quota = 100ms/500ms.
Recap: this means that every 500ms the task should not get more than
100ms of cpu time.  However, when we saw the CPU usage, it was
happening in "bursts".  The task consumed 1s of CPU time (as shown by
htop) and was dequeued for 4 seconds roughly.  This confirms our
suspicion about the accuracy of the throttle/unthrottle.

However, on a 1000HZ kernel, since the task accounting happens at a
finer granularity (by defn, every 1ms), we didn't observe this
artifact.  The authors are aware of this and they're trying to debug
this issue as well.

>
> It respons after "long" time? aha.... signal is handled when a context
> is switching from interrupt to kernel or user mode, IIRC the fastest
> is when you enabled full preemption.

Yes, the "long" time is particularly vague.  In all our experiments,
we used a program that does the following:

1. Start a timer to interrupt program in 10 seconds (setitimer + ITIMER_REAL)
2. while(1) {}
3. When the timer fires, output user exec time, and the real time passed.

For quota/period=100ms/500ms, the user exec time should be around 2s
and real time should be 10s.

On a tickless kernel, we found the the output of "real time" to be as
high as 300 seconds!

On a HZ=1000 kernel, it was fine, except for the crash. :)

Attached is the code.  The program is invoked as follows:

$ ./cpu-stress seconds microseconds
sets a timer for the given arguments.

>> I could explain in detail what tests we conducted, if that's useful.
>
> personally, i think it would be nice (and I welcome it) if you share it...

Sure.  We stress test as follows:

repeat 50 times:
  for N in [2, 6, 10, 20, 40, 80, 160]:
    repeat 10 times:
      start N cpu-stress programs with 10 second timers (./cpu-stress.c 10 0)

This should take: 50 * 7 * 10 * 10 = 35000 seconds to complete ~ 10 hrs.

We also tried manually pinning the programs to cpus in a round-robin
fashion to see if it's a problem with task migration.  It *still*
crashed. :(

>
> sure..emulator is serializing things...it hardly does true multi
> processing... so using emulator might yield very different result. But
> still, maybe with Qemu-KVM, it still worth a shot.
>

I'll try that.  Thanks!

> BTW, does this scheduler patches could be adapted to User Mode Linux
> architecture? if it can, IMHO it could be more promising platform for
> debugging purpose in this case.

AFAIK, UML models tasks as user processes and the scheduling is
handled by the host kernel itself.

>
> IMHO, we're really dealing with corner case of logic flaws...something
> that sometimes is hardly reproduced. I suggest to do very high load
> multithreaded stress testing over and over again and try to find the
> patern. I am sure eventually it could be found...only it takes time.
>

Yes, thanks a lot!  We've been trying hard.  Our hunch is that the
following functions cause trouble:

sched_fair.c:
 static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)

For e.g.:

static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
  struct sched_entity *se;

  se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];

  account_hier_tasks(se, -cfs_rq->h_nr_tasks);
  for_each_sched_entity(se) {
    struct cfs_rq *cfs_rq = cfs_rq_of(se);

    dequeue_entity(cfs_rq, se, 1);
    if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
      // ^^ this line
      // cfs_rq_throttled(cfs_rq) returns cfs_rq->throttled,
      // which is only set to 1 at the end of this for loop.
      // so it means that this cfs_rq might be touched somewhere
      // else too.

      break;
  }
  cfs_rq->throttled = 1;
  cfs_rq->throttled_timestamp = rq_of(cfs_rq)->clock;
}

Thanks,
-- 
Vimal
Attachment:
cpu-stress.c

Description: Binary data