[Hotplug_sig] Re: [lhcs-devel] Automated Hotplug CPU Regression Test Cases

nathanl at austin.ibm.com (Nathan Lynch) · Tue Feb 15 13:08:53 2005

Hi Mark-

On Sat, 2005-02-12 at 07:22 -0800, Mark Wong wrote:
> > > What are the test cases we need?
> > 
> > I think it might be valuable to recount some of the bugs that have been
> > encountered in the past.  Off the top of my head:
> > 
> > 1) tasks not scheduled on a newly onlined cpu
> 
> Just to make sure I understand, can this be done with a cpu that was
> just offlined, as opposed to a system that is brought up with 2 cpus
> and a 3rd one is onlined?

Hmm, are you asking whether we can check for the correct behavior in
both scenarios?  If so, yes.

> > 2) tasks wrongly migrated to a cpu while it is going offline
> > 3) irqs with cpu affinity not migrated from an offlined cpu
> 
> Does the kernel update the cpu affinity itself?  The angle I'm looking
> at this is how to verify correct behavior.  For example, will we only
> be able to check something in /proc to see interrupts are going to
> another cpu or can we also check to see if the affinity mask has been
> updated?

With tasks, the kernel changes the affinity (the task_struct's
cpus_allowed bitmask) only if it has to -- that is, only if the task is
bound to the dead cpu and that cpu only.  See move_task_off_dead_cpu in
kernel/sched.c.

For irqs, check /proc/irq/xxx/smp_affinity -- the output is a bitmask of
cpus which are allowed to service that interrupt number.  That *should*
be updated, if necessary, but I just tried on a ppc64 system offlining a
cpu with an interrupt bound to it, and the mask is unchanged.  I think
it's a cosmetic bug, though -- I verified that that interrupt was being
serviced by other cpus in /proc/interrupts.  I'll get back to you on
this.  (Heh, just talking about a test suite is helping to find latent
bugs! :)

> 
> > 4) mm_struct slab leak (affected only some architectures)
> 
> I'm not familiar with this at all, in understanding slabs or how this
> can be verified.

It was a memory leak.  The way it was discovered was to continuously
online and offline cpus, while keeping the system busy with kernel
compiles, for example.  Eventually (after hours and hours) the system
either locked up or OOMed.  So I would say that long-running stress
tests would be beneficial.

> > 5) offlining the last running cpu in the system :)
> 
> Just because I'm mostly ignorant, what's the correct behavior?  For
> example, the atttriute to offline goes away, is not changeable, etc.
> Easy to test definitely.  :)

The write to /sys/devices/system/cpu/cpuX/online should return EBUSY.

> > > 4. Offline a CPU, say cpu1.
> > > 
> > >    echo 0 > /sys/devices/system/cpu/cpu1/online
> > > 
> > >    cpu0 is not hotswappable on some architectures.
> > 
> > Right, and the sysfs entry for cpu0 will not have an online attribute in
> > such cases.
> 
> Is it silly for me to ask if the lack of a online attribute needs to
> be checked on those particular archtectures?

That wouldn't be a bad thing to check IMO.

> > 
> > > 5. Do we see any change in /proc/interrupts?
> > 
> > This is architecture-specific behavior iirc.  ppc64 lists only online
> > cpus.
> 
> Ok, so we can still infer correct behavior by seeing the remaining
> cpu's pick up interrupts.

Yep.