[Hotplug_sig] Re: [lhcs-devel] Automated Hotplug CPU Regression Test Cases

markw at osdl.org (Mark Wong) · Sat Feb 12 07:22:21 2005

Hi Nathan,

I'm also lagging on email since I'm away from the office, but I should
be able to check mail more frequently on this leg of the trip.

On Thu, Feb 10, 2005 at 09:05:38PM -0600, Nathan Lynch wrote:
> Hi Mark-
> 
> [sorry for long delay in replying]
> 
> On Thu, 2005-01-06 at 16:14 -0800, Mark Wong wrote:
> > Here's what I started drafting for input, ideas, and contributions in
> > response to getting a regression test suite going.
> 
> I'd like to help.
> 
> > Automated Regression Test for Hotplug CPU 
> > 
> > What are the test cases we need?
> 
> I think it might be valuable to recount some of the bugs that have been
> encountered in the past.  Off the top of my head:
> 
> 1) tasks not scheduled on a newly onlined cpu

Just to make sure I understand, can this be done with a cpu that was
just offlined, as opposed to a system that is brought up with 2 cpus
and a 3rd one is onlined?

> 2) tasks wrongly migrated to a cpu while it is going offline
> 3) irqs with cpu affinity not migrated from an offlined cpu

Does the kernel update the cpu affinity itself?  The angle I'm looking
at this is how to verify correct behavior.  For example, will we only
be able to check something in /proc to see interrupts are going to
another cpu or can we also check to see if the affinity mask has been
updated?

> 4) mm_struct slab leak (affected only some architectures)

I'm not familiar with this at all, in understanding slabs or how this
can be verified.

> 5) offlining the last running cpu in the system :)

Just because I'm mostly ignorant, what's the correct behavior?  For
example, the atttriute to offline goes away, is not changeable, etc.
Easy to test definitely.  :)

> > How do we verify success in each test case?
> 
> Some of the above are easier than others to check in an automated way.
> (1) is hard to verify directly but we might indirectly check on this by
> looking at /proc/stat or measuring the relative performance of some
> parallelized benchmark before and after onlining the cpu.  The scheduler
> currently has an assertion in migration_call which catches (2), not sure
> what else could be done in userspace to check that.  (3) it looks as if
> you have a start at handling below.  (4) is hard... (5) is pretty
> easy :)
> 
> > Test Case 1:
> > Verify interrupts are moved off of a CPU when offlined through sysfs on
> > a multiprossor system?
> > 
> > 1. Set the IRQ smp_affinity mask for the disk controller to the CPU.
> >    Echo the appropriate hex mask into /proc/irq/IRQ#/smp_affinity
> > 
> >    Test interrupts from devices other than disk controllers?
> > 
> > 2. Start watching the interrupt counts in /proc/interrupts.
> > 
> >    Is it worth verifying tools such as sar at the same time?
> > 
> >    Other statistics to monitor?
> > 
> > 3. Start writing to a disk.
> > 
> >    Suggestions for what to do in order to be able to verify all writes
> >    are completed and correct?
> > 
> > 4. Offline a CPU, say cpu1.
> > 
> >    echo 0 > /sys/devices/system/cpu/cpu1/online
> > 
> >    cpu0 is not hotswappable on some architectures.
> 
> Right, and the sysfs entry for cpu0 will not have an online attribute in
> such cases.

Is it silly for me to ask if the lack of a online attribute needs to
be checked on those particular archtectures?

> >    Can we pinpoint when a CPU goes offline?  Or when can we know it is
> >    safe to physically remove a CPU.
>
> When the write() which wrote '1' or '0' to the online file returns
> successfully, the operation has completed from the kernel's point of
> view.
> 
> >    It's my understanding that timeslice overrun prevents
> >    'time echo 0 > /sys/devices/system/cpu/cpu1/online' from being an
> >    accurate measure of how long it takes to offline a CPU.
> 
> Hm, I don't know anything about this.
> 
> >    Does the return of an echo signify the CPU has completed offlining?
> 
> Yes, but check the return code (zero for success).
> 
> > 5. Do we see any change in /proc/interrupts?
> 
> This is architecture-specific behavior iirc.  ppc64 lists only online
> cpus.

Ok, so we can still infer correct behavior by seeing the remaining
cpu's pick up interrupts.

> >    Any interesting kernel messages in /var/log/messages?
> 
> ppc64 currently outputs stuff, but i386 didn't the last time I checked.
> In any case I don't think this should be relied on.
> 
> > Test Case 2:
> > Verify running processors are moved off of a CPU when offlined through
> > sysfs on a multiprossor system?
> 
> As I mentioned, this might be hard to verify from userspace, but the
> kernel does BUG() on this.
> 
> Hope this helps.
> 
> 
> Nathan
> 
> 
>