Hi Mark- [sorry for long delay in replying] On Thu, 2005-01-06 at 16:14 -0800, Mark Wong wrote: > Here's what I started drafting for input, ideas, and contributions in > response to getting a regression test suite going. I'd like to help. > Automated Regression Test for Hotplug CPU > > What are the test cases we need? I think it might be valuable to recount some of the bugs that have been encountered in the past. Off the top of my head: 1) tasks not scheduled on a newly onlined cpu 2) tasks wrongly migrated to a cpu while it is going offline 3) irqs with cpu affinity not migrated from an offlined cpu 4) mm_struct slab leak (affected only some architectures) 5) offlining the last running cpu in the system :) > How do we verify success in each test case? Some of the above are easier than others to check in an automated way. (1) is hard to verify directly but we might indirectly check on this by looking at /proc/stat or measuring the relative performance of some parallelized benchmark before and after onlining the cpu. The scheduler currently has an assertion in migration_call which catches (2), not sure what else could be done in userspace to check that. (3) it looks as if you have a start at handling below. (4) is hard... (5) is pretty easy :) > Test Case 1: > Verify interrupts are moved off of a CPU when offlined through sysfs on > a multiprossor system? > > 1. Set the IRQ smp_affinity mask for the disk controller to the CPU. > Echo the appropriate hex mask into /proc/irq/IRQ#/smp_affinity > > Test interrupts from devices other than disk controllers? > > 2. Start watching the interrupt counts in /proc/interrupts. > > Is it worth verifying tools such as sar at the same time? > > Other statistics to monitor? > > 3. Start writing to a disk. > > Suggestions for what to do in order to be able to verify all writes > are completed and correct? > > 4. Offline a CPU, say cpu1. > > echo 0 > /sys/devices/system/cpu/cpu1/online > > cpu0 is not hotswappable on some architectures. Right, and the sysfs entry for cpu0 will not have an online attribute in such cases. > Can we pinpoint when a CPU goes offline? Or when can we know it is > safe to physically remove a CPU. When the write() which wrote '1' or '0' to the online file returns successfully, the operation has completed from the kernel's point of view. > It's my understanding that timeslice overrun prevents > 'time echo 0 > /sys/devices/system/cpu/cpu1/online' from being an > accurate measure of how long it takes to offline a CPU. Hm, I don't know anything about this. > Does the return of an echo signify the CPU has completed offlining? Yes, but check the return code (zero for success). > 5. Do we see any change in /proc/interrupts? This is architecture-specific behavior iirc. ppc64 lists only online cpus. > Any interesting kernel messages in /var/log/messages? ppc64 currently outputs stuff, but i386 didn't the last time I checked. In any case I don't think this should be relied on. > Test Case 2: > Verify running processors are moved off of a CPU when offlined through > sysfs on a multiprossor system? As I mentioned, this might be hard to verify from userspace, but the kernel does BUG() on this. Hope this helps. Nathan