I've started to outline more regression test cases based on Nathan's suggestion. (attached) Comments? We'll probably discuss it a bit on tomorrow Hotplug SIG conference call. Thanks, Mark -------------- next part -------------- Test Case 1: What happens to disk controller interrupts when you offline a CPU on a multiprossor system? 1. Note the current smp_affinity mask for the disk controller to stress. Set the IRQ smp_affinity mask for the disk controller to all CPU's. Echo the appropriate hex mask into /proc/irq/IRQ#/smp_affinity Verify the smp_affinity mask. 2. Start watching the interrupt counts in /proc/interrupts. Is it worth verifying tools such as sar at the same time? 3. Start writing to a disk. while true; do echo 1 > dud; sleep 1; done Suggestions for what to do in order to be able to verify all writes are completed and correct? 4. Offline a CPU, pick on cpu1. echo 0 > /sys/devices/system/cpu/cpu1/online cpu0 is not hotswappable on some architectures and will not have an online attribute. Can we pinpoint when a CPU goes offline? It's my understanding that timeslice overrun prevents 'time echo 0 > /sys/devices/system/cpu/cpu1/online' from being an accurate measure of how long it takes to offline a CPU. A turn of 0 (zero) signified the successful complettion of offlining the CPU from the kernel's point of view. Verify the smp_affinity mask of the affected disk controller. 5. Analyze data collected from /proc/interrupts? Relevent messages in /var/log/messages regarding the procedure will occur depending on the architecture tested on. Test Case 2: What happens to a process when you offline a CPU on a multiprossor system? 1. Start a shell script that spins on a CPU. 2. Note the current process affinity mask of the spinning process using taskset. I believe there is at least one other tool available. How important is it to note all of them. Set the processor affinity mask to cpu1, using taskset. Verify the processor affinity mask, using taskset. 2. Start recording the cpu utilization using sar. Is it worth verifying other tools such at the same time? 3. Offline a CPU, pick on cpu1. echo 0 > /sys/devices/system/cpu/cpu1/online cpu0 is not hotswappable on some architectures and will not have an online attribute. Verify the processor affinity mask, using taskset. 5. Analyze processor utilization data collected. Relevent messages in /var/log/messages regarding the procedure will occur depending on the architecture tested on. Test Case 3: Check that tasks are scheduled on a newly on-lined CPU. 1. Offline a cpu. 2. Start a script to spins a CPU, per total number of processors in the systems, including the cpu just offlined. 3. Monitor the processor utilization on each CPU. 4. Online the cpu from step 1. 5. Analyze the processor utilization to determine if one of the spinning tasks migrated to the new cpu. Test Case 4: Offline the last running CPU. 1. Starting from the last cpu, as opposed to cpu0, offline the CPU, except for the cpu0. 2. Verify if cpu0 can be offlined by checking the existance of /sys/devices/system/cpu/cpu0/online. 3. Offline cpu0, if the attribute exists, and check for EBUSY for correct behavior. Test Case 5: Stress Test 1. Start monituring memory usage. vmstat? sar? 2. Start a cpu and memory intensive test for a duration of 4(?) hours. (tpc-c, reaim, suggestions?) 3. Offline and online processors at regular intervals throughout the duration of the test. 4. At the end of the test, analyze the system statistics to determine memory leaks.