FYI for discussion on today's call, here was Bryce's update that was accidentally missed for the last SIG call when both Bryce and I were on vacation. This does not include what he has done in the last two weeks. -------- Forwarded Message -------- Here's the current version of my todo list... Bryce Hotplug Testing Project ----------------------- 0. Initialization ================== * (DONE) Request getting added to the hotplug project * (DONE) Have Mark teach me what he knows about using the system * (DONE) Make sure hotplug project includes request for 64-bit hardware * (DONE) Ask John Cherry to create /dev/hotplug on developer * (DONE) Set up a preliminary website; plan further work * (DONE) Locate high level info about Hotplug and plan reading * (DONE) Put hotplug testing website out on developer 1. Background ============= * (DONE) Check with cem about status of 3rd power machine * (DONE) Automate pulling of patches into PLM for doing cross compiles * (DONE) Read http://developer.osdl.org/maryedie/HOTPLUG/CPU_statement_scope.txt * (DONE) Read http://developer.osdl.org/maryedie/HOTPLUG/Memory_statement_scope.txt * (DONE) Read http://developer.osdl.org/maryedie/HOTPLUG/planning/hotplug_cpu_test_plan_status.html * (DONE) Read http://www.developer.osdl.org/maryedie/HOTPLUG/HOTPLUG_SIG_0120.pdf * (DONE) Read http://lists.osdl.org/pipermail/hotplug_sig/2004-December/000097.htm * (DONE) Locate the hotplug memory patches * (DONE) Review http://lhms.sourceforge.net/ * (DONE) Review http://openipmi.sourceforge.net/ * (DONE) Review http://openhpi.sourceforge.net/ * (DONE) Read minutes from previous meetings * (DONE) Read the high level document Martine wrote about Hotplug memory * (DONE) Learn about hypervisor * (DONE) Read Dave Hansen's "Hotplug Memory Redux" paper 2. Preliminary Testing ====================== * (DONE) Write PackageRetriever script to pull the memory hotplug patches * (DONE) Figure out machine usage situation * (DONE) Set up testing mounts on dev4-013 * (DONE) Run the CPU onlining/offlining tests a few times * (DONE) Run the CPU onlining/offlining tests as stress test over a weekend * (DONE) Create a hotplug testing NFS file share - reuse NFS one * (DONE) Set up PackageRetriever as a cron job on cl023 * Implement mechanism to have package retriever trigger testing * (DONE) Add hotplug memory patches to package retriever input.txt & test * Find and review the existing tests and plan modifying them to check behaviors when run. 3. Planning =========== * Plan out per-patch testing + Report PLM results + Download and compile patch on target platform; report success/fail + Try booting system with new patch + Run tests and report results * Plan out testing reports that will be useful to the community * Plan out regression, performance, and integration testing for memory hotplug, with priority on regression. + Run any other test, then take memory sections on/offline * Plan out "thorough regression testing" for CPU hotplug * Plan out integration testing for CPU hotplug + Verify stability of code when run with other new kernel features, other RAS features, and user-land performance tools * Plan out how to verify Reliability/Availability for replacing defective CPUs and memory * Plan out how to verify dynamic partitioning by transferring CPU and memory resources to workloads that need them * Plan out how to demonstrate use of CPU/memory add/remove for virtualization applications * Plan demoing "Instant Capacity" by onlining CPUs or memory as needed * Plan what architectures make sense to test - enterprise-like systems + user, DMA, kernel memory + SMP, NUMA * Plan performance testing; use cases, comparisons to code without the patch, etc. * Plan out regression tests needing to be written for memory hotplug * Plan out regression test automation * On each -mm release, run CPU scripts to test it 4. Research ============ * Play with the /sys/devices/memory interfaces * Ensure that the scripts check the status of the CPU when offlined to make sure it was offlined * Create test that checks if the CPU should not be allowed to be taken offline * Add tests on affinity checking Future ====== * If can find a 64-bit system, try doing testing on it * Plan what to do to get more info about why a CPU offlining failed * Plan doing stress testing * Identify intermediary milestones for hotplug memory and define test that could be run at those steps * For memory, need to define intermediate stages * For CPU, need to verify proper on/offlining -- Mary Edie Meredith maryedie@xxxxxxxx 503-906-1942 Data Center Linux Initiative Manager Open Source Development Labs