Hi Thomas, On 5/13/24 5:44 AM, Thomas Gleixner wrote: > On Fri, May 10 2024 at 12:06, Dongli Zhang wrote: >> The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of >> interrupt affinity reconfiguration via procfs. Instead, the change is >> deferred until the next instance of the interrupt being triggered on the >> original CPU. >> >> When the interrupt next triggers on the original CPU, the new affinity is >> enforced within __irq_move_irq(). A vector is allocated from the new CPU, >> but if the old vector on the original CPU remains online, it is not >> immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the >> reclaiming process is delayed until the next trigger of the interrupt on >> the new CPU. >> >> Upon the subsequent triggering of the interrupt on the new CPU, >> irq_complete_move() adds a task to the old CPU's vector_cleanup list if it >> remains online. Subsequently, the timer on the old CPU iterates over its >> vector_cleanup list, reclaiming vectors. >> >> However, if the old CPU is offline before the interrupt triggers again on >> the new CPU, irq_complete_move() simply resets both apicd->move_in_progress >> and apicd->prev_vector to 0. Consequently, the vector remains unreclaimed >> in vector_matrix, resulting in a CPU vector leak. > > I doubt that. > > Any interrupt which is affine to an outgoing CPU is migrated and > eventually pending moves are enforced: > > cpu_down() > ... > cpu_disable_common() > fixup_irqs() > irq_migrate_all_off_this_cpu() > migrate_one_irq() > irq_force_complete_move() > free_moved_vector(); > > No? I noticed this and finally abandoned the solution to fix at migrate_one_irq(), because: 1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to cleanup before irq_do_set_affinity(). 2. The irq_needs_fixup() may return false so that irq_force_complete_move() does not get the chance to trigger. 3. Even irq_force_complete_move() is triggered, it exits early if apicd->prev_vector==0. The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because cpu_disable_common() releases the vector_lock after CPU is flagged offline. void cpu_disable_common(void) { int cpu = smp_processor_id(); remove_siblinginfo(cpu); /* It's now safe to remove this processor from the online map */ lock_vector_lock(); remove_cpu_from_maps(cpu); ---> CPU is flagged offline unlock_vector_lock(); ---> release the vector_lock here! fixup_irqs(); lapic_offline(); } Therefore, the bugfix may become something like (just to demo the idea): diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 185738c72766..247a53fe9ada 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -1035,8 +1035,6 @@ static void __vector_schedule_cleanup(struct apic_chip_data *apicd) cl->timer.expires = jiffies + 1; add_timer_on(&cl->timer, cpu); } - } else { - apicd->prev_vector = 0; // or print a warning } raw_spin_unlock(&vector_lock); } diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c index 1ed2b1739363..5ecd072a34fe 100644 --- a/kernel/irq/cpuhotplug.c +++ b/kernel/irq/cpuhotplug.c @@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc) return false; } + /* + * Complete an eventually pending irq move cleanup. If this + * interrupt was moved in hard irq context, then the vectors need + * to be cleaned up. It can't wait until this interrupt actually + * happens and this CPU was involved. + */ + irq_force_complete_move(desc); + /* * No move required, if: * - Interrupt is per cpu @@ -87,14 +95,6 @@ static bool migrate_one_irq(struct irq_desc *desc) return false; } - /* - * Complete an eventually pending irq move cleanup. If this - * interrupt was moved in hard irq context, then the vectors need - * to be cleaned up. It can't wait until this interrupt actually - * happens and this CPU was involved. - */ - irq_force_complete_move(desc); - /* * If there is a setaffinity pending, then try to reuse the pending * mask, so the last change of the affinity does not get lost. If That's why I modify only the __vector_schedule_cleanup() as it looked simple. I will fix in the CPU hotplug interrupt migration code. > > In fact irq_complete_move() should never see apicd->move_in_progress > with apicd->prev_cpu pointing to an offline CPU. I think it is possible. The fact that a CPU is offline doesn't indicate fixup_irqs() has already been triggered. The vector_lock is released after CPU is flagged offline. > > The CPU offline case in __vector_schedule_cleanup() should not even > exist or at least just emit a warning. > > If you can trigger that case, then there is something fundamentally > wrong with the CPU hotplug interrupt migration code and that needs to be > investigated and fixed. > I can easily reproduce the issue. 1. Create a QEMU VM (12 vCPUs) with virtio-net, and 8 RX/TX queues (16 vectors). Triggered active iperf3 multiqueue workload. 2. Affine all 16 vectors to CPU=11, and sleep 1 second. 3. Affine all 16 vectors to CPU=10. 4. Offline CPU=11, and sleep for 2-seconds. 5. Online CPU=11, goto step 2 in a loop. After hours, the CPU=11 has many un-reclaimed vectors. # cat /sys/kernel/debug/irq/domains/VECTOR name: VECTOR size: 0 mapped: 47 flags: 0x00000103 Online bitmaps: 12 Global available: 2302 Global reserved: 7 Total allocated: 122 System: 38: 0-19,21,50,128,236,240-242,244,246-255 | CPU | avl | man | mac | act | vectors 0 198 1 1 4 32-34,48 1 198 1 1 4 32-35 2 198 1 1 4 32-35 3 199 1 1 3 32-34 4 199 1 1 3 32-34 5 199 1 1 3 32-34 6 198 1 1 4 32-35 7 200 1 1 2 32-33 8 199 1 1 3 32,34-35 9 198 1 1 4 32-33,36-37 10 198 1 1 4 32-33,35,41 11 118 1 1 84 32-49,51-110,113-115,117,119,121 I will fix in the interrupt migration code. Thank you very much! Dongli Zhang