Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

Dongli Zhang <dongli.zhang@xxxxxxxxxx> · Mon, 13 May 2024 10:43:48 -0700

Hi Thomas,

On 5/13/24 5:44 AM, Thomas Gleixner wrote:
> On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
>> The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
>> interrupt affinity reconfiguration via procfs. Instead, the change is
>> deferred until the next instance of the interrupt being triggered on the
>> original CPU.
>>
>> When the interrupt next triggers on the original CPU, the new affinity is
>> enforced within __irq_move_irq(). A vector is allocated from the new CPU,
>> but if the old vector on the original CPU remains online, it is not
>> immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the
>> reclaiming process is delayed until the next trigger of the interrupt on
>> the new CPU.
>>
>> Upon the subsequent triggering of the interrupt on the new CPU,
>> irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
>> remains online. Subsequently, the timer on the old CPU iterates over its
>> vector_cleanup list, reclaiming vectors.
>>
>> However, if the old CPU is offline before the interrupt triggers again on
>> the new CPU, irq_complete_move() simply resets both apicd->move_in_progress
>> and apicd->prev_vector to 0. Consequently, the vector remains unreclaimed
>> in vector_matrix, resulting in a CPU vector leak.
> 
> I doubt that.
> 
> Any interrupt which is affine to an outgoing CPU is migrated and
> eventually pending moves are enforced:
> 
> cpu_down()
>   ...
>   cpu_disable_common()
>     fixup_irqs()
>       irq_migrate_all_off_this_cpu()
>         migrate_one_irq()
>           irq_force_complete_move()
>             free_moved_vector();
> 
> No?

I noticed this and finally abandoned the solution to fix at migrate_one_irq(),
because:

1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to
cleanup before irq_do_set_affinity().

2. The irq_needs_fixup() may return false so that irq_force_complete_move() does
not get the chance to trigger.

3. Even irq_force_complete_move() is triggered, it exits early if
apicd->prev_vector==0.

The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because
cpu_disable_common() releases the vector_lock after CPU is flagged offline.

void cpu_disable_common(void)
{
    int cpu = smp_processor_id();

    remove_siblinginfo(cpu);

    /* It's now safe to remove this processor from the online map */
    lock_vector_lock();
    remove_cpu_from_maps(cpu); ---> CPU is flagged offline
    unlock_vector_lock();      ---> release the vector_lock here!
    fixup_irqs();
    lapic_offline();
}


Therefore, the bugfix may become something like (just to demo the idea):

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 185738c72766..247a53fe9ada 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -1035,8 +1035,6 @@ static void __vector_schedule_cleanup(struct
apic_chip_data *apicd)
                        cl->timer.expires = jiffies + 1;
                        add_timer_on(&cl->timer, cpu);
                }
-       } else {
-               apicd->prev_vector = 0; // or print a warning
        }
        raw_spin_unlock(&vector_lock);
 }
diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 1ed2b1739363..5ecd072a34fe 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc)
                return false;
        }

+       /*
+        * Complete an eventually pending irq move cleanup. If this
+        * interrupt was moved in hard irq context, then the vectors need
+        * to be cleaned up. It can't wait until this interrupt actually
+        * happens and this CPU was involved.
+        */
+       irq_force_complete_move(desc);
+
        /*
         * No move required, if:
         * - Interrupt is per cpu
@@ -87,14 +95,6 @@ static bool migrate_one_irq(struct irq_desc *desc)
                return false;
        }

-       /*
-        * Complete an eventually pending irq move cleanup. If this
-        * interrupt was moved in hard irq context, then the vectors need
-        * to be cleaned up. It can't wait until this interrupt actually
-        * happens and this CPU was involved.
-        */
-       irq_force_complete_move(desc);
-
        /*
         * If there is a setaffinity pending, then try to reuse the pending
         * mask, so the last change of the affinity does not get lost. If



That's why I modify only the __vector_schedule_cleanup() as it looked simple.

I will fix in the CPU hotplug interrupt migration code.

> 
> In fact irq_complete_move() should never see apicd->move_in_progress
> with apicd->prev_cpu pointing to an offline CPU.

I think it is possible. The fact that a CPU is offline doesn't indicate
fixup_irqs() has already been triggered. The vector_lock is released after CPU
is flagged offline.

> 
> The CPU offline case in __vector_schedule_cleanup() should not even
> exist or at least just emit a warning.
> 
> If you can trigger that case, then there is something fundamentally
> wrong with the CPU hotplug interrupt migration code and that needs to be
> investigated and fixed.
> 

I can easily reproduce the issue.

1. Create a QEMU VM (12 vCPUs) with virtio-net, and 8 RX/TX queues (16 vectors).
Triggered active iperf3 multiqueue workload.

2. Affine all 16 vectors to CPU=11, and sleep 1 second.

3. Affine all 16 vectors to CPU=10.

4. Offline CPU=11, and sleep for 2-seconds.

5. Online CPU=11, goto step 2 in a loop.


After hours, the CPU=11 has many un-reclaimed vectors.

# cat /sys/kernel/debug/irq/domains/VECTOR
name:   VECTOR
 size:   0
 mapped: 47
 flags:  0x00000103
Online bitmaps:       12
Global available:   2302
Global reserved:       7
Total allocated:     122
System: 38: 0-19,21,50,128,236,240-242,244,246-255
 | CPU | avl | man | mac | act | vectors
     0   198     1     1    4  32-34,48
     1   198     1     1    4  32-35
     2   198     1     1    4  32-35
     3   199     1     1    3  32-34
     4   199     1     1    3  32-34
     5   199     1     1    3  32-34
     6   198     1     1    4  32-35
     7   200     1     1    2  32-33
     8   199     1     1    3  32,34-35
     9   198     1     1    4  32-33,36-37
    10   198     1     1    4  32-33,35,41
    11   118     1     1   84  32-49,51-110,113-115,117,119,121


I will fix in the interrupt migration code.

Thank you very much!

Dongli Zhang