Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

Dongli Zhang <dongli.zhang@xxxxxxxxxx> · Wed, 15 May 2024 12:51:17 -0700





On 5/13/24 3:46 PM, Thomas Gleixner wrote:
> On Mon, May 13 2024 at 10:43, Dongli Zhang wrote:
>> On 5/13/24 5:44 AM, Thomas Gleixner wrote:
>>> On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
>>> Any interrupt which is affine to an outgoing CPU is migrated and
>>> eventually pending moves are enforced:
>>>
>>> cpu_down()
>>>   ...
>>>   cpu_disable_common()
>>>     fixup_irqs()
>>>       irq_migrate_all_off_this_cpu()
>>>         migrate_one_irq()
>>>           irq_force_complete_move()
>>>             free_moved_vector();
>>>
>>> No?
>>
>> I noticed this and finally abandoned the solution to fix at migrate_one_irq(),
>> because:
>>
>> 1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to
>> cleanup before irq_do_set_affinity().
>>
>> 2. The irq_needs_fixup() may return false so that irq_force_complete_move() does
>> not get the chance to trigger.
>>
>> 3. Even irq_force_complete_move() is triggered, it exits early if
>> apicd->prev_vector==0.
> 
> But that's not the case, really.
> 
>> The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because
>> cpu_disable_common() releases the vector_lock after CPU is flagged offline.
> 
> Nothing can schedule vector cleanup at that point because _all_ other
> CPUs spin in stop_machine() with interrupts disabled and therefore
> cannot handle interrupts which might invoke it.

Thank you very much for the explanation! Now I see why to drop the vector_lock
is not an issue.


[snip]


> 
>> diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
>> index 1ed2b1739363..5ecd072a34fe 100644
>> --- a/kernel/irq/cpuhotplug.c
>> +++ b/kernel/irq/cpuhotplug.c
>> @@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc)
>>                 return false;
>>         }
>>
>> +       /*
>> +        * Complete an eventually pending irq move cleanup. If this
>> +        * interrupt was moved in hard irq context, then the vectors need
>> +        * to be cleaned up. It can't wait until this interrupt actually
>> +        * happens and this CPU was involved.
>> +        */
>> +       irq_force_complete_move(desc);
> 
> You cannot do that here because it is only valid when the interrupt is
> affine to the outgoing CPU.
> 
> In the problem case the interrupt was affine to the outgoing CPU, but
> the core code does not know that it has not been cleaned up yet. It does
> not even know that the interrupt was affine to the outgoing CPU before.
> 
> So in principle we could just go and do:
> 
>  	} else {
> -		apicd->prev_vector = 0;
> +		free_moved_vector(apicd);
>  	}
>  	raw_spin_unlock(&vector_lock);
> 
> but that would not give enough information and would depend on the
> interrupt to actually arrive.
> 
> irq_force_complete_move() already describes this case, but obviously it
> does not work because the core code does not invoke it in that
> situation.
> 
> So yes, moving the invocation of irq_force_complete_move() before the
> irq_needs_fixup() call makes sense, but it wants this to actually work
> correctly:
> 
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -1036,7 +1036,8 @@ static void __vector_schedule_cleanup(st
>  			add_timer_on(&cl->timer, cpu);
>  		}
>  	} else {
> -		apicd->prev_vector = 0;
> +		pr_warn("IRQ %u schedule cleanup for offline CPU %u\n", apicd->irq, cpu);
> +		free_moved_vector(apicd);
>  	}
>  	raw_spin_unlock(&vector_lock);
>  }
> @@ -1097,10 +1098,11 @@ void irq_force_complete_move(struct irq_
>  		goto unlock;
>  
>  	/*
> -	 * If prev_vector is empty, no action required.
> +	 * If prev_vector is empty or the descriptor was previously
> +	 * not on the outgoing CPU no action required.
>  	 */
>  	vector = apicd->prev_vector;
> -	if (!vector)
> +	if (!vector || apicd->prev_cpu != smp_processor_id())
>  		goto unlock;
>  

The above may not work. migrate_one_irq() relies on irq_force_complete_move() to
always reclaim the apicd->prev_vector. Otherwise, the call of
irq_do_set_affinity() later may return -EBUSY.


 801         /*
 802          * Careful here. @apicd might either have move_in_progress set or
 803          * be enqueued for cleanup. Assigning a new vector would either
 804          * leave a stale vector on some CPU around or in case of a pending
 805          * cleanup corrupt the hlist.
 806          */
 807         if (apicd->move_in_progress || !hlist_unhashed(&apicd->clist))
 808                 return -EBUSY;


Or maybe add a flag to control whether to enforce a cleanup in any conditions?

-void irq_force_complete_move(struct irq_desc *desc)
+void irq_force_complete_move(struct irq_desc *desc, bool always_force)
 {
 	struct apic_chip_data *apicd;
 	struct irq_data *irqd;
@@ -1102,6 +1103,9 @@ void irq_force_complete_move(struct irq_desc *desc)
 	if (!vector)
 		goto unlock;

+	if (!always_force && apicd->prev_cpu != smp_processor_id())
+		goto unlock;
+
 	/*
 	 * This is tricky. If the cleanup of the old vector has not been
 	 * done yet, then the following setaffinity call will fail with

... and call irq_force_complete_move() at different places?

@@ -79,6 +79,11 @@ static bool migrate_one_irq(struct irq_desc *desc)
 	 * interrupt.
 	 */
 	if (irqd_is_per_cpu(d) || !irqd_is_started(d) || !irq_needs_fixup(d)) {
+		/*
+		 * Complete an eventually pending irq move cleanup if this
+		 * interrupt was affine to the outgoing CPU.
+		 */
+		irq_force_complete_move(desc, false);
 		/*
 		 * If an irq move is pending, abort it if the dying CPU is
 		 * the sole target.
@@ -93,7 +98,7 @@ static bool migrate_one_irq(struct irq_desc *desc)
 	 * to be cleaned up. It can't wait until this interrupt actually
 	 * happens and this CPU was involved.
 	 */
-	irq_force_complete_move(desc);
+	irq_force_complete_move(desc, true);

 	/*
 	 * If there is a setaffinity pending, then try to reuse the pending


Thank you very much!

Dongli Zhang