Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

Dongli Zhang <dongli.zhang@xxxxxxxxxx> · Wed, 22 May 2024 14:44:36 -0700

On 5/21/24 5:00 AM, Thomas Gleixner wrote:
> On Wed, May 15 2024 at 12:51, Dongli Zhang wrote:
>> On 5/13/24 3:46 PM, Thomas Gleixner wrote:
>>> So yes, moving the invocation of irq_force_complete_move() before the
>>> irq_needs_fixup() call makes sense, but it wants this to actually work
>>> correctly:
>>> @@ -1097,10 +1098,11 @@ void irq_force_complete_move(struct irq_
>>>  		goto unlock;
>>>  
>>>  	/*
>>> -	 * If prev_vector is empty, no action required.
>>> +	 * If prev_vector is empty or the descriptor was previously
>>> +	 * not on the outgoing CPU no action required.
>>>  	 */
>>>  	vector = apicd->prev_vector;
>>> -	if (!vector)
>>> +	if (!vector || apicd->prev_cpu != smp_processor_id())
>>>  		goto unlock;
>>>  
>>
>> The above may not work. migrate_one_irq() relies on irq_force_complete_move() to
>> always reclaim the apicd->prev_vector. Otherwise, the call of
>> irq_do_set_affinity() later may return -EBUSY.
> 
> You're right. But that still can be handled in irq_force_complete_move()
> with a single unconditional invocation in migrate_one_irq():
> 
> 	cpu = smp_processor_id();
> 	if (!vector || (apicd->cur_cpu != cpu && apicd->prev_cpu != cpu))
> 		goto unlock;

The current affine is apicd->cpu :)

Thank you very much for the suggestion!

> 
> because there are only two cases when a cleanup is required:
> 
>    1) The outgoing CPU is the current target
> 
>    2) The outgoing CPU was the previous target
> 
> No?

I agree with this statement.

My only concern is: while we use "apicd->cpu", the irq_needs_fixup() uses a
different way. It uses d->common->effective_affinity or d->common->affinity to
decide whether to move forward to migrate the interrupt.

I have spent some time reading about the discussion that happened in the year
2017 (below link). According to my understanding,
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK always relies on CONFIG_SMP, and we do not
have the chance to encounter the issue for x86.

https://lore.kernel.org/all/alpine.DEB.2.20.1710042208400.2406@nanos/T/#u

I have tested the new patch for a while and never encountered any issue.

Therefore, I will send v2.

Thank you very much for all suggestions!

Dongli Zhang