On 5/13/24 3:46 PM, Thomas Gleixner wrote: > On Mon, May 13 2024 at 10:43, Dongli Zhang wrote: >> On 5/13/24 5:44 AM, Thomas Gleixner wrote: >>> On Fri, May 10 2024 at 12:06, Dongli Zhang wrote: >>> Any interrupt which is affine to an outgoing CPU is migrated and >>> eventually pending moves are enforced: >>> >>> cpu_down() >>> ... >>> cpu_disable_common() >>> fixup_irqs() >>> irq_migrate_all_off_this_cpu() >>> migrate_one_irq() >>> irq_force_complete_move() >>> free_moved_vector(); >>> >>> No? >> >> I noticed this and finally abandoned the solution to fix at migrate_one_irq(), >> because: >> >> 1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to >> cleanup before irq_do_set_affinity(). >> >> 2. The irq_needs_fixup() may return false so that irq_force_complete_move() does >> not get the chance to trigger. >> >> 3. Even irq_force_complete_move() is triggered, it exits early if >> apicd->prev_vector==0. > > But that's not the case, really. > >> The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because >> cpu_disable_common() releases the vector_lock after CPU is flagged offline. > > Nothing can schedule vector cleanup at that point because _all_ other > CPUs spin in stop_machine() with interrupts disabled and therefore > cannot handle interrupts which might invoke it. Thank you very much for the explanation! Now I see why to drop the vector_lock is not an issue. [snip] > >> diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c >> index 1ed2b1739363..5ecd072a34fe 100644 >> --- a/kernel/irq/cpuhotplug.c >> +++ b/kernel/irq/cpuhotplug.c >> @@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc) >> return false; >> } >> >> + /* >> + * Complete an eventually pending irq move cleanup. If this >> + * interrupt was moved in hard irq context, then the vectors need >> + * to be cleaned up. It can't wait until this interrupt actually >> + * happens and this CPU was involved. >> + */ >> + irq_force_complete_move(desc); > > You cannot do that here because it is only valid when the interrupt is > affine to the outgoing CPU. > > In the problem case the interrupt was affine to the outgoing CPU, but > the core code does not know that it has not been cleaned up yet. It does > not even know that the interrupt was affine to the outgoing CPU before. > > So in principle we could just go and do: > > } else { > - apicd->prev_vector = 0; > + free_moved_vector(apicd); > } > raw_spin_unlock(&vector_lock); > > but that would not give enough information and would depend on the > interrupt to actually arrive. > > irq_force_complete_move() already describes this case, but obviously it > does not work because the core code does not invoke it in that > situation. > > So yes, moving the invocation of irq_force_complete_move() before the > irq_needs_fixup() call makes sense, but it wants this to actually work > correctly: > > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -1036,7 +1036,8 @@ static void __vector_schedule_cleanup(st > add_timer_on(&cl->timer, cpu); > } > } else { > - apicd->prev_vector = 0; > + pr_warn("IRQ %u schedule cleanup for offline CPU %u\n", apicd->irq, cpu); > + free_moved_vector(apicd); > } > raw_spin_unlock(&vector_lock); > } > @@ -1097,10 +1098,11 @@ void irq_force_complete_move(struct irq_ > goto unlock; > > /* > - * If prev_vector is empty, no action required. > + * If prev_vector is empty or the descriptor was previously > + * not on the outgoing CPU no action required. > */ > vector = apicd->prev_vector; > - if (!vector) > + if (!vector || apicd->prev_cpu != smp_processor_id()) > goto unlock; > The above may not work. migrate_one_irq() relies on irq_force_complete_move() to always reclaim the apicd->prev_vector. Otherwise, the call of irq_do_set_affinity() later may return -EBUSY. 801 /* 802 * Careful here. @apicd might either have move_in_progress set or 803 * be enqueued for cleanup. Assigning a new vector would either 804 * leave a stale vector on some CPU around or in case of a pending 805 * cleanup corrupt the hlist. 806 */ 807 if (apicd->move_in_progress || !hlist_unhashed(&apicd->clist)) 808 return -EBUSY; Or maybe add a flag to control whether to enforce a cleanup in any conditions? -void irq_force_complete_move(struct irq_desc *desc) +void irq_force_complete_move(struct irq_desc *desc, bool always_force) { struct apic_chip_data *apicd; struct irq_data *irqd; @@ -1102,6 +1103,9 @@ void irq_force_complete_move(struct irq_desc *desc) if (!vector) goto unlock; + if (!always_force && apicd->prev_cpu != smp_processor_id()) + goto unlock; + /* * This is tricky. If the cleanup of the old vector has not been * done yet, then the following setaffinity call will fail with ... and call irq_force_complete_move() at different places? @@ -79,6 +79,11 @@ static bool migrate_one_irq(struct irq_desc *desc) * interrupt. */ if (irqd_is_per_cpu(d) || !irqd_is_started(d) || !irq_needs_fixup(d)) { + /* + * Complete an eventually pending irq move cleanup if this + * interrupt was affine to the outgoing CPU. + */ + irq_force_complete_move(desc, false); /* * If an irq move is pending, abort it if the dying CPU is * the sole target. @@ -93,7 +98,7 @@ static bool migrate_one_irq(struct irq_desc *desc) * to be cleaned up. It can't wait until this interrupt actually * happens and this CPU was involved. */ - irq_force_complete_move(desc); + irq_force_complete_move(desc, true); /* * If there is a setaffinity pending, then try to reuse the pending Thank you very much! Dongli Zhang