Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Mon, 13 May 2024 14:44:15 +0200

On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
> The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
> interrupt affinity reconfiguration via procfs. Instead, the change is
> deferred until the next instance of the interrupt being triggered on the
> original CPU.
>
> When the interrupt next triggers on the original CPU, the new affinity is
> enforced within __irq_move_irq(). A vector is allocated from the new CPU,
> but if the old vector on the original CPU remains online, it is not
> immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the
> reclaiming process is delayed until the next trigger of the interrupt on
> the new CPU.
>
> Upon the subsequent triggering of the interrupt on the new CPU,
> irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
> remains online. Subsequently, the timer on the old CPU iterates over its
> vector_cleanup list, reclaiming vectors.
>
> However, if the old CPU is offline before the interrupt triggers again on
> the new CPU, irq_complete_move() simply resets both apicd->move_in_progress
> and apicd->prev_vector to 0. Consequently, the vector remains unreclaimed
> in vector_matrix, resulting in a CPU vector leak.

I doubt that.

Any interrupt which is affine to an outgoing CPU is migrated and
eventually pending moves are enforced:

cpu_down()
  ...
  cpu_disable_common()
    fixup_irqs()
      irq_migrate_all_off_this_cpu()
        migrate_one_irq()
          irq_force_complete_move()
            free_moved_vector();

No?

In fact irq_complete_move() should never see apicd->move_in_progress
with apicd->prev_cpu pointing to an offline CPU.

The CPU offline case in __vector_schedule_cleanup() should not even
exist or at least just emit a warning.

If you can trigger that case, then there is something fundamentally
wrong with the CPU hotplug interrupt migration code and that needs to be
investigated and fixed.

Thanks,

        tglx