Re: [RFC PATCH v3 3/3] genirq/cpuhotplug: Adjust managed irqs according to change of housekeeping CPU

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Tue, 29 Oct 2024 19:54:12 +0100

On Tue, Oct 29 2024 at 14:05, Costa Shulyupin wrote:
> index afc920116d42..44c7da0e1b8d 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -171,7 +171,7 @@ static bool cpuhp_step_empty(bool bringup, struct cpuhp_step *step)
>   *
>   * Return: %0 on success or a negative errno code
>   */
> -static int cpuhp_invoke_callback(unsigned int cpu, enum cpuhp_state state,
> +int cpuhp_invoke_callback(unsigned int cpu, enum cpuhp_state state,
>  				 bool bringup, struct hlist_node *node,
>  				 struct hlist_node **lastp)

This is deep internal functionality of cpu hotplug and only valid when
the hotplug lock is write held or if it is read held _and_ the state
mutex is held.

Otherwise it is completely unprotected against a concurrent state or
instance insertion/removal and concurrent invocations of this function.

And no, we are not going to expose the state mutex just because. CPU
hotplug is complex enough already and we really don't need more side
channels into it.

There is another issue with this approach in general:

   1) The 3 block states are just the tip of the iceberg. You are going
      to play a whack a mole game to add other subsystems/drivers as
      well.

   2) The whole logic has ordering constraints. The states have strict
      ordering for a reason. So what guarantees that e.g. BLK_MQ_ONLINE
      has no dependencies on non BLK related states to be invoked before
      that. I'm failing to see the analysis of correctness here.

      Just because it did not explode right away does not make it
      correct. We've had enough subtle problems with ordering and
      dependencies in the past. No need to introduce new ones.

CPU hotplug solves this problem without any hackery. Take a CPU offline,
change the mask of that CPU and bring it online again. Repeat until all
CPU changes are done.

If some user space component cannot deal with that, then fix that
instead of inflicting fragile and unmaintainable complexity on the
kernel. That kubernetes problem is known since 2018 and nobody has
actually sat down and solved it. Now we waste another 6 years to make it
"work" in the kernel magically.

This needs userspace awareness anyway. If you isolate a CPU then tasks
or containers which are assigned to that CPU need to move away and the
container has to exclude that CPU. If you remove the isolation then what
is opening the CPU for existing containers magically?

I'm not buying any of this "will" just work and nobody notices
handwaving.

Thanks,

        tglx