Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Thu, 08 Oct 2020 11:34:41 +0200

On Thu, Oct 08 2020 at 08:21, David Woodhouse wrote:
> On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote:
>> Multiqueue devices want to have at max 1 queue per CPU or if the device
>> has less queues than CPUs they want the queues to have a fixed
>> association to a set of CPUs.
>> 
>> At setup time this is established considering possible CPUs to handle
>> 'physical' hotplug correctly.
>> 
>> If a queue has no online CPUs it cannot be started. If it's active and
>> the last CPU goes down then it's quiesced and stopped and the core code
>> shuts down the interrupt and does not move it to a still online CPU.
>> 
>> So with your hackery, we end up in a situation where we have a large
>> possible mask, but not all CPUs in that mask can be reached, which means
>> in a 1 queue per CPU scenario all unreachable CPUs would have
>> disfunctional queues.
>> 
>> So that spreading algorithm needs to know about this limitation.
>
> OK, thanks. So the queue exists, with an MSI assigned to point to an
> offline CPU(s), but it cannot actually be used until/unless at least
> one CPU in its mask comes online.

The MSI entry in that case is actually directed to an online CPU's
MANAGED_IRQ_SHUTDOWN_VECTOR to catch cases where an interrupt is raised
by the device after shutdown.

> So when I said I wanted to try treating "reachable" the same way as
> "online", that would mean the queue can't start until/unless at least
> one *reachable* CPU in its mask comes online.
>
> The underlying problem here is that until a CPU comes online, we don't
> actually *know* if it's reachable or not.

It's known before online, i.e. when the CPU is registered which is
either at boot time for present CPUs or at 'physical' hotplug.

> So if we want carefully create the affinity masks at setup time so that
> they don't include any unreachable CPUs... that basically means we
> don't include any non-present CPUs at all (unless they've been added
> once and then removed).

That breaks _all_ multi-queue assumptions in one go. :)

> But those really do seem like hacks which might only apply on x86,
> while the generic approach of treating "reachable" like "online" seems
> like it would work in other cases too.
>
> Fundamentally, there are three sets of CPUs. There are those known to
> be reachable, those known not to be, and those which are not yet
> known.

Unfortunately there are lots of assumptions all over the place that
possible CPUs are reachable. Multi-queue using managed interrupts is
just the tip of the iceberg.

> So another approach we could use is to work with a cpumask of those
> *known* not to be reachable, and to filter those *out* of the prebuilt
> affinities. That gives us basically the right behaviour without
> hotplug, but does include absent CPUs in a mask that *if* they are ever
> added, wouldn't be able to receive the IRQ. Which does mean we'd have
> to refrain from bringing up the corresponding queue. 

The multi-queue drivers rely on the interrupt setup to create their
queues and the fundamental assumption is that this setup works. The
managed interrupt mechanism guarantees that the queue has a vector
available on all CPUs which are in the queues assigned affinity mask. As
of today it also guarantees that these CPUs are reachable once they come
online.

So in order to make that work you'd need to teach the multi-queue stuff
about this new world order:

 1) On hotplug the queue needs to be able to figure out whether the
    interrupt is functional. If not it has to redirect any requests to
    some actually functional queue.

 2) On unplug it needs to be able to figure out whether the interrupt
    will shutdown because the outgoing CPU is the last reachable in the
    group and if there are still online but unreachable CPUs then use
    the redirect mechanism.

I'm sure that the multi-queue people will be enthusiastic to add all of
this and deal with all the nasty corner cases coming out of it.

The overall conclusion for this is:

 1) X2APIC support on bare metal w/o irq remapping is not going to
    happen unless you:

      - added support in multi-queue devices which utilize managed
        interrupts

      - audited the whole tree for other assumptions related to the
        reachability of possible CPUs.

    I'm not expecting you to be done with that before I retire so for
    me it's just not going to happen :)

 2) X2APIC support on VIRT is possible if the extended ID magic is
    supported by the hypervisor because that does not make any CPU
    unreachable for MSI and therefore the multi-queue muck and
    everything else just works.

    This requires to have either the domain affinity limitation for HPET
    in place or just to force disable HPET or at least HPET-MSI which is
    a reasonable tradeoff.

    HPET is not required for guests which have kvmclock and
    APIC/deadline timer and known (hypervisor provided) frequencies.

Anything else is just wishful thinking, really.

Thanks,

        tglx