Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Sun, 08 Nov 2020 19:47:24 +0100

On Fri, Nov 06 2020 at 20:12, Jason Gunthorpe wrote:
> All IMS device drivers will work correctly. No VMM device emulation is
> ever needed to translate addr/data pairs.
>
> Earlier in this thread Kevin said hyper-v is already working this way,
> even for MSI/MSI-X. To me this says it is fundamentally a KVM platform
> problem and it should not be solved by PCI capability flags.

I mostly agree but want to add a few clarifications about the
terminology and the boundaries because I think there is where lot of the
confusion comes from.

Let me go back to the basic structure both at the hardware and at the
software level.

The basic structure is:

  [CPU] -- [Bridge] -- Bus -- [Device]

This applies to all kind of buses where the bridge directly translates
into the CPUs address space. Now let's look at the boundaries:

                |
                |
  [CPU] -- [Bri | dge] -- Bus -- [Device]
                |   
                |

The boundary is in the middle of the bridge because the CPU side of the
bridge is obviously CPU and therefore architecture specific. The Bus
side of the bridge is architecture agnostic.

Now let's add an IOMMU:

  [CPU] -- [IOMMU] -- [Bridge] -- Bus -- [Device]

and in theory the boundary moves now to:

               |
               |
  [CPU] -- [IO | MMU] -- [Bridge] -- Bus -- [Device]
               |
               |

because with an IOMMU the bridge could become CPU and architecture
agnostic. In reality this is not the case as the bridge is still the
same thing.

Now let's look at MSI. As established above, the Bus and the Device are
CPU and architecture agnostic and the Device merily uses a composed
message which is stored at some place accessible to the device to send
that message when it raises an interrupt. So where is this message
composed?

The basic case:

                   |
                   |
  [CPU]    -- [Bri | dge] -- Bus -- [Device]
                   |
  Alloc +           
  Compose                   Store     Use

The Bridge is irrelevant here as it just is involved in the
transport. Nevertheless the Bridge is only transport in the view of the
interrupt subsystem.

The IOMMU case:

               |
               |
  [CPU] -- [IO | MMU] -- [Bridge] -- Bus -- [Device]
               |
            Alloc +
  Alloc     Compose                 Store     Use

That's exactly reflected in hierarchical irq domains:

                       |
                       |
  [CPU]        -- [Bri | dge] --    Bus    -- [Device]
                       |   
  Alloc +           
  Compose                         Store        Use

  Vectordomain                   Busdomain

and:

                     |
                     |
  [CPU]       -- [IO | MMU]  -- [Bridge] --    Bus    -- [Device]
                     |
                  Alloc +   
  Alloc           Compose                    Store       Use

  Vectordomain   Remapdomain                Busdomain

Now if we look at the virtualization scenario and device hand through
then the structure in the guest view is not any different from the basic
case. This works with PCI-MSI[X] and the IDXD IMS variant because the
hypervisor can trap the access to the storage and translate the message:

                   |
                   |
  [CPU]    -- [Bri | dge] -- Bus -- [Device]
                   |
  Alloc +
  Compose                   Store     Use
                             |
                             | Trap
                             v
                             Hypervisor translates and stores

But obviously with an IMS storage location which is software controlled
by the guest side driver (the case Jason is interested in) the above
cannot work for obvious reasons.

That means the guest needs a way to ask the hypervisor for a proper
translation, i.e. a hypercall. Now where to do that? Looking at the
above remapping case it's pretty obvious:

                     |
                     |
  [CPU]       -- [VI | RT]  -- [Bridge] --    Bus    -- [Device]
                     |
  Alloc          "Compose"                   Store         Use

  Vectordomain   HCALLdomain                Busdomain
                 |        ^
                 |        |
                 v        | 
            Hypervisor    
               Alloc + Compose

Why? Because it reflects the boundaries and leaves the busdomain part
agnostic as it should be. And it works for _all_ variants of Busdomains.

Now the question which I can't answer is whether this can work correctly
in terms of isolation. If the IMS storage is in guest memory (queue
storage) then the guest driver can obviously write random crap into it
which the device will happily send. (For MSI and IDXD style IMS it
still can trap the store).

Is the IOMMU/Interrupt remapping unit able to catch such messages which
go outside the space to which the guest is allowed to signal to? If yes,
problem solved. If no, then IMS storage in guest memory can't ever work.

Coming back to this:

> In the end pci_subdevice_msi_create_irq_domain() is a platform
> function. Either it should work completely on every device with no
> device-specific emulation required in the VMM, or it should not work
> at all and return -EOPNOTSUPP.

The subdevice domain is a 'Busdomain' according to the structure
above. It does not and should never have any clue about the underlying
system. It's in the agnostic part and always works. It simply does not
care what's underneath. So it won't return -EOPNOTSUPP.

What it has to do is to transport the IMS in queue memory requirement to
the underlying parent domain.

So in case that the HCALL domain is missing, the Vector domain needs
return an error code on domain creation. If the HCALL domain is there
then the domain creation works and in case of actual interrupt
allocation the hypercall either returns a valid composed message or an
appropriate error code.

But there's a catch:

This only works when the guest OS actually knows that it runs in a
VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
solved because from the guest OS view that's the same as running on bare
metal. Obviously on bare metal the Vector domain can and must handle
this.

So this needs some thought.

Thanks,

        tglx