Re: [PATCH] vfio/pci: Virtualize Maximum Payload Size

Auger Eric <eric.auger@xxxxxxxxxx> · Wed, 20 Sep 2017 09:59:31 +0200

Hi Alex,

On 19/09/2017 22:20, Alex Williamson wrote:
> [cc +linux-pci, Sinan]
> 
> On Tue, 19 Sep 2017 19:50:37 +0200
> Auger Eric <eric.auger@xxxxxxxxxx> wrote:
> 
>> Hi Alex,
>>
>> On 19/09/2017 18:58, Alex Williamson wrote:
>>> With virtual PCI-Express chipsets, we now see userspace/guest drivers
>>> trying to match the physical MPS setting to a virtual downstream port.
>>> Of course a lone physical device surrounded by virtual interconnects
>>> cannot make a correct decision for a proper MPS setting.  Instead,
>>> let's virtualize the MPS control register so that writes through to
>>> hardware are disallowed.  Userspace drivers like QEMU assume they can
>>> write anything to the device and we'll filter out anything dangerous.
>>> Since mismatched MPS can lead to AER and other faults, let's add it
>>> to the kernel side rather than relying on userspace virtualization to
>>> handle it.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
>>> ---
>>>
>>> Do we have any reason to suspect that a userspace driver has any
>>> dependencies on the physical MPS setting or is this only tuning the
>>> protocol layer and it's transparent to the driver?  Note that per the
>>> PCI spec, a device supporting only 128B MPS can hardwire the control
>>> register to 000b, but it doesn't seem PCIe compliant to hardwire it to
>>> any given value, such as would be the appearance if we exposed this as
>>> a read-only register rather than virtualizing it.  QEMU would then be
>>> responsible for virtualizing it, which makes coordinating the upgrade
>>> troublesome.
>>>
>>>  drivers/vfio/pci/vfio_pci_config.c |    6 ++++--
>>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>>> index 5628fe114347..91335e6de88a 100644
>>> --- a/drivers/vfio/pci/vfio_pci_config.c
>>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>>> @@ -849,11 +849,13 @@ static int __init init_pci_cap_exp_perm(struct perm_bits *perm)
>>>  
>>>  	/*
>>>  	 * Allow writes to device control fields, except devctl_phantom,
>>> -	 * which could confuse IOMMU, and the ARI bit in devctl2, which
>>> +	 * which could confuse IOMMU, MPS, which can break communication
>>> +	 * with other physical devices, and the ARI bit in devctl2, which
>>>  	 * is set at probe time.  FLR gets virtualized via our writefn.
>>>  	 */
>>>  	p_setw(perm, PCI_EXP_DEVCTL,
>>> -	       PCI_EXP_DEVCTL_BCR_FLR, ~PCI_EXP_DEVCTL_PHANTOM);
>>> +	       PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_PAYLOAD,
>>> +	       ~PCI_EXP_DEVCTL_PHANTOM);
>>>  	p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI);  
>> Is it correct that the read value still will be the one written by the
>> guest?
> 
> If we don't do this, the register is read-only, which is what I'm
> suggesting in my comment above doesn't seem spec compliant.  If the
> kernel makes it read-only, then userspace (ex. QEMU) would need to
> virtualize it, and we get into a more complicated, two part fix.
>  
>> I see the MMRS can take the read MPS value in some pcie_bus_config
>> values. So a consequence could be that the applied MMRS (which is not
>> virtualized) is lower than what is set by host, due to a guest pcie root
>> port MPSS for instance.
>>
>> So if the above is not totally wrong, shouldn't we virtualize MMRS as well?
> 
> I think MPSS is Maximum Payload Size Supported and MMRS... did you mean
> MRRS (Maximum Read Request Size)?
Yes sorry for the typo. Indeed I meant MRRS.
> 
> My impression is that MRRS is predominantly device and driver
> dependent, not topology dependent.  A device can send a read request
> with a size larger than MPS, which implies that the device supplying
> the read data would split it into multiple TLPs based on MPS.
I read that too on the net. However in in 6.3.4.1. (3.0. Nov 10), Rules
for SW Configuration it is written:
"Software must set Max_Read_Request_Size of an isochronous-configured
device with a value that does not exceed the Max_Payload_Size set for
the device."

But on the the other hand some drivers are setting the MMRS directly
without further checking the MPS?
  The
> implementation note in PCIe rev3.1 sec 7.8.4 suggests there may be QoS
> implications of MRRS such that if MRRS>MPS, then the device requesting
> the data may use fewer tokens for the request.  I have no idea how we
> could infer any sort of QoS policy though and I don't see that
> in-kernel drivers are bound to any sort of policy.
> 
> The pcie_write_mrrs() function also implies a protocol (which I can't
> find a spec reference to) where it will re-try writing MRRS if the
> value written doesn't stick (and generate a dev_warn on each
> iteration).  Unless we want extra warnings in VMs, that suggests we
> don't want this to be read-only.
> 
> So I think we need to allow MRRS writes through to hardware, but I do
> question whether we need to fill the gap that a VM might casually write
> MRRS to a value less than the physical MPS.  This is what we'd expect
> today where the virtual topology has an MPS of 128B regardless of the
> physical topology.  Do we virtualize MRRS writes such that an MRRS
> value less the physical MPS value never reaches hardware?  Can we
> assume that's always invalid?  Thanks,

Thanks

Eric
> 
> Alex
>