Hi Alex, On 19/09/2017 22:20, Alex Williamson wrote: > [cc +linux-pci, Sinan] > > On Tue, 19 Sep 2017 19:50:37 +0200 > Auger Eric <eric.auger@xxxxxxxxxx> wrote: > >> Hi Alex, >> >> On 19/09/2017 18:58, Alex Williamson wrote: >>> With virtual PCI-Express chipsets, we now see userspace/guest drivers >>> trying to match the physical MPS setting to a virtual downstream port. >>> Of course a lone physical device surrounded by virtual interconnects >>> cannot make a correct decision for a proper MPS setting. Instead, >>> let's virtualize the MPS control register so that writes through to >>> hardware are disallowed. Userspace drivers like QEMU assume they can >>> write anything to the device and we'll filter out anything dangerous. >>> Since mismatched MPS can lead to AER and other faults, let's add it >>> to the kernel side rather than relying on userspace virtualization to >>> handle it. >>> >>> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> >>> --- >>> >>> Do we have any reason to suspect that a userspace driver has any >>> dependencies on the physical MPS setting or is this only tuning the >>> protocol layer and it's transparent to the driver? Note that per the >>> PCI spec, a device supporting only 128B MPS can hardwire the control >>> register to 000b, but it doesn't seem PCIe compliant to hardwire it to >>> any given value, such as would be the appearance if we exposed this as >>> a read-only register rather than virtualizing it. QEMU would then be >>> responsible for virtualizing it, which makes coordinating the upgrade >>> troublesome. >>> >>> drivers/vfio/pci/vfio_pci_config.c | 6 ++++-- >>> 1 file changed, 4 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c >>> index 5628fe114347..91335e6de88a 100644 >>> --- a/drivers/vfio/pci/vfio_pci_config.c >>> +++ b/drivers/vfio/pci/vfio_pci_config.c >>> @@ -849,11 +849,13 @@ static int __init init_pci_cap_exp_perm(struct perm_bits *perm) >>> >>> /* >>> * Allow writes to device control fields, except devctl_phantom, >>> - * which could confuse IOMMU, and the ARI bit in devctl2, which >>> + * which could confuse IOMMU, MPS, which can break communication >>> + * with other physical devices, and the ARI bit in devctl2, which >>> * is set at probe time. FLR gets virtualized via our writefn. >>> */ >>> p_setw(perm, PCI_EXP_DEVCTL, >>> - PCI_EXP_DEVCTL_BCR_FLR, ~PCI_EXP_DEVCTL_PHANTOM); >>> + PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_PAYLOAD, >>> + ~PCI_EXP_DEVCTL_PHANTOM); >>> p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI); >> Is it correct that the read value still will be the one written by the >> guest? > > If we don't do this, the register is read-only, which is what I'm > suggesting in my comment above doesn't seem spec compliant. If the > kernel makes it read-only, then userspace (ex. QEMU) would need to > virtualize it, and we get into a more complicated, two part fix. > >> I see the MMRS can take the read MPS value in some pcie_bus_config >> values. So a consequence could be that the applied MMRS (which is not >> virtualized) is lower than what is set by host, due to a guest pcie root >> port MPSS for instance. >> >> So if the above is not totally wrong, shouldn't we virtualize MMRS as well? > > I think MPSS is Maximum Payload Size Supported and MMRS... did you mean > MRRS (Maximum Read Request Size)? Yes sorry for the typo. Indeed I meant MRRS. > > My impression is that MRRS is predominantly device and driver > dependent, not topology dependent. A device can send a read request > with a size larger than MPS, which implies that the device supplying > the read data would split it into multiple TLPs based on MPS. I read that too on the net. However in in 6.3.4.1. (3.0. Nov 10), Rules for SW Configuration it is written: "Software must set Max_Read_Request_Size of an isochronous-configured device with a value that does not exceed the Max_Payload_Size set for the device." But on the the other hand some drivers are setting the MMRS directly without further checking the MPS? The > implementation note in PCIe rev3.1 sec 7.8.4 suggests there may be QoS > implications of MRRS such that if MRRS>MPS, then the device requesting > the data may use fewer tokens for the request. I have no idea how we > could infer any sort of QoS policy though and I don't see that > in-kernel drivers are bound to any sort of policy. > > The pcie_write_mrrs() function also implies a protocol (which I can't > find a spec reference to) where it will re-try writing MRRS if the > value written doesn't stick (and generate a dev_warn on each > iteration). Unless we want extra warnings in VMs, that suggests we > don't want this to be read-only. > > So I think we need to allow MRRS writes through to hardware, but I do > question whether we need to fill the gap that a VM might casually write > MRRS to a value less than the physical MPS. This is what we'd expect > today where the virtual topology has an MPS of 128B regardless of the > physical topology. Do we virtualize MRRS writes such that an MRRS > value less the physical MPS value never reaches hardware? Can we > assume that's always invalid? Thanks, Thanks Eric > > Alex >