On Tue, Sep 13, 2022 at 02:52:44PM +0100, Alexandru Elisei wrote: > Hi, > > On Thu, Sep 08, 2022 at 03:42:09PM +0100, Jean-Philippe Brucker wrote: > > Although the PCI Status register only contains read-only and > > write-1-to-clear bits, we currently keep anything written there, which > > can confuse a guest. > > > > The problem was highlighted by recent Linux commit 6cd514e58f12 ("PCI: > > Clear PCI_STATUS when setting up device"), which unconditionally writes > > 0xffff to the Status register in order to clear pending errors. Then the > > EDAC driver sees the parity status bits set and attempts to clear them > > by writing 0xc100, which in turn clears the Capabilities List bit. > > Later on, when the virtio-pci driver starts probing, it assumes due to > > missing capabilities that the device is using the legacy transport, and > > fails to setup the device because of mismatched protocol. > > > > Filter writes to the config space, keeping only those to writable > > fields. Tighten the access size check while we're at it, to prevent > > overflow. This is only a small step in the right direction, not a > > foolproof solution, because a guest could still write both Command and > > Status registers using a single 32-bit write. More work is needed for: > > * Supporting arbitrary sized writes. > > * Sanitizing accesses to capabilities, which are device-specific. > > * Fine-grained filtering of the Command register, where only some bits > > are writable. > > I'm confused here. Why not do value &= mask to keep only those bits that > writable? Sure, I can add it > > > > > Also remove the old hack that filtered accesses. It was wrong and not > > properly explained in the git history, but whatever it was guarding > > against should be prevented by these new checks. > > If I remember correctly, that was guarding against the guest kernel poking > the ROM base address register for drivers that assumed that the ROM was > always there, I vaguely remember that was the case with GPUs. Pairs with > the similar check in the vfio callback, vfio_pci_cfg_write(). Right, makes sense. I think that's what I assumed when rewriting pci__config_wr() hence the current comment but the original commits didn't say anything about it. > > > > > Reported-by: Pierre Gondois <pierre.gondois@xxxxxxx> > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@xxxxxxxxxx> > > --- > > Note that the issue described here only shows up during ACPI boot for > > me, because edac_init() happens after PCI enumeration. With DT boot, > > edac_pci_clear_parity_errors() runs earlier and doesn't find any device. > > --- > > pci.c | 41 ++++++++++++++++++++++++++++++++--------- > > 1 file changed, 32 insertions(+), 9 deletions(-) > > > > diff --git a/pci.c b/pci.c > > index a769ae27..84dc7d1d 100644 > > --- a/pci.c > > +++ b/pci.c > > @@ -350,6 +350,24 @@ static void pci_config_bar_wr(struct kvm *kvm, > > pci_activate_bar_regions(kvm, old_addr, bar_size); > > } > > > > +/* > > + * Bits that are writable in the config space header. > > + * Write-1-to-clear Status bits are missing since we never set them. > > + */ > > +static const u8 pci_config_writable[PCI_STD_HEADER_SIZEOF] = { > > + [PCI_COMMAND] = > > + PCI_COMMAND_IO | > > + PCI_COMMAND_MEMORY | > > + PCI_COMMAND_MASTER | > > + PCI_COMMAND_PARITY, > > + [PCI_COMMAND + 1] = > > + (PCI_COMMAND_SERR | > > + PCI_COMMAND_INTX_DISABLE) >> 8, > > + [PCI_INTERRUPT_LINE] = 0xff, > > + [PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_5 + 3] = 0xff, > > + [PCI_CACHE_LINE_SIZE] = 0xff, > > +}; > > + > > void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, int size) > > { > > void *base; > > @@ -357,7 +375,7 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, > > u16 offset; > > struct pci_device_header *pci_hdr; > > u8 dev_num = addr.device_number; > > - u32 value = 0; > > + u32 value = 0, mask = 0; > > > > if (!pci_device_exists(addr.bus_number, dev_num, 0)) > > return; > > @@ -368,12 +386,12 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, > > if (pci_hdr->cfg_ops.write) > > pci_hdr->cfg_ops.write(kvm, pci_hdr, offset, data, size); > > > > - /* > > - * legacy hack: ignore writes to uninitialized regions (e.g. ROM BAR). > > - * Not very nice but has been working so far. > > - */ > > - if (*(u32 *)(base + offset) == 0) > > - return; > > + /* We don't sanity-check capabilities for the moment */ > > + if (offset < PCI_STD_HEADER_SIZEOF) { > > + memcpy(&mask, pci_config_writable + offset, size); > > + if (!mask) > > + return; > > Shouldn't this be performed before the VFIO callbacks? Yes I think I can move it up > Also, the vfio callbacks still do the writes to the VFIO in-kernel PCI > header, but now kvmtool would skip those writes entirely. Shouldn't > kvmtool's view of the configuration space be identical to that of VFIO? VFIO also skips writes to read-only fields, so they should now be more in sync than before :) But their views are already desynchronized, because kvmtool doesn't read back the config space virtualized by VFIO after writing to it. We should probably improve it, but that's also for a future patch. > > > + } > > > > if (offset == PCI_COMMAND) { > > memcpy(&value, data, size); > > @@ -419,8 +437,13 @@ static void pci_config_mmio_access(struct kvm_cpu *vcpu, u64 addr, u8 *data, > > cfg_addr.w = (u32)addr; > > cfg_addr.enable_bit = 1; > > > > - if (len > 4) > > - len = 4; > > + /* > > + * "Root Complex implementations are not required to support the > > + * generation of Configuration Requests from accesses that cross DW > > + * [4 bytes] boundaries." > > + */ > > + if ((addr & 3) + len > 4) > > + return; > > Isn't that a change in behaviour? Yes, but it should be safe to change since it is implementation defined. According to the spec 64-bit config space writes through ECAM are not expected to work and I find the old behaviour, truncating the write, worse than rejecting the whole thing. It looks like at least linux, freebsd, u-boot and edk2 don't issue 64-bit writes. Thanks, Jean > How about: > > len = 4 - (addr & 3); > > Which should conform to the spec, but still allow writes like before. > > Thanks, > Alex > > > > > if (is_write) > > pci__config_wr(kvm, cfg_addr, data, len); > > -- > > 2.37.3 > >