Hi, On Thu, Sep 08, 2022 at 03:42:09PM +0100, Jean-Philippe Brucker wrote: > Although the PCI Status register only contains read-only and > write-1-to-clear bits, we currently keep anything written there, which > can confuse a guest. > > The problem was highlighted by recent Linux commit 6cd514e58f12 ("PCI: > Clear PCI_STATUS when setting up device"), which unconditionally writes > 0xffff to the Status register in order to clear pending errors. Then the > EDAC driver sees the parity status bits set and attempts to clear them > by writing 0xc100, which in turn clears the Capabilities List bit. > Later on, when the virtio-pci driver starts probing, it assumes due to > missing capabilities that the device is using the legacy transport, and > fails to setup the device because of mismatched protocol. > > Filter writes to the config space, keeping only those to writable > fields. Tighten the access size check while we're at it, to prevent > overflow. This is only a small step in the right direction, not a > foolproof solution, because a guest could still write both Command and > Status registers using a single 32-bit write. More work is needed for: > * Supporting arbitrary sized writes. > * Sanitizing accesses to capabilities, which are device-specific. > * Fine-grained filtering of the Command register, where only some bits > are writable. I'm confused here. Why not do value &= mask to keep only those bits that writable? > > Also remove the old hack that filtered accesses. It was wrong and not > properly explained in the git history, but whatever it was guarding > against should be prevented by these new checks. If I remember correctly, that was guarding against the guest kernel poking the ROM base address register for drivers that assumed that the ROM was always there, I vaguely remember that was the case with GPUs. Pairs with the similar check in the vfio callback, vfio_pci_cfg_write(). > > Reported-by: Pierre Gondois <pierre.gondois@xxxxxxx> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@xxxxxxxxxx> > --- > Note that the issue described here only shows up during ACPI boot for > me, because edac_init() happens after PCI enumeration. With DT boot, > edac_pci_clear_parity_errors() runs earlier and doesn't find any device. > --- > pci.c | 41 ++++++++++++++++++++++++++++++++--------- > 1 file changed, 32 insertions(+), 9 deletions(-) > > diff --git a/pci.c b/pci.c > index a769ae27..84dc7d1d 100644 > --- a/pci.c > +++ b/pci.c > @@ -350,6 +350,24 @@ static void pci_config_bar_wr(struct kvm *kvm, > pci_activate_bar_regions(kvm, old_addr, bar_size); > } > > +/* > + * Bits that are writable in the config space header. > + * Write-1-to-clear Status bits are missing since we never set them. > + */ > +static const u8 pci_config_writable[PCI_STD_HEADER_SIZEOF] = { > + [PCI_COMMAND] = > + PCI_COMMAND_IO | > + PCI_COMMAND_MEMORY | > + PCI_COMMAND_MASTER | > + PCI_COMMAND_PARITY, > + [PCI_COMMAND + 1] = > + (PCI_COMMAND_SERR | > + PCI_COMMAND_INTX_DISABLE) >> 8, > + [PCI_INTERRUPT_LINE] = 0xff, > + [PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_5 + 3] = 0xff, > + [PCI_CACHE_LINE_SIZE] = 0xff, > +}; > + > void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, int size) > { > void *base; > @@ -357,7 +375,7 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, > u16 offset; > struct pci_device_header *pci_hdr; > u8 dev_num = addr.device_number; > - u32 value = 0; > + u32 value = 0, mask = 0; > > if (!pci_device_exists(addr.bus_number, dev_num, 0)) > return; > @@ -368,12 +386,12 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, > if (pci_hdr->cfg_ops.write) > pci_hdr->cfg_ops.write(kvm, pci_hdr, offset, data, size); > > - /* > - * legacy hack: ignore writes to uninitialized regions (e.g. ROM BAR). > - * Not very nice but has been working so far. > - */ > - if (*(u32 *)(base + offset) == 0) > - return; > + /* We don't sanity-check capabilities for the moment */ > + if (offset < PCI_STD_HEADER_SIZEOF) { > + memcpy(&mask, pci_config_writable + offset, size); > + if (!mask) > + return; Shouldn't this be performed before the VFIO callbacks? Also, the vfio callbacks still do the writes to the VFIO in-kernel PCI header, but now kvmtool would skip those writes entirely. Shouldn't kvmtool's view of the configuration space be identical to that of VFIO? > + } > > if (offset == PCI_COMMAND) { > memcpy(&value, data, size); > @@ -419,8 +437,13 @@ static void pci_config_mmio_access(struct kvm_cpu *vcpu, u64 addr, u8 *data, > cfg_addr.w = (u32)addr; > cfg_addr.enable_bit = 1; > > - if (len > 4) > - len = 4; > + /* > + * "Root Complex implementations are not required to support the > + * generation of Configuration Requests from accesses that cross DW > + * [4 bytes] boundaries." > + */ > + if ((addr & 3) + len > 4) > + return; Isn't that a change in behaviour? How about: len = 4 - (addr & 3); Which should conform to the spec, but still allow writes like before. Thanks, Alex > > if (is_write) > pci__config_wr(kvm, cfg_addr, data, len); > -- > 2.37.3 >