Hi, On Tue, Sep 13, 2022 at 07:40:02PM +0100, Jean-Philippe Brucker wrote: > On Tue, Sep 13, 2022 at 02:52:44PM +0100, Alexandru Elisei wrote: > > Hi, > > > > On Thu, Sep 08, 2022 at 03:42:09PM +0100, Jean-Philippe Brucker wrote: > > > Although the PCI Status register only contains read-only and > > > write-1-to-clear bits, we currently keep anything written there, which > > > can confuse a guest. > > > > > > The problem was highlighted by recent Linux commit 6cd514e58f12 ("PCI: > > > Clear PCI_STATUS when setting up device"), which unconditionally writes > > > 0xffff to the Status register in order to clear pending errors. Then the > > > EDAC driver sees the parity status bits set and attempts to clear them > > > by writing 0xc100, which in turn clears the Capabilities List bit. > > > Later on, when the virtio-pci driver starts probing, it assumes due to > > > missing capabilities that the device is using the legacy transport, and > > > fails to setup the device because of mismatched protocol. > > > > > > Filter writes to the config space, keeping only those to writable > > > fields. Tighten the access size check while we're at it, to prevent > > > overflow. This is only a small step in the right direction, not a > > > foolproof solution, because a guest could still write both Command and > > > Status registers using a single 32-bit write. More work is needed for: > > > * Supporting arbitrary sized writes. > > > * Sanitizing accesses to capabilities, which are device-specific. > > > * Fine-grained filtering of the Command register, where only some bits > > > are writable. > > > > I'm confused here. Why not do value &= mask to keep only those bits that > > writable? > > Sure, I can add it > > > > > > > > > Also remove the old hack that filtered accesses. It was wrong and not > > > properly explained in the git history, but whatever it was guarding > > > against should be prevented by these new checks. > > > > If I remember correctly, that was guarding against the guest kernel poking > > the ROM base address register for drivers that assumed that the ROM was > > always there, I vaguely remember that was the case with GPUs. Pairs with > > the similar check in the vfio callback, vfio_pci_cfg_write(). > > Right, makes sense. I think that's what I assumed when rewriting > pci__config_wr() hence the current comment but the original commits didn't > say anything about it. > > > > > > > > > Reported-by: Pierre Gondois <pierre.gondois@xxxxxxx> > > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@xxxxxxxxxx> > > > --- > > > Note that the issue described here only shows up during ACPI boot for > > > me, because edac_init() happens after PCI enumeration. With DT boot, > > > edac_pci_clear_parity_errors() runs earlier and doesn't find any device. > > > --- > > > pci.c | 41 ++++++++++++++++++++++++++++++++--------- > > > 1 file changed, 32 insertions(+), 9 deletions(-) > > > > > > diff --git a/pci.c b/pci.c > > > index a769ae27..84dc7d1d 100644 > > > --- a/pci.c > > > +++ b/pci.c > > > @@ -350,6 +350,24 @@ static void pci_config_bar_wr(struct kvm *kvm, > > > pci_activate_bar_regions(kvm, old_addr, bar_size); > > > } > > > > > > +/* > > > + * Bits that are writable in the config space header. > > > + * Write-1-to-clear Status bits are missing since we never set them. > > > + */ > > > +static const u8 pci_config_writable[PCI_STD_HEADER_SIZEOF] = { > > > + [PCI_COMMAND] = > > > + PCI_COMMAND_IO | > > > + PCI_COMMAND_MEMORY | > > > + PCI_COMMAND_MASTER | > > > + PCI_COMMAND_PARITY, > > > + [PCI_COMMAND + 1] = > > > + (PCI_COMMAND_SERR | > > > + PCI_COMMAND_INTX_DISABLE) >> 8, > > > + [PCI_INTERRUPT_LINE] = 0xff, > > > + [PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_5 + 3] = 0xff, > > > + [PCI_CACHE_LINE_SIZE] = 0xff, > > > +}; > > > + > > > void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, int size) > > > { > > > void *base; > > > @@ -357,7 +375,7 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, > > > u16 offset; > > > struct pci_device_header *pci_hdr; > > > u8 dev_num = addr.device_number; > > > - u32 value = 0; > > > + u32 value = 0, mask = 0; > > > > > > if (!pci_device_exists(addr.bus_number, dev_num, 0)) > > > return; > > > @@ -368,12 +386,12 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, > > > if (pci_hdr->cfg_ops.write) > > > pci_hdr->cfg_ops.write(kvm, pci_hdr, offset, data, size); > > > > > > - /* > > > - * legacy hack: ignore writes to uninitialized regions (e.g. ROM BAR). > > > - * Not very nice but has been working so far. > > > - */ > > > - if (*(u32 *)(base + offset) == 0) > > > - return; > > > + /* We don't sanity-check capabilities for the moment */ > > > + if (offset < PCI_STD_HEADER_SIZEOF) { > > > + memcpy(&mask, pci_config_writable + offset, size); > > > + if (!mask) > > > + return; > > > > Shouldn't this be performed before the VFIO callbacks? > > Yes I think I can move it up > > > Also, the vfio callbacks still do the writes to the VFIO in-kernel PCI > > header, but now kvmtool would skip those writes entirely. Shouldn't > > kvmtool's view of the configuration space be identical to that of VFIO? > > VFIO also skips writes to read-only fields, so they should now be more in > sync than before :) But their views are already desynchronized, because > kvmtool doesn't read back the config space virtualized by VFIO after > writing to it. We should probably improve it, but that's also for a future > patch. Ah, I see now. My concern was that kvmtool was still writing to the VFIO config space, while ignoring those writes to its own instance of the config space. > > > > > > + } > > > > > > if (offset == PCI_COMMAND) { > > > memcpy(&value, data, size); > > > @@ -419,8 +437,13 @@ static void pci_config_mmio_access(struct kvm_cpu *vcpu, u64 addr, u8 *data, > > > cfg_addr.w = (u32)addr; > > > cfg_addr.enable_bit = 1; > > > > > > - if (len > 4) > > > - len = 4; > > > + /* > > > + * "Root Complex implementations are not required to support the > > > + * generation of Configuration Requests from accesses that cross DW > > > + * [4 bytes] boundaries." > > > + */ > > > + if ((addr & 3) + len > 4) > > > + return; > > > > Isn't that a change in behaviour? > > Yes, but it should be safe to change since it is implementation defined. > According to the spec 64-bit config space writes through ECAM are not > expected to work and I find the old behaviour, truncating the write, worse > than rejecting the whole thing. It looks like at least linux, freebsd, > u-boot and edk2 don't issue 64-bit writes. Ok, great, so this change won't break existing VMs because the guest kernels don't do these kind of imp def writes. Would you mind also moving this before the cfg_ops.write callback? Thanks, Alex > > Thanks, > Jean > > > How about: > > > > len = 4 - (addr & 3); > > > > Which should conform to the spec, but still allow writes like before. > > > > Thanks, > > Alex > > > > > > > > if (is_write) > > > pci__config_wr(kvm, cfg_addr, data, len); > > > -- > > > 2.37.3 > > >