On Mon, 15 Aug 2022 17:32:06 -0400, "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote: > On Mon, Aug 15, 2022 at 01:53:30PM -0700, Andres Freund wrote: > > Hi, > > > > On 2022-08-15 16:21:51 -0400, Michael S. Tsirkin wrote: > > > On Mon, Aug 15, 2022 at 10:46:17AM -0700, Andres Freund wrote: > > > > Hi, > > > > > > > > On 2022-08-15 12:50:52 -0400, Michael S. Tsirkin wrote: > > > > > On Mon, Aug 15, 2022 at 09:45:03AM -0700, Andres Freund wrote: > > > > > > Hi, > > > > > > > > > > > > On 2022-08-15 11:40:59 -0400, Michael S. Tsirkin wrote: > > > > > > > OK so this gives us a quick revert as a solution for now. > > > > > > > Next, I would appreciate it if you just try this simple hack. > > > > > > > If it crashes we either have a long standing problem in virtio > > > > > > > code or more likely a gcp bug where it can't handle smaller > > > > > > > rings than what device requestes. > > > > > > > Thanks! > > > > > > > > > > > > I applied the below and the problem persists. > > > > > > > > > > > > [...] > > > > > > > > > > Okay! > > > > > > > > Just checking - I applied and tested this atop 6.0-rc1, correct? Or did you > > > > want me to test it with the 762faee5a267 reverted? I guess what you're trying > > > > to test if a smaller queue than what's requested you'd want to do so without > > > > the problematic patch applied... > > > > > > > > > > > > Either way, I did this, and there are no issues that I could observe. No > > > > oopses, no broken networking. But: > > > > > > > > To make sure it does something I added a debugging printk - which doesn't show > > > > up. I assume this is at a point at least earlyprintk should work (which I see > > > > getting enabled via serial)? > > > > > > > > > Sorry if I was unclear. I wanted to know whether the change somehow > > > exposes a driver bug or a GCP bug. So what I wanted to do is to test > > > this patch on top of *5.19*, not on top of the revert. > > > > Right, the 5.19 part was clear, just the earlier test: > > > > > > > > On 2022-08-15 11:40:59 -0400, Michael S. Tsirkin wrote: > > > > > > > OK so this gives us a quick revert as a solution for now. > > > > > > > Next, I would appreciate it if you just try this simple hack. > > > > > > > If it crashes we either have a long standing problem in virtio > > > > > > > code or more likely a gcp bug where it can't handle smaller > > > > > > > Thanks! > > > > I wasn't sure about. > > > > After I didn't see any effect on 5.19 + your patch, I grew a bit suspicious > > and added the printks. > > > > > > > Yes I think printk should work here. > > > > The reason the debug patch didn't change anything, and that my debug printk > > didn't show, is that gcp uses the legacy paths... > > Wait a second. Eureka I think! > > So I think GCP is not broken. > I think what's broken is this patch: > > commit cdb44806fca2d0ad29ca644cbf1505433902ee0c > Author: Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> > Date: Mon Aug 1 14:38:54 2022 +0800 > > virtio_pci: support the arg sizes of find_vqs() > > > Specifically: > > diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c > index 2257f1b3d8ae..d75e5c4e637f 100644 > --- a/drivers/virtio/virtio_pci_legacy.c > +++ b/drivers/virtio/virtio_pci_legacy.c > @@ -112,6 +112,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, > unsigned int index, > void (*callback)(struct virtqueue *vq), > const char *name, > + u32 size, > bool ctx, > u16 msix_vec) > { > @@ -125,10 +126,13 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, > if (!num || vp_legacy_get_queue_enable(&vp_dev->ldev, index)) > return ERR_PTR(-ENOENT); > > + if (!size || size > num) > + size = num; > + > info->msix_vector = msix_vec; > > /* create the vring */ > - vq = vring_create_virtqueue(index, num, > + vq = vring_create_virtqueue(index, size, > VIRTIO_PCI_VRING_ALIGN, &vp_dev->vdev, > true, false, ctx, > vp_notify, callback, name); > > > > So if you pass the size parameter for a legacy device it will > try to make the ring smaller and that is not legal with > legacy at all. But the driver treats legacy and modern > the same, it allocates a smaller queue anyway. > > > Lo and behold, I pass disable-modern=on to qemu and it happily > corrupts memory exactly the same as GCP does. Yes, I think you are right. Thank you very much. > > > So the new find_vqs API is actually completely broken, it can not work for > legacy at all and for added fun there's no way to find out > that it's legacy. Maybe we should interpret the patch > > So I think I will also revert > > 04ca0b0b16f11faf74fa92468dab51b8372586cd..fe3dc04e31aa51f91dc7f741a5f76cc4817eb5b4 > > > > > > > > > If there were a bug in the legacy path, it'd explain why the problem only > > shows on gcp, and not in other situations. > > > > I'll queue testing the legacy path with the equivalent change. > > > > - Andres > > > > > > Greetings, > > > > Andres Freund > _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization