Re: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue

Jason Wang <jasowang@xxxxxxxxxx> · Thu, 16 May 2024 17:52:44 +0800

On Thu, May 16, 2024 at 5:46 PM Catherine Redfield
<catherine.redfield@xxxxxxxxxxxxx> wrote:
>
> Feng,
>
> Thank you for providing your debugging steps; I used them on a gce image locally and was not able to replicate the issue.  I also attempted to replicate in qemu/virsh using qemu-guest-agent to enable the S3 suspend state, also without success (that is S3 suspend state worked without any problems).  I have brought this back to the cloud for further debugging of their config and guest agent to try and determine what the issue is.
>
> Thank you very much for all your help on this issue and time looking into it!
> Catherine

Does this fix the issue? I guess the reason is that GCE is using legacy virtio.

https://lore.kernel.org/kvm/CACGkMEth_9Baewekq862YgZwuozwG96Z3G6oYqHzyCj2JPUZ3g@xxxxxxxxxxxxxx/T/

Thanks

>
> On Thu, May 9, 2024 at 5:03 AM Feng Liu <feliu@xxxxxxxxxx> wrote:
>>
>>
>> On 2024-05-08 a.m.7:18, Catherine Redfield wrote:
>> > *External email: Use caution opening links or attachments*
>> >
>> >
>> > On a VM with the GCP kernel (where we first identified the problem), I see:
>> >
>> > 1. The full kernel log from `journalctl --system > kernlog` attached.
>> > The specific suspend section is here:
>> >
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > systemd[1]: Reached target sleep.target - Sleep.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > systemd[1]: Starting systemd-suspend.service - System Suspend...
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > systemd-sleep[1413]: Performing sleep operation 'suspend'...
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: suspend entry (deep)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Filesystems sync: 0.008 seconds
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing user space processes
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing user space processes completed (elapsed 0.001 seconds)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: OOM killer disabled.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing remaining freezable tasks
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: printk: Suspending console(s) (use no_console_suspend to debug)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: port 00:03:0.0: PM: dpm_run_callback():
>> > pm_runtime_force_suspend+0x0/0x130 returns -16
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: port 00:03:0.0: PM: failed to suspend: error -16
>>
>> Thanks Joesph and Catherine's help.
>>
>> Hi,
>>
>> I have alreay synced up with Cananical guys offline about this issue.
>>
>> I can run "suspend/resume" sucessfully on my local server and VM.
>> And "PM: failed to suspend: error -16" looks like not cause by my
>> previous virtio patch ( fd27ef6b44be  ("virtio-pci: Introduce admin
>> virtqueue")) which only modified "virtio_device_freeze" about "suspend"
>> action.
>>
>> So I have provide the my steps and debug patch to Joesph and Catherine.
>> I will also sync up the information here, as follow:
>>
>> I have read the qemu code and find a way to trigger "suspend/resume" on
>> my setup, and add some debug message in the latest kerenel
>>
>> My setps are:
>> 1. QEMU cmdline add following
>> ....
>> -global PIIX4_PM.disable_s3=0 \
>> -global PIIX4_PM.disable_s4=1 \
>> ....
>> -netdev type=tap,ifname=tap0,id=hostnet0,script=no,downscript=no \
>> -device
>> virtio-net-pci,netdev=hostnet0,id=net0,mac=$SSH_MAC,bus=pci.0,addr=0x3 \
>> ......
>>
>> 2. In the VM, run "systemctl suspend" to PM suspend the VM into memory
>> 3. In qemu hmp shell, run "system_wakeup" to resume the VM again
>>
>> My VM configuration:
>> NIC:     1 virtio nic emulated by QEMU
>> OS:      Ubuntu 22.04.4 LTS
>> kernel:  latest kernel, 6.9-rc7: ee5b455b0ada (kernel2/net-next-virito,
>> kernel2/master, master) Merge tag 'slab-for-6.9-rc7-fixes' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab)
>>
>>
>> I add some debug message on the latest kernel, and do above steps to
>> trigger "suspen/resume". Everything of VM is OK, VM could suspend/resume
>> successfully.
>> Follwing is the kernel log:
>> ----------------------------------------------------------------------------
>> ........
>> May  6 15:59:52 feliu-vm kernel: [   43.446737] PM: suspend entry (deep)
>> May  6 16:00:04 feliu-vm kernel: [   43.467640] Filesystems sync: 0.020
>> seconds
>> May  6 16:00:04 feliu-vm kernel: [   43.467923] Freezing user space
>> processes
>> May  6 16:00:04 feliu-vm kernel: [   43.470294] Freezing user space
>> processes completed (elapsed 0.002 seconds)
>> May  6 16:00:04 feliu-vm kernel: [   43.470299] OOM killer disabled.
>> May  6 16:00:04 feliu-vm kernel: [   43.470301] Freezing remaining
>> freezable tasks
>> May  6 16:00:04 feliu-vm kernel: [   43.471482] Freezing remaining
>> freezable tasks completed (elapsed 0.001 seconds)
>> May  6 16:00:04 feliu-vm kernel: [   43.471495] printk: Suspending
>> console(s) (use no_console_suspend to debug)
>> May  6 16:00:04 feliu-vm kernel: [   43.474034] virtio_net virtio0:
>> godeng virtio device freeze
>> May  6 16:00:04 feliu-vm kernel: [   43.475714] virtio_net virtio0 ens3:
>> godfeng virtnet_freeze done
>> May  6 16:00:04 feliu-vm kernel: [   43.475717] virtio_net virtio0:
>> godfeng VIRTIO_F_ADMIN_VQ not enabled
>> May  6 16:00:04 feliu-vm kernel: [   43.475719] virtio_net virtio0:
>> godeng virtio device freeze done
>> ........
>> May  6 16:00:04 feliu-vm kernel: [   43.535382] smpboot: CPU 1 is now
>> offline
>> May  6 16:00:04 feliu-vm kernel: [   43.537283] IRQ fixup: irq 1 move in
>> progress, old vector 32
>> May  6 16:00:04 feliu-vm kernel: [   43.538504] smpboot: CPU 2 is now
>> offline
>> May  6 16:00:04 feliu-vm kernel: [   43.541392] smpboot: CPU 3 is now
>> offline
>>
>> ......
>>
>> May  6 16:00:04 feliu-vm kernel: [   54.973285] smpboot: Booting Node 0
>> Processor 15 APIC 0xf
>> May  6 16:00:04 feliu-vm kernel: [   54.975190] CPU15 is up
>> May  6 16:00:04 feliu-vm kernel: [   54.976011] ACPI: PM: Waking up from
>> system sleep state S3
>> May  6 16:00:04 feliu-vm kernel: [   54.986071] virtio_net virtio0:
>> godeng virtio device restore
>> May  6 16:00:04 feliu-vm kernel: [   54.987563] virtio_net virtio0 ens3:
>> godfeng virtnet_restore done
>> May  6 16:00:04 feliu-vm kernel: [   54.987635] virtio_net virtio0:
>> godfeng: virtio device restore done
>> ......
>> May  6 16:00:04 feliu-vm kernel: [   55.307221] ata8: SATA link down
>> (SStatus 0 SControl 300)
>> May  6 16:00:04 feliu-vm kernel: [   55.442048] OOM killer enabled.
>> May  6 16:00:04 feliu-vm kernel: [   55.442051] Restarting tasks ... done.
>> May  6 16:00:04 feliu-vm kernel: [   55.443576] random: crng reseeded on
>> system resumption
>> May  6 16:00:04 feliu-vm kernel: [   55.443582] PM: suspend exit
>>
>> ----------------------------------------------------------------------------
>>
>> Attachment is the full kernel log. I think maybe it is some configration
>> error.
>>
>>
>> Thanks
>> Feng
>>
>>
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: sd 0:0:1:0: [sda] Synchronizing SCSI cache
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: Some devices failed to suspend, or early wake event detected
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: OOM killer enabled.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Restarting tasks ... done.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: random: crng reseeded on system resumption
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: suspend exit
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: suspend entry (s2idle)
>> > -- Boot 61828bc938b44fc68a8aeedc16a23a9d --
>> > May 08 11:09:03 localhost kernel: Linux version 6.8.0-1007-gcp
>> > (buildd@lcy02-amd64-079) (x86_64-linux-gnu-gcc-13 (Ubuntu
>> > 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42)
>> > #7-Ubuntu SMP Sat Apr 20 00:58:31 UTC 2024 (Ubuntu 6.8.0-1007.7-gcp 6.8.1)
>> > May 08 11:09:03 localhost kernel: Command line:
>> > BOOT_IMAGE=/vmlinuz-6.8.0-1007-gcp
>> > root=PARTUUID=7a949935-6bf2-4cae-b404-803c95163572 ro
>> > console=ttyS0,115200 panic=-1
>> >
>> > 2. The features the devices has:
>> >
>> > catred@kernel-test-202405080702:~$ cat
>> > /sys/bus/virtio/devices/virtio0/features
>> > 0110000000000000000000000000010000000000000000000000000000000000
>> > catred@kernel-test-202405080702:~$ cat
>> > /sys/bus/virtio/devices/virtio1/features
>> > 1110010110011001110000100000010000000000000000000000000000000000
>> > catred@kernel-test-202405080702:~$ cat
>> > /sys/bus/virtio/devices/virtio2/features
>> > 1110000000000000000000000000000000000000000000000000000000000000
>> > catred@kernel-test-202405080702:~$ cat
>> > /sys/bus/virtio/devices/virtio3/features
>> > 0000000000000000000000000000000000000000000000000000000000000000
>> >
>> > Catherine
>> >
>> > On Tue, May 7, 2024 at 11:34 PM Jason Wang <jasowang@xxxxxxxxxx
>> > <mailto:jasowang@xxxxxxxxxx>> wrote:
>> >
>> >     On Sat, May 4, 2024 at 2:10 AM Joseph Salisbury
>> >     <joseph.salisbury@xxxxxxxxxxxxx
>> >     <mailto:joseph.salisbury@xxxxxxxxxxxxx>> wrote:
>> >      >
>> >      > Hi Feng,
>> >      >
>> >      > During testing, a kernel bug was identified with the suspend/resume
>> >      > functionality on instances running in a public cloud [0].  This
>> >     bug is a
>> >      > regression introduced in v6.8-rc1.  After a kernel bisect, the
>> >     following
>> >      > commit was identified as the cause of the regression:
>> >      >
>> >      >         fd27ef6b44be  ("virtio-pci: Introduce admin virtqueue")
>> >
>> >     Have a quick glance at the patch it seems it should not damage the
>> >     freeze/restore as it should behave as in the past.
>> >
>> >     But I found something interesting:
>> >
>> >     1) assumes 1 admin vq which is not what spec said
>> >     2) special function for admin virtqueue during freeze/restore, but it
>> >     doesn't do anything special than del_vq()
>> >     3) lack real users but I guess e.g the destroy_avq() needs to be
>> >     synchronized with the one that is using admin virtqueue
>> >
>> >      >
>> >      > I was hoping to get your feedback, since you are the patch author. Do
>> >      > you think gathering any additional data will help diagnose this
>> >     issue?
>> >
>> >     Yes, please show us
>> >
>> >     1) the kernel log here.
>> >     2) the features that the device has like
>> >     /sys/bus/virtio/devices/virtio0/features
>> >
>> >      > This commit is depended upon by other virtio commits, so a revert
>> >     test
>> >      > is not really straight forward without reverting all the
>> >     dependencies.
>> >      > Any ideas you have would be greatly appreciated.
>> >
>> >     Thanks
>> >
>> >      >
>> >      >
>> >      > Thanks,
>> >      >
>> >      > Joe
>> >      >
>> >      > http://pad.lv/2063315 <http://pad.lv/2063315>
>> >      >
>> >