On Thu, May 16, 2024 at 5:46 PM Catherine Redfield <catherine.redfield@xxxxxxxxxxxxx> wrote: > > Feng, > > Thank you for providing your debugging steps; I used them on a gce image locally and was not able to replicate the issue. I also attempted to replicate in qemu/virsh using qemu-guest-agent to enable the S3 suspend state, also without success (that is S3 suspend state worked without any problems). I have brought this back to the cloud for further debugging of their config and guest agent to try and determine what the issue is. > > Thank you very much for all your help on this issue and time looking into it! > Catherine Does this fix the issue? I guess the reason is that GCE is using legacy virtio. https://lore.kernel.org/kvm/CACGkMEth_9Baewekq862YgZwuozwG96Z3G6oYqHzyCj2JPUZ3g@xxxxxxxxxxxxxx/T/ Thanks > > On Thu, May 9, 2024 at 5:03 AM Feng Liu <feliu@xxxxxxxxxx> wrote: >> >> >> On 2024-05-08 a.m.7:18, Catherine Redfield wrote: >> > *External email: Use caution opening links or attachments* >> > >> > >> > On a VM with the GCP kernel (where we first identified the problem), I see: >> > >> > 1. The full kernel log from `journalctl --system > kernlog` attached. >> > The specific suspend section is here: >> > >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > systemd[1]: Reached target sleep.target - Sleep. >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > systemd[1]: Starting systemd-suspend.service - System Suspend... >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > systemd-sleep[1413]: Performing sleep operation 'suspend'... >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: PM: suspend entry (deep) >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: Filesystems sync: 0.008 seconds >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: Freezing user space processes >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: Freezing user space processes completed (elapsed 0.001 seconds) >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: OOM killer disabled. >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: Freezing remaining freezable tasks >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: Freezing remaining freezable tasks completed (elapsed 0.000 seconds) >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: printk: Suspending console(s) (use no_console_suspend to debug) >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: port 00:03:0.0: PM: dpm_run_callback(): >> > pm_runtime_force_suspend+0x0/0x130 returns -16 >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: port 00:03:0.0: PM: failed to suspend: error -16 >> >> Thanks Joesph and Catherine's help. >> >> Hi, >> >> I have alreay synced up with Cananical guys offline about this issue. >> >> I can run "suspend/resume" sucessfully on my local server and VM. >> And "PM: failed to suspend: error -16" looks like not cause by my >> previous virtio patch ( fd27ef6b44be ("virtio-pci: Introduce admin >> virtqueue")) which only modified "virtio_device_freeze" about "suspend" >> action. >> >> So I have provide the my steps and debug patch to Joesph and Catherine. >> I will also sync up the information here, as follow: >> >> I have read the qemu code and find a way to trigger "suspend/resume" on >> my setup, and add some debug message in the latest kerenel >> >> My setps are: >> 1. QEMU cmdline add following >> .... >> -global PIIX4_PM.disable_s3=0 \ >> -global PIIX4_PM.disable_s4=1 \ >> .... >> -netdev type=tap,ifname=tap0,id=hostnet0,script=no,downscript=no \ >> -device >> virtio-net-pci,netdev=hostnet0,id=net0,mac=$SSH_MAC,bus=pci.0,addr=0x3 \ >> ...... >> >> 2. In the VM, run "systemctl suspend" to PM suspend the VM into memory >> 3. In qemu hmp shell, run "system_wakeup" to resume the VM again >> >> My VM configuration: >> NIC: 1 virtio nic emulated by QEMU >> OS: Ubuntu 22.04.4 LTS >> kernel: latest kernel, 6.9-rc7: ee5b455b0ada (kernel2/net-next-virito, >> kernel2/master, master) Merge tag 'slab-for-6.9-rc7-fixes' of >> git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab) >> >> >> I add some debug message on the latest kernel, and do above steps to >> trigger "suspen/resume". Everything of VM is OK, VM could suspend/resume >> successfully. >> Follwing is the kernel log: >> ---------------------------------------------------------------------------- >> ........ >> May 6 15:59:52 feliu-vm kernel: [ 43.446737] PM: suspend entry (deep) >> May 6 16:00:04 feliu-vm kernel: [ 43.467640] Filesystems sync: 0.020 >> seconds >> May 6 16:00:04 feliu-vm kernel: [ 43.467923] Freezing user space >> processes >> May 6 16:00:04 feliu-vm kernel: [ 43.470294] Freezing user space >> processes completed (elapsed 0.002 seconds) >> May 6 16:00:04 feliu-vm kernel: [ 43.470299] OOM killer disabled. >> May 6 16:00:04 feliu-vm kernel: [ 43.470301] Freezing remaining >> freezable tasks >> May 6 16:00:04 feliu-vm kernel: [ 43.471482] Freezing remaining >> freezable tasks completed (elapsed 0.001 seconds) >> May 6 16:00:04 feliu-vm kernel: [ 43.471495] printk: Suspending >> console(s) (use no_console_suspend to debug) >> May 6 16:00:04 feliu-vm kernel: [ 43.474034] virtio_net virtio0: >> godeng virtio device freeze >> May 6 16:00:04 feliu-vm kernel: [ 43.475714] virtio_net virtio0 ens3: >> godfeng virtnet_freeze done >> May 6 16:00:04 feliu-vm kernel: [ 43.475717] virtio_net virtio0: >> godfeng VIRTIO_F_ADMIN_VQ not enabled >> May 6 16:00:04 feliu-vm kernel: [ 43.475719] virtio_net virtio0: >> godeng virtio device freeze done >> ........ >> May 6 16:00:04 feliu-vm kernel: [ 43.535382] smpboot: CPU 1 is now >> offline >> May 6 16:00:04 feliu-vm kernel: [ 43.537283] IRQ fixup: irq 1 move in >> progress, old vector 32 >> May 6 16:00:04 feliu-vm kernel: [ 43.538504] smpboot: CPU 2 is now >> offline >> May 6 16:00:04 feliu-vm kernel: [ 43.541392] smpboot: CPU 3 is now >> offline >> >> ...... >> >> May 6 16:00:04 feliu-vm kernel: [ 54.973285] smpboot: Booting Node 0 >> Processor 15 APIC 0xf >> May 6 16:00:04 feliu-vm kernel: [ 54.975190] CPU15 is up >> May 6 16:00:04 feliu-vm kernel: [ 54.976011] ACPI: PM: Waking up from >> system sleep state S3 >> May 6 16:00:04 feliu-vm kernel: [ 54.986071] virtio_net virtio0: >> godeng virtio device restore >> May 6 16:00:04 feliu-vm kernel: [ 54.987563] virtio_net virtio0 ens3: >> godfeng virtnet_restore done >> May 6 16:00:04 feliu-vm kernel: [ 54.987635] virtio_net virtio0: >> godfeng: virtio device restore done >> ...... >> May 6 16:00:04 feliu-vm kernel: [ 55.307221] ata8: SATA link down >> (SStatus 0 SControl 300) >> May 6 16:00:04 feliu-vm kernel: [ 55.442048] OOM killer enabled. >> May 6 16:00:04 feliu-vm kernel: [ 55.442051] Restarting tasks ... done. >> May 6 16:00:04 feliu-vm kernel: [ 55.443576] random: crng reseeded on >> system resumption >> May 6 16:00:04 feliu-vm kernel: [ 55.443582] PM: suspend exit >> >> ---------------------------------------------------------------------------- >> >> Attachment is the full kernel log. I think maybe it is some configration >> error. >> >> >> Thanks >> Feng >> >> >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: sd 0:0:1:0: [sda] Synchronizing SCSI cache >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: PM: Some devices failed to suspend, or early wake event detected >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: OOM killer enabled. >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: Restarting tasks ... done. >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: random: crng reseeded on system resumption >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: PM: suspend exit >> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal >> > kernel: PM: suspend entry (s2idle) >> > -- Boot 61828bc938b44fc68a8aeedc16a23a9d -- >> > May 08 11:09:03 localhost kernel: Linux version 6.8.0-1007-gcp >> > (buildd@lcy02-amd64-079) (x86_64-linux-gnu-gcc-13 (Ubuntu >> > 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42) >> > #7-Ubuntu SMP Sat Apr 20 00:58:31 UTC 2024 (Ubuntu 6.8.0-1007.7-gcp 6.8.1) >> > May 08 11:09:03 localhost kernel: Command line: >> > BOOT_IMAGE=/vmlinuz-6.8.0-1007-gcp >> > root=PARTUUID=7a949935-6bf2-4cae-b404-803c95163572 ro >> > console=ttyS0,115200 panic=-1 >> > >> > 2. The features the devices has: >> > >> > catred@kernel-test-202405080702:~$ cat >> > /sys/bus/virtio/devices/virtio0/features >> > 0110000000000000000000000000010000000000000000000000000000000000 >> > catred@kernel-test-202405080702:~$ cat >> > /sys/bus/virtio/devices/virtio1/features >> > 1110010110011001110000100000010000000000000000000000000000000000 >> > catred@kernel-test-202405080702:~$ cat >> > /sys/bus/virtio/devices/virtio2/features >> > 1110000000000000000000000000000000000000000000000000000000000000 >> > catred@kernel-test-202405080702:~$ cat >> > /sys/bus/virtio/devices/virtio3/features >> > 0000000000000000000000000000000000000000000000000000000000000000 >> > >> > Catherine >> > >> > On Tue, May 7, 2024 at 11:34 PM Jason Wang <jasowang@xxxxxxxxxx >> > <mailto:jasowang@xxxxxxxxxx>> wrote: >> > >> > On Sat, May 4, 2024 at 2:10 AM Joseph Salisbury >> > <joseph.salisbury@xxxxxxxxxxxxx >> > <mailto:joseph.salisbury@xxxxxxxxxxxxx>> wrote: >> > > >> > > Hi Feng, >> > > >> > > During testing, a kernel bug was identified with the suspend/resume >> > > functionality on instances running in a public cloud [0]. This >> > bug is a >> > > regression introduced in v6.8-rc1. After a kernel bisect, the >> > following >> > > commit was identified as the cause of the regression: >> > > >> > > fd27ef6b44be ("virtio-pci: Introduce admin virtqueue") >> > >> > Have a quick glance at the patch it seems it should not damage the >> > freeze/restore as it should behave as in the past. >> > >> > But I found something interesting: >> > >> > 1) assumes 1 admin vq which is not what spec said >> > 2) special function for admin virtqueue during freeze/restore, but it >> > doesn't do anything special than del_vq() >> > 3) lack real users but I guess e.g the destroy_avq() needs to be >> > synchronized with the one that is using admin virtqueue >> > >> > > >> > > I was hoping to get your feedback, since you are the patch author. Do >> > > you think gathering any additional data will help diagnose this >> > issue? >> > >> > Yes, please show us >> > >> > 1) the kernel log here. >> > 2) the features that the device has like >> > /sys/bus/virtio/devices/virtio0/features >> > >> > > This commit is depended upon by other virtio commits, so a revert >> > test >> > > is not really straight forward without reverting all the >> > dependencies. >> > > Any ideas you have would be greatly appreciated. >> > >> > Thanks >> > >> > > >> > > >> > > Thanks, >> > > >> > > Joe >> > > >> > > http://pad.lv/2063315 <http://pad.lv/2063315> >> > > >> >