Thanks Michael, for the analysis. I have tried the kdump steps on Oracle 9.4, 6.13.0 kernel as well. Although I couldn't see the soft lockup issue I see some other VMBus failures. But I agree the bootup is extremely slow, which should be due to same reason. My system is having newer UEFI version, wondering if the latest UEFI version (UEFI Release v4.1 08/23/2024) causing this difference in behaviour. Relevant part of the logs: --------------------------------------------------------- echo 1 > /proc/sys/kernel/sysrq echo c > /proc/sysrq-trigger [ 982.948352] sysrq: Trigger a crash [ 982.949553] Kernel panic - not syncing: sysrq triggered crash [ 982.951515] CPU: 31 UID: 0 PID: 6938 Comm: bash Kdump: loaded Not tainted 6.13.0 #1 [ 982.954115] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 08/23/2024 [ 982.957641] Call Trace: [ 982.958508] <TASK> [ 982.959251] panic+0x37e/0x3b0 [ 982.960373] ? _printk+0x64/0x90 [ 982.961452] sysrq_handle_crash+0x1a/0x20 [ 982.962840] __handle_sysrq+0x9b/0x190 [ 982.964145] write_sysrq_trigger+0x5f/0x80 [ 982.965578] proc_reg_write+0x59/0xb0 [ 982.966905] vfs_write+0x111/0x470 [ 982.968004] ? __count_memcg_events+0xbf/0x150 [ 982.969432] ? count_memcg_events.constprop.0+0x26/0x50 [ 982.971190] ksys_write+0x6e/0xf0 [ 982.972307] do_syscall_64+0x62/0x180 [ 982.973438] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 982.975102] RIP: 0033:0x7f3d570fdbd7 [ 982.976421] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 982.982893] RSP: 002b:00007fff6d613c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 982.985424] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f3d570fdbd7 [ 982.987613] RDX: 0000000000000002 RSI: 000056362a928470 RDI: 0000000000000001 [ 982.989774] RBP: 000056362a928470 R08: 0000000000000000 R09: 00007f3d571b0d40 [ 982.992109] R10: 00007f3d571b0c40 R11: 0000000000000246 R12: 0000000000000002 [ 982.994321] R13: 00007f3d571fa780 R14: 0000000000000002 R15: 00007f3d571f59e0 [ 982.996461] </TASK> [ 982.998317] Kernel Offset: 0x10c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 0.000000] Linux version 6.13.0 (lisatest@lisa--505-e0-n0) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2.0.1), GNU ld version 2.35.2-54.0.1.el9) #1 SMP PREEMPT_DYNAMIC Thu Feb 6 10:05:27 UTC 2025 [ 0.000000] Command line: elfcorehdr=0xd000000 BOOT_IMAGE=(hd0,gpt1)/vmlinuz-6.13.0 ro console=tty0 console=ttyS0,115200n8 rd.lvm.vg=rootvg irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 acpi_no_memhotplug transparent_hugepage=never nokaslr hest_disable novmcoredd cma=0 hugetlb_cma=0 iommu=off disable_cpu_apicid=0 [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009ffff] usable [ 0.000000] BIOS-e820: [mem 0x00000000000c0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000000d0e00b0-0x000000002cffffff] usable [ 0.000000] BIOS-e820: [mem 0x000000003eead000-0x000000003eeb3fff] reserved [ 0.000000] BIOS-e820: [mem 0x000000003ff41000-0x000000003ffc8fff] reserved [ 0.000000] BIOS-e820: [mem 0x000000003ffc9000-0x000000003fffafff] ACPI data [ 0.000000] BIOS-e820: [mem 0x000000003fffb000-0x000000003fffefff] ACPI NVS [ 0.000000] random: crng init done <snip> [ 0.928063] Console: switching to colour frame buffer device 128x48 [ 13.391297] fb0: EFI VGA frame buffer device <snip> [ 590.199511] hv_netvsc 7c1e527c-2980-7c1e-527c-29807c1e527c (unnamed net_device) (uninitialized): VF slot 1 added [ 595.120270] Console: switching to colour dummy device 80x25 [ 605.203700] hyperv_fb: Time out on waiting vram location ack [ 605.206161] iounmap: bad address 0000000005f4dac5 [ 605.207740] CPU: 0 UID: 0 PID: 30 Comm: kworker/u4:2 Not tainted 6.13.0 #1 [ 605.209984] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 08/23/2024 [ 605.213869] Workqueue: async async_run_entry_fn [ 605.215601] Call Trace: [ 605.216382] <TASK> [ 605.217123] dump_stack_lvl+0x66/0x90 [ 605.218184] hvfb_putmem+0x32/0x110 [hyperv_fb] [ 605.219646] hvfb_probe+0x27f/0x360 [hyperv_fb] [ 605.221120] vmbus_probe+0x3d/0xa0 [hv_vmbus] [ 605.222623] really_probe+0xd9/0x390 [ 605.223779] __driver_probe_device+0x78/0x160 [ 605.225213] driver_probe_device+0x1e/0xa0 [ 605.226591] __driver_attach_async_helper+0x5e/0xe0 [ 605.228166] async_run_entry_fn+0x34/0x130 [ 605.229681] process_one_work+0x187/0x3b0 [ 605.231075] worker_thread+0x24e/0x360 [ 605.232376] ? __pfx_worker_thread+0x10/0x10 [ 605.233758] kthread+0xd3/0x100 [ 605.234805] ? __pfx_kthread+0x10/0x10 [ 605.236053] ret_from_fork+0x34/0x50 [ 605.237251] ? __pfx_kthread+0x10/0x10 [ 605.238519] ret_from_fork_asm+0x1a/0x30 [ 605.239833] </TASK> [ 605.240855] hv_vmbus: probe failed for device 5620e0c7-8062-4dce-aeb7-520c7ef76171 (-110) [ 605.243404] hyperv_fb 5620e0c7-8062-4dce-aeb7-520c7ef76171: probe with driver hyperv_fb failed with error -110 [ 605.254672] hv_vmbus: registering driver hv_pci - Saurabh > -----Original Message----- > From: Michael Kelley <mhklinux@xxxxxxxxxxx> > Sent: 07 February 2025 02:30 > To: Michael Kelley <mhklinux@xxxxxxxxxxx>; Thomas Tai > <thomas.tai@xxxxxxxxxx>; mhkelley58@xxxxxxxxx; Haiyang Zhang > <haiyangz@xxxxxxxxxxxxx>; wei.liu@xxxxxxxxxx; Dexuan Cui > <decui@xxxxxxxxxxxxx>; drawat.floss@xxxxxxxxx; javierm@xxxxxxxxxx; > Helge Deller <deller@xxxxxx>; daniel@xxxxxxxx; airlied@xxxxxxxxx; > tzimmermann@xxxxxxx > Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; linux-fbdev@xxxxxxxxxxxxxxx; linux- > kernel@xxxxxxxxxxxxxxx; linux-hyperv@xxxxxxxxxxxxxxx > Subject: [EXTERNAL] RE: hyper_bf soft lockup on Azure Gen2 VM when taking > kdump or executing kexec > > From: Michael Kelley <mhklinux@xxxxxxxxxxx> > > > > From: Thomas Tai <thomas.tai@xxxxxxxxxx> Sent: Thursday, January 30, > > 2025 12:44 PM > > > > > > > -----Original Message----- > > > > From: Michael Kelley <mhklinux@xxxxxxxxxxx> Sent: Thursday, > > > > January 30, 2025 3:20 PM > > > > > > > > From: Thomas Tai <thomas.tai@xxxxxxxxxx> Sent: Thursday, January > > > > 30, > > > > 2025 10:50 AM > > > > > > > > > > Sorry for the typo in the subject title. It should have been > > > > > 'hyperv_fb soft lockup on Azure Gen2 VM when taking kdump or > executing kexec' > > > > > > > > > > Thomas > > > > > > > > > > > > > > > > > Hi Michael, > > > > > > > > > > > > We see an issue with the mainline kernel on the Azure Gen 2 VM > > > > > > when trying to induce a kernel panic with sysrq commands. The > > > > > > VM would hang with soft lockup. A similar issue happens when > executing kexec on the VM. > > > > > > This issue is seen only with Gen2 VMs(with UEFI boot). Gen1 > > > > > > VMs with bios boot are fine. > > > > > > > > > > > > git bisect identifies the issue is cased by the commit > > > > > > 20ee2ae8c5899 > > > > > > ("fbdev/hyperv_fb: Fix logic error for Gen2 VMs in hvfb_getmem()" ). > > > > > > However, reverting the commit would cause the frame buffer not > > > > > > to work on the Gen2 VM. > > > > > > > > > > > > Do you have any hints on what caused this issue? > > > > > > > > > > > > To reproduce the issue with kdump: > > > > > > - Install mainline kernel on an Azure Gen 2 VM and trigger a > > > > > > kdump > > > > > > - echo 1 > /proc/sys/kernel/sysrq > > > > > > - echo c > /proc/sysrq-trigger > > > > > > > > > > > > To reproduce the issue with executing kexec: > > > > > > - Install mainline kernel on Azure Gen 2 VM and use kexec > > > > > > - sudo kexec -l /boot/vmlinuz --initrd=/boot/initramfs.img > > > > > > --command- line="$( cat /proc/cmdline )" > > > > > > - sudo kexec -e > > > > > > > > > > > > Thank you, > > > > > > Thomas > > > > > > > > I will take a look, but it might be early next week before I can do so. > > > > > > > > > > Thank you, Michael for your help! > > > > > > > It looks like your soft lockup log below is from the kdump kernel > > > > (or the newly kexec'ed kernel). Can you confirm? Also, this looks like a > subset of the full log. > > > > > > Yes, the soft lockup log below is from the kdump kernel. > > > > > > > Do you have the full serial console log that you could email to > > > > me? Seeing everything might be helpful. Of course, I'll try to > > > > repro the problem myself as well. > > > > > > I have attached the complete bootup and kdump kernel log. > > > > > > File: bootup_and_kdump.log > > > Line 1 ... 984 (bootup log) > > > Line 990 (kdump kernel booting up) > > > Line 1351 (soft lockup) > > > > > > Thank you, > > > Thomas > > > > > > > I have reproduced the problem in an Azure VM running Oracle Linux > > 9.4 with the 6.13.0 kernel. Interestingly, the problem does not occur > > in a VM running on a locally installed Hyper-V with Ubuntu 20.04 and > > the 6.13.0 kernel. There are several differences in the two > > environments: the version of Hyper-V, the VM configuration, the Linux > > distro, and the .config file used to build the 6.13.0 kernel. I'll try > > to figure out what make the difference, and then the root cause. > > > > This has been a real bear to investigate. :-( The key observation is that with > older kernel versions, the efifb driver does *not* try to load when running in > the kdump kernel, and everything works. > In newer kernels, the efifb driver *does* try to load, and it appears to hang. > (Actually, it is causing the VM to run very slowly. More on that in a minute.) > > I've bisected the kernel again, compensating for the fact that commit > 20ee2ae8c5899 is needed to make the Hyper-V frame buffer work. With that > compensation, the actual problematic commit is 2bebc3cd4870 (Revert > "firmware/sysfb: Clear screen_info state after consuming it"). > Doing the revert causes screen_info.orig_video_isVGA to retain its value of > 0x70 (VIDEO_TYPE_EFI), which the kdump kernel picks up, causing it to load > the efifb driver. > > Then the question is why the efifb driver doesn't work in the kdump kernel. > Actually, it *does* work in many cases. I built the 6.13.0 kernel on the Oracle > Linux 9.4 system, and transferred the kernel image binary and module > binaries to an Ubuntu 20.04 VM in Azure. In that VM, the efifb driver is > loaded as part of the kdump kernel, and it doesn't cause any problems. But > there's an interesting difference. In the Oracle Linux > 9.4 VM, the efifb driver finds the framebuffer at 0x40000000, while on the > Ubuntu 20.04 VM, it finds the framebuffer at 0x40900000. This difference is > due to differences in how the screen_info variable gets setup in the two VMs. > > When the normal kernel starts in a freshly booted VM, Hyper-V provides the > EFI framebuffer at 0x40000000, and it works. But after the Hyper-V FB driver > or Hyper-V DRM driver has initialized, Linux has picked a different MMIO > address range and told Hyper-V to use the new address range (which often > starts at 0x40900000). A kexec does *not* reset Hyper-V's transition to the > new range, so when the efifb driver tries to use the framebuffer at > 0x40000000, the accesses trap to Hyper-V and probably fail or timeout (I'm > not sure of the details). After the guest does some number of these bad > references, Hyper-V considers itself to be under attack from an ill-behaving > guest, and throttles the guest so that it doesn't run for a few seconds. The > throttling repeats, and results in extremely slow running in the kdump kernel. > > Somehow in the Ubuntu 20.04 VM, the location of the frame buffer as stored > in screen_info.lfb_base gets updated to be 0x40900000. I haven't fully > debugged how that happens. But with that update, the efifb driver is using > the updated framebuffer address and it works. On the Oracle Linux 9.4 > system, that update doesn't appear to happen, and the problem occurs. > > This in an interim update on the problem. I'm still investigating how > screen_info.lfb_base is set in the kdump kernel, and why it is different in the > Ubuntu 20.04 VM vs. in the Oracle Linux 9.4 VM. Once that is well > understood, we can contemplate how to fix the problem. Undoing the revert > that is commit 2bebc3cd4870 doesn't seem like the solution since the original > code there was reported to cause many other issues. > The solution focus will likely be on how to ensure the kdump kernel gets the > correct framebuffer address so the efifb driver works, since the framebuffer > address changing is a quirk of Hyper-V behavior. > > If anyone else has insight into what's going on here, please chime in. > What I've learned so far is still somewhat tentative. > > Michael