Re: [REGRESSION][BISECTED] vmwgfx crashes with command buffer error after update

Zack Rusin <zack.rusin@xxxxxxxxxxxx> · Thu, 15 Aug 2024 14:40:09 -0400

On Thu, Aug 15, 2024 at 1:48 PM Christian Heusel <christian@xxxxxxxxx> wrote:
>
> Hello Zack,
>
> the user rdkehn (in CC) on the Arch Linux Forums reports that after
> updating to the 6.10.4 stable kernel inside of their VM Workstation the
> driver crashes with the error attached below. This error is also present
> on the latest mainline release 6.11-rc3.
>
> We have bisected the issue together down to the following commit:
>
>     d6667f0ddf46 ("drm/vmwgfx: Fix handling of dumb buffers")
>
> Reverting this commit on top of 6.11-rc3 fixes the issue.
>
> While we were still debugging the issue Brad (also CC'ed) messaged me
> that they were seeing similar failures in their ESXi based test
> pipelines except for one box that was running on legacy BIOS (so maybe
> that is relevant). They noticed this because they had set panic_on_warn.
>
> Cheers,
> Chris
>
> ---
>
> #regzbot introduced: d6667f0ddf46
> #regzbot title: drm/vmwgfx: driver crashes due to command buffer error
> #regzbot link: https://bbs.archlinux.org/viewtopic.php?id=298491
>
> ---
>
> dmesg snippet:
> [   13.297084] ------------[ cut here ]------------
> [   13.297086] Command buffer error.
> [   13.297139] WARNING: CPU: 0 PID: 186 at drivers/gpu/drm/vmwgfx/vmwgfx_cmdbuf.c:399 vmw_cmdbuf_ctx_process+0x268/0x270 [vmwgfx]
> [   13.297160] Modules linked in: uas usb_storage hid_generic usbhid mptspi sr_mod cdrom scsi_transport_spi vmwgfx serio_raw mptscsih ata_generic atkbd drm_ttm_helper libps2 pata_acpi vivaldi_fmap ttm mptbase crc32c_intel xhci_pci intel_agp xhci_pci_renesas ata_piix intel_gtt i8042 serio
> [   13.297172] CPU: 0 PID: 186 Comm: irq/16-vmwgfx Not tainted 6.10.4-arch2-1 #1 517ed45cc9c4492ee5d5bfc2d2fe6ef1f2e7a8eb
> [   13.297174] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
> [   13.297175] RIP: 0010:vmw_cmdbuf_ctx_process+0x268/0x270 [vmwgfx]
> [   13.297186] Code: 01 00 01 e8 ba 8c 4f f9 0f 0b 4c 89 ff e8 40 fb ff ff e9 9d fe ff ff 48 c7 c7 99 d9 3f c0 c6 05 52 2f 01 00 01 e8 98 8c 4f f9 <0f> 0b e9 1f fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> [   13.297187] RSP: 0018:ffffb9c1805e3d78 EFLAGS: 00010282
> [   13.297188] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000003
> [   13.297189] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
> [   13.297190] RBP: ffff907fc8274c98 R08: 0000000000000000 R09: ffffb9c1805e3bf8
> [   13.297191] R10: ffff9086dbdfffa8 R11: 0000000000000003 R12: ffff907fc4db5b00
> [   13.297192] R13: ffff907fc83fd318 R14: ffff907fc8274c88 R15: ffff907fc83fd300
> [   13.297193] FS:  0000000000000000(0000) GS:ffff9086dbe00000(0000) knlGS:0000000000000000
> [   13.297194] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   13.297194] CR2: 0000774dc57671ca CR3: 00000006b9e20005 CR4: 00000000003706f0
> [   13.297196] Call Trace:
> [   13.297198]  <TASK>
> [   13.297199]  ? vmw_cmdbuf_ctx_process+0x268/0x270 [vmwgfx a4fe13044bca4eda782d964fb8c4ca15afb325e9]
> [   13.297209]  ? __warn.cold+0x8e/0xe8
> [   13.297211]  ? vmw_cmdbuf_ctx_process+0x268/0x270 [vmwgfx a4fe13044bca4eda782d964fb8c4ca15afb325e9]
> [   13.297221]  ? report_bug+0xff/0x140
> [   13.297222]  ? console_unlock+0x84/0x130
> [   13.297225]  ? handle_bug+0x3c/0x80
> [   13.297226]  ? exc_invalid_op+0x17/0x70
> [   13.297227]  ? asm_exc_invalid_op+0x1a/0x20
> [   13.297230]  ? vmw_cmdbuf_ctx_process+0x268/0x270 [vmwgfx a4fe13044bca4eda782d964fb8c4ca15afb325e9]
> [   13.297238]  ? vmw_cmdbuf_ctx_process+0x268/0x270 [vmwgfx a4fe13044bca4eda782d964fb8c4ca15afb325e9]
> [   13.297245]  vmw_cmdbuf_man_process+0x5d/0x100 [vmwgfx a4fe13044bca4eda782d964fb8c4ca15afb325e9]
> [   13.297253]  vmw_cmdbuf_irqthread+0x25/0x30 [vmwgfx a4fe13044bca4eda782d964fb8c4ca15afb325e9]
> [   13.297261]  vmw_thread_fn+0x3a/0x70 [vmwgfx a4fe13044bca4eda782d964fb8c4ca15afb325e9]
> [   13.297271]  irq_thread_fn+0x20/0x60
> [   13.297273]  irq_thread+0x18a/0x270
> [   13.297274]  ? __pfx_irq_thread_fn+0x10/0x10
> [   13.297276]  ? __pfx_irq_thread_dtor+0x10/0x10
> [   13.297277]  ? __pfx_irq_thread+0x10/0x10
> [   13.297278]  kthread+0xcf/0x100
> [   13.297281]  ? __pfx_kthread+0x10/0x10
> [   13.297282]  ret_from_fork+0x31/0x50
> [   13.297285]  ? __pfx_kthread+0x10/0x10
> [   13.297286]  ret_from_fork_asm+0x1a/0x30
> [   13.297288]  </TASK>
> [   13.297289] ---[ end trace 0000000000000000 ]---

Hi, Christian.

Thanks for the report! So just to be clear vmwgfx doesn't crash, but
it shows a warning and the kernel has been compiled with panic on
warning which is actually what panics, right?

I haven't seen this on any of our systems so I'm guessing the affected
systems aren't running gnome/kde? Is there any chance I could see the
full "journalctl -b" log and the vmware.log file associated with those
warnings? They could give me some clues on how to reproduce this.

z