Hi, I spotted "MES failed to respond to msg=MISC (WAIT_REG_MEM)" messages in my kernel log since 6.10-rc5. After this message, usually follow "[drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait". [ 8972.590502] input: Noble FoKus Mystique (AVRCP) as /devices/virtual/input/input21 [ 9964.748433] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748433] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748494] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748476] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748479] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748661] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9964.748770] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9977.224893] Bluetooth: hci0: ACL packet for unknown connection handle 3837 [ 9980.347061] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.347077] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349859] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349859] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) [ 9980.349870] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349868] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349870] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349890] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349865] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349865] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349867] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349867] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349869] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait [10037.250083] Bluetooth: hci0: ACL packet for unknown connection handle 3837 [12054.238867] workqueue: gc_worker [nf_conntrack] hogged CPU for >10000us 1027 times, consider switching to WQ_UNBOUND [12851.087896] fossilize_repla (45968) used greatest stack depth: 17440 bytes left Unfortunately, it is not easily reproducible. Usually it appears when I play several hours in the game "STAR WARS Jedi: Survivor". So it is why I bisected it so long. git bisect start # status: waiting for both good and bad commits # bad: [f2661062f16b2de5d7b6a5c42a9a5c96326b8454] Linux 6.10-rc5 git bisect bad f2661062f16b2de5d7b6a5c42a9a5c96326b8454 # good: [50736169ecc8387247fe6a00932852ce7b057083] Merge tag 'for-6.10-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux git bisect good 50736169ecc8387247fe6a00932852ce7b057083 # bad: [d4ba3313e84dfcdeb92a13434a2d02aad5e973e1] Merge tag 'loongarch-fixes-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson git bisect bad d4ba3313e84dfcdeb92a13434a2d02aad5e973e1 # good: [264efe488fd82cf3145a3dc625f394c61db99934] Merge tag 'ovl-fixes-6.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs git bisect good 264efe488fd82cf3145a3dc625f394c61db99934 # bad: [35bb670d65fc0f80c62383ab4f2544cec85ac57a] Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi git bisect bad 35bb670d65fc0f80c62383ab4f2544cec85ac57a # good: [f0d576f840153392d04b2d52cf3adab8f62e8cb6] drm/amdgpu: fix UBSAN warning in kv_dpm.c git bisect good f0d576f840153392d04b2d52cf3adab8f62e8cb6 # bad: [07e06189c5ea7ffe897d12b546c918380d3bffb1] Merge tag 'amd-drm-fixes-6.10-2024-06-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes git bisect bad 07e06189c5ea7ffe897d12b546c918380d3bffb1 # bad: [ed5a4484f074aa2bfb1dad99ff3628ea8da4acdc] drm/amdgpu: init TA fw for psp v14 git bisect bad ed5a4484f074aa2bfb1dad99ff3628ea8da4acdc # bad: [e356d321d0240663a09b139fa3658ddbca163e27] drm/amdgpu: cleanup MES11 command submission git bisect bad e356d321d0240663a09b139fa3658ddbca163e27 # first bad commit: [e356d321d0240663a09b139fa3658ddbca163e27] drm/amdgpu: cleanup MES11 command submission Author: Christian König <christian.koenig@xxxxxxx> Date: Fri May 31 10:56:00 2024 +0200 drm/amdgpu: cleanup MES11 command submission The approach of having a separate WB slot for each submission doesn't really work well and for example breaks GPU reset. Use a status query packet for the fence update instead since those should always succeed we can use the fence of the original packet to signal the state of the operation. While at it cleanup the coding style. Fixes: eef016ba8986 ("drm/amdgpu/mes11: Use a separate fence per transaction") Reviewed-by: Mukul Joshi <mukul.joshi@xxxxxxx> Signed-off-by: Christian König <christian.koenig@xxxxxxx> Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> And I can confirm after reverting e356d321d024 I played the whole day, and the "MES failed to respond" error message does not appear anymore. My hardware specs are: https://linux-hardware.org/?probe=78d8c680db Christian, can you look into it, please? -- Best Regards, Mike Gavrilov.
<<attachment: dmesg.zip>>
Attachment:
.config.zip
Description: Zip archive