On Mon, Jul 22, 2024 at 4:50 AM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > That's a known issue and we are already working on it. Do either of these patches help? https://patchwork.freedesktop.org/patch/605437/ https://patchwork.freedesktop.org/patch/605201/ Alex > > Regards, > Christian. > > Am 20.07.24 um 19:08 schrieb Mikhail Gavrilov: > > Hi, > > I spotted "MES failed to respond to msg=MISC (WAIT_REG_MEM)" messages > > in my kernel log since 6.10-rc5. > > After this message, usually follow "[drm:amdgpu_mes_reg_write_reg_wait > > [amdgpu]] *ERROR* failed to reg_write_reg_wait". > > > > [ 8972.590502] input: Noble FoKus Mystique (AVRCP) as > > /devices/virtual/input/input21 > > [ 9964.748433] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748433] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748494] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748476] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748479] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748661] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9964.748770] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9977.224893] Bluetooth: hci0: ACL packet for unknown connection handle 3837 > > [ 9980.347061] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.347077] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349859] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349859] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to > > msg=MISC (WAIT_REG_MEM) > > [ 9980.349870] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349868] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349870] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349890] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349865] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349865] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349867] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349867] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349869] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* > > failed to reg_write_reg_wait > > [10037.250083] Bluetooth: hci0: ACL packet for unknown connection handle 3837 > > [12054.238867] workqueue: gc_worker [nf_conntrack] hogged CPU for > >> 10000us 1027 times, consider switching to WQ_UNBOUND > > [12851.087896] fossilize_repla (45968) used greatest stack depth: > > 17440 bytes left > > > > Unfortunately, it is not easily reproducible. > > Usually it appears when I play several hours in the game "STAR WARS > > Jedi: Survivor". > > So it is why I bisected it so long. > > > > git bisect start > > # status: waiting for both good and bad commits > > # bad: [f2661062f16b2de5d7b6a5c42a9a5c96326b8454] Linux 6.10-rc5 > > git bisect bad f2661062f16b2de5d7b6a5c42a9a5c96326b8454 > > # good: [50736169ecc8387247fe6a00932852ce7b057083] Merge tag > > 'for-6.10-rc4-tag' of > > git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux > > git bisect good 50736169ecc8387247fe6a00932852ce7b057083 > > # bad: [d4ba3313e84dfcdeb92a13434a2d02aad5e973e1] Merge tag > > 'loongarch-fixes-6.10-2' of > > git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson > > git bisect bad d4ba3313e84dfcdeb92a13434a2d02aad5e973e1 > > # good: [264efe488fd82cf3145a3dc625f394c61db99934] Merge tag > > 'ovl-fixes-6.10-rc5' of > > git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs > > git bisect good 264efe488fd82cf3145a3dc625f394c61db99934 > > # bad: [35bb670d65fc0f80c62383ab4f2544cec85ac57a] Merge tag > > 'scsi-fixes' of > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi > > git bisect bad 35bb670d65fc0f80c62383ab4f2544cec85ac57a > > # good: [f0d576f840153392d04b2d52cf3adab8f62e8cb6] drm/amdgpu: fix > > UBSAN warning in kv_dpm.c > > git bisect good f0d576f840153392d04b2d52cf3adab8f62e8cb6 > > # bad: [07e06189c5ea7ffe897d12b546c918380d3bffb1] Merge tag > > 'amd-drm-fixes-6.10-2024-06-19' of > > https://gitlab.freedesktop.org/agd5f/linux into drm-fixes > > git bisect bad 07e06189c5ea7ffe897d12b546c918380d3bffb1 > > # bad: [ed5a4484f074aa2bfb1dad99ff3628ea8da4acdc] drm/amdgpu: init TA > > fw for psp v14 > > git bisect bad ed5a4484f074aa2bfb1dad99ff3628ea8da4acdc > > # bad: [e356d321d0240663a09b139fa3658ddbca163e27] drm/amdgpu: cleanup > > MES11 command submission > > git bisect bad e356d321d0240663a09b139fa3658ddbca163e27 > > # first bad commit: [e356d321d0240663a09b139fa3658ddbca163e27] > > drm/amdgpu: cleanup MES11 command submission > > > > Author: Christian König <christian.koenig@xxxxxxx> > > Date: Fri May 31 10:56:00 2024 +0200 > > > > drm/amdgpu: cleanup MES11 command submission > > > > The approach of having a separate WB slot for each submission doesn't > > really work well and for example breaks GPU reset. > > > > Use a status query packet for the fence update instead since those > > should always succeed we can use the fence of the original packet to > > signal the state of the operation. > > > > While at it cleanup the coding style. > > > > Fixes: eef016ba8986 ("drm/amdgpu/mes11: Use a separate fence per > > transaction") > > Reviewed-by: Mukul Joshi <mukul.joshi@xxxxxxx> > > Signed-off-by: Christian König <christian.koenig@xxxxxxx> > > Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> > > > > And I can confirm after reverting e356d321d024 I played the whole day, > > and the "MES failed to respond" error message does not appear anymore. > > > > My hardware specs are: https://linux-hardware.org/?probe=78d8c680db > > > > Christian, can you look into it, please? > > >