Re: 6.10/bisected/regression - Since commit e356d321d024 in the kernel log appears the message "MES failed to respond to msg=MISC (WAIT_REG_MEM)" which were never seen before

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 22, 2024 at 4:50 AM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
>
> That's a known issue and we are already working on it.

Do either of these patches help?
https://patchwork.freedesktop.org/patch/605437/
https://patchwork.freedesktop.org/patch/605201/

Alex

>
> Regards,
> Christian.
>
> Am 20.07.24 um 19:08 schrieb Mikhail Gavrilov:
> > Hi,
> > I spotted "MES failed to respond to msg=MISC (WAIT_REG_MEM)" messages
> > in my kernel log since 6.10-rc5.
> > After this message, usually follow "[drm:amdgpu_mes_reg_write_reg_wait
> > [amdgpu]] *ERROR* failed to reg_write_reg_wait".
> >
> > [ 8972.590502] input: Noble FoKus Mystique (AVRCP) as
> > /devices/virtual/input/input21
> > [ 9964.748433] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748433] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748494] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748476] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748479] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748661] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9964.748770] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9977.224893] Bluetooth: hci0: ACL packet for unknown connection handle 3837
> > [ 9980.347061] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.347077] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349859] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349859] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
> > msg=MISC (WAIT_REG_MEM)
> > [ 9980.349870] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349868] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349870] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349890] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349865] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349865] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349867] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349867] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349869] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
> > failed to reg_write_reg_wait
> > [10037.250083] Bluetooth: hci0: ACL packet for unknown connection handle 3837
> > [12054.238867] workqueue: gc_worker [nf_conntrack] hogged CPU for
> >> 10000us 1027 times, consider switching to WQ_UNBOUND
> > [12851.087896] fossilize_repla (45968) used greatest stack depth:
> > 17440 bytes left
> >
> > Unfortunately, it is not easily reproducible.
> > Usually it appears when I play several hours in the game "STAR WARS
> > Jedi: Survivor".
> > So it is why I bisected it so long.
> >
> > git bisect start
> > # status: waiting for both good and bad commits
> > # bad: [f2661062f16b2de5d7b6a5c42a9a5c96326b8454] Linux 6.10-rc5
> > git bisect bad f2661062f16b2de5d7b6a5c42a9a5c96326b8454
> > # good: [50736169ecc8387247fe6a00932852ce7b057083] Merge tag
> > 'for-6.10-rc4-tag' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
> > git bisect good 50736169ecc8387247fe6a00932852ce7b057083
> > # bad: [d4ba3313e84dfcdeb92a13434a2d02aad5e973e1] Merge tag
> > 'loongarch-fixes-6.10-2' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
> > git bisect bad d4ba3313e84dfcdeb92a13434a2d02aad5e973e1
> > # good: [264efe488fd82cf3145a3dc625f394c61db99934] Merge tag
> > 'ovl-fixes-6.10-rc5' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
> > git bisect good 264efe488fd82cf3145a3dc625f394c61db99934
> > # bad: [35bb670d65fc0f80c62383ab4f2544cec85ac57a] Merge tag
> > 'scsi-fixes' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> > git bisect bad 35bb670d65fc0f80c62383ab4f2544cec85ac57a
> > # good: [f0d576f840153392d04b2d52cf3adab8f62e8cb6] drm/amdgpu: fix
> > UBSAN warning in kv_dpm.c
> > git bisect good f0d576f840153392d04b2d52cf3adab8f62e8cb6
> > # bad: [07e06189c5ea7ffe897d12b546c918380d3bffb1] Merge tag
> > 'amd-drm-fixes-6.10-2024-06-19' of
> > https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
> > git bisect bad 07e06189c5ea7ffe897d12b546c918380d3bffb1
> > # bad: [ed5a4484f074aa2bfb1dad99ff3628ea8da4acdc] drm/amdgpu: init TA
> > fw for psp v14
> > git bisect bad ed5a4484f074aa2bfb1dad99ff3628ea8da4acdc
> > # bad: [e356d321d0240663a09b139fa3658ddbca163e27] drm/amdgpu: cleanup
> > MES11 command submission
> > git bisect bad e356d321d0240663a09b139fa3658ddbca163e27
> > # first bad commit: [e356d321d0240663a09b139fa3658ddbca163e27]
> > drm/amdgpu: cleanup MES11 command submission
> >
> > Author: Christian König <christian.koenig@xxxxxxx>
> > Date:   Fri May 31 10:56:00 2024 +0200
> >
> >      drm/amdgpu: cleanup MES11 command submission
> >
> >      The approach of having a separate WB slot for each submission doesn't
> >      really work well and for example breaks GPU reset.
> >
> >      Use a status query packet for the fence update instead since those
> >      should always succeed we can use the fence of the original packet to
> >      signal the state of the operation.
> >
> >      While at it cleanup the coding style.
> >
> >      Fixes: eef016ba8986 ("drm/amdgpu/mes11: Use a separate fence per
> > transaction")
> >      Reviewed-by: Mukul Joshi <mukul.joshi@xxxxxxx>
> >      Signed-off-by: Christian König <christian.koenig@xxxxxxx>
> >      Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
> >
> > And I can confirm after reverting e356d321d024 I played the whole day,
> > and the "MES failed to respond" error message does not appear anymore.
> >
> > My hardware specs are: https://linux-hardware.org/?probe=78d8c680db
> >
> > Christian, can you look into it, please?
> >
>




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux