RE: [Bug report] Regression with kernel v6.13-rc2

"Li, Yunxiang (Teddy)" <Yunxiang.Li@xxxxxxx> · Wed, 18 Dec 2024 16:25:57 +0000



[Public]

+Felix

> From: Tobias Klausmann <klausman@xxxxxxxxxxxxxxx>
> Sent: Wednesday, December 18, 2024 10:54
>
> I have been hitting kernel messages from AMDGPU since v6.13-rc2, for
> example:
>
> [Wed Dec 18 15:56:24 2024] gmc_v11_0_process_interrupt: 10 callbacks
> suppressed [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub]
> page fault (src_id:0 ring:169 vmid:0 pasid:0)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address 0x0000000000000000 from client 10
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B52
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2
> client ID: CPC (0x5)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> MORE_FAULTS: 0x0
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> WALKER_ERROR: 0x1
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> PERMISSION_FAULTS: 0x5
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> MAPPING_ERROR: 0x1
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          RW: 0x1
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> (src_id:0 ring:153 vmid:0 pasid:0)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address 0x0000000000000000 from client 10
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2
> client ID: CPC (0x5)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> MORE_FAULTS: 0x1
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> WALKER_ERROR: 0x1
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> PERMISSION_FAULTS: 0x3
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:
> MAPPING_ERROR: 0x1
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> (src_id:0 ring:169 vmid:0 pasid:0)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address 0x0000000000000000 from client 10
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> (src_id:0 ring:153 vmid:0 pasid:0)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address 0x0000000000000000 from client 10
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> (src_id:0 ring:169 vmid:0 pasid:0)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address 0x0000000000000000 from client 10
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> (src_id:0 ring:153 vmid:0 pasid:0)
> [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address 0x0000000000000000 from client 10
>
> This happens when loading nontrivial (~6g) models using PyTorch. There is no
> immediate crash, but if exercise the model for a few minutes, evetually, the GPU
> crashes (sometimes the whole machine).
>
> I bisected this betwee -rc1 (which works fine) and -rc2, and I landed on this commit:
>
> commit 438b39ac74e2a9dc0a5c9d653b7d8066877e86b1
> Author: Jesse.zhang@xxxxxxx <Jesse.zhang@xxxxxxx>
> Date:   Thu Dec 5 17:41:26 2024 +0800
>
>     drm/amdkfd: pause autosuspend when creating pdd
>
>     When using MES creating a pdd will require talking to the GPU to
>     setup the relevant context. The code here forgot to wake up the GPU
>     in case it was in suspend, this causes KVM to EFAULT for passthrough
>     GPU for example. This issue can be masked if the GPU was woken up by
>     other things (e.g. opening the KMS node) first and have not yet gone to sleep.
>
>     v4: do the allocation of proc_ctx_bo in a lazy fashion
>     when the first queue is created in a process (Felix)
>
>     Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
>     Reviewed-by: Yunxiang Li <Yunxiang.Li@xxxxxxx>
>     Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
>     Cc: stable@xxxxxxxxxxxxxxx
>
>  .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  | 15 ++++++++++++++
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c           | 23 ++--------------------
>  2 files changed, 17 insertions(+), 21 deletions(-)
>
> I am not sure what the causal relation ship between the commit and the messages I
> get is, but I thought this report might be useful.
>
> Since I am not subscribed to the list, please CC me on replies. Thank you!
>
> Best,
> Tobias