[Bug report] Regression with kernel v6.13-rc2

Tobias Klausmann <klausman@xxxxxxxxxxxxxxx> · Wed, 18 Dec 2024 16:53:43 +0100

Hi!

I have been hitting kernel messages from AMDGPU since v6.13-rc2, for
example:

[Wed Dec 18 15:56:24 2024] gmc_v11_0_process_interrupt: 10 callbacks suppressed
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0)
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B52
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CPC (0x5)
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x1
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x1
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          RW: 0x1
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0)
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CPC (0x5)
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x1
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x1
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x1
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0)
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0)
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0)
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0)
[Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10

This happens when loading nontrivial (~6g) models using PyTorch. There
is no immediate crash, but if exercise the model for a few minutes,
evetually, the GPU crashes (sometimes the whole machine).

I bisected this betwee -rc1 (which works fine) and -rc2, and I landed on
this commit:

commit 438b39ac74e2a9dc0a5c9d653b7d8066877e86b1
Author: Jesse.zhang@xxxxxxx <Jesse.zhang@xxxxxxx>
Date:   Thu Dec 5 17:41:26 2024 +0800

    drm/amdkfd: pause autosuspend when creating pdd

    When using MES creating a pdd will require talking to the GPU to
    setup the relevant context. The code here forgot to wake up the GPU
    in case it was in suspend, this causes KVM to EFAULT for passthrough
    GPU for example. This issue can be masked if the GPU was woken up by
    other things (e.g. opening the KMS node) first and have not yet gone to sleep.

    v4: do the allocation of proc_ctx_bo in a lazy fashion
    when the first queue is created in a process (Felix)

    Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
    Reviewed-by: Yunxiang Li <Yunxiang.Li@xxxxxxx>
    Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
    Cc: stable@xxxxxxxxxxxxxxx

 .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  | 15 ++++++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_process.c           | 23 ++--------------------
 2 files changed, 17 insertions(+), 21 deletions(-)

I am not sure what the causal relation ship between the commit and the
messages I get is, but I thought this report might be useful.

Since I am not subscribed to the list, please CC me on replies. Thank
you!

Best,
Tobias