Hi! I have been hitting kernel messages from AMDGPU since v6.13-rc2, for example: [Wed Dec 18 15:56:24 2024] gmc_v11_0_process_interrupt: 10 callbacks suppressed [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0) [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B52 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5) [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x5 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: RW: 0x1 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0) [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5) [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0) [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0) [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0) [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0) [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 This happens when loading nontrivial (~6g) models using PyTorch. There is no immediate crash, but if exercise the model for a few minutes, evetually, the GPU crashes (sometimes the whole machine). I bisected this betwee -rc1 (which works fine) and -rc2, and I landed on this commit: commit 438b39ac74e2a9dc0a5c9d653b7d8066877e86b1 Author: Jesse.zhang@xxxxxxx <Jesse.zhang@xxxxxxx> Date: Thu Dec 5 17:41:26 2024 +0800 drm/amdkfd: pause autosuspend when creating pdd When using MES creating a pdd will require talking to the GPU to setup the relevant context. The code here forgot to wake up the GPU in case it was in suspend, this causes KVM to EFAULT for passthrough GPU for example. This issue can be masked if the GPU was woken up by other things (e.g. opening the KMS node) first and have not yet gone to sleep. v4: do the allocation of proc_ctx_bo in a lazy fashion when the first queue is created in a process (Felix) Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx> Reviewed-by: Yunxiang Li <Yunxiang.Li@xxxxxxx> Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> Cc: stable@xxxxxxxxxxxxxxx .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 15 ++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_process.c | 23 ++-------------------- 2 files changed, 17 insertions(+), 21 deletions(-) I am not sure what the causal relation ship between the commit and the messages I get is, but I thought this report might be useful. Since I am not subscribed to the list, please CC me on replies. Thank you! Best, Tobias