[Public] +Felix > From: Tobias Klausmann <klausman@xxxxxxxxxxxxxxx> > Sent: Wednesday, December 18, 2024 10:54 > > I have been hitting kernel messages from AMDGPU since v6.13-rc2, for > example: > > [Wed Dec 18 15:56:24 2024] gmc_v11_0_process_interrupt: 10 callbacks > suppressed [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] > page fault (src_id:0 ring:169 vmid:0 pasid:0) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at > address 0x0000000000000000 from client 10 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B52 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 > client ID: CPC (0x5) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > MORE_FAULTS: 0x0 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > WALKER_ERROR: 0x1 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > PERMISSION_FAULTS: 0x5 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > MAPPING_ERROR: 0x1 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: RW: 0x1 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > (src_id:0 ring:153 vmid:0 pasid:0) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at > address 0x0000000000000000 from client 10 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 > client ID: CPC (0x5) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > MORE_FAULTS: 0x1 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > WALKER_ERROR: 0x1 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > PERMISSION_FAULTS: 0x3 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > MAPPING_ERROR: 0x1 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > (src_id:0 ring:169 vmid:0 pasid:0) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at > address 0x0000000000000000 from client 10 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > (src_id:0 ring:153 vmid:0 pasid:0) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at > address 0x0000000000000000 from client 10 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > (src_id:0 ring:169 vmid:0 pasid:0) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at > address 0x0000000000000000 from client 10 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > (src_id:0 ring:153 vmid:0 pasid:0) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at > address 0x0000000000000000 from client 10 > > This happens when loading nontrivial (~6g) models using PyTorch. There is no > immediate crash, but if exercise the model for a few minutes, evetually, the GPU > crashes (sometimes the whole machine). > > I bisected this betwee -rc1 (which works fine) and -rc2, and I landed on this commit: > > commit 438b39ac74e2a9dc0a5c9d653b7d8066877e86b1 > Author: Jesse.zhang@xxxxxxx <Jesse.zhang@xxxxxxx> > Date: Thu Dec 5 17:41:26 2024 +0800 > > drm/amdkfd: pause autosuspend when creating pdd > > When using MES creating a pdd will require talking to the GPU to > setup the relevant context. The code here forgot to wake up the GPU > in case it was in suspend, this causes KVM to EFAULT for passthrough > GPU for example. This issue can be masked if the GPU was woken up by > other things (e.g. opening the KMS node) first and have not yet gone to sleep. > > v4: do the allocation of proc_ctx_bo in a lazy fashion > when the first queue is created in a process (Felix) > > Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx> > Reviewed-by: Yunxiang Li <Yunxiang.Li@xxxxxxx> > Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > > .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 15 ++++++++++++++ > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 23 ++-------------------- > 2 files changed, 17 insertions(+), 21 deletions(-) > > I am not sure what the causal relation ship between the commit and the messages I > get is, but I thought this report might be useful. > > Since I am not subscribed to the list, please CC me on replies. Thank you! > > Best, > Tobias