[Public] Hi Tobias and Teddy, -----Original Message----- From: Li, Yunxiang (Teddy) <Yunxiang.Li@xxxxxxx> Sent: Thursday, December 19, 2024 12:46 AM To: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Tobias Klausmann <klausman@xxxxxxxxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx> Subject: RE: [Bug report] Regression with kernel v6.13-rc2 [Public] > From: Tobias Klausmann <klausman@xxxxxxxxxxxxxxx> > Sent: Wednesday, December 18, 2024 10:54 Hi! > > I have been hitting kernel messages from AMDGPU since v6.13-rc2, for > example: > > [Wed Dec 18 15:56:24 2024] gmc_v11_0_process_interrupt: 10 callbacks > suppressed [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at > address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024] > amdgpu 0000:03:00.0: amdgpu: > GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B52 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 > client ID: CPC (0x5) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > MORE_FAULTS: 0x0 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > WALKER_ERROR: 0x1 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > PERMISSION_FAULTS: 0x5 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > MAPPING_ERROR: 0x1 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: RW: 0x1 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page > fault > (src_id:0 ring:153 vmid:0 pasid:0) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at > address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024] > amdgpu 0000:03:00.0: amdgpu: > GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 > client ID: CPC (0x5) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > MORE_FAULTS: 0x1 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > WALKER_ERROR: 0x1 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > PERMISSION_FAULTS: 0x3 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: > MAPPING_ERROR: 0x1 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page > fault > (src_id:0 ring:169 vmid:0 pasid:0) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at > address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024] > amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > (src_id:0 ring:153 vmid:0 pasid:0) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at > address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024] > amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > (src_id:0 ring:169 vmid:0 pasid:0) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at > address 0x0000000000000000 from client 10 [Wed Dec 18 15:56:24 2024] > amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault > (src_id:0 ring:153 vmid:0 pasid:0) > [Wed Dec 18 15:56:24 2024] amdgpu 0000:03:00.0: amdgpu: in page starting at > address 0x0000000000000000 from client 10 > > This happens when loading nontrivial (~6g) models using PyTorch. There > is no immediate crash, but if exercise the model for a few minutes, > evetually, the GPU crashes (sometimes the whole machine). > > I bisected this betwee -rc1 (which works fine) and -rc2, and I landed on this commit: > Hi Tobias, With this patch, PyTorch works on my side, please help verify this on your side. https://lists.freedesktop.org/archives/amd-gfx/2024-December/118058.html > commit 438b39ac74e2a9dc0a5c9d653b7d8066877e86b1 > Author: Jesse.zhang@xxxxxxx <Jesse.zhang@xxxxxxx> > Date: Thu Dec 5 17:41:26 2024 +0800 > > drm/amdkfd: pause autosuspend when creating pdd > > When using MES creating a pdd will require talking to the GPU to > setup the relevant context. The code here forgot to wake up the GPU > in case it was in suspend, this causes KVM to EFAULT for passthrough > GPU for example. This issue can be masked if the GPU was woken up by > other things (e.g. opening the KMS node) first and have not yet gone to sleep. > > v4: do the allocation of proc_ctx_bo in a lazy fashion > when the first queue is created in a process (Felix) > > Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx> > Reviewed-by: Yunxiang Li <Yunxiang.Li@xxxxxxx> > Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > > .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 15 ++++++++++++++ > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 23 ++-------------------- > 2 files changed, 17 insertions(+), 21 deletions(-) > > I am not sure what the causal relation ship between the commit and the > messages I get is, but I thought this report might be useful. If I had to guess I'd say that somewhere used the pdd->proc_ctx_gpu_addr before add_queue_mes is called, and since this patch moved the init into add_queue_mes null is passed to the GPU and we get the page fault. Hi Teddy, It enable MES debugger before add mes queue. And MES debugger will use pdd->proc_ctx_gpu_addr. Thanks Jesse +Alex as well for awareness. > Since I am not subscribed to the list, please CC me on replies. Thank you! > > Best, > Tobias