On Wed, Apr 26, 2023 at 7:00 AM Chen, Guchun <Guchun.Chen@xxxxxxx> wrote: > > After reviewing this whole history, maybe attached patch is able to fix your problem. Can you have a try please? > > Regards, > Guchun > Thanks, I tested this patch for 6 days. And the error "BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96" never appears any more. But instead I began to note GPU hangs which happen randomly after "[gfxhub] page fault". Not sure if there is anything useful to seen in page fault message: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:1 pasid:32779, for process steamwebhelper pid 15552 thread steamwebhe:cs0 pid 15832) amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x00008001012c3000 from client 0x1b (UTCL2) amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00141051 amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8) amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0 amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x5 amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0 amdgpu 0000:03:00.0: amdgpu: RW: 0x1 amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32794, for process EvilDead-Win64- pid 12883 thread EvilDead-W:cs0 pid 13035) amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x00008001e62a5000 from client 10 amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201030 amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8) amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0 amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0 amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3 amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0 amdgpu 0000:03:00.0: amdgpu: RW: 0x0 amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32770, for process Xwayland pid 3706 thread Xwayland:cs0 pid 3713) amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000800100c04000 from client 10 amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00101031 amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8) amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0 amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3 amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0 amdgpu 0000:03:00.0: amdgpu: RW: 0x0 amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:2 pasid:32784, for process thedivision.exe pid 168608 thread thedivision.exe pid 168733) amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000800000372000 from client 10 amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00240C51 amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPG (0x6) amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0 amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x5 amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0 amdgpu 0000:03:00.0: amdgpu: RW: 0x1 amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32797, for process thedivision.exe pid 9902 thread thedivision.exe pid 9962) amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x000080013b3cc000 from client 10 amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00500830 amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPF (0x4) amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0 amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0 amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3 amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0 amdgpu 0000:03:00.0: amdgpu: RW: 0x0 Since the hangs have a random nature, it is very difficult to relate them with any changes. I really want to add Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@xxxxxxxxx> but I'm not sure if I have the right to do so if for some unknown reason the GPU is not stable yet. All full kernel logs are attached below. On Wed, Apr 26, 2023 at 4:50 PM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > Sending that once more from my mailing list address since AMD internal > servers are blocking the mail. > > Regards, > Christian. > > Am 26.04.23 um 13:48 schrieb Christian König: > > WTF? I own you a beer! > > > > I've fixed exactly that problem during the review process of the > > cleanup patch and because of this didn't considered that the code is > > still there. > > > > It also explains why we don't see that in our testing. > > > > @Mikhail can you test that patch with drm-misc-next? Christian, in the drm-misc-next I should test the Guchun's patch or something else? I already tested Guchun's patch on top of 6.4-git58390c8ce1bd and shared my result above. -- Best Regards, Mike Gavrilov.
Attachment:
dmesg-gfxhub-page-fault-7.tar.xz
Description: Binary data
Attachment:
dmesg-gfxhub-page-fault-6.tar.xz
Description: Binary data
Attachment:
dmesg-gfxhub-page-fault-5.tar.xz
Description: Binary data
Attachment:
dmesg-gfxhub-page-fault-4.tar.xz
Description: Binary data
Attachment:
dmesg-gfxhub-page-fault-3.tar.xz
Description: Binary data