Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

Mikhail Gavrilov <mikhail.v.gavrilov@xxxxxxxxx> · Wed, 3 May 2023 00:28:58 +0500

On Wed, Apr 26, 2023 at 7:00 AM Chen, Guchun <Guchun.Chen@xxxxxxx> wrote:
>
> After reviewing this whole history, maybe attached patch is able to fix your problem. Can you have a try please?
>
> Regards,
> Guchun
>

Thanks, I tested this patch for 6 days.
And the error "BUG: KASAN: null-ptr-deref in
drm_sched_job_cleanup+0x96" never appears any more.
But instead I began to note GPU hangs which happen randomly after
"[gfxhub] page fault".
Not sure if there is anything useful to seen in page fault message:

amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40
vmid:1 pasid:32779, for process steamwebhelper pid 15552 thread
steamwebhe:cs0 pid 15832)
amdgpu 0000:03:00.0: amdgpu:   in page starting at address
0x00008001012c3000 from client 0x1b (UTCL2)
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00141051
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x5
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x1

amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24
vmid:2 pasid:32794, for process EvilDead-Win64- pid 12883 thread
EvilDead-W:cs0 pid 13035)
amdgpu 0000:03:00.0: amdgpu:   in page starting at address
0x00008001e62a5000 from client 10
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201030
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x0

amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24
vmid:1 pasid:32770, for process Xwayland pid 3706 thread Xwayland:cs0
pid 3713)
amdgpu 0000:03:00.0: amdgpu:   in page starting at address
0x0000800100c04000 from client 10
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00101031
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x0

amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40
vmid:2 pasid:32784, for process thedivision.exe pid 168608 thread
thedivision.exe pid 168733)
amdgpu 0000:03:00.0: amdgpu:   in page starting at address
0x0000800000372000 from client 10
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00240C51
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPG (0x6)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x5
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x1

amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24
vmid:5 pasid:32797, for process thedivision.exe pid 9902 thread
thedivision.exe pid 9962)
amdgpu 0000:03:00.0: amdgpu:   in page starting at address
0x000080013b3cc000 from client 10
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00500830
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPF (0x4)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x0

Since the hangs have a random nature, it is very difficult to relate
them with any changes.

I really want to add Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@xxxxxxxxx>
but I'm not sure if I have the right to do so if for some unknown
reason the GPU is not stable yet.

All full kernel logs are attached below.

On Wed, Apr 26, 2023 at 4:50 PM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
>
> Sending that once more from my mailing list address since AMD internal
> servers are blocking the mail.
>
> Regards,
> Christian.
>
> Am 26.04.23 um 13:48 schrieb Christian König:
> > WTF? I own you a beer!
> >
> > I've fixed exactly that problem during the review process of the
> > cleanup patch and because of this didn't considered that the code is
> > still there.
> >
> > It also explains why we don't see that in our testing.
> >
> > @Mikhail can you test that patch with drm-misc-next?

Christian, in the drm-misc-next I should test the Guchun's patch or
something else?
I already tested Guchun's patch on top of 6.4-git58390c8ce1bd and
shared my result above.

-- 
Best Regards,
Mike Gavrilov.
Attachment:
dmesg-gfxhub-page-fault-7.tar.xz

Description: Binary data
Attachment:
dmesg-gfxhub-page-fault-6.tar.xz

Description: Binary data
Attachment:
dmesg-gfxhub-page-fault-5.tar.xz

Description: Binary data
Attachment:
dmesg-gfxhub-page-fault-4.tar.xz

Description: Binary data
Attachment:
dmesg-gfxhub-page-fault-3.tar.xz

Description: Binary data