RE: [PATCH] drm/amdgpu: validate process_context_addr for the MES shader debugger

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Public]

> From: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
> Sent: Friday, January 17, 2025 12:21 AM
> To: Liang, Prike <Prike.Liang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Xiao,
> Jack <Jack.Xiao@xxxxxxx>
> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix
> <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>;
> Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>
> Subject: RE: [PATCH] drm/amdgpu: validate process_context_addr for the MES
> shader debugger
>
> [Public]
>
> > -----Original Message-----
> > From: Liang, Prike <Prike.Liang@xxxxxxx>
> > Sent: Thursday, January 16, 2025 4:16 AM
> > To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>;
> > amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Xiao, Jack <Jack.Xiao@xxxxxxx>
> > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix
> > <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
> > Subject: RE: [PATCH] drm/amdgpu: validate process_context_addr for the
> > MES shader debugger
> >
> > [Public]
> >
> > > From: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
> > > Sent: Wednesday, January 15, 2025 1:14 AM
> > > To: Liang, Prike <Prike.Liang@xxxxxxx>;
> > > amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix
> > > <Felix.Kuehling@xxxxxxx>; Koenig, Christian
> > > <Christian.Koenig@xxxxxxx>
> > > Subject: RE: [PATCH] drm/amdgpu: validate process_context_addr for
> > > the MES shader debugger
> > >
> > > [Public]
> > >
> > > > -----Original Message-----
> > > > From: Liang, Prike <Prike.Liang@xxxxxxx>
> > > > Sent: Tuesday, January 14, 2025 12:15 AM
> > > > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling,
> > > > Felix <Felix.Kuehling@xxxxxxx>; Koenig, Christian
> > > > <Christian.Koenig@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>;
> > > > Liang, Prike <Prike.Liang@xxxxxxx>
> > > > Subject: [PATCH] drm/amdgpu: validate process_context_addr for the
> > > > MES shader debugger
> > > >
> > > > The following page fault was observed during the exit moment of
> > > > the HIP test process. In this particular error case, the HIP test
> > > > (./MemcpyPerformance -h) does not require the AQL queue. As a
> > > > result,
> > >
> > > I don't think this has anything to do with AQL compute specifically
> > > but a quirk in the KFD where the KFD only creates the process device
> > > mes context when adding the first queue.
> > > Maybe we should move context creation from the KFD add_queue_mes
> > > call to
> > KFD
> > > process device creation when MES is supported.
>
> Looking at the git logs, the allocation got moved to queue creation recently which
> is probably why we're seeing this bug now (added Jesse).
>
> Commit description:
>     drm/amdkfd: pause autosuspend when creating pdd
>
>     When using MES creating a pdd will require talking to the GPU to
>     setup the relevant context. The code here forgot to wake up the GPU
>     in case it was in suspend, this causes KVM to EFAULT for passthrough
>     GPU for example. This issue can be masked if the GPU was woken up by
>     other things (e.g. opening the KMS node) first and have not yet gone to sleep.
>
>     v4: do the allocation of proc_ctx_bo in a lazy fashion
>     when the first queue is created in a process (Felix)
>
>     Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
>     Reviewed-by: Yunxiang Li <Yunxiang.Li@xxxxxxx>
>
> Maybe the solution should have been to transiently disable runtime power
> management and leave allocation where it originally was.
> set_shader_debugger (wrapper for mes flush) also gets called on debug attach,
> which can be a queue-less call that requires MES process context.
> Having to remember that MES process context creation happens at the queue
> level is confusing.
>
> Jon

Thank @Kim, Jonathan for the input, if the original issue can be resolved by exiting the runtime suspension at the kfd_create_process_device_data() then we can revert the regression commit.

Thanks,
Prike

> >
> > I didn't look much into what queue type is created in the HIP test:
> > MemcpyPerformance, but the issue is that the KFD queue isn't created
> > when executing MemcpyPerformance -h. This means that the MES context
> > process isn't created yet in the KFD process that is released. In this
> > case, the MES queue will not be set up. Do we need to create the MES process
> at each KFD process creation?
> >
> > > Strangely enough KGD has an amdgpu_mes_create_process call that
> > > doesn't
> > seem
> > > to be used but does exactly this.
> >
> > Yes, it seems like dead code and may be implemented for user queue
> > cases in future usage. Maybe we can exclude
> > amdgpu_mes_create_process() and
> > amdgpu_mes_destroy_process() from compiling to reduce the kernel
> > binary size. Hi, @Xiao, Jack  could you please clarify why we need the
> > unused function
> > amdgpu_mes_create_process() here?
> >
> > > Otherwise, since the driver is always supposed to allocate and pass
> > > a valid context for any mes call, maybe it's better to do a context
> > > null check with some comments
> > in
> > > the kfd_process_dequeue_from_all_devices call to reflect this quirk.
> > >
> > Yes, I prefer to add the MES context check in the
> > kfd_process_dequeue_from_all_devices() function rather than creating
> > the MES context regardless of whether it is required to create the
> > queue at each KFD process creation.
> >
> > Thanks,
> >   Prike
> >
> >
> >
> > > Jon
> >
> >
> > >
> > > > the process_context_addr was not assigned when the KFD process was
> > > > released, ultimately leading to this page fault during the
> > > > execution of kfd_process_dequeue_from_all_devices().
> > > >
> > > > [345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> > > > (src_id:0
> > > > ring:153 vmid:0 pasid:0)
> > > > [345962.295333] amdgpu 0000:03:00.0: amdgpu:   in page starting at
> address
> > > > 0x0000000000000000 from client 10
> > > > [345962.295775] amdgpu 0000:03:00.0: amdgpu:
> > > > GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33
> > > > [345962.296097] amdgpu 0000:03:00.0: amdgpu:     Faulty UTCL2 client ID:
> > CPC
> > > > (0x5)
> > > > [345962.296394] amdgpu 0000:03:00.0: amdgpu:     MORE_FAULTS: 0x1
> > > > [345962.296633] amdgpu 0000:03:00.0: amdgpu:     WALKER_ERROR: 0x1
> > > > [345962.296876] amdgpu 0000:03:00.0: amdgpu:
> PERMISSION_FAULTS:
> > 0x3
> > > > [345962.297135] amdgpu 0000:03:00.0: amdgpu:     MAPPING_ERROR:
> 0x1
> > > > [345962.297377] amdgpu 0000:03:00.0: amdgpu:     RW: 0x0
> > > > [345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> > > > (src_id:0
> > > > ring:169 vmid:0 pasid:0)
> > > >
> > > > Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx>
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > > index cee38bb6cfaf..4d313144cc4b 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > > @@ -1062,6 +1062,11 @@ int amdgpu_mes_flush_shader_debugger(struct
> > > > amdgpu_device *adev,
> > > >               return -EINVAL;
> > > >       }
> > > >
> > > > +     if (!process_context_addr) {
> > > > +             dev_warn(adev->dev, "invalidated process context addr\n");
> > > > +             return -EINVAL;
> > > > +     }
> > > > +
> > > >       op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
> > > >       op_input.set_shader_debugger.process_context_addr =
> > > > process_context_addr;
> > > >       op_input.set_shader_debugger.flags.process_ctx_flush = true;
> > > > --
> > > > 2.34.1
> > >
> >
>

Attachment: 0002-drm-amdkfd-exit-runtime-suspend-for-the-PDD.patch
Description: 0002-drm-amdkfd-exit-runtime-suspend-for-the-PDD.patch

Attachment: 0001-Revert-drm-amdkfd-pause-autosuspend-when-creating-pd.patch
Description: 0001-Revert-drm-amdkfd-pause-autosuspend-when-creating-pd.patch


[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux