RE: [PATCH] drm/amdgpu: validate process_context_addr for the MES shader debugger

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Public]

> -----Original Message-----
> From: Liang, Prike <Prike.Liang@xxxxxxx>
> Sent: Thursday, January 16, 2025 4:16 AM
> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Xiao,
> Jack <Jack.Xiao@xxxxxxx>
> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix
> <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
> Subject: RE: [PATCH] drm/amdgpu: validate process_context_addr for the MES
> shader debugger
>
> [Public]
>
> > From: Kim, Jonathan <Jonathan.Kim@xxxxxxx>
> > Sent: Wednesday, January 15, 2025 1:14 AM
> > To: Liang, Prike <Prike.Liang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix
> > <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
> > Subject: RE: [PATCH] drm/amdgpu: validate process_context_addr for the MES
> > shader debugger
> >
> > [Public]
> >
> > > -----Original Message-----
> > > From: Liang, Prike <Prike.Liang@xxxxxxx>
> > > Sent: Tuesday, January 14, 2025 12:15 AM
> > > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix
> > > <Felix.Kuehling@xxxxxxx>; Koenig, Christian
> > > <Christian.Koenig@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>;
> > > Liang, Prike <Prike.Liang@xxxxxxx>
> > > Subject: [PATCH] drm/amdgpu: validate process_context_addr for the MES
> > > shader debugger
> > >
> > > The following page fault was observed during the exit moment of the
> > > HIP test process. In this particular error case, the HIP test
> > > (./MemcpyPerformance -h) does not require the AQL queue. As a result,
> >
> > I don't think this has anything to do with AQL compute specifically but a quirk in the
> > KFD where the KFD only creates the process device mes context when adding the
> > first queue.
> > Maybe we should move context creation from the KFD add_queue_mes call to
> KFD
> > process device creation when MES is supported.

Looking at the git logs, the allocation got moved to queue creation recently which is probably why we're seeing this bug now (added Jesse).

Commit description:
    drm/amdkfd: pause autosuspend when creating pdd

    When using MES creating a pdd will require talking to the GPU to
    setup the relevant context. The code here forgot to wake up the GPU
    in case it was in suspend, this causes KVM to EFAULT for passthrough
    GPU for example. This issue can be masked if the GPU was woken up by
    other things (e.g. opening the KMS node) first and have not yet gone to sleep.

    v4: do the allocation of proc_ctx_bo in a lazy fashion
    when the first queue is created in a process (Felix)

    Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
    Reviewed-by: Yunxiang Li <Yunxiang.Li@xxxxxxx>

Maybe the solution should have been to transiently disable runtime power management and leave allocation where it originally was.
set_shader_debugger (wrapper for mes flush) also gets called on debug attach, which can be a queue-less call that requires MES process context.
Having to remember that MES process context creation happens at the queue level is confusing.

Jon

>
> I didn't look much into what queue type is created in the HIP test:
> MemcpyPerformance, but the issue is that the KFD queue isn't created when
> executing MemcpyPerformance -h. This means that the MES context process isn't
> created yet in the KFD process that is released. In this case, the MES queue will not
> be set up. Do we need to create the MES process at each KFD process creation?
>
> > Strangely enough KGD has an amdgpu_mes_create_process call that doesn't
> seem
> > to be used but does exactly this.
>
> Yes, it seems like dead code and may be implemented for user queue cases in future
> usage. Maybe we can exclude amdgpu_mes_create_process() and
> amdgpu_mes_destroy_process() from compiling to reduce the kernel binary size. Hi,
> @Xiao, Jack  could you please clarify why we need the unused function
> amdgpu_mes_create_process() here?
>
> > Otherwise, since the driver is always supposed to allocate and pass a valid context
> > for any mes call, maybe it's better to do a context null check with some comments
> in
> > the kfd_process_dequeue_from_all_devices call to reflect this quirk.
> >
> Yes, I prefer to add the MES context check in the
> kfd_process_dequeue_from_all_devices() function rather than creating the MES
> context regardless of whether it is required to create the queue at each KFD process
> creation.
>
> Thanks,
>   Prike
>
>
>
> > Jon
>
>
> >
> > > the process_context_addr was not assigned when the KFD process was
> > > released, ultimately leading to this page fault during the execution
> > > of kfd_process_dequeue_from_all_devices().
> > >
> > > [345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> > > (src_id:0
> > > ring:153 vmid:0 pasid:0)
> > > [345962.295333] amdgpu 0000:03:00.0: amdgpu:   in page starting at address
> > > 0x0000000000000000 from client 10
> > > [345962.295775] amdgpu 0000:03:00.0: amdgpu:
> > > GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33
> > > [345962.296097] amdgpu 0000:03:00.0: amdgpu:     Faulty UTCL2 client ID:
> CPC
> > > (0x5)
> > > [345962.296394] amdgpu 0000:03:00.0: amdgpu:     MORE_FAULTS: 0x1
> > > [345962.296633] amdgpu 0000:03:00.0: amdgpu:     WALKER_ERROR: 0x1
> > > [345962.296876] amdgpu 0000:03:00.0: amdgpu:     PERMISSION_FAULTS:
> 0x3
> > > [345962.297135] amdgpu 0000:03:00.0: amdgpu:     MAPPING_ERROR: 0x1
> > > [345962.297377] amdgpu 0000:03:00.0: amdgpu:     RW: 0x0
> > > [345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault
> > > (src_id:0
> > > ring:169 vmid:0 pasid:0)
> > >
> > > Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > index cee38bb6cfaf..4d313144cc4b 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > > @@ -1062,6 +1062,11 @@ int amdgpu_mes_flush_shader_debugger(struct
> > > amdgpu_device *adev,
> > >               return -EINVAL;
> > >       }
> > >
> > > +     if (!process_context_addr) {
> > > +             dev_warn(adev->dev, "invalidated process context addr\n");
> > > +             return -EINVAL;
> > > +     }
> > > +
> > >       op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
> > >       op_input.set_shader_debugger.process_context_addr =
> > > process_context_addr;
> > >       op_input.set_shader_debugger.flags.process_ctx_flush = true;
> > > --
> > > 2.34.1
> >
>





[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux