RE: [PATCHv2 2/3] drm/amdkfd: Update queue unmap after VM fault with MES

"Joshi, Mukul" <Mukul.Joshi@xxxxxxx> · Thu, 15 Aug 2024 21:08:30 +0000

[AMD Official Use Only - AMD Internal Distribution Only]

Hi Felix,

> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
> Sent: Thursday, August 15, 2024 2:25 PM
> To: Joshi, Mukul <Mukul.Joshi@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>
> Subject: Re: [PATCHv2 2/3] drm/amdkfd: Update queue unmap after VM fault
> with MES
>
> On 2024-08-14 19:27, Mukul Joshi wrote:
> > MEC FW expects MES to unmap all queues when a VM fault is observed on
> > a queue and then resumed once the affected process is terminated.
> > Use the MES Suspend and Resume APIs to achieve this.
> >
> > Signed-off-by: Mukul Joshi <mukul.joshi@xxxxxxx>
> > ---
> > v1->v2:
> > - Add MES FW version check.
> > - Separate out the kfd_dqm_evict_pasid into another function.
> > - Use amdgpu_mes_suspend/amdgpu_mes_resume to suspend/resume
> queues.
> >
> >   .../drm/amd/amdkfd/kfd_device_queue_manager.c | 79
> ++++++++++++++++++-
> >   1 file changed, 77 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> > b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> > index f6e211070299..cb5b866eee3b 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> > @@ -319,6 +319,42 @@ static int remove_all_queues_mes(struct
> device_queue_manager *dqm)
> >     return retval;
> >   }
> >
> > +static int suspend_all_queues_mes(struct device_queue_manager *dqm) {
> > +   struct amdgpu_device *adev = (struct amdgpu_device *)dqm->dev-
> >adev;
> > +   int r = 0;
> > +
> > +   if (dqm->is_hws_hang)
> > +           return -EIO;
> > +
> > +   r = amdgpu_mes_suspend(adev);
> > +   if (r) {
> > +           dev_err(adev->dev, "failed to suspend gangs from MES\n");
> > +           dev_err(adev->dev, "MES might be in unrecoverable state,
> issue a GPU reset\n");
> > +           kfd_hws_hang(dqm);
> > +   }
> > +
> > +   return r;
> > +}
> > +
> > +static int resume_all_queues_mes(struct device_queue_manager *dqm) {
> > +   struct amdgpu_device *adev = (struct amdgpu_device *)dqm->dev-
> >adev;
> > +   int r = 0;
> > +
> > +   if (dqm->is_hws_hang)
> > +           return -EIO;
> > +
> > +   r = amdgpu_mes_resume(adev);
> > +   if (r) {
> > +           dev_err(adev->dev, "failed to resume gangs from MES\n");
> > +           dev_err(adev->dev, "MES might be in unrecoverable state,
> issue a GPU reset\n");
> > +           kfd_hws_hang(dqm);
> > +   }
> > +
> > +   return r;
> > +}
> > +
> >   static void increment_queue_count(struct device_queue_manager *dqm,
> >                               struct qcm_process_device *qpd,
> >                               struct queue *q)
> > @@ -2835,6 +2871,40 @@ void device_queue_manager_uninit(struct
> device_queue_manager *dqm)
> >     kfree(dqm);
> >   }
> >
> > +static int kfd_dqm_evict_pasid_mes(struct device_queue_manager *dqm,
> > +                              struct qcm_process_device *qpd) {
> > +   struct device *dev = dqm->dev->adev->dev;
> > +   int ret = 0;
> > +
> > +   /* Check if process is already evicted */
> > +   dqm_lock(dqm);
> > +   if (qpd->evicted) {
> > +           dqm_unlock(dqm);
> > +           goto out;
>
> qpd->evicted is a reference count. Without this shortcut,
> dqm->ops.evict_process_queues will increment the ref count. You probably
> need to increment it here before dropping the lock. Otherwise two things can
> go wrong:
>
>  1. The corresponding dqm->ops.restore_process_queues will underflow the
>     reference count
>  2. A race condition where the queues get restored too early
>

The intent here is to check if the process queues are already evicted or not. If its not, then we want
to suspend all queues, evict all queues (which also updates the evicted refcount) of the affected process,
and resume all queues.
If I increment the refcount here, then dqm->ops.evict_process_queues will not evict the queues unless we
change that function.
And this function would be called only for the VM fault case, so the process is going to be terminated. Is it
possible to have dqm->ops.restore_process_queues called on it? Even if it called, I don't think we can have
underflow of the refcount with the current code.

Can you please explain the case where the dqm->ops.restore_process_queues can cause an underflow
with the current code? And the scenario for the race?

Regards,
Mukul

> Regards,
>    Felix
>
>
> > +   }
> > +   dqm_unlock(dqm);
> > +
> > +   ret = suspend_all_queues_mes(dqm);
> > +   if (ret) {
> > +           dev_err(dev, "Suspending all queues failed");
> > +           goto out;
> > +   }
> > +
> > +   ret = dqm->ops.evict_process_queues(dqm, qpd);
> > +   if (ret) {
> > +           dev_err(dev, "Evicting process queues failed");
> > +           goto out;
> > +   }
> > +
> > +   ret = resume_all_queues_mes(dqm);
> > +   if (ret)
> > +           dev_err(dev, "Resuming all queues failed");
> > +
> > +out:
> > +   return ret;
> > +}
> > +
> >   int kfd_dqm_evict_pasid(struct device_queue_manager *dqm, u32 pasid)
> >   {
> >     struct kfd_process_device *pdd;
> > @@ -2845,8 +2915,13 @@ int kfd_dqm_evict_pasid(struct
> device_queue_manager *dqm, u32 pasid)
> >             return -EINVAL;
> >     WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid);
> >     pdd = kfd_get_process_device_data(dqm->dev, p);
> > -   if (pdd)
> > -           ret = dqm->ops.evict_process_queues(dqm, &pdd->qpd);
> > +   if (pdd) {
> > +           if (dqm->dev->kfd->shared_resources.enable_mes)
> > +                   ret = kfd_dqm_evict_pasid_mes(dqm, &pdd->qpd);
> > +           else
> > +                   ret = dqm->ops.evict_process_queues(dqm, &pdd-
> >qpd);
> > +   }
> > +
> >     kfd_unref_process(p);
> >
> >     return ret;