[+Xinhui] Am 2021-06-15 um 1:50 p.m. schrieb Amber Lin: > Calling free_mqd inside of destroy_queue_nocpsch_locked can cause a > circular lock. destroy_queue_nocpsch_locked is called under a DQM lock, > which is taken in MMU notifiers, potentially in FS reclaim context. > Taking another lock, which is BO reservation lock from free_mqd, while > causing an FS reclaim inside the DQM lock creates a problematic circular > lock dependency. Therefore move free_mqd out of > destroy_queue_nocpsch_locked and call it after unlocking DQM. > > Signed-off-by: Amber Lin <Amber.Lin@xxxxxxx> > Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> Let's submit this patch as is. I'm making some comments inline for things that Xinhui can address in his race condition patch. > --- > .../drm/amd/amdkfd/kfd_device_queue_manager.c | 18 +++++++++++++----- > 1 file changed, 13 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c > index 72bea5278add..c069fa259b30 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c > @@ -486,9 +486,6 @@ static int destroy_queue_nocpsch_locked(struct device_queue_manager *dqm, > if (retval == -ETIME) > qpd->reset_wavefronts = true; > > - > - mqd_mgr->free_mqd(mqd_mgr, q->mqd, q->mqd_mem_obj); > - > list_del(&q->list); > if (list_empty(&qpd->queues_list)) { > if (qpd->reset_wavefronts) { > @@ -523,6 +520,8 @@ static int destroy_queue_nocpsch(struct device_queue_manager *dqm, > int retval; > uint64_t sdma_val = 0; > struct kfd_process_device *pdd = qpd_to_pdd(qpd); > + struct mqd_manager *mqd_mgr = > + dqm->mqd_mgrs[get_mqd_type_from_queue_type(q->properties.type)]; > > /* Get the SDMA queue stats */ > if ((q->properties.type == KFD_QUEUE_TYPE_SDMA) || > @@ -540,6 +539,8 @@ static int destroy_queue_nocpsch(struct device_queue_manager *dqm, > pdd->sdma_past_activity_counter += sdma_val; > dqm_unlock(dqm); > > + mqd_mgr->free_mqd(mqd_mgr, q->mqd, q->mqd_mem_obj); > + > return retval; > } > > @@ -1629,7 +1630,7 @@ static bool set_cache_memory_policy(struct device_queue_manager *dqm, > static int process_termination_nocpsch(struct device_queue_manager *dqm, > struct qcm_process_device *qpd) > { > - struct queue *q, *next; > + struct queue *q; > struct device_process_node *cur, *next_dpn; > int retval = 0; > bool found = false; > @@ -1637,12 +1638,19 @@ static int process_termination_nocpsch(struct device_queue_manager *dqm, > dqm_lock(dqm); > > /* Clear all user mode queues */ > - list_for_each_entry_safe(q, next, &qpd->queues_list, list) { > + while (!list_empty(&qpd->queues_list)) { > + struct mqd_manager *mqd_mgr; > int ret; > > + q = list_first_entry(&qpd->queues_list, struct queue, list); > + mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type( > + q->properties.type)]; > ret = destroy_queue_nocpsch_locked(dqm, qpd, q); > if (ret) > retval = ret; > + dqm_unlock(dqm); > + mqd_mgr->free_mqd(mqd_mgr, q->mqd, q->mqd_mem_obj); > + dqm_lock(dqm); This is the correct way to clean up the list when dropping the dqm-lock in the middle. Xinhui, you can use the same method in process_termination_cpsch. I believe the swapping of the q->mqd with a temporary variable is not needed. When free_mqd is called, the queue is no longer on the qpd->queues_list, so destroy_queue cannot race with it. If we ensure that queues are always removed from the list before calling free_mqd, and that list-removal happens under the dqm_lock, then there should be no risk of a race condition that causes a double-free. Regards, Felix > } > > /* Unregister process */ _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx