> 2021年6月16日 12:36,Kuehling, Felix <Felix.Kuehling@xxxxxxx> 写道: > > Am 2021-06-16 um 12:01 a.m. schrieb Pan, Xinhui: >>> 2021年6月16日 02:22,Kuehling, Felix <Felix.Kuehling@xxxxxxx> 写道: >>> >>> [+Xinhui] >>> >>> >>> Am 2021-06-15 um 1:50 p.m. schrieb Amber Lin: >>>> Calling free_mqd inside of destroy_queue_nocpsch_locked can cause a >>>> circular lock. destroy_queue_nocpsch_locked is called under a DQM lock, >>>> which is taken in MMU notifiers, potentially in FS reclaim context. >>>> Taking another lock, which is BO reservation lock from free_mqd, while >>>> causing an FS reclaim inside the DQM lock creates a problematic circular >>>> lock dependency. Therefore move free_mqd out of >>>> destroy_queue_nocpsch_locked and call it after unlocking DQM. >>>> >>>> Signed-off-by: Amber Lin <Amber.Lin@xxxxxxx> >>>> Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> >>> Let's submit this patch as is. I'm making some comments inline for >>> things that Xinhui can address in his race condition patch. >>> >>> >>>> --- >>>> .../drm/amd/amdkfd/kfd_device_queue_manager.c | 18 +++++++++++++----- >>>> 1 file changed, 13 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >>>> index 72bea5278add..c069fa259b30 100644 >>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >>>> @@ -486,9 +486,6 @@ static int destroy_queue_nocpsch_locked(struct device_queue_manager *dqm, >>>> if (retval == -ETIME) >>>> qpd->reset_wavefronts = true; >>>> >>>> - >>>> - mqd_mgr->free_mqd(mqd_mgr, q->mqd, q->mqd_mem_obj); >>>> - >>>> list_del(&q->list); >>>> if (list_empty(&qpd->queues_list)) { >>>> if (qpd->reset_wavefronts) { >>>> @@ -523,6 +520,8 @@ static int destroy_queue_nocpsch(struct device_queue_manager *dqm, >>>> int retval; >>>> uint64_t sdma_val = 0; >>>> struct kfd_process_device *pdd = qpd_to_pdd(qpd); >>>> + struct mqd_manager *mqd_mgr = >>>> + dqm->mqd_mgrs[get_mqd_type_from_queue_type(q->properties.type)]; >>>> >>>> /* Get the SDMA queue stats */ >>>> if ((q->properties.type == KFD_QUEUE_TYPE_SDMA) || >>>> @@ -540,6 +539,8 @@ static int destroy_queue_nocpsch(struct device_queue_manager *dqm, >>>> pdd->sdma_past_activity_counter += sdma_val; >>>> dqm_unlock(dqm); >>>> >>>> + mqd_mgr->free_mqd(mqd_mgr, q->mqd, q->mqd_mem_obj); >>>> + >>>> return retval; >>>> } >>>> >>>> @@ -1629,7 +1630,7 @@ static bool set_cache_memory_policy(struct device_queue_manager *dqm, >>>> static int process_termination_nocpsch(struct device_queue_manager *dqm, >>>> struct qcm_process_device *qpd) >>>> { >>>> - struct queue *q, *next; >>>> + struct queue *q; >>>> struct device_process_node *cur, *next_dpn; >>>> int retval = 0; >>>> bool found = false; >>>> @@ -1637,12 +1638,19 @@ static int process_termination_nocpsch(struct device_queue_manager *dqm, >>>> dqm_lock(dqm); >>>> >>>> /* Clear all user mode queues */ >>>> - list_for_each_entry_safe(q, next, &qpd->queues_list, list) { >>>> + while (!list_empty(&qpd->queues_list)) { >>>> + struct mqd_manager *mqd_mgr; >>>> int ret; >>>> >>>> + q = list_first_entry(&qpd->queues_list, struct queue, list); >>>> + mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type( >>>> + q->properties.type)]; >>>> ret = destroy_queue_nocpsch_locked(dqm, qpd, q); >>>> if (ret) >>>> retval = ret; >>>> + dqm_unlock(dqm); >>>> + mqd_mgr->free_mqd(mqd_mgr, q->mqd, q->mqd_mem_obj); >>>> + dqm_lock(dqm); >>> This is the correct way to clean up the list when dropping the dqm-lock >>> in the middle. Xinhui, you can use the same method in >>> process_termination_cpsch. >>> >> yes, that is the right way to walk through the list. thanks. >> >> >>> I believe the swapping of the q->mqd with a temporary variable is not >>> needed. When free_mqd is called, the queue is no longer on the >>> qpd->queues_list, so destroy_queue cannot race with it. If we ensure >>> that queues are always removed from the list before calling free_mqd, >>> and that list-removal happens under the dqm_lock, then there should be >>> no risk of a race condition that causes a double-free. >>> >> no, the double free exists because pqm_destroy_queue fetch the queue from qid by get_queue_by_qid() >> the race is like below. >> pqm_destroy_queue >> get_queue_by_qid process_termination_cpsch >> destroy_queue_cpsch >> lock >> list_for_each_entry_safe >> list_del(q) >> unlock >> free_mqd >> lock >> list_del(q) >> unlock >> free_mqd > > I think if both those threads try to free the same queue, they both need > to hold the same process->mutex. For pqm_destroy_queue that happens in > kfd_ioctl_destroy_queue. For process_termination_cpsch that happens in > kfd_process_notifier_release before it calls > kfd_process_dequeue_from_all_devices. oh, yes, you are right. So the double free I am seeing has different root cause. :( > > Regards, > Felix > > >> >> >> >> >>> Regards, >>> Felix >>> >>> >>>> } >>>> >>>> /* Unregister process */ _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx