[AMD Official Use Only - Internal Distribution Only] Hi, Felix, >>>The failure to execute_queues should probably not be reported to the caller of create_queue, because the queue was already created, and the problem with execute_queues has a much bigger scope than this one caller. So I think the correct solution is to ignore the return value from >>>execute_queues. Got it. I have created a patch v2 according to your suggestion. >>>As a follow up, we should probably handle all the error scenarios inside execute_queues and make it a void function. Failure to unmap queues already triggers a GPU reset, so nothing new needs to be done for that. >>>But we need to add handling of failures to map queues. It doesn't require a GPU reset, because the problem is in the kernel (e.g. out of memory), not the GPU. The best we can do is report this asynchronously as a GPU hang to all KFD processes, so they know the GPU is no longer going to work >>>for them. Understood. I will follow up this issue and prepare a solution to discuss with you. Best Regards Dennis Li -----Original Message----- From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> Sent: Wednesday, September 2, 2020 11:26 AM To: Li, Dennis <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Subject: Re: [PATCH] drm/kfd: fix a system crash issue during GPU recovery On 2020-09-01 11:21 a.m., Li, Dennis wrote: > [AMD Official Use Only - Internal Distribution Only] > > Hi, Felix, > If GPU hang, execute_queues_cpsch will fail to unmap or map queues and then create_queue_cpsch will return error. If pqm_create_queue find create_queue_cpsch failed, it will call uninit_queue to free queue object. However this queue object has been added in to qpd->queues_list in the old code. Right, that's a problem. I think the intention here is to keep going because a failure to execute the runlist affects not just the queue that was just created, but all queues in all processes. The failure to execute_queues should probably not be reported to the caller of create_queue, because the queue was already created, and the problem with execute_queues has a much bigger scope than this one caller. So I think the correct solution is to ignore the return value from execute_queues. As a follow up, we should probably handle all the error scenarios inside execute_queues and make it a void function. Failure to unmap queues already triggers a GPU reset, so nothing new needs to be done for that. But we need to add handling of failures to map queues. It doesn't require a GPU reset, because the problem is in the kernel (e.g. out of memory), not the GPU. The best we can do is report this asynchronously as a GPU hang to all KFD processes, so they know the GPU is no longer going to work for them. Regards, Felix > > Best Regards > Dennis Li > > -----Original Message----- > From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > Sent: Tuesday, September 1, 2020 9:26 PM > To: Li, Dennis <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; > Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> > Subject: Re: [PATCH] drm/kfd: fix a system crash issue during GPU > recovery > > I'm not sure how the bug you're fixing is caused, but your fix is clearly in the wrong place. > > A queue being disabled is not the same thing as a queue being destroyed. > Queues can be disabled for legitimate reasons, but they still should exist and be in the qpd->queues_list. > > If a destroyed queue is left on the qpd->queues_list, that would be a problem. Can you point out where such a thing is happening? > > Thanks, > Felix > > > Am 2020-08-31 um 9:36 p.m. schrieb Dennis Li: >> The crash log as the below: >> >> [Thu Aug 20 23:18:14 2020] general protection fault: 0000 [#1] SMP NOPTI >> [Thu Aug 20 23:18:14 2020] CPU: 152 PID: 1837 Comm: kworker/152:1 Tainted: G OE 5.4.0-42-generic #46~18.04.1-Ubuntu >> [Thu Aug 20 23:18:14 2020] Hardware name: GIGABYTE >> G482-Z53-YF/MZ52-G40-00, BIOS R12 05/13/2020 [Thu Aug 20 23:18:14 >> 2020] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [Thu Aug 20 >> 23:18:14 2020] RIP: 0010:evict_process_queues_cpsch+0xc9/0x130 >> [amdgpu] [Thu Aug 20 23:18:14 2020] Code: 49 8d 4d 10 48 39 c8 75 21 >> eb 44 83 fa 03 74 36 80 78 72 00 74 0c 83 ab 68 01 00 00 01 41 c6 45 >> 41 00 48 8b 00 48 39 c8 74 25 <80> 78 70 00 c6 40 6d 01 74 ee 8b 50 >> 28 >> c6 40 70 00 83 ab 60 01 00 [Thu Aug 20 23:18:14 2020] RSP: >> 0018:ffffb29b52f6fc90 EFLAGS: 00010213 [Thu Aug 20 23:18:14 2020] RAX: >> 1c884edb0a118914 RBX: ffff8a0d45ff3c00 RCX: ffff8a2d83e41038 [Thu Aug >> 20 23:18:14 2020] RDX: 0000000000000000 RSI: 0000000000000082 RDI: >> ffff8a0e2e4178c0 [Thu Aug 20 23:18:14 2020] RBP: ffffb29b52f6fcb0 R08: >> 0000000000001b64 R09: 0000000000000004 [Thu Aug 20 23:18:14 2020] R10: >> ffffb29b52f6fb78 R11: 0000000000000001 R12: ffff8a0d45ff3d28 [Thu Aug 20 23:18:14 2020] R13: ffff8a2d83e41028 R14: 0000000000000000 R15: 0000000000000000 [Thu Aug 20 23:18:14 2020] FS: 0000000000000000(0000) GS:ffff8a0e2e400000(0000) knlGS:0000000000000000 [Thu Aug 20 23:18:14 2020] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Thu Aug 20 23:18:14 2020] CR2: 000055c783c0e6a8 CR3: 00000034a1284000 CR4: 0000000000340ee0 [Thu Aug 20 23:18:14 2020] Call Trace: >> [Thu Aug 20 23:18:14 2020] kfd_process_evict_queues+0x43/0xd0 >> [amdgpu] [Thu Aug 20 23:18:14 2020] >> kfd_suspend_all_processes+0x60/0xf0 [amdgpu] [Thu Aug 20 23:18:14 >> 2020] kgd2kfd_suspend.part.7+0x43/0x50 [amdgpu] [Thu Aug 20 23:18:14 >> 2020] kgd2kfd_pre_reset+0x46/0x60 [amdgpu] [Thu Aug 20 23:18:14 >> 2020] >> amdgpu_amdkfd_pre_reset+0x1a/0x20 [amdgpu] [Thu Aug 20 23:18:14 2020] >> amdgpu_device_gpu_recover+0x377/0xf90 [amdgpu] [Thu Aug 20 23:18:14 >> 2020] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [Thu Aug 20 >> 23:18:14 2020] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [Thu Aug >> 20 23:18:14 2020] process_one_work+0x20f/0x400 [Thu Aug 20 23:18:14 >> 2020] worker_thread+0x34/0x410 >> >> When GPU hang, user process will fail to create a compute queue whose >> struct object will be freed later, but driver wrongly add this queue >> to queue list of the proccess. And then kfd_process_evict_queues will >> access a freed memory, which cause a system crash. >> >> Signed-off-by: Dennis Li <Dennis.Li@xxxxxxx> >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >> index 560adc57a050..d5e6b07ffb27 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >> @@ -1296,16 +1296,18 @@ static int create_queue_cpsch(struct device_queue_manager *dqm, struct queue *q, >> mqd_mgr->init_mqd(mqd_mgr, &q->mqd, q->mqd_mem_obj, >> &q->gart_mqd_addr, &q->properties); >> >> - list_add(&q->list, &qpd->queues_list); >> - qpd->queue_count++; >> - >> if (q->properties.is_active) { >> increment_queue_count(dqm, q->properties.type); >> >> retval = execute_queues_cpsch(dqm, >> KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0); >> + if (retval) >> + goto out_execute_cpsch; >> } >> >> + list_add(&q->list, &qpd->queues_list); >> + qpd->queue_count++; >> + >> /* >> * Unconditionally increment this counter, regardless of the queue's >> * type or whether the queue is active. >> @@ -1318,6 +1320,9 @@ static int create_queue_cpsch(struct device_queue_manager *dqm, struct queue *q, >> dqm_unlock(dqm); >> return retval; >> >> +out_execute_cpsch: >> + decrement_queue_count(dqm, q->properties.type); >> + dqm_unlock(dqm); >> out_deallocate_doorbell: >> deallocate_doorbell(qpd, q); >> out_deallocate_sdma_queue: _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx