[PATCH] drm/amdgpu:resolv deadlock between reset and cs_ioctl

deathsimple@xxxxxxxxxxx (Christian König) · Tue, 12 Sep 2017 08:49:25 +0200

> but since UMD/APP submits their jobs in one thread upon one ctx
That's correct for Mesa, but not necessary other user space stacks.

But that's actually irrelevant, IOCTLs must be thread save to the point 
that they at least don't crash or produce a hardware hang or something 
like this.

Aborting with a result that indicates concurrently access is forbidden 
(like -EBUSY) is ok, but hard to get right and so should usually be 
avoided as well.

Regards,
Christian.

Am 12.09.2017 um 05:47 schrieb Liu, Monk:
>> The order of job->base.s_fence and preventing concurrent CS was guaranteed by the VM PD being reserved, that is now no longer the case and we need a new lock for protection.
> Thought of this either, but since UMD/APP submits their jobs in one thread upon one ctx (if submits jobs within multiple thread upon one ctx that means the order is not important for UMD/app), so after we unreserved PD's resv lock
> We still only have one thread doing the cs_ioctl upon *current ctx*,
>
> So how that not work now ? thanks !
>
> BR Monk
>
> -----Original Message-----
> From: Christian KÃ¶nig [mailto:deathsimple at vodafone.de]
> Sent: 2017å¹´9æ??11æ?¥ 17:13
> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu:resolv deadlock between reset and cs_ioctl
>
> Am 11.09.2017 um 11:02 schrieb Monk Liu:
>> need to unreserve ttm bo before "cs_add_fence" and "entity_push_job"
>> otherwise there will be deadlock between "recover_vram_from_shadow"
>> and previous two routines on the ttm bo's resv lock.
>>
>> Change-Id: I9c18b677e474ce5b45a9cc24da3fa255d77b3d44
>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> Yeah, I was thinking about this as well. Problem is this won't work without further changes.
>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 35 ++++++++++++++++++----------------
>>    1 file changed, 19 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index c1f1ae9..823ac12 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -736,26 +736,12 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
>>    /**
>>     * cs_parser_fini() - clean parser states
>>     * @parser:	parser structure holding parsing context.
>> - * @error:	error number
>> - *
>> - * If error is set than unvalidate buffer, otherwise just free memory
>> - * used by parsing context.
>>     **/
>> -static void amdgpu_cs_parser_fini(struct amdgpu_cs_parser *parser,
>> int error, bool backoff)
>> +static void amdgpu_cs_parser_fini(struct amdgpu_cs_parser *parser)
>>    {
>>    	struct amdgpu_fpriv *fpriv = parser->filp->driver_priv;
>>    	unsigned i;
>>    
>> -	if (!error) {
>> -		amdgpu_vm_move_pt_bos_in_lru(parser->adev, &fpriv->vm);
>> -
>> -		ttm_eu_fence_buffer_objects(&parser->ticket,
>> -					    &parser->validated,
>> -					    parser->fence);
>> -	} else if (backoff) {
>> -		ttm_eu_backoff_reservation(&parser->ticket,
>> -					   &parser->validated);
>> -	}
>>    	fence_put(parser->fence);
>>    
>>    	if (parser->ctx)
>> @@ -1075,6 +1061,16 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>>    	job->owner = p->filp;
>>    	job->fence_ctx = entity->fence_context;
>>    	p->fence = fence_get(&job->base.s_fence->finished);
>> +
>> +	/* hook sched fence to all BOs' reservation in validated list
>> +	 * and unreserve them.
>> +	 *
>> +	 * we unreserve at here is because otherwise
>> +	 * there'll be deadlock between ctx_add_fence/sched_entity_push_job
>> +	 * and gpu_reset routine's recover_bo_from_shadow on PD/PTEs' ttm bo lock
>> +	 */
>> +	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
> The order of job->base.s_fence and preventing concurrent CS was guaranteed by the VM PD being reserved, that is now no longer the case and we need a new lock for protection.
>
> I suggest to add another lock to the context which we grab in
> amdgpu_ctx_get() and release in amdgpu_ctx_put().
>
> Regards,
> Christian.
>
>> +
>>    	r = amdgpu_ctx_add_fence(p->ctx, ring, p->fence, &seq);
>>    	if (r) {
>>    		fence_put(p->fence);
>> @@ -1095,6 +1091,7 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>    {
>>    	struct amdgpu_device *adev = dev->dev_private;
>>    	union drm_amdgpu_cs *cs = data;
>> +	struct amdgpu_fpriv *fpriv;
>>    	struct amdgpu_cs_parser parser = {};
>>    	bool reserved_buffers = false;
>>    	int i, r;
>> @@ -1104,6 +1101,7 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void
>> *data, struct drm_file *filp)
>>    
>>    	parser.adev = adev;
>>    	parser.filp = filp;
>> +	fpriv = filp->driver_priv;
>>    
>>    	r = amdgpu_cs_parser_init(&parser, data);
>>    	if (r) {
>> @@ -1141,7 +1139,12 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>    	r = amdgpu_cs_submit(&parser, cs);
>>    
>>    out:
>> -	amdgpu_cs_parser_fini(&parser, r, reserved_buffers);
>> +	if (!r)
>> +		amdgpu_vm_move_pt_bos_in_lru(parser.adev, &fpriv->vm);
>> +	else if (reserved_buffers)
>> +		ttm_eu_backoff_reservation(&parser.ticket, &parser.validated);
>> +
>> +	amdgpu_cs_parser_fini(&parser);
>>    	return r;
>>    }
>>    
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx