[PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()

nhaehnle@xxxxxxxxx (Nicolai Hähnle) · Mon, 13 Feb 2017 20:03:13 +0100

On 13.02.2017 19:58, Nicolai HÃ¤hnle wrote:
> On 13.02.2017 19:38, Samuel Pitoiset wrote:
>>
>>
>> On 02/13/2017 07:09 PM, Nicolai HÃ¤hnle wrote:
>>> On 13.02.2017 19:04, Nicolai HÃ¤hnle wrote:
>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>>
>>>>>
>>>>> On 02/13/2017 05:25 PM, Nicolai HÃ¤hnle wrote:
>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>>>
>>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>>> mutex was locked as expected.
>>>>>>>
>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>>> (around ~5k) and deadlock.
>>>>>>>
>>>>>>> This regression has been introduced pretty recently.
>>>>>>>
>>>>>>> v2: only release the mutex if resv is NULL
>>>>>>>
>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com>
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>>> -    if (unlikely(r != 0))
>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>> +        if (!resv)
>>>>>>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>>          return r;
>>>>>>> +    }
>>>>>>
>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>>> sure
>>>>>> I had this exact same patch just to realize that it's actually
>>>>>> incorrect.
>>>>>>
>>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>>> function
>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>>> freed.
>>>>>>
>>>>>> This code is a huge mess. I'm surprised though: have you verified
>>>>>> that
>>>>>> this patch actually fixes a hang?
>>>>>
>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>>
>>>> That's surprising, but a relief. Maybe it ties into some of the other
>>>> problems I'm seeing as well.
>>>>
>>>> This means we need a real fix for this; I still think the current patch
>>>> is broken.
>>>>
>>>>
>>>>> This fixes a deadlock, here's the report:
>>>>> https://hastebin.com/durodivoma.xml
>>>>>
>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy
>>>>> function
>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and
>>>>> resv
>>>>> is not NULL, the mutex is not unlocked.
>>>>
>>>> On which code path is the destroy function not called? If that is the
>>>> case, we're leaking memory.
>>>>
>>>> With the patch as-is, the error paths are either leaking memory (if
>>>> you're right) or accessing memory after it's freed (otherwise).
>>>> Obviously, neither is good.
>>>
>>> Actually, I find it extremely suspicious that this patch resolves hangs.
>>> By all rights, no other task should have a pointer to this bo left. It
>>> points at problems elsewhere in the code, possibly the precise problem
>>> I've been trying to track down.
>>
>> Well, maybe we are just lucky but as I said, I checked many times to
>> reproduce the issue with that patch applied without any success, you can
>> trust me. Although I'm also starting to think that's not the right
>> solution (and could introduce other ones).
>>
>>>
>>> Could you please revert the patch, reproduce the hang, and report
>>> /proc/$pid/stack for all the hung tasks?
>>
>> Sure. The thing is: Hitman's branch has been updated during the weekend
>> and my local installation is broken. I need to re-download the whole
>> game (will take a while).
>>
>> I will let you know when I'm able to grab that report.
>
> Hmm, so I thought about this some more, and I'm no longer so sure that
> your bug and mine are the same. If it was related, I'd somehow expect
> you to get an error about a mutex being destroyed while it's held (at
> least with lock debugging enabled).
>
> Anyway... we need to change the contract of ttm_bo_init, I'm just not
> yet sure how, because there are two points of failure: one quite early
> on, and the second rather late which gets cleaned up by ttm_bo_unref.

Maybe it would actually be best to split ttm_bo_init into two parts: the 
initial bulk of structure initialization as the first half, and the 
ttm_bo_validate in the second half.

Cheers,
Nicolai

>
> Cheers,
> Nicolai
>
>> Thanks Nicolai.
>>>
>>> Thanks,
>>> Nicolai
>