[PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()

nhaehnle@xxxxxxxxx (Nicolai Hähnle) · Mon, 13 Feb 2017 19:19:54 +0100

On 13.02.2017 19:11, Samuel Pitoiset wrote:
>
>
> On 02/13/2017 07:04 PM, Nicolai HÃ¤hnle wrote:
>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>
>>>
>>> On 02/13/2017 05:25 PM, Nicolai HÃ¤hnle wrote:
>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>
>>>>> In debug build, the kernel reported "possible recursive locking
>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>> mutex was locked as expected.
>>>>>
>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>> (around ~5k) and deadlock.
>>>>>
>>>>> This regression has been introduced pretty recently.
>>>>>
>>>>> v2: only release the mutex if resv is NULL
>>>>>
>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com>
>>>>> ---
>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>> amdgpu_device *adev,
>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>              &amdgpu_ttm_bo_destroy);
>>>>> -    if (unlikely(r != 0))
>>>>> +    if (unlikely(r != 0)) {
>>>>> +        if (!resv)
>>>>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>          return r;
>>>>> +    }
>>>>
>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>> sure
>>>> I had this exact same patch just to realize that it's actually
>>>> incorrect.
>>>>
>>>> The problem is that ttm_bo_init will actually call the destroy function
>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>> freed.
>>>>
>>>> This code is a huge mess. I'm surprised though: have you verified that
>>>> this patch actually fixes a hang?
>>>
>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>
>> That's surprising, but a relief. Maybe it ties into some of the other
>> problems I'm seeing as well.
>>
>> This means we need a real fix for this; I still think the current patch
>> is broken.
>
> Maybe the issue is somewhere else and this not the proper solution, but
> I don't think the given patch is broken as-is. It fixes deadlocks which
> are pretty easy to reproduce with Hitman (as explained in the commit
> description).

I'm sorry, but a use-after-free is clearly broken.

Nicolai

>
>>
>>
>>> This fixes a deadlock, here's the report:
>>> https://hastebin.com/durodivoma.xml
>>>
>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>> with a WARN_ON(is_locked)) because it doesn't call the destroy function
>>> in all situations. Presumably, when drm_vma_offset_add() fails and resv
>>> is not NULL, the mutex is not unlocked.
>>
>> On which code path is the destroy function not called? If that is the
>> case, we're leaking memory.
>>
>> With the patch as-is, the error paths are either leaking memory (if
>> you're right) or accessing memory after it's freed (otherwise).
>> Obviously, neither is good.
>
> No, I was wrong. resv is always NULL in this situation. The best
> solution is probably to try to clean up that code path because I do
> agree: it's a bit messy.
>
>>
>> Nicolai