[PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()

samuel.pitoiset@xxxxxxxxx (Samuel Pitoiset) · Mon, 13 Feb 2017 19:45:37 +0100

On 02/13/2017 07:41 PM, Christian KÃ¶nig wrote:
> Am 13.02.2017 um 19:32 schrieb Samuel Pitoiset:
>>
>>
>> On 02/13/2017 07:19 PM, Nicolai HÃ¤hnle wrote:
>>> On 13.02.2017 19:11, Samuel Pitoiset wrote:
>>>>
>>>>
>>>> On 02/13/2017 07:04 PM, Nicolai HÃ¤hnle wrote:
>>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>>>
>>>>>>
>>>>>> On 02/13/2017 05:25 PM, Nicolai HÃ¤hnle wrote:
>>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>>>>
>>>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>>>> mutex was locked as expected.
>>>>>>>>
>>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>>>> (around ~5k) and deadlock.
>>>>>>>>
>>>>>>>> This regression has been introduced pretty recently.
>>>>>>>>
>>>>>>>> v2: only release the mutex if resv is NULL
>>>>>>>>
>>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com>
>>>>>>>> ---
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>>>> -    if (unlikely(r != 0))
>>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>>> +        if (!resv)
>>>>>>>> + ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>>>          return r;
>>>>>>>> +    }
>>>>>>>
>>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>>>> sure
>>>>>>> I had this exact same patch just to realize that it's actually
>>>>>>> incorrect.
>>>>>>>
>>>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>>>> function
>>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>>>> freed.
>>>>>>>
>>>>>>> This code is a huge mess. I'm surprised though: have you verified
>>>>>>> that
>>>>>>> this patch actually fixes a hang?
>>>>>>
>>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>>>
>>>>> That's surprising, but a relief. Maybe it ties into some of the other
>>>>> problems I'm seeing as well.
>>>>>
>>>>> This means we need a real fix for this; I still think the current
>>>>> patch
>>>>> is broken.
>>>>
>>>> Maybe the issue is somewhere else and this not the proper solution, but
>>>> I don't think the given patch is broken as-is. It fixes deadlocks which
>>>> are pretty easy to reproduce with Hitman (as explained in the commit
>>>> description).
>>>
>>> I'm sorry, but a use-after-free is clearly broken.
>>
>> You are right. If the destroy callback is called, there is a
>> use-after-free which is bad, really..
>
> bad. Calling the destroy callback when something goes wrong sound fishy
> to me in the first place when the structure initialized here is
> allocated by the caller.
>
> Probably best to clean that up from the beginning.

Yes.
I wonder where the original issue comes from though. I will need to 
investigate more.

>
> Regards,
> Christian.
>
>>
>>>
>>> Nicolai
>>>
>>>>
>>>>>
>>>>>
>>>>>> This fixes a deadlock, here's the report:
>>>>>> https://hastebin.com/durodivoma.xml
>>>>>>
>>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy
>>>>>> function
>>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and
>>>>>> resv
>>>>>> is not NULL, the mutex is not unlocked.
>>>>>
>>>>> On which code path is the destroy function not called? If that is the
>>>>> case, we're leaking memory.
>>>>>
>>>>> With the patch as-is, the error paths are either leaking memory (if
>>>>> you're right) or accessing memory after it's freed (otherwise).
>>>>> Obviously, neither is good.
>>>>
>>>> No, I was wrong. resv is always NULL in this situation. The best
>>>> solution is probably to try to clean up that code path because I do
>>>> agree: it's a bit messy.
>>>>
>>>>>
>>>>> Nicolai
>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>