[PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()

nhaehnle@xxxxxxxxx (Nicolai Hähnle) · Tue, 14 Feb 2017 09:30:59 +0100

On 14.02.2017 03:56, zhoucm1 wrote:
>
>
> On 2017å¹´02æ??14æ?¥ 03:03, Nicolai HÃ¤hnle wrote:
>> On 13.02.2017 19:58, Nicolai HÃ¤hnle wrote:
>>> On 13.02.2017 19:38, Samuel Pitoiset wrote:
>>>>
>>>>
>>>> On 02/13/2017 07:09 PM, Nicolai HÃ¤hnle wrote:
>>>>> On 13.02.2017 19:04, Nicolai HÃ¤hnle wrote:
>>>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 02/13/2017 05:25 PM, Nicolai HÃ¤hnle wrote:
>>>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>>>>> When ttm_bo_init() fails, the reservation mutex should be
>>>>>>>>> unlocked.
>>>>>>>>>
>>>>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>>>>> mutex was locked as expected.
>>>>>>>>>
>>>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>>>>> (around ~5k) and deadlock.
>>>>>>>>>
>>>>>>>>> This regression has been introduced pretty recently.
>>>>>>>>>
>>>>>>>>> v2: only release the mutex if resv is NULL
>>>>>>>>>
>>>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com>
>>>>>>>>> ---
>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>>>>> -    if (unlikely(r != 0))
>>>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>>>> +        if (!resv)
>>>>>>>>> + ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>>>>          return r;
>>>>>>>>> +    }
>>>>>>>>
>>>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>>>>> sure
>>>>>>>> I had this exact same patch just to realize that it's actually
>>>>>>>> incorrect.
>>>>>>>>
>>>>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>>>>> function
>>>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>>>>> freed.
>>>>>>>>
>>>>>>>> This code is a huge mess. I'm surprised though: have you verified
>>>>>>>> that
>>>>>>>> this patch actually fixes a hang?
>>>>>>>
>>>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>>>>
>>>>>> That's surprising, but a relief. Maybe it ties into some of the other
>>>>>> problems I'm seeing as well.
>>>>>>
>>>>>> This means we need a real fix for this; I still think the current
>>>>>> patch
>>>>>> is broken.
>>>>>>
>>>>>>
>>>>>>> This fixes a deadlock, here's the report:
>>>>>>> https://hastebin.com/durodivoma.xml
>>>>>>>
>>>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I
>>>>>>> checked
>>>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy
>>>>>>> function
>>>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and
>>>>>>> resv
>>>>>>> is not NULL, the mutex is not unlocked.
>>>>>>
>>>>>> On which code path is the destroy function not called? If that is the
>>>>>> case, we're leaking memory.
>>>>>>
>>>>>> With the patch as-is, the error paths are either leaking memory (if
>>>>>> you're right) or accessing memory after it's freed (otherwise).
>>>>>> Obviously, neither is good.
>>>>>
>>>>> Actually, I find it extremely suspicious that this patch resolves
>>>>> hangs.
>>>>> By all rights, no other task should have a pointer to this bo left. It
>>>>> points at problems elsewhere in the code, possibly the precise problem
>>>>> I've been trying to track down.
>>>>
>>>> Well, maybe we are just lucky but as I said, I checked many times to
>>>> reproduce the issue with that patch applied without any success, you
>>>> can
>>>> trust me. Although I'm also starting to think that's not the right
>>>> solution (and could introduce other ones).
>>>>
>>>>>
>>>>> Could you please revert the patch, reproduce the hang, and report
>>>>> /proc/$pid/stack for all the hung tasks?
>>>>
>>>> Sure. The thing is: Hitman's branch has been updated during the weekend
>>>> and my local installation is broken. I need to re-download the whole
>>>> game (will take a while).
>>>>
>>>> I will let you know when I'm able to grab that report.
>>>
>>> Hmm, so I thought about this some more, and I'm no longer so sure that
>>> your bug and mine are the same. If it was related, I'd somehow expect
>>> you to get an error about a mutex being destroyed while it's held (at
>>> least with lock debugging enabled).
>>>
>>> Anyway... we need to change the contract of ttm_bo_init, I'm just not
>>> yet sure how, because there are two points of failure: one quite early
>>> on, and the second rather late which gets cleaned up by ttm_bo_unref.
>>
>> Maybe it would actually be best to split ttm_bo_init into two parts:
>> the initial bulk of structure initialization as the first half, and
>> the ttm_bo_validate in the second half.
> Agreed. Have you gone ahead with your proposal?

I'm looking into it now.

Nicolai

> Although Samuel's patch isn't best way, it indeed fix a OCL bug which is
> trying to allocate multiple big buffers.
>
> Thanks,
> David Zhou
>>
>> Cheers,
>> Nicolai
>>
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>> Thanks Nicolai.
>>>>>
>>>>> Thanks,
>>>>> Nicolai
>>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>