On 13.02.2017 19:11, Samuel Pitoiset wrote: > > > On 02/13/2017 07:04 PM, Nicolai Hähnle wrote: >> On 13.02.2017 18:49, Samuel Pitoiset wrote: >>> >>> >>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote: >>>> On 09.02.2017 11:33, Samuel Pitoiset wrote: >>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked. >>>>> >>>>> In debug build, the kernel reported "possible recursive locking >>>>> detected" in this codepath. For debugging purposes, I also added >>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the >>>>> mutex was locked as expected. >>>>> >>>>> This should fix (random) GPU hangs. The easy way to reproduce the >>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in >>>>> Hitman. It will create a huge buffer, evict a bunch of buffers >>>>> (around ~5k) and deadlock. >>>>> >>>>> This regression has been introduced pretty recently. >>>>> >>>>> v2: only release the mutex if resv is NULL >>>>> >>>>> Fixes: 12a852219583 ("drm/amdgpu: improve >>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)") >>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++- >>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> index d1ef1d064de4..556236a112c1 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct >>>>> amdgpu_device *adev, >>>>> &bo->placement, page_align, !kernel, NULL, >>>>> acc_size, sg, resv ? resv : &bo->tbo.ttm_resv, >>>>> &amdgpu_ttm_bo_destroy); >>>>> - if (unlikely(r != 0)) >>>>> + if (unlikely(r != 0)) { >>>>> + if (!resv) >>>>> + ww_mutex_unlock(&bo->tbo.resv->lock); >>>>> return r; >>>>> + } >>>> >>>> I was looking at this myself a couple of weeks back, and I'm pretty >>>> sure >>>> I had this exact same patch just to realize that it's actually >>>> incorrect. >>>> >>>> The problem is that ttm_bo_init will actually call the destroy function >>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been >>>> freed. >>>> >>>> This code is a huge mess. I'm surprised though: have you verified that >>>> this patch actually fixes a hang? >>> >>> Yes, I triple-checked. I can't reproduce the hangs with Hitman. >> >> That's surprising, but a relief. Maybe it ties into some of the other >> problems I'm seeing as well. >> >> This means we need a real fix for this; I still think the current patch >> is broken. > > Maybe the issue is somewhere else and this not the proper solution, but > I don't think the given patch is broken as-is. It fixes deadlocks which > are pretty easy to reproduce with Hitman (as explained in the commit > description). I'm sorry, but a use-after-free is clearly broken. Nicolai > >> >> >>> This fixes a deadlock, here's the report: >>> https://hastebin.com/durodivoma.xml >>> >>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked >>> with a WARN_ON(is_locked)) because it doesn't call the destroy function >>> in all situations. Presumably, when drm_vma_offset_add() fails and resv >>> is not NULL, the mutex is not unlocked. >> >> On which code path is the destroy function not called? If that is the >> case, we're leaking memory. >> >> With the patch as-is, the error paths are either leaking memory (if >> you're right) or accessing memory after it's freed (otherwise). >> Obviously, neither is good. > > No, I was wrong. resv is always NULL in this situation. The best > solution is probably to try to clean up that code path because I do > agree: it's a bit messy. > >> >> Nicolai