On 02/13/2017 07:41 PM, Christian König wrote: > Am 13.02.2017 um 19:32 schrieb Samuel Pitoiset: >> >> >> On 02/13/2017 07:19 PM, Nicolai Hähnle wrote: >>> On 13.02.2017 19:11, Samuel Pitoiset wrote: >>>> >>>> >>>> On 02/13/2017 07:04 PM, Nicolai Hähnle wrote: >>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote: >>>>>> >>>>>> >>>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote: >>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote: >>>>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked. >>>>>>>> >>>>>>>> In debug build, the kernel reported "possible recursive locking >>>>>>>> detected" in this codepath. For debugging purposes, I also added >>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the >>>>>>>> mutex was locked as expected. >>>>>>>> >>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the >>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in >>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers >>>>>>>> (around ~5k) and deadlock. >>>>>>>> >>>>>>>> This regression has been introduced pretty recently. >>>>>>>> >>>>>>>> v2: only release the mutex if resv is NULL >>>>>>>> >>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve >>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)") >>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com> >>>>>>>> --- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++- >>>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>>>>> index d1ef1d064de4..556236a112c1 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct >>>>>>>> amdgpu_device *adev, >>>>>>>> &bo->placement, page_align, !kernel, NULL, >>>>>>>> acc_size, sg, resv ? resv : &bo->tbo.ttm_resv, >>>>>>>> &amdgpu_ttm_bo_destroy); >>>>>>>> - if (unlikely(r != 0)) >>>>>>>> + if (unlikely(r != 0)) { >>>>>>>> + if (!resv) >>>>>>>> + ww_mutex_unlock(&bo->tbo.resv->lock); >>>>>>>> return r; >>>>>>>> + } >>>>>>> >>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty >>>>>>> sure >>>>>>> I had this exact same patch just to realize that it's actually >>>>>>> incorrect. >>>>>>> >>>>>>> The problem is that ttm_bo_init will actually call the destroy >>>>>>> function >>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been >>>>>>> freed. >>>>>>> >>>>>>> This code is a huge mess. I'm surprised though: have you verified >>>>>>> that >>>>>>> this patch actually fixes a hang? >>>>>> >>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman. >>>>> >>>>> That's surprising, but a relief. Maybe it ties into some of the other >>>>> problems I'm seeing as well. >>>>> >>>>> This means we need a real fix for this; I still think the current >>>>> patch >>>>> is broken. >>>> >>>> Maybe the issue is somewhere else and this not the proper solution, but >>>> I don't think the given patch is broken as-is. It fixes deadlocks which >>>> are pretty easy to reproduce with Hitman (as explained in the commit >>>> description). >>> >>> I'm sorry, but a use-after-free is clearly broken. >> >> You are right. If the destroy callback is called, there is a >> use-after-free which is bad, really.. > > bad. Calling the destroy callback when something goes wrong sound fishy > to me in the first place when the structure initialized here is > allocated by the caller. > > Probably best to clean that up from the beginning. Yes. I wonder where the original issue comes from though. I will need to investigate more. > > Regards, > Christian. > >> >>> >>> Nicolai >>> >>>> >>>>> >>>>> >>>>>> This fixes a deadlock, here's the report: >>>>>> https://hastebin.com/durodivoma.xml >>>>>> >>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked >>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy >>>>>> function >>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and >>>>>> resv >>>>>> is not NULL, the mutex is not unlocked. >>>>> >>>>> On which code path is the destroy function not called? If that is the >>>>> case, we're leaking memory. >>>>> >>>>> With the patch as-is, the error paths are either leaking memory (if >>>>> you're right) or accessing memory after it's freed (otherwise). >>>>> Obviously, neither is good. >>>> >>>> No, I was wrong. resv is always NULL in this situation. The best >>>> solution is probably to try to clean up that code path because I do >>>> agree: it's a bit messy. >>>> >>>>> >>>>> Nicolai >>> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > >