On 14.02.2017 03:56, zhoucm1 wrote: > > > On 2017å¹´02æ??14æ?¥ 03:03, Nicolai Hähnle wrote: >> On 13.02.2017 19:58, Nicolai Hähnle wrote: >>> On 13.02.2017 19:38, Samuel Pitoiset wrote: >>>> >>>> >>>> On 02/13/2017 07:09 PM, Nicolai Hähnle wrote: >>>>> On 13.02.2017 19:04, Nicolai Hähnle wrote: >>>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote: >>>>>>> >>>>>>> >>>>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote: >>>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote: >>>>>>>>> When ttm_bo_init() fails, the reservation mutex should be >>>>>>>>> unlocked. >>>>>>>>> >>>>>>>>> In debug build, the kernel reported "possible recursive locking >>>>>>>>> detected" in this codepath. For debugging purposes, I also added >>>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the >>>>>>>>> mutex was locked as expected. >>>>>>>>> >>>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the >>>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in >>>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers >>>>>>>>> (around ~5k) and deadlock. >>>>>>>>> >>>>>>>>> This regression has been introduced pretty recently. >>>>>>>>> >>>>>>>>> v2: only release the mutex if resv is NULL >>>>>>>>> >>>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve >>>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)") >>>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com> >>>>>>>>> --- >>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++- >>>>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>>>>>> index d1ef1d064de4..556236a112c1 100644 >>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct >>>>>>>>> amdgpu_device *adev, >>>>>>>>> &bo->placement, page_align, !kernel, NULL, >>>>>>>>> acc_size, sg, resv ? resv : &bo->tbo.ttm_resv, >>>>>>>>> &amdgpu_ttm_bo_destroy); >>>>>>>>> - if (unlikely(r != 0)) >>>>>>>>> + if (unlikely(r != 0)) { >>>>>>>>> + if (!resv) >>>>>>>>> + ww_mutex_unlock(&bo->tbo.resv->lock); >>>>>>>>> return r; >>>>>>>>> + } >>>>>>>> >>>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty >>>>>>>> sure >>>>>>>> I had this exact same patch just to realize that it's actually >>>>>>>> incorrect. >>>>>>>> >>>>>>>> The problem is that ttm_bo_init will actually call the destroy >>>>>>>> function >>>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been >>>>>>>> freed. >>>>>>>> >>>>>>>> This code is a huge mess. I'm surprised though: have you verified >>>>>>>> that >>>>>>>> this patch actually fixes a hang? >>>>>>> >>>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman. >>>>>> >>>>>> That's surprising, but a relief. Maybe it ties into some of the other >>>>>> problems I'm seeing as well. >>>>>> >>>>>> This means we need a real fix for this; I still think the current >>>>>> patch >>>>>> is broken. >>>>>> >>>>>> >>>>>>> This fixes a deadlock, here's the report: >>>>>>> https://hastebin.com/durodivoma.xml >>>>>>> >>>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I >>>>>>> checked >>>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy >>>>>>> function >>>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and >>>>>>> resv >>>>>>> is not NULL, the mutex is not unlocked. >>>>>> >>>>>> On which code path is the destroy function not called? If that is the >>>>>> case, we're leaking memory. >>>>>> >>>>>> With the patch as-is, the error paths are either leaking memory (if >>>>>> you're right) or accessing memory after it's freed (otherwise). >>>>>> Obviously, neither is good. >>>>> >>>>> Actually, I find it extremely suspicious that this patch resolves >>>>> hangs. >>>>> By all rights, no other task should have a pointer to this bo left. It >>>>> points at problems elsewhere in the code, possibly the precise problem >>>>> I've been trying to track down. >>>> >>>> Well, maybe we are just lucky but as I said, I checked many times to >>>> reproduce the issue with that patch applied without any success, you >>>> can >>>> trust me. Although I'm also starting to think that's not the right >>>> solution (and could introduce other ones). >>>> >>>>> >>>>> Could you please revert the patch, reproduce the hang, and report >>>>> /proc/$pid/stack for all the hung tasks? >>>> >>>> Sure. The thing is: Hitman's branch has been updated during the weekend >>>> and my local installation is broken. I need to re-download the whole >>>> game (will take a while). >>>> >>>> I will let you know when I'm able to grab that report. >>> >>> Hmm, so I thought about this some more, and I'm no longer so sure that >>> your bug and mine are the same. If it was related, I'd somehow expect >>> you to get an error about a mutex being destroyed while it's held (at >>> least with lock debugging enabled). >>> >>> Anyway... we need to change the contract of ttm_bo_init, I'm just not >>> yet sure how, because there are two points of failure: one quite early >>> on, and the second rather late which gets cleaned up by ttm_bo_unref. >> >> Maybe it would actually be best to split ttm_bo_init into two parts: >> the initial bulk of structure initialization as the first half, and >> the ttm_bo_validate in the second half. > Agreed. Have you gone ahead with your proposal? I'm looking into it now. Nicolai > Although Samuel's patch isn't best way, it indeed fix a OCL bug which is > trying to allocate multiple big buffers. > > Thanks, > David Zhou >> >> Cheers, >> Nicolai >> >>> >>> Cheers, >>> Nicolai >>> >>>> Thanks Nicolai. >>>>> >>>>> Thanks, >>>>> Nicolai >>> >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >