On 02/13/2017 05:25 PM, Nicolai Hähnle wrote: > On 09.02.2017 11:33, Samuel Pitoiset wrote: >> When ttm_bo_init() fails, the reservation mutex should be unlocked. >> >> In debug build, the kernel reported "possible recursive locking >> detected" in this codepath. For debugging purposes, I also added >> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the >> mutex was locked as expected. >> >> This should fix (random) GPU hangs. The easy way to reproduce the >> issue is to change the "Super Sampling" option from 1.0 to 2.0 in >> Hitman. It will create a huge buffer, evict a bunch of buffers >> (around ~5k) and deadlock. >> >> This regression has been introduced pretty recently. >> >> v2: only release the mutex if resv is NULL >> >> Fixes: 12a852219583 ("drm/amdgpu: improve >> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)") >> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> index d1ef1d064de4..556236a112c1 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct >> amdgpu_device *adev, >> &bo->placement, page_align, !kernel, NULL, >> acc_size, sg, resv ? resv : &bo->tbo.ttm_resv, >> &amdgpu_ttm_bo_destroy); >> - if (unlikely(r != 0)) >> + if (unlikely(r != 0)) { >> + if (!resv) >> + ww_mutex_unlock(&bo->tbo.resv->lock); >> return r; >> + } > > I was looking at this myself a couple of weeks back, and I'm pretty sure > I had this exact same patch just to realize that it's actually incorrect. > > The problem is that ttm_bo_init will actually call the destroy function > (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been freed. > > This code is a huge mess. I'm surprised though: have you verified that > this patch actually fixes a hang? Yes, I triple-checked. I can't reproduce the hangs with Hitman. This fixes a deadlock, here's the report: https://hastebin.com/durodivoma.xml The resv->lock has to be unlocked when ttm_bo_init() fails (I checked with a WARN_ON(is_locked)) because it doesn't call the destroy function in all situations. Presumably, when drm_vma_offset_add() fails and resv is not NULL, the mutex is not unlocked. > > Cheers, > Nicolai > > >> >> bo->tbo.priority = ilog2(bo->tbo.num_pages); >> if (kernel) >> >