[PATCH] drm/amdgpu: fix error handling in amdgpu_bo_do_create

michel@xxxxxxxxxxx (Michel Dänzer) · Tue, 31 Oct 2017 16:33:04 +0100

On 31/10/17 04:25 PM, Christian KÃ¶nig wrote:
> Am 31.10.2017 um 16:14 schrieb Michel DÃ¤nzer:
>> On 31/10/17 04:03 PM, Christian KÃ¶nig wrote:
>>> Am 31.10.2017 um 09:43 schrieb Michel DÃ¤nzer:
>>>> On 31/10/17 09:37 AM, Christian KÃ¶nig wrote:
>>>>> From: Christian KÃ¶nig <christian.koenig at amd.com>
>>>>>
>>>>> The bo structure is freed up in case of an error, so we can't do any
>>>>> accounting if that happens.
>>>>>
>>>>> Signed-off-by: Christian KÃ¶nig <christian.koenig at amd.com>
>>>>> ---
>>>>> Â Â  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +++---
>>>>> Â Â  1 file changed, 3 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> index e44b880eefdd..ff6f842655d1 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> @@ -369,6 +369,9 @@ static int amdgpu_bo_do_create(struct
>>>>> amdgpu_device *adev,
>>>>> Â Â Â Â Â Â  r = ttm_bo_init_reserved(&adev->mman.bdev, &bo->tbo, size,
>>>>> type,
>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  &bo->placement, page_align, !kernel, NULL,
>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  acc_size, sg, resv, &amdgpu_ttm_bo_destroy);
>>>>> +Â Â Â  if (unlikely(r != 0))
>>>>> +Â Â Â Â Â Â Â  return r;
>>>>> +
>>>>> Â Â Â Â Â Â  bytes_moved = atomic64_read(&adev->num_bytes_moved) -
>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  initial_bytes_moved;
>>>>> Â Â Â Â Â Â  if (adev->mc.visible_vram_size < adev->mc.real_vram_size &&
>>>>> @@ -378,9 +381,6 @@ static int amdgpu_bo_do_create(struct
>>>>> amdgpu_device *adev,
>>>>> Â Â Â Â Â Â  else
>>>>> Â Â Â Â Â Â Â Â Â Â  amdgpu_cs_report_moved_bytes(adev, bytes_moved, 0);
>>>>> Â Â  -Â Â Â  if (unlikely(r != 0))
>>>>> -Â Â Â Â Â Â Â  return r;
>>>>> -
>>>>> Â Â Â Â Â Â  if (kernel)
>>>>> Â Â Â Â Â Â Â Â Â Â  bo->tbo.priority = 1;
>>>>> Â  
>>>> Hmm. If it's at least theoretically possible that ttm_bo_init_reserved
>>>> moves some BOs before returning an error, we should instead make the
>>>> accounting safe WRT bo having been freed. But assuming it's not
>>>> possible,
>>> I have made patches for this back in April and send them to the list,
>>> but the value is rather low and you need to change a lot of places in
>>> other drivers as well.
>> At least in this case, simply moving the
>>
>> Â Â Â Â if (adev->mc.visible_vram_size < adev->mc.real_vram_size &&
>> Â Â Â Â Â Â Â  bo->tbo.mem.mem_type == TTM_PL_VRAM &&
>> Â Â Â Â Â Â Â  bo->tbo.mem.start < adev->mc.visible_vram_size >> PAGE_SHIFT)
>>
>> test before the ttm_bo_init_reserved call, and then passing bytes_moved
>> or 0 for the third parameter of amdgpu_cs_report_moved_bytes depending
>> on its result, would work as well, wouldn't it?
> No, that won't work. Neither bo->tbo.mem.mem_type nor bo->tbo.mem.start
> are determined yet before the call to ttm_bo_init().
> 
> The root problem is that we have a mis-assumption about how visible VRAM
> works here (e.g. it doesn't handle the case where a BO is partially in
> visible VRAM) and that we use the global bytes moved counter as
> indicator for the local operation (e.g. to know how many bytes moved the
> current client).
> 
> The only real solution is to give ttm_bo_init_* and ttm_bo_validate a
> context parameter where we can note how much work we had to do to
> fulfill the request.

Right, though in the meantime I wonder if over-accounting (i.e. calling
amdgpu_cs_report_moved_bytes(adev, bytes_moved, bytes_moved)) in the
failure case wouldn't be better than not accounting at all.

-- 
Earthling Michel DÃ¤nzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer