On 09/05/2024 13:40, Tvrtko Ursulin wrote:
On 08/05/2024 19:09, Tvrtko Ursulin wrote:
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>
Last few days I was looking at the situation with VRAM over
subscription, what
happens versus what perhaps should happen. Browsing through the driver
and
running some simple experiments.
I ended up with this patch series which, as a disclaimer, may be
completely
wrong but as I found some suspicious things, to me at least, I thought
it was a
good point to stop and request some comments.
To perhaps summarise what are the main issues I think I found:
* Migration rate limiting does not bother knowing if actual
migration happened
and so can over-account and unfairly penalise.
* Migration rate limiting does not even work, at least not for the
common case
where userspace configures VRAM+GTT. It thinks it can stop
migration attempts
by playing with bo->allowed_domains vs bo->preferred domains but,
both from
the code, and from empirical experiments, I see that not working
at all. Both
masks are identical so fiddling with them achieves nothing.
* Idea of the fallback placement only works when VRAM has free
space. As soon
as it does not, ttm_resource_compatible is happy to leave the
buffers in the
secondary placement forever.
* Driver thinks it will be re-validating evicted buffers on the next
submission
but it does not for the very common case of VRAM+GTT because it
only checks
if current placement is *none* of the preferred placements.
All those problems are addressed in individual patches.
End result of this series appears to be driver which will try harder
to move
buffers back into VRAM, but will be (more) correctly throttled in
doing so by
the existing rate limiting logic.
I have run a quick benchmark of Cyberpunk 2077 and cannot say that I
saw a
change but that could be a good thing too. At least I did not break
anything,
perhaps.. On one occassion I did see the rate limiting logic get
confused while
for a period of few minutes it went to a mode where it was constantly
giving a
high migration budget. But that recovered itself when I switched
clients and did
not come back so I don't know. If there is something wrong there I
don't think
it would be caused by any patches in this series.
Since yesterday I also briefly tested with Far Cry New Dawn. One run
each so possibly doesn't mean anything apart that there isn't a
regression aka migration throttling is keeping things at bay even with
increased requests to migrate things back to VRAM:
before after
min/avg/max fps 36/44/54 37/45/55
Cyberpunk 2077 from yesterday was similarly close:
26.96/29.59/30.40 29.70/30.00/30.32
I guess the real story is proper DGPU where misplaced buffers have a
real cost.
I found one game which regresses spectacularly badly with this series -
Assasin's Creed Valhalla. The built-in benchmark at least. The game
appears to have a working set much larger than the other games I tested,
around 5GiB total during the benchmark. And for some reason migration
throttling totally fails to put it in check. I will be investigating
this shortly.
Regards,
Tvrtko
Series is probably rough but should be good enough for dicsussion. I
am curious
to hear if I identified at least something correctly as a real problem.
It would also be good to hear what are the suggested games to check
and see
whether there is any improvement.
Cc: Christian König <christian.koenig@xxxxxxx>
Cc: Friedrich Vock <friedrich.vock@xxxxxx>
Tvrtko Ursulin (5):
drm/amdgpu: Fix migration rate limiting accounting
drm/amdgpu: Actually respect buffer migration budget
drm/ttm: Add preferred placement flag
drm/amdgpu: Use preferred placement for VRAM+GTT
drm/amdgpu: Re-validate evicted buffers
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +++++++++++++++++-----
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 +++--
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++++++++++--
drivers/gpu/drm/ttm/ttm_resource.c | 13 +++++---
include/drm/ttm/ttm_placement.h | 3 ++
5 files changed, 65 insertions(+), 18 deletions(-)