Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager

Christian König <christian.koenig@xxxxxxx> · Wed, 22 Feb 2023 15:20:10 +0100

Am 22.02.23 um 14:54 schrieb Thomas Hellström:
Hi,

On 2/22/23 12:39, Christian König wrote:
Hi Thomas,

Am 22.02.23 um 12:00 schrieb Thomas Hellström:
Hi, Christian,

So I resurrected Maarten's previous patch series around this (the 
amdgpu suballocator) slightly modified the code to match the API of 
this patch series, re-introduced the per-allocation alignment as per 
a previous review comment from you on that series, and made 
checkpatch.pl pass mostly, except for pre-existing style problems, 
and added / fixed some comments. No memory corruption seen so far on 
limited Xe testing.

To move this forward I suggest starting with that as a common drm 
suballocator. I'll post the series later today. We can follow up 
with potential simplifactions lif needed.

I also made a kunit test also reporting some timing information. 
Will post that as a follow up. Some interesting preliminary 
conclusions:

* drm_mm is per se not a cpu hog, If the rb tree processing is 
disabled and the EVICT algorithm is changed from MRU to ring-like 
LRU traversal, it's more or less just as fast as the ring suballocator.

* With a single ring, and the suballocation buffer never completely 
filled (no sleeps) the amd suballocator is a bit faster per 
allocation / free. (Around 250 ns instead of 350). Allocation is 
slightly slower on the amdgpu one, freeing is faster, mostly due to 
the locking overhead incurred when setting up the fence callbacks, 
and for avoiding irq-disabled processing on the one I proposed.

For some more realistic numbers try to signal the fence from another 
CPU. Alternative you can invalidate all the CPU read cache lines 
touched by the fence callback so that they need to be read in again 
from the allocating CPU.

Fences are signalled using hr-timer driven fake "ring"s, so should 
probably be distributed among cpus in a pretty realistic way. But 
anyway I agree results obtained from that kunit test can and should be 
challenged before we actually use them for improvements.

I would double check that. My expectation is that hr-timers execute by 
default on the CPU from which they are started.

* With multiple rings and varying allocation sizes and signalling 
times creating fragmentation, the picture becomes different as the 
amdgpu allocator starts to sleep/throttle already round 50% - 75% 
fill. The one I proposed between 75% to 90% fill, and once that 
happens, the CPU cost of putting to sleep and waking up should 
really shadow the above numbers.

So it's really a tradeoff. Where IMO also code size and 
maintainability should play a role.

Also I looked at the history of the amdgpu allocator originating 
back to Radeon 2012-ish, but couldn't find any commits mentioning 
fence callbacks nor problem with those. Could you point me to that 
discussion?

Uff that was ~10 years ago. I don't think I can find that again.

OK, fair enough. But what was the objective reasoning against using 
fence callbacks for this sort of stuff, was it unforeseen locking 
problems, caching issues or something else?

Well caching line bouncing is one major problem. Also take a look at the 
discussion about using list_head in interrupt handlers, that should be 
easy to find on LWN.

The allocator usually manages enough memory so that it never runs into 
waiting for anything, only in extreme cases like GPU resets we actually 
wait for allocations to be freed.

So the only cache lines which is accessed from more than one CPU should 
be the signaled flag of the fence.

With moving list work into the interrupt handler you have at least 3 
cache lines which start to bounce between different CPUs.

Regards,
Christian.

Thanks,

Thomas

Regards,
Christian.

Thanks,

Thomas

On 2/17/23 14:51, Thomas Hellström wrote:

On 2/17/23 14:18, Christian König wrote:
Am 17.02.23 um 14:10 schrieb Thomas Hellström:
[SNIP]

Any chance you could do a quick performance comparison? If 
not, anything against merging this without the amd / radeon 
changes until we can land a simpler allocator?

Only if you can stick the allocator inside Xe and not drm, 
cause this seems to be for a different use case than the 
allocators inside radeon/amdgpu.

Hmm. No It's allocating in a ring-like fashion as well. Let me 
put together a unit test for benchmaking. I think it would be a 
failure for the community to end up with three separate 
suballocators doing the exact same thing for the same problem, 
really.

Well exactly that's the point. Those allocators aren't the same 
because they handle different problems.

The allocator in radeon is simpler because it only had to deal 
with a limited number of fence timelines. The one in amdgpu is a 
bit more complex because of the added complexity for more fence 
timelines.

We could take the one from amdgpu and use it for radeon and 
others as well, but the allocator proposed here doesn't even 
remotely matches the requirements.

But again, what *are* those missing requirements exactly? What is 
the pathological case you see for the current code?

Well very low CPU overhead and don't do anything in a callback.

Well, dma_fence_wait_any() will IIRC register callbacks on all 
affected fences, although admittedly there is no actual allocator 
processing in them.

From what I can tell the amdgpu suballocator introduces excessive 
complexity to coalesce waits for fences from the same contexts, 
whereas the present code just frees from the fence callback if 
the fence wasn't already signaled.

And this is exactly the design we had previously which we removed 
after Dave stumbled over tons of problems with it.

So is the worry that those problems have spilled over in this code 
then? It's been pretty extensively tested, or is it you should 
never really use dma-fence callbacks?

The fence signalling code that fires that callback is typcally 
always run anyway on scheduler fences.

The reason we had for not using the amdgpu suballocator as 
originally planned was that this complexity made it very hard for 
us to undertand it and to fix issues we had with it.

Well what are those problems? The idea is actually not that 
hardware to understand.

We hit memory corruption, and we spent substantially more time 
trying to debug it than to put together this patch, while never 
really understanding what  happened, nor why you don't see that 
with amdgpu.

We could simplify it massively for the cost of only waiting for 
the oldest fence if that helps.

Let me grab the latest version from amdgpu and give it a try again, 
but yes I think that to make it common code we'll need it simpler 
(and my personal wish would be to separate the allocator 
functionality a bit more from the fence waiting, which I guess 
should be OK if the fence waiting is vastly simplified).

/Thomas

Regards,
Christian.

Regards,

Thomas