Support for amdgpu VM update via CPU on large-bar systems

deathsimple@xxxxxxxxxxx (Christian König) · Sat, 13 May 2017 11:08:57 +0200

Am 12.05.2017 um 21:25 schrieb Felix Kuehling:
> On 17-05-12 04:43 AM, Christian KÃ¶nig wrote:
>> Am 12.05.2017 um 10:37 schrieb zhoucm1:
>>>
>>>
>>> If the sdma is faster, even they wait for finish, which time is
>>> shorter than CPU, isn't it? Of course, the precondition is sdma is
>>> exclusive. They can reserve a sdma for PT updating.
>>>
>> No, if I understood Felix numbers correctly the setup and wait time
>> for SDMA is a bit (but not much) longer than doing it with the CPU.
> I'm skeptical of claims that SDMA is faster. Even when you use SDMA to
> write the page table, the CPU still has to do about the same amount of
> work writing PTEs into the SDMA IBs. SDMA can only save CPU time in
> certain cases:
>
>    * Copying PTEs from GART table if they are on the same GPU (not
>      possible on Vega10 due to different MTYPE bits)
>    * Generating PTEs for contiguous VRAM BOs
>
> At least for system memory BOs writing the PTEs directly to
> write-combining VRAM should be faster than writing them to cached system
> memory IBs first and then kicking off an SDMA transfer and waiting for
> completion.

That's unfortunately not correct at all.

Nicolai did quite some measurements on this and even with WC enabled on 
most systems the SDMA is more efficient transferring even small amounts 
of memory over the bus than the CPU.

And no we couldn't figure why, it indeed doesn't make much sense when WC 
is enabled.

I think the SDMA is simply optimized for those kinds of transfers, so 
even considering the overhead of allocating an IB.

So anything larger than I would say 1KB is faster handled when you write 
it to system memory and then copy it to VRAM with the SDMA.

Regards,
Christian.