Re: [PATCH 0/5] radeon: Write-combined CPU mappings of BOs in GTT

Michel Dänzer <michel@xxxxxxxxxxx> · Wed, 23 Jul 2014 16:21:15 +0900

On 23.07.2014 15:42, Christian König wrote:
> Am 23.07.2014 05:54, schrieb Michel Dänzer:
>> On 21.07.2014 17:07, Christian König wrote:
>>> Am 19.07.2014 03:15, schrieb Michel Dänzer:
>>>> On 19.07.2014 00:47, Christian König wrote:
>>>>> Am 18.07.2014 05:07, schrieb Michel Dänzer:
>>>>>>>> [PATCH 5/5] drm/radeon: Use VRAM for indirect buffers on >= SI
>>>>>>> I'm still not very keen with this change since I still don't
>>>>>>> understand
>>>>>>> the reason why it's faster than with GTT. Definitely needs more
>>>>>>> testing
>>>>>>> on a wider range of systems.
>>>>>> Sure. If anyone wants to give this patch a spin and see if they can
>>>>>> measure any performance difference, good or bad, that would be
>>>>>> interesting.
>>>>>>
>>>>>>> Maybe limit it to APUs for now?
>>>>>> But IIRC, CPU writes to VRAM vs. write-combined GTT are actually an
>>>>>> even
>>>>>> bigger win with dedicated GPUs than with the Kaveri built-in GPU
>>>>>> on my
>>>>>> system. I suspect it may depend on the bandwidth available for
>>>>>> PCIe vs.
>>>>>> system memory though.
>>>>> I've made a few tests today with the kernel part of the patches
>>>>> running
>>>>> Xonotic on Ultra in 1920 x 1080.
>>>>>
>>>>> Without any patches I get around ~47.0fps on average with my dedicated
>>>>> HD7870.
>>>>>
>>>>> Adding only "drm/radeon: Use write-combined CPU mappings of rings and
>>>>> IBs on >= SI" and that goes down to ~45.3fps.
>>>>>
>>>>> Adding on to off that "drm/radeon: Use VRAM for indirect buffers on >=
>>>>> SI" and the frame rate goes down to ~27.74fps.
>>>> Hmm, looks like I'll need to do more benchmarking of 3D workloads as
>>>> well.
>> I haven't been able to consistently[0] measure any significant
>> difference between all placements of the rings and IBs with Xonotic or
>> Reaction Quake with my Bonaire. I'd expect Xonotic to be shader / GPU
>> memory bandwidth bound rather than CS bound anyway, so a ~40% hit from
>> that kernel patch alone is very surprising. Are you sure it wasn't just
>> the same kind of variation as described below?
> 
> Yes, I've measured that multiple times and the results where quite
> consistent.
> 
> But I didn't measured it on a Bonaire, where the bottleneck probably
> isn't the CPU load. I measured it on a fast Pitcairn 

Ahem, my Bonaire is cranking out ~90fps of Xonotic Ultra at 1920x1080.
:) (And AFAIK there are even faster Bonaire variants)

> and there Xonotic was clearly affected by the patches.

Okay, I hadn't realized we're not doing any command stream checking as
of CIK, that probably explains the difference.

>>> My tests clearly show that we still can use USWC for the ring buffer on
>>> SI and probably earlier chips as well.
>> Yeah, that might be the safest approach for now.
> How about using USWC for the rings on all chips since R600

Any particular reason against doing it for older chips which support
unsnooped access as well?

> and for the IB only on CIK? As far as I can see that should do the trick
> quite well.

Yeah, sounds good.

-- 
Earthling Michel Dänzer            |                  http://www.amd.com
Libre software enthusiast          |                Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel