Re: [PATCH] drm/amdgpu: optimize the padding with hw optimization

Marek Olšák <maraeo@xxxxxxxxx> · Wed, 7 Aug 2024 12:42:55 -0400

On Wed, Aug 7, 2024 at 4:21 AM Tvrtko Ursulin <tursulin@xxxxxxxxxxx> wrote:
>
>
> On 04/08/2024 19:11, Marek Olšák wrote:
> > On Thu, Aug 1, 2024 at 2:55 PM Marek Olšák <maraeo@xxxxxxxxx> wrote:
> >>
> >> On Thu, Aug 1, 2024, 03:37 Christian König <christian.koenig@xxxxxxx> wrote:
> >>>
> >>> Am 01.08.24 um 08:53 schrieb Marek Olšák:
> >>>
> >>> On Thu, Aug 1, 2024, 00:28 Khatri, Sunil <sukhatri@xxxxxxx> wrote:
> >>>>
> >>>>
> >>>> On 8/1/2024 8:49 AM, Marek Olšák wrote:
> >>>>>> +       /* Header is at index 0, followed by num_nops - 1 NOP packet's */
> >>>>>> +       for (i = 1; i < num_nop; i++)
> >>>>>> +               amdgpu_ring_write(ring, ring->funcs->nop);
> >>>>> This loop should be removed. It's unnecessary CPU overhead and we
> >>>>> should never get more than 0x3fff NOPs (maybe use BUG_ON). Leaving the
> >>>>> whole packet body uninitialized is the fastest option.
> >>>> That was the original intent to just move the WPTR for the no of nops
> >>>> and tried too. Based on Christian inputs we should not let the nops packet
> >>>>
> >>>> as garbage or whatever was there originally as a threat/safety measure.
> >>>
> >>>
> >>> It doesn't help safety. It can only be read by the GPU with kernel-level permissions.
> >>>
> >>> Initializing the packet body is useless and adds CPU overhead, especially with the 256 NOPs or so that we use for no reason.
> >>>
> >>>
> >>> Not filling the remaining ring buffers with NOPs is a pretty clear NAK from my side. Leaving garbage in the ring buffer is not even remotely defensive.
> >>
> >>
> >> What are you defending against? You know the ring is kernel-owned memory, right?
> >
> > This was pushed without any justification why you need to clear
> > kernel-allocated memory with some constant number up to 30000 times
> > per second that only the kernel can read.
>
> I see that this seems to be controversial, but FWIW, if the loop ends up
> staying, at least we could replace it with memset32 as I have shown in
> https://lore.kernel.org/amd-gfx/20240715104026.6311-1-tursulin@xxxxxxxxxx/
> that the inefficient amdgpu_ring_write can show up in the profile.
>
> And also maybe consider other than gfx? Again, I did something in
> https://lore.kernel.org/amd-gfx/20240712152855.45284-4-tursulin@xxxxxxxxxx/,
> but AMD folks will know if there is a similar (like in this series)
> approach which also improves the GPU side processing and not just CPU side.

1. Yes, we should reduce CPU overhead by not using amdgpu_ring_write
to flll a ring buffer.

2. We should stop clearing NOP content.

3. We should stop padding to 256 dwords when we really just need to
pad to 8 dwords.

4. We should get rid of amdgpu_ring_write and use a more efficient way
to write into a ring buffer, which is described below.
amdgpu_ring_write generates bad CPU code because the compiler can't
determine pointer aliasing.

When starting to write into a ring buffer, count_dw, *ring, buf_mask,
and wptr should be copied to local variables and ring writes should
only use those. After everything has been written into the ring
buffer, the local variables should be copied back to amdgpu_ring, and
ptr_mask should be applied only then. That allows the compiler to use
constant offsets for the writes, and
reorder/merge/const-evaluate/optimize all operations on the local
variables, which doesn't happen with non-local variables. 3 macros
(amdgpu_ring_begin, amdgpu_ring_write, amdgpu_ring_end) can be created
to encapsulate this logic. Example:

amdgpu_ring_begin(ring);
amdgpu_ring_write(value0);
amdgpu_ring_write(value1);
amdgpu_ring_write(value2);
amdgpu_ring_write(value3);
amdgpu_ring_end();

Marek