Re: Looking for clarifications around gfx/kcq/kiq

Alex Deucher <alexdeucher@xxxxxxxxx> · Mon, 6 Dec 2021 15:01:45 -0500

On Mon, Dec 6, 2021 at 5:29 AM Yann Dirson <ydirson@xxxxxxx> wrote:
>
> Hello,
>
> Context: trying to understand what happens with my Renoir passed through
> to a Xen domu [0] (starting with the "VCN disabled" because I don't need it
> now (so let's postpone the problem with its _fini) and with "PSP disabled"
> because the alternative issue seems easier to solve -- so ip_block_mask=0xF7).
>
> I'm slowed down by a number of additional terms:
>
> * KIQ: we have the acronym, but a few more words about it would be great:
>   it seems to relate to a ring buffer provided by the GFX IP, but this one
>   does not talk much to me (e.g. it tells me less than the names of the
>   "gfx" and "compute" ones)

Kernel Interface Queue.  This is a control queue used by the kernel
driver to manage other gfx and compute queues on the GFX/compute
engine.  You can use it to map/unmap additional queues, etc.

> * "me", "mec" = ?  In some places at least "me" stands for "micro engine" but
>   what are those ?  A "mec" contains pipes which contain queues.  And in
>   amdgpu_ring the "me" field seems to identify a "mec"

MicroEngine Compute.  The is the microcontroller that controls the
compute queues on the GFX/compute engine.

> * "mes", rather looks like an IP/block family than the plural of "me".
>   A specific list of those IPs / hw blocks would be useful (maybe with
>   a diagram showing how they interact, much as what was started by
>   Rodrigo for the DC pipeline, but a first components/subcomponents diagram
>   would probably be helpful)

MicroEngine Scheduler.  This is a new engine for managing queues.
This is currently unused.

> * RLC ?  Looks like a "micro engine" inside the GFX IPs ?

RunList Controller.  This is another microcontroller in the
GFX/Compute engine.  It handles power management related functionality
within the GFX/Compute engine.  The name is a vestige of old hardware
where it was originally added and doesn't really have much relation to
what the engine does now.

> * one starting point for enhancing doc would be to start with amdgpu.h, where
>   a number of acronyms used in structs are not self-explanatory: IB, SS, CP,
>   ACP, CAC, HPD, ...

IB = Indirect Buffer.  A command buffer for a particular engine.
Rather than writing commands directly to the queue, you can write the
commands into a piece of memory and then put a pointer to the memory
into the queue.  The hardware will then follow the pointer and execute
the commands in the memory, then returning to the rest of the commands
in the ring.

SS = Spread Spectrum.

CP = Command Processor.  The name for the hardware block that
encompasses the front end of the GFX/Compute pipeline.  Consists
mainly of a bunch of microcontrollers (PFP, ME, CE, MEC).  The
firmware that runs on these microcontrollers provides the driver
interface to interact with the GFX/Compute engine.

>
> Do we have somewhere a description of what the hardware expects to find in
> those queues ?

It depends on the Engine.  Each engine has it's own packet format.
GFX/Compute uses one format, SDMA uses another, VCN uses another.
They are documented in the code and headers for the relevant engines.

>
> About amdgpu_gfx_enable_kcq():
> - Isn't the `DRM_INFO("kiq ring mec %d pipe %d q %d\n"` line rather meant as
>   DRM_DEBUG ?

It's informational so we can see what queue slot is being used for
KIQ.  There are requirements around the physical queue slot for KIQ so
it useful to know it.  That said, it could probably be made debug
only.

> - An error from amdgpu_ring_alloc() is reported as "failed to lock", but looks
>   like "failed to allocate space on ring" ?
>
> amdgpu_ring_alloc() itself is unconditionally setting count_dw, which looked
> suspicious to me -- so I added the check shown below, and it does look like
> ring_alloc() gets called again too soon.  Am I right in thinking this could be
> the cause of amdgpu_ring_test_helper() failing in timeout ?
>

Not likely.  The PSP failing to load firmware is most likely the
problem.  You need to have a functional PSP for any of the other
engines to be usable.  If we can't load the firmware for the
microcontrollers, the driver can't interact with them.

> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -70,6 +70,9 @@ int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw)
>         if (WARN_ON_ONCE(ndw > ring->max_dw))
>                 return -ENOMEM;
>
> +       /* check we're not allocating too fast */
> +       WARN_ON_ONCE(ring->count_dw);
> +
>         ring->count_dw = ndw;
>         ring->wptr_old = ring->wptr;
>
>
> About gfx_v9_0_sw_fini():
> - the 2 calls to bo_free are called here without condition, whereas they are
>   allocated from rlc_init, not directly from sw_init.  Is this asymmetry wanted ?
>
>
> Maybe such info should join the documentation at some point?

Yeah, would be useful.

Alex

>
> [0] https://lists.freedesktop.org/archives/amd-gfx/2021-November/071855.html