Looking for clarifications around gfx/kcq/kiq

Yann Dirson <ydirson@xxxxxxx> · Sun, 5 Dec 2021 21:18:27 +0100 (CET)

Hello,

Context: trying to understand what happens with my Renoir passed through
to a Xen domu [0] (starting with the "VCN disabled" because I don't need it
now (so let's postpone the problem with its _fini) and with "PSP disabled"
because the alternative issue seems easier to solve -- so ip_block_mask=0xF7).

I'm slowed down by a number of additional terms:

* KIQ: we have the acronym, but a few more words about it would be great:
  it seems to relate to a ring buffer provided by the GFX IP, but this one
  does not talk much to me (e.g. it tells me less than the names of the
  "gfx" and "compute" ones)
* "me", "mec" = ?  In some places at least "me" stands for "micro engine" but
  what are those ?  A "mec" contains pipes which contain queues.  And in
  amdgpu_ring the "me" field seems to identify a "mec"
* "mes", rather looks like an IP/block family than the plural of "me".
  A specific list of those IPs / hw blocks would be useful (maybe with
  a diagram showing how they interact, much as what was started by
  Rodrigo for the DC pipeline, but a first components/subcomponents diagram
  would probably be helpful)
* RLC ?  Looks like a "micro engine" inside the GFX IPs ?
* one starting point for enhancing doc would be to start with amdgpu.h, where
  a number of acronyms used in structs are not self-explanatory: IB, SS, CP,
  ACP, CAC, HPD, ...

Do we have somewhere a description of what the hardware expects to find in
those queues ?

About amdgpu_gfx_enable_kcq():
- Isn't the `DRM_INFO("kiq ring mec %d pipe %d q %d\n"` line rather meant as
  DRM_DEBUG ?
- An error from amdgpu_ring_alloc() is reported as "failed to lock", but looks
  like "failed to allocate space on ring" ?

amdgpu_ring_alloc() itself is unconditionally setting count_dw, which looked
suspicious to me -- so I added the check shown below, and it does look like
ring_alloc() gets called again too soon.  Am I right in thinking this could be
the cause of amdgpu_ring_test_helper() failing in timeout ?

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -70,6 +70,9 @@ int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw)
        if (WARN_ON_ONCE(ndw > ring->max_dw))
                return -ENOMEM;
 
+       /* check we're not allocating too fast */
+       WARN_ON_ONCE(ring->count_dw);
+
        ring->count_dw = ndw;
        ring->wptr_old = ring->wptr;


About gfx_v9_0_sw_fini():
- the 2 calls to bo_free are called here without condition, whereas they are
  allocated from rlc_init, not directly from sw_init.  Is this asymmetry wanted ?


Maybe such info should join the documentation at some point?

[0] https://lists.freedesktop.org/archives/amd-gfx/2021-November/071855.html