Re: [PATCH] drm/amdgpu: add initial documentation for debugfs files

Alex Deucher <alexdeucher@xxxxxxxxx> · Tue, 4 Mar 2025 14:25:13 -0500

On Tue, Mar 4, 2025 at 12:37 PM Rodrigo Siqueira <siqueira@xxxxxxxxxx> wrote:
>
> Hi Alex,
>
> I added a few suggestions and questions.
>
> On 03/03, Alex Deucher wrote:
> > Describes what debugfs files are available and what
> > they are used for.
> >
> > Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
> > ---
> >  Documentation/gpu/amdgpu/debugfs.rst | 201 +++++++++++++++++++++++++++
> >  Documentation/gpu/amdgpu/index.rst   |   1 +
> >  2 files changed, 202 insertions(+)
> >  create mode 100644 Documentation/gpu/amdgpu/debugfs.rst
> >
> > diff --git a/Documentation/gpu/amdgpu/debugfs.rst b/Documentation/gpu/amdgpu/debugfs.rst
> > new file mode 100644
> > index 0000000000000..9d82c770c1e78
> > --- /dev/null
> > +++ b/Documentation/gpu/amdgpu/debugfs.rst
> > @@ -0,0 +1,201 @@
> > +==============
> > +AMDGPU DebugFS
> > +==============
> > +
> > +The amdgpu driver provides a number of debugfs files to aid in debugging
> > +issues in the driver.
> > +
> > +DebugFS Files
> > +=============
> > +
> > +amdgpu_benchmark
> > +----------------
> > +
> > +Run benchmarks using the DMA engine the driver uses for GPU memory paging.
> > +Write a number to the file to run the test.  The results are written to the
> > +kernel log.  The following tests are available:
> > +
> > +- 1: simple test, VRAM to GTT and GTT to VRAM
>
> I know GTT is part of the glossary, but to improve this part of the doc
> readability, I suggested adding the acronym meaning the first time you
> mentioned GTT. You already used this approach in the rest of this patch.

Sure.

>
> > +- 2: simple test, VRAM to VRAM
> > +- 3: GTT to VRAM, buffer size sweep, powers of 2
> > +- 4: VRAM to GTT, buffer size sweep, powers of 2
> > +- 5: VRAM to VRAM, buffer size sweep, powers of 2
> > +- 6: GTT to VRAM, buffer size sweep, common modes
>
> What do you mean by "common modes"? Maybe consider adding a brief
> explanation or point to the documentation that explains it.

Sure.

>
> > +- 7: VRAM to GTT, buffer size sweep, common modes
> > +- 8: VRAM to VRAM, buffer size sweep, common modes
> > +
> > +amdgpu_test_ib
> > +--------------
> > +
> > +Read this file to run simple IB (Indirect Buffer) tests on all kernel managed
> > +rings.  IBs are command buffers usually generated by userspace applications
> > +which are submitted to the kernel for execution on an particular GPU engine.
> > +This just runs the simple IB tests included in the kernel.
>
> How about adding the path to the simple IB test that you mentioned?

It's different for each engine type, but the basic idea is that it
provides the minimum viable IB that can exercise the functionality.

>
> > +
> > +amdgpu_discovery
> > +----------------
> > +
> > +Provides raw access to the IP discovery binary provided by the GPU.  Read this
> > +file to acess the raw binary.
>
> /acess/access/

Will fix.

>
> Just out of curiosity, what is the use for these debugfs? Why users
> might want to use it? Can you get this binary from one device and load
> it into another device for testing?

It's the binary that the driver parses to determine which IP's are
present on the GPU.  It's mainly for debugging the actual binary.  You
shouldn't try and use this on any GPU other than the one that
generated it.

>
> > +
> > +amdgpu_vbios
> > +------------
> > +
> > +Provides raw access to the ROM binary image from the GPU.  Read this file to
> > +access the raw binary.
> > +
>
> I repeat my previous question:
>
> Can you get this binary from one device and load it into another device
> for testing?

Similar to the discovery binary.  Mainly for debugging the contents of
the binary.  E.g., the data tables and command tables included in it.
You wouldn't want to use this for any GPU other than the one it came
from.

>
> > +amdgpu_evict_gtt
> > +----------------
> > +
> > +Evict all buffers from the GTT memory pool.  Read this file to evict all
> > +buffers from this pool.
> > +
> > +amdgpu_evict_vram
> > +-----------------
> > +
> > +Evict all buffers from the VRAM memory pool.  Read this file to evict all
> > +buffers from this pool.
> > +
> > +amdgpu_gpu_recover
> > +------------------
> > +
> > +Read this file to trigger a full GPU reset.  All work currently running
> > +on the GPU will be lost.
>
> iirc, AMD has 3 reset modes. By full GPU reset, do you mean the Mode
> that resets the entire device (mode 0?)?

I meant whole GPU reset rather than per queue reset.  How the driver
accomplishes this depends on the individual chip (some will use mode1
some will use mode2, etc.).

>
> > +
> > +amdgpu_ring_<name>
> > +------------------
> > +
> > +Provides read access to the kernel managed ring buffers for each ring <name>.
> > +These are useful for debugging problems on a particular ring.  The ring buffer
> > +is how the CPU sends commands to the GPU.  The CPU writes commands into the
> > +buffer and then asks the GPU engine to process it.
>
> When I checked this debugfs, it prints a non-human readable output
> (maybe I did something wrong?). How can users use this output for
> debugging? Is there a way to parser the output?

This is the raw content of the ring buffer itself.  You can use UMR to
parse the contents and print contents in human readable form.

>
> > +
> > +amdgpu_mqd_<name>
> > +-----------------
> > +
>
> Same as my previous question.

This is also the raw content.  You'll need a separate tool to parse
this.  I don't remember if UMR can or not off hand.

>
> > +Provides read access to the kernel managed MQD (Memory Queue Descriptor) for
> > +ring <name> managed by the kernel driver.  MQDs define the features of the ring
> > +and are used to store the ring's state when it is not connected to hardware.
> > +The driver writes the requested ring features and metadata (GPU addresses of
> > +the ring itself and associated buffers) to the MQD and the firmware uses the MQD
> > +to populate the hardware when the ring is mapped to a hardware slot.  Only
> > +available on engines which use MQDs.
> > +
> > +amdgpu_error_<name>
> > +-------------------
> > +
> > +Provides an interface to set an error on fences associated with ring <name>.
> > +The error code specified is propogated to all fences associated with the
> > +ring.
>
> I don't know how this error works. Is it something like this:
>
> echo 23 > /sys/kernel/debug/dri/1/amdgpu_error_gfx # 23 is a random number
>
> And if there is a fence error in the gfx ring, should I see the error
> code 23 in the dmesg?

It would need to be a valid error number for a dma fence.  E.g., like
-ETIME.  The status of the fences determine the status of jobs on the
ring.

>
> > +
> > +amdgpu_pm_info
> > +--------------
> > +
> > +Provides human readable information about the power management features
> > +and state of the GPU.  This includes current GFX clock, Memory clock,
> > +voltages, average SoC power, temperature, GFX load, Memory load, SMU
> > +feature mask, VCN power state, clock and power gating features.
> > +
> > +amdgpu_firmware_info
> > +--------------------
> > +
> > +Lists the firmware versions for all firmwares used by the GPU.  Only
> > +entries with a non-0 version are valid.  If the version is 0, the firmware
> > +is not valid for the GPU.
> > +
> > +amdgpu_fence_info
> > +-----------------
> > +
> > +Shows the last signalled and emitted fence sequence numbers for each
> > +kernel driver managed ring.  Fences are associated with submissions
> > +to the engine.  Emitted fences have been submitted to the ring
> > +and signalled fences have been signalled by the GPU.  Rings with a
> > +larger emitted fence value have outstanding work that is still being
> > +processed by the engine that owns that ring.  When the emitted and
> > +signalled fence values are equal, the ring is idle.
> > +
> > +amdgpu_gem_info
> > +---------------
> > +
> > +Lists all of the PIDs using the GPU and the GPU buffers that are they have
> > +allocated.  This lists the buffer size, pool (VRAM, GTT, etc.), and buffer
> > +attributes (CPU access required, CPU cache attributes, etc.).
> > +
> > +amdgpu_vm_info
> > +--------------
> > +
> > +Lists all of the PIDs using the GPU and the GPU buffers that are they have
> > +allocated as well as the status of those buffers relative to that process'
> > +GPU virtual address space (e.g., evicted, idle, invalidated, etc.).
> > +
> > +amdgpu_sa_info
> > +--------------
>
> Is sa == SubAllocation?

Yes.  Will update.

>
> > +
> > +Prints out all of the suballocations by the suballocation manager in the
> > +kernel driver.  Prints the GPU address, size, and fence info associated
> > +with each suballocation.  They suballocations are used internally within
> > +the kernel driver for various things.
> > +
> > +amdgpu_<pool>_mm
> > +----------------
> > +
> > +Prints TTM information about the memory pool <pool>.
> > +
> > +amdgpu_vram
> > +-----------
> > +
> > +Provides direct access to VRAM.  Used by tools like UMR to inspect
> > +objects in VRAM.
> > +
> > +amdgpu_iomem
> > +------------
> > +
> > +Provides direct access to GTT memory.  Used by tools like UMR to inspect
> > +GTT memory.
> > +
> > +amdgpu_regs_*
> > +-------------
> > +
> > +Provides direct access to various register aperatures on the GPU.  Used
> > +by tools like UMR to access GPU registers.
> > +
> > +amdgpu_regs2
> > +------------
> > +
> > +Provides an IOCTL interface used by UMR for interacting with GPU registers.
> > +
> > +
> > +amdgpu_sensors
> > +--------------
> > +
> > +Provides an interface to query GPU power metrics (temperature, average
> > +power, etc.).  Used by tools like UMR to query GPU power metrics.
> > +
> > +
> > +amdgpu_gca_config
> > +-----------------
>
> What is GCA? Could you add this to the amdgpu glossary?

yes.  Graphics and Compute Array (it's another name for the GFX/GC IP).

>
> > +
> > +Provides an interface to query GPU details (GFX config, PCI config,
> > +GPU family, etc.).  Used by tools like UMR to query GPU details.
> > +
> > +amdgpu_wave
> > +-----------
> > +
> > +Used to query GFX/compute wave infomation from the hardware.  Used by tools
> > +like UMR to query GFX/compute wave information.
> > +
> > +amdgpu_gpr
> > +----------
> > +
> > +Used to      query GFX/compute GPR (General Purpose Register) infomation from the
>
> It looks like that GPR it is not part of the amdgpu glossary.

Wasn't sure if it was worth putting this in the glossary since GPR is
a pretty common term in computer processors.

>
> > +hardware.  Used by tools like UMR to query GPRs when debugging shaders.
> > +
> > +amdgpu_gprwave
> > +--------------
> > +
> > +Provides an IOCTL interface used by UMR for interacting with shader waves.
> > +
> > +amdgpu_fw_attestation
> > +---------------------
> > +
> > +Provides an interface for reading back firmware attestation records.
>
> What is this attestation record?

It's the attestation results for the firmwares used by the GPU.

>
> Is this available for all GPUs and APUs?

It's available on certain dGPUs.

>
> > diff --git a/Documentation/gpu/amdgpu/index.rst b/Documentation/gpu/amdgpu/index.rst
> > index 302d039928ee8..5254f3a162f84 100644
> > --- a/Documentation/gpu/amdgpu/index.rst
> > +++ b/Documentation/gpu/amdgpu/index.rst
> > @@ -17,4 +17,5 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.
> >     driver-misc
> >     debugging
>
> I believe this page is directly related to the debugging page. In this
> sense, maybe add a new section about the debugfs entries to the
> debugging page.

Will do.

>
> Thanks
>
> >     process-isolation
> > +   debugfs
> >     amdgpu-glossary
> > --
> > 2.48.1
> >
>
> --
> Rodrigo Siqueira