Re: [PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs

Timur Tabi <ttabi@xxxxxxxxxx> · Tue, 30 Jul 2024 21:49:15 +0000

On Tue, 2024-07-30 at 02:21 +0200, Danilo Krummrich wrote:
> Hi Timur,
> 
> > This method has the advantage that it allows the buffers to be parsed
> > even when the logging ELF file is not available to the user. However,
> > it has the disadvantage the debubfs entries need to remain until the
> 
> debubfs -> debugfs

Ok

> > driver is unloaded.
> > 
> > The buffers are exposed via debugfs. If GSP-RM fails to initialize,
> > then Nouveau immediately shuts down the GSP interface. This would
> > normally also deallocate the logging buffers, thereby preventing the
> > user from capturing the debug logs.
> > 
> > To avoid this, introduce the keep-gsp-logging command line parameter.
> > If specified, and if at least one logging buffer has content, then
> > Nouveau will migrate these buffers into new debugfs entries that are
> > retained until the driver unloads.
> > 
> > An end-user can capture the logs using the following commands:
> > 
> >     cp /sys/kernel/debug/dri/<path>/loginit loginit
> >     cp /sys/kernel/debug/dri/<path>/logrm logrm
> >     cp /sys/kernel/debug/dri/<path>/logintr logintr
> >     cp /sys/kernel/debug/dri/<path>/logpmu logpmu
> > 
> > where <path> is the PCI ID of the GPU (e.g. 0000:65:00.0).
> 
> I think this depends on the actual device type. Could also be a platform or even
> aux device.
> 
> No need to add more examples though, I think it's enough if we just change
> this to:
> 
> "where (for a PCI device) <path> is the PCI ID of the GPU (e.g. 0000:65:00.0)."

I'll make the change, but AFAIK non-PCI versions of Turing+ GPUs do not
exist.  Even the GH100, which uses an nvlink data connection between GPU to
the ARM host, still uses PCI for control.

> > 
> > +/**
> > + * nvif_log - structure for tracking logging buffers
> > + * @entry: an entry in a list of struct nvif_logs
> > + * @shutdown: pointer to function to call to clean up
> > + *
> > + * Structure used to track logging buffers so that they can be cleaned
> > up
> > + * when the driver exits.
> 
> "Driver exit" is a bit of an undefined term, the closest thing is probably
> "driver detach" (from a device). In this case I think you actually mean
> "module
> exit" though.

Yes, I use driver and module interchangeably, but I guess that's not
accurate.

> > + *
> > + * The @shutdown function is called when the driver exits. It should
> > free all
> 
> Same here.

I replaced 'driver' with 'module'.

> 
> > + * backing resources, such as logging buffers.
> > + */
> > +struct nvif_log {
> > +	struct list_head entry;
> > +	void (*shutdown)(struct nvif_log *log);
> > +};
> > +
> > +#define NVIF_LOGS_DECLARE(_log) \
> > +	struct nvif_log _log = { LIST_HEAD_INIT(_log.entry) }
> 
> If you declare this as
> 
> 	#define NVIF_LOGS_DECLARE(_log) \
> 		struct nvif_log _log = { LIST_HEAD_INIT(_log.entry), nvif_log_shutdown }
> 
> and change the signature of nvif_log_shutdown() to
> 
> 	static inline void nvif_log_shutdown(struct nvif_log *logs)
> 
> you can just call
> 
> 	gsp_logs.shutdown(&gsp_logs);
> 
> in nouveau_drm_exit().
> 
> Admittedly, maybe a bit too sneaky though. :)

gsp_logs.shutdown(&gsp_logs) -- are you sure you want this?  This is some
weird C++ wanna-be code, IMHO.  I don't think this is an improvement.  I'd
rather keep it as-is.

> > +/*
> > + * GSP-RM uses a pseudo-class mechanism to define of a variety of per-"engine"
> > + * data structures, and each engine has a "class ID" genererated by a
> > + * pre-processor. This is the class ID for the PMU.
> > + */
> > +#define NV_GSP_MSG_EVENT_UCODE_LIBOS_CLASS_PMU		0xf3d722
> > +
> > +/**
> > + * rpc_ucode_libos_print_v1E_08 - RPC payload for libos print buffers
> 
> Is this structure versioned? If so, does it relate to a specific GSP-RM version?

Yes, the structure is versioned, but it's a little weird.  When a new RPC is
added, the developer can choose to use any version number that matches the
"current" VGPU version or earlier.  src/nvidia/generated/g_rpc-structures.h
is generated from a special "def" file, and the developer just chooses
whatever number he wants.  Typically, the developer will choose whatever
version that VGPU is at when he starts working on the code.  But since this
is the first and only version of rpc_ucode_libos_print, he could have chosen
v01_00.

So "previous RPC version (if it exists)" < "new RPC version" <= "current
VGPU version" at the time the RPC was created/updated.

> > +static void
> > +r535_gsp_libos_debugfs_init(struct nvkm_gsp *gsp)
> > +{
> > +	struct dentry *dir, *dir_init;
> > +	struct dentry *dir_intr = NULL, *dir_rm = NULL, *dir_pmu = NULL;
> > +	struct device *dev = gsp->subdev.device->dev;
> > +
> > +	/* Each GPU has a subdir based on its device name, so find it */
> > +	struct drm_device *drm_dev = dev_get_drvdata(dev);
> > +
> > +	if (!drm_dev || !drm_dev->debugfs_root) {
> 
> I don't think we need any of those checks, please remove them.

Ok

> > +	nvkm_debug(&gsp->subdev, "created debugfs GSP-RM logging entries\n");
> > +
> > +	if (keep_gsp_logging) {
> > +		nvkm_warn(&gsp->subdev,
> 
> nvkm_info() is probably enough.

Ok.

> > +static void r535_gsp_retain_logging(struct nvkm_gsp *gsp)
> > +{
> > +	struct device *dev = gsp->subdev.device->dev;
> > +	struct dentry *root, *dir;
> > +	struct r535_gsp_log *log;
> > +	int ret;
> > +
> > +	/* If we were told not to keep logs, then don't. */
> > +	if (!keep_gsp_logging)
> > +		return;
> > +
> > +	/* Check to make sure at least one buffer has data. */
> > +	if (is_empty(&gsp->blob_init) && is_empty(&gsp->blob_intr) &&
> > +	    is_empty(&gsp->blob_rm) && is_empty(&gsp->blob_rm)) {
> > +		nvkm_warn(&gsp->subdev, "all logging buffers are empty\n");
> > +		return;
> > +	}
> > +
> > +	/* Find the 'dri' root debugfs entry. Every GPU has a dentry under it */
> > +	root = debugfs_lookup("dri", NULL);
> > +	if (IS_ERR(root)) {
> 
> I don't think this can ever happen. This entry is created in drm_core_init().

So first, the check is wrong.  It should be "if (!root)".

However, I think it can happen.  drm_core_init() does not fail if
debugfs_create_dir() fails:

	drm_debugfs_root = debugfs_create_dir("dri", NULL);

It doesn't even print a warning message.  It just keeps going.  So I think
there should be some error checking somewhere.

I tested this, and if drm_core_init() fails to create the dentry, then
r535_gsp_retain_logging() will just keep going trying to create debugfs
entries, because a root of NULL is actually valid, and the entries are
created in /sys/kernel/debug/0000:65:00.0/ instead of
/sys/kernel/debug/dri/0000:65:00.0/

In fact, I think I found a small bug in dput():

void dput(struct dentry *dentry)
{
	if (!dentry)
		return;

This should probably be "if (IS_ERR_OR_NULL(dentry))".