Re: [PATCH v2 RESEND 4/5] drm: panthor: add debugfs knob to dump successful jobs

Adrian Larumbe <adrian.larumbe@xxxxxxxxxxxxx> · Thu, 22 Aug 2024 13:36:15 +0100

Hi Daniel,

> On 21.08.2024 11:37, Daniel Almeida wrote:
> It can be advantageous for userspace to have access to successful jobs.

While it's true that perhaps having additional jobs as part of the same devcoredump file
could provide further information as to why a later job failed, I see a few snags with this
approach:

- The time since the debugfs knob is triggered and therefore some successful jobs dumped
until a later job fails might be very long, so the preceding jobs maybe won't provide
much context.
- Besides being mostly interested in immediately preceding jobs, I think we'd want these
to belong to the same scheduling group and VM as the failing job, but this approach
will dump them all consecutively, even if they belong to a different open DRM file.
- In my experience writing a similar feature for the Panfrost driver, I think it's best
to wait until users of the driver run into specific bugs for us to come up with debug
features that would be useful for them, rather than sort of trying to guess them instead,
because there's the risk they'll never be used and then just add cruft into the codebase.

Other than that, the preceding patches look absolutely gorgeous to me, so I
think it's best if you resubmit them, and maybe keep patches 3-5 on hold until
we run into a bug scenario where they might prove useful.

Cheers,
Adrian

> Allow this as an opt-in through a debugfs file.
> 
> Note that the devcoredump infrastructure will discard new dumps if a
> previous dump hasn't been read. A future patch will add support for
> multi-job dumps which will work around this limitation.
> 
> Signed-off-by: Daniel Almeida <daniel.almeida@xxxxxxxxxxxxx>
> ---
>  drivers/gpu/drm/panthor/panthor_sched.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index afd644c7d9f1..ea2696c1075a 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -10,6 +10,7 @@
>  
>  #include <linux/build_bug.h>
>  #include <linux/clk.h>
> +#include <linux/debugfs.h>
>  #include <linux/delay.h>
>  #include <linux/dma-mapping.h>
>  #include <linux/dma-resv.h>
> @@ -317,6 +318,9 @@ struct panthor_scheduler {
>  		 */
>  		struct list_head stopped_groups;
>  	} reset;
> +
> +	/** @dump_successful_jobs: whether to dump successful jobs through coredumpv */
> +	bool dump_successful_jobs;
>  };
>  
>  /**
> @@ -2946,6 +2950,16 @@ queue_run_job(struct drm_sched_job *sched_job)
>  	queue->iface.input->extract = queue->iface.output->extract;
>  	queue->iface.input->insert = job->ringbuf.end;
>  
> +	if (sched->dump_successful_jobs) {
> +		struct panthor_core_dump_args core_dump_args = {
> +			.ptdev = ptdev,
> +			.group_vm = job->group->vm,
> +			.group = job->group,
> +		};
> +
> +		panthor_core_dump(&core_dump_args);
> +	}
> +
>  	if (group->csg_id < 0) {
>  		/* If the queue is blocked, we want to keep the timeout running, so we
>  		 * can detect unbounded waits and kill the group when that happens.
> @@ -3609,5 +3623,8 @@ void panthor_sched_debugfs_init(struct drm_minor *minor)
>  	struct panthor_device *ptdev =
>  		container_of(minor->dev, struct panthor_device, base);
>  	struct panthor_scheduler *sched = ptdev->scheduler;
> +
> +	debugfs_create_bool("dump_successful_jobs", 0644, minor->debugfs_root,
> +			    &sched->dump_successful_jobs);
>  }
>  #endif /* CONFIG_DEBUG_FS */
> -- 
> 2.45.2