Re: [PATCH v6 2/6] drm/xe/guc: Don't store capture nodes in xe_devcoredump_snapshot

"Teres Alexis, Alan Previn" <alan.previn.teres.alexis@xxxxxxxxx> · Wed, 12 Feb 2025 19:25:45 +0000

alan: I will respin this rev with the changes mentiond below - thanks Zhanjun for the time in reviewing this. :)

On Mon, 2025-02-10 at 18:41 -0500, Dong, Zhanjun wrote:
> See my comments inline below.
> 
> Regards,
> Zhanjun Dong
> 
> On 2025-01-28 1:36 p.m., Alan Previn wrote:
> > GuC-Err-Capture should not be storing register snapshot
> > nodes directly inside of the top level xe_devcoredump_snapshot
> > structure that it doesn't control. Furthermore, that is
> > is not right from a driver subsystem layering perspective.
> > 
> > Instead, when a matching GuC-Err-Capture register snapshot is
> > available, store it into xe_hw_engine_snapshot structure.
> > 
> > To ensure the manual snapshot can be retrieved and released
> > like the firmware reported snapshot nodes, replace xe_engine_manual_capture
> > xe_guc_capture_snapshot_store_manual_job (which generates
> > and stores the manual GuC-Err-Capture register snapshot
> > within its internal outlist).
> > 
> > Signed-off-by: Alan Previn <alan.previn.teres.alexis@xxxxxxxxx>
> > ---
> >   drivers/gpu/drm/xe/xe_devcoredump.c       |   3 -
> >   drivers/gpu/drm/xe/xe_devcoredump_types.h |   6 -
> >   drivers/gpu/drm/xe/xe_guc_capture.c       | 153 ++++++++++------------
> >   drivers/gpu/drm/xe/xe_guc_capture.h       |   9 +-
> >   drivers/gpu/drm/xe/xe_guc_submit.c        |  12 +-
> >   drivers/gpu/drm/xe/xe_hw_engine.c         |  32 ++---
> >   drivers/gpu/drm/xe/xe_hw_engine_types.h   |   8 ++
> >   7 files changed, 102 insertions(+), 121 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
> > index 39fe485d2085..006041997550 100644
> > --- a/drivers/gpu/drm/xe/xe_devcoredump.c
> > +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> > @@ -149,9 +149,6 @@ static void xe_devcoredump_snapshot_free(struct xe_devcoredump_snapshot *ss)
> >         xe_guc_ct_snapshot_free(ss->guc.ct);
> >         ss->guc.ct = NULL;
> >   
> > -       xe_guc_capture_put_matched_nodes(&ss->gt->uc.guc);
> > -       ss->matched_node = NULL;
> > -
> sounds like removed the capture_put_matched nodes from 
> devcoredump_snapshot_free, rather than that, free it with 
> xe_hw_engine_snapshot_free, that's fine.
alan: thats right - from a driver layering perspective the node retrieved from guc-err-cap
is a subset of the xe_hw_engine_snapshot, not the top-level devcore-dump.
Also, as per the series header, guc-err-capture shall not reach up into structures it doesn't
own to put place-hold the matching node handle - i.e. guc-err-cap wont modify
xe_hw_engine_snapshot, .. instead xe_hw_engine shall retrieve the matching node from guc-err-cap
and it shall store it in xe_hw_engine_snapshot which it allocated in the first place.
Thats why the free has moved.
> 
> >         xe_guc_exec_queue_snapshot_free(ss->ge);
> >         ss->ge = NULL;
> >   
> > diff --git a/drivers/gpu/drm/xe/xe_devcoredump_types.h b/drivers/gpu/drm/xe/xe_devcoredump_types.h
> > index c94ce21043a8..28486ed93314 100644
> > --- a/drivers/gpu/drm/xe/xe_devcoredump_types.h
> > +++ b/drivers/gpu/drm/xe/xe_devcoredump_types.h
> > @@ -53,12 +53,6 @@ struct xe_devcoredump_snapshot {
> >         struct xe_hw_engine_snapshot *hwe[XE_NUM_HW_ENGINES];
> >         /** @job: Snapshot of job state */
> >         struct xe_sched_job_snapshot *job;
> > -       /**
> > -        * @matched_node: The matched capture node for timedout job
> > -        * this single-node tracker works because devcoredump will always only
> > -        * produce one hw-engine capture per devcoredump event
> > -        */
> > -       struct xe_guc_capture_snapshot *matched_node;
> >         /** @vm: Snapshot of VM state */
> >         struct xe_vm_snapshot *vm;
> >   
> > diff --git a/drivers/gpu/drm/xe/xe_guc_capture.c b/drivers/gpu/drm/xe/xe_guc_capture.c
> > index e04c87739267..f118e8dd0ecb 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_capture.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_capture.c
> > @@ -1532,35 +1532,18 @@ read_reg_to_node(struct xe_hw_engine *hwe, const struct __guc_mmio_reg_descr_gro
> >         }
> >   }
> >   
> > -/**
> > - * xe_engine_manual_capture - Take a manual engine snapshot from engine.
> > - * @hwe: Xe HW Engine.
> > - * @snapshot: The engine snapshot
> > - *
> > - * Take engine snapshot from engine read.
> > - *
> > - * Returns: None
> > - */
> > -void
> > -xe_engine_manual_capture(struct xe_hw_engine *hwe, struct xe_hw_engine_snapshot *snapshot)
> > +static struct xe_guc_capture_snapshot *
> > +guc_capture_get_manual_snapshot(struct xe_guc *guc, struct xe_hw_engine *hwe)
> >   {
> > -       struct xe_gt *gt = hwe->gt;
> > -       struct xe_device *xe = gt_to_xe(gt);
> > -       struct xe_guc *guc = &gt->uc.guc;
> > -       struct xe_devcoredump *devcoredump = &xe->devcoredump;
> > +       struct xe_gt *gt = guc_to_gt(guc);
> >         enum guc_capture_list_class_type capture_class;
> >         const struct __guc_mmio_reg_descr_group *list;
> >         struct xe_guc_capture_snapshot *new;
> >         enum guc_state_capture_type type;
> > -       u16 guc_id = 0;
> > -       u32 lrca = 0;
> > -
> > -       if (IS_SRIOV_VF(xe))
> > -               return;
> 
> Sounds like split xe_engine_manual_capture into 2 functions, while
> here lost this SRIOV check, although not a issue as caller check that, 
> but looks unbalanced since split. Do you think that will be an issue? I 
> mean at later time, when maintanence this code, people might forgot that 
> caller need to check this condition.
alan: yeah - you have a point, this function is further reused in patch
5 so i should add back the sriov code above which i lost.

> 
> >   
> >         new = guc_capture_get_prealloc_node(guc);
> >         if (!new)
> > -               return;
> > +               return NULL;
> >   
> >         capture_class = xe_engine_class_to_guc_capture_class(hwe->class);
> >         for (type = GUC_STATE_CAPTURE_TYPE_GLOBAL; type < GUC_STATE_CAPTURE_TYPE_MAX; type++) {
> > @@ -1594,26 +1577,64 @@ xe_engine_manual_capture(struct xe_hw_engine *hwe, struct xe_hw_engine_snapshot
> >                 }
> >         }
> >   
> > -       if (devcoredump && devcoredump->captured) {
> > -               struct xe_guc_submit_exec_queue_snapshot *ge = devcoredump->snapshot.ge;
> > +       new->eng_class = xe_engine_class_to_guc_class(hwe->class);
> > +       new->eng_inst = hwe->instance;
> >   
> > -               if (ge) {
> > -                       guc_id = ge->guc.id;
> > -                       if (ge->lrc[0])
> > -                               lrca = ge->lrc[0]->context_desc;
> > -               }
> The code removed is in case of devcoredump already captured, we can take 
> guc_id and lrca from queue snapshot, no matter of q is killed or not.
> I wonder there might be some corner case between new and old 
> implementation. To be clarified.
alan: AS per offline conversation, i wanted to create two separate functions
for retrieving guc-err-cap manual snapshot:
1. an early manual snapshot associated with a job (see below function "xe_guc_capture_snapshot_store_manual_job"
   - this will always have a valid queue and only called as part of drm-timedout-job sequence.
2. a raw/late manual shapshot not associated with any job (see patch #5's "xe_guc_capture_snapshot_manual_hwe"
   - this will not have a valid queue and is to be called by xe_hw_engine if no match and no queue was 
     available which is what happens when its called as part of the debugfs or gt-reset sequence.
as per offline conversation its better if I add this info into this patch's comments.

> > +       return new;
> > +}
> > +
> > +/**
> > + * xe_guc_capture_snapshot_store_manual_job - Generate and store a manual engine register dump
> > + * @guc: Target GuC for manual capture
> > + * @q: Associated xe_exec_queue to simulate a manual capture on its behalf.
> > + *
> > + * Generate a manual GuC-Error-Capture snapshot of engine instance + engine class registers
> > + * for the engine of the given exec queue. Stores this node in internal outlist for future
> > + * retrieval with the ability to match up against the same queue.
> > + *
> > + * Returns: None
> > + */
> > +void
> > +xe_guc_capture_snapshot_store_manual_job(struct xe_guc *guc, struct xe_exec_queue *q)
> > +{
> > +       struct xe_guc_capture_snapshot *new;
> > +       struct xe_gt *gt = guc_to_gt(guc);
> > +       struct xe_hw_engine *hwe;
> > +       enum xe_hw_engine_id id;
> > +
> > +       /* we don't support GuC-Error-Capture, including manual captures on VFs */
> > +       if (IS_SRIOV_VF(guc_to_xe(guc)))
> > +               return;
> > +
> > +       if (!q) {
> > +               xe_gt_warn(gt, "Manual GuC Error capture requested with invalid job\n");
> Same depents on q alive issue.
> Also, is that a warnning?
alan: yes - i wanted to be sure we don't have incorrect plumbing wired up (across the layers, flows, events)
In V7 i dont see any CI from this and in fact none of the suspected regressions have anything to do with 
the series so it looks like we are good. However, i do see your concern in the bigger picture of
racy drm-timeout vs guc-exec-quanta vs user doing ctrl-c. So i'll change this to drm_dbg.

> > +               return;
> >         }
> >   
> > -       new->eng_class = xe_engine_class_to_guc_class(hwe->class);
> > -       new->eng_inst = hwe->instance;
> > -       new->guc_id = guc_id;
> > -       new->lrca = lrca;
> > +       /* Find hwe for the queue */
> > +       for_each_hw_engine(hwe, gt, id) {
> > +               if (hwe != q->hwe)
> > +                       continue;
> > +               break;
> > +       }
> > +       if (hwe != q->hwe) {
> > +               xe_gt_warn(gt, "Manual GuC Error capture failed to find matching engine\n");
> Not found, is that a warnning or debug info?
alan: i shall change this to drm_dbg as per above.

> > +               return;
> > +       }
> > +
> > +       new = guc_capture_get_manual_snapshot(guc, hwe);
> > +       if (!new)
> > +               return;
> > +
> > +       new->guc_id = q->guc->id;
> > +       new->lrca = xe_lrc_ggtt_addr(q->lrc[0]);
> Same depents on q alive issue.
alan: as per offline conversation and my response above to your comment
about the removal of following hunk:
	-               if (ge) {
	-                       guc_id = ge->guc.id;
	-                       if (ge->lrc[0])
	-                               lrca = ge->lrc[0]->context_desc;
	-               }
... in patch 5 we separate a function just to handle jobless snapshots

> >         new->is_partial = 0;
> >         new->locked = 1;
> >         new->source = XE_ENGINE_CAPTURE_SOURCE_MANUAL;
> >   
> >         guc_capture_add_node_to_outlist(guc->capture, new);
> > -       devcoredump->snapshot.matched_node = new;
> > +
> > +       return;
> >   }
alan:snip
> > 

> > @@ -1893,51 +1902,23 @@ xe_engine_snapshot_capture_for_queue(struct xe_exec_queue *q)
> >                         coredump->snapshot.hwe[id] = NULL;
> >                         continue;
> >                 }
> > -
> > -               if (!coredump->snapshot.hwe[id]) {
> > -                       coredump->snapshot.hwe[id] =
> > -                               xe_hw_engine_snapshot_capture(hwe, q);
> > -               } else {
> > -                       struct xe_guc_capture_snapshot *new;
> > -
> > -                       new = xe_guc_capture_get_matching_and_lock(q);
> > -                       if (new) {
> > -                               struct xe_guc *guc =  &q->gt->uc.guc;
> > -
> > -                               /*
> > -                                * If we are in here, it means we found a fresh
> > -                                * GuC-err-capture node for this engine after
> > -                                * previously failing to find a match in the
> > -                                * early part of guc_exec_queue_timedout_job.
> > -                                * Thus we must free the manually captured node
> > -                                */
> > -                               guc_capture_free_outlist_node(guc->capture,
> > -                                                             coredump->snapshot.matched_node);
> > -                               coredump->snapshot.matched_node = new;
> > -                       }
> > -               }
> > -
> > -               break;
> > +               coredump->snapshot.hwe[id] = xe_hw_engine_snapshot_capture(hwe, q);
> Now xe_engine_snapshot_capture_for_queue is seperate with manual capture 
> at all, so no more above logic is needed. Nice.
alan: i like the cleaner dedicated helpers :)