alan: I will respin this rev with the changes mentiond below - thanks Zhanjun for the time in reviewing this. :) On Mon, 2025-02-10 at 18:41 -0500, Dong, Zhanjun wrote: > See my comments inline below. > > Regards, > Zhanjun Dong > > On 2025-01-28 1:36 p.m., Alan Previn wrote: > > GuC-Err-Capture should not be storing register snapshot > > nodes directly inside of the top level xe_devcoredump_snapshot > > structure that it doesn't control. Furthermore, that is > > is not right from a driver subsystem layering perspective. > > > > Instead, when a matching GuC-Err-Capture register snapshot is > > available, store it into xe_hw_engine_snapshot structure. > > > > To ensure the manual snapshot can be retrieved and released > > like the firmware reported snapshot nodes, replace xe_engine_manual_capture > > xe_guc_capture_snapshot_store_manual_job (which generates > > and stores the manual GuC-Err-Capture register snapshot > > within its internal outlist). > > > > Signed-off-by: Alan Previn <alan.previn.teres.alexis@xxxxxxxxx> > > --- > > drivers/gpu/drm/xe/xe_devcoredump.c | 3 - > > drivers/gpu/drm/xe/xe_devcoredump_types.h | 6 - > > drivers/gpu/drm/xe/xe_guc_capture.c | 153 ++++++++++------------ > > drivers/gpu/drm/xe/xe_guc_capture.h | 9 +- > > drivers/gpu/drm/xe/xe_guc_submit.c | 12 +- > > drivers/gpu/drm/xe/xe_hw_engine.c | 32 ++--- > > drivers/gpu/drm/xe/xe_hw_engine_types.h | 8 ++ > > 7 files changed, 102 insertions(+), 121 deletions(-) > > > > diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c > > index 39fe485d2085..006041997550 100644 > > --- a/drivers/gpu/drm/xe/xe_devcoredump.c > > +++ b/drivers/gpu/drm/xe/xe_devcoredump.c > > @@ -149,9 +149,6 @@ static void xe_devcoredump_snapshot_free(struct xe_devcoredump_snapshot *ss) > > xe_guc_ct_snapshot_free(ss->guc.ct); > > ss->guc.ct = NULL; > > > > - xe_guc_capture_put_matched_nodes(&ss->gt->uc.guc); > > - ss->matched_node = NULL; > > - > sounds like removed the capture_put_matched nodes from > devcoredump_snapshot_free, rather than that, free it with > xe_hw_engine_snapshot_free, that's fine. alan: thats right - from a driver layering perspective the node retrieved from guc-err-cap is a subset of the xe_hw_engine_snapshot, not the top-level devcore-dump. Also, as per the series header, guc-err-capture shall not reach up into structures it doesn't own to put place-hold the matching node handle - i.e. guc-err-cap wont modify xe_hw_engine_snapshot, .. instead xe_hw_engine shall retrieve the matching node from guc-err-cap and it shall store it in xe_hw_engine_snapshot which it allocated in the first place. Thats why the free has moved. > > > xe_guc_exec_queue_snapshot_free(ss->ge); > > ss->ge = NULL; > > > > diff --git a/drivers/gpu/drm/xe/xe_devcoredump_types.h b/drivers/gpu/drm/xe/xe_devcoredump_types.h > > index c94ce21043a8..28486ed93314 100644 > > --- a/drivers/gpu/drm/xe/xe_devcoredump_types.h > > +++ b/drivers/gpu/drm/xe/xe_devcoredump_types.h > > @@ -53,12 +53,6 @@ struct xe_devcoredump_snapshot { > > struct xe_hw_engine_snapshot *hwe[XE_NUM_HW_ENGINES]; > > /** @job: Snapshot of job state */ > > struct xe_sched_job_snapshot *job; > > - /** > > - * @matched_node: The matched capture node for timedout job > > - * this single-node tracker works because devcoredump will always only > > - * produce one hw-engine capture per devcoredump event > > - */ > > - struct xe_guc_capture_snapshot *matched_node; > > /** @vm: Snapshot of VM state */ > > struct xe_vm_snapshot *vm; > > > > diff --git a/drivers/gpu/drm/xe/xe_guc_capture.c b/drivers/gpu/drm/xe/xe_guc_capture.c > > index e04c87739267..f118e8dd0ecb 100644 > > --- a/drivers/gpu/drm/xe/xe_guc_capture.c > > +++ b/drivers/gpu/drm/xe/xe_guc_capture.c > > @@ -1532,35 +1532,18 @@ read_reg_to_node(struct xe_hw_engine *hwe, const struct __guc_mmio_reg_descr_gro > > } > > } > > > > -/** > > - * xe_engine_manual_capture - Take a manual engine snapshot from engine. > > - * @hwe: Xe HW Engine. > > - * @snapshot: The engine snapshot > > - * > > - * Take engine snapshot from engine read. > > - * > > - * Returns: None > > - */ > > -void > > -xe_engine_manual_capture(struct xe_hw_engine *hwe, struct xe_hw_engine_snapshot *snapshot) > > +static struct xe_guc_capture_snapshot * > > +guc_capture_get_manual_snapshot(struct xe_guc *guc, struct xe_hw_engine *hwe) > > { > > - struct xe_gt *gt = hwe->gt; > > - struct xe_device *xe = gt_to_xe(gt); > > - struct xe_guc *guc = >->uc.guc; > > - struct xe_devcoredump *devcoredump = &xe->devcoredump; > > + struct xe_gt *gt = guc_to_gt(guc); > > enum guc_capture_list_class_type capture_class; > > const struct __guc_mmio_reg_descr_group *list; > > struct xe_guc_capture_snapshot *new; > > enum guc_state_capture_type type; > > - u16 guc_id = 0; > > - u32 lrca = 0; > > - > > - if (IS_SRIOV_VF(xe)) > > - return; > > Sounds like split xe_engine_manual_capture into 2 functions, while > here lost this SRIOV check, although not a issue as caller check that, > but looks unbalanced since split. Do you think that will be an issue? I > mean at later time, when maintanence this code, people might forgot that > caller need to check this condition. alan: yeah - you have a point, this function is further reused in patch 5 so i should add back the sriov code above which i lost. > > > > > new = guc_capture_get_prealloc_node(guc); > > if (!new) > > - return; > > + return NULL; > > > > capture_class = xe_engine_class_to_guc_capture_class(hwe->class); > > for (type = GUC_STATE_CAPTURE_TYPE_GLOBAL; type < GUC_STATE_CAPTURE_TYPE_MAX; type++) { > > @@ -1594,26 +1577,64 @@ xe_engine_manual_capture(struct xe_hw_engine *hwe, struct xe_hw_engine_snapshot > > } > > } > > > > - if (devcoredump && devcoredump->captured) { > > - struct xe_guc_submit_exec_queue_snapshot *ge = devcoredump->snapshot.ge; > > + new->eng_class = xe_engine_class_to_guc_class(hwe->class); > > + new->eng_inst = hwe->instance; > > > > - if (ge) { > > - guc_id = ge->guc.id; > > - if (ge->lrc[0]) > > - lrca = ge->lrc[0]->context_desc; > > - } > The code removed is in case of devcoredump already captured, we can take > guc_id and lrca from queue snapshot, no matter of q is killed or not. > I wonder there might be some corner case between new and old > implementation. To be clarified. alan: AS per offline conversation, i wanted to create two separate functions for retrieving guc-err-cap manual snapshot: 1. an early manual snapshot associated with a job (see below function "xe_guc_capture_snapshot_store_manual_job" - this will always have a valid queue and only called as part of drm-timedout-job sequence. 2. a raw/late manual shapshot not associated with any job (see patch #5's "xe_guc_capture_snapshot_manual_hwe" - this will not have a valid queue and is to be called by xe_hw_engine if no match and no queue was available which is what happens when its called as part of the debugfs or gt-reset sequence. as per offline conversation its better if I add this info into this patch's comments. > > + return new; > > +} > > + > > +/** > > + * xe_guc_capture_snapshot_store_manual_job - Generate and store a manual engine register dump > > + * @guc: Target GuC for manual capture > > + * @q: Associated xe_exec_queue to simulate a manual capture on its behalf. > > + * > > + * Generate a manual GuC-Error-Capture snapshot of engine instance + engine class registers > > + * for the engine of the given exec queue. Stores this node in internal outlist for future > > + * retrieval with the ability to match up against the same queue. > > + * > > + * Returns: None > > + */ > > +void > > +xe_guc_capture_snapshot_store_manual_job(struct xe_guc *guc, struct xe_exec_queue *q) > > +{ > > + struct xe_guc_capture_snapshot *new; > > + struct xe_gt *gt = guc_to_gt(guc); > > + struct xe_hw_engine *hwe; > > + enum xe_hw_engine_id id; > > + > > + /* we don't support GuC-Error-Capture, including manual captures on VFs */ > > + if (IS_SRIOV_VF(guc_to_xe(guc))) > > + return; > > + > > + if (!q) { > > + xe_gt_warn(gt, "Manual GuC Error capture requested with invalid job\n"); > Same depents on q alive issue. > Also, is that a warnning? alan: yes - i wanted to be sure we don't have incorrect plumbing wired up (across the layers, flows, events) In V7 i dont see any CI from this and in fact none of the suspected regressions have anything to do with the series so it looks like we are good. However, i do see your concern in the bigger picture of racy drm-timeout vs guc-exec-quanta vs user doing ctrl-c. So i'll change this to drm_dbg. > > + return; > > } > > > > - new->eng_class = xe_engine_class_to_guc_class(hwe->class); > > - new->eng_inst = hwe->instance; > > - new->guc_id = guc_id; > > - new->lrca = lrca; > > + /* Find hwe for the queue */ > > + for_each_hw_engine(hwe, gt, id) { > > + if (hwe != q->hwe) > > + continue; > > + break; > > + } > > + if (hwe != q->hwe) { > > + xe_gt_warn(gt, "Manual GuC Error capture failed to find matching engine\n"); > Not found, is that a warnning or debug info? alan: i shall change this to drm_dbg as per above. > > + return; > > + } > > + > > + new = guc_capture_get_manual_snapshot(guc, hwe); > > + if (!new) > > + return; > > + > > + new->guc_id = q->guc->id; > > + new->lrca = xe_lrc_ggtt_addr(q->lrc[0]); > Same depents on q alive issue. alan: as per offline conversation and my response above to your comment about the removal of following hunk: - if (ge) { - guc_id = ge->guc.id; - if (ge->lrc[0]) - lrca = ge->lrc[0]->context_desc; - } ... in patch 5 we separate a function just to handle jobless snapshots > > new->is_partial = 0; > > new->locked = 1; > > new->source = XE_ENGINE_CAPTURE_SOURCE_MANUAL; > > > > guc_capture_add_node_to_outlist(guc->capture, new); > > - devcoredump->snapshot.matched_node = new; > > + > > + return; > > } alan:snip > > > > @@ -1893,51 +1902,23 @@ xe_engine_snapshot_capture_for_queue(struct xe_exec_queue *q) > > coredump->snapshot.hwe[id] = NULL; > > continue; > > } > > - > > - if (!coredump->snapshot.hwe[id]) { > > - coredump->snapshot.hwe[id] = > > - xe_hw_engine_snapshot_capture(hwe, q); > > - } else { > > - struct xe_guc_capture_snapshot *new; > > - > > - new = xe_guc_capture_get_matching_and_lock(q); > > - if (new) { > > - struct xe_guc *guc = &q->gt->uc.guc; > > - > > - /* > > - * If we are in here, it means we found a fresh > > - * GuC-err-capture node for this engine after > > - * previously failing to find a match in the > > - * early part of guc_exec_queue_timedout_job. > > - * Thus we must free the manually captured node > > - */ > > - guc_capture_free_outlist_node(guc->capture, > > - coredump->snapshot.matched_node); > > - coredump->snapshot.matched_node = new; > > - } > > - } > > - > > - break; > > + coredump->snapshot.hwe[id] = xe_hw_engine_snapshot_capture(hwe, q); > Now xe_engine_snapshot_capture_for_queue is seperate with manual capture > at all, so no more above logic is needed. Nice. alan: i like the cleaner dedicated helpers :)