On Fri, 2023-04-14 at 17:27 -0700, John.C.Harrison@xxxxxxxxx wrote: > From: John Harrison <John.C.Harrison@xxxxxxxxx> > > GuC based register dumps in error capture logs were basically broken > for virtual engines. This can be seen in igt@gem_exec_balancer@hang: > [IGT] gem_exec_balancer: starting subtest hang > [drm] GPU HANG: ecode 12:4:e1524110, in gem_exec_balanc [6388] > [drm] GT0: GUC: No register capture node found for 0x1005 / 0xFEDC311D > [drm] GPU HANG: ecode 12:4:00000000, in gem_exec_balanc [6388] > [IGT] gem_exec_balancer: exiting, ret=0 > > The test causes a hang on both engines of a virtual engine context. > The engine instance zero hang gets a valid error capture but the > non-instance-zero hang does not. > > Fix that by scanning through the list of pending register captures > when a hang notification for a virtual engine is received. That way, > the hang can be assigned to the correct physical engine prior to > starting the error capture process. So later on, when the error capture > handler tries to find the engine register list, it looks for one on > the correct engine. > > Also, sneak in a missing blank line before a comment in the node > search code. > > Signed-off-by: John Harrison <John.C.Harrison@xxxxxxxxx> LGTM - thanks for fixing this! :D Reviewed-by: Alan Previn <alan.previn.teres.alexis@xxxxxxxxx> A side conversation - potentially requring an unrelated future patch,... i notice that the array "error->reset_engine_count[]" (which is being used for error state reporting and as some verification in selftests) seem to have a size indicating of engine-instance-count but the reading/wrting of members of this array keep using the engine->uabi_class as index... meaning its being used to track engine class reset counts? Maybe this is an issue or i am misundertanding that. Either way, that issue is unrelated to the intent of this patch - i just wanted to get that highlighted for future action if needed. I can take that onus if its in fact an issue.