Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Thu, 7 Oct 2021 09:17:34 +0100





On 06/10/2021 21:45, Umesh Nerlige Ramappa wrote:
On Wed, Oct 06, 2021 at 10:11:58AM +0100, Tvrtko Ursulin wrote:

[snip]

@@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
 static void disable_submission(struct intel_guc *guc)
 {
     struct i915_sched_engine * const sched_engine = guc->sched_engine;
+    struct intel_gt *gt = guc_to_gt(guc);
+    struct intel_engine_cs *engine;
+    enum intel_engine_id id;
+    unsigned long flags;
     if (__tasklet_is_enabled(&sched_engine->tasklet)) {
         GEM_BUG_ON(!guc->ct.enabled);
         __tasklet_disable_sync_once(&sched_engine->tasklet);
         sched_engine->tasklet.callback = NULL;
     }
+
+    cancel_delayed_work(&guc->timestamp.work);

I am not sure when disable_submission gets called so a question - 
could it be important to call cancel_delayed_work_sync here to ensure 
if the worker was running it had exited before proceeding?

disable_submission is called in the reset_prepare path for uc resets. I 
see this happening only with busy-hang test which does a global gt 
reset. The counterpart for this is the guc_init_engine_stats which is 
called post reset in the path to initialize GuC.

I tried cancel_delayed_work_sync both here and in park. Seems to work 
fine, so will change the calls to _sync versions.

From park is not allowed to sleep so can't do sync from there. It might 
have been my question which put you on a wrong path, sorry. Now I think 
question remains what happens if the ping worker happens to be sampling 
GuC state as GuC is being reset? Do you need some sort of a lock to 
protect that, or make sure worker skips if reset in progress?


Also, does this interact with the open about resets? Should/could 
parking helper be called from here?

It is related to reset. Below, I am only updating the engine prev_total 
to 0 since it gets reset on gt reset. I thought that's all we need to 
keep the busyness increasing monotonically. By calling parking helper, 
are you suggesting we should update the other stats too (total, start, 
gt_stamp etc.)?

Don't know, was just asking.

Looking at it now again, resetting prev_total looks correct to me if it 
tracks rec->total_runtime which is also reset by GuC. 
Engine->stats.total_gt_clks is then purely software managed state which 
you only keep adding to. Yes looks fine to me.


+    mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
+             guc->timestamp.ping_delay);
+}
+
+static int guc_action_enable_usage_stats(struct intel_guc *guc)
+{
+    u32 offset = intel_guc_engine_usage_offset(guc);
+    u32 action[] = {
+        INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
+        offset,
+        0,
+    };
+
+    return intel_guc_send(guc, action, ARRAY_SIZE(action));
+}
+
+static void guc_init_engine_stats(struct intel_guc *guc)
+{
+    struct intel_gt *gt = guc_to_gt(guc);
+    intel_wakeref_t wakeref;
+
+    mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
+             guc->timestamp.ping_delay);

Not sure how this slots in with unpark. It will probably be called two 
times but it also probably does not matter? If you can figure it out 
perhaps you can remove this call from here. Or maybe there is a 
separate path where disable-enable can be called without the 
park-unpark transition. In which case you could call the unpark helper 
here. Not sure really.

- disable_submission pairs with guc_init_engine_stats for the gt reset 
  path.
- park/unpark just follow the gt_park/gt_unpark paths.

I haven't checked if reset eventually results in park/unpark or if they 
are separate paths though. In the reset path, there are a bunch of 
i915_requests going on, so difficult to say if reset caused the 
gt_park/gt_unpark or was it the requests.

The cases where mod_delayed_work is called twice are:

1) module load
2) i915_gem_resume (based on rc6-suspend test)

In both cases, unpark is followed by guc_init_engine_stats. Looking a 
bit at what is returned from the mod_delayed_work, I see that it just 
modifies the time if the work is already queued/pending, so I am 
thinking we should be okay.

I don't see cancel getting called twice without a mod_delayed_work in 
between.

Sounds good.

Regards,

Tvrtko