Re: [PATCH v3] drm/i915/pmu: Reconstruct active state on starting busy-stats

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Fri, 12 Jan 2018 14:12:10 +0000

On 12/01/2018 13:24, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-01-12 13:19:30)

On 12/01/2018 13:03, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-01-12 11:43:11)

On 12/01/2018 10:35, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-01-12 10:30:26)


On 12/01/2018 09:51, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-01-12 09:40:40)
So submit side doesn't work in either case, unless I am missing
something. Would need the pair of port manipulation and context_in to be
atomic.

Sure, there is a small window with the result that we either never turn
off the stats, or turn them off one too early (and then recover if the
my understanding of the underflow protection holds). The same problem as
existed before the reconstruction, except now the window is much
smaller. I'm not that scared by this (it should not explode, nor result
in hopelessly wrong stats) so it can wait until stats enabling doesn't
need to peek at execlists. I think you will need to postpone enabling
until the next context-switch if we wanted to avoid the atomics; except
that poses a problem with the igt expecting busy-start to be accurate. A
dilemma for later.

My analysis was partially incorrect, yes, there is underflow protection
already.

But I don't see that there is a race window where we end up with
permanent 100% busyness before the reconstruction patch. Where do you
see that?

The worst I see without the reconstruction is to miss the accounting of
the batch currently in progress when stats get enabled. Which is a much
less serious, totally ignorable event.

One is observable via pmu, the other not. If we fail to turn off
busy-stats accounting, nothing is lost except for a few wasted cycles.
Except on the next pmu, it starts from the previous cs instead of the
enabling -- but that is a problem that also exists with the second user.

I don't follow, I am talking about permanent 100%. Let me copy what I
wrote earlier:

port0 context complete
context_out - not enabled, no-op
stats enable - port0 busy, active = 1
port0 clear
submit
context_in - active = 2 !!! BAD
port0 set

That leads to permanent 100% until busy stats are disabled. I think that
is hugely less desirable than just failing to account for the currently
running batch, which was the case before reconstruction on enable, and
is 99.9% only a problem for IGTs.

Do you think there is a flaw in this analysis or something else?

No, I don't think it's a huge problem, an improbable race for which the
quick-and-dirty cure may worse than the disease:

diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 25360ce0353f..62d9ee9d45a6 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -1980,16 +1980,22 @@ intel_engine_lookup_user(struct drm_i915_private *i915, u8 class, u8 instance)
    */
   int intel_enable_engine_stats(struct intel_engine_cs *engine)
   {
+       struct intel_engine_execlists *execlists = &engine->execlists;
          unsigned long flags;
+       int err = 0;
   
          if (!intel_engine_supports_stats(engine))
                  return -ENODEV;
   
+       tasklet_disable(&execlists->tasklet);
          spin_lock_irqsave(&engine->stats.lock, flags);
-       if (engine->stats.enabled == ~0)
-               goto busy;
+
+       if (unlikely(engine->stats.enabled == ~0)) {
+               err = -EBUSY;
+               goto unlock;
+       }
+
          if (engine->stats.enabled++ == 0) {
-               struct intel_engine_execlists *execlists = &engine->execlists;
                  const struct execlist_port *port = execlists->port;
                  unsigned int num_ports = execlists_num_ports(execlists);
   
@@ -2004,14 +2010,12 @@ int intel_enable_engine_stats(struct intel_engine_cs *engine)
                  if (engine->stats.active)
                          engine->stats.start = engine->stats.enabled_at;
          }
-       spin_unlock_irqrestore(&engine->stats.lock, flags);
   
-       return 0;
-
-busy:
+unlock:
          spin_unlock_irqrestore(&engine->stats.lock, flags);
+       tasklet_enable(&execlists->tasklet);
   
-       return -EBUSY;
+       return err;
   }
   
   static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine)

using the tasklet control as a spinlock. Whereas the previous race did
not take much effort to observe.

By the previous race you are referring to the missed currently running
batch, or the permanent 100% with active state reconstruction?

You are against reverting the reconstruction? Even if it is unlikely, I
see it as very severe, while the missed current batch is more likely but
not at all severe. So I would be really for the revert. Unfortunately I
did not spot this issue during review. :(

I am against it since we can demonstrate one race very easily in igt,
and the other requires a stress test to be written, when I expect it to
be fixed. We both saw it as a temporary measure as it is poking in places
it should, and the above patch would provide the protection it needs.

Solution with tasklet_disable looks fine to me, I just thought you are 
against that as well - since you said worst than the disease.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx