Re: [PATCH 04/43] drm/i915: Do a synchronous switch-to-kernel-context on idling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 07/03/2019 22:24, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2019-03-07 17:06:58)

On 07/03/2019 13:29, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2019-03-07 13:07:18)

On 06/03/2019 14:24, Chris Wilson wrote:
+static bool switch_to_kernel_context_sync(struct drm_i915_private *i915)
+{
+     if (i915_gem_switch_to_kernel_context(i915))
+             return false;

Is it worth still trying to idle if this fails? Since the timeout is
short, maybe reset in idle state bring less havoc than not. It can only
fail on memory allocations I think, okay and terminally wedged. In which
case it is still okay.

Terminally wedged is hard wired to return 0 (this, the next patch?) so
that we don't bail during i915_gem_suspend() for this reason.

We do still idle if this fails, as we mark the driver/GPU as wedged.
Perform a GPU reset so that it hopefully isn't shooting kittens any
more, and pull a pillow over our heads.

I didn't find a path which idles before wedging if
switch_to_kernel_context_sync fails due failing
i915_gem_switch_to_kernel_context. Where is it?

Wedging implies idling. When we are wedged, the GPU is reset and left
pointing into the void (effectively idle, GT powersaving should be
unaffected by the wedge, don't ask how that works on ilk). All the
callers do

if (!switch_to_kernel_context_sync())
	i915_gem_set_wedged()

with more or less intermediate steps. Hmm, given that is true why not
pull it into switch_to_kernel_context_sync()...

If all callers follow up with a wedge maybe, yes.


It is a minor concern don't get me wrong. It is unlikely to fail like
this. I was simply thinking why not try and wait for the current work to
finish before suspending in this case. Might be a better experience
after resume.

For the desktop use case, it's immaterial as the hotplug, reconfigure
and redraw take care of that. (Fbcon is also cleared.)

But a matter of whether context state is sane or corrupt I think comes into play. A short wait for idle before suspend might still work if the extremely unlikely fail in i915_gem_switch_to_kernel_context happens.

Seems more robust to me to try regardless since the timeout is short.

It's the server use case with persistent jobs that worries me; and I
expect will be upset unless we can do a preempt-to-idle beforehand. Are
suspend cycles even a factor? Probably not, but you never know with
thermald and host migration what they may get up to.

Yeah don't know.

That's pretty much our answer in every case, wedge and hope the users go
away.

+
+     if (i915_gem_wait_for_idle(i915,
+                                I915_WAIT_LOCKED |
+                                I915_WAIT_FOR_IDLE_BOOST,
+                                HZ / 10))

You'll presumably change HZ / 10 to that new macro.

Yup.

    static void
    i915_gem_idle_work_handler(struct work_struct *work)
    {
-     struct drm_i915_private *dev_priv =
-             container_of(work, typeof(*dev_priv), gt.idle_work.work);
+     struct drm_i915_private *i915 =
+             container_of(work, typeof(*i915), gt.idle_work.work);
+     typeof(i915->gt) *gt = &i915->gt;

I am really not sure about the typeof idiom in normal C code. :( It
saves a little bit of typing, and a little bit of churn if type name
changes, but just feels weird to use it somewhere and somewhere not.

But then we have to name thing! We're not sold on gt; it means quite a
few different things around the bspec. This bit is actually part that
I'm earmarking for i915_gem itself (high level idle/power/user management)
and I'm contemplating i915_gpu for the bits that beneath us but still a
management layer over hardware (and intel_foo for the bits that talk to
hardware. Maybe that too will change if we completely split out into
different modules.)

So you could have left it as is for now and have a smaller diff. But
okay.. have it if you insist.

No, you've stated on a few occasions that you don't like gt->X so I'll
have to find a new strategy and fixup patches as I remember your
distaste.

I know I don't like sprinkling of typeof to declare locals, but don't remember if I disliked something more. Not sure what your "no" refers to now. That you feel you had to do this in this patch, or that you don't accept mine "have it if you insist"?


General plan is still to try and split drm_i915_private into i915_gem
and i915_gpu (for us at least).

        /*
-      * New request retired after this work handler started, extend active
-      * period until next instance of the work.
+      * Flush out the last user context, leaving only the pinned
+      * kernel context resident. Should anything unfortunate happen
+      * while we are idle (such as the GPU being power cycled), no users
+      * will be harmed.
         */
-     if (new_requests_since_last_retire(dev_priv))
-             goto out_unlock;
-
-     __i915_gem_park(dev_priv);
+     if (!gt->active_requests && !work_pending(&gt->idle_work.work)) {

Could have left the new_requests_since_last_retire for readability.

As always, it changed several times. I actually think considering that
we play games with active_requests, keeping that in the open is
important. ~o~

Ok.


+             ++gt->active_requests; /* don't requeue idle */
+
+             if (!switch_to_kernel_context_sync(i915)) {
+                     dev_err(i915->drm.dev,
+                             "Failed to idle engines, declaring wedged!\n");
+                     GEM_TRACE_DUMP();
+                     i915_gem_set_wedged(i915);
+             }
+             i915_retire_requests(i915);

@@ -3128,7 +3120,6 @@ int i915_gem_wait_for_idle(struct drm_i915_private *i915,
                        return err;
i915_retire_requests(i915);
-             GEM_BUG_ON(i915->gt.active_requests);

Belongs to this patch?

Yes. See the above chunk.

The idle work handler? More hand holding required I'm afraid.

Ok, in the idle work, we bump active_requests before calling
i915_gem_wait_for_idle() to prevent active_requests == 0 here. We
intentionally hit the GEM_BUG_ON, in order to prevent requeuing the idle
work handler by emitting the request to switch to kernel context.

Well doh.. seems like I had a small stack overflow. Didn't see the wait since it is "hidden" in switch_to_kernel_context_sync.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux