Re: [PATCH 24/46] drm/i915: Do a synchronous switch-to-kernel-context on idling

Daniele Ceraolo Spurio <daniele.ceraolospurio@xxxxxxxxx> · Thu, 21 Feb 2019 16:29:07 -0800






On 2/21/19 3:25 PM, Chris Wilson wrote:
Quoting Daniele Ceraolo Spurio (2019-02-21 22:53:41)


On 2/21/19 1:42 PM, Chris Wilson wrote:
Quoting Daniele Ceraolo Spurio (2019-02-21 21:31:45)


On 2/21/19 1:17 PM, Chris Wilson wrote:
Quoting Daniele Ceraolo Spurio (2019-02-21 19:48:01)

<snip>

@@ -4481,19 +4471,7 @@ int i915_gem_suspend(struct drm_i915_private *i915)
          * state. Fortunately, the kernel_context is disposable and we do
          * not rely on its state.
          */
-     if (!i915_terminally_wedged(&i915->gpu_error)) {
-             ret = i915_gem_switch_to_kernel_context(i915);
-             if (ret)
-                     goto err_unlock;
-
-             ret = i915_gem_wait_for_idle(i915,
-                                          I915_WAIT_INTERRUPTIBLE |
-                                          I915_WAIT_LOCKED |
-                                          I915_WAIT_FOR_IDLE_BOOST,
-                                          HZ / 5);
-             if (ret == -EINTR)
-                     goto err_unlock;
-
+     if (!switch_to_kernel_context_sync(i915)) { >                   /* Forcibly cancel outstanding work and leave the gpu quiet. */
                 i915_gem_set_wedged(i915);
         }

GuC-related question: what's your expectation here in regards to GuC
status? The current i915 flow expect either uc_reset_prepare() or
uc_suspend() to be called to clean up the guc status, but we're calling
neither of them here if the switch is successful. Do you expect the
resume code to always blank out the GuC status before a reload?

(A few patches later on I propose that we always just do a reset+wedge
on suspend in lieu of hangcheck.)

On resume, we have to bring the HW up from scratch and do another reset
in the process. Some platforms have been known to survive the trips to
PCI_D3 (someone is lying!) and so we _have_ to do a reset to be sure we
clear the HW state. I expect we would need to force a reset on resume
even for the guc, to be sure we cover all cases such as kexec.
-Chris

More than about the HW state, my question here was about the SW
tracking. At which point do we go and stop guc communication and mark
guc as not loaded/accessible? e.g. we need to disable and re-enable CT
buffers before GuC is reset/suspended to make sure the shared memory
area is cleaned correctly (we currently avoid memsetting all of it on
reload since it is quite big). Also, communication with GuC is going to
increase going forward, so we'll need to make sure we accurately track
its state and do all the relevant cleanups.

Across suspend/resume, we issue a couple of resets and scrub/sanitize our
state tracking. By the time we load the fw again, both the fw and our
state should be starting from scratch.

That all seems unavoidable, so I am not understanding the essence of
your question.
-Chris


We're not doing the state scrubbing for guc in all paths at the moment.
There is logic in gem_suspend_late(), but that doesn't seem to be called
on all paths; e.g. it isn't when we run
igt@gem_exec_suspend@basic-s4-devices

Yup, the dummy hibernate code throws a few surprises, and why
i915_gem_sanitize is so fiddly to get right between that and
gem_eio/suspend.

and that's why Suja's patch moved
the disabling of communication from uc_sanitize to uc_suspend.

That should also help as previously it tried to talk to the guc after we
reset it.

But only helps if we do call uc_suspend ;). I'm wondering if it ends up 
being better to call it from both places.


The guc
resume code also doesn't currently clean everything as some of the
structures (including stuff we allocate for guc usage) are carried over.
We can either add something more in the cleanup path or go and rework
the resume to blank everything (which would be time consuming since
there is tens of MBs involved), but before putting down any code one way
or another I wanted to understand what the expectation is.

I may be naive, but my expectations is that we just have to reset the
comm ringbuffer pointers. We shouldn't need to hand the guc pristine
pages, it will zero on allocate when its needs to, surely? We do have to
rebuild the set of clients everytime we load the guc, so that can't be
the issue (as that has to be done on resume, device reset etc today),
although that should only have to be the pinned clients?

GuC doesn't clean up some of the state stored in the memory we allocate 
for its use. In the specific example of the CT buffers, the registration 
is not automatically cleaned by GuC, it is only cleaned when the 
disable_communication H2G is issued or if we just memset the guc memory. 
This is to allow re-use of the same buffers across resets without having 
to issue an H2G to re-enable them. Similar approach is taken for other 
info (e.g. lrc info required gen11+), again to allow the host to 
seamlessly restart after a reset or suspend/resume. We always need to 
recreate the clients because the doorbells are a HW state and thus they 
can get reset with the guc; the firmware also saves db status in the 
WOPCM rather then in the shared memory for speed, so that does get 
cleaned on reload.

In the current gen11 guc code (which hopefully will hit the ML soon) we 
assumed that uc_suspend would be called on all suspend paths to make 
sure the state in the shared structures was clean, but if it doesn't 
then we'll have to do some tweaks to cope. BTW, we need to add 
uc_reset_prepare() to __i915_gem_set_wedged as well.

Daniele


We have to restart the comm channels on loading the guc, so what's
changing?

On suspend, hit the device reset & kill guc. On resume, load the guc fw,
restart comm. After fiddling about making sure we are in the right
callpaths, the intent is that resume just looks like a fresh module
load (so we only have to reason about init sequence [nearly] once).
-Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx