On Mon, Nov 28, 2016 at 03:55:42PM +0000, Tvrtko Ursulin wrote: > > On 28/11/2016 14:19, Chris Wilson wrote: > >On Mon, Nov 28, 2016 at 02:02:33PM +0000, Tvrtko Ursulin wrote: > >> > >>On 25/11/2016 09:30, Chris Wilson wrote: > >>>Something I missed before sending off the partial series was that the > >>>non-scheduler guc reset path was broken (in the full series, this is > >>>pushed to the execlists reset handler). The issue is that after a reset, > >>>we have to refill the GuC workqueues, which we do by resubmitting the > >>>requests. However, if we already have submitted them, the fences within > >>>them have already been used and triggering them again is an error. > >>>Instead, just repopulate the guc workqueue. > >>> > >>>[ 115.858560] [IGT] gem_busy: starting subtest hang-render > >>>[ 135.839867] [drm] GPU HANG: ecode 9:0:0xe757fefe, in gem_busy [1716], reason: Hang on render ring, action: reset > >>>[ 135.839902] drm/i915: Resetting chip after gpu hang > >>>[ 135.839957] [drm] RC6 on > >>>[ 135.858351] ------------[ cut here ]------------ > >>>[ 135.858357] WARNING: CPU: 2 PID: 45 at drivers/gpu/drm/i915/i915_sw_fence.c:108 i915_sw_fence_complete+0x25/0x30 > >>>[ 135.858357] Modules linked in: rfcomm bnep binfmt_misc nls_iso8859_1 input_leds snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core btusb btrtl snd_hwdep snd_pcm 8250_dw snd_seq_midi hid_lenovo snd_seq_midi_event snd_rawmidi iwlwifi x86_pkg_temp_thermal coretemp snd_seq crct10dif_pclmul snd_seq_device hci_uart snd_timer crc32_pclmul ghash_clmulni_intel idma64 aesni_intel virt_dma btbcm snd btqca aes_x86_64 btintel lrw cfg80211 bluetooth gf128mul glue_helper ablk_helper cryptd soundcore intel_lpss_pci intel_pch_thermal intel_lpss_acpi intel_lpss acpi_als mfd_core kfifo_buf acpi_pad industrialio autofs4 hid_plantronics usbhid dm_mirror dm_region_hash dm_log sdhci_pci ahci sdhci libahci i2c_hid hid > >>>[ 135.858389] CPU: 2 PID: 45 Comm: kworker/2:1 Tainted: G W 4.9.0-rc4+ #238 > >>>[ 135.858389] Hardware name: /NUC6i3SYB, BIOS SYSKLi35.86A.0024.2015.1027.2142 10/27/2015 > >>>[ 135.858392] Workqueue: events_long i915_hangcheck_elapsed > >>>[ 135.858394] ffffc900001bf9b8 ffffffff812bb238 0000000000000000 0000000000000000 > >>>[ 135.858396] ffffc900001bf9f8 ffffffff8104f621 0000006c00000000 ffff8808296137f8 > >>>[ 135.858398] 0000000000000a00 ffff8808457a0000 ffff880845764e60 ffff880845760000 > >>>[ 135.858399] Call Trace: > >>>[ 135.858403] [<ffffffff812bb238>] dump_stack+0x4d/0x65 > >>>[ 135.858405] [<ffffffff8104f621>] __warn+0xc1/0xe0 > >>>[ 135.858406] [<ffffffff8104f748>] warn_slowpath_null+0x18/0x20 > >>>[ 135.858408] [<ffffffff813f8c15>] i915_sw_fence_complete+0x25/0x30 > >>>[ 135.858410] [<ffffffff813f8fad>] i915_sw_fence_commit+0xd/0x30 > >>>[ 135.858412] [<ffffffff8142e591>] __i915_gem_request_submit+0xe1/0xf0 > >>>[ 135.858413] [<ffffffff8142e5c8>] i915_gem_request_submit+0x28/0x40 > >>>[ 135.858415] [<ffffffff814433e7>] i915_guc_submit+0x47/0x210 > >>>[ 135.858417] [<ffffffff81443e98>] i915_guc_submission_enable+0x468/0x540 > >>>[ 135.858419] [<ffffffff81442495>] intel_guc_setup+0x715/0x810 > >>>[ 135.858421] [<ffffffff8142b6b4>] i915_gem_init_hw+0x114/0x2a0 > >>>[ 135.858423] [<ffffffff813eeaa8>] i915_reset+0xe8/0x120 > >>>[ 135.858424] [<ffffffff813f3937>] i915_reset_and_wakeup+0x157/0x180 > >>>[ 135.858426] [<ffffffff813f79db>] i915_handle_error+0x1ab/0x230 > >>>[ 135.858428] [<ffffffff812c760d>] ? scnprintf+0x4d/0x90 > >>>[ 135.858430] [<ffffffff81435985>] i915_hangcheck_elapsed+0x275/0x3d0 > >>>[ 135.858432] [<ffffffff810668cf>] process_one_work+0x12f/0x410 > >>>[ 135.858433] [<ffffffff81066bf3>] worker_thread+0x43/0x4d0 > >>>[ 135.858435] [<ffffffff81066bb0>] ? process_one_work+0x410/0x410 > >>>[ 135.858436] [<ffffffff81066bb0>] ? process_one_work+0x410/0x410 > >>>[ 135.858438] [<ffffffff8106bbb4>] kthread+0xd4/0xf0 > >>>[ 135.858440] [<ffffffff8106bae0>] ? kthread_park+0x60/0x60 > >>> > >>>v2: Only resubmit submitted requests > >>>v3: Don't forget the pending requests have reserved space. > >>> > >>>Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission") > >>>Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>>Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > >>>--- > >>>drivers/gpu/drm/i915/i915_guc_submission.c | 37 ++++++++++++++++-------------- > >>>1 file changed, 20 insertions(+), 17 deletions(-) > >>> > >>>diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c > >>>index 00b5fa871644..e12ff881d38d 100644 > >>>--- a/drivers/gpu/drm/i915/i915_guc_submission.c > >>>+++ b/drivers/gpu/drm/i915/i915_guc_submission.c > >>>@@ -602,12 +602,9 @@ static int guc_ring_doorbell(struct i915_guc_client *gc) > >>>} > >>> > >>>/** > >>>- * i915_guc_submit() - Submit commands through GuC > >>>+ * __i915_guc_submit() - Submit commands through GuC > >>> * @rq: request associated with the commands > >>> * > >>>- * Return: 0 on success, otherwise an errno. > >>>- * (Note: nonzero really shouldn't happen!) > >>>- * > >>> * The caller must have already called i915_guc_wq_reserve() above with > >>> * a result of 0 (success), guaranteeing that there is space in the work > >>> * queue for the new request, so enqueuing the item cannot fail. > >>>@@ -619,7 +616,7 @@ static int guc_ring_doorbell(struct i915_guc_client *gc) > >>> * The only error here arises if the doorbell hardware isn't functioning > >>> * as expected, which really shouln't happen. > >>> */ > >>>-static void i915_guc_submit(struct drm_i915_gem_request *rq) > >>>+static void __i915_guc_submit(struct drm_i915_gem_request *rq) > >>>{ > >>> struct drm_i915_private *dev_priv = rq->i915; > >>> struct intel_engine_cs *engine = rq->engine; > >>>@@ -628,17 +625,6 @@ static void i915_guc_submit(struct drm_i915_gem_request *rq) > >>> struct i915_guc_client *client = guc->execbuf_client; > >>> int b_ret; > >>> > >>>- /* We keep the previous context alive until we retire the following > >>>- * request. This ensures that any the context object is still pinned > >>>- * for any residual writes the HW makes into it on the context switch > >>>- * into the next object following the breadcrumb. Otherwise, we may > >>>- * retire the context too early. > >>>- */ > >>>- rq->previous_context = engine->last_context; > >>>- engine->last_context = rq->ctx; > >>>- > >>>- i915_gem_request_submit(rq); > >>>- > >>> spin_lock(&client->wq_lock); > >>> guc_wq_item_append(client, rq); > >>> > >>>@@ -658,6 +644,23 @@ static void i915_guc_submit(struct drm_i915_gem_request *rq) > >>> spin_unlock(&client->wq_lock); > >>>} > >>> > >>>+static void i915_guc_submit(struct drm_i915_gem_request *rq) > >>>+{ > >>>+ struct intel_engine_cs *engine = rq->engine; > >>>+ > >>>+ /* We keep the previous context alive until we retire the following > >>>+ * request. This ensures that any the context object is still pinned > >>>+ * for any residual writes the HW makes into it on the context switch > >>>+ * into the next object following the breadcrumb. Otherwise, we may > >>>+ * retire the context too early. > >>>+ */ > >>>+ rq->previous_context = engine->last_context; > >>>+ engine->last_context = rq->ctx; > >>>+ > >>>+ i915_gem_request_submit(rq); > >>>+ __i915_guc_submit(rq); > >>>+} > >>>+ > >>>/* > >>> * Everything below here is concerned with setup & teardown, and is > >>> * therefore not part of the somewhat time-critical batch-submission > >>>@@ -1556,7 +1559,7 @@ int i915_guc_submission_enable(struct drm_i915_private *dev_priv) > >>> /* Replay the current set of previously submitted requests */ > >>> list_for_each_entry(rq, &engine->timeline->requests, link) { > >>> client->wq_rsvd += sizeof(struct guc_wq_item); > >>>- i915_guc_submit(rq); > >>>+ __i915_guc_submit(rq); > >>> } > >>> } > >>> > >>> > >> > >>The only thing bothering me here is how this interacts with the > >>common, or "execlist" part of the reset. > > > >Also applies to ringbuffer (some parts at least). > > > >>Copying Mika in hope he can help here. :) > >> > >>I see that i915_gem_reset_engine runs before this GuC wq refill and > >>does some magic with the requests on the engine->timeline->requests > >>lists. > >> > >>Would it be possible to somehow better integrate the two things? > > > >I thought they were nicely separated. The common work to disable the > >guilty context is handled by the core reset function. The specific work > >required to restart and replay the requests handled by the different > >backends. > > > >i915_gem_reset_engine: is for disabling the guilty context > >engine->reset: updates the context / ring for restarting past a hung > >batch (if required) > >engine->init: starts the engine, processing the pending requests if any > > > >engine->reset is a bad name, since we have per-engine resets as well, > >but I haven't a better verb yet for handling the request bumping after a > >reset. > > Yes but final part of the reset when GuC is enabled, is not in any > of those and is not even called from i915_gem_reset. It happens > later via i915_gem_init_hw in i915_guc_submission_enable. Which is the init_hw... > I was wondering if that is the most logical way of doing it - does > it belong with the GEM reset, or the hardware reset? It's just GuC not fitting in very well. > For example could we, for GuC, not use reset_common_ring but add > guc_reset_engine (yes, why did you name it ring when we went to > engines?). It was already called ring, I just cut'n'sed. :) > But I see that the ordering of operations would make it very > problematic, since GuC is not setup until later following a reset. > > So it all becomes very complicated and since it also predates this > patch it doesn't matter. Just confused me a little bit, and probably > will again in the future. :) Yes, I think with care we could remove the GuC special casing. Just the discontent hasn't yet boiled over into anger with a bug to justify some gutting. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx