Nick Hoath <nicholas.hoath@xxxxxxxxx> writes: > Use the first retired request on a new context to unpin > the old context. This ensures that the hw context remains > bound until it has been written back to by the GPU. > Now that the context is pinned until later in the request/context > lifecycle, it no longer needs to be pinned from context_queue to > retire_requests. > This fixes an issue with GuC submission where the GPU might not > have finished writing back the context before it is unpinned. This > results in a GPU hang. > > v2: Moved the new pin to cover GuC submission (Alex Dai) > Moved the new unpin to request_retire to fix coverage leak > v3: Added switch to default context if freeing a still pinned > context just in case the hw was actually still using it > v4: Unwrapped context unpin to allow calling without a request > v5: Only create a switch to idle context if the ring doesn't > already have a request pending on it (Alex Dai) > Rename unsaved to dirty to avoid double negatives (Dave Gordon) > Changed _no_req postfix to __ prefix for consistency (Dave Gordon) > Split out per engine cleanup from context_free as it > was getting unwieldy > Corrected locking (Dave Gordon) > v6: Removed some bikeshedding (Mika Kuoppala) > Added explanation of the GuC hang that this fixes (Daniel Vetter) > v7: Removed extra per request pinning from ring reset code (Alex Dai) > Added forced ring unpin/clean in error case in context free (Alex Dai) > v8: Renamed lrc specific last_context to lrc_last_context as there > were some reset cases where the codepaths leaked (Mika Kuoppala) > NULL'd last_context in reset case - there was a pointer leak > if someone did reset->close context. > Signed-off-by: Nick Hoath <nicholas.hoath@xxxxxxxxx> > Issue: VIZ-4277 > Cc: Daniel Vetter <daniel.vetter@xxxxxxxx> > Cc: David Gordon <david.s.gordon@xxxxxxxxx> > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Cc: Alex Dai <yu.dai@xxxxxxxxx> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> There is a splat with this patch included. But I haven't yet checked if plain din has this also. The script to repro is: intel-gpu-tools/tests/drv_module_reload_basic intel-gpu-tools/tests/drv_hangman --r error-state-basic intel-gpu-tools/tests/kms_addfb_basic --r addfb25-framebuffer-vs-set-tiling intel-gpu-tools/tests/gem_mmap_gtt --r basic-write-gtt-no-prefault Thanks, -Mika [ 420.497681] Console: switching to colour dummy device 80x25 [ 421.175758] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060 [ 421.175767] IP: [<ffffffffa064a631>] intel_lr_context_free+0x41/0x210 [i915] [ 421.175798] PGD 0 [ 421.175803] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 421.175810] Modules linked in: snd_hda_codec_hdmi nls_iso8859_1 i915(-) i2c_algo_bit x86_pkg_temp_thermal drm_kms_helper coretemp syscopyarea sysfillrect sysimgblt fb_sys_fops kvm_intel drm kvm irqbypass crct10dif_pclmul snd_hda_codec crc32_pclmul snd_hda_core snd_pcm aesni_intel aes_x86_64 glue_helper lrw snd_hwdep gf128mul ablk_helper cryptd snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq serio_raw snd_timer snd_seq_device mei_me snd mei soundcore lpc_ich video acpi_pad mac_hid parport_pc ppdev lp parport hid_generic usbhid hid e1000e ptp ahci libahci pps_core sdhci_acpi sdhci [last unloaded: snd_hda_intel] [ 421.175892] CPU: 3 PID: 1872 Comm: rmmod Not tainted 4.4.0-rc4+ #34 [ 421.175898] Hardware name: Intel Corporation Broadwell Client platform/SawTooth Peak, BIOS BDW-E1R1.86C.0115.R00.1502122119 02/12/2015 [ 421.175906] task: ffff8800a7e127c0 ti: ffff880145704000 task.ti: ffff880145704000 [ 421.175913] RIP: 0010:[<ffffffffa064a631>] [<ffffffffa064a631>] intel_lr_context_free+0x41/0x210 [i915] [ 421.175937] RSP: 0018:ffff880145707ce8 EFLAGS: 00010286 [ 421.175941] RAX: 0000000000000000 RBX: ffff8800a7ab8738 RCX: 000000000000000f [ 421.175946] RDX: 0000000000000000 RSI: ffffffff81e5c240 RDI: ffff8800a7ab86d8 [ 421.175952] RBP: ffff880145707d20 R08: 0000000000000001 R09: 0000000000000000 [ 421.175957] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800a9f731e0 [ 421.175962] R13: ffff8800a9c083c0 R14: ffff8801481eb1b0 R15: ffff8800a7ab86d8 [ 421.175967] FS: 00007fcf9f523700(0000) GS:ffff88014e2c0000(0000) knlGS:0000000000000000 [ 421.175974] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 421.175979] CR2: 0000000000000060 CR3: 0000000143119000 CR4: 00000000003406e0 [ 421.175984] Stack: [ 421.175987] ffff8800a7ab87b0 ffff880145707d10 ffff8800a9f791f0 ffff8800a7ab86d8 [ 421.175996] ffff8800a7ab86d8 ffffffffa06f80e0 0000557f7f5ad090 ffff880145707d48 [ 421.176004] ffffffffa0624b8c ffff8800a9f791f0 ffff8800a9f791f0 ffff8800a7ab86d8 [ 421.176013] Call Trace: [ 421.176028] [<ffffffffa0624b8c>] i915_gem_context_free+0x1bc/0x220 [i915] [ 421.176044] [<ffffffffa06253a2>] i915_gem_context_fini+0xa2/0x140 [i915] [ 421.176068] [<ffffffffa06bc8f2>] i915_driver_unload+0x162/0x290 [i915] [ 421.176082] [<ffffffffa04a8879>] drm_dev_unregister+0x29/0xb0 [drm] [ 421.176094] [<ffffffffa04a8f33>] drm_put_dev+0x23/0x70 [drm] [ 421.176105] [<ffffffffa05fb1a5>] i915_pci_remove+0x15/0x20 [i915] [ 421.176112] [<ffffffff8145ffd9>] pci_device_remove+0x39/0xc0 [ 421.176119] [<ffffffff8156d3a6>] __device_release_driver+0x96/0x130 [ 421.176125] [<ffffffff8156d558>] driver_detach+0xb8/0xc0 [ 421.176131] [<ffffffff8156c538>] bus_remove_driver+0x58/0xd0 [ 421.176137] [<ffffffff8156df3c>] driver_unregister+0x2c/0x50 [ 421.176142] [<ffffffff8145e6ba>] pci_unregister_driver+0x2a/0x70 [ 421.176154] [<ffffffffa04aa769>] drm_pci_exit+0x79/0xa0 [drm] [ 421.176176] [<ffffffffa06bd095>] i915_exit+0x20/0xf8b [i915] [ 421.176183] [<ffffffff811246bd>] SyS_delete_module+0x19d/0x1f0 [ 421.176190] [<ffffffff81825d36>] entry_SYSCALL_64_fastpath+0x16/0x7a [ 421.176195] Code: 41 54 49 89 ff 53 48 8d 5f 60 48 83 ec 10 48 89 45 c8 4c 8b 2b 4d 85 ed 0f 84 a4 00 00 00 4c 8b 73 08 4d 8b 66 10 49 8b 44 24 10 <8b> 40 60 83 f8 01 0f 84 11 01 00 00 4d 3b bc 24 48 02 00 00 0f [ 421.176243] RIP [<ffffffffa064a631>] intel_lr_context_free+0x41/0x210 [i915] [ 421.176264] RSP <ffff880145707ce8> [ 421.176267] CR2: 0000000000000060 [ 421.176272] ---[ end trace c465bebdbea3b712 ]--- [ 421.176278] BUG: sleeping function called from invalid context at include/linux/sched.h:2776 [ 421.176285] in_atomic(): 0, irqs_disabled(): 1, pid: 1872, name: rmmod [ 421.176289] INFO: lockdep is turned off. [ 421.176293] irq event stamp: 80390 [ 421.176297] hardirqs last enabled at (80389): [<ffffffff818253f6>] _raw_spin_unlock_irqrestore+0x36/0x60 [ 421.176306] hardirqs last disabled at (80390): [<ffffffff8182828e>] error_entry+0x6e/0xc0 [ 421.176315] softirqs last enabled at (78860): [<ffffffff81087cc4>] __do_softirq+0x374/0x470 [ 421.176324] softirqs last disabled at (78847): [<ffffffff81087fda>] irq_exit+0xfa/0x130 [ 421.176333] CPU: 3 PID: 1872 Comm: rmmod Tainted: G D 4.4.0-rc4+ #34 [ 421.176339] Hardware name: Intel Corporation Broadwell Client platform/SawTooth Peak, BIOS BDW-E1R1.86C.0115.R00.1502122119 02/12/2015 [ 421.176348] ffffffff81ca2673 ffff8801457079f0 ffffffff8140cf39 ffff8800a7e127c0 [ 421.176356] ffff880145707a18 ffffffff810ad40d ffffffff81ca2673 0000000000000ad8 [ 421.176364] 0000000000000000 ffff880145707a40 ffffffff810ad519 ffff8800a7e127c0 [ 421.176373] Call Trace: [ 421.176378] [<ffffffff8140cf39>] dump_stack+0x4b/0x72 [ 421.176384] [<ffffffff810ad40d>] ___might_sleep+0x17d/0x240 [ 421.176390] [<ffffffff810ad519>] __might_sleep+0x49/0x80 [ 421.176395] [<ffffffff81094914>] exit_signals+0x24/0x130 [ 421.176401] [<ffffffff81085474>] do_exit+0xb4/0xbd0 [ 421.176407] [<ffffffff810f27d0>] ? kmsg_dump+0x120/0x190 [ 421.176413] [<ffffffff8101f331>] oops_end+0xa1/0xd0 [ 421.176418] [<ffffffff8106d4fd>] no_context+0x10d/0x380 [ 421.176424] [<ffffffff8106d7f6>] __bad_area_nosemaphore+0x86/0x200 [ 421.176429] [<ffffffff8106d983>] bad_area_nosemaphore+0x13/0x20 [ 421.176434] [<ffffffff8106dc29>] __do_page_fault+0x99/0x450 [ 421.176441] [<ffffffff8142b3fd>] ? debug_object_activate+0x13d/0x1c0 [ 421.176446] [<ffffffff8106e00f>] do_page_fault+0x2f/0x80 [ 421.176452] [<ffffffff81826907>] ? native_iret+0x7/0x7 [ 421.176458] [<ffffffff81828078>] page_fault+0x28/0x30 [ 421.176476] [<ffffffffa064a631>] ? intel_lr_context_free+0x41/0x210 [i915] [ 421.176492] [<ffffffffa0624b8c>] i915_gem_context_free+0x1bc/0x220 [i915] [ 421.176507] [<ffffffffa06253a2>] i915_gem_context_fini+0xa2/0x140 [i915] [ 421.176531] [<ffffffffa06bc8f2>] i915_driver_unload+0x162/0x290 [i915] [ 421.176543] [<ffffffffa04a8879>] drm_dev_unregister+0x29/0xb0 [drm] [ 421.176555] [<ffffffffa04a8f33>] drm_put_dev+0x23/0x70 [drm] [ 421.176565] [<ffffffffa05fb1a5>] i915_pci_remove+0x15/0x20 [i915] [ 421.176571] [<ffffffff8145ffd9>] pci_device_remove+0x39/0xc0 [ 421.176577] [<ffffffff8156d3a6>] __device_release_driver+0x96/0x130 [ 421.176583] [<ffffffff8156d558>] driver_detach+0xb8/0xc0 [ 421.176589] [<ffffffff8156c538>] bus_remove_driver+0x58/0xd0 [ 421.176595] [<ffffffff8156df3c>] driver_unregister+0x2c/0x50 [ 421.176600] [<ffffffff8145e6ba>] pci_unregister_driver+0x2a/0x70 [ 421.176612] [<ffffffffa04aa769>] drm_pci_exit+0x79/0xa0 [drm] [ 421.176634] [<ffffffffa06bd095>] i915_exit+0x20/0xf8b [i915] [ 421.176640] [<ffffffff811246bd>] SyS_delete_module+0x19d/0x1f0 [ 421.176646] [<ffffffff81825d36>] entry_SYSCALL_64_fastpath+0x16/0x7a [ 421.219190] drv_hangman: executing [ 421.219688] drv_hangman: starting subtest error-state-basic [ 421.221143] drv_hangman: exiting, ret=0 [ 421.235754] kms_addfb_basic: executing [ 421.237545] kms_addfb_basic: exiting, ret=0 [ 421.246746] gem_mmap_gtt: executing [ 421.248650] gem_mmap_gtt: exiting, ret=0 > --- > drivers/gpu/drm/i915/i915_drv.h | 1 + > drivers/gpu/drm/i915/i915_gem.c | 7 +- > drivers/gpu/drm/i915/intel_lrc.c | 138 ++++++++++++++++++++++++++------ > drivers/gpu/drm/i915/intel_lrc.h | 1 + > drivers/gpu/drm/i915/intel_ringbuffer.h | 1 + > 5 files changed, 121 insertions(+), 27 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > index 9ab3e25..a59ca13 100644 > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -884,6 +884,7 @@ struct intel_context { > struct { > struct drm_i915_gem_object *state; > struct intel_ringbuffer *ringbuf; > + bool dirty; > int pin_count; > } engine[I915_NUM_RINGS]; > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index a6997a8..cd27ecc 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -1362,6 +1362,9 @@ static void i915_gem_request_retire(struct drm_i915_gem_request *request) > { > trace_i915_gem_request_retire(request); > > + if (i915.enable_execlists) > + intel_lr_context_complete_check(request); > + > /* We know the GPU must have read the request to have > * sent us the seqno + interrupt, so use the position > * of tail of the request to update the last known position > @@ -2772,10 +2775,6 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, > struct drm_i915_gem_request, > execlist_link); > list_del(&submit_req->execlist_link); > - > - if (submit_req->ctx != ring->default_context) > - intel_lr_context_unpin(submit_req); > - > i915_gem_request_unreference(submit_req); > } > spin_unlock_irq(&ring->execlist_lock); > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > index 4ebafab..f96fb51 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -571,9 +571,6 @@ static int execlists_context_queue(struct drm_i915_gem_request *request) > struct drm_i915_gem_request *cursor; > int num_elements = 0; > > - if (request->ctx != ring->default_context) > - intel_lr_context_pin(request); > - > i915_gem_request_reference(request); > > spin_lock_irq(&ring->execlist_lock); > @@ -737,6 +734,13 @@ intel_logical_ring_advance_and_submit(struct drm_i915_gem_request *request) > if (intel_ring_stopped(ring)) > return; > > + if (request->ctx != ring->default_context) { > + if (!request->ctx->engine[ring->id].dirty) { > + intel_lr_context_pin(request); > + request->ctx->engine[ring->id].dirty = true; > + } > + } > + > if (dev_priv->guc.execbuf_client) > i915_guc_submit(dev_priv->guc.execbuf_client, request); > else > @@ -963,12 +967,6 @@ void intel_execlists_retire_requests(struct intel_engine_cs *ring) > spin_unlock_irq(&ring->execlist_lock); > > list_for_each_entry_safe(req, tmp, &retired_list, execlist_link) { > - struct intel_context *ctx = req->ctx; > - struct drm_i915_gem_object *ctx_obj = > - ctx->engine[ring->id].state; > - > - if (ctx_obj && (ctx != ring->default_context)) > - intel_lr_context_unpin(req); > list_del(&req->execlist_link); > i915_gem_request_unreference(req); > } > @@ -1063,21 +1061,39 @@ reset_pin_count: > return ret; > } > > -void intel_lr_context_unpin(struct drm_i915_gem_request *rq) > +static void __intel_lr_context_unpin(struct intel_engine_cs *ring, > + struct intel_context *ctx) > { > - struct intel_engine_cs *ring = rq->ring; > - struct drm_i915_gem_object *ctx_obj = rq->ctx->engine[ring->id].state; > - struct intel_ringbuffer *ringbuf = rq->ringbuf; > - > + struct drm_i915_gem_object *ctx_obj = ctx->engine[ring->id].state; > + struct intel_ringbuffer *ringbuf = ctx->engine[ring->id].ringbuf; > if (ctx_obj) { > WARN_ON(!mutex_is_locked(&ring->dev->struct_mutex)); > - if (--rq->ctx->engine[ring->id].pin_count == 0) { > + if (--ctx->engine[ring->id].pin_count == 0) { > intel_unpin_ringbuffer_obj(ringbuf); > i915_gem_object_ggtt_unpin(ctx_obj); > } > } > } > > +void intel_lr_context_unpin(struct drm_i915_gem_request *rq) > +{ > + __intel_lr_context_unpin(rq->ring, rq->ctx); > +} > + > +void intel_lr_context_complete_check(struct drm_i915_gem_request *req) > +{ > + struct intel_engine_cs *ring = req->ring; > + > + if (ring->lrc_last_context && ring->lrc_last_context != req->ctx && > + ring->lrc_last_context->engine[ring->id].dirty) { > + __intel_lr_context_unpin( > + ring, > + ring->lrc_last_context); > + ring->lrc_last_context->engine[ring->id].dirty = false; > + } > + ring->lrc_last_context = req->ctx; > +} > + > static int intel_logical_ring_workarounds_emit(struct drm_i915_gem_request *req) > { > int ret, i; > @@ -2352,6 +2368,76 @@ populate_lr_context(struct intel_context *ctx, struct drm_i915_gem_object *ctx_o > } > > /** > + * intel_lr_context_clean_ring() - clean the ring specific parts of an LRC > + * @ctx: the LR context being freed. > + * @ring: the engine being cleaned > + * @ctx_obj: the hw context being unreferenced > + * @ringbuf: the ringbuf being freed > + * > + * Take care of cleaning up the per-engine backing > + * objects and the logical ringbuffer. > + */ > +static void > +intel_lr_context_clean_ring(struct intel_context *ctx, > + struct intel_engine_cs *ring, > + struct drm_i915_gem_object *ctx_obj, > + struct intel_ringbuffer *ringbuf) > +{ > + int ret; > + > + WARN_ON(!mutex_is_locked(&ring->dev->struct_mutex)); > + > + if (ctx == ring->default_context) { > + intel_unpin_ringbuffer_obj(ringbuf); > + i915_gem_object_ggtt_unpin(ctx_obj); > + } > + > + if (ctx->engine[ring->id].dirty) { > + struct drm_i915_gem_request *req = NULL; > + > + /** > + * If there is already a request pending on > + * this ring, wait for that to complete, > + * otherwise create a switch to idle request > + */ > + if (list_empty(&ring->request_list)) { > + int ret; > + > + ret = i915_gem_request_alloc( > + ring, > + ring->default_context, > + &req); > + if (!ret) > + i915_add_request(req); > + else > + DRM_DEBUG("Failed to ensure context saved"); > + } else { > + req = list_first_entry( > + &ring->request_list, > + typeof(*req), list); > + } > + if (req) { > + ret = i915_wait_request(req); > + if (ret != 0) { > + /** > + * If we get here, there's probably been a ring > + * reset, so we just clean up the dirty flag.& > + * pin count. > + */ > + ctx->engine[ring->id].dirty = false; > + __intel_lr_context_unpin( > + ring, > + ctx); > + } > + } > + } > + > + WARN_ON(ctx->engine[ring->id].pin_count); > + intel_ringbuffer_free(ringbuf); > + drm_gem_object_unreference(&ctx_obj->base); > +} > + > +/** > * intel_lr_context_free() - free the LRC specific bits of a context > * @ctx: the LR context to free. > * > @@ -2363,7 +2449,7 @@ void intel_lr_context_free(struct intel_context *ctx) > { > int i; > > - for (i = 0; i < I915_NUM_RINGS; i++) { > + for (i = 0; i < I915_NUM_RINGS; ++i) { > struct drm_i915_gem_object *ctx_obj = ctx->engine[i].state; > > if (ctx_obj) { > @@ -2371,13 +2457,10 @@ void intel_lr_context_free(struct intel_context *ctx) > ctx->engine[i].ringbuf; > struct intel_engine_cs *ring = ringbuf->ring; > > - if (ctx == ring->default_context) { > - intel_unpin_ringbuffer_obj(ringbuf); > - i915_gem_object_ggtt_unpin(ctx_obj); > - } > - WARN_ON(ctx->engine[ring->id].pin_count); > - intel_ringbuffer_free(ringbuf); > - drm_gem_object_unreference(&ctx_obj->base); > + intel_lr_context_clean_ring(ctx, > + ring, > + ctx_obj, > + ringbuf); > } > } > } > @@ -2539,5 +2622,14 @@ void intel_lr_context_reset(struct drm_device *dev, > > ringbuf->head = 0; > ringbuf->tail = 0; > + > + if (ctx->engine[ring->id].dirty) { > + __intel_lr_context_unpin( > + ring, > + ctx); > + ctx->engine[ring->id].dirty = false; > + if (ring->lrc_last_context == ctx) > + ring->lrc_last_context = NULL; > + } > } > } > diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h > index 0b821b9..fd1b6b4 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.h > +++ b/drivers/gpu/drm/i915/intel_lrc.h > @@ -91,6 +91,7 @@ void intel_lr_context_reset(struct drm_device *dev, > struct intel_context *ctx); > uint64_t intel_lr_context_descriptor(struct intel_context *ctx, > struct intel_engine_cs *ring); > +void intel_lr_context_complete_check(struct drm_i915_gem_request *req); > > /* Execlists */ > int intel_sanitize_enable_execlists(struct drm_device *dev, int enable_execlists); > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h > index 5d1eb20..7c913e7 100644 > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h > @@ -275,6 +275,7 @@ struct intel_engine_cs { > u32 flush_domains); > int (*emit_bb_start)(struct drm_i915_gem_request *req, > u64 offset, unsigned dispatch_flags); > + struct intel_context *lrc_last_context; > > /** > * List of objects currently involved in rendering from the > -- > 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx