On Fri, Jan 09, 2015 at 06:47:41PM +0000, Dave Gordon wrote: > On 08/01/15 13:40, Mika Kuoppala wrote: > > i915_gem_validate_context() will check the engine->state to see if it can > > submit into a ringbuffer. But when we are releasing the context we leave the > > engine state to a non null value. Thus after a successful hang recovery > > we might mistakenly submit to a non initialized ringbuffer resulting in: > > > > [ 1991.356418] ------------[ cut here ]------------ > > [ 1991.359192] WARNING: CPU: 1 PID: 2335 at lib/iomap.c:43 bad_io_access+0x3d/0x40() > > [ 1991.361966] Bad IO access at port 0x24 (outl(val,port)) > > [ 1991.364750] Modules linked in: snd_hda_codec_hdmi i915 x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hwdep snd_pcm aesni_intel aes_x86_64 glue_helper lrw i2c_algo_bit gf128mul ablk_helper drm_kms_helper cryptd snd_seq_midi snd_seq_midi_event serio_raw drm snd_rawmidi snd_seq snd_seq_device snd_timer video snd soundcore mei_me lpc_ich bnep mac_hid acpi_pad mei rfcomm bluetooth parport_pc ppdev lp parport nls_iso8859_1 e1000e ptp ahci libahci pps_core sdhci_acpi sdhci > > [ 1991.370827] CPU: 1 PID: 2335 Comm: gem_ringfill Tainted: G W 3.19.0-rc3+ #50 > > [ 1991.373838] Hardware name: Intel Corporation Broadwell Client platform/SawTooth Peak, BIOS BDW-E1R1.86C.0092.R00.1408311942 08/31/2014 > > [ 1991.376902] ffffffff81aa1a46 ffff88014910fac8 ffffffff8173dbcf 0000000000000001 > > [ 1991.379978] ffff88014910fb18 ffff88014910fb08 ffffffff8107007a ffff88014910fb28 > > [ 1991.383037] ffff880147209940 ffff8800aafa8718 ffff8800aafa0000 ffff8800aafa1918 > > [ 1991.386094] Call Trace: > > [ 1991.389140] [<ffffffff8173dbcf>] dump_stack+0x45/0x57 > > [ 1991.392207] [<ffffffff8107007a>] warn_slowpath_common+0x8a/0xc0 > > [ 1991.395268] [<ffffffff810700f6>] warn_slowpath_fmt+0x46/0x50 > > [ 1991.398330] [<ffffffffa053290c>] ? intel_logical_ring_begin+0x3c/0x240 [i915] > > [ 1991.401395] [<ffffffff813985bd>] bad_io_access+0x3d/0x40 > > [ 1991.404462] [<ffffffff81398763>] iowrite32+0x33/0x40 > > [ 1991.407529] [<ffffffffa0533585>] gen8_init_rcs_context+0xd5/0x170 [i915] > > [ 1991.410605] [<ffffffffa0533d17>] intel_lr_context_deferred_create+0x657/0x8e0 [i915] > > [ 1991.413668] [<ffffffffa050eff1>] i915_gem_do_execbuffer.isra.22+0xed1/0xf60 [i915] > > [ 1991.416736] [<ffffffff811c0125>] ? __kmalloc+0x55/0x1b0 > > [ 1991.419801] [<ffffffffa051029c>] ? i915_gem_execbuffer2+0x6c/0x2c0 [i915] > > [ 1991.422772] [<ffffffffa05102e1>] i915_gem_execbuffer2+0xb1/0x2c0 [i915] > > [ 1991.425632] [<ffffffffa01b8ab4>] drm_ioctl+0x1a4/0x630 [drm] > > [ 1991.428454] [<ffffffff811258bc>] ? acct_account_cputime+0x1c/0x20 > > [ 1991.431255] [<ffffffff811ee378>] do_vfs_ioctl+0x2f8/0x510 > > [ 1991.434009] [<ffffffff8109f834>] ? vtime_account_user+0x54/0x60 > > [ 1991.436778] [<ffffffff811ee611>] SyS_ioctl+0x81/0xa0 > > [ 1991.439553] [<ffffffff81745cb4>] ? int_check_syscall_exit_work+0x34/0x3d > > [ 1991.442306] [<ffffffff81745a2d>] system_call_fastpath+0x16/0x1b > > > > Fix this by setting all the engine fields properly when lrc is freed. > > > > Cc: Thomas Daniel <thomas.daniel@xxxxxxxxx> > > Signed-off-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxx> > > --- > > drivers/gpu/drm/i915/intel_lrc.c | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > > index 7670a0f..32684d9 100644 > > --- a/drivers/gpu/drm/i915/intel_lrc.c > > +++ b/drivers/gpu/drm/i915/intel_lrc.c > > @@ -1777,6 +1777,10 @@ void intel_lr_context_free(struct intel_context *ctx) > > intel_destroy_ringbuffer_obj(ringbuf); > > kfree(ringbuf); > > drm_gem_object_unreference(&ctx_obj->base); > > + WARN_ON(ctx->engine[i].unpin_count != 0); > > + ctx->engine[i].unpin_count = 0; > > + ctx->engine[i].ringbuf = NULL; > > + ctx->engine[i].state = NULL; > > } > > } > > } > > Hi, > > I don't quite see how this can fix the problem illustrated by the stack > trace above. AFAICS intel_lr_context_free() is called /only/ from > i915_gem_context_free(), which should mean that the refcount on the > intel_context object is already zero, and that it will be freed on > return. So the contents of ctx->engine[] should be irrelevant ... I've now managed to hit this fairly reliably with 'gem_ringfill --r render' + ctrl-c straight after boot. To me it looks like some kind of bigger mess with olr. If olr was already there, logical_ring_alloc_request() will do absolutely nothing, which means the ringbuf won't even be pinned/mapped and ringbug->virtual_start will remain NULL. And that leads to the w/a init to call iowrite() with some close to 0 iomem addressses. I'm not sure how this is supposed to actually work. To me it looks olr should never be allowed to be != NULL when creating the context/ringbuf to avoid this problem. And yet somehow the presence of olr doesn't always trigger the 'bad IO access' warns. In fact I can't seem to trigger it more than once until I reboot the machine. I tried to add a warning [1] to catch all cases when olr is already there, and it will trigger in most cases when I hit ctrl-c while running gem_ringfill. And yet the 'Bad IO access' doesn't follow it except once after boot. [1] @@ -1906,6 +1926,8 @@ int intel_lr_context_deferred_create(struct intel_context *ctx, } + WARN_ON(ring->outstanding_lazy_request); + ret = populate_lr_context(ctx, ctx_obj, ring, ringbuf); if (ret) { DRM_DEBUG_DRIVER("Failed to populate LRC: %d\n", ret); > > void i915_gem_context_free(struct kref *ctx_ref) > { > struct intel_context *ctx = container_of(ctx_ref, > typeof(*ctx), ref); > > trace_i915_context_free(ctx); > > if (i915.enable_execlists) > intel_lr_context_free(ctx); > > i915_ppgtt_put(ctx->ppgtt); > > if (ctx->legacy_hw_ctx.rcs_state) > drm_gem_object_unreference(&ctx->legacy_hw_ctx.rcs_state->base); > > list_del(&ctx->link); > kfree(ctx); > } > > ... unless something is trying to reuse the context while it is still in > the process of being deleted :( > > In addition, the stack trace above implies that ctx->engine[].state WAS > NULL when i915_gem_validate_context() was called, otherwise it would not > have called intel_lr_context_deferred_create() > > if (i915.enable_execlists && !ctx->engine[ring->id].state) { > int ret = intel_lr_context_deferred_create(ctx, ring); > if (ret) { > DRM_DEBUG("Could not create LRC %u: %d\n", ctx_id, ret); > return ERR_PTR(ret); > } > } > > and likewise that function would not have called gen8_init_rcs_context() > unless this was a new context: > > if (ctx == ring->default_context) > lrc_setup_hardware_status_page(ring, ctx_obj); > else if (ring->id == RCS && !ctx->rcs_initialized) { > if (ring->init_context) { > ret = ring->init_context(ring, ctx); > if (ret) { > DRM_ERROR("ring init context: %d\n", ret); > ctx->engine[ring->id].ringbuf = NULL; > ctx->engine[ring->id].state = NULL; > goto error; > } > } > > ctx->rcs_initialized = true; > } > > Note that rcs_initialized is never cleared, even with your change, so > that in a use-after-free situation we wouldn't end up in this path. So I > think the mystery is how this context ended up in an inconsistent state: > has it been partially freed and then reused, or has some part of the new > context allocation path failed but not been unwound correctly? > > And if setting to NULL a pointer that's inside a structure that's in the > process of being freed actually makes a difference, doesn't that mean > there's a use-after-free issue somewhere? > > .Dave. > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx > http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Ville Syrjälä Intel OTC _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx