On Thu, Nov 14, 2013 at 05:01:44PM -0800, Ben Widawsky wrote: > If an object was bound in the ppgtt, and we do a GPU reset, but the > PPGTT was not brought back up on reset, trying to unbind the object > later will result in a NULL ptr. Ideally this (failed PPGTT) should > never happen, but it is allowed in the code, and therefore we should > prevent the OOPS. > > Since Broadwell hangs/reset is still under development, and apparently > so is aliasing PPGTT after rest, this helps alleviate some of the pain. > > NOTE: With the coming PPGTT patches this can't ever occur since there if > PPGTT is supposed to come up, and doesn't the driver will fail to load > (since it will make context loading fail). > > Here is an example splat: > > [ 588.795571] ---[ end trace f23239922ecdffbc ]--- > [ 598.427072] [drm] stuck on render ring > [ 598.473116] [drm] GPU crash dump saved to /sys/class/drm/card0/error > [ 598.550946] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. > [ 598.663996] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel > [ 598.772830] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. > [ 598.891172] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. > [ 599.006668] [drm] Simulated gpu hang, resetting stop_rings > [ 599.084004] [drm:__gen6_gt_force_wake_mt_get] *ERROR* Timed out waiting for forcewake old ack to clear. > [ 599.204258] [drm] PPGTT enable failed. This is not fatal, but unexpected > [ 599.287287] BUG: unable to handle kernel NULL pointer dereference at 0000000000000108 > [ 599.389563] IP: [<ffffffffa0559794>] i915_ppgtt_unbind_object+0x14/0x60 [i915] > [ 599.484426] PGD 34ab6067 PUD 50b2a067 PMD 0 > [ 599.542964] Oops: 0000 [#1] PREEMPT SMP > [ 599.597171] Modules linked in: i915 drm_kms_helper drm intel_gtt agpgart i2c_algo_bit i2c_core netconsole configfs ext4 x86_pkg_temp_thermal coretemp crc16 mbcache kvm_intel jbd2 kvm ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode serio_raw evdev thermal fan battery e1000e acpi_cpufreq video ptp button pps_core ac processor snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_timer snd soundcore hid_generic usbhid hid btrfs libcrc32c xor raid6_pq sd_mod ehci_pci ehci_hcd ahci libahci crc32c_intel libata usbcore scsi_mod usb_common > [ 600.268119] CPU: 0 PID: 2612 Comm: kms_flip Tainted: G W 3.12.0-BEN+ #38 > [ 600.366889] Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1, BIOS BDW-E1R1.86C.0048.R02.1310291000 10/29/2013 > [ 600.527990] task: ffff88009d28b0c0 ti: ffff88003b5d4000 task.ti: ffff88003b5d4000 > [ 600.626121] RIP: 0010:[<ffffffffa0559794>] [<ffffffffa0559794>] i915_ppgtt_unbind_object+0x14/0x60 [i915] > [ 600.750986] RSP: 0018:ffff88003b5d5ca0 EFLAGS: 00010202 > [ 600.822380] RAX: 0000000000000004 RBX: ffff88008eac5a40 RCX: 00000000000000fe > [ 600.916281] RDX: 0000000000000000 RSI: ffff88008eac5a40 RDI: ffff88008eac5a40 > [ 601.010181] RBP: ffff88003b5d5cb8 R08: 0000000000000000 R09: 0000000000000000 > [ 601.104083] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > [ 601.197983] R13: ffff88008eac5b30 R14: ffff880145208000 R15: ffff88008eac5ac8 > [ 601.291686] FS: 00007fabc555a8c0(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 > [ 601.397339] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 601.474153] CR2: 0000000000000108 CR3: 000000008eadd000 CR4: 00000000003407f0 > [ 601.568035] Stack: > [ 601.599051] ffff88008eac5a40 ffff88005cb96000 ffff88008eac5b30 ffff88003b5d5cf0 > [ 601.696409] ffffffffa055003f ffff88008eac5a40 ffff88005cb96000 ffff88008eac5b30 > [ 601.793636] ffff880145208000 ffff88008eac5ac8 ffff88003b5d5d30 ffffffffa05513fe > [ 601.890889] Call Trace: > [ 601.927323] [<ffffffffa055003f>] i915_vma_unbind+0x28f/0x340 [i915] > [ 602.011520] [<ffffffffa05513fe>] i915_gem_free_object+0x9e/0x340 [i915] > [ 602.100135] [<ffffffff810b81cd>] ? trace_hardirqs_on+0xd/0x10 > [ 602.178010] [<ffffffffa04c248a>] drm_gem_object_free+0x2a/0x30 [drm] > [ 602.263249] [<ffffffffa04c29fa>] drm_gem_object_handle_unreference_unlocked+0x11a/0x130 [drm] > [ 602.375214] [<ffffffffa04c2ae6>] drm_gem_handle_delete+0xd6/0x1d0 [drm] > [ 602.463759] [<ffffffffa04c3358>] drm_gem_close_ioctl+0x28/0x30 [drm] > [ 602.549031] [<ffffffffa04c0d92>] drm_ioctl+0x502/0x640 [drm] > [ 602.625820] [<ffffffff8115ac70>] ? might_fault+0xa0/0xb0 > [ 602.698152] [<ffffffff8115ac27>] ? might_fault+0x57/0xb0 > [ 602.770831] [<ffffffff8100f0ec>] ? __restore_xstate_sig+0x13c/0x600 > [ 602.855035] [<ffffffff811bb6c5>] do_vfs_ioctl+0x305/0x530 > [ 602.928680] [<ffffffff811c73a7>] ? fget_light+0x387/0x4f0 > [ 603.001415] [<ffffffff811bb971>] SyS_ioctl+0x81/0xa0 > [ 603.069506] [<ffffffff814dd6d6>] system_call_fastpath+0x1a/0x1f > [ 603.148332] Code: 89 e7 89 c2 41 ff d6 5b 41 5c 41 5d 41 5e 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 49 89 fc 48 89 f7 53 <4d> 8b ac 24 08 01 00 00 48 8b 56 08 48 8b 9e b8 00 00 00 48 8b > [ 603.392564] RIP [<ffffffffa0559794>] i915_ppgtt_unbind_object+0x14/0x60 [i915] > [ 603.487472] RSP <ffff88003b5d5ca0> > [ 603.535732] CR2: 0000000000000108 > [ 603.622175] ---[ end trace f23239922ecdffbd ]--- > > Signed-off-by: Ben Widawsky <ben@xxxxxxxxxxxx> > --- > drivers/gpu/drm/i915/i915_gem.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 40d9dcf..2b9245d 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2741,6 +2741,13 @@ int i915_vma_unbind(struct i915_vma *vma) > > if (obj->has_global_gtt_mapping) > i915_gem_gtt_unbind_object(obj); > + > + if (unlikely(!dev_priv->mm.aliasing_ppgtt && > + obj->has_aliasing_ppgtt_mapping)) { > + DRM_DEBUG_DRIVER("Leftover PPGTT mapping after reset\n"); > + obj->has_aliasing_ppgtt_mapping = 0; > + } Nack on sprinkling band-aids all over the place when imo the real bug is that the our init_hw functions destroys sw tracking state behind our back: http://www.mail-archive.com/intel-gfx@xxxxxxxxxxxxxxxxxxxxx/msg27025.html If we rip out the i915_gem_cleanup_aliasing_ppgtt we shouldn't blow up, but can keep on going (without the gt ofc). -Daniel > + > if (obj->has_aliasing_ppgtt_mapping) { > i915_ppgtt_unbind_object(dev_priv->mm.aliasing_ppgtt, obj); > obj->has_aliasing_ppgtt_mapping = 0; > -- > 1.8.4.2 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx > http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx