Re: [Bug #13819] system freeze when switching to console

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 8 Sep 2009 11:06:21 -0700 (PDT)

On Tue, 8 Sep 2009, reinette chatre wrote:
> 
> As you can see from the kernel version it is not a build of a vanilla
> kernel. It only contains changes related to the wireless networking work
> I am doing.
> 
> Here is the output:

Thanks, this is great. It pinpoints the problem very effectively.

> [  352.803960] BUG: unable to handle kernel NULL pointer dereference at 0000000000000084
> [  352.804006] IP: [<ffffffffa03ecaab>] i915_driver_irq_handler+0x26b/0xd20 [i915]

The code here is

	  16:	48 8b 80 00 01 00 00 	mov    0x100(%rax),%rax
	  1d:	48 8b 50 08          	mov    0x8(%rax),%rdx
	  21:	48 85 d2             	test   %rdx,%rdx
	  24:	74 11                	je     0x37
	  26:	49 8b 44 24 78       	mov    0x78(%r12),%rax
	  2b:*	8b 80 84 00 00 00    	mov    0x84(%rax),%eax     <-- trapping instruction
	  31:	89 82 08 08 00 00    	mov    %eax,0x808(%rdx)
	  37:	f6 45 a0 02          	testb  $0x2,-0x60(%rbp)

and that "testb $0x2, -0x60(%rbp)" seems to be the

	if (iir & I915_USER_INTERRUPT) {

test if I'm reading things right. Although it could also be the

	if (eir & I915_ERROR_MEMORY_REFRESH) {

thing. The disassembly is totally impossible to read, because the stupid 
i915 driver is chock-full of crap like

	if (IS_G4X(dev)) {
		..

which expands to insane amounts of code that check the PCI ID's one by 
one.

Intel guys: could you _please_ stop doing that. Create a capability mask 
in the device or something, so that you can test for "is this a G4x" with 
a single bit test, rather than have code like this:

        mov    0x31c(%rsi),%eax
        cmp    $0x2982,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2972,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2992,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x29a2,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2a02,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2a12,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2a42,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2e02,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2e12,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2e22,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2e32,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x42,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>

for that IS_G4X() thing (I'm not kidding - that's exactly a hundred bytes 
of code for that _stupid_ test, and it's inlined!)

Anyway, we're getting that DRM irq, and it has a normal IRQ stack trace:

> [  352.804006] Process Xorg (pid: 4424, threadinfo ffff8800b6b1a000, task ffff880037373c00)
> [  352.804006] Call Trace:
> [  352.804006]  <IRQ> 
> [  352.804006]  [<ffffffff8106db7d>] ? mark_held_locks+0x6d/0x90
> [  352.804006]  [<ffffffff81098ee8>] handle_IRQ_event+0x68/0x170
> [  352.804006]  [<ffffffff8109ac01>] handle_edge_irq+0xc1/0x160
> [  352.804006]  [<ffffffff8100e76f>] handle_irq+0x1f/0x30
> [  352.804006]  [<ffffffff8100dc6a>] do_IRQ+0x6a/0xf0
> [  352.804006]  [<ffffffff8100c793>] ret_from_intr+0x0/0xf

.. but it happened just as we're tearing down the DRM irq handling:

> [  352.804006]  <EOI> 
> [  352.804006]  [<ffffffff81070b88>] ? lock_acquire+0xe8/0x100
> [  352.804006]  [<ffffffffa03c0b85>] ? drm_irq_uninstall+0x65/0x180 [drm]
> [  352.804006]  [<ffffffff8132d7b5>] ? mutex_lock_nested+0x45/0x320
> [  352.804006]  [<ffffffffa03c0b85>] ? drm_irq_uninstall+0x65/0x180 [drm]
> [  352.804006]  [<ffffffff8106de85>] ? trace_hardirqs_on_caller+0x145/0x190
> [  352.804006]  [<ffffffff8106dedd>] ? trace_hardirqs_on+0xd/0x10
> [  352.804006]  [<ffffffffa03c0b85>] ? drm_irq_uninstall+0x65/0x180 [drm]
> [  352.804006]  [<ffffffffa03f3335>] ? i915_gem_idle+0x225/0x330 [i915]
> [  352.804006]  [<ffffffffa03f34c7>] ? i915_gem_leavevt_ioctl+0x37/0x50 [i915]
> [  352.804006]  [<ffffffffa03bdafd>] ? drm_ioctl+0x17d/0x3c0 [drm]
> [  352.804006]  [<ffffffffa03f3490>] ? i915_gem_leavevt_ioctl+0x0/0x50 [i915]

so what is going on is that the i915 driver has obviously torn down some 
state before it uninstalls the irq, so the irq happens when the state has 
already been torn down, and the irq handler is not ready for that.

This patch *may* fix it - simply by getting rid of the irq early. However, 
I did not check whether maybe something in i915_gem_idle() actually needs 
the interrupt to be able to happen, so this is TOTALLY UNTESTED!

		Linus
---
 drivers/gpu/drm/i915/i915_gem.c |    6 +-----
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 7edb5b9..80e5ba4 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4232,15 +4232,11 @@ int
 i915_gem_leavevt_ioctl(struct drm_device *dev, void *data,
 		       struct drm_file *file_priv)
 {
-	int ret;
-
 	if (drm_core_check_feature(dev, DRIVER_MODESET))
 		return 0;
 
-	ret = i915_gem_idle(dev);
 	drm_irq_uninstall(dev);
-
-	return ret;
+	return i915_gem_idle(dev);
 }
 
 void
--
To unsubscribe from this list: send the line "unsubscribe kernel-testers" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html