Re: [PATCH 2/8] drm/i915: Complete the fences as they are cancelled due to wedging

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Mon, 03 Dec 2018 17:36:32 +0000

Quoting Tvrtko Ursulin (2018-12-03 17:11:59)
> 
> On 03/12/2018 11:36, Chris Wilson wrote:
> > We inspect the requests under the assumption that they will be marked as
> > completed when they are removed from the queue. Currently however, in the
> > process of wedging the requests will be removed from the queue before they
> > are completed, so rearrange the code to complete the fences before the
> > locks are dropped.
> > 
> > <1>[  354.473346] BUG: unable to handle kernel NULL pointer dereference at 0000000000000250
> > <6>[  354.473363] PGD 0 P4D 0
> > <4>[  354.473370] Oops: 0000 [#1] PREEMPT SMP PTI
> > <4>[  354.473380] CPU: 0 PID: 4470 Comm: gem_eio Tainted: G     U            4.20.0-rc4-CI-CI_DRM_5216+ #1
> > <4>[  354.473393] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
> > <4>[  354.473480] RIP: 0010:__i915_schedule+0x311/0x5e0 [i915]
> > <4>[  354.473490] Code: 49 89 44 24 20 4d 89 4c 24 28 4d 89 29 44 39 b3 a0 04 00 00 7d 3a 41 8b 44 24 78 85 c0 74 13 48 8b 93 78 04 00 00 48 83 e2 fc <39> 82 50 02 00 00 79 1e 44 89 b3 a0 04 00 00 48 8d bb d0 03 00 00
> 
> This confuses me, isn't the code segment usually at the end?

*shrug* It was cut and paste.

> And then 
> you have another after the call trace which doesn't match 
> __i915_scheduel.. anyways, _this_ code seems to be this part:
> 
>                  if (node_to_request(node)->global_seqno &&
>   90d:   8b 43 78                mov    eax,DWORD PTR [rbx+0x78]
>   910:   85 c0                   test   eax,eax
>   912:   74 13                   je     927 <__i915_schedule+0x317>
>  
> i915_seqno_passed(port_request(engine->execlists.port)->global_seqno,
>   914:   49 8b 97 c0 04 00 00    mov    rdx,QWORD PTR [r15+0x4c0]
>   91b:   48 83 e2 fc             and    rdx,0xfffffffffffffffc
>                  if (node_to_request(node)->global_seqno &&
>   91f:   39 82 50 02 00 00       cmp    DWORD PTR [rdx+0x250],eax
>   925:   79 1e                   jns    945 <__i915_schedule+0x335>
> 
> > <4>[  354.473515] RSP: 0018:ffffc900001bba90 EFLAGS: 00010046
> > <4>[  354.473524] RAX: 0000000000000003 RBX: ffff8882624c8008 RCX: f34a737800000000
> > <4>[  354.473535] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8882624c8048
> > <4>[  354.473545] RBP: ffffc900001bbab0 R08: 000000005963f1f1 R09: 0000000000000000
> > <4>[  354.473556] R10: ffffc900001bba10 R11: ffff8882624c8060 R12: ffff88824fdd7b98
> > <4>[  354.473567] R13: ffff88824fdd7bb8 R14: 0000000000000001 R15: ffff88824fdd7750
> > <4>[  354.473578] FS:  00007f44b4b5b980(0000) GS:ffff888277e00000(0000) knlGS:0000000000000000
> > <4>[  354.473590] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[  354.473599] CR2: 0000000000000250 CR3: 000000026976e000 CR4: 0000000000340ef0
> 
> Given the registers above, I think it means this - eax is global_seqno 
> of the node rq. rdx is is port_request so NULL and bang. No request in 
> port, but why would there always be one at the point we are scheduling 
> in a new request to the runnable queue?

Correct. The answer, as I chose to interpret it, is because of the
incomplete submitted+dequeued requests during cancellation which this
patch attempts to address.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx