Re: [PATCH 2/8] drm/i915: Complete the fences as they are cancelled due to wedging

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 4 Dec 2018 10:30:15 +0000

On 03/12/2018 17:36, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-12-03 17:11:59)

On 03/12/2018 11:36, Chris Wilson wrote:
We inspect the requests under the assumption that they will be marked as
completed when they are removed from the queue. Currently however, in the
process of wedging the requests will be removed from the queue before they
are completed, so rearrange the code to complete the fences before the
locks are dropped.

<1>[  354.473346] BUG: unable to handle kernel NULL pointer dereference at 0000000000000250
<6>[  354.473363] PGD 0 P4D 0
<4>[  354.473370] Oops: 0000 [#1] PREEMPT SMP PTI
<4>[  354.473380] CPU: 0 PID: 4470 Comm: gem_eio Tainted: G     U            4.20.0-rc4-CI-CI_DRM_5216+ #1
<4>[  354.473393] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4>[  354.473480] RIP: 0010:__i915_schedule+0x311/0x5e0 [i915]
<4>[  354.473490] Code: 49 89 44 24 20 4d 89 4c 24 28 4d 89 29 44 39 b3 a0 04 00 00 7d 3a 41 8b 44 24 78 85 c0 74 13 48 8b 93 78 04 00 00 48 83 e2 fc <39> 82 50 02 00 00 79 1e 44 89 b3 a0 04 00 00 48 8d bb d0 03 00 00

This confuses me, isn't the code segment usually at the end?

*shrug* It was cut and paste.

And then
you have another after the call trace which doesn't match
__i915_scheduel.. anyways, _this_ code seems to be this part:

                  if (node_to_request(node)->global_seqno &&
   90d:   8b 43 78                mov    eax,DWORD PTR [rbx+0x78]
   910:   85 c0                   test   eax,eax
   912:   74 13                   je     927 <__i915_schedule+0x317>

i915_seqno_passed(port_request(engine->execlists.port)->global_seqno,
   914:   49 8b 97 c0 04 00 00    mov    rdx,QWORD PTR [r15+0x4c0]
   91b:   48 83 e2 fc             and    rdx,0xfffffffffffffffc
                  if (node_to_request(node)->global_seqno &&
   91f:   39 82 50 02 00 00       cmp    DWORD PTR [rdx+0x250],eax
   925:   79 1e                   jns    945 <__i915_schedule+0x335>

<4>[  354.473515] RSP: 0018:ffffc900001bba90 EFLAGS: 00010046
<4>[  354.473524] RAX: 0000000000000003 RBX: ffff8882624c8008 RCX: f34a737800000000
<4>[  354.473535] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8882624c8048
<4>[  354.473545] RBP: ffffc900001bbab0 R08: 000000005963f1f1 R09: 0000000000000000
<4>[  354.473556] R10: ffffc900001bba10 R11: ffff8882624c8060 R12: ffff88824fdd7b98
<4>[  354.473567] R13: ffff88824fdd7bb8 R14: 0000000000000001 R15: ffff88824fdd7750
<4>[  354.473578] FS:  00007f44b4b5b980(0000) GS:ffff888277e00000(0000) knlGS:0000000000000000
<4>[  354.473590] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  354.473599] CR2: 0000000000000250 CR3: 000000026976e000 CR4: 0000000000340ef0

Given the registers above, I think it means this - eax is global_seqno
of the node rq. rdx is is port_request so NULL and bang. No request in
port, but why would there always be one at the point we are scheduling
in a new request to the runnable queue?

Correct. The answer, as I chose to interpret it, is because of the
incomplete submitted+dequeued requests during cancellation which this
patch attempts to address.

I couldn't find any other route to this state myself, so on the basis of 
that, but with a little bit of fear from "Could it have really been so 
much simpler all along?!":

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx