On 03/12/2018 17:36, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-12-03 17:11:59)
On 03/12/2018 11:36, Chris Wilson wrote:
We inspect the requests under the assumption that they will be marked as
completed when they are removed from the queue. Currently however, in the
process of wedging the requests will be removed from the queue before they
are completed, so rearrange the code to complete the fences before the
locks are dropped.
<1>[ 354.473346] BUG: unable to handle kernel NULL pointer dereference at 0000000000000250
<6>[ 354.473363] PGD 0 P4D 0
<4>[ 354.473370] Oops: 0000 [#1] PREEMPT SMP PTI
<4>[ 354.473380] CPU: 0 PID: 4470 Comm: gem_eio Tainted: G U 4.20.0-rc4-CI-CI_DRM_5216+ #1
<4>[ 354.473393] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4>[ 354.473480] RIP: 0010:__i915_schedule+0x311/0x5e0 [i915]
<4>[ 354.473490] Code: 49 89 44 24 20 4d 89 4c 24 28 4d 89 29 44 39 b3 a0 04 00 00 7d 3a 41 8b 44 24 78 85 c0 74 13 48 8b 93 78 04 00 00 48 83 e2 fc <39> 82 50 02 00 00 79 1e 44 89 b3 a0 04 00 00 48 8d bb d0 03 00 00
This confuses me, isn't the code segment usually at the end?
*shrug* It was cut and paste.
And then
you have another after the call trace which doesn't match
__i915_scheduel.. anyways, _this_ code seems to be this part:
if (node_to_request(node)->global_seqno &&
90d: 8b 43 78 mov eax,DWORD PTR [rbx+0x78]
910: 85 c0 test eax,eax
912: 74 13 je 927 <__i915_schedule+0x317>
i915_seqno_passed(port_request(engine->execlists.port)->global_seqno,
914: 49 8b 97 c0 04 00 00 mov rdx,QWORD PTR [r15+0x4c0]
91b: 48 83 e2 fc and rdx,0xfffffffffffffffc
if (node_to_request(node)->global_seqno &&
91f: 39 82 50 02 00 00 cmp DWORD PTR [rdx+0x250],eax
925: 79 1e jns 945 <__i915_schedule+0x335>
<4>[ 354.473515] RSP: 0018:ffffc900001bba90 EFLAGS: 00010046
<4>[ 354.473524] RAX: 0000000000000003 RBX: ffff8882624c8008 RCX: f34a737800000000
<4>[ 354.473535] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8882624c8048
<4>[ 354.473545] RBP: ffffc900001bbab0 R08: 000000005963f1f1 R09: 0000000000000000
<4>[ 354.473556] R10: ffffc900001bba10 R11: ffff8882624c8060 R12: ffff88824fdd7b98
<4>[ 354.473567] R13: ffff88824fdd7bb8 R14: 0000000000000001 R15: ffff88824fdd7750
<4>[ 354.473578] FS: 00007f44b4b5b980(0000) GS:ffff888277e00000(0000) knlGS:0000000000000000
<4>[ 354.473590] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 354.473599] CR2: 0000000000000250 CR3: 000000026976e000 CR4: 0000000000340ef0
Given the registers above, I think it means this - eax is global_seqno
of the node rq. rdx is is port_request so NULL and bang. No request in
port, but why would there always be one at the point we are scheduling
in a new request to the runnable queue?
Correct. The answer, as I chose to interpret it, is because of the
incomplete submitted+dequeued requests during cancellation which this
patch attempts to address.
I couldn't find any other route to this state myself, so on the basis of
that, but with a little bit of fear from "Could it have really been so
much simpler all along?!":
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx