On 07/10/2021 19:18, Christian König wrote:
Am 07.10.21 um 17:53 schrieb Tvrtko Ursulin:
On 07/10/2021 16:18, Vudum, Lakshminarayana wrote:
-----Original Message-----
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>
Sent: Thursday, October 7, 2021 6:41 AM
To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>;
intel-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Vudum, Lakshminarayana <lakshminarayana.vudum@xxxxxxxxx>
Subject: Re: ✗ Fi.CI.IGT: failure for series starting
with [v7,1/8] drm/i915/gem: Break out some shmem backend utils
On 07/10/2021 13:57, Christian König wrote:
Am 07.10.21 um 12:51 schrieb Tvrtko Ursulin:
On 07/10/2021 10:19, Christian König wrote:
Am 07.10.21 um 11:15 schrieb Tvrtko Ursulin:
Hi,
On 06/10/2021 16:26, Patchwork wrote:
*Patch Details*
*Series:* series starting with [v7,1/8] drm/i915/gem: Break out
some shmem backend utils
*URL:* https://patchwork.freedesktop.org/series/95501/
<https://patchwork.freedesktop.org/series/95501/>
*State:* failure
*Details:*
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21264/index.htm
l
<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21264/index.ht
ml>
CI Bug Log - changes from CI_DRM_10688_full ->
Patchwork_21264_full
Summary
*FAILURE*
Serious unknown changes coming with Patchwork_21264_full
absolutely need to be verified manually.
If you think the reported changes have nothing to do with the
changes introduced in Patchwork_21264_full, please notify your bug
team to allow them to document this new failure mode, which will
reduce false positives in CI.
Possible new issues
Here are the unknown changes that may have been introduced in
Patchwork_21264_full:
IGT changes
Possible regressions
*
igt@gem_sync@basic-many-each:
o shard-apl: NOTRUN -> INCOMPLETE
<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21264/shard-ap
l7/igt@gem_sync@xxxxxxxxxxxxxxxxxxxx>
Something still fishy in the unlocked iterator? Or
dma_resv_get_fences using it?
Probably the later. I'm going to take a look.
Thanks for the notice,
Christian.
<6> [187.551235] [IGT] gem_sync: starting subtest basic-many-each
<1> [188.935462] BUG: kernel NULL pointer dereference, address:
0000000000000010
<1> [188.935485] #PF: supervisor write access in kernel mode <1>
[188.935495] #PF: error_code(0x0002) - not-present page <6>
[188.935504] PGD 0 P4D 0 <4> [188.935512] Oops: 0002 [#1] PREEMPT
SMP NOPTI <4> [188.935521] CPU: 2 PID: 1467 Comm: gem_sync Not
tainted 5.15.0-rc4-CI-Patchwork_21264+ #1 <4> [188.935535] Hardware
name: /NUC6CAYB, BIOS
AYAPLCEL.86A.0049.2018.0508.1356 05/08/2018 <4> [188.935546] RIP:
0010:dma_resv_get_fences+0x116/0x2d0
<4> [188.935560] Code: 10 85 c0 7f c9 be 03 00 00 00 e8 15 8b df ff
eb bd e8 8e c6 ff ff eb b6 41 8b 04 24 49 8b 55 00 48 89 e7 8d 48
01
41 89 0c 24 <4c> 89 34 c2 e8 41 f2 ff ff 49 89 c6 48 85 c0 75 8c 48
8b 44 24 10 <4> [188.935583] RSP: 0018:ffffc900011dbcc8 EFLAGS:
00010202 <4> [188.935593] RAX: 0000000000000000 RBX:
00000000ffffffff RCX:
0000000000000001
<4> [188.935603] RDX: 0000000000000010 RSI: ffffffff822e343c RDI:
ffffc900011dbcc8
<4> [188.935613] RBP: ffffc900011dbd48 R08: ffff88812d255bb8 R09:
00000000fffffffe
<4> [188.935623] R10: 0000000000000001 R11: 0000000000000000 R12:
ffffc900011dbd44
<4> [188.935633] R13: ffffc900011dbd50 R14: ffff888113d29cc0 R15:
0000000000000000
<4> [188.935643] FS: 00007f68d17e9700(0000)
GS:ffff888277900000(0000) knlGS:0000000000000000 <4> [188.935655]
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [188.935665]
CR2: 0000000000000010 CR3: 000000012d0a4000 CR4:
00000000003506e0
<4> [188.935676] Call Trace:
<4> [188.935685] i915_gem_object_wait+0x1ff/0x410 [i915] <4>
[188.935988] i915_gem_wait_ioctl+0xf2/0x2a0 [i915] <4>
[188.936272] ? i915_gem_object_wait+0x410/0x410 [i915] <4>
[188.936533] drm_ioctl_kernel+0xae/0x140 <4> [188.936546]
drm_ioctl+0x201/0x3d0 <4> [188.936555] ?
i915_gem_object_wait+0x410/0x410 [i915] <4> [188.936820] ?
__fget_files+0xc2/0x1c0 <4> [188.936830] ? __fget_files+0xda/0x1c0
<4> [188.936839] __x64_sys_ioctl+0x6d/0xa0 <4> [188.936848]
do_syscall_64+0x3a/0xb0 <4> [188.936859]
entry_SYSCALL_64_after_hwframe+0x44/0xae
FWIW if you disassemble the code it seems to be crashing in:
(*shared)[(*shared_count)++] = fence; // mov %r14, (%rdx, %rax, 8)
RDX is *shared, RAX is *shared_count, RCX is *shared_count++ (for the
next iteration. R13 is share and R12 shared_count.
That *shared can contain 0000000000000010 makes no sense to me. At
least yet. :)
Yeah, me neither. I've gone over the whole code multiple time now and
absolutely don't get what's happening here.
Adding some more selftests didn't helped either. As far as I can see
the code works as intended.
Do we have any other reports of crashes?
Yes, sporadic but present across different platforms since the change
went it:
https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_sync@xxxxxxxxxxxxxxxxxxxx.
So issue is probably real.
Did not find any other tests failing with the same signature. Lakshmi
are you perhaps able to search for the same or similar signature
across the whole set of recent results?
[Lakshmi] Both the regressions failures are new. I filed below issues
and reported.
Thanks Lakshmi!
Christian, maybe revert for now since it looks tricky to figure out? I
at least couldn't spent much time looking at it today. Or try to find
a third set of eyes to look at it quickly in case we are not seeing
something.
Looks like a good selftest will be needed here for robustness.
Including threads to trigger restarts and external manipulation to hit
the refcount zero.
Yeah, agree. Already working on that.
Going to send out the revert for dma_resv_get_fences() tomorrow.
Looks like the issue is actually in the unlocked iterator.
What happens in practice when it crashes is that the fence count in the
shared fences object is zero, which means no space gets allocated in
dma_resv_get_fences. But clearly shared_count was not zero in
dma_resv_iter_walk_unlocked, otherwise the loop in dma_resv_get_fences
wouldn't run.
I suspect it is not safe to drop the RCU lock having peeking at the
dma_resv_shared_list.
Regards,
Tvrtko