On 12/05/2019 18:41, Pierre JUHEN wrote:
Hi,
the bug is present in 5.0.11, 5.0.13 et 5.0.14 (rawhide).
Please see :
https://bugzilla.redhat.com/show_bug.cgi?id=1708315
I guess it will be a tough one, since it's seems clearly linked to the
gcc version, since the same code works under Fedora 29 (gcc 8), and
fails under Fedora 30 (gcc 9).
Regards,
Pierre
I haven't upgraded to any 5.x release yet and still using gcc 8.3 but
seeing that this particular issue appears to trigger upon attaching the
cache device, it made me wonder if an issue I have encountered recently
could be related and therefore some help. If it is not then I apologise
for the noise.
The issue I have encountered recently, which I had not before, is an
Oops on bootup, after upgrading to stable 4.19.38 from an earlier 4.19
release. Specifically it occurs when doing one of these in a startup
script (haven't been able to narrow down exactly which yet):
echo writeback > /sys/block/bcach0/bcache/cache_mode
echo 4200000000 > /sys/block/bcach0/bcache/sequential_cutoff
echo 50 > /sys/block/bcach0/bcache/writeback_percent
echo 0 > /sys/block/bcach0/bcache/cache/congested_write_threshold_us
echo 0 > /sys/block/bcach0/bcache/cache/congested_read_threshold_us
I managed to get some of the Oops in my serial terminal, but
unfortunately some lines of it were corrupted when the machine rebooted
and subsequent serial output overwrote them. But these are the lines
which did not get overwritten:
[ 205.046081] BUG: unable to handle kernel NULL pointer dereference at
0000000000000340
[ 205.053962] PGD 0 P4D 0
[ 205.056506] Oops: 0000 [#1] SMP NOPTI
[ 205.060220] CPU: 2 PID: 27 Comm: kworker/2:0 Tainted: G W O
T 4.19.38-rc1 #1
[ 205.068266] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS
3.5c 03/18/2016
[ 205.076489] Workqueue: events update_writeback_rate [bcache]
[ 205.082166] RIP: 0010:update_writeback_rate+0x2f/0x300 [bcache]
[ 205.088161] Code: 41 57 41 56 41 55 41 54 55 53 4c 8b a7 00 f4 ff ff
f0 80 8f 20 f4 ff ff 10 f0 83 44 24 fc 00 48 8b 87 20 f4 ff ff a8 08 74
57 <49> 8b 84 24 40 03 00 00 48 c1 e8 03 83 e0 01 48 89 c5 75 43 8b 47
[ 205.107050] RSP: 0018:ffffc900032ffe68 EFLAGS: 00010202
[ 205.107052] RAX: 0000000000000018 RBX: ffff8884c3620c80 RCX:
ffff8884178a01e0
[ 205.107052] RDX: 0000000000000001 RSI: ffff8884c3620c88 RDI:
ffff8884c3620c80
[ 205.107053] RBP: ffff8884178a01c0 R08: 0000000000000000 R09:
000073746e657665
[ 205.107054] R10: 8080808080808080 R11: 0000002f93e8e556 R12:
0000000000000000
[ 205.107054] R13: 0000000000000000 R14: ffff888415b13c80 R15:
ffff8884c3620c88
[ 205.107056] FS: 0000000000000000(0000) GS:ffff888417880000(0000)
knlGS:0000000000000000
[ 205.107057] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 205.107057] CR2: 0000000000000340 CR3: 00000002c479c000 CR4:
00000000000406e0
[ 205.107058] Call Trace:
[ 205.107066] process_one_work+0x1a7/0x3a0
[ 205.107075] worker_thread+0x30/0x390
There is some more Call Trace but as I say it is corrupted by subsequent
serial data. I will try and capture full oops if I get time this week,
and hopefully a full crash dump.
The fact that it occurred on updating to a very recent 4.19 stable
release, and that the other issue you guys have experienced with
corruption is with a very recent kernel, makes me wonder if perhaps a
recent change somewhere else in the kernel that is present in 5.x and
been backported to stable could be causing both issues.
I'm not sure if my issue actually would have led to corruption as I
discarded completely the bcache data right after I had the oops, and
re-created it without a cache device and now run it like that (maybe it
is the exact same issue). I plan to add cache device again when I get
time. So sorry for the incomplete bug report for now, as I say hope to
get time to investigate more fully soon.
Eddie