Re: Critical bug on bcache kernel module in Fedora 30

Eddie Chapman <eddie@xxxxxxxx> · Mon, 13 May 2019 12:43:55 +0100

On 12/05/2019 18:41, Pierre JUHEN wrote:
Hi,

the bug is present in 5.0.11, 5.0.13 et 5.0.14 (rawhide).

Please see :

https://bugzilla.redhat.com/show_bug.cgi?id=1708315

I guess it will be a tough one, since it's seems clearly linked to the 
gcc version, since the same code works under Fedora 29 (gcc 8), and 
fails under Fedora 30 (gcc 9).

Regards,

Pierre

I haven't upgraded to any 5.x release yet and still using gcc 8.3 but 
seeing that this particular issue appears to trigger upon attaching the 
cache device, it made me wonder if an issue I have encountered recently 
could be related and therefore some help. If it is not then I apologise 
for the noise.

The issue I have encountered recently, which I had not before, is an 
Oops on bootup, after upgrading to stable 4.19.38 from an earlier 4.19 
release. Specifically it occurs when doing one of these in a startup 
script (haven't been able to narrow down exactly which yet):

echo writeback > /sys/block/bcach0/bcache/cache_mode
echo 4200000000 > /sys/block/bcach0/bcache/sequential_cutoff
echo 50 > /sys/block/bcach0/bcache/writeback_percent
echo 0 > /sys/block/bcach0/bcache/cache/congested_write_threshold_us
echo 0 > /sys/block/bcach0/bcache/cache/congested_read_threshold_us

I managed to get some of the Oops in my serial terminal, but 
unfortunately some lines of it were corrupted when the machine rebooted 
and subsequent serial output overwrote them.  But these are the lines 
which did not get overwritten:

[  205.046081] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000340
[  205.053962] PGD 0 P4D 0
[  205.056506] Oops: 0000 [#1] SMP NOPTI
[  205.060220] CPU: 2 PID: 27 Comm: kworker/2:0 Tainted: G        W  O 
 T 4.19.38-rc1 #1
[  205.068266] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 
3.5c       03/18/2016
[  205.076489] Workqueue: events update_writeback_rate [bcache]
[  205.082166] RIP: 0010:update_writeback_rate+0x2f/0x300 [bcache]
[  205.088161] Code: 41 57 41 56 41 55 41 54 55 53 4c 8b a7 00 f4 ff ff 
f0 80 8f 20 f4 ff ff 10 f0 83 44 24 fc 00 48 8b 87 20 f4 ff ff a8 08 74 
57 <49> 8b 84 24 40 03 00 00 48 c1 e8 03 83 e0 01 48 89 c5 75 43 8b 47
[  205.107050] RSP: 0018:ffffc900032ffe68 EFLAGS: 00010202
[  205.107052] RAX: 0000000000000018 RBX: ffff8884c3620c80 RCX: 
ffff8884178a01e0
[  205.107052] RDX: 0000000000000001 RSI: ffff8884c3620c88 RDI: 
ffff8884c3620c80
[  205.107053] RBP: ffff8884178a01c0 R08: 0000000000000000 R09: 
000073746e657665
[  205.107054] R10: 8080808080808080 R11: 0000002f93e8e556 R12: 
0000000000000000
[  205.107054] R13: 0000000000000000 R14: ffff888415b13c80 R15: 
ffff8884c3620c88
[  205.107056] FS:  0000000000000000(0000) GS:ffff888417880000(0000) 
knlGS:0000000000000000
[  205.107057] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  205.107057] CR2: 0000000000000340 CR3: 00000002c479c000 CR4: 
00000000000406e0
[  205.107058] Call Trace:
[  205.107066]  process_one_work+0x1a7/0x3a0
[  205.107075]  worker_thread+0x30/0x390

There is some more Call Trace but as I say it is corrupted by subsequent 
serial data.  I will try and capture full oops if I get time this week, 
and hopefully a full crash dump.

The fact that it occurred on updating to a very recent 4.19 stable 
release, and that the other issue you guys have experienced with 
corruption is with a very recent kernel, makes me wonder if perhaps a 
recent change somewhere else in the kernel that is present in 5.x and 
been backported to stable could be causing both issues.

I'm not sure if my issue actually would have led to corruption as I 
discarded completely the bcache data right after I had the oops, and 
re-created it without a cache device and now run it like that (maybe it 
is the exact same issue). I plan to add cache device again when I get 
time. So sorry for the incomplete bug report for now, as I say hope to 
get time to investigate more fully soon.

Eddie