raid5-cache: deeply broken (with write-back?)

Daniel Santos <daniel.santos@xxxxxxxxx> · Mon, 27 Aug 2018 21:55:45 -0500

I have spent the last few days attempting to recovery my system.  I
built a raid 5 array with an external journal residing on a 32G Optane
and I had echo write-back > /sys/block/md0/md/journal_mode an
/etc/local.d script.  I had either a power failure or hardware glitch
and rebooted abnormally.  Following this I was unable to recover the
journal.  With 4.14, it hung with "Cannot get enough stripes due to
memory pressure. Recovery failed." exactly as described here:
https://www.spinics.net/lists/raid/msg59856.html

[  250.985001] md: md0 stopped.
[  250.988985] md/raid:md0: device sda1 operational as raid disk 0
[  250.988985] md/raid:md0: device sdd1 operational as raid disk 3
[  250.988986] md/raid:md0: device sdc1 operational as raid disk 2
[  250.988986] md/raid:md0: device sdb1 operational as raid disk 1
[  250.989232] md/raid:md0: raid level 5 active with 4 out of 4 devices,
algorithm 2
[  250.994896] md/raid:md0: Cannot get enough stripes due to memory
pressure. Recovery failed.
[  250.995259] kmem_cache_destroy raid5-md0: Slab cache still has objects
[  250.995260] CPU: 14 PID: 4156 Comm: mdadm Not tainted 4.14.65-gentoo #1
[  250.995261] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./X470 Taichi Ultimate, BIOS P1.20 04/17/2018
[  250.995261] Call Trace:
[  250.995266]  dump_stack+0x5d/0x79
[  250.995269]  kmem_cache_destroy+0x4d/0xfc
[  250.995272]  free_conf+0x87/0x12f [raid456]
[  250.995274]  raid5_run+0x765/0x831 [raid456]
[  250.995276]  md_run+0x5c1/0x9a5
[  250.995278]  ? __wake_up_common_lock+0x76/0xa6
[  250.995279]  do_md_run+0xa/0x94
[  250.995280]  md_ioctl+0x13e5/0x13fc
[  250.995282]  ? __switch_to_asm+0x24/0x60
[  250.995282]  ? __switch_to_asm+0x30/0x60
[  250.995283]  ? __switch_to_asm+0x24/0x60
[  250.995284]  ? __switch_to_asm+0x30/0x60
[  250.995284]  ? __switch_to_asm+0x24/0x60
[  250.995285]  ? __switch_to_asm+0x30/0x60
[  250.995287]  blkdev_ioctl+0x7f4/0x80a
[  250.995288]  ? __switch_to_asm+0x30/0x60
[  250.995289]  block_ioctl+0x37/0x3a
[  250.995291]  vfs_ioctl+0x19/0x26
[  250.995292]  do_vfs_ioctl+0x508/0x51b
[  250.995293]  ? __schedule+0x664/0x6b6
[  250.995294]  SyS_ioctl+0x4d/0x6f
[  250.995296]  do_syscall_64+0x6d/0xfe
[  250.995297]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  250.995298] RIP: 0033:0x7f770a5fec97
[  250.995299] RSP: 002b:00007fff52a72ea8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  250.995300] RAX: ffffffffffffffda RBX: 0000000000000004 RCX:
00007f770a5fec97
[  250.995300] RDX: 0000000000000000 RSI: 00000000400c0930 RDI:
0000000000000004
[  250.995301] RBP: 00007fff52a73020 R08: 0000559aaa49a4c0 R09:
0000000000000004
[  250.995301] R10: 0000000000000001 R11: 0000000000000246 R12:
000000000000000b
[  250.995302] R13: 0000559aaa49af00 R14: 00007fff52a73770 R15:
0000000000000004
[  250.995326] md/raid:md0: failed to run raid set.
[  250.995327] md: pers->run() failed ...

Of course, the problem has nothing to do with available memory, I have
32GiB.  I made many attempts using various boot disks (Gentoo admin cd,
Debian live dvd, etc.).  Having read that the regression began somewhere
after 4.11 I tried a debian livedvd using 4.9.110 and came up with
different and scarier errors:

Aug 24 06:15:41 localhost kernel: [  304.346350] md: bind<sda1>
Aug 24 06:16:53 localhost kernel: [  376.687498] md: md0 stopped.
Aug 24 06:16:53 localhost kernel: [  376.687506] md: unbind<sda1>
Aug 24 06:16:53 localhost kernel: [  376.704800] md: export_rdev(sda1)
Aug 24 06:17:05 localhost kernel: [  388.580468] md: md0 stopped.
Aug 24 06:17:05 localhost kernel: [  388.581973] md: bind<sdb1>
Aug 24 06:17:05 localhost kernel: [  388.582254] md: bind<sdc1>
Aug 24 06:17:05 localhost kernel: [  388.582539] md: bind<sdd1>
Aug 24 06:17:05 localhost kernel: [  388.582613] md: bind<nvme0n1>
Aug 24 06:17:05 localhost kernel: [  388.582858] md: bind<sda1>
Aug 24 06:17:05 localhost kernel: [  388.669564] raid6: sse2x1   gen() 
6270 MB/s
Aug 24 06:17:05 localhost kernel: [  388.737557] raid6: sse2x1   xor() 
5958 MB/s
Aug 24 06:17:05 localhost kernel: [  388.805570] raid6: sse2x2   gen()
13217 MB/s
Aug 24 06:17:05 localhost kernel: [  388.873573] raid6: sse2x2   xor() 
9201 MB/s
Aug 24 06:17:05 localhost kernel: [  388.941581] raid6: sse2x4   gen()
13708 MB/s
Aug 24 06:17:06 localhost kernel: [  389.009582] raid6: sse2x4   xor() 
8690 MB/s
Aug 24 06:17:06 localhost kernel: [  389.077587] raid6: avx2x1   gen()
15528 MB/s
Aug 24 06:17:06 localhost kernel: [  389.145602] raid6: avx2x2   gen()
20335 MB/s
Aug 24 06:17:06 localhost kernel: [  389.213597] raid6: avx2x4   gen()
21409 MB/s
Aug 24 06:17:06 localhost kernel: [  389.213600] raid6: using algorithm
avx2x4 gen() 21409 MB/s
Aug 24 06:17:06 localhost kernel: [  389.213601] raid6: using avx2x2
recovery algorithm
Aug 24 06:17:06 localhost kernel: [  389.217427] async_tx: api
initialized (async)
Aug 24 06:17:06 localhost kernel: [  389.224559] xor: automatically
using best checksumming function   avx      
Aug 24 06:17:06 localhost kernel: [  389.244159] md: raid6 personality
registered for level 6
Aug 24 06:17:06 localhost kernel: [  389.244160] md: raid5 personality
registered for level 5
Aug 24 06:17:06 localhost kernel: [  389.244161] md: raid4 personality
registered for level 4
Aug 24 06:17:06 localhost kernel: [  389.244534] md/raid:md0: device
sda1 operational as raid disk 0
Aug 24 06:17:06 localhost kernel: [  389.244538] md/raid:md0: device
sdd1 operational as raid disk 3
Aug 24 06:17:06 localhost kernel: [  389.244540] md/raid:md0: device
sdc1 operational as raid disk 2
Aug 24 06:17:06 localhost kernel: [  389.244542] md/raid:md0: device
sdb1 operational as raid disk 1
Aug 24 06:17:06 localhost kernel: [  389.244883] md/raid:md0: allocated
4374kB
Aug 24 06:17:06 localhost kernel: [  389.244915] md/raid:md0: raid level
5 active with 4 out of 4 devices, algorithm 2
Aug 24 06:17:06 localhost kernel: [  389.244916] RAID conf printout:
Aug 24 06:17:06 localhost kernel: [  389.244916]  --- level:5 rd:4 wd:4
Aug 24 06:17:06 localhost kernel: [  389.244917]  disk 0, o:1, dev:sda1
Aug 24 06:17:06 localhost kernel: [  389.244918]  disk 1, o:1, dev:sdb1
Aug 24 06:17:06 localhost kernel: [  389.244919]  disk 2, o:1, dev:sdc1
Aug 24 06:17:06 localhost kernel: [  389.244920]  disk 3, o:1, dev:sdd1
Aug 24 06:17:06 localhost kernel: [  389.244928] md/raid:md0: using
device nvme0n1 as journal
Aug 24 06:17:06 localhost kernel: [  389.247994] attempt to access
beyond end of device
Aug 24 06:17:06 localhost kernel: [  389.247996] nvme0n1: rw=8,
want=116880821, limit=57149440
Aug 24 06:17:06 localhost kernel: [  389.247996] attempt to access
beyond end of device
Aug 24 06:17:06 localhost kernel: [  389.247997] nvme0n1: rw=8,
want=59764149, limit=57149440
Aug 24 06:17:06 localhost kernel: [  389.248264] attempt to access
beyond end of device
Aug 24 06:17:06 localhost kernel: [  389.248265] nvme0n1: rw=8,
want=248587438, limit=57149440
Aug 24 06:17:06 localhost kernel: [  389.248266] attempt to access
beyond end of device
Aug 24 06:17:06 localhost kernel: [  389.248267] nvme0n1: rw=8,
want=4474931993, limit=57149440
Aug 24 06:17:06 localhost kernel: [  389.248268] attempt to access
beyond end of device
Aug 24 06:17:06 localhost kernel: [  389.248269] nvme0n1: rw=8,
want=4417815321, limit=57149440
Aug 24 06:17:06 localhost kernel: [  389.248269] attempt to access
beyond end of device
Aug 24 06:17:06 localhost kernel: [  389.248270] nvme0n1: rw=8,
want=4360698649, limit=57149440
Aug 24 06:17:06 localhost kernel: [  389.248271] attempt to access
beyond end of device
... repeats 2371 times with different offsets.

I should note that for some reason the "32G" Optane only has 29.260 G
bytes (27.3 GiB)

I overlooked that Documentation/md/raid5-cache.txt states "write-back
mode (supported since 4.10)", so I will *guess* that these errors are
due to attempting to work with a write-back journal without support for
it.  Is there no versioning or other mechanism in the metadata for the
journal to indicate it has write-back data?  This alone seems like a
critical flaw because it should have refused to use the journal stating
that it used features not available.  (BTW, I didn't try booting a 4.10
kernel and attempting recovery.)

So I'm guessing there are two distinct bugs, 1.) the journal recovery
being broken, and 2.) a missing or improperly used versioning mechanism
to prevent older kernels from trying to use newer journal data formats
.  Furthermore, 3.) this feature is not marked as experimental and is
anything but working.

I propose that at AT A VERY MINIMUM, write-back functionality be gated
by a Kconfig option that informs the user that it is "(EXPERIMENTAL)". 
If there are similar bugs affecting raid5-cache in write-through mode
then I propose they both be gated in this way so that raid5-cache
support is not even built unless explicitly enabled in .config, probably
as a sub-option of MD_RAID456.

I greatly appreciate Facebook contributing to FOSS, but we also don't
want to set up traps for people to use features that will lead to
heartache the first time their system crashes for some reason.

As for my system, I have just assembled the array read-only without the
journal and copied the data elsewhere.  I'm probably going to reformat
the md device without an external journal.

Regards,
Daniel

PS: As a side-note, it would be really nice if the stripe journal could
also serve as a cache of non-dirty stripes, tracking last use (similar
to the page cache), keeping a minimum of free entries for new writes and
discarding the least-used stripes as needed.  I know that would be
involved, but it's just a thought. :)