Hi all,
I'm seeing a reshape issue where the array gets stuck with requests
seemingly getting blocked and md0_raid6 process taking 100% CPU whenever
I --continue the reshape.
From what I can tell, the md0_raid6 process is stuck processing a set of
stripes over and over via handle_stripe() without progressing.
Log excerpt of one handle_stripe() of an affected stripe with some extra
logging is below.
The 4600-5200 integers are line numbers for
http://onse.fi/files/reshape-infloop-issue/raid5.c .
0x1401 = STRIPE_ACTIVE STRIPE_EXPANDING STRIPE_EXPAND_READY
0x1402 = STRIPE_HANDLE STRIPE_EXPANDING STRIPE_EXPAND_READY
0x813 = R5_UPTODATE R5_LOCKED R5_Insync R5_Expanded
0x811 = R5_UPTODATE R5_Insync R5_Expanded
0xa01 = R5_UPTODATE R5_ReadError R5_Expanded
[ 499.262769] XX handle_stripe 4694, state 0x1402, reconstr 6
[ 499.263376] XX handle_stripe 4703, state 0x1401, reconstr 6
[ 499.263681] XX handle_stripe 4709, state 0x1401, reconstr 6
[ 499.263988] XX handle_stripe 4713, state 0x1401, reconstr 6
[ 499.264355] XX handle_stripe 4732, state 0x1401, reconstr 6
[ 499.264657] handling stripe 198248960, state=0x1401 cnt=1, pd_idx=19,
qd_idx=0
, check:0, reconstruct:6
[ 499.265304] check 19: state 0x813 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.265649] check 18: state 0xa01 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.265978] check 17: state 0xa01 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.266337] check 16: state 0xa01 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.266658] check 15: state 0x811 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.266988] check 14: state 0xa01 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.267335] check 13: state 0xa01 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.267657] check 12: state 0xa01 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.267985] check 11: state 0x811 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.268349] check 10: state 0x811 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.268670] check 9: state 0x811 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.269021] check 8: state 0x811 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.269371] check 7: state 0xa01 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.269695] check 6: state 0xa01 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.270027] check 5: state 0xa01 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.270376] check 4: state 0xa01 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.270700] check 3: state 0x811 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.271031] check 2: state 0x811 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.271380] check 1: state 0x811 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.271707] check 0: state 0x813 read 000000000bfabb56 write
000000000bfabb56 written 000000000bfabb56
[ 499.272039] XX handle_stripe 4742, state 0x1401, reconstr 6
[ 499.272410] XX handle_stripe 4746, state 0x1401, reconstr 6
[ 499.272740] XX handle_stripe 4753, state 0x1401, reconstr 6
[ 499.273093] XX handle_stripe 4765, state 0x1401, reconstr 6
[ 499.273446] locked=2 uptodate=20 to_read=0 to_write=0 failed=10
failed_num=18,17
[ 499.273786] XX too many failed
[ 499.274174] XX handle_stripe 4834, state 0x1401, reconstr 0
[ 499.274523] XX handle_stripe 4847, state 0x1401, reconstr 0
[ 499.274877] XX handle_stripe 4874, state 0x1401, reconstr 0
[ 499.275250] XX handle_stripe 4882, state 0x1401, reconstr 0
[ 499.275591] XX handle_stripe 4893, state 0x1401, reconstr 0
[ 499.275939] XX handle_stripe 4923, state 0x1401, reconstr 0
[ 499.276324] XX handle_stripe 4939, state 0x1401, reconstr 0
[ 499.276666] XX handle_stripe 4956, state 0x1401, reconstr 0
[ 499.277033] XX handle_stripe 4965, state 0x1401, reconstr 0
[ 499.277399] XX handle_stripe 4990, state 0x1401, reconstr 0
[ 499.277742] XX handle_stripe 5019, state 0x1401, reconstr 0
[ 499.278090] handle_stripe: 5026
[ 499.278477] XX handle_stripe 5035, state 0x1401, reconstr 3
[ 499.278831] XX handle_stripe 5040, state 0x1401, reconstr 3
[ 499.279198] XX handle_stripe 5043, state 0x1401, reconstr 3
[ 499.279547] XX handle_stripe 5057, state 0x1401, reconstr 3
[ 499.279898] XX handle_stripe 5087, state 0x1401, reconstr 3
[ ... raid_run_ops() call with STRIPE_OP_RECONSTRUCT ... ]
[ 499.280292] XX handle_stripe 5091, state 0x1403, reconstr 6
[ 499.280645] XX handle_stripe 5094, state 0x1403, reconstr 6
[ 499.281042] XX handle_stripe 5108, state now 0x1402
More details:
- The reshape (level 6, raid-devices 21 => 20) was happening normally
until almost the end when the array became at least partially
unresponsive and I was not able to login anymore (/home is on the array)
so I rebooted the system. Root FS is not on the array so any kernel
errors should have been logged to journal - I did not see any kernel
messages, though.
- The current issue now happens immediately whenever I run --grow
--continue --backup-file=xx on the array.
- The reshape was started on Fedora's 5.2.17-200.fc30.x86_64, but I'm
now debugging this on torvalds master from a couple of days ago
(v5.4-rc3-1-g2abd839aa7e6).
- The above log excerpt has 10 devices indicated as "failed" (the array
state is clean). Those devices all have the exact same 192 sectors in
their MD bad blocks list, except for sdq which only has 40 sectors
(subset of the 192). These seem to be bogus entries as I can dd-read the
entire /dev/md0 just fine and I can read the reported sectors directly
just fine with dd.
- A full log the above excerpt is from is at
http://onse.fi/files/reshape-infloop-issue/dmesg.txt .
- Before I started the reshape, I had succesfully --replaced several
members.
- Before I started the reshape, I had marked RAID member /dev/md8 with
--replace (i.e. to be replaced once the reshape is complete and a spare
is available).
- The array was originally 74230862272 kB long (21 devices of size
3906887488 kB). My intention was to have it end up with 20 slightly
larger members but the same total size, so I used --grow
--size=4124036736 to increase the device size slightly, and then
--array-size=74230862272K to reduce the available size back to original
before starting the reshape. Note that the --array-size I used is
smaller than the "actual" size after the reshape (74232661248 kB), in
case it matters.
- I have not noticed any data loss, the array seems to be functioning
properly after assembly until I run --grow --continue.
- I think "mdadm --assemble --backup-file=/xx" tries to start a
continuation of the --grow operation in the background via systemd but
fails because the --backup-file is not propagated to
mdadm-grow-continue@.service. But in this case I guess it is a good
thing as the array keeps being operational :) .. I guess
--freeze-reshape would be the equivalent proper way to achieve this.
- System is Fedora 30 with mdadm 4.1rc2. --grow --continue on mdadm git
master gives the same result.
So, regardless of how we ended up in this state, clearly there is at
least some bug as getting stuck in an infinite loop should not happen -
hopefully the cause can be determined, let me know if you need anything
more, e.g. more detailed log prints.
I'd also like to get the array reshape to continue, of course :)
Array details (before running --grow --continue) and the --examine
details of one component are below.
Examine output for all components is at
http://onse.fi/files/reshape-infloop-issue/examine-all.txt .
/dev/md0:
Version : 1.1
Creation Time : Mon Jun 2 00:53:32 2014
Raid Level : raid6
Array Size : 74232661248 (70793.78 GiB 76014.25 GB)
Used Dev Size : 4124036736 (3932.99 GiB 4223.01 GB)
Raid Devices : 20
Total Devices : 21
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Oct 15 04:03:05 2019
State : clean, reshaping
Active Devices : 21
Working Devices : 21
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
Consistency Policy : bitmap
Reshape Status : 97% complete
Delta Devices : -1, (21->20)
Name : delta.onse.fi:delta (local to host delta.onse.fi)
UUID : b18ea5a9:fe91f26f:96b96dce:73a5d373
Events : 27216022
Number Major Minor RaidDevice State
0 66 32 0 active sync /dev/sdai
1 65 96 1 active sync /dev/sdw
2 8 192 2 active sync /dev/sdm
3 65 32 3 active sync /dev/sds
24 66 80 4 active sync /dev/sdal
23 66 64 5 active sync /dev/sdak
22 8 224 6 active sync /dev/sdo
28 9 8 7 active sync /dev/md8
16 65 144 8 active sync /dev/sdz
9 65 48 9 active sync /dev/sdt
10 65 16 10 active sync /dev/sdr
11 65 64 11 active sync /dev/sdu
27 65 80 12 active sync /dev/sdv
13 65 0 13 active sync /dev/sdq
25 66 48 14 active sync /dev/sdaj
15 65 128 15 active sync /dev/sdy
20 65 112 16 active sync /dev/sdx
26 8 160 17 active sync /dev/sdk
18 8 240 18 active sync /dev/sdp
17 8 208 19 active sync /dev/sdn
21 8 176 20 active sync /dev/sdl
md0 : active raid6 sdai[0] sdl[21] sdn[17] sdp[18] sdk[26] sdx[20]
sdy[15] sdaj[25] sdq[13] sdv[27] sdu[11] sdr[10] sdt[9] sdz[16] md8[28]
sdo[22] sdak[23] sdal[24] sds[3] sdm[2] sdw[1]
74232661248 blocks super 1.1 level 6, 64k chunk, algorithm 2
[20/20] [UUUUUUUUUUUUUUUUUUUU]
[===================>.] reshape = 97.5% (4024886912/4124036736)
finish=154611756.8min speed=0K/sec
bitmap: 5/31 pages [20KB], 65536KB chunk
$ sudo mdadm --examine-badblocks /dev/sdal
Bad-blocks on /dev/sdal:
198276520 for 88 sectors
198503424 for 32 sectors
198503600 for 8 sectors
198503680 for 8 sectors
198503808 for 8 sectors
198503936 for 8 sectors
198504576 for 8 sectors
198504704 for 8 sectors
198504832 for 8 sectors
198504960 for 8 sectors
198505088 for 8 sectors
$ sudo mdadm --examine /dev/sdal
/dev/sdal:
Magic : a92b4efc
Version : 1.1
Feature Map : 0xd
Array UUID : b18ea5a9:fe91f26f:96b96dce:73a5d373
Name : delta.onse.fi:delta (local to host delta.onse.fi)
Creation Time : Mon Jun 2 00:53:32 2014
Raid Level : raid6
Raid Devices : 20
Avail Dev Size : 8248073584 (3932.99 GiB 4223.01 GB)
Array Size : 74232661248 (70793.78 GiB 76014.25 GB)
Used Dev Size : 8248073472 (3932.99 GiB 4223.01 GB)
Data Offset : 256000 sectors
Super Offset : 0 sectors
Unused Space : before=255928 sectors, after=7379723696 sectors
State : active
Device UUID : 7f100c8b:a7fb3d2d:e75224bf:229be6ff
Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 1784696832 (1702.02 GiB 1827.53 GB)
Delta Devices : -1 (21->20)
Update Time : Fri Oct 18 20:53:56 2019
Bad Block Log : 512 entries available at offset 24 sectors - bad
blocks present.
Checksum : 8a873a9f - correct
Events : 27357529
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 4
Array State : AAAAAAAAAAAAAAAAAAAAA ('A' == active, '.' == missing,
'R' == replacing)
--
Anssi Hannula