Sorry for delayed reply. On Sat, Oct 19, 2019 at 2:10 AM Anssi Hannula <anssi.hannula@xxxxxx> wrote: > > Hi all, > > I'm seeing a reshape issue where the array gets stuck with requests > seemingly getting blocked and md0_raid6 process taking 100% CPU whenever > I --continue the reshape. > > From what I can tell, the md0_raid6 process is stuck processing a set of > stripes over and over via handle_stripe() without progressing. > > Log excerpt of one handle_stripe() of an affected stripe with some extra > logging is below. > The 4600-5200 integers are line numbers for > http://onse.fi/files/reshape-infloop-issue/raid5.c . Maybe add sh->sector to DEBUGPRINT()? Also, please add more DEBUGPRINT() in the if (sh->reconstruct_state == reconstruct_state_result) { case. > > 0x1401 = STRIPE_ACTIVE STRIPE_EXPANDING STRIPE_EXPAND_READY > 0x1402 = STRIPE_HANDLE STRIPE_EXPANDING STRIPE_EXPAND_READY > > 0x813 = R5_UPTODATE R5_LOCKED R5_Insync R5_Expanded > 0x811 = R5_UPTODATE R5_Insync R5_Expanded > 0xa01 = R5_UPTODATE R5_ReadError R5_Expanded > > [ 499.262769] XX handle_stripe 4694, state 0x1402, reconstr 6 > [ 499.263376] XX handle_stripe 4703, state 0x1401, reconstr 6 > [ 499.263681] XX handle_stripe 4709, state 0x1401, reconstr 6 > [ 499.263988] XX handle_stripe 4713, state 0x1401, reconstr 6 > [ 499.264355] XX handle_stripe 4732, state 0x1401, reconstr 6 > [ 499.264657] handling stripe 198248960, state=0x1401 cnt=1, pd_idx=19, > qd_idx=0 > , check:0, reconstruct:6 > [ 499.265304] check 19: state 0x813 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.265649] check 18: state 0xa01 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.265978] check 17: state 0xa01 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.266337] check 16: state 0xa01 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.266658] check 15: state 0x811 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.266988] check 14: state 0xa01 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.267335] check 13: state 0xa01 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.267657] check 12: state 0xa01 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.267985] check 11: state 0x811 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.268349] check 10: state 0x811 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.268670] check 9: state 0x811 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.269021] check 8: state 0x811 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.269371] check 7: state 0xa01 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.269695] check 6: state 0xa01 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.270027] check 5: state 0xa01 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.270376] check 4: state 0xa01 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.270700] check 3: state 0x811 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.271031] check 2: state 0x811 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.271380] check 1: state 0x811 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.271707] check 0: state 0x813 read 000000000bfabb56 write > 000000000bfabb56 written 000000000bfabb56 > [ 499.272039] XX handle_stripe 4742, state 0x1401, reconstr 6 > [ 499.272410] XX handle_stripe 4746, state 0x1401, reconstr 6 > [ 499.272740] XX handle_stripe 4753, state 0x1401, reconstr 6 > [ 499.273093] XX handle_stripe 4765, state 0x1401, reconstr 6 > [ 499.273446] locked=2 uptodate=20 to_read=0 to_write=0 failed=10 > failed_num=18,17 > [ 499.273786] XX too many failed > [ 499.274174] XX handle_stripe 4834, state 0x1401, reconstr 0 > [ 499.274523] XX handle_stripe 4847, state 0x1401, reconstr 0 > [ 499.274877] XX handle_stripe 4874, state 0x1401, reconstr 0 > [ 499.275250] XX handle_stripe 4882, state 0x1401, reconstr 0 > [ 499.275591] XX handle_stripe 4893, state 0x1401, reconstr 0 > [ 499.275939] XX handle_stripe 4923, state 0x1401, reconstr 0 > [ 499.276324] XX handle_stripe 4939, state 0x1401, reconstr 0 > [ 499.276666] XX handle_stripe 4956, state 0x1401, reconstr 0 > [ 499.277033] XX handle_stripe 4965, state 0x1401, reconstr 0 > [ 499.277399] XX handle_stripe 4990, state 0x1401, reconstr 0 > [ 499.277742] XX handle_stripe 5019, state 0x1401, reconstr 0 > [ 499.278090] handle_stripe: 5026 > [ 499.278477] XX handle_stripe 5035, state 0x1401, reconstr 3 > [ 499.278831] XX handle_stripe 5040, state 0x1401, reconstr 3 > [ 499.279198] XX handle_stripe 5043, state 0x1401, reconstr 3 > [ 499.279547] XX handle_stripe 5057, state 0x1401, reconstr 3 > [ 499.279898] XX handle_stripe 5087, state 0x1401, reconstr 3 > [ ... raid_run_ops() call with STRIPE_OP_RECONSTRUCT ... ] > [ 499.280292] XX handle_stripe 5091, state 0x1403, reconstr 6 > [ 499.280645] XX handle_stripe 5094, state 0x1403, reconstr 6 After this the stripe should be handled again, but I didn't find it in the dmesg file. Could you please retry with the extra debug information? > [ 499.281042] XX handle_stripe 5108, state now 0x1402 [...] > - The array was originally 74230862272 kB long (21 devices of size > 3906887488 kB). My intention was to have it end up with 20 slightly > larger members but the same total size, so I used --grow > --size=4124036736 to increase the device size slightly, and then > --array-size=74230862272K to reduce the available size back to original > before starting the reshape. Note that the --array-size I used is > smaller than the "actual" size after the reshape (74232661248 kB), in > case it matters. That's an impressive array, btw. Thanks, Song