Re: Assemblin journaled array fails

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/19/20 1:55 AM, Song Liu wrote:

Thanks for these information.

It looks like the kernel had some issue in recovery. Could you please
try the following:

1. check dmesg, we should see one of the following at some point:
     pr_info("md/raid:%s: starting from clean shutdown\n",
or
     pr_info("md/raid:%s: recovering %d data-only stripes and %d
data-parity stripes\n",

2. try use bcc/bpftrace to trace r5l_recovery_read_page(),
specifically, the 4th argument.
With bcc, it is something like:

     trace.py -M 100 'r5l_recovery_read_page() "%llx", arg4'

-M above limits the number of outputs to 100 lines. We may need to
increase the limit or
remove the constraint. If the system doesn't have bcc/bpftrace. You
can also try with
kprobe.

Thanks,
Song


Will do the above tests in coming days.

As for dmesg, it was always clean after mdadm's invocation (that's why I never noticed it), but today I found on the machine that the following popped around *19* hours after mdadm assembly was started:

Note - since the last dmesg entry yesterday, nothing new showed up in dmesg, and the mdadm is still running.

ps a -o command,lstart | grep mdadm
strace -f -o mdadm.strace.l Mon May 18 13:25:45 2020
mdadm -A /dev/md/r5_big /de Mon May 18 13:25:45 2020



[May19 09:53] md/raid:md124: recovering 24667 data-only stripes and 50212 data-parity stripes
[May19 09:57] INFO: task mdadm:4698 blocked for more than 120 seconds.
[ +0.007114] Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 5.4.19-1~bpo10+1 [ +0.007881] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.008754] mdadm           D    0  4698   4695 0x00004003
[  +0.006128] Call Trace:
[  +0.002599]  ? __schedule+0x2e6/0x6f0
[  +0.004222]  schedule+0x2f/0xa0
[  +0.003519]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.005560]  ? finish_wait+0x80/0x80
[  +0.003901]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.005175]  do_md_run+0x64/0x100 [md_mod]
[  +0.004582]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.004675]  blkdev_ioctl+0x4d0/0xa00
[  +0.004104]  block_ioctl+0x39/0x40
[  +0.003806]  do_vfs_ioctl+0xa4/0x630
[  +0.004004]  ksys_ioctl+0x60/0x90
[  +0.003697]  __x64_sys_ioctl+0x16/0x20
[  +0.003966]  do_syscall_64+0x52/0x160
[  +0.004100]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.005658] RIP: 0033:0x7efcb9969427
[  +0.004118] Code: Bad RIP value.
[ +0.003611] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.008461] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007efcb9969427 [ +0.007964] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004 [ +0.007965] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 00005616858b9b60 [ +0.007968] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ +0.008034] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 0000000000000004
[May19 09:59] INFO: task mdadm:4698 blocked for more than 241 seconds.
[ +0.007113] Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 5.4.19-1~bpo10+1 [ +0.031011] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.032417] mdadm           D    0  4698   4695 0x00004003
[  +0.030883] Call Trace:
[  +0.027037]  ? __schedule+0x2e6/0x6f0
[  +0.027109]  schedule+0x2f/0xa0
[  +0.026651]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.028842]  ? finish_wait+0x80/0x80
[  +0.026646]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.027846]  do_md_run+0x64/0x100 [md_mod]
[  +0.027423]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027029]  blkdev_ioctl+0x4d0/0xa00
[  +0.027716]  block_ioctl+0x39/0x40
[  +0.028195]  do_vfs_ioctl+0xa4/0x630
[  +0.028868]  ksys_ioctl+0x60/0x90
[  +0.028845]  __x64_sys_ioctl+0x16/0x20
[  +0.030159]  do_syscall_64+0x52/0x160
[  +0.029944]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.029988] RIP: 0033:0x7efcb9969427
[  +0.027728] Code: Bad RIP value.
[ +0.025732] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.028386] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007efcb9969427 [ +0.028462] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004 [ +0.028854] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 00005616858b9b60 [ +0.027692] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ +0.028838] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 0000000000000004
[May19 10:01] INFO: task mdadm:4698 blocked for more than 362 seconds.
[ +0.027842] Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 5.4.19-1~bpo10+1 [ +0.028293] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.029933] mdadm           D    0  4698   4695 0x00004003
[  +0.027836] Call Trace:
[  +0.025804]  ? __schedule+0x2e6/0x6f0
[  +0.025857]  schedule+0x2f/0xa0
[  +0.025504]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.026469]  ? finish_wait+0x80/0x80
[  +0.026253]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.026693]  do_md_run+0x64/0x100 [md_mod]
[  +0.027043]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027063]  blkdev_ioctl+0x4d0/0xa00
[  +0.025390]  block_ioctl+0x39/0x40
[  +0.026334]  do_vfs_ioctl+0xa4/0x630
[  +0.025508]  ksys_ioctl+0x60/0x90
[  +0.024969]  __x64_sys_ioctl+0x16/0x20
[  +0.024461]  do_syscall_64+0x52/0x160
[  +0.025266]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.025898] RIP: 0033:0x7efcb9969427
[  +0.025336] Code: Bad RIP value.
[ +0.023623] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.029437] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007efcb9969427 [ +0.028688] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004 [ +0.028993] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 00005616858b9b60 [ +0.028810] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ +0.028715] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 0000000000000004
[May19 10:03] INFO: task mdadm:4698 blocked for more than 483 seconds.
[ +0.028312] Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 5.4.19-1~bpo10+1 [ +0.029922] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.029811] mdadm           D    0  4698   4695 0x00004003
[  +0.028094] Call Trace:
[  +0.024963]  ? __schedule+0x2e6/0x6f0
[  +0.025823]  schedule+0x2f/0xa0
[  +0.029291]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.031421]  ? finish_wait+0x80/0x80
[  +0.028282]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.026902]  do_md_run+0x64/0x100 [md_mod]
[  +0.026903]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.026249]  blkdev_ioctl+0x4d0/0xa00
[  +0.026441]  block_ioctl+0x39/0x40
[  +0.026720]  do_vfs_ioctl+0xa4/0x630
[  +0.025734]  ksys_ioctl+0x60/0x90
[  +0.025040]  __x64_sys_ioctl+0x16/0x20
[  +0.025255]  do_syscall_64+0x52/0x160
[  +0.024974]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026676] RIP: 0033:0x7efcb9969427
[  +0.025491] Code: Bad RIP value.
[ +0.023661] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.029253] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007efcb9969427 [ +0.027865] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004 [ +0.029299] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 00005616858b9b60 [ +0.028662] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ +0.028661] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 0000000000000004
[May19 10:05] INFO: task mdadm:4698 blocked for more than 604 seconds.
[ +0.028120] Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 5.4.19-1~bpo10+1 [ +0.028921] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.030050] mdadm           D    0  4698   4695 0x00004003
[  +0.028201] Call Trace:
[  +0.025739]  ? __schedule+0x2e6/0x6f0
[  +0.025130]  schedule+0x2f/0xa0
[  +0.025862]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.026863]  ? finish_wait+0x80/0x80
[  +0.026362]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.028033]  do_md_run+0x64/0x100 [md_mod]
[  +0.026302]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027130]  blkdev_ioctl+0x4d0/0xa00
[  +0.025276]  block_ioctl+0x39/0x40
[  +0.025804]  do_vfs_ioctl+0xa4/0x630
[  +0.025543]  ksys_ioctl+0x60/0x90
[  +0.025133]  __x64_sys_ioctl+0x16/0x20
[  +0.025289]  do_syscall_64+0x52/0x160
[  +0.025019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.027423] RIP: 0033:0x7efcb9969427
[  +0.025241] Code: Bad RIP value.
[ +0.024188] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.029330] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007efcb9969427 [ +0.028777] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004 [ +0.029321] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 00005616858b9b60 [ +0.030312] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ +0.028756] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 0000000000000004
[May19 10:07] INFO: task mdadm:4698 blocked for more than 724 seconds.
[ +0.028161] Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 5.4.19-1~bpo10+1 [ +0.029025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.029438] mdadm           D    0  4698   4695 0x00004003
[  +0.027906] Call Trace:
[  +0.024749]  ? __schedule+0x2e6/0x6f0
[  +0.025400]  schedule+0x2f/0xa0
[  +0.027335]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.030223]  ? finish_wait+0x80/0x80
[  +0.028959]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.029618]  do_md_run+0x64/0x100 [md_mod]
[  +0.026927]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.026187]  blkdev_ioctl+0x4d0/0xa00
[  +0.026722]  block_ioctl+0x39/0x40
[  +0.026070]  do_vfs_ioctl+0xa4/0x630
[  +0.026268]  ksys_ioctl+0x60/0x90
[  +0.025397]  __x64_sys_ioctl+0x16/0x20
[  +0.025492]  do_syscall_64+0x52/0x160
[  +0.025293]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026295] RIP: 0033:0x7efcb9969427
[  +0.025152] Code: Bad RIP value.
[ +0.023763] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.029591] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007efcb9969427 [ +0.027850] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004 [ +0.028603] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 00005616858b9b60 [ +0.028714] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ +0.028870] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 0000000000000004
[May19 10:09] INFO: task mdadm:4698 blocked for more than 845 seconds.
[ +0.028077] Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 5.4.19-1~bpo10+1 [ +0.029226] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.029321] mdadm           D    0  4698   4695 0x00004003
[May19 10:10] Call Trace:
[  +0.025422]  ? __schedule+0x2e6/0x6f0
[  +0.025490]  schedule+0x2f/0xa0
[  +0.026722]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.026858]  ? finish_wait+0x80/0x80
[  +0.026399]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.027444]  do_md_run+0x64/0x100 [md_mod]
[  +0.026228]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027008]  blkdev_ioctl+0x4d0/0xa00
[  +0.025656]  block_ioctl+0x39/0x40
[  +0.026068]  do_vfs_ioctl+0xa4/0x630
[  +0.025899]  ksys_ioctl+0x60/0x90
[  +0.025539]  __x64_sys_ioctl+0x16/0x20
[  +0.025885]  do_syscall_64+0x52/0x160
[  +0.025075]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026798] RIP: 0033:0x7efcb9969427
[  +0.024981] Code: Bad RIP value.
[ +0.024604] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.028628] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007efcb9969427 [ +0.028698] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004 [ +0.028732] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 00005616858b9b60 [ +0.027649] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ +0.029800] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 0000000000000004
[May19 10:12] INFO: task mdadm:4698 blocked for more than 966 seconds.
[ +0.030285] Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 5.4.19-1~bpo10+1 [ +0.032412] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.033870] mdadm           D    0  4698   4695 0x00004003
[  +0.031620] Call Trace:
[  +0.028343]  ? __schedule+0x2e6/0x6f0
[  +0.028456]  schedule+0x2f/0xa0
[  +0.025514]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.028027]  ? finish_wait+0x80/0x80
[  +0.025278]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.028173]  do_md_run+0x64/0x100 [md_mod]
[  +0.027243]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.026750]  blkdev_ioctl+0x4d0/0xa00
[  +0.026450]  block_ioctl+0x39/0x40
[  +0.025187]  do_vfs_ioctl+0xa4/0x630
[  +0.025689]  ksys_ioctl+0x60/0x90
[  +0.024076]  __x64_sys_ioctl+0x16/0x20
[  +0.025253]  do_syscall_64+0x52/0x160
[  +0.024310]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.027527] RIP: 0033:0x7efcb9969427
[  +0.024193] Code: Bad RIP value.
[ +0.024483] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.028585] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007efcb9969427 [ +0.029009] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004 [ +0.028403] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 00005616858b9b60 [ +0.027510] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ +0.028629] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 0000000000000004
[May19 10:14] INFO: task mdadm:4698 blocked for more than 1087 seconds.
[ +0.028117] Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 5.4.19-1~bpo10+1 [ +0.029008] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.030891] mdadm           D    0  4698   4695 0x00004003
[  +0.027018] Call Trace:
[  +0.024698]  ? __schedule+0x2e6/0x6f0
[  +0.025217]  schedule+0x2f/0xa0
[  +0.025613]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.027065]  ? finish_wait+0x80/0x80
[  +0.026013]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.027664]  do_md_run+0x64/0x100 [md_mod]
[  +0.026160]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027902]  blkdev_ioctl+0x4d0/0xa00
[  +0.026085]  block_ioctl+0x39/0x40
[  +0.026057]  do_vfs_ioctl+0xa4/0x630
[  +0.025886]  ksys_ioctl+0x60/0x90
[  +0.024539]  __x64_sys_ioctl+0x16/0x20
[  +0.025231]  do_syscall_64+0x52/0x160
[  +0.025571]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026934] RIP: 0033:0x7efcb9969427
[  +0.024585] Code: Bad RIP value.
[ +0.024635] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.029237] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007efcb9969427 [ +0.031792] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004 [ +0.031758] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 00005616858b9b60 [ +0.031567] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ +0.028862] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 0000000000000004
[May19 10:16] INFO: task mdadm:4698 blocked for more than 1208 seconds.
[ +0.028273] Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 5.4.19-1~bpo10+1 [ +0.029463] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.031789] mdadm           D    0  4698   4695 0x00004003
[  +0.028022] Call Trace:
[  +0.024606]  ? __schedule+0x2e6/0x6f0
[  +0.026123]  schedule+0x2f/0xa0
[  +0.024892]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.027620]  ? finish_wait+0x80/0x80
[  +0.025492]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.027706]  do_md_run+0x64/0x100 [md_mod]
[  +0.026976]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.026705]  blkdev_ioctl+0x4d0/0xa00
[  +0.027531]  block_ioctl+0x39/0x40
[  +0.024963]  do_vfs_ioctl+0xa4/0x630
[  +0.025568]  ksys_ioctl+0x60/0x90
[  +0.024137]  __x64_sys_ioctl+0x16/0x20
[  +0.025538]  do_syscall_64+0x52/0x160
[  +0.024200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026633] RIP: 0033:0x7efcb9969427
[  +0.024619] Code: Bad RIP value.
[ +0.024698] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.029381] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007efcb9969427 [ +0.028794] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004 [ +0.028724] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 00005616858b9b60 [ +0.028828] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ +0.028030] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 0000000000000004



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux