On 6/4/20 12:07 AM, Song Liu wrote:
The hang happens at expected place.
[Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds.
[ +0.060545] Tainted: G E 5.4.19-msl-00001-gbf39596faf12 #2
[ +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Could you please try disable the timeout message with
echo 0 > /proc/sys/kernel/hung_task_timeout_secs
And during this wait (after message
"r5c_recovery_flush_data_only_stripes before wait_event"),
checks whether the raid disks (not the journal disk) are taking IOs
(using tools like iostat).
Will report tommorow (machine was restarted, so gotta wait 19+ hours
again until r5c_recovery_flush_log / processing gets its part of the job
completed).
Non-assembling raid issue aside - any idea why is it so inhumanly slow ?
It's not really much of an use in a production scenario in this state.
Following as every-10 seconds stats from journal device after the
assembly of the main raid started.
Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
md125 3.00 3072.00 0.00 30720 0
md125 2.80 2867.20 0.00 28672 0
md125 2.10 2150.40 0.00 21504 0
md125 1.90 1945.60 0.00 19456 0
md125 2.00 1920.40 0.00 19204 0
md125 1.30 1331.20 0.00 13312 0
md125 1.50 1536.00 0.00 15360 0