On 6/9/20 11:51 PM, Michal Soltys wrote:
On 20/06/09 20:36, Song Liu wrote:
On Tue, Jun 9, 2020 at 2:36 AM Michal Soltys <msoltyspl@xxxxxxxxx> wrote:
On 6/5/20 2:26 PM, Michal Soltys wrote:
> On 6/4/20 12:07 AM, Song Liu wrote:
>>
>> The hang happens at expected place.
>>
>>> [Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120
seconds.
>>> [ +0.060545] Tainted: G E
>>> 5.4.19-msl-00001-gbf39596faf12 #2
>>> [ +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this message.
>>
>> Could you please try disable the timeout message with
>>
>> echo 0 > /proc/sys/kernel/hung_task_timeout_secs
>>
>> And during this wait (after message
>> "r5c_recovery_flush_data_only_stripes before wait_event"),
>> checks whether the raid disks (not the journal disk) are taking IOs
>> (using tools like iostat).
>>
>
> No activity on component drives.
To expand on that - while there is no i/o activity whatsoever at the
component drives (as well as journal), the cpu is of course still
fully loaded (5 days so far):
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
root 8129 6755 15 740 1904 10 Jun04 pts/2 17:42:34 mdadm
-A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdj1 /dev/sdi1
/dev/sdg1 /dev/sdh1
root 8147 2 84 0 0 30 Jun04 ? 4-02:09:47
[md124_raid5]
I guess the md thread stuck at some stripe. Does the kernel have
CONFIG_DYNAMIC_DEBUG enabled? If so, could you please try enable some
pr_debug()
in function handle_stripe()?
Thanks,
Song
Massive spam in dmesg with messages like these:
[464836.603033] handling stripe 1551540328, state=0x41 cnt=1, pd_idx=3,
qd_idx=-1
, check:0, reconstruct:0
[464836.603036] handling stripe 1551540336, state=0x41 cnt=1, pd_idx=3,
qd_idx=-1
, check:0, reconstruct:0
[464836.603038] handling stripe 1551540344, state=0x41 cnt=1, pd_idx=3,
qd_idx=-1
, check:0, reconstruct:0
<cut>
So what should be the next step in debugging/fixing this ?