On Wed, Jun 17, 2020 at 5:53 AM Michal Soltys <msoltyspl@xxxxxxxxx> wrote: > > On 6/17/20 2:48 AM, Song Liu wrote: > > On Tue, Jun 9, 2020 at 2:51 PM Michal Soltys <msoltyspl@xxxxxxxxx> wrote: > >> > >> On 20/06/09 20:36, Song Liu wrote: > >>> On Tue, Jun 9, 2020 at 2:36 AM Michal Soltys <msoltyspl@xxxxxxxxx> wrote: > >>>> > >>>> On 6/5/20 2:26 PM, Michal Soltys wrote: > >>>>> On 6/4/20 12:07 AM, Song Liu wrote: > >>>>>> > >>>>>> The hang happens at expected place. > >>>>>> > >>>>>>> [Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds. > >>>>>>> [ +0.060545] Tainted: G E > >>>>>>> 5.4.19-msl-00001-gbf39596faf12 #2 > >>>>>>> [ +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > >>>>>>> disables this message. > >>>>>> > >>>>>> Could you please try disable the timeout message with > >>>>>> > >>>>>> echo 0 > /proc/sys/kernel/hung_task_timeout_secs > >>>>>> > >>>>>> And during this wait (after message > >>>>>> "r5c_recovery_flush_data_only_stripes before wait_event"), > >>>>>> checks whether the raid disks (not the journal disk) are taking IOs > >>>>>> (using tools like iostat). > >>>>>> > >>>>> > >>>>> No activity on component drives. > >>>> > >>>> To expand on that - while there is no i/o activity whatsoever at the component drives (as well as journal), the cpu is of course still fully loaded (5 days so far): > >>>> > >>>> UID PID PPID C SZ RSS PSR STIME TTY TIME CMD > >>>> root 8129 6755 15 740 1904 10 Jun04 pts/2 17:42:34 mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdj1 /dev/sdi1 /dev/sdg1 /dev/sdh1 > >>>> root 8147 2 84 0 0 30 Jun04 ? 4-02:09:47 [md124_raid5] > >>> > >>> I guess the md thread stuck at some stripe. Does the kernel have > >>> CONFIG_DYNAMIC_DEBUG enabled? If so, could you please try enable some pr_debug() > >>> in function handle_stripe()? > >>> > >>> Thanks, > >>> Song > >>> > >> > >> Massive spam in dmesg with messages like these: > >> > >> [464836.603033] handling stripe 1551540328, state=0x41 cnt=1, pd_idx=3, > >> qd_idx=-1 > >> , check:0, reconstruct:0 > >> <cut> > >> > > > > I am really sorry for the delay. A few questions: > > > > No worries. > > > 1. There are two pr_debug() calls in handle_stripe(): > > pr_debug("handling stripe %llu, state=%#lx cnt=%d, " > > pr_debug("locked=%d uptodate=%d to_read=%d" > > > > Did you enable all of them? Or only the first one? > > I enabled all of them (I think), to be precise: > > echo -n 'func handle_stripe +p' >/sys/kernel/debug/dynamic_debug/control > > Haven't seen any `locked` lines though. That's a little weird, and probably explains why we stuck. Could you please try the attached patch? Thanks, Song
Attachment:
0001-Debug-raid5-recovery-hang.patch
Description: Binary data