Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Song Liu <song@xxxxxxxxxx> · Wed, 24 Jan 2024 16:01:47 -0800

Thanks for the information!

On Tue, Jan 23, 2024 at 3:58 PM Dan Moulding <dan@xxxxxxxx> wrote:
>
> > This appears the md thread hit some infinite loop, so I would like to
> > know what it is doing. We can probably get the information with the
> > perf tool, something like:
> >
> > perf record -a
> > perf report
>
> Here you go!
>
> # Total Lost Samples: 0
> #
> # Samples: 78K of event 'cycles'
> # Event count (approx.): 83127675745
> #
> # Overhead  Command          Shared Object                   Symbol
> # ........  ...............  ..............................  ...................................................
> #
>     49.31%  md0_raid5        [kernel.kallsyms]               [k] handle_stripe
>     18.63%  md0_raid5        [kernel.kallsyms]               [k] ops_run_io
>      6.07%  md0_raid5        [kernel.kallsyms]               [k] handle_active_stripes.isra.0
>      5.50%  md0_raid5        [kernel.kallsyms]               [k] do_release_stripe
>      3.09%  md0_raid5        [kernel.kallsyms]               [k] _raw_spin_lock_irqsave
>      2.48%  md0_raid5        [kernel.kallsyms]               [k] r5l_write_stripe
>      1.89%  md0_raid5        [kernel.kallsyms]               [k] md_wakeup_thread
>      1.45%  ksmd             [kernel.kallsyms]               [k] ksm_scan_thread
>      1.37%  md0_raid5        [kernel.kallsyms]               [k] stripe_is_lowprio
>      0.87%  ksmd             [kernel.kallsyms]               [k] memcmp
>      0.68%  ksmd             [kernel.kallsyms]               [k] xxh64
>      0.56%  md0_raid5        [kernel.kallsyms]               [k] __wake_up_common
>      0.52%  md0_raid5        [kernel.kallsyms]               [k] __wake_up
>      0.46%  ksmd             [kernel.kallsyms]               [k] mtree_load
>      0.44%  ksmd             [kernel.kallsyms]               [k] try_grab_page
>      0.40%  ksmd             [kernel.kallsyms]               [k] follow_p4d_mask.constprop.0
>      0.39%  md0_raid5        [kernel.kallsyms]               [k] r5l_log_disk_error
>      0.37%  md0_raid5        [kernel.kallsyms]               [k] _raw_spin_lock_irq
>      0.33%  md0_raid5        [kernel.kallsyms]               [k] release_stripe_list
>      0.31%  md0_raid5        [kernel.kallsyms]               [k] release_inactive_stripe_list

It appears the thread is indeed doing something. I haven't got luck to
reproduce this on my hosts. Could you please try whether the following
change fixes the issue (without reverting 0de40f76d567)? I will try to
reproduce the issue on my side.

Junxiao,

Please also help look into this.

Thanks,
Song