Re: Linux RAID with btrfs stuck and consume 100 % CPU

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Thu, 17 Sep 2020 11:08:13 -0600

On Wed, Sep 16, 2020 at 3:42 AM Vojtech Myslivec <vojtech@xxxxxxxxxxxx> wrote:
>
> Hello,
>
> it seems my last e-mail was filtered as I can't find it in the archives.
> So I will resend it and include all attachments in one tarball.
>
>
> On 26. 08. 20 20:07, Chris Murphy wrote:> OK so from the attachments..
> >
> > cat /proc/<pid>/stack for md1_raid6
> >
> > [<0>] rq_qos_wait+0xfa/0x170
> > [<0>] wbt_wait+0x98/0xe0
> > [<0>] __rq_qos_throttle+0x23/0x30
> > [<0>] blk_mq_make_request+0x12a/0x5d0
> > [<0>] generic_make_request+0xcf/0x310
> > [<0>] submit_bio+0x42/0x1c0
> > [<0>] md_update_sb.part.71+0x3c0/0x8f0 [md_mod]
> > [<0>] r5l_do_reclaim+0x32a/0x3b0 [raid456]
> > [<0>] md_thread+0x94/0x150 [md_mod]
> > [<0>] kthread+0x112/0x130
> > [<0>] ret_from_fork+0x22/0x40
> >
> >
> > Btrfs snapshot flushing might instigate the problem but it seems to me
> > there's some kind of contention or blocking happening within md, and
> > that's why everything stalls. But I can't tell why.
> >
> > Do you have any iostat output at the time of this problem? I'm
> > wondering if md is waiting on disks. If not, try `iostat -dxm 5` and
> > share a few minutes before and after the freeze/hang.
> We have detected the issue at Monday 31.09.2020 15:24. It must happen
> sometimes between 15:22-15:24 as we monitor the state every 2 minutes.
>
> We have recorded stacks of blocked processes, sysrq+w command and
> requested `iostat`. Then in 15:45, we perform manual "unstuck" process
> by accessing md1 device via dd command (reading a few random blocks).
>
> I hope attached file names are self-explaining.
>
> Please let me know if we can do anything more to track the issue or if I
> forget something.
>
> Thanks a lot,
> Vojtech and Michal
>
>
>
> Description of the devices in iostat, just for recap:
> - sda-sdf: 6 HDD disks
> - sdg, sdh: 2 SSD disks
>
> - md0: raid1 over sdg1 and sdh1 ("SSD RAID", Physical Volume for LVM)
> - md1: our "problematic" raid6 over sda-sdf ("HDD RAID", btrfs
>        formatted)
>
> - Logical volumes over md0 Physical Volume (on SSD RAID)
>     - dm-0: 4G  LV for SWAP
>     - dm-1: 16G LV for root file system (ext4 formatted)
>     - dm-2: 1G  LV for md1 journal
>

It's kindof a complicated setup. When this problem happens, can you
check swap pressure?

/sys/fs/cgroup/memory.stat

pgfault and maybe also pgmajfault - see if they're going up; or also
you can look at vmstat and see how heavy swap is being used at the
time. The thing is.

Because any heavy eviction means writes to dm-0->md0 raid1->sdg+sdh
SSDs, which are the same SSDs that you have the md1 raid6 mdadm
journal going to. So if you have any kind of swap pressure, it very
likely will stop the journal or at least substantially slow it down,
and now you get blocked tasks as the pressure builds more and more
because now you have a ton of dirty writes in Btrfs that can't make it
to disk.

If there is minimal swap usage, then this hypothesis is false and
something else is going on. I also don't have an explanation why your
work around works.

-- 
Chris Murphy