Re: Linux RAID with btrfs stuck and consume 100 % CPU

Song Liu <songliubraving@xxxxxx> · Thu, 30 Jul 2020 06:45:04 +0000

> On Jul 29, 2020, at 2:06 PM, Guoqing Jiang <guoqing.jiang@xxxxxxxxxxxxxxx> wrote:
> 
> Hi,
> 
> On 7/22/20 10:47 PM, Vojtech Myslivec wrote:
>> 1. What should be the cause of this problem?
> 
> Just a quick glance based on the stacks which you attached, I guess it could be
> a deadlock issue of raid5 cache super write.
> 
> Maybe the commit 8e018c21da3f ("raid5-cache: fix a deadlock in superblock
> write") didn't fix the problem completely.  Cc Song.
> 
> And I am curious why md thread is not waked if mddev_trylock fails, you can
> give it a try but I can't promise it helps ...
> 
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -1337,8 +1337,10 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
>          */
>         set_mask_bits(&mddev->sb_flags, 0,
>                 BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
> -       if (!mddev_trylock(mddev))
> +       if (!mddev_trylock(mddev)) {
> +               md_wakeup_thread(mddev->thread);
>                 return;
> +       }
>         md_update_sb(mddev, 1);
>         mddev_unlock(mddev);
> 

Thanks Guoqing!

I am not sure whether we hit the mddev_trylock() failure. Looks like the 
md1_raid6 thread is already running at 100%. 

A few questions: 

1. I see wbt_wait in the stack trace. Are we using write back throttling here?
2. Could you please get the /proc/<pid>/stack for <pid> of md1_raid6? We may
   want to sample it multiple times. 

Thanks,
Song