Re: PROBLEM: repeatable lockup on RAID-6 with LUKS dm-crypt on NVMe devices when rsyncing many files

Xiao Ni <xni@xxxxxxxxxx> · Fri, 15 Nov 2024 16:07:01 +0800

On Thu, Nov 14, 2024 at 11:07 PM Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> just a followup: the system ran over 2 days without my workload being able to trigger the issue. I’ve seen there is another thread where this patch wasn’t sufficient and if i understand correctly, Yu and Xiao are working on an amalgamated fix?
>
> Christian

Hi Christian

Beside the bitmap stuck problem, the other thread has a new problem.
But it looks like you don't have the new problem because you already
ran without failure for 2 days. I'll send patches against 6.13 and
6.11.

Regards
Xiao

>
> > On 12. Nov 2024, at 07:57, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > my workload has been running for 22 hours now successfully - it seems that the patch works.
> >
> > If this gets accepted then I’d kindly ask for an LTS backport to 6.6.
> >
> > Thanks to everyone for helping figuring this out and fixing it!
> >
> > Christian
> >
> >> On 11. Nov 2024, at 15:34, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
> >>
> >> I’ve been running withy my workflow that is known to trigger the issue reliably for about 6 hours now. This is longer than it worked before.
> >> I’m leaving the office for today and will leave things running over night and report back tomorrow.
> >>
> >> Christian
> >>
> >>> On 11. Nov 2024, at 09:00, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I’m trying this with 6.11.7 today.
> >>>
> >>> Christian
> >>>
> >>>> On 9. Nov 2024, at 12:35, Xiao Ni <xni@xxxxxxxxxx> wrote:
> >>>>
> >>>> On Thu, Nov 7, 2024 at 3:55 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
> >>>>>
> >>>>> Hi!
> >>>>>
> >>>>> 在 2024/11/06 14:40, Christian Theune 写道:
> >>>>>> Hi,
> >>>>>>
> >>>>>>> On 6. Nov 2024, at 07:35, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> 在 2024/11/05 18:15, Christian Theune 写道:
> >>>>>>>> Hi,
> >>>>>>>> after about 2 hours it stalled again. Here’s the full blocked process dump. (Tell me if this isn’t helpful, otherwise I’ll keep posting that as it’s the only real data I can show)
> >>>>>>>
> >>>>>>> This is bad news :(
> >>>>>>
> >>>>>> Yeah. But: the good new is that we aren’t eating any data so far … ;)
> >>>>>>
> >>>>>>> While reviewing related code, I come up with a plan to move bitmap
> >>>>>>> start/end write ops to the upper layer. Make sure each write IO from
> >>>>>>> upper layer only start once and end once, this is easy to make sure
> >>>>>>> they are balanced and can avoid many calls to improve performance as
> >>>>>>> well.
> >>>>>>
> >>>>>> Sounds like a plan!
> >>>>>>
> >>>>>>> However, I need a few days to cooke a patch after work.
> >>>>>>
> >>>>>> Sure thing! I’ll switch off bitmaps for that time - I’m happy we found a workaround so we can take time to resolve it cleanly. :)
> >>>>>
> >>>>> I wrote a simple and crude version, please give it a test again.
> >>>>>
> >>>>> Thanks,
> >>>>> Kuai
> >>>>>
> >>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >>>>> index d3a837506a36..5e1a82b79e41 100644
> >>>>> --- a/drivers/md/md.c
> >>>>> +++ b/drivers/md/md.c
> >>>>> @@ -8753,6 +8753,30 @@ void md_submit_discard_bio(struct mddev *mddev,
> >>>>> struct md_rdev *rdev,
> >>>>> }
> >>>>> EXPORT_SYMBOL_GPL(md_submit_discard_bio);
> >>>>>
> >>>>> +static bool is_raid456(struct mddev *mddev)
> >>>>> +{
> >>>>> +       return mddev->pers->level == 4 || mddev->pers->level == 5 ||
> >>>>> +              mddev->pers->level == 6;
> >>>>> +}
> >>>>> +
> >>>>> +static void bitmap_startwrite(struct mddev *mddev, struct bio *bio)
> >>>>> +{
> >>>>> +       if (!is_raid456(mddev) || !mddev->bitmap)
> >>>>> +               return;
> >>>>> +
> >>>>> +       md_bitmap_startwrite(mddev->bitmap, bio_offset(bio),
> >>>>> bio_sectors(bio),
> >>>>> +                            0);
> >>>>> +}
> >>>>> +
> >>>>> +static void bitmap_endwrite(struct mddev *mddev, struct bio *bio,
> >>>>> sector_t sectors)
> >>>>> +{
> >>>>> +       if (!is_raid456(mddev) || !mddev->bitmap)
> >>>>> +               return;
> >>>>> +
> >>>>> +       md_bitmap_endwrite(mddev->bitmap, bio_offset(bio), sectors,o
> >>>>> +                          bio->bi_status == BLK_STS_OK, 0);
> >>>>> +}
> >>>>> +
> >>>>> static void md_end_clone_io(struct bio *bio)
> >>>>> {
> >>>>>     struct md_io_clone *md_io_clone = bio->bi_private;
> >>>>> @@ -8765,6 +8789,7 @@ static void md_end_clone_io(struct bio *bio)
> >>>>>     if (md_io_clone->start_time)
> >>>>>             bio_end_io_acct(orig_bio, md_io_clone->start_time);
> >>>>>
> >>>>> +       bitmap_endwrite(mddev, orig_bio, md_io_clone->sectors);
> >>>>>     bio_put(bio);
> >>>>>     bio_endio(orig_bio);
> >>>>>     percpu_ref_put(&mddev->active_io);
> >>>>> @@ -8778,6 +8803,7 @@ static void md_clone_bio(struct mddev *mddev,
> >>>>> struct bio **bio)
> >>>>>             bio_alloc_clone(bdev, *bio, GFP_NOIO,
> >>>>> &mddev->io_clone_set);
> >>>>>
> >>>>>     md_io_clone = container_of(clone, struct md_io_clone, bio_clone);
> >>>>> +       md_io_clone->sectors = bio_sectors(*bio);
> >>>>>     md_io_clone->orig_bio = *bio;
> >>>>>     md_io_clone->mddev = mddev;
> >>>>>     if (blk_queue_io_stat(bdev->bd_disk->queue))
> >>>>> @@ -8790,6 +8816,7 @@ static void md_clone_bio(struct mddev *mddev,
> >>>>> struct bio **bio)
> >>>>>
> >>>>> void md_account_bio(struct mddev *mddev, struct bio **bio)
> >>>>> {
> >>>>> +       bitmap_startwrite(mddev, *bio);
> >>>>>     percpu_ref_get(&mddev->active_io);
> >>>>>     md_clone_bio(mddev, bio);
> >>>>> }
> >>>>> @@ -8807,6 +8834,8 @@ void md_free_cloned_bio(struct bio *bio)
> >>>>>     if (md_io_clone->start_time)
> >>>>>             bio_end_io_acct(orig_bio, md_io_clone->start_time);
> >>>>>
> >>>>> +       bitmap_endwrite(mddev, orig_bio, md_io_clone->sectors);
> >>>>> +
> >>>>>     bio_put(bio);
> >>>>>     percpu_ref_put(&mddev->active_io);
> >>>>> }
> >>>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >>>>> index a0d6827dced9..0c2794230e0a 100644
> >>>>> --- a/drivers/md/md.h
> >>>>> +++ b/drivers/md/md.h
> >>>>> @@ -837,6 +837,7 @@ struct md_io_clone {
> >>>>>     struct mddev    *mddev;
> >>>>>     struct bio      *orig_bio;
> >>>>>     unsigned long   start_time;
> >>>>> +       sector_t        sectors;
> >>>>>     struct bio      bio_clone;
> >>>>> };
> >>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >>>>> index c14cf2410365..4f009e32f68a 100644
> >>>>> --- a/drivers/md/raid5.c
> >>>>> +++ b/drivers/md/raid5.c
> >>>>> @@ -3561,12 +3561,6 @@ static void __add_stripe_bio(struct stripe_head
> >>>>> *sh, struct bio *bi,
> >>>>>              * is added to a batch, STRIPE_BIT_DELAY cannot be changed
> >>>>>              * any more.
> >>>>>              */
> >>>>> -               set_bit(STRIPE_BITMAP_PENDING, &sh->state);
> >>>>> -               spin_unlock_irq(&sh->stripe_lock);
> >>>>> -               md_bitmap_startwrite(conf->mddev->bitmap, sh->sector,
> >>>>> -                                    RAID5_STRIPE_SECTORS(conf), 0);
> >>>>> -               spin_lock_irq(&sh->stripe_lock);
> >>>>> -               clear_bit(STRIPE_BITMAP_PENDING, &sh->state);
> >>>>>             if (!sh->batch_head) {
> >>>>>                     sh->bm_seq = conf->seq_flush+1;
> >>>>>                     set_bit(STRIPE_BIT_DELAY, &sh->state);
> >>>>> @@ -3621,7 +3615,6 @@ handle_failed_stripe(struct r5conf *conf, struct
> >>>>> stripe_head *sh,
> >>>>>     BUG_ON(sh->batch_head);
> >>>>>     for (i = disks; i--; ) {
> >>>>>             struct bio *bi;
> >>>>> -               int bitmap_end = 0;
> >>>>>
> >>>>>             if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
> >>>>>                     struct md_rdev *rdev = conf->disks[i].rdev;
> >>>>> @@ -3646,8 +3639,6 @@ handle_failed_stripe(struct r5conf *conf, struct
> >>>>> stripe_head *sh,
> >>>>>             sh->dev[i].towrite = NULL;
> >>>>>             sh->overwrite_disks = 0;
> >>>>>             spin_unlock_irq(&sh->stripe_lock);
> >>>>> -               if (bi)
> >>>>> -                       bitmap_end = 1;
> >>>>>
> >>>>>             log_stripe_write_finished(sh);
> >>>>> @@ -3662,10 +3653,6 @@ handle_failed_stripe(struct r5conf *conf, struct
> >>>>> stripe_head *sh,
> >>>>>                     bio_io_error(bi);
> >>>>>                     bi = nextbi;
> >>>>>             }
> >>>>> -               if (bitmap_end)
> >>>>> -                       md_bitmap_endwrite(conf->mddev->bitmap, sh->sector,
> >>>>> -                                          RAID5_STRIPE_SECTORS(conf),
> >>>>> 0, 0);
> >>>>> -               bitmap_end = 0;
> >>>>>             /* and fail all 'written' */
> >>>>>             bi = sh->dev[i].written;
> >>>>>             sh->dev[i].written = NULL;
> >>>>> @@ -3674,7 +3661,6 @@ handle_failed_stripe(struct r5conf *conf, struct
> >>>>> stripe_head *sh,
> >>>>>                     sh->dev[i].page = sh->dev[i].orig_page;
> >>>>>             }
> >>>>>
> >>>>> -               if (bi) bitmap_end = 1;
> >>>>>             while (bi && bi->bi_iter.bi_sector <
> >>>>>                    sh->dev[i].sector + RAID5_STRIPE_SECTORS(conf)) {
> >>>>>                     struct bio *bi2 = r5_next_bio(conf, bi,
> >>>>> sh->dev[i].sector);
> >>>>> @@ -3708,9 +3694,6 @@ handle_failed_stripe(struct r5conf *conf, struct
> >>>>> stripe_head *sh,
> >>>>>                             bi = nextbi;
> >>>>>                     }
> >>>>>             }
> >>>>> -               if (bitmap_end)
> >>>>> -                       md_bitmap_endwrite(conf->mddev->bitmap, sh->sector,
> >>>>> -                                          RAID5_STRIPE_SECTORS(conf),
> >>>>> 0, 0);
> >>>>>             /* If we were in the middle of a write the parity block
> >>>>> might
> >>>>>              * still be locked - so just clear all R5_LOCKED flags
> >>>>>              */
> >>>>> @@ -4059,10 +4042,6 @@ static void handle_stripe_clean_event(struct
> >>>>> r5conf *conf,
> >>>>>                                     bio_endio(wbi);
> >>>>>                                     wbi = wbi2;
> >>>>>                             }
> >>>>> -                               md_bitmap_endwrite(conf->mddev->bitmap,
> >>>>> sh->sector,
> >>>>> -
> >>>>> RAID5_STRIPE_SECTORS(conf),
> >>>>> -
> >>>>> !test_bit(STRIPE_DEGRADED, &sh->state),
> >>>>> -                                                  0);
> >>>>>                             if (head_sh->batch_head) {
> >>>>>                                     sh =
> >>>>> list_first_entry(&sh->batch_list,
> >>>>>                                                           struct
> >>>>> stripe_head,
> >>>>> @@ -5788,13 +5767,6 @@ static void make_discard_request(struct mddev
> >>>>> *mddev, struct bio *bi)
> >>>>>             }
> >>>>>             spin_unlock_irq(&sh->stripe_lock);
> >>>>>             if (conf->mddev->bitmap) {
> >>>>> -                       for (d = 0;
> >>>>> -                            d < conf->raid_disks - conf->max_degraded;
> >>>>> -                            d++)
> >>>>> -                               md_bitmap_startwrite(mddev->bitmap,
> >>>>> -                                                    sh->sector,
> >>>>> -
> >>>>> RAID5_STRIPE_SECTORS(conf),
> >>>>> -                                                    0);
> >>>>>                     sh->bm_seq = conf->seq_flush + 1;
> >>>>>                     set_bit(STRIPE_BIT_DELAY, &sh->state);
> >>>>>             }
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Thanks a lot for your help!
> >>>>>> Christian
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> Hi Kuai
> >>>>
> >>>> Maybe it's not good to put the bitmap operation from raid5 to md which
> >>>> the new api is only used for raid5. And the bitmap region which raid5
> >>>> needs to handle is based on the member disk. It should be calculated
> >>>> rather than the bio address space. Because the bio address space is
> >>>> for the whole array.
> >>>>
> >>>> We have a customer who reports a similar problem. There is a patch
> >>>> from David. I put it in the attachment.
> >>>>
> >>>> @Christian, can you have a try with the patch? It can be applied
> >>>> cleanly on 6.11-rc6
> >>>>
> >>>> Regards
> >>>> Xiao
> >>>> <md_raid5_one_bitmap_claim_per_stripe_head.patch>
> >>>
> >>> Liebe Grüße,
> >>> Christian Theune
> >>>
> >>> --
> >>> Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
> >>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
> >>>
> >>
> >> Liebe Grüße,
> >> Christian Theune
> >>
> >> --
> >> Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
> >> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
> >>
> >
> > Liebe Grüße,
> > Christian Theune
> >
> > --
> > Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
> > Flying Circus Internet Operations GmbH · https://flyingcircus.io
> > Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> > HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
> >
> >
>
> Liebe Grüße,
> Christian Theune
>
> --
> Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
>