Re: LVM-on-LVM: error while submitting device barriers

Patrick Plenefisch <simonpatp@xxxxxxxxx> · Sun, 10 Mar 2024 14:11:11 -0400

On Sun, Mar 10, 2024 at 11:27 AM Mike Snitzer <snitzer@xxxxxxxxxx> wrote:
>
> On Sun, Mar 10 2024 at  7:34P -0400,
> Ming Lei <ming.lei@xxxxxxxxxx> wrote:
>
> > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote:
> > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
> > > >
> > > > #!/usr/bin/bpftrace
> > > >
> > > > #ifndef BPFTRACE_HAVE_BTF
> > > > #include <linux/blkdev.h>
> > > > #endif
> > > >
> > > > kprobe:submit_bio_noacct,
> > > > kprobe:submit_bio
> > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 /
> > > > {
> > > >         $bio = (struct bio *)arg0;
> > > >         @submit_stack[arg0] = kstack;
> > > >         @tracked[arg0] = 1;
> > > > }
> > > >
> > > > kprobe:bio_endio
> > > > /@tracked[arg0] != 0/
> > > > {
> > > >         $bio = (struct bio *)arg0;
> > > >
> > > >         if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) {
> > > >                 return;
> > > >         }
> > > >
> > > >         if ($bio->bi_status != 0) {
> > > >                 printf("dev %s bio failed %d, submitter %s completion %s\n",
> > > >                         $bio->bi_bdev->bd_disk->disk_name,
> > > >                         $bio->bi_status, @submit_stack[arg0], kstack);
> > > >         }
> > > >         delete(@submit_stack[arg0]);
> > > >         delete(@tracked[arg0]);
> > > > }
> > > >
> > > > END {
> > > >         clear(@submit_stack);
> > > >         clear(@tracked);
> > > > }
> > > >
> > >
> > > Attaching 4 probes...
> > > dev dm-77 bio failed 10, submitter
> > >        submit_bio_noacct+5
> > >        __send_duplicate_bios+358
> > >        __send_empty_flush+179
> > >        dm_submit_bio+857
> > >        __submit_bio+132
> > >        submit_bio_noacct_nocheck+345
> > >        write_all_supers+1718
> > >        btrfs_commit_transaction+2342
> > >        transaction_kthread+345
> > >        kthread+229
> > >        ret_from_fork+49
> > >        ret_from_fork_asm+27
> > > completion
> > >        bio_endio+5
> > >        dm_submit_bio+955
> > >        __submit_bio+132
> > >        submit_bio_noacct_nocheck+345
> > >        write_all_supers+1718
> > >        btrfs_commit_transaction+2342
> > >        transaction_kthread+345
> > >        kthread+229
> > >        ret_from_fork+49
> > >        ret_from_fork_asm+27
> > >
> > > dev dm-86 bio failed 10, submitter
> > >        submit_bio_noacct+5
> > >        write_all_supers+1718
> > >        btrfs_commit_transaction+2342
> > >        transaction_kthread+345
> > >        kthread+229
> > >        ret_from_fork+49
> > >        ret_from_fork_asm+27
> > > completion
> > >        bio_endio+5
> > >        clone_endio+295
> > >        clone_endio+295
> > >        process_one_work+369
> > >        worker_thread+635
> > >        kthread+229
> > >        ret_from_fork+49
> > >        ret_from_fork_asm+27
> > >
> > >
> > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool
> >
> > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on
> > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue.
> >
> > The error should be from the following code only:
> >
> > static void __map_bio(struct bio *clone)
> >
> >       ...
> >       if (r == DM_MAPIO_KILL)
> >               dm_io_dec_pending(io, BLK_STS_IOERR);
> >       else
> >               dm_io_dec_pending(io, BLK_STS_DM_REQUEUE);
> >     break;
>
> I agree that the above bpf stack traces for dm-77 indicate that
> dm_submit_bio failed, which would end up in the above branch if the
> target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE.
>
> But such an early failure speaks to the flush bio never being
> submitted to the underlying storage. No?
>
> dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with:
>
>         /*
>          * If we're reshaping to add disk(s)), ti->len and
>          * mddev->array_sectors will differ during the process
>          * (ti->len > mddev->array_sectors), so we have to requeue
>          * bios with addresses > mddev->array_sectors here or
>          * there will occur accesses past EOD of the component
>          * data images thus erroring the raid set.
>          */
>         if (unlikely(bio_end_sector(bio) > mddev->array_sectors))
>                 return DM_MAPIO_REQUEUE;
>
> But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems
> weird relative to a flush.
>
> > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is
> > built? It is dm-raid1 target or over plain raid1 device which is
> > build over /dev/lowerVG?

LVM raid1:
lvcreate --type raid1 -m 1 ...

I had previously added raidintegrity and caching like
"lowerVG/single", but I had removed them in trying to root cause this
bug

>
> In my earlier reply I asked Patrick for both:
> lsblk
> dmsetup table

Oops, here they are, trimmed for relevance:

NAME
sdb
└─sdb2
  ├─lowerVG-single_corig_rmeta_1
  │ └─lowerVG-single_corig
  │   └─lowerVG-single
  ├─lowerVG-single_corig_rimage_1_imeta
  │ └─lowerVG-single_corig_rimage_1
  │   └─lowerVG-single_corig
  │     └─lowerVG-single
  ├─lowerVG-single_corig_rimage_1_iorig
  │ └─lowerVG-single_corig_rimage_1
  │   └─lowerVG-single_corig
  │     └─lowerVG-single
  ├─lowerVG-lvmPool_rmeta_0
  │ └─lowerVG-lvmPool
  │   ├─lvm-a
  │   └─lvm-brokenDisk
  ├─lowerVG-lvmPool_rimage_0
  │ └─lowerVG-lvmPool
  │   ├─lvm-a
  │   └─lvm-brokenDisk
sdc
└─sdc3
  ├─lowerVG-single_corig_rmeta_0
  │ └─lowerVG-single_corig
  │   └─lowerVG-single
  ├─lowerVG-single_corig_rimage_0_imeta
  │ └─lowerVG-single_corig_rimage_0
  │   └─lowerVG-single_corig
  │     └─lowerVG-single
  ├─lowerVG-single_corig_rimage_0_iorig
  │ └─lowerVG-single_corig_rimage_0
  │   └─lowerVG-single_corig
  │     └─lowerVG-single
sdd
└─sdd3
  ├─lowerVG-lvmPool_rmeta_1
  │ └─lowerVG-lvmPool
  │   ├─lvm-a
  │   └─lvm-brokenDisk
  └─lowerVG-lvmPool_rimage_1
    └─lowerVG-lvmPool
      ├─lvm-a
      └─lvm-brokenDisk
sdf
├─sdf2
│ ├─lowerVG-lvmPool_rimage_1
│ │ └─lowerVG-lvmPool
│ │   ├─lvm-a
│ │   └─lvm-brokenDisk

lowerVG-single: 0 5583462400 cache 254:32 254:31 254:71 128 2
metadata2 writethrough mq 0
lowerVG-singleCache_cvol: 0 104857600 linear 259:13 104859648
lowerVG-singleCache_cvol-cdata: 0 104775680 linear 254:30 81920
lowerVG-singleCache_cvol-cmeta: 0 81920 linear 254:30 0
lowerVG-single_corig: 0 5583462400 raid raid1 3 0 region_size 4096 2
254:33 254:36 254:67 254:70
lowerVG-single_corig_rimage_0: 0 5583462400 integrity 254:35 0 4 J 8
meta_device:254:34 recalculate journal_sectors:130944
interleave_sectors:1 buffer_sectors:128 journal_watermark:50
commit_time:10000 internal_hash:crc32c
lowerVG-single_corig_rimage_0_imeta: 0 44802048 linear 8:35 5152466944
lowerVG-single_corig_rimage_0_iorig: 0 4724465664 linear 8:35 427821056
lowerVG-single_corig_rimage_0_iorig: 4724465664 431005696 linear 8:35 5362001920
lowerVG-single_corig_rimage_0_iorig: 5155471360 427819008 linear 8:35 2048
lowerVG-single_corig_rimage_0_iorig: 5583290368 172032 linear 8:35 5152294912
lowerVG-single_corig_rimage_1: 0 5583462400 integrity 254:69 0 4 J 8
meta_device:254:68 recalculate journal_sectors:130944
interleave_sectors:1 buffer_sectors:128 journal_watermark:50
commit_time:10000 internal_hash:crc32c
lowerVG-single_corig_rimage_1_imeta: 0 44802048 linear 8:18 5583472640
lowerVG-single_corig_rimage_1_iorig: 0 5583462400 linear 8:18 10240
lowerVG-single_corig_rmeta_0: 0 8192 linear 8:35 5152286720
lowerVG-single_corig_rmeta_1: 0 8192 linear 8:18 2048
lowerVG-lvmPool: 0 6442450944 raid raid1 3 0 region_size 4096 2 254:73
254:74 254:75 254:76
lowerVG-lvmPool_rimage_0: 0 2967117824 linear 8:18 5628282880
lowerVG-lvmPool_rimage_0: 2967117824 59875328 linear 8:18 12070733824
lowerVG-lvmPool_rimage_0: 3026993152 3415457792 linear 8:18 8655276032
lowerVG-lvmPool_rimage_1: 0 2862260224 linear 8:51 10240
lowerVG-lvmPool_rimage_1: 2862260224 164732928 linear 8:82 3415459840
lowerVG-lvmPool_rimage_1: 3026993152 3415457792 linear 8:82 2048
lowerVG-lvmPool_rmeta_0: 0 8192 linear 8:18 5628274688
lowerVG-lvmPool_rmeta_1: 0 8192 linear 8:51 2048
lvm-a: 0 1468006400 linear 254:77 1310722048
lvm-brokenDisk: 0 1310720000 linear 254:77 2048
lvm-brokenDisk: 1310720000 83886080 linear 254:77 2778728448
lvm-brokenDisk: 1394606080 2015404032 linear 254:77 4427040768
lvm-brokenDisk: 3410010112 884957184 linear 254:77 2883586048

As a side note, is there a way to make lsblk only show things the
first time they come up?

>
> Picking over the described IO stacks provided earlier (or Goffredo's
> interpretation of it, via ascii art) isn't really a great way to see
> the IO stacks that are in use/question.
>
> > Mike, the logic in the following code doesn't change from v5.18-rc2 to
> > v5.19, but I still can't understand why STS_IOERR is set in
> > dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(),
> > since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which
> > is supposed to not happen in Patrick's case.
> >
> > dm_io_complete()
> >       ...
> >       if (io->status == BLK_STS_DM_REQUEUE) {
> >               unsigned long flags;
> >               /*
> >                * Target requested pushing back the I/O.
> >                */
> >               spin_lock_irqsave(&md->deferred_lock, flags);
> >               if (__noflush_suspending(md) &&
> >                   !WARN_ON_ONCE(dm_is_zone_write(md, bio))) {
> >                       /* NOTE early return due to BLK_STS_DM_REQUEUE below */
> >                       bio_list_add_head(&md->deferred, bio);
> >               } else {
> >                       /*
> >                        * noflush suspend was interrupted or this is
> >                        * a write to a zoned target.
> >                        */
> >                       io->status = BLK_STS_IOERR;
> >               }
> >               spin_unlock_irqrestore(&md->deferred_lock, flags);
> >       }
>
> Given the reason from dm-raid.c:raid_map returning DM_MAPIO_REQUEUE
> I think the DM device could be suspending without flush.
>
> But regardless, given you logged BLK_STS_IOERR lets assume it isn't,
> the assumption that "noflush suspend was interrupted" seems like a
> stale comment -- especially given that target's like dm-raid are now
> using DM_MAPIO_REQUEUE without concern for the historic tight-coupling
> of noflush suspend (which was always the case for the biggest historic
> reason for this code: dm-multipath, see commit 2e93ccc1933d0 from
> 2006 -- predates my time with developing DM).
>
> So all said, this code seems flawed for dm-raid (and possibly other
> targets that return DM_MAPIO_REQUEUE).  I'll look closer this week.
>
> Mike