On Sun, Mar 10, 2024 at 11:27 AM Mike Snitzer <snitzer@xxxxxxxxxx> wrote: > > On Sun, Mar 10 2024 at 7:34P -0400, > Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote: > > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > > > > > > > #!/usr/bin/bpftrace > > > > > > > > #ifndef BPFTRACE_HAVE_BTF > > > > #include <linux/blkdev.h> > > > > #endif > > > > > > > > kprobe:submit_bio_noacct, > > > > kprobe:submit_bio > > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 / > > > > { > > > > $bio = (struct bio *)arg0; > > > > @submit_stack[arg0] = kstack; > > > > @tracked[arg0] = 1; > > > > } > > > > > > > > kprobe:bio_endio > > > > /@tracked[arg0] != 0/ > > > > { > > > > $bio = (struct bio *)arg0; > > > > > > > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) { > > > > return; > > > > } > > > > > > > > if ($bio->bi_status != 0) { > > > > printf("dev %s bio failed %d, submitter %s completion %s\n", > > > > $bio->bi_bdev->bd_disk->disk_name, > > > > $bio->bi_status, @submit_stack[arg0], kstack); > > > > } > > > > delete(@submit_stack[arg0]); > > > > delete(@tracked[arg0]); > > > > } > > > > > > > > END { > > > > clear(@submit_stack); > > > > clear(@tracked); > > > > } > > > > > > > > > > Attaching 4 probes... > > > dev dm-77 bio failed 10, submitter > > > submit_bio_noacct+5 > > > __send_duplicate_bios+358 > > > __send_empty_flush+179 > > > dm_submit_bio+857 > > > __submit_bio+132 > > > submit_bio_noacct_nocheck+345 > > > write_all_supers+1718 > > > btrfs_commit_transaction+2342 > > > transaction_kthread+345 > > > kthread+229 > > > ret_from_fork+49 > > > ret_from_fork_asm+27 > > > completion > > > bio_endio+5 > > > dm_submit_bio+955 > > > __submit_bio+132 > > > submit_bio_noacct_nocheck+345 > > > write_all_supers+1718 > > > btrfs_commit_transaction+2342 > > > transaction_kthread+345 > > > kthread+229 > > > ret_from_fork+49 > > > ret_from_fork_asm+27 > > > > > > dev dm-86 bio failed 10, submitter > > > submit_bio_noacct+5 > > > write_all_supers+1718 > > > btrfs_commit_transaction+2342 > > > transaction_kthread+345 > > > kthread+229 > > > ret_from_fork+49 > > > ret_from_fork_asm+27 > > > completion > > > bio_endio+5 > > > clone_endio+295 > > > clone_endio+295 > > > process_one_work+369 > > > worker_thread+635 > > > kthread+229 > > > ret_from_fork+49 > > > ret_from_fork_asm+27 > > > > > > > > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool > > > > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on > > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue. > > > > The error should be from the following code only: > > > > static void __map_bio(struct bio *clone) > > > > ... > > if (r == DM_MAPIO_KILL) > > dm_io_dec_pending(io, BLK_STS_IOERR); > > else > > dm_io_dec_pending(io, BLK_STS_DM_REQUEUE); > > break; > > I agree that the above bpf stack traces for dm-77 indicate that > dm_submit_bio failed, which would end up in the above branch if the > target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE. > > But such an early failure speaks to the flush bio never being > submitted to the underlying storage. No? > > dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with: > > /* > * If we're reshaping to add disk(s)), ti->len and > * mddev->array_sectors will differ during the process > * (ti->len > mddev->array_sectors), so we have to requeue > * bios with addresses > mddev->array_sectors here or > * there will occur accesses past EOD of the component > * data images thus erroring the raid set. > */ > if (unlikely(bio_end_sector(bio) > mddev->array_sectors)) > return DM_MAPIO_REQUEUE; > > But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems > weird relative to a flush. > > > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is > > built? It is dm-raid1 target or over plain raid1 device which is > > build over /dev/lowerVG? LVM raid1: lvcreate --type raid1 -m 1 ... I had previously added raidintegrity and caching like "lowerVG/single", but I had removed them in trying to root cause this bug > > In my earlier reply I asked Patrick for both: > lsblk > dmsetup table Oops, here they are, trimmed for relevance: NAME sdb └─sdb2 ├─lowerVG-single_corig_rmeta_1 │ └─lowerVG-single_corig │ └─lowerVG-single ├─lowerVG-single_corig_rimage_1_imeta │ └─lowerVG-single_corig_rimage_1 │ └─lowerVG-single_corig │ └─lowerVG-single ├─lowerVG-single_corig_rimage_1_iorig │ └─lowerVG-single_corig_rimage_1 │ └─lowerVG-single_corig │ └─lowerVG-single ├─lowerVG-lvmPool_rmeta_0 │ └─lowerVG-lvmPool │ ├─lvm-a │ └─lvm-brokenDisk ├─lowerVG-lvmPool_rimage_0 │ └─lowerVG-lvmPool │ ├─lvm-a │ └─lvm-brokenDisk sdc └─sdc3 ├─lowerVG-single_corig_rmeta_0 │ └─lowerVG-single_corig │ └─lowerVG-single ├─lowerVG-single_corig_rimage_0_imeta │ └─lowerVG-single_corig_rimage_0 │ └─lowerVG-single_corig │ └─lowerVG-single ├─lowerVG-single_corig_rimage_0_iorig │ └─lowerVG-single_corig_rimage_0 │ └─lowerVG-single_corig │ └─lowerVG-single sdd └─sdd3 ├─lowerVG-lvmPool_rmeta_1 │ └─lowerVG-lvmPool │ ├─lvm-a │ └─lvm-brokenDisk └─lowerVG-lvmPool_rimage_1 └─lowerVG-lvmPool ├─lvm-a └─lvm-brokenDisk sdf ├─sdf2 │ ├─lowerVG-lvmPool_rimage_1 │ │ └─lowerVG-lvmPool │ │ ├─lvm-a │ │ └─lvm-brokenDisk lowerVG-single: 0 5583462400 cache 254:32 254:31 254:71 128 2 metadata2 writethrough mq 0 lowerVG-singleCache_cvol: 0 104857600 linear 259:13 104859648 lowerVG-singleCache_cvol-cdata: 0 104775680 linear 254:30 81920 lowerVG-singleCache_cvol-cmeta: 0 81920 linear 254:30 0 lowerVG-single_corig: 0 5583462400 raid raid1 3 0 region_size 4096 2 254:33 254:36 254:67 254:70 lowerVG-single_corig_rimage_0: 0 5583462400 integrity 254:35 0 4 J 8 meta_device:254:34 recalculate journal_sectors:130944 interleave_sectors:1 buffer_sectors:128 journal_watermark:50 commit_time:10000 internal_hash:crc32c lowerVG-single_corig_rimage_0_imeta: 0 44802048 linear 8:35 5152466944 lowerVG-single_corig_rimage_0_iorig: 0 4724465664 linear 8:35 427821056 lowerVG-single_corig_rimage_0_iorig: 4724465664 431005696 linear 8:35 5362001920 lowerVG-single_corig_rimage_0_iorig: 5155471360 427819008 linear 8:35 2048 lowerVG-single_corig_rimage_0_iorig: 5583290368 172032 linear 8:35 5152294912 lowerVG-single_corig_rimage_1: 0 5583462400 integrity 254:69 0 4 J 8 meta_device:254:68 recalculate journal_sectors:130944 interleave_sectors:1 buffer_sectors:128 journal_watermark:50 commit_time:10000 internal_hash:crc32c lowerVG-single_corig_rimage_1_imeta: 0 44802048 linear 8:18 5583472640 lowerVG-single_corig_rimage_1_iorig: 0 5583462400 linear 8:18 10240 lowerVG-single_corig_rmeta_0: 0 8192 linear 8:35 5152286720 lowerVG-single_corig_rmeta_1: 0 8192 linear 8:18 2048 lowerVG-lvmPool: 0 6442450944 raid raid1 3 0 region_size 4096 2 254:73 254:74 254:75 254:76 lowerVG-lvmPool_rimage_0: 0 2967117824 linear 8:18 5628282880 lowerVG-lvmPool_rimage_0: 2967117824 59875328 linear 8:18 12070733824 lowerVG-lvmPool_rimage_0: 3026993152 3415457792 linear 8:18 8655276032 lowerVG-lvmPool_rimage_1: 0 2862260224 linear 8:51 10240 lowerVG-lvmPool_rimage_1: 2862260224 164732928 linear 8:82 3415459840 lowerVG-lvmPool_rimage_1: 3026993152 3415457792 linear 8:82 2048 lowerVG-lvmPool_rmeta_0: 0 8192 linear 8:18 5628274688 lowerVG-lvmPool_rmeta_1: 0 8192 linear 8:51 2048 lvm-a: 0 1468006400 linear 254:77 1310722048 lvm-brokenDisk: 0 1310720000 linear 254:77 2048 lvm-brokenDisk: 1310720000 83886080 linear 254:77 2778728448 lvm-brokenDisk: 1394606080 2015404032 linear 254:77 4427040768 lvm-brokenDisk: 3410010112 884957184 linear 254:77 2883586048 As a side note, is there a way to make lsblk only show things the first time they come up? > > Picking over the described IO stacks provided earlier (or Goffredo's > interpretation of it, via ascii art) isn't really a great way to see > the IO stacks that are in use/question. > > > Mike, the logic in the following code doesn't change from v5.18-rc2 to > > v5.19, but I still can't understand why STS_IOERR is set in > > dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(), > > since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which > > is supposed to not happen in Patrick's case. > > > > dm_io_complete() > > ... > > if (io->status == BLK_STS_DM_REQUEUE) { > > unsigned long flags; > > /* > > * Target requested pushing back the I/O. > > */ > > spin_lock_irqsave(&md->deferred_lock, flags); > > if (__noflush_suspending(md) && > > !WARN_ON_ONCE(dm_is_zone_write(md, bio))) { > > /* NOTE early return due to BLK_STS_DM_REQUEUE below */ > > bio_list_add_head(&md->deferred, bio); > > } else { > > /* > > * noflush suspend was interrupted or this is > > * a write to a zoned target. > > */ > > io->status = BLK_STS_IOERR; > > } > > spin_unlock_irqrestore(&md->deferred_lock, flags); > > } > > Given the reason from dm-raid.c:raid_map returning DM_MAPIO_REQUEUE > I think the DM device could be suspending without flush. > > But regardless, given you logged BLK_STS_IOERR lets assume it isn't, > the assumption that "noflush suspend was interrupted" seems like a > stale comment -- especially given that target's like dm-raid are now > using DM_MAPIO_REQUEUE without concern for the historic tight-coupling > of noflush suspend (which was always the case for the biggest historic > reason for this code: dm-multipath, see commit 2e93ccc1933d0 from > 2006 -- predates my time with developing DM). > > So all said, this code seems flawed for dm-raid (and possibly other > targets that return DM_MAPIO_REQUEUE). I'll look closer this week. > > Mike