Re: [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio

Ming Lei <ming.lei@xxxxxxxxxx> · Mon, 11 Apr 2022 22:18:05 +0800

On Mon, Apr 11, 2022 at 04:42:59PM +0900, Damien Le Moal wrote:
> On 4/11/22 16:34, Ming Lei wrote:
> > On Mon, Apr 11, 2022 at 11:55:14AM +0900, Damien Le Moal wrote:
> >> On 4/11/22 11:19, Damien Le Moal wrote:
> >>> On 4/11/22 10:04, Ming Lei wrote:
> >>>> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
> >>>>> On 4/11/22 09:36, Ming Lei wrote:
> >>>>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> >>>>>>> On 4/9/22 02:12, Ming Lei wrote:
> >>>>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
> >>>>>>>> reference is grabbed already, and the reference won't be released
> >>>>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> >>>>>>>> any more.
> >>>>>>>
> >>>>>>> I do not think that this patch is correct. Removing the extra reference on
> >>>>>>> the io can lead to problems if the io is completed in the target
> >>>>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
> >>>>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
> >>>>>>
> >>>>>> __map_bio():
> >>>>>> 	...
> >>>>>> 	dm_io_inc_pending(io);
> >>>>>> 	...
> >>>>>> 	dm_zone_map_bio(tio);
> >>>>>> 	...
> >>>>>
> >>>>> dm-crypt (for instance) may terminate the clone bio immediately in its
> >>>>> ->map() function context, resulting in the bio_endio()clone) ->
> >>>>> clone_endio() -> dm_io_dec_pending() call chain.
> >>>>>
> >>>>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
> >>>>> reference on the orig bio.
> >>>>
> >>>> Any target can complete io during ->map. Here looks nothing is special with
> >>>> dm-crypt or dm-zone, why does only dm zone need extra reference?
> >>>>
> >>>> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
> >>>> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
> >>>> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
> >>>> in dm_split_and_process_bio() is called.
> >>>>
> >>>> Or maybe I miss any extra requirement from dm-zone?
> >>>
> >>> Something is wrong... With and without your patch, when I setup a dm-crypt
> >>> target on top of a zoned nullblk device, I get:
> >>>
> >>> [  292.596454] device-mapper: uevent: version 1.0.3
> >>> [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
> >>> initialised: dm-devel@xxxxxxxxxx
> >>> [  292.732217] general protection fault, probably for non-canonical
> >>> address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
> >>> [  292.743724] KASAN: null-ptr-deref in range
> >>> [0x0000000000000010-0x0000000000000017]
> >>> [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
> >>> 5.18.0-rc2+ #1458
> >>> [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
> >>> 02/21/2020
> >>> [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
> >>> [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
> >>> 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
> >>> <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
> >>> [  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
> >>> [  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
> >>> 1ffff11034470027
> >>> [  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
> >>> ffff8885c5bcdc60
> >>> [  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
> >>> ffff8881a238013f
> >>> [  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
> >>> 0000000000000000
> >>> [  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
> >>> ffff8885c5bcdd08
> >>> [  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
> >>> knlGS:0000000000000000
> >>> [  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
> >>> 00000000007706f0
> >>> [  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>> 0000000000000000
> >>> [  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> >>> 0000000000000400
> >>> [  292.868194] PKRU: 55555554
> >>> [  292.870949] Call Trace:
> >>> [  292.873446]  <TASK>
> >>> [  292.875593]  ? lock_is_held_type+0xd7/0x130
> >>> [  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
> >>> [  292.885718]  ? __module_address.part.0+0x25/0x300
> >>> [  292.890509]  ? is_module_address+0x43/0x70
> >>> [  292.894674]  ? static_obj+0x8a/0xc0
> >>> [  292.898233]  __map_bio+0x352/0x740 [dm_mod]
> >>> [  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
> >>> [  292.907222]  ? find_held_lock+0x2c/0x110
> >>> [  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
> >>> [  292.916459]  ? lock_release+0x3b2/0x6f0
> >>> [  292.920368]  ? lock_downgrade+0x6d0/0x6d0
> >>> [  292.924458]  ? lock_is_held_type+0xd7/0x130
> >>> [  292.928714]  __submit_bio+0x12a/0x1f0
> >>> [  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
> >>> [  292.937324]  ? should_fail_request+0x70/0x70
> >>> [  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
> >>> [  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
> >>> [  292.950888]  ? lock_is_held_type+0xd7/0x130
> >>> [  292.957813]  mpage_readahead+0x32e/0x4b0
> >>> [  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
> >>> [  292.971661]  ? blkdev_write_begin+0x20/0x20
> >>> [  292.978567]  ? lock_release+0x3b2/0x6f0
> >>> [  292.985073]  ? folio_add_lru+0x217/0x3f0
> >>> [  292.991620]  ? lock_downgrade+0x6d0/0x6d0
> >>> [  292.998237]  read_pages+0x18c/0x990
> >>> [  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
> >>> [  293.011404]  ? folio_add_lru+0x238/0x3f0
> >>> [  293.017805]  ? file_ra_state_init+0xd0/0xd0
> >>> [  293.024395]  ? policy_node+0xbb/0x140
> >>> [  293.030416]  page_cache_ra_unbounded+0x258/0x410
> >>> [  293.037376]  force_page_cache_ra+0x281/0x400
> >>> [  293.043944]  filemap_get_pages+0x25e/0x1290
> >>> [  293.050342]  ? __lock_acquire+0x1603/0x6180
> >>> [  293.056654]  ? filemap_add_folio+0x140/0x140
> >>> [  293.063002]  ? lock_is_held_type+0xd7/0x130
> >>> [  293.069236]  filemap_read+0x29e/0x910
> >>> [  293.074927]  ? filemap_get_pages+0x1290/0x1290
> >>> [  293.081378]  ? __lock_acquire+0x1603/0x6180
> >>> [  293.087558]  blkdev_read_iter+0x20c/0x640
> >>> [  293.093529]  ? cp_new_stat+0x47a/0x590
> >>> [  293.099190]  ? cp_old_stat+0x470/0x470
> >>> [  293.104795]  new_sync_read+0x2e4/0x520
> >>> [  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
> >>> [  293.116269]  ? lock_acquire+0x1b2/0x4d0
> >>> [  293.121928]  ? find_held_lock+0x2c/0x110
> >>> [  293.127648]  vfs_read+0x312/0x430
> >>> [  293.132755]  ksys_read+0xf3/0x1d0
> >>> [  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
> >>> [  293.144032]  do_syscall_64+0x35/0x80
> >>> [  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>
> >>> The crash is at: drivers/md/dm-zone.c:499, which is
> >>> dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
> >>> pointer is invalid. Weird. Investigating.
> >>>
> >>> Also checking why our weekly test runs did not catch this.
> >>
> >> This fixes the issue:
> >>
> >> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >> index 3c5fad7c4ee6..3dd6735450c5 100644
> >> --- a/drivers/md/dm.c
> >> +++ b/drivers/md/dm.c
> >> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> >> *md, struct bio *bio)
> >>         io->status = 0;
> >>         atomic_set(&io->io_count, 1);
> >>         this_cpu_inc(*md->pending_io);
> >> -       io->orig_bio = NULL;
> >> +       io->orig_bio = bio;
> >>         io->md = md;
> >>         io->map_task = current;
> >>         spin_lock_init(&io->lock);
> >>
> >> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> >> However, this change may be messing up the bio accounting. Need to check that.
> > 
> > Looks it is one recent regression since:
> > 
> > commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
> 
> Yep, saw that. Problem is, I really do not understand that change setting
> io->orig_bio *after* __map_bio() is called. It seems that the accounting
> is done on each fragment of the orig_bio instead of once for the entire
> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
> an argument to  __map_bio() from __split_and_process_bio(), I do not think
> my change is correct. Thoughts ?

Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
after ->map() looks ugly & tricky, and the following change should avoid the
issue, meantime simplify dm accounting a bit:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..f1fe83113608 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	io->status = 0;
 	atomic_set(&io->io_count, 1);
 	this_cpu_inc(*md->pending_io);
-	io->orig_bio = NULL;
+	io->orig_bio = bio;
 	io->md = md;
 	io->map_task = current;
 	spin_lock_init(&io->lock);
@@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
 	 * Account io->origin_bio to DM dev on behalf of target
 	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
 	 */
-	if (io->map_task == current) {
+	if (io->map_task == current)
 		/* Still in target's map function */
 		dm_io_set_flag(io, DM_IO_START_ACCT);
-	} else {
-		/*
-		 * Called by another thread, managed by DM target,
-		 * wait for dm_split_and_process_bio() to store
-		 * io->orig_bio
-		 */
-		while (unlikely(!smp_load_acquire(&io->orig_bio)))
-			msleep(1);
+	else
 		dm_start_io_acct(io, clone);
-	}
 
 	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
 			      tio->old_sector);
@@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 				     struct dm_table *map, struct bio *bio)
 {
 	struct clone_info ci;
-	struct bio *orig_bio = NULL;
+	struct bio *new_bio = NULL;
 	int error = 0;
 
 	init_clone_info(&ci, md, map, bio);
@@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	if (error || !ci.sector_count)
 		goto out;
 
-	/*
-	 * Remainder must be passed to submit_bio_noacct() so it gets handled
-	 * *after* bios already submitted have been completely processed.
-	 * We take a clone of the original to store in ci.io->orig_bio to be
-	 * used by dm_end_io_acct() and for dm_io_complete() to use for
-	 * completion handling.
-	 */
-	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
-			     GFP_NOIO, &md->queue->bio_split);
-	bio_chain(orig_bio, bio);
-	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
-	submit_bio_noacct(bio);
+	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
+			&md->queue->bio_split);
+	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
+	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
+	bio_chain(new_bio, bio);
+	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
+	submit_bio_noacct(new_bio);
 out:
-	if (!orig_bio)
-		orig_bio = bio;
-	smp_store_release(&ci.io->orig_bio, orig_bio);
 	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
 		dm_start_io_acct(ci.io, NULL);
 

-- 
Ming
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel