On 4/12/22 10:02, Ming Lei wrote: > On Tue, Apr 12, 2022 at 09:28:46AM +0900, Damien Le Moal wrote: >> On 4/12/22 09:09, Ming Lei wrote: >>> On Tue, Apr 12, 2022 at 08:33:04AM +0900, Damien Le Moal wrote: >>>> On 4/11/22 23:18, Ming Lei wrote: >>>>>>>> This fixes the issue: >>>>>>>> >>>>>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c >>>>>>>> index 3c5fad7c4ee6..3dd6735450c5 100644 >>>>>>>> --- a/drivers/md/dm.c >>>>>>>> +++ b/drivers/md/dm.c >>>>>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device >>>>>>>> *md, struct bio *bio) >>>>>>>> io->status = 0; >>>>>>>> atomic_set(&io->io_count, 1); >>>>>>>> this_cpu_inc(*md->pending_io); >>>>>>>> - io->orig_bio = NULL; >>>>>>>> + io->orig_bio = bio; >>>>>>>> io->md = md; >>>>>>>> io->map_task = current; >>>>>>>> spin_lock_init(&io->lock); >>>>>>>> >>>>>>>> Otherwise, the dm-zone.c code sees a NULL orig_bio. >>>>>>>> However, this change may be messing up the bio accounting. Need to check that. >>>>>>> >>>>>>> Looks it is one recent regression since: >>>>>>> >>>>>>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface") >>>>>> >>>>>> Yep, saw that. Problem is, I really do not understand that change setting >>>>>> io->orig_bio *after* __map_bio() is called. It seems that the accounting >>>>>> is done on each fragment of the orig_bio instead of once for the entire >>>>>> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as >>>>>> an argument to __map_bio() from __split_and_process_bio(), I do not think >>>>>> my change is correct. Thoughts ? >>>>> >>>>> Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio >>>>> after ->map() looks ugly & tricky, and the following change should avoid the >>>>> issue, meantime simplify dm accounting a bit: >>>>> >>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c >>>>> index 3c5fad7c4ee6..f1fe83113608 100644 >>>>> --- a/drivers/md/dm.c >>>>> +++ b/drivers/md/dm.c >>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio) >>>>> io->status = 0; >>>>> atomic_set(&io->io_count, 1); >>>>> this_cpu_inc(*md->pending_io); >>>>> - io->orig_bio = NULL; >>>>> + io->orig_bio = bio; >>>>> io->md = md; >>>>> io->map_task = current; >>>>> spin_lock_init(&io->lock); >>>>> @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone) >>>>> * Account io->origin_bio to DM dev on behalf of target >>>>> * that took ownership of IO with DM_MAPIO_SUBMITTED. >>>>> */ >>>>> - if (io->map_task == current) { >>>>> + if (io->map_task == current) >>>>> /* Still in target's map function */ >>>>> dm_io_set_flag(io, DM_IO_START_ACCT); >>>>> - } else { >>>>> - /* >>>>> - * Called by another thread, managed by DM target, >>>>> - * wait for dm_split_and_process_bio() to store >>>>> - * io->orig_bio >>>>> - */ >>>>> - while (unlikely(!smp_load_acquire(&io->orig_bio))) >>>>> - msleep(1); >>>>> + else >>>> >>>> Curly brackets around the else here. >>>> >>>>> dm_start_io_acct(io, clone); >>>>> - } >>>>> >>>>> __dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk), >>>>> tio->old_sector); >>>>> @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md, >>>>> struct dm_table *map, struct bio *bio) >>>>> { >>>>> struct clone_info ci; >>>>> - struct bio *orig_bio = NULL; >>>>> + struct bio *new_bio = NULL; >>>>> int error = 0; >>>>> >>>>> init_clone_info(&ci, md, map, bio); >>>>> @@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md, >>>>> if (error || !ci.sector_count) >>>>> goto out; >>>>> >>>>> - /* >>>>> - * Remainder must be passed to submit_bio_noacct() so it gets handled >>>>> - * *after* bios already submitted have been completely processed. >>>>> - * We take a clone of the original to store in ci.io->orig_bio to be >>>>> - * used by dm_end_io_acct() and for dm_io_complete() to use for >>>>> - * completion handling. >>>>> - */ >>>> >>>> This comment should remain with some adjustment. >>> >>> Fine, just felt the approach is very straightforward. >>> >>>> >>>>> - orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count, >>>>> - GFP_NOIO, &md->queue->bio_split); >>>>> - bio_chain(orig_bio, bio); >>>>> - trace_block_split(orig_bio, bio->bi_iter.bi_sector); >>>>> - submit_bio_noacct(bio); >>>>> + new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO, >>>>> + &md->queue->bio_split); >>>> >>>> Why not keep using bio_split() ? >>> >>> With bio_split(), 'bio' actually tracks the remainder, and the returned >>> 'orig_bio' tracks the part for current target io, so ->orig_bio has to >>> be updated in this way. >>> >>> With bio_alloc_clone() and bio_trim(), ->orig_bio needn't to be >>> changed, and assigning it in alloc_io() works perfectly. >> >> OK. Got it. >> >>>>> + bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count); >>>>> + bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count); >>>>> + bio_chain(new_bio, bio); >>>>> + trace_block_split(new_bio, new_bio->bi_iter.bi_sector); >>>>> + submit_bio_noacct(new_bio); >>>>> out: >>>>> - if (!orig_bio) >>>>> - orig_bio = bio; >>>>> - smp_store_release(&ci.io->orig_bio, orig_bio); >>>>> if (dm_io_flagged(ci.io, DM_IO_START_ACCT)) >>>>> dm_start_io_acct(ci.io, NULL); >>>> >>>> I tested this and it works. Need to check the accounting though. >>>> And I agree this is a lot cleaner :) >>> >>> BTW, the cloned bio for split is just for accounting purpose, if >>> ->bi_iter.bi_sector & ->bi_iter.bi_size can be stored in 'struct dm_io', >>> the cloned bio can be avoided, but code may become not readable as >>> before. >> >> The BIO op would be needed too for remapping zone append to regular writes >> when zone append emulation is enabled. That is actually why >> dm_zone_map_bio() needs the original BIO, to have the unmodified BIO op >> since the clone op may not be the same. > > clone inherits the original bio's op. I meant to store the > real bio from FS as ->orig_bio always in alloc_io(), and simply trim it in case > of split & re-submission, meantime store orig_bio's ->bi_sector & ->size into > 'dm_io' for account purpose. But it looks a bit complicated and messy. Indeed. > Wrt. dm zone, I'd suggest to double check anywhere orig bio is used, > since only part of orig bio may be mapped in case of splitting, which is > actually post-split. > > If bio split won't happen for dm-zone, your patch is fine. But I guess > it isn't true for dm-zone. Zone append operations can never be split. That is checked before a user bio reaches DM __map_bio()/dm_zone_map_bio(). So I think that there is no issue with the clone+trim change since that will never be done for a zone append operation. In the case of emulated zone append, regular writes will still go through the dm_zone_map_bio() function, and these writes may be split. But then the orig_bio does not really matter in that case and the zone write pointer tracking relies on the clone sector & size, not on the original BIO. I will run more tests with this patch. But so far, this seems all OK. >> So I think this fix+cleanup as is is good for now. Will you send a proper >> patch ? > > Not yet, the fix+cleanup patch actually breaks recent dm io polling, and I > don't figure out one solution yet. OK. Let me know if you need help testing. -- Damien Le Moal Western Digital Research -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel