On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote: > Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB. > These are fixed at these locations so that recovery tools can reliably > retrieve the superblocks even if one of the mirror gets corrupted. > > power of 2 zone sizes align at these offsets irrespective of their > value but non power of 2 zone sizes will not align. > > To make sure the first zone at mirror 1 and mirror 2 align, write zero > operation is performed to move the write pointer of the first zone to > the expected offset. This operation is performed only after a zone reset > of the first zone, i.e., when the second zone that contains the sb is FULL. Is it a good idea to do the "write zeros", instead of a plain "set write pointer"? I assume setting write pointer is instant, while writing potentially hundreds of megabytes may take significiant time. As the functions may be called from random contexts, the increased time may become a problem. > Signed-off-by: Pankaj Raghav <p.raghav@xxxxxxxxxxx> > --- > fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 63 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c > index 3023c871e..805aeaa76 100644 > --- a/fs/btrfs/zoned.c > +++ b/fs/btrfs/zoned.c > @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) > return 0; > } > > +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone, > + int mirror, u64 *wp_ret) > +{ > + u64 offset = 0; > + int ret = 0; > + > + ASSERT(!is_power_of_two_u64(zone->len)); > + ASSERT(zone->wp == zone->start); > + ASSERT(mirror != 0); This could simply accept 0 as the mirror offset too, the calculation is trivial. > + > + switch (mirror) { > + case 1: > + div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT, > + zone->len, &offset); > + break; > + case 2: > + div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT, > + zone->len, &offset); > + break; > + } > + > + ret = blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0); > + if (ret) > + return ret; > + > + zone->wp += offset; > + zone->cond = BLK_ZONE_COND_IMP_OPEN; > + *wp_ret = zone->wp << SECTOR_SHIFT; > + > + return 0; > +} > + > static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, > - int rw, u64 *bytenr_ret) > + int rw, int mirror, u64 *bytenr_ret) > { > u64 wp; > int ret; > + bool zones_empty = false; > > if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) { > *bytenr_ret = zones[0].start << SECTOR_SHIFT; > @@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, > if (ret != -ENOENT && ret < 0) > return ret; > > + if (ret == -ENOENT) > + zones_empty = true; > + > if (rw == WRITE) { > struct blk_zone *reset = NULL; > + bool is_sb_offset_write_req = false; > + u32 reset_zone_nr = -1; > > - if (wp == zones[0].start << SECTOR_SHIFT) > + if (wp == zones[0].start << SECTOR_SHIFT) { > reset = &zones[0]; > - else if (wp == zones[1].start << SECTOR_SHIFT) > + reset_zone_nr = 0; > + } else if (wp == zones[1].start << SECTOR_SHIFT) { > reset = &zones[1]; > + reset_zone_nr = 1; > + } > + > + /* > + * Non po2 zone sizes will not align naturally at > + * mirror 1 (512GB) and mirror 2 (4TB). The wp of the > + * 1st zone in those superblock mirrors need to be > + * moved to align at those offsets. > + */ Please move this comment to the helper fill_sb_wp_offset itself, there it's more discoverable. > + is_sb_offset_write_req = > + (zones_empty || (reset_zone_nr == 0)) && mirror && > + !is_power_of_2(zones[0].len); Accepting 0 as the mirror number would also get rid of this wild expression substituting and 'if'. > > if (reset && reset->cond != BLK_ZONE_COND_EMPTY) { > ASSERT(sb_zone_is_full(reset)); > @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, > reset->cond = BLK_ZONE_COND_EMPTY; > reset->wp = reset->start; > } > + > + if (is_sb_offset_write_req) { And get rid of the conditional. The point of supporting both po2 and nonpo2 is to hide any implementation details to wrappers as much as possible. > + ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp); > + if (ret) > + return ret; > + } > + > } else if (ret != -ENOENT) { > /* > * For READ, we want the previous one. Move write pointer to -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel