On 2022-05-17 14:42, David Sterba wrote: > On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote: >> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB. >> These are fixed at these locations so that recovery tools can reliably >> retrieve the superblocks even if one of the mirror gets corrupted. >> >> power of 2 zone sizes align at these offsets irrespective of their >> value but non power of 2 zone sizes will not align. >> >> To make sure the first zone at mirror 1 and mirror 2 align, write zero >> operation is performed to move the write pointer of the first zone to >> the expected offset. This operation is performed only after a zone reset >> of the first zone, i.e., when the second zone that contains the sb is FULL. > > Is it a good idea to do the "write zeros", instead of a plain "set write > pointer"? I assume setting write pointer is instant, while writing > potentially hundreds of megabytes may take significiant time. As the > functions may be called from random contexts, the increased time may > become a problem. > Unfortunately it is not possible to just move the WP in zoned devices. The only alternative that I could use is to do write zeroes which are natively supported by some devices such as ZNS. It would be nice to know if someone had a better solution to this instead of doing write zeroes in zoned devices. >> Signed-off-by: Pankaj Raghav <p.raghav@xxxxxxxxxxx> >> --- >> fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++---- >> 1 file changed, 63 insertions(+), 5 deletions(-) >> >> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c >> index 3023c871e..805aeaa76 100644 >> --- a/fs/btrfs/zoned.c >> +++ b/fs/btrfs/zoned.c >> @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) >> return 0; >> } >> >> +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone, >> + int mirror, u64 *wp_ret) >> +{ >> + u64 offset = 0; >> + int ret = 0; >> + >> + ASSERT(!is_power_of_two_u64(zone->len)); >> + ASSERT(zone->wp == zone->start); >> + ASSERT(mirror != 0); > > This could simply accept 0 as the mirror offset too, the calculation is > trivial. > Ok. I will fix it up! >> + >> + switch (mirror) { >> + case 1: >> + div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT, >> + zone->len, &offset); >> + break; >> + case 2: >> + div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT, >> + zone->len, &offset); >> + break; >> + } >> + >> + ret = blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0); >> + if (ret) >> + return ret; >> + >> + /* >> + * Non po2 zone sizes will not align naturally at >> + * mirror 1 (512GB) and mirror 2 (4TB). The wp of the >> + * 1st zone in those superblock mirrors need to be >> + * moved to align at those offsets. >> + */ > > Please move this comment to the helper fill_sb_wp_offset itself, there > it's more discoverable. > Ok. >> + is_sb_offset_write_req = >> + (zones_empty || (reset_zone_nr == 0)) && mirror && >> + !is_power_of_2(zones[0].len); > > Accepting 0 as the mirror number would also get rid of this wild > expression substituting and 'if'. > >> >> if (reset && reset->cond != BLK_ZONE_COND_EMPTY) { >> ASSERT(sb_zone_is_full(reset)); >> @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, >> reset->cond = BLK_ZONE_COND_EMPTY; >> reset->wp = reset->start; >> } >> + >> + if (is_sb_offset_write_req) { > > And get rid of the conditional. The point of supporting both po2 and > nonpo2 is to hide any implementation details to wrappers as much as > possible. > Alright. I will move the logic to the wrapper instead of having the conditional in this function. >> + ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp); >> + if (ret) >> + return ret; >> + } >> + >> } else if (ret != -ENOENT) { >> /* >> * For READ, we want the previous one. Move write pointer to Thanks for your comments.