On 2020/7/15 2:40, Aravind Ramesh wrote: > Thanks for the valuable feedback. > My comments are inline. > Will send the V3 with the feedback incorporated. > > Regards, > Aravind > >> -----Original Message----- >> From: Chao Yu <yuchao0@xxxxxxxxxx> >> Sent: Tuesday, July 14, 2020 5:28 PM >> To: Aravind Ramesh <Aravind.Ramesh@xxxxxxx>; jaegeuk@xxxxxxxxxx; linux- >> fsdevel@xxxxxxxxxxxxxxx; linux-f2fs-devel@xxxxxxxxxxxxxxxxxxxxx; hch@xxxxxx >> Cc: Damien Le Moal <Damien.LeMoal@xxxxxxx>; Niklas Cassel >> <Niklas.Cassel@xxxxxxx>; Matias Bjorling <Matias.Bjorling@xxxxxxx> >> Subject: Re: [PATCH v2 1/1] f2fs: support zone capacity less than zone size >> >> On 2020/7/11 1:43, Aravind Ramesh wrote: >>> NVMe Zoned Namespace devices can have zone-capacity less than zone-size. >>> Zone-capacity indicates the maximum number of sectors that are usable >>> in a zone beginning from the first sector of the zone. This makes the >>> sectors sectors after the zone-capacity till zone-size to be unusable. >>> This patch set tracks zone-size and zone-capacity in zoned devices and >>> calculate the usable blocks per segment and usable segments per section. >>> >>> If zone-capacity is less than zone-size mark only those segments which >>> start before zone-capacity as free segments. All segments at and >>> beyond zone-capacity are treated as permanently used segments. In >>> cases where zone-capacity does not align with segment size the last >>> segment will start before zone-capacity and end beyond the >>> zone-capacity of the zone. For such spanning segments only sectors within the >> zone-capacity are used. >>> >>> During writes and GC manage the usable segments in a section and >>> usable blocks per segment. Segments which are beyond zone-capacity are >>> never allocated, and do not need to be garbage collected, only the >>> segments which are before zone-capacity needs to garbage collected. >>> For spanning segments based on the number of usable blocks in that >>> segment, write to blocks only up to zone-capacity. >>> >>> Zone-capacity is device specific and cannot be configured by the user. >>> Since NVMe ZNS device zones are sequentially write only, a block >>> device with conventional zones or any normal block device is needed >>> along with the ZNS device for the metadata operations of F2fs. >>> >>> A typical nvme-cli output of a zoned device shows zone start and >>> capacity and write pointer as below: >>> >>> SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ >>> SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ >>> SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ >>> >>> Here zone size is 64MB, capacity is 49MB, WP is at zone start as the >>> zones are in EMPTY state. For each zone, only zone start + 49MB is >>> usable area, any lba/sector after 49MB cannot be read or written to, >>> the drive will fail any attempts to read/write. So, the second zone >>> starts at 64MB and is usable till 113MB (64 + 49) and the range >>> between 113 and 128MB is again unusable. The next zone starts at 128MB, and so >> on. >>> >>> Signed-off-by: Aravind Ramesh <aravind.ramesh@xxxxxxx> >>> Signed-off-by: Damien Le Moal <damien.lemoal@xxxxxxx> >>> Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx> >>> --- >>> Documentation/filesystems/f2fs.rst | 15 +++ >>> fs/f2fs/f2fs.h | 5 + >>> fs/f2fs/gc.c | 27 +++-- >>> fs/f2fs/gc.h | 42 +++++++- >>> fs/f2fs/segment.c | 154 ++++++++++++++++++++++++++--- >>> fs/f2fs/segment.h | 21 ++-- >>> fs/f2fs/super.c | 41 ++++++-- >>> 7 files changed, 267 insertions(+), 38 deletions(-) >>> >>> diff --git a/Documentation/filesystems/f2fs.rst >>> b/Documentation/filesystems/f2fs.rst >>> index 535021c46260..cec2167a31db 100644 >>> --- a/Documentation/filesystems/f2fs.rst >>> +++ b/Documentation/filesystems/f2fs.rst >>> @@ -766,3 +766,18 @@ Compress metadata layout:: >>> +-------------+-------------+----------+----------------------------+ >>> | data length | data chksum | reserved | compressed data | >>> >>> +-------------+-------------+----------+----------------------------+ >>> + >>> +NVMe Zoned Namespace devices >>> +---------------------------- >>> + >>> +- ZNS defines a per-zone capacity which can be equal or less than the >>> + zone-size. Zone-capacity is the number of usable blocks in the zone. >>> + F2fs checks if zone-capacity is less than zone-size, if it is, then >>> +any >>> + segment which starts after the zone-capacity is marked as not-free >>> +in >>> + the free segment bitmap at initial mount time. These segments are >>> +marked >>> + as permanently used so they are not allocated for writes and >>> + consequently are not needed to be garbage collected. In case the >>> + zone-capacity is not aligned to default segment size(2MB), then a >>> +segment >>> + can start before the zone-capacity and span across zone-capacity boundary. >>> + Such spanning segments are also considered as usable segments. All >>> +blocks >>> + past the zone-capacity are considered unusable in these segments. >>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index >>> e6e47618a357..73219e4e1ba4 100644 >>> --- a/fs/f2fs/f2fs.h >>> +++ b/fs/f2fs/f2fs.h >>> @@ -1232,6 +1232,7 @@ struct f2fs_dev_info { #ifdef >>> CONFIG_BLK_DEV_ZONED >>> unsigned int nr_blkz; /* Total number of zones */ >>> unsigned long *blkz_seq; /* Bitmap indicating sequential zones */ >>> + block_t *zone_capacity_blocks; /* Array of zone capacity in blks */ >>> #endif >>> }; >>> >>> @@ -3395,6 +3396,10 @@ void f2fs_destroy_segment_manager_caches(void); >>> int f2fs_rw_hint_to_seg_type(enum rw_hint hint); enum rw_hint >>> f2fs_io_type_to_rw_hint(struct f2fs_sb_info *sbi, >>> enum page_type type, enum temp_type temp); >>> +unsigned int f2fs_usable_segs_in_sec(struct f2fs_sb_info *sbi, >>> + unsigned int segno); >>> +unsigned int f2fs_usable_blks_in_seg(struct f2fs_sb_info *sbi, >>> + unsigned int segno); >>> >>> /* >>> * checkpoint.c >>> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index >>> 9a40761445d3..dfa6d91cffcb 100644 >>> --- a/fs/f2fs/gc.c >>> +++ b/fs/f2fs/gc.c >>> @@ -266,13 +266,14 @@ static unsigned int get_cb_cost(struct f2fs_sb_info >> *sbi, unsigned int segno) >>> unsigned char age = 0; >>> unsigned char u; >>> unsigned int i; >>> + unsigned int usable_segs_per_sec = f2fs_usable_segs_in_sec(sbi, >>> +segno); >>> >>> - for (i = 0; i < sbi->segs_per_sec; i++) >>> + for (i = 0; i < usable_segs_per_sec; i++) >>> mtime += get_seg_entry(sbi, start + i)->mtime; >>> vblocks = get_valid_blocks(sbi, segno, true); >>> >>> - mtime = div_u64(mtime, sbi->segs_per_sec); >>> - vblocks = div_u64(vblocks, sbi->segs_per_sec); >>> + mtime = div_u64(mtime, usable_segs_per_sec); >>> + vblocks = div_u64(vblocks, usable_segs_per_sec); >>> >>> u = (vblocks * 100) >> sbi->log_blocks_per_seg; >>> >>> @@ -536,6 +537,7 @@ static int gc_node_segment(struct f2fs_sb_info *sbi, >>> int phase = 0; >>> bool fggc = (gc_type == FG_GC); >>> int submitted = 0; >>> + unsigned int usable_blks_in_seg = f2fs_usable_blks_in_seg(sbi, >>> +segno); >>> >>> start_addr = START_BLOCK(sbi, segno); >>> >>> @@ -545,7 +547,7 @@ static int gc_node_segment(struct f2fs_sb_info *sbi, >>> if (fggc && phase == 2) >>> atomic_inc(&sbi->wb_sync_req[NODE]); >>> >>> - for (off = 0; off < sbi->blocks_per_seg; off++, entry++) { >>> + for (off = 0; off < usable_blks_in_seg; off++, entry++) { >>> nid_t nid = le32_to_cpu(entry->nid); >>> struct page *node_page; >>> struct node_info ni; >>> @@ -1033,13 +1035,14 @@ static int gc_data_segment(struct f2fs_sb_info *sbi, >> struct f2fs_summary *sum, >>> int off; >>> int phase = 0; >>> int submitted = 0; >>> + unsigned int usable_blks_in_seg = f2fs_usable_blks_in_seg(sbi, >>> +segno); >>> >>> start_addr = START_BLOCK(sbi, segno); >>> >>> next_step: >>> entry = sum; >>> >>> - for (off = 0; off < sbi->blocks_per_seg; off++, entry++) { >>> + for (off = 0; off < usable_blks_in_seg; off++, entry++) { >>> struct page *data_page; >>> struct inode *inode; >>> struct node_info dni; /* dnode info for the data */ @@ -1201,7 >>> +1204,16 @@ static int do_garbage_collect(struct f2fs_sb_info *sbi, >>> SUM_TYPE_DATA : >> SUM_TYPE_NODE; >>> int submitted = 0; >>> >>> - if (__is_large_section(sbi)) >>> + /* >>> + * zone-capacity can be less than zone-size in zoned devices, >>> + * resulting in less than expected usable segments in the zone, >>> + * calculate the end segno in the zone which can be garbage collected >>> + */ >>> + if (f2fs_sb_has_blkzoned(sbi)) >>> + end_segno -= sbi->segs_per_sec - >>> + f2fs_usable_segs_in_sec(sbi, segno); >>> + >>> + else if (__is_large_section(sbi)) >>> end_segno = rounddown(end_segno, sbi->segs_per_sec); >> >> end_segno could be beyond end of section, so below calculation could adjust it to >> the end of section first, then adjust to zone-capacity boundary. >> >> if (__is_large_section(sbi)) >> end_segno = rounddown(end_segno, sbi->segs_per_sec); >> >> if (f2fs_sb_has_blkzoned(sbi)) >> end_segno -= sbi->segs_per_sec - >> f2fs_usable_segs_in_sec(sbi, segno); > Ok, will change it. >> >>> >>> /* readahead multi ssa blocks those have contiguous address */ @@ >>> -1356,7 +1368,8 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync, >>> goto stop; >>> >>> seg_freed = do_garbage_collect(sbi, segno, &gc_list, gc_type); >>> - if (gc_type == FG_GC && seg_freed == sbi->segs_per_sec) >>> + if (gc_type == FG_GC && >>> + seg_freed == f2fs_usable_segs_in_sec(sbi, segno)) >>> sec_freed++; >>> total_freed += seg_freed; >>> >>> diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h index >>> db3c61046aa4..463b4e38b864 100644 >>> --- a/fs/f2fs/gc.h >>> +++ b/fs/f2fs/gc.h >>> @@ -44,13 +44,47 @@ struct gc_inode_list { >>> /* >>> * inline functions >>> */ >>> + >>> +/* >>> + * On a Zoned device zone-capacity can be less than zone-size and if >>> + * zone-capacity is not aligned to f2fs segment size(2MB), then the >>> +segment >>> + * starting just before zone-capacity has some blocks spanning across >>> +the >>> + * zone-capacity, these blocks are not usable. >>> + * Such spanning segments can be in free list so calculate the sum of >>> +usable >>> + * blocks in currently free segments including normal and spanning segments. >>> + */ >>> +static inline block_t free_segs_blk_count_zoned(struct f2fs_sb_info >>> +*sbi) { >>> + block_t free_seg_blks = 0; >>> + struct free_segmap_info *free_i = FREE_I(sbi); >>> + int j; >>> + >> >> spin_lock(&free_i->segmap_lock); > Ok >> >>> + for (j = 0; j < MAIN_SEGS(sbi); j++) >>> + if (!test_bit(j, free_i->free_segmap)) >>> + free_seg_blks += f2fs_usable_blks_in_seg(sbi, j); >> >> spin_unlock(&free_i->segmap_lock); > Ok >> >>> + >>> + return free_seg_blks; >>> +} >>> + >>> +static inline block_t free_segs_blk_count(struct f2fs_sb_info *sbi) { >>> + if (f2fs_sb_has_blkzoned(sbi)) >>> + return free_segs_blk_count_zoned(sbi); >>> + >>> + return free_segments(sbi) << sbi->log_blocks_per_seg; } >>> + >>> static inline block_t free_user_blocks(struct f2fs_sb_info *sbi) { >>> - if (free_segments(sbi) < overprovision_segments(sbi)) >>> + block_t free_blks, ovp_blks; >>> + >>> + free_blks = free_segs_blk_count(sbi); >>> + ovp_blks = overprovision_segments(sbi) << sbi->log_blocks_per_seg; >>> + >>> + if (free_blks < ovp_blks) >>> return 0; >>> - else >>> - return (free_segments(sbi) - overprovision_segments(sbi)) >>> - << sbi->log_blocks_per_seg; >>> + >>> + return free_blks - ovp_blks; >>> } >>> >>> static inline block_t limit_invalid_user_blocks(struct f2fs_sb_info >>> *sbi) diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index >>> c35614d255e1..d75c1849dc83 100644 >>> --- a/fs/f2fs/segment.c >>> +++ b/fs/f2fs/segment.c >>> @@ -869,10 +869,10 @@ static void locate_dirty_segment(struct f2fs_sb_info >> *sbi, unsigned int segno) >>> ckpt_valid_blocks = get_ckpt_valid_blocks(sbi, segno); >> >> unsigned int usable_blocks = f2fs_usable_blks_in_seg(sbi, segno); >> >>> >>> if (valid_blocks == 0 && (!is_sbi_flag_set(sbi, SBI_CP_DISABLED) || >>> - ckpt_valid_blocks == sbi->blocks_per_seg)) { >> >> ckpt_valid_blocks == usable_blocks >> >>> + ckpt_valid_blocks == f2fs_usable_blks_in_seg(sbi, segno))) { >>> __locate_dirty_segment(sbi, segno, PRE); >>> __remove_dirty_segment(sbi, segno, DIRTY); >>> - } else if (valid_blocks < sbi->blocks_per_seg) { >>> + } else if (valid_blocks < f2fs_usable_blks_in_seg(sbi, segno)) { >> >> } else if (valid_blocks < usable_blocks) { >> > Ok. >> >> BTW, would you consider to add a field as below to avoid calculation whenever we >> want to get usable_blocks in segment: > > The reason to do it like this is that f2fs supports using multiple devices to create a volume. > So, if 2 ZNS devices are used where both have same zone-size but different zone capacity, > then having this single field to indicate usable blocks will not work, as usable blocks of zones The field @usable_blocks in 'struct seg_entry' is per-segment, that's not single, it can be initialized to 0 or blocks_per_seg or 'sec_cap_blkaddr - seg_start' based on its locatation in which zone of device during build_sit_entries(). > from different devices are not same. In such a scenario, we would need to maintain an array, > which stores the usable blocks of each zone based on its zone-capacity. > This approach can eliminate the calculation. > > So we actually implemented this approach with a buffer and at mount time this buffer would be > allocated and populated with all the zone's usable_blocks information and would be read from > this buffer subsequently. But we did not see any performance difference when compared to > current approach and concluded that the overhead of calculation was negligible and also > current approach does not need to modify core data structure, as this field will be used > only when using a ZNS drive and if that drive has zone capacity less than zone size. > And for all other cases, the function f2fs_usable_blks_in_seg() just returns sbi->blocks_per_seg > without any calculation. > > If you think, like the calculation overhead is more, then we need to add a pointer and allocate memory > to it to accommodate all zone's zone-capacity info. In my opinion, the gain is very less because of the > reasons stated above. However, I am open to change it based on your feedback. I just thought those fixed size could be recorded in advance to avoid redundant calculation, but the change I suggested is not critical, and I do agree that it will not help performance a bit. Anyway, I'm not against with keeping it as it is, let's go ahead. :) > >> >> struct seg_entry { >> unsigned long long type:6; /* segment type like >> CURSEG_XXX_TYPE */ >> unsigned long long valid_blocks:10; /* # of valid blocks */ >> unsigned long long ckpt_valid_blocks:10; /* # of valid blocks last cp >> */ >> unsigned long long usable_blocks:10; /* usable block count in segment >> */ >> unsigned long long padding:28; /* padding */ >> ... >> >> For non-zns device, usable_blocks equals to blks_per_seg, otherwise it depends on >> where it locates, we can initialize it in build_sit_entries(), thoughts? >> >>> __locate_dirty_segment(sbi, segno, DIRTY); >>> } else { >>> /* Recovery routine with SSR needs this */ @@ -915,9 +915,11 >> @@ >>> block_t f2fs_get_unusable_blocks(struct f2fs_sb_info *sbi) >>> for_each_set_bit(segno, dirty_i->dirty_segmap[DIRTY], MAIN_SEGS(sbi)) { >>> se = get_seg_entry(sbi, segno); >>> if (IS_NODESEG(se->type)) >>> - holes[NODE] += sbi->blocks_per_seg - se->valid_blocks; >>> + holes[NODE] += f2fs_usable_blks_in_seg(sbi, segno) - >>> + se->valid_blocks; >>> else >>> - holes[DATA] += sbi->blocks_per_seg - se->valid_blocks; >>> + holes[DATA] += f2fs_usable_blks_in_seg(sbi, segno) - >>> + se->valid_blocks; >>> } >>> mutex_unlock(&dirty_i->seglist_lock); >>> >>> @@ -2167,7 +2169,7 @@ static void update_sit_entry(struct f2fs_sb_info *sbi, >> block_t blkaddr, int del) >>> offset = GET_BLKOFF_FROM_SEG0(sbi, blkaddr); >>> >>> f2fs_bug_on(sbi, (new_vblocks >> (sizeof(unsigned short) << 3) || >>> - (new_vblocks > sbi->blocks_per_seg))); >>> + (new_vblocks > f2fs_usable_blks_in_seg(sbi, segno)))); >>> >>> se->valid_blocks = new_vblocks; >>> se->mtime = get_mtime(sbi, false); >>> @@ -2933,9 +2935,9 @@ int f2fs_trim_fs(struct f2fs_sb_info *sbi, >>> struct fstrim_range *range) static bool __has_curseg_space(struct >>> f2fs_sb_info *sbi, int type) { >>> struct curseg_info *curseg = CURSEG_I(sbi, type); >>> - if (curseg->next_blkoff < sbi->blocks_per_seg) >>> - return true; >>> - return false; >>> + >>> + return curseg->next_blkoff < f2fs_usable_blks_in_seg(sbi, >>> + curseg->segno); >>> } >>> >>> int f2fs_rw_hint_to_seg_type(enum rw_hint hint) @@ -4294,9 +4296,12 >>> @@ static void init_free_segmap(struct f2fs_sb_info *sbi) { >>> unsigned int start; >>> int type; >>> + struct seg_entry *sentry; >>> >>> for (start = 0; start < MAIN_SEGS(sbi); start++) { >>> - struct seg_entry *sentry = get_seg_entry(sbi, start); >>> + if (f2fs_usable_blks_in_seg(sbi, start) == 0) >>> + continue; >>> + sentry = get_seg_entry(sbi, start); >>> if (!sentry->valid_blocks) >>> __set_free(sbi, start); >>> else >>> @@ -4316,7 +4321,7 @@ static void init_dirty_segmap(struct f2fs_sb_info *sbi) >>> struct dirty_seglist_info *dirty_i = DIRTY_I(sbi); >>> struct free_segmap_info *free_i = FREE_I(sbi); >>> unsigned int segno = 0, offset = 0, secno; >>> - unsigned short valid_blocks; >>> + unsigned short valid_blocks, usable_blks_in_seg; >>> unsigned short blks_per_sec = BLKS_PER_SEC(sbi); >>> >>> while (1) { >>> @@ -4326,9 +4331,10 @@ static void init_dirty_segmap(struct f2fs_sb_info *sbi) >>> break; >>> offset = segno + 1; >>> valid_blocks = get_valid_blocks(sbi, segno, false); >>> - if (valid_blocks == sbi->blocks_per_seg || !valid_blocks) >>> + usable_blks_in_seg = f2fs_usable_blks_in_seg(sbi, segno); >>> + if (valid_blocks == usable_blks_in_seg || !valid_blocks) >>> continue; >>> - if (valid_blocks > sbi->blocks_per_seg) { >>> + if (valid_blocks > usable_blks_in_seg) { >>> f2fs_bug_on(sbi, 1); >>> continue; >>> } >>> @@ -4678,6 +4684,101 @@ int f2fs_check_write_pointer(struct >>> f2fs_sb_info *sbi) >>> >>> return 0; >>> } >>> + >>> +static bool is_conv_zone(struct f2fs_sb_info *sbi, unsigned int zone_idx, >>> + unsigned int dev_idx) >>> +{ >>> + if (!bdev_is_zoned(FDEV(dev_idx).bdev)) >>> + return true; >>> + return !test_bit(zone_idx, FDEV(dev_idx).blkz_seq); } >>> + >>> +/* Return the zone index in the given device */ static unsigned int >>> +get_zone_idx(struct f2fs_sb_info *sbi, unsigned int secno, >>> + int dev_idx) >>> +{ >>> + block_t sec_start_blkaddr = START_BLOCK(sbi, GET_SEG_FROM_SEC(sbi, >>> +secno)); >>> + >>> + return (sec_start_blkaddr - FDEV(dev_idx).start_blk) >> >>> + sbi->log_blocks_per_blkz; >>> +} >>> + >>> +/* >>> + * Return the usable segments in a section based on the zone's >>> + * corresponding zone capacity. Zone is equal to a section. >>> + */ >>> +static inline unsigned int f2fs_usable_zone_segs_in_sec( >>> + struct f2fs_sb_info *sbi, unsigned int segno) { >>> + unsigned int dev_idx, zone_idx, unusable_segs_in_sec; >>> + >>> + dev_idx = f2fs_target_device_index(sbi, START_BLOCK(sbi, segno)); >>> + zone_idx = get_zone_idx(sbi, GET_SEC_FROM_SEG(sbi, segno), dev_idx); >>> + >>> + /* Conventional zone's capacity is always equal to zone size */ >>> + if (is_conv_zone(sbi, zone_idx, dev_idx)) >>> + return sbi->segs_per_sec; >>> + >>> + /* >>> + * If the zone_capacity_blocks array is NULL, then zone capacity >>> + * is equal to the zone size for all zones >>> + */ >>> + if (!FDEV(dev_idx).zone_capacity_blocks) >>> + return sbi->segs_per_sec; >>> + >>> + /* Get the segment count beyond zone capacity block */ >>> + unusable_segs_in_sec = (sbi->blocks_per_blkz - >>> + FDEV(dev_idx).zone_capacity_blocks[zone_idx]) >>>> >>> + sbi->log_blocks_per_seg; >>> + return sbi->segs_per_sec - unusable_segs_in_sec; } >>> + >>> +/* >>> + * Return the number of usable blocks in a segment. The number of >>> +blocks >>> + * returned is always equal to the number of blocks in a segment for >>> + * segments fully contained within a sequential zone capacity or a >>> + * conventional zone. For segments partially contained in a >>> +sequential >>> + * zone capacity, the number of usable blocks up to the zone capacity >>> + * is returned. 0 is returned in all other cases. >>> + */ >>> +static inline unsigned int f2fs_usable_zone_blks_in_seg( >>> + struct f2fs_sb_info *sbi, unsigned int segno) { >>> + block_t seg_start, sec_start_blkaddr, sec_cap_blkaddr; >>> + unsigned int zone_idx, dev_idx, secno; >>> + >>> + secno = GET_SEC_FROM_SEG(sbi, segno); >>> + seg_start = START_BLOCK(sbi, segno); >>> + dev_idx = f2fs_target_device_index(sbi, seg_start); >>> + zone_idx = get_zone_idx(sbi, secno, dev_idx); >>> + >>> + /* >>> + * Conventional zone's capacity is always equal to zone size, >>> + * so, blocks per segment is unchanged. >>> + */ >>> + if (is_conv_zone(sbi, zone_idx, dev_idx)) >>> + return sbi->blocks_per_seg; >>> + >>> + if (!FDEV(dev_idx).zone_capacity_blocks) >>> + return sbi->blocks_per_seg; >>> + >>> + sec_start_blkaddr = START_BLOCK(sbi, GET_SEG_FROM_SEC(sbi, secno)); >>> + sec_cap_blkaddr = sec_start_blkaddr + >>> + FDEV(dev_idx).zone_capacity_blocks[zone_idx]; >>> + >>> + /* >>> + * If segment starts before zone capacity and spans beyond >>> + * zone capacity, then usable blocks are from seg start to >>> + * zone capacity. If the segment starts after the zone capacity, >>> + * then there are no usable blocks. >>> + */ >>> + if (seg_start >= sec_cap_blkaddr) >>> + return 0; >>> + if (seg_start + sbi->blocks_per_seg > sec_cap_blkaddr) >>> + return sec_cap_blkaddr - seg_start; >>> + >>> + return sbi->blocks_per_seg; >>> +} >>> #else >>> int f2fs_fix_curseg_write_pointer(struct f2fs_sb_info *sbi) { @@ >>> -4688,7 +4789,36 @@ int f2fs_check_write_pointer(struct f2fs_sb_info >>> *sbi) { >>> return 0; >>> } >>> + >>> +static inline unsigned int f2fs_usable_zone_blks_in_seg(struct f2fs_sb_info *sbi, >>> + unsigned int segno) >>> +{ >>> + return 0; >>> +} >>> + >>> +static inline unsigned int f2fs_usable_zone_segs_in_sec(struct f2fs_sb_info *sbi, >>> + unsigned int segno) >>> +{ >>> + return 0; >>> +} >>> #endif >>> +unsigned int f2fs_usable_blks_in_seg(struct f2fs_sb_info *sbi, >>> + unsigned int segno) >>> +{ >>> + if (f2fs_sb_has_blkzoned(sbi)) >>> + return f2fs_usable_zone_blks_in_seg(sbi, segno); >>> + >>> + return sbi->blocks_per_seg; >>> +} >>> + >>> +unsigned int f2fs_usable_segs_in_sec(struct f2fs_sb_info *sbi, >>> + unsigned int segno) >>> +{ >>> + if (f2fs_sb_has_blkzoned(sbi)) >>> + return f2fs_usable_zone_segs_in_sec(sbi, segno); >>> + >>> + return sbi->segs_per_sec; >>> +} >>> >>> /* >>> * Update min, max modified time for cost-benefit GC algorithm diff >>> --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h index >>> f261e3e6a69b..41bc08013160 100644 >>> --- a/fs/f2fs/segment.h >>> +++ b/fs/f2fs/segment.h >>> @@ -411,6 +411,7 @@ static inline void __set_free(struct f2fs_sb_info *sbi, >> unsigned int segno) >>> unsigned int secno = GET_SEC_FROM_SEG(sbi, segno); >>> unsigned int start_segno = GET_SEG_FROM_SEC(sbi, secno); >>> unsigned int next; >>> + unsigned int usable_segs = f2fs_usable_segs_in_sec(sbi, segno); >>> >>> spin_lock(&free_i->segmap_lock); >>> clear_bit(segno, free_i->free_segmap); @@ -418,7 +419,7 @@ static >>> inline void __set_free(struct f2fs_sb_info *sbi, unsigned int segno) >>> >>> next = find_next_bit(free_i->free_segmap, >>> start_segno + sbi->segs_per_sec, start_segno); >>> - if (next >= start_segno + sbi->segs_per_sec) { >>> + if (next >= start_segno + usable_segs) { >>> clear_bit(secno, free_i->free_secmap); >>> free_i->free_sections++; >>> } >>> @@ -444,6 +445,7 @@ static inline void __set_test_and_free(struct f2fs_sb_info >> *sbi, >>> unsigned int secno = GET_SEC_FROM_SEG(sbi, segno); >>> unsigned int start_segno = GET_SEG_FROM_SEC(sbi, secno); >>> unsigned int next; >>> + unsigned int usable_segs = f2fs_usable_segs_in_sec(sbi, segno); >>> >>> spin_lock(&free_i->segmap_lock); >>> if (test_and_clear_bit(segno, free_i->free_segmap)) { @@ -453,7 >>> +455,7 @@ static inline void __set_test_and_free(struct f2fs_sb_info *sbi, >>> goto skip_free; >>> next = find_next_bit(free_i->free_segmap, >>> start_segno + sbi->segs_per_sec, start_segno); >>> - if (next >= start_segno + sbi->segs_per_sec) { >>> + if (next >= start_segno + usable_segs) { >>> if (test_and_clear_bit(secno, free_i->free_secmap)) >>> free_i->free_sections++; >>> } >>> @@ -546,8 +548,8 @@ static inline bool has_curseg_enough_space(struct >> f2fs_sb_info *sbi) >>> /* check current node segment */ >>> for (i = CURSEG_HOT_NODE; i <= CURSEG_COLD_NODE; i++) { >>> segno = CURSEG_I(sbi, i)->segno; >>> - left_blocks = sbi->blocks_per_seg - >>> - get_seg_entry(sbi, segno)->ckpt_valid_blocks; >>> + left_blocks = f2fs_usable_blks_in_seg(sbi, segno) - >>> + get_seg_entry(sbi, segno)->ckpt_valid_blocks; >>> >>> if (node_blocks > left_blocks) >>> return false; >>> @@ -555,7 +557,7 @@ static inline bool has_curseg_enough_space(struct >>> f2fs_sb_info *sbi) >>> >>> /* check current data segment */ >>> segno = CURSEG_I(sbi, CURSEG_HOT_DATA)->segno; >>> - left_blocks = sbi->blocks_per_seg - >>> + left_blocks = f2fs_usable_blks_in_seg(sbi, segno) - >>> get_seg_entry(sbi, segno)->ckpt_valid_blocks; >>> if (dent_blocks > left_blocks) >>> return false; >>> @@ -677,21 +679,22 @@ static inline int check_block_count(struct f2fs_sb_info >> *sbi, >>> bool is_valid = test_bit_le(0, raw_sit->valid_map) ? true : false; >>> int valid_blocks = 0; >>> int cur_pos = 0, next_pos; >>> + unsigned int usable_blks_per_seg = f2fs_usable_blks_in_seg(sbi, >>> +segno); >>> >>> /* check bitmap with valid block count */ >>> do { >>> if (is_valid) { >>> next_pos = find_next_zero_bit_le(&raw_sit->valid_map, >>> - sbi->blocks_per_seg, >>> + usable_blks_per_seg, >>> cur_pos); >>> valid_blocks += next_pos - cur_pos; >>> } else >>> next_pos = find_next_bit_le(&raw_sit->valid_map, >>> - sbi->blocks_per_seg, >>> + usable_blks_per_seg, >>> cur_pos); >>> cur_pos = next_pos; >>> is_valid = !is_valid; >>> - } while (cur_pos < sbi->blocks_per_seg); >>> + } while (cur_pos < usable_blks_per_seg); >> >> I meant we need to check that there should be no valid slot locates beyond zone- >> capacity in bitmap: > Ahh, ok. Got it. Will change. >> >> if (usable_blks_per_seg < sbi->blocks_per_seg) >> f2fs_bug_on( >> find_next_bit_le(&raw_sit->valid_map, >> sbi->blocks_per_seg, >> usable_blks_per_seg) != sbi->blocks_per_seg) >> >>> >>> if (unlikely(GET_SIT_VBLOCKS(raw_sit) != valid_blocks)) { >>> f2fs_err(sbi, "Mismatch valid blocks %d vs. %d", @@ -701,7 +704,7 >>> @@ static inline int check_block_count(struct f2fs_sb_info *sbi, >>> } >>> >>> /* check segment usage, and check boundary of a given segment number */ >>> - if (unlikely(GET_SIT_VBLOCKS(raw_sit) > sbi->blocks_per_seg >>> + if (unlikely(GET_SIT_VBLOCKS(raw_sit) > usable_blks_per_seg >>> || segno > TOTAL_SEGS(sbi) - 1)) { >>> f2fs_err(sbi, "Wrong valid blocks %d or segno %u", >>> GET_SIT_VBLOCKS(raw_sit), segno); diff --git >> a/fs/f2fs/super.c >>> b/fs/f2fs/super.c index 80cb7cd358f8..7ced425dc102 100644 >>> --- a/fs/f2fs/super.c >>> +++ b/fs/f2fs/super.c >>> @@ -1164,6 +1164,7 @@ static void destroy_device_list(struct f2fs_sb_info *sbi) >>> blkdev_put(FDEV(i).bdev, FMODE_EXCL); #ifdef >> CONFIG_BLK_DEV_ZONED >>> kvfree(FDEV(i).blkz_seq); >>> + kfree(FDEV(i).zone_capacity_blocks); >>> #endif >>> } >>> kvfree(sbi->devs); >>> @@ -3039,13 +3040,26 @@ static int init_percpu_info(struct >>> f2fs_sb_info *sbi) } >>> >>> #ifdef CONFIG_BLK_DEV_ZONED >>> + >>> +struct f2fs_report_zones_args { >>> + struct f2fs_dev_info *dev; >>> + bool zone_cap_mismatch; >>> +}; >>> + >>> static int f2fs_report_zone_cb(struct blk_zone *zone, unsigned int idx, >>> - void *data) >>> + void *data) >>> { >>> - struct f2fs_dev_info *dev = data; >>> + struct f2fs_report_zones_args *rz_args = data; >>> + >>> + if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) >>> + return 0; >>> + >>> + set_bit(idx, rz_args->dev->blkz_seq); >>> + rz_args->dev->zone_capacity_blocks[idx] = zone->capacity >> >>> + F2FS_LOG_SECTORS_PER_BLOCK; >>> + if (zone->len != zone->capacity && !rz_args->zone_cap_mismatch) >>> + rz_args->zone_cap_mismatch = true; >>> >>> - if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL) >>> - set_bit(idx, dev->blkz_seq); >>> return 0; >>> } >>> >>> @@ -3053,6 +3067,7 @@ static int init_blkz_info(struct f2fs_sb_info >>> *sbi, int devi) { >>> struct block_device *bdev = FDEV(devi).bdev; >>> sector_t nr_sectors = bdev->bd_part->nr_sects; >>> + struct f2fs_report_zones_args rep_zone_arg; >>> int ret; >>> >>> if (!f2fs_sb_has_blkzoned(sbi)) >>> @@ -3078,12 +3093,26 @@ static int init_blkz_info(struct f2fs_sb_info *sbi, int >> devi) >>> if (!FDEV(devi).blkz_seq) >>> return -ENOMEM; >>> >>> - /* Get block zones type */ >>> + /* Get block zones type and zone-capacity */ >>> + FDEV(devi).zone_capacity_blocks = f2fs_kzalloc(sbi, >>> + FDEV(devi).nr_blkz * sizeof(block_t), >>> + GFP_KERNEL); >>> + if (!FDEV(devi).zone_capacity_blocks) >>> + return -ENOMEM; >>> + >>> + rep_zone_arg.dev = &FDEV(devi); >>> + rep_zone_arg.zone_cap_mismatch = false; >>> + >>> ret = blkdev_report_zones(bdev, 0, BLK_ALL_ZONES, f2fs_report_zone_cb, >>> - &FDEV(devi)); >>> + &rep_zone_arg); >>> if (ret < 0) >> >> kfree(FDEV(devi).zone_capacity_blocks); > Actually, we do not need to free this here, because if error (ret < 0) is returned, then the control goes back to > destroy_device_list() and deallocates the buffer. Confirmed. Thanks, > >> >> Thanks, >> >>> return ret; >>> >>> + if (!rep_zone_arg.zone_cap_mismatch) { >>> + kfree(FDEV(devi).zone_capacity_blocks); >>> + FDEV(devi).zone_capacity_blocks = NULL; >>> + } >>> + >>> return 0; >>> } >>> #endif >>> > . >