On 2015-05-20 16:43, Ryusuke Konishi wrote: > On Sun, 3 May 2015 12:05:22 +0200, Andreas Rohner wrote: >> It doesn't really matter if the number of reclaimable blocks for a >> segment is inaccurate, as long as the overall performance is better than >> the simple timestamp algorithm and starvation is prevented. >> >> The following steps will lead to starvation of a segment: >> >> 1. The segment is written >> 2. A snapshot is created >> 3. The files in the segment are deleted and the number of live >> blocks for the segment is decremented to a very low value >> 4. The GC tries to free the segment, but there are no reclaimable >> blocks, because they are all protected by the snapshot. To prevent an >> infinite loop the GC has to adjust the number of live blocks to the >> correct value. >> 5. The snapshot is converted to a checkpoint and the blocks in the >> segment are now reclaimable. >> 6. The GC will never attempt to clean the segment again, because it >> looks as if it had a high number of live blocks. >> >> To prevent this, the already existing padding field of the SUFILE entry >> is used to track the number of snapshot blocks in the segment. This >> number is only set by the GC, since it collects the necessary >> information anyway. So there is no need, to track which block belongs to >> which segment. In step 4 of the list above the GC will set the new field >> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and >> entries with a big su_nsnapshot_blks field get their su_nlive_blks field >> reduced. >> >> Signed-off-by: Andreas Rohner <andreas.rohner@xxxxxxx> > > I still don't know whether this workaround is the way we should take > or not. This patch has several drawbacks: > > 1. It introduces overheads to every "chcp cp" operation > due to traversal rewrite of sufile. > If the ratio of snapshot protected blocks is high, then > this overheads will be big. > > 2. The traversal rewrite of sufile will causes many sufile blocks will be > written out. If most blocks are protected by a snapshot, > more than 4MB of sufile blocks will be written per 1TB capacity. > > Even though this rewrite may not happen for contiguous "chcp cp" > operations, it still has potential for creating sufile write blocks > if the application of nilfs manipulates snapshots frequently. I could also implement this functionality in nilfs_cleanerd in userspace. Every time a "chcp cp" happens some kind of permanent flag like "snapshot_was_recently_deleted" is set at an appropriate location. The flag could be returned with GET_SUSTAT ioctl(). Then nilfs_cleanerd would, at certain intervals and if the flag is set, check all segments with GET_SUINFO ioctl() and set the ones that have potentially invalid values with SET_SUINFO ioctl(). After that it would clear the "snapshot_was_recently_deleted" flag. What do you think about this idea? If the policy is "timestamp" the GC would of course skip this scan, because it is unnecessary. > 3. The ratio of the threshold "max_segblks" is hard coded to 50% > of blocks_per_segment. It is not clear if the ratio is good > (versatile). The interval and percentage could be set in /etc/nilfs_cleanerd.conf. I chose 50% kind of arbitrarily. My intent was to encourage the GC to check the segment again in the future. I guess anything between 25% and 75% would also work. > I will add comments inline below. >> --- >> fs/nilfs2/ioctl.c | 50 +++++++++++++++++++++++++++++++- >> fs/nilfs2/sufile.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> fs/nilfs2/sufile.h | 3 ++ >> 3 files changed, 137 insertions(+), 1 deletion(-) >> >> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c >> index 40bf74a..431725f 100644 >> --- a/fs/nilfs2/ioctl.c >> +++ b/fs/nilfs2/ioctl.c >> @@ -200,6 +200,49 @@ static int nilfs_ioctl_getversion(struct inode *inode, void __user *argp) >> } >> >> /** >> + * nilfs_ioctl_fix_starving_segs - fix potentially starving segments >> + * @nilfs: nilfs object >> + * @inode: inode object >> + * >> + * Description: Scans for segments, which are potentially starving and >> + * reduces the number of live blocks to less than half of the maximum >> + * number of blocks in a segment. This requires a scan of the whole SUFILE, >> + * which can take a long time on certain devices and under certain conditions. >> + * To avoid blocking other file system operations for too long the SUFILE is >> + * scanned in steps of NILFS_SUFILE_STARVING_SEGS_STEP. After each step the >> + * locks are released and cond_resched() is called. >> + * >> + * Return Value: On success, 0 is returned and on error, one of the >> + * following negative error codes is returned. >> + * >> + * %-EIO - I/O error. >> + * >> + * %-ENOMEM - Insufficient amount of memory available. >> + */ > >> +static int nilfs_ioctl_fix_starving_segs(struct the_nilfs *nilfs, >> + struct inode *inode) { > > This "inode" argument is meaningless for this routine. > Consider passing "sb" instead. I agree. > I feel odd for the function name "fix starving segs". It looks to > give a workaround rather than solve the root problem of gc in nilfs. > It looks like what this patch is doing, is "calibrating" live block > count. I like the name "calibrating". I will change it. >> + struct nilfs_transaction_info ti; > >> + unsigned long i, nsegs = nilfs_sufile_get_nsegments(nilfs->ns_sufile); > > nsegs is set outside the transaction lock. > > Since the file system can be resized (both shrinked or extended) > outside the lock, nsegs must be initialized or updated in the > section where the tranaction lock is held. Good point. I'll change it. >> + int ret = 0; >> + >> + for (i = 0; i < nsegs; i += NILFS_SUFILE_STARVING_SEGS_STEP) { >> + nilfs_transaction_begin(inode->i_sb, &ti, 0); >> + >> + ret = nilfs_sufile_fix_starving_segs(nilfs->ns_sufile, i, >> + NILFS_SUFILE_STARVING_SEGS_STEP); >> + if (unlikely(ret < 0)) { >> + nilfs_transaction_abort(inode->i_sb); >> + break; >> + } >> + >> + nilfs_transaction_commit(inode->i_sb); /* never fails */ >> + cond_resched(); >> + } >> + >> + return ret; >> +} >> + >> +/** >> * nilfs_ioctl_change_cpmode - change checkpoint mode (checkpoint/snapshot) >> * @inode: inode object >> * @filp: file object >> @@ -224,7 +267,7 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp, >> struct the_nilfs *nilfs = inode->i_sb->s_fs_info; >> struct nilfs_transaction_info ti; >> struct nilfs_cpmode cpmode; >> - int ret; >> + int ret, is_snapshot; >> >> if (!capable(CAP_SYS_ADMIN)) >> return -EPERM; >> @@ -240,6 +283,7 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp, >> mutex_lock(&nilfs->ns_snapshot_mount_mutex); >> >> nilfs_transaction_begin(inode->i_sb, &ti, 0); >> + is_snapshot = nilfs_cpfile_is_snapshot(nilfs->ns_cpfile, cpmode.cm_cno); >> ret = nilfs_cpfile_change_cpmode( >> nilfs->ns_cpfile, cpmode.cm_cno, cpmode.cm_mode); >> if (unlikely(ret < 0)) >> @@ -248,6 +292,10 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp, >> nilfs_transaction_commit(inode->i_sb); /* never fails */ >> >> mutex_unlock(&nilfs->ns_snapshot_mount_mutex); >> + > >> + if (is_snapshot > 0 && cpmode.cm_mode == NILFS_CHECKPOINT && >> + nilfs_feature_track_live_blks(nilfs)) >> + ret = nilfs_ioctl_fix_starving_segs(nilfs, inode); > > Should we use this return value ? > This doesn't relate to the success and failure of "chcp" operation. > > nilfs_ioctl_fix_starving_segs() is called every time "chcp cp" is > called. I prefer to delay this extra work with a workqueue and to > skip starting a new work if the previous work is still running. Good idea. I'll look into it. >> out: >> mnt_drop_write_file(filp); >> return ret; >> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c >> index 9cd8820d..47e2c05 100644 >> --- a/fs/nilfs2/sufile.c >> +++ b/fs/nilfs2/sufile.c >> @@ -1215,6 +1215,91 @@ out_sem: >> } >> >> /** >> + * nilfs_sufile_fix_starving_segs - fix potentially starving segments >> + * @sufile: inode of segment usage file >> + * @segnum: segnum to start >> + * @nsegs: number of segments to check >> + * >> + * Description: Scans for segments, which are potentially starving and >> + * reduces the number of live blocks to less than half of the maximum >> + * number of blocks in a segment. This way the segment is more likely to be >> + * chosen by the GC. A segment is marked as potentially starving, if more >> + * than half of the blocks it contains are protected by snapshots. >> + * >> + * Return Value: On success, 0 is returned and on error, one of the >> + * following negative error codes is returned. >> + * >> + * %-EIO - I/O error. >> + * >> + * %-ENOMEM - Insufficient amount of memory available. >> + */ >> +int nilfs_sufile_fix_starving_segs(struct inode *sufile, __u64 segnum, >> + __u64 nsegs) >> +{ >> + struct buffer_head *su_bh; >> + struct nilfs_segment_usage *su; >> + size_t n, i, susz = NILFS_MDT(sufile)->mi_entry_size; >> + struct the_nilfs *nilfs = sufile->i_sb->s_fs_info; >> + void *kaddr; >> + unsigned long maxnsegs, segusages_per_block; >> + __u32 max_segblks = nilfs->ns_blocks_per_segment >> 1; >> + int ret = 0, blkdirty, dirty = 0; >> + >> + down_write(&NILFS_MDT(sufile)->mi_sem); >> + > >> + maxnsegs = nilfs_sufile_get_nsegments(sufile); >> + segusages_per_block = nilfs_sufile_segment_usages_per_block(sufile); >> + nsegs += segnum; >> + if (nsegs > maxnsegs) >> + nsegs = maxnsegs; >> + >> + while (segnum < nsegs) { > > This local variable "nsegs" is used as an (exclusive) end segment number. > It's confusing. You should define "end" variable separately. > It can be simply calculated by: > > end = min_t(__u64, segnum + nsegs, nilfs_sufile_get_nsegments(sufile)); > > ("maxnsegs" can be removed.) > > Note that the evaluation of each argument will never be done twice in > min_t() macro since min_t() temporarily stores the evaluation results > to hidden local variables and uses them for comparison. Ok. Regards, Andreas Rohner > Regards, > Ryusuke Konishi > > >> + n = nilfs_sufile_segment_usages_in_block(sufile, segnum, >> + nsegs - 1); >> + >> + ret = nilfs_sufile_get_segment_usage_block(sufile, segnum, >> + 0, &su_bh); >> + if (ret < 0) { >> + if (ret != -ENOENT) >> + goto out; >> + /* hole */ >> + segnum += n; >> + continue; >> + } >> + >> + kaddr = kmap_atomic(su_bh->b_page); >> + su = nilfs_sufile_block_get_segment_usage(sufile, segnum, >> + su_bh, kaddr); >> + blkdirty = 0; >> + for (i = 0; i < n; ++i, ++segnum, su = (void *)su + susz) { >> + if (le32_to_cpu(su->su_nsnapshot_blks) <= max_segblks) >> + continue; >> + if (le32_to_cpu(su->su_nlive_blks) <= max_segblks) >> + continue; >> + >> + su->su_nlive_blks = cpu_to_le32(max_segblks); >> + su->su_nsnapshot_blks = cpu_to_le32(max_segblks); >> + blkdirty = 1; >> + } >> + >> + kunmap_atomic(kaddr); >> + if (blkdirty) { >> + mark_buffer_dirty(su_bh); >> + dirty = 1; >> + } >> + put_bh(su_bh); >> + cond_resched(); >> + } >> + >> +out: >> + if (dirty) >> + nilfs_mdt_mark_dirty(sufile); >> + >> + up_write(&NILFS_MDT(sufile)->mi_sem); >> + return ret; >> +} >> + >> +/** >> * nilfs_sufile_alloc_cache_node - allocate and insert a new cache node >> * @sufile: inode of segment usage file >> * @group: group to allocate a node for >> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h >> index 3466abb..f11e3e6 100644 >> --- a/fs/nilfs2/sufile.h >> +++ b/fs/nilfs2/sufile.h >> @@ -30,6 +30,7 @@ >> >> #define NILFS_SUFILE_CACHE_NODE_SHIFT 6 >> #define NILFS_SUFILE_CACHE_NODE_COUNT (1 << NILFS_SUFILE_CACHE_NODE_SHIFT) >> +#define NILFS_SUFILE_STARVING_SEGS_STEP (1 << 15) >> >> struct nilfs_sufile_cache_node { >> __u32 values[NILFS_SUFILE_CACHE_NODE_COUNT]; >> @@ -88,6 +89,8 @@ int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs); >> int nilfs_sufile_read(struct super_block *sb, size_t susize, >> struct nilfs_inode *raw_inode, struct inode **inodep); >> int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range); >> +int nilfs_sufile_fix_starving_segs(struct inode *sufile, __u64 segnum, >> + __u64 nsegs); >> int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum); >> void nilfs_sufile_shrink_cache(struct inode *sufile); >> int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark, >> -- >> 2.3.7 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html