On Aug 21, 2018, at 8:21 PM, Jaco Kroon <jaco@xxxxxxxxx> wrote: > > NOT READY FOR MERGE!!!!!! > > Limiting block allocations to a high watermark will eventually enable us > to perform online shrinks of an ext4 filesystem. As an immediate > benefit it'll prevent allocation of blocks in the high range, which if > performed as a precursor to an offline filesystem shrink will help to > reduce the overall time a filesystem needs to be taken offline in order > to shrink it. > > (possible) shortcomings: > > Currently this tunable does not get stored to the superblock, and thus > needs to be set again after each mount. > > The ext4_statfs function doesn't adjust the f_bavail value currently, as > such df will report incorrect results. > > The inode allocator hasn't been synced yet. Hi Jaco, sorry for the extreme delay in replying to this. It was lost in my inbox and I only just found it now. Looking through the patch, it does seem OK for the basic functionality intended, and would at least allow you to reduce the number of blocks allocated at the end of the device, meaning that the offline shrink would take less time (ideally none if all of the files are removed from the end of the device). With this first patch it should be possible to do an "online shrink" by setting the high watermark, then walking the filesystem checking for any files have blocks beyond the HWM via "filefrag -v" and running e4defrag on those files. This should be largely transparent to userspace. The current patch would not allow directly limiting inode allocation, but using the "inode_goal" tunable could be used to influence the inode selection to allow "mkdir + rsync + mv" to move directory trees to lower inodes. Only files currently open for write would not be safe to move to new inodes. I think for fully using this functionality in the kernel/e2fsprogs a few more additions are needed, as you mentioned above: - store the high watermark in the superblock via tune2fs, so that it is not lost if the system is rebooted or filesystem remounted - fix ext4_statfs() to adjust available blocks appropriately - avoid allocating inodes in blocks above the high watermark Typically, using tune2fs to adjust a mounted filesystem should change the value used by the kernel, so also having a /sys tunable gets tricky. One option would be to leave "sbi->s_block_high_watermark = 0" and use the superblock value if the sbi->s_block_high_watermark == 0, and only use sbi->s_block_high_watermark if it is set directly? Something like: static inline ext4_fsblk_t ext4_blocks_max_allocatable(struct ext4_sb_info *sbi) { ext4_fsblk_t blocks = ext4_blocks_count(sbi->s_es); if (unlikely(sbi->s_block_high_watermark && sbi->s_block_high_watermark < blocks)) return sbi->s_block_high_watermark; if (unlikely(sbi->s_es->s_blk_high_watermark && le64_to_cpu(sbi->s_es->s_blk_high_watermark) < blocks) return le64_to_cpu(sbi->s_es->s_blk_high_watermark); return blocks; } this adds a bit more runtime overhead vs. setting s_block_high_watermark from the superblock at mount time, but is more flexible. For ext4_statfs() do we subtract only the free blocks beyond HWM from the available count, or all blocks? Subtracting the difference between ext4_blocks_count() and ext4_blocks_max_allocatable() is easy (zero if no high watermark), but the available blocks should not be negative if there are lots of blocks used beyond the HWM and few free below it. Better would be if the available blocks would report the free blocks below the HWM, but this would involve subtracting free blocks above the HWM and adjusting this as blocks above the limit are freed. For the inode allocation limit, it is fairly straight forward to map the block HWM to an inode HWM based on the group descriptor that the HWM is in. For future use (dynamic inode tables) it may be desirable to also have a separately tunable inode HWM, but it could also be done later as needed. On the e2fsprogs side, there should be a "-E block_high_watermark=N" tunable added to set the field in the superblock, and support to print it in dumpe2fs and modify it in dumpe2fs via "ssv". It may be useful to add a "-f" force flag to e4defrag so that it moves inodes even if they are not less fragmented afterward, so blocks beyond the HWM are always freed. Alternately, block and inode move (for closed files) might be implemented in userspace via resize2fs (essentially cp+rename) when it is doing an online shrink of the filesystem? That might be simpler from a user point of view instead of needing to run e4defrag manuall that needs to be scripted to find the files to be moved. Optionally, should there be a "hard" and a "soft" block limit? For example, if the high watermark is set to a negative value -blocks it is a soft limit (prefer lower allocation, but can exceed it if filesystem is full), or have a separate "soft" flag stored somewhere else? In the first case, we should mask off the high bit when accessing this field, and use it only for deciding if allocation can continue after a normal scan failed. In the longer term, the resize ioctl could be enhanced to drop the last group(s) if they are above the high watermark and have no used blocks/inodes. The resize2fs tool could report if trying to shrink a filesystem with in-use blocks that the HWM will be set and file migration is needed, then do the online migration (reporting any files that are open via lsof) and returning an error in the end that which processes are blocking the resize. Some minor nits in the patch inline below: > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 0f0edd1cd0cd..dc30ea107c55 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -1423,6 +1423,7 @@ struct ext4_sb_info { > unsigned int s_mb_order2_reqs; > unsigned int s_mb_group_prealloc; > unsigned int s_max_dir_size_kb; > + ext4_fsblk_t s_block_high_watermark; /* allocators must not allocate blocks above this */ (style) should stay under 80 columns. Easiest to just shorten comment to something like "/* max allocatable block number */" or similar. > @@ -2711,6 +2712,15 @@ static inline ext4_fsblk_t ext4_blocks_count(struct +static inline ext4_fsblk_t ext4_blocks_max_allocatable(struct ext4_sb_info *sbi) > +{ > + ext4_fsblk_t blocks = ext4_blocks_count(sbi->s_es); (style) blank line after variable declarations > + if (sbi->s_block_high_watermark && sbi->s_block_high_watermark < blocks) > + return sbi->s_block_high_watermark; > + else > + return blocks; (style) no need for "else" after "return". > diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c > index 9212a026a1f1..2a1a955c2c0b 100644 > --- a/fs/ext4/sysfs.c > +++ b/fs/ext4/sysfs.c > @@ -304,6 +307,9 @@ static ssize_t ext4_attr_show(struct kobject *kobj, > return print_tstamp(buf, sbi->s_es, s_first_error_time); > case attr_last_error_time: > return print_tstamp(buf, sbi->s_es, s_last_error_time); > + case attr_block_high_watermark: > + return snprintf(buf, PAGE_SIZE, "%llu\n", > + (s64) sbi->s_block_high_watermark); (style) no space after typecast Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP