Re: [PATCH] Add block_high_watermark sysfs tunable.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Aug 21, 2018, at 8:57 PM, Jaco Kroon <jaco@xxxxxxxxx> wrote:
> 
> Hi,
> 
> The below is based on suggestion from Andreas.  I opted for the sysfs
> tunable rather than an ioctl (which could also be used) because this
> enables to set the value without a special tool.
> 
> Andreas mentioned a number of functions to adjust for allocating blocks,
> but only the two below seems to relate.  I've also worked through all
> uses of ext4_blocks_count() and could not find further use cases that
> needs adjustment.  Some of the uses is during mount only (cluster
> reservations), and since this change is per-mount and nt superblock
> persistent adjustments won't have any effect there.  If this is
> desirable I'll try putting this in the superblock instead but this would
> require allocating feature bits and I'm not sure this change is worth a
> feature bit.

Haven't had a chance to look at the patch yet, but thanks for submitting.
One comment below...

> I'll attempt limiting inode allocation in ialloc based on this next,
> just wanted to get feedback on the below first.
> 
> My big question is this:  how do I build a test case for this code?
> 
> As an aside, the resize2fs from 64T to 56T eventually finished some time
> during the night.  So just under 17 days total.  If we can get an online
> resize for the same to be double or even triple that in total it'll
> still be a massive win for me.  Even if we need to for the final stages
> take the filesystem offline for two days - I know a fsck on this system
> takes ~18 hours (will probably be a bit less now), so a shrink without
> having to move data blocks will take at least that time in all
> likelihood, a debugfs ncheck took ~12 hours on 64TB (which is needed to
> migrate inodes), icheck was ~12 minutes, most of the time spent to open
> the filesystem.

Note that if opening the filesystem takes a long time, you can use
"debugfs -c" to skip loading the block and inode bitmaps, which can
speed things up significantly.  That doesn't work for everything
(definitely not filesystem-modifying operations) but for many read-only
operations it is very useful.

Cheers, Andreas

> On 22/08/2018 04:21, Jaco Kroon wrote:
>> NOT READY FOR MERGE!!!!!!
>> 
>> Limiting block allocations to a high watermark will eventually enable us
>> to perform online shrinks of an ext4 filesystem.  As an immediate
>> benefit it'll prevent allocation of blocks in the high range, which if
>> performed as a precursor to an offline filesystem shrink will help to
>> reduce the overall time a filesystem needs to be taken offline in order
>> to shrink it.
>> 
>> (possible) shortcomings:
>> 
>> Currently this tunable does not get stored to the superblock, and thus
>> needs to be set again after each mount.
>> 
>> The ext4_statfs function doesn't adjust the f_bavail value currently, as
>> such df will report incorrect results.
>> 
>> The inode allocator hasn't been synced yet.
>> ---
>> fs/ext4/balloc.c  |  2 +-
>> fs/ext4/ext4.h    | 10 ++++++++++
>> fs/ext4/mballoc.c |  2 +-
>> fs/ext4/sysfs.c   | 19 +++++++++++++++++++
>> 4 files changed, 31 insertions(+), 2 deletions(-)
>> 
>> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
>> index e5d6ee61ff48..4f723c7a9c88 100644
>> --- a/fs/ext4/balloc.c
>> +++ b/fs/ext4/balloc.c
>> @@ -883,7 +883,7 @@ ext4_fsblk_t ext4_inode_to_goal_block(struct inode *inode)
>> 			block_group++;
>> 	}
>> 	bg_start = ext4_group_first_block_no(inode->i_sb, block_group);
>> -	last_block = ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es) - 1;
>> +	last_block = ext4_blocks_max_allocatable(EXT4_SB(inode->i_sb)) - 1;
>> 
>> 	/*
>> 	 * If we are doing delayed allocation, we don't need take
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 0f0edd1cd0cd..dc30ea107c55 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1423,6 +1423,7 @@ struct ext4_sb_info {
>> 	unsigned int s_mb_order2_reqs;
>> 	unsigned int s_mb_group_prealloc;
>> 	unsigned int s_max_dir_size_kb;
>> +	ext4_fsblk_t s_block_high_watermark; /* allocators must not allocate blocks above this */
>> 	/* where last allocation was done - for stream allocation */
>> 	unsigned long s_mb_last_group;
>> 	unsigned long s_mb_last_start;
>> @@ -2711,6 +2712,15 @@ static inline ext4_fsblk_t ext4_blocks_count(struct ext4_super_block *es)
>> 		le32_to_cpu(es->s_blocks_count_lo);
>> }
>> 
>> +static inline ext4_fsblk_t ext4_blocks_max_allocatable(struct ext4_sb_info *sbi)
>> +{
>> +	ext4_fsblk_t blocks = ext4_blocks_count(sbi->s_es);
>> +	if (sbi->s_block_high_watermark && sbi->s_block_high_watermark < blocks)
>> +		return sbi->s_block_high_watermark;
>> +	else
>> +		return blocks;
>> +}
>> +
>> static inline ext4_fsblk_t ext4_r_blocks_count(struct ext4_super_block *es)
>> {
>> 	return ((ext4_fsblk_t)le32_to_cpu(es->s_r_blocks_count_hi) << 32) |
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index e29fce2fbf25..a158c2c9de10 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -4232,7 +4232,7 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
>> 	/* start searching from the goal */
>> 	goal = ar->goal;
>> 	if (goal < le32_to_cpu(es->s_first_data_block) ||
>> -			goal >= ext4_blocks_count(es))
>> +			goal >= ext4_blocks_max_allocatable(sbi))
>> 		goal = le32_to_cpu(es->s_first_data_block);
>> 	ext4_get_group_no_and_offset(sb, goal, &group, &block);
>> 
>> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
>> index 9212a026a1f1..2a1a955c2c0b 100644
>> --- a/fs/ext4/sysfs.c
>> +++ b/fs/ext4/sysfs.c
>> @@ -30,6 +30,7 @@ typedef enum {
>> 	attr_feature,
>> 	attr_pointer_ui,
>> 	attr_pointer_atomic,
>> +	attr_block_high_watermark,
>> } attr_id_t;
>> 
>> typedef enum {
>> @@ -167,6 +168,7 @@ EXT4_ATTR_FUNC(delayed_allocation_blocks, 0444);
>> EXT4_ATTR_FUNC(session_write_kbytes, 0444);
>> EXT4_ATTR_FUNC(lifetime_write_kbytes, 0444);
>> EXT4_ATTR_FUNC(reserved_clusters, 0644);
>> +EXT4_ATTR_FUNC(block_high_watermark, 0600);
>> 
>> EXT4_ATTR_OFFSET(inode_readahead_blks, 0644, inode_readahead,
>> 		 ext4_sb_info, s_inode_readahead_blks);
>> @@ -217,6 +219,7 @@ static struct attribute *ext4_attrs[] = {
>> 	ATTR_LIST(errors_count),
>> 	ATTR_LIST(first_error_time),
>> 	ATTR_LIST(last_error_time),
>> +	ATTR_LIST(block_high_watermark),
>> 	NULL,
>> };
>> 
>> @@ -304,6 +307,9 @@ static ssize_t ext4_attr_show(struct kobject *kobj,
>> 		return print_tstamp(buf, sbi->s_es, s_first_error_time);
>> 	case attr_last_error_time:
>> 		return print_tstamp(buf, sbi->s_es, s_last_error_time);
>> +	case attr_block_high_watermark:
>> +		return snprintf(buf, PAGE_SIZE, "%llu\n",
>> +				(s64) sbi->s_block_high_watermark);
>> 	}
>> 
>> 	return 0;
>> @@ -318,6 +324,7 @@ static ssize_t ext4_attr_store(struct kobject *kobj,
>> 	struct ext4_attr *a = container_of(attr, struct ext4_attr, attr);
>> 	void *ptr = calc_ptr(a, sbi);
>> 	unsigned long t;
>> +	unsigned long long t2;
>> 	int ret;
>> 
>> 	switch (a->attr_id) {
>> @@ -338,6 +345,18 @@ static ssize_t ext4_attr_store(struct kobject *kobj,
>> 		return inode_readahead_blks_store(sbi, buf, len);
>> 	case attr_trigger_test_error:
>> 		return trigger_test_error(sbi, buf, len);
>> +	case attr_block_high_watermark:
>> +		if (!ptr)
>> +			return 0;
>> +		ret = kstrtoull(skip_spaces(buf), 0, &t2);
>> +		if (ret)
>> +			return ret;
>> +		if (t2 > ext4_blocks_count(sbi->s_es))
>> +			return -EINVAL;
>> +		if (t2 && t2 < le32_to_cpu(sbi->s_es->s_first_data_block))
>> +			return -EINVAL;
>> +		sbi->s_block_high_watermark = t2;
>> +		return len;
>> 	}
>> 	return 0;
>> }
> 


Cheers, Andreas





Attachment: signature.asc
Description: Message signed with OpenPGP


[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux