Re: [PATCH, RFC] ext4: New inode/block allocation algorithms for flex_bg filesystems

Andreas Dilger <adilger@xxxxxxx> · Tue, 24 Feb 2009 15:41:08 -0700

Ted Ts'o wrote something like the following (didnt' get original email):
> @@ -122,6 +122,9 @@ struct ext4_inode_info {
>  	struct list_head i_prealloc_list;
>  	spinlock_t i_prealloc_lock;
> 
> +	/* ialloc */
> +	ext4_group_t	i_last_alloc_group;

Even better would be to store i_last_alloc_inode.  In the past Eric
has demonstrated workloads that are allocating lots of inodes exhibit
O(n^2) behaviour because the entire group bitmap is searched from the
start each time, and that can cumulatively be very slow.  Having the
directory start searching from the most recently allocated inode would
make this O(n), and would not significantly alter behaviour.

If the inode is newly read from disk this would be initialized to 0,
or the inode number of the directory.  If it was recently creating
files it should be the previously allocated inode.  The only issue
is with inodes being repeatedly created and deleted in the same
directory, in which case we could reset this at delete time to be

    max(first inode in group, directory inode)

so that the same inodes keep being reused by the directory.  This could
be limited to clearing last_alloc_inode if the inode being freed is in
the parent group.

Hmm, it seems something like this is already done in find_group_orlov().
More discussion there.

> @@ -170,10 +172,23 @@ static ext4_fsblk_t ext4_ext_find_goal(struct inode *inode,
> +	if (flex_size >= 4) {
> +		block_group &= ~(flex_size-1);
> +		if (S_ISREG(inode->i_mode))
> +			block_group++;
> +	}

The usage of "4" in several places to threshold flexbg size is also
confusing to me.

> +	bg_start = (block_group * EXT4_BLOCKS_PER_GROUP(inode->i_sb)) +
>  		le32_to_cpu(EXT4_SB(inode->i_sb)->s_es->s_first_data_block);
>  	last_block = ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es) - 1;

Similar to the above proposal, Eric and I discussed storing the most
recently selected (flex) group in the superblock would avoid repeated
group scanning for allocations in separate directories.

> @@ -446,12 +481,21 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
> +	if (flex_size < 4)
> +		flex_size = 1;
> +	else {

Generally CodingStyle has { } around the first "if" clause if they are
around the "else" clause.

> @@ -460,71 +504,85 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
> +	if (S_ISDIR(mode) &&
> +	    ((parent == sb->s_root->d_inode) ||
> +	     (EXT4_I(parent)->i_flags & EXT4_TOPDIR_FL))) {
> 
>  		get_random_bytes(&grp, sizeof(grp));
>  		parent_group = (unsigned)grp % ngroups;
>  		for (i = 0; i < ngroups; i++) {
> +			g = (parent_group + i) % ngroups;
> +			get_orlov_stats(sb, g, flex_size, &stats);

Two notes here -
- get_random_bytes() is actually a fairly CPU-intensive operation, though
  I'm not sure it is significant in this context
- more importantly, scanning ALL groups in a very large filesystem to find
  an optimal group.  Large filesystems can have upwards of 60-100k groups
  to scan, and even with flexbg it seems that get_orlov_stats() is not
  using the flex_group[] stats in the common case where flex_size is the
  same as the flexbg size, but recalculating these for every allocation.

> +			if (!stats.free_inodes)
>  				continue;
> +			if (stats.used_dirs >= best_ndir)
>  				continue;
> +			if (stats.free_inodes < avefreei)
>  				continue;
> +			if (stats.free_blocks < avefreeb)
>  				continue;
> +			grp = g;
>  			ret = 0;
> +			best_ndir = stats.used_dirs;

Based on the above, it would be preferable to settle for a "local"
optimal group after scanning N successful candidates, then checking
the "most recently selected" group saved in the superblock if it is
still better than the local optimum.  Of course it should continue to
scan all groups if there are no suitable groups yet found.

Since the starting group is random, this will avoid being stuck in
a bad local minimum, while avoiding an exhaustive search when there
are a lot of groups.

> +	min_inodes = avefreei - inodes_per_group*flex_size / 4;
> +	min_blocks = avefreeb - EXT4_BLOCKS_PER_GROUP(sb)*flex_size / 4;

style - "inodes_per_group * flex_size / 4"

>  static int find_group_other(struct super_block *sb, struct inode *parent,
> +			    ext4_group_t *group, int mode)
>  {
> +	/*
> +	 * If we are doing flex_bg style allocation, try to put
> +	 * special inodes in the first block group; start files and
> +	 * directories at the 2nd block group in the flex_bg.
> +	 */
> +	if (flex_size >= 4) {
> +		int ret, retry = 0;
> +
> +	try_again:
> +		i = (parent_group & ~(flex_size-1));
> +		last = i+flex_size;

Can you please use a better variable name than "i", "grp", or
"new_group" or similar?

> +		while (i < last) {
> +			desc = ext4_get_group_desc(sb, i, NULL);
> +			if (desc && ext4_free_inodes_count(sb, desc)) {
> +				*group = i;
> +				return 0;
> +			}
> +			i++;
> +		}
> +		if (!retry && EXT4_I(parent)->i_last_alloc_group != ~0) {
> +			retry = 1;
> +			parent_group = EXT4_I(parent)->i_last_alloc_group;
> +			goto try_again;
> +		}

It would seem that using i_last_alloc_group FIRST (if set) would be better
than always checking all groups in the flexbg.  Either i_last_alloc_group
is in the same flex group as parent_group (so no harm done) or a previous
allocation wasn't able to find any free inode in that flex group and it
had to do a more expensive search elsewhere.

All that needs to be changed is to possibly reset i_last_alloc_group
to be parent_group when there is a file unlinked from the directory,
though this wouldn't be great for huge directories since any unlink
would cause a full scan.

Hmm, maybe checking ONLY the parent group for free inodes, then the
i_last_alloc_group, then remaining groups in the flex group, then
all remaining groups is the best ordering.  It allows inodes to be
clustered if there is space in the parent group (whether due to freeing
inodes in this directory or others), but avoids a potentially lengthy
scan for each inode (which might be noticable if there are 64 or 128
groups per flexbg).

Another possibility is if the starting goal inode was based on the group
a majority of dirent inode numbers in the same directory leaf.  This would
require deferring inode number selection until the dirent is about to be
inserted (i.e. delalloc for inodes) by moving the inode selection into
ext4_add_entry().  Combining this with a hash->inode range mapping
(as I've proposed in the past, range size based on the directory size)
would improve the htree->inode ordering and would also improve performance
for large directories.

It might not be a bad idea to separate the inode selection from the
inode initialization anyway, as ext4_new_inode() is pretty huge already,
and I can't see anything except the i_ino setting and insert_inode_hash()
that is use the inode number.  That means it should be possible to defer
the inode selection to ext4_add_entry() and ext4_dx_add_entry() once the
leaf block is found.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html