Re: Regarding random grouop search start for allocation of inode.

Andreas Dilger <adilger@xxxxxxxxx> · Fri, 4 Dec 2015 14:54:59 -0700

> On Dec 3, 2015, at 12:13 PM, lokesh jaliminche <lokesh.jaliminche@xxxxxxxxx> wrote:
> 
> Ohh thanks for the clarification. There is one more thing I would like
> to point out here.
> In the code there  is a loop to scan the groups for inode
> alllocation(Inside find_group_orlov function).
> There are some policies for group selection . while scanning the
> groups, it checks for these
> policies to be satisfied.
> If a particular group satisfy these properties it should get selected
> for inode allocation but instead
> it does further lookup in next groups.
> I think there is missing breaking condition. I have added break over
> there and here is the
> patch for that. Any reason for not having break condition over here ?
> 
> diff -Nur linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
> linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
> --- linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c    2014-04-12
> 01:20:31.000000000 +0530
> +++ linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c    2015-11-29
> 21:36:51.805542209 +0530
> @@ -529,6 +529,7 @@
>             grp = g;
>             ret = 0;
>             best_ndir = stats.used_dirs;
> +            break;
>         }
>         if (ret)
>         goto fallback;

No, this isn't correct.  This loop is looking for the *best* group it can find,
and your "break" would have it exit the loop as soon as the *first* directory
that matches the conditions is found.  Since those conditions are fairly weak,
for example that the group actually has free inodes, and it has better than
average free inodes and blocks, it makes sense to search beyond just the first
matching group.

That said, it also doesn't make sense to search beyond a "perfect" group that
has no allocated inodes and no allocated blocks, so a break condition could be
added to this loop and make it more efficient, especially for very large
filesystems that have 128k+ groups.

It should be noted that this part of the algorithm only applies to "top level"
directories (those below the root inode, or with the EXT4_INODE_TOPDIR flag
set, so searching a bit longer for a good group is not a bad idea in this case.

Cheers, Andreas.

> Thanks & Regards,
>   Lokesh
> 
> 
> 
> On Thu, Dec 3, 2015 at 11:28 PM, Andreas Dilger <adilger@xxxxxxxxx> wrote:
>> On Dec 3, 2015, at 01:07, lokesh jaliminche <lokesh.jaliminche@xxxxxxxxx> wrote:
>>> 
>>> Thought of giving more clarification on my question
>>> why group search start is random ? because we can also start  search
>>> for valid groups for inode allocation from the start. As this group
>>> search is random  inode selection might go to end of groups which
>>> might affect IO performance
>> 
>> Starting the inode search at the beginning of the disk each time
>> means that inode allocation will be inefficient because it will search
>> over groups that are mostly or entirely full already.
>> 
>> Allocating the new directory in a semi-random group, one that is
>> relatively unused, ensures that new
>> inode and block allocations are relatively efficient afterward.
>> 
>> Cheers, Andreas
>> 
>>> On Thu, Dec 3, 2015 at 1:14 PM, lokesh jaliminche
>>> <lokesh.jaliminche@xxxxxxxxx> wrote:
>>>> hello folks,
>>>>               I am new to ext4 code. I was going through the
>>>> ext4-source for allocation of inode.
>>>> There is one thing that I did not understand while selection of groups
>>>> for inode allocation . I came across this code snippet which is part
>>>> of find_group_orlov function. question is, why group search start is
>>>> random ?
>>>> 
>>>> Code snippet:
>>>> ==========
>>>> В·В·В·if (qstr) {
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·hinfo.hash_version = LDISKFS_DX_HASH_HALF_MD4;
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·hinfo.seed = sbi->s_hash_seed;
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·ldiskfsfs_dirhash(qstr->name, qstr->len, &hinfo);
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·grp = hinfo.hash;
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·} else
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·get_random_bytes(&grp, sizeof(grp));
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·parent_group = (unsigned)grp % ngroups;
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·for (i = 0; i < ngroups; i++) {
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·g = (parent_group + i) % ngroups;
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·get_orlov_stats(sb, g, flex_size, &stats);
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·if (!stats.free_inodes)
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·continue;
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·if (stats.used_dirs >= best_ndir)
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·continue;
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·if (stats.free_inodes < avefreei)
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·continue;
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·if (stats.free_blocks < avefreeb)
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·continue;
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·grp = g;
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·ret = 0;
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·best_ndir = stats.used_dirs;
>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·}
>>>> 
>>>> Thanks & Regards,
>>>> Lokesh
>>> N‹§Іжмrё›yъиљШbІX¬¶З§vШ^–)Ює{.nЗ+‰·ҐЉ{±{ xЉ{ayє К‡Ъ™л,j ўfЈў·hљ‹аz№ ®wҐўё ў·¦j:+v‰ЁЉwиjШm¶џяѕ «‘кзzZ+ѓщљЋЉЭўj"ќъ!¶i

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail