[RFC] Optimizing readdir()

Radek Pazdera <rpazdera@xxxxxxxxxx> · Sun, 13 Jan 2013 16:22:04 +0100

Hi!

I am a student at Brno University of Technology working on my "final
project". I'm doing this with great help of Lukas Czerner, who acts
as my mentor and leads me on the project.

I'd like to optimize the problem of getents() returning entries in
hash order, which can have an impact on performance in some cases.
It was mentioned here a few times already [1][2].

I did some tests [3] and it seems to me that the amount of available
page cache plays a crucial role in this case. It can compensate for
the seeks if the directory is "small enough". However, in case
of memory pressure or when something starts to evict your pages, the
performance can go down.

The biggest performance can be observed when copying an entire directory
(ext4-spd are with the spd_readdir preload - and it's *fast*):

    http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/clean-3-1-block-files/cp.png
    http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/clean-3-1-block-files/cp.results

Raw readdir() and stat() on every file is ok up to really large dirs
(as long as your page cache can hold all the inode tables, I think).
ext4's dir index is ok even with aged dirs:

    http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/readdir-stat_clean-vs-aged.results
    http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/clean-1/readdir-stat.png
    http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/aged/readdir-stat.png

My idea was to try to add a second tree to the current index that would
be used to retrieve entries in inode-order. Its structure could be very
similar to the current HTree (the hash is 32bits and so are inode
numbers).

Question is, how hard would this hit creates/deletes, and renames.
Isolated create/delete would be definitely slower (ext4 would do
the same thing twice). But page cache could save the case of
creating/deleting a whole directory. Deletion might in fact benefit
from the inode-order readdir a little (compare ext4 and ext4-spd):

    http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/clean-1/delete.png
    http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/clean-1/delete.results

One downside is, it would roughly double the size of directory metadata, as
there probably would have to be two dirents for each entry (for each tree).
If one tree would link to another's dirents, it would make node splits,
extremely expensive. I had an idea about using a array of dirents shared
between the trees, but I'm not really sure how to manage free space in it
efficiently.

On-disk format would remain the same, apart from the dx_root structure,
which would have to carry a pointer to the root of the second tree. I
think, there might be a place in the embedded dx_root_info:

struct dx_root_info
{
        __le32 reserved_zero;
        u8 hash_version;
-        u8 info_length; /* 8 */
+        u8 info_length; /* 12 */
        u8 indirect_levels;
        u8 unused_flags;
+        __le32 itree_root_block;
}

What do you think about this approach? Is it worth trying?
Any feedback or sugesstions are greatly appreciated :).

Radek Pazdera

[1] http://thread.gmane.org/gmane.comp.file-systems.ext4/23794
[2] http://thread.gmane.org/gmane.linux.file-systems/61910
[3] http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html