Hello everyone, I am a university student from Brno /CZE/. I'm trying to optimise the readdir/stat scenario in ext4 as the final project to school. I posted some test results I got few months ago [1]. And following that, I tried to implement an additional tree for ext4's directory index that would be sorted by inode numbers. The tree is used by ext4_readdir() which should lead to substantial increase of performance of operations that manipulate a whole directory at once. The performance increase should be visible especially with large directories or in case of low memory or cache pressure. This patch series is what I've got so far. I must say, I originally thought it would be *much* simpler :). TLDR ==== The series contains the implementation of the new tree and several rather small changes to the original directory index. I also added a new implementation of ext4_readdir() that uses this new tree instead of the original one. It applies on 3.10-rc1, it should be basically working, however, it is highly experimental at the moment. It doesn't survive XFS tests yet (I still need to work on that). Changes from v1 =============== * rebased on ext4-dev (it applies also on 3.10-rc1) THE DESIGN ========== The tree comes with a new read-only compatible file system feature called "itree". It is created in a directory at the same time as the original dx tree -- when the directory file exceeds a signle block of size. It is indicated by an inode flag. The tree resides completely outside of the directory file. I am using 64bit block numbers (as suggested by Ted Ts'o), and the pointer to its root is stored in the end of the dx_root block. I tried to keep the structure of the tree as close as possible to the original dx tree. It is a B+ tree with a 64bit long compound key. The key consists of a inode number and a hash. The hash is there because of hard links. With ext4 there can be as much as 65,000 names for a single file which is, from the point of the inode tree, a collision. It is a very unlikely scenario to crate over 60 000 names for a file from a single directory. But it would be a very serious problem for readdir() if someone tried to do that, so I decided to add the hash to the key as well. This reduces the problems of collisions to the same as of the hash tree. The depth of the tree is limited by a constant in the code to 3 levels, but it can be increased anytime. I decided to keep the directory entries within the leaf nodes sorted, which might have been a bad idea (and it brought me quite a fewa sleepless nights of pointer debugging). However, it is more convenient for readdir and splits. And in the case of the inode tree, there shouldn't be that much memmoving, because inodes are allocated linearly, therefore we're adding to the end most of the time anyway. I also had to implement coalesce-on-delete. Unlike the hash values, the inode numbers are not random, so the tree would get fragmented really easily. For example when a range of inodes would be freed and allocated somewhere else, that part of the tree would stay empty forever. In this implementation I used 8 bits for "node fullness". Neighbour nodes in the tree are merged as soon as their total fullness is smaller than maximum fullness. This way, coalescing happens too often. I plan to reduce it to only 2 bits (25% full, 50% full, 75% full, 100% full). There are also some optimisations to increase the fullness of blocks within the tree, because if the file system adds a contiguous sequence of inodes to a directory and we split the nodes in half, there will be some tree nodes that will never get another entry and still be only 50% full. In the current implementation, an append is performed instead of a split in case the new entry should be added to the end of the full block. BENCHMARKS ========== I did some benchmarks and compared the performance with ext4/htree, XFS, and btrfs up to 5 000 000 of files in a single directory. Not all of them are done though (they run for days). Probably the biggest improvement can be observed with copying files. I used two physical SATA disks for the test. For 5M files, the itree implementation is 14x faster. With metadata only, ext4 is even a tiny bit faster than xfs: http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/cp.png With each files 4kB big, the inode tree is 11x faster: http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/cp.png On the other hand, probably the biggest drawback of this implementation is that it slows down file creates, as we now have to insert the entry into both trees. The difference gets bigger as both trees grow (and their blocks get further apart). For 5M directory entries, the creation is roughly 40% slower. Hopefully, I'll be able to optimise it a bit in the future: http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/create.png http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/create.png Deleting entries should also very much benefit from this. However, the increase of performance in my test is far lower than expected. I think that is due to my implementation of coalesce-on-delete, which happens too often and during these massive deletes, it coallesces all the time. I hope that when I fix that, it will get somewhere close to XFS as well: http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/delete.png http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/delete.png All of the results have a graph and a table with precise values. You can always get the table by replacing the suffix of the graph *.png to *.results in the end of the link. Full results are available here: http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/ I also did some tests on an aged file system (I used the simple 0.8 chance to create, 0.2 to delete a file) where the results of ext4 with itree are much better even than xfs, which gets fragmented: http://www.stud.fit.vutbr.cz/~xpazde00/soubory/5M-dirty/cp.png http://www.stud.fit.vutbr.cz/~xpazde00/soubory/5M-dirty/readdir-stat.png There are still some things to be done, the checksums are not yet finished and I certainly need to do a bit of cleaning up at some places. But as far as features go, it all should be there already. I'm working on this along with Lukas Czerner, who acts as my mentor and helps me with different things (big thanks!). Any feedback/ideas are greatly appeciated :)! Cheers, Radek Radek Pazdera (9): ext4: Adding itree feature and inode flags ext4: Allow sorting dx_map by inode as well ext4: Adding a link to itree to the dx_root struct ext4: Adding itree structures ext4: Adding itree implementation I - Core ext4: Adding itree implementation II - Inserting ext4: Adding itree implementation III - Deleting ext4: Make directory operations use itree ext4: Make ext4_readdir() use itree if available fs/ext4/dir.c | 177 +++- fs/ext4/ext4.h | 21 +- fs/ext4/namei.c | 2460 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 2582 insertions(+), 76 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html