Re: [RFC 0/9] ext4: An Auxiliary Tree for the Directory Index

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, May 11, 2013 at 09:28:34PM +0800, Zheng Liu wrote:
>Hi Radek,
>
>Could you please rebase your patch series against dev branch of ext4
>tree?  I couldn't apply your patches against this branch.  You can get
>the tree from here:
>  https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/

Hi Zheng,

Sure, I will do the rebase and repost the series tomorrow.

-Radek

>Regards,
>                                                - Zheng
>
>On Sat, May 04, 2013 at 11:28:33PM +0200, Radek Pazdera wrote:
>> Hello everyone,
>> 
>> I am an university student from Brno /CZE/. I decided to try to optimise
>> the readdir/stat scenario in ext4 as the final project to school. I
>> posted some test results I got few months ago [1].
>> 
>> I tried to implement an additional tree for ext4's directory index
>> that would be sorted by inode numbers. The tree then would be used
>> by ext4_readdir() which should lead to substantial increase of
>> performance of operations that manipulate a whole directory at once.
>> 
>> The performance increase should be visible especially with large
>> directories or in case of low memory or cache pressure.
>> 
>> This patch series is what I've got so far. I must say, I originally
>> thought it would be *much* simpler :).
>> 
>> TLDR
>> ====
>> 
>> The series contains the implementation of the new tree and several
>> rather small changes to the original directory index. I also added
>> a new implementation of ext4_readdir() that uses this new tree
>> instead of the original one.
>> 
>> It applies on 3.9, it is basically working, however, it is highly
>> experimental at the moment. It doesn't survive XFS tests yet (I
>> still need to work on that).
>> 
>> DESIGN
>> ======
>> 
>> The tree comes with a new read-only compatible file system feature
>> called "itree". It is created in a directory at the same time as the
>> original dx tree -- when the directory file exceeds a signle block of
>> size. It is indicated by an inode flag.
>> 
>> The tree resides completely outside of the directory file. I am using
>> 64bit block numbers (as suggested by Ted Ts'o), and the pointer to its
>> root is stored in the end of the dx_root block.
>> 
>> I tried to keep the structure of the tree as close as possible to the
>> original dx tree. It is a B+ tree with a 64bit long compound key. The
>> key consists of a inode number and a hash.
>> 
>> The hash is there because of hard links. On ext4 there can be as much
>> as 65 000 names for a single file which is, from the point of the inode
>> tree, a collision. It is a very unlikely scenario to crate over 60 000
>> names for a file from a single directory. But it would be a very serious
>> problem for readdir() if someone tried to do that, so I decided to add
>> the hash to the key as well. This reduces the problems of collisions to
>> the same as of the hash tree.
>> 
>> The depth of the tree is limited by a constant in the code to 3 levels,
>> but it can be increased anytime.
>> 
>> I decided to keep the directory entries within the leaf nodes sorted,
>> which might have been a bad idea (and it brought me quite a few
>> sleepless nights of pointer debugging). However, it is more convenient
>> for readdir and splits. And in the case of the inode tree, there
>> shouldn't be that much memmoving, because inodes are allocated linearly,
>> therefore we're adding to the end most of the time anyway.
>> 
>> I also had to implement coalesce-on-delete. Unlike the hash values,
>> the inode numbers are not random, so the tree would get fragmented
>> really easily. For example when a range of inodes would be freed
>> and allocated somewhere else, that part of the tree would stay empty
>> forever.
>> 
>> In this implementation I used 8 bits for "node fullness". Neighbour
>> nodes in the tree are merged as soon as their total fullness is smaller
>> than maximum fullness. This way, coalescing happens too often. I plan
>> to reduce it to only 2 bits (25% full, 50% full, 75% full, 100% full).
>> 
>> There are also some optimisations to increase the fullness of blocks
>> within the tree, because if the file system adds a contiguous sequence
>> of inodes to a directory and we split the nodes in half, there will be
>> some tree nodes that will never get another entry and still be only 50%
>> full. In the current implementation, an append is performed instead of
>> a split in case the new entry should be added to the end of the full
>> block.
>> 
>> BENCHMARKS
>> ==========
>> 
>> I did some benchmarks and compared the performance with ext4/htree,
>> XFS, and btrfs up to 5 000 000 of files in a single directory. Not
>> all of them are done though (they run for days).
>> 
>> Probably the biggest improvement can be observed with copying files.
>> I used two physical SATA disks for the test. For 5M files, the itree
>> implementation is 14x faster. With metadata only, ext4 is even a tiny
>> bit faster than xfs:
>> 
>>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/cp.png
>> 
>> With each files 4kB big, the inode tree is 11x faster:
>> 
>>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/cp.png
>> 
>> On the other hand, probably the biggest drawback of this implementation
>> is that it slows down file creates, as we now have to insert the entry
>> into both trees. The difference gets bigger as both trees grow (and their
>> blocks get further apart). For 5M directory entries, the creation is
>> roughly 40% slower. Hopefully, I'll be able to optimise it a bit in the
>> future:
>> 
>>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/create.png
>>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/create.png
>> 
>> Deleting entries should also very much benefit from this. However, the
>> increase of performance in my test is far lower than expected. I think
>> that is due to my implementation of coalesce-on-delete, which happens
>> too often and during these massive deletes, it coallesces all the time.
>> I hope that when I fix that, it will get somewhere close to XFS as well:
>> 
>>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/delete.png
>>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/delete.png
>> 
>> All of the results have a graph and a table with precise values.
>> You can always get the table by replacing the suffix of the graph
>> *.png to *.results in the end of the link.
>> 
>> Full results are available here:
>>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/
>> 
>> 
>> I also did some tests on an aged file system (I used the simple 0.8
>> chance to create, 0.2 to delete a file) where the results of ext4
>> with itree are much better even than xfs, which gets fragmented:
>> 
>>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/5M-dirty/cp.png
>>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/5M-dirty/readdir-stat.png
>> 
>> 
>> There are still some things to be done, the checksums are not yet
>> finished and I certainly need to do a bit of cleaning up at some
>> places. But as far as features go, it all should be there already.
>> 
>> I'm working on this along with Lukas Czerner, who acts as my mentor
>> and helps me with different things (big thanks!).
>> 
>> Any feedback/ideas are greatly appeciated :)!
>> 
>> Cheers,
>> Radek
>> 
>> 
>> Radek Pazdera (9):
>>   ext4: Adding itree feature and inode flags
>>   ext4: Allow sorting dx_map by inode as well
>>   ext4: Adding a link to itree to the dx_root struct
>>   ext4: Adding itree structures
>>   ext4: Adding itree implementation I - Core
>>   ext4: Adding itree implementation II - Inserting
>>   ext4: Adding itree implementation III - Deleting
>>   ext4: Make directory operations use itree
>>   ext4: Make ext4_readdir() use itree if available
>> 
>>  fs/ext4/dir.c   |  177 +++-
>>  fs/ext4/ext4.h  |   21 +-
>>  fs/ext4/namei.c | 2442 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  3 files changed, 2565 insertions(+), 75 deletions(-)
>> 
>> -- 
>> 1.7.11.7
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux