On Mar 6, 2014, at 11:32 AM, Theodore Ts'o <tytso@xxxxxxx> wrote: > On Thu, Mar 06, 2014 at 06:54:05PM +0100, Lukáš Czerner wrote: >> >> All that said, I was getting to rewrite this mess a long time ago, >> it's just a reminder that it's something that needs to be done. >> Especially since the bigger requests are getting split unnecessarily >> which hurts especially in fallocate case. > > We should try to get input from Andreas about what some of the more > interesting hueristics in mballoc were trying to accomplish, since > there's a lot going on that's not obvious, and one of the reasons why > I've always been worried about trying to do cleanups was because > something that looks ugly might be papering over some other dark > corner of mballoc.c ---- and so I was fairly certain that one we > started opening up mballoc.c, we'd have to do a lot of work on it, and > a lot of performance measurements to make sure we didn't accidentally > introduce some performance regression. There is actually quite a lengthy description of mballoc at the start of the file. I guess it would make sense to turn anything in this thread into comments for ext4_mb_normalize_request() once verified. So, below is hopefully a summary of what ext4_mb_normalize_request() is actually doing. I've CC'd Alex to correct my mistakes. I think the first few cases are commented accurately and self explanatorily: * don't prealloc blocks for non-regular files (!EXT4_MB_HINT_DATA) - should we reconsider this for larger directories? * don't use prealloc if caller wants exact (EXT4_MB_HINT_GOAL_ONLY) - currently unused, but would be useful for defrag * don't reserve blocks if caller doesn't want it (EXT4_MB_HINT_NOPREALLOC) - used for small files or if requested data fits exactly into extent * if write is a small file, use group prealloc (EXT4_MB_HINT_GROUP_ALLOC) - this combines multiple small writes into a single prealloc region and avoids read-modify-write of RAID stripes The rest of the function is about handling large file writes efficiently. * round up small writes to a power-of-two value for better alignment - we have a patch that makes the preallocation region sizes tunable, if that is something of interest. That said, we don't really use it. * if the request is large, align it to a power-of-two boundary - the allocation goal is based on the logical file offset, so that if a file is written sparsely by multiple threads, it can coalesce into a densely packed file in the end. This is common for HPC jobs, or applications like bittorrent. * the list_for_each() loops align the prealloc region with other regions - this helps when the file becomes fully allocated that the regions will be contiguous on disk I'm pretty sure some of this is not 100% accurate, hopefully Alex can comment and correct any inconsistencies. Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP using GPGMail