Re: [PATCH 2/2] ext4: fix bug in ext4_mb_normalize_request()

Andreas Dilger <adilger@xxxxxxxxx> · Fri, 7 Mar 2014 14:09:10 -0700

On Mar 6, 2014, at 11:32 AM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> On Thu, Mar 06, 2014 at 06:54:05PM +0100, Lukáš Czerner wrote:
>> 
>> All that said, I was getting to rewrite this mess a long time ago,
>> it's just a reminder that it's something that needs to be done.
>> Especially since the bigger requests are getting split unnecessarily
>> which hurts especially in fallocate case.
> 
> We should try to get input from Andreas about what some of the more
> interesting hueristics in mballoc were trying to accomplish, since
> there's a lot going on that's not obvious, and one of the reasons why
> I've always been worried about trying to do cleanups was because
> something that looks ugly might be papering over some other dark
> corner of mballoc.c ---- and so I was fairly certain that one we
> started opening up mballoc.c, we'd have to do a lot of work on it, and
> a lot of performance measurements to make sure we didn't accidentally
> introduce some performance regression.

There is actually quite a lengthy description of mballoc at the start
of the file.  I guess it would make sense to turn anything in this
thread into comments for ext4_mb_normalize_request() once verified.

So, below is hopefully a summary of what ext4_mb_normalize_request()
is actually doing.  I've CC'd Alex to correct my mistakes.  I think
the first few cases are commented accurately and self explanatorily:

* don't prealloc blocks for non-regular files (!EXT4_MB_HINT_DATA)
  - should we reconsider this for larger directories?
* don't use prealloc if caller wants exact (EXT4_MB_HINT_GOAL_ONLY)
  - currently unused, but would be useful for defrag
* don't reserve blocks if caller doesn't want it (EXT4_MB_HINT_NOPREALLOC)
  - used for small files or if requested data fits exactly into extent
* if write is a small file, use group prealloc (EXT4_MB_HINT_GROUP_ALLOC)
  - this combines multiple small writes into a single prealloc region
    and avoids read-modify-write of RAID stripes

The rest of the function is about handling large file writes efficiently.
* round up small writes to a power-of-two value for better alignment
  - we have a patch that makes the preallocation region sizes tunable,
    if that is something of interest.  That said, we don't really use it.
* if the request is large, align it to a power-of-two boundary
  - the allocation goal is based on the logical file offset, so that if
    a file is written sparsely by multiple threads, it can coalesce into
    a densely packed file in the end.  This is common for HPC jobs, or
    applications like bittorrent.
* the list_for_each() loops align the prealloc region with other regions
  - this helps when the file becomes fully allocated that the regions
    will be contiguous on disk

I'm pretty sure some of this is not 100% accurate, hopefully Alex can
comment and correct any inconsistencies.

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail