Re: ext4: first write to large ext3 filesystem takes 96 seconds

Benjamin LaHaise <bcrl@xxxxxxxxx> · Wed, 30 Jul 2014 10:49:28 -0400

Hi Andreas, Ted,

I've finally had some more time to dig into this problem, and it's worse 
than I initially thought in that it occurs on normal ext4 filesystems.

On Mon, Jul 07, 2014 at 11:11:58PM -0600, Andreas Dilger wrote:
...
> The main problem here is that reading all of the block bitmaps takes
> a huge amount of time for a large filesystem.

Very true.

...
> 
> 7.8TB / 128MB/group ~= 8000 groups
> 8000 bitmaps / 100 seeks/sec = 80s
> 
> So that is what is making things slow. Once the allocator has all the
> blocks in memory there are no problems. There are some heuristics
> to skip bitmaps that are totally full, but they don't work in your case. 
> 
> This is why the flex_bg feature was created - to allow the bitmaps
> to be read from disk without seeks.  This also speeds up e2fsck by
> the same 96s that would otherwise be wasted waiting for the disk.

Unfortunately, that isn't the case.

> Backporting flex_bg to ext3 would be fairly trivial - just disable the checks
> for the location of the bitmaps at mount time. However, using it
> requires that you reformat your filesystem with "-O flex_bg" to
> get the improved layout. 

flex_bg is not sufficient to resolve this issue.  Using a native ext4 
formatted filesystem initialized with mke4fs 1.41.12, this problem still 
occurs.  I created a 7.1TB filesystem, filled it to about 92% full with 
8MB files.  The time to create a new 8MB file after a fresh mount ranges 
from 0.017 seconds 13.2 seconds.  The outlier correlates with bitmaps 
being read from disk.  A copy of /proc/fs/ext4/dm-2/mb_groups from this 
92% full fs is available at http://www.kvack.org/~bcrl/mb_groups.ext4-92 

Note that is isn't the first allocating write to the filesystem that is 
the worst in terms of timing, it can end up being the 10th or even the 
100th attempt.

> The other option (if your runtime environment allows it) is to prefetch
> the block bitmaps using "dumpe2fs /dev/XXX > /dev/null" before the
> filesystem is in use. This still takes 90s, but can be started early in
> the boot process on each disk in parallel.

That isn't a solution.  Prefetching is impossible in my particular use-case, 
as the filesystem is being mounted after a failover from another node -- 
any data prefetched prior to switching active nodes is not guaranteed to be 
valid.

This seems like a pretty serious regression relative to ext3.  Why can't 
ext4's mballoc pick better block groups to attempt allocating from based 
on the free block counts in the block group summaries?

		-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html