How stable is ext3fs?

akpm@zip.com.au (Andrew Morton) · Fri, 01 Mar 2002 15:00:55 -0800

Theodore Tso wrote:
> 
> On Fri, Mar 01, 2002 at 02:15:24PM -0800, Andrew Morton wrote:
> >
> > Mail produces "slow-growth" files.  Which means that their blocks
> > are sprinkled all over the disk.   If you're adding a few k per hour to
> > a file, the fs just about never manages to allocate the blocks
> > contiguously.  A while back, I had a six-month-old multi-megabyte
> > mailbox which had precisely *zero* contiguous blocks.  It was 100%
> > fragmented!
> 
> Yeah, we really need to get preallocation working again for ext3,

I have half-a-patch for that.  It takes the preallocation out
of the bitmaps altogether, and puts it into (start_block, nr_blocks)
in the inode instead.  Which has the advantage that prealloc
doesn't stumble over stray already-used blocks.  And the prealloc
window can be grown dynamically, like readahead.  To larger values.
Without requiring tricky changes to the journalling, and does not
need to differ from an ext2 implementation.

I'll finish that off reasonably soon, I think.  I was for a while
hoping that delayed allocation would suffice to solve the problem.
And indeed it does.  But it's too big for 2.4 - much too big.

> and
> it would be useful if the filesystem could notice the mail case, and
> to not release the preallocated blocks back to the system when the
> file descriptor is closed.

mm.  Allocate-on-flush partially solves this.   Dropping the
preallocation at the right time is absolutely vital for the 
many-small-file workloads.

> > For the above reasons, I partition my machines with all partitions
> > the same size, and keep one free.  For the monthly theraputic
> > copy-all-files-and-switch-mountpoints speedup.
> >
> > It's all a bit sad, really.
> 
> Well, perhaps it time that someone rewrote the defragger to work with
> 4k blocks, and so that it doens't leave your filesystem a smoking heap
> of debris if your system crashes in the middle of the defrag
> operation.   :-)

I have 100%-journalled pagecache-coherent online defrag code
sitting here.  Haven't quite gotten around to designing the
userspace bit yet. :(

`cp -a' does the job.

> I haven't really noticed a major slowdown effect, but that's probably
> because I was used to speed of using emacs RMAIL, and for large mail
> files, mutt is blazingly fast in comparison, fragmented files or no.
> 
> As always, there's always more work to that we could do to make things
> better, and not enough time to do it.  :-)

The algorithm for placing directory inodes is the biggest performance
problem in ext2 and ext3.   I did a truckload of work on that last
year.  I ended up concluding that we need online defrag, which will
enable the placement of directory inodes in the same block group as
their parent.  We're talking a 5x speedup for some common workloads here.

-