Re: fallocate vs ENOSPC

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 28 Nov 2011 16:10:54 +1100

On Sun, Nov 27, 2011 at 07:40:14PM -0500, Theodore Tso wrote:
> 
> On Nov 27, 2011, at 6:43 PM, Dave Chinner wrote:
> 
> > fallocate() style (or non-delalloc, write syscall time) allocation
> > leads to non-optimal file layouts and slower writeback because the
> > location that blocks are allocated in no way matches the writeback
> > pattern, hence causing an increase in seeks during writeback of
> > large numbers of files.
> > 
> > Further, filesytsems that are alignment aware (e.g. XFS) will align
> > every fallocate() based allocation, greatly fragmenting free space
> > when used on small files and the filesystem is on a RAID array.
> > However, in XFS, delayed allocation will actually pack the
> > allocation across files tightly on disk, resulting in full stripe
> > writes (even for sub-stripe unit/width files) during write back.
> 
> Well, the question is whether you're optimizing for writing the files,
> or reading the files.    In some cases, files are write once, read never
> (well, almost never) --- i.e., the backup case.  In other cases, the files
> are write once, read many --- i.e., when installing software.

Doesn't matter. If delayed allocation is doing it's job properly,
then you'll get unfragemented files when they are written. delayed
allocation is supposed to make up front preallocation of disk space
-unnecessary- to prevent fragmentation. Using preallocation instead
of dealyed allocation implies your dealyed allocation implementation
is sub-optimal and needs to be fixed.

Indeed, there is no guarantee that preallocation will even lay the
files out in a sane manner that will give you good read speeds
across multiple files - it may place them so far apart that the seek
penalty between files is worse than having a few fragments...

> In that case, optimizing for the file reading might mean that you
> want to make the files aligned on RAID stripes, although it will
> fragment free space.   It all depends on what you're optimizing
> for.

If you want to optimise for read speed - especially for small
files or random IO patterns - you want to *avoid* alignment to RAID
stripes. Doing so overloads the first disk in the RAID stripe
because all small file reads (and writes) hit that disk/LUN in the
stripe. Indeed, if you have RAID5/6 and lots of small files, it is
recommended that you turn off filesystem alignment at mkfs time for
XFS.

SGI hit this problem back in the early 90s, and is one of the reasons
that XFS lays it's metadata out such that it does not hot-spot one
drive in a RAID stripe trying to read/write frequently accessed
metadata (e.g. AG headers).

> I didn't realize that XFS was not aligning to RAID stripes when doing
> delayed allocation writes.

It certainly does do alignment during delayed allocation.

/me waits for the "but you said..."

That's because XFS does -selective- alignment during delayed
allocation.... :)

What people seem to forget about delayed allocation is that when
delayed allocation occurs, we have lots of information about the
data being written that is not available in the fallocate() context
- how big the delalloc extent is, how large the file currently is,
how much more data needs to be written, whether the file is still
growing, etc, and so delayed allocation can make a much more informed
decision about how to allocate the data extents compared to
fallocate().

For example, if the allocation is for offset zero of the file, the
filesystem is using aligned allocation and the file size is larger
than the stripe unit, the allocation will be stripe unit aligned.

Hence, if you've got lots of small files, they get packed because
aligned allocation is not triggered and each allocation gets peeled
from the front edge of the same free space extent.

If you've got large files, then they get aligned, leaving space
between them for the fiel to potentially grow and fill full stripe
units and widths.

And if you've got really large files still being written to, they
get aligned and over-allocated thanks to the speculative prealloc
beyond EOF, which effectively prevents fragmentation of large files
due to interleaving allocations between files when many files are
being written concurrently by writeback.....

> I'm curious --- does it do this only when
> there are multiple files outstanding for delayed allocation in an 
> allocation group? 

Irrelevant - the consideration is solely to do with the state of the
current inode the allocation is being done for. If you're only
writing a single file, then it doesn't matter for perfromance
whether it is aligned or not. But it will matter for a freespace
management POV, and hence how the filesytem ages. 

> If someone does a singleton cp of a large file
> without using fallocate, will XFS try to align the write?

The above should hopefully answer that question, especially with
respect to why delayed allocation should not be short-circuited by
using fallocate by default in generic system utilities.

> Also, if we are going to use fallocate() as a way of implicitly signaling
> to the file system that the file should be optimized for reads, as
> opposed to the write, maybe we should explicitly document it as such
> in the fallocate(2) man page, so that  application programmers
> understand that this is the semantics they should expect.

Preallocation is for preventing fragmentation that leads to
performance problems. Use of fallocate() does not imply the file
layout has been optimised for read access and, IMO, never should.

Quite frankly, if system utilities like cp and tar start to abuse
fallocate() by default so they can get "upfront ENOSPC detection",
then I will seriously consider making XFS use delayed allocation for
fallocate rather than unwritten extents so we don't lose the past 15
years worth of IO and aging optimisations that delayed allocation
provides us with....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html