Re: fallocate vs ENOSPC

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 30 Nov 2011 10:37:29 +1100

On Tue, Nov 29, 2011 at 02:11:48PM +0000, Pádraig Brady wrote:
> On 11/29/2011 12:24 AM, Dave Chinner wrote:
> > On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote:
> >> On 11/28/2011 05:10 AM, Dave Chinner wrote:
> >>> Quite frankly, if system utilities like cp and tar start to abuse
> >>> fallocate() by default so they can get "upfront ENOSPC detection",
> >>> then I will seriously consider making XFS use delayed allocation for
> >>> fallocate rather than unwritten extents so we don't lose the past 15
> >>> years worth of IO and aging optimisations that delayed allocation
> >>> provides us with....
> >>
> >> For the record I was considering fallocate() for these reasons.
> >>
> >>   1. Improved file layout for subsequent access
> >>   2. Immediate indication of ENOSPC
> >>   3. Efficient writing of NUL portions
> >>
> >> You lucidly detailed issues with 1. which I suppose could be somewhat
> >> mitigated by not fallocating < say 1MB, though I suppose file systems
> >> could be smarter here and not preallocate small chunks (or when
> >> otherwise not appropriate).
> > 
> > When you consider that some high end filesystem deployments have alignment
> > characteristics over 50MB (e.g. so each uncompressed 4k resolution
> > video frame is located on a different set of non-overlapping disks),
> > arbitrary "don't fallocate below this amount" heuristics will always
> > have unforseen failure cases...
> 
> So about this alignment policy, I don't understand the issues so I'm guessing here.

Which, IMO, is exactly why you shouldn't be using fallocate() by
default. Every filesystem behaves differently, and is optimises
allocation differently to be tuned for the filesystem's unique
structure and capability. fallocate() is a big hammer that ensures
filesystems cannot optimise allocation to match observed operational
patterns.

> You say delalloc packs files, while fallocate() will align on XFS according to
> the stripe config. Is that assuming that when writing lots of files, that they
> will be more likely to be read together, rather than independently.

No, it's assuming that preallocation is used for enabling extremely
high performance, high bandwidth IO. This is what it has been used
for in XFS for the past 10+ years, and so that is what the
implementation in XFS is optimised for (and will continue to be
optimised for).  In this environment, even when the file size is
smaller than the alignment unit, we want allocation alignment to be
done.

A real world example for you: supporting multiple, concurrent,
realtime file-per-frame uncompressed 2k res video streams (@ ~12.5MB
per frame).  Systems doing this sort of work are made from lots of
HW RAID5/6 Luns (often spread across multiple arrays) that will have
a stripe width of 14MB. XFS will be configured with a stripe unit of
14MB. 4-6 of these Luns will be striped together to give a stripe
width of 56-84MB from a filesystem perspective. Each file that is
preallocated needs to be aligned to a 16MB stripe unit so that each
frame IO goes to a different RAID Lun. Each frame write can be done
as a full stripe write without a RMW cycle in the back end array,
and each frame read loads all the disks in the LUN evenly.  i.e. the
load is distributed evenly, optimally and deterministically across
all the back end storage.

This is the sort of application that cannot be done effectively with
a lot of filesystem allocator support (indeed, XFS has the special
filestreams allocation policy for this workload), and it's this sort
of high peformance application that what we optimise preallocation
for.

In short, what XFS is doing here is optimising allocation patterns
for high performance, RAID based storage. If your write pattern
triggers repeated RMW cycles in a RAID array, your write performance
will fall by an order of magnitude or more.  Large files don't need
packing because the writeback flusher threads can do full stripe
writes which avoids RMW cycles in the RAID array if the files are
aligned to the underlying RAID stripes.  But small files need tight
packing to enable them to be aggregated into full stripe writes in
the elevator and/or RAID controller BBWC.  This aggregation then
avoids RMW cycles in the RAID array and hence writeback performance
for both small and large files is similar (i.e. close to maximum IO
bandwidth).  If you don't pack small files tightly (and XFs won't if
you use preallocation), then each file write will cause a RMW cycle
in the RAID array and the throughput is effective going to be about
half the IOPS of a random write workload....

> That's a big assumption if true. Also the converse is a big assumption, that
> fallocate() should be aligned, as that's more likely to be read independently.

You're guessing, making assumptions, etc all about how one
filesystem works and what the impact of the change is going to be.
What about ext4, or btrfs? They are very different structurally to
XFS, and hence have different sets of issues when you start
preallocating everything.  It is not a simple problem: allocation
optimisation is, IMO, the single most difficult and complex area of
filesystems, with many different, non-obvious, filesystem specific
trade-offs to be made....

> > fallocate is for preallocation, not for ENOSPC detection. If you
> > want efficient and effective ENOSPC detection before writing
> > anything, then you really want a space -reservation- extension to
> > fallocate. Filesystems that use delayed allocation already have a
> > space reservation subsystem - it how they account for space that is
> > reserved by delayed allocation prior to the real allocation being
> > done. IMO, allowing userspace some level of access to those
> > reservations would be more appropriate for early detection of ENOSPC
> > than using preallocation for everything...
> 
> Fair enough, so fallocate() would be a superset of reserve(),
> though I'm having a hard time thinking of why one might ever need to
> fallocate() then.

Exactly my point - the number of applications that actually need
-preallocation- for performance reasons is actually quite small.

I'd suggest that we'd implement a reservation mechanism as a
separate fallocate() flag, to tell fallocate() to reserve the space
over the given range rather than needing to preallocate it. I'd also
suggest that a reservation is not persistent (e.g. only guaranteed
to last for the life of the file descriptor the reservation was made
for). That would make it simple to implement in memory for all
filesystems, and provide you with the short-term ENOSPC-or-success
style reservation you are looking for...

Does that sound reasonable?

> > As to efficient writing of NULL ranges - that's what sparse files
> > are for - you do not need to write or even preallocate NULL ranges
> > when copying files. Indeed, the most efficient way of dealing with
> > NULL ranges is to punch a hole and let the filesystem deal with
> > it.....
> 
> well not for `cp --sparse=never` which might be used
> so that processing of the copy will not result in ENOSPC.
> 
> I'm also linking here to a related discussion.
> http://oss.sgi.com/archives/xfs/2011-06/msg00064.html

Right, and from that discussion you can see exactly why delayed
allocation in XFS significantly improves both data and metadata
allocation and IO patterns for operations like tar, cp, rsync, etc
whilst also minimising long term aging effects as compared to
preallocation:

http://oss.sgi.com/archives/xfs/2011-06/msg00092.html

> Note also that the gold linker does fallocate() on output files by default.

"He's doing it, so we should do it" is not a very convincing
technical argument.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html