Re: fallocate vs ENOSPC

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 28 Nov 2011 14:51:27 +1100

On Mon, Nov 28, 2011 at 12:13:31AM +0000, Pádraig Brady wrote:
> On 11/27/2011 11:43 PM, Dave Chinner wrote:
> > On Sat, Nov 26, 2011 at 10:14:55PM -0500, Ted Ts'o wrote:
> >> On Fri, Nov 25, 2011 at 05:40:50AM -0500, Christoph Hellwig wrote:
> >>> On Fri, Nov 25, 2011 at 10:26:09AM +0000, P??draig Brady wrote:
> >>>> I was wondering about adding fallocate() to cp,
> >>>> where one of the benefits would be immediate indication of ENOSPC.
> >>>> I'm now wondering though might fallocate() fail to allocate an
> >>>> extent with ENOSPC, but there could be fragmented space available to write()?
> >>>
> >>> fallocate isn't guaranteed to allocate a single or even contiguous
> >>> extents, it just allocate the given amount of space, and if the fs isn't
> >>> too fragmented and the allocator not braindead it will be sufficiently
> >>> contiguous.  Also all Linux implementation may actually still fail a write
> >>> later if extreme corner cases when btree splits or other metadata
> >>> operations during unwritten extent conversions go over the space limit.
> >>
> >> While this is true, *usually* fallocate will allocate enough space,
> >> but as Cirstoph has said, you still have to check the error returns
> >> for the write(2) and close(2) system call, and deal appropriately with
> >> any errors.
> >>
> >> The other reason to use fallocate is if you are copying a huge number
> >> of files, it's possible you'll get better block allocation layout,
> >> depending on the file system, and how insane the writeback code for a
> >> particular kernel version might be.  (Some versions of the kernel had
> >> writeback algorithms that would write 4MB of one file, then 4MB for
> >> another file, then 4MB for yet another file, then 4MB of the first
> >> file, etc. --- and some file systems can deal with this kind of write
> >> pattern better than others.)
> > 
> > Right, but....
> > 
> >> Using fallocate if you know the size of
> >> the file up front won't hurt, and on some systems it might help.
> > 
> > ... this is - as a generalisation - wrong. Up front fallocate() can
> > and does hurt performance, even when you know the size of the file
> > ahead of time.
> > 
> > Why? Because it defeats the primary, seek reducing writeback
> > optimisation that filesystems have these days: delayed allocation.
> > This has been mentioned before in previous threads where you've been
> > considering adding fallocate to cp. e.g:
> > 
> > http://www.mail-archive.com/linux-btrfs@xxxxxxxxxxxxxxx/msg10819.html
> > 
> > fallocate() style (or non-delalloc, write syscall time) allocation
> > leads to non-optimal file layouts and slower writeback because the
> > location that blocks are allocated in no way matches the writeback
> > pattern, hence causing an increase in seeks during writeback of
> > large numbers of files.
> 
> I'm interpreting the above to mean that,
> in the presence of concurrent writes to multiple files,
> fallocate() may cause slower _writes_, due to bypassing the
> delalloc write scheduler.

It's not even concurrent writes. A single process writing multiple
files into cache serially does not necessarily result in writeback
30s later writing the data to disk in the same order.

> Subsequent reads of the file should be no slower though,
> and perhaps faster, due to the greater likelihood of
> all the blocks for the file being contiguous.

If delayed allocation does it's job, the files will be contiguous
and unfragmented and no slower to read.

> > Further, filesytsems that are alignment aware (e.g. XFS) will align
> > every fallocate() based allocation, greatly fragmenting free space
> > when used on small files and the filesystem is on a RAID array.
> > However, in XFS, delayed allocation will actually pack the
> > allocation across files tightly on disk, resulting in full stripe
> > writes (even for sub-stripe unit/width files) during writeback.
> 
> Interesting. So what are the typical alignments involved.

Typical range of alignments can be anything from 16k through to 16MB
or larger.

Consider this - a 1TB filesystem with a 1MB alignment unit
(stripe unit in the case of XFS) doing 16k aligned allocation per
file will run out of aligned allocation slots after ~1,000,000 files
have been created. At that point, the largest contiguous free space
in the filesytsem is now under 16MB. When you want to create that
multi-GB file now, it's going to have lots of extents rather than
1-2 because the preallocation has spread the small file data all
over the place.

If you used delayed allocation, the small file data will be packed
close together without alignment, leaving large, multi-GB free space
extents for the large file you then want to create....

> If you had to, what would you choose as a default min file size
> to enable fallocate() for?

I would not enable fallocate by default at all.

> > Delayed allocation allows workloads such as cp to run as a bandwidth
> > bound operation because allocation is optimised to cause sequential
> > write IO, whereas up-front fallocate() causes it to run as an IOPS
> > bound option because file layout does not match the writeback
> > pattern. And on large, high performance RAID arrays, bandwidth
> > capacity is much, much higher than IOPS capacity, so delayed
> > allocation is going to be far faster and have less long term impact
> > on the filesystem than using fallocate.
> 
> So the consequences are the same as those in the first paragraph?
> Though I don't understand the detrimental "long term impact" you mention.

Free space fragmentation will result in severe degradation of
performance as soon as all > alignment sized free spaces are
partially consumed. Then fragmentation will result from any large
allocation. i.e. small aligned preallocations accelerate filesystem
aging effects by imcreasing free space fragmentation.  This
typically won't be noticed for months until the fragmentation starts
causing noticable performance issues - at which point it will be
difficult if not impossible to correct without a backup/mkfs/restore
cycle....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html