Re: fallocate vs ENOSPC

Pádraig Brady <P@xxxxxxxxxxxxxx> · Mon, 28 Nov 2011 00:13:31 +0000

On 11/27/2011 11:43 PM, Dave Chinner wrote:
> On Sat, Nov 26, 2011 at 10:14:55PM -0500, Ted Ts'o wrote:
>> On Fri, Nov 25, 2011 at 05:40:50AM -0500, Christoph Hellwig wrote:
>>> On Fri, Nov 25, 2011 at 10:26:09AM +0000, P??draig Brady wrote:
>>>> I was wondering about adding fallocate() to cp,
>>>> where one of the benefits would be immediate indication of ENOSPC.
>>>> I'm now wondering though might fallocate() fail to allocate an
>>>> extent with ENOSPC, but there could be fragmented space available to write()?
>>>
>>> fallocate isn't guaranteed to allocate a single or even contiguous
>>> extents, it just allocate the given amount of space, and if the fs isn't
>>> too fragmented and the allocator not braindead it will be sufficiently
>>> contiguous.  Also all Linux implementation may actually still fail a write
>>> later if extreme corner cases when btree splits or other metadata
>>> operations during unwritten extent conversions go over the space limit.
>>
>> While this is true, *usually* fallocate will allocate enough space,
>> but as Cirstoph has said, you still have to check the error returns
>> for the write(2) and close(2) system call, and deal appropriately with
>> any errors.
>>
>> The other reason to use fallocate is if you are copying a huge number
>> of files, it's possible you'll get better block allocation layout,
>> depending on the file system, and how insane the writeback code for a
>> particular kernel version might be.  (Some versions of the kernel had
>> writeback algorithms that would write 4MB of one file, then 4MB for
>> another file, then 4MB for yet another file, then 4MB of the first
>> file, etc. --- and some file systems can deal with this kind of write
>> pattern better than others.)
> 
> Right, but....
> 
>> Using fallocate if you know the size of
>> the file up front won't hurt, and on some systems it might help.
> 
> ... this is - as a generalisation - wrong. Up front fallocate() can
> and does hurt performance, even when you know the size of the file
> ahead of time.
> 
> Why? Because it defeats the primary, seek reducing writeback
> optimisation that filesystems have these days: delayed allocation.
> This has been mentioned before in previous threads where you've been
> considering adding fallocate to cp. e.g:
> 
> http://www.mail-archive.com/linux-btrfs@xxxxxxxxxxxxxxx/msg10819.html
> 
> fallocate() style (or non-delalloc, write syscall time) allocation
> leads to non-optimal file layouts and slower writeback because the
> location that blocks are allocated in no way matches the writeback
> pattern, hence causing an increase in seeks during writeback of
> large numbers of files.

I'm interpreting the above to mean that,
in the presence of concurrent writes to multiple files,
fallocate() may cause slower _writes_, due to bypassing the
delalloc write scheduler.
Subsequent reads of the file should be no slower though,
and perhaps faster, due to the greater likelihood of
all the blocks for the file being contiguous.

> Further, filesytsems that are alignment aware (e.g. XFS) will align
> every fallocate() based allocation, greatly fragmenting free space
> when used on small files and the filesystem is on a RAID array.
> However, in XFS, delayed allocation will actually pack the
> allocation across files tightly on disk, resulting in full stripe
> writes (even for sub-stripe unit/width files) during writeback.

Interesting. So what are the typical alignments involved.
If you had to, what would you choose as a default min file size
to enable fallocate() for?

> Delayed allocation allows workloads such as cp to run as a bandwidth
> bound operation because allocation is optimised to cause sequential
> write IO, whereas up-front fallocate() causes it to run as an IOPS
> bound option because file layout does not match the writeback
> pattern. And on large, high performance RAID arrays, bandwidth
> capacity is much, much higher than IOPS capacity, so delayed
> allocation is going to be far faster and have less long term impact
> on the filesystem than using fallocate.

So the consequences are the same as those in the first paragraph?
Though I don't understand the detrimental "long term impact" you mention.

> IOWs, use of fallocate() -by default- will speed filesystem aging
> because it removes the benefits delayed allocation has on both short
> and long term filesystem performance.
> 
> The three major Linux filesystems (XFS, BTRFS and ext4) use delayed
> allocation, and hence do not need fallocate() to be used by
> userspace utilities like cp, tar, etc. to avoid fragmentation. If a
> given filesystem is still prone to fragmentation of data extents
> when copying data via cp or tar, then that is a problem with the
> filesystem that needs to be fixed, not worked around in the
> userspace utilities in a manner that is detrimental to other
> filesystems that don't suffer from those problems...
> 
> Yes, fallocate can help reduce fragmentation and increase
> performance in some situations, so making it an -option- for people
> who know what they are doing is a good idea. However, it should not
> be made the default for all of the reasons above.

thanks for the excellent info,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html