Re: fallocate vs ENOSPC

Pádraig Brady <P@xxxxxxxxxxxxxx> · Wed, 30 Nov 2011 09:28:32 +0000

On 11/29/2011 11:37 PM, Dave Chinner wrote:
> On Tue, Nov 29, 2011 at 02:11:48PM +0000, Pádraig Brady wrote:
>> On 11/29/2011 12:24 AM, Dave Chinner wrote:
>>> On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote:
>>>> On 11/28/2011 05:10 AM, Dave Chinner wrote:
>>>>> Quite frankly, if system utilities like cp and tar start to abuse
>>>>> fallocate() by default so they can get "upfront ENOSPC detection",
>>>>> then I will seriously consider making XFS use delayed allocation for
>>>>> fallocate rather than unwritten extents so we don't lose the past 15
>>>>> years worth of IO and aging optimisations that delayed allocation
>>>>> provides us with....
>>>>
>>>> For the record I was considering fallocate() for these reasons.
>>>>
>>>>   1. Improved file layout for subsequent access
>>>>   2. Immediate indication of ENOSPC
>>>>   3. Efficient writing of NUL portions
>>>>
>>>> You lucidly detailed issues with 1. which I suppose could be somewhat
>>>> mitigated by not fallocating < say 1MB, though I suppose file systems
>>>> could be smarter here and not preallocate small chunks (or when
>>>> otherwise not appropriate).
>>>
>>> When you consider that some high end filesystem deployments have alignment
>>> characteristics over 50MB (e.g. so each uncompressed 4k resolution
>>> video frame is located on a different set of non-overlapping disks),
>>> arbitrary "don't fallocate below this amount" heuristics will always
>>> have unforseen failure cases...
>>
>> So about this alignment policy, I don't understand the issues so I'm guessing here.
> 
> Which, IMO, is exactly why you shouldn't be using fallocate() by
> default. Every filesystem behaves differently, and is optimises
> allocation differently to be tuned for the filesystem's unique
> structure and capability. fallocate() is a big hammer that ensures
> filesystems cannot optimise allocation to match observed operational
> patterns.
> 
>> You say delalloc packs files, while fallocate() will align on XFS according to
>> the stripe config. Is that assuming that when writing lots of files, that they
>> will be more likely to be read together, rather than independently.
> 
> No, it's assuming that preallocation is used for enabling extremely
> high performance, high bandwidth IO. This is what it has been used
> for in XFS for the past 10+ years, and so that is what the
> implementation in XFS is optimised for (and will continue to be
> optimised for).  In this environment, even when the file size is
> smaller than the alignment unit, we want allocation alignment to be
> done.
> 
> A real world example for you: supporting multiple, concurrent,
> realtime file-per-frame uncompressed 2k res video streams (@ ~12.5MB
> per frame).  Systems doing this sort of work are made from lots of
> HW RAID5/6 Luns (often spread across multiple arrays) that will have
> a stripe width of 14MB. XFS will be configured with a stripe unit of
> 14MB. 4-6 of these Luns will be striped together to give a stripe
> width of 56-84MB from a filesystem perspective. Each file that is
> preallocated needs to be aligned to a 16MB stripe unit so that each
> frame IO goes to a different RAID Lun. Each frame write can be done
> as a full stripe write without a RMW cycle in the back end array,
> and each frame read loads all the disks in the LUN evenly.  i.e. the
> load is distributed evenly, optimally and deterministically across
> all the back end storage.
> 
> This is the sort of application that cannot be done effectively with
> a lot of filesystem allocator support (indeed, XFS has the special
> filestreams allocation policy for this workload), and it's this sort
> of high peformance application that what we optimise preallocation
> for.
> 
> In short, what XFS is doing here is optimising allocation patterns
> for high performance, RAID based storage. If your write pattern
> triggers repeated RMW cycles in a RAID array, your write performance
> will fall by an order of magnitude or more.  Large files don't need
> packing because the writeback flusher threads can do full stripe
> writes which avoids RMW cycles in the RAID array if the files are
> aligned to the underlying RAID stripes.  But small files need tight
> packing to enable them to be aggregated into full stripe writes in
> the elevator and/or RAID controller BBWC.  This aggregation then
> avoids RMW cycles in the RAID array and hence writeback performance
> for both small and large files is similar (i.e. close to maximum IO
> bandwidth).  If you don't pack small files tightly (and XFs won't if
> you use preallocation), then each file write will cause a RMW cycle
> in the RAID array and the throughput is effective going to be about
> half the IOPS of a random write workload....
> 
>> That's a big assumption if true. Also the converse is a big assumption, that
>> fallocate() should be aligned, as that's more likely to be read independently.
> 
> You're guessing, making assumptions, etc all about how one
> filesystem works and what the impact of the change is going to be.
> What about ext4, or btrfs? They are very different structurally to
> XFS, and hence have different sets of issues when you start
> preallocating everything.  It is not a simple problem: allocation
> optimisation is, IMO, the single most difficult and complex area of
> filesystems, with many different, non-obvious, filesystem specific
> trade-offs to be made....
> 
>>> fallocate is for preallocation, not for ENOSPC detection. If you
>>> want efficient and effective ENOSPC detection before writing
>>> anything, then you really want a space -reservation- extension to
>>> fallocate. Filesystems that use delayed allocation already have a
>>> space reservation subsystem - it how they account for space that is
>>> reserved by delayed allocation prior to the real allocation being
>>> done. IMO, allowing userspace some level of access to those
>>> reservations would be more appropriate for early detection of ENOSPC
>>> than using preallocation for everything...
>>
>> Fair enough, so fallocate() would be a superset of reserve(),
>> though I'm having a hard time thinking of why one might ever need to
>> fallocate() then.
> 
> Exactly my point - the number of applications that actually need
> -preallocation- for performance reasons is actually quite small.
> 
> I'd suggest that we'd implement a reservation mechanism as a
> separate fallocate() flag, to tell fallocate() to reserve the space
> over the given range rather than needing to preallocate it. I'd also
> suggest that a reservation is not persistent (e.g. only guaranteed
> to last for the life of the file descriptor the reservation was made
> for). That would make it simple to implement in memory for all
> filesystems, and provide you with the short-term ENOSPC-or-success
> style reservation you are looking for...
> 
> Does that sound reasonable?

But then posix_fallocate() would always be slow I think,
requiring one to actually write the NULs.

TBH, it sounds like the best/minimal change is to the uncommon case.
I.E. add an ALIGN flag to fallocate() which specialised apps like
described above can use.

>>> As to efficient writing of NULL ranges - that's what sparse files
>>> are for - you do not need to write or even preallocate NULL ranges
>>> when copying files. Indeed, the most efficient way of dealing with
>>> NULL ranges is to punch a hole and let the filesystem deal with
>>> it.....
>>
>> well not for `cp --sparse=never` which might be used
>> so that processing of the copy will not result in ENOSPC.
>>
>> I'm also linking here to a related discussion.
>> http://oss.sgi.com/archives/xfs/2011-06/msg00064.html
> 
> Right, and from that discussion you can see exactly why delayed
> allocation in XFS significantly improves both data and metadata
> allocation and IO patterns for operations like tar, cp, rsync, etc
> whilst also minimising long term aging effects as compared to
> preallocation:
> 
> http://oss.sgi.com/archives/xfs/2011-06/msg00092.html
> 
>> Note also that the gold linker does fallocate() on output files by default.
> 
> "He's doing it, so we should do it" is not a very convincing
> technical argument.

Just FYI.

cheers,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html