On 11/29/2011 11:37 PM, Dave Chinner wrote: > On Tue, Nov 29, 2011 at 02:11:48PM +0000, Pádraig Brady wrote: >> On 11/29/2011 12:24 AM, Dave Chinner wrote: >>> On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote: >>>> On 11/28/2011 05:10 AM, Dave Chinner wrote: >>>>> Quite frankly, if system utilities like cp and tar start to abuse >>>>> fallocate() by default so they can get "upfront ENOSPC detection", >>>>> then I will seriously consider making XFS use delayed allocation for >>>>> fallocate rather than unwritten extents so we don't lose the past 15 >>>>> years worth of IO and aging optimisations that delayed allocation >>>>> provides us with.... >>>> >>>> For the record I was considering fallocate() for these reasons. >>>> >>>> 1. Improved file layout for subsequent access >>>> 2. Immediate indication of ENOSPC >>>> 3. Efficient writing of NUL portions >>>> >>>> You lucidly detailed issues with 1. which I suppose could be somewhat >>>> mitigated by not fallocating < say 1MB, though I suppose file systems >>>> could be smarter here and not preallocate small chunks (or when >>>> otherwise not appropriate). >>> >>> When you consider that some high end filesystem deployments have alignment >>> characteristics over 50MB (e.g. so each uncompressed 4k resolution >>> video frame is located on a different set of non-overlapping disks), >>> arbitrary "don't fallocate below this amount" heuristics will always >>> have unforseen failure cases... >> >> So about this alignment policy, I don't understand the issues so I'm guessing here. > > Which, IMO, is exactly why you shouldn't be using fallocate() by > default. Every filesystem behaves differently, and is optimises > allocation differently to be tuned for the filesystem's unique > structure and capability. fallocate() is a big hammer that ensures > filesystems cannot optimise allocation to match observed operational > patterns. > >> You say delalloc packs files, while fallocate() will align on XFS according to >> the stripe config. Is that assuming that when writing lots of files, that they >> will be more likely to be read together, rather than independently. > > No, it's assuming that preallocation is used for enabling extremely > high performance, high bandwidth IO. This is what it has been used > for in XFS for the past 10+ years, and so that is what the > implementation in XFS is optimised for (and will continue to be > optimised for). In this environment, even when the file size is > smaller than the alignment unit, we want allocation alignment to be > done. > > A real world example for you: supporting multiple, concurrent, > realtime file-per-frame uncompressed 2k res video streams (@ ~12.5MB > per frame). Systems doing this sort of work are made from lots of > HW RAID5/6 Luns (often spread across multiple arrays) that will have > a stripe width of 14MB. XFS will be configured with a stripe unit of > 14MB. 4-6 of these Luns will be striped together to give a stripe > width of 56-84MB from a filesystem perspective. Each file that is > preallocated needs to be aligned to a 16MB stripe unit so that each > frame IO goes to a different RAID Lun. Each frame write can be done > as a full stripe write without a RMW cycle in the back end array, > and each frame read loads all the disks in the LUN evenly. i.e. the > load is distributed evenly, optimally and deterministically across > all the back end storage. > > This is the sort of application that cannot be done effectively with > a lot of filesystem allocator support (indeed, XFS has the special > filestreams allocation policy for this workload), and it's this sort > of high peformance application that what we optimise preallocation > for. > > In short, what XFS is doing here is optimising allocation patterns > for high performance, RAID based storage. If your write pattern > triggers repeated RMW cycles in a RAID array, your write performance > will fall by an order of magnitude or more. Large files don't need > packing because the writeback flusher threads can do full stripe > writes which avoids RMW cycles in the RAID array if the files are > aligned to the underlying RAID stripes. But small files need tight > packing to enable them to be aggregated into full stripe writes in > the elevator and/or RAID controller BBWC. This aggregation then > avoids RMW cycles in the RAID array and hence writeback performance > for both small and large files is similar (i.e. close to maximum IO > bandwidth). If you don't pack small files tightly (and XFs won't if > you use preallocation), then each file write will cause a RMW cycle > in the RAID array and the throughput is effective going to be about > half the IOPS of a random write workload.... > >> That's a big assumption if true. Also the converse is a big assumption, that >> fallocate() should be aligned, as that's more likely to be read independently. > > You're guessing, making assumptions, etc all about how one > filesystem works and what the impact of the change is going to be. > What about ext4, or btrfs? They are very different structurally to > XFS, and hence have different sets of issues when you start > preallocating everything. It is not a simple problem: allocation > optimisation is, IMO, the single most difficult and complex area of > filesystems, with many different, non-obvious, filesystem specific > trade-offs to be made.... > >>> fallocate is for preallocation, not for ENOSPC detection. If you >>> want efficient and effective ENOSPC detection before writing >>> anything, then you really want a space -reservation- extension to >>> fallocate. Filesystems that use delayed allocation already have a >>> space reservation subsystem - it how they account for space that is >>> reserved by delayed allocation prior to the real allocation being >>> done. IMO, allowing userspace some level of access to those >>> reservations would be more appropriate for early detection of ENOSPC >>> than using preallocation for everything... >> >> Fair enough, so fallocate() would be a superset of reserve(), >> though I'm having a hard time thinking of why one might ever need to >> fallocate() then. > > Exactly my point - the number of applications that actually need > -preallocation- for performance reasons is actually quite small. > > I'd suggest that we'd implement a reservation mechanism as a > separate fallocate() flag, to tell fallocate() to reserve the space > over the given range rather than needing to preallocate it. I'd also > suggest that a reservation is not persistent (e.g. only guaranteed > to last for the life of the file descriptor the reservation was made > for). That would make it simple to implement in memory for all > filesystems, and provide you with the short-term ENOSPC-or-success > style reservation you are looking for... > > Does that sound reasonable? But then posix_fallocate() would always be slow I think, requiring one to actually write the NULs. TBH, it sounds like the best/minimal change is to the uncommon case. I.E. add an ALIGN flag to fallocate() which specialised apps like described above can use. >>> As to efficient writing of NULL ranges - that's what sparse files >>> are for - you do not need to write or even preallocate NULL ranges >>> when copying files. Indeed, the most efficient way of dealing with >>> NULL ranges is to punch a hole and let the filesystem deal with >>> it..... >> >> well not for `cp --sparse=never` which might be used >> so that processing of the copy will not result in ENOSPC. >> >> I'm also linking here to a related discussion. >> http://oss.sgi.com/archives/xfs/2011-06/msg00064.html > > Right, and from that discussion you can see exactly why delayed > allocation in XFS significantly improves both data and metadata > allocation and IO patterns for operations like tar, cp, rsync, etc > whilst also minimising long term aging effects as compared to > preallocation: > > http://oss.sgi.com/archives/xfs/2011-06/msg00092.html > >> Note also that the gold linker does fallocate() on output files by default. > > "He's doing it, so we should do it" is not a very convincing > technical argument. Just FYI. cheers, Pádraig. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html