On Tue, Nov 29, 2011 at 02:11:48PM +0000, Pádraig Brady wrote: > On 11/29/2011 12:24 AM, Dave Chinner wrote: > > On Mon, Nov 28, 2011 at 08:55:02AM +0000, Pádraig Brady wrote: > >> On 11/28/2011 05:10 AM, Dave Chinner wrote: > >>> Quite frankly, if system utilities like cp and tar start to abuse > >>> fallocate() by default so they can get "upfront ENOSPC detection", > >>> then I will seriously consider making XFS use delayed allocation for > >>> fallocate rather than unwritten extents so we don't lose the past 15 > >>> years worth of IO and aging optimisations that delayed allocation > >>> provides us with.... > >> > >> For the record I was considering fallocate() for these reasons. > >> > >> 1. Improved file layout for subsequent access > >> 2. Immediate indication of ENOSPC > >> 3. Efficient writing of NUL portions > >> > >> You lucidly detailed issues with 1. which I suppose could be somewhat > >> mitigated by not fallocating < say 1MB, though I suppose file systems > >> could be smarter here and not preallocate small chunks (or when > >> otherwise not appropriate). > > > > When you consider that some high end filesystem deployments have alignment > > characteristics over 50MB (e.g. so each uncompressed 4k resolution > > video frame is located on a different set of non-overlapping disks), > > arbitrary "don't fallocate below this amount" heuristics will always > > have unforseen failure cases... > > So about this alignment policy, I don't understand the issues so I'm guessing here. Which, IMO, is exactly why you shouldn't be using fallocate() by default. Every filesystem behaves differently, and is optimises allocation differently to be tuned for the filesystem's unique structure and capability. fallocate() is a big hammer that ensures filesystems cannot optimise allocation to match observed operational patterns. > You say delalloc packs files, while fallocate() will align on XFS according to > the stripe config. Is that assuming that when writing lots of files, that they > will be more likely to be read together, rather than independently. No, it's assuming that preallocation is used for enabling extremely high performance, high bandwidth IO. This is what it has been used for in XFS for the past 10+ years, and so that is what the implementation in XFS is optimised for (and will continue to be optimised for). In this environment, even when the file size is smaller than the alignment unit, we want allocation alignment to be done. A real world example for you: supporting multiple, concurrent, realtime file-per-frame uncompressed 2k res video streams (@ ~12.5MB per frame). Systems doing this sort of work are made from lots of HW RAID5/6 Luns (often spread across multiple arrays) that will have a stripe width of 14MB. XFS will be configured with a stripe unit of 14MB. 4-6 of these Luns will be striped together to give a stripe width of 56-84MB from a filesystem perspective. Each file that is preallocated needs to be aligned to a 16MB stripe unit so that each frame IO goes to a different RAID Lun. Each frame write can be done as a full stripe write without a RMW cycle in the back end array, and each frame read loads all the disks in the LUN evenly. i.e. the load is distributed evenly, optimally and deterministically across all the back end storage. This is the sort of application that cannot be done effectively with a lot of filesystem allocator support (indeed, XFS has the special filestreams allocation policy for this workload), and it's this sort of high peformance application that what we optimise preallocation for. In short, what XFS is doing here is optimising allocation patterns for high performance, RAID based storage. If your write pattern triggers repeated RMW cycles in a RAID array, your write performance will fall by an order of magnitude or more. Large files don't need packing because the writeback flusher threads can do full stripe writes which avoids RMW cycles in the RAID array if the files are aligned to the underlying RAID stripes. But small files need tight packing to enable them to be aggregated into full stripe writes in the elevator and/or RAID controller BBWC. This aggregation then avoids RMW cycles in the RAID array and hence writeback performance for both small and large files is similar (i.e. close to maximum IO bandwidth). If you don't pack small files tightly (and XFs won't if you use preallocation), then each file write will cause a RMW cycle in the RAID array and the throughput is effective going to be about half the IOPS of a random write workload.... > That's a big assumption if true. Also the converse is a big assumption, that > fallocate() should be aligned, as that's more likely to be read independently. You're guessing, making assumptions, etc all about how one filesystem works and what the impact of the change is going to be. What about ext4, or btrfs? They are very different structurally to XFS, and hence have different sets of issues when you start preallocating everything. It is not a simple problem: allocation optimisation is, IMO, the single most difficult and complex area of filesystems, with many different, non-obvious, filesystem specific trade-offs to be made.... > > fallocate is for preallocation, not for ENOSPC detection. If you > > want efficient and effective ENOSPC detection before writing > > anything, then you really want a space -reservation- extension to > > fallocate. Filesystems that use delayed allocation already have a > > space reservation subsystem - it how they account for space that is > > reserved by delayed allocation prior to the real allocation being > > done. IMO, allowing userspace some level of access to those > > reservations would be more appropriate for early detection of ENOSPC > > than using preallocation for everything... > > Fair enough, so fallocate() would be a superset of reserve(), > though I'm having a hard time thinking of why one might ever need to > fallocate() then. Exactly my point - the number of applications that actually need -preallocation- for performance reasons is actually quite small. I'd suggest that we'd implement a reservation mechanism as a separate fallocate() flag, to tell fallocate() to reserve the space over the given range rather than needing to preallocate it. I'd also suggest that a reservation is not persistent (e.g. only guaranteed to last for the life of the file descriptor the reservation was made for). That would make it simple to implement in memory for all filesystems, and provide you with the short-term ENOSPC-or-success style reservation you are looking for... Does that sound reasonable? > > As to efficient writing of NULL ranges - that's what sparse files > > are for - you do not need to write or even preallocate NULL ranges > > when copying files. Indeed, the most efficient way of dealing with > > NULL ranges is to punch a hole and let the filesystem deal with > > it..... > > well not for `cp --sparse=never` which might be used > so that processing of the copy will not result in ENOSPC. > > I'm also linking here to a related discussion. > http://oss.sgi.com/archives/xfs/2011-06/msg00064.html Right, and from that discussion you can see exactly why delayed allocation in XFS significantly improves both data and metadata allocation and IO patterns for operations like tar, cp, rsync, etc whilst also minimising long term aging effects as compared to preallocation: http://oss.sgi.com/archives/xfs/2011-06/msg00092.html > Note also that the gold linker does fallocate() on output files by default. "He's doing it, so we should do it" is not a very convincing technical argument. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html