On Sun, Nov 27, 2011 at 07:40:14PM -0500, Theodore Tso wrote: > > On Nov 27, 2011, at 6:43 PM, Dave Chinner wrote: > > > fallocate() style (or non-delalloc, write syscall time) allocation > > leads to non-optimal file layouts and slower writeback because the > > location that blocks are allocated in no way matches the writeback > > pattern, hence causing an increase in seeks during writeback of > > large numbers of files. > > > > Further, filesytsems that are alignment aware (e.g. XFS) will align > > every fallocate() based allocation, greatly fragmenting free space > > when used on small files and the filesystem is on a RAID array. > > However, in XFS, delayed allocation will actually pack the > > allocation across files tightly on disk, resulting in full stripe > > writes (even for sub-stripe unit/width files) during write back. > > Well, the question is whether you're optimizing for writing the files, > or reading the files. In some cases, files are write once, read never > (well, almost never) --- i.e., the backup case. In other cases, the files > are write once, read many --- i.e., when installing software. Doesn't matter. If delayed allocation is doing it's job properly, then you'll get unfragemented files when they are written. delayed allocation is supposed to make up front preallocation of disk space -unnecessary- to prevent fragmentation. Using preallocation instead of dealyed allocation implies your dealyed allocation implementation is sub-optimal and needs to be fixed. Indeed, there is no guarantee that preallocation will even lay the files out in a sane manner that will give you good read speeds across multiple files - it may place them so far apart that the seek penalty between files is worse than having a few fragments... > In that case, optimizing for the file reading might mean that you > want to make the files aligned on RAID stripes, although it will > fragment free space. It all depends on what you're optimizing > for. If you want to optimise for read speed - especially for small files or random IO patterns - you want to *avoid* alignment to RAID stripes. Doing so overloads the first disk in the RAID stripe because all small file reads (and writes) hit that disk/LUN in the stripe. Indeed, if you have RAID5/6 and lots of small files, it is recommended that you turn off filesystem alignment at mkfs time for XFS. SGI hit this problem back in the early 90s, and is one of the reasons that XFS lays it's metadata out such that it does not hot-spot one drive in a RAID stripe trying to read/write frequently accessed metadata (e.g. AG headers). > I didn't realize that XFS was not aligning to RAID stripes when doing > delayed allocation writes. It certainly does do alignment during delayed allocation. /me waits for the "but you said..." That's because XFS does -selective- alignment during delayed allocation.... :) What people seem to forget about delayed allocation is that when delayed allocation occurs, we have lots of information about the data being written that is not available in the fallocate() context - how big the delalloc extent is, how large the file currently is, how much more data needs to be written, whether the file is still growing, etc, and so delayed allocation can make a much more informed decision about how to allocate the data extents compared to fallocate(). For example, if the allocation is for offset zero of the file, the filesystem is using aligned allocation and the file size is larger than the stripe unit, the allocation will be stripe unit aligned. Hence, if you've got lots of small files, they get packed because aligned allocation is not triggered and each allocation gets peeled from the front edge of the same free space extent. If you've got large files, then they get aligned, leaving space between them for the fiel to potentially grow and fill full stripe units and widths. And if you've got really large files still being written to, they get aligned and over-allocated thanks to the speculative prealloc beyond EOF, which effectively prevents fragmentation of large files due to interleaving allocations between files when many files are being written concurrently by writeback..... > I'm curious --- does it do this only when > there are multiple files outstanding for delayed allocation in an > allocation group? Irrelevant - the consideration is solely to do with the state of the current inode the allocation is being done for. If you're only writing a single file, then it doesn't matter for perfromance whether it is aligned or not. But it will matter for a freespace management POV, and hence how the filesytem ages. > If someone does a singleton cp of a large file > without using fallocate, will XFS try to align the write? The above should hopefully answer that question, especially with respect to why delayed allocation should not be short-circuited by using fallocate by default in generic system utilities. > Also, if we are going to use fallocate() as a way of implicitly signaling > to the file system that the file should be optimized for reads, as > opposed to the write, maybe we should explicitly document it as such > in the fallocate(2) man page, so that application programmers > understand that this is the semantics they should expect. Preallocation is for preventing fragmentation that leads to performance problems. Use of fallocate() does not imply the file layout has been optimised for read access and, IMO, never should. Quite frankly, if system utilities like cp and tar start to abuse fallocate() by default so they can get "upfront ENOSPC detection", then I will seriously consider making XFS use delayed allocation for fallocate rather than unwritten extents so we don't lose the past 15 years worth of IO and aging optimisations that delayed allocation provides us with.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html