On Thu, Jun 13, 2024 at 10:28:55AM +0200, Christoph Hellwig wrote: > On Thu, Jun 13, 2024 at 04:03:53PM +1000, Dave Chinner wrote: > > I disagree, there was a very good reason for this behaviour: > > preventing append-only log files from getting excessively fragmented > > because speculative prealloc would get removed on close(). > > Where is that very clear intent documented? Not in the original > commit message (which is very sparse) and no where in any documentation > I can find. We've lost all the internal SGI bug databases, so there's little to know evidence I can point at. But at the time, it was a well known problem amongst Irix XFS engineers that append-only log files would regularly get horribly fragmented. There'd been several escalations over that behaviour over the years w.r.t. large remote servers (think of facilities that "don't trust the logs on client machines because they might be compromised"). In general, the fixes for these applications tended to require the loggin server application to use F_RESVSP to do the append-only log file initialisation. That got XFS_DIFLAG_PREALLOC set on the files, so then anything allocated by appending writes beyond EOF was left alone. That small change was largely sufficient to mitigate worst case log file fragmentation on Irix-XFS. So when adding a flag on disk for Linux-XFS to say "this is an append only file" it made lots of sense to make it behave like XFS_DIFLAG_PREALLOC had already been set on the inode without requring the application to do anything to set that up. I'll note that the patches sent to the list by Ethan Benson to originally implement XFS_DIFLAG_APPEND (and others) is not exactly what was committed in this commit: https://marc.info/?l=linux-xfs&m=106360278223548&w=2 The last version posted on the list was this: https://marc.info/?l=linux-xfs&m=106109662212214&w=2 but the version committed had lots of things renamed, sysctls for sync and nodump inheritance and other bits and pieces including the EOF freeing changes to skip if DIFLAG_APPEND was set. It is clear that there was internal SGI discussion, modification and review of the original proposed patch set, and none of that internal discussion is on open mailing lists. We might have the historical XFS code and Linux mailing list archives, but that doesn't always tell us what institutional knowledge was behind subtle changes to publicly proposed patches like this.... > > i.e. applications that slowly log messages to append only files > > with the pattern open(O_APPEND); write(a single line to the log); > > close(); caused worst case file fragmentation because the close() > > always removed the speculative prealloc beyond EOF. > > That case should be covered by the XFS_IDIRTY_RELEASE, at least > except for O_SYNC workloads. Ah, so I fixed the problem independently 7 or 8 years later to fix Linux NFS server performance issues. Ok, that makes removing the flag less bad, but I still don't see the harm in keeping it there given that behaviour has existed for the past 20 years.... > > The fix for this pessimisitic XFS behaviour is for the application > > to use chattr +A (like they would for ext3/4) hence triggering the > > existence of XFS_DIFLAG_APPEND and that avoided the removal > > speculative delalloc removed when the file is closed. hence the > > fragmentation problems went away. > > For ext4 the EXT4_APPEND_FL flag does not cause any difference > in allocation behavior. Sure, but ext4 doesn't have speculative preallocation beyond EOF to prevent fragmentation, either. > For the historic ext2 driver it apparently > did just, with an XXX comment marking this as a bug, but for ext3 it > also never did looking back quite a bit in history. Ditto - when the filesystem isn't allocating anything beyond EOF, there's little point in trying to removing blocks beyond EOF that can't exist on final close()... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx