On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote: > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote: > > Hi folks, > > > > I've just finished analysing an IO trace from a application > > generating an extreme filesystem fragmentation problem that started > > with extent size hints and ended with spurious ENOSPC reports due to > > massively fragmented files and free space. While the ENOSPC issue > > looks to have previously been solved, I still wanted to understand > > how the application had so comprehensively defeated extent size > > hints as a method of avoiding file fragmentation. > > > > The key behaviour that I discovered was that specific "append write > > only" files that had extent size hints to prevent fragmentation > > weren't actually write only. The application didn't do a lot of > > writes to the file, but it kept the file open and appended to the > > file (from the traces I have) in chunks of between ~3000 bytes and > > ~160000 bytes. This didn't explain the problem. I did notice that > > the files were opened O_SYNC, however. > > > > I then found was another process that, once every second, opened the > > log file O_RDONLY, read 28 bytes from offset zero, then closed the > > file. Every second. IOWs, between every appending write that would > > allocate an extent size hint worth of space beyond EOF and then > > write a small chunk of it, there were numerous open/read/close > > cycles being done on the same file. > > > > And what do we do on close()? We call xfs_release() and that can > > truncate away blocks beyond EOF. For some reason the close wasn't > > triggering the IDIRTY_RELEASE heuristic that preventd close from > > removing EOF blocks prematurely. Then I realised that O_SYNC writes > > don't leave delayed allocation blocks behind - they are always > > converted in the context of the write. That's why it wasn't > > triggering, and that meant that the open/read/close cycle was > > removing the extent size hint allocation beyond EOF prematurely. > > beyond EOF prematurely. > > <urk> > > > Then it occurred to me that extent size hints don't use delalloc > > either, so they behave the same was as O_SYNC writes in this > > situation. > > > > Oh, and we remove EOF blocks on O_RDONLY file close, too. i.e. we > > modify the file without having write permissions. > > Yikes! > > > I suspect there's more cases like this when combined with repeated > > open/<do_something>/close operations on a file that is being > > written, but the patches address just these ones I just talked > > about. The test script to reproduce them is below. Fragmentation > > reduction results are in the commit descriptions. It's running > > through fstests for a couple of hours now, no issues have been > > noticed yet. > > > > FWIW, I suspect we need to have a good hard think about whether we > > should be trimming EOF blocks on close by default, or whether we > > should only be doing it in very limited situations.... > > > > Comments, thoughts, flames welcome. > > > > -Dave. > > > > > > #!/bin/bash > > # > > # Test 1 > > Can you please turn these into fstests to cause the maintainer maximal > immediate pain^W^W^Wmake everyone pay attention^W^W^W^Westablish a basis > for regression testing and finding whatever other problems we can find > from digging deeper? :) I will, but not today - I only understood the cause well enough to write a prototype reproducer about 4 hours ago. The rest of the time since then has been fixing the issues and running smoke tests. My brain is about fried now.... FWIW, I think the scope of the problem is quite widespread - anything that does open/something/close repeatedly on a file that is being written to with O_DSYNC or O_DIRECT appending writes will kill the post-eof extent size hint allocated space. That's why I suspect we need to think about not trimming by default and trying to enumerating only the cases that need to trim eof blocks. e.g. I closed the O_RDONLY case, but O_RDWR/read/close in a loop will still trigger removal of post EOF extent size hint preallocation and hence severe fragmentation. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx