On Tue, Aug 05, 2014 at 01:21:23PM -0400, Theodore Ts'o wrote: > On Tue, Aug 05, 2014 at 10:17:17PM +1000, Dave Chinner wrote: > > IOWs, the longer term plan is to move all this stuff to async > > workqueue processing and so be able to defer and batch unlink and > > reclaim work more efficiently: > > > http://xfs.org/index.php/Improving_inode_Caching#Inode_Unlink > > I discussed doing this for ext4 a while back (because on a very busy > machine, unlink latency can be quite large). I got pushback because > people were concerned that if a very large directory is getting > deleted --- say, you're cleaning up a the directory belonging to a > (for example, a Docker / Borg / Omega) job that has been shut down, so > the equivalent of an "rm -rf" of several hundred files comprising tens > or hundreds of megabytes or gigabytes, the fact that all of the unlink > have returned without the space not being available could confuse a > number of programs. And it's not just "df", but if the user is over > quota, the fact that they still aren't allowed to write for seconds or > minutes because the block release isn't taking place except in a > workqueue that could potentially get deferred for a non-trivial amount > of time. I'm not concerned about space usage in general - XFS already does things that cause "unexpected" space usage issues (e.g. dynamic speculative preallocation) and so we've demonstrated how to deal with such issues. That is, make the radical change of behaviour and then temper the new behaviour such that it doesn't affect users and applications adversely whilst still maintaining the benefits the change was intended to provide. The reality is that nobody really even notices dynamic specualtive prealloc anymore because we've refined it to only have short-term impact on space usage and have triggers to reduce, turn off and/or reclaim speculative prealloc if free space is low or the workload is adversely affected by it. The short-term differences in df, du and space usage just don't matter.... Background unlink is no different. If large amounts of space freeing are deferred, we can kick the queue to run sooner than it's default period. We can account space in deferred inactivations as "delayed freeing" for statfs() and so hide such behaviour from userspace completely. If the user hits edquot, we can run a scan to truncate any inodes accounts to that quota id that are in reclaim state. Same for ENOSPC. > I could imagine recruiting the process that tries to do a block > allocation that would otherwise would have failed with a ENOSPC or > EDQUOT to help with the completing the deallocation of inodes to > help release disk space, but then we're moving the latency > variability from the unlink() call to an otherwise innocent > production job that is trying to do file writes. So the user > visibility is more than just the df statistics; it's also some > file writes either failing or suffering increased latency until > the blocks can be reclaimed. Sure, but we already do that to free up delayed allocation metadata reservations (i.e. run fs-wide writeback) and freeing speculative preallocation (eofblocks scan) before we fail the write with ENOSPC/EDQUOT. It's a rare slow path, already has extremely variable (and long!) latencies, so the overhead of adding more inode/space reclaim work does not really change anything fundamental. There's a simple principle of system design: if a latency-sensitive application is running anywhere near slow paths, then the design is fundamentally wrong. i.e. if someone choses to run a latency-sensitive app at EDQUOT/ENOSPC then that's not a problem we can solve at the XFS design level, and as such we've neer tried to solve. Indeed, best practices say you shouldn't run such applications on XFS filesystems more than 85% full..... > Has the XFS developers considered these sort of concerns, and are > there any solutions to these issues that you've contemplated? I think it is obvious we've been walking this path for quite some time now. ;) The fundamental observation is that the vast majority of XFS filesystem operation occurs when there is ample free space. Trading off increased slowpath latency variation for increased fast-path bulk throughput rates and improved resilience against filesystem aging artifacts is exactly the right thing to be doing for the vast majority of XFS users. We have always done this in XFS, especially w.r.t. improving scalability. Fact is, I'm quite happy to be flamed by people who think that such behavioural changes is the work of the $ANTIDEITY. If we are not making some people unhappy, then we are not pushing the boundaries of what is possible enough. The key is to listen to why people are unhappy about the change and to then address those concerns without compromising the benefits of the original change. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html