Re: How to handle an kmalloc failure in evict_inode()?

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 6 Aug 2014 08:12:44 +1000

On Tue, Aug 05, 2014 at 01:21:23PM -0400, Theodore Ts'o wrote:
> On Tue, Aug 05, 2014 at 10:17:17PM +1000, Dave Chinner wrote:
> > IOWs, the longer term plan is to move all this stuff to async
> > workqueue processing and so be able to defer and batch unlink and
> > reclaim work more efficiently:
> 
> > http://xfs.org/index.php/Improving_inode_Caching#Inode_Unlink
> 
> I discussed doing this for ext4 a while back (because on a very busy
> machine, unlink latency can be quite large).  I got pushback because
> people were concerned that if a very large directory is getting
> deleted --- say, you're cleaning up a the directory belonging to a
> (for example, a Docker / Borg / Omega) job that has been shut down, so
> the equivalent of an "rm -rf" of several hundred files comprising tens
> or hundreds of megabytes or gigabytes, the fact that all of the unlink
> have returned without the space not being available could confuse a
> number of programs.  And it's not just "df", but if the user is over
> quota, the fact that they still aren't allowed to write for seconds or
> minutes because the block release isn't taking place except in a
> workqueue that could potentially get deferred for a non-trivial amount
> of time.

I'm not concerned about space usage in general - XFS already does
things that cause "unexpected" space usage issues (e.g. dynamic
speculative preallocation) and so we've demonstrated how to deal
with such issues. That is, make the radical change of behaviour and
then temper the new behaviour such that it doesn't affect users and
applications adversely whilst still maintaining the benefits the
change was intended to provide.

The reality is that nobody really even notices dynamic specualtive
prealloc anymore because we've refined it to only have short-term
impact on space usage and have triggers to reduce, turn off and/or
reclaim speculative prealloc if free space is low or the workload is
adversely affected by it. The short-term differences in df, du and
space usage just don't matter....

Background unlink is no different. If large amounts of space freeing
are deferred, we can kick the queue to run sooner than it's default
period.  We can account space in deferred inactivations as "delayed
freeing" for statfs() and so hide such behaviour from userspace
completely.  If the user hits edquot, we can run a scan
to truncate any inodes accounts to that quota id that are in reclaim
state. Same for ENOSPC.

> I could imagine recruiting the process that tries to do a block
> allocation that would otherwise would have failed with a ENOSPC or
> EDQUOT to help with the completing the deallocation of inodes to
> help release disk space, but then we're moving the latency
> variability from the unlink() call to an otherwise innocent
> production job that is trying to do file writes.  So the user
> visibility is more than just the df statistics; it's also some
> file writes either failing or suffering increased latency until
> the blocks can be reclaimed.

Sure, but we already do that to free up delayed allocation metadata
reservations (i.e. run fs-wide writeback) and freeing speculative
preallocation (eofblocks scan) before we fail the write with
ENOSPC/EDQUOT. It's a rare slow path, already has extremely variable
(and long!) latencies, so the overhead of adding more inode/space
reclaim work does not really change anything fundamental.

There's a simple principle of system design: if a latency-sensitive
application is running anywhere near slow paths, then the design is
fundamentally wrong. i.e. if someone choses to run a
latency-sensitive app at EDQUOT/ENOSPC then that's not a problem we
can solve at the XFS design level, and as such we've neer tried to
solve. Indeed, best practices say you shouldn't run such
applications on XFS filesystems more than 85% full.....

> Has the XFS developers considered these sort of concerns, and are
> there any solutions to these issues that you've contemplated?

I think it is obvious we've been walking this path for quite some
time now. ;)

The fundamental observation is that the vast majority of XFS
filesystem operation occurs when there is ample free space.
Trading off increased slowpath latency variation for increased
fast-path bulk throughput rates and improved resilience against
filesystem aging artifacts is exactly the right thing to be doing
for the vast majority of XFS users. We have always done this in XFS,
especially w.r.t. improving scalability.

Fact is, I'm quite happy to be flamed by people who think that such
behavioural changes is the work of the $ANTIDEITY. If we are not
making some people unhappy, then we are not pushing the boundaries
of what is possible enough. The key is to listen to why people are
unhappy about the change and to then address those concerns without
compromising the benefits of the original change.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html