Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 26 Dec 2018 10:47:32 +1100

Hi Kenton,

On Thu, Dec 20, 2018 at 08:00:21PM -0800, Kenton Varda wrote:
> When XFS blocks direct reclaim, our service pretty much grinds to a
> halt on that machine, because everything is trying to allocate memory
> all the time. For example, as alluded by the subject of this thread,
> writing to a socket allocates memory, and thus will block waiting for
> XFS to write back inodes. What we find really frustrating is that we
> almost always have over 100GB of clean page cache that could be
> reclaimed immediately, without blocking, yet we end up waiting for the
> much-smaller inode cache to be written back to disk.

Sure, it's frustrating. It frsutrates me that I've been trying for
years to get memory reclaim behaviour changed so that we don't have
to do this, but we are still stuck with it.

But taking out your frustrations on the people who are trying to fix
the problems you are seeing isn't productive. We are only a small
team and we can't fix every problem that everyone reports
immediately. Some things take time to fix.

> We really can't accept random multi-second pauses.

And that's the problem here. For the vast majority of XFS users, the
alternative (i.e. not blocking reclaim) leads to substantially lower
performance and the high risk of premature OOM kills. It basically
moves the reclaim blocking problem to a different context (i.e. to
journal space allocation) and that has even worse latency and has
global filesystem scope rather than just the process doing reclaim.

IOWs, there are relatively few applications that have such a
significant dependency on memory reclaim having extremely low
latency, but there are a lot that are dependent on memory reclaim
being throttled harshly to keep the filesystem out of even worse
breakdown conditions.....

> Our current plan is
> to roll out the patch Ivan linked to.

Which is perfectly fine by me. I read the link, and it looks like it
works just fine in your environment. In contrast, I ran the same
patch on my performance bencmarks and saw between 15-30%
degradation in performance on my inode cache heavy and mixed
inode/page cache heavy memory pressure benchmarks. IOWs, that change
still doesn't work for XFS in general.

This is the beauty of open source software - you can make tweaks to
it to refine it for your specific workloads easily when such changes
aren't really suitable for the wider user base.  I encourage people
to make tweaks like this to optimise your systems for your workload.

However, I also encourage people then to discuss the problems that
lead to needing such tweaks with upstream so we are aware of the
issues and can work towards either incorporating them or modifying
infrastructure to avoid them altogether. Further, there's no need to
be combative or get upset when upstream determines that the tweak
isn't generally applicable or is hiding something deeper that needs
fixing.

All it means is that the upstream developers know there is a deeper
underlying problem and want to fix the underlying problem rather
than try to hide the problem for a specific workload.  The fact we
know there's a problem (and it's not just one workload situation
that it affects) helps us prioritise what we need to fix.

> But, if you have any other
> suggestions, we'd love to hear them. It would be great if we could
> agree on an upstream solution, and maybe solve Facebook's problem too.

I've already mentioned the things we are working on to Shaun to keep
the ball rolling on this. e.g. Darrick's pipelined background inode
inactivation work:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=85baea68d8e87803e6831bda7b5a3773cf0d8820

i.e. as I indicated above, inode reclaim can block on lots more
things that just inode writeback. We can do transactions in inode
reclaim, so we can block on log space and that might require
*thousands* of IO to be completed before it can make progress. The
async kswapd mod doesn't address this problem at all because
inactivation occurs in the prune_icache() VFS shrinker context (i.e.
->destroy_inode()) rather than the XFS inode cache context that the
FB and CF patches address.

Also, the upcoming inode cache xarray conversion will provide us
with many more per-inode state tags in the xarray that will allow us
to track and execute multiple different post-VFS reclaim inode
states directly.  This will allow us to efficiently separate inodes
that need transactions and/or IO to be reclaimed from inodes that
can be reclaimed immediately, and it will allow efficient concurrent
async processing of inodes that need IO to be reclaimed.

IOWs, we're trying to solve *all* the blocking problems that we know
that can occur in inode reclaim so that it all just works for
everyone without tweaks being necessary. Yes, this takes longer than
just addressing the specific symptom that is causing you problems,
but the reality is while fixing things properly takes time to get
right, everyone will benefit from it being fixed and not just one or
two very specific, latency sensitive workloads.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx