Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()

Kenton Varda <kenton@xxxxxxxxxxxxxx> · Thu, 20 Dec 2018 20:00:21 -0800

Hi all,

I'm a coworker of Ivan's and wanted to add some background here -- in
particular to answer Dave's question about our workload.

For the purpose of this discussion, we can describe our workload as a
giant, glorified HTTP caching proxy. (We receive HTTP requests in. We
check if we have a cached response. If so, we return it to the client,
otherwise we forward the request on to its "origin" server. When the
origin responds, if the response is cacheable, we save it, and either
way we return it to the client.)

Roughly speaking, each HTTP cache entry is stored as a file on disk.
Hence, we have a very large number of files with files being added and
removed frequently. We also rely heavily on page cache for
performance, rather than some more complicated database scheme.

The HTTP requests we serve almost always come from live end users
interacting with a web site. So, any kind of delay means someone is
sitting and waiting. When delays get up over 15 seconds, we start
hitting timeouts, meaning someone's web site doesn't load at all or
loads "broken". Also note that any particular machine may serve
thousands of requests per second, so blocking one machine may affect
thousands of users.

When XFS blocks direct reclaim, our service pretty much grinds to a
halt on that machine, because everything is trying to allocate memory
all the time. For example, as alluded by the subject of this thread,
writing to a socket allocates memory, and thus will block waiting for
XFS to write back inodes. What we find really frustrating is that we
almost always have over 100GB of clean page cache that could be
reclaimed immediately, without blocking, yet we end up waiting for the
much-smaller inode cache to be written back to disk.

We really can't accept random multi-second pauses. Our current plan is
to roll out the patch Ivan linked to. But, if you have any other
suggestions, we'd love to hear them. It would be great if we could
agree on an upstream solution, and maybe solve Facebook's problem too.

Hope that helps elucidate things.

Thanks,
-Kenton

On Wed, Dec 19, 2018 at 2:15 PM Ivan Babrou <ivan@xxxxxxxxxxxxxx> wrote:
>
> We're sticking with the following patch that allows runtime switching
> between XFS memory reclaim strategies:
>
> * https://github.com/bobrik/linux/pull/2
>
> There are some tests and graphs describing the issue and how it can be solved.
>
> Let me know if you think this can be incorporated upstream, I'm fine if not.
>
> On Thu, Nov 29, 2018 at 11:45 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, Nov 30, 2018 at 05:49:08PM +1100, Dave Chinner wrote:
> > > Seriously: describe your workload in detail for me so I can write a
> > > reproducer for it. Without that I cannot help you any further and I
> > > am just wasting my time asking you to describe the workload over
> > > and over again.
> >
> > FWIW, here's the discussion that about the FB issue. Go read it,
> > the first few emails are pretty much the same as this thread so far.
> >
> > https://www.spinics.net/lists/linux-xfs/msg01541.html
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@xxxxxxxxxxxxx