Re: Using Folios for XFS metadata

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 23 Jan 2024 11:19:37 +1100

On Mon, Jan 22, 2024 at 05:34:12AM -0800, Andi Kleen wrote:
> [fixed the subject, not sure what happened there]
> 
> FWIW I'm not sure fail-fail is always the right strategy here,
> in many cases even with some reclaim, compaction may win. Just not if you're
> on a tight budget for the latencies.
> 
> > I stress test and measure XFS metadata performance under sustained
> > memory pressure all the time. This change has not caused any
> > obvious regressions in the short time I've been testing it.
> 
> Did you test for tail latencies?

No, it's an RFC, and I mostly don't care about memory allocation
tail latencies because it is highly unlikely to be a *new issue* we
need to care about.

The fact is taht we already do so much memory allocation and high
order memory allocation (e.g. through slub, xlog_kvmalloc(), user
data IO through the page cache, etc) that if there's a long tail
latency problem with high order memory allocation then it will
already be noticably affecting the XFS data and metadata IO
latencies.

Nobody is reporting problems about excessive long tail latencies
when using XFS, so my care factor about long tail latencies in this
specific memory allocation case is close to zero.

> There are some relatively simple ways to trigger memory fragmentation,
> the standard way is to allocate a very large THP backed file and then
> punch a lot of holes.

Or just run a filesystem with lots of variable sized high order
allocations and varying cached object life times under sustained
memory pressure for a significant period of time....

> > > I would in any case add a tunable for it in case people run into this.
> > 
> > No tunables. It either works or it doesn't. If we can't make
> > it work reliably by default, we throw it in the dumpster, light it
> > on fire and walk away.
> 
> I'm not sure there is a single definition of "reliably" here -- for
> many workloads tail latencies don't matter, so it's always reliable,
> as long as you have good aggregate throughput.
> 
> Others have very high expectations for them.
> 
> Forcing the high expectations on everyone is probably not a good
> general strategy though, as there are general trade offs.

Yup, and we have to make those tradeoffs because filesystems need to
be good at "general purpose" workloads as their primary focus.
Minimising long tail latencies is really just "best effort only"
because we intimately aware of the fact that there are global
resource limitations that cause long tail latencies in filesystem
implementations that simply cannot be worked around.

> I could see that having lots of small tunables for every use might not be a
> good idea. Perhaps there would be a case for a single general tunable
> that controls higher order folios for everyone.

You're not listening: no tunables. Code that has tunables because
you think it *may not work* is code that is not ready to be merged.

> > > Tail latencies are a common concern on many IO workloads.
> > 
> > Yes, for user data operations it's a common concern. For metadata,
> > not so much - there's so many far worse long tail latencies in
> > metadata operations (like waiting for journal space) that memory
> > allocation latencies in the metadata IO path are largely noise....
> 
> I've seen pretty long stalls in the past.
> 
> The difference to the journal is also that it is local the file system, while
> the memory is normally shared with everyone on the node or system. So the
> scope of noisy neighbour impact can be quite different, especially on a large
> machine. 

Most systems run everything on a single filesystem. That makes them
just as global as memory allocation.  If the journal bottlenecks,
everyone on the system suffers the performance degradation, not just
the user who caused it.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx