On Mon, Jan 22, 2024 at 02:13:23AM -0800, Andi Kleen wrote: > Dave Chinner <david@xxxxxxxxxxxxx> writes: > > > Thoughts, comments, etc? > > The interesting part is if it will cause additional tail latencies > allocating under fragmentation with direct reclaim, compaction > etc. being triggered before it falls back to the base page path. It's not like I don't know these problems exist with memory allocation. Go have a look at xlog_kvmalloc() which is an open coded kvmalloc() that allows the high order kmalloc allocations to fail-fast without triggering all the expensive and unnecessary direct reclaim overhead (e.g. compaction!) because we can fall back to vmalloc without huge concerns. When high order allocations start to fail, then we fall back to vmalloc and then we hit the long standing vmalloc scalability problems before anything else in XFS or the IO path becomes a bottleneck. IOWs, we already know that fail-fast high-order allocation is a more efficient and effective fast path than using vmalloc/vmap_ram() all the time. As this is an RFC, I haven't implemented stuff like this yet - I haven't seen anything in the profiles indicating that high order folio allocation is failing and causing lots of reclaim overhead, so I simply haven't added fail-fast behaviour yet... > In fact it is highly likely it will, the question is just how bad it is. > > Unfortunately benchmarking for that isn't that easy, it needs artificial > memory fragmentation and then some high stress workload, and then > instrumenting the transactions for individual latencies. I stress test and measure XFS metadata performance under sustained memory pressure all the time. This change has not caused any obvious regressions in the short time I've been testing it. I still need to do perf testing on large directory block sizes. That is where high-order allocations will get stressed - that's where xlog_kvmalloc() starts dominating the profiles as it trips over vmalloc scalability issues... > I would in any case add a tunable for it in case people run into this. No tunables. It either works or it doesn't. If we can't make it work reliably by default, we throw it in the dumpster, light it on fire and walk away. > Tail latencies are a common concern on many IO workloads. Yes, for user data operations it's a common concern. For metadata, not so much - there's so many far worse long tail latencies in metadata operations (like waiting for journal space) that memory allocation latencies in the metadata IO path are largely noise.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx