Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 11 Feb 2025 12:13:18 +1100

On Mon, Feb 10, 2025 at 06:28:28PM +0100, Jan Kara wrote:
> On Tue 04-02-25 06:06:42, Christoph Hellwig wrote:
> > On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote:
> > > I doubt that will create enough concurrency for a typical small
> > > server or desktop machine that only has a single NUMA node but has a
> > > couple of fast nvme SSDs in it.
> > > 
> > > > 2) Fixed number of writeback contexts, say min(10, numcpu).
> > > > 3) NUMCPU/N number of writeback contexts.
> > > 
> > > These don't take into account the concurrency available from
> > > the underlying filesystem or storage.
> > > 
> > > That's the point I was making - CPU count has -zero- relationship to
> > > the concurrency the filesystem and/or storage provide the system. It
> > > is fundamentally incorrect to base decisions about IO concurrency on
> > > the number of CPU cores in the system.
> > 
> > Yes.  But as mention in my initial reply there is a use case for more
> > WB threads than fs writeback contexts, which is when the writeback
> > threads do CPU intensive work like compression.  Being able to do that
> > from normal writeback threads vs forking out out to fs level threads
> > would really simply the btrfs code a lot.  Not really interesting for
> > XFS right now of course.
> > 
> > Or in other words: fs / device geometry really should be the main
> > driver, but if a file systems supports compression (or really expensive
> > data checksums) being able to scale up the numbes of threads per
> > context might still make sense.  But that's really the advanced part,
> > we'll need to get the fs geometry aligned to work first.
> 
> As I'm reading the thread it sounds to me the writeback subsystem should
> provide an API for the filesystem to configure number of writeback
> contexts which would be kind of similar to what we currently do for cgroup
> aware writeback?

Yes, that's pretty much what I've been trying to say.

> Currently we create writeback context per cgroup so now
> additionally we'll have some property like "inode writeback locality" that
> will also influence what inode->i_wb gets set to and hence where
> mark_inode_dirty() files inodes etc.

Well, that's currently selected by __inode_attach_wb() based on
whether there is a memcg attached to the folio/task being dirtied or
not. If there isn't a cgroup based writeback task, then it uses the
bdi->wb as the wb context.

In my mind, what you are describing above sounds like we would be
heading down the same road list_lru started down back in 2012 to
support NUMA scalability for LRU based memory reclaim.

i.e. we originally had a single global LRU list for important
caches. This didn't scale up, so I introduced the list_lru construct
to abstract the physical layout of the LRU from the objects being
stored on it and the reclaim infrastructure walking it. That gave us
per-NUMA-node LRUs and NUMA-aware shrinkers for memory reclaim. The
fundamental concept was that we abstract away the sharding of the
object tracking into per-physical-node structures via generic
infrastructure (i.e. list_lru).

Then memcgs needed memory reclaim, and so they were added as extra
lists with a different indexing mechanism to the list-lru contexts.
These weren't per-node lists because there could be thousands of
them. Hence it was just a single "global" list per memcg, and so it
didn't scale on large machines.

This wasn't seen as a problem initially, but a few years later
applications using memcgs wanted to scale properly on large NUMA
systems. So now we have each memcg tracking the physical per-node
memory usage for reclaim purposes (i.e.  combinatorial explosion of
memcg vs per-node lists).

Hence suggesting "physically sharded lists for global objects,
single per-cgroup lists for cgroup-owned objects" sounds like
exactly the same problem space progression is about to play out with
writeback contexts.

i.e. we shared the global writeback context into a set of physically
sharded lists for scalability and perofrmance reasons, but leave
cgroups with the old single threaded list constructs. Then someone
says "my cgroup based workload doesn't perform the same as a global
workload" and we're off to solve the problem list_lru solves again.

So....

Should we be looking towards using a subset of the existing list_lru
functionality for writeback contexts here? i.e. create a list_lru
object with N-way scalability, allow the fs to provide an
inode-number-to-list mapping function, and use the list_lru
interfaces to abstract away everything physical and cgroup related
for tracking dirty inodes?

Then selecting inodes for writeback becomes a list_lru_walk()
variant depending on what needs to be written back (e.g. physical
node, memcg, both, everything that is dirty everywhere, etc).

> Then additionally you'd like to have possibility to have more processes
> operate on one struct wb_writeback (again configurable by filesystem?) to
> handle cases of cpu intensive writeback submission more efficiently.

Yeah, that is a separate problem...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx