Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

Jan Kara <jack@xxxxxxx> · Tue, 11 Feb 2025 14:43:01 +0100

On Tue 11-02-25 12:13:18, Dave Chinner wrote:
> On Mon, Feb 10, 2025 at 06:28:28PM +0100, Jan Kara wrote:
> > On Tue 04-02-25 06:06:42, Christoph Hellwig wrote:
> > > On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote:
> > > > I doubt that will create enough concurrency for a typical small
> > > > server or desktop machine that only has a single NUMA node but has a
> > > > couple of fast nvme SSDs in it.
> > > > 
> > > > > 2) Fixed number of writeback contexts, say min(10, numcpu).
> > > > > 3) NUMCPU/N number of writeback contexts.
> > > > 
> > > > These don't take into account the concurrency available from
> > > > the underlying filesystem or storage.
> > > > 
> > > > That's the point I was making - CPU count has -zero- relationship to
> > > > the concurrency the filesystem and/or storage provide the system. It
> > > > is fundamentally incorrect to base decisions about IO concurrency on
> > > > the number of CPU cores in the system.
> > > 
> > > Yes.  But as mention in my initial reply there is a use case for more
> > > WB threads than fs writeback contexts, which is when the writeback
> > > threads do CPU intensive work like compression.  Being able to do that
> > > from normal writeback threads vs forking out out to fs level threads
> > > would really simply the btrfs code a lot.  Not really interesting for
> > > XFS right now of course.
> > > 
> > > Or in other words: fs / device geometry really should be the main
> > > driver, but if a file systems supports compression (or really expensive
> > > data checksums) being able to scale up the numbes of threads per
> > > context might still make sense.  But that's really the advanced part,
> > > we'll need to get the fs geometry aligned to work first.
> > 
> > As I'm reading the thread it sounds to me the writeback subsystem should
> > provide an API for the filesystem to configure number of writeback
> > contexts which would be kind of similar to what we currently do for cgroup
> > aware writeback?
> 
> Yes, that's pretty much what I've been trying to say.
> 
> > Currently we create writeback context per cgroup so now
> > additionally we'll have some property like "inode writeback locality" that
> > will also influence what inode->i_wb gets set to and hence where
> > mark_inode_dirty() files inodes etc.
> 
> Well, that's currently selected by __inode_attach_wb() based on
> whether there is a memcg attached to the folio/task being dirtied or
> not. If there isn't a cgroup based writeback task, then it uses the
> bdi->wb as the wb context.
> 
> In my mind, what you are describing above sounds like we would be
> heading down the same road list_lru started down back in 2012 to
> support NUMA scalability for LRU based memory reclaim.
> 
> i.e. we originally had a single global LRU list for important
> caches. This didn't scale up, so I introduced the list_lru construct
> to abstract the physical layout of the LRU from the objects being
> stored on it and the reclaim infrastructure walking it. That gave us
> per-NUMA-node LRUs and NUMA-aware shrinkers for memory reclaim. The
> fundamental concept was that we abstract away the sharding of the
> object tracking into per-physical-node structures via generic
> infrastructure (i.e. list_lru).
> 
> Then memcgs needed memory reclaim, and so they were added as extra
> lists with a different indexing mechanism to the list-lru contexts.
> These weren't per-node lists because there could be thousands of
> them. Hence it was just a single "global" list per memcg, and so it
> didn't scale on large machines.
> 
> This wasn't seen as a problem initially, but a few years later
> applications using memcgs wanted to scale properly on large NUMA
> systems. So now we have each memcg tracking the physical per-node
> memory usage for reclaim purposes (i.e.  combinatorial explosion of
> memcg vs per-node lists).
> 
> Hence suggesting "physically sharded lists for global objects,
> single per-cgroup lists for cgroup-owned objects" sounds like
> exactly the same problem space progression is about to play out with
> writeback contexts.
> 
> i.e. we shared the global writeback context into a set of physically
> sharded lists for scalability and perofrmance reasons, but leave
> cgroups with the old single threaded list constructs. Then someone
> says "my cgroup based workload doesn't perform the same as a global
> workload" and we're off to solve the problem list_lru solves again.
> 
> So....
> 
> Should we be looking towards using a subset of the existing list_lru
> functionality for writeback contexts here? i.e. create a list_lru
> object with N-way scalability, allow the fs to provide an
> inode-number-to-list mapping function, and use the list_lru
> interfaces to abstract away everything physical and cgroup related
> for tracking dirty inodes?

Interesting idea. Indeed, the similarity with problems list_lru is solving
is significant. I like the idea.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR