Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 21 Mar 2023 21:05:26 +1100

On Tue, Mar 21, 2023 at 07:45:56AM +0000, Lorenzo Stoakes wrote:
> On Tue, Mar 21, 2023 at 06:23:39AM +0100, Uladzislau Rezki wrote:
> > On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> > > On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > already contains components which may sleep, so avoiding spin locks is not
> > > > a problem from the perspective of atomic context.
> > > >
> > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > potentially high contention. It is likely to be under contention for reads
> > > > rather than write, so replace it with a rwsem.
> > > >
> > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > under low contention, so a mutex is not an outrageous choice here.
> > > >
> > > > A subset of test_vmalloc.sh performance results:-
> > > >
> > > > fix_size_alloc_test             0.40%
> > > > full_fit_alloc_test		2.08%
> > > > long_busy_list_alloc_test	0.34%
> > > > random_size_alloc_test		-0.25%
> > > > random_size_align_alloc_test	0.06%
> > > > ...
> > > > all tests cycles                0.2%
> > > >
> > > > This represents a tiny reduction in performance that sits barely above
> > > > noise.
> > >
> > > I'm travelling right now, but give me a few days and I'll test this
> > > against the XFS workloads that hammer the global vmalloc spin lock
> > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > heavily for metadata buffers and hit the global spin lock from every
> > > CPU in the system at the same time (i.e. highly concurrent
> > > workloads). vmalloc is also heavily used in the hottest path
> > > throught the journal where we process and calculate delta changes to
> > > several million items every second, again spread across every CPU in
> > > the system at the same time.
> > >
> > > We really need the global spinlock to go away completely, but in the
> > > mean time a shared read lock should help a little bit....
> > >
> 
> Hugely appreciated Dave, however I must disappoint on the rwsem as I have now
> reworked my patch set to use the original locks in order to satisfy Willy's
> desire to make vmalloc atomic in future, and Uladzislau's desire to not have a
> ~6% performance hit -
> https://lore.kernel.org/all/cover.1679354384.git.lstoakes@xxxxxxxxx/

Yeah, I'd already read that.

What I want to do, though, is to determine whether the problem
shared access contention or exclusive access contention. If it's
exclusive access contention, then an rwsem will do nothing to
alleviate the problem, and that's kinda critical to know before any
fix for the contention problems are worked out...

> > I am working on it. I submitted a proposal how to eliminate it:
> >
> >
> > <snip>
> > Hello, LSF.
> >
> > Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
> >
> > Description:
> >  Currently the vmap code is not scaled to number of CPU cores in a system
> >  because a global vmap space is protected by a single spinlock. Such approach
> >  has a clear bottleneck if many CPUs simultaneously access to one resource.
> >
> >  In this talk i would like to describe a drawback, show some data related
> >  to contentions and places where those occur in a code. Apart of that i
> >  would like to share ideas how to eliminate it providing a few approaches
> >  and compare them.

If you want data about contention problems with vmalloc

> > Requirements:
> >  * It should be a per-cpu approach;

Hmmmm. My 2c worth on this: That is not a requirement.

That's a -solution-.

The requirement is that independent concurrent vmalloc/vfree
operations do not severely contend with each other.

Yes, the solution will probably involve sharding the resource space
across mulitple independent structures (as we do in filesystems with
block groups, allocations groups, etc) but that does not necessarily
need the structures to be per-cpu.

e.g per-node vmalloc arenas might be sufficient and allow more
expensive but more efficient indexing structures to be used because
we don't have to care about the explosion of memory that
fine-grained per-cpu indexing generally entails.  This may also fit
in to the existing per-node structure of the memory reclaim
infrastructure to manage things like compaction, balancing, etc of
vmalloc space assigned to the given node.

Hence I think saying "per-cpu is a requirement" kinda prevents
exploration of other novel solutions that may have advantages other
than "just solves the concurrency problem"...

> >  * Search of freed ptrs should not interfere with other freeing(as much as we can);
> >  *   - offload allocated areas(buzy ones) per-cpu;
> >  * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
> >  * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
> >  * Prefetch a fixed size in front and allocate per-cpu

I'd call these desired traits and/or potential optimisations, not
hard requirements.

> > Goals:
> >  * Implement a per-cpu way of allocation to eliminate a contention.

The goal should be to "allow contention-free vmalloc operations", not
that we implement a specific solution.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx