Re: [PATCH 0/9] Mitigate a vmap lock contention

Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx> · Wed, 24 May 2023 10:30:46 +0900

On Wed, May 24, 2023 at 07:43:38AM +1000, Dave Chinner wrote:
> On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> > On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > > > 2. Motivation.
> > > > > 
> > > > > - The vmap code is not scalled to number of CPUs and this should be fixed;
> > > > > - XFS folk has complained several times that vmalloc might be contented on
> > > > >   their workloads:
> > > > > 
> > > > > <snip>
> > > > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> > > > > Author: Dave Chinner <dchinner@xxxxxxxxxx>
> > > > > Date:   Tue Jan 4 17:22:18 2022 -0800
> > > > > 
> > > > >     xfs: reduce kvmalloc overhead for CIL shadow buffers
> > > > >     
> > > > >     Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> > > > >     
> > > > >     The problem is when we are logging lots of large objects, we hit
> > > > >     kvmalloc really damn hard with costly order allocations, and
> > > > >     behaviour utterly sucks:
> > > > 
> > > > based on the commit I guess xfs should use vmalloc/kvmalloc is because
> > > > it allocates large buffers, how large could it be?
> > > > 
> > > They use kvmalloc(). When the page allocator is not able to serve a
> > > request they fallback to vmalloc. At least what i see, the sizes are:
> > > 
> > > from 73728 up to 1048576, i.e. 18 pages up to 256 pages.
> > > 
> > > > > 3. Test
> > > > > 
> > > > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> > > > > 
> > > > >     1-page     1-page-this-patch
> > > > > 1  0.576131   vs   0.555889
> > > > > 2   2.68376   vs    1.07895
> > > > > 3   4.26502   vs    1.01739
> > > > > 4   6.04306   vs    1.28924
> > > > > 5   8.04786   vs    1.57616
> > > > > 6   9.38844   vs    1.78142
> > > > 
> > > > <snip>
> > > > 
> > > > > 29    20.06   vs    3.59869
> > > > > 30  20.4353   vs     3.6991
> > > > > 31  20.9082   vs    3.73028
> > > > > 32  21.0865   vs    3.82904
> > > > > 
> > > > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> > > > > pair throughput.
> > > > 
> > > > I would be more interested in real numbers than synthetic benchmarks,
> > > > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
> > > > with and without this patchset?
> > > > 
> > > I added Dave Chinner <david@xxxxxxxxxxxxx> to this thread.
> > 
> > Oh, I missed that, and it would be better to [+Cc linux-xfs]
> > 
> > > But. The contention exists.
> > 
> > I think "theoretically can be contended" doesn't necessarily mean it's actually
> > contended in the real world.
> 
> Did you not read the commit message for the XFS commit documented
> above? vmalloc lock contention most c0ertainly does exist in the
> real world and the profiles in commit 8dc9384b7d75  ("xfs: reduce
> kvmalloc overhead for CIL shadow buffers") document it clearly.
>
> > Also I find it difficult to imagine vmalloc being highly contended because it was
> > historically considered slow and thus discouraged when performance is important.
> 
> Read the above XFS commit.
> 
> We use vmalloc in critical high performance fast paths that cannot
> tolerate high order memory allocation failure. XFS runs this
> fast path millions of times a second, and will call into
> vmalloc() several hundred thousands times a second with machine wide
> concurrency under certain types of workloads.
>
> > IOW vmalloc would not be contended when allocation size is small because we have
> > kmalloc/buddy API, and therefore I wonder which workloads are allocating very large
> > buffers and at the same time allocating very frequently, thus performance-sensitive.
> >
> > I am not against this series, but wondering which workloads would benefit ;)
> 
> Yup, you need to read the XFS commit message. If you understand what
> is in that commit message, then you wouldn't be doubting that
> vmalloc contention is real and that it is used in high performance
> fast paths that are traversed millions of times a second....

Oh, I read the commit but seems slipped my mind while reading it - sorry for such a dumb
question, now I get it, and thank you so much. In any case didn't mean to offend,
I should've read and thought more before asking.

>
> > > Apart of that per-cpu-KVA allocator can go away if we make it generic instead.
> > 
> > Not sure I understand your point, can you elaborate please?
> > 
> > And I would like to ask some side questions:
> > 
> > 1. Is vm_[un]map_ram() API still worth with this patchset?
> 
> XFS also uses this interface for mapping multi-page buffers in the
> XFS buffer cache. These are the items that also require the high
> order costly kvmalloc allocations in the transaction commit path
> when they are modified.
> 
> So, yes, we need these mapping interfaces to scale just as well as
> vmalloc itself....

I mean, even before this series, vm_[un]map_ram() caches vmap_blocks
per CPU but it has limitation on size that can be cached per cpu.

But now that vmap() itself becomes scalable after this series, I wonder
they are still worth, why not replace it with v[un]map()?
> 
> > 2. How does this patchset deals with 32-bit machines where
> >    vmalloc address space is limited?
> 
> From the XFS side, we just don't care about 32 bit machines at all.
> XFS is aimed at server and HPC environments which have been entirely
> 64 bit for a long, long time now...

But Linux still supports 32 bit machines and is not going to drop
support for them anytime soon so I think there should be at least a way to
disable this feature.

Thanks!

-- 
Hyeonggon Yoo

Doing kernel stuff as a hobby
Undergraduate | Chungnam National University
Dept. Computer Science & Engineering