On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote: > On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote: > > > > 2. Motivation. > > > > > > > > - The vmap code is not scalled to number of CPUs and this should be fixed; > > > > - XFS folk has complained several times that vmalloc might be contented on > > > > their workloads: > > > > > > > > <snip> > > > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf > > > > Author: Dave Chinner <dchinner@xxxxxxxxxx> > > > > Date: Tue Jan 4 17:22:18 2022 -0800 > > > > > > > > xfs: reduce kvmalloc overhead for CIL shadow buffers > > > > > > > > Oh, let me count the ways that the kvmalloc API sucks dog eggs. > > > > > > > > The problem is when we are logging lots of large objects, we hit > > > > kvmalloc really damn hard with costly order allocations, and > > > > behaviour utterly sucks: > > > > > > based on the commit I guess xfs should use vmalloc/kvmalloc is because > > > it allocates large buffers, how large could it be? > > > > > They use kvmalloc(). When the page allocator is not able to serve a > > request they fallback to vmalloc. At least what i see, the sizes are: > > > > from 73728 up to 1048576, i.e. 18 pages up to 256 pages. > > > > > > 3. Test > > > > > > > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures: > > > > > > > > 1-page 1-page-this-patch > > > > 1 0.576131 vs 0.555889 > > > > 2 2.68376 vs 1.07895 > > > > 3 4.26502 vs 1.01739 > > > > 4 6.04306 vs 1.28924 > > > > 5 8.04786 vs 1.57616 > > > > 6 9.38844 vs 1.78142 > > > > > > <snip> > > > > > > > 29 20.06 vs 3.59869 > > > > 30 20.4353 vs 3.6991 > > > > 31 20.9082 vs 3.73028 > > > > 32 21.0865 vs 3.82904 > > > > > > > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree() > > > > pair throughput. > > > > > > I would be more interested in real numbers than synthetic benchmarks, > > > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750 > > > with and without this patchset? > > > > > I added Dave Chinner <david@xxxxxxxxxxxxx> to this thread. > > Oh, I missed that, and it would be better to [+Cc linux-xfs] > > > But. The contention exists. > > I think "theoretically can be contended" doesn't necessarily mean it's actually > contended in the real world. > > Also I find it difficult to imagine vmalloc being highly contended because it was > historically considered slow and thus discouraged when performance is important. > > IOW vmalloc would not be contended when allocation size is small because we have > kmalloc/buddy API, and therefore I wonder which workloads are allocating very large > buffers and at the same time allocating very frequently, thus performance-sensitive. > > I am not against this series, but wondering which workloads would benefit ;) > > > Apart of that per-cpu-KVA allocator can go away if we make it generic instead. > > Not sure I understand your point, can you elaborate please? > > And I would like to ask some side questions: > > 1. Is vm_[un]map_ram() API still worth with this patchset? > It is up to community to decide. As i see XFS needs it also. Maybe in the future it can be removed(who knows). If the vmalloc code itself can deliver such performance as vm_map* APIs. > > 2. How does this patchset deals with 32-bit machines where > vmalloc address space is limited? > It can deal without any problems. Though i am not sure it is needed for 32-bit systems. The reason is, the vmalloc code was a bit slow when it comes to lookup time, it used to be O(n). After that it was improved to O(logn). vm_map_ram() and friends interface was added because of vmalloc drawbacks. I am not sure that there are 32-bit systems with 10/20/30... CPUs on board. In that case it is worth care about contention. -- Uladzislau Rezki