On Fri, Jul 29, 2022 at 12:02 PM Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > > Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field. Thanks a lot, Mathieu - it is really exciting to see this happening! I'll share our experiences here, with the hope that it may be useful. I've also cc-ed Chris Kennelly, who worked on the userspace/tcmalloc side, as he can provide more context/details if I miss or misrepresent something. The problem: tcmalloc maintains per-cpu freelists in the userspace to make userspace memory allocations fast and efficient; it relies on rseq to do so, as any manipulation of the freelists has to be protected vs thread migrations. However, as a typical userspace process at a Google datacenter is confined to a relatively small number of CPUs (8-16) via cgroups, while the servers typically have a much larger number of physical CPUs, the per-cpu freelist model is somewhat wasteful: if a process has only at most 10 threads running, for example, but these threads can "wander" across 100 CPUs over the lifetime of the process, keeping 100 freelists instead of 10 noticeably wastes memory. Note that although a typical process at Google has a limited CPU quota, thus using only a small number of CPUs at any given time, the process may often have many hundreds or thousands of threads, so per-thread freelists are not a viable solution to the problem just described. Our current solution: As you outlined in patch 9, tracking the number of currently running threads per address space and exposing this information via a vcpu_id abstraction helps tcmalloc to noticeably reduce its freelist overhead in the "narrow process running on a wide server" situation, which is typical at Google. We have experimented with several approaches here. The one that we are currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes. We did try per-numa-node vcpus, but it did not show any material improvement over the "flat" model, perhaps because on our most "wide" servers the CPU topology is multi-level. Chris Kennelly may provide more details here. On a more technical note, we do use atomic operations extensively in the kernel to make sure vcpu IDs are "tightly packed", i.e. if only N threads of a process are currently running on physical CPUs, vcpu IDs will be in the range [0, N-1], i.e. no gaps, no going to N and above; this does consume some extra CPU cycles, but the RAM savings we gain far outweigh the extra CPU cost; it will be interesting to see what you can do with the optimizations you propose in this patchset. Again, thanks a lot for this effort! Peter [...]