----- On Aug 1, 2022, at 1:07 PM, Peter Oskolkov posk@xxxxxxx wrote: > On Fri, Jul 29, 2022 at 12:02 PM Mathieu Desnoyers > <mathieu.desnoyers@xxxxxxxxxxxx> wrote: >> >> Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field. > > Thanks a lot, Mathieu - it is really exciting to see this happening! > > I'll share our experiences here, with the hope that it may be useful. > I've also cc-ed > Chris Kennelly, who worked on the userspace/tcmalloc side, as he can provide > more context/details if I miss or misrepresent something. Thanks for sharing your experiences at Google. This helps put things in perspective. > > The problem: > > tcmalloc maintains per-cpu freelists in the userspace to make userspace > memory allocations fast and efficient; it relies on rseq to do so, as > any manipulation > of the freelists has to be protected vs thread migrations. > > However, as a typical userspace process at a Google datacenter is confined to > a relatively small number of CPUs (8-16) via cgroups, while the > servers typically > have a much larger number of physical CPUs, the per-cpu freelist model > is somewhat > wasteful: if a process has only at most 10 threads running, for > example, but these threads > can "wander" across 100 CPUs over the lifetime of the process, keeping 100 > freelists instead of 10 noticeably wastes memory. > > Note that although a typical process at Google has a limited CPU > quota, thus using > only a small number of CPUs at any given time, the process may often have many > hundreds or thousands of threads, so per-thread freelists are not a viable > solution to the problem just described. > > Our current solution: > > As you outlined in patch 9, tracking the number of currently running threads per > address space and exposing this information via a vcpu_id abstraction helps > tcmalloc to noticeably reduce its freelist overhead in the "narrow > process running > on a wide server" situation, which is typical at Google. > > We have experimented with several approaches here. The one that we are > currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes. > > We did try per-numa-node vcpus, but it did not show any material improvement > over the "flat" model, perhaps because on our most "wide" servers the CPU > topology is multi-level. Chris Kennelly may provide more details here. I would really like to know more about Google's per-numa-node vcpus implementation. I suspect you guys may have taken a different turn somewhere in the design which led to these results. But having not seen that implementation, I can only guess. I notice the following Google-specific prototype extension in tcmalloc: // This is a prototype extension to the rseq() syscall. Since a process may // run on only a few cores at a time, we can use a dense set of "v(irtual) // cpus." This can reduce cache requirements, as we only need N caches for // the cores we actually run on simultaneously, rather than a cache for every // physical core. union { struct { short numa_node_id; short vcpu_id; }; int vcpu_flat; }; Can you tell me more about the way the numa_node_id and vcpu_id are allocated internally, and how they are expected to be used by userspace ? > > On a more technical note, we do use atomic operations extensively in > the kernel to make sure > vcpu IDs are "tightly packed", i.e. if only N threads of a process are currently > running on physical CPUs, vcpu IDs will be in the range [0, N-1], i.e. no gaps, > no going to N and above; this does consume some extra CPU cycles, but the > RAM savings we gain far outweigh the extra CPU cost; it will be interesting to > see what you can do with the optimizations you propose in this patchset. The optimizations I propose keep those "tightly packed" characteristics, but skip the atomic operations in common scenarios. I'll welcome benchmarks of the added overhead in representative workloads. > Again, thanks a lot for this effort! Thanks for your input. It really helps steering the effort in the right direction. Mathieu > > Peter > > [...] -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com