Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

Peter Oskolkov <posk@xxxxxxx> · Mon, 1 Aug 2022 10:07:09 -0700

On Fri, Jul 29, 2022 at 12:02 PM Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>
> Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field.

Thanks a lot, Mathieu - it is really exciting to see this happening!

I'll share our experiences here, with the hope that it may be useful.
I've also cc-ed
Chris Kennelly, who worked on the userspace/tcmalloc side, as he can provide
more context/details if I miss or misrepresent something.

The problem:

tcmalloc maintains per-cpu freelists in the userspace to make userspace
memory allocations fast and efficient; it relies on rseq to do so, as
any manipulation
of the freelists has to be protected vs thread migrations.

However, as a typical userspace process at a Google datacenter is confined to
a relatively small number of CPUs (8-16) via cgroups, while the
servers typically
have a much larger number of physical CPUs, the per-cpu freelist model
is somewhat
wasteful: if a process has only at most 10 threads running, for
example, but these threads
can "wander" across 100 CPUs over the lifetime of the process, keeping 100
freelists instead of 10 noticeably wastes memory.

Note that although a typical process at Google has a limited CPU
quota, thus using
only a small number of CPUs at any given time, the process may often have many
hundreds or thousands of threads, so per-thread freelists are not a viable
solution to the problem just described.

Our current solution:

As you outlined in patch 9, tracking the number of currently running threads per
address space and exposing this information via a vcpu_id abstraction helps
tcmalloc to noticeably reduce its freelist overhead in the "narrow
process running
on a wide server" situation, which is typical at Google.

We have experimented with several approaches here. The one that we are
currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.

We did try per-numa-node vcpus, but it did not show any material improvement
over the "flat" model, perhaps because on our most "wide" servers the CPU
topology is multi-level. Chris Kennelly may provide more details here.

On a more technical note, we do use atomic operations extensively in
the kernel to make sure
vcpu IDs are "tightly packed", i.e. if only N threads of a process are currently
running on physical CPUs, vcpu IDs will be in the range [0, N-1], i.e. no gaps,
no going to N and above; this does consume some extra CPU cycles, but the
RAM savings we gain far outweigh the extra CPU cost; it will be interesting to
see what you can do with the optimizations you propose in this patchset.

Again, thanks a lot for this effort!

Peter

[...]