On Tue, Aug 2, 2022 at 8:01 AM Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > [...] > > > > We have experimented with several approaches here. The one that we are > > currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes. > > > > We did try per-numa-node vcpus, but it did not show any material improvement > > over the "flat" model, perhaps because on our most "wide" servers the CPU > > topology is multi-level. Chris Kennelly may provide more details here. > > I would really like to know more about Google's per-numa-node vcpus implementation. > I suspect you guys may have taken a different turn somewhere in the design which > led to these results. But having not seen that implementation, I can only guess. > > I notice the following Google-specific prototype extension in tcmalloc: > > // This is a prototype extension to the rseq() syscall. Since a process may > // run on only a few cores at a time, we can use a dense set of "v(irtual) > // cpus." This can reduce cache requirements, as we only need N caches for > // the cores we actually run on simultaneously, rather than a cache for every > // physical core. > union { > struct { > short numa_node_id; > short vcpu_id; > }; > int vcpu_flat; > }; > > Can you tell me more about the way the numa_node_id and vcpu_id are allocated > internally, and how they are expected to be used by userspace ? Based on a "VCPU policy" flag passed by the userspace during rseq registration request, our kernel would: - do nothing re: vcpus, i.e. behave like it currently does upstream; - allocate VCPUs in a "flat" manner, ignoring NUMA; - populate numa_node_id with the value from the function with the same name in https://elixir.bootlin.com/linux/latest/source/include/linux/topology.h and allocate vcpu_id within the numa node in a tight manner. Basically, if there are M threads running on node 0 and N threads running on node 1 at time T, there will be [0,M-1] vcpu IDs associated with node 0 and [0,N-1] vcpu IDs associated with node 1 at this moment in time. If a thread migrates across nodes, the balance would change accordingly. I'm not sure how exactly tcmalloc tried to use VCPUs under this policy, and what were the benefits expected. The simplest way would be to keep a freelist per node_id/vcpu_id pair (basically, per vcpu_flat in the union), but this would tend to increase the number of freelists due to thread migrations, so benefits should be related to memory locality, and so somewhat difficult to measure precisely. Chris Kennelly may offer more details here. Thanks, Peter