On 2022-09-22 16:10, Chris Kennelly wrote:
Hi,
I still need to update the code in TCMalloc to cooperate with the new
glibc ABI/convention. One concern I have is that it looks like I might
need to add a extra memory dereference (or two) to get the early
initialized offsets provided by glibc folded into the read of the cpu_id
field.
If you have a concrete example of this, I'd be happy to help and perhaps
we can improve your usage pattern.
I think I can avoid this by using %gs to point to the address of the
cpu_id field itself (which I think could be used to select between vCPUs
or not*), but %gs is a global piece of state that all of the libraries
in the program need to cooperate on.
I think what we are all looking for here is a scheme that would allow us
the fastest per-vcpu data structure accesses possible from userspace.
I think we could do something similar to what is done in the Linux
kernel for that, but in userspace. Here are some random ideas I have on
this topic:
We could introduce a new prctl(2) PT_{SET,GET}_GS_MODE on x86-64. This
would take as arguments the indexing mode and offset multiplier we want
to be applied to the GS segment selector on return to userspace:
enum gs_index_mode {
GS_INDEX_MODE_MM_VCPU,
};
struct prctl_set_gs_mode {
enum gs_index_mode index_mode;
u64 stride;
};
For a memory space which has this gs mode set, the return to userspace
code would populate the GS segment selector register with:
stride * current->mm_vcpu_id
The "stride" would be the virtual address space size allowed for
per-vcpu-data. This could be decided by the libc, with a tunable
allowing to increase/decrease this size. Another libc tunable could
disable populating the GS segment selector altogether (e.g. for
compatibility with applications like Wine which AFAIK use it).
With this in place, I hope we could then do per-vcpu data access by
simply prefixing memory access instructions with a %%gs: segment
selector prefix.
Thoughts ?
Thanks,
Mathieu
Thanks,
Chris
* TCMalloc is already paying a load+pointer arithmetic to select between
cpu_id versus vcpu_id, so this would actually make things a little bit
faster.
On Thu, Sep 22, 2022 at 3:21 PM Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx <mailto:mathieu.desnoyers@xxxxxxxxxxxx>>
wrote:
Hi Chris,
Sorry it looks like I forgot to CC you on this series. If you can give
it a spin with tcmalloc I would be very much interested in the result.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com