Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Fri, 17 Jul 2015 08:53:06 -0700

On Fri, Jul 17, 2015 at 3:21 AM, Ondřej Bílka <neleai@xxxxxxxxx> wrote:
> On Thu, Jul 16, 2015 at 12:27:10PM -0700, Andy Lutomirski wrote:
>> On Thu, Jul 16, 2015 at 11:08 AM, Mathieu Desnoyers
>> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>> > ----- On Jul 14, 2015, at 5:34 AM, Ben Maurer bmaurer@xxxxxx wrote:
>> >>
>> >> That said, having the ability for the kernel to understand that TLS
>> >> implementation are laid out using the same offset on each thread seems like
>> >> something that could be valuable long term. Doing so makes it possible to build
>> >> other TLS-based features without forcing each thread to be registered.
>> >
>> > AFAIU, using a fixed hardcoded ABI between kernel and user-space might make
>> > transition from the pre-existing ABI (where this memory area is not
>> > reserved) a bit tricky without registering the area, or getting a "feature"
>> > flag, through a system call.
>> >
>> > The related question then becomes: should we issue this system call once
>> > per process, or once per thread at thread creation ? Issuing it once per
>> > thread is marginally more costly for thread creation, but seems to be
>> > easier to deal with internally within the kernel.
>> >
>> > We could however ensure that only a single system call is needed per new-coming
>> > thread, rather than one system call per feature. One way to do this would be
>> > to register an area that may contain more than just the CPU id. It could
>> > consist of an expandable structure with fixed offsets. When registered, we
>> > could pass the size of that structure as an argument to the system call, so
>> > the kernel knows which features are expected by user-space.
>>
>> If we actually bit the bullet and implemented per-cpu mappings, we
>> could have this be completely flexible because there would be no
>> format at all.  Similarly, if we implemented per-cpu segments,
>> userspace would need to agree with *itself* how to arbitrate it, but
>> the kernel wouldn't need to be involved.
>>
>> With this kind of memory poking, it's definitely messier, which is unfortunate.
>>
> Could you recapitulate thread? On libc side we didn't read most of it so
> it would be appreciated.
>
> If per-cpu mappings mean that there is a single virtual page that is
> mapped to different virtual pages?

Single virtual page that's mapped to different physical pages on
different cpus.  I believe that ARM has some hardware support for
this, but I'm not that familiar with ARM.  x86 can fake it (at the
cost of some context switch overhead).

>
> I had in my todo list improving tls access. This would help tls
> implementations for older arms and in general architectures that dont
> store tcb in register.
>
> My proposal is modulo small constant equivalent of userspace accessing tid
> without syscall overhead, just use array of tcb's for first 32768 tids
> and do syscall only when tid exceeds that.
>
> On userspace my proposal would be use map that to fixed virtual address and store tcb in first eigth bytes. Kernel would on context switch along registers also
> save and restore these. That would make tls access cheap as it would
> need only extra load instruction versus static variable.
>

The problem is that having the kernel access userspace memory on
context switch, while doable, is a little bit unpleasant.  We also
really need to get the ABI right the first time, because we don't
really get a second chance.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html