Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Tue, 21 Jul 2015 17:45:26 +0000 (UTC)

----- On Jul 21, 2015, at 11:16 AM, Ondřej Bílka neleai@xxxxxxxxx wrote:

> On Tue, Jul 21, 2015 at 12:58:13PM +0000, Mathieu Desnoyers wrote:
>> ----- On Jul 21, 2015, at 3:30 AM, Ondřej Bílka neleai@xxxxxxxxx wrote:
>> 
>> > On Tue, Jul 21, 2015 at 12:25:00AM +0000, Mathieu Desnoyers wrote:
>> >> >> Does it solve the Wine problem?  If Wine uses gs for something and
>> >> >> calls a function that does this, Wine still goes boom, right?
>> >> > 
>> >> > So the advantage of just making a global segment descriptor available
>> >> > is that it's not *that* expensive to just save/restore segments. So
>> >> > either wine could do it, or any library users would do it.
>> >> > 
>> >> > But anyway, I'm not sure this is a good idea. The advantage of it is
>> >> > that the kernel support really is _very_ minimal.
>> >> 
>> >> Considering that we'd at least also want this feature on ARM and
>> >> PowerPC 32/64, and that the gs segment selector approach clashes with
>> >> existing apps (wine), I'm not sure that implementing a gs segment
>> >> selector based approach to cpu number caching would lead to an overall
>> >> decrease in complexity if it leads to performance similar to those of
>> >> portable approaches.
>> >> 
>> >> I'm perfectly fine with architecture-specific tweaks that lead to
>> >> fast-path speedups, but if we have to bite the bullet and implement
>> >> an approach based on TLS and registering a memory area at thread start
>> >> through a system call on other architectures anyway, it might end up
>> >> being less complex to add a new system call on x86 too, especially if
>> >> fast path overhead is similar.
>> >> 
>> >> But I'm inclined to think that some aspect of the question eludes me,
>> >> especially given the amount of interest generated by the gs-segment
>> >> selector approach. What am I missing ?
>> >> 
>> > As I wrote before you don't have to bite bullet as I said before. It
>> > suffices to create 128k element array with cpu for each tid, make that
>> > mmapable file and userspace could get cpu with nearly same performance
>> > without hacks.
>> 
>> I don't see how this would be acceptable on memory-constrained embedded
>> systems. They have multiple cores, and performance requirements, so
>> having a fast getcpu would be useful there (e.g. telecom industry),
>> but they clearly cannot afford a 512kB table per process just for that.
>> 
> Which just means that you need more complicated api and implementation
> for that but idea stays same. You would need syscalls
> register/deregister_cpuid_idx that would give you index used instead
> tid. A kernel would need to handle that many ids could be registered for
> each thread and resize mmaped file in syscalls.

I feel we're talking past each other here. What I propose is to implement
a system call that registers a TLS area. It can be invoked at thread start.
The kernel can then keep the current CPU number within that registered
area up-to-date. This system call does not care how the TLS is implemented
underneath.

My understanding is that you are suggesting a way to speed up TLS accesses
by creating a table indexed by TID. Although it might lead to interesting
speed ups useful when reading the TLS, I don't see how you proposal is
useful in addressing the problem of caching the current CPU number (other
than possibly speeding up TLS accesses).

Or am I missing something fundamental to your proposal ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html