Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Tue, 21 Jul 2015 00:25:00 +0000 (UTC)

----- On Jul 20, 2015, at 6:39 PM, Linus Torvalds torvalds@xxxxxxxxxxxxxxxxxxxx wrote:

> On Mon, Jul 20, 2015 at 2:09 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>
>> Annoying problem one: the segment base field is only 32 bits in the GDT.
> 
> Ok. So if we go this way, we'd make the rule be something like "the
> segment base is the CPU number shifted up by the page size", and then
> you'd have to add some magic offset that we'd declare as the "per-cpu
> page offset".
> 
>>> - user space can just load the segment selector in %gs
>>
>> IIRC this is very expensive -- 40 cycles or so.  At this point
>> userspace might as well just use a real lock cmpxchg.
> 
> So cmpxchg may be as many cycles, but
> 
> (a) you can choose to load the segment just once, and do several
> operations with it
> 
> (b) often - but admittedly not always - the real cost of a
> non-cpu-local local and cmpxchg tends to be the cacheline ping-pong,
> not the CPU cycles.
> 
> so I agree, loading a segment isn't free. But it's not *that*
> expensive, and you could always decide to keep the segment loaded and
> just do
> 
> - read segment selector
> - if NUL segment, reload it.
> 
> although that only works if you own the segment entirely and can keep
> it as the percpu segment (ie obviously not the Wine case, for
> example).
> 
>> Does it solve the Wine problem?  If Wine uses gs for something and
>> calls a function that does this, Wine still goes boom, right?
> 
> So the advantage of just making a global segment descriptor available
> is that it's not *that* expensive to just save/restore segments. So
> either wine could do it, or any library users would do it.
> 
> But anyway, I'm not sure this is a good idea. The advantage of it is
> that the kernel support really is _very_ minimal.

Considering that we'd at least also want this feature on ARM and
PowerPC 32/64, and that the gs segment selector approach clashes with
existing apps (wine), I'm not sure that implementing a gs segment
selector based approach to cpu number caching would lead to an overall
decrease in complexity if it leads to performance similar to those of
portable approaches.

I'm perfectly fine with architecture-specific tweaks that lead to
fast-path speedups, but if we have to bite the bullet and implement
an approach based on TLS and registering a memory area at thread start
through a system call on other architectures anyway, it might end up
being less complex to add a new system call on x86 too, especially if
fast path overhead is similar.

But I'm inclined to think that some aspect of the question eludes me,
especially given the amount of interest generated by the gs-segment
selector approach. What am I missing ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html