Re: TLS/NPTL for m68k and ColdFire

Richard Zidlicky <rz@xxxxxxxxxxxxxx> · Wed, 7 May 2008 13:53:26 +0200

Hello,

my 0.02c

On Mon, May 05, 2008 at 11:15:11AM -0600, Kurt Mahan wrote:
From reading the discussion it appears that Joseph's proposal of using
the VDSO approach was questioned and alternatives such as the ARM Magic
page were looked at but that VDSO still seems to be the way to go.  Is
this agreed or should alternatives be looked at?  If this isn't the
direction the community finds acceptable let's attempt to resolve it.

I am not experienced with either VDSOs or magic pages and can not judge the
merits of either from the proposal alone.

In principle I am fine with VDSO, it might be interesting to discuss some
implementation details in depth.

As of the locking instructions that the codlfire does not provide, would
it be more straghtforward to use a new relocation type that would either 
patch the 680x0 insn inline or a subroutine call?

I did think of one way to handle this that may be relatively logical,
but it requires a few hooks and needs some comment. The basic idea is
to put two extra pointers into the thread in the kernel. One would be
the TLS area as set by user-space. The other is a pointer to the page
that is currently seen by that thread as the VDSO page that contains
the location where the VDSO will look for the TLS pointer. In the mm
code, we check on the copy to see if we are copying and replacing the
page that is held as the special VDSO page and update the pointer in
the thread to the new page. A context switch would always write the
value in the TLS save slot in the thread to the page referenced in the
thread structure as the VDSO thread. It does mean that the VDSO page(s)
must never be swapped out. Any comments on this idea? Could it work? Is
there a way we could make the VDSO swappable without breaking the code
in the context switch?

For the multi-CPU case I think it is difficult to avoid a special
per CPU page holding the current TLS pointer. Am I missing something?

This page could in principle be shared by all processes per CPU so it is 
mostly irrelevant whether it can be swapped out. Either way it seems to
involve some interesting VM/MM magic.
I wonder if the TLS pointer is the only thing that would be variable between
processes? 

From userspace the special page might be accessed eg by open&mmap a special
device to keep number of obscure syscalls low although I do not see this as
an important detail.

However the more interesting question is how to update this special page, if
it is indeed the way to go.
Doing it on every context switch would make every context switch a few insns
slower for all processes whereas I do expect the TLS functionality to be
used rarely. Has anyone a guess how frequent that would be?
Last time I tried to benchmark it we had around 250 cycles per process switch
on a 68060.. I hope thread switch would be even slightly faster so a few
instructions can matter.

Not sure what the VDSO approach would be but it seems to involve either some
MM tricks or some updating on context switch so in principle there would still
be some cost on every context switch??

So I was thinking if it is possible to avoid that penalty and have some rough
ideas. 
Again there would have to be a special page for every CPU (shared for every 
TLS using process) but context switch would do nothing to update this page. 
Instead a thread wanting to use the TLS pointer would determine whether the 
information in the page is valid or needs updated.
The key point is that the thread should be able to determine validity of the
page very quickly so the fast path would not be hit - without involving any
MM magic or kernel help.
For this purpose I believe it would be enough if the thread would compare 
stored PID and current stack pointer against PID and stack bounds stored in
the special page. The cost of that operation seems small enough compared
to GOT reloading so it should be acceptable for a fast path?
In the fast case the values would match and TLS lookup is a simple read,
otherwise the VDSO magic or whatever else would be invoked to update the
TLS pointer as well as PID and stack bounds of current thread.

Alternatively it should be even possible to use assumptions about thread stack 
area allocation to calculate an thread index into a TLS lookup table based on
current stack alone, making it a userspace only solution. This could even 
avoid magic pages although they might be of some benefit as a quick cache
even in this case.
I am not familiar enough with it to see how easy this would break if some 
thread does fancy things with the stack, although I think it would be fair
(and complicate things even further..) to have a slow kernel based fallback
solution for such threads.

Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-m68k" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html