Re: anybody tried NPTL?

Ralf Baechle <ralf@xxxxxxxxxxxxxx> · Mon, 23 Aug 2004 19:12:57 +0200

On Mon, Aug 23, 2004 at 09:28:53AM -0400, Daniel Jacobowitz wrote:

> On Fri, Aug 20, 2004 at 02:46:11PM +0100, Dominic Sweetman wrote:
> > I guess our main message was that we felt it would be a mistake just
> > to add a thread register to o32 (which produces a substantially
> > incompatible new ABI anyway).
> 
> Completely agree...
> 
> > Until that all works, what we had in mind is that we'd do NPTL over
> > o32 by defining a system call to return a per-thread ID which is or
> > can be converted into a per-thread data pointer.  We suspected that
> > NPTL's per-thread-data model allows the use of cunning macros or
> > library functions to make that look OK.
> > 
> > Ought we to go further and see exactly how that can be done?
> 
> It shouldn't be at all hard.  The way NPTL's __thread support works,
> the only things that should have to know where the TLS base is are
> (A) GCC, so it can load it and (B) GDB, via some new ptrace op.  I
> don't know if you'd want to open-code the syscall or take the overhead
> of a function call.  Ralf had some ideas?

Thiemo and have been compiling various pieces of code with different
gcc versions trying to find the best possible register for that purpose.
We used code bloat as (weak ...) indicator for register pressure.  It
turned out that $t9 was the best choice for all tested compiler versions;
thanks to the much improved register allocation of newer gcc the choice
of a particular register made far less difference on recent compilers
than on older compilers.

I've also implemented a fast system call for reading the thread registers.
Benchmarks did show that to have about half the latency of a regular
syscall; the hope was if gcc was doing clever optimization that overhead
would effectivly become zero.

I was favoring this low-overhead syscall approach because it would avoid
the loss of a register thus leaving performance of non-threaded code
unchanged but other developers generally favor the permanent allocation
of $t9 as a thread register.

Other crazy ideas did include a per-thread mapping containing the thread
pointer - and possibly more information in the future.

Finally one of the ideas was using one of $k0 / $k1 as thread pointer.
This would require changes to the exception handlers; any extra
instruction in the TLB refill handler would be particularly painful.

On the positive side if we had multiple register sets on a MIPSxx V2
processor we could exploit that to get rid of this overheade and do
other nice optimizations for TLB reload also.  Unfortunately these
register sets are optional feature of the architecture only.

  Ralf