Re: Git and GCC

David Miller <davem@xxxxxxxxxxxxx> · Fri, 07 Dec 2007 17:55:29 -0800 (PST)

From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Fri, 7 Dec 2007 09:23:47 -0800 (PST)

> 
> 
> On Fri, 7 Dec 2007, David Miller wrote:
> > 
> > Also I could end up being performance limited by SHA, it's not very
> > well tuned on Sparc.  It's been on my TODO list to code up the crypto
> > unit support for Niagara-2 in the kernel, then work with Herbert Xu on
> > the userland interfaces to take advantage of that in things like
> > libssl.  Even a better C/asm version would probably improve GIT
> > performance a bit.
> 
> I doubt yu can use the hardware support. Kernel-only hw support is 
> inherently broken for any sane user-space usage, the setup costs are just 
> way way too high. To be useful, crypto engines need to support direct user 
> space access (ie a regular instruction, with all state being held in 
> normal registers that get saved/restored by the kernel).

Unfortunately they are hypervisor calls, and you have to give
the thing physical addresses for the buffer to work on, so
letting userland get at it directly isn't currently doable.

I still believe that there are cases where userland can take
advantage of in-kernel crypto devices, such as when we are
streaming the data into the kernel anyways (for a write()
or sendmsg()) and the user just wants the transformation to
be done on that stream.

As a specific case, hardware crypto SSL support works quite
well for sendmsg() user packet data.  And this the kind of API
Solaris provides to get good SSL performance with Niagara.

> > Is SHA a significant portion of the compute during these repacks?
> > I should run oprofile...
> 
> SHA1 is almost totally insignificant on x86. It hardly shows up. But we 
> have a good optimized version there.

Ok.

> zlib tends to be a lot more noticeable (especially the uncompression: it 
> may be faster than compression, but it's done _so_ much more that it 
> totally dominates).

zlib is really hard to optimize on Sparc, I've tried numerous times.
Actually compress is the real cycle killer, and in that case the inner
loop wants to dereference 2-byte shorts at a time but they are
unaligned half of the time, and any the check for alignment nullifies
the gains of avoiding the two byte loads.

Uncompress I don't think is optimized at all on any platform with
asm stuff like the compress side is.  It's a pretty straightforward
transformation and the memory accesses dominate the overhead.

I'll do some profiling to see what might be worth looking into.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html