Re:

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 8 May 2009 09:15:50 -0700 (PDT)

On Fri, 8 May 2009, Brandon Casey wrote:
> 
> plain 'git checkout' on linux kernel over NFS.

Thanks.

> Best time without patch: 1.20 seconds
> 
>   0.45user 0.71system 0:01.20elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
>   0inputs+0outputs (0major+15467minor)pagefaults 0swaps
> 
> Best time with patch (core.preloadindex = true): 1.10 seconds
> 
>   0.43user 4.00system 0:01.10elapsed 402%CPU (0avgtext+0avgdata 0maxresident)k
>   0inputs+0outputs (0major+13999minor)pagefaults 0swaps
> 
> Best time with patch (core.preloadindex = false): 0.84 seconds
> 
>   0.42user 0.39system 0:00.84elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
>   0inputs+0outputs (0major+13965minor)pagefaults 0swaps

Ok, that is _disgusting_. The parallelism clearly works (402%CPU), but the 
system time overhead is horrible. Going from 0.39s system time to 4s of 
system time is really quite nasty.

Is there any possibility you could oprofile this (run it in a loop to get 
better profiles)? It very much sounds like some serious lock contention, 
and I'd love to hear more about exactly which lock it's hitting.

Also, you're already almost totally CPU-bound, with 96% CPU for the 
single-threaded csase. So you may be running over NFS, but your NFS server 
is likely pretty good and/or the client just captures everything in the 
caches anyway.

I don't recall what the Linux NFS stat cache timeout is, but it's less 
than a minute. I suspect that you ran things in a tight loop, which is why 
you then got effectively the local caching behavior for the best times. 

Can you do a "best time" check but with a 60-second pause between runs 
(and before), to see what happens when the client doesn't do caching?

> Best time with read_cache_preload patch only: 1.38 seconds
> 
>   0.45user 4.42system 0:01.38elapsed 352%CPU (0avgtext+0avgdata 0maxresident)k
>   0inputs+0outputs (0major+13990minor)pagefaults 0swaps

Yeah, here you're not getting any advantage of fewer lstats, and you 
show the same "almost entirely CPU-bound on four cores" behavior, and the 
same (probable) lock contention that has pushed the system time way up.

> The read_cache_preload() changes actually slow things down for me for this
> case.
> 
> Reduction in lstat's gives a nice 30% improvement.

Yes, I think the one-liner lstat avoidance is a real fix regardless. And 
the preloading sounds like it hits serialization overhead in the kernel, 
which I'm not at all surprised at, but not being surprised doesn't mean 
that I'm not interested to hear where it is.

The Linux VFS dcache itself should scale better than that (but who knows - 
cacheline ping-pong due to lock contention can easily cause a 10x slowdown 
even without being _totally_ contended all the time). So I would _suspect_ 
that it's some NFS lock that you're seeing, but I'd love to know more.

Btw, those system times are pretty high to begin with, so I'd love to know 
kernel version and see a profile even without the parallel case and 
presumably lock contention. Because while I probably have a faster 
machine anyway, what I see iis:

	[torvalds@nehalem linux]$ /usr/bin/time git checkout
	0.13user 0.05system 0:00.19elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
	0inputs+0outputs (0major+13334minor)pagefaults 0swaps

ie my "system" time is _much_ lower than yours (and lower than your system 
time). This is the 'without patch' time, btw, so this has extra lstat's. 
And my system time is still lower than my user time, so I wonder where all 
_your_ system time comes from. Your system time is much more comparable to 
user time even in the good case, and I wonder why?

Could be just that kernel code tends to have more cache misses, and my 8MB 
cache captures it all, and yours doesn't. Regardless, a profile would be 
very interesting.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html