Re:

Brandon Casey <casey@xxxxxxxxxxxxxxx> · Fri, 08 May 2009 12:27:05 -0500

Linus Torvalds wrote:
> 
> On Fri, 8 May 2009, Brandon Casey wrote:
>> plain 'git checkout' on linux kernel over NFS.
> 
> Thanks.
> 
>> Best time without patch: 1.20 seconds
>>
>>   0.45user 0.71system 0:01.20elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
>>   0inputs+0outputs (0major+15467minor)pagefaults 0swaps
>>
>> Best time with patch (core.preloadindex = true): 1.10 seconds
>>
>>   0.43user 4.00system 0:01.10elapsed 402%CPU (0avgtext+0avgdata 0maxresident)k
>>   0inputs+0outputs (0major+13999minor)pagefaults 0swaps
>>
>> Best time with patch (core.preloadindex = false): 0.84 seconds
>>
>>   0.42user 0.39system 0:00.84elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
>>   0inputs+0outputs (0major+13965minor)pagefaults 0swaps
> 
> Ok, that is _disgusting_. The parallelism clearly works (402%CPU), but the 
> system time overhead is horrible. Going from 0.39s system time to 4s of 
> system time is really quite nasty.
> 
> Is there any possibility you could oprofile this (run it in a loop to get 
> better profiles)? It very much sounds like some serious lock contention, 
> and I'd love to hear more about exactly which lock it's hitting.

Possibly, I'll see if our sysadmin has time to "play".

> Also, you're already almost totally CPU-bound, with 96% CPU for the 
> single-threaded csase. So you may be running over NFS, but your NFS server 
> is likely pretty good and/or the client just captures everything in the 
> caches anyway.
> 
> I don't recall what the Linux NFS stat cache timeout is, but it's less 
> than a minute. I suspect that you ran things in a tight loop, which is why 
> you then got effectively the local caching behavior for the best times. 

Yeah, that's what I did.

> Can you do a "best time" check but with a 60-second pause between runs 
> (and before), to see what happens when the client doesn't do caching?

No problem.

>> Best time with read_cache_preload patch only: 1.38 seconds
>>
>>   0.45user 4.42system 0:01.38elapsed 352%CPU (0avgtext+0avgdata 0maxresident)k
>>   0inputs+0outputs (0major+13990minor)pagefaults 0swaps
> 
> Yeah, here you're not getting any advantage of fewer lstats, and you 
> show the same "almost entirely CPU-bound on four cores" behavior, and the 
> same (probable) lock contention that has pushed the system time way up.
> 
>> The read_cache_preload() changes actually slow things down for me for this
>> case.
>>
>> Reduction in lstat's gives a nice 30% improvement.
> 
> Yes, I think the one-liner lstat avoidance is a real fix regardless. And 
> the preloading sounds like it hits serialization overhead in the kernel, 
> which I'm not at all surprised at, but not being surprised doesn't mean 
> that I'm not interested to hear where it is.
> 
> The Linux VFS dcache itself should scale better than that (but who knows - 
> cacheline ping-pong due to lock contention can easily cause a 10x slowdown 
> even without being _totally_ contended all the time). So I would _suspect_ 
> that it's some NFS lock that you're seeing, but I'd love to know more.
> 
> Btw, those system times are pretty high to begin with, so I'd love to know 
> kernel version and see a profile even without the parallel case and 
> presumably lock contention. Because while I probably have a faster 
> machine anyway, what I see iis:
> 
> 	[torvalds@nehalem linux]$ /usr/bin/time git checkout
> 	0.13user 0.05system 0:00.19elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
> 	0inputs+0outputs (0major+13334minor)pagefaults 0swaps
> 
> ie my "system" time is _much_ lower than yours (and lower than your system 
> time). This is the 'without patch' time, btw, so this has extra lstat's. 
> And my system time is still lower than my user time, so I wonder where all 
> _your_ system time comes from. Your system time is much more comparable to 
> user time even in the good case, and I wonder why?
> 
> Could be just that kernel code tends to have more cache misses, and my 8MB 
> cache captures it all, and yours doesn't. Regardless, a profile would be 
> very interesting.

Something is definitely up.

I provided timing results for your original preload_cache implementation
which affected status and diff, which was part of the justification for
merging it in.

   http://article.gmane.org/gmane.comp.version-control.git/100998

You can see that cold cache system time for 'git status' went from 0.36 to
0.52 seconds.  Fine.  I just ran it again, and now I'm getting system time
of 10 seconds!  This is the same machine.

Similarly for the cold cache 'git checkout' reruns:

Best without patch: 6.02 (systime 1.57)

  0.43user 1.57system 0:06.02elapsed 33%CPU (0avgtext+0avgdata 0maxresident)k
  5336inputs+0outputs (12major+15472minor)pagefaults 0swaps

Best with patch (preload_cache,lstat reduction): 2.69 (systime 10.47)

  0.45user 10.47system 0:02.69elapsed 405%CPU (0avgtext+0avgdata 0maxresident)k
  5336inputs+0outputs (12major+13985minor)pagefaults 0swaps

OS: Centos4.7

$ cat /proc/version
Linux version 2.6.9-78.0.17.ELsmp (mockbuild@xxxxxxxxxxxxxxxxxxxx) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-9)) #1 SMP Thu Mar 12 20:05:15 EDT 2009

-brandon
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html