On Fri, 8 May 2009, Brandon Casey wrote: > > plain 'git checkout' on linux kernel over NFS. Thanks. > Best time without patch: 1.20 seconds > > 0.45user 0.71system 0:01.20elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (0major+15467minor)pagefaults 0swaps > > Best time with patch (core.preloadindex = true): 1.10 seconds > > 0.43user 4.00system 0:01.10elapsed 402%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (0major+13999minor)pagefaults 0swaps > > Best time with patch (core.preloadindex = false): 0.84 seconds > > 0.42user 0.39system 0:00.84elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (0major+13965minor)pagefaults 0swaps Ok, that is _disgusting_. The parallelism clearly works (402%CPU), but the system time overhead is horrible. Going from 0.39s system time to 4s of system time is really quite nasty. Is there any possibility you could oprofile this (run it in a loop to get better profiles)? It very much sounds like some serious lock contention, and I'd love to hear more about exactly which lock it's hitting. Also, you're already almost totally CPU-bound, with 96% CPU for the single-threaded csase. So you may be running over NFS, but your NFS server is likely pretty good and/or the client just captures everything in the caches anyway. I don't recall what the Linux NFS stat cache timeout is, but it's less than a minute. I suspect that you ran things in a tight loop, which is why you then got effectively the local caching behavior for the best times. Can you do a "best time" check but with a 60-second pause between runs (and before), to see what happens when the client doesn't do caching? > Best time with read_cache_preload patch only: 1.38 seconds > > 0.45user 4.42system 0:01.38elapsed 352%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (0major+13990minor)pagefaults 0swaps Yeah, here you're not getting any advantage of fewer lstats, and you show the same "almost entirely CPU-bound on four cores" behavior, and the same (probable) lock contention that has pushed the system time way up. > The read_cache_preload() changes actually slow things down for me for this > case. > > Reduction in lstat's gives a nice 30% improvement. Yes, I think the one-liner lstat avoidance is a real fix regardless. And the preloading sounds like it hits serialization overhead in the kernel, which I'm not at all surprised at, but not being surprised doesn't mean that I'm not interested to hear where it is. The Linux VFS dcache itself should scale better than that (but who knows - cacheline ping-pong due to lock contention can easily cause a 10x slowdown even without being _totally_ contended all the time). So I would _suspect_ that it's some NFS lock that you're seeing, but I'd love to know more. Btw, those system times are pretty high to begin with, so I'd love to know kernel version and see a profile even without the parallel case and presumably lock contention. Because while I probably have a faster machine anyway, what I see iis: [torvalds@nehalem linux]$ /usr/bin/time git checkout 0.13user 0.05system 0:00.19elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+13334minor)pagefaults 0swaps ie my "system" time is _much_ lower than yours (and lower than your system time). This is the 'without patch' time, btw, so this has extra lstat's. And my system time is still lower than my user time, so I wonder where all _your_ system time comes from. Your system time is much more comparable to user time even in the good case, and I wonder why? Could be just that kernel code tends to have more cache misses, and my 8MB cache captures it all, and yours doesn't. Regardless, a profile would be very interesting. Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html