Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)

Nicolas Pitre <nico@xxxxxxx> · Thu, 30 Apr 2009 14:56:23 -0400 (EDT)

On Thu, 30 Apr 2009, Jakub Narebski wrote:

> Jakub Narebski <jnareb@xxxxxxxxx> writes:
> 
> es> Two:  Maybe Git is fast because Linus Torvalds is so smart.
> 
> [non answer; the details are important]

I think Linus is certainly responsible for a big part of Git's speed.  
He came with the basic data structure used by git which has lots to do 
with that.  Also, he designed Git specifically to fulfill a need for 
which none of the alternatives were fast enough.  Hence Git was designed 
from the ground up with speed as one of the primary design goals, such 
as being able to create multiple commits per second instead of the other 
way around (several seconds per commit). And yes, Linus is usually smart 
enough with the proper mindset to achieve such goals.

> es> Three: Maybe Git is fast because it's written in C instead of one
> es> of those newfangled higher-level languages.
> es>
> es> Nah, probably not.  Lots of people have written fast software in
> es> C#, Java or Python.
> es>
> es> And lots of people have written really slow software in
> es> traditional native languages like C/C++. [...]
> 
> Well, I guess that access to low-level optimization techniques like
> mmap are important for performance.  But here I am guessing and
> speculating like Eric did; well, I am asking on a proper forum ;-)
> 
> We have some anegdotical evidence supporting this possibility (which
> Eric dismisses), namely the fact that pure-Python Bazaar is slowest of
> three most common open source DVCS (Git, Mercurial, bazaar) and the
> fact that parts of Mercurial were written in C for better performance.
> 
> We can also compare implementations of Git in other, higher level
> languages, with reference implementation in C (and shell scripts, and
> Perl ;-)).  For example most complete I think but still not fully
> complete Java implementation: JGit.  I hope that JGit developers can
> tell us whether using higher level language affects performance, how
> much, and what features of higher-level language are causing decrease
> in performance.  Of course we have to take into account the
> possibility that JGit isn't simply as well optimized because of less
> manpower.

One of the main JGit developers is Shawn Pearce.  If you look at Shawn's 
contribution to C git, they mostly are all related to performance 
issues.  Amongst other things, he is the author of git-fast-import, he 
contributed the pack access windowing code, and he was also involved in 
the initial design of pack v4.  Hence Shawn is a smart guy who certainly 
knows one or two things about performance optimizations.  Yet he 
reported on this list that his efforts to make JGit faster were not much 
successful anymore, most probably due to the language overhead.

> es> Four: Maybe Git is fast because being fast is the primary goal for
> es> Git.
> 
> [non answer; the details are important]

Still, this is actually true (see about Linus above).  Without such a 
goal, you quickly lose sight of performance regressions.

> es> Maybe Git is fast because every time they faced one of these "buy
> es> vs. build" choices, they decided to just write it themselves.
> 
> I don't think so.  Rather the opposite is true.  Git uses libcurl for
> HTTP transport.  Git uses zlib for compression.  Git uses SHA-1 from
> OpenSSL or from Mozilla.  Git uses (modified, internal) LibXDiff for
> (binary) deltaifying, for diffs and for merges.

Well, I think he's right on this point as well.  libcurl is not so 
relevant since it is rarely the bottleneck (the network bandwidth itself 
usually is).  zlib is already as fast as it can be as multiple attempts 
to make it faster didn't succeed.  Git already carries its own version 
of SHA-1 code for ARM and PPC because the alternatives were slower.  
The fact that libxdiff was made internal is indeed to have a better 
impedance matching with the core code, otherwise it could have remained 
fully external just like zlib.  And the binary delta code is not 
libxdiff anymore but a much smaller, straight forward, and optimized to 
death version to achieve speed over versatility (no need to be versatile 
when strictly dealing with Git's needs only).

> es> Seven:  Maybe Git isn't really that fast.
> es>
> es> If there is one thing I've learned about version control it's that
> es> everybody's situation is different.  It is quite likely that Git
> es> is a lot faster for some scenarios than it is for others.
> es>
> es> How does Git handle really large trees?  Git was designed primary
> es> to support the efforts of the Linux kernel developers.  A lot of
> es> people think the Linux kernel is a large tree, but it's really
> es> not.  Many enterprise configuration management repositories are
> es> FAR bigger than the Linux kernel.
> 
> c.f. "Why Perforce is more scalable than Git" by Steve Hanov
>      http://gandolf.homelinux.org/blog/index.php?id=50
> 
> I don't really know about this.

Git certainly sucks big time with large files.

Git also sucks to a lesser extent (but still) with very large 
repositories.

But large trees?  I don't think Git is worse than anything out there 
with a large tree of average size files.

Yet, this point is misleading because when people gives to Git the 
reputation of being faster, this is certainly from comparison of 
operations performed on the same source tree.  Who cares about scenarios 
for which the tool was not designed?  Those "enterprise configuration 
management repositories" are not what Git was designed for indeed, but 
neither was Mercurial nor Bazaar, or any other contender to which Git is 
usually compared.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html