Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)

Jakub Narebski <jnareb@xxxxxxxxx> · Thu, 30 Apr 2009 05:17:58 -0700 (PDT)

Jakub Narebski <jnareb@xxxxxxxxx> writes:

> Martin Langhoff <martin.langhoff@xxxxxxxxx> writes:
> 
> > Eric Sink hs been working on the (commercial, proprietary) centralised
> > SCM Vault for a while. He's written recently about his explorations
> > around the new crop of DSCMs, and I think it's quite interesting.

[...]

> > So here's the blog -  http://www.ericsink.com/

[...]
> *  "Mercurial, Subversion, and Wesley Snipes"
> >   http://www.ericsink.com/entries/hg_denzel.html
> 
> which I will comment now.  The 'ES>' prefix means quoting above blog
> post.

[...]
> ES> * The one where I speculate cluelessly about why Git is so fast
> 
> where Eric guesses instead of asking on git mailing list or #git
> channel... ;-)

This issue is interesting: what features and what design decision
make Git fast? One of the goals of Git was good performance; are
we there?

All quotes marked 'es> ' below are from "Why is Git so Fast?" post
http://www.ericsink.com/entries/why_is_git_fast.html

es> One:  Maybe Git is fast simply because it's a DVCS.
es>
es> There's probably some truth here.  One of the main benefits touted
es> by the DVCS fanatics is the extra performance you get when
es> everything is "local".

This is I think quite obvious.  Accessing memory is faster than
acessing disk, which in turn is faster than accessing network.  So if
commit and (change)log does not require access to server via network,
they are so much faster.

BTW. that is why Subversion stores along working copy 'pristine'
versions of files: to make status and diff fast enough to be usable.
Which in turn might make SVN checkout to be larger than full Git
clone ;-)

es>
es> But this answer isn't enough.  Maybe it explains why Git is faster
es> than Subversion, but it doesn't explain why Git is so often
es> described as being faster than the other DVCSs.

Not only described; see http://git.or.cz/gitwiki/GitBenchmarks
(although some, if not most of those benchmarks are dated,
and e.g. Bazaar claims to have much better performance now).

es>
es> Two:  Maybe Git is fast because Linus Torvalds is so smart.

[non answer; the details are important]

es> Three: Maybe Git is fast because it's written in C instead of one
es> of those newfangled higher-level languages.
es>
es> Nah, probably not.  Lots of people have written fast software in
es> C#, Java or Python.
es>
es> And lots of people have written really slow software in
es> traditional native languages like C/C++. [...]

Well, I guess that access to low-level optimization techniques like
mmap are important for performance.  But here I am guessing and
speculating like Eric did; well, I am asking on a proper forum ;-)

We have some anegdotical evidence supporting this possibility (which
Eric dismisses), namely the fact that pure-Python Bazaar is slowest of
three most common open source DVCS (Git, Mercurial, bazaar) and the
fact that parts of Mercurial were written in C for better performance.

We can also compare implementations of Git in other, higher level
languages, with reference implementation in C (and shell scripts, and
Perl ;-)).  For example most complete I think but still not fully
complete Java implementation: JGit.  I hope that JGit developers can
tell us whether using higher level language affects performance, how
much, and what features of higher-level language are causing decrease
in performance.  Of course we have to take into account the
possibility that JGit isn't simply as well optimized because of less
manpower.

es>
es> Four: Maybe Git is fast because being fast is the primary goal for
es> Git.

[non answer; the details are important]

es>
es> Five:  Maybe Git is fast because it does less.
es>
es> One of my favorite recent blog entries is this piece[1] which
es> claims that the way to make code faster is to have it do less.
es>
es> [1] "How to write fast code" by Kas Thomas
es>     http://asserttrue.blogspot.com/2009/03/how-to-write-fast-code.html
[...]

es>
es> For example, the way you get something in the Git index is you use
es> the "git add" command.  Git doesn't scan your working copy for
es> changed files unless you explicitly tell it to.  This can be a
es> pretty big performance win for huge trees.  Even when you use the
es> "remember the timestamp" trick, detecting modified files in a
es> really big tree can take a noticeable amount of time.

That of course depends on how you compare performance of different
version control systems (to not compare apples with oranges).  But if
you compare e.g. "<scm> commit" with Git equivalent "git commit -a"
the above is simply not true.

BTW. when doing comparison you have to take care of the reverse,
e.g. git doing more like calculating and dislaying diffstat by default
for merges/pulls.

es>
es> Or maybe Git's shortcut for handling renames is faster than doing
es> them more correctly[2] like Bazaar does.
es>
es> [2] "Renaming is the killer app of distributed version control"
es>     http://www.markshuttleworth.com/archives/123

Errr... what?

es> Six:  Maybe Git is fast because it doesn't use much external code.
es>
es> Very often, when you are facing a decision to use somebody else's
es> code or write it yourself, there is a performance tradeoff.  Not
es> always, but often.  Maybe the third party code is just slower than
es> the code you could write yourself if you had time to do it.  Or
es> maybe there is an impedance mismatch between the API of the
es> external library and your own architecture.
es>
es> This can happen even when the library is very high quality.  For
es> example, consider libcurl.  This is a great library.  Tons of
es> people use it.  But it does have one problem that will cause
es> performance problems for some users: When using libcurl to fetch
es> an object, it wants to own the buffer.  In some situations, this
es> can end up forcing you to use extra memcpys or temporary files.
es> The reason all the low level calls like send() and recv() allow
es> the caller to own the loop and the buffer is because this is the
es> best way to avoid the need to make extra copies of the data on
es> disk or in memory.
[...]

es>
es> Maybe Git is fast because every time they faced one of these "buy
es> vs. build" choices, they decided to just write it themselves.

I don't think so.  Rather the opposite is true.  Git uses libcurl for
HTTP transport.  Git uses zlib for compression.  Git uses SHA-1 from
OpenSSL or from Mozilla.  Git uses (modified, internal) LibXDiff for
(binary) deltaifying, for diffs and for merges.

OTOH Git includes several own micro-libraries: parseopt, strbuf,
ALLOC_GROW, etc.  NIH syndrome?  I don't think so; rather avoiding
extra dependencies (bstring vs strbuf), and existing solutions not
fitting all needs (popt/argp/getopt vs parse-options).

es> Seven:  Maybe Git isn't really that fast.
es>
es> If there is one thing I've learned about version control it's that
es> everybody's situation is different.  It is quite likely that Git
es> is a lot faster for some scenarios than it is for others.
es>
es> How does Git handle really large trees?  Git was designed primary
es> to support the efforts of the Linux kernel developers.  A lot of
es> people think the Linux kernel is a large tree, but it's really
es> not.  Many enterprise configuration management repositories are
es> FAR bigger than the Linux kernel.

c.f. "Why Perforce is more scalable than Git" by Steve Hanov
     http://gandolf.homelinux.org/blog/index.php?id=50

I don't really know about this.

But there is one issue Eric Sink didn't think about:

Eight: Git seems fast.
======================

Here I mean concentaring on low _latency_, which means that when git
produces more than one page of output (for example "git log"), it tries to output the first page as fast as possible; which means that first page e.g.
"git <sth> | head -25  >/dev/null" has to be fast, and not 
"git <sth> >/dev/null" itself.

Having progress indicator appearing whenever is longer wait (quite
fresh feature) also help impression of being fast...

And what do you think about this?

-- 
Jakub Narebski
Poland
ShadeHawk on #git
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html