RE: Git performance results on a large repository

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[ wanted to reply to my initial msg, but wasn't subscribed to the list at time of mailing, so replying to most recent post instead ]

Thanks to everyone for the questions and suggestions.  I'll try to respond here.  One high-level clarification - this synthetic repo for which I've reported perf times is representative of where we think we'll be in the future.  Git is slow but marginally acceptable for today.  We want to start planning now for any big changes we need to make going forward.

Evgeny Sazhin, Slinky and Ævar Arnfjörð Bjarmason suggested splitting up the repo into multiple, smaller repos.  I indicated before that we have a lot of cross-dependencies.  Our largest repo by number of files and commits is the repo containing the front-end server.  It is a large code base in which the tight integration of various components results in many of the cross dependencies.  We are working slowly to split things up more, for example into services, but that is a long-term process.

To get a bit abstract for a moment, in an ideal world, it doesn't seem like performance constraints of a source-control-system should dictate how we choose to structure our code.  Ideally, seems like we should be able to choose to structure our code in whatever way we feel maximizes developer productivity.  If development and code/release management seem easier in a single repo, than why not make an SCM that can handle it?  This is one reason I've been leaning towards figuring out an SCM approach that can work well with our current practices rather than changing them as a prerequisite for good SCM performance.

Sam Vilain:  Thanks for the pointer, i didn't realize that fast-import was bi-directional.  I used it for generating the synthetic repo.  Will look into using it the other way around.  Though that still won't speed up things like git-blame, presumably?  The sparse-checkout issue you mention is a good one.  There is a good question of how to support quick checkout, branch switching, clone, push and so forth.  I'll look into the approaches you suggest.  One consideration is coming up with a high-leverage approach - i.e. not doing heavy dev work if we can avoid it.  On the other hand, it would be nice if we (including the entire community :) ) improve git in areas that others that share similar issues benefit from as well.

Matt Graham:  I don't have file stats at the moment.  It's mostly code files, with a few larger data files here and there.    We also don't do sparse checkouts, primarily because most people use git (whether on top of SVN or not), which doesn't support it.

Chris Lee:  When I was building up the repo (e.g., doing lots of commits, before I started using fast-import), i noticed that flash was not much faster - stat'ing the whole repo takes a lot of kernel time, even with flash.  My hunch is that we'd see similar issues with other operations, like git-blame.

Zeki Mokhtarzada:  Dumping history I think would speed up operations for which we don't care about old history, like git-blame in which we only want to see recent modifications.  We'd also need a good story for other kinds of operations.  In my mental model of git scalability, I categorize git structures into three kinds:  those for reasoning about history, those for the index and those for the working directory  (yeah, I know these don't map precisely to actual on-disk things like the object store, including trees, etc.).  One scaling approach we've been thinking of is to focus on each individually:  develop a specialized thing to handle history commands efficiently (git-blame, git-log, git-diff, etc.), something to speed up or bypass the index, and something to make large changes to the working directly quickly.

Joey Hess:  Separating the factors is a good suggestion.  My hunch is that the various git operations test the performance issues in isolation.  For example, git-status performance depends just on the number of files, not on the depth of history.  On the other hand, my guess is that git-blame performance is more a function of the length of history rather than the number of files.  Though, certainly with compression and indexing in pack files, I could imagine there being cross-effects between length of history and number of files.   The git-status suggestion definitely helps when you know which directory you are concerned about.  Often I'm lazy and stat the repo root so I trade-off slowness for being more sure I'm not missing anything.

@Joey, I think you're also touching on a good meta point which is that, there's probably no silver bullet here.  If we want git to efficiently handle repos that are large across a number of dimensions (size, # commits, # files, etc.), there's multiple parts of git that would need enhancement of some form.

Nguyen Thai Ngoc Duy:  At which point in the test flow should I insert git-update-index?  I'm happy to try it out.  Will compress index when I next get to a terminal.  My guess is it'll compress a bunch.  It's also conceivable that, if there were an external interface in git to attach other systems to efficiently report which files have changed (e.g., via file-system integration), it's possible that we could omit managing the index in many cases.   I know that would be a big change, but the benefits are intriguing.

Cheers,
Josh




________________________________________
From: Nguyen Thai Ngoc Duy [pclouds@xxxxxxxxx]
Sent: Friday, February 03, 2012 10:53 PM
To: Joshua Redstone
Cc: git@xxxxxxxxxxxxxxx
Subject: Re: Git performance results on a large repository

On Fri, Feb 3, 2012 at 9:20 PM, Joshua Redstone <joshua.redstone@xxxxxx> wrote:
> I timed a few common operations with both a warm OS file cache and a cold
> cache.  i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
> the operation in question a few times (first timing is the cold timing,
> the next few are the warm timings).  The following results are on a server
> with average hard drive (I.e., not flash)  and > 10GB of ram.
>
> 'git status' :   39 minutes cold, and 24 seconds warm.
>
> 'git blame':   44 minutes cold, 11 minutes warm.
>
> 'git add' (appending a few chars to the end of a file and adding it):   7
> seconds cold and 5 seconds warm.
>
> 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet
> --no-status':  41 minutes cold, 20 seconds warm.  I also hacked a version
> of git to remove the three or four places where 'git commit' stats every
> file in the repo, and this dropped the times to 30 minutes cold and 8
> seconds warm.

Have you tried "git update-index --assume-unchaged"? That should
reduce mass lstat() and hopefully improve the above numbers. The
interface is not exactly easy-to-use, but if it has significant gain,
then we can try to improve UI.

On the index size issue, ideally we should make minimum writes to
index instead of rewriting 191 MB index. An improvement we could do
now is to compress it, reduce disk footprint, thus disk I/O. If you
compress the index with gzip, how big is it?
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]