Re: Git performance results on a large repository

Joshua Redstone <joshua.redstone@xxxxxx> · Mon, 6 Feb 2012 20:50:08 +0000

Hi all,

Nguyen, thanks for pointing out the assume-unchanged part.  That, and
especially the suggestion of making assume-unchanged files read-only is
interesting.  It does require explicit specification of what's changed.
Hmm, I wonder if that could be a candidate API through which something
like  CoW file system could let git know what's changed.  Btw, I think you
asked earlier, but the index compresses from 158MB to 58MB - keep in mind
that the majority of file names in the repo are synthetic, so take with
big grain of salt.

Joey, it sounds like it might be good if git-mv and other commands where
consistent in how they treat the assume-unchanged bit.

David Mohs:  Yeah, it's an open question whether we'd be better off
somehow forcing the repos the split apart more.  As a practical matter,
what may happen is that we incrementally solve our problem by addressing
pain points as they come up (e.g., git status being slow).  One risk with
that approach is that it leads to overly short-term thinking and we get
stuck in a local minimum.  I totally agree that good modularization and
code health is valuable.  I think sometimes that getting to good
modularization does involve some technical work - like maybe moving
functionality between systems so they split apart better, having some
notion of versioning and dependency and managing that, and so forth.    I
suppose the other aspect to the problem is that we want to make sure we
have a good source-control story even if the modularization effort takes a
long time - we'd rather not end up in a race between long-term
modularization efforts and source-control performance going south too
fast.  I suppose this comes back to the desire that modularization not be
a prerequisite for good source-control performance.  Oh, and in case I
didn't mention it - we are working on modularization and splitting off
large chunks of code, both into separable libraries as well as into
separate services, but it's a long-term process.

Matt, some of our repos are still on SVN, many are on pure-git.  One of
the main ones that is on SVN is, at least at the moment, not amenable to
sparse checkouts because of it's structure.

Tomas, yeah, I think one of the big questions is how little technical work
can we get away with, and where's the point of maximum leverage in terms
of how much engineering time we invest.

Greg,  'git commit' does some stat'ing of every file, even with all those
flags - for example, I think one instance it does it is, just in case any
pre-commit hooks touched any files, it re-stats everything.  Regarding the
perf numbers, I ran it on a beefy linux box.  Have you tried doing your
measurements with the drop_caches trick to make sure the file cache is
totally cold?  Sorry for the dumb question, but how do I check the vnode
cache size?

David Lang and David Barr, I generated the pack files by doing a repack:
"git repack -a -d -f --max-pack-size=10g --depth=100 --window=250"  after
generating the repo.

One other update, the command I was running to get a histogram of all
files in the repo finally completed.  The histogram (counting file size in
bytes) is:

[       0.0 -        6.4): 3
[       6.4 -       41.3): 27
[      41.3 -      265.7): 6
[     265.7 -     1708.1): 652594
[    1708.1 -    10980.6): 673482
[   10980.6 -    70591.6): 19519
[   70591.6 -   453814.3): 1583
[  453814.3 -  2917451.4): 276
[ 2917451.4 - 18755519.0): 61
[18755519.0 - 120574242.0]: 4
n=1347555 mean=3697.917708, median=1770.000000, stddev=122940.890559

The smaller files are all text (code), and the large ones are probably
binary.

Cheers,
Josh

On 2/6/12 11:23 AM, "Matt Graham" <mdg149@xxxxxxxxx> wrote:

>On Sat, Feb 4, 2012 at 18:05, Joshua Redstone <joshua.redstone@xxxxxx>
>wrote:
>> [ wanted to reply to my initial msg, but wasn't subscribed to the list
>>at time of mailing, so replying to most recent post instead ]
>>
>> Matt Graham:  I don't have file stats at the moment.  It's mostly code
>>files, with a few larger data files here and there.    We also don't do
>>sparse checkouts, primarily because most people use git (whether on top
>>of SVN or not), which doesn't support it.
>
>
>This doesn't help your original goal, but while you're still working
>with git-svn, you can do sparse checkouts. Use --ignore-paths when you
>do the original clone and it will filter out directories that are not
>of interest.
>
>We used this at Etsy to keep git svn checkouts manageable when we
>still had a gigantic svn repo.  You've repeatedly said you don't want
>to reorganize your repos but you may find this writeup informative
>about how Etsy migrated to git (which included a health amount of repo
>manipuation).
>http://codeascraft.etsy.com/2011/12/02/moving-from-svn-to-git-in-1000-easy
>-steps/

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html