[RFC] git-clone should create packed refs

Johan Herland <johan@xxxxxxxxxxx> · Fri, 15 Feb 2008 01:33:19 +0100

Hi,

I'm experimenting with converting deep (lots of history) CVS repos to Git, 
and I notice that cloning the resulting Git repos is _slow_. E.g. an 
example repo with 10000 tags and 1000 branches will take ~24 seconds to 
clone. Debugging shows that >95% of that time is spent by calling "git 
update-ref" for each of the 11000 refs. I can easily get the total runtime 
down to ~4 seconds by replacing the "git update-ref ..." with something 
like "echo $sha1 $destname >> $GIT_DIR/packed-refs". Some more 
investigation shows that what's actually taking so long is not writing all 
these 40-bytes ref files and their corresponding reflogs, but rather the 
overhead of creating the "git update-ref" process 11000 times (echo is a 
shell builtin, I presume, so doesn't have the same overhead). My conclusion 
is therefore that making "git clone" a builtin will solve my performance 
problems (since the update-ref is now a function call, rather than a 
subprocess).

Searching the list, I find that - lo and behold - someone (CCed) is actually 
already working on this. :)
(BTW, a progress report on this work would be nice...)

So the only niggle I have left, is that when git-clone is cloning repos with 
thousands of refs, it makes sense to create a packed-refs file directly in 
the clone, instead of having to run "git pack-refs" (or "git gc") 
afterwards to (re)pack the refs. This has pretty much the same reasoning as 
transferring and storing the objects in packs instead of exploding them 
into loose objects.

In my case, the upstream repo already has packed refs, so it just seems 
stupid to explode them into "loose" refs when cloning, and make me re-pack 
them afterwards.

Looking at git-clone.sh, I even find that when cloning, the refs are 
transferred in a format similar (but not identical) to the packed-refs file 
format (see CLONE_HEAD in git-clone.sh).

AFAICS, the only complication with this proposal is how to deal with the 
reflogs. Right now, for each ref created, a corresponding reflog with a 
single entry is written. Therefore - in my example repo above - the 
current "git clone" writes ~22000 files, and my proposal offers only a net 
reduction in #files written by ~50%, instead of ~100%. For reference, the 
reflog entries written by "git clone" look like this:
	"000... $sha1 A U Thor <e@mail> $timestamp  clone: from $repo"
IMHO, these entries don't carry much value:
- The $sha1 is self-evident (and if later changed, will still be mentioned
  in the next reflog entry).
- The author name and email would probably be self-evident/uninteresting in
  most cases.
- The timestamp might be marginally useful, as I can't immediately document
  another way of getting the time of cloning.
- The $repo would also be self-evident in many cases, and would in any case
  also be listed in the config file in the "origin" remote section.
I'd therefore suggest to make reflog creation in "git clone" optional, in 
order to avoid having the number of files written be proportional to the 
number of refs.

I would imagine that even though the time used on Linux for writing 
thousands of files might be negligible, this is not the case on certain 
other OSes...

Have fun! :)

...Johan

-- 
Johan Herland, <johan@xxxxxxxxxxx>
www.herland.net
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html