Hi, I'm experimenting with converting deep (lots of history) CVS repos to Git, and I notice that cloning the resulting Git repos is _slow_. E.g. an example repo with 10000 tags and 1000 branches will take ~24 seconds to clone. Debugging shows that >95% of that time is spent by calling "git update-ref" for each of the 11000 refs. I can easily get the total runtime down to ~4 seconds by replacing the "git update-ref ..." with something like "echo $sha1 $destname >> $GIT_DIR/packed-refs". Some more investigation shows that what's actually taking so long is not writing all these 40-bytes ref files and their corresponding reflogs, but rather the overhead of creating the "git update-ref" process 11000 times (echo is a shell builtin, I presume, so doesn't have the same overhead). My conclusion is therefore that making "git clone" a builtin will solve my performance problems (since the update-ref is now a function call, rather than a subprocess). Searching the list, I find that - lo and behold - someone (CCed) is actually already working on this. :) (BTW, a progress report on this work would be nice...) So the only niggle I have left, is that when git-clone is cloning repos with thousands of refs, it makes sense to create a packed-refs file directly in the clone, instead of having to run "git pack-refs" (or "git gc") afterwards to (re)pack the refs. This has pretty much the same reasoning as transferring and storing the objects in packs instead of exploding them into loose objects. In my case, the upstream repo already has packed refs, so it just seems stupid to explode them into "loose" refs when cloning, and make me re-pack them afterwards. Looking at git-clone.sh, I even find that when cloning, the refs are transferred in a format similar (but not identical) to the packed-refs file format (see CLONE_HEAD in git-clone.sh). AFAICS, the only complication with this proposal is how to deal with the reflogs. Right now, for each ref created, a corresponding reflog with a single entry is written. Therefore - in my example repo above - the current "git clone" writes ~22000 files, and my proposal offers only a net reduction in #files written by ~50%, instead of ~100%. For reference, the reflog entries written by "git clone" look like this: "000... $sha1 A U Thor <e@mail> $timestamp clone: from $repo" IMHO, these entries don't carry much value: - The $sha1 is self-evident (and if later changed, will still be mentioned in the next reflog entry). - The author name and email would probably be self-evident/uninteresting in most cases. - The timestamp might be marginally useful, as I can't immediately document another way of getting the time of cloning. - The $repo would also be self-evident in many cases, and would in any case also be listed in the config file in the "origin" remote section. I'd therefore suggest to make reflog creation in "git clone" optional, in order to avoid having the number of files written be proportional to the number of refs. I would imagine that even though the time used on Linux for writing thousands of files might be negligible, this is not the case on certain other OSes... Have fun! :) ...Johan -- Johan Herland, <johan@xxxxxxxxxxx> www.herland.net - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html