Hi all, One thing I find curious about git is how objects mostly aren't shared between multiple repositories on the local system. For example, if I do: git clone git://git.kernel.org/pub/scm/git/git.git git1 git clone git://git.kernel.org/pub/scm/git/git.git git2 Then I end up downloading the same objects from kernel.org *twice*. If I use --reference on the second clone, then I can avoid re-downloading all the objects, and it's much faster. Unfortunately, I have to provide that option by hand, which is a problem for git-submodule: it goes out to clone someone else's repository automatically and doesn't know how to guess a value for --reference. Another thing I commonly want to do with submodules is to rm -rf the submodule's files, eg. because I change branches and git doesn't clean it automatically. But then when I switch branches back to the one with the submodule, git wants to go re-download the submodule *again*. Redoing the checkout makes sense to me (just as git deletes/recreates files when I normally switch branches) but re-downloading seems silly. So here's my suggestion to minimize downloads in a pretty easy way: - whenever git creates a packfile in any repo (eg. during git gc or git fetch), make an *extra* hardlink of it into ~/.gitcache/objects/pack. - whenever git is considering which objects it does/doesn't currently have, also consider the packs in ~/.gitcache/objects/pack (ie. using the git/objects/alternates mechanism). If one of the packs qualifies, hardlink it into the current repo. Maybe give it a .keep file to indicate that it's counterproductive to repack this pack. - after git deletes a packfile in any repo (eg. during git gc), check the link count of that pack in ~/.gitcache/objects/pack; if it's now down to just 1, there are no other users of the pack, so delete it there too. You would also need to prune the cachedir occasionally to deal with repositories that were deleted in other ways (eg. rm -rf). - share the list of refs in a similar way (noticing that you probably have different refs in multiple repos that are named "refs/heads/master" of course) so that fetches will be efficient. - extra improvement to submodule behaviour: hardlink packs from the submodule into the supermodule's objects/pack directory (or use a different directory like .git/submodules/pack to keep things separate). Also, submodules should use the superproject's pack directory as an alternate, in case (as often happens for me) the supermodule already contains a bunch of objects from the submodule, because the modules were split at some point. I believe this would be quite easy to implement and would give an immediate efficiency improvement. The ~/.gitcache feature could be enabled/disabled by a config option. Is there any reason not to do it? Thanks, Avery P.S. I've been testing git's behaviour with lots of very large packs - I'm currently using about 58 packs of about 1 GB each - as part of my 'bup' git-based backup tool (http://apenwarr.ca/log/?m=201001#04). Repacking and fsck are obviously horrendously slow with that much data, but bup avoids those operations as much as possible, and a ~/.gitcache wouldn't need to worry about them either (since each repo is still responsible for repacking its own packs). Overall performance for other git operations seems to be fine, though. And searching the cache as a last restore can be optimized by always searching packs in MRU order, in case git doesn't already do this. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html