On Jan 22, 2008, at 9:46 PM, Junio C Hamano wrote:
Kevin Ballard <kevin@xxxxxx> writes:I just glanced at git-filter-branch.sh (and I must say I was incredibly surprised to find out it was a shell script) and it seems it never runs git-gc or git-repack. Doesn't that end up with the same problems as git-svn sans git-repack when filtering a large number of commits? I was just thinking, if I were to git-filter-branch on my massive repo (in fact, the same repo that started this thread, with over 33000 commits in the upstream svn repo), even if I just do something as simple as change the commit msg wont I end up with thousands of unreachable objects? I shudder to think how many unreachable objects I would have if I pruned the entire dports directory off of the tree. Am I missing something, or does git-filter-branch really not do any garbage collection? I tried reading the source, but complex bash scripts are almost as bad as perl in terms of readability.Theoretically yes, and it largely depends on what you do, but filter-branch goes over the objects that already exists in your repository, and hopefully you won't be rewriting majority of them. So the impact of not repacking is probably much less painful in practice. But again as I said, it largely depends on what you do in your filter. If you are upcasing (or convert to NFD ;-)) the contents of all of your blob objects, you would certainly want to repack every once in a while.
I'm actually considering what the cost would be of switching macports to git (not that it will ever happen - too many anonymous people pull from svn trunk). Right now the svn trunk contains a subfolder for the source code and another subfolder for all ~4400+ Portfiles. In such a theoretical move, I'd want to split that up, probably into two unrelated branches. Doing so would mean running git-filter-branch over a linear commit history that's 31580 objects long, with a tree filter to prune the dports directory away and a msg filter to remove the svn- id stuff that git-svn left behind. This means that every single commit objects would be changed, as well as the root tree object for every single commit. That would be about 63160 objects. I'd also have to figure out some way to remove the commit objects entirely that only reference the dports directory. Then I'd have to do it again with the opposite tree filter (to prune everything but the dports directory and move the contents of the dports directory up one level) and same msg filter. Granted, if I do the first action in a branch, that leaves no unreachable objects (since the originals are still referenced), but the second operation definitely would leave unreachable objects, and were I to clone the repository instead and do the operations in the different repos (which is perfectly legitimate - otherwise I'd have to clone it after everything else and then delete branches) then both actions would leave thousands of objects unreachable.
I'd suggest a patch to run git gc --auto, but it looks like you just did in a subsequent email. As for your comments about the reflogs, can't I disable recording those, at least temporarily? I'd rather clean up after myself as I work rather than balloon the repository and collapse it in a single operation at the end.
-Kevin Ballard -- Kevin Ballard http://kevin.sb.org kevin@xxxxxx http://www.tildesoft.com
<<attachment: smime.p7s>>