Hi, On Fri, Feb 4, 2011 at 2:17 PM, Ken Brownfield <krb@xxxxxxxxxxx> wrote: > Thanks for the feedback on git_fast_filter. ÂIt takes 11.5 hours on our repository instead of 6.5 days, so that's a significant improvement. :-) ÂI have a couple of observations: > > 1) You said that your repo would have taken 2-3 months to filter with git-filter-branch, and the time was reduced to ~1hr. ÂI'm surprised our reduction was not quite as dramatic, although I presume the variability of repo contents are the explanation. Variability of the repo certainly would account for some differences, though I suspect more of the differences come from what kind of filtering we were doing. For example, the advantage of git_fast_filter over filter-branch's --index-filter will be much less than its advantage over filter-branch's --tree-filter. Further, in my case, I was parsing and potentially editing the contents of all files, which becomes much more painful with filter-branch as you'll need to re-edit the exact same contents in as many revisions of history as the file remains unchanged in (in other words, duplicating the same work hundreds or thousands of times). With git_fast_filter, I only needed to parse/edit a given version of some file exactly once. That's what really helped in my case. > 2) The resulting repository pack files are actually much larger. ÂA garbage collection reduces the size below the original, but only slightly. ÂI'm concerned that the recreated repository has redundant or inefficiently stored information, but I'm not sure how to verify what objects are taking up what space. You may want to use packinfo.pl from under contrib/stats/ in the git repository to find out what objects take up how much space. From my notes on using it for this purpose: git verify-pack -v .git/objects/pack/pack-<sha1sum>.idx | packinfo.pl -tree -filenames > tree-info.txt sort -k 4 -n tree-info.txt | grep -v ^$ | less > 3) git_fast_filter doesn't currently support remote submodules. ÂWhen it tries to parse a submodule line, the regex fails and the code aborts: > > Expected: > Â Â Â ÂM 100644 :433236 foo/bar/bletch > Received, something like: > Â Â Â ÂM 100644 cd821b4c0ea8e9493069ff43712a0b09 foo/bar/bletch > > To correct the issue, I modified git_fast_filter to simply skip these. ÂWhile we no longer utilize remote submodules, I would prefer not to have them removed. > > Any feedback on what the proper behavior would be in the submodule case? ÂPerhaps this is covered in your internal version? git_fast_filter would need to be modified to handle this kind of input, create an appropriate object type, and that object type would need to be able to appropriately output itself later. Since submodules haven't really been relevant for me, I've never bothered implementing this[*]. The assumption that git-fast-export will produce numeric ids (i.e. that submodules are not present) is somewhat hardwired in, so it'd take a little bit of refactoring, though probably not to bad. Elijah [*] Well, actually we did hit it once somewhat recently when someone created a commit containing a submodule...and then also immediately reverted it. Since we don't want to use submodules, I simply put in a hack that would recognize them and unconditionally strip them out on the input parsing end, which sounds like the same thing you did. That's obviously not what you're asking for. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html