Re: Performance issue exposed by git-filter-branch

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On Fri, Feb 4, 2011 at 2:17 PM, Ken Brownfield <krb@xxxxxxxxxxx> wrote:
> Thanks for the feedback on git_fast_filter. ÂIt takes 11.5 hours on our repository instead of 6.5 days, so that's a significant improvement. :-) ÂI have a couple of observations:
>
> 1) You said that your repo would have taken 2-3 months to filter with git-filter-branch, and the time was reduced to ~1hr. ÂI'm surprised our reduction was not quite as dramatic, although I presume the variability of repo contents are the explanation.

Variability of the repo certainly would account for some differences,
though I suspect more of the differences come from what kind of
filtering we were doing.  For example, the advantage of
git_fast_filter over filter-branch's --index-filter will be much less
than its advantage over filter-branch's --tree-filter.  Further, in my
case, I was parsing and potentially editing the contents of all files,
which becomes much more painful with filter-branch as you'll need to
re-edit the exact same contents in as many revisions of history as the
file remains unchanged in (in other words, duplicating the same work
hundreds or thousands of times).  With git_fast_filter, I only needed
to parse/edit a given version of some file exactly once.  That's what
really helped in my case.

> 2) The resulting repository pack files are actually much larger. ÂA garbage collection reduces the size below the original, but only slightly. ÂI'm concerned that the recreated repository has redundant or inefficiently stored information, but I'm not sure how to verify what objects are taking up what space.

You may want to use packinfo.pl from under contrib/stats/ in the git
repository to find out what objects take up how much space.  From my
notes on using it for this purpose:

  git verify-pack -v .git/objects/pack/pack-<sha1sum>.idx |
packinfo.pl -tree -filenames > tree-info.txt
  sort -k 4 -n tree-info.txt | grep -v ^$ | less

> 3) git_fast_filter doesn't currently support remote submodules. ÂWhen it tries to parse a submodule line, the regex fails and the code aborts:
>
> Expected:
> Â Â Â ÂM 100644 :433236 foo/bar/bletch
> Received, something like:
> Â Â Â ÂM 100644 cd821b4c0ea8e9493069ff43712a0b09 foo/bar/bletch
>
> To correct the issue, I modified git_fast_filter to simply skip these. ÂWhile we no longer utilize remote submodules, I would prefer not to have them removed.
>
> Any feedback on what the proper behavior would be in the submodule case? ÂPerhaps this is covered in your internal version?

git_fast_filter would need to be modified to handle this kind of
input, create an appropriate object type, and that object type would
need to be able to appropriately output itself later.  Since
submodules haven't really been relevant for me, I've never bothered
implementing this[*].  The assumption that git-fast-export will
produce numeric ids (i.e. that submodules are not present) is somewhat
hardwired in, so it'd take a little bit of refactoring, though
probably not to bad.


Elijah

[*] Well, actually we did hit it once somewhat recently when someone
created a commit containing a submodule...and then also immediately
reverted it.  Since we don't want to use submodules, I simply put in a
hack that would recognize them and unconditionally strip them out on
the input parsing end, which sounds like the same thing you did.
That's obviously not what you're asking for.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]