Re: Replacing large blobs in git history

Neal Kreitzinger <nkreitzinger@xxxxxxxxx> · Tue, 06 Mar 2012 14:49:49 -0600

On 3/6/2012 10:09 AM, Barry Roberts wrote:
I started this question on #git last week, but this is getting long,
and things have changed some, so I'm going to try here.

I had a 3rd party jar file checked in to our git repository.  It was
about 4 mb, so no big deal.  Then about 17 months ago somebody
checked in a 550 mb version.  There were several versions of the
original file in several different directories.  The large version
replaced the small version in some of those directories (but not all
of them). Then somebody found a "small" version that was only 110 mb
and replaced some of the 550 mb files and some of the old 4 mb
files. Finally several months after that we got the correct updated 5
mb latest version.  But I'm still carrying around an extra 660 mb in
my object database, and we are adding developers and moving to an
off-site location with lower bandwidth and higher latency, so I
would like to clean this up.

My first attempt just removed the blob (by hash ID).  It's been over
a year since the small correct file was checked in, so the odds of
ever needing to build anything that old are very slim. But after
thinking about it some, I came up with this to replace the blob with
the correct one and wanted to see if this is a reasonable way to do
this before I actually backup and then replace my central git
repository.

git filter-branch --index-filter 'killem=$(git ls-files --stage  |
grep 7a36af54a6c47\\\|abe809091bcb3 ) ; if [ -n "$killem" ] ; then
git ls-files --stage |grep 7a36af54a6c47\\\|abe809091bcb3 | sed -f
/home/blr/tmp/chgblob.sed |  git update-index --index-info ; fi'

chgblob.sed looks like this:
s/7a36af54a6c47a29eb9690caefa132489d39c4d0/8924ef0f78b3d09957a8697ca93cce6700771071/g

s/abe809091bcb37a06284f8353366074622d72373/8924ef0f78b3d09957a8697ca93cce6700771071/g

7a36af is the 550 mb blob, abe80909 is the 110 mb, and 8924ef0f is
the 5 mb new version.

This isn't extremely efficient since it does the 'git ls-filess
--stage' twice (once to see if the blob is used, then again to
change it ONLY if the blob is referenced in the current index).  But
that only adds a few seconds to the 28 minute runtime, so I'm not
too worried about that.  And yes, I could just check for the return
value of grep, but I did echo $killem while I was debugging and that
was useful, so I just left it like that.

Does this look like a reasonable way to accomplish what I'm trying
to do, or am I doing something that's going to cause grief later?

Be aware that you are rewriting history.  I assume this is published
history that you are going to run filter-branch on.  That means everyone 
who cloned from the old history (pre-filter-branch), not to mention 
those who also have WIP based on the old history, will need to somehow 
adjust to the new history.  How do you plan on addressing that?  (see 
git-rebase manpage section "recovering from upstream rebase" for more 
info on the implications of rewriting history.)

(I have never done filter-branch, and am not an expert on git, but do 
find this subject relevant to normal use of git.)

v/r,
neal
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html