Re: Replacing large blobs in git history

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 03/06/2012 05:09 PM, Barry Roberts wrote:
> I started this question on #git last week, but this is getting long,
> and things have changed some, so I'm going to try here.
> 
> I had a 3rd party jar file checked in to our git repository.  It was
> about 4 mb, so no big deal.  Then about 17 months ago somebody checked
> in a 550 mb version.  There were several versions of the original file
> in several different directories.  The large version replaced the
> small version in some of those directories (but not all of them).
> Then somebody found a "small" version that was only 110 mb and
> replaced some of the 550 mb files and some of the old 4 mb files.
> Finally several months after that we got the correct updated 5 mb
> latest version.  But I'm still carrying around an extra 660 mb in my
> object database, and we are adding developers and moving to an
> off-site location with lower bandwidth and higher latency, so I would
> like to clean this up.
> 
> My first attempt just removed the blob (by hash ID).  It's been over a
> year since the small correct file was checked in, so the odds of ever
> needing to build anything that old are very slim. But after thinking
> about it some, I came up with this to replace the blob with the
> correct one and wanted to see if this is a reasonable way to do this
> before I actually backup and then replace my central git repository.
> 
> git filter-branch --index-filter 'killem=$(git ls-files --stage  |
> grep 7a36af54a6c47\\\|abe809091bcb3 ) ; if [ -n "$killem" ] ; then git
> ls-files --stage |grep 7a36af54a6c47\\\|abe809091bcb3 | sed -f
> /home/blr/tmp/chgblob.sed |  git update-index --index-info ; fi'
> 
> chgblob.sed looks like this:
> s/7a36af54a6c47a29eb9690caefa132489d39c4d0/8924ef0f78b3d09957a8697ca93cce6700771071/g
> s/abe809091bcb37a06284f8353366074622d72373/8924ef0f78b3d09957a8697ca93cce6700771071/g
> 
> 7a36af is the 550 mb blob, abe80909 is the 110 mb, and 8924ef0f is the
> 5 mb new version.

You could use "git replace" to cause the bad blobs to be replaced
everywhere they appear:

    $ git replace 7a36af54a6c47a29eb9690caefa132489d39c4d0 \
                  8924ef0f78b3d09957a8697ca93cce6700771071
    $ git replace abe809091bcb37a06284f8353366074622d72373 \
                  8924ef0f78b3d09957a8697ca93cce6700771071

Then you could use "git filter-branch" to "bake in" the substitutions
(but please see the caveats mentioned by Neal).

It seems like an alternative to using "git filter-branch" would be to
share the "git replace" references across repositories.  This would make
the short versions of the file appear wherever they should without
requiring history to be rewritten entirely.  But I don't believe that
this approach would allow the long versions of the file to be discarded
by the git garbage collector, so it would not help you reduce clone sizes.

Michael

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]