Re: removing content from git history

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Wed, 21 Feb 2007, J. Bruce Fields wrote:
> 
> Reconstructing history with a bunch of merges seems like something that
> could be a huge pain.  (Though with some tools it might be doable.)

It's not actually that painful, but it *is* expensive.

I wrote git-convert-cache (now "git convert-objects") back when we did the 
SHA1/compression switchover changes and the date format translation, so 
we've actually had a tool that can do history rewriting pretty much since 
day 1 (well, "day 14", to be exact, but still.. April 2005).

BUT:

 - I'm not guaranteeing that it works any more. We haven't changed the 
   fundamental object format since, so that particular program has never 
   gotten any testing. It still compiles, but does it work? I dunno.

   I actually tested it on git itself. It converted the top of the git 
   tree successfully, and generated a *new* git history. Why? Because it 
   will actually rewrite the old git tree entries that have permission 
   0664 into 0644: the *data* will be identical (and no git tools except 
   for "git fsck --pedantic" will even notice the difference), but the 
   converted tree avoids one of the legacy decisions that we never fixed 
   in the git repository itself.

   So it works at least to *some* degree, but I would suggest you be very 
   very careful!

 - it can be slow. For something like git, which isn't *that* big, and 
   where we actually don't need to do a lot of rewriting (ie all the blobs 
   stay the same, and only a few trees have to be rewritten, and so it's 
   really just rewriting commits), it's not that bad. It actyally 
   converted the whole git history in less than ten seconds for me.

   But if you have a *huge* tree, and you actually convert objects too 
   (say, you started using git on Windows before the "autocrlf" thing, and 
   want to convert the old blobs from CRLF -> LF), it would

    (a) require some extensions to convert-object.c to do the blob 
        conversion
    (b) be *much* slower
    (c) generate tons of unpacked objects (because git-convert-objects 
        doesn't know to pack in between, and doesn't use anything 
        newfangled like "git-fast-import" to do anything clever)

   For the kernel, it took 2 minutes, but again, it was exactly the same 
   thing: just a few old tree objects that it rewrote, and as a result, 
   every single commit SHA1 changed. Still, it was almost _only_ commits 
   (it generated 49521 new objects, 49332 of which was the new commit 
   history)

   If you want to rewrite a *lot* (ie somethign that exists in more than 
   just a few trees), and you have lots of history, it can be very 
   expensive indeed.

 - It currently doesn't convert the SHA1 numbers that show up in commit 
   messages. It could, and it should. But it doesn't. So once you convert 
   a git project, it doesn't do the nice "gitk does links from the SHA1 
   text in a commit message to the commit it talks about" any more.

   Somebody should fix that.

Anyway, git-convert-objects does kind of give you a starting point. It 
should be fixed to use "git-fast-import" or repack once in a while (so 
that it doesn't leave tons and tons of unpacked objects), and it should be 
fixed to fix up any commit messages that mention SHA1's that it has 
already converted to something else, but it seems to still work. It would 
not be impossible at all to extend the tree-rewriting logic to remove some 
file or a particular SHA1 object you want to replace.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]