Re: removing content from git history

Junio C Hamano <junkio@xxxxxxx> · Wed, 21 Feb 2007 11:01:55 -0800

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

>  - explicit support for "missing objects". We don't do it right now, but 
>    we could add it. It was discussed for things like limited history etc 
>    (the "shallow clone" kind of thing, before people actually added 
>    shallow clones), and it would support the notion of "we export all our 
>    history, but for internal reasons we cannot make certain objects 
>    available" kinds of workflows.
> ...
> But at least in theory, it wouldn't be impossible to extend on the 
> ".git/grafts" kind of setup to say "this object has been consciously 
> deleted", and that could in some circumstances be a better model. The 
> biggest headache there would be the need to extend the native git protocol 
> with a way to add such objects.

While I agree in principle to the argument that there is no
taking it back what's already published, I've heard people
wanting to just stop distributing further, without worrying
about copies already out there.  'missing objects' support would
help us in such a situation.

Supporting 'missing objects' in general would be painful, when
they contain pointers to other objects (i.e. tags, commits, and
trees).

Thinking aloud...

 * missing blob: we can have 'stub blob' objects.  Probably the
   object header for such an object would look like:

	stub <length> NUL
	-----------------
        object <object name of the real blob object>
        type blob

   Hashing a 'stub' object (along with its header as usual, in
   write_sha1_file_prepare()) would instead just report the
   object name recorded there.

   When packing (this applies both to local repacking and
   push/fetch object transfer to other repositories), the stub
   object is included.  delta algorithm would probably not to
   delta other objects with it.

 * missing commit and tag: 'stub object' needs to be extended to
   include these object types, and we would also need 'stub
   commit' and 'stub tag' objects, that copy the structural
   fields from the corresponding true object.  So a stub commit
   would probably look like:

	stub <length> NUL
	-----------------
        object <object name of the real commit object>
        type commit
        tree <object name of the tree contained in the real commit object>
        parent <object name of the first parent in the real commit object>
        parent <object name of the first second in the real commit object>

 * missing tree would only be useful to conceal pathnames
   recorded in the real tree object.  I am not sure if that is
   needed.

 * fsck and verify-pack needs to be taught about 'stub' objects,
   so that they know that their filenames (or the data pointed
   at by pack .idx) do not match the result of hashing them.

If we were to do this, I suspect we can probably do nothing but
'missing blob' first to cover a lot of ground, but we would
eventually need 'missing commit' to replace real commit objects
that has sensitive information in its log message.

As Nico pointed out, this has serious security implications.  We
would need a separate list of objects that are Ok to be stubbed
out, with probably explanation of why they are stubbed out, and
fsck should compare the stub objects found in the repository
against that list.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html