Re: [PATCH] Documentation/git-bundle.txt: Dumping contents of any bundle

Jeff King <peff@xxxxxxxx> · Fri, 2 Jan 2009 03:27:09 -0500

On Thu, Jan 01, 2009 at 11:15:19PM -0800, Shawn O. Pearce wrote:

> > OK, I wish you luck in the fruition of the new --dump-delta option, and
> > can proofread the man pages involved, otherwise this is no area for
> > junior programmer me.
> 
> This is rather insane.  There's very little data inside of a delta.
> That's sort of the point of that level of compression, it takes
> up very little disk space and yet describes the change made.
> Almost nobody is going to want the delta without the base object
> it applies onto.  No user of git is going to need that.  I'd rather
> not carry dead code around in the tree for something nobody will
> ever use.

I somewhat agree. Obviously we can come up with contrived cases where
the delta is a pure "add" and this option magically lets you recover
some text via "strings" on the resulting delta dump. But in practice,
it's hard to say exactly how useful it would be, especially since the
"motivation" here seems to be more academic than any actual real-world
problem. We can approximate with something like:

  git clone git://git.kernel.org/pub/scm/git/git.git
  cd git
  git bundle create ../bundle.git v1.6.0..v1.6.1
  mkdir ../broken && cd ../broken
  sed '/^PACK/,$!d' ../bundle.git >pack
  git init
  git unpack-objects --dump-deltas <pack
  strings .git/lost-found/delta/* | less

where maybe you lost your actual repository, but you still have a backup
of a bundle you sneaker-netted between major versions. In this instance
we have 6000 objects in the bundle, 2681 of which are blobs (and
therefore presumably the most interesting things to recover). Of those,
1070 were non-delta and can be recovered completely. For the remainder,
our strings command shows us snippets of what was there. There are
definitely recognizable pieces of code. But likewise there are pieces of
code that are missing subtle parts. E.g.:

                  if (textconv_one) {
                        size_t size;
                        mf1.ptr = run_textconv(textconv_one, one, &size);
                        if (!mf1.
ptr)
                        mf1.size = size;
                if (textconv_two) {
                        size_t size;
                        mf2.ptr = run_textconv(textconv_two, two, &size);
                        if (!mf2.
ptr)
                        mf2.size = size;

So while there is _something_ to be recovered there, it is basically as
easy to rewrite the code as it is to piece together whatever fragments
are available into something comprehensible.

So in practice, the delta dump would only be useful if:

  1. You have an incomplete thin pack, which generally means you are
     using bundles (or you interrupted a fetch and kept the tmp_pack).

  2. There is _no_ other copy of the basis. The results you get from
     this method are so awful that it should really only be last-ditch.
     I think you would be insane to say "Oh, I don't have net access
     right now. Let me just spend hours picking through these deltas to
     find a scrap of something useful instead of just waiting until I
     get access again."

  3. The changes in the pack tend to produce deltas rather than full
     blobs, but the deltas tend to be very add-heavy.

I don't know how popular bundles are, but I would expect (1) puts us
very much in the minority. On top of that, given the nature of git, I
find (2) to be pretty unlikely. If you're sneaker-netting data with a
bundle, then it seems rare that both ends of the net will be lost at
once. As for (3), it seems source code is not a good candidate here.
Perhaps if you were writing a novel in a single file, you might salvage
whole paragraphs or even chapters.

So I am inclined to leave it as-is: a patch in the list archive. If and
when the day comes when somebody loses some super-important data and
somehow matches all of these criteria, then they can consult whatever
aged and senile git gurus still exist to pull the patch out and see if
anything can be recovered.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html