Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 27 Jun 2014 12:38:49 -0700

On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Even though the original question mentioned "delta discovery", I
> think what was being asked is not "delta" in the Git sense (which
> your answer is about) but is "can we diff two long sequences of text
> (that happens to consist of only 4-letter alphabet but that is a
> irrelevant detail) without holding both in-core in their entirety?",
> which is a more relevant question/desire from the application point
> of view.

.. even there, there's another issue. With enough memory, the diff
itself should be fairly reasonable to do, but we do not have any sane
*format* for diffing those kinds of things.

The regular textual diff is line-based, and is not amenable to
comparing two long lines. You'll just get a diff that says "the two
really long lines are different".

The binary diff option should work, but it is a horrible output
format, and not very helpful. It contains all the relevant data ("copy
this chunk from here to here"), but it's then shown in a binary
encoding that isn't really all that useful if you want to say "what
are the differences between these two chromosomes".

I think it might be possible to just specify a special diff algorithm
(git already supports that, obviously), and just introduce a new "use
binary diffs with a textual representation" model.

But it also sounds like there might be some actual performance problem
with these 1GB file delta-calculations. Which I wouldn't be surprised
about at all.

Jarrad - is there any public data you could give as an example and for
people to play with?

                Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html