RE: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

"Jason Pyeron" <jpyeron@xxxxxxxx> · Fri, 27 Jun 2014 15:55:59 -0400

> -----Original Message-----
> From: Linus Torvalds
> Sent: Friday, June 27, 2014 15:39
> 
> On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano 
> <gitster@xxxxxxxxx> wrote:
> >
> > Even though the original question mentioned "delta discovery", I
> > think what was being asked is not "delta" in the Git sense (which
> > your answer is about) but is "can we diff two long sequences of text
> > (that happens to consist of only 4-letter alphabet but that is a
> > irrelevant detail) without holding both in-core in their entirety?",
> > which is a more relevant question/desire from the application point
> > of view.
> 
> .. even there, there's another issue. With enough memory, the diff
> itself should be fairly reasonable to do, but we do not have any sane
> *format* for diffing those kinds of things.
> 
> The regular textual diff is line-based, and is not amenable to
> comparing two long lines. You'll just get a diff that says "the two
> really long lines are different".
> 
> The binary diff option should work, but it is a horrible output
> format, and not very helpful. It contains all the relevant data ("copy
> this chunk from here to here"), but it's then shown in a binary
> encoding that isn't really all that useful if you want to say "what
> are the differences between these two chromosomes".
> 
> I think it might be possible to just specify a special diff algorithm
> (git already supports that, obviously), and just introduce a new "use
> binary diffs with a textual representation" model.
> 
> But it also sounds like there might be some actual performance problem
> with these 1GB file delta-calculations. Which I wouldn't be surprised
> about at all.
> 
> Jarrad - is there any public data you could give as an example and for
> people to play with?

Until Jarrad replies see sample here:
http://www.genomatix.de/online_help/help/sequence_formats.html

The issue will be, if we talk about changes other than same length substitutions
(e.g. Down's Syndrome where it has an insertion of code) would require one code
per line for the diffs to work nicely.

-Jason

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Principal Consultant              10 West 24th Street #100    -
- +1 (443) 269-1555 x333            Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html