Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 27 Jun 2014 13:13:57 -0700

On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron <jpyeron@xxxxxxxx> wrote:
>
> The issue will be, if we talk about changes other than same length substitutions
> (e.g. Down's Syndrome where it has an insertion of code) would require one code
> per line for the diffs to work nicely.

Not my area of expertise, but depending on what you are interested in
- like protein encoding etc, I really think you don't need to do
things character-per-character. You might want to break at interesting
sequences (TATA box, and/or known long repeating sequences).

So you could basically turn the "one long line" representation into
multiple lines, by just looking for particular known interesting (or
known particularly *UN*interesting) patterns, and whenever you see the
pattern you create a new line, describing the pattern ("TATAAA" or
"run of 128 U"), and then continue on the next line.

Then you diff those "semantically enriched" streams instead of the raw data.

But it probably depends on what you are looking for and at. Sometimes
you might be looking at individual base pairs. And sometimes maybe you
want to look at the codons, and consider condons that transcribe to
the same amino acid to be the same, and not show up as a difference.
So I could well imagine that you might want to have multiple different
ways to generate these diffs. No?

               Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html