On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron <jpyeron@xxxxxxxx> wrote: > > The issue will be, if we talk about changes other than same length substitutions > (e.g. Down's Syndrome where it has an insertion of code) would require one code > per line for the diffs to work nicely. Not my area of expertise, but depending on what you are interested in - like protein encoding etc, I really think you don't need to do things character-per-character. You might want to break at interesting sequences (TATA box, and/or known long repeating sequences). So you could basically turn the "one long line" representation into multiple lines, by just looking for particular known interesting (or known particularly *UN*interesting) patterns, and whenever you see the pattern you create a new line, describing the pattern ("TATAAA" or "run of 128 U"), and then continue on the next line. Then you diff those "semantically enriched" streams instead of the raw data. But it probably depends on what you are looking for and at. Sometimes you might be looking at individual base pairs. And sometimes maybe you want to look at the codons, and consider condons that transcribe to the same amino acid to be the same, and not show up as a difference. So I could well imagine that you might want to have multiple different ways to generate these diffs. No? Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html