On Mon, Jan 12, 2009 at 6:42 PM, Shawn O. Pearce <spearce@xxxxxxxxxxx> wrote: > >> I largely ported the cgit blame algorithm described here >> https://kerneltrap.org/mailarchive/git/2006/10/12/224187 , the >> relevant file is builtin-blame.c cf. >> http://repo.or.cz/w/git.git?a=blob;f=builtin-blame.c;hb=HEAD > > OK. Fortunately Junio has largely given his blessing to this code > being converted to Java under the BSD license. Glad to hear that, Dscho already mentioned that a port like this might raise license concerns. So I hope Junio would be okay with that. > Unfortunately I can't reach incava.org to download java-diff, but > its entry on sourceforge says it uses an LGPL license. Within JGit > we do want to stick to BSD and BSD dependencies, so bringing in > java-diff as a new dependency isn't something we really want to do. > > Fortunately Dscho has been working on a Java diff implementation for > JGit, and is considering releasing it under a BSD license so we can > include it. > > The interface is still to be decided, but in general we have > found that running against a byte[] is much faster than running > against String. We're probably going to want the diff library to > take a byte[] and an IntList (as created by RawParseUtils.lineMap) > and let it work against the byte array ranges directly, without any > additional allocations, except where it needs to build up its hash > of records. That sounds like a good plan. For now I am not all that concerned about performance myself (premature optimization and all that), but in the long run - and especially with rename/copy detection that will definitely a factor for usability. We could adopt the interface to allow for a "fast" implementation, and have existing diff implementations adapted to that. I'd be very interested to see Dscho's diff work, and maybe we can get something going on that front. A patience diff implementation would be rather neat too, and all my googling hasn't turned up anything. > I think eventually we'll have a BSD licensed LCS that will come with > JGit, which would eliminate the need for such a shim. I'd like > to see that happen before blame gets added, but if blame is ready > and the shim is reasonably well done, I'm inclined to bring blame > in early. While trying to look up the Myers diff algorithm I found a diff implementation in Apache wicket (cf. http://wicket.sourceforge.net/apidocs/wicket/util/diff/myers/package-summary.html ). This one is under an Apache license, is that any better? It's truly kind of sad that you need a degree in law these days to get any work done in this license jungle. I just happen to strongly oppose the reinvention of circular transportation-enabling devices... > - Don't use @author tags, we don't use them anywhere else. > > - Please include copyright headers on all new files. > > - I think the IDiff, IDifference should be in a ".diff" package so > we can reuse them for other diff applications. Especially if we > wind up using a shim to link to different LCS implementations. > > - I think the API for BlameEngine should be more like: > > public class BlameEngine { > private final RevWalk rw; > > public BlameEngine(final Repository repo) { > this(new RevWalk(repo)); > } > > public BlameEngine(final RevWalk walk) { > rw = walk; > } > > public RevWalk getRevWalk() { > return rw; > } > > public List<BlameEntry> blame(RevCommit c, String path); > } > > One reason for this style of API is we can have the engine cache > blame records. This can make a GUI navigator much more efficient > as it jumps around between commits and asks for blames over the > same data. > > - Internally you should *always* use RevCommit, not Commit. > The RevCommit class can be orders of magnitude faster than Commit. > > - Internally you should always use RevTree and TreeWalk to locate > blob IDs. Its orders of magnitude faster than Tree and TreeEntry. > > - Don't use new String(loader.getBytes()) (e.g. in Origin.getData) > to get the data for a file. We want to do raw diffs, so that the > character encoding is considered as part of the blame. E.g. if > a file switches character encodings, the blame has to go to the > commit that introduced the new encoding. This is a huge reason > to use byte[] and IntList over an array of String. > > - RawParseUtils.lineMap will be faster than a Pattern of "^(.*)$". > Thanks for the comments. This first version was basically just a proof-of-concept to see if and how it could work. I didn't want to pour too much effort into, before knowing if this project was viable at all. This input is very valuable and I will try to incorporate the suggestions into future revisions. I'll keep you posted on new developments - Manuel -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html