Re: [JGIT] Blame functionality for jgit

"Manuel Woelker" <manuel.woelker@xxxxxxxxx> · Mon, 12 Jan 2009 22:17:47 +0100

On Mon, Jan 12, 2009 at 6:42 PM, Shawn O. Pearce <spearce@xxxxxxxxxxx> wrote:
>
>> I largely ported the cgit blame algorithm described here
>> https://kerneltrap.org/mailarchive/git/2006/10/12/224187 , the
>> relevant file is builtin-blame.c cf.
>> http://repo.or.cz/w/git.git?a=blob;f=builtin-blame.c;hb=HEAD
>
> OK.  Fortunately Junio has largely given his blessing to this code
> being converted to Java under the BSD license.

Glad to hear that, Dscho already mentioned that a port like this might
raise license concerns. So I hope Junio would be okay with that.

> Unfortunately I can't reach incava.org to download java-diff, but
> its entry on sourceforge says it uses an LGPL license.  Within JGit
> we do want to stick to BSD and BSD dependencies, so bringing in
> java-diff as a new dependency isn't something we really want to do.
>
> Fortunately Dscho has been working on a Java diff implementation for
> JGit, and is considering releasing it under a BSD license so we can
> include it.
>
> The interface is still to be decided, but in general we have
> found that running against a byte[] is much faster than running
> against String.  We're probably going to want the diff library to
> take a byte[] and an IntList (as created by RawParseUtils.lineMap)
> and let it work against the byte array ranges directly, without any
> additional allocations, except where it needs to build up its hash
> of records.

That sounds like a good plan. For now I am not all that concerned
about performance myself (premature optimization and all that), but in
the long run - and especially with rename/copy detection that will
definitely a factor for usability. We could adopt the interface to
allow for a "fast" implementation, and have existing diff
implementations adapted to that. I'd be very interested to see Dscho's
diff work, and maybe we can get something going on that front. A
patience diff implementation would be rather neat too, and all my
googling hasn't turned up anything.

> I think eventually we'll have a BSD licensed LCS that will come with
> JGit, which would eliminate the need for such a shim.  I'd like
> to see that happen before blame gets added, but if blame is ready
> and the shim is reasonably well done, I'm inclined to bring blame
> in early.

While trying to look up the Myers diff algorithm I found a diff
implementation in Apache wicket (cf.
http://wicket.sourceforge.net/apidocs/wicket/util/diff/myers/package-summary.html
). This one is under an Apache license, is that any better? It's truly
kind of sad that you need a degree in law these days to get any work
done in this license jungle. I just happen to strongly oppose the
reinvention of circular transportation-enabling devices...

> - Don't use @author tags, we don't use them anywhere else.
>
> - Please include copyright headers on all new files.
>
> - I think the IDiff, IDifference should be in a ".diff" package so
> we can reuse them for other diff applications.  Especially if we
> wind up using a shim to link to different LCS implementations.
>
> - I think the API for BlameEngine should be more like:
>
>  public class BlameEngine {
>    private final RevWalk rw;
>
>    public BlameEngine(final Repository repo) {
>      this(new RevWalk(repo));
>    }
>
>    public BlameEngine(final RevWalk walk) {
>      rw = walk;
>    }
>
>    public RevWalk getRevWalk() {
>      return rw;
>    }
>
>    public List<BlameEntry> blame(RevCommit c, String path);
>  }
>
>  One reason for this style of API is we can have the engine cache
>  blame records.  This can make a GUI navigator much more efficient
>  as it jumps around between commits and asks for blames over the
>  same data.
>
> - Internally you should *always* use RevCommit, not Commit.
> The RevCommit class can be orders of magnitude faster than Commit.
>
> - Internally you should always use RevTree and TreeWalk to locate
> blob IDs.  Its orders of magnitude faster than Tree and TreeEntry.
>
> - Don't use new String(loader.getBytes()) (e.g. in Origin.getData)
> to get the data for a file.  We want to do raw diffs, so that the
> character encoding is considered as part of the blame.  E.g. if
> a file switches character encodings, the blame has to go to the
> commit that introduced the new encoding.  This is a huge reason
> to use byte[] and IntList over an array of String.
>
> - RawParseUtils.lineMap will be faster than a Pattern of "^(.*)$".
>
Thanks for the comments. This first version was basically just a
proof-of-concept to see if and how it could work. I didn't want to
pour too much effort into, before knowing if this project was viable
at all. This input is very valuable and I will try to incorporate the
suggestions into future revisions.

I'll keep you posted on new developments
 - Manuel
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html