On Sat, Apr 26, 2014 at 2:39 PM, David Kastrup <dak@xxxxxxx> wrote: > > At least the stuff I fixed with regard to performance would seem to be > done right in JGit to start with. Hah! Its Java. We have to do things right, otherwise its too slow. :-) >> Its still not as fast as I want it to be. :-) > > Most of the diff data/CRC is computed over and over because of the > blackbox use of xdiff. Yes, I have been thinking about this all week. JGit blame uses HistogramDiff by default instead of MyersDiff. The first stage after we trim common header/trailer from both files is to compute a hash of each line and store those hashes. Those hashes are discarded as the blame algorithm moves to the next commit. Clearly for a commit chain of A -> B -> C, the hashes computed at B for the A->B compare can be reused for the B->C compare. This is not the case in either git-core or JGit, because the diff algorithm is a block box to the blame algorithm. I think this is what you mean by the CRC being computed again. For any given compare blame has a list of regions it is interested in learning about from the diff algorithm. Anything outside of those regions is useless noise that will be discarded. I have been pondering pushing that region list down into the diff algorithm so it can avoid executing on sections that are not relevant to the caller. At least for HistogramDiff this makes some sense, the algorithm is recursively applied after it finds a longest common subsequence. If one side of the LCS is outside of the region of interest from blame, there is no value in recursing on that portion. If the blame region list covers a small enough portion, it may even make sense to avoid the common header/trailer elimination preprocessing step. Unfortunately that sounds "hard" as you could be working with a file like a ChangeLog which grows on one of those sides. > And then the delta-chain storage is packing > stuff based on CRCs as well (not sure whether it keeps them around for > unpacking). There are CRCs validated by libz during inflation, but these aren't rechecked once the inflated bytes are cached in that silly undersized 16M delta base cache. > So there is a lot that could likely be improved while > keeping the same basic algorithms, by cracking open the black boxes of > the xdiff engine and the delta-chain coding. The delta chain coding has no relationship to the source file. Currently even plain text files are delta chain coded on a byte basis, not a line basis. Just matching up the delta coding against a source text file to determine lines 0-N are common is costly, since you have a byte range in the delta coding and you want a line range out the end. To make things more challenging, the delta chain coding can be against completely different blobs. In a compare of A->B from the commit graph being walked by blame there is no requirement the delta coding uses this pairing, and it almost certainly never uses this direction (its usually B->A, if its even this pair!). Given your comments in the other patch, I understand why you probably won't be working on blame more. But the above may help someone else that has available time to continue. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html