Re: [PATCH 1/2] blame: large-scale performance rewrite

David Kastrup <dak@xxxxxxx> · Sat, 26 Apr 2014 09:48:14 +0200

Shawn Pearce <spearce@xxxxxxxxxxx> writes:

> On Fri, Apr 25, 2014 at 4:56 PM, David Kastrup <dak@xxxxxxx> wrote:
>> The previous implementation used a single sorted linear list of blame
>> entries for organizing all partial or completed work.  Every subtask had
>> to scan the whole list, with most entries not being relevant to the
>> task.  The resulting run-time was quadratic to the number of separate
>> chunks.
>>
>> This change gives every subtask its own data to work with.  Subtasks are
>> organized into "struct origin" chains hanging off particular commits.
>> Commits are organized into a priority queue, processing them in commit
>> date order in order to keep most of the work affecting a particular blob
>> collated even in the presence of an extensive merge history.
>
> Without reading the code, this sounds like how JGit runs blame.
>
>> For large files with a diversified history, a speedup by a factor of 3
>> or more is not unusual.
>
> And JGit was already usually slower than git-core. Now it will be even
> slower! :-)

If your statement about JGit is accurate, it should likely have beat Git
for large use cases (where the performance improvements are most
important) as O(n) beats O(n^2) in the long run.

At any rate, I see that I ended up posting this patch series at the end
of the week again which makes for a somewhat lacklustre initial response
from those who code Git for a regular living.

Apropos: shaking the bugs regarding -M and -C options out of the code
had taken a large toll because -M can cause the same or overlapping line
regions to be responsible for different target regions and the original
code implementing the "straightforward" blame blew up on the overlap.
I spent a _lot_ of time tracking down that problem.

As I am lousy focusing on more than one task, and as I don't get a
regular paycheck anyway, this will have to remain my last contribution
to Git if I am not going to recoup my losses.

Patch 2 of this series tries giving the community of Git a serious
chance at picking that option (I mean, there are literally millions of
Git users around with a sizable number profiting) while not being
obnoxious about it.

My personal guess is that it will fail regarding both objectives.  But
then I've been surprised before by other free software communities when
trying to make those particular two ends meet.

At any rate, feedback about the performance of the patch from users
disappointed by regular git blame would be welcome.

Apart from the objective measurement of "total time", the more
subjective impression of interactive/incremental response (like in git
gui blame) where the order of results will significantly differ (current
git-blame --incremental focuses on getting blames resolved in
first-lines-first manner, the proposed git-blame rather works on a
newest-commits-first basis which might better match typical use cases)
might be worth reporting.

-- 
David Kastrup
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html