Re: [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection

Kirill Smelkov <kirr@xxxxxxxxxx> · Wed, 29 Jan 2014 15:21:06 +0400

On Tue, Jan 28, 2014 at 01:55:09PM -0800, Junio C Hamano wrote:
> Kirill Smelkov <kirr@xxxxxxxxxx> writes:
> 
> > diff --git a/combine-diff.c b/combine-diff.c
> > index 3b92c448..98c2562 100644
> > --- a/combine-diff.c
> > +++ b/combine-diff.c
> > @@ -15,8 +15,8 @@
> > ...
> > +	while (1) {
> > ...
> > +		if (cmp < 0) {
> > +			if (pprev)
> > +				pprev->next = p->next;
> > +			ptmp = p;
> > +			p = p->next;
> > +			free(ptmp);
> > +			if (curr == ptmp)
> > +				curr = p;
> >  			continue;
> > ...
> > +		if (cmp > 0) {
> > +			i++;
> > +			continue;
> >  		}
> > ...
> > +
> > +		pprev = p;
> > +		p = p->next;
> > +		i++;
> >  	}
> >  	return curr;
> >  }
> 
> Thanks. I very much like the approach.
> 
> I was staring at the above part of the code, but couldn't help
> recalling this gem (look for "understand pointers" in the article):
> 
>   http://meta.slashdot.org/story/12/10/11/0030249/linus-torvalds-answers-your-questions
> 
> How about doing it this way (on top of your patch)?  It reduces 7
> lines even though it adds two comment lines ;-)
> 
>  combine-diff.c | 37 +++++++++++++++----------------------
>  1 file changed, 15 insertions(+), 22 deletions(-)

Thanks, this is sound approach and adding guiding comments is good, and
also now some of us with self-taught heritage understand (or at least
they think so) pointers a bit better :)

Now some nitpicks:

> diff --git a/combine-diff.c b/combine-diff.c
> index 2d79312..0809e79 100644
> --- a/combine-diff.c
> +++ b/combine-diff.c
> @@ -15,11 +15,10 @@
>  static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr, int n, int num_parent)
>  {
>  	struct diff_queue_struct *q = &diff_queued_diff;
> -	struct combine_diff_path *p, *pprev, *ptmp;
> +	struct combine_diff_path *p, **tail = &curr;
>  	int i, cmp;
>  
>  	if (!n) {
> -		struct combine_diff_path *list = NULL, **tail = &list;
>  		for (i = 0; i < q->nr; i++) {
>  			int len;
>  			const char *path;
> @@ -43,35 +42,30 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,
>  			*tail = p;
>  			tail = &p->next;
>  		}
> -		return list;
> +		return curr;
>  	}
>  
>  	/*
> -	 * NOTE paths are coming sorted here (= in tree order)
> +	 * paths in curr (linked list) and q->queue[] (array) are
> +	 * both sorted in the tree order.
>  	 */
> -
> -	pprev = NULL;
> -	p = curr;
>  	i = 0;
> +	while ((p = *tail) != NULL) {
> +		cmp = ((i >= q->nr)
> +		       ? -1 : strcmp(p->path, q->queue[i]->two->path));

I liked cmp assignment being the original way - when "-1" is on one line
and strcmp is on another - to me it reads better. I'm not insisting on
it though.


> -	while (1) {
> -		if (!p)
> -			break;
> -
> -		cmp = (i >= q->nr) ? -1
> -				   : strcmp(p->path, q->queue[i]->two->path);
>  		if (cmp < 0) {
> -			if (pprev)
> -				pprev->next = p->next;
> -			ptmp = p;
> -			p = p->next;
> -			free(ptmp);
> -			if (curr == ptmp)
> -				curr = p;
> +			/* p->path not in q->queue[]; drop it */
> +			struct combine_diff_path *next = p->next;
> +
> +			if ((*tail = next) != NULL)
> +				tail = &next->next;
> +			free(p);
>  			continue;
>  		}

A bug crept in here - if we are removing the first element, i.e. when
p=curr, we have to advance curr as well - as we are returning curr back
as new intersected paths set list start. That's why there was curr
change.

Now curr stays the same, and if we'll remove the first element, curr
will be pointing to freed memory -> oops. A possible fixup could be:

---- 8< ----

diff --git a/combine-diff.c b/combine-diff.c
index 0809e79..6a61a25 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -60,6 +60,8 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,
 
                        if ((*tail = next) != NULL)
                                tail = &next->next;
+                       if (curr == p)
+                               curr = next;
                        free(p);
                        continue;
                }
---- 8< ----

but this is blind code, as I had not tested it.


>  
>  		if (cmp > 0) {
> +			/* q->queue[i] not in p->path; skip it */
>  			i++;
>  			continue;
>  		}
> @@ -80,8 +74,7 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,
>  		p->parent[n].mode = q->queue[i]->one->mode;
>  		p->parent[n].status = q->queue[i]->status;
>  
> -		pprev = p;
> -		p = p->next;
> +		tail = &p->next;
>  		i++;
>  	}
>  	return curr;


P.S. I'm slowly working on to speedup combine-diff further - the same
way as diff_tree() skips path for two trees, for combine-diff we could
traverse a merge tree and n parents simultaneously, even not delving
into generating (usually huge) diff(merge,parent_i) for a path, if we
know such diff for parent_j will be empty.

I have no numbers yet, but this should give significant speedup, as my
tracing showed for e.g. linux.git a lot of diffing is done for
combine-diff for merges to e.g. second parents (mean value of diff to
HEAD^2 is ~ 1500 paths) and almost all of them annulate when intersected
to diff(HEAD, HEAD^1).

Only this can't work (or at least I don't know how) if rename/copy
detection is on, so there will be two codepaths - fast, if we run
without -M/-C, and generic, but slower, where combine-diff paths are
computed as intersections:

    D(A,P1,P2,...Pn) = D(A,P1) ^ D(A,P2) ^ ... ^ D(A,Pn)

i.e. the current way.

Does this approach sound reasonable? My draft not-working-yet code is here:

http://repo.or.cz/w/git/kirr.git/shortlog/refs/heads/x/combinediff-sorted
(look for diff_tree_combined_X)


Thanks,
Kirill
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html