Re: [PATCH v2 18/19] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well

Kirill Smelkov <kirr@xxxxxxxxxxxxxx> · Tue, 8 Apr 2014 00:26:17 +0400

On Mon, Apr 07, 2014 at 10:29:46AM -0700, Junio C Hamano wrote:
> Kirill Smelkov <kirr@xxxxxxxxxxxxxx> writes:
> 
> > The following
> > ...
> > maybe looks a bit simpler, but calls tree_entry_pathcmp twice more times.
> >
> > Besides for important nparent=1 case we were not calling
> > tree_entry_pathcmp at all and here we'll call it once, which would slow
> > execution down a bit, as base_name_compare shows measurable enough in profile.
> > To avoid that we'll need to add 'if (i==imin) continue' and this won't
> > be so simple then. And for general nparent case, as I've said, we'll be
> > calling tree_entry_pathcmp twice more times...
> >
> > Because of all that I'd suggest to go with my original version.
> 
> OK.

Thanks.

> > ... After some break on the topic,
> > with a fresh eye I see a lot of confusion goes from the notation I've
> > chosen initially (because of how I was reasoning about it on paper, when
> > it was in flux) - i.e. xi for x[imin] and also using i as looping
> > variable. And also because xi was already used for x[imin] I've used
> > another letter 'k' denoting all other x'es, which leads to confusion...
> >
> >
> > I propose we do the following renaming to clarify things:
> >
> >     A/a     ->      T/t     (to match resulting tree t name in the code)
> >     X/x     ->      P/p     (to match parents trees tp in the code)
> >     i       ->      imin    (so that i would be free for other tasks)
> >
> > then the above (with a prologue) would look like
> >
> > ---- 8< ----
> >  *       T     P1       Pn
> >  *       -     -        -
> >  *      |t|   |p1|     |pn|
> >  *      |-|   |--| ... |--|      imin = argmin(p1...pn)
> >  *      | |   |  |     |  |
> >  *      |-|   |--|     |--|
> >  *      |.|   |. |     |. |
> >  *       .     .        .
> >  *       .     .        .
> >  *
> >  * at any time there could be 3 cases:
> >  *
> >  *      1)  t < p[imin];
> >  *      2)  t > p[imin];
> >  *      3)  t = p[imin].
> >  *
> >  * Schematic deduction of what every case means, and what to do, follows:
> >  *
> >  * 1)  t < p[imin]  ->  ∀j t ∉ Pj  ->  "+t" ∈ D(T,Pj)  ->  D += "+t";  t↓
> >  *
> >  * 2)  t > p[imin]
> >  *
> >  *     2.1) ∃j: pj > p[imin]  ->  "-p[imin]" ∉ D(T,Pj)  ->  D += ø;  ∀ pi=p[imin]  pi↓
> >  *     2.2) ∀i  pi = p[imin]  ->  pi ∉ T  ->  "-pi" ∈ D(T,Pi)  ->  D += "-p[imin]";  ∀i pi↓
> >  *
> >  * 3)  t = p[imin]
> >  *
> >  *     3.1) ∃j: pj > p[imin]  ->  "+t" ∈ D(T,Pj)  ->  only pi=p[imin] remains to investigate
> >  *     3.2) pi = p[imin]  ->  investigate δ(t,pi)
> >  *      |
> >  *      |
> >  *      v
> >  *
> >  *     3.1+3.2) looking at δ(t,pi) ∀i: pi=p[imin] - if all != ø  ->
> >  *
> >  *                       ⎧δ(t,pi)  - if pi=p[imin]
> >  *              ->  D += ⎨
> >  *                       ⎩"+t"     - if pi>p[imin]
> >  *
> >  *
> >  *     in any case t↓  ∀ pi=p[imin]  pi↓
> > ...
> > now xk is gone and i matches p[i] (= pi) etc so variable names correlate
> > to algorithm description better.
> >
> > Does that maybe clarify things?
> 
> That sounds more consistent (modulo perhaps s/argmin/min/ at the
> beginning?).

Thanks. argmin is there on purpose - min(p1...pn) is the minimal p, and
argmin(p1...pn) is imin such that p[imin] is minimal. As we are finding
the index of the minimal element we should use argmin.

> > P.S. Sorry for maybe some crept-in mistakes - I've tried to verify it
> > thoroughly, but am too sleepy to be completely sure. On the other hand I
> > think and hope the patch should be ok.
> 
> Thanks and do not be sorry for "mistakes"---we have the review
> process exactly for catching them.

Thanks, I appreciate that.

On Mon, Apr 07, 2014 at 11:07:47AM -0700, Junio C Hamano wrote:
> Kirill Smelkov <kirr@xxxxxxxxxxxxxx> writes:
> 
> >> > +			if (!DIFF_OPT_TST(opt, FIND_COPIES_HARDER)) {
> >> > +				for (i = 0; i < nparent; ++i)
> >> > +					if (tp[i].entry.mode & S_IFXMIN_NEQ)
> >> > +						goto skip_emit_tp;
> >> > +			}
> >> > +
> >> > +			p = emit_path(p, base, opt, nparent,
> >> > +					/*t=*/NULL, tp, imin);
> >> > +
> >> > +		skip_emit_tp:
> >> > +			/* ∀ xk=ximin  xk↓ */
> >> > +			update_tp_entries(tp, nparent);
> >> 
> >> There are parents whose path sort earlier than what is in 't'
> >> (i.e. they were lost in the result---we would want to show
> >> removal).  What makes us jump to the skip label?
> >> 
> >>     We are looking at path in 't', and some parents have paths that
> >>     sort earlier than that path.  We will not go to skip label if
> >>     any one of the parent's entry sorts after some other parent (or
> >>     the parent in question has ran out its entries), which means we
> >>     show the entry from the parents only when all the parents have
> >>     that same path, which is missing from 't'.
> >> 
> >> I am not sure if I am reading this correctly, though.
> >> 
> >> For the two-way diff, the above degenerates to "show all parent
> >> entries that come before the first entry in 't'", which is correct.
> >> For the combined diff, the current intersect_paths() makes sure that
> >> each path appears in all the pair-wise diff between t and tp[],
> >> which again means that the above logic match the current behaviour.
> >
> > Yes, correct (modulo we *will* go to skip label if any one of the
> > parent's entry sorts after some other parent). By definition of combined
> > diff we show a path only if it shows in every diff D(T,Pi), and if 
> >
> >     2.1) ∃j: pj > p[imin]  ->  "-p[imin]" ∉ D(T,Pj)  ->  D += ø;  ∀ pi=p[imin]  pi↓
> >
> > some pj sorts after p[imin] that would mean that Pj does not have
> > p[imin] and since t > p[imin] (which means T does not have p[imin]
> > either) diff D(T,Pj) does not have p[imin]. And because of that we know
> > the whole combined-diff will not have p[imin] as, by definition,
> > combined diff is sets intersection and one of the sets does not have
> > that path.
> >
> >   ( In usual words p[imin] is not changed between Pj..T - it was
> >     e.g. removed in Pj~, so merging parents to T does not bring any new
> >     information wrt path p[imin] and that is why we do not want to show
> >     p[imin] in combined-diff output - no new change about that path )
> >
> > So nothing to append to the output, and update minimum tree entries,
> > preparing for the next step.
> 
> That's all in line with the current and traditional definition of
> combined diff.

Ok.

> This is a tangent that is outside the scope of this current topic,
> but I wonder if you found it disturbing that we treat the result 't'
> that has a path and the result 't' that does not have a path with
> respect to a parent that does not have the path in a somewhat
> assymmetric way.
> 
> With a merge M between commits A and B, where they all have the same
> path with different contents, we obviously show that path in the
> combined diff format.  A merge N that records exactly the same tree
> as M that merges the same commits A and B plus another commit C that
> does not have that path still shows the combined diff, with one
> extra column to express "everything in the result N has been added
> with respect to C which did not have the path at all".
> 
> However, a merge O between the same commits A and B, where A and B
> have a path and O loses it, shows the path in the combined format.
> A merge P among the same A, B and an extra parent C that does not
> have that path ceases to show it (this is the assymmetry).

Symmetry properties are very important and if something does not fit into
symmetry - then something somewhere is really wrong for sure, but I
think there is no asymmetry here. In your example

      M∆∙    N∆∙
     / \ . '''
    / .'\  '  '
    A∆   B∙    Cø
    \ ''/. '  '
     \ /   '''
      Oø     Pø

let's say a path can be:

    ø   - empty
    ∆   - triangle
    ∙   - bullet
    ∆∙  - triangle+bullet

then some symmetry operation is

    ∆∙  <-> ø
    ∙   <-> ∙
    ∆   <-> ∆

so you "mirror" ∆∙ to ø (M,N->O,P), and so ø is mirrored back to ∆∙
(O,P->M,N). But then, when we mirror the whole graph, Cø should be
mirrored to C'∆∙ and then that would be correct:

    A∆   B∙    C'∆∙
    \ ''/. '  '
     \ /   '''
      Oø     Pø

P to A,B,C' would show the path as N to A,B,C(without')

In other words a merge P among the same A, B and extra parent C' with
"contains everything" content is symmetrical to merge N to A,B and C
with empty content.

> It is a natural extension of "Do not show the path when the result
> matches one of the parent" rule, and in this case the result P takes
> contents, "the path does not exist", from one parent "C", so it is
> internally consistent, and I originally designed it that way on
> purpose, but somehow it feels a bit strange.

I hope it does not fill strange once the true symmetry is discovered,
and also P does not add a change compared to C, so the combined diff
should be empty as no new information is present in a merge itself.

A bit unusual on the first glance, but not strange, once you are used to
it and consistent.

Thanks,
Kirill
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html