Re: [PATCH 1/2] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well

Kirill Smelkov <kirr@xxxxxxxxxxxxxx> · Sun, 16 Feb 2014 12:08:29 +0400

On Fri, Feb 14, 2014 at 09:37:00AM -0800, Junio C Hamano wrote:
> Kirill Smelkov <kirr@xxxxxxxxxx> writes:
> 
> > Previously diff_tree(), which is now named __diff_tree_sha1(), was
> 
> That name with two leading underscores is a rather unfortunate,
> especially for a function that is not a file scope static.  No way
> to rename this to something more sensible?

I agree. In preparatory patches I thought this will go away, but it
stayed. I'll try to come up with something reasonable.

> > That impedance mismatch *hurts* *performance* *badly* for generating
> > combined diffs - in c839f1bd (combine-diff: optimize combine_diff_path
> 
> Please avoid referring to a commit that is not in 'master' by its
> object name.  It can be reworked later and get a different name.

I agree, this makes sense. Is it ok to refer to nearby commits, by say
HEAD~3, if we know we are referring to 3-times previous commit?

> > That slowness comes from the fact that currently, while generating
> > combined diff, a lot of time is spent computing diff(commit,commit^2)
> > just to only then intersect that huge diff to almost small set of files
> > from diff(commit,commit^1).
> 
> Good observation.
> 
> >    |a|   |b|    a < b   ->  a ∉ B   ->   D(A,B) +=  +a    a↓
> >    |-|   |-|    a > b   ->  b ∉ A   ->   D(A,B) +=  -b    b↓
> >    | |   | |    a = b   ->  investigate δ(a,b)            a↓ b↓
> 
> In both the "n-parallel" and "diff-tree", when an entry 'a' is a
> tree, I take this "D(A,B) += +a" to mean (recursively) adding all
> the paths within 'a' to the result as addition.  Sounds sensible.

Correct.

> > D(A,B)
> >
> > is by definition the same as combined diff
> >
> > D(A,[B]),
> >
> > so if we could rework the code for common case and make it be not slower
> > for nparent=1 case, usual diff(t1,t2) generation will not be slower, and
> > multiparent diff tree-walker would greatly benefit generating
> > combine-diff.
> 
> OK.

Thanks. My first goal was to demonstrate it is doable - i.e. we could
join two diff tree-walkers into generalized one, and that approach would
be sound and have chances to be accepted.

> > What we do is as follows:
> >
> > 1) diff tree-walker __diff_tree_sha1() is internally reworked to be
> >    a paths generator (new name diff_tree_paths()), with each generated path
> >    being `struct combine_diff_path` with info for path, new sha1,mode and for
> >    every parent which sha1,mode it was in it.
> >
> > 2) From that info, we can still generate usual diff queue with
> >    struct diff_filepairs, via "exporting" generated
> >    combine_diff_path, if we know we run for nparent=1 case.
> >    (see emit_diff() which is now named emit_diff_p0only())
> 
> s/p0/first_parent_/; perhaps?

Ok.

> > 3) In order for diff_can_quit_early(), which checks
> >
> >        DIFF_OPT_TST(opt, HAS_CHANGES))
> >
> >    to work, that exporting have to be happening not in bulk, but
> >    incrementally, one diff path at a time.
> 
> Good thinking.

Yes. This requirement made the diff-paths producer be real generator,
which emits paths incrementally, which is imho, could be a good design
for making the component more reusable.

> > Some notes(*):
> >
> > 1) For loops,
> >
> >  i = 0; do { ... } while (++i < nparent);
> >
> > is used instead of
> >
> >  for (i = 0; i < nparent; ++i)
> >      ...
> >
> > because for the former case, the compiler have to emit additional
> > prologue code which checks for i >= nparent case before entering the
> > loop.
> >
> > As we require nparent must be >0, that additional overhead
> > conflicts with the "runs not slower for nparent=1 case than before"
> > goal.
> 
> Unfortunate.  I'd rather see us stick to more readable and familiar
> form for maintainability if this were not measurable.

The most effect on performance were avoiding mallocs and reduce register
pressure. This too, was measurable, but of lower impact. If giving away
some part of percent for nparent=1 case is ok, this could be back to
usual for loops.

By the way, I find it a bit unfortunate, we don't have some for-loop
form with post-conditions in C, only do-while. Another observation, is
that it would be good to say to compiler e.g.

    assume nparent > 0;

and then in e.g.

    for (i = 0; i < nparent; i++)
        ...

the compiler could know it should not do pre-conditions checks before
entering the loop.

I've verified, that such behaviour, at least with gcc, could be achieved
with

    if (nparent <= 0)
        return;

in function prologue - then the compiler deduces, at least with O2, that
after return point, nparent is >0, but this adds slight overhead on each
entry to function, and we have as many calls as there would be
recursions.

To me, the do-while is still readable, but if fors are preferred, I'll
re-measure what it costs and come back.

> > 2) alloca(), for small arrays, is used for the same reason - if we change
> > it to xmalloc()/free() the timings get worse
> 
> Do you see any use of it outside compat/?
> 
> I thought we specifically avoid alloca() for portability.  Also we
> do not use variable-length-arrays on the stack either, I think.

No, no usage outside compat/ and I knew alloca and VLAs are not used in
Git codebase for portability, and I understand alloca will be
criticized, but wanted to start the discussion rolling.

I've actually started without alloca, and used xmalloc/free for
[nparent] vectors, but the impact was measurable, so it just had to be
changed to something more optimal.

For me, personally, alloca is ok, but I understand there could be
portability issues (by the way, what compiler/system Git cares about
does not have working alloca?). Thats why I propose we do the following

1. at configure time, determine, do we have working alloca, and define

    #define HAVE_ALLOCA

   if yes.

2. in code

    #ifdef HAVE_ALLOCA
    # define xalloca(size)      (alloca(size))
    # define xalloca_free(p)    do {} while(0)
    #else
    # define xalloca(size)      (xmalloc(size))
    # define xalloca_free(p)    (free(p))
    #endif

   and use it like

   func() {
       p = xalloca(size);
       ...

       xalloca_free(p);
   }

This way, for systems, where alloca is available, we'll have optimal
on-stack allocations with fast executions. On the other hand, on
systems, where alloca is not available, this gracefully fallbacks to
xmalloc/free.

Please tell me what you think.

> > 3) For every parent tree, we need to keep a tag, whether entry from that
> > parent equals to entry from minimal parent. For performance reasons I'm
> > keeping that tag in entry's mode field in unused bit - see S_IFXMIN_NEQ.
> 
> Unfortunate, but I do not see another place to keep this
> information offhand (nor implement this approach without keeping
> that piece of information).

Me neither. I've actually started with separate on-stack (via alloca)
[nparent] vector for such tags, and removing it was measurable.

> > P.S. and combined diff is not some exotic/for-play-only stuff - for
> 
> No need to convince us about that ;-)

Thanks ;) May I ask to please keep my "motivation" part in the commit
log for historical reasons?

> > example for a program I write to represent Git archives as readonly
> > filesystem, there is initial scan with
> >
> >     `git log --reverse --raw --no-abbrev --no-renames -c`
> >
> > to extract log of what was created/changed when, as a result building a
> > map
> >
> >     {}  sha1    ->  in which commit (and date) a content was added
> >
> > that `-c` means also show combined diff for merges, and without them, if
> > a merge is non-trivial (merges changes from two parents with both having
> > separate changes to a file), or an evil one, the map will not be full,
> > i.e. some valid sha1 would be absent from it.
> >
> > That case was my initial motivation for combined diffs speedup.
> 
> I wonder if this machinery can be reused for "log -m" as well (or
> perhaps you do that already?).  After all, by performing a single
> parallel scan, you are gathering all the necessary information to
> let you pretend that you did N pairwise diff-tree.

Unfortunately, as it is now, no, and let me explain why:

The reason that is not true, is that we omit recursing into directories,
if we know D(A,some-parent) for that path is empty. That means we don't
calculate D(A,any-other-parents) for that path and subpaths.

More structured description is that combined diff and "log -m", which
could be though as all diffs D(A,Pi) are different things:

    - the combined diff is D(A,B) generalization based on "^" (sets
      intersection) operator, and

    - log -m, aka "all diffs" is D(A,B) generalization based on "v"
      (sets union) operator.

Intersection means, we can omit calculating parts from other sets, if we
know some set does not have an element (remember "don't recurse into
subdirectories"?), and unioning does not have this property.

It does so happen, that "^" case (combine-diff) is more interesting,
because in the end it allows to see new information - the diff a merge
itself introduces. "log -m" does not have this property and is no more
interesting to what plain diff(HEAD,HEAD^n) can provide - in other words
it's just a convenience.

Now, the diff tree-walker could be generalized once more, to allow
clients specify, which diffs combination operator to use - intersection
or unioning, but I doubt that for unioning case that would add
significant speedup - we can't reduce any diff generation based on
another diff and the only saving is that we traverse resulting commit
tree once, but for some cases that could be maybe slower, say if result
and some parents don't have a path and some parent does, we'll be
recursing into that path and do more work compared to plain D(A,Pi) for
Pi that lacks the path.

In short: it could be generalized more, if needed, but I propose we
first establish the ground with generalizing to just combine-diff.

> > diff --git a/tree-diff.c b/tree-diff.c
> > index ab61a0a..2b7c991 100644
> > --- a/tree-diff.c
> > +++ b/tree-diff.c
> > @@ -7,6 +7,25 @@
> >  #include "tree.h"
> >  
> >  /*
> > + * internal mode marker, saying a tree entry != entry of tp[imin]
> > + * (see __diff_tree_paths for what it means there)
> > + *
> > + * it *must* not overlap with any valid modes, and we will update/use/emit
> > + * entry for diff only with it unset. Only non-overlapping to valid modes is
> > + * required, because mode in tree_desc, comes here canonicalized via
> > + * canon_mode().
> > + *
> > + * the definition assumes unsigned is at least 32 bits.
> > + */
> > +#define S_IFXMIN_NEQ	0x80000000
> 
> To allow better coordination across multiple codepaths that deal
> with modes, I am wondering if this should be defined in cache.h
> where made-up S_FIGITLINK and S_IFINVALID are defined (note the
> comment that is there, as well).

I knew about S_IFGITLINK and S_IFINVALID being located in cache.h with
comments.

The reason I decided to place S_IFXMIN_NEQ here, is that it is local tag
- it is used locally in diff-tree, and neither is coming in, nor
coming out of it in set state.

So putting it in cache.h would mean we'll reserve the bit for something
which is used only temporarily and that bit would be not available for
other temporary uses.

On the other hand, it would be better, to mark in cache.h that some bits
could be used temporarily for application specific things. Maybe provide
a mask there or something similar to S_IFUSR1, S_IFUSR2 :)

S_IFXMIN_NEQ is just has no meaning for code outside of tree-diff.c ...

What do you think?

> > +static struct combine_diff_path *__diff_tree_paths(
> > +	struct combine_diff_path *p, const unsigned char *sha1,
> > +	const unsigned char **parents_sha1, int nparent,
> > +	struct strbuf *base, struct diff_options *opt);
> 
> Most of our code do not name private helper functions with leading
> underscores.

What would be a good name here? In contrast to __diff_tree_sha1, it is
static, and adding some prefix/suffix to the name is maybe adding more
noise than value? We have diff_tree_paths() which does the setup and
cleanup and this worker __diff_tree_paths... Some names come to me as
diff_tree_paths_raw, or diff_tree_paths_worker, but having Linux
heritage, I like __diff_tree_paths more.

What would be a good approach for Git style here?

> I do like the direction this is going, but it looks to me that
> "struct combine_diff" is now misnamed, because it no longer is about
> combined diff.  You are introducing a good framework for n-way diff,
> and producing combined diff (i.e. -c or --cc) is now merely one way
> to use that framework.  We may want to clean these names up after
> this series settles---perhaps "struct nway_diff" or something.

Thanks. It was my main question whether this reworking will be a
good/accepted idea or not. I agree about `struct combine_diff_path`
name and that it should be changed afterwards (my idea was `struct
diff_path` for it).

Given that we had diff_filepair for n=2 diffs, it would logical to call
it as maybe

    diff_nway,
    diff_filevector,
    or something else

but maybe emphasizing n/vector/... in the name is not so good in the
long run, and using, for generalized diff just

    diff_file, or
    diff_path

is better.

Naming is important and I don't settled on it however, yet...

> > +
> > +/*
> >   * Compare two tree entries, taking into account only path/S_ISDIR(mode),
> >   * but not their sha1's.
> >   *
> > @@ -33,72 +52,152 @@ static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2)
> >  }
> >  
> >  
> > -/* convert path, t1/t2 -> opt->diff_*() callbacks */
> > -static void emit_diff(struct diff_options *opt, struct strbuf *path,
> > -		      struct tree_desc *t1, struct tree_desc *t2)
> > +/*
> > + * convert path -> opt->diff_*() callbacks
> > + *
> > + * emits diff to parent0 only.
> 
> Please call that "first parent".

Ok.

> > + */
> 
> "Returns 0 to tell the caller that we are done with p and it can be
> freed" or something?
> 
> > +static int emit_diff_p0only(struct diff_options *opt, struct combine_diff_path *p)
> >  {
> > ...
> > +
> > +	return 0;	/* = no need to keep allocated combine_diff_path */
> 
> Curious; what is that equal sign in the comment?

It stands for "that means", "i.e." or something similar. In other words
it is semantically equal to your comment from the above.

I'm open to reworking the commenting style for it to reads more well.
Should I?

~~~~

Thanks again for reviewing this and for generally accepting the
approach.  Please tell me your thoughts. Based on your input, I'll try
to come up with something improved on monday.

Thanks,
Kirill
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html