Re: [PATCH v4 1/4] Introduce wholesame directory move detection in diffcore.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 04, 2010 at 02:28:50AM -0500, Jonathan Nieder wrote:
> > of "mv a b" and "mv a/* b/" are displayed as "Rnnn a/ b/", which is
> > probably not what we want.  "Rnnn a/* b/" could be a good choice for
> > the latter if we want them to be distinguished, and even if we want
> > them to look the same.
> 
> Example?

It's just that every time we detect a bulk rename (of all files of a
directory) it is currently shown as a directory rename, even if they
all get moved into an existing dir.  Technically, a "directory rename"
(which I'd label "mv a b") is a special case of "bulk rename" (for
which "mv a/* b/" is more correct), where the target directory is
empty.  Currently the distinction between the two is not done, but
everything is incorrectly advertized as a rename.  I shall fix that.


> > Other future developements to be made on top of this include:
> > * extension of unified-diff format to express this
> [...]
> > ---
> 
> Oh, so this is for diff --raw only.

Right, but that TODO list has not be revised in this iteration.
--detect-dir-renames does not need much to gain unified-diff, it is
mostly --hide-dir-rename-details which needs thougth and work - but as
it is now targetted at humans only, we can start with whatever seems
adequate and rewrite/refine as see fit.


> For the confused: the discussion from v3 perhaps explains why.

Thanks for this work of putting everything into context - I should
have done that myself.


>  $ ./git-diff-tree --abbrev=6 ee491 --factorize-renames -r 
>  [DBG] possible rename from arm/ to bar/
>  [DBG] possible rename from ppc/ to moved/
>  [DBG] discarding dir rename of arm/, mixing moved/ and bar/
>  [DBG] ppc/* -> moved/* makes ppc/sha1ppc.S -> moved/sha1ppc.S uninteresting
>  [DBG] ppc/* -> moved/* makes ppc/sha1.c -> moved/sha1.c uninteresting
>  ee491a42190ec6e716f46a55fa0a7f4e307f1629
>  :040000 040000 000000... 000000... R100	ppc/	moved/
>  :100644 100644 9e3ae0... 9e3ae0... R100	arm/sha1.c	bar/sha1.c
>  :100644 100644 395264... 395264... R100	arm/sha1.h	bar/sha1.h
>  :100644 100644 c3c51a... c065ee... R099	ppc/sha1.h	moved/sha1.h
>  :100644 100644 8c1cb9... 8c1cb9... R100	arm/sha1_arm.S	moved/sha1_arm.S
> 
> Presumably this patch takes care of the first step (grouping potential
> full-directory moves) and not the second (hiding the redundant file renames).

No, it did hide the two moves labeled as "uninteresting" in the debug
traces above.  The ones left are one which is part of the dif rename
but was also modified, and the other ones (from arm/) are part of a
directory split (see 3rd debug line).

> Naïve review:
> 
> > --- a/diff-lib.c
> > +++ b/diff-lib.c
> > @@ -208,7 +208,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
> >  						    ce_option, &dirty_submodule);
> >  		if (!changed && !dirty_submodule) {
> >  			ce_mark_uptodate(ce);
> > -			if (!DIFF_OPT_TST(&revs->diffopt, FIND_COPIES_HARDER))
> > +			if (!DIFF_OPT_TST(&revs->diffopt, FIND_COPIES_HARDER) &&
> > +			    !DIFF_OPT_TST(&revs->diffopt, DETECT_DIR_RENAMES))
> >  				continue;
> 
> Hm, why?
> > @@ -338,7 +339,8 @@ static int show_modified(struct rev_info *revs,
> [...]
> 
> Likewise.

The code below that (calls to diff_change) was previously only used
for FIND_COPIES_HARDER, but we also need to go there for
DETECT_DIR_RENAMES.  Removing those eg. make test "validate the output
for a move without a subdir." fail.  I admit I have not re-analyzed
this part recently, those tests were already parts of the first
iterations.


> [...]
> > --- a/diffcore-rename.c
> > +++ b/diffcore-rename.c
> > @@ -11,6 +11,7 @@
> >  static struct diff_rename_dst {
> >  	struct diff_filespec *two;
> >  	struct diff_filepair *pair;
> > +	int i_am_not_single:1; // does not look for a match, only here to be looked at
> >  } *rename_dst;
> 
> What does single mean?

"looking for a match" :)

> > @@ -451,13 +656,22 @@ void diffcore_rename(struct diff_options *options)
> >  				p->one->rename_used++;
> >  			register_rename_src(p->one, p->score);
> >  		}
> > +		else {
> [...]
> > +			if (DIFF_OPT_TST(options, DETECT_DIR_RENAMES)) {
> > +				/* similarly, rename factorization needs to
> > +				 * see all files from second tree, but we don't
> > +				 * want them to be matched against single sources.
> > +				 */
> > +				locate_rename_dst(p->two, 1)->i_am_not_single = 1;
> 
> Huh?

> > @@ -386,8 +414,11 @@ static int find_exact_renames(void)
> >  	for (i = 0; i < rename_src_nr; i++)
> >  		insert_file_table(&file_table, -1, i, rename_src[i].one);
> >  
> > -	for (i = 0; i < rename_dst_nr; i++)
> > +	for (i = 0; i < rename_dst_nr; i++) {
> > +		if (rename_dst[i].i_am_not_single)
> > +			continue;
> >  		insert_file_table(&file_table, 1, i, rename_dst[i].two);
> > +	}
> 
> What is this code path for?  (Sorry for the tedious questions.  My
> thinking is, if I cannot answer them without doing some legwork, how
> could the person about to break your code who is only glancing for
> a moment before plunging forward on an exciting new feature?)

Those are definitely related to the above.  This mimics the
FIND_COPIES_HARDER behaviour (which is adding untouched files from
*src* tree to the rename_dst list so the -C algorithm works): here we
need to see all files from *dst* tree so we can tell if there was any
file left in a dir after a move.  This has a cost: we must ensure in
other places that codepaths not intended for those extra files are ok.

Hm, thinking quickly about it, maybe we could use a separate list to
store them, and only run through them when explicitely needed, that
would cost less and avoid the polutions you pointed at.  I shall have
a look at that.


> > @@ -49,9 +50,36 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
> [...]
> >  
> > +static struct diff_rename_dst *locate_rename_dst_dir(struct diff_filespec *dir)
> > +{
> > +	/* code mostly duplicated from locate_rename_dst - not sure we
> > +	 * could merge them efficiently,though
> > +	 */
> > +	int first, last;
> > +	int prefixlength = strlen(dir->path);
> > +
> > +	first = 0;
> > +	last = rename_dst_nr;
> > +	while (last > first) {
> > +		int next = (last + first) >> 1;
> > +		struct diff_rename_dst *dst = &(rename_dst[next]);
> > +		int cmp = strncmp(dir->path, dst->two->path, prefixlength);
> > +		if (!cmp)
> > +			return dst;
> > +		if (cmp < 0) {
> > +			last = next;
> > +			continue;
> > +		}
> > +		first = next+1;
> > +	}
> > +	/* not found */
> > +	return NULL;
> > +}
> 
> Binary search --- this is just a way to index into the sorted list
> of rename_dsts, right?

right.

> At first I thought it was searching for a good rename destination.
> A comment (or overview in the log message) could help clarify.

Ok.  More comments are surely good in this hairy piece of code :)


> > +// FIXME: prevent possible overflow
> > +/*
> > + * Copy dirname of src into dst, suitable to append a filename without
> > + * an additional "/".
> > + * Only handles relative paths since there is no absolute path in a git repo.
> > + * Writes "" when there is no "/" in src.
> > + * May overwrite more chars than really needed, if src ends with a "/".
> > + */
> > +static const char* copy_dirname(char* dst, const char* src)
> > +{
> > +	char* lastslash = strrchr(src, '/');
> > +	if (!lastslash) {
> > +		*dst = '\0';
> > +		return dst;
> > +	}
> > +	strncpy(dst, src, lastslash - src + 1);
> 
> memcpy?

right

> > +	dst[lastslash - src + 1] = '\0';
> > +
> > +	// if src ends with a "/" strip the last component
> > +	if (lastslash[1] == '\0') {
> > +		lastslash = strrchr(dst, '/');
> > +		if (!lastslash)
> > +			return strcpy(dst, ".");
> > +		lastslash[1] = '\0';
> > +	}
> 
> It might be easier to read like this:
> 
> /* Write name of the parent directory of src to dest. */
> static char *remove_last_component(char *dst, const char *src)
> {
> 	size_t len = strlen(src);
> 	const char *slash;
> 
> 	if (len > 0 && src[len - 1] == '/')
> 		/* Trailing slash.  Ignore it. */
> 		len--;
> 
> 	slash = memrchr(src, '/', len);
> 	if (!slash) {
> 		*dst = '\0';
> 		return dst;
> 	}
> 
> 	*mempcpy(dst, src, slash - src) = '\0';
> 	return dst;
> }
> 
> Requires the glibc-specific function memrchr(), but that is easy to
> implement.  Compare strchrnul() [in git-compat-util.h].

Right, that's much more elegant - and should also be more efficient in
the "ends with /" case.

> [...]
> > +static struct diff_dir_rename* factorization_candidates = NULL;
> > +static void diffcore_factorize_renames(void)
> 
> Maybe this could be refactored into multiple functions?
> 
> > @@ -509,6 +723,8 @@ void diffcore_rename(struct diff_options *options)
> >  
> >  		if (rename_dst[i].pair)
> >  			continue; /* dealt with exact match already. */
> > +		if (rename_dst[i].i_am_not_single)
> > +			continue; /* not looking for a match. */
> 
> Oh, not single means "not seeking a new relationship".
> 
> > @@ -569,7 +785,28 @@ void diffcore_rename(struct diff_options *options)
> >  	/* At this point, we have found some renames and copies and they
> >  	 * are recorded in rename_dst.  The original list is still in *q.
> >  	 */
> > +
> > +	/* Now possibly factorize those renames and copies. */
> > +	if (DIFF_OPT_TST(options, DETECT_DIR_RENAMES))
> > +		diffcore_factorize_renames();
> 
> Huh?  Factorize?

"--factorize-renames" was how the single-option was called in the
previous series.  It has still not turned completely false, but sure
using "bulk rename" would be easier to read.


> Looks like I'll need to read diffcore_factorize_renames() after all. :(

Hopefully that part is better commented :)


> Sorry, not much to say yet.  Hopefully some of that can be useful
> nonetheless.

Thanks for this review, that work is always useful.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]