Re: [PATCH v4 1/4] Introduce wholesame directory move detection in diffcore.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



(+cc: Duy and Baz, who commented before)

Yann Dirson wrote:

> This feature tries to group together files moving from and to
> identical directories - the most common case being directory renames.
> 
> It is activated by the new --detect-dir-renames diffcore
> flag.
> 
> This is only the first step, adding the basic functionnality and
> adding support to raw diff output (and it breaks unified-diff output
> which does not know how to handle that stuff yet).
[...]
>                                                For now both the result
> of "mv a b" and "mv a/* b/" are displayed as "Rnnn a/ b/", which is
> probably not what we want.  "Rnnn a/* b/" could be a good choice for
> the latter if we want them to be distinguished, and even if we want
> them to look the same.

Example?

> Other future developements to be made on top of this include:
> * extension of unified-diff format to express this
[...]
> ---

Oh, so this is for diff --raw only.

For the confused: the discussion from v3 perhaps explains why.  But:

 1. Just like --word-diff, this could be used as a user-facing feature
    for diffs that cannot be applied

 2. Even if no one agrees on the proper diff format, it would be good
    plumbing to have.

v3: http://thread.gmane.org/gmane.comp.version-control.git/99780/focus=99782
v2: http://thread.gmane.org/gmane.comp.version-control.git/99529/focus=99528
v1: http://thread.gmane.org/gmane.comp.version-control.git/94612/focus=96807

I kind of like the cover letter to v1:

 The idea is to add a new pass to diffcore-rename, to group file renames
 looking like a full directory move, and then to hide those file
 renames which do not carry any additionnal information.

 Here is a sample run:

 $ ./git-diff-tree --abbrev=6 ee491 --factorize-renames -r 
 [DBG] possible rename from arm/ to bar/
 [DBG] possible rename from ppc/ to moved/
 [DBG] discarding dir rename of arm/, mixing moved/ and bar/
 [DBG] ppc/* -> moved/* makes ppc/sha1ppc.S -> moved/sha1ppc.S uninteresting
 [DBG] ppc/* -> moved/* makes ppc/sha1.c -> moved/sha1.c uninteresting
 ee491a42190ec6e716f46a55fa0a7f4e307f1629
 :040000 040000 000000... 000000... R100	ppc/	moved/
 :100644 100644 9e3ae0... 9e3ae0... R100	arm/sha1.c	bar/sha1.c
 :100644 100644 395264... 395264... R100	arm/sha1.h	bar/sha1.h
 :100644 100644 c3c51a... c065ee... R099	ppc/sha1.h	moved/sha1.h
 :100644 100644 8c1cb9... 8c1cb9... R100	arm/sha1_arm.S	moved/sha1_arm.S

Presumably this patch takes care of the first step (grouping potential
full-directory moves) and not the second (hiding the redundant file renames).

NaÃve review:

> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -208,7 +208,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  						    ce_option, &dirty_submodule);
>  		if (!changed && !dirty_submodule) {
>  			ce_mark_uptodate(ce);
> -			if (!DIFF_OPT_TST(&revs->diffopt, FIND_COPIES_HARDER))
> +			if (!DIFF_OPT_TST(&revs->diffopt, FIND_COPIES_HARDER) &&
> +			    !DIFF_OPT_TST(&revs->diffopt, DETECT_DIR_RENAMES))
>  				continue;

Hm, why?

> @@ -338,7 +339,8 @@ static int show_modified(struct rev_info *revs,
[...]

Likewise.

[...]
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -11,6 +11,7 @@
>  static struct diff_rename_dst {
>  	struct diff_filespec *two;
>  	struct diff_filepair *pair;
> +	int i_am_not_single:1; // does not look for a match, only here to be looked at
>  } *rename_dst;

What does single mean?

> @@ -49,9 +50,36 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
[...]
>  
> +static struct diff_rename_dst *locate_rename_dst_dir(struct diff_filespec *dir)
> +{
> +	/* code mostly duplicated from locate_rename_dst - not sure we
> +	 * could merge them efficiently,though
> +	 */
> +	int first, last;
> +	int prefixlength = strlen(dir->path);
> +
> +	first = 0;
> +	last = rename_dst_nr;
> +	while (last > first) {
> +		int next = (last + first) >> 1;
> +		struct diff_rename_dst *dst = &(rename_dst[next]);
> +		int cmp = strncmp(dir->path, dst->two->path, prefixlength);
> +		if (!cmp)
> +			return dst;
> +		if (cmp < 0) {
> +			last = next;
> +			continue;
> +		}
> +		first = next+1;
> +	}
> +	/* not found */
> +	return NULL;
> +}

Binary search --- this is just a way to index into the sorted list
of rename_dsts, right?

At first I thought it was searching for a good rename destination.
A comment (or overview in the log message) could help clarify.

> @@ -386,8 +414,11 @@ static int find_exact_renames(void)
>  	for (i = 0; i < rename_src_nr; i++)
>  		insert_file_table(&file_table, -1, i, rename_src[i].one);
>  
> -	for (i = 0; i < rename_dst_nr; i++)
> +	for (i = 0; i < rename_dst_nr; i++) {
> +		if (rename_dst[i].i_am_not_single)
> +			continue;
>  		insert_file_table(&file_table, 1, i, rename_dst[i].two);
> +	}

What is this code path for?  (Sorry for the tedious questions.  My
thinking is, if I cannot answer them without doing some legwork, how
could the person about to break your code who is only glancing for
a moment before plunging forward on an exciting new feature?)

> @@ -414,6 +445,180 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
>  		m[worst] = *o;
>  }
>  
> +struct diff_dir_rename {
> +	struct diff_filespec *one;
> +	struct diff_filespec *two;
> +	int discarded;
> +	struct diff_dir_rename* next;
> +};

Okay.

> +// FIXME: prevent possible overflow
> +/*
> + * Copy dirname of src into dst, suitable to append a filename without
> + * an additional "/".
> + * Only handles relative paths since there is no absolute path in a git repo.
> + * Writes "" when there is no "/" in src.
> + * May overwrite more chars than really needed, if src ends with a "/".
> + */
> +static const char* copy_dirname(char* dst, const char* src)
> +{
> +	char* lastslash = strrchr(src, '/');
> +	if (!lastslash) {
> +		*dst = '\0';
> +		return dst;
> +	}
> +	strncpy(dst, src, lastslash - src + 1);

memcpy?

> +	dst[lastslash - src + 1] = '\0';
> +
> +	// if src ends with a "/" strip the last component
> +	if (lastslash[1] == '\0') {
> +		lastslash = strrchr(dst, '/');
> +		if (!lastslash)
> +			return strcpy(dst, ".");
> +		lastslash[1] = '\0';
> +	}

It might be easier to read like this:

/* Write name of the parent directory of src to dest. */
static char *remove_last_component(char *dst, const char *src)
{
	size_t len = strlen(src);
	const char *slash;

	if (len > 0 && src[len - 1] == '/')
		/* Trailing slash.  Ignore it. */
		len--;

	slash = memrchr(src, '/', len);
	if (!slash) {
		*dst = '\0';
		return dst;
	}

	*mempcpy(dst, src, slash - src) = '\0';
	return dst;
}

Requires the glibc-specific function memrchr(), but that is easy to
implement.  Compare strchrnul() [in git-compat-util.h].

[...]
> +static struct diff_dir_rename* factorization_candidates = NULL;
> +static void diffcore_factorize_renames(void)

Maybe this could be refactored into multiple functions?

> @@ -451,13 +656,22 @@ void diffcore_rename(struct diff_options *options)
>  				p->one->rename_used++;
>  			register_rename_src(p->one, p->score);
>  		}
> +		else {
[...]
> +			if (DIFF_OPT_TST(options, DETECT_DIR_RENAMES)) {
> +				/* similarly, rename factorization needs to
> +				 * see all files from second tree, but we don't
> +				 * want them to be matched against single sources.
> +				 */
> +				locate_rename_dst(p->two, 1)->i_am_not_single = 1;

Huh?

> @@ -509,6 +723,8 @@ void diffcore_rename(struct diff_options *options)
>  
>  		if (rename_dst[i].pair)
>  			continue; /* dealt with exact match already. */
> +		if (rename_dst[i].i_am_not_single)
> +			continue; /* not looking for a match. */

Oh, not single means "not seeking a new relationship".

> @@ -569,7 +785,28 @@ void diffcore_rename(struct diff_options *options)
>  	/* At this point, we have found some renames and copies and they
>  	 * are recorded in rename_dst.  The original list is still in *q.
>  	 */
> +
> +	/* Now possibly factorize those renames and copies. */
> +	if (DIFF_OPT_TST(options, DETECT_DIR_RENAMES))
> +		diffcore_factorize_renames();

Huh?  Factorize?

>  	DIFF_QUEUE_CLEAR(&outq);
> +
> +	// Now turn non-discarded factorization_candidates into real renames
> +	struct diff_dir_rename* candidate;
> +	for (candidate=factorization_candidates; candidate; candidate = candidate->next) {
> +		struct diff_filepair* pair;
> +		if (candidate->discarded) continue;
> +		// visualize toplevel dir if needed - FIXME: wrong place for this ?
> +		if (!*candidate->one->path)
> +			candidate->one->path = "./";
> +		if (!*candidate->two->path)
> +			candidate->two->path = "./";
> +		pair = diff_queue(&outq, candidate->one, candidate->two);
> +		pair->score = MAX_SCORE;
> +		pair->renamed_pair = 1;
> +	}

Looks like I'll need to read diffcore_factorize_renames() after all. :(

Sorry, not much to say yet.  Hopefully some of that can be useful
nonetheless.

Jonathan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]