On Mon, Oct 04, 2010 at 08:06:20PM -0500, Jonathan Nieder wrote: > > @@ -338,7 +339,8 @@ static int show_modified(struct rev_info *revs, > > > > oldmode = old->ce_mode; > > if (mode == oldmode && !hashcmp(sha1, old->sha1) && !dirty_submodule && > > - !DIFF_OPT_TST(&revs->diffopt, FIND_COPIES_HARDER)) > > + !DIFF_OPT_TST(&revs->diffopt, FIND_COPIES_HARDER) && > > + !DIFF_OPT_TST(&revs->diffopt, DETECT_DIR_RENAMES)) > > return 0; > > The diff-index part. > > Would we also need something like v1.6.4-rc0~45^2 (Avoid "diff-index > --cached" optimization under --find-copies-harder, 2009-05-22) in > order to grok unchanged subtrees, or is that taken care of some other > way? Note to self: look at that > [...] > > --- a/diffcore-rename.c > > +++ b/diffcore-rename.c > > @@ -11,6 +11,7 @@ > > static struct diff_rename_dst { > > struct diff_filespec *two; > > struct diff_filepair *pair; > > + int i_am_not_single:1; // does not look for a match, only here to be looked at > > } *rename_dst; > > If we're looking for directory renames but not finding-copies-harder, > then unmodified files, while interesting and deserving of dst entries, > should not be considered candidates for tracing individual files. > > Yes? If so, maybe the less colorful > > unsigned rename_target_candidate:1; > > would make that clearer Right - but maybe we can use a separate list for those non-candidates as I suggested elsewhere, to make things even more clear. > (and more importantly, the new kind of dst entry should be mentioned > in the commit log message). Hm, the commit message is a bit long already, isn't that too much detail ? > > +static struct diff_rename_dst *locate_rename_dst_dir(struct diff_filespec *dir) > > +{ > > + /* code mostly duplicated from locate_rename_dst - not sure we > > + * could merge them efficiently,though > > + */ > > + int first, last; > > + int prefixlength = strlen(dir->path); > > + > > + first = 0; > > + last = rename_dst_nr; > > + while (last > first) { > > + int next = (last + first) >> 1; > > + struct diff_rename_dst *dst = &(rename_dst[next]); > > + int cmp = strncmp(dir->path, dst->two->path, prefixlength); > > prefixcmp? Since prefixlength is constant accross the loop, strncmp spares comparing every char with \0 - we only do that once. I guess it's worth keeping. > > + if (!cmp) > > + return dst; > > + if (cmp < 0) { > > + last = next; > > + continue; > > + } > > + first = next+1; > > + } > > + /* not found */ > > + return NULL; > > +} > > Hmm, so this is like locate_rename_dst(..., 0), except it matches > prefixes. Result is a rename_dst entry within that directory (why > not spend the time to find the first?). We just don't need the first. In most cases, finding one entry just invalidates the possiblity of a bulk move. Then when we want all of them, we have to scan the list in both directions - and yes, a clean way to avoid code duplication would be nice for those. > > > > /* Find the renames */ > > i = for_each_hash(&file_table, find_same_files); > > @@ -414,6 +445,180 @@ static void record_if_better(struct diff_score m[], struct diff_score *o) > > m[worst] = *o; > > } > > > > +struct diff_dir_rename { > > + struct diff_filespec *one; > > + struct diff_filespec *two; > > + int discarded; > > + struct diff_dir_rename* next; > > +}; > > A linked list of renamed directories. What functions maintain it? That is just the type decl. For this patch it is only used from diffcore_factorize_renames() and diffcore_rename(), but the --hide-dir-rename-details part needs it for maybe_mark_uninteresting, which explains its position in the source file. > [...] > > +static struct diff_dir_rename* factorization_candidates = NULL; > > +static void diffcore_factorize_renames(void) > > +{ > > + char one_parent_path[PATH_MAX], two_parent_path[PATH_MAX]; > > + int i; > > + > > + for (i = 0; i < rename_dst_nr; i++) { > > For each rename target candidate: ... > > Would it make sense to give the body of this loop its own function? Yes, that makes things more readable. > > + struct diff_dir_rename* seen; > > + > > + // FIXME: what are those ? > > + if (!rename_dst[i].pair) > > + continue; > > Genuine new files are not interesting to us (since they do not > provide evidence about directory renames, postive or negative). Well, maybe except for Junio's idea of using bulk-rename detection to bump rename scores (we may want to look at that afterwards, but I fear it will require heavy surgery). > > + // dummy renames used by copy detection > > + if (!strcmp(rename_dst[i].pair->one->path, rename_dst[i].pair->two->path)) > > + continue; > > Whether a file was copied is probably interesting to us, but the > corresponding filepair isn't. Good point: bulk-rename can already be made to direct move+copy detection to select the correct dst for the move. > > + > > + struct diff_filespec* one_parent = alloc_filespec(one_parent_path); > > + fill_filespec(one_parent, null_sha1 /*FIXME*/, S_IFDIR); > > Initialize a filespec, for use in filepairs in case the containing > directories are a source/target pair. (?) We need a filespec for locate_rename_dst_dir, but it is silly, we can just pass it a path string, and initialize the filespec when really needed. Good catch. > > + > > + // After this commit, are there any files still under one_parent_path ? > > + struct diff_rename_dst* one_leftover = locate_rename_dst_dir(one_parent); > > + if (one_leftover) { // FIXME: should only be run if !seen > [...] > > + } > > A candidate for a separate function, no? Right. > The idea, if I understand it: too many unchanged files can > disqualify a rename source. Not exactly: any file left in a dir disqualifies this dir for bulk-rename. > > + > > + // already considered ? > > + for (seen=factorization_candidates; seen; seen = seen->next) > > + if (!strcmp(seen->one->path, one_parent_path)) break; > > Lookup. A candidate for a separate function, I think. > > > + if (!seen) { > > + // record potential dir rename > > + seen = xmalloc(sizeof(*seen)); > > + seen->one = one_parent; > > + seen->two = alloc_filespec(two_parent_path); > > + fill_filespec(seen->two, null_sha1 /*FIXME*/, S_IFDIR); > > + seen->discarded = 0; > > + seen->next = factorization_candidates; > > + factorization_candidates = seen; > > + fprintf(stderr, "[DBG] %s -> %s suggests possible rename from %s to %s\n", > > + rename_dst[i].pair->one->path, > > + rename_dst[i].pair->two->path, > > + one_parent_path, two_parent_path); > > Insertion. Likewise. Hm, it's only used once, and once the other splits are done the func fits on a screen (at least on my display ;) OTOH, if we decide for a sorted list, to allow for binary search to be more efficient, then yes. > > + > > + /* all checks ok, we keep that entry */ > > + } > > + > > + return; > > +} > > + > > void diffcore_rename(struct diff_options *options) > > { > > int detect_rename = options->detect_rename; > > @@ -451,13 +656,22 @@ void diffcore_rename(struct diff_options *options) > [...] > > if (detect_rename == DIFF_DETECT_COPY) { > > /* > > * Increment the "rename_used" score by > > * one, to indicate ourselves as a user. > > */ > > p->one->rename_used++; > > register_rename_src(p->one, p->score); > > } > > + if (DIFF_OPT_TST(options, DETECT_DIR_RENAMES)) { > > + /* similarly, rename factorization needs to > > + * see all files from second tree, but we don't > > + * want them to be matched against single sources. > > + */ > > + locate_rename_dst(p->two, 1)->i_am_not_single = 1; > > ... except when --find-copies-harder is being used, right? I have not investigated interactions with copy detection yet. > > @@ -569,7 +785,28 @@ void diffcore_rename(struct diff_options *options) > > /* At this point, we have found some renames and copies and they > > * are recorded in rename_dst. The original list is still in *q. > > */ > > + > > + /* Now possibly factorize those renames and copies. */ > > + if (DIFF_OPT_TST(options, DETECT_DIR_RENAMES)) > > + diffcore_factorize_renames(); > > + > > DIFF_QUEUE_CLEAR(&outq); > > + > > + // Now turn non-discarded factorization_candidates into real renames > > + struct diff_dir_rename* candidate; > > + for (candidate=factorization_candidates; candidate; candidate = candidate->next) { > > + struct diff_filepair* pair; > > + if (candidate->discarded) continue; > > + // visualize toplevel dir if needed - FIXME: wrong place for this ? > > + if (!*candidate->one->path) > > + candidate->one->path = "./"; > > + if (!*candidate->two->path) > > + candidate->two->path = "./"; > > + pair = diff_queue(&outq, candidate->one, candidate->two); > > + pair->score = MAX_SCORE; > > + pair->renamed_pair = 1; > > Outputting the discovered directory renames. Not really outputting, just queueing for ouput here. > Conclusions: > > - the basic idea looks sane > - the main function would benefit a lot from being split up a bit > - would be nice to have an overview of the design (especially, a > quick description of the heuristics used) for the commit message Thanks for this detailed review. Pushing to http://repo.or.cz/w/git/ydirson.git a set of patches addressing those remarks and others from this thread. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html