On Wed, Jul 09, 2014 at 08:51:07AM -0700, Junio C Hamano wrote: > > The delta heuristics in pack-objects use pack_name_hash, which claims: > > > > /* > > * This effectively just creates a sortable number from the > > * last sixteen non-whitespace characters. Last characters > > * count "most", so things that end in ".c" sort together. > > */ > > > > which might be another option (and seems like a superset of the basename > > check, short of basenames that are longer than 16 characters). > > Perhaps. > > I am however not sure if the code to compute similarity score is as > OK with false positives, i.e. dissimilar names that happen to hash > together getting clumped in a same bin or in close bins, as the > existing callers of pack_name_hash(). I think the hash here does not collide in that way. It really is just the last sixteen characters shoved into a uint32_t. But thinking on it more, that is useful to the delta code because it wants to create a sorted list of items. In the rename code we are doing pairwise comparisons, so we are more flexible. We can compare whole basenames, or whole suffixes (so "a/foo/bar.c" is closer to "b/foo/bar.c" than to "c/other/bar.c"). Or just use a general-purpose edit-distance function. The tricky part is that the rename detection seems to take the score as a binary 0/1 "is it the same", but we would want to express more nuance (i.e., the "best" match among those that have similar content scores). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html