Elijah Newren <newren@xxxxxxxxx> writes: > Heh, good point. And more generally, due to how the similarity checks > work (split the file into lines/chunks, hash each to an integer, then > sort the list of integers and compare the list of integers between two > files), whenever you keep all the original lines of a file but permute > their order, you will see a 100% match. Stepping back a bit, there internally is a distinction between "identical files" and "similar but not identical files". The file with permuted lines are not identical but are very similar. The decision not to surface these two to end-users was made long ago, and after finding identical renames, we skip similarity computation when minimum_score is set to -M100 from the command line, which is a documented UI that cannot change. So you are right to say that similarity index of an inexact rename should be capped at 99 and never reach 100. I do not know offhand if your "how about" computation ... > static int similarity_index(struct diff_filepair *p) > { > - return p->score * 100 / MAX_SCORE; > + return p->score * 100 / (MAX_SCORE+1); > } ... is the way to go, though. We have filepair, so shouldn't we be doing more like if (identical_rename(p)) return 100; else return 99 * p->score / MAX_SCORE; instead? It will also keep working when a minor pet peeve of mine is fixed. If you have a file with 1000 lines, all hashing to distinct values, and if you change only one line, even though the similarity is computed as 99% by the current logic, it actually is 99.9% that is much closer to 100%. But we always round down, which feels wrong. Alternatively, we could cap p->score to (MAX_SCORE-1) when similarity is _computed_ for filepair between non-identical blobs. That should happen in diffcore-rename.c:estimate_similarity(). It would allow us to still _show_ "1 line changed among 1000 lines" case as R100 and still reject inexact renames via "-M100" from the command line, I think, as the "exact renames only" short-cut works with the finer-grained "score" values and not the coarser "percent" values, which might give us a better approach.