On Fri, Oct 14, 2022 at 10:58 AM Tao Klerks <tao@xxxxxxxxxx> wrote: > > I don't understand this suggestion; doesn't it only catch duplicates > where both instances were introduced in the same 100-commit range? Yes. It was a bit half-baked but the main idea was to limit the tree to a smaller subset (and not the whole tree) and incrementally checking for introduced duplicates instead of a full tree search. I think that's basically Elijah's idea. Get all (added?) files introduced in a certain revision range (last change, since yesterday etc.) and then only check those against the tree for duplicates in a manner of how you define duplicates On Fri, Oct 14, 2022 at 10:50 AM Tao Klerks <tao@xxxxxxxxxx> wrote: > > Directories have been the problem, in "my" repo, around one-third of > the time - typically someone does a directory rename, and someone else > does a bad merge and reintroduces the old directory. That adds a bit of complexity :/ but should still be doable. Not perfect but maybe something along these lines? (caveat, possibly GNU only) #!/bin/sh # files added between revisions x y added_files() { git diff --diff-filter=A --name-only --no-renames $1 $2 ; } # folders of those added files added_folders() { added_files $1 $2 | sed -e '/[^\/]*/s@^@./@' -e 's@/[^/]*$@/@' | sort -u ; } # all files tracked by git in *those* folders at HEAD possible_dupes() { added_folders $1 $2 | xargs git ls-tree --name-only HEAD ; } # case insensitive columns separated by \x1 # eg. #path\x1PaTh #path\x1path case_insensitive() { sed -e 's@.*@\L\0\E\x1\0@' | sort ; } x=$1 y=$2 # Find all duplicates paths (case insensitive) # in directories which were added between $x $y possible_dupes $x $y | case_insensitive | awk -F '\x1' ' # actual "duplicate" paths, column $2 # as determined by case-insensitive column $1 $1 in a { print a[$1]; print $2 } { a[$1]=$2 } ' | uniq