This seems to be working, thank you!!! Two updates I had to make, in case this is useful to anyone else: 1: I'm getting some weird behavior I can't explain yet, where some paths are returned from the ls-tree call twice: The input to ls-tree is all unique paths, but the output somehow includes a relatively small subset of paths twice. This mysterious issue is easily addressed by adding an extra "uniq" call to remove the "trivial dupes" before hunting for the "case-insensitive dupes" we're interested in: git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only -t HEAD | sort | uniq | uniq -i -d 2: The xargs call has issues with paths with spaces in them. Adding -d"\n" seems to be a clean way to fix this git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | all-leading-dirs.py | xargs -d"\n" --no-run-if-empty git ls-tree --name-only -t HEAD | sort | uniq | uniq -i -d Not only does this approach seem to work well, but it also has far better performance characteristics than I was expecting! Simple small commit (10 files): 20ms Reasonably large commit (10,000 files): 250ms Diff from empty on a smaller branch (100,000 files): 2,800ms Diff from empty on a larger branch (200,000 files): 5,400ms It still makes sense to check the number of files/lines after doing the diff, and do a "simple" 800ms full-tree (no-pathspec) dupe check instead of this when your diff size goes past some file count threshold, but it looks like that threshold would be quite high in my environment - 30k files maybe? I will have a go at writing a full update hook, and (without knowing whether it will make sense from a performance perspective) I'd like to try to implement the "all-leading-dirs" logic in bash 4 using associative arrays, to remove the python dependency. If I make it work I'll post back here. This seems to cover what I needed icase pathspec magic for in ls-tree, without having to implement it - so thanks again! Tao On Fri, Oct 14, 2022 at 7:06 PM Elijah Newren <newren@xxxxxxxxx> wrote: > > On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@xxxxxxxxxx> wrote: > > > > On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@xxxxxxxxx> wrote: > > > > [...] > > > I don't see why you need to do full-tree with existing options, nor > > > why the ls-tree option you want would somehow make it easier to avoid. > > > I think you can avoid the full-tree search with something like: > > > > > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH | > > > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH | > > > \ > > > sort | uniq -i -d > > > > > > The final "sort | uniq -i -d" is taken from Torsten's suggestion. > > > > > > The git diff ... xargs git ls-tree section on the first line will > > > provide a list of all files (& subdirs) in the same directory as any > > > added file. (Although, it has a blind spot for paths in the toplevel > > > directory.) > > > > The theoretical problem with this approach is that it only addresses > > case-insensitive-duplicate files, not directories. > > It'll catch some case-insensitive-duplicate directories too -- note > that I did call out that it'd print subdirs. But to be more cautious, > you would need to carefully grab all leading directories of any added > path, not just the immediate leading directory. > > > Directories have been the problem, in "my" repo, around one-third of > > the time - typically someone does a directory rename, and someone else > > does a bad merge and reintroduces the old directory. > > > > That said, what "icase pathspec magic" actually *does*, is break down > > the pathspec into iteratively more complete paths, level by level, > > looking for case-duplicates at each level. That's something I could > > presumably do in shell scripting, collecting all the interesting > > sub-paths first, and then getting ls-tree to tell me about the > > immediate children for each sub-path, doing case-insensitive dupe > > searches across children for each of these sub-paths. > > > > ls-tree supporting icase pathspec magic would clearly be more > > efficient (I wouldn't need N ls-tree git processes, where N is the > > number of sub-paths in the diff), but this should be plenty efficient > > for normal commits, with a fallback to the full search > > > > This seems like a sensible direction, I'll have a play. > > If you create a script that gives you all leading directories of any > listed path (plus replacing the toplevel dir with ':/'), such as this > (which I'm calling 'all-leading-dirs.py'): > > """ > #!/usr/bin/env python3 > > import os > import sys > > paths = sys.stdin.read().splitlines() > dirs_seen = set() > for path in paths: > dir = path > while dir: > dir = os.path.dirname(dir) > if dir in dirs_seen: > continue > dirs_seen.add(dir) > if dirs_seen: > # Replace top-level dir of "" with ":"; we'll add the trailing '/' > below when adding it to all other dirs > dirs_seen.remove("") > dirs_seen.add(':') > for dir in dirs_seen: > print(dir+'/') # ls-tree wants the trailing '/' if we are going > to list contents within that tree rather than just the tree itself > """ > > Then the following will catch duplicates files and directories for you: > > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only > -t HEAD | sort | uniq -i -d > > and it no longer has problems catching duplicates in the toplevel > directory either. It does have (at most) two git invocations, but > only one invocation of ls-tree. Here's a test script to prove it > works: > > """ > #!/bin/bash > > git init -b main nukeme > cd nukeme > mkdir -p dir1/subdir/whatever > mkdir -p dir2/subdir/whatever > >dir1/subdir/whatever/foo > >dir2/subdir/whatever/foo > git add . > git commit -m initial > > mkdir -p dir1/SubDir/whatever > >dir1/SubDir/whatever/foo > git add . > git commit -m stuff > > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only > -t HEAD | sort | uniq -i -d > """ > > The output of this script is > """ > dir1/subdir > """ > which correctly notifies on the duplicate (dir1/SubDir being the > other; uniq is the one that picks which of the two duplicate names to > print)