Thanks again for the code samples above! I've spent some time productizing this for my environment, and I think it's pretty close to optimal for my environment at least. In case this can be of value to anyone else, I've attached the final thing. This version is genericized a little - two specific specializations for my environment not included here are: * a specific strategy for determining the most-likely-correct upstream branch to use as a default for "diff from" for new branches, to avoid having to do a full-tree dupe search; in most environments choosing "master" or "main" would likely be the right thing most of the time * a specific list of branch paths that we should validate (vs others that we should not) Best regards, Tao On Sun, Oct 16, 2022 at 12:06 AM Tao Klerks <tao@xxxxxxxxxx> wrote: > > This seems to be working, thank you!!! > > Two updates I had to make, in case this is useful to anyone else: > > 1: I'm getting some weird behavior I can't explain yet, where some > paths are returned from the ls-tree call twice: The input to ls-tree > is all unique paths, but the output somehow includes a relatively > small subset of paths twice. > This mysterious issue is easily addressed by adding an extra "uniq" > call to remove the "trivial dupes" before hunting for the > "case-insensitive dupes" we're interested in: > > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only > -t HEAD | sort | uniq | uniq -i -d > > 2: The xargs call has issues with paths with spaces in them. Adding > -d"\n" seems to be a clean way to fix this > > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | > all-leading-dirs.py | xargs -d"\n" --no-run-if-empty git ls-tree --name-only > -t HEAD | sort | uniq | uniq -i -d > > > Not only does this approach seem to work well, but it also has far > better performance characteristics than I was expecting! > > Simple small commit (10 files): 20ms > Reasonably large commit (10,000 files): 250ms > Diff from empty on a smaller branch (100,000 files): 2,800ms > Diff from empty on a larger branch (200,000 files): 5,400ms > > > It still makes sense to check the number of files/lines after doing > the diff, and do a "simple" 800ms full-tree (no-pathspec) dupe check > instead of this when your diff size goes past some file count > threshold, but it looks like that threshold would be quite high in my > environment - 30k files maybe? > > I will have a go at writing a full update hook, and (without knowing > whether it will make sense from a performance perspective) I'd like to > try to implement the "all-leading-dirs" logic in bash 4 using > associative arrays, to remove the python dependency. If I make it work > I'll post back here. > > This seems to cover what I needed icase pathspec magic for in ls-tree, > without having to implement it - so thanks again! > > Tao > > On Fri, Oct 14, 2022 at 7:06 PM Elijah Newren <newren@xxxxxxxxx> wrote: > > > > On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@xxxxxxxxxx> wrote: > > > > > > On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@xxxxxxxxx> wrote: > > > > > > [...] > > > > I don't see why you need to do full-tree with existing options, nor > > > > why the ls-tree option you want would somehow make it easier to avoid. > > > > I think you can avoid the full-tree search with something like: > > > > > > > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH | > > > > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH | > > > > \ > > > > sort | uniq -i -d > > > > > > > > The final "sort | uniq -i -d" is taken from Torsten's suggestion. > > > > > > > > The git diff ... xargs git ls-tree section on the first line will > > > > provide a list of all files (& subdirs) in the same directory as any > > > > added file. (Although, it has a blind spot for paths in the toplevel > > > > directory.) > > > > > > The theoretical problem with this approach is that it only addresses > > > case-insensitive-duplicate files, not directories. > > > > It'll catch some case-insensitive-duplicate directories too -- note > > that I did call out that it'd print subdirs. But to be more cautious, > > you would need to carefully grab all leading directories of any added > > path, not just the immediate leading directory. > > > > > Directories have been the problem, in "my" repo, around one-third of > > > the time - typically someone does a directory rename, and someone else > > > does a bad merge and reintroduces the old directory. > > > > > > That said, what "icase pathspec magic" actually *does*, is break down > > > the pathspec into iteratively more complete paths, level by level, > > > looking for case-duplicates at each level. That's something I could > > > presumably do in shell scripting, collecting all the interesting > > > sub-paths first, and then getting ls-tree to tell me about the > > > immediate children for each sub-path, doing case-insensitive dupe > > > searches across children for each of these sub-paths. > > > > > > ls-tree supporting icase pathspec magic would clearly be more > > > efficient (I wouldn't need N ls-tree git processes, where N is the > > > number of sub-paths in the diff), but this should be plenty efficient > > > for normal commits, with a fallback to the full search > > > > > > This seems like a sensible direction, I'll have a play. > > > > If you create a script that gives you all leading directories of any > > listed path (plus replacing the toplevel dir with ':/'), such as this > > (which I'm calling 'all-leading-dirs.py'): > > > > """ > > #!/usr/bin/env python3 > > > > import os > > import sys > > > > paths = sys.stdin.read().splitlines() > > dirs_seen = set() > > for path in paths: > > dir = path > > while dir: > > dir = os.path.dirname(dir) > > if dir in dirs_seen: > > continue > > dirs_seen.add(dir) > > if dirs_seen: > > # Replace top-level dir of "" with ":"; we'll add the trailing '/' > > below when adding it to all other dirs > > dirs_seen.remove("") > > dirs_seen.add(':') > > for dir in dirs_seen: > > print(dir+'/') # ls-tree wants the trailing '/' if we are going > > to list contents within that tree rather than just the tree itself > > """ > > > > Then the following will catch duplicates files and directories for you: > > > > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | > > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only > > -t HEAD | sort | uniq -i -d > > > > and it no longer has problems catching duplicates in the toplevel > > directory either. It does have (at most) two git invocations, but > > only one invocation of ls-tree. Here's a test script to prove it > > works: > > > > """ > > #!/bin/bash > > > > git init -b main nukeme > > cd nukeme > > mkdir -p dir1/subdir/whatever > > mkdir -p dir2/subdir/whatever > > >dir1/subdir/whatever/foo > > >dir2/subdir/whatever/foo > > git add . > > git commit -m initial > > > > mkdir -p dir1/SubDir/whatever > > >dir1/SubDir/whatever/foo > > git add . > > git commit -m stuff > > > > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | > > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only > > -t HEAD | sort | uniq -i -d > > """ > > > > The output of this script is > > """ > > dir1/subdir > > """ > > which correctly notifies on the duplicate (dir1/SubDir being the > > other; uniq is the one that picks which of the two duplicate names to > > print)
Attachment:
case-insensitive-duplicate-check-update-hook.bash
Description: Binary data