Re: icase pathspec magic support in ls-tree

Tao Klerks <tao@xxxxxxxxxx> · Sun, 16 Oct 2022 00:06:50 +0200

This seems to be working, thank you!!!

Two updates I had to make, in case this is useful to anyone else:

1: I'm getting some weird behavior I can't explain yet, where some
paths are returned from the ls-tree call twice: The input to ls-tree
is all unique paths, but the output somehow includes a relatively
small subset of paths twice.
This mysterious issue is easily addressed by adding an extra "uniq"
call to remove the "trivial dupes" before hunting for the
"case-insensitive dupes" we're interested in:

git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
-t HEAD | sort | uniq | uniq -i -d

2: The xargs call has issues with paths with spaces in them. Adding
-d"\n" seems to be a clean way to fix this

git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
all-leading-dirs.py | xargs -d"\n" --no-run-if-empty git ls-tree --name-only
-t HEAD | sort | uniq | uniq -i -d

Not only does this approach seem to work well, but it also has far
better performance characteristics than I was expecting!

Simple small commit (10 files): 20ms
Reasonably large commit (10,000 files): 250ms
Diff from empty on a smaller branch (100,000 files): 2,800ms
Diff from empty on a larger branch (200,000 files): 5,400ms

It still makes sense to check the number of files/lines after doing
the diff, and do a "simple" 800ms full-tree (no-pathspec) dupe check
instead of this when your diff size goes past some file count
threshold, but it looks like that threshold would be quite high in my
environment - 30k files maybe?

I will have a go at writing a full update hook, and (without knowing
whether it will make sense from a performance perspective) I'd like to
try to implement the "all-leading-dirs" logic in bash 4 using
associative arrays, to remove the python dependency. If I make it work
I'll post back here.

This seems to cover what I needed icase pathspec magic for in ls-tree,
without having to implement it - so thanks again!

Tao

On Fri, Oct 14, 2022 at 7:06 PM Elijah Newren <newren@xxxxxxxxx> wrote:
>
> On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@xxxxxxxxxx> wrote:
> >
> > On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@xxxxxxxxx> wrote:
> > >
> [...]
> > > I don't see why you need to do full-tree with existing options, nor
> > > why the ls-tree option you want would somehow make it easier to avoid.
> > > I think you can avoid the full-tree search with something like:
> > >
> > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH |
> > > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH |
> > > \
> > >    sort | uniq -i -d
> > >
> > > The final "sort | uniq -i -d" is taken from Torsten's suggestion.
> > >
> > > The git diff ... xargs git ls-tree section on the first line will
> > > provide a list of all files (& subdirs) in the same directory as any
> > > added file.  (Although, it has a blind spot for paths in the toplevel
> > > directory.)
> >
> > The theoretical problem with this approach is that it only addresses
> > case-insensitive-duplicate files, not directories.
>
> It'll catch some case-insensitive-duplicate directories too -- note
> that I did call out that it'd print subdirs.  But to be more cautious,
> you would need to carefully grab all leading directories of any added
> path, not just the immediate leading directory.
>
> > Directories have been the problem, in "my" repo, around one-third of
> > the time - typically someone does a directory rename, and someone else
> > does a bad merge and reintroduces the old directory.
> >
> > That said, what "icase pathspec magic" actually *does*, is break down
> > the pathspec into iteratively more complete paths, level by level,
> > looking for case-duplicates at each level. That's something I could
> > presumably do in shell scripting, collecting all the interesting
> > sub-paths first, and then getting ls-tree to tell me about the
> > immediate children for each sub-path, doing case-insensitive dupe
> > searches across children for each of these sub-paths.
> >
> > ls-tree supporting icase pathspec magic would clearly be more
> > efficient (I wouldn't need N ls-tree git processes, where N is the
> > number of sub-paths in the diff), but this should be plenty efficient
> > for normal commits, with a fallback to the full search
> >
> > This seems like a sensible direction, I'll have a play.
>
> If you create a script that gives you all leading directories of any
> listed path (plus replacing the toplevel dir with ':/'), such as this
> (which I'm calling 'all-leading-dirs.py'):
>
> """
> #!/usr/bin/env python3
>
> import os
> import sys
>
> paths = sys.stdin.read().splitlines()
> dirs_seen = set()
> for path in paths:
>   dir = path
>   while dir:
>     dir = os.path.dirname(dir)
>     if dir in dirs_seen:
>       continue
>     dirs_seen.add(dir)
> if dirs_seen:
>   # Replace top-level dir of "" with ":"; we'll add the trailing '/'
> below when adding it to all other dirs
>   dirs_seen.remove("")
>   dirs_seen.add(':')
>   for dir in dirs_seen:
>     print(dir+'/')  # ls-tree wants the trailing '/' if we are going
> to list contents within that tree rather than just the tree itself
> """
>
> Then the following will catch duplicates files and directories for you:
>
> git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> -t HEAD | sort | uniq -i -d
>
> and it no longer has problems catching duplicates in the toplevel
> directory either.  It does have (at most) two git invocations, but
> only one invocation of ls-tree.  Here's a test script to prove it
> works:
>
> """
> #!/bin/bash
>
> git init -b main nukeme
> cd nukeme
> mkdir -p dir1/subdir/whatever
> mkdir -p dir2/subdir/whatever
> >dir1/subdir/whatever/foo
> >dir2/subdir/whatever/foo
> git add .
> git commit -m initial
>
> mkdir -p dir1/SubDir/whatever
> >dir1/SubDir/whatever/foo
> git add .
> git commit -m stuff
>
> git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> -t HEAD | sort | uniq -i -d
> """
>
> The output of this script is
> """
> dir1/subdir
> """
> which correctly notifies on the duplicate (dir1/SubDir being the
> other; uniq is the one that picks which of the two duplicate names to
> print)