Re: icase pathspec magic support in ls-tree

Tao Klerks <tao@xxxxxxxxxx> · Mon, 17 Oct 2022 17:46:44 +0200

Thanks again for the code samples above!

I've spent some time productizing this for my environment, and I think
it's pretty close to optimal for my environment at least.

In case this can be of value to anyone else, I've attached the final thing.

This version is genericized a little - two specific specializations
for my environment not included here are:
* a specific strategy for determining the most-likely-correct upstream
branch to use as a default for "diff from" for new branches, to avoid
having to do a full-tree dupe search; in most environments choosing
"master" or "main" would likely be the right thing most of the time
* a specific list of branch paths that we should validate (vs others
that we should not)

Best regards,
Tao

On Sun, Oct 16, 2022 at 12:06 AM Tao Klerks <tao@xxxxxxxxxx> wrote:
>
> This seems to be working, thank you!!!
>
> Two updates I had to make, in case this is useful to anyone else:
>
> 1: I'm getting some weird behavior I can't explain yet, where some
> paths are returned from the ls-tree call twice: The input to ls-tree
> is all unique paths, but the output somehow includes a relatively
> small subset of paths twice.
> This mysterious issue is easily addressed by adding an extra "uniq"
> call to remove the "trivial dupes" before hunting for the
> "case-insensitive dupes" we're interested in:
>
> git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> -t HEAD | sort | uniq | uniq -i -d
>
> 2: The xargs call has issues with paths with spaces in them. Adding
> -d"\n" seems to be a clean way to fix this
>
> git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> all-leading-dirs.py | xargs -d"\n" --no-run-if-empty git ls-tree --name-only
> -t HEAD | sort | uniq | uniq -i -d
>
>
> Not only does this approach seem to work well, but it also has far
> better performance characteristics than I was expecting!
>
> Simple small commit (10 files): 20ms
> Reasonably large commit (10,000 files): 250ms
> Diff from empty on a smaller branch (100,000 files): 2,800ms
> Diff from empty on a larger branch (200,000 files): 5,400ms
>
>
> It still makes sense to check the number of files/lines after doing
> the diff, and do a "simple" 800ms full-tree (no-pathspec) dupe check
> instead of this when your diff size goes past some file count
> threshold, but it looks like that threshold would be quite high in my
> environment - 30k files maybe?
>
> I will have a go at writing a full update hook, and (without knowing
> whether it will make sense from a performance perspective) I'd like to
> try to implement the "all-leading-dirs" logic in bash 4 using
> associative arrays, to remove the python dependency. If I make it work
> I'll post back here.
>
> This seems to cover what I needed icase pathspec magic for in ls-tree,
> without having to implement it - so thanks again!
>
> Tao
>
> On Fri, Oct 14, 2022 at 7:06 PM Elijah Newren <newren@xxxxxxxxx> wrote:
> >
> > On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@xxxxxxxxxx> wrote:
> > >
> > > On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@xxxxxxxxx> wrote:
> > > >
> > [...]
> > > > I don't see why you need to do full-tree with existing options, nor
> > > > why the ls-tree option you want would somehow make it easier to avoid.
> > > > I think you can avoid the full-tree search with something like:
> > > >
> > > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH |
> > > > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH |
> > > > \
> > > >    sort | uniq -i -d
> > > >
> > > > The final "sort | uniq -i -d" is taken from Torsten's suggestion.
> > > >
> > > > The git diff ... xargs git ls-tree section on the first line will
> > > > provide a list of all files (& subdirs) in the same directory as any
> > > > added file.  (Although, it has a blind spot for paths in the toplevel
> > > > directory.)
> > >
> > > The theoretical problem with this approach is that it only addresses
> > > case-insensitive-duplicate files, not directories.
> >
> > It'll catch some case-insensitive-duplicate directories too -- note
> > that I did call out that it'd print subdirs.  But to be more cautious,
> > you would need to carefully grab all leading directories of any added
> > path, not just the immediate leading directory.
> >
> > > Directories have been the problem, in "my" repo, around one-third of
> > > the time - typically someone does a directory rename, and someone else
> > > does a bad merge and reintroduces the old directory.
> > >
> > > That said, what "icase pathspec magic" actually *does*, is break down
> > > the pathspec into iteratively more complete paths, level by level,
> > > looking for case-duplicates at each level. That's something I could
> > > presumably do in shell scripting, collecting all the interesting
> > > sub-paths first, and then getting ls-tree to tell me about the
> > > immediate children for each sub-path, doing case-insensitive dupe
> > > searches across children for each of these sub-paths.
> > >
> > > ls-tree supporting icase pathspec magic would clearly be more
> > > efficient (I wouldn't need N ls-tree git processes, where N is the
> > > number of sub-paths in the diff), but this should be plenty efficient
> > > for normal commits, with a fallback to the full search
> > >
> > > This seems like a sensible direction, I'll have a play.
> >
> > If you create a script that gives you all leading directories of any
> > listed path (plus replacing the toplevel dir with ':/'), such as this
> > (which I'm calling 'all-leading-dirs.py'):
> >
> > """
> > #!/usr/bin/env python3
> >
> > import os
> > import sys
> >
> > paths = sys.stdin.read().splitlines()
> > dirs_seen = set()
> > for path in paths:
> >   dir = path
> >   while dir:
> >     dir = os.path.dirname(dir)
> >     if dir in dirs_seen:
> >       continue
> >     dirs_seen.add(dir)
> > if dirs_seen:
> >   # Replace top-level dir of "" with ":"; we'll add the trailing '/'
> > below when adding it to all other dirs
> >   dirs_seen.remove("")
> >   dirs_seen.add(':')
> >   for dir in dirs_seen:
> >     print(dir+'/')  # ls-tree wants the trailing '/' if we are going
> > to list contents within that tree rather than just the tree itself
> > """
> >
> > Then the following will catch duplicates files and directories for you:
> >
> > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> > -t HEAD | sort | uniq -i -d
> >
> > and it no longer has problems catching duplicates in the toplevel
> > directory either.  It does have (at most) two git invocations, but
> > only one invocation of ls-tree.  Here's a test script to prove it
> > works:
> >
> > """
> > #!/bin/bash
> >
> > git init -b main nukeme
> > cd nukeme
> > mkdir -p dir1/subdir/whatever
> > mkdir -p dir2/subdir/whatever
> > >dir1/subdir/whatever/foo
> > >dir2/subdir/whatever/foo
> > git add .
> > git commit -m initial
> >
> > mkdir -p dir1/SubDir/whatever
> > >dir1/SubDir/whatever/foo
> > git add .
> > git commit -m stuff
> >
> > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> > -t HEAD | sort | uniq -i -d
> > """
> >
> > The output of this script is
> > """
> > dir1/subdir
> > """
> > which correctly notifies on the duplicate (dir1/SubDir being the
> > other; uniq is the one that picks which of the two duplicate names to
> > print)
Attachment:
case-insensitive-duplicate-check-update-hook.bash

Description: Binary data