Re: icase pathspec magic support in ls-tree

Elijah Newren <newren@xxxxxxxxx> · Fri, 14 Oct 2022 10:06:23 -0700

On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@xxxxxxxxxx> wrote:
>
> On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@xxxxxxxxx> wrote:
> >
[...]
> > I don't see why you need to do full-tree with existing options, nor
> > why the ls-tree option you want would somehow make it easier to avoid.
> > I think you can avoid the full-tree search with something like:
> >
> > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH |
> > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH |
> > \
> >    sort | uniq -i -d
> >
> > The final "sort | uniq -i -d" is taken from Torsten's suggestion.
> >
> > The git diff ... xargs git ls-tree section on the first line will
> > provide a list of all files (& subdirs) in the same directory as any
> > added file.  (Although, it has a blind spot for paths in the toplevel
> > directory.)
>
> The theoretical problem with this approach is that it only addresses
> case-insensitive-duplicate files, not directories.

It'll catch some case-insensitive-duplicate directories too -- note
that I did call out that it'd print subdirs.  But to be more cautious,
you would need to carefully grab all leading directories of any added
path, not just the immediate leading directory.

> Directories have been the problem, in "my" repo, around one-third of
> the time - typically someone does a directory rename, and someone else
> does a bad merge and reintroduces the old directory.
>
> That said, what "icase pathspec magic" actually *does*, is break down
> the pathspec into iteratively more complete paths, level by level,
> looking for case-duplicates at each level. That's something I could
> presumably do in shell scripting, collecting all the interesting
> sub-paths first, and then getting ls-tree to tell me about the
> immediate children for each sub-path, doing case-insensitive dupe
> searches across children for each of these sub-paths.
>
> ls-tree supporting icase pathspec magic would clearly be more
> efficient (I wouldn't need N ls-tree git processes, where N is the
> number of sub-paths in the diff), but this should be plenty efficient
> for normal commits, with a fallback to the full search
>
> This seems like a sensible direction, I'll have a play.

If you create a script that gives you all leading directories of any
listed path (plus replacing the toplevel dir with ':/'), such as this
(which I'm calling 'all-leading-dirs.py'):

"""
#!/usr/bin/env python3

import os
import sys

paths = sys.stdin.read().splitlines()
dirs_seen = set()
for path in paths:
  dir = path
  while dir:
    dir = os.path.dirname(dir)
    if dir in dirs_seen:
      continue
    dirs_seen.add(dir)
if dirs_seen:
  # Replace top-level dir of "" with ":"; we'll add the trailing '/'
below when adding it to all other dirs
  dirs_seen.remove("")
  dirs_seen.add(':')
  for dir in dirs_seen:
    print(dir+'/')  # ls-tree wants the trailing '/' if we are going
to list contents within that tree rather than just the tree itself
"""

Then the following will catch duplicates files and directories for you:

git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
-t HEAD | sort | uniq -i -d

and it no longer has problems catching duplicates in the toplevel
directory either.  It does have (at most) two git invocations, but
only one invocation of ls-tree.  Here's a test script to prove it
works:

"""
#!/bin/bash

git init -b main nukeme
cd nukeme
mkdir -p dir1/subdir/whatever
mkdir -p dir2/subdir/whatever
>dir1/subdir/whatever/foo
>dir2/subdir/whatever/foo
git add .
git commit -m initial

mkdir -p dir1/SubDir/whatever
>dir1/SubDir/whatever/foo
git add .
git commit -m stuff

git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
-t HEAD | sort | uniq -i -d
"""

The output of this script is
"""
dir1/subdir
"""
which correctly notifies on the duplicate (dir1/SubDir being the
other; uniq is the one that picks which of the two duplicate names to
print)