Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

Jeff King <peff@xxxxxxxx> · Wed, 8 Aug 2018 18:31:39 -0400

On Wed, Aug 08, 2018 at 03:48:04PM -0400, Jeff Hostetler wrote:

> > ce_match_stat() may not be a very good measure to see if two paths
> > refer to the same file, though.  After a fresh checkout, I would not
> > be surprised if two completely unrelated paths have the same size
> > and have same mtime/ctime.  In its original use case, i.e. "I have
> > one specific path in mind and took a snapshot of its metadata
> > earlier.  Is it still the same, or has it been touched?", that may
> > be sufficient to detect that the path has been _modified_, but
> > without reliable inum, it may be a poor measure to say two paths
> > refer to the same.
> 
> I agree with Junio on this one.  The mtime values are sloppy at best.
> On FAT file systems, they have 2 second resolution.  Even NTFS IIRC
> has only 100ns resolution, so we might get a lot of false matches
> using this technique, right?

Yeah, I think anything less than inode (or some system equivalent) is
going to be too flaky.

> It might be better to build an equivalence-class hash-map for the
> colliding entries.  Compute a "normalized" version of the pathname
> (such as convert to lowercase, strip final-dots/spaces, strip the
> digits following tilda of a shortname, and etc for the MAC's UTF-isms).
> Then when you rescan the index entries to find the matches, apply the
> equivalence operator on the pathname and do the hashmap lookup.
> When you find a match, you have a "potential" collider pair (I say
> potential only because of the ambiguity of shortnames).  Then we
> can use inum/file-index/whatever to see if they actually collide.

I think we really want to avoid doing that normalization ourselves if we
can. There are just too many filesystem-specific rules.

If we have an equivalence-class hashmap and feed it inodes (or again,
some system equivalent) as the keys, we should get buckets of
collisions. I started to write a "something like this..." earlier, but
got bogged down in boilerplate around the C hashmap.

But here it is in perl. ;)

-- >8 --
# pretend we have these paths in our index
paths='foo FOO and some other paths'

# create them; this will make a single path on a case-insensitive system
for i in $paths; do
  echo $i >$i
done

# now find the duplicates
perl -le '
  for my $path (@ARGV) {
    # this would be an ntfs unique-id on Windows
    my $inode = (lstat($path))[1];
    push @{$h{$inode}}, $path;
  }

  for my $group (grep { @$_ > 1 } values(%h)) {
    print "group:";
    print "  ", $_ for (@$group);
  }
' $paths
-- >8 --

which should show the obvious pair (it does for me on vfat-on-linux,
though where it gets those inodes from, I have no idea ;) ).

-Peff