On Wed, Aug 08, 2018 at 03:48:04PM -0400, Jeff Hostetler wrote: > > ce_match_stat() may not be a very good measure to see if two paths > > refer to the same file, though. After a fresh checkout, I would not > > be surprised if two completely unrelated paths have the same size > > and have same mtime/ctime. In its original use case, i.e. "I have > > one specific path in mind and took a snapshot of its metadata > > earlier. Is it still the same, or has it been touched?", that may > > be sufficient to detect that the path has been _modified_, but > > without reliable inum, it may be a poor measure to say two paths > > refer to the same. > > I agree with Junio on this one. The mtime values are sloppy at best. > On FAT file systems, they have 2 second resolution. Even NTFS IIRC > has only 100ns resolution, so we might get a lot of false matches > using this technique, right? Yeah, I think anything less than inode (or some system equivalent) is going to be too flaky. > It might be better to build an equivalence-class hash-map for the > colliding entries. Compute a "normalized" version of the pathname > (such as convert to lowercase, strip final-dots/spaces, strip the > digits following tilda of a shortname, and etc for the MAC's UTF-isms). > Then when you rescan the index entries to find the matches, apply the > equivalence operator on the pathname and do the hashmap lookup. > When you find a match, you have a "potential" collider pair (I say > potential only because of the ambiguity of shortnames). Then we > can use inum/file-index/whatever to see if they actually collide. I think we really want to avoid doing that normalization ourselves if we can. There are just too many filesystem-specific rules. If we have an equivalence-class hashmap and feed it inodes (or again, some system equivalent) as the keys, we should get buckets of collisions. I started to write a "something like this..." earlier, but got bogged down in boilerplate around the C hashmap. But here it is in perl. ;) -- >8 -- # pretend we have these paths in our index paths='foo FOO and some other paths' # create them; this will make a single path on a case-insensitive system for i in $paths; do echo $i >$i done # now find the duplicates perl -le ' for my $path (@ARGV) { # this would be an ntfs unique-id on Windows my $inode = (lstat($path))[1]; push @{$h{$inode}}, $path; } for my $group (grep { @$_ > 1 } values(%h)) { print "group:"; print " ", $_ for (@$group); } ' $paths -- >8 -- which should show the obvious pair (it does for me on vfat-on-linux, though where it gets those inodes from, I have no idea ;) ). -Peff