On 8/7/2018 3:31 PM, Junio C Hamano wrote:
Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> writes:
One nice thing about this is we don't need platform specific code for
detecting the duplicate entries. I think ce_match_stat() works even
on Windows. And it's now equally expensive on all platforms :D
ce_match_stat() may not be a very good measure to see if two paths
refer to the same file, though. After a fresh checkout, I would not
be surprised if two completely unrelated paths have the same size
and have same mtime/ctime. In its original use case, i.e. "I have
one specific path in mind and took a snapshot of its metadata
earlier. Is it still the same, or has it been touched?", that may
be sufficient to detect that the path has been _modified_, but
without reliable inum, it may be a poor measure to say two paths
refer to the same.
I agree with Junio on this one. The mtime values are sloppy at best.
On FAT file systems, they have 2 second resolution. Even NTFS IIRC
has only 100ns resolution, so we might get a lot of false matches
using this technique, right?
It might be better to build an equivalence-class hash-map for the
colliding entries. Compute a "normalized" version of the pathname
(such as convert to lowercase, strip final-dots/spaces, strip the
digits following tilda of a shortname, and etc for the MAC's UTF-isms).
Then when you rescan the index entries to find the matches, apply the
equivalence operator on the pathname and do the hashmap lookup.
When you find a match, you have a "potential" collider pair (I say
potential only because of the ambiguity of shortnames). Then we
can use inum/file-index/whatever to see if they actually collide.
Jeff