Re: [PATCH v4] clone: report duplicate entries on case-insensitive filesystems

Torsten Bögershausen <tboegi@xxxxxx> · Thu, 16 Aug 2018 16:03:12 +0200

On Wed, Aug 15, 2018 at 12:38:49PM -0700, Junio C Hamano wrote:

This should answer Duys comments as well.
> Torsten Bögershausen <tboegi@xxxxxx> writes:
> 
[snip]
> > Should the following be protected by core.checkstat ? 
> > 	if (check_stat) {
> 
> I do not think such a if statement is strictly necessary.
> 
> Even if check_stat tells us "when checking if a cached stat
> information tells us that the path may have modified, use minimum
> set of fields from the 'struct stat'", we still capture and update
> the values from the same "full" set of fields when we mark a cache
> entry up-to-date.  So it all depends on why you are limiting with
> check_stat.  Is it because stdev is unusable?  Is it because nsec is
> unusable?  Is it because ino is unusable?  Only in the last case,
> paying attention to check_stat will reduce the false positive.
> 
> But then you made me wonder what value check_stat has on Windows.
> If it is false, perhaps we do not even need the conditional
> compilation, which is a huge plus.

Agreed:
check_stat is 0 on Windows, and inum is allways 0 in lstat().
I was thinking about systems which don't have inodes and inum,
and then generate an inum in memory, sometimes random.
After a reboot or a re-mount of the file systems those ino values
change.
However, for the initial clone we are fine in any case.

> 
> >> +		if (dup->ce_stat_data.sd_ino == st->st_ino) {
> >> +			dup->ce_flags |= CE_MATCHED;
> >> +			break;
> >> +		}
> >> +	}
> >> +#endif
> >
> > Another thing is that we switch of the ASCII case-folding-detection-logic
> > off for Windows users, even if we otherwise rely on icase.
> > I think we can use fspathcmp() as a fallback. when inodes fail,
> > because we may be on a network file system.
> >
> > (I don't have a test setup at the moment, but what happens with inodes
> > when a Windows machine exports a share to Linux or Mac ?)
> >
> > Is there a chance to get the fspathcmp() back, like this ?
> 
> If fspathcmp() never gives false positives, I do not think we would
> mind using it like your update.  False negatives are fine, as that
> is better than just punting the whole thing when there is no usable
> inum.  And we do not care all that much if it is more expensive;
> this is an error codepath after all.
> 
> And from code structure's point of view, I think it makes sense.  It
> would be even better if we can lose the conditional compilation.

The current implementation of fspathcmp() does not give false positvies,
and future versions should not either.
All case-insentive file systems have always treated 'a-z' equal to 'A-Z'.
In FAT MS/DOS there had only been uppercase letters as file names,
and `type file.txt` (the equivilant to ´cat file.txt´ in *nix)
simply resultet in `type FILE.TXT`
Later, with VFAT and later with HPFS/NTFS a file could be stored on
disk as "File.txt".
>From now on  ´type FILE.TXT´ still worked, (and all other upper-lowercase
combinations).
This all is probably nothing new.
The main point should be that fspathcmp() should never return a false positive,
and I think we all agree on that. 

Now back to the compiler switch:
Windows always set inum to 0 and I can't think about a situation where
a file in a working tree gets inum = 0, can we use the following:

static void mark_colliding_entries(const struct checkout *state,
				   struct cache_entry *ce, struct stat *st)
{
	int i;
	ce->ce_flags |= CE_MATCHED;

	for (i = 0; i < state->istate->cache_nr; i++) {
		struct cache_entry *dup = state->istate->cache[i];
		int folded = 0;

		if (dup == ce)
			break;

		if (dup->ce_flags & (CE_MATCHED | CE_VALID | CE_SKIP_WORKTREE))
			continue;
		/*
		 * Windows sets ino to 0. On other FS ino = 0 will already be
		 *  used, so we don't see it for a file in a Git working tree
		 */
		if (st->st_ino && (dup->ce_stat_data.sd_ino == st->st_ino))
			folded = 1;

		/*
		 * Fallback for NTFS and other case insenstive FS,
		 * which don't use POSIX inums
		 */
		if (!fspathcmp(dup->name, ce->name))
			folded = 1;

		if (folded) {
			dup->ce_flags |= CE_MATCHED;
			break;
		}
	}
}

> 
> Another thing we maybe want to see is if we can update the caller of
> this function so that we do not overwrite the earlier checkout with
> the data for this path.  When two paths collide, we check out one of
> the paths without reporting (because we cannot notice), then attempt
> to check out the other path and report (because we do notice the
> previous one with lstat()).  The current code then goes on and overwrites
> the file with the contents from the "other" path.
> 
> Even if we had false negative in this loop, if we leave the contents
> for the earlier path while reporting the "other" path, then the user
> can get curious, inspect what contents the "other" path has on the
> filesystem, and can notice that it belongs to the (unreported--due
> to false negative) earlier path.
> 
[snip]