On Thu, Oct 13, 2022 at 08:35:11AM +0200, Tao Klerks wrote: > On Sun, Oct 2, 2022 at 9:07 PM brian m. carlson > <sandals@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > On 2022-09-30 at 13:53:16, Ævar Arnfjörð Bjarmason wrote: > > > You might find ASCII-only sufficient, but note that even if you get this > > > working you won't catch the more complex Unicode normalization rules > > > various filesystems perform, see the fsck code we carefully crafted to > > > make sure we don't get something those FS's will mistake for a ".git" > > > directory in-tree. > > > > What's even worse is that different OSes case-fold differently and the > > behaviour differs based on the version of the OS that formatted the file > > system (which is of course not exposed to userspace), so in general it's > > impossible to know exactly how case folding works on a particular > > system. > > > > It might be possible to implement some general rules that are > > overzealous (in that they will catch patterns that will case-fold on > > _some_ system), but in general this is very difficult. The rules will > > also almost certainly change with newer versions of Unicode. > > > > I'll also point out that there is no locale-independent way to correctly > > case-fold Unicode text. Correct case-folding is sensitive to the > > language, script, and region. > > Thanks for the feedback! > > If I'm understanding correctly, both of these responses were targeted > specifically at my motivation/usecase (preventing the submission of > case-insensitively duplicate files into a repository) rather than the > question of whether anyone has worked or is working on anything > relevant to adding icase pathspec magic support to ls-tree. > > I understand that case-folding is a complex topic, and doing it > correctly in some universal sense is undoubtedly beyond me - but "my" > context certainly does not require a high standard of correctness: > There's a repo shared by some 1000 engineers, 200k files, lots of > activity, three different OSes of which two default to > case-insensitive filesystems, and every once in a while a user on > linux creates a case-insensitive-duplicate file with differing > content, which causes git on case-insensitive filesystems to lose the > plot (you end up with one file's content looking like changes to the > other file's content - "ghost changes" that appear as soon as you > check out, that prevent you from doing a simple "pull", and that you > just can't reset). > > I don't imagine I can make a perfectly correct and universal fix to > this, but with case-insensitive matching on ls-tree in an update hook > I believe I could reduce the frequency of this already-infrequent > issue by at least 1000X, which would suit my purposes just fine. In my > case filenames are mostly ansi-based, and I don't expect we've ever > had Turkish filenames (turkish "i" being the most famous case-folding > gotcha I think?). > > Coming at this from another angle, I guess we could teach git on > case-insensitive filesystems to detect this situation (where two files > in the index, with different contents, are pointing to the exact same > filesystem file) and more explicitly warn the user of what's wrong, > giving them clear help on how to fix it? And temporarily exclude those > two files from its change reconciliation processes altogether to avoid > ghost changes interfering with recovery actions like "pull"? Certainly > that would be better than the current "ghost changes" behavior... but > it would still be far less convenient than preventing (the vast > majority of) these issues altogether, be that with a custom hook or a > core option prohibiting clearly case-insensitive-duplicate files from > being pushed. > > By the time a case-insensitive-FS-user encounters this issue in their > repo as they checkout or clone, it's likely that the problem is in > master/main and others are already affected, and both the cycle time > to fixing the issue, and the communication impact in the current state > ("please wait, the issue is being addressed, once the remote branch is > fixed here's what you'll do to 'pull' successfully in spite of the > local repo thinking there are filesystem changes that really don't > exist and can't be reset") are... suboptimal. > > It feels like adding case-insensitivity pathspec magic support to > ls-tree (however reliable or universal the subsequent > duplicate-detection is or isn't) *should* be much simpler than what it > would have taken to support it in ls-files in the first place - but at > a glance, I see the official pathspec-supporting function > "match_pathspec()" is deep in index-land, with an "index_state" > structure being passed around all over the place. If it really was > easy, someone would already have done it I guess? :) > > I don't see this being something I can take on in my spare time, so > for now I suspect I'll have to do a full-tree duplicate-file-search on > every ref update, and simply accept the 1-second update hook > processing time/delay per pushed ref :( > > I'm assuming the "ghost changes" behavior I allude to here (where two > different files in the index, with different contents, point to the > same single file in the case-insensitive filesystem, and one or the > other index file appears modified / the working tree looks "dirty") is > a known issue, but if there's any value in my opening a thread more > clearly/explicitly about this behavior, please let me know. > > Thanks, > Tao Thanks for sharing your experience in detail. Did you ever consider to write a shell script, that can detect icase-collisions ? For example, we can use Linux: git ls-files | tr 'A-Z' 'a-z' | sort | uniq -d ; echo $? include/uapi/linux/netfilter_ipv4/ipt_ecn.h include/uapi/linux/netfilter_ipv4/ipt_ttl.h [snip the other files] The GNU versions of uniq allow an even shorter command, (But the POSIX versions don't) git ls-files | sort | uniq -i -d I think that a script like this could do the trick: #!/bin/sh ret=1 >/tmp/$$-exp git ls-files | sort | uniq -i -d >/tmp/$$-act && cmp /tmp/$$-exp /tmp/$$-act && ret=0 rm -f /tmp/$$-exp /tmp/$$-act exit $ret #################### The usage of files in /tmp is probably debatable, I want just illustrate how a combination of shell scripts in combination with existing commands can be used. The biggest step may be to introduce a server-side hook that does a check. But once that is done and working, you probably do not want to miss it.