Re: icase pathspec magic support in ls-tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Oct 2, 2022 at 9:07 PM brian m. carlson
<sandals@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On 2022-09-30 at 13:53:16, Ævar Arnfjörð Bjarmason wrote:
> > You might find ASCII-only sufficient, but note that even if you get this
> > working you won't catch the more complex Unicode normalization rules
> > various filesystems perform, see the fsck code we carefully crafted to
> > make sure we don't get something those FS's will mistake for a ".git"
> > directory in-tree.
>
> What's even worse is that different OSes case-fold differently and the
> behaviour differs based on the version of the OS that formatted the file
> system (which is of course not exposed to userspace), so in general it's
> impossible to know exactly how case folding works on a particular
> system.
>
> It might be possible to implement some general rules that are
> overzealous (in that they will catch patterns that will case-fold on
> _some_ system), but in general this is very difficult.  The rules will
> also almost certainly change with newer versions of Unicode.
>
> I'll also point out that there is no locale-independent way to correctly
> case-fold Unicode text.  Correct case-folding is sensitive to the
> language, script, and region.

Thanks for the feedback!

If I'm understanding correctly, both of these responses were targeted
specifically at my motivation/usecase (preventing the submission of
case-insensitively duplicate files into a repository) rather than the
question of whether anyone has worked or is working on anything
relevant to adding icase pathspec magic support to ls-tree.

I understand that case-folding is a complex topic, and doing it
correctly in some universal sense is undoubtedly beyond me - but "my"
context certainly does not require a high standard of correctness:
There's a repo shared by some 1000 engineers, 200k files, lots of
activity, three different OSes of which two default to
case-insensitive filesystems, and every once in a while a user on
linux creates a case-insensitive-duplicate file with differing
content, which causes git on case-insensitive filesystems to lose the
plot (you end up with one file's content looking like changes to the
other file's content - "ghost changes" that appear as soon as you
check out, that prevent you from doing a simple "pull", and that you
just can't reset).

I don't imagine I can make a perfectly correct and universal fix to
this, but with case-insensitive matching on ls-tree in an update hook
I believe I could reduce the frequency of this already-infrequent
issue by at least 1000X, which would suit my purposes just fine. In my
case filenames are mostly ansi-based, and I don't expect we've ever
had Turkish filenames (turkish "i" being the most famous case-folding
gotcha I think?).

Coming at this from another angle, I guess we could teach git on
case-insensitive filesystems to detect this situation (where two files
in the index, with different contents, are pointing to the exact same
filesystem file) and more explicitly warn the user of what's wrong,
giving them clear help on how to fix it? And temporarily exclude those
two files from its change reconciliation processes altogether to avoid
ghost changes interfering with recovery actions like "pull"? Certainly
that would be better than the current "ghost changes" behavior... but
it would still be far less convenient than preventing (the vast
majority of) these issues altogether, be that with a custom hook or a
core option prohibiting clearly case-insensitive-duplicate files from
being pushed.

By the time a case-insensitive-FS-user encounters this issue in their
repo as they checkout or clone, it's likely that the problem is in
master/main and others are already affected, and both the cycle time
to fixing the issue, and the communication impact in the current state
("please wait, the issue is being addressed, once the remote branch is
fixed here's what you'll do to 'pull' successfully in spite of the
local repo thinking there are filesystem changes that really don't
exist and can't be reset") are... suboptimal.

It feels like adding case-insensitivity pathspec magic support to
ls-tree (however reliable or universal the subsequent
duplicate-detection is or isn't) *should* be much simpler than what it
would have taken to support it in ls-files in the first place - but at
a glance, I see the official pathspec-supporting function
"match_pathspec()" is deep in index-land, with an "index_state"
structure being passed around all over the place. If it really was
easy, someone would already have done it I guess? :)

I don't see this being something I can take on in my spare time, so
for now I suspect I'll have to do a full-tree duplicate-file-search on
every ref update, and simply accept the 1-second update hook
processing time/delay per pushed ref :(

I'm assuming the "ghost changes" behavior I allude to here (where two
different files in the index, with different contents, point to the
same single file in the case-insensitive filesystem, and one or the
other index file appears modified / the working tree looks "dirty") is
a known issue, but if there's any value in my opening a thread more
clearly/explicitly about this behavior, please let me know.

Thanks,
Tao




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux