Re: Possible bug in .gitignore

Jeff King <peff@xxxxxxxx> · Fri, 26 Jul 2024 01:26:21 -0400

On Thu, Jul 25, 2024 at 01:01:45PM +0900, KwonHyun Kim wrote:

> I am experimenting with git and I found there is something not working
> as explain in the document
> 
> When I place `text_[가나].txt` in `.gitignore` it does not ignore
> text_가.txt nor text_나.txt
> 
> I experimented with `text_[ab].txt` and it works fine.
> 
> So I thought it might work bytewise so I put
> `text_[\200-\352][\200-\352][\200-\352].txt` with no effect. (가 is
> "\352\260\200" when core.quotepath is set to true)
> 
> So I think it must be a bug that is that pattern [abc] or [a-z] does
> not incorporate non-ascii characters. but I am not sure.

The globbing in git is generally done by wildmatch.c, which was imported
from rsync. Looking in that file, it looks like it does not support
multi-byte characters at all inside brackets.

So I don't see a way to make it work except to place the _literal_ bytes
making up the utf8 sequence, each inside its own single-byte match.
Like:

  printf 'text_[\352\353][\260\202][\200\230].txt\n' >.gitignore

But then your .gitignore file is itself invalid utf8 (not to mention
that this is obviously something a user shouldn't have to do).

So I guess the fix would be to teach wildmatch.c to recognize and match
multi-byte sequences inside []. That probably requires that we assume
the pattern and the path are utf8, which will usually be true, but not
always. So we might need some kind of config switch there.

There are also probably a deep rabbit hole of corner cases there (e.g.,
NFD vs NFC, matching é versus "e" + combining accent). But I suspect
that even recognizing multi-byte sequences as a single char to match
would be big improvement.

-Peff