Re: [PATCH] rebase -p: avoid grep on potentailly non-ASCII data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Torsten Bögershausen <tboegi@xxxxxx> writes:

> On 09.03.16 21:26, Junio C Hamano wrote:
>> Anders Kaseorg <andersk@xxxxxxx> writes:
> []
>>  sane_grep () {
>> -	GREP_OPTIONS= LC_ALL=C grep "$@"
>> +	GREP_OPTIONS= LC_ALL=C grep @@SANE_TEXT_GREP@@ "$@"
>>  }
>>  
>>  sane_egrep () {
>> -	GREP_OPTIONS= LC_ALL=C egrep "$@"
>> +	GREP_OPTIONS= LC_ALL=C egrep @@SANE_TEXT_GREP@@ "$@"
>>  }
>>
>
> I always wondered why we do LC_ALL=C.

It is because, at least back when we started scripted Porcelains of
Git, that is the only thing we could count on available everywhere,
and more importantly, that is how you disable all the i18n gotchas
and ask tools to use the traditional "a file stores a stream of
bytes" semantics.

> Isn't that begging for trouble, when we feed UTF-8, ISO-8895-1
> or other stuff into a program and say LC_ALL=C at the same time ?

Yes and no.  It is true that in "C" ("POSIX"), behaviour on anything
outside the portable and control character sets is undefined, so in
theory, it is bad that we relied on that undefined behaviour to be
to treat the input as just a stream of bytes.  But at the same time,
it is no worse than letting the user's locale take effect or use
hardcoded en_US.UTF-8 and passing unknown end user data that could
be in latin1 and other stuff.  And sensible implementors of "C"
locale seemed to choose the "stream of bytes" back when fa9d3485
(git-am: force egrep to use correct characters set, 2009-09-25) was
written, which is where LC_ALL=C comes from.  I wasn't around when
that patch was accepted, so I cannot quite tell if it was done to
fix a reported bug (i.e. it would reintroduce the bug if we lost
LC_ALL=C) or it was done "just because".

Back when e1622bfc (Protect scripted Porcelains from GREP_OPTIONS
insanity, 2009-11-23) introduced sane_grep, the only caller of it
that wanted to use LC_ALL=C was this callsite in "git am", and we no
longer have that callsite since 783d7e86 (builtin-am: remove
redirection to git-am.sh, 2015-08-04), so I think whatever fa9d3485
wanted to do will not be broken if we removed LC_ALL=C from
sane_grep.  It however is possible that any callsites of sane_grep
added after e1622bfc implicitly depended on having LC_ALL=C in the
implementation.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]