Re: [PATH/RFC] parse-options: report invalid UTF-8 switches

Erik Faye-Lund <kusmabite@xxxxxxxxx> · Mon, 11 Feb 2013 15:27:19 +0100

On Mon, Feb 11, 2013 at 3:05 PM, Matthieu Moy
<Matthieu.Moy@xxxxxxxxxxxxxxx> wrote:
> Erik Faye-Lund <kusmabite@xxxxxxxxx> writes:
>
>> But isn't UTF-8 constructed to be very unlikely to clash with existing
>> encodings? If so, I could add a case for non-ascii and non-UTF-8, that
>> simply writes the byte as a hex-tuple?
>
> If it's non-ascii and non-UTF-8, I think you'd want to display the byte
> as it is, because this is how it was entered. IOW, I'd say we should
> keep the current behavior in this case.
>

Yes, you are of course right. We should detect UTF-8, and only in
those cases do anything special. Because the likely alternatives are
other 8-byte encodings (which the terminal already should grok, since
the user was able to input it), or other multi-byte sequences (which
already is broken, and is tricky to handle). So at least we'd only
break in very unlikely cases.

But, I wonder, could mbrlen be used to detect the length instead? It
consults LC_CTYPE to find out what encoding to use, which seems like
it might give the correct answer in all non-corrupted cases... I'm far
from an expert on UNIX-internationalization, though. And this approach
is likely to break on Windows, but I suspect that we can perform some
well-placed hack for it, as we already know that we're doing UTF-8
there.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html