On Mon, Feb 11, 2013 at 3:05 PM, Matthieu Moy <Matthieu.Moy@xxxxxxxxxxxxxxx> wrote: > Erik Faye-Lund <kusmabite@xxxxxxxxx> writes: > >> But isn't UTF-8 constructed to be very unlikely to clash with existing >> encodings? If so, I could add a case for non-ascii and non-UTF-8, that >> simply writes the byte as a hex-tuple? > > If it's non-ascii and non-UTF-8, I think you'd want to display the byte > as it is, because this is how it was entered. IOW, I'd say we should > keep the current behavior in this case. > Yes, you are of course right. We should detect UTF-8, and only in those cases do anything special. Because the likely alternatives are other 8-byte encodings (which the terminal already should grok, since the user was able to input it), or other multi-byte sequences (which already is broken, and is tricky to handle). So at least we'd only break in very unlikely cases. But, I wonder, could mbrlen be used to detect the length instead? It consults LC_CTYPE to find out what encoding to use, which seems like it might give the correct answer in all non-corrupted cases... I'm far from an expert on UNIX-internationalization, though. And this approach is likely to break on Windows, but I suspect that we can perform some well-placed hack for it, as we already know that we're doing UTF-8 there. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html