Re: Source character set detection

Ian Lance Taylor <iant@xxxxxxxxxx> · Wed, 30 Jun 2010 17:57:19 -0700

gcc-help@xxxxxxxxx writes:

> When I compile this code:
>
> #include <stdio.h>
>
> int main(void)
> {
>     char c = '?;                       /* ISO-8859-1 0xFC */
>
>     printf("%c\n", c);
>
>     return 0;
> }

This source file did not make it through the mail system.  I assume that
you mean for it to be
    char c = 'X';
where in the source file X is the single byte 0xfc.

> with gcc 4.3.2 under Linux with the locale specifying UTF-8 encoding,
> but the source file having ISO-8859-1 encoding, I don't get any
> diagnostics, and the output of the printf is a binary 0xFC.  I get the
> same results if I compile with
>
> -finput-charset=iso8859-1 -fexec-charset=iso8859-1
>
> or
>
> -finput-charset=utf-8 -fexec-charset=utf-8

The compiler does not attempt to validate the contents of a character
constant or string.  In these cases you have not asked for any character
set conversion, and the compiler has not applied any character set
conversion.

> My understanding is that gcc should default to UTF-8 source encoding,
> and should give a diagnostic when it encounters the illegal UTF-8 start
> byte of 0xFC.

That is not how the compiler works today.  By default, the compiler
applies no conversion.  That is, it assumes that your input is valid,
and it takes it unchanged.

> I get the expected diagnostic if I compile with
>
> -finput-charset=utf-8 -fexec-charset=iso8859-1
>
> (converting to execution character set: Invalid argument)

There you go.

Most people value compilation speed.  Most people do not write invalid
character strings in their programs.  So overall I think gcc is making a
sensible choice in not bothering to validate character strings in the
input file.

Ian