Re: How to compile c++ code without strip off utf-8 BOM?

Dario Saccavino <kathoum@xxxxxxxxx> · Wed, 18 Feb 2009 10:33:35 +0100

> Hi Tao Wang,
>
> My test.cpp source is UTF-8 with BOM.
>
> If I compile it like this...
>
> g++ -x c++ <(xxd -g 1 -s 3 test.cpp | xxd -g 1 -s -3 -r) -o a.out
>
> ... that strips out the first three bytes at the beginning.  For test.cpp, this happens to be the BOM (ef bb bf) at the beginning.
>
> You'd may want to create a little 'stripBOM' program that behaves like 'cat', but gobbles the BOM if present.
>
> Or you could use awk, sed, perl, or your favorite-text-munging-tool-of-choice to perform the same conversion.  I just used xxd because it was quick, for illustrative purposes.  (There's probably a more suitable unix tool than xxd for this kind of cat-with-offset, but you'd want something that filters out BOM rather than always offsetting.)
>
> HTH,
> --Ejlay
>

Hi Eljay and Tao Wang,

I have experienced the same problem working in a multi-platform
environment with a shared repository.

In my case the source files have no BOM (they are stored in the server
using the Windows machines' native encoding), so my solution was to
add -finput-charset=WINDOWS-1252 to gcc's command line. Unfortunately,
it seems like iconv has no way to insert/remove the BOM, so Tao Wang
is out of luck.

Eljay's solution isn't always viable either, because if the source
file #includes a header with the BOM the compilation fails.

I think there are two possible ways out:
1) Automatically execute a conversion command (like uconv
--remove-signature) at checkouts/commits
2) Install a modified libiconv with an additional character set "UTF8-BOM"

Best regards

   Dario