Re: [PATCH 0/2] Improve documentation on UTF-16

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 28/12/2018 08:59, Johannes Sixt wrote:
Am 28.12.18 um 00:45 schrieb brian m. carlson:
On Thu, Dec 27, 2018 at 08:55:27PM +0100, Johannes Sixt wrote:
But why do you add another U+FEFF on the way to UTF-8? There is one in the incoming UTF-16 data, and only *that* one must be converted. If there is no
U+FEFF in the UTF-16 data, the should not be one in UTF-8, either.
Puzzled...

So for UTF-16, there must be a BOM. For UTF-16LE and UTF-16BE, there
must not be a BOM. So if we do this:

   $ printf '\xfe\xff\x00\x0a' | iconv -f UTF-16BE -t UTF-16 | xxd -g1
   00000000: ff fe ff fe 0a 00 ......

What sort of braindamage is this? Fix iconv.

But as I said, I'm not an expert. I just vented my worries that widespread existing practice would be ignored under the excuse "you are the outlier".

-- Hannes

For ref, I dug out a Microsoft document [1] on its view of BOMs which can be compared to the ref [0] Brian gave

[1] https://docs.microsoft.com/en-us/windows/desktop/intl/using-byte-order-marks

[0] https://unicode.org/faq/utf_bom.html#bom9

Maybe the documentation patch ([PATCH 1/2] Documentation: document UTF-16-related behavior) should include the line ", because we encode into UTF-8 internally,", and a link to ref [0], and maybe [1]


Whether the various Windows programs actually follow the Microsoft convention is another matter altogether .

--

Philip





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux