On 2021-06-02 at 10:50:53, Ævar Arnfjörð Bjarmason wrote: > I debugged this a bit more, it's probably *also* an issue in our use of > libiconv, but it goes wrong just with our test setup with > iconv(1). I.e. on my boring linux box: > > echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a' > $VAR1 = [ > '0xff', > '0xfe', > '0x78', > '0x0', > '0xa', > '0x0' > ]; > This is a little-endian encoding of UTF-16 with a BOM. The BOM is required here since the default, if no BOM is provided, is big endian. However, as I alluded to in 79444c92943, while the standard permits the BOM to be omitted, doing so is generally improvident because that leads to breakage when interoperating with Windows machines, many programs for which assume little endian. I mean, I don't use Windows and I think those programs are broken and their authors rightfully should have known better, but practically, using a BOM solves the problem easily, and if we can be slightly nicer to the poor, hapless users of those programs, why not? > On the AIX box to get the same I need to do that as: > > (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...] > > I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian > UTF-16, a plain UTF-16 gives you the big-endian version. To make things > worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE > version. So it seems we can't get the same result at all for that one. But what do you get if you just use UTF-16? Is it little endian with BOM, big endian with BOM, or big endian without BOM? If it's big endian without BOM, did you set ICONV_OMITS_BOM when building? > So from the outset the code added around 79444c92943 (utf8: handle > systems that don't write BOM for UTF-16, 2019-02-12) needs to be more > careful (although this looked broken before), i.e. we should test exact > known-good bytes and see if UTF-16 is really what we think it is, > etc. This is likely broken on any big-endian non-GNUish iconv > implementation. We probably could have been more careful here. Part of the problem is that I don't have access to any affected systems here, so it's not in general easy for me to write a test (or even a patch) for this case. We also did use iconv(1) before that, but I _think_ it's possible to remove it. The thing that's tricky is the use of SHIFT-JIS, which has known round-tripping problems, but I don't think we rely on using the system iconv(3) there and encoding any valid SHIFT-JIS sequence is probably fine. -- brian m. carlson (he/him or they/them) Houston, Texas, US
Attachment:
signature.asc
Description: PGP signature