Re: UTF-BOM was: [PATCH] t2080: fix cp invocation...

Torsten Bögershausen <tboegi@xxxxxx> · Wed, 2 Jun 2021 21:13:44 +0200

On Wed, Jun 02, 2021 at 03:36:57PM +0200, Ævar Arnfjörð Bjarmason wrote:

> >> >> There's still a failure[1] in t2082-parallel-checkout-attributes.sh
> >> >> though, which is new in 2.32.0-rc*. The difference is in an unexpected
> >> >> BOM:
> >> >>
> >> >>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal
> >> >>     efbbbf74657874
> >> >>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text
> >> >>     74657874
> >> >>
> >> >> I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also
> >> >> fails similarly[2], so perhaps it's some old/iconv/whatever issue not
> >> >> per-se related to any change of yours.
> >> >
> >> > The 0xefbbbf looks interesting, it's BOM for utf-8.
> >> >
> >> >> I tried compiling with both NO_ICONV=Y and ICONV_OMITS_BOM=Y, both have
> >> >> the same failure.
> >> >
> >> > I didn't check the code-path for NO_ICONV=Y but ICONV_OMITS_BOM=Y only
> >> > affects output of converting *to* utf-16 and utf-32.
> >> >
> >> > So, I think AIX iconv implementation automatically add BOM to utf-8?
> >> >
> >> > Perhap we need to call skip_utf8_bom somewhere?
> >>
> >> I debugged this a bit more, it's probably *also* an issue in our use of
> >> libiconv, but it goes wrong just with our test setup with
> >> iconv(1). I.e. on my boring linux box:
> >>
> >>     echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a'
> >>     $VAR1 = [
> >>               '0xff',
> >>               '0xfe',
> >>               '0x78',
> >>               '0x0',
> >>               '0xa',
> >>               '0x0'
> >>             ];
> >>
> >>
> >> On the AIX box to get the same I need to do that as:
> >>
> >>     (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...]
> >
> > FWIW, my Linux with musl-libc also need to be done like this.
> >
> >> I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian
> >> UTF-16, a plain UTF-16 gives you the big-endian version.
> >
> > Per spec, plain UTF-16 *is* big-endian. [1]
> >
> > 	In the table <BOM> indicates that the byte order is determined
> > 	by a byte order mark, if present at the beginning of the data
> > 	stream, otherwise it is big-endian.
> >
> >> To make things
> >> worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE
> >> version. So it seems we can't get the same result at all for that one.
> >
> > Ditto for UTF-32
> >
> >> So from the outset the code added around 79444c92943 (utf8: handle
> >> systems that don't write BOM for UTF-16, 2019-02-12) needs to be more
> >> careful (although this looked broken before), i.e. we should test exact
> >> known-good bytes and see if UTF-16 is really what we think it is,
> >> etc. This is likely broken on any big-endian non-GNUish iconv
> >> implementation.
> >
> > Linux with musl-libc on little endian also thinks UTF-16 without BOM is UTF-16-BE
> >
> > I still think we should strip UTF-8 BOM after reencode_string_len
> > I.e. something like this, I can't test this, though, since I don't have any AIX box.
> > And my Linux with musl-libc doesn't output BOM for utf-8
> > It doesn't write BOM for utf-16be and utf-32be, anyway.
> >
> > -----8<----
> > diff --git a/utf8.c b/utf8.c
> > index de4ce5c0e6..73631632bd 100644
> > --- a/utf8.c
> > +++ b/utf8.c
> > @@ -8,6 +8,7 @@ static const char utf16_be_bom[] = {'\xFE', '\xFF'};
> >  static const char utf16_le_bom[] = {'\xFF', '\xFE'};
> >  static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
> >  static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
> > +const char utf8_bom[] = "\357\273\277";
> >
> >  struct interval {
> >  	ucs_char_t first;
> > @@ -28,6 +29,12 @@ size_t display_mode_esc_sequence_len(const char *s)
> >  	return p - s;
> >  }
> >
> > +static int has_utf8_bom(const char *text, size_t len)
> > +{
> > +	return len >= strlen(utf8_bom) &&
> > +		memcmp(text, utf8_bom, strlen(utf8_bom)) == 0;
> > +}
> > +
> >  /* auxiliary function for binary search in interval table */
> >  static int bisearch(ucs_char_t ucs, const struct interval *table, int max)
> >  {
> > @@ -539,12 +546,13 @@ static const char *fallback_encoding(const char *name)
> >
> >  char *reencode_string_len(const char *in, size_t insz,
> >  			  const char *out_encoding, const char *in_encoding,
> > -			  size_t *outsz)
> > +			  size_t *outsz_p)
> >  {
> >  	iconv_t conv;
> >  	char *out;
> >  	const char *bom_str = NULL;
> >  	size_t bom_len = 0;
> > +	size_t outsz = 0;
> >
> >  	if (!in_encoding)
> >  		return NULL;
> > @@ -590,10 +598,16 @@ char *reencode_string_len(const char *in, size_t insz,
> >  		if (conv == (iconv_t) -1)
> >  			return NULL;
> >  	}
> > -	out = reencode_string_iconv(in, insz, conv, bom_len, outsz);
> > +	out = reencode_string_iconv(in, insz, conv, bom_len, &outsz);
> >  	iconv_close(conv);
> >  	if (out && bom_str && bom_len)
> >  		memcpy(out, bom_str, bom_len);
> > +	if (is_encoding_utf8(out_encoding) && has_utf8_bom(out, outsz)) {
> > +		outsz -= strlen(utf8_bom);
> > +		memmove(out, out + strlen(utf8_bom), outsz + 1);
> > +	}
> > +	if (outsz_p)
> > +		*outsz_p = outsz;
> >  	return out;
> >  }
> >  #endif
> > @@ -782,12 +796,9 @@ int is_hfs_dotmailmap(const char *path)
> >  	return is_hfs_dot_str(path, "mailmap");
> >  }
> >
> > -const char utf8_bom[] = "\357\273\277";
> > -
> >  int skip_utf8_bom(char **text, size_t len)
> >  {
> > -	if (len < strlen(utf8_bom) ||
> > -	    memcmp(*text, utf8_bom, strlen(utf8_bom)))
> > +	if (!has_utf8_bom(*text, len))
> >  		return 0;
> >  	*text += strlen(utf8_bom);
> >  	return 1;
> > ---->8------
> >
> > 1: https://unicode.org/faq/utf_bom.html
>
> That's getting us there, now we don't fail on the 2nd test, but do start
> failing on the third "re-encode to UTF-16 on checkout" and other
> "checkout" tests.
>
> The "test_cmp" at the end of that 3rd tests shows that the difference in
> test.utf16.raw and test.utf16 is now that the "raw" one has the BOM, but
> not the "test.utf16" file.

What I can read from all of this, is that "the iconv" does not handle BOMS
correcttly. When going from UTF-16 or UTF-32 to UTF-8 the BOM should be removed.
But that is not the case here, as it seams.
The patch from above for utf8.c in Git will fix this - OK so far.

For t2082-parallel-checkout-attributes.sh and t0028 we may be able to prepare
the "right" versions of the expected data on e.g. Linux box and add that material
to the test case.
Aad remove the invocation of a potential broken iconv binary completely from
the test scripts.