Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)

Rich Felker <dalias@xxxxxxxx> · Fri, 8 Feb 2019 01:04:03 -0500

On Fri, Feb 08, 2019 at 12:17:05AM +0000, brian m. carlson wrote:
> [Please skip using Reply-To and instead of Mail-Followup-To so that
> responses also go to the list.]
> 
> On Thu, Feb 07, 2019 at 10:59:35PM +0100, Kevin Daudt wrote:
> > I'm trying to get the git test suite passing on Alpine Linux, which is
> > based on musl libc.
> > 
> > All tests in t0028-working-tree-encoding.sh are currently failing,
> > because musl iconv does not support statefull output of UTF-16/32 (eg,
> > it does not output a BOM), while git is expecting that to be present:
> > 
> > > hint: The file 'test.utf16' is missing a byte order mark (BOM). Please
> > > use UTF-16BE or UTF-16LE (depending on the byte order) as
> > > working-tree-encoding.
> > > fatal: BOM is required in 'test.utf16' if encoded as utf-16
> > 
> > Because adding the file to get fails, all the other tests fail as well
> > as they expect the file to be present in the repository.
> > 
> > Any idea how to get around this?
> 
> I think musl needs to patch their libc. RFC 2781 says that if there's no
> BOM in UTF-16, then "the text SHOULD be interpreted as being
> big-endian."
> 
> Unfortunately for all of us, many Windows-based programs have chosen to
> ignore that advice (technically, it's only a SHOULD) and interpret it as
> little-endian instead. Git can't safely assume anything about the
> endianness of a UTF-16 stream that doesn't contain a BOM. Technically,
> since the RFC doesn't specify a MUST requirement, musl can't, either.
> 
> Even if Git were to produce a BOM to work around this issue, then we'd
> still have the problem that any program using musl will write data in
> UTF-16 without a BOM. Moreover, because musl, in violation of the RFC,
> doesn't read and process BOMs, someone using little-endian UTF-16 (with
> a proper BOM) with musl and Git will have their data corrupted,
> according to my reading of the musl website.

That information is outdated and someone from our side should update
it; since 1.1.19, musl treats "UTF-16" input as ambiguous endianness
determined by BOM, defaulting to big if there's no BOM. However output
is always big endian, such that processes conforming to the Unicode
SHOULD clause will interpret it correctly.

The portable way to get little endian with a BOM is to open a
conversion descriptor for "UTF-16LE" (which should not add any BOM)
and write a BOM manually.

In any case, this test seems mainly relevant to Windows users wanting
to store source files in UTF-16LE with BOM. This doesn't really make
sense to do on a Linux/musl system, so I'm not sure any action is
needed here from either side.

Rich