On Wed, Jun 02 2021, Đoàn Trần Công Danh wrote: > On 2021-06-02 12:50:53+0200, Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote: >> >> On Wed, Jun 02 2021, Đoàn Trần Công Danh wrote: >> >> > On 2021-05-31 16:01:01+0200, Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote: >> >> >> >> On Thu, May 27 2021, Ævar Arnfjörð Bjarmason wrote: >> >> >> >> > On Wed, May 26 2021, Matheus Tavares wrote: >> >> > >> >> >> t2080 makes a few copies of a test repository and later performs a >> >> >> branch switch on each one of the copies to verify that parallel checkout >> >> >> and sequential checkout produce the same results. However, the >> >> >> repository is copied with `cp -R` which, on some systems, defaults to >> >> >> following symlinks on the directory hierarchy and copying their target >> >> >> files instead of copying the symlinks themselves. AIX is one example of >> >> >> system where this happens. Because the symlinks are not preserved, the >> >> >> copied repositories have paths that do not match what is in the index, >> >> >> causing git to abort the checkout operation that we want to test. This >> >> >> makes the test fail on these systems. >> >> >> >> >> >> Fix this by copying the repository with the POSIX flag '-P', which >> >> >> forces cp to copy the symlinks instead of following them. Note that we >> >> >> already use this flag for other cp invocations in our test suite (see >> >> >> t7001). With this change, t2080 now passes on AIX. >> >> >> >> >> >> Reported-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> >> >> >> Signed-off-by: Matheus Tavares <matheus.bernardino@xxxxxx> >> >> >> --- >> >> >> t/t2080-parallel-checkout-basics.sh | 2 +- >> >> >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> >> >> >> >> diff --git a/t/t2080-parallel-checkout-basics.sh b/t/t2080-parallel-checkout-basics.sh >> >> >> index 7087818550..3e0f8c675f 100755 >> >> >> --- a/t/t2080-parallel-checkout-basics.sh >> >> >> +++ b/t/t2080-parallel-checkout-basics.sh >> >> >> @@ -114,7 +114,7 @@ do >> >> >> >> >> >> test_expect_success "$mode checkout" ' >> >> >> repo=various_$mode && >> >> >> - cp -R various $repo && >> >> >> + cp -R -P various $repo && >> >> >> >> >> >> # The just copied files have more recent timestamps than their >> >> >> # associated index entries. So refresh the cached timestamps >> >> > >> >> > Thanks for the quick fix, I can confirm that this makes the test pass on >> >> > AIX 7.2. >> >> >> >> There's still a failure[1] in t2082-parallel-checkout-attributes.sh >> >> though, which is new in 2.32.0-rc*. The difference is in an unexpected >> >> BOM: >> >> >> >> avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal >> >> efbbbf74657874 >> >> avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text >> >> 74657874 >> >> >> >> I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also >> >> fails similarly[2], so perhaps it's some old/iconv/whatever issue not >> >> per-se related to any change of yours. >> > >> > The 0xefbbbf looks interesting, it's BOM for utf-8. >> > >> >> I tried compiling with both NO_ICONV=Y and ICONV_OMITS_BOM=Y, both have >> >> the same failure. >> > >> > I didn't check the code-path for NO_ICONV=Y but ICONV_OMITS_BOM=Y only >> > affects output of converting *to* utf-16 and utf-32. >> > >> > So, I think AIX iconv implementation automatically add BOM to utf-8? >> > >> > Perhap we need to call skip_utf8_bom somewhere? >> >> I debugged this a bit more, it's probably *also* an issue in our use of >> libiconv, but it goes wrong just with our test setup with >> iconv(1). I.e. on my boring linux box: >> >> echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a' >> $VAR1 = [ >> '0xff', >> '0xfe', >> '0x78', >> '0x0', >> '0xa', >> '0x0' >> ]; >> >> >> On the AIX box to get the same I need to do that as: >> >> (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...] > > FWIW, my Linux with musl-libc also need to be done like this. > >> I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian >> UTF-16, a plain UTF-16 gives you the big-endian version. > > Per spec, plain UTF-16 *is* big-endian. [1] > > In the table <BOM> indicates that the byte order is determined > by a byte order mark, if present at the beginning of the data > stream, otherwise it is big-endian. > >> To make things >> worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE >> version. So it seems we can't get the same result at all for that one. > > Ditto for UTF-32 > >> So from the outset the code added around 79444c92943 (utf8: handle >> systems that don't write BOM for UTF-16, 2019-02-12) needs to be more >> careful (although this looked broken before), i.e. we should test exact >> known-good bytes and see if UTF-16 is really what we think it is, >> etc. This is likely broken on any big-endian non-GNUish iconv >> implementation. > > Linux with musl-libc on little endian also thinks UTF-16 without BOM is UTF-16-BE > > I still think we should strip UTF-8 BOM after reencode_string_len > I.e. something like this, I can't test this, though, since I don't have any AIX box. > And my Linux with musl-libc doesn't output BOM for utf-8 > It doesn't write BOM for utf-16be and utf-32be, anyway. > > -----8<---- > diff --git a/utf8.c b/utf8.c > index de4ce5c0e6..73631632bd 100644 > --- a/utf8.c > +++ b/utf8.c > @@ -8,6 +8,7 @@ static const char utf16_be_bom[] = {'\xFE', '\xFF'}; > static const char utf16_le_bom[] = {'\xFF', '\xFE'}; > static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'}; > static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'}; > +const char utf8_bom[] = "\357\273\277"; > > struct interval { > ucs_char_t first; > @@ -28,6 +29,12 @@ size_t display_mode_esc_sequence_len(const char *s) > return p - s; > } > > +static int has_utf8_bom(const char *text, size_t len) > +{ > + return len >= strlen(utf8_bom) && > + memcmp(text, utf8_bom, strlen(utf8_bom)) == 0; > +} > + > /* auxiliary function for binary search in interval table */ > static int bisearch(ucs_char_t ucs, const struct interval *table, int max) > { > @@ -539,12 +546,13 @@ static const char *fallback_encoding(const char *name) > > char *reencode_string_len(const char *in, size_t insz, > const char *out_encoding, const char *in_encoding, > - size_t *outsz) > + size_t *outsz_p) > { > iconv_t conv; > char *out; > const char *bom_str = NULL; > size_t bom_len = 0; > + size_t outsz = 0; > > if (!in_encoding) > return NULL; > @@ -590,10 +598,16 @@ char *reencode_string_len(const char *in, size_t insz, > if (conv == (iconv_t) -1) > return NULL; > } > - out = reencode_string_iconv(in, insz, conv, bom_len, outsz); > + out = reencode_string_iconv(in, insz, conv, bom_len, &outsz); > iconv_close(conv); > if (out && bom_str && bom_len) > memcpy(out, bom_str, bom_len); > + if (is_encoding_utf8(out_encoding) && has_utf8_bom(out, outsz)) { > + outsz -= strlen(utf8_bom); > + memmove(out, out + strlen(utf8_bom), outsz + 1); > + } > + if (outsz_p) > + *outsz_p = outsz; > return out; > } > #endif > @@ -782,12 +796,9 @@ int is_hfs_dotmailmap(const char *path) > return is_hfs_dot_str(path, "mailmap"); > } > > -const char utf8_bom[] = "\357\273\277"; > - > int skip_utf8_bom(char **text, size_t len) > { > - if (len < strlen(utf8_bom) || > - memcmp(*text, utf8_bom, strlen(utf8_bom))) > + if (!has_utf8_bom(*text, len)) > return 0; > *text += strlen(utf8_bom); > return 1; > ---->8------ > > 1: https://unicode.org/faq/utf_bom.html That's getting us there, now we don't fail on the 2nd test, but do start failing on the third "re-encode to UTF-16 on checkout" and other "checkout" tests. The "test_cmp" at the end of that 3rd tests shows that the difference in test.utf16.raw and test.utf16 is now that the "raw" one has the BOM, but not the "test.utf16" file.