Re: js/drop-mingw-test-cmp, was Re: What's cooking in git.git (Dec 2022, #03; Sun, 11)

René Scharfe <l.s.r@xxxxxx> · Mon, 26 Dec 2022 20:53:10 +0100

Am 25.12.2022 um 16:44 schrieb Johannes Sixt:
> Am 24.12.22 um 14:31 schrieb René Scharfe:
>> Am 24.12.22 um 09:10 schrieb Johannes Sixt:
>>> Am 21.12.22 um 14:05 schrieb Junio C Hamano:
>>>> I know we have been operating under such a test environment, but
>>>> after seeing the exchange between Réne and J6t, I was hoping that we
>>>> do not have to keep being sloppy.
>>>
>>> Things did not turn out to be as simple. After ripping out all
>>> special-casing of GIT_TEST_CMP from a MinGW build, I notice at least one
>>> case that needs special treatment (it's `tar tf` that writes CRLF
>>> output).
>>
>> That would affect t6132 and perhaps t9502, right?
>>
>> How can I reproduce it?  I get only LF:
>>
>>    $ uname -rs
>>    MINGW64_NT-10.0-22621 3.3.6-341.x86_64
>>
>>    $ git archive HEAD Makefile | tar tf - | hexdump.exe -C
>>    00000000  4d 61 6b 65 66 69 6c 65  0a                       |Makefile.|
>>    00000009
>>
>> Is there some configuration option that I need to set?
>
> Good catch! Looks like I am wrong. In my environment I give the Windows
> tools precedence over the MinGW tools. There is a tar.exe in
> C:\Windows\System32 that writes CRLF (instead of just LF) that I was
> using in my test runs.

Interesting, I have that program as well:

   $ git archive HEAD Makefile | /c/Windows/System32/tar.exe tf - | hexdump.exe -C
   00000000  4d 61 6b 65 66 69 6c 65  0d 0a                    |Makefile..|
   0000000a

   $ /c/Windows/System32/tar.exe --version
   bsdtar 3.5.2 - libarchive 3.5.2 zlib/1.2.5.f-ipp bz2lib/1.0.6

And indeed on my system the SDK's variant is used:

   $ which tar
   /usr/bin/tar

There are a few more commands in System32 that could shadow the ones
in the SDK:

   $ for f in /c/Windows/system32/*.exe; do p=${f%.exe}; w="$(which ${p##*/} 2>/dev/null)" && test "$p" != "$w" && echo "$w"; done
   /usr/bin/bash
   /usr/bin/cmd
   /mingw64/bin/curl
   /usr/bin/expand
   /usr/bin/find
   /usr/bin/klist
   /usr/bin/notepad
   /usr/bin/sort
   /usr/bin/tar
   /usr/bin/timeout
   /usr/bin/whoami

Windows' find and sort are quite different from GNU's.  That's handled
in t/test-lib.sh by forcing the use of the variants from /usr/bin.
Should we do the same for tar?

Not sure about curl; /mingw64/bin/curl writes CRLFs as well.

expand.exe is very different from GNU's, but we only use it in
t/t4135/make-patches, which is not executed during a test suite run.

>>> For the time being, I suggest to take Dscho's patch.
>>
>> The patch is intended to make comparisons faster.  That works for big
>> files, but the test suite compares small ones.   The total duration of
>> a test suite run is about one minute longer with the patch than without
>> it for me [1].  I retried with 7c2ef319c5 (The first batch for 2.40,
>> 2022-12-19), and that's still the case.  Do you get different numbers?
>
> I'm not going to measure it (one full run takes ~2 hours). I trusted
> Dscho's argument that this patch brings a much better improvement based
> on the one case that is cited in the commit message, but if that was
> just an extraordinary outlier, I am not sure anymore...

Painful, I know.

Logged the sizes of files handed to test_cmp (on macOS).  19170 calls,
median size 42 bytes, average size 617 bytes.  2307 calls with empty
files, 1093 of which in t1092 alone.  The two biggest files in t1050,
2000000 and 2500000 bytes.  t9300 in third place with 180333, one
magnitude smaller.

t1050 at 8a4e8f6a67 (The second batch, 2022-12-26) on Windows:

Benchmark 1: sh.exe t1050-large.sh
  Time (mean ± σ):     18.312 s ±  0.069 s    [User: 0.000 s, System: 0.003 s]
  Range (min … max):   18.218 s … 18.422 s    10 runs

... and with the patch:

Benchmark 1: sh.exe t1050-large.sh
  Time (mean ± σ):      5.709 s ±  0.046 s    [User: 0.000 s, System: 0.003 s]
  Range (min … max):    5.647 s …  5.787 s    10 runs

So it works as advertised for big files, but calling an external
program 19000 times takes time as well, which explain the longer
overall test suite duration.

If we use test_cmp_bin for the two biggest comparisons we get the
same speedup:

Benchmark 1: sh.exe t1050-large.sh
  Time (mean ± σ):      5.719 s ±  0.089 s    [User: 0.000 s, System: 0.006 s]
  Range (min … max):    5.659 s …  5.960 s    10 runs

Is this safe?  The files consist of X's and Y's at the point of
comparison, so they aren't typical binary files, but they don't
have line endings at all or any user-readable content, so I think
treating them as blobs is appropriate.

---
 t/t1050-large.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index c71932b024..c184540855 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -90,7 +90,7 @@ test_expect_success 'checkout a large file' '
 	large1=$(git rev-parse :large1) &&
 	git update-index --add --cacheinfo 100644 $large1 another &&
 	git checkout another &&
-	test_cmp large1 another
+	test_cmp_bin large1 another
 '

 test_expect_success 'packsize limit' '
@@ -192,7 +192,7 @@ test_expect_success 'pack-objects with large loose object' '
 	test_create_repo packed &&
 	mv pack-* packed/.git/objects/pack &&
 	GIT_DIR=packed/.git git cat-file blob $SHA1 >actual &&
-	test_cmp huge actual
+	test_cmp_bin huge actual
 '

 test_expect_success 'tar archiving' '
--
2.38.1.windows.1