Sorry for the delay. I wanted to gather some data but could only do it in small chunks at a time. Hopefully that doesn't make this too rambling and/or disjointed. One upside is that I finally got around to teaching the Fedora builds to include not just the failing test output, but tar up the contents of test-results and any trash directories on failures. That makes it easier to poke around when a build fails on a remote build system where I can't get a shell. :) Jeff King wrote: > On Wed, Nov 30, 2022 at 05:09:34PM -0500, Todd Zullinger wrote: > >> The changes in 73c49a4474 (t: run t5551 tests with both HTTP >> and HTTP/2, 2022-11-11) seem to rather frequently trigger >> test failures on Fedora. > > Thanks for the report. I can't seem to reproduce here on my Debian > system, even with --stress. One notable difference between Debian/Ubuntu and Fedora is that Debian/Ubuntu uses mod_http2 included with the upstream Apache httpd source. Fedora is using the newer, stand-alone module (which is the upstream source for the http2 module). Ubuntu 22.04.1 has httpd-2.4.52 with mod_http2-1.15.26 (per MOD_HTTP2_VERSION in modules/http2/h2_version.h). Fedora 37, 36, and rawhide have httpd-httpd-2.4.54 with mod_http2-2.0.9. They also have curl-7.86 in rawhide (where I've done the most testing), 7.85.0 in 37, and 7.82.0 in 36. Interestingly, if I build the same git source rpm against RHEL-9 which has httpd-2.4.53 and mod_http2-1.15.19, I don't see any failures. I'm tempted to do a build of mod_http2-1.x or the embedded mod_http2 and test with that, but I have not yet made time to do so. Searching for errors similar to what I found in the test failures, I found an old curl issue and a related Fedora bug: https://github.com/curl/curl/issues/3750 https://bugzilla.redhat.com/show_bug.cgi?id=1690971 Of course, these are from 2019 so you'd think this particular issue was long-gone. But perhaps with mod_http2 still being actively developed, similar bugs can crop up. > I can try a Fedora chroot or vm, but it may take some time to set up. It > also sounds like you're not able to reproduce it locally on a Fedora > system? Have you tried "./t5559-* --stress"? I had, but at the time the failures I got seemed unrelated, particularly because they triggered on earlier tests and somewhat randomly. After I changed the test to append $test_count to err so I could see the reason for the failure more clearly, I think they are more similar. With --stress, http2 fails at various points with errors like: expecting success of 5559.6 'fetch changes via http': echo content >>file && git commit -a -m two && git push public && (cd clone && git pull) && test_cmp file clone/file +++ echo content +++ git commit -a -m two [main 5647d5f] two Author: A U Thor <author@xxxxxxxxxxx> 1 file changed, 1 insertion(+) +++ git push public To /builddir/build/BUILD/git-2.39.0.rc2/t/trash directory.t5559-http-fetch-smart-http2.stress-13/httpd/www/repo.git f39a923..5647d5f main -> main +++ cd clone +++ git pull error: RPC failed; curl 16 Error in the HTTP2 framing layer fatal: expected flush after ref listing error: last command exited with $?=1 not ok 6 - fetch changes via http or sometimes with the same error in test 30, as below. >> The most frequent test to fail is "large fetch-pack requests >> can be sent using chunked encoding" (t5559.30), but earlier >> tests have also failed on occasion. For the common failure, >> the test exits with: >> >> expecting success of 5559.30 'large fetch-pack requests can be sent using chunked encoding': >> GIT_TRACE_CURL=true git -c http.postbuffer=65536 \ >> clone --bare "$HTTPD_URL/smart/repo.git" split.git 2>err && >> { >> test_have_prereq HTTP2 || >> grep "^=> Send header: Transfer-Encoding: chunked" err >> } >> +++ GIT_TRACE_CURL=true >> +++ git -c http.postbuffer=65536 clone --bare http://127.0.0.1:5559/smart/repo.git split.git >> error: last command exited with $?=128 >> not ok 30 - large fetch-pack requests can be sent using chunked encoding >> >> (Less frequently, I've seen the last command exit 141.) > > Interesting. 141 implies SIGPIPE, but that may be a red herring. The > operation isn't supposed to fail, but if it does due to the server > bailing, then the client sometimes hitting SIGPIPE is a known unrelated > problem. > > Is it possible to see the contents of "err" here? That will have the > message from the client, but also the curl trace, which is probably > helpful. Appending -$test_count to err in all the places it's used in the test helped here, e.g.: sed -Ei 's/\berr\b/&-$test_count/' t/t5551-http-fetch-smart.sh With that, one of the more common errors is: error: RPC failed; HTTP 101 curl 92 HTTP/2 stream 1 was not closed cleanly before end of the underlying stream others are: error: RPC failed; HTTP 101 curl 16 Error in the HTTP2 framing layer fatal: expected flush after ref listing error: RPC failed; HTTP 101 curl 16 Error in the HTTP2 framing layer error: RPC failed; curl 16 Error in the HTTP2 framing layer fatal: expected 'packfile' The full err output is usually around 5k lines, so it's not ideal for inclusion here, but I've saved a number of them and can either extract portions or post them somewhere and link to them, if you think they might be useful. > It might also be interesting to see the server's view, which would be in > httpd/error.log in the trash directory (also access.log, but probably > that is not interesting). The error.log looks the same for the failed runs I've collected: [Tue Dec 06 16:35:40.840250 2022] [mpm_event:notice] [pid 298654:tid 298654] AH00489: Apache/2.4.54 (Fedora Linux) configured -- resuming normal operations [Tue Dec 06 16:35:40.840388 2022] [core:notice] [pid 298654:tid 298654] AH00094: Command line: '/usr/sbin/httpd -d /tmp/git-t.oOfy/trash directory.t5559-http-fetch-smart-http2/httpd -f /builddir/build/BUILD/git-2.39.0.rc2/t/lib-httpd/apache.conf -D HTTP2 -c Listen 127.0.0.1:5559' [Tue Dec 06 16:35:42.372099 2022] [cgi:error] [pid 298658:tid 298768] [remote 127.0.0.1:55644] AH01215: stderr from /builddir/build/BUILD/git-2.39.0.rc2/git-http-backend: Unsupported service: getanyfile [Tue Dec 06 16:35:46.209758 2022] [cgi:error] [pid 300664:tid 300695] [client 127.0.0.1:60712] AH01215: stderr from /builddir/build/BUILD/git-2.39.0.rc2/git-http-backend: fatal: git upload-pack: not our ref 51c0379e3f0bea4537b50664334a941865f61b94 [Tue Dec 06 16:35:46.912816 2022] [cgi:error] [pid 298659:tid 298663] [remote 127.0.0.1:60718] AH01215: stderr from /builddir/build/BUILD/git-2.39.0.rc2/git-http-backend: fatal: git upload-pack: not our ref 51c0379e3f0bea4537b50664334a941865f61b94 [Tue Dec 06 16:35:53.962296 2022] [auth_basic:error] [pid 300664:tid 300707] [client 127.0.0.1:60926] AH01618: user bogus not found: /auth/smart/repo.git/info/refs [Tue Dec 06 16:35:56.627946 2022] [auth_basic:error] [pid 300664:tid 300713] [client 127.0.0.1:37644] AH01617: user user@host: authentication failure for "/auth/smart/repo.git/info/refs": Password Mismatch [Tue Dec 06 16:35:58.185699 2022] [mpm_event:notice] [pid 298654:tid 298654] AH00491: caught SIGTERM, shutting down The LogLevel might need to be adjusted to get more useful output there, perhaps? > If t5551 isn't failing similarly, then presumably the problem is either: > > 1. There is some race or bad interaction between curl and apache > speaking http2 (possibly a buggy version of one?) > > 2. http2 requires using mpm_event instead of mpm_prefork. Maybe this > introduces problems with apache on this system? > > 3. there is a bug in git. ;) This is always a possibility, of course, > but we literally do nothing differently from our end, except pass > it to curl via CURLOPT_HTTP_VERSION. > > It would be interesting to see if doing this: > > diff --git a/t/lib-httpd/apache.conf b/t/lib-httpd/apache.conf > index 0294739a77..b24ca44a95 100644 > --- a/t/lib-httpd/apache.conf > +++ b/t/lib-httpd/apache.conf > @@ -73,16 +73,9 @@ LockFile accept.lock > LoadModule unixd_module modules/mod_unixd.so > </IfModule> > > -<IfDefine HTTP2> > <IfModule !mod_mpm_event.c> > LoadModule mpm_event_module modules/mod_mpm_event.so > </IfModule> > -</IfDefine> > -<IfDefine !HTTP2> > -<IfModule !mod_mpm_prefork.c> > - LoadModule mpm_prefork_module modules/mod_mpm_prefork.so > -</IfModule> > -</IfDefine> > </IfVersion> > > PassEnv GIT_VALGRIND > > also causes t5551 to start failing. If so, then we can blame mpm_event, > and not http2. Good idea. With that applied, I've still not seen a failure in t5551, not even when run via --stress for some minutes. I'm not sure whether any of this points to a bug in Git's http2 code at all. It _seems_ like it's going to be elsewhere, in curl and/or httpd/mod_http2. In other words, your 1 above. If nothing jumps out to point to a possible issue in git, I'll extract a reproduction recipe from the test suite and file a Fedora bug. Maybe the folks who have looked at similar issues in curl and httpd/mod_http2 will spot something. Thanks! -- Todd