On Wed, 27 Feb 2019 at 09:49, SZEDER Gábor <szeder.dev@xxxxxxxxx> wrote: > > Hi Luke, > > I saw rare failures in test 6 'git p4 sync uninitialized repo' in > 't9800-git-p4-basic.sh' on Travis CI, because the 'cleanup_git' > function failed to do its job. The (redacted) trace looks like this: Thanks for the *very* detailed analysis! I think you are right - git-p4 should wait() for all of its children, and that ought to fix this. I think I may have even added the bit of code you mention (about a decade ago now). I'll have a look and see what can be done. Thanks! Luke > > + cleanup_git > + retry_until_success rm -r /home/szeder/src/git/t/trash directory.t9800-git-p4-basic/git > + time_in_seconds > + cd / > + /usr/bin/python -c import time; print(int(time.time())) > + timeout=1551233042 > + rm -r /home/szeder/src/git/t/trash directory.t9800-git-p4-basic/git > + test_must_fail test -d /home/szeder/src/git/t/trash directory.t9800-git-p4-basic/git > test_must_fail: command succeeded: test -d /home/szeder/src/git/t/trash directory.t9800-git-p4-basic/git > + eval_ret=1 > + : > not ok 6 - git p4 sync uninitialized repo > > Trying to reproduce this failure with stock Git can be tricky: I've > seen > > ./t9800-git-p4-basic.sh --stress -r 1,2,6,22 > > fail within the first 10 tries, but it also run overnight without a > single failure... However, the following two-liner patch can reliably > trigger this failure: > > diff --git a/fast-import.c b/fast-import.c > index b7ba755c2b..54715018b3 100644 > --- a/fast-import.c > +++ b/fast-import.c > @@ -3296,6 +3296,7 @@ int cmd_main(int argc, const char **argv) > rc_free[i].next = &rc_free[i + 1]; > rc_free[cmd_save - 1].next = NULL; > > + sleep(1); > start_packfile(); > set_die_routine(die_nicely); > set_checkpoint_signal(); > diff --git a/t/lib-git-p4.sh b/t/lib-git-p4.sh > index b3be3ba011..2d2ef50bfa 100644 > --- a/t/lib-git-p4.sh > +++ b/t/lib-git-p4.sh > @@ -190,6 +190,7 @@ kill_p4d () { > > cleanup_git () { > retry_until_success rm -r "$git" > + sleep 2 > test_must_fail test -d "$git" && > retry_until_success mkdir "$git" > } > > > What's going on is this: 'git p4' invokes 'git fast-import' via > Python's subprocess.Popen(), and then there are two chain of events > racing with each other: > > - In the foreground: > > - 'git p4' notices that there are no p4 branches and die()s as > expected, but leaves its child fast-import behind > - 'test_i18ngrep' quickly verifies that 'git p4' died with the > expected error message > - the cleanup function removes the "$TRASH_DIRECTORY/git" > directory, and > - checks that the directory is indeed gone. > > - Meanwhile in the background: > > - 'git fast-import' starts up, kicks off the dashed external > 'git-fast-import' process, > - which opens a tmp packfile in "$TRASH_DIRECTORY/git" for writing > (the start_packfile() call in the patch context above), creating > any leading directories if necessary, > - notices that its stdin is closed (because 'git p4' died) and it > has nothing left to do, so > - it cleans up and exits. Note that this cleanup only removes the > tmp packfile, but leaves any newly created leading directories > behind. > > While unlikely, it does apparently happen from time to time that the > test's cleanup function first removes "$TRASH_DIRECTORY/git", but then > 'git fast-import' re-creates it for its packfile before the cleanup > function checks the result of the removal. The two well-placed > sleep()s in the patch above force such a problematic scheduling. > > There are 4 test cases running 'test_must_fail git p4 sync': the above > patch triggers a failure in 3 of them, and with a bit of extra modding > I could trigger a failure in the fourth one as well. > > We could work this around by waiting for 'git p4' and all its child > processes in the affected tests themselves, using the same shell > trickery as in ef09036cf3 (t6500: wait for detached auto gc at the end > of the test script, 2017-04-13), but this feels like, well, a > workaround. > > I think the proper solution would be to ensure that 'git p4' kills and > waits for all its child processes when die()ing. Alternatively (or in > addition?), it could perform all the necessary sanity-checking (and > potential die()ing) before starting the 'git fast-import' process in > the first place. > > I've glanced through all subprocess.Popen() callsites in 'git p4' and > found most of them OK, in the sense that they wait for whatever child > process they create. Alas, there was one exception: p4CmdList() can > invoke an optional callback function before wait()ing on its 'p4' > child, and the streamP4FilesCb() callback function can die() without > waiting for the 'p4' process (but it does wait() for 'git > fast-import'!). > > On a related note: this check for the just-removed directory was added > in 23aee4199a (git-p4: retry kill/cleanup operations in tests with > timeout, 2015-11-19), which mentions flaky cleanup operations. > Perhaps this is the same flakiness?! > > Anyway, as I mentioned elsewhere before, I have no idea why/how 'git > p4' works, so I'll leave it up to you how it's best to deal with this > issue... > > > Gábor >