Re: Weird (seemingly flakey) p4 breakage in t9833

SZEDER Gábor <szeder.dev@xxxxxxxxx> · Wed, 6 Feb 2019 12:11:11 +0100

On Wed, Feb 06, 2019 at 11:33:21AM +0100, Johannes Schindelin wrote:
> Hi Luke,
> 
> in a private Azure Pipeline (sorry...) I noticed an intermittent problem
> in the p4 tests on osx-gcc.
> 
> I would point you to a public log, but the Azure Pipelines support *just*
> made it to next, so I *just* set up a public one targeting anything else
> than my `vsts-ci` branch, at
> https://dev.azure.com/gitgitgadget/git/_build/index?definitionId=6. And
> those builds do not show that problem, so it must be a flakey test.
> 
> But maybe you can spot anything familiar from the log?
> 
> -- snip --
> [...]
> ++ P4TICKETS='/Users/vsts/agent/2.146.0/work/1/s/t/trash
> directory.t9833-errors/cli/tickets'
> ++ P4USER=short_expiry_user
> ++ echo password
> ++ p4 login
> Enter password: 
> User short_expiry_user logged in.
> Perforce db files in '.' will be created if missing...
> ++ cd '/Users/vsts/agent/2.146.0/work/1/s/t/trash
> directory.t9833-errors/git'
> ++ git p4 sync
> ++ true
> +++ time_in_seconds
> +++ cd /
> +++ /usr/bin/python -c 'import time; print(int(time.time()))'
> ++ test 1549411312 -gt 1549411605
> ++ sleep 1
> Perforce db files in '.' will be created if missing...
> failure accessing depot: perforce ticket expires in 1 seconds
> Performing incremental import into refs/remotes/p4/master git branch
> Depot paths: //depot/
> error: last command exited with $?=1
> ++ true
> +++ time_in_seconds
> +++ cd /
> +++ /usr/bin/python -c 'import time; print(int(time.time()))'
> ++ test 1549411314 -gt 1549411605
> ++ sleep 1
> not ok 6 - git operation with expired ticket
> #	
> #		P4TICKETS="$cli/tickets" &&
> #		P4USER=short_expiry_user &&
> #		echo "password" | p4 login &&
> #		(
> #			cd "$git" &&
> #			git p4 sync &&
> #			sleep 5 &&
> #			test_must_fail git p4 sync 2>errmsg &&
> #			grep "failure accessing depot" errmsg
> #		)
> #	
> -- snap --

I saw this on Travis CI a couple of times, and looked into it, though
I have no idea how p4 is supposed to work...  anyway, my theory is:

The previous test 'create group with short ticket expiry' creates a
"ticket" with 3 seconds expiration time, to be used in this failing
one.  This timeout might be just a bit too short when running the test
under high load, and it takes long enough to reach the first 'git p4
sync', so long that the timeout is (almost?) up, and then 'git p4'
errors out.

I'm not sure what the proper solution would be (assuming that my
theory is right, that is): increasing the ticket timeout to a "must be
surely long enough" value would require longer 'sleep' in this test,
which is not good.  I wonder why that failing 'git p4 sync' is
necessary in the first place, and whether it's really necessary to
test ticket expiration, but then again: I have no idea how p4 works.

On a related note, I think it would be better if these two tests were
squashed into one, so we would get the whole picture.

> BTW I find it very odd to see a `sleep 1` in the trace but not in the
> snippet (there is only a `sleep 5` instead, which I fail to see in the
> trace). Odd?

Intentional.  p4d seems to be prone to hang, to circumvent this
start_p4d() from 'lib-git-p4.sh' starts a watchdog process in the
background to kill it after a long-enough timeout is up.  These three
lines in the log:

> +++ /usr/bin/python -c 'import time; print(int(time.time()))'
> ++ test 1549411312 -gt 1549411605
> ++ sleep 1

come from that background process.

A couple of cleanup patches on top of your 'test_atexit' will
eventually get rid of it...  once I get around to clean them up :)