On Thu, Sep 06, 2018 at 02:26:49PM +0200, Ævar Arnfjörð Bjarmason wrote:
On Thu, Sep 06 2018, SZEDER Gábor wrote:
Several tests failed occasionally when the test suite was run with
'GIT_TEST_SPLIT_INDEX=yes'. Here are those that I managed to trace
back to this racy split index problem, starting with those failing
more frequently, with a link to a failing Travis CI build job for
each. The highlighted line shows when the racy file was written,
which is not always in the failing test (but note that those lines are
in the 'after failure' fold, and your browser might unhelpfully fold
it up before you could take a good look).
Thanks for working on this. When I package up git I run the tests
under a few different modes, in the case of split index I've been
doing:
GIT_TEST_SPLIT_INDEX=true GIT_SKIP_TESTS="t3903 t4015.77"
Yeah, I noticed that you mentioned this in an unrelated thread the
other day, that's why I put you on Cc. ... and that's why I pulled
this series from the backburner and cleaned it up for submission.
(Gah, most of these commits have an author date back in February...)
Since those were the ones I spotted failing under that mode, but
I still had occasional other failures, I don't have a record of
which, maybe some of these other tests you mention, maybe not.
I poked around the Travis CI API, and managed to get the logs of all
build jobs that failed with 'GIT_TEST_SPLIT_INDEX=yes' but succeeded
without it. Here is the list of failed test scripts, along with how
many times they failed:
1 t0027-auto-crlf.sh
1 t0090-cache-tree.sh
1 t3404-rebase-interactive.sh
1 t5520-pull.sh
1 t5573-pull-verify-signatures.sh
1 t5608-clone-2gb.sh
1 t7406-submodule-update.sh
2 t2200-add-update.sh
2 t4002-diff-basic.sh
2 t5504-fetch-receive-strict.sh
3 t0028-working-tree-encoding.sh
4 t1000-read-tree-m-3way.sh
6 t4015-diff-whitespace.sh
10 t4024-diff-optimize-common.sh
17 t3030-merge-recursive.sh
17 t3402-rebase-merge.sh
17 t3501-revert-cherry-pick.sh
17 t6022-merge-rename.sh
17 t6032-merge-large-rename.sh
17 t6034-merge-rename-nocruft.sh
17 t6042-merge-rename-corner-cases.sh
17 t6043-merge-rename-directories.sh
17 t6046-merge-skip-unneeded-updates.sh
17 t7003-filter-branch.sh
17 t7601-merge-pull-config.sh
23 t3903-stash.sh
I doubt that this racy split index problem is responsible for all
these failures; e.g. I looked at the failures of a few merge-related
test scripts, and these logs show that they tend to fail because of
memory corruption, i.e. with 'Aborted (core dumped)' or 'Segmentation
fault (core dumped)'.
To test how this this series improves things, I've been running
this on a 56 core CentOS 7.5 machine:
while true; do GIT_TEST_SPLIT_INDEX=yes prove -j$(parallel --number-of-cores) t3903-stash.sh t4024-diff-optimize-common.sh t4015-diff-whitespace.sh t2200-add-update.sh t0090-cache-tree.sh && echo "OK $(date) $(git describe)" >>log2 || echo "FAIL $(date) $(git describe)" >>log2; done
While, in another window to get some load on the machine (these seem to
fail more under load):
while true; do prove -j$(parallel --number-of-cores) t[156789]*.sh; done
The results with this series applied up to 4/5. I.e. without the actual
fix:
92 OK v2.19.0-rc2-6-ged839bd155
8 FAIL v2.19.0-rc2-6-ged839bd155
I.e. when running this 100 times, I got 8 failures. So 8%.
Lucky you ;)
I could only get t3903 to fail on me, but not the others. That was
enough to eventually track down and fix the problem (fun! ;), and then
I looked at the logs of failed git/git Travis CI build jobs to see,
what other failures might have been caused by it.
With this patch applied:
389 OK v2.19.0-rc2-5-g05a5a13935
11 FAIL v2.19.0-rc2-5-g05a5a13935
This time I ran the tests 400 times, and got 11 failures, i.e. a
~2.8% failure rate. I don't have a full account of what stuff
failed (this was just scrolling past in my terminal), but most
were:
t0090-cache-tree.sh (Wstat: 256 Tests: 21 Failed: 3)
Failed tests: 10-12
Non-zero exit status: 1
I.e. these tests:
ok 10 - commit --interactive gives cache-tree on partial commit
ok 11 - commit in child dir has cache-tree
ok 12 - reset --hard gives cache-tree
But hey, 't0090 --verbose-log -x -i' just failed on me with the fix
applied while writing this email, yay! In its logs I see the
following errors in all three tests you mention:
error: index uses ?½þA extension, which we do not understand
fatal: index file corrupt
Test 13 then starts with 'rm -f .git/index', and then all is well for
the remaining tests.
There was a recent discussion about a similar error starting at:
https://public-inbox.org/git/20180901214157.hxlqmbz3fds7hsdl@ltop.local/
and leading to a fix in 6c003d6ffb (reopen_tempfile(): truncate opened
file, 2018-09-04). I certainly hope that not my fix is buggy, but
combined with 'GIT_TEST_SPLIT_INDEX=yes' it can occasionally trigger
the same error, and Peff's (Cc'd) fix will resolve it as well.
Could you run your stress tests with Peff's fix merged with mine?
Apparetly your setup is much more capable of triggering these issues
than mine...