Re: [RFC PATCH 5/5] split-index: smudge and add racily clean cache entries to split index

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 06 Sep 2018 14:26:49 +0200

On Thu, Sep 06 2018, SZEDER Gábor wrote:

> Ever since the split index feature was introduced [1], refreshing a
> split index is prone to a variant of the classic racy git problem.
>
> Consider the following sequence of commands updating the split index
> when the shared index contains a racily clean cache entry, i.e. an
> entry whose cached stat data matches with the corresponding file in
> the worktree and the cached mtime matches that of the index:
>
>   echo "cached content" >file
>   git update-index --split-index --add file
>   echo "dirty worktree" >file    # size stays the same!
>   # ... wait ...
>   git update-index --add other-file
>
> Normally, when a non-split index is updated, then do_write_index()
> (the function responsible for writing all kinds of indexes, "regular",
> split, and shared) recognizes racily clean cache entries, and writes
> them with smudged stat data, i.e. with file size set to 0.  When
> subsequent git commands read the index, they will notice that the
> smudged stat data doesn't match with the file in the worktree, and
> then go on to check the file's content.
>
> In the above example, however, in the second 'git update-index'
> prepare_to_write_split_index() gathers all cache entries that should
> be written to the new split index.  Alas, this function never looks
> out for racily clean cache entries, and since the file's stat data in
> the worktree hasn't changed since the shared index was written, it
> won't be replaced in the new split index.  Consequently,
> do_write_index() doesn't even get this racily clean cache entry, and
> can't smudge its stat data.  Subsequent git commands will then see
> that the index has more recent mtime than the file and that the (not
> smudged) cached stat data still matches with the file in the worktree,
> and, ultimately, will erroneously consider the file clean.
>
> Modify prepare_to_write_split_index() to recognize racily clean cache
> entries, and mark them to be added to the split index.  This way
> do_write_index() will get these racily clean cache entries as well,
> and will then write them with smudged stat data to the new split
> index.
>
> Note that after this change if the index is split when it contains a
> racily clean cache entry, then a smudged cache entry will be written
> both to the new shared and to the new split indexes.  This doesn't
> affect regular git commands: as far as they are concerned this is just
> an entry in the split index replacing an outdated entry in the shared
> index.  It did affect a few tests in 't1700-split-index.sh', though,
> because they actually check which entries are stored in the split
> index; the previous patch made the necessary adjustments.  And racily
> clean cache entries and index splitting are rare enough to not worry
> about the resulting duplicated smudged cache entries, and the
> additional complexity required to prevent them is not worth it.
>
> Several tests failed occasionally when the test suite was run with
> 'GIT_TEST_SPLIT_INDEX=yes'.  Here are those that I managed to trace
> back to this racy split index problem, starting with those failing
> more frequently, with a link to a failing Travis CI build job for
> each.  The highlighted line shows when the racy file was written,
> which is not always in the failing test (but note that those lines are
> in the 'after failure' fold, and your browser might unhelpfully fold
> it up before you could take a good look).

Thanks for working on this. When I package up git I run the tests
under a few different modes, in the case of split index I've been
doing:

    GIT_TEST_SPLIT_INDEX=true GIT_SKIP_TESTS="t3903 t4015.77"

Since those were the ones I spotted failing under that mode, but
I still had occasional other failures, I don't have a record of
which, maybe some of these other tests you mention, maybe not.

To test how this this series improves things, I've been running
this on a 56 core CentOS 7.5 machine:

    while true; do GIT_TEST_SPLIT_INDEX=yes prove -j$(parallel --number-of-cores) t3903-stash.sh t4024-diff-optimize-common.sh t4015-diff-whitespace.sh t2200-add-update.sh t0090-cache-tree.sh && echo "OK $(date) $(git describe)" >>log2 || echo "FAIL $(date) $(git describe)" >>log2; done

While, in another window to get some load on the machine (these seem to
fail more under load):

    while true; do prove -j$(parallel --number-of-cores) t[156789]*.sh; done

The results with this series applied up to 4/5. I.e. without the actual
fix:

     92 OK v2.19.0-rc2-6-ged839bd155
      8 FAIL v2.19.0-rc2-6-ged839bd155

I.e. when running this 100 times, I got 8 failures. So 8%.

With this patch applied:

    389 OK v2.19.0-rc2-5-g05a5a13935
     11 FAIL v2.19.0-rc2-5-g05a5a13935

This time I ran the tests 400 times, and got 11 failures, i.e. a
~2.8% failure rate. I don't have a full account of what stuff
failed (this was just scrolling past in my terminal), but most
were:

    t0090-cache-tree.sh          (Wstat: 256 Tests: 21 Failed: 3)
      Failed tests:  10-12
      Non-zero exit status: 1

I.e. these tests:

    ok 10 - commit --interactive gives cache-tree on partial commit
    ok 11 - commit in child dir has cache-tree
    ok 12 - reset --hard gives cache-tree

Then I saw two of these fail, and no other failures:

    t3903-stash.sh               (Wstat: 256 Tests: 90 Failed: 1)
      Failed test:  55
      Non-zero exit status: 1

I.e. this:

    ok 55 - stash branch should not drop the stash if the apply fails

I don't have output from those under -x -v. I'm running them in a loop
now to try to make them fail like that, no luck yet, maybe one of those
options "fixes" the race condition, or I'm just unlucky.

[Side note: All of the above is just assuming that running the tests in
 a loop without GIT_TEST_SPLIT_INDEX=yes will work, but I haven't
 actually tested that, but I've never seen one of these transitory
 failures in the past without GIT_TEST_SPLIT_INDEX=yes, so I'm fairly
 sure that works]

So this definitely seems like an improvement, i.e. the transitory
failure rate is much lower now, but it looks like there's still some
race condition related to split index left to solve.

However, one thing that makes me paranoid is that without your patch I
do get failures on t3903-stash.sh, but it's a *different* failure than
the (much less likely to happen) failure after your patch. I.e. I've
only seen it fail like this before:

    t3903-stash.sh               (Wstat: 256 Tests: 90 Failed: 1)
      Failed test:  60
      Non-zero exit status: 1

That's this test, i.e. #60, not the #55 test that occasionally fails
after these patches:

    ok 60 - handle stash specification with spaces

This series doesn't change t3903-stash.sh at all, so this really is a
different failure.

You *do* modify t0090-cache-tree.sh, but the 10-12 tests failing with
your patch are earlier in the file than the new test you added, so I
believe that's a new failure as well. It could just be sampling bias /
bad luck, but I don't have a single failure for t0090-cache-tree.sh
without this patch, and with the patch it's the most common failure.