Re: [PATCH 4/5] t6022, t6046: test expected behavior instead of testing a proxy for it

Elijah Newren <newren@xxxxxxxxx> · Fri, 13 Mar 2020 10:45:50 -0700

On Fri, Mar 13, 2020 at 10:17 AM SZEDER Gábor <szeder.dev@xxxxxxxxx> wrote:
>
> On Thu, Mar 12, 2020 at 09:01:57PM +0100, SZEDER Gábor wrote:
> > > > > I saw this test fail today in one of my custom CI builds:
> > > > >
> > > > >   +git checkout change
> > > > >   Switched to branch 'change'
> > > > >   +test-tool chmtime =-1 M
> > > > >   +test-tool chmtime --get M
> > > > >   +GIT_MERGE_VERBOSITY=3 git merge change+rename
> > > > >   +test-tool chmtime --get B
> > > > >   +cat old-mtime
> > > > >   +cat new-mtime
> > > > >   +test 1583967731 -lt 1583967731
> > > > >   error: last command exited with $?=1
> > > > >   not ok 12 - merge of identical changes in a renamed file
> > > > >
> > > > > The contents of 'out', i.e. the output of the 'git merge' command
> > > > > before the failure is:
> > > > >
> > > > >   Auto-merging B
> > > > >   Merge made by the 'recursive' strategy.
> > > > >    A => B | 0
> > > > >    1 file changed, 0 insertions(+), 0 deletions(-)
> > > > >    rename A => B (100%)
> > > > >
> > > > > This is a rare failure, this is the first time I saw it, and to make
> > > > > things worse, this one time it happened on big-endian and with all the
> > > > > GIT_TEST_* knobs enabled.
> > > > >
> > > > >   https://travis-ci.org/github/szeder/git-cooking-topics-for-travis-ci/jobs/661294571#L4020
> > > >
> > > > This is very troubling.  The workflow is basically:
> > > >   - Manually set the mtime to a file to something old (I happened to
> > > > pick 1 second before now, but picking something from 1970 would have
> > > > been fine too).
> > > >   - Run a merge which is known to need to overwrite the file.  Your
> > > > output ("Auto-merging B") suggests that we should have been in such a
> > > > case.
> > > >   - Verify that the file was actually updated as expected.  Since the
> > > > OS is supposed to update the mtime when it writes the file, it should
> > > > have set it to something recent, i.e. something *different* than what
> > > > it had before.
> > > >
> > > > So, now I'm left wondering how the mtime possibly could have been not
> > > > updated.  Maybe the file wasn't actually written?  (But if so, why
> > > > didn't other people see the failure?  Or your stress runs not see it?)
> > > >  Or maybe it was written but all file contents and metadata were
> > > > delayed in writing to disk such that a subsequent command still sees
> > > > the old file??  Or maybe it was written but the mtime update was
> > > > delayed and the test was able to check it in that intermediate
> > > > state???  Or perhaps the mtime check before the merge raced with the
> > > > setting of the mtime backwards and got the mtime before it was
> > > > rewound????
> > > >
> > > > I don't have a plausible scenario under which any of these should be
> > > > possible; I'm at a loss.
> > >
> > > Hmm.  Maybe leap seconds, or clock updates via ntp at an unfortunate
> > > time?
> >
> > I'm now fairly confident that 'git merge' is OK, and suspect that it's
> > an issue with Travis CI's s390x environment (multi-architecture
> > support is an alpha-stage feature).
>
> I could finally reproduce the issue on my own machine, so apparently
> it's not an issue in Travis CI's multi-arch environments.
>
> >   test_expect_success 'test' '
> >       touch file &&
> >       old=$(test-tool chmtime --get =-1 file) &&
> >       touch file &&
> >       new=$(test-tool chmtime --get file) &&
> >       test $old -lt $new
> >   '
>
> I wanted to report the issue to Travis CI, and in order to do so I
> turned the above test case into a script that doesn't at all depend on
> out test framework and 'test-tool' but uses only coreutils commands
> (with a whole lot of GNUisms...):
>
>   cat >timestamp.sh <<-\EOF
>   #!/bin/sh
>
>   set -ex
>
>   i=0
>   while true
>   do
>         printf "$i\r"
>         i=$((i + 1))
>
>         # set a file's mtime to one second ago
>         now=$(date "+%s.%N")
>         one_sec_ago=$(date -d "@$((${now%.*} - 1))" "+%Y%m%d%H%M%S")
>         touch -t ${one_sec_ago%??}.${one_sec_ago#????????????} file
>         # save its actual mtime
>         old=$(date -r file "+%s.%N")
>         # set its mtime to now
>         touch file
>         # the current the mtime should be different, but sometimes it isn't
>         new=$(date -r file "+%s.%N")
>         test "${old%.*}" != "${new%.*}"
>   done
>   EOF
>   chmod u+x timestamp.sh
>   ./timestamp.sh 2>out
>   echo
>   tail -n11 out
>
> And this script usually fails after a few hundred iterations on my
> machine:
>
>   308
>   + date +%s.%N
>   + now=1584118408.002458987
>   + date -d @1584118407 +%Y%m%d%H%M%S
>   + one_sec_ago=20200313175327
>   + touch -t 202003131753.27 file
>   + date -r file +%s.%N
>   + old=1584118407.000000000
>   + touch file
>   + date -r file +%s.%N
>   + new=1584118407.997464837
>   + test 1584118407 != 1584118407
>
> Note that the mtime update at the end of the iteration results in an
> mtime that is _before_ the current time at the beginning of the
> iteration.  Well, I'm puzzled :)

Interesting...

>From https://stackoverflow.com/questions/14392975/timestamp-accuracy-on-ext4-sub-millsecond:
'''
date will call the gettimeofday system call which will always return
the most accurate time available based on the cached kernel time,
adjusted by the CPU cycle time if available to give nanosecond
resolution. The timestamps stored in the file system however, are only
based on the cached kernel time. ie The time calculated at the last
timer interrupt.
'''

also, from https://apenwarr.ca/log/20181113 says:

'''
Does mtime get set to >= the current time?

No, this depends on clock granularity. For example, gettimeofday() can
return times in microseconds on my system, but ext4 rounds timestamps
down to the previous ~10ms (but not exactly 10ms) increment, with the
surprising result that a newly-created file is almost always created
in the past:

  $ python -c "
  import os, time
  t0 = time.time()
  open('testfile', 'w').close()
  print os.stat('testfile').st_mtime - t0
  "

  -0.00234484672546
'''

So, perhaps instead of setting the file mtime relative to the current
clock (and truncated to the nearest integer), we need to set the file
mtime relative to its recorded mtime.

> Yes, setting the old timestamp 2 seconds in the past seems to be
> sufficient, at least I haven't seen it fail in a few 100k repetitions.

Or yeah, we could just use bigger numbers, but given that wall clock
and file-system clock are apparently different, I like the idea of
setting them relative to the mtimes that existed before.  Looks like
`test-tool chmtime` supports that too, so I'll cook up a patch.