Re: [PATCH 04/30] directory rename detection: basic testcases

Elijah Newren <newren@xxxxxxxxx> · Mon, 13 Nov 2017 16:57:31 -0800

On Mon, Nov 13, 2017 at 2:04 PM, Stefan Beller <sbeller@xxxxxxxxxx> wrote:
> On Fri, Nov 10, 2017 at 11:05 AM, Elijah Newren <newren@xxxxxxxxx> wrote:
> (minor thought:)
> After rereading the docs above this is clear; I wonder if instead of A, B, C
> a notation of Base, ours, theirs would be easier to understand?

Sure, that'd be an easy change.

>> +test_expect_success '1a-setup: Simple directory rename detection' '
>> +test_expect_failure '1a-check: Simple directory rename detection' '
>
> Thanks for splitting the setup and the check into two different test cases!
>
>
>> +       git checkout B^0 &&
>
> Any reason for ^0 ? (to make clear it is a branch?)

Partially force-of-habit (did the same with t6036 and t6042), but it
seemed to have the nice feature of making debugging easier through
improved reproducability.  In particular, if I had checked out B
rather than B^0 in the testcase and a merge succeeded when I didn't
expect it, then attempting to run the same commands gives me a
different starting point for the merge.  By instead explicitly
checking out B^0, then even if the merge succeeds, someone who again
runs checkout B^0 will have the same starting point.

>> +test_expect_success '1b-setup: Merge a directory with another' '
>> +       git rm -rf . &&
>> +       git clean -fdqx &&
>> +       rm -rf .git &&
>> +       git init &&
>
> This is quite a strong statement to start a test with.

It was actually copy-paste from t6036, and achieved two purposes:
  1) Even if one test fails, subsequent ones continue running.  (Had
lots of problems with this in t6036 years ago and just ended up with
those four steps)
  2) Someone who runs into a failing testcase has a _much_ easier time
figuring out what is going on with the testcase.  I find it takes a
fair amount of time to figure out what's going on with other tests in
git's testsuite because of the presence of so many files completely
unrelated to the given test, which have simply accumulated from
previous tests.  For many tests, that may be fine, but in particular
for t6036, t6042, and now t6043, since these are mostly about corner
cases that are hard enough to reason about, I didn't want the extra
distractions.

but...

> Nowadays we rather do
>
>     test_when_finished "some command" &&
>
> than polluting the follow up tests. But as you split up the previous test
> into 2 tests, it is not clear if this would bring any good.

test_when_finished looks pretty cool; I didn't know about it.  Thanks
for the pointer.  Not sure if we want to use it here (if we do, we'd
only do so in the check tests rather than the setup ones).

> Also these are four cleanup commands, I'd have expected fewer.
> Maybe just "rm -rf ." ? Or as we make a new repo for each test case,
>
>     test_create_repo 1a &&
>     cd 1a
>
> in the first setup, and here we do
>     test_create_repo 1b &&
>     cd 1b
>
> relying on test_done to cleanup everything afterwards?

rm aborts for me with 'rm -rf .', but I could possibly make it 'rm -rf
* .* && git init .'

The test_create_repo might not be so bad as long as every test used it
and put all files under it's own little repo.  It does create some
clutter, but it's at least somewhat managed.  I'm still a bit partial
to just totally cleaning it out, but if others would prefer, I can
switch.

>> +       test 3 -eq $(git ls-files -s | wc -l) &&
>
>     git ls-files >out &&
>     test_line_count = 3 out &&
>
> maybe? Piping output of git commands somewhere is an
> anti-pattern as we cannot examine the return code of ls-files in this case.

Um...I guess that makes sense, if you assumed that I cared about the
return code of ls-files.  But it seems to make the tests with multiple
calls to ls-files in a row (of which there are many) considerably
uglier, so I'd personally prefer to avoid that.  Also, why would I
care about the return code of ls-files?

Your suggestion made me curious, so I went looking.  As far as I can
tell, the return code of ls-files is always 0 unless you both specify
both --error-unmatch and one or more paths, neither of which I did.
Am I missing something?

If you feel the return code of ls-files is important, perhaps I could
just have a separate
   git ls-files -s >/dev/null &&
call before the others?

>> +       test $(git rev-parse HEAD:y/b) = $(git rev-parse A:z/b) &&
>> +       test $(git rev-parse HEAD:y/c) = $(git rev-parse A:z/c) &&
>> +       test $(git rev-parse HEAD:y/d) = $(git rev-parse A:x/d) &&
>
> Speaking of these, there are quite a lot of invocations of rev-parse,
> though it looks clean; recently Junio had some good ideas regarding a
> similar test[1].
> So maybe
>
>   git rev-parse >actual \
>     HEAD:y/b  HEAD:y/c HEAD:y/d &&
>   git rev-parse >expect \
>     A:z/b    A:z/c    A:x/d &&
>   test_cmp expect actual
>
> Not sure if that is any nicer, but has fewer calls.

Sure, I can switch it over.

>> +       test_must_fail git rev-parse HEAD:x/d &&
>> +       test_must_fail git rev-parse HEAD:z/d &&
>> +       test ! -f z/d
>> +'
>> +
>> +# Testcase 1d, Directory renames (merging two directories into one new one)
>> +#              cause a rename/rename(2to1) conflict
>> +#   (Related to testcases 1c and 7b)
>> +#   Commit A. z/{b,c},        y/{d,e}
>> +#   Commit B. x/{b,c},        y/{d,e,m,wham}
>> +#   Commit C. z/{b,c,n,wham}, x/{d,e}
>> +#   Expected: x/{b,c,d,e,m,n}, CONFLICT:(y/wham & z/wham -> x/wham)
>> +#   Note: y/m & z/n should definitely move into x.  By the same token, both
>> +#         y/wham & z/wham should to...giving us a conflict.
>
> If wham are equal (in oid), shouldn't this not conflict and only conflict
> when z/wham and x/wham differ in oid, but have the same sub-path?

Good eyes, I should have labelled these as wham_1 and wham_2, since
the testcase did explicitly make them different (having contents of
"wham1\n" and "wham2\n"), but it could make sense to add a test to the
testsuite where two such colliding files are identical.

I thought about that, figured it wasn't worth it, and didn't add the
code to handle it.  I could add a testcase for it, though I'd be very
tempted to just leave it as test_expect_failure and let someone else
come along and implement it if someone feels so inclined.  Note,
though, that the situation is made slightly more complex due to the
fact that it might be N colliding files rather than just 2 (due to the
possibility of multiple directories being merged into one on either
side of history).

>> +
>> +# Testcase 1e, Renamed directory, with all filenames being renamed too
>> +#   Commit A: z/{oldb,oldc}
>> +#   Commit B: y/{newb,newc}
>> +#   Commit C: z/{oldb,oldc,d}
>
> What about oldd ?
> (expecting a pattern across many files of s/old/new/ isn't unreasonable,
> but maybe too much for now?)
> By having no "old" prefix there is only one thing to do, which is y/d

I'm not following.  The "old" and "new" in the filenames were just
there so a human reading the testcase could easily tell which
filenames were related and involved in renames.  There is no rename
associated with d, so why would I have called it "old" or "new"?

>> +# Testcase 1f, Split a directory into two other directories
>> +#   (Related to testcases 3a, all of section 2, and all of section 4)
>> +#   Commit A: z/{b,c,d,e,f}
>> +#   Commit B: z/{b,c,d,e,f,g}
>> +#   Commit C: y/{b,c}, x/{d,e,f}
>> +#   Expected: y/{b,c}, x/{d,e,f,g}
>
> Why not y/g ? Because there are more files in x?

Yes.

> I can come up with a reasonable counter example:

Oh sure, of course.  You should also be able to come up with
counter-examples to the "correctness" of git's content merging of
files when it fails to report any conflicts (e.g. one side added
another call to a function, the other side changed the number of
parameters the function took).  In short, this is just a case of where
we need to come up with simple predictable rules since we can't read
minds.  Here's the situation we're faced with, and my line of thinking
in coming up with this simple, predictable, but definitely coarse
rule:

  * Users sometimes rename directories on one side of history and add
files to the original directory on the other side.
  * We would like to "detect" the directory rename, and move the new
files into the correct directory.
  * We don't really have any hints for detecting directory renames
other by comparing content, i.e. basing it on the file rename
detection we already have.
  * There is a possibility that not all files in a certain directory
went to the same location.  It's possible that a few may have gone
elsewhere.
  * Only in weird corner cases would I expect renamed-file location to
be split nearly 50-50 (as in this contrived testcase) -- although for
such cases, as you point out, there is a much higher chance of us
getting the merge "wrong".  Thus, our rule should be simple so people
can understand and expect what we did and have an easy time fixing it.

So, what to do?  There are a few options:
  1) Don't do directory rename detection at all.  Just declare it to
be an anti-feature.
  2) Try to guess why the user moved different files to different
directories and try to mimic that somehow.
  3) Only allow directory rename detection to happen when ALL renamed
files in the directory went to the same directory.
  4) Use a simple predictable rule like majority wins.

I think 2 is insanity.  Choices 1 and 3 don't have much appeal to me;
I'm strongly against them.  I'm unware of any remaining choices other
than 4, but 4 seems pretty reasonable to me.  It won't always be
right, but neither is simple content merging.