Re: [PATCH 04/30] directory rename detection: basic testcases

Stefan Beller <sbeller@xxxxxxxxxx> · Mon, 13 Nov 2017 17:21:49 -0800

On Mon, Nov 13, 2017 at 4:57 PM, Elijah Newren <newren@xxxxxxxxx> wrote:
> On Mon, Nov 13, 2017 at 2:04 PM, Stefan Beller <sbeller@xxxxxxxxxx> wrote:
>> On Fri, Nov 10, 2017 at 11:05 AM, Elijah Newren <newren@xxxxxxxxx> wrote:
>> (minor thought:)
>> After rereading the docs above this is clear; I wonder if instead of A, B, C
>> a notation of Base, ours, theirs would be easier to understand?
>
> Sure, that'd be an easy change.
>
>>> +test_expect_success '1a-setup: Simple directory rename detection' '
>>> +test_expect_failure '1a-check: Simple directory rename detection' '
>>
>> Thanks for splitting the setup and the check into two different test cases!
>>
>>
>>> +       git checkout B^0 &&
>>
>> Any reason for ^0 ? (to make clear it is a branch?)
>
> Partially force-of-habit (did the same with t6036 and t6042), but it
> seemed to have the nice feature of making debugging easier through
> improved reproducability.  In particular, if I had checked out B
> rather than B^0 in the testcase and a merge succeeded when I didn't
> expect it, then attempting to run the same commands gives me a
> different starting point for the merge.  By instead explicitly
> checking out B^0, then even if the merge succeeds, someone who again
> runs checkout B^0 will have the same starting point.

Thanks for enlightening me. Makes sense.

>
>>> +test_expect_success '1b-setup: Merge a directory with another' '
>>> +       git rm -rf . &&
>>> +       git clean -fdqx &&
>>> +       rm -rf .git &&
>>> +       git init &&
>>
>> This is quite a strong statement to start a test with.
>
> It was actually copy-paste from t6036, and achieved two purposes:
>   1) Even if one test fails, subsequent ones continue running.  (Had
> lots of problems with this in t6036 years ago and just ended up with
> those four steps)
>   2) Someone who runs into a failing testcase has a _much_ easier time
> figuring out what is going on with the testcase.  I find it takes a
> fair amount of time to figure out what's going on with other tests in
> git's testsuite because of the presence of so many files completely
> unrelated to the given test, which have simply accumulated from
> previous tests.  For many tests, that may be fine, but in particular
> for t6036, t6042, and now t6043, since these are mostly about corner
> cases that are hard enough to reason about, I didn't want the extra
> distractions.

I agree with these reasons to strongly want a clean slate.

>> Nowadays we rather do
>>
>>     test_when_finished "some command" &&
>>
>> than polluting the follow up tests. But as you split up the previous test
>> into 2 tests, it is not clear if this would bring any good.
>
> test_when_finished looks pretty cool; I didn't know about it.  Thanks
> for the pointer.  Not sure if we want to use it here (if we do, we'd
> only do so in the check tests rather than the setup ones).
>
>> Also these are four cleanup commands, I'd have expected fewer.
>> Maybe just "rm -rf ." ? Or as we make a new repo for each test case,
>>
>>     test_create_repo 1a &&
>>     cd 1a
>>
>> in the first setup, and here we do
>>     test_create_repo 1b &&
>>     cd 1b
>>
>> relying on test_done to cleanup everything afterwards?
>
> rm aborts for me with 'rm -rf .', but I could possibly make it 'rm -rf
> * .* && git init .'
>
> The test_create_repo might not be so bad as long as every test used it
> and put all files under it's own little repo.

That is my current favorite, I would think.

>  It does create some
> clutter, but it's at least somewhat managed.

But the clutter is outside the repository under test, which
may help with the situation.

> I'm still a bit partial
> to just totally cleaning it out, but if others would prefer, I can
> switch.

(slightly dreaming:)
I wonder if we could teach our test suite to accept multiple test_done
or restart_tests or such to resurrect the clean slate.

>
>>> +       test 3 -eq $(git ls-files -s | wc -l) &&
>>
>>     git ls-files >out &&
>>     test_line_count = 3 out &&
>>
>> maybe? Piping output of git commands somewhere is an
>> anti-pattern as we cannot examine the return code of ls-files in this case.
>
> Um...I guess that makes sense, if you assumed that I cared about the
> return code of ls-files.

As this is the test suite, we care about the return code of any git
command, for current git as well as future-gits.

>  But it seems to make the tests with multiple
> calls to ls-files in a row (of which there are many) considerably
> uglier, so I'd personally prefer to avoid that.  Also, why would I
> care about the return code of ls-files?

While I understand the rationale here and your examination of ls-files
seems to indicate that we are ok doing it here, a lot of (test) code
is taken for inspiration (copied, adapted) to other test cases.
And most of the time we actually care if the return code is correct
additionally to the actions performed, so I was shooting from the hip
here.

> Your suggestion made me curious, so I went looking.  As far as I can
> tell, the return code of ls-files is always 0 unless you both specify
> both --error-unmatch and one or more paths, neither of which I did.
> Am I missing something?

I am not saying it was a cargo-culting reaction, but rather relaying
a well discussed style issue to you. ;)

> If you feel the return code of ls-files is important, perhaps I could
> just have a separate
>    git ls-files -s >/dev/null &&
> call before the others?

I would prefer to not add any further calls; also (speaking generically)
this would bring in potential racing issues (what if the second ls-files
behaves differently than the first?)

>>> +       test $(git rev-parse HEAD:y/b) = $(git rev-parse A:z/b) &&
>>> +       test $(git rev-parse HEAD:y/c) = $(git rev-parse A:z/c) &&
>>> +       test $(git rev-parse HEAD:y/d) = $(git rev-parse A:x/d) &&
>>
>> Speaking of these, there are quite a lot of invocations of rev-parse,
>> though it looks clean; recently Junio had some good ideas regarding a
>> similar test[1].
>> So maybe
>>
>>   git rev-parse >actual \
>>     HEAD:y/b  HEAD:y/c HEAD:y/d &&
>>   git rev-parse >expect \
>>     A:z/b    A:z/c    A:x/d &&
>>   test_cmp expect actual
>>
>> Not sure if that is any nicer, but has fewer calls.
>
> Sure, I can switch it over.

Well, that was also just a quick suggestion, maybe we'll find an
even nicer way that has also very few invocations.

>>> +
>>> +# Testcase 1e, Renamed directory, with all filenames being renamed too
>>> +#   Commit A: z/{oldb,oldc}
>>> +#   Commit B: y/{newb,newc}
>>> +#   Commit C: z/{oldb,oldc,d}
>>
>> What about oldd ?
>> (expecting a pattern across many files of s/old/new/ isn't unreasonable,
>> but maybe too much for now?)
>> By having no "old" prefix there is only one thing to do, which is y/d
>
> I'm not following.  The "old" and "new" in the filenames were just
> there so a human reading the testcase could easily tell which
> filenames were related and involved in renames.  There is no rename
> associated with d, so why would I have called it "old" or "new"?

because a user may be impressed by gits pattern matching in the
rename detection:

 A: z/{oldb,oldc}
 B: z/{newb,newc}
 C: z/{oldb, oldc, oldd}

Obviously A->B is z/{old->new}-* (not a directory rename,
but just patterns), so one might be tempted to expect newd
as the expectation. But that is nonsense(!?)

>
>   * Users sometimes rename directories on one side of history and add
> files to the original directory on the other side.
>   * We would like to "detect" the directory rename, and move the new
> files into the correct directory.
>   * We don't really have any hints for detecting directory renames
> other by comparing content, i.e. basing it on the file rename
> detection we already have.
>   * There is a possibility that not all files in a certain directory
> went to the same location.  It's possible that a few may have gone
> elsewhere.
>   * Only in weird corner cases would I expect renamed-file location to
> be split nearly 50-50 (as in this contrived testcase) -- although for
> such cases, as you point out, there is a much higher chance of us
> getting the merge "wrong".  Thus, our rule should be simple so people
> can understand and expect what we did and have an easy time fixing it.

Makes sense

>
> So, what to do?  There are a few options:
>   1) Don't do directory rename detection at all.  Just declare it to
> be an anti-feature.
>   2) Try to guess why the user moved different files to different
> directories and try to mimic that somehow.
>   3) Only allow directory rename detection to happen when ALL renamed
> files in the directory went to the same directory.
>   4) Use a simple predictable rule like majority wins.

I wonder if eventually (down the road, not now) we would want to
also teach custom merge/diff drivers about potential dir renaming.
(Well not these drivers itself, but rather a method by which a custom diff
driver can tell us to use another custom rule for thees rename detections)

>
> I think 2 is insanity.

or the place where hooks/custom code shines. :)

> Choices 1 and 3 don't have much appeal to me;
> I'm strongly against them.  I'm unware of any remaining choices other
> than 4, but 4 seems pretty reasonable to me.  It won't always be
> right, but neither is simple content merging.

That makes sense, thanks!
Stefan