Re: [PATCH 00/15] tests: don't ignore "git" exit codes

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 03 Mar 2022 16:35:16 +0100

On Thu, Mar 03 2022, Phillip Wood wrote:

> On 03/03/2022 02:02, Derrick Stolee wrote:
>> On 3/2/2022 12:27 PM, Ævar Arnfjörð Bjarmason wrote:
>>> As an aside we still have various potential issues with hidden
>>> segfaults etc. in the test suite after this that are tricked to solve,
>>> because:
>>>
>>>   * Our tests will (mostly) catch segfaults and abort(), but if we
>>>     invoke a command that invokes another command it needs to ferry the
>>>     exit code up to us.
>>>
>>>   * run-command.c notably does not do that, so for e.g. "git push"
>
> I'm not sure what you mean by this, the return value of run_command()
> already indicates which signal if any killed the child see for example 
> run_specified_editor() which re-raises SIGINT and SIGQUIT if the
> editor is killed by those signals.

Yes, it returns the correct status code, but that doesn't help with
(pseudo)code like:

	if (run_command("foo")) /* exits with e.g. 123 */
		die("oh no, foo failed"); /* exits with 128 */

I should have said "code using run-command.c does not do that...",
sorry.

I.e. if "pack-objects" or whatever invoked by a "git push" segfaults
we might exit with status 128, not the code that the underlying command
failed with, and thus lose the segfault or abort().

If you add the appropriate "log_path" to LSAN_OPTIONS and run the test
suite you can see where this fails. I'm adding a mode to test-lib.sh
sooner than later (waiting on the outstanding ab/test-leak-diag) to make
it trivial to report on those.

>>>     tests where we expect a failure and an underlying "git" command
>>>     fails we won't ferry up the segfault or abort exit code.
>>
>> Perhaps run-command.c could auto-exit for certain well-known error
>> codes that could only happen on certain kinds of failures (segfault,
>> for example). A simple die() might be something that is expected to
>> be handled by the top-level command in some cases.
>
> I think we need to be careful that run_command() does not re-raise a
> signal before the caller has a chance to do any cleanup. A caller to 
> run_command() can already check the return value and choose to die
> based on that after doing any cleanup. If run_command() starts dying
> we'll end up adding more unsafe signal handlers to do the cleanup.

I think the method I described in
https://lore.kernel.org/git/220303.86fsnz5o9w.gmgdl@xxxxxxxxxxxxxxxxxxx/
in the side-thread doesn't suffer from those problems.

I.e. I think the solution to this is not to interrupt whatever code
calls run-command, we'll let it and whatever else the calling program
wants to do run to completion.

We'll just ignore whatever exit(status) we picked at the very end and
exit instead with the status of our a failing child process we invoked,
if any of them returned a status that "test_must_fail" would count as
"failed but BAD!"