Re: RFC - kernel test result specification (KTAP)

Frank Rowand <frowand.list@xxxxxxxxx> · Thu, 3 Feb 2022 20:44:59 -0600

On 8/30/21 2:58 PM, Kees Cook wrote:
> On Mon, Aug 30, 2021 at 05:48:07PM +0000, Tim.Bird@xxxxxxxx wrote:
>> From: Kees Cook <keescook@xxxxxxxxxxxx>
>>>> Test case result lines
>>>> ----------------------
>>>>
>>>> Test case result lines must have the format:
>>>>
>>>>     <result> <number> [-] [<description>] [<directive>] [<diagnostic data>]
>>>
>>> "[<diagnostic data>]" is not defined below. I think what you mean is
>>> "directive details".
>>
>> IMHO it's useful to keep this called 'diagnostic data'.
>>
>> The directive is optional, and having this be diagnostic data means you can
>> put explanatory text, or result details, that are independent of the description,
>> on the same line as the result.  This is convenient in some cases.
>>
>> I think diagnostic data should always be prefixed by a hash sign.
>> Or, in the case where diagnostic data appears on the result line,
>> the hash serves as the delimiter.  This means that the description
>> text must not include a hash character.
>>
>> For example:
>> ok 5 - check return code # rcode=0
>> or
>> not ok 5 - check return code # rcode=17
> 
> The directive is currently used to answer the question "why?" as in,
> "why is this test not ok?" ("a timeout occurred") or "why is this test
> ok?" ("a SKIP")
> 
> The "supporting evidence" for the "why?" answer are specific details
> about that: "the exit was was NNN" or "the timeout was YYY seconds".
> 
> Test diagnostics are all the lines preceding that (currently prefixed
> with a "#").
> 
>>> Yes, though the optional "- " is strictly part of the optional
>>> description.
>>
>> It's mildly annoying that "-" is optional.  It's trivial to deal with in the parser
>> to just ignore it if it's present.  But it has no semantic meaning whatsoever.
>> IMHO it would be nice to either mandate it or remove it, for consistency's sake.
>> This could be based solely on the consensus for whether it added or detracted
>> from human readability, since parsers don't care.
> 
> I have no strong opinion on the "-". I was surprised to encounter it
> as it's not part of the TAP 13 spec. I would prefer to drop it, if I had
> to choose.
> 
>> ERROR usually means that there was some kind of failure
>> in the testing process that makes the testcase result indeterminate.
>> (like loss of connection to a server, or crash of the program executing
>> the testcase).  This can be due to a bug in the test or the test system.
>> I'm, not sure if ERROR is used this way in kselftest.  It is, IMHO, a useful
>> distinction to make.  TIMEOUT is a special case of ERROR, IMHO.
> 
> Well, I think TIMEOUT is a "why not ok?" answer. "not ok" seems to be
> the same as ERROR, but I guess it's really a catch-all, but must come
> with the "evidence" portion:
> 
> not ok 1 description goes here # TIMEOUT 45 seconds with no results
> not ok 1 description goes here # ERROR lost connection to server
> not ok 1 description goes here # ERROR non-zero exit code: NNN
> 
> I might go so far as to say the directive is _required_ for the "not ok"
> state.
> 
>> As an aside, I've never liked XFAIL or XPASS.  Semantically, I believe they
>> are quite confusing, and depend on the level of testcase execution.  XPASS is also
>> redundant.  Isn't the default case always XPASS?
> 
> Nothing currently uses XPASS, and we should remove it.
> 
> XFAIL capture the condition of the test logic in that a test for
> something arm64-specific on a non-arm64 machine is _not_ a "pass". It
> will fail. But it's an expected failure. And, as a directive, must come
> with the evidence bit:
> 
> not ok 1 PAN catches bad accessg # XFAIL PAN CPU feature is arm64 only
> 
> As a test author, I want to see this as distinct from failure and
> success.
> 
>> In the case of XFAIL, does it mean that an operation that was expected to fail
>> all the time, failed, and so the test passed?  Or does it mean that this is a known
>> bug that a particular developer or site has decided to ignore (as in "we know this
>> is a bug, but we expected it to fail, so just ignore it for now).  The latter, I really
>> don't think should be encoded in the tests themselves, but should be put
>> into a layer above the test and parser, that is applied according to developer
>> or testlab policy.
> 
> I agree that "just ignore it for now" shouldn't be there.
> 
> Other arguments are that xfail tests shouldn't be run at all, but I
> don't think that's right because it makes it very hard to do large-scale
> test result comparisons because random stuff is missing, depending on
> various configs, archs, etc. It blocks a 1-to-1 comparison, and begs
> questions like "why is the test here in one case and not in another?"
> where as an XFAIL will encode the reason it is an XFAIL in its output,
> providing a self-documenting test result.
> 
>>> I still think there should be a distinction between "diagnostic lines"
>>> and "unknown lines". For example, if kselftest is running on a console,
>>> the dmesg output may be intermixed, and possibly temporally offset. Such
>>> lines may not be useful, and may not be strictly correlated, where-as
>>> the output from kselftest is at least self-consistent.
>>
>> I think this would be a very good distinction to make.  You might have
>> unknown lines that (inadvertently?) start with a hash, but trying to 
>> make sure that the test itself always outputs diagnostic data with a preceding
>> hash gives a reader potentially more information about where the message
>> came from, and could be useful.
> 
> Right, so lines with a hash shouldn't be machine-parsed. (And this is a
> mistake I made in the current kselftest output for the LAVA parser.) The
> good news for the kernel, is that nothing else in dmesg leads with a
> "#" currently. :)

Actually, not correct.  Devicetree unittest (drivers/of/unittest.c) leads
messages with "### dt-test ### ".  But the good news is that this is
easily changeable.

-Frank

> 
>>> No; diagnostic lines must have no meaning: they are for humans and nothing
>>> else. If other details are needed for machines, we should explicitly
>>> create new format lines. I made a mistake when I used a diagnostic line
>>> for embedding the test names. :( There is a need for parsers to discover
>>> the name of a given test, though, so we do likely need something for this.
>>
>> I agree it is handy to have the test name for sub-tests.  However, I don't
>> see any semantic difference whether we define sub-test declarations via
>> diagnostic lines with special syntax or by creating a new syntax separate
>> from diagnostic lines.  Either way, we have a new syntax for parsers to
>> understand.
> 
> Pragmatically, I agree, which is what lead me to my mistake. However, in
> reconsidering it, I realize this leads to a slippery slope of just
> dumping stuff into diagnostic lines and pretending nothing will read it.
> 
> I suspect what we need is an optional "starting test" line, like:
> 
> test N DESC
> # diag...
> # diag...
> # diag...
> ok N
> 
> The nesting looks like:
> 
> TAP version 13
> 1..3
> test 1 seccomp
> # TAP version 13
> # 1..27
> # test 1 prctl
> # ok 1
> # test 2 strict
> # ok 2
> # ...
> # test 27 user_notif
> # # eek missing CONFIG_....
> # not ok 27 user_notif # ERROR can't use ioctl
> not ok 1 seccomp # ERROR non-zero exit code 1
> test 2 netfilter
> ok 2 netfilter
> test 3 bpf
> # exciting things happen
> ok 3 bpf
> 
> This is what we have now except basically just replacing "# $name" with
> "test $num $name"
> 
>>>
>>>> Diagnostic lines of the format "# Subtest:
>>>> <test_name>" indicate the start of a test with subtests. Also,
>>>> diagnostic lines of the format "# <test_name>: <description>" refer to
>>>> a specific test and tend to occur before the test result line of that
>>>> test but are optional.
>>>
>>> I don't think the above should be included in the spec -- diag lines
>>> should have no parseable meaning.
>>>
>>>>
>>>> Bail out line
>>>> -------------
>>>>
>>>> A bail out line can occur anywhere in the KTAP output and will
>>>> indicate that a test has crashed. The format of a bail out line is
>>>> "Bail out! [<description>]",  where the description can give
>>>> information on why the bail out occurred and can be any string.
>>>
>>> I'm not a fan of the Bail out line. It's not a problem, exactly, but I
>>> find it redundant. If we want an "end of test" line, let's make one.
>>> "Bail out" is a special case of exit. If we want to handle test exit,
>>> let's define it. E.g. make kselftest's test summary lines not
>>> diagnostic lines:
>>>
>>> # FAILED: 85 / 87 tests passed.
>>> # Totals: pass:83 fail:2 xfail:0 xpass:0 skip:2 error:0
>>
>> How consistent is it for selftests to produce test summary lines?
> 
> Any of the tools built with kselftest.h will spit it out. (Which is to
> say several of the "subtests".) The bulk of kselftest output is from the
> runner, which doesn't produce this summary.
> 
>>> Please no. There is no reason to force a nested test to suddenly have
>>> to know about its test execution depth/environment. A subtest is just a
>>> nested TAP result. That it is nested is only meaningful to the parser, not
>>> the executing test, and it must have the same output, nested or not. (e.g.
>>> running the test alone or running the test under a full kselftest run,
>>> the only difference is the indentation level.)
>> I agree that the sub-test should not know about this line.  I thought
>> the "parent" test wrote the line.
> 
> The suggestion was that the subtest should _not_ emit the "TAP" line,
> which is what I strongly disagree with: it should be a self-contained
> TAP-emitting test.
> 
>>> Again, I see only downsides to this. Nesting for the spec is simple
>>> indentation-based recursion. Let's just keep what we already have:
>>>
>>> TAP version 13
>>> 1..1
>>> # TAP version 13
>>> # 1..2
>>> # ok 1 - test_1
>>> # ok 2 - test_2
>>> ok 1 - test_suite
>>
>> I prefer spaces, but whatever.  Is the prefix on the 'TAP version 13' line guaranteed
>> to be the prefix on all lines for the (sub-)test?
> 
> That's the current implementation, yes.
> 
> As for tabs vs spaces, I think it's fine to swap "# " for "  ". We just
> need to patch LAVA to deal with it.
> 
>> Just as a side note, in some Fuego tests, it was very useful to include an identifier
>> in thethe prefix nested tests.  The output looked like this:
>>
>> TAP version 13
>> 1..2
>> [batch_id 4] TAP version 13
>> [batch_id 4] 1..2
>> [batch_id 4] ok 1 - cyclictest with 1000 cycles
>> [batch_id 4] # problem setting CLOCK_REALTIME
>> [batch_id 4] not ok 2 - cyclictest with CLOCK_REALTIME
>> not ok 1 - check realtime
>> [batch_id 4] TAP version 13
>> [batch_id 4] 1..1
>> [batch_id 4] ok 1 - IOZone read/write 4k blocks
>> ok 2 - check I/O performance
>>
>> Can I propose that the prefix not be fixed by the spec, but that the spec indicates that
>> whatever the prefix is on the TAP version line, that prefix must be used with the output for
>> all lines from the test (with the exception of unknown lines)?
> 
> Oh, interesting. This would also allow parallel (unique) test execution
> to be parsable. That sounds workable. (Again, this needs LAVA patching
> again...)
> 
>>>> Major differences between TAP 14 and KTAP specification
>>>> -------------------------------------------------------
>>>>
>>>> Note that the major differences between TAP 14 and KTAP specification:
>>>> - yaml and json are not allowed in diagnostic messages
>>>
>>> Agreed -- these are overkill (and very difficult to implement as they
>>> _follow_ the test result line: anything generating them has already
>>> finished running).
>>
>> I thought that some tests put the diagnostic data before the result line.
>>
>> I don't care whether it is before or after (and indeed Fuego's parser
>> can handle either case).  But it can't handle it when some testcases
>> put diagnostic data before the result line, and some testcases put
>> it after, in the same test.  There's no delimiter for the separation
>> between the two different testcases diagnostic data.
> 
> Right. The current implementation puts it always before. TAP 13 (14?) has
> the YAML _after_ the results line, but that's totally unworkable for
> kselftest, which is designed to be first human readable and second machine
> readable. Waiting to see diagnostics from a slow test is terrible --
> we can't collect it and spit it out later. Also, a parser would have
> no idea about when a test is finished if diagnostics follow the last
> test result of a plan.
> 
> Making it "always before" is fine by me, and the motivation for making
> the "summary" lines be non-diagnostics because they technically followed
> the last test result.
> 
>> Could we possibly have a rule about diagnostic lines positioning relative
>> to the testcase result line?  Maybe a rule like this:  For a single test,
>> diagnostic lines should either always preceded the testcase they
>> are related to, or follow the testcase they are related to.
> 
> For the kernel, I would rather avoid anything after because that just
> confuses things.
> 
>>> Yes. (Though "exit=" is a mistake in runner.sh -- this should probably
>>> be reported without the '#')
>>
>> Please No!  (Unless I'm misinterpreting that line).
>>
>> If you remove the '#', that makes the 'exit=127' part of the test description.
>> The test description should be invariant between success and failure cases.
>> The exit=127 (if I'm reading that line right) is part of data explaining or
>> describing the failure. It would be 'exit=0' if the test succeeded.
>>
>> In order to correlate test cases (ie show results over time), the description
>> should not change from one invocation of the test to the next.  At least Fuego
>> uses the test description this way.  That is, the testcase description is used as
>> a testcase identifier, and if it changes from run to run things get confusing.
> 
> Agreed, I think it should be part of an ERROR directive:
> 
> not ok 14 awesome test # ERROR exit=127
> 
>