Re: [PATCH] fstests: report: always save the dmesg as system-err if KEEP_DMESG is set

"Theodore Ts'o" <tytso@xxxxxxx> · Mon, 19 Dec 2022 13:56:58 -0500

On Mon, Dec 19, 2022 at 09:39:33AM -0800, Darrick J. Wong wrote:
> On Fri, Dec 16, 2022 at 02:51:21PM +0800, Qu Wenruo wrote:
> > When KEEP_DMESG is set to "yes", we will always save the dmesg of any
> > test case (no matter if it passed or not) into "$seqnum.dmesg".
> > 
> > But this KEEP_DMESG behavior doesn't affect xunit report.
> > 
> > This patch will make xunit report to follow KEEP_DMESG setting.

This may be dangerous; if the XML file is too large, the XML parser
may end up rejecting the whole XML file because otherwlse a too-large
XML file can trigger a denial of service attack[1].  (This is why I
implemented "xunit-quiet".)

[1] https://gitlab.com/gitlab-org/gitlab/-/issues/25357

So if you are running a large number of tests (e.g., "-g auto") it
might very well that adding dmesg for all tests might very well end up
bloating the XML file to the point where it will be unmangeable.  For
example, this is the size for my syslog file after running "-g auto"
on the "xfs/quota" config:

-rw-r----- 1 tytso primarygroup 10316684 Aug 25 10:35 ae/syslog

The syslog file for all of the xfs configs are 9-10 megabytes each.
If I combined the 12 xfs configs that we run into a single xunit JML
file with the dmesg output, this would be *guaranteed* to blow out
most XML parsers.

Personally, I find that a better solution is to use the syslog daemon
to save the dmesg output for all of the tests into a single file.  I
prefer this for three reasons:

  * The single file is more compressibls compared to having it broken
    out into separate $seqnum.dmesg files.
  * By keeping dmesg and other test artifacts separate from the xml
    file I can archive the xml file for a much larger period of time,
    (perhaps indefinitely) while allowing the much more volunumous
    test artifacts to be archived for a shorter time (say, 3-6 months).
  * When there are test isolation issues, it's not uncommon for a
    previous test to fail with some kind of global or cgroup-specific
    OOM-kill, or when I'm testing on bare metal with real hardware
    where hardware failures is a Thing, being able to look for unusual
    kernel messages before the start of a particular test can often be
    quite revealing.  

Cheers,

						- Ted