Re: Help deciding about backported patch (kernel bug 214767, 19f4e7cc8197 xfs: Fix CIL throttle hang when CIL space used going backwards)

Christian Theune <ct@xxxxxxxxxxxxxxx> · Wed, 23 Feb 2022 07:17:48 +0100

Hi Dave,

thanks a lot - those are the instructions I was missing!

Unfortunately I mixed up the results directories and the test baseline output tat goes to the /tmp directory, so I don’t have the true full diffs available, only what’s in the stdout log. In any case, here’s what I found:

Baseline
--------

On my vanilla kernel (5.10.76) I get between 18 and 20 test failures when running auto mode as you instructed. The affected tests on vanilla are:

generic/035 generic/050 generic/388 generic/452 generic/594 generic/623 generic/646 generic/670* xfs/031 xfs/033 xfs/071 xfs/154 xfs/158 xfs/177 xfs/185 xfs/506 xfs/513 xfs/539 xfs/540 xfs/542

Notable here is generic/670 which only failed in the baseline but not the patched kernel:

generic/670 10s ... - output mismatch (see /home/ctheune/fc-nixos/results/generic/670.out.bad)
     Reflink and mmap reread the files!
    +61 61 61 61 61 61 61 61 61 62 62 62 62 62 62 62
     Finished reflinking

Patched
-------

On my patched kernel I get between 18 and 22 test failures. I’m listing only the ones not failing in the baseline:

generic/471
    -RWF_NOWAIT time is within limits.
    +RWF_NOWAIT took 0.2517 seconds

generic/475
     Silence is golden.
    +your 131072x1 screen size is bogus. expect trouble
    +your 131072x1 screen size is bogus. expect trouble
    +your 131072x1 screen size is bogus. expect trouble
    +your 131072x1 screen size is bogus. expect trouble
    +your 131072x1 screen size is bogus. expect trouble
    ...

generic/648
     Silence is golden.
    +your 131072x1 screen size is bogus. expect trouble

Also notable is that only the second run on the patched kernel contained additional failures compared to the baseline. The first and third run were “clean” compared to the baseline.

I’m guessing that the “screen size is bogus” messages are due to me running the tests in a ‘screen’ + sudo environment. Leaves the generic/471 which doesn’t sound too bad, but honestly I have no idea … :)

Also, I’d be happy to pay back a bit by adding your instructions to the documentation or wiki (or wherever googling has a higher chance of finding them).

Kind regards,
Christian

> On 19. Feb 2022, at 22:14, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Thu, Feb 17, 2022 at 10:22:49AM +0100, Christian Theune wrote:
>> Hi,
>> 
>> I’ve been debugging an elusive XFS issue that I could not track
>> down to any other parameters than it being an xfs internal bug.
>> I’ve recorded what I’ve seen so far in
>> https://bugzilla.kernel.org/show_bug.cgi?id=214767 and Dave
>> recommended that "19f4e7cc8197 xfs: Fix CIL throttle hang when CIL
>> space used going backwards” is likely the issue. AFAICT this was
>> not backported to the 5.10 branch and we’ve been updating to
>> vanilla kernels diligently and still keep seeing this issue.
>> Unfortunately within a fleet of around 1k VMs it strikes about
>> once every week or so and there’s no way to predict when and
>> where.
>> 
>> So, I took Dave’s pointer and applied the patch to our 5.10 series
>> (basd on 5.10.76 at that point) and it applied cleanly. The
>> machine boots fine and I ran the XFS test suite. However, I
>> haven’t done any tests using the test suite before and I’m getting
>> a number of errors where I don’t know how to interpret the
>> results. Some of those seem to be due to not having the DEBUG flag
>> set in the kernel, others … I’m not sure.
> 
> Run the "auto" group tests ('-g auto') only, which will weed out
> tests that are broken, likely to fail or crash the machine (i.e.
> test-to-failure scenarios). You can ignore "not run" reports - they
> aren't failures, just indicative of the kernel not supporting that
> functionality (like not being built with DEBUG functionality).
> 
> Then run the tests across an unmodified kernel 2-3 times to get a
> baseline set of results (should be identical each run), then do the
> same thing for the patched kernel.
> 
> Now compare baseline vs patched results, looking for things that
> failed in the patched kernel that didn't fail in the baseline kernel
> - those are the regressions that need more investigation. If there
> are no regressions (very likely), you are good to go.
> 
>> I’m attaching the test runner output, unfortunately I lost the
>> actual outputs as the test ran quite long and the outputs where
>> cleaned up by the tempfile watcher faster than I could retrieve
>> them. I can run them again, my estimation currently is it takes
>> around 3-4 days to complete them, though.
> 
> The auto group tests should take ~3-6 hours to run a full cycle
> depending on storage config.
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

Liebe Grüße,
Christian Theune

--
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick

Attachment:
signature.asc

Description: Message signed with OpenPGP