Re: Help deciding about backported patch (kernel bug 214767, 19f4e7cc8197 xfs: Fix CIL throttle hang when CIL space used going backwards)

Dave Chinner <david@xxxxxxxxxxxxx> · Sun, 20 Feb 2022 08:14:19 +1100

On Thu, Feb 17, 2022 at 10:22:49AM +0100, Christian Theune wrote:
> Hi,
> 
> I’ve been debugging an elusive XFS issue that I could not track
> down to any other parameters than it being an xfs internal bug.
> I’ve recorded what I’ve seen so far in
> https://bugzilla.kernel.org/show_bug.cgi?id=214767 and Dave
> recommended that "19f4e7cc8197 xfs: Fix CIL throttle hang when CIL
> space used going backwards” is likely the issue. AFAICT this was
> not backported to the 5.10 branch and we’ve been updating to
> vanilla kernels diligently and still keep seeing this issue.
> Unfortunately within a fleet of around 1k VMs it strikes about
> once every week or so and there’s no way to predict when and
> where.
> 
> So, I took Dave’s pointer and applied the patch to our 5.10 series
> (basd on 5.10.76 at that point) and it applied cleanly. The
> machine boots fine and I ran the XFS test suite. However, I
> haven’t done any tests using the test suite before and I’m getting
> a number of errors where I don’t know how to interpret the
> results. Some of those seem to be due to not having the DEBUG flag
> set in the kernel, others … I’m not sure.

Run the "auto" group tests ('-g auto') only, which will weed out
tests that are broken, likely to fail or crash the machine (i.e.
test-to-failure scenarios). You can ignore "not run" reports - they
aren't failures, just indicative of the kernel not supporting that
functionality (like not being built with DEBUG functionality).

Then run the tests across an unmodified kernel 2-3 times to get a
baseline set of results (should be identical each run), then do the
same thing for the patched kernel.

Now compare baseline vs patched results, looking for things that
failed in the patched kernel that didn't fail in the baseline kernel
- those are the regressions that need more investigation. If there
are no regressions (very likely), you are good to go.

> I’m attaching the test runner output, unfortunately I lost the
> actual outputs as the test ran quite long and the outputs where
> cleaned up by the tempfile watcher faster than I could retrieve
> them. I can run them again, my estimation currently is it takes
> around 3-4 days to complete them, though.

The auto group tests should take ~3-6 hours to run a full cycle
depending on storage config.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx