Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

Christian Theune <ct@xxxxxxxxxxxxxxx> · Thu, 10 Oct 2024 08:29:14 +0200

> On 1. Oct 2024, at 02:56, Chris Mason <clm@xxxxxxxx> wrote:
> 
> Not disagreeing with Linus at all, but given that you've got IO
> throttling too, we might really just be waiting.  It's hard to tell
> because the hung task timeouts only give you information about one process.
> 
> I've attached a minimal version of a script we use here to show all the
> D state processes, it might help explain things.  The only problem is
> you have to actually ssh to the box and run it when you're stuck.
> 
> The idea is to print the stack trace of every D state process, and then
> also print out how often each unique stack trace shows up.  When we're
> deadlocked on something, there are normally a bunch of the same stack
> (say waiting on writeback) and then one jerk sitting around in a
> different stack who is causing all the trouble.

I think I should be able to trigger this. I’ve seen around a 100 of those issues over the last week and the chance of it happening correlates with a certain workload that should be easy to trigger. Also, the condition remains for at around 5 minutes, so I should be able to trace it when I see the alert in an interactive session.

I’ve verified I can run your script and I’ll get back to you in the next days.

Christian

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick