Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On 19. Sep 2024, at 08:57, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> 
> Yeah, right now Jens is still going to run some more testing, but I
> think the plan is to just backport
> 
>  a4864671ca0b ("lib/xarray: introduce a new helper xas_get_order")
>  6758c1128ceb ("mm/filemap: optimize filemap folio adding")
> 
> and I think we're at the point where you might as well start testing
> that if you have the cycles for it. Jens is mostly trying to confirm
> the root cause, but even without that, I think you running your load
> with those two changes back-ported is worth it.
> 
> (Or even just try running it on plain 6.10 or 6.11, both of which
> already has those commits)

I’ve discussed this with my team and we’re preparing to switch all our 
non-prod machines as well as those production machines that have shown
the error before.

This will require a bit of user communication and reboot scheduling.
Our release prep will be able to roll this out starting early next week
and the production machines in question around Sept 30.

We would run with 6.11 as our understanding so far is that running the
most current kernel would generate the most insight and is easier to
work with for you all?

(Generally we run the mostly vanilla LTS that has surpassed x.y.50+ so
we might later downgrade to 6.6 when this is fixed.)

> So considering how well the reproducer works for Jens and Chris, my
> main worry is whether your load might have some _additional_ issue.
> 
> Unlikely, but still .. The two commits fix the repproducer, so I think
> the important thing to make sure is that it really fixes the original
> issue too.
> 
> And yeah, I'd be surprised if it doesn't, but at the same time I would
> _not_ suggest you try to make your load look more like the case we
> already know gets fixed.
> 
> So yes, it will be "weeks of not seeing crashes" until we'd be
> _really_ confident it's all the same thing, but I'd rather still have
> you test that, than test something else than what caused issues
> originally, if you see what I mean.

Agreed, I’m all onboard with that.

Liebe Grüße,
Christian Theune

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick






[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux