Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5

Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx> · Mon, 30 Oct 2023 14:30:52 +0100

On Mon, Oct 30, 2023 at 01:25:13PM +0100, Jan Kara wrote:
> On Mon 30-10-23 12:30:23, Vlastimil Babka wrote:
> > On 10/30/23 12:22, Mikulas Patocka wrote:
> > > On Mon, 30 Oct 2023, Vlastimil Babka wrote:
> > > 
> > >> Ah, missed that. And the traces don't show that we would be waiting for
> > >> that. I'm starting to think the allocation itself is really not the issue
> > >> here. Also I don't think it deprives something else of large order pages, as
> > >> per the sysrq listing they still existed.
> > >> 
> > >> What I rather suspect is what happens next to the allocated bio such that it
> > >> works well with order-0 or up to costly_order pages, but there's some
> > >> problem causing a deadlock if the bio contains larger pages than that?
> > > 
> > > Yes. There are many "if (order > PAGE_ALLOC_COSTLY_ORDER)" branches in the 
> > > memory allocation code and I suppose that one of them does something bad 
> > > and triggers this bug. But I don't know which one.
> > 
> > It's not what I meant. All the interesting branches for costly order in page
> > allocator/compaction only apply with __GFP_DIRECT_RECLAIM, so we can't be
> > hitting those here.
> > The traces I've seen suggest the allocation of the bio suceeded, and
> > problems arised only after it was submitted.
> > 
> > I wouldn't even be surprised if the threshold for hitting the bug was not
> > exactly order > PAGE_ALLOC_COSTLY_ORDER but order > PAGE_ALLOC_COSTLY_ORDER
> > + 1 or + 2 (has that been tested?) or rather that there's no exact
> > threshold, but probability increases with order.
> 
> Well, it would be possible that larger pages in a bio would trip e.g. bio
> splitting due to maximum segment size the disk supports (which can be e.g.
> 0xffff) and that upsets something somewhere. But this is pure
> speculation. We definitely need more debug data to be able to tell more.

I can collect more info, but I need some guidance how :) Some patch
adding extra debug messages?
Note I collect those via serial console (writing to disk doesn't work
when it freezes), and that has some limits in the amount of data I can
extract especially when printed quickly. For example sysrq-t is too much.
Or maybe there is some trick to it, like increasing log_bug_len?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
Attachment:
signature.asc

Description: PGP signature