On Mon 30-10-23 12:30:23, Vlastimil Babka wrote: > On 10/30/23 12:22, Mikulas Patocka wrote: > > On Mon, 30 Oct 2023, Vlastimil Babka wrote: > > > >> Ah, missed that. And the traces don't show that we would be waiting for > >> that. I'm starting to think the allocation itself is really not the issue > >> here. Also I don't think it deprives something else of large order pages, as > >> per the sysrq listing they still existed. > >> > >> What I rather suspect is what happens next to the allocated bio such that it > >> works well with order-0 or up to costly_order pages, but there's some > >> problem causing a deadlock if the bio contains larger pages than that? > > > > Yes. There are many "if (order > PAGE_ALLOC_COSTLY_ORDER)" branches in the > > memory allocation code and I suppose that one of them does something bad > > and triggers this bug. But I don't know which one. > > It's not what I meant. All the interesting branches for costly order in page > allocator/compaction only apply with __GFP_DIRECT_RECLAIM, so we can't be > hitting those here. > The traces I've seen suggest the allocation of the bio suceeded, and > problems arised only after it was submitted. > > I wouldn't even be surprised if the threshold for hitting the bug was not > exactly order > PAGE_ALLOC_COSTLY_ORDER but order > PAGE_ALLOC_COSTLY_ORDER > + 1 or + 2 (has that been tested?) or rather that there's no exact > threshold, but probability increases with order. Well, it would be possible that larger pages in a bio would trip e.g. bio splitting due to maximum segment size the disk supports (which can be e.g. 0xffff) and that upsets something somewhere. But this is pure speculation. We definitely need more debug data to be able to tell more. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR