Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5

Jan Kara <jack@xxxxxxx> · Thu, 2 Nov 2023 13:21:54 +0100



On Thu 02-11-23 10:28:57, Mikulas Patocka wrote:
> On Thu, 2 Nov 2023, Marek Marczykowski-Górecki wrote:
> > On Tue, Oct 31, 2023 at 06:24:19PM +0100, Mikulas Patocka wrote:
> > 
> > > > Hi
> > > > 
> > > > I would like to ask you to try this patch. Revert the changes to "order" 
> > > > and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a 
> > > > clean upstream kernel.
> > > > 
> > > > Does it deadlock?
> > > > 
> > > > There is a bug in dm-crypt that it doesn't account large pages in 
> > > > cc->n_allocated_pages, this patch fixes the bug.
> > 
> > This patch did not help.
> > 
> > > If the previous patch didn't fix it, try this patch (on a clean upstream 
> > > kernel).
> > >
> > > This patch allocates large pages, but it breaks them up into single-page 
> > > entries when adding them to the bio.
> > 
> > But this does help.
> 
> Thanks. So we can stop blaming the memory allocator and start blaming the 
> NVMe subsystem.

;-)

> I added NVMe maintainers to this thread - the summary of the problem is: 
> In dm-crypt, we allocate a large compound page and add this compound page 
> to the bio as a single big vector entry. Marek reports that on his system 
> it causes deadlocks, the deadlocks look like a lost bio that was never 
> completed. When I chop the large compound page to individual pages in 
> dm-crypt and add bio vector for each of them, Marek reports that there are 
> no longer any deadlocks. So, we have a problem (either hardware or 
> software) that the NVMe subsystem doesn't like bio vectors with large 
> bv_len. This is the original bug report: 
> https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/

Actually, Ming Lei has already identified [1] that we are apparently
looping in an endless retry loop in nvme_queue_rq(), always ending up the
attempt with BLK_STS_RESOURCE.
								Honza

[1] https://lore.kernel.org/all/ZUHE52SznRaZQxnG@fedora

-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR