Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5

Keith Busch <kbusch@xxxxxxxxxx> · Thu, 2 Nov 2023 08:02:34 -0600

On Wed, Nov 01, 2023 at 07:23:05PM +0800, Ming Lei wrote:
> On Wed, Nov 01, 2023 at 11:15:02AM +0100, Hannes Reinecke wrote:
> > > nvme_queue_rq() on the above request.
> > > 
> > And that is something I've been wondering (for quite some time now):
> > What _is_ the appropriate error handling for -ENOMEM?
> 
> It is just my guess.
> 
> Actually it shouldn't fail since the sgl allocation is backed with
> memory pool, but there is also dma pool allocation and dma mapping.
> 
> > At this time, we assume it to be a retryable error and re-run the queue
> > in the hope that things will sort itself out.
> 
> It should not be hard to figure out why nvme_queue_rq() can't move on.

There's only a few reasons nvme_queue_rq would return BLK_STS_RESOURCE
for a typical read/write command:

  DMA mapping error
  Can't allocate SGL from mempool
  Can't allocate PRP from dma_pool
  Controller stuck in resetting state

We should always be able to get at least one allocation from the memory
pools, so I think the only one the driver doesn't have a way to
guarantee eventual forward progress are the DMA mapping error
conditions. Is there some other limit that the driver needs to consider
when configuring it's largest supported transfers?