Re: possible regression fs corruption on 64GB nvme

Keith Busch <kbusch@xxxxxxxxxx> · Tue, 10 Sep 2024 11:53:23 -0600

On Tue, Sep 10, 2024 at 06:27:55PM +0100, Robert Beckett wrote:
> nvme.io_queue_depth=2 appears to fix it. Could you explain the implications of this?
> I assume it is limiting to 2 outstanding requests concurrently.

You'd think so, but not quite. NVMe queues need to leave one entry
empty, so a submission queue with depth "2" means you can have at most 1
command outstanding.

> Does it suggest an issue with the specific device's FW?

I think that sounds probable. Especially considering the dmapool code
has had considerable run time in real life, and no other such issue has
been reported.

> I assume this would suggest that it is not actually anything wrong with the dmapool, it was just exposing the issue of the device/fw?

That's what I'm thinking, though, if you have a single queue with depth
2, we're not stressing the dmapool implementation either. It's always
going to return the same dma block for each command.

> Any advice for handling this and/or investigating further?

If you have the resources for it, get protocol analyzer trace and show
it to your nvme vendor.

> My initial speculation was that maybe the disk fw is signalling completion of an access before it has actually finished making it's way to ram. I checked the code and saw that the dmapool appears to be used for storing the buffer page addresses, so I imagine that is not updated by the disk at all, which would rule out my assumption.

Right, it's used to make the prp/sgl list. Once we get a completion,
that dma block becomes immediately available for the very next command.
If you have a higher queue depth, it's possible that dma block is reused
immediately while the driver is still notifying the block layer of the
completion.

If we're thinking that the device is completing the command before it's
really done with the list (which could explain your observation), that
would be a problem. Going to single queue-depth might introduce a delay
or work around some firmware issue when dealing with concurrent
commands.

Prior to the "new" dmapool allocation, it was much less likely (though I
think still possible) for your next command to reuse the same dma block
of the command currently being completed.