On Tue, Sep 10, 2024 at 06:27:55PM +0100, Robert Beckett wrote: > nvme.io_queue_depth=2 appears to fix it. Could you explain the implications of this? > I assume it is limiting to 2 outstanding requests concurrently. You'd think so, but not quite. NVMe queues need to leave one entry empty, so a submission queue with depth "2" means you can have at most 1 command outstanding. > Does it suggest an issue with the specific device's FW? I think that sounds probable. Especially considering the dmapool code has had considerable run time in real life, and no other such issue has been reported. > I assume this would suggest that it is not actually anything wrong with the dmapool, it was just exposing the issue of the device/fw? That's what I'm thinking, though, if you have a single queue with depth 2, we're not stressing the dmapool implementation either. It's always going to return the same dma block for each command. > Any advice for handling this and/or investigating further? If you have the resources for it, get protocol analyzer trace and show it to your nvme vendor. > My initial speculation was that maybe the disk fw is signalling completion of an access before it has actually finished making it's way to ram. I checked the code and saw that the dmapool appears to be used for storing the buffer page addresses, so I imagine that is not updated by the disk at all, which would rule out my assumption. Right, it's used to make the prp/sgl list. Once we get a completion, that dma block becomes immediately available for the very next command. If you have a higher queue depth, it's possible that dma block is reused immediately while the driver is still notifying the block layer of the completion. If we're thinking that the device is completing the command before it's really done with the list (which could explain your observation), that would be a problem. Going to single queue-depth might introduce a delay or work around some firmware issue when dealing with concurrent commands. Prior to the "new" dmapool allocation, it was much less likely (though I think still possible) for your next command to reuse the same dma block of the command currently being completed.