Damien, > Exposing an HDD through multiple-queues each with a high queue depth > is simply asking for troubles. Commands will end up spending so much > time sitting in the queues that they will timeout. Yep! > This can already be observed with the smartpqi SAS HBA which exposes > single drives as multiqueue block devices with high queue depth. > Exercising these drives heavily leads to thousands of commands being > queued and to timeouts. It is fairly easy to trigger this without a > manual change to the QD. This is on my to-do list of fixes for some > time now (lacking time to do it). Controllers that queue internally are very susceptible to application or filesystem timeouts when drives are struggling to keep up. > NVMe HDDs need to have an interface setup that match their speed, that > is, something like a SAS interface: *single* queue pair with a max QD > of 256 or less depending on what the drive can take. Their is no > TASK_SET_FULL notification on NVMe, so throttling has to come from the > max QD of the SQ, which the drive will advertise to the host. At the very minimum we'll need low queue depths. But I have my doubts whether we can make this work well enough without some kind of TASK SET FULL style AER to throttle the I/O. > NVMe specs will need an update to have a "NONROT" (non-rotational) bit in > the identify data for all this to fit well in the current stack. Absolutely. -- Martin K. Petersen Oracle Linux Engineering