On 11/04/2024 01:29, Luis Chamberlain wrote:
On Tue, Mar 26, 2024 at 01:38:13PM +0000, John Garry wrote:
From: Alan Adamson <alan.adamson@xxxxxxxxxx>
Add support to set block layer request_queue atomic write limits. The
limits will be derived from either the namespace or controller atomic
parameters.
NVMe atomic-related parameters are grouped into "normal" and "power-fail"
(or PF) class of parameter. For atomic write support, only PF parameters
are of interest. The "normal" parameters are concerned with racing reads
and writes (which also applies to PF). See NVM Command Set Specification
Revision 1.0d section 2.1.4 for reference.
Whether to use per namespace or controller atomic parameters is decided by
NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data
Structure, NVM Command Set.
NVMe namespaces may define an atomic boundary, whereby no atomic guarantees
are provided for a write which straddles this per-lba space boundary. The
block layer merging policy is such that no merges may occur in which the
resultant request would straddle such a boundary.
Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from
atomic boundary rule.
Larger IU drives a larger alignment *preference*, and it can be multiples
of the LBA format, it's called Namespace Preferred Write Granularity (NPWG)
and the NVMe driver already parses it. So say you have a 4k LBA format
but a 16k NPWG. I suspect this means we'd want atomics writes to align to 16k
but I can let Dan confirm.
If we need to be aligned to NPWG, then the min atomic write unit would
also need to be NPWG. Any NPWG relation to atomic writes is not defined
in the spec, AFAICS.
We simply use the LBA data size as the min atomic unit in this patch.
Note on NABSPF:
There seems to be some vagueness in the spec as to whether NABSPF applies
for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF
and how it is affected by bit 1. However Figure 4 does tell to check Figure
97 for info about per-namespace parameters, which NABSPF is, so it is
implied. However currently nvme_update_disk_info() does check namespace
parameter NABO regardless of this bit.
Yeah that its quirky.
Also today we set the physical block size to min(npwg, atomic) and that
means for a today's average 4k IU drive if they get 16k atomic the
physical block size would still be 4k. As the physical block size in
practice can also lift the sector size filesystems used it would seem
odd only a larger npwg could lift it.
It seems to me that if you want to provide atomic guarantees for this
large "physical block size", then it needs to be based on (N)AWUPF and NPWG.
So we may want to revisit this
eventually, specially if we have an API to do atomics properly across the
block layer.
Thanks,
John