RE: write atomicity with xfs ... current status?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dave, 
Is the nvme 1.4 specification really broken? It provides boundaries as noted.
https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf

Check section 6.4 page 249. There are several ways to do this and this ratified specification is quite deep about atomic writes and describes what you are saying. I know you told me in another note, there is a glaring hole in the specifications, but is this hole still in the 1.4 specifications?

The layers above the drive could leverage identify namespace, which the drive's controller could advertise to anyone looking for awun and awunpf, but if this would be required to be an offset all we can provide is either 512B or 4096B which are the two nvme block sizes that are atomic on our drives today.
If awun/awunpf were the offset to our standard blocksize (512b or 4096b) would that work?

nvme id-ctrl /dev/nvme0n1 | grep aw
awun      : 0
awupf     : 0
Frank

-----Original Message-----
From: Dave Chinner <david@xxxxxxxxxxxxx> 
Sent: Tuesday, March 17, 2020 7:27 PM
To: Ober, Frank <frank.ober@xxxxxxxxx>
Cc: Darrick J. Wong <darrick.wong@xxxxxxxxxx>; Dimitri <dimitri.kravtchuk@xxxxxxxxxx>; linux-xfs@xxxxxxxxxxxxxxx; Barczak, Mariusz <mariusz.barczak@xxxxxxxxx>; Barajas, Felipe <felipe.barajas@xxxxxxxxx>
Subject: Re: write atomicity with xfs ... current status?

[ Hi Frank, you email program is really badly mangling quoting and line wrapping. Can you see if you can get it to behave better for us? I think I've fixed it below. ]

On Tue, Mar 17, 2020 at 10:56:53PM +0000, Ober, Frank wrote:
> Thanks Dave and Darrick, adding Dimitri Kravtchuk from Oracle to this 
> thread.
> 
> If Intel produced an SSD that was atomic at just the block size level 
> (as in using awun - atomic write unit of the NVMe spec)

What is this "atomic block size" going to be, and how is it going to be advertised to the block layer and filesystems?

> would that constitute that we could do the following

> > We've plumbed RWF_DSYNC to use REQ_FUA IO for pure overwrites if the 
> > hardware supports it. We can do exactly the same thing for 
> > RWF_ATOMIC - it succeeds if:
> > 
> > - we can issue it as a single bio
> > - the lower layers can take the entire atomic bio without
> >   splitting it. 
> > - we treat O_ATOMIC as O_DSYNC so that any metadata changes
> >   required also get synced to disk before signalling IO
> >   completion. If no metadata updates are required, then it's an
> >   open question as to whether REQ_FUA is also required with
> >   REQ_ATOMIC...
> > 
> > Anything else returns a "atomic write IO not possible" error.

So, as I said, your agreeing that an atomic write is essentially a variant of a data integrity write but has more strict size and alignment requirements and a normal RWF_DSYNC write?

> One design goal on the hw side, is to not slow the SSD down, the 
> footprint of firmware code is smaller in an Optane SSD and we don't 
> want to slow that down.

I really don't care what the impact on the SSD firmware size or speed is - if the hardware can't guarantee atomic writes right down to the physical media with full data integrity guarantees, and/or doesn't advertise it's atomic write limits to the OS and filesystem then it's simply not usable.

Please focus on correctness of behaviour first - speed is completely irrelevant if we don't have correctness guarantees from the hardware.

> What's the fastest approach for
> something like InnoDB writes? Can we take small steps that produce 
> value for DirectIO and specific files which is common in databases 
> architectures even 1 table per file ? Streamlining one block size that 
> can be tied to specific file opens seems valuable.

Atomic writes have nothing to do with individual files. Either the device under the filesystem can do atomic writes or it can't. What files we do atomic writes to is irrelevant; What we need to know at the filesystem level is the alignment and size restrictions on atomic writes so we can allocate space appropriately and/or reject user IO as out of bounds.

i.e. we already have size and alignment restrictions for direct IO (typically single logical sector size). For atomic direct IO we will have a different set of size and alignment restrictions, and like the logical sector size, we need to get that from the hardware somehow, and then make use of it in the filesystem appropriately.

Ideally the hardware would supply us with a minimum atomic IO size and alignment and a maximum size. e.g. minimum might be the physical sector size (we can always do atomic physical sector size/aligned IOs) but the maximum is likely going to be some device internal limit. If we require a minimum and maximum from the device and the device only supports one atomic IO size can simply set min = max.

Then it will be up to the filesystem to align extents to those limits, and prevent user IOs that don't match the device size/alignment restrictions placed on atomic writes...

But, first, you're going to need to get sane atomic write behaviour standardised in the NVMe spec, yes? Otherwise nobody can use it because we aren't guaranteed the same behaviour from device to device...

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux