Re: write atomicity with xfs ... current status?

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 18 Mar 2020 13:27:19 +1100

[ Hi Frank, you email program is really badly mangling quoting and
line wrapping. Can you see if you can get it to behave better for
us? I think I've fixed it below. ]

On Tue, Mar 17, 2020 at 10:56:53PM +0000, Ober, Frank wrote:
> Thanks Dave and Darrick, adding Dimitri Kravtchuk from Oracle to
> this thread.
> 
> If Intel produced an SSD that was atomic at just the block size
> level (as in using awun - atomic write unit of the NVMe spec)

What is this "atomic block size" going to be, and how is it going to
be advertised to the block layer and filesystems?

> would that constitute that we could do the following

> > We've plumbed RWF_DSYNC to use REQ_FUA IO for pure overwrites if
> > the hardware supports it. We can do exactly the same thing for
> > RWF_ATOMIC - it succeeds if:
> > 
> > - we can issue it as a single bio
> > - the lower layers can take the entire atomic bio without
> >   splitting it. 
> > - we treat O_ATOMIC as O_DSYNC so that any metadata changes
> >   required also get synced to disk before signalling IO
> >   completion. If no metadata updates are required, then it's an
> >   open question as to whether REQ_FUA is also required with
> >   REQ_ATOMIC...
> > 
> > Anything else returns a "atomic write IO not possible" error.

So, as I said, your agreeing that an atomic write is essentially a
variant of a data integrity write but has more strict size and
alignment requirements and a normal RWF_DSYNC write?

> One design goal on the hw side, is to not slow the SSD down, the
> footprint of firmware code is smaller in an Optane SSD and we
> don't want to slow that down.

I really don't care what the impact on the SSD firmware size or
speed is - if the hardware can't guarantee atomic writes right down
to the physical media with full data integrity guarantees, and/or
doesn't advertise it's atomic write limits to the OS and filesystem
then it's simply not usable.

Please focus on correctness of behaviour first - speed is completely
irrelevant if we don't have correctness guarantees from the
hardware.

> What's the fastest approach for
> something like InnoDB writes? Can we take small steps that produce
> value for DirectIO and specific files which is common in databases
> architectures even 1 table per file ? Streamlining one block size
> that can be tied to specific file opens seems valuable.

Atomic writes have nothing to do with individual files. Either the
device under the filesystem can do atomic writes or it can't. What
files we do atomic writes to is irrelevant; What we need to know at
the filesystem level is the alignment and size restrictions on
atomic writes so we can allocate space appropriately and/or reject
user IO as out of bounds.

i.e. we already have size and alignment restrictions for direct IO
(typically single logical sector size). For atomic direct IO we will
have a different set of size and alignment restrictions, and like
the logical sector size, we need to get that from the hardware
somehow, and then make use of it in the filesystem appropriately.

Ideally the hardware would supply us with a minimum atomic IO size
and alignment and a maximum size. e.g. minimum might be the
physical sector size (we can always do atomic physical sector
size/aligned IOs) but the maximum is likely going to be some device
internal limit. If we require a minimum and maximum from the device
and the device only supports one atomic IO size can simply set
min = max.

Then it will be up to the filesystem to align extents to those
limits, and prevent user IOs that don't match the device
size/alignment restrictions placed on atomic writes...

But, first, you're going to need to get sane atomic write behaviour
standardised in the NVMe spec, yes? Otherwise nobody can use it
because we aren't guaranteed the same behaviour from device to
device...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx