Re: [PATCH 1/2] block: Add support for atomic writes

Matthew Wilcox <willy@xxxxxxxxxxxxxxx> · Wed, 13 Nov 2013 16:35:54 -0500

On Wed, Nov 13, 2013 at 03:44:38PM -0500, Chris Mason wrote:
> Quoting Matthew Wilcox (2013-11-12 10:11:51)
> > On Thu, Nov 07, 2013 at 08:52:20AM -0500, Chris Mason wrote:
> > > Unfortunately, it's hard to say.  I think the fusionio cards are the
> > > only shipping devices that support this, but I've definitely heard that
> > > others plan to support it as well.  mariadb/percona already support the
> > > atomics via fusionio specific ioctls, and turning that into a real
> > > O_ATOMIC is a priority so other hardware can just hop on the train.
> > > 
> > > This feature in general is pretty natural for the log structured squirrels
> > > they stuff inside flash, so I'd expect everyone to support it.  Matthew,
> > > how do you feel about all of this?
> > 
> > NVMe doesn't have support for this functionality.  I know what stories I've
> > heard from our internal device teams about what they can and can't support
> > in the way of this kind of thing, but I obviously can't repeat them here!
> 
> There are some atomics in the NVMe spec though, correct?  The minimum
> needed for database use is only ~16-64K.

Yes, NVMe has limited atomic support.  It has two fields:

  Atomic Write Unit Normal (AWUN): This field indicates the atomic write
  size for the controller during normal operation. This field is specified
  in logical blocks and is a 0’s based value. If a write is submitted
  of this size or less, the host is guaranteed that the write is atomic
  to the NVM with respect to other read or write operations. If a write
  is submitted that is greater than this size, there is no guarantee
  of atomicity.

  A value of FFFFh indicates all commands are atomic as this is the
  largest command size. It is recommended that implementations support
  a minimum of 128KB (appropriately scaled based on LBA size).

  Atomic Write Unit Power Fail (AWUPF): This field indicates the atomic
  write size for the controller during a power fail condition. This
  field is specified in logical blocks and is a 0’s based value. If a
  write is submitted of this size or less, the host is guaranteed that
  the write is atomic to the NVM with respect to other read or write
  operations. If a write is submitted that is greater than this size,
  there is no guarantee of atomicity.

Basically just exposing what is assumed to be true for SCSI and generally
assumed to be lies for ATA drives :-)

> > I took a look at the SCSI Block Command spec.  If I understand it
> > correctly, SCSI would implement this with the WRITE USING TOKEN command.
> > I don't see why it couldn't implement this API, though it seems like
> > SCSI would prefer a separate setup step before the write comes in.  I'm
> > not sure that's a reasonable request to make of the application (nor
> > am I sure I understand SBC correctly).
> 
> What kind of setup would we have to do?  We have all the IO in hand, so
> it can be organized in just about any way needed.

Someone who understands SCSI better than I do assures me this is NOT the
proposal that allows SCSI devices to do scattered writes.  Apparently that
proposal is still in progress.  This appears to be true; from the t10
NEW list:

12-087r6 	SBC-4 Gathered reads, optionally atomic 	Rob Elliott, Ashish Batwara, Walt Hubis 	Missing	
12-086r6 	SBC-4 SPC-5 Scattered writes, optionally atomic 	Rob Elliott, Ashish Batwara, Walt Hubis 	Missing

> Grin, almost Btrfs already does this...COW means that btrfs needs to
> update metadata to point to new locations.  To avoid an ugly
> flush-all-the-io-every-commit mess, we track pending writes and update
> the meatadata when the write is fully on media.
> 
> We're missing a firm line that makes sure all the metadata updates for a
> single write happen in the same transaction, but that part isn't hard.
> 
> We're missing good performance in database workloads, which is a
> slightly bigger trick.

Yeah ... if only you could find a database company to ... oh, wait :-)

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html