Re: [PATCH 1/2] block: Add support for atomic writes

Chris Mason <chris.mason@xxxxxxxxxxxx> · Wed, 13 Nov 2013 15:44:38 -0500

Quoting Matthew Wilcox (2013-11-12 10:11:51)
> On Thu, Nov 07, 2013 at 08:52:20AM -0500, Chris Mason wrote:
> > Unfortunately, it's hard to say.  I think the fusionio cards are the
> > only shipping devices that support this, but I've definitely heard that
> > others plan to support it as well.  mariadb/percona already support the
> > atomics via fusionio specific ioctls, and turning that into a real
> > O_ATOMIC is a priority so other hardware can just hop on the train.
> > 
> > This feature in general is pretty natural for the log structured squirrels
> > they stuff inside flash, so I'd expect everyone to support it.  Matthew,
> > how do you feel about all of this?
> 
> NVMe doesn't have support for this functionality.  I know what stories I've
> heard from our internal device teams about what they can and can't support
> in the way of this kind of thing, but I obviously can't repeat them here!

There are some atomics in the NVMe spec though, correct?  The minimum
needed for database use is only ~16-64K.

> 
> I took a look at the SCSI Block Command spec.  If I understand it
> correctly, SCSI would implement this with the WRITE USING TOKEN command.
> I don't see why it couldn't implement this API, though it seems like
> SCSI would prefer a separate setup step before the write comes in.  I'm
> not sure that's a reasonable request to make of the application (nor
> am I sure I understand SBC correctly).

What kind of setup would we have to do?  We have all the IO in hand, so
it can be organized in just about any way needed.

> 
> I like the API, but I'm a little confused not to see a patch saying "Oh,
> and here's how we implemented it in btrfs without any hardware support"
> ;-)  It seems to me that the concept is just as good a match for an
> advanced filesystem that supports snapshots as it is for the FTL inside
> a drive.

Grin, almost Btrfs already does this...COW means that btrfs needs to
update metadata to point to new locations.  To avoid an ugly
flush-all-the-io-every-commit mess, we track pending writes and update
the meatadata when the write is fully on media.

We're missing a firm line that makes sure all the metadata updates for a
single write happen in the same transaction, but that part isn't hard.

We're missing good performance in database workloads, which is a
slightly bigger trick.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html