Quoting Matthew Wilcox (2013-11-12 10:11:51) > On Thu, Nov 07, 2013 at 08:52:20AM -0500, Chris Mason wrote: > > Unfortunately, it's hard to say. I think the fusionio cards are the > > only shipping devices that support this, but I've definitely heard that > > others plan to support it as well. mariadb/percona already support the > > atomics via fusionio specific ioctls, and turning that into a real > > O_ATOMIC is a priority so other hardware can just hop on the train. > > > > This feature in general is pretty natural for the log structured squirrels > > they stuff inside flash, so I'd expect everyone to support it. Matthew, > > how do you feel about all of this? > > NVMe doesn't have support for this functionality. I know what stories I've > heard from our internal device teams about what they can and can't support > in the way of this kind of thing, but I obviously can't repeat them here! There are some atomics in the NVMe spec though, correct? The minimum needed for database use is only ~16-64K. > > I took a look at the SCSI Block Command spec. If I understand it > correctly, SCSI would implement this with the WRITE USING TOKEN command. > I don't see why it couldn't implement this API, though it seems like > SCSI would prefer a separate setup step before the write comes in. I'm > not sure that's a reasonable request to make of the application (nor > am I sure I understand SBC correctly). What kind of setup would we have to do? We have all the IO in hand, so it can be organized in just about any way needed. > > I like the API, but I'm a little confused not to see a patch saying "Oh, > and here's how we implemented it in btrfs without any hardware support" > ;-) It seems to me that the concept is just as good a match for an > advanced filesystem that supports snapshots as it is for the FTL inside > a drive. Grin, almost Btrfs already does this...COW means that btrfs needs to update metadata to point to new locations. To avoid an ugly flush-all-the-io-every-commit mess, we track pending writes and update the meatadata when the write is fully on media. We're missing a firm line that makes sure all the metadata updates for a single write happen in the same transaction, but that part isn't hard. We're missing good performance in database workloads, which is a slightly bigger trick. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html