On Wed, Nov 13, 2013 at 03:44:38PM -0500, Chris Mason wrote: > Quoting Matthew Wilcox (2013-11-12 10:11:51) > > On Thu, Nov 07, 2013 at 08:52:20AM -0500, Chris Mason wrote: > > > Unfortunately, it's hard to say. I think the fusionio cards are the > > > only shipping devices that support this, but I've definitely heard that > > > others plan to support it as well. mariadb/percona already support the > > > atomics via fusionio specific ioctls, and turning that into a real > > > O_ATOMIC is a priority so other hardware can just hop on the train. > > > > > > This feature in general is pretty natural for the log structured squirrels > > > they stuff inside flash, so I'd expect everyone to support it. Matthew, > > > how do you feel about all of this? > > > > NVMe doesn't have support for this functionality. I know what stories I've > > heard from our internal device teams about what they can and can't support > > in the way of this kind of thing, but I obviously can't repeat them here! > > There are some atomics in the NVMe spec though, correct? The minimum > needed for database use is only ~16-64K. Yes, NVMe has limited atomic support. It has two fields: Atomic Write Unit Normal (AWUN): This field indicates the atomic write size for the controller during normal operation. This field is specified in logical blocks and is a 0’s based value. If a write is submitted of this size or less, the host is guaranteed that the write is atomic to the NVM with respect to other read or write operations. If a write is submitted that is greater than this size, there is no guarantee of atomicity. A value of FFFFh indicates all commands are atomic as this is the largest command size. It is recommended that implementations support a minimum of 128KB (appropriately scaled based on LBA size). Atomic Write Unit Power Fail (AWUPF): This field indicates the atomic write size for the controller during a power fail condition. This field is specified in logical blocks and is a 0’s based value. If a write is submitted of this size or less, the host is guaranteed that the write is atomic to the NVM with respect to other read or write operations. If a write is submitted that is greater than this size, there is no guarantee of atomicity. Basically just exposing what is assumed to be true for SCSI and generally assumed to be lies for ATA drives :-) > > I took a look at the SCSI Block Command spec. If I understand it > > correctly, SCSI would implement this with the WRITE USING TOKEN command. > > I don't see why it couldn't implement this API, though it seems like > > SCSI would prefer a separate setup step before the write comes in. I'm > > not sure that's a reasonable request to make of the application (nor > > am I sure I understand SBC correctly). > > What kind of setup would we have to do? We have all the IO in hand, so > it can be organized in just about any way needed. Someone who understands SCSI better than I do assures me this is NOT the proposal that allows SCSI devices to do scattered writes. Apparently that proposal is still in progress. This appears to be true; from the t10 NEW list: 12-087r6 SBC-4 Gathered reads, optionally atomic Rob Elliott, Ashish Batwara, Walt Hubis Missing 12-086r6 SBC-4 SPC-5 Scattered writes, optionally atomic Rob Elliott, Ashish Batwara, Walt Hubis Missing > Grin, almost Btrfs already does this...COW means that btrfs needs to > update metadata to point to new locations. To avoid an ugly > flush-all-the-io-every-commit mess, we track pending writes and update > the meatadata when the write is fully on media. > > We're missing a firm line that makes sure all the metadata updates for a > single write happen in the same transaction, but that part isn't hard. > > We're missing good performance in database workloads, which is a > slightly bigger trick. Yeah ... if only you could find a database company to ... oh, wait :-) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html