On Thu, Apr 24, 2014 at 02:50:23PM -0400, Chris Mason wrote: > > > On 04/24/2014 02:23 PM, Dan Williams wrote: > >On Thu, Apr 24, 2014 at 11:03 AM, Chris Mason <clm@xxxxxx> wrote: > >>On 04/24/2014 01:39 PM, Matthew Wilcox wrote: > >>> > >>> > >>>NVMe allows the drive to tell the host what atomicity guarantees it > >>>provides for a write command. At the moment, I don't think Linux has > >>>a way for the driver to pass that information up to the filesystem. > >>> > >>>The value that is most interesting to report is Atomic Write Unit Power > >>>Fail ("if you send a write no larger than this, the drive guarantees to > >>>write all of it or none of it"), minimum value 1 sector. [1] > >>> > >>>There's a proposal before the NVMe workgroup to add a boundary size/offset > >>>to modify AWUPF ("except if you cross this boundary, then AWUPF is not > >>>guaranteed"). Think RAID stripe crossing. > >>> > >>>So, three questions. Is there somewhere already to pass boundary > >>>information up to the filesystem? Can filesystems make use of a larger > >>>atomic write unit than a single sector? And, if the device is internally > >>>a RAID device, is knowing the boundary size/offset useful? > >>> > >>> > >>>[1] There is also Atomic Write Unit Normal ("if you send two writes, > >>>neither of which is larger than this, subsequent reads will get either > >>>one or the other, not a mixture of both"), which I don't think we care > >>>about because the page cache prevents us from sending two writes which > >>>overlap with each other. > >> > >> > >>I think we really need the atomics to be vectored. Send N writes which as a > >>unit are not larger than X, but which may span anywhere on device. An array > >>with writeback cache, or a log structured squirrel in the FTL should be able > >>to provide this pretty easily? > >> > >>The immediate use case is mysql (16K writes) on a fragmented filesystem. > >>The FS needs to be able to collect a single atomic write made up of N 4K > >>sectors. > > > >How big does N need to be before it starts to be generally useful? > >Here it seems we're talking on the order to tens of writes, but for > >the upper bound Dave said that N could be in the hundreds of thousands > > Right, if you ask the filesystem guys, we'll want to dump the entire > contents of ram down to the storage in atomic fashion. I do agree > with Dave here, bigger is definitely better. Right, bigger is better, but what about minimum requirements? The minimum requirement I need for converting XFS is around 4MB of discontiguous single sector IOs for the worst case event. That covers the largest *single* atomic transaction log reservation we currently make on XFS at 64k block sizes. > 16K and up are useful, depending on which workload you're targeting. > The fusion devices can do 1MB. User data workloads, yes. The moment we start thinking about atomic filesystem metadata updates, the requirements go way, way up.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html