Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 25, 2016 at 01:08:28PM -0800, Phil Terry wrote:
> On 02/25/2016 12:15 PM, Dave Chinner wrote:
> >On Thu, Feb 25, 2016 at 02:11:49PM -0500, Jeff Moyer wrote:
> >>Jeff Moyer <jmoyer@xxxxxxxxxx> writes:
> >>
> >>>>The big issue we have right now is that we haven't made the DAX/pmem
> >>>>infrastructure work correctly and reliably for general use.  Hence
> >>>>adding new APIs to workaround cases where we haven't yet provided
> >>>>correct behaviour, let alone optimised for performance is, quite
> >>>>frankly, a clear case premature optimisation.
> >>>Again, I see the two things as separate issues.  You need both.
> >>>Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
> >>>issue of making existing applications work safely.
> >>I want to add one more thing to this discussion, just for the sake of
> >>clarity.  When I talk about existing applications and pmem, I mean
> >>applications that already know how to detect and recover from torn
> >>sectors.  Any application that assumes hardware does not tear sectors
> >>should be run on a file system layered on top of the btt.
> >Which turns off DAX, and hence makes this a moot discussion because
> >mmap is then buffered through the page cache and hence applications
> >*must use msync/fsync* to provide data integrity. Which also makes
> >them safe to use with DAX if we have a working fsync.
> >
> >Keep in mind that existing storage technologies tear fileystem data
> >writes, too, because user data writes are filesystem block sized and
> >not atomic at the device level (i.e.  typical is 512 byte sector, 4k
> >filesystem block size, so there are 7 points in a single write where
> >a tear can occur on a crash).
> Is that really true? Storage to date is on the PCIE/SATA etc IO
> chain. The locks and application crash scenarios when traversing
> down this chain are such that the device will not have its DMA
> programmed until the whole 4K etc page is flushed to memory, pinned

Has nothing to do with DMA semantics. Storage devices we have to
deal with have volatile write caches, and we can't assume anything
about what they write when power fails except that single sector
writes are atomic.

> In both cases, btt is not indirecting the buffer (as for a DMA
> master IO type device) but is simply using the same pmem api
> primitives to manage its own meta data about the filesystem writes
> to detect and recover from tears after the event. In what sense is
> DAX disabled for this?

BTT is, IIRC, using writeahead logging to stage every IO into pmem
so that after a crash the entire write can be recovered and replayed
to overwrite any torn sectors. This requires buffering at page cache
level, as direct writes to the pmem will not get logged. Hence DAX
cannot be used on BTT devices. Indeed:

static const struct block_device_operations btt_fops = {
        .owner =                THIS_MODULE,
        .rw_page =              btt_rw_page,
        .getgeo =               btt_getgeo,
        .revalidate_disk =      nvdimm_revalidate_disk,
};

There's no .direct_access method implemented for btt devices, so
it's clear that filesystems on BTT devices cannot enable DAX.

> So I think (please correct me if I'm wrong) but actually the
> hardware/firmware guys have been fixing the torn sector problem for

I was not talking about torn /sectors/. I was talking about a user
data write being made up of *multiple sectors*, and so there is no
atomicity guarantee for a user data write on existing storage when
the filesystem block size (user data IO size) is larger than the
device sector size. 

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]