On Mon, Sep 28, 2015 at 08:08:13PM -0700, Dan Williams wrote: > On Mon, Sep 28, 2015 at 7:18 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Mon, Sep 28, 2015 at 03:57:29PM -0700, Dan Williams wrote: > > I'm concerned with making what we have work before we go and change > > everything. You might want to move really quickly, but without sane > > filesystem support you can't ship anything worth a damn. There's all > > sorts of issues here, and introducing struct pages doesn't solve all > > of them. > > > > Let's concentrate on ensuring the basic operation of DAX is robust > > first - get the page fault vs extent manipulations serialised, sane > > and scalable before we start changing anything else. If we don't > > solve these problems, then nothing else we do will be reliable, and > > the problems exist regardless of whether we are using struct pages > > or not. Hence these are the critical problems we need to fix before > > anything else. > > > > Once we have these issues sorted out, switching between struct page > > and pfn should be much simpler because we don't have to worry about > > different locking strategies to protect against truncate, racing > > page faults, etc. > > It sounds like you have a page-independent/scalable method in mind for > solving the truncate protection problem? In mind? It was added to XFS back in 4.1 by commit 653c60b ("xfs: introduce mmap/truncate lock"). > I had always thought that > must require struct page, but if you're happy to carry that solution > in the filesystem you're not going to see resistance from me. The page based "truncate serialisation" solution in Linux has always been a horrible, nasty hack. It only works because we always modify the inode size before invalidating pages in the page cache. Hence we can check once we have the page lock to see if the page is beyond EOF. This EOF update hack avoids the need for a filesystem level lock to guarantee multi-page truncate invalidation atomicity. This, however, does not work for operations which require atomic multi-page invalidation within EOF over multiple long running operations. The page is not beyond EOF, so we have no way of determining from the page taht we are racing with an invalidation of the page. This affects operations such as hole punching, extent shifting up/down, swapping extents between different inodes, etc. These *require* the filesystem to be able to lock out page faults for the duration of the manipulation operation, as the extent manipulations may not be done atomically from the perspective of userspace driven page faults. They *are atomic* from the perspective of the read/write syscall interfaces thanks to the XFS_IOLOCK_* locking (usually the i_mutex provides this in other filesystems). The XFS_MMAPLOCK_[SHARED|EXCL] locking was added to provide this multi-page invalidation vs page fault serialisation to XFS. It works completely independently of any given struct page in the file, and so does not rely on DAX to use struct page to serialise correctly. I've solved this problem in XFS because a generic solution was not happening. Any filesystem that supports hole punching and other fallocate-based manipulations needs similar locking to be safe against page fault based /data corruption/ races in these operations. FWIW, direct IO also needs to exclude page faults for proper mmap coherency, but that's much harder problem to solve because direct IO takes the mmap_sem below the filesystem, and hence we can't hold any locks over DIO submission that might be required in a page fault within get_user_pages().... > >> I do not think introducing page-back persistent memory sets us back to > >> square 1. Instead, given the functionality that is enabled when pages > >> are present I think it is safe to assume most platforms will arrange > >> for page backed persistent memory. > > > > Sure, but it will take a little time to get there. Moving fast > > doesn't help us here - it only results in stuff we have to revert or > > redo in the near future and that means progress is much slower than > > it should be. Let's solve the DAX problems in the right order - it > > will make things simpler and faster down the road. > > Sounds workable, although this thread is missing an ext4 > representative so far. Hopefully ext4 is equally open to solving > these problems generically without struct page. It appears to me that none of the ext4 developers have shown any interest in fixing the problems reported with DAX. Some of those problems existed before DAX came along (e.g. unwritten extent conversion issues that can cause data corruption) but I've seen no interest in fixing them, either. Hence I'm quite happy to drop ext4 DAX support until an ext4 dev steps up and fixes the issues that need fixing... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html