On Tue, Aug 11, 2015 at 07:17:12PM +0300, Boaz Harrosh wrote: > On 08/11/2015 06:28 PM, Kirill A. Shutemov wrote: > > We also used lock_page() to make sure we shoot out all pages as we don't > > exclude page faults during truncate. Consider this race: > > > > <fault> <truncate> > > get_block > > check i_size > > update i_size > > unmap > > setup pte > > > > Please consider this senario then: > > <fault> <truncate> > read_lock(inode) > > get_block > check i_size > > read_unlock(inode) > > write_lock(inode) > > update i_size > * remove allocated blocks > unmap > > write_unlock(inode) > > setup pte > > IS what you suppose to do in xfs Do you realize that you describe a race? :-P Exactly in this scenario pfn your pte point to is not belong to the file anymore. Have fun. > > With normal page cache we make sure that all pages beyond i_size is > > dropped using lock_page() in truncate_inode_pages_range(). > > > > Yes there is no truncate_inode_pages_range() in DAX again radix tree is > empty. > > Please do you have a reproducer I would like to see this race and also > experiment with xfs (I guess you saw it in ext4) I don't. And I don't see how race like above can be FS-specific. All critical points in generic code. > > For DAX we need a way to stop all page faults to the pgoff range before > > doing unmap. > > > > Why ? Because you can end up with ptes pointing to pfns which fs consider not be part of the file. <truncate> <fault> unmap.. fault in pfn which unmap already unmapped ..continue unmap > >> Because with DAX there is no inode->mapping "mapping" at all. You have the call > >> into the FS with get_block() to replace "holes" (zero pages) with real allocated > >> blocks, on WRITE faults, but this conversion should be protected inside the FS > >> already. Then there is the atomic exchange of the PTE which is fine. > >> (And vis versa with holes mapping and writes) > > > > Having unmap_mapping_range() in PMD fault handling is very unfortunate. > > Go to rmap just to solve page fault is very wrong. > > BTW, we need to do it in write path too. > > > > Only the write path and only when we exchange a zero-page (hole) with > a new allocated (written to) page. Both write fault and/or write-path No. Always on new BH. We don't have anything (except rmap) to find out if any other process have zero page for the pgoff. > > I'm not convinced that all these "let's avoid backing storage allocation" > > in DAX code is not layering violation. I think the right place to solve > > this is filesystem. And we have almost all required handles for this in > > place. We only need to change vm_ops->page_mkwrite() interface to be able > > to return different page than what was given on input. > > > > What? there is no page returned for DAX page_mkwrite(), it is all insert_mixed > with direct pmd. That was bogus idea, please ignore. > > Hm. Where does XFS take this read-write lock in fault path? > > > > IIUC, truncation vs. page fault serialization relies on i_size being > > updated before doing truncate_pagecache() and checking i_size under > > page_lock() on fault side. We don't have i_size fence for punch hole. > > > > again truncate_pagecache() is NONE. > And yes the read-write locking will protect punch-hole just as truncate > see my locking senario above. Do you mean as racy as your truncate scenario? ;-P -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html