I think the ultimate goal here has to be to have the truncate code lock the DAX entry in the radix tree and delete it. Then we can have do_cow_fault() unlock the radix tree entry instead of the i_mmap_lock. So we'll need another element in struct vm_fault where we can pass back a pointer into the radix tree instead of a pointer to struct page (or add another bit to VM_FAULT_ that indicates that 'page' is not actually a page, but a pointer to an exceptional entry ... or have the MM code understand the exceptional bit ... there's a few ways we can go here). -----Original Message----- From: Jan Kara [mailto:jack@xxxxxxx] Sent: Monday, March 14, 2016 3:01 AM To: Wilcox, Matthew R Cc: Jan Kara; linux-fsdevel@xxxxxxxxxxxxxxx; Ross Zwisler; Williams, Dan J; linux-nvdimm@xxxxxxxxxxxx; NeilBrown Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock On Thu 10-03-16 20:10:09, Wilcox, Matthew R wrote: > Here's the race: > > CPU 0 CPU 1 > do_cow_fault() > __do_fault() > takes sem > dax_fault() > releases sem > truncate() > unmap_mapping_range() > i_mmap_lock_write() > unmap_mapping_range_tree() > i_mmap_unlock_write() > do_set_pte() > > Holding i_mmap_lock_read() from inside __do_fault() prevents the truncate > from proceeding until the page is inseted with do_set_pte(). Ah, right. Thanks for reminding me. I was hoping to get rid of this i_mmap_lock abuse in DAX code but obviously it needs more work :). Honza > -----Original Message----- > From: Jan Kara [mailto:jack@xxxxxxx] > Sent: Thursday, March 10, 2016 12:05 PM > To: Wilcox, Matthew R > Cc: Jan Kara; linux-fsdevel@xxxxxxxxxxxxxxx; Ross Zwisler; Williams, Dan J; linux-nvdimm@xxxxxxxxxxxx; NeilBrown > Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock > > On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote: > > This locking's still necessary. i_mmap_sem has already been released by > > the time we're back in do_cow_fault(), so it doesn't protect that page, > > and truncate can have whizzed past and thinks there's nothing to unmap. > > So a task can have a MAP_PRIVATE page still in its address space after > > it's supposed to have been unmapped. > > I don't think this is possible. Filesystem holds its inode->i_mmap_sem for > reading when handling the fault. That synchronizes against truncate... > > Honza > > > -----Original Message----- > > From: Jan Kara [mailto:jack@xxxxxxx] > > Sent: Thursday, March 10, 2016 11:19 AM > > To: linux-fsdevel@xxxxxxxxxxxxxxx > > Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@xxxxxxxxxxxx; NeilBrown; Jan Kara > > Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock > > > > At one point DAX used i_mmap_lock so synchronize page faults with page > > table invalidation during truncate. However these days DAX uses > > filesystem specific RW semaphores to protect against these races > > (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove > > the unnecessary locking. > > > > Signed-off-by: Jan Kara <jack@xxxxxxx> > > --- > > fs/dax.c | 19 ------------------- > > mm/memory.c | 14 -------------- > > 2 files changed, 33 deletions(-) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 9c4d697fb6fc..e409e8fc13b7 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, > > pgoff_t size; > > int error; > > > > - i_mmap_lock_read(mapping); > > - > > /* > > * Check truncate didn't happen while we were allocating a block. > > * If it did, this block may or may not be still allocated to the > > @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, > > error = vm_insert_mixed(vma, vaddr, dax.pfn); > > > > out: > > - i_mmap_unlock_read(mapping); > > - > > return error; > > } > > > > @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > if (error) > > goto unlock_page; > > vmf->page = page; > > - if (!page) { > > - i_mmap_lock_read(mapping); > > - /* Check we didn't race with truncate */ > > - size = (i_size_read(inode) + PAGE_SIZE - 1) >> > > - PAGE_SHIFT; > > - if (vmf->pgoff >= size) { > > - i_mmap_unlock_read(mapping); > > - error = -EIO; > > - goto out; > > - } > > - } > > return VM_FAULT_LOCKED; > > } > > > > @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, > > truncate_pagecache_range(inode, lstart, lend); > > } > > > > - i_mmap_lock_read(mapping); > > - > > /* > > * If a truncate happened while we were allocating blocks, we may > > * leave blocks allocated to the file that are beyond EOF. We can't > > @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, > > } > > > > out: > > - i_mmap_unlock_read(mapping); > > - > > if (buffer_unwritten(&bh)) > > complete_unwritten(&bh, !(result & VM_FAULT_ERROR)); > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 8132787ae4d5..13f76eb08f33 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping, > > if (details.last_index < details.first_index) > > details.last_index = ULONG_MAX; > > > > - > > - /* DAX uses i_mmap_lock to serialise file truncate vs page fault */ > > i_mmap_lock_write(mapping); > > if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) > > unmap_mapping_range_tree(&mapping->i_mmap, &details); > > @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > if (fault_page) { > > unlock_page(fault_page); > > page_cache_release(fault_page); > > - } else { > > - /* > > - * The fault handler has no page to lock, so it holds > > - * i_mmap_lock for read to protect against truncate. > > - */ > > - i_mmap_unlock_read(vma->vm_file->f_mapping); > > } > > goto uncharge_out; > > } > > @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > if (fault_page) { > > unlock_page(fault_page); > > page_cache_release(fault_page); > > - } else { > > - /* > > - * The fault handler has no page to lock, so it holds > > - * i_mmap_lock for read to protect against truncate. > > - */ > > - i_mmap_unlock_read(vma->vm_file->f_mapping); > > } > > return ret; > > uncharge_out: > > -- > > 2.6.2 > > > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html