The patch titled Fix the race between invalidate_inode_pages and do_no_page has been added to the -mm tree. Its filename is fix-the-race-between-invalidate_inode_pages-and-do_no_page.patch See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find out what to do about this ------------------------------------------------------ Subject: Fix the race between invalidate_inode_pages and do_no_page From: Nick Piggin <nickpiggin@xxxxxxxxxxxx> Andrea Arcangeli identified a subtle race between invalidation of pages from pagecache with userspace mappings, and do_no_page. The issue is that invalidation has to shoot down all mappings to the page, before it can be discarded from the pagecache. Between shooting down ptes to a particular page, and actually dropping the struct page from the pagecache, do_no_page from any process might fault on that page and establish a new mapping to the page just before it gets discarded from the pagecache. The most common case where such invalidation is used is in file truncation. This case was catered for by doing a sort of open-coded seqlock between the file's i_size, and its truncate_count. Truncation will decrease i_size, then increment truncate_count before unmapping userspace pages; do_no_page will read truncate_count, then find the page if it is within i_size, and then check truncate_count under the page table lock and back out and retry if it had subsequently been changed (ptl will serialise against unmapping, and ensure a potentially updated truncate_count is actually visible). Complexity and documentation issues aside, the locking protocol fails in the case where we would like to invalidate pagecache inside i_size. do_no_page can come in anytime and filemap_nopage is not aware of the invalidation in progress (as it is when it is outside i_size). The end result is that dangling (->mapping == NULL) pages that appear to be from a particular file may be mapped into userspace with nonsense data. Valid mappings to the same place will see a different page. Andrea implemented two working fixes, one using a real seqlock, another using a page->flags bit. He also proposed using the page lock in do_no_page, but that was initially considered too heavyweight. However, it is not a global or per-file lock, and the page cacheline is modified in do_no_page to increment _count and _mapcount anyway, so a further modification should not be a large performance hit. Scalability is not an issue. This patch implements this latter approach. ->nopage implementations return with the page locked if it is possible for their underlying file to be invalidated (in that case, they must set a special vm_flags bit to indicate so). do_no_page only unlocks the page after setting up the mapping completely. invalidation is excluded because it holds the page lock during invalidation of each page. This allows significant simplifications in do_no_page. kbuild performance is, surprisingly, maybe slightly improved: 2xP4 Xeon 3.6GHz with HT: kbuild -j8 in shmfs original: 505.16user 28.59system 2:15.87elapsed 392%CPU 505.08user 28.28system 2:15.81elapsed 392%CPU 504.16user 29.28system 2:15.85elapsed 392%CPU patched: 502.35user 26.86system 2:14.77elapsed 392%CPU 501.89user 27.15system 2:14.86elapsed 392%CPU 502.22user 27.01system 2:14.87elapsed 392%CPU kbuild -j8 in ext3 original: 505.65user 27.72system 2:15.75elapsed 392%CPU 506.38user 26.73system 2:16.38elapsed 390%CPU 507.00user 26.21system 2:15.76elapsed 392%CPU patched: 501.03user 28.67system 2:14.98elapsed 392%CPU 500.84user 29.34system 2:14.99elapsed 392%CPU 501.18user 28.76system 2:15.05elapsed 392%CPU Signed-off-by: Nick Piggin <npiggin@xxxxxxx> Cc: Andrea Arcangeli <andrea@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxx> --- fs/ncpfs/mmap.c | 1 fs/ocfs2/mmap.c | 1 fs/xfs/linux-2.6/xfs_file.c | 2 + include/linux/mm.h | 6 +++ mm/filemap.c | 45 +++------------------------ mm/memory.c | 55 +++++++++------------------------- mm/mmap.c | 3 + mm/shmem.c | 10 ++++-- 8 files changed, 42 insertions(+), 81 deletions(-) diff -puN fs/ncpfs/mmap.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page fs/ncpfs/mmap.c --- a/fs/ncpfs/mmap.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page +++ a/fs/ncpfs/mmap.c @@ -100,6 +100,7 @@ static struct page* ncp_file_mmap_nopage static struct vm_operations_struct ncp_file_mmap = { .nopage = ncp_file_mmap_nopage, + .vm_flags = VM_CAN_INVLD, }; diff -puN fs/ocfs2/mmap.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page fs/ocfs2/mmap.c --- a/fs/ocfs2/mmap.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page +++ a/fs/ocfs2/mmap.c @@ -78,6 +78,7 @@ out: static struct vm_operations_struct ocfs2_file_vm_ops = { .nopage = ocfs2_nopage, + .vm_flags = VM_CAN_INVLD, }; int ocfs2_mmap(struct file *file, struct vm_area_struct *vma) diff -puN fs/xfs/linux-2.6/xfs_file.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page fs/xfs/linux-2.6/xfs_file.c --- a/fs/xfs/linux-2.6/xfs_file.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page +++ a/fs/xfs/linux-2.6/xfs_file.c @@ -597,6 +597,7 @@ const struct file_operations xfs_dir_fil static struct vm_operations_struct xfs_file_vm_ops = { .nopage = filemap_nopage, .populate = filemap_populate, + .vm_flags = VM_CAN_INVLD, }; #ifdef CONFIG_XFS_DMAPI @@ -606,5 +607,6 @@ static struct vm_operations_struct xfs_d #ifdef HAVE_VMOP_MPROTECT .mprotect = xfs_vm_mprotect, #endif + .vm_flags = VM_CAN_INVLD, }; #endif /* CONFIG_XFS_DMAPI */ diff -puN include/linux/mm.h~fix-the-race-between-invalidate_inode_pages-and-do_no_page include/linux/mm.h --- a/include/linux/mm.h~fix-the-race-between-invalidate_inode_pages-and-do_no_page +++ a/include/linux/mm.h @@ -165,6 +165,11 @@ extern unsigned int kobjsize(const void #define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */ #define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap) */ #define VM_INSERTPAGE 0x02000000 /* The vma has had "vm_insert_page()" done on it */ +#define VM_CAN_INVLD 0x04000000 /* The mapping may be invalidated, + * eg. truncate or invalidate_inode_*. + * In this case, do_no_page must + * return with the page locked. + */ #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS @@ -210,6 +215,7 @@ struct vm_operations_struct { int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from, const nodemask_t *to, unsigned long flags); #endif + unsigned long vm_flags; /* vm_flags to copy into any mapping vmas */ }; struct mmu_gather; diff -puN mm/filemap.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page mm/filemap.c --- a/mm/filemap.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page +++ a/mm/filemap.c @@ -1384,6 +1384,8 @@ struct page *filemap_nopage(struct vm_ar unsigned long size, pgoff; int did_readaround = 0, majmin = VM_FAULT_MINOR; + BUG_ON(!(area->vm_flags & VM_CAN_INVLD)); + pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff; retry_all: @@ -1408,7 +1410,7 @@ retry_all: * Do we have something in the page cache already? */ retry_find: - page = find_get_page(mapping, pgoff); + page = find_lock_page(mapping, pgoff); if (!page) { unsigned long ra_pages; @@ -1442,7 +1444,7 @@ retry_find: start = pgoff - ra_pages / 2; do_page_cache_readahead(mapping, file, start, ra_pages); } - page = find_get_page(mapping, pgoff); + page = find_lock_page(mapping, pgoff); if (!page) goto no_cached_page; } @@ -1504,30 +1506,6 @@ page_not_uptodate: majmin = VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); } - lock_page(page); - - /* Did it get unhashed while we waited for it? */ - if (!page->mapping) { - unlock_page(page); - page_cache_release(page); - goto retry_all; - } - - /* Did somebody else get it up-to-date? */ - if (PageUptodate(page)) { - unlock_page(page); - goto success; - } - - error = mapping->a_ops->readpage(file, page); - if (!error) { - wait_on_page_locked(page); - if (PageUptodate(page)) - goto success; - } else if (error == AOP_TRUNCATED_PAGE) { - page_cache_release(page); - goto retry_find; - } /* * Umm, take care of errors if the page isn't up-to-date. @@ -1535,20 +1513,6 @@ page_not_uptodate: * because there really aren't any performance issues here * and we need to check for errors. */ - lock_page(page); - - /* Somebody truncated the page on us? */ - if (!page->mapping) { - unlock_page(page); - page_cache_release(page); - goto retry_all; - } - - /* Somebody else successfully read it in? */ - if (PageUptodate(page)) { - unlock_page(page); - goto success; - } ClearPageError(page); error = mapping->a_ops->readpage(file, page); if (!error) { @@ -1746,6 +1710,7 @@ EXPORT_SYMBOL(filemap_populate); struct vm_operations_struct generic_file_vm_ops = { .nopage = filemap_nopage, .populate = filemap_populate, + .vm_flags = VM_CAN_INVLD, }; /* This is used for a general mmap of a disk file */ diff -puN mm/memory.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page mm/memory.c --- a/mm/memory.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page +++ a/mm/memory.c @@ -1659,6 +1659,13 @@ static int unmap_mapping_range_vma(struc unsigned long restart_addr; int need_break; + /* + * files that support invalidating or truncating portions of the + * file from under mmaped areas must set the VM_CAN_INVLD flag, and + * have their .nopage function return the page locked. + */ + BUG_ON(!(vma->vm_flags & VM_CAN_INVLD)); + again: restart_addr = vma->vm_truncate_count; if (is_restart_addr(restart_addr) && start_addr < restart_addr) { @@ -1789,17 +1796,8 @@ void unmap_mapping_range(struct address_ spin_lock(&mapping->i_mmap_lock); - /* serialize i_size write against truncate_count write */ - smp_wmb(); - /* Protect against page faults, and endless unmapping loops */ + /* Protect against endless unmapping loops */ mapping->truncate_count++; - /* - * For archs where spin_lock has inclusive semantics like ia64 - * this smp_mb() will prevent to read pagetable contents - * before the truncate_count increment is visible to - * other cpus. - */ - smp_mb(); if (unlikely(is_restart_addr(mapping->truncate_count))) { if (mapping->truncate_count == 0) reset_vma_truncate_counts(mapping); @@ -1811,6 +1809,7 @@ void unmap_mapping_range(struct address_ unmap_mapping_range_tree(&mapping->i_mmap, &details); if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details); + spin_unlock(&mapping->i_mmap_lock); } EXPORT_SYMBOL(unmap_mapping_range); @@ -2129,9 +2128,8 @@ static int do_no_page(struct mm_struct * { spinlock_t *ptl; struct page *new_page; - struct address_space *mapping = NULL; + struct page *old_page; pte_t entry; - unsigned int sequence = 0; int ret = VM_FAULT_MINOR; int anon = 0; struct page *dirty_page = NULL; @@ -2139,26 +2137,14 @@ static int do_no_page(struct mm_struct * pte_unmap(page_table); BUG_ON(vma->vm_flags & VM_PFNMAP); - if (vma->vm_file) { - mapping = vma->vm_file->f_mapping; - sequence = mapping->truncate_count; - smp_rmb(); /* serializes i_size against truncate_count */ - } -retry: new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret); - /* - * No smp_rmb is needed here as long as there's a full - * spin_lock/unlock sequence inside the ->nopage callback - * (for the pagecache lookup) that acts as an implicit - * smp_mb() and prevents the i_size read to happen - * after the next truncate_count read. - */ /* no page was available -- either SIGBUS or OOM */ if (new_page == NOPAGE_SIGBUS) return VM_FAULT_SIGBUS; if (new_page == NOPAGE_OOM) return VM_FAULT_OOM; + old_page = new_page; /* * Should we do an early C-O-W break? @@ -2191,19 +2177,6 @@ retry: } page_table = pte_offset_map_lock(mm, pmd, address, &ptl); - /* - * For a file-backed vma, someone could have truncated or otherwise - * invalidated this page. If unmap_mapping_range got called, - * retry getting the page. - */ - if (mapping && unlikely(sequence != mapping->truncate_count)) { - pte_unmap_unlock(page_table, ptl); - page_cache_release(new_page); - cond_resched(); - sequence = mapping->truncate_count; - smp_rmb(); - goto retry; - } /* * This silly early PAGE_DIRTY setting removes a race @@ -2249,10 +2222,14 @@ unlock: set_page_dirty_balance(dirty_page); put_page(dirty_page); } +out: + if (likely(vma->vm_flags & VM_CAN_INVLD)) + unlock_page(old_page); return ret; oom: page_cache_release(new_page); - return VM_FAULT_OOM; + ret = VM_FAULT_OOM; + goto out; } /* diff -puN mm/mmap.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page mm/mmap.c --- a/mm/mmap.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page +++ a/mm/mmap.c @@ -1097,6 +1097,9 @@ munmap_back: goto free_vma; } + if (vma->vm_ops) + vma->vm_flags |= vma->vm_ops->vm_flags; + /* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform * shmem_zero_setup (perhaps called through /dev/zero's ->mmap) * that memory reservation must be checked; but that reservation diff -puN mm/shmem.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page mm/shmem.c --- a/mm/shmem.c~fix-the-race-between-invalidate_inode_pages-and-do_no_page +++ a/mm/shmem.c @@ -79,6 +79,7 @@ enum sgp_type { SGP_READ, /* don't exceed i_size, don't allocate page */ SGP_CACHE, /* don't exceed i_size, may allocate page */ SGP_WRITE, /* may exceed i_size, may allocate page */ + SGP_NOPAGE, /* same as SGP_CACHE, return with page locked */ }; static int shmem_getpage(struct inode *inode, unsigned long idx, @@ -1202,8 +1203,10 @@ repeat: } done: if (*pagep != filepage) { - unlock_page(filepage); *pagep = filepage; + if (sgp != SGP_NOPAGE) + unlock_page(filepage); + } return 0; @@ -1222,13 +1225,15 @@ struct page *shmem_nopage(struct vm_area unsigned long idx; int error; + BUG_ON(!(vma->vm_flags & VM_CAN_INVLD)); + idx = (address - vma->vm_start) >> PAGE_SHIFT; idx += vma->vm_pgoff; idx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT; if (((loff_t) idx << PAGE_CACHE_SHIFT) >= i_size_read(inode)) return NOPAGE_SIGBUS; - error = shmem_getpage(inode, idx, &page, SGP_CACHE, type); + error = shmem_getpage(inode, idx, &page, SGP_NOPAGE, type); if (error) return (error == -ENOMEM)? NOPAGE_OOM: NOPAGE_SIGBUS; @@ -2221,6 +2226,7 @@ static struct vm_operations_struct shmem .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif + .vm_flags = VM_CAN_INVLD, }; _ Patches currently in -mm which might be from nickpiggin@xxxxxxxxxxxx are git-gfs2.patch radix-tree-rcu-lockless-readside-semicolon.patch adix-tree-rcu-lockless-readside-update-tidy.patch fix-the-race-between-invalidate_inode_pages-and-do_no_page.patch sched2-sched-domain-sysctl.patch - To unsubscribe from this list: send the line "unsubscribe mm-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html