The patch titled mm: fix pagecache write deadlocks has been added to the -mm tree. Its filename is mm-fix-pagecache-write-deadlocks.patch See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find out what to do about this ------------------------------------------------------ Subject: mm: fix pagecache write deadlocks From: Andrew Morton <akpm@xxxxxxxx> and Nick Piggin <npiggin@xxxxxxx> The idea is to modify the core write() code so that it won't take a pagefault while holding a lock on the pagecache page. There are a number of different deadlocks possible if we try to do such a thing: 1. generic_buffered_write 2. lock_page 3. prepare_write 4. unlock_page+vmtruncate 5. copy_from_user 6. mmap_sem(r) 7. handle_mm_fault 8. lock_page (filemap_nopage) 9. commit_write 1. unlock_page b. sys_munmap / sys_mlock / others c. mmap_sem(w) d. make_pages_present e. get_user_pages f. handle_mm_fault g. lock_page (filemap_nopage) 2,8 - recursive deadlock if page is same 2,8;2,7 - ABBA deadlock is page is different 2,6;c,g - ABBA deadlock if page is same - Instead of copy_from_user(), use inc_preempt_count() and copy_from_user_inatomic(). - If the copy_from_user_inatomic() hits a pagefault, it'll return a short copy. - if the page was not uptodate, we cannot commit the write, because the uncopied bit could have uninitialised data. Commit zero length copy, which should do the right thing (ie. not set the page uptodate). - if the page was uptodate, commit the copied portion so we make some progress. - unlock_page() - go back and try to fault the page in again, redo the lock_page, prepare_write, copy_from_user_inatomic(), etc. - Now we have a non uptodate page, and we keep faulting on a 2nd or later iovec, this gives a deadlock, because fault_in_pages readable only faults in the first iovec. To fix this situation, if we fault on a !uptodate page, make the next iteration only attempt to copy a single iovec. - This also showed up a number of buggy prepare_write / commit_write implementations that were setting the page uptodate in the prepare_write side: bad! this allows uninitialised data to be read. Fix these. (also, rename maxlen to seglen, because it was confusing) Signed-off-by: Andrew Morton <akpm@xxxxxxxx> --- fs/buffer.c | 41 +++++------------ fs/libfs.c | 41 +++++++++-------- mm/filemap.c | 112 +++++++++++++++++++++++++++++++------------------ mm/filemap.h | 74 +++++++++++++++++++++----------- 4 files changed, 154 insertions(+), 114 deletions(-) diff -puN fs/buffer.c~mm-fix-pagecache-write-deadlocks fs/buffer.c --- a/fs/buffer.c~mm-fix-pagecache-write-deadlocks +++ a/fs/buffer.c @@ -1874,6 +1874,9 @@ static int __block_commit_write(struct i unsigned blocksize; struct buffer_head *bh, *head; + if (from == to) + return 0; + blocksize = 1 << inode->i_blkbits; for(bh = head = page_buffers(page), block_start = 0; @@ -2335,17 +2338,6 @@ int nobh_prepare_write(struct page *page if (is_mapped_to_disk) SetPageMappedToDisk(page); - SetPageUptodate(page); - - /* - * Setting the page dirty here isn't necessary for the prepare_write - * function - commit_write will do that. But if/when this function is - * used within the pagefault handler to ensure that all mmapped pages - * have backing space in the filesystem, we will need to dirty the page - * if its contents were altered. - */ - if (dirtied_it) - set_page_dirty(page); return 0; @@ -2354,17 +2346,6 @@ failed: if (read_bh[i]) free_buffer_head(read_bh[i]); } - - /* - * Error recovery is pretty slack. Clear the page and mark it dirty - * so we'll later zero out any blocks which _were_ allocated. - */ - kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr, 0, PAGE_CACHE_SIZE); - flush_dcache_page(page); - kunmap_atomic(kaddr, KM_USER0); - SetPageUptodate(page); - set_page_dirty(page); return ret; } EXPORT_SYMBOL(nobh_prepare_write); @@ -2372,13 +2353,16 @@ EXPORT_SYMBOL(nobh_prepare_write); int nobh_commit_write(struct file *file, struct page *page, unsigned from, unsigned to) { - struct inode *inode = page->mapping->host; - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + if (to > from) { + struct inode *inode = page->mapping->host; + loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; - set_page_dirty(page); - if (pos > inode->i_size) { - i_size_write(inode, pos); - mark_inode_dirty(inode); + SetPageUptodate(page); + set_page_dirty(page); + if (pos > inode->i_size) { + i_size_write(inode, pos); + mark_inode_dirty(inode); + } } return 0; } @@ -2469,6 +2453,7 @@ int nobh_truncate_page(struct address_sp memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); + SetPageUptodate(page); set_page_dirty(page); } unlock_page(page); diff -puN fs/libfs.c~mm-fix-pagecache-write-deadlocks fs/libfs.c --- a/fs/libfs.c~mm-fix-pagecache-write-deadlocks +++ a/fs/libfs.c @@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s int simple_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) { - if (!PageUptodate(page)) { - if (to - from != PAGE_CACHE_SIZE) { - void *kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr, 0, from); - memset(kaddr + to, 0, PAGE_CACHE_SIZE - to); - flush_dcache_page(page); - kunmap_atomic(kaddr, KM_USER0); - } + if (PageUptodate(page)) + return 0; + + if (to - from != PAGE_CACHE_SIZE) { + clear_highpage(page); + flush_dcache_page(page); SetPageUptodate(page); } + return 0; } int simple_commit_write(struct file *file, struct page *page, - unsigned offset, unsigned to) + unsigned from, unsigned to) { - struct inode *inode = page->mapping->host; - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; - - /* - * No need to use i_size_read() here, the i_size - * cannot change under us because we hold the i_mutex. - */ - if (pos > inode->i_size) - i_size_write(inode, pos); - set_page_dirty(page); + if (to > from) { + struct inode *inode = page->mapping->host; + loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + + if (to - from == PAGE_CACHE_SIZE) + SetPageUptodate(page); + /* + * No need to use i_size_read() here, the i_size + * cannot change under us because we hold the i_mutex. + */ + if (pos > inode->i_size) + i_size_write(inode, pos); + set_page_dirty(page); + } return 0; } diff -puN mm/filemap.c~mm-fix-pagecache-write-deadlocks mm/filemap.c --- a/mm/filemap.c~mm-fix-pagecache-write-deadlocks +++ a/mm/filemap.c @@ -2090,7 +2090,7 @@ generic_file_buffered_write(struct kiocb do { pgoff_t index; /* Pagecache index for current page */ unsigned long offset; /* Offset into pagecache page */ - unsigned long maxlen; /* Bytes remaining in current iovec */ + unsigned long seglen; /* Bytes remaining in current iovec */ size_t bytes; /* Bytes to write to page */ size_t copied; /* Bytes copied from user */ @@ -2100,18 +2100,16 @@ generic_file_buffered_write(struct kiocb if (bytes > count) bytes = count; - maxlen = cur_iov->iov_len - iov_offset; - if (maxlen > bytes) - maxlen = bytes; + seglen = min(cur_iov->iov_len - iov_offset, bytes); +retry_noprogress: #ifndef CONFIG_DEBUG_VM /* - * Bring in the user page that we will copy from _first_. - * Otherwise there's a nasty deadlock on copying from the - * same page as we're writing to, without it being marked - * up-to-date. + * Bring in the user page that we will copy from _first_, this + * minimises the chance we have to break into the slowpath + * below. */ - fault_in_pages_readable(buf, maxlen); + fault_in_pages_readable(buf, seglen); #endif page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec); @@ -2122,8 +2120,6 @@ generic_file_buffered_write(struct kiocb status = a_ops->prepare_write(file, page, offset, offset+bytes); if (unlikely(status)) { - loff_t isize = i_size_read(inode); - if (status != AOP_TRUNCATED_PAGE) unlock_page(page); page_cache_release(page); @@ -2131,52 +2127,86 @@ generic_file_buffered_write(struct kiocb continue; /* * prepare_write() may have instantiated a few blocks - * outside i_size. Trim these off again. + * outside i_size. Trim these off again. Don't need + * i_size_read because we hold i_mutex. */ - if (pos + bytes > isize) - vmtruncate(inode, isize); + if (pos + bytes > inode->i_size) + vmtruncate(inode, inode->i_size); break; } + + /* + * Must not enter the pagefault handler here, because we hold + * the page lock, so we might recursively deadlock on the same + * lock, or get an ABBA deadlock against a different lock, or + * against the mmap_sem (which nests outside the page lock). + * So increment preempt count, and use _atomic usercopies. + */ + inc_preempt_count(); if (likely(nr_segs == 1)) - copied = filemap_copy_from_user(page, offset, + copied = filemap_copy_from_user_atomic(page, offset, buf, bytes); else - copied = filemap_copy_from_user_iovec(page, offset, - cur_iov, iov_offset, bytes); + copied = filemap_copy_from_user_iovec_atomic(page, + offset, cur_iov, iov_offset, + bytes); + dec_preempt_count(); + + if (!PageUptodate(page)) { + /* + * If the page is not uptodate, we cannot allow a + * partial commit_write, because that might expose + * uninitialised data. + * + * We will enter the single-segment path below, which + * should get the filesystem to bring the page + * uputodate for us next time. + */ + if (unlikely(copied != bytes)) + copied = 0; + } + flush_dcache_page(page); - status = a_ops->commit_write(file, page, offset, offset+bytes); + status = a_ops->commit_write(file, page, offset, offset+copied); if (status == AOP_TRUNCATED_PAGE) { page_cache_release(page); continue; } - if (likely(copied > 0)) { - if (!status) - status = copied; - - if (status >= 0) { - written += status; - count -= status; - pos += status; - buf += status; - if (unlikely(nr_segs > 1)) { - filemap_set_next_iovec(&cur_iov, - &iov_offset, status); - if (count) - buf = cur_iov->iov_base + - iov_offset; - } else { - iov_offset += status; - } - } - } - if (unlikely(copied != bytes)) - if (status >= 0) - status = -EFAULT; unlock_page(page); mark_page_accessed(page); page_cache_release(page); + if (status < 0) break; + if (likely(copied > 0)) { + written += copied; + count -= copied; + pos += copied; + buf += copied; + if (unlikely(nr_segs > 1)) { + filemap_set_next_iovec(&cur_iov, + &iov_offset, copied); + if (count) + buf = cur_iov->iov_base + iov_offset; + } else { + iov_offset += copied; + } + } else { +#ifdef CONFIG_DEBUG_VM + fault_in_pages_readable(buf, bytes); +#endif + /* + * OK, we took a fault without making progress. Fall + * back to trying just a single segment next time. + * This is important to prevent deadlock if a full + * page write was not causing the page to be brought + * uptodate. A smaller write will tend to bring it + * uptodate in ->prepare_write, and thus we have a + * chance of making some progress. + */ + bytes = seglen; + goto retry_noprogress; + } balance_dirty_pages_ratelimited(mapping); cond_resched(); } while (count); diff -puN mm/filemap.h~mm-fix-pagecache-write-deadlocks mm/filemap.h --- a/mm/filemap.h~mm-fix-pagecache-write-deadlocks +++ a/mm/filemap.h @@ -22,19 +22,19 @@ __filemap_copy_from_user_iovec_inatomic( /* * Copy as much as we can into the page and return the number of bytes which - * were sucessfully copied. If a fault is encountered then clear the page - * out to (offset+bytes) and return the number of bytes which were copied. + * were sucessfully copied. If a fault is encountered then return the number of + * bytes which were copied. * - * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache - * to *NOT* zero any tail of the buffer that it failed to copy. If it does, - * and if the following non-atomic copy succeeds, then there is a small window - * where the target page contains neither the data before the write, nor the - * data after the write (it contains zero). A read at this time will see - * data that is inconsistent with any ordering of the read and the write. - * (This has been detected in practice). + * NOTE: For this to work reliably we really want + * copy_from_user_inatomic_nocache to *NOT* zero any tail of the buffer that it + * failed to copy. If it does, and if the following non-atomic copy succeeds, + * then there is a small window where the target page contains neither the data + * before the write, nor the data after the write (it contains zero). A read at + * this time will see data that is inconsistent with any ordering of the read + * and the write. (This has been detected in practice). */ static inline size_t -filemap_copy_from_user(struct page *page, unsigned long offset, +filemap_copy_from_user_atomic(struct page *page, unsigned long offset, const char __user *buf, unsigned bytes) { char *kaddr; @@ -44,23 +44,32 @@ filemap_copy_from_user(struct page *page left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes); kunmap_atomic(kaddr, KM_USER0); - if (left != 0) { - /* Do it the slow way */ - kaddr = kmap(page); - left = __copy_from_user_nocache(kaddr + offset, buf, bytes); - kunmap(page); - } + return bytes - left; +} + +static inline size_t +filemap_copy_from_user_nonatomic(struct page *page, unsigned long offset, + const char __user *buf, unsigned bytes) +{ + char *kaddr; + int left; + + kaddr = kmap(page); + left = __copy_from_user_nocache(kaddr + offset, buf, bytes); + kunmap(page); + return bytes - left; } /* - * This has the same sideeffects and return value as filemap_copy_from_user(). + * This has the same sideeffects and return value as + * filemap_copy_from_user_atomic(). * The difference is that on a fault we need to memset the remainder of the * page (out to offset+bytes), to emulate filemap_copy_from_user()'s * single-segment behaviour. */ static inline size_t -filemap_copy_from_user_iovec(struct page *page, unsigned long offset, +filemap_copy_from_user_iovec_atomic(struct page *page, unsigned long offset, const struct iovec *iov, size_t base, size_t bytes) { char *kaddr; @@ -70,14 +79,27 @@ filemap_copy_from_user_iovec(struct page copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, base, bytes); kunmap_atomic(kaddr, KM_USER0); - if (copied != bytes) { - kaddr = kmap(page); - copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, - base, bytes); - if (bytes - copied) - memset(kaddr + offset + copied, 0, bytes - copied); - kunmap(page); - } + return copied; +} + +/* + * This has the same sideeffects and return value as + * filemap_copy_from_user_nonatomic(). + * The difference is that on a fault we need to memset the remainder of the + * page (out to offset+bytes), to emulate filemap_copy_from_user_nonatomic()'s + * single-segment behaviour. + */ +static inline size_t +filemap_copy_from_user_iovec_nonatomic(struct page *page, unsigned long offset, + const struct iovec *iov, size_t base, size_t bytes) +{ + char *kaddr; + size_t copied; + + kaddr = kmap(page); + copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, + base, bytes); + kunmap(page); return copied; } _ Patches currently in -mm which might be from akpm@xxxxxxxx are revert-pci-quirk-for-ibm-dock-ii-cardbus-controllers.patch ioc4-fixes.patch ioc4-kconfig-fix.patch remove_mapping-fix.patch proc_numbuf-is-wrong.patch carta_random32-fix-linkage.patch rename-net_random-to-random32-fixes.patch rename-net_random-to-random32-fix-2.patch vmalloc-dont-pass-__gfp_zero-to-slab.patch fix-build-breakage-with-config_ppc32-fix.patch revert-generic_file_buffered_write-handle-zero-length-iovec-segments.patch revert-generic_file_buffered_write-deadlock-on-vectored-write.patch generic_file_buffered_write-cleanup.patch mm-fix-pagecache-write-deadlocks.patch generic_file_buffered_write-fix-page-prefaulting.patch generic_file_buffered_write-max_len-cleanup.patch fix-pagecache-write-deadlocks.patch git-acpi.patch i386-acpi-build-fix.patch acpi-asus-s3-resume-fix.patch sony_apci-resume.patch git-dvb-build-fix.patch git-infiniband.patch git-input.patch git-input-fixup.patch git-libata-all.patch mtd-maps-support-for-bios-flash-chips-on-intel-esb2-southbridge.patch git-netdev-all.patch libphy-dont-do-that.patch drivers-net-ns83820c-add-paramter-to-disable-auto.patch git-pcmcia-fixup.patch git-serial-fixup.patch git-scsi-target-fixup.patch git-scsi-target-vs-git-block.patch fix-gregkh-usb-usbatm-fix-tiny-race.patch xpad-dance-pad-support.patch git-watchdog.patch x86_64-dump_trace-atomicity-fix.patch spinlock-debug-all-cpu-backtrace.patch xfs-rename-uio_read.patch touchkit-ps-2-touchscreen-driver.patch get-rid-of-zone_table.patch new-scheme-to-preempt-swap-token-tidy.patch radix-tree-rcu-lockless-readside.patch acx1xx-wireless-driver.patch swsusp-add-resume_offset-command-line-parameter-rev-2.patch deprecate-smbfs-in-favour-of-cifs.patch edac-new-opteron-athlon64-memory-controller-driver.patch add-address_space_operationsbatch_write.patch kbuild-dont-put-temp-files-in-the-source-tree.patch lockdep-annotate-nfs-nfsd-in-kernel-sockets-tidy.patch bug-test-1.patch log2-implement-a-general-integer-log2-facility-in-the-kernel-fix.patch fs-cache-provide-a-filesystem-specific-syncable-page-bit-ext4.patch fs-cache-make-kafs-use-fs-cache-fix.patch fs-cache-make-kafs-use-fs-cache-vs-streamline-generic_file_-interfaces-and-filemap.patch nfs-use-local-caching-12-fix.patch fs-cache-cachefiles-a-cache-that-backs-onto-a-mounted-filesystem-log2-fix.patch swap_prefetch-vs-zoned-counters.patch readahead-sysctl-parameters.patch make-copy_from_user_inatomic-not-zero-the-tail-on-i386-vs-reiser4.patch make-kmem_cache_destroy-return-void-reiser4.patch reiser4-hardirq-include-fix.patch reiser4-run-truncate_inode_pages-in-reiser4_delete_inode.patch reiser4-get_sb_dev-fix.patch reiser4-vs-zoned-allocator.patch reiser4-rename-generic_sounding_globalspatch-fix.patch hpt3xx-rework-rate-filtering-tidy.patch gtod-persistent-clock-support-i386.patch hrtimers-state-tracking.patch clockevents-drivers-for-i386.patch gtod-mark-tsc-unusable-for-highres-timers.patch round_jiffies-infrastructure-fix.patch kevent-core-files-fix.patch kevent-core-files-s390-hack.patch kevent-socket-notifications-fix-2.patch kevent-socket-notifications-fix-4.patch kevent-timer-notifications-fix.patch nr_blockdev_pages-in_interrupt-warning.patch device-suspend-debug.patch mutex-subsystem-synchro-test-module-fix.patch slab-leaks3-default-y.patch x86-kmap_atomic-debugging.patch restore-rogue-readahead-printk.patch put_bh-debug.patch acpi_format_exception-debug.patch warn-if-setting-non-uptodate-page-dirty.patch jmicron-warning-fix.patch squash-ipc-warnings.patch squash-udf-warnings.patch - To unsubscribe from this list: send the line "unsubscribe mm-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html