The patch titled fs: prepare_write fixes has been added to the -mm tree. Its filename is fs-prepare_write-fixes.patch See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find out what to do about this ------------------------------------------------------ Subject: fs: prepare_write fixes From: Nick Piggin <nickpiggin@xxxxxxxxxxxx> Some prepare/commit_write implementations have possible pre-existing bugs, possible data leaks (by setting uptodate too early) and data corruption (by not reading in non-modified parts of a !uptodate page). Others are (also) broken when commit_write passes in a 0 length commit with a !uptodate page (a change caused by buffered write deadlock fix patch). Fix filesystems as best we can. GFS2, OCFS2, Reiserfs, JFFS are nontrivial and are likely broken. All others at least need a glance. Also, update vfs.txt to describe the prepare_write/commit_write changes. Signed-off-by: Nick Piggin <npiggin@xxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxx> --- Documentation/filesystems/vfs.txt | 28 ++++++++---- fs/affs/file.c | 4 + fs/cifs/file.c | 43 ++++++------------- fs/ecryptfs/mmap.c | 17 +++---- fs/ext3/inode.c | 63 +++++++++++++++------------- fs/fat/inode.c | 6 ++ fs/fuse/file.c | 3 + fs/hostfs/hostfs_kern.c | 13 +++++ fs/jffs/inode-v23.c | 11 ++++ fs/jffs2/file.c | 3 + fs/nfs/file.c | 3 + fs/smbfs/file.c | 9 ++++ fs/udf/file.c | 7 +++ 13 files changed, 130 insertions(+), 80 deletions(-) diff -puN Documentation/filesystems/vfs.txt~fs-prepare_write-fixes Documentation/filesystems/vfs.txt --- a/Documentation/filesystems/vfs.txt~fs-prepare_write-fixes +++ a/Documentation/filesystems/vfs.txt @@ -607,22 +607,32 @@ struct address_space_operations { the given range of bytes is about to be written. The address_space should check that the write will be able to complete, by allocating space if necessary and doing any other - internal housekeeping. If the write will update parts of - any basic-blocks on storage, then those blocks should be - pre-read (if they haven't been read already) so that the + internal housekeeping. If the write will *not* update any + parts of any basic-blocks of the page, then those blocks should + be pre-read (if they are not already uptodate) so that the updated blocks can be written out properly. The page will be locked. If prepare_write wants to unlock the page it, like readpage, may do so and return AOP_TRUNCATED_PAGE. In this case the prepare_write will be retried one the lock is - regained. + regained.a + prepare_write should *not* set the page uptodate unless it has + actually been filled with valid data (eg. that may be returned + with a read(2) syscall). commit_write: If prepare_write succeeds, new data will be copied - into the page and then commit_write will be called. It will - typically update the size of the file (if appropriate) and - mark the inode as dirty, and do any other related housekeeping - operations. It should avoid returning an error if possible - - errors should have been handled by prepare_write. + into the page and then commit_write will be called. commit_write may + be called with a range that is smaller than that passed in to + prepare_write, it could even be zero. If the page is not uptodate, + the range will *only* be zero or the full length that was passed to + prepare_write, if it is zero, the page should not be marked uptodate + (success should still be returned, if possible -- the write will be + retried). + commit_write will typically update the size of the file (if + appropriate) and mark the inode as dirty, and do any other related + housekeeping operations. A successful commit_write should mark the page + uptodate if it is not already. It should avoid returning an error if + possible - errors should have been handled by prepare_write. bmap: called by the VFS to map a logical block offset within object to physical block number. This method is used by the FIBMAP diff -puN fs/affs/file.c~fs-prepare_write-fixes fs/affs/file.c --- a/fs/affs/file.c~fs-prepare_write-fixes +++ a/fs/affs/file.c @@ -655,6 +655,10 @@ static int affs_commit_write_ofs(struct int written; pr_debug("AFFS: commit_write(%u, %ld, %d, %d)\n", (u32)inode->i_ino, page->index, from, to); + + if (to - from == 0) + return 0; + bsize = AFFS_SB(sb)->s_data_blksize; data = page_address(page); diff -puN fs/cifs/file.c~fs-prepare_write-fixes fs/cifs/file.c --- a/fs/cifs/file.c~fs-prepare_write-fixes +++ a/fs/cifs/file.c @@ -1365,6 +1365,9 @@ static int cifs_commit_write(struct file loff_t position = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; char *page_data; + if (to - offset == 0) + return 0; + xid = GetXid(); cFYI(1, ("commit write for page %p up to position %lld for %d", page, position, to)); @@ -1418,8 +1421,6 @@ static int cifs_commit_write(struct file rc = 0; /* else if (rc < 0) should we set writebehind rc? */ kunmap(page); - } else { - set_page_dirty(page); } FreeXid(xid); @@ -1973,36 +1974,20 @@ int is_size_safe_to_change(struct cifsIn static int cifs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) { - int rc = 0; loff_t offset = (loff_t)page->index << PAGE_CACHE_SHIFT; cFYI(1, ("prepare write for page %p from %d to %d",page,from,to)); - if (!PageUptodate(page)) { - /* if (to - from != PAGE_CACHE_SIZE) { - void *kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr, 0, from); - memset(kaddr + to, 0, PAGE_CACHE_SIZE - to); - flush_dcache_page(page); - kunmap_atomic(kaddr, KM_USER0); - } */ - /* If we are writing a full page it will be up to date, - no need to read from the server */ - if ((to == PAGE_CACHE_SIZE) && (from == 0)) - SetPageUptodate(page); - - /* might as well read a page, it is fast enough */ - if ((file->f_flags & O_ACCMODE) != O_WRONLY) { - rc = cifs_readpage_worker(file, page, &offset); - } else { - /* should we try using another file handle if there is one - - how would we lock it to prevent close of that handle - racing with this read? - In any case this will be written out by commit_write */ - } - } - /* BB should we pass any errors back? - e.g. if we do not have read access to the file */ - return 0; + if (PageUptodate(page)) + return 0; + if (to - from == PAGE_CACHE_SIZE) + return 0; + + /* might as well read a page, it is fast enough */ + if ((file->f_flags & O_ACCMODE) != O_WRONLY) + return cifs_readpage_worker(file, page, &offset); + + printk("CIFS does not yet support partial page writes on O_WRONLY files\n"); + return -EIO; } const struct address_space_operations cifs_addr_ops = { diff -puN fs/ecryptfs/mmap.c~fs-prepare_write-fixes fs/ecryptfs/mmap.c --- a/fs/ecryptfs/mmap.c~fs-prepare_write-fixes +++ a/fs/ecryptfs/mmap.c @@ -337,16 +337,12 @@ out: static int ecryptfs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) { - int rc = 0; - kmap(page); - if (from == 0 && to == PAGE_CACHE_SIZE) - goto out; /* If we are writing a full page, it will be - up to date. */ - if (!PageUptodate(page)) - rc = ecryptfs_do_readpage(file, page, page->index); -out: - return rc; + if (PageUptodate(page)) + return 0; + if (to - from == PAGE_CACHE_SIZE) + return 0; + return ecryptfs_do_readpage(file, page, page->index); } int ecryptfs_grab_and_map_lower_page(struct page **lower_page, @@ -634,6 +630,9 @@ static int ecryptfs_commit_write(struct struct ecryptfs_crypt_stat *crypt_stat; int rc; + if (to - from == 0) + return 0; + inode = page->mapping->host; lower_inode = ecryptfs_inode_to_lower(inode); lower_file = ecryptfs_file_to_lower(file); diff -puN fs/ext3/inode.c~fs-prepare_write-fixes fs/ext3/inode.c --- a/fs/ext3/inode.c~fs-prepare_write-fixes +++ a/fs/ext3/inode.c @@ -1214,21 +1214,24 @@ static int ext3_ordered_commit_write(str struct inode *inode = page->mapping->host; int ret = 0, ret2; - ret = walk_page_buffers(handle, page_buffers(page), - from, to, NULL, ext3_journal_dirty_data); + if (to - from > 0) { + ret = walk_page_buffers(handle, page_buffers(page), + from, to, NULL, ext3_journal_dirty_data); - if (ret == 0) { - /* - * generic_commit_write() will run mark_inode_dirty() if i_size - * changes. So let's piggyback the i_disksize mark_inode_dirty - * into that. - */ - loff_t new_i_size; + if (ret == 0) { + /* + * generic_commit_write() will run mark_inode_dirty() + * if i_size changes. So let's piggyback the + * i_disksize mark_inode_dirty into that. + */ + loff_t new_i_size; - new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; - if (new_i_size > EXT3_I(inode)->i_disksize) - EXT3_I(inode)->i_disksize = new_i_size; - ret = generic_commit_write(file, page, from, to); + new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + + to; + if (new_i_size > EXT3_I(inode)->i_disksize) + EXT3_I(inode)->i_disksize = new_i_size; + ret = generic_commit_write(file, page, from, to); + } } ret2 = ext3_journal_stop(handle); if (!ret) @@ -1268,23 +1271,25 @@ static int ext3_journalled_commit_write( int partial = 0; loff_t pos; - /* - * Here we duplicate the generic_commit_write() functionality - */ - pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + if (to - from > 0) { + /* + * Here we duplicate the generic_commit_write() functionality + */ + pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; - ret = walk_page_buffers(handle, page_buffers(page), from, - to, &partial, commit_write_fn); - if (!partial) - SetPageUptodate(page); - if (pos > inode->i_size) - i_size_write(inode, pos); - EXT3_I(inode)->i_state |= EXT3_STATE_JDATA; - if (inode->i_size > EXT3_I(inode)->i_disksize) { - EXT3_I(inode)->i_disksize = inode->i_size; - ret2 = ext3_mark_inode_dirty(handle, inode); - if (!ret) - ret = ret2; + ret = walk_page_buffers(handle, page_buffers(page), from, + to, &partial, commit_write_fn); + if (!partial) + SetPageUptodate(page); + if (pos > inode->i_size) + i_size_write(inode, pos); + EXT3_I(inode)->i_state |= EXT3_STATE_JDATA; + if (inode->i_size > EXT3_I(inode)->i_disksize) { + EXT3_I(inode)->i_disksize = inode->i_size; + ret2 = ext3_mark_inode_dirty(handle, inode); + if (!ret) + ret = ret2; + } } ret2 = ext3_journal_stop(handle); if (!ret) diff -puN fs/fat/inode.c~fs-prepare_write-fixes fs/fat/inode.c --- a/fs/fat/inode.c~fs-prepare_write-fixes +++ a/fs/fat/inode.c @@ -150,7 +150,11 @@ static int fat_commit_write(struct file unsigned from, unsigned to) { struct inode *inode = page->mapping->host; - int err = generic_commit_write(file, page, from, to); + int err; + if (to - from > 0) + return 0; + + err = generic_commit_write(file, page, from, to); if (!err && !(MSDOS_I(inode)->i_attrs & ATTR_ARCH)) { inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC; MSDOS_I(inode)->i_attrs |= ATTR_ARCH; diff -puN fs/fuse/file.c~fs-prepare_write-fixes fs/fuse/file.c --- a/fs/fuse/file.c~fs-prepare_write-fixes +++ a/fs/fuse/file.c @@ -467,6 +467,9 @@ static int fuse_commit_write(struct file if (is_bad_inode(inode)) return -EIO; + if (to - from == 0) + return 0; + req = fuse_get_req(fc); if (IS_ERR(req)) return PTR_ERR(req); diff -puN fs/hostfs/hostfs_kern.c~fs-prepare_write-fixes fs/hostfs/hostfs_kern.c --- a/fs/hostfs/hostfs_kern.c~fs-prepare_write-fixes +++ a/fs/hostfs/hostfs_kern.c @@ -469,6 +469,11 @@ int hostfs_prepare_write(struct file *fi long long start, tmp; int err; + if (PageUptodate(page)) + return 0; + if (to - from == PAGE_CACHE_SIZE) + return 0; + start = (long long) page->index << PAGE_CACHE_SHIFT; buffer = kmap(page); if(from != 0){ @@ -498,10 +503,14 @@ int hostfs_commit_write(struct file *fil long long start; int err = 0; + if (to - from == 0) + return 0; + start = (((long long) page->index) << PAGE_CACHE_SHIFT) + from; buffer = kmap(page); err = write_file(FILE_HOSTFS_I(file)->fd, &start, buffer + from, to - from); + kunmap(page); if(err > 0) err = 0; /* Actually, if !err, write_file has added to-from to start, so, despite @@ -511,7 +520,9 @@ int hostfs_commit_write(struct file *fil if(!err && (start > inode->i_size)) inode->i_size = start; - kunmap(page); + if (!PageUptodate(page) && to - from == PAGE_CACHE_SIZE) + SetPageUptodate(page); + return(err); } diff -puN fs/jffs/inode-v23.c~fs-prepare_write-fixes fs/jffs/inode-v23.c --- a/fs/jffs/inode-v23.c~fs-prepare_write-fixes +++ a/fs/jffs/inode-v23.c @@ -1531,7 +1531,7 @@ jffs_prepare_write(struct file *filp, st /* FIXME: we should detect some error conditions here */ /* Bugger that. We should make sure the page is uptodate */ - if (!PageUptodate(page) && (from || to < PAGE_CACHE_SIZE)) + if (!PageUptodate(page) && (to - from < PAGE_CACHE_SIZE)) return jffs_do_readpage_nolock(filp, page); return 0; @@ -1541,11 +1541,18 @@ static int jffs_commit_write(struct file *filp, struct page *page, unsigned from, unsigned to) { + int ret; + void *addr = page_address(page) + from; /* XXX: PAGE_CACHE_SHIFT or PAGE_SHIFT */ loff_t pos = page_offset(page) + from; - return jffs_file_write(filp, addr, to-from, &pos); + if (to - from == 0) + return 0; + + ret = jffs_file_write(filp, addr, to-from, &pos); + if (!PageUptodate(page) && (to - from == PAGE_CACHE_SIZE)) + SetPageUptodate(page); } /* jffs_commit_write() */ /* This is our ioctl() routine. */ diff -puN fs/jffs2/file.c~fs-prepare_write-fixes fs/jffs2/file.c --- a/fs/jffs2/file.c~fs-prepare_write-fixes +++ a/fs/jffs2/file.c @@ -222,6 +222,9 @@ static int jffs2_commit_write (struct fi D1(printk(KERN_DEBUG "jffs2_commit_write(): ino #%lu, page at 0x%lx, range %d-%d, flags %lx\n", inode->i_ino, pg->index << PAGE_CACHE_SHIFT, start, end, pg->flags)); + if (end - start == 0) + return 0; + if (end == PAGE_CACHE_SIZE) { if (!start) { /* We need to avoid deadlock with page_cache_read() in diff -puN fs/nfs/file.c~fs-prepare_write-fixes fs/nfs/file.c --- a/fs/nfs/file.c~fs-prepare_write-fixes +++ a/fs/nfs/file.c @@ -299,6 +299,9 @@ static int nfs_commit_write(struct file { long status; + if (to - offset == 0) + return 0; + lock_kernel(); status = nfs_updatepage(file, page, offset, to-offset); unlock_kernel(); diff -puN fs/smbfs/file.c~fs-prepare_write-fixes fs/smbfs/file.c --- a/fs/smbfs/file.c~fs-prepare_write-fixes +++ a/fs/smbfs/file.c @@ -293,6 +293,10 @@ out: static int smb_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) { + if (!PageUptodate(page) && to - offset < PAGE_CACHE_SIZE) { + printk("SMB does not yet support partial page writes\n"); + return -EIO; + } return 0; } @@ -301,10 +305,15 @@ static int smb_commit_write(struct file { int status; + if (to - offset == 0) + return 0; + status = -EFAULT; lock_kernel(); status = smb_updatepage(file, page, offset, to-offset); unlock_kernel(); + if (!PageUptodate(page) && to - offset == PAGE_CACHE_SIZE) + SetPageUptodate(page); return status; } diff -puN fs/udf/file.c~fs-prepare_write-fixes fs/udf/file.c --- a/fs/udf/file.c~fs-prepare_write-fixes +++ a/fs/udf/file.c @@ -75,6 +75,10 @@ static int udf_adinicb_writepage(struct static int udf_adinicb_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) { + if (!PageUptodate(page) && to - offset < PAGE_CACHE_SIZE) { + printk("UDF does not yet support partial page writes\n"); + return -EIO; + } kmap(page); return 0; } @@ -84,6 +88,9 @@ static int udf_adinicb_commit_write(stru struct inode *inode = page->mapping->host; char *kaddr = page_address(page); + if (to - offset == 0) + return 0; + memcpy(UDF_I_DATA(inode) + UDF_I_LENEATTR(inode) + offset, kaddr + offset, to - offset); mark_inode_dirty(inode); _ Patches currently in -mm which might be from nickpiggin@xxxxxxxxxxxx are origin.patch oom-killer-meets-userspace-headers.patch mm-fix-pagecache-write-deadlocks-xip.patch fs-prepare_write-fixes.patch memory-page-alloc-minor-cleanups.patch memory-page-alloc-minor-cleanups-fix.patch cpuset-remove-useless-sched-domain-line.patch cpuset-explicit-dynamic-sched-domain-control-flags.patch sched2-sched-domain-sysctl.patch - To unsubscribe from this list: send the line "unsubscribe mm-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html