Re: [patch 2/3] fs: buffer_head writepage no zero

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jul 10, 2009 at 01:46:51PM +0200, Jan Kara wrote:
> On Fri 10-07-09 11:34:03, Nick Piggin wrote:
> > 
> > When writing a page to filesystem, buffer.c zeroes out parts of the page past
> > i_size in an attempt to get zeroes into those blocks on disk, so as to honour
> > the requirement that an expanding truncate should zero-fill the file.
> > 
> > Unfortunately, this is racy. The reason we can get something other than
> > zeroes here is via an mmaped write to the block beyond i_size. Zeroing it
> > out before writepage narrows the window, but it is still possible to store
> > junk beyond i_size on disk, by storing into the page after writepage zeroes,
> > but before DMA (or copy) completes. This allows process A to break posix
> > semantics for process B (or even inadvertently for itsef).
> > 
> > It could also be possible that the filesystem has written data into the
> > block but not yet expanded the inode size when the system crashes for
> > some reason. Unless its journal reply / fsck process etc checks for this
> > condition, it could also cause subsequent breakage in semantics.
>   Actually, it should be possible to fix the posix semantics by zeroing out
> the page when i_size is going to be extended - hmm, I see you're trying to
> do something like that in ext2 code. Ugh. Since we have to lock the

Yeah, it could probably do it in write_begin in generic code, that
part was a bit ugly.

> old last page to make mkwrite work anyway, I think we should do it in a
> generic code (probably in a separate patch and just note it here...).
>   I can include it in my mkwrite fixes when I port them on top of your
> patches.
> 
> > @@ -2752,7 +2741,6 @@ has_buffers:
> >  	}
> >  	zero_user(page, offset, length);
> >  	set_page_dirty(page);
> > -	err = 0;
> >  
> >  unlock:
> >  	unlock_page(page);
>   Above two chunks are just style cleanup, aren't they? Could you maybe separate
> it from the logical changes?

Yes I think so. They devolved from something that was actually useful,
and I should remove them.

 
> > @@ -2802,15 +2790,20 @@ int block_truncate_page(struct address_s
> >  		pos += blocksize;
> >  	}
> >  
> > -	err = 0;
> >  	if (!buffer_mapped(bh)) {
> >  		WARN_ON(bh->b_size != blocksize);
> >  		err = get_block(inode, iblock, bh, 0);
> >  		if (err)
> >  			goto unlock;
> > -		/* unmapped? It's a hole - nothing to do */
> > -		if (!buffer_mapped(bh))
> > +		/*
> > +		 * unmapped? It's a hole - must zero out partial
> > +		 * in the case of an extending truncate where mmap has
> > +		 * previously written past i_size of the page
> > +		 */
> > +		if (!buffer_mapped(bh)) {
> > +			zero_user(page, offset, length);
> >  			goto unlock;
>   Hmm, but who was zeroing out the page previously? Because the end of the
> page gets zeroed already now...

Yes it does aready get zeroed, however I think ftruncate semantics
say that expanding ftruncate shoud leave the new area with zero
filled. A partial-mmap on the last page could have dirtied these
parts of the page and so break the guarantee.

I guess it could be ignored because such partial mmap writes are
supposed to result in undefined behaviour, however I think it is
a bit wrong (also the result could change based on memory pressure
even when another program opens the file, so I think zeroing here
is best).
 

> > -	/*
> > -	 * The page straddles i_size.  It must be zeroed out on each and every
> > -	 * writepage invokation because it may be mmapped.  "A file is mapped
> > -	 * in multiples of the page size.  For a file that is not a multiple of
> > -	 * the  page size, the remaining memory is zeroed when mapped, and
> > -	 * writes to that region are not written out to the file."
> > -	 */
> > -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> >  	return __block_write_full_page(inode, page, get_block, wbc, handler);
> >  }
>   I suppose you should also update __block_write_full_page() - there's
> comment about zeroing. Also I'm not sure that marking buffer as uptodate
> there is a good idea when the buffer isn't zeroed.

Thanks I'll check it out.


> >  EXPORT_SYMBOL_GPL(xip_truncate_page);
>   Again, only a style change, right?

Yes.

> > Index: linux-2.6/fs/ext2/inode.c
> > ===================================================================
> > --- linux-2.6.orig/fs/ext2/inode.c
> > +++ linux-2.6/fs/ext2/inode.c
> > @@ -777,14 +777,40 @@ ext2_write_begin(struct file *file, stru
> >  	return ret;
> >  }
> >  
> > +int __block_truncate_page(struct address_space *mapping,
> > +			loff_t from, loff_t to, get_block_t *get_block);
>   Uf, that's ugly... Shouldn't it be in some header?
> 
> >  static int ext2_write_end(struct file *file, struct address_space *mapping,
> >  			loff_t pos, unsigned len, unsigned copied,
> >  			struct page *page, void *fsdata)
> >  {
> > +	struct inode *inode = mapping->host;
> >  	int ret;
> >  
> > -	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> > -	if (ret < len) {
> > +	ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> > +	unlock_page(page);
> > +	page_cache_release(page);
> > +        if (pos+copied > inode->i_size) {
> > +		int err;
> > +                if (pos > inode->i_size) {
> > +                        /* expanding a hole */
> > +			err = __block_truncate_page(mapping, inode->i_size,
> > +						pos, ext2_get_block);
> > +			if (err) {
> > +				ret = err;
> > +				goto out;
> > +			}
> > +			err = __block_truncate_page(mapping, pos+copied,
> > +						LLONG_MAX, ext2_get_block);
> > +			if (err) {
> > +				ret = err;
> > +				goto out;
> > +			}
> > +                }
> > +                i_size_write(inode, pos+copied);
> > +                mark_inode_dirty(inode);
> > +        }
> > +out:
> > +	if (ret < 0 || ret < len) {
> >  		struct inode *inode = mapping->host;
> >  		loff_t isize = inode->i_size;
> >  		if (pos + len > isize)
>   There are whitespace problems above... Also calling __block_truncate_page()
> on old i_size looks strange - we just want to zero-out the page if it
> exists (this way we'd unnecessarily read it from disk). Also I think
> block_write_end() should do this.
>   Finally, zeroing after pos+copied does not make sence to be conditioned
> by pos > inode->i_size and again I don't think it's needed...

Yeah this part was ugly because it was just a result of working
through bugs and I didn't really try to make it nice. I agree if
we can move as much as possible to generic code it woud be
best.

Thanks for review. I'll try to post another version soon.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux