Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jan 03, 2011 at 11:58:21AM -0500, Christoph Hellwig wrote:
> On Mon, Jan 03, 2011 at 03:03:29PM +0000, Steven Whitehouse wrote:
> > 
> >  - With "journaled data" files
> >    - Do a log flush conditional upon the inode's glock
> >    - The core code then writes back any dirty pages
> 
> Any data writeback is done before calling ->fsync.
> 
> >  - With regular files/directories
> >   - If datasync is not set, we need to write back the metadata including
> > timestamp updates, so that is done via ->write_inode. Note that an extra
> > complication here is that we need to get the glock on the inode if we
> > don't already have it in order to check and conditionally update the
> > atime.The call to ->write_inode includes an implicit (conditional) log
> > flush.
> >  - If datasync is set, we assume that only the data pages need to be
> > written out. My understanding of datasync was that it was only supposed
> > to write out data and never any of the metadata. The reason for the call
> > to flush the log for "stuffed" files is that the data shares a disk
> > block with the inode metadata, so we cannot avoid the log flush in this
> > case, since we must unpin the block to write it back.
> 
> What happens to indirect blocks, inode size updates, etc?  In general
> the only correct form to use the datasync argument is along the lines
> of:
> 
> 	if ((inode->i_state & I_DIRTY_DATASYNC) ||
> 	    ((inode->i_state & I_DIRTY_SYNC) && !datasync)) {
> 		/* write out the inode */
> 	} else {
> 		/*
> 		 * VFS inode not dirty, no need to write it out.
> 		 *
> 		 * If the filesystem support asynchronous inode writes,
> 		 * we may have to wait for them here.
> 		 */
> 	}
> 
> or rather mostly correct, as pointed out by Nick in this series, that's
> why the above gets replaced with an equivalent check that also
> participates in the writeback locking protocol in this series.

Just to recap, basically we have 2 main problems in vfs/filesystems:

- i_state dirtyness is checked outside the correct synchronization
  protcol, so it may be seen as clean before a concurrent writer
  has finished.

- .write_inode is only guaranteed to be called once regardless of sync
  or async mode, for a dirty inode at a sync point. Many filesystems
  were incorrectly assuming they would be called once *in synchronous
  mode*.

  The optimal approach for .write_inode seems to be clean the struct
  inode so that it may be eventually reclaimed. Then have your .fsync and
  .sync_fs implementations enforce the actual data integrity.

  Note that "clean struct inode" often means to copy the metadata
  somewhere else to be scheduled for asynch writeout. You have to be
  careful to note that if you allow the inode to be evicted at this
  point without data integrity point also in .evict_inode, then you need
  to keep in mind that .sync_fs (and subsequent .fsync, if the inode is
  re opened) need to still enforce integrity for these potentially
  evicted inodes.

Everyone happy with this? Please review your filesystems and look at
my patches :)

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux