Re: [RFC] relaxed barrier semantics

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On Wed, 2010-07-28 at 11:28 +0200, Christoph Hellwig wrote:
> On Wed, Jul 28, 2010 at 11:17:06AM +0200, Tejun Heo wrote:
> > Well, if disabling barrier works around the problem for them (which is
> > basically what was suggeseted in the first message), that's not too
> > bad for short term, I think.
> 
> It's a pretty horrible workaround.  Requiring manual mount options to
> get performance out of a setup which could trivially work out of the
> box is a bad workaround.
> 
> > I'll re-read barrier code and see how hard it would be to implement a
> > proper solution.
> 
> If we move all filesystems to non-draining barriers with pre- and post-
> flushes that might actually be a relatively easy first step.  We don't
> have the complications to deal with multiple types of barriers to
> start with, and it'll fix the issue for devices without volatile write
> caches completely.
> 
> I just need some help from the filesystem folks to determine if they
> are safe with them.
> 
> I know for sure that ext3 and xfs are from looking through them.  And
> I know reiserfs is if we make sure it doesn't hit the code path that
> relies on it that is currently enabled by the barrier option.
> 
> I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks.
> That already ends our small list of barrier supporting filesystems, and
> possibly ocfs2, too - although the barrier implementation there seems
> incomplete as it doesn't seem to flush caches in fsync.

GFS2 uses barriers only on journal flushing. There are three reasons for
flushing the journal:

1. Its full and we need more space (or the periodic timer has expired,
and there is at least one transaction to flush)
2. We are doing fsync or a full fs sync
3. We need to release a glock to another node, and that glock has some
journaled blocks associated with it

In case #1, I don't think there is any need to actually issue a flush
along with the barrier - the fs will always be correct in case of a (for
example) power failure and it is only the amount of data which might be
lost which depends on the write cache size. This is basically the same
for any local filesystem.

In case #2 we must always flush

In case #3 we need to be certain that all I/O up to and including the
barrier (and subsequent written back in-place metadata, if any) has
reached the storage device (and is not still lurking in the I/O
elevator) before we release the lock, but there is no actual need to
flush the write cache of the device itself. In other words, we need to
flush the non-shared bit of the stack, but not the shared bit on the
device itself. The same caveats about the amount of data which may be
lost on power failure apply as per case #1.

I have also made the assumption that a barrier issued from one node to
the shared device will affect I/O from all nodes equally. If that is not
the case, then the above will not apply and we must always flush in case
#3.

Currently the code is also waiting for I/O to drain in cases #1 and #3
as well as case #2 since it was simpler to implement all cases the same,
at least to start with.

Also in case #3, if we were to implement a non-flushing barrier, then we
would need to add a barrier after the in-place metadata writeback of the
inode that is being released I think, in order to be sure cross-node
ordering was correct. Hmmm. Maybe we should be doing that anyway....

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux