Re: [patch 9/9] mm: do_sync_mapping_range integrity fix

steve@xxxxxxxxxxx · Fri, 31 Oct 2008 14:30:48 +0000

Hi,

On Fri, Oct 31, 2008 at 10:10:19AM -0400, Chris Mason wrote:
> UID: 518429                                                  
> 
> On Fri, 2008-10-31 at 10:16 +0100, Nick Piggin wrote:
> > On Thu, Oct 30, 2008 at 04:13:44PM -0700, Andrew Morton wrote:
> > > On Wed, 29 Oct 2008 01:47:24 +1100
> > > npiggin@xxxxxxx wrote:
> > > 
> > > > Chris Mason notices do_sync_mapping_range didn't actually ask for
> > data
> > > > integrity writeout. Unfortunately, it is advertised as being
> > usable for
> > > > data integrity operations.
> > > > 
> > > > This is a data interity bug.
> > > > 
> 
> [ use WB_SYNC_ALL instead of WB_SYNC_NONE ]
> 
> > >  If the caller
> > > is using sync_file_range() for integrity then the caller has done a
> > > SYNC_FILE_RANGE_WAIT_BEFORE.
> > 
> > No disputes about whether the API works "by design". But I think the
> > implementation has a bug. I'll explain:
> 
> I'll definitely agree the current usage is clumsy, and there is a bug in
> fs/buffer.c.  A grep through the rest of the filesystems doesn't turn up
> many assumptions that WB_SYNC_NONE means it's ok to skip dirty pages.
> 
> Greps for WB_SYNC_ALL and WB_SYNC_NONE in the fs code reveal:
> 
> fs/buffer.c:__block_write_full_page()
>                if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
>                         lock_buffer(bh);
>                 } else if (!trylock_buffer(bh)) {
>                         redirty_page_for_writepage(wbc, page);
>                         continue;
>                 }
> 
> Easily fixed s/||/&&/, which is what XFS does.  reiser3 has the same bug
> in fs/reiserfs/inode.c
> 
> ntfs and gfs2 each have a check that assumes WB_SYNC_NONE means optional
> writeback, both seem fixable with one liners.
>
I doubt that is the case for GFS2... Its a bit subtle, but in order to
ensure that we have the correct lock ordering the code in writepage
for jdata (the only place this occurs, and I assume is what you are
referring to) relies in the fact that because we have a writepages
set for jdata, writepage will only ever be called with WB_SYNC_NONE
and can thus be skipped.

The lock ordering in question is between the transaction (glock)
and the page lock. The former has to be grabbed first and this
is also the reason that this only affects jdata as writeback and
ordered don't need transactions.

This is also the same reason that we have our own version of
write_cache_pages, its more or less identical to the VFS version
except its got our transaction lock stuck in the middle of it.

Ideally (and I speak entirely from a selfish, personal, GFS2 point of
view here!) we would either have a version of writepage which didn't
hold the page lock on entry or we'd just move over to writepages
entirely. This issue came up at the filesystem workshop earlier in the year
and I think it was generally agreed that it was a good idea. I did
have a look into doing it at one stage, but the code scared me off a bit :-)

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html