Re: [patch 9/9] mm: do_sync_mapping_range integrity fix

Nick Piggin <npiggin@xxxxxxx> · Fri, 31 Oct 2008 10:16:16 +0100

On Thu, Oct 30, 2008 at 04:13:44PM -0700, Andrew Morton wrote:
> On Wed, 29 Oct 2008 01:47:24 +1100
> npiggin@xxxxxxx wrote:
> 
> > Chris Mason notices do_sync_mapping_range didn't actually ask for data
> > integrity writeout. Unfortunately, it is advertised as being usable for
> > data integrity operations.
> > 
> > This is a data interity bug.
> > 
> > Signed-off-by: Nick Piggin <npiggin@xxxxxxx>
> > ---
> > Index: linux-2.6/fs/sync.c
> > ===================================================================
> > --- linux-2.6.orig/fs/sync.c
> > +++ linux-2.6/fs/sync.c
> > @@ -269,7 +269,7 @@ int do_sync_mapping_range(struct address
> >  
> >  	if (flags & SYNC_FILE_RANGE_WRITE) {
> >  		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
> > -						WB_SYNC_NONE);
> > +						WB_SYNC_ALL);
> >  		if (ret < 0)
> >  			goto out;
> >  	}
> > 
> 
> Really?
> 
> Some thought did go into the code which you're "fixing".

Yes, I even remember something of a flamewar involving me and you :)

>  If the caller
> is using sync_file_range() for integrity then the caller has done a
> SYNC_FILE_RANGE_WAIT_BEFORE.

No disputes about whether the API works "by design". But I think the
implementation has a bug. I'll explain:

> So all we need to guarantee here is that
> __filemap_fdatawrite_range(WB_SYNC_NONE) will start writeout on all
> dirty pages in the range.  Probably that gets broken lower down as part
> of various hacks^woptimisations have gone in, but which ones, and
> where?  Perhaps _this_ (if it's there) is what should be fixed.

WB_SYNC_NONE has never (until this was introduced) been used for data
integrity AFAIKS. There is code littered throughout fs/ which assumes
WB_SYNC_NONE ~= "efficiency / background writeback". At least
definitely the buffer.c "trylock" will cause dirty pages to be
skipped. There is also a fair amount of filesystem code I haven't looked
at. The fs-writeback.c code might not affect this path, but it also
definitely makes the same assumption about WB_SYNC_NONE, so it would be
ugly to mandate WB_SYNC_NONE is for data integrity from mapping downard,
but not from inode upward...

I didn't check, but I suspect this has been broken since it got merged.

> And if we _do_ make the above change, we don't need to run the
> wait_on_page_writeback_range() if userspace asked for
> SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE, yes?

Now you're asking the hard questions... I think we still have to wait,
because SYNC_FILE_RANGE_WRITE itself doesn't necessarily wait for writeback.
After the optimisation to skip waiting for clean and writeback pages
in write_cache_pages, actually I think this change (to use WB_SYNC_ALL)
should not hurt very much...

> 
> 
> 
> IOW, I don't think enough thought (or at least description of that
> thought) has gone into this one.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html