On Thu, Oct 30, 2008 at 04:13:44PM -0700, Andrew Morton wrote: > On Wed, 29 Oct 2008 01:47:24 +1100 > npiggin@xxxxxxx wrote: > > > Chris Mason notices do_sync_mapping_range didn't actually ask for data > > integrity writeout. Unfortunately, it is advertised as being usable for > > data integrity operations. > > > > This is a data interity bug. > > > > Signed-off-by: Nick Piggin <npiggin@xxxxxxx> > > --- > > Index: linux-2.6/fs/sync.c > > =================================================================== > > --- linux-2.6.orig/fs/sync.c > > +++ linux-2.6/fs/sync.c > > @@ -269,7 +269,7 @@ int do_sync_mapping_range(struct address > > > > if (flags & SYNC_FILE_RANGE_WRITE) { > > ret = __filemap_fdatawrite_range(mapping, offset, endbyte, > > - WB_SYNC_NONE); > > + WB_SYNC_ALL); > > if (ret < 0) > > goto out; > > } > > > > Really? > > Some thought did go into the code which you're "fixing". Yes, I even remember something of a flamewar involving me and you :) > If the caller > is using sync_file_range() for integrity then the caller has done a > SYNC_FILE_RANGE_WAIT_BEFORE. No disputes about whether the API works "by design". But I think the implementation has a bug. I'll explain: > So all we need to guarantee here is that > __filemap_fdatawrite_range(WB_SYNC_NONE) will start writeout on all > dirty pages in the range. Probably that gets broken lower down as part > of various hacks^woptimisations have gone in, but which ones, and > where? Perhaps _this_ (if it's there) is what should be fixed. WB_SYNC_NONE has never (until this was introduced) been used for data integrity AFAIKS. There is code littered throughout fs/ which assumes WB_SYNC_NONE ~= "efficiency / background writeback". At least definitely the buffer.c "trylock" will cause dirty pages to be skipped. There is also a fair amount of filesystem code I haven't looked at. The fs-writeback.c code might not affect this path, but it also definitely makes the same assumption about WB_SYNC_NONE, so it would be ugly to mandate WB_SYNC_NONE is for data integrity from mapping downard, but not from inode upward... I didn't check, but I suspect this has been broken since it got merged. > And if we _do_ make the above change, we don't need to run the > wait_on_page_writeback_range() if userspace asked for > SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE, yes? Now you're asking the hard questions... I think we still have to wait, because SYNC_FILE_RANGE_WRITE itself doesn't necessarily wait for writeback. After the optimisation to skip waiting for clean and writeback pages in write_cache_pages, actually I think this change (to use WB_SYNC_ALL) should not hurt very much... > > > > IOW, I don't think enough thought (or at least description of that > thought) has gone into this one. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html