On Wed, 10 Dec 2008 08:42:09 +0100 Nick Piggin <npiggin@xxxxxxx> wrote: > OK, there has not been any further discussion on this approach since I last > posted it, so I am going to go out on a limb and suggest that we take this > approach, if any, rather than Mikulas' one. > > The advantages of my approach I think: > - nothing added to non-sync fastpaths > - close to theoretically fewest number of pages will be written / waited for > in the fsync case > - works nicely even in unusual cases (eg. file with 1GB of dirty data, but > the fsync_range only needs to sync a few pages will not get stuck behind > a concurrent dirtier) > > And some comments: > - adds 8 bytes to the radix tree node, but this doesn't really change its > fitting into slab or cachelines, so effective impact is basically zero > for this addition. > - adds an extra lock, but as per the comments, this lock seems to be required > in order to fix a bug anyway. And we already tend to hold i_mutex over at > least some of the fsync operation. Although if anybody thinks this will be > a problem, I'd like to hear. > > Disadvantages: > - more complex. Although in a way I consider Mikulas' change to have > more complex heuristics, which I don't like. I think Mikulas' version > would be more complex to analyse at runtime. Also, much of the complexity > comes from the extra lock, which as I said fixes a bug. > > Any additions or disputes? :) > > -- > This patch fixes fsync starvation problems in the presence of concurrent > dirtiers. > > To take an extreme example: if thread A calls fsync on a file with one dirty > page, at index 1 000 000; at the same time, thread B starts dirtying the > file from offset 0 onwards. > > Thead B perhaps will be allowed to dirty 1 000 pages before hitting its dirty > threshold, then it will start throttling. Thread A will start writing out B's > pages. They'll proceed more or less in lockstep until thread B finishes > writing. > > While these semantics are correct, we'd prefer a more timely notification that > pages dirty at the time fsync was called are safely on disk. In the above > scenario, we may have to wait until many times the machine's RAM capacity has > been written to disk before the fsync returns success. Ideally, thread A would > write the single page at index 1 000 000, then return. > > This patch introduces a new pagecache tag, PAGECACHE_TAG_FSYNC. Data integrity > syncs then start by looking through the pagecache for pages which are DIRTY > and/or WRITEBACK within the requested range, and tagging all those as FSYNC. > > Subsequent writeout and wait phases need then only look up those pages in the > pagecache which are tagged with PAGECACHE_TAG_FSYNC. > > After the sync operation has completed, the FSYNC tags are removed from the > radix tree. This design requires exclusive usage of the FSYNC tags for the > duration of a sync operation, so a lock on the address space is required. > > For simplicity, I have removed the "don't wait for writeout if we hit -EIO" > logic from a couple of places. I don't know if this is really worth the added > complexity (EIO will still get reported, but it will just take a bit longer; > an app can't rely in specific behaviour or timeliness here). > > This lock also solves a real data integrity problem that I only noticed as > I was writing the livelock avoidance code. If we consider the lock as the > solution to this bug, this makes the livelock avoidance code much more > attractive because then it does not introduce the new lock. > > The bug is that fsync errors do not get propogated back up to the caller > properly in some cases. Consider where we write a page in the writeout path, > then it encounters an IO error and finishes writeback, in the meantime, another > process (eg. via sys_sync, or another fsync) clears the mapping error bits. > Then our fsync will have appeared to finish successfully, but actually should > have returned error. Has *anybody* *ever* complained about this behaviour? I think maybe one person after sixish years? Why fix it? > > ... > > --- linux-2.6.orig/mm/filemap.c > +++ linux-2.6/mm/filemap.c > @@ -147,6 +147,28 @@ void remove_from_page_cache(struct page > spin_unlock_irq(&mapping->tree_lock); > } > > +static int sleep_on_fsync(void *word) > +{ > + io_schedule(); > + return 0; > +} > + > +void mapping_fsync_lock(struct address_space *mapping) > +{ > + wait_on_bit_lock(&mapping->flags, AS_FSYNC_LOCK, sleep_on_fsync, > + TASK_UNINTERRUPTIBLE); > + WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC)); > +} > + > +void mapping_fsync_unlock(struct address_space *mapping) > +{ > + WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC)); > + WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags)); > + clear_bit_unlock(AS_FSYNC_LOCK, &mapping->flags); > + smp_mb__after_clear_bit(); hm, I wonder why clear_bit_unlock() didn't already do that. The clear_bit_unlock() documentation is rather crappy. > + wake_up_bit(&mapping->flags, AS_FSYNC_LOCK); > +} It wouldn't hurt to document this interface a little bit. > static int sync_page(void *word) > { > struct address_space *mapping; > @@ -287,7 +309,64 @@ int wait_on_page_writeback_range(struct > > /* until radix tree lookup accepts end_index */ > if (page->index > end) > - continue; > + break; > + > + wait_on_page_writeback(page); > + if (PageError(page)) > + ret = -EIO; > + } > + pagevec_release(&pvec); > + cond_resched(); > + } > + > + /* Check for outstanding write errors */ > + if (test_and_clear_bit(AS_ENOSPC, &mapping->flags)) > + ret = -ENOSPC; > + if (test_and_clear_bit(AS_EIO, &mapping->flags)) > + ret = -EIO; > + > + return ret; > +} > + > +int wait_on_page_writeback_range_fsync(struct address_space *mapping, > + pgoff_t start, pgoff_t end) We already have a wait_on_page_writeback_range(). The reader of your code will be wondering why this variant exists, and how it differs. Sigh. > +{ > + struct pagevec pvec; > + int nr_pages; > + int ret = 0; > + pgoff_t index; > + > + WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags)); > + > + if (end < start) > + goto out; > + > + pagevec_init(&pvec, 0); > + index = start; > + while ((index <= end) && > + (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, > + PAGECACHE_TAG_FSYNC, > + min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) { > + unsigned i; > + > + spin_lock_irq(&mapping->tree_lock); > + for (i = 0; i < nr_pages; i++) { > + struct page *page = pvec.pages[i]; > + > + /* until radix tree lookup accepts end_index */ > + if (page->index > end) > + break; > + > + radix_tree_tag_clear(&mapping->page_tree, page->index, PAGECACHE_TAG_FSYNC); Don't mucky up the nice kernel code please. > + } > + spin_unlock_irq(&mapping->tree_lock); > + > + for (i = 0; i < nr_pages; i++) { > + struct page *page = pvec.pages[i]; > + > + /* until radix tree lookup accepts end_index */ > + if (page->index > end) > + break; > > wait_on_page_writeback(page); > if (PageError(page)) > @@ -303,6 +382,7 @@ int wait_on_page_writeback_range(struct > if (test_and_clear_bit(AS_EIO, &mapping->flags)) > ret = -EIO; > > +out: > return ret; > } > > @@ -325,18 +405,20 @@ int sync_page_range(struct inode *inode, > { > pgoff_t start = pos >> PAGE_CACHE_SHIFT; > pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT; > - int ret; > + int ret, ret2; > > if (!mapping_cap_writeback_dirty(mapping) || !count) > return 0; > + mutex_lock(&inode->i_mutex); I am unable to determine why i_mutex is being taken here. > + mapping_fsync_lock(mapping); > ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1); > - if (ret == 0) { > - mutex_lock(&inode->i_mutex); > + if (ret == 0) > ret = generic_osync_inode(inode, mapping, OSYNC_METADATA); > - mutex_unlock(&inode->i_mutex); > - } > + mutex_unlock(&inode->i_mutex); Nor why it was dropped at this particular place. > + ret2 = wait_on_page_writeback_range_fsync(mapping, start, end); > if (ret == 0) > - ret = wait_on_page_writeback_range(mapping, start, end); > + ret = ret2; > + mapping_fsync_unlock(mapping); > return ret; > } > EXPORT_SYMBOL(sync_page_range); > @@ -357,15 +439,18 @@ int sync_page_range_nolock(struct inode > { > pgoff_t start = pos >> PAGE_CACHE_SHIFT; > pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT; > - int ret; > + int ret, ret2; > > if (!mapping_cap_writeback_dirty(mapping) || !count) > return 0; > + mapping_fsync_lock(mapping); > ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1); > if (ret == 0) > ret = generic_osync_inode(inode, mapping, OSYNC_METADATA); > + ret2 = wait_on_page_writeback_range_fsync(mapping, start, end); > if (ret == 0) > - ret = wait_on_page_writeback_range(mapping, start, end); > + ret = ret2; > + mapping_fsync_unlock(mapping); > return ret; > } > EXPORT_SYMBOL(sync_page_range_nolock); > @@ -389,23 +474,30 @@ int filemap_fdatawait(struct address_spa > } > EXPORT_SYMBOL(filemap_fdatawait); > > +int filemap_fdatawait_fsync(struct address_space *mapping) > +{ > + loff_t i_size = i_size_read(mapping->host); > + > + if (i_size == 0) > + return 0; > + > + return wait_on_page_writeback_range_fsync(mapping, 0, > + (i_size - 1) >> PAGE_CACHE_SHIFT); > +} filemap_fdatawait() is documented... > int filemap_write_and_wait(struct address_space *mapping) > { > int err = 0; > > if (mapping->nrpages) { > + int err2; > + > + mapping_fsync_lock(mapping); > err = filemap_fdatawrite(mapping); > - /* > - * Even if the above returned error, the pages may be > - * written partially (e.g. -ENOSPC), so we wait for it. > - * But the -EIO is special case, it may indicate the worst > - * thing (e.g. bug) happened, so we avoid waiting for it. > - */ > - if (err != -EIO) { > - int err2 = filemap_fdatawait(mapping); > - if (!err) > - err = err2; > - } > + err2 = filemap_fdatawait_fsync(mapping); > + if (!err) > + err = err2; > + mapping_fsync_unlock(mapping); > } > return err; > } > @@ -428,16 +520,16 @@ int filemap_write_and_wait_range(struct > int err = 0; > > if (mapping->nrpages) { > - err = __filemap_fdatawrite_range(mapping, lstart, lend, > - WB_SYNC_ALL); > - /* See comment of filemap_write_and_wait() */ > - if (err != -EIO) { > - int err2 = wait_on_page_writeback_range(mapping, > + int err2; > + > + mapping_fsync_lock(mapping); > + err = filemap_fdatawrite_range(mapping, lstart, lend); > + err2 = wait_on_page_writeback_range_fsync(mapping, > lstart >> PAGE_CACHE_SHIFT, > lend >> PAGE_CACHE_SHIFT); > - if (!err) > - err = err2; > - } > + if (!err) > + err = err2; > + mapping_fsync_unlock(mapping); > } > return err; > } > > ... > > --- linux-2.6.orig/mm/page-writeback.c > +++ linux-2.6/mm/page-writeback.c > @@ -872,9 +872,11 @@ int write_cache_pages(struct address_spa > pgoff_t index; > pgoff_t end; /* Inclusive */ > pgoff_t done_index; > + unsigned int tag = PAGECACHE_TAG_DIRTY; > int cycled; > int range_whole = 0; > long nr_to_write = wbc->nr_to_write; > + int sync = wbc->sync_mode != WB_SYNC_NONE; > > if (wbc->nonblocking && bdi_write_congested(bdi)) { > wbc->encountered_congestion = 1; > @@ -897,13 +899,40 @@ int write_cache_pages(struct address_spa > range_whole = 1; > cycled = 1; /* ignore range_cyclic tests */ > } > + > + if (sync) { > + WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags)); hm. Is fsync the only caller of write_cache_pages(!WB_SYNC_NONE)? Surprised. > + /* > + * If any pages are writeback or dirty, mark them fsync now. > + * These are the pages we need to wait in in order to meet our s/in/on/ > + * data integrity contract. > + * > + * Writeback pages need to be tagged, so we'll wait for them > + * at the end of the writeout phase. However, the lookup below > + * could just look up pages which are _DIRTY AND _FSYNC, > + * because we don't care about them for the writeout phase. > + */ > + spin_lock_irq(&mapping->tree_lock); > + if (!radix_tree_gang_tag_set_if_tagged(&mapping->page_tree, > + index, end, > + (1UL << PAGECACHE_TAG_DIRTY) | > + (1UL << PAGECACHE_TAG_WRITEBACK), > + (1UL << PAGECACHE_TAG_FSYNC))) { ooh, so that's what that thing does. > + /* nothing tagged */ > + spin_unlock_irq(&mapping->tree_lock); > + return 0; Can we please avoid the deeply-nested-return hand grenade? > + } > + spin_unlock_irq(&mapping->tree_lock); > + tag = PAGECACHE_TAG_FSYNC; > + } > + > retry: > done_index = index; > while (!done && (index <= end)) { > int i; > > nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, > - PAGECACHE_TAG_DIRTY, > + tag, > min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); > if (nr_pages == 0) > break; > @@ -951,7 +980,7 @@ continue_unlock: > } > > if (PageWriteback(page)) { > - if (wbc->sync_mode != WB_SYNC_NONE) > + if (sync) > wait_on_page_writeback(page); > else > goto continue_unlock; > @@ -981,7 +1010,7 @@ continue_unlock: > } > } > > - if (wbc->sync_mode == WB_SYNC_NONE) { > + if (!sync) { > wbc->nr_to_write--; > if (wbc->nr_to_write <= 0) { > done = 1; > Index: linux-2.6/drivers/usb/gadget/file_storage.c > =================================================================== > --- linux-2.6.orig/drivers/usb/gadget/file_storage.c > +++ linux-2.6/drivers/usb/gadget/file_storage.c > @@ -1873,13 +1873,15 @@ static int fsync_sub(struct lun *curlun) > > inode = filp->f_path.dentry->d_inode; > mutex_lock(&inode->i_mutex); > + mapping_fsync_lock(mapping); Dood. Do `make allmodconfig ; make' > rc = filemap_fdatawrite(inode->i_mapping); > err = filp->f_op->fsync(filp, filp->f_path.dentry, 1); > if (!rc) > rc = err; > - err = filemap_fdatawait(inode->i_mapping); > + err = filemap_fdatawait_fsync(inode->i_mapping); > if (!rc) > rc = err; > + mapping_fsync_unlock(mapping); > > ... > I won't apply this because of the build breakage. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html