On Fri, Jun 04, 2010 at 08:47:11PM +0200, Jan Kara wrote: > We try to avoid livelocks of writeback when some steadily creates > dirty pages in a mapping we are writing out. For memory-cleaning > writeback, using nr_to_write works reasonably well but we cannot > really use it for data integrity writeback. This patch tries to > solve the problem. > > The idea is simple: Tag all pages that should be written back > with a special tag (TOWRITE) in the radix tree. This can be done > rather quickly and thus livelocks should not happen in practice. > Then we start doing the hard work of locking pages and sending > them to disk only for those pages that have TOWRITE tag set. > > Signed-off-by: Jan Kara <jack@xxxxxxx> > --- > include/linux/fs.h | 1 + > include/linux/radix-tree.h | 2 +- > mm/page-writeback.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- > 3 files changed, 44 insertions(+), 3 deletions(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3428393..fe308f0 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -685,6 +685,7 @@ struct block_device { > */ > #define PAGECACHE_TAG_DIRTY 0 > #define PAGECACHE_TAG_WRITEBACK 1 > +#define PAGECACHE_TAG_TOWRITE 2 > > int mapping_tagged(struct address_space *mapping, int tag); > > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h > index efdfb07..f7ebff8 100644 > --- a/include/linux/radix-tree.h > +++ b/include/linux/radix-tree.h > @@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr) > > /*** radix-tree API starts here ***/ > > -#define RADIX_TREE_MAX_TAGS 2 > +#define RADIX_TREE_MAX_TAGS 3 > > /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */ > struct radix_tree_root { > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index b289310..f590a12 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -807,6 +807,30 @@ void __init page_writeback_init(void) > } > > /** > + * tag_pages_for_writeback - tag pages to be written by write_cache_pages > + * @mapping: address space structure to write > + * @start: starting page index > + * @end: ending page index (inclusive) > + * > + * This function scans the page range from @start to @end and tags all pages > + * that have DIRTY tag set with a special TOWRITE tag. The idea is that > + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag > + * to identify pages eligible for writeback. This mechanism is used to avoid > + * livelocking of writeback by a process steadily creating new dirty pages in > + * the file (thus it is important for this function to be damn quick so that it > + * can tag pages faster than a dirtying process can create them). > + */ > +void tag_pages_for_writeback(struct address_space *mapping, > + pgoff_t start, pgoff_t end) > +{ > + spin_lock_irq(&mapping->tree_lock); > + radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end, > + PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE); > + spin_unlock_irq(&mapping->tree_lock); > +} > +EXPORT_SYMBOL(tag_pages_for_writeback); > + > +/** > * write_cache_pages - walk the list of dirty pages of the given address space and write all of them. > * @mapping: address space structure to write > * @wbc: subtract the number of written pages from *@wbc->nr_to_write > @@ -820,6 +844,13 @@ void __init page_writeback_init(void) > * the call was made get new I/O started against them. If wbc->sync_mode is > * WB_SYNC_ALL then we were called for data integrity and we must wait for > * existing IO to complete. > + * > + * To avoid livelocks (when other process dirties new pages), we first tag > + * pages which should be written back with TOWRITE tag and only then start > + * writing them. For data-integrity sync we have to be careful so that we do > + * not miss some pages (e.g., because some other process has cleared TOWRITE > + * tag we set). The rule we follow is that TOWRITE tag can be cleared only > + * by the process clearing the DIRTY tag (and submitting the page for IO). > */ > int write_cache_pages(struct address_space *mapping, > struct writeback_control *wbc, writepage_t writepage, > @@ -836,6 +867,7 @@ int write_cache_pages(struct address_space *mapping, > int cycled; > int range_whole = 0; > long nr_to_write = wbc->nr_to_write; > + int tag; > > pagevec_init(&pvec, 0); > if (wbc->range_cyclic) { > @@ -853,13 +885,18 @@ int write_cache_pages(struct address_space *mapping, > range_whole = 1; > cycled = 1; /* ignore range_cyclic tests */ > } > + if (wbc->sync_mode == WB_SYNC_ALL) > + tag = PAGECACHE_TAG_TOWRITE; > + else > + tag = PAGECACHE_TAG_DIRTY; > retry: > + if (wbc->sync_mode == WB_SYNC_ALL) > + tag_pages_for_writeback(mapping, index, end); I wonder if this is too much spinlock latency in a huge dirty file? Some kid of batching of the operation perhaps would be good? > done_index = index; > while (!done && (index <= end)) { > int i; > > - nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, > - PAGECACHE_TAG_DIRTY, > + nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag, > min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); > if (nr_pages == 0) > break; Would it be neat to clear the tag even if we didn't set page to writeback? It should be uncommon case. > @@ -1319,6 +1356,9 @@ int test_set_page_writeback(struct page *page) > radix_tree_tag_clear(&mapping->page_tree, > page_index(page), > PAGECACHE_TAG_DIRTY); > + radix_tree_tag_clear(&mapping->page_tree, > + page_index(page), > + PAGECACHE_TAG_TOWRITE); > spin_unlock_irqrestore(&mapping->tree_lock, flags); > } else { > ret = TestSetPageWriteback(page); It would be nice to have bitwise tag clearing so we can clear multiple at once. Then clear_tag = PAGECACHE_TAG_TOWRITE; if (!PageDirty(page)) clear_tag |= PAGECACHE_TAG_DIRTY; That could reduce overhead a bit more. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html