On Mon 25-05-09 12:34:10, Jens Axboe wrote: > On Mon, May 25 2009, Jan Kara wrote: > > Hi Jens, > > > > I've noticed a few more things: > > > > On Mon 25-05-09 09:34:39, Jens Axboe wrote: > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > > index 1137408..7cb4d02 100644 > > > --- a/fs/fs-writeback.c > > > +++ b/fs/fs-writeback.c > > > @@ -19,6 +19,8 @@ > > > #include <linux/sched.h> > > > #include <linux/fs.h> > > > #include <linux/mm.h> > > > +#include <linux/kthread.h> > > > +#include <linux/freezer.h> > > > #include <linux/writeback.h> > > > #include <linux/blkdev.h> > > > #include <linux/backing-dev.h> > > > @@ -61,10 +63,193 @@ int writeback_in_progress(struct backing_dev_info *bdi) > > > */ > > > static void writeback_release(struct backing_dev_info *bdi) > > > { > > > - BUG_ON(!writeback_in_progress(bdi)); > > > + WARN_ON_ONCE(!writeback_in_progress(bdi)); > > > + bdi->wb_arg.nr_pages = 0; > > > + bdi->wb_arg.sb = NULL; > > > clear_bit(BDI_pdflush, &bdi->state); > > > } > > > > > > +int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb, > > > + long nr_pages, enum writeback_sync_modes sync_mode) > > > +{ > > > + /* > > > + * This only happens the first time someone kicks this bdi, so put > > > + * it out-of-line. > > > + */ > > > + if (unlikely(!bdi->task)) { > > > + bdi_add_default_flusher_task(bdi); > > > + return 1; > > > + } > > > + > > > + if (writeback_acquire(bdi)) { > > > + bdi->wb_arg.nr_pages = nr_pages; > > > + bdi->wb_arg.sb = sb; > > > + bdi->wb_arg.sync_mode = sync_mode; > > > + /* > > > + * make above store seen before the task is woken > > > + */ > > > + smp_mb(); > > > + wake_up(&bdi->wait); > > > + } > > Hmm, wouldn't the interface be more useful if we could just pass down the > > whole writeback_control to the flusher threads? > > Yeah, that's what the later patch does too, when support for > 1 thread > has been added. Really? I cannot see it in the series. I see that later you introduce struct bdi_work but it still contains just nr_pages, sb_data and sync_mode. What I meant was we could include 'struct writeback_control' into the bdi (or bdi_work later) and just pass it down so that caller can have more finegrained control of what writeback the thread is going to perform (like nonblocking, older_than_this, maybe something else). > > Maybe a real question is, what should be the role of flusher threads? > > Should they be just threads for kupdate / pdflush style writeout or do we > > expect them to be used also for other cases when we need to submit IO to > > the BDI? > > The thought has occured to me to punt other types of work to these > threads, now that we have them available. I haven't really thought that > through yet (and other changes may be required too), but it could be > things like async IO or even simple things like the unplugging that is > now done by kblockd. OK. > > > + > > > + oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10); > > > + > > > + nr_to_write = global_page_state(NR_FILE_DIRTY) + > > > + global_page_state(NR_UNSTABLE_NFS) + > > > + (inodes_stat.nr_inodes - inodes_stat.nr_unused); > > > + > > > + while (nr_to_write > 0) { > > > + wbc.more_io = 0; > > > + wbc.encountered_congestion = 0; > > > + wbc.nr_to_write = MAX_WRITEBACK_PAGES; > > > + generic_sync_bdi_inodes(NULL, &wbc); > > > + if (wbc.nr_to_write > 0) > > > + break; /* All the old data is written */ > > > + nr_to_write -= MAX_WRITEBACK_PAGES; > > > + } > > > +} > > > + > > > +static inline bool over_bground_thresh(void) > > > +{ > > > + unsigned long background_thresh, dirty_thresh; > > > + > > > + get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL); > > > + > > > + return (global_page_state(NR_FILE_DIRTY) + > > > + global_page_state(NR_UNSTABLE_NFS) >= background_thresh); > > > +} > > > + > > > +static void bdi_pdflush(struct backing_dev_info *bdi) > > > +{ > > > + struct writeback_control wbc = { > > > + .bdi = bdi, > > > + .sync_mode = bdi->wb_arg.sync_mode, > > > + .older_than_this = NULL, > > > + .range_cyclic = 1, > > > + }; > > > + long nr_pages = bdi->wb_arg.nr_pages; > > > + > > > + for (;;) { > > > + if (wbc.sync_mode == WB_SYNC_NONE && nr_pages <= 0 && > > > + !over_bground_thresh()) > > > + break; > > > + > > > + wbc.more_io = 0; > > > + wbc.encountered_congestion = 0; > > > + wbc.nr_to_write = MAX_WRITEBACK_PAGES; > > > + wbc.pages_skipped = 0; > > > + generic_sync_bdi_inodes(bdi->wb_arg.sb, &wbc); > > > + nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; > > > + /* > > > + * If we ran out of stuff to write, bail unless more_io got set > > > + */ > > > + if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { > > > + if (wbc.more_io) > > > + continue; > > > + break; > > > + } > > > + } > > > +} > > > + > > > +/* > > > + * Handle writeback of dirty data for the device backed by this bdi. Also > > > + * wakes up periodically and does kupdated style flushing. > > > + */ > > > +int bdi_writeback_task(struct backing_dev_info *bdi) > > > +{ > > > + while (!kthread_should_stop()) { > > > + unsigned long wait_jiffies; > > > + DEFINE_WAIT(wait); > > > + > > > + prepare_to_wait(&bdi->wait, &wait, TASK_INTERRUPTIBLE); > > > + wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10); > > > + schedule_timeout(wait_jiffies); > > > + try_to_freeze(); > > > + > > > + /* > > > + * We get here in two cases: > > > + * > > > + * schedule_timeout() returned because the dirty writeback > > > + * interval has elapsed. If that happens, we will be able > > > + * to acquire the writeback lock and will proceed to do > > > + * kupdated style writeout. > > > + * > > > + * Someone called bdi_start_writeback(), which will acquire > > > + * the writeback lock. This means our writeback_acquire() > > > + * below will fail and we call into bdi_pdflush() for > > > + * pdflush style writeout. > > > + * > > > + */ > > > + if (writeback_acquire(bdi)) > > > + bdi_kupdated(bdi); > > > + else > > > + bdi_pdflush(bdi); > > > + > > > + writeback_release(bdi); > > > + finish_wait(&bdi->wait, &wait); > > > + } > > Hmm, what does protect this thread from racing with umount? Note that old > > flusher threads took s_umount semaphore and also elevated sb->s_count. > > If everything is fine, we should have a comment somewhere around here what > > stops umount from removing things under us or why it does not matter... > > Maybe this is the reason of the oopses Yanmin saw. > > I think it was that oops, it got fixed by generic_sync_sb_inodes() doing > sync writeout again. So I think what happened is that it closed that > hole, but there could still be trouble. I'll look into that as well. > > > BTW, one more difference here: Previously, if pdflush saw congestion on > > the device, it did congestion_wait() and retried. Now it end's up waiting > > whole dirty_writeback_interval so it should result in a bursty writeback on > > a slow disk, couldn't it? > > Now it'll block on the queue instead of giving up, so in my opinion that > should provide smoother writeout. From the tests I did, it does. Ah, OK, you don't set wbc->nonblocking from kupdate or pdflush. Makes sence. > > > +void bdi_writeback_all(struct super_block *sb, long nr_pages, > > > + enum writeback_sync_modes sync_mode) > > > +{ > > > + struct backing_dev_info *bdi, *tmp; > > > + > > > + mutex_lock(&bdi_lock); > > > + > > > + list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) { > > > + if (!bdi_has_dirty_io(bdi)) > > > + continue; > > > + bdi_start_writeback(bdi, sb, nr_pages, sync_mode); > > > + } > > > + > > > + mutex_unlock(&bdi_lock); > > > +} > > > + > > > /** > > > * __mark_inode_dirty - internal function > > > * @inode: inode to mark > > ... > > > @@ -591,13 +718,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi, > > > void generic_sync_sb_inodes(struct super_block *sb, > > > struct writeback_control *wbc) > > > { > > > - const int is_blkdev_sb = sb_is_blkdev_sb(sb); > > > - struct backing_dev_info *bdi; > > > - > > > - mutex_lock(&bdi_lock); > > > - list_for_each_entry(bdi, &bdi_list, bdi_list) > > > - generic_sync_bdi_inodes(bdi, wbc, sb, is_blkdev_sb); > > > - mutex_unlock(&bdi_lock); > > > + if (wbc->bdi) > > > + generic_sync_bdi_inodes(sb, wbc); > > > + else > > > + bdi_writeback_all(sb, wbc->nr_to_write, wbc->sync_mode); > > Hmm, you write in the changelog that bdi_writeback_all() writes inline > > now but AFAICS it still happens through the writeback threads. Which, on a > > second though probably *is* what we want because we want writeback to go in > > parallel on all devices we have. > > Note that this is patch 4, later it is sync. But yes, I think it should Ah, right. I missed that. > be async as well, but we do need a way to wait on the submitted work to > be processed in case of WB_SYNC_ALL. Yes, this is a bit complicated. You'd need some notification bdi_work is processed. > > > * Start writeback of `nr_pages' pages. If `nr_pages' is zero, write back > > > * the whole world. Returns 0 if a pdflush thread was dispatched. Returns > > > * -1 if all pdflush threads were busy. > > > */ > > > -int wakeup_pdflush(long nr_pages) > > > +void wakeup_flusher_threads(long nr_pages) > > > { > > > if (nr_pages == 0) > > > nr_pages = global_page_state(NR_FILE_DIRTY) + > > > global_page_state(NR_UNSTABLE_NFS); > > > - return pdflush_operation(background_writeout, nr_pages); > > > + bdi_writeback_all(NULL, nr_pages, WB_SYNC_NONE); > > > + return; > > > } > > ... > > > - > > > -static void laptop_flush(unsigned long unused) > > > -{ > > > - sys_sync(); > > > -} > > > - > > > static void laptop_timer_fn(unsigned long unused) > > > { > > > - pdflush_operation(laptop_flush, 0); > > > + wakeup_flusher_threads(0); > > > } > > > > > > /* > > Here you significantly change the behavior of laptop mode - previously it > > reliably synced all the data to disk once the "laptop writeout timeout" > > elapsed. The idea is that we want to write all the dirty data we have so > > that the drive can go to sleep for a longer period. > Yeah I know, I wrote the original laptop mode implementation :-) Sorry, I wasn't aware of that. > > Now you changed that to asynchronous WB_SYNC_NONE writeback of some > > number of pages. In particular if the disk gets congested, we'll just stop > > doing writeback which probably isn't what we want for laptop mode... > > Congestion doesn't matter, we don't do nonblocking writeout from the > threads at all. And it is sync later in the series. > > I'll switch it back to async in general throughout the series, and add > some way of making sure we actually have things submitted before doing > the filemap_fdatawait() in generic_sync_sb_inodes(). I think that should > fix it. OK, looking forward to your patches :) Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html