On Wed 16-03-11 12:53:31, Vivek Goyal wrote: > On Tue, Mar 08, 2011 at 11:31:13PM +0100, Jan Kara wrote: > [..] > > +/* > > + * balance_dirty_pages() must be called by processes which are generating dirty > > + * data. It looks at the number of dirty pages in the machine and will force > > + * the caller to perform writeback if the system is over `vm_dirty_ratio'. > > + * If we're over `background_thresh' then the writeback threads are woken to > > + * perform some writeout. > > + */ > > +static void balance_dirty_pages(struct address_space *mapping, > > + unsigned long write_chunk) > > +{ > > + struct backing_dev_info *bdi = mapping->backing_dev_info; > > + struct balance_waiter bw; > > + struct dirty_limit_state st; > > + int dirty_exceeded = check_dirty_limits(bdi, &st); > > + > > + if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT || > > + (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT && > > + !bdi_task_limit_exceeded(&st, current))) { > > + if (bdi->dirty_exceeded && > > + dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) > > + bdi->dirty_exceeded = 0; > > /* > > - * Increase the delay for each loop, up to our previous > > - * default of taking a 100ms nap. > > + * In laptop mode, we wait until hitting the higher threshold > > + * before starting background writeout, and then write out all > > + * the way down to the lower threshold. So slow writers cause > > + * minimal disk activity. > > + * > > + * In normal mode, we start background writeout at the lower > > + * background_thresh, to keep the amount of dirty memory low. > > */ > > - pause <<= 1; > > - if (pause > HZ / 10) > > - pause = HZ / 10; > > + if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND) > > + bdi_start_background_writeback(bdi); > > + return; > > } > > > > - /* Clear dirty_exceeded flag only when no task can exceed the limit */ > > - if (!min_dirty_exceeded && bdi->dirty_exceeded) > > - bdi->dirty_exceeded = 0; > > + if (!bdi->dirty_exceeded) > > + bdi->dirty_exceeded = 1; > > > > - if (writeback_in_progress(bdi)) > > - return; > > + trace_writeback_balance_dirty_pages_waiting(bdi, write_chunk); > > + /* Kick flusher thread to start doing work if it isn't already */ > > + bdi_start_background_writeback(bdi); > > > > + bw.bw_wait_pages = write_chunk; > > + bw.bw_task = current; > > + spin_lock(&bdi->balance_lock); > > /* > > - * In laptop mode, we wait until hitting the higher threshold before > > - * starting background writeout, and then write out all the way down > > - * to the lower threshold. So slow writers cause minimal disk activity. > > - * > > - * In normal mode, we start background writeout at the lower > > - * background_thresh, to keep the amount of dirty memory low. > > + * First item? Need to schedule distribution of IO completions among > > + * items on balance_list > > + */ > > + if (list_empty(&bdi->balance_list)) { > > + bdi->written_start = bdi_stat_sum(bdi, BDI_WRITTEN); > > + /* FIXME: Delay should be autotuned based on dev throughput */ > > + schedule_delayed_work(&bdi->balance_work, HZ/10); > > + } > > + /* > > + * Add work to the balance list, from now on the structure is handled > > + * by distribute_page_completions() > > + */ > > + list_add_tail(&bw.bw_list, &bdi->balance_list); > > + bdi->balance_waiters++; > Had a query. > > - What makes sure that flusher thread will not stop writing back till all > the waiters on the bdi have been woken up. IIUC, flusher thread will > stop once global background ratio is with-in limit. Is it possible that > there are still some waiter on some bdi waiting for more pages to finish > writeback and that might not happen for sometime. Yes, this can possibly happen but once distribute_page_completions() gets called (after a given time), it will notice that we are below limits and wake all waiters. Under normal circumstances, we should have a decent estimate when distribute_page_completions() needs to be called and that should be long before flusher thread finishes it's work. But in cases when a bdi has only a small share of global dirty limit, what you describe can possibly happen. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html