On Sun, 2009-08-23 at 21:00 +0800, Wu Fengguang wrote: > On Sun, Aug 23, 2009 at 05:33:33PM +0800, Richard Kennedy wrote: > > On Sat, 2009-08-22 at 10:51 +0800, Wu Fengguang wrote: > > > > > > > > > > mm/page-writeback.c | 116 +++++++++++++++--------------------------- > > > > 1 file changed, 43 insertions(+), 73 deletions(-) > > > > > > > > diff -puN mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references mm/page-writeback.c > > > > --- a/mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references > > > > +++ a/mm/page-writeback.c > > > > @@ -249,32 +249,6 @@ static void bdi_writeout_fraction(struct > > > > } > > > > } > > > > > > > > -/* > > > > - * Clip the earned share of dirty pages to that which is actually available. > > > > - * This avoids exceeding the total dirty_limit when the floating averages > > > > - * fluctuate too quickly. > > > > - */ > > > > -static void clip_bdi_dirty_limit(struct backing_dev_info *bdi, > > > > - unsigned long dirty, unsigned long *pbdi_dirty) > > > > -{ > > > > - unsigned long avail_dirty; > > > > - > > > > - avail_dirty = global_page_state(NR_FILE_DIRTY) + > > > > - global_page_state(NR_WRITEBACK) + > > > > - global_page_state(NR_UNSTABLE_NFS) + > > > > - global_page_state(NR_WRITEBACK_TEMP); > > > > - > > > > - if (avail_dirty < dirty) > > > > - avail_dirty = dirty - avail_dirty; > > > > - else > > > > - avail_dirty = 0; > > > > - > > > > - avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) + > > > > - bdi_stat(bdi, BDI_WRITEBACK); > > > > - > > > > - *pbdi_dirty = min(*pbdi_dirty, avail_dirty); > > > > -} > > > > - > > > > static inline void task_dirties_fraction(struct task_struct *tsk, > > > > long *numerator, long *denominator) > > > > { > > > > @@ -465,7 +439,6 @@ get_dirty_limits(unsigned long *pbackgro > > > > bdi_dirty = dirty * bdi->max_ratio / 100; > > > > > > > > *pbdi_dirty = bdi_dirty; > > > > - clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty); > > > > task_dirty_limit(current, pbdi_dirty); > > > > } > > > > } > > > > @@ -499,45 +472,12 @@ static void balance_dirty_pages(struct a > > > > }; > > > > > > > > get_dirty_limits(&background_thresh, &dirty_thresh, > > > > - &bdi_thresh, bdi); > > > > + &bdi_thresh, bdi); > > > > > > > > nr_reclaimable = global_page_state(NR_FILE_DIRTY) + > > > > - global_page_state(NR_UNSTABLE_NFS); > > > > - nr_writeback = global_page_state(NR_WRITEBACK); > > > > - > > > > - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); > > > > - bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); > > > > - > > > > - if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) > > > > - break; > > > > - > > > > - /* > > > > - * Throttle it only when the background writeback cannot > > > > - * catch-up. This avoids (excessively) small writeouts > > > > - * when the bdi limits are ramping up. > > > > - */ > > > > - if (nr_reclaimable + nr_writeback < > > > > - (background_thresh + dirty_thresh) / 2) > > > > - break; > > > > - > > > > - if (!bdi->dirty_exceeded) > > > > - bdi->dirty_exceeded = 1; > > > > - > > > > - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. > > > > - * Unstable writes are a feature of certain networked > > > > - * filesystems (i.e. NFS) in which data may have been > > > > - * written to the server's write cache, but has not yet > > > > - * been flushed to permanent storage. > > > > - * Only move pages to writeback if this bdi is over its > > > > - * threshold otherwise wait until the disk writes catch > > > > - * up. > > > > - */ > > > > - if (bdi_nr_reclaimable > bdi_thresh) { > > > > - generic_sync_bdi_inodes(NULL, &wbc); > > > > - pages_written += write_chunk - wbc.nr_to_write; > > > > - get_dirty_limits(&background_thresh, &dirty_thresh, > > > > - &bdi_thresh, bdi); > > > > - } > > > > + global_page_state(NR_UNSTABLE_NFS); > > > > + nr_writeback = global_page_state(NR_WRITEBACK) + > > > > + global_page_state(NR_WRITEBACK_TEMP); > > > > > > > > /* > > > > * In order to avoid the stacked BDI deadlock we need > > > > @@ -557,16 +497,48 @@ static void balance_dirty_pages(struct a > > > > bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); > > > > } > > > > > > > > - if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) > > > > - break; > > > > - if (pages_written >= write_chunk) > > > > - break; /* We've done our duty */ > > > > > > > + /* always throttle if over threshold */ > > > > + if (nr_reclaimable + nr_writeback < dirty_thresh) { > > > > > > That 'if' is a big behavior change. It effectively blocks every one > > > and canceled Peter's proportional throttling work: the less a process > > > dirtied, the less it should be throttled. > > > > > I don't think it does. the code ends up looking like > > > > FOR > > IF less than dirty_thresh THEN > > check bdi limits etc > > ENDIF > > > > thottle > > ENDFOR > > > > Therefore we always throttle when over the threshold otherwise we apply > > the per bdi limits to decide if we throttle. > > > > In the existing code clip_bdi_dirty_limit modified the bdi_thresh so > > that it would not let a bdi dirty enough pages to go over the > > dirty_threshold. All I've done is to bring the check of dirty_thresh up > > into balance_dirty_pages. > > > > So isn't this effectively the same ? > > Yes and no. For the bdi_thresh part it somehow makes the > clip_bdi_dirty_limit() logic more simple and obvious. Which I tend to > agree with you and Peter on doing something like this: > > if (nr_reclaimable + nr_writeback < dirty_thresh) { > /* compute bdi_* */ > if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) > break; > } > > For other two 'if's.. > > > > I'd propose to remove the above 'if' and liberate the following three 'if's. > > > > > > > + > > > > + if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) > > > > + break; > > > > + > > > > + /* > > > > + * Throttle it only when the background writeback cannot > > > > + * catch-up. This avoids (excessively) small writeouts > > > > + * when the bdi limits are ramping up. > > > > + */ > > > > + if (nr_reclaimable + nr_writeback < > > > > + (background_thresh + dirty_thresh) / 2) > > > > + break; > > That 'if' can be trivially moved out. OK, > > > > + > > > > + /* done enough? */ > > > > + if (pages_written >= write_chunk) > > > > + break; > > That 'if' must be moved out, otherwise it can block a light writer > for ever, as long as there is another heavy dirtier keeps the dirty > numbers high. Yes, I see. But I was worried about a failing device that gets stuck. Doesn't this let the application keep dirtying pages forever if the pages aren't get written to the device? Maybe something like this ? if ( nr_writeback < background_thresh && pages_written >= write_chunk) break; or bdi_nr_writeback < bdi_thresh/2 ? > > > > + } > > > > + if (!bdi->dirty_exceeded) > > > > + bdi->dirty_exceeded = 1; > > > > > > > > + /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. > > > > + * Unstable writes are a feature of certain networked > > > > + * filesystems (i.e. NFS) in which data may have been > > > > + * written to the server's write cache, but has not yet > > > > + * been flushed to permanent storage. > > > > > + * Only move pages to writeback if this bdi is over its > > > > + * threshold otherwise wait until the disk writes catch > > > > + * up. > > > > + */ > > > > + if (bdi_nr_reclaimable > bdi_thresh) { > > I'd much prefer its original form > > if (bdi_nr_reclaimable) { > > Let's push dirty pages to disk ASAP :) That change comes from my previous patch, and it's to stop this code over reacting and pushing all the available dirty pages to the writeback list. > > > > + writeback_inodes(&wbc); > > > > + pages_written += write_chunk - wbc.nr_to_write; > > > > > > > + if (wbc.nr_to_write == 0) > > > > + continue; > > > > > > What's the purpose of the above 2 lines? > > > > This is to try to replicate the existing code as closely as possible. > > > > If writeback_inodes wrote write_chunk pages in one pass then skip to the > > top of the loop to recheck the limits and decide if we can let the > > application continue. Otherwise it's not making enough forward progress > > due to congestion so do the congestion_wait & loop. > > It makes sense. We have wbc.encountered_congestion for that purpose. > However it may not able to write enough pages for other reasons like > lock contention. So I'd suggest to test (wbc.nr_to_write <= 0). > Thanks, > Fengguang I didn't test the congestion flag directly because we don't care about it if writeback_inodes did enough. If write_chunk pages get moved to writeback then we don't need to do the congestion_wait. Can writeback_inodes do more work than it was asked to do? But OK, I can make that change if you think it worthwhile. regards Richard -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html