On Tue, Sep 08, 2009 at 12:29:36PM -0400, Chris Mason wrote: > On Tue, Sep 08, 2009 at 06:06:23PM +0200, Peter Zijlstra wrote: > > On Tue, 2009-09-08 at 13:37 +0300, Artem Bityutskiy wrote: > > > Hi, > > > > > > On 09/08/2009 12:23 PM, Jens Axboe wrote: > > > > From: Theodore Ts'o<tytso@xxxxxxx> > > > > > > > > Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a > > > > concern of not holding I_SYNC for too long. (At least, that was the > > > > comment previously.) This doesn't make sense now because the only > > > > time we wait for I_SYNC is if we are calling sync or fsync, and in > > > > that case we need to write out all of the data anyway. Previously > > > > there may have been other code paths that waited on I_SYNC, but not > > > > any more. > > > > > > > > According to Christoph, the current writeback size is way too small, > > > > and XFS had a hack that bumped out nr_to_write to four times the value > > > > sent by the VM to be able to saturate medium-sized RAID arrays. This > > > > value was also problematic for ext4 as well, as it caused large files > > > > to be come interleaved on disk by in 8 megabyte chunks (we bumped up > > > > the nr_to_write by a factor of two). > > > > > > > > So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable, > > > > max_writeback_mb, and set it to a default value of 128 megabytes. > > > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=13930 > > > > > > > > Signed-off-by: "Theodore Ts'o"<tytso@xxxxxxx> > > > > Signed-off-by: Jens Axboe<jens.axboe@xxxxxxxxxx> > > > > > > Would be nice to update doc files like > > > > > > Documentation/sysctl/vm.txt > > > Documentation/filesystems/proc.txt > > > > I'm still not convinced this knob is worth the patch and I'm inclined to > > flat out NAK it.. > > > > The whole point of MAX_WRITEBACK_PAGES seems to occasionally check the > > dirty stats again and not write out too much. > > The problem is that 'too much' is a very abstract thing. When a process > is stuck in balance_dirty_pages, we want them to do the minimal amount > of work (or waiting) required to get them safely back inside file_write(). It seems that balance_dirty_pages() is not coupled with MAX_WRITEBACK_PAGES. Instead it uses the much smaller (ratelimit_pages + ratelimit_pages / 2). So I feel that we could just increase MAX_WRITEBACK_PAGES. It won't lead to bumpy throttled writes. It does affect fairness of background writes, ie. small files will have to wait more time for large files. But I'm fine with MAX_WRITEBACK_PAGES=64MB, which means for desktop that a large file may only delay others for 1 second, which is small enough comparing to the 30 second dirty expire time. On the other hand, I find that the (ratelimit_pages + ratelimit_pages / 2) used for balance_dirty_pages() may fall below the real number of dirtied pages, which is not safe if some filesystem choose to dirty 2 * ratelimit_pages before calling balance_dirty_pages_ratelimited_nr(). So, how about this patch? Thanks, Fengguang --- writeback: balance_dirty_pages() shall write more than dirtied pages Some filesystem may choose to write much more than ratelimit_pages before calling balance_dirty_pages_ratelimited_nr(). So it is safer to determine number to write based on real number of dirtied pages. Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> --- mm/page-writeback.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) --- linux.orig/mm/page-writeback.c 2009-09-09 16:53:15.000000000 +0800 +++ linux/mm/page-writeback.c 2009-09-09 17:05:14.000000000 +0800 @@ -44,12 +44,12 @@ static long ratelimit_pages = 32; /* * When balance_dirty_pages decides that the caller needs to perform some * non-background writeback, this is how many pages it will attempt to write. - * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably + * It should be somewhat larger than dirtied pages to ensure that reasonably * large amounts of I/O are submitted. */ -static inline long sync_writeback_pages(void) +static inline long sync_writeback_pages(unsigned long dirtied) { - return ratelimit_pages + ratelimit_pages / 2; + return dirtied + dirtied / 2; } /* The following parameters are exported via /proc/sys/vm */ @@ -481,7 +481,8 @@ get_dirty_limits(unsigned long *pbackgro * If we're over `background_thresh' then pdflush is woken to perform some * writeout. */ -static void balance_dirty_pages(struct address_space *mapping) +static void balance_dirty_pages(struct address_space *mapping, + unsigned long write_chunk) { long nr_reclaimable, bdi_nr_reclaimable; long nr_writeback, bdi_nr_writeback; @@ -489,7 +490,6 @@ static void balance_dirty_pages(struct a unsigned long dirty_thresh; unsigned long bdi_thresh; unsigned long pages_written = 0; - unsigned long write_chunk = sync_writeback_pages(); struct backing_dev_info *bdi = mapping->backing_dev_info; @@ -634,9 +634,10 @@ void balance_dirty_pages_ratelimited_nr( p = &__get_cpu_var(ratelimits); *p += nr_pages_dirtied; if (unlikely(*p >= ratelimit)) { + ratelimit = sync_writeback_pages(*p); *p = 0; preempt_enable(); - balance_dirty_pages(mapping); + balance_dirty_pages(mapping, ratelimit); return; } preempt_enable(); -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html