Re: [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Thu, 13 Jan 2011 11:44:01 +0800

Hi Jan,

On Thu, Jan 13, 2011 at 05:43:03AM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Mon 13-12-10 22:46:47, Wu Fengguang wrote:
> > I noticed that my NFSROOT test system goes slow responding when there
> > is heavy dd to a local disk. Traces show that the NFSROOT's bdi limit
> > is near 0 and many tasks in the system are repeatedly stuck in
> > balance_dirty_pages().
> > 
> > There are two generic problems:
> > 
> > - light dirtiers at one device (more often than not the rootfs) get
> >   heavily impacted by heavy dirtiers on another independent device
> > 
> > - the light dirtied device does heavy throttling because bdi limit=0,
> >   and the heavy throttling may in turn withhold its bdi limit in 0 as
> >   it cannot dirty fast enough to grow up the bdi's proportional weight.
> > 
> > Fix it by introducing some "low pass" gate, which is a small (<=32MB)
> > value reserved by others and can be safely "stole" from the current
> > global dirty margin.  It does not need to be big to help the bdi gain
> > its initial weight.
>   I'm sorry for a late reply but I didn't get earlier to your patches...

It's fine. Honestly speaking, the patches are still some "experiments",
and will need some major refactor. When testing 10-disk JBOD setup, I
find that bdi_dirty_limit fluctuations too much. So I'm considering
use global_dirty_limit as control target.

Attached is the JBOD test result for XFS. Other filesystems share the
same problem more or less.  Here you can find some old graphs:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-10HDD-JBOD/

> ...
> > -unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
> > + *
> > + * There is a chicken and egg problem: when bdi A (eg. /pub) is heavy dirtied
> > + * and bdi B (eg. /) is light dirtied hence has 0 dirty limit, tasks writing to
> > + * B always get heavily throttled and bdi B's dirty limit might never be able
> > + * to grow up from 0. So we do tricks to reserve some global margin and honour
> > + * it to the bdi's that run low.
> > + */
> > +unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> > +			      unsigned long dirty,
> > +			      unsigned long dirty_pages)
> >  {
> >  	u64 bdi_dirty;
> >  	long numerator, denominator;
> >  
> >  	/*
> > +	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
> > +	 */
> > +	dirty -= min(dirty / 128, 32768UL >> (PAGE_SHIFT-10));
> > +
> > +	/*
> >  	 * Calculate this BDI's share of the dirty ratio.
> >  	 */
> >  	bdi_writeout_fraction(bdi, &numerator, &denominator);
> > @@ -459,6 +472,15 @@ unsigned long bdi_dirty_limit(struct bac
> >  	do_div(bdi_dirty, denominator);
> >  
> >  	bdi_dirty += (dirty * bdi->min_ratio) / 100;
> > +
> > +	/*
> > +	 * If we can dirty N more pages globally, honour N/2 to the bdi that
> > +	 * runs low, so as to help it ramp up.
> > +	 */
> > +	if (unlikely(bdi_dirty < (dirty - dirty_pages) / 2 &&
> > +		     dirty > dirty_pages))
> > +		bdi_dirty = (dirty - dirty_pages) / 2;
> > +
> I wonder how well this works - have you tried that? Because from my naive

Yes I've been running it in the tests. It does show some undesirable
effects in multi-disk tests. For example, it leads to more than
necessary high bdi_dirty_limit for the slow USB key in the test case
of concurrent writing to 1 UKEY and 1 HDD. See the second graph.
You'll see that it's taking long time for the UKEY's bdi_dirty_limit
to shrink back to normal. The avg_dirty and bdi_dirty are also
departing too much. I'll fix them in the next update, where
bdi_dirty_limit will no longer play as big role as current code, and
this patch will also need to be reconsidered and may look much
different then.

> understanding if we have say two drives - sda, sdb. Someone is banging sda
> really hard (several processes writing to the disk as fast as they can), then
> we are really close to dirty limit anyway and thus we won't give much space
> for sdb to ramp up it's writeout fraction...  Didn't you intend to use
> 'dirty' without the safety margin subtracted in the above condition? That
> would then make more sense to me (i.e. those 32MB are then used as the
> ramp-up area).
> 
> If I'm right in the above, maybe you could simplify the above condition to:
> if (bdi_dirty < margin)
> 	bdi_dirty = margin;
> 
> Effectively it seems rather similar to me and it's immediately obvious how
> it behales. Global limit is enforced anyway so the logic just differs in
> the number of dirtiers on ramping-up bdi you need to suck out the margin.

sigh.. I've been hassled a lot by the possible disharmonies between
the bdi/global dirty limits.

One example is the below graph, where the bdi dirty pages are
constantly exceeding the bdi dirty limit. The root cause is,
"(dirty + background) / 2" may be close to or even exceed
bdi_dirty_limit. 

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/256M/ext3-2dd-1M-8p-191M-2.6.37-rc5+-2010-12-09-13-42/dirty-pages-200.png

Another problem is the btrfs JBOD case, where the global limit can be
exceeded at times. The root cause is, some bdi limits are dropping and
some others are increasing. If the bdi dirty limit drop too fast -- so
that it drops below its dirty pages, then even if the sum of all bdi
dirty limits are below the global limit, the sum of all bdi dirty
pages could still exceed the global limit.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-10HDD-JBOD/btrfs-fio-jbod-sync-128k-24p-15977M-2.6.37-rc8-dt5+-2010-12-31-10-06/global_dirty_state.png

The "enforced" global limit will jump into action here. However it
turns out to be a very undesirable behavior. In the tests, I run some
tasks to collect vmstat information. Whenever the global limit is
exceeded, I'll see disrupted samples in the vmstat graph. So when the
global limit is exceeded, it will block _all_ dirtiers in the system,
whether or not it is a light dirtier or an independent fast storage.

I hope the move to global dirty pages/limit as main control feedback
and bdi_dirty_limit as the secondary control feedback will help
address the problem nicely.

Thanks,
Fengguang
Attachment:
xfs-jbod-balance_dirty_pages-pages.png

Description: PNG image
Attachment:
ukey+hdd-balance_dirty_pages-pages.png

Description: PNG image