Re: [PATCH] writeback: hard throttle 1000+ dd on a slow USB stick

Jan Kara <jack@xxxxxxx> · Tue, 22 Nov 2011 20:27:36 +0100



On Tue 22-11-11 10:20:53, Wu Fengguang wrote:
> On Tue, Nov 22, 2011 at 04:37:42AM +0800, Jan Kara wrote:
> > On Thu 17-11-11 19:59:14, Wu Fengguang wrote:
> > > The sleep based balance_dirty_pages() can pause at most MAX_PAUSE=200ms
> > > on every 1 4KB-page, which means it cannot throttle a task under
> > > 4KB/200ms=20KB/s. So when there are more than 512 dd writing to a
> > > 10MB/s USB stick, its bdi dirty pages could grow out of control.
> > > 
> > > Even if we can increase MAX_PAUSE, the minimal (task_ratelimit = 1)
> > > means a limit of 4KB/s.
> > > 
> > > They can eventually be safeguarded by the global limit check
> > > (nr_dirty < dirty_thresh). However if someone is also writing to an
> > > HDD at the same time, it'll get poor HDD write performance.
> > > 
> > > We at least want to maintain good write performance for other devices
> > > when one device is attacked by some "massive parallel" workload, or
> > > suffers from slow write bandwidth, or somehow get stalled due to some
> > > error condition (eg. NFS server not responding).
> > > 
> > > For a stalled device, we need to completely block its dirtiers, too,
> > > before its bdi dirty pages grow all the way up to the global limit and
> > > leave no space for the other functional devices.
> >   This is a fundamental question - how much do you allow dirty cache of one
> > device to grow when other devices are relatively idle? Every choice has
> > advantages and disadvantages. If you allow device to occupy lot of the
> > cache, you may later find yourself short on dirtiable memory when other
> > devices become active. On the other hand allowing more dirty memory can
> > improve IO pattern and thus writeout speed. So whatever choice we make,
> > we should explain our choice somewhere in the code and stick to that...
> 
> The answer lies in bdi_thresh and will continue to be so with this patch :)
>
> Basically we let active devices to grow its quota of dirty pages and
> stalled/inactive devices to decrease its quota over time. That's the
> backing rational for Peter's "floating proportions".
  Sure, but how we use bdi_thresh was changing in the past. For example at
which global level of dirty pages we decide to enforce bdi_thresh. Now you
again slightly change the logic when bdi_thresh is enforced and when not...
So my suggestion was more aiming at having somewhere documented, when we
enforce bdi_thresh and when not, together with description of goals we try
to achieve with this setting.
 
> It works well except in the case of "an inactive disk suddenly goes
> busy", where the initial quota may be too small. To mitigate this,
> bdi_position_ratio() has the below line to raise a small bdi_thresh
> when it's safe to do so, so that the disk get a reasonable large
> initial quota for fast rampup:
> 
>         bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
> 
> > > So change the loop exit condition to
> > > 
> > > 	/*
> > > 	 * Always enforce global dirty limit; also enforce bdi dirty limit
> > > 	 * if the normal max_pause sleeps cannot keep things under control.
> > > 	 */
> > > 	if (nr_dirty < dirty_thresh &&
> > > 	    (bdi_dirty < bdi_thresh || bdi->dirty_ratelimit > 1))
> > > 		break;
> > > 
> > > which can be further simplified to
> > > 
> > > 	if (task_ratelimit)
> > > 		break;
> >   Hmm, but if pos_ratio == 0, task_ratelimit is uninitialized... Generally,
> 
> Shy.. That's fixed by a recent commit 3a73dbbc9bb ("writeback: fix
> uninitialized task_ratelimit") in Linus' tree.
  Ah, OK, I missed that.

> > I would find it more robust to have there a test directly with numbers of
> > dirty pages - then it would be independent of whatever changes we make in
> > ratelimit computations in future. 
> 
> In principle, task_ratelimit has to be 0 when over the global dirty
> limit. But yeah the explicit check sounds more robust. I can add it
> back when doing some more tricky ratelimit calculations (eg. for async
> write IO controller).

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html