Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()

Vivek Goyal <vgoyal@xxxxxxxxxx> · Mon, 22 Aug 2011 13:22:30 -0400

On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote:
> On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote:
> > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote:
> > > Hi Vivek,
> > > 
> > > > > +		base_rate = bdi->dirty_ratelimit;
> > > > > +		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > > > > +					       background_thresh, nr_dirty,
> > > > > +					       bdi_thresh, bdi_dirty);
> > > > > +		if (unlikely(pos_ratio == 0)) {
> > > > > +			pause = MAX_PAUSE;
> > > > > +			goto pause;
> > > > >  		}
> > > > > +		task_ratelimit = (u64)base_rate *
> > > > > +					pos_ratio >> RATELIMIT_CALC_SHIFT;
> > > > 
> > > > Hi Fenguaang,
> > > > 
> > > > I am little confused here. I see that you have already taken pos_ratio
> > > > into account in bdi_update_dirty_ratelimit() and wondering why to take
> > > > that into account again in balance_diry_pages().
> > > > 
> > > > We calculated the pos_rate and balanced_rate and adjusted the
> > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit().
> > > 
> > > Good question. There are some inter-dependencies in the calculation,
> > > and the dependency chain is the opposite to the one in your mind:
> > > balance_dirty_pages() used pos_ratio in the first place, so that
> > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation
> > > of the balanced dirty rate, too.
> > > 
> > > Let's return to how the balanced dirty rate is estimated. Please pay
> > > special attention to the last paragraphs below the "......" line.
> > > 
> > > Start by throttling each dd task at rate
> > > 
> > >         task_ratelimit = task_ratelimit_0                               (1)
> > >                          (any non-zero initial value is OK)
> > > 
> > > After 200ms, we measured
> > > 
> > >         dirty_rate = # of pages dirtied by all dd's / 200ms
> > >         write_bw   = # of pages written to the disk / 200ms
> > > 
> > > For the aggressive dd dirtiers, the equality holds
> > > 
> > >         dirty_rate == N * task_rate
> > >                    == N * task_ratelimit
> > >                    == N * task_ratelimit_0                              (2)
> > > Or     
> > >         task_ratelimit_0 = dirty_rate / N                               (3)
> > > 
> > > Now we conclude that the balanced task ratelimit can be estimated by
> > > 
> > >         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (4)
> > > 
> > > Because with (2) and (3), (4) yields the desired equality (1):
> > > 
> > >         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > >                       == write_bw / N
> > 
> > Hi Fengguang,
> > 
> > Following is my understanding. Please correct me where I got it wrong.
> > 
> > Ok, I think I follow till this point. I think what you are saying is
> > that following is our goal in a stable system.
> > 
> > 	task_ratelimit = write_bw/N				(6)
> > 
> > So we measure the write_bw of a bdi over a period of time and use that
> > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies
> > task_ratelimit and hence we achieve the balance. So we will start with
> > some arbitrary task limit say task_ratelimit_0, and modify that limit
> > over a period of time based on our feedback loop to achieve a balanced
> > system. And following seems to be the formula.
> > 					    write_bw
> > 	task_ratelimit = task_ratelimit_0 * ------- 		(7)
> > 					    dirty_rate
> > 
> > Now I also understand that by using (2) and (3), you proved that
> > how (7) will lead to (6) and that is our deisred goal. 
> 
> That's right.
> 
> > > 
> > > .............................................................................
> > > 
> > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute
> > > the ratelimit
> > > 
> > >         task_ratelimit = task_ratelimit_0
> > >                        = dirty_ratelimit * pos_ratio                    (5)
> > > 
> > 
> > So balance_drity_pages() chose to take into account pos_ratio() also
> > because for various reason like just taking into account only bandwidth
> > variation as feedback was not sufficient. So we also took pos_ratio
> > into account which in-trun is dependent on gloabal dirty pages and per
> > bdi dirty_pages/rate.
> 
> That's right so far. balance_drity_pages() needs to do dirty position
> control, so used formula (5).
> 
> > So we refined the formula for calculating a tasks's effective rate
> > over a period of time to following.
> > 					    write_bw
> > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > 					    dirty_rate
> > 
> 
> That's not true. It should still be formula (7) when
> balance_drity_pages() considers pos_ratio.

Why it is not true? If I do some math, it sounds right. Let me summarize
my understanding again.

- In a steady state stable system, we want dirty_bw = write_bw, IOW.

  dirty_bw/write_bw = 1  		(1)

  If we can achieve above then that means we are throttling tasks at
  just right rate.

Or
-  dirty_bw  == write_bw
   N * task_ratelimit == write_bw
   task_ratelimit =  write_bw/N         (2)

  So as long as we can come up with a system where balance_dirty_pages()
  calculates task_ratelimit to be write_bw/N, we should be fine.

- But this does not take care of imbalances. So if system goes out of
  balance before feedback loop kicks in and dirty rate shoots up, then
  cache size will grow and number of dirty pages will shoot up. Hence
  we brought in the notion of position ratio where we also vary a 
  tasks's dirty ratelimit based on number of dirty pages. So our
  effective formula became.

  task_ratelimit = write_bw/N * pos_ratio     (3)

  So as long as we meet (3), we should reach to stable state.

-  But here N is unknown in advance so balance_drity_pages() can not make
   use of this formula directly. But write_bw and dirty_bw from previous
   200ms are known. So following can replace (3).

				       write_bw
   task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
					dirty_bw	

   dirty_bw = tas_ratelimit_0 * N                (5)

   Substitute (5) in (4)

   task_ratelimit = write_bw/N * pos_ratio      (6)

   (6) is same as (3) which has been derived from (4) and that means at any
   given point of time (4) can be used by balance_drity_pages() to calculate
   a tasks's throttling rate.

- Now going back to (4). Because we have a feedback loop where we
  continuously update a previous number based on feedback, we can track
  previous value in bdi->dirty_ratelimit.

				       write_bw
   task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
					dirty_bw	

   Or

   task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)

   where
					    write_bw	
  bdi->dirty_ratelimit = task_ratelimit_0 * ---------
					    dirty_bw

  Because task_ratelimit_0 is initial value to begin with and we will
  keep on coming with new value every 200ms, we should be able to write
  above as follows.

						      write_bw
  bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
						      dirty_bw

  Effectively we start with an initial value of task_ratelimit_0 and
  then keep on updating it based on rate change feedback every 200ms.

  To summarize,

  We need to achieve (3) for a balanced system. Because we don't know the
  value of N in advance, we can use (4) to achieve effect of (3). So we
  start with a default value of task_ratelimit_0 and update that every
  200ms based on how write and dirty rate on device is changing (8). We also
  further refine that rate by pos_ratio so that any variations in number
  of dirty pages due to temporary imbalances in the system can be
  accounted for (7).

I see that you also use (7). I think only contention point is how
(8) is perceived. So can you please explain why do you think that
above calculation or (9) is wrong.

I can kind of understand that you have done various adjustments to keep the
task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html