On Mon, Oct 14, 2013 at 05:09:17PM +0800, Hong Zhiguo wrote: [..] Hi Hong, Thanks for the token based throttling implementation. In general it looks good and it simplies the logic. Couple of comments/concerns below. > @@ -133,14 +135,13 @@ struct throtl_grp { > /* IOPS limits */ > unsigned int iops[2]; > > - /* Number of bytes disptached in current slice */ > - uint64_t bytes_disp[2]; > - /* Number of bio's dispatched in current slice */ > - unsigned int io_disp[2]; > + /* Token for throttling by bps */ > + uint64_t bytes_token[2]; > + /* Token for throttling by iops */ > + unsigned int io_token[2]; > > - /* When did we start a new slice */ > - unsigned long slice_start[2]; > - unsigned long slice_end[2]; > + /* Time check-point */ > + unsigned long t_c[2]; Can we rename this variable to say last_dispatch[2]. [..] > > @@ -852,41 +708,26 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio, > unsigned long *wait) > { > bool rw = bio_data_dir(bio); > - u64 bytes_allowed, extra_bytes, tmp; > - unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; > - > - jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw]; > - > - /* Slice has just started. Consider one slice interval */ > - if (!jiffy_elapsed) > - jiffy_elapsed_rnd = throtl_slice; > - > - jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice); > + u64 extra_bytes, token; > + unsigned long jiffy_wait; > > - tmp = tg->bps[rw] * jiffy_elapsed_rnd; > - do_div(tmp, HZ); > - bytes_allowed = tmp; > + token = (u64)tg->bps[rw] * (jiffies - tg->t_c[rw]); So here we seem to be calculating how many tokens a group is eligible for. This is done based on since last time check. And last time check was updated in last dispatch from the group. What happens if no IO happens in a group for some time, say for 5 minutes, and then one big IO comes in. IIUC, we will allow a big burst of IO for the first IO (tokens worth of full 5 minutes will be given to this group). I think we should have a mechanism where we don't allow too big a burst for the first IO. (Possibly limit the burst to THROTL_BURST_IO and THROTL_BURST_BYTES until and unless a bigger IO is already queued and we are waiting for it). In previous time slice logic, I took care of it by extend time slice by 100ms and if no IO happens in the group for 100ms, time slice will expire and next IO will get at max of 100ms of worth of credit (equivalent of burst limits). Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html