On Thu 28-04-16 12:53:50, Jens Axboe wrote: > >2) As far as I can see in patch 8/8, you have plugged the throttling above > > the IO scheduler. When there are e.g. multiple cgroups with different IO > > limits operating, this throttling can lead to strange results (like a > > cgroup with low limit using up all available background "slots" and thus > > effectively stopping background writeback for other cgroups)? So won't > > it make more sense to plug this below the IO scheduler? Now I understand > > there may be other problems with this but I think we should put more > > though to that and provide some justification in changelogs. > > One complexity is that we have to do this early for blk-mq, since once you > get a request, you're already sitting on the hw tag. CoDel should actually > work fine at each hop, so hopefully this will as well. OK, I see. But then this suggests that any IO scheduling and / or cgroup-related throttling should happen before we get a request for blk-mq as well? And then we can still do writeback throttling below that layer? > But yes, fairness is something that we have to pay attention to. Right now > the wait queue has no priority associated with it, that should probably be > improved to be able to wakeup in a more appropriate order. > Needs testing, but hopefully it works out since if you do run into > starvation, then you'll go to the back of the queue for the next attempt. Yeah, once I'll hunt down that regression with old disk, I can have a look into how writeback throttling plays together with blkio-controller. > >>+static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat) > >>+{ > >>+ u64 thislat; > >>+ > >>+ /* > >>+ * If our stored sync issue exceeds the window size, or it > >>+ * exceeds our min target AND we haven't logged any entries, > >>+ * flag the latency as exceeded. > >>+ */ > >>+ thislat = rwb_sync_issue_lat(rwb); > >>+ if (thislat > rwb->cur_win_nsec || > >>+ (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) { > >>+ trace_wbt_lat(rwb->bdi, thislat); > >>+ return LAT_EXCEEDED; > >>+ } > > > >So I'm trying to wrap my head around this. If I read the code right, > >rwb_sync_issue_lat() this returns time that has passed since issuing sync > >request that is still running. We basically randomly pick which sync > >request we track as we always start tracking a sync request when some is > >issued and we are not tracking any at that moment. This is to detect the > >case when latency of sync IO is very large compared to measurement window > >so we would not get enough samples to make it valid? > > Right, that's pretty close. Since wbt uses the completion latencies to make > decisions, if an IO hasn't completed, we don't know about it. If the device > is flooded with writes, and we then issue a read, maybe that read won't > complete for multiple monitoring windows. During that time, we keep thinking > everything is fine. But in reality, it's not completing because of the write > load. So this logic attempts to track the single sync IO request case. If > that exceeds a monitoring window of time and we saw no other sync IO in that > window, then treat that case as if it had completed but exceeded the min > latency. And then scale back. > > We'll always treat a state sample with 1 read as valuable, but for this > case, we don't have that sample until it completes. > > Does that make more sense? OK, makes sense. Thanks for explanation. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html