On Tue, Jul 09, 2013 at 09:42:57PM +0400, Konstantin Khlebnikov wrote: [..] > >So what kind of priority inversion you are facing with blkcg and how would > >you avoid it with your implementation? > > > >I know that serialization can happen at filesystem level while trying > >to commit journal. But I think same thing will happen with your > >implementation too. > > Yes, metadata changes are serialized and and they depends on data commits, > thus block layer cannot delay write requests without introducing nasty priority > inversions. Tejun had some thoughts about this on how to solve this problem. I don't remember the details though. Tejun? > Cached read requests cannot be delayed at all. Who wants to delay the reads which are coming out of cache. That sounds like a mis-feature. > All solutions either > breaks throttling or adds PI. So block layer is just wrong place for this. Well implmenting throttling at block layer can allow you to cache writes so that application does not see the dealye for small writes at the same time it protects against that burst being visible on device and it impacting other IO going device. Not sure how much does it matter but atleast this was one discussion point in the past. Implementing it at device level provides better control when it comes to avoiding interference from bursty buffered writes. > > > > >One simple way of avoiding that will be to throttle IO even earlier > >but that means we do not take advantage of writeback cache and buffered > >writes will slow down. > > If we want to control writeback speed we also must control size of dirty set. > There are several possibilities: we either can start writeback earlier, > or when dirty set exceeds some threshold we will start charging that dirty > memory into throttler and slow down all tasks who generates this dirty memory. > Because dirty memory is charged and accounted we can write it without delays. Ok, so this is equivalent to allowing bursty IO. Admit bunch of IO burst (dirty set) and then apply throttling rules. Dirty set can be flushed without throttling if sync requires that but future admission of IO will be delayed. That can avoid PI problems due arising due to file system journaling. We have discussed implementing throttling at higher layer in the past too. Various proof of concept implementations had been posted to do throttling in higher layer. blk-throttle: Throttle buffered WRITEs in balance_dirty_pages() https://lkml.org/lkml/2011/6/28/243 buffered write IO controller in balance_dirty_pages() https://lkml.org/lkml/2012/3/28/275 Andrea Righi had posted some proof of concept implementations too. None of these implementations ever made any progress. Tejun always liked the idea of doing throttling at lower layers and then generating back pressure on bdi which in turn controls the size of dirty set. To me sovling the issue of Priority inversion in file systems is important one. If we can't solve that reasonably with existing mechanism it does make a case that why throttling at higher level might be interesting. > > > > >So I am curious how would you take care of these serialization issue. > > > >Also the throttlers you are planning to implement, what kind of throttling > >do they provide. Is it throttling rate per cgroup or per file per cgroup > >or rules will be per bdi per cgroup or something else. > > Currently I'm thinking about per-cgroup X per-tier. Each bdi will be assigned > to some tier. It's flexible enough and solves chicken-and-egg problem: > when disk appears it will be assigned to default tier and can be reassigned. Ok, this is completely orthogonal issue. It has nothing to do with whether to apply throttling at block layer or at higher leayer. To solve the chicken and egg problem we need to take help of user space here and not rely on kernel storing the rules and apply these when devices show up. Also how would you create rules for assigning a bdi to a tier. How would you identify a bdi uniquely in a persistent manner. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>