On Tue 31-12-13 15:34:40, Chris Mason wrote: > On Tue, 2013-12-31 at 22:22 +0800, Tao Ma wrote: > > Hi Chris, > > On 12/31/2013 09:19 PM, Chris Mason wrote: > > > > > So I'd like to throttle the rate at which dirty pages are created, > > > preferably based on the rates currently calculated in the BDI of how > > > quickly the device is doing IO. This way we can limit dirty creation to > > > a percentage of the disk capacity during the current workload > > > (regardless of random vs buffered). > > Fengguang had already done some work on this, but it seems that the > > community does't have a consensus on where this control file should go. > > You can look at this link: https://lkml.org/lkml/2011/4/4/205 > > I had forgotten Wu's patches here, it's very close to the starting point > I was hoping for. I specifically don't like those patches because throttling pagecache dirty rate is IMHO rather poor interface. What people want to do is to limit IO from a container. That means reads & writes, buffered & direct IO. So dirty rate is just a one of several things which contributes to total IO rate. When you have both direct IO & buffered IO happening in the container they influence each other so dirty rate 50 MB/s may be fine when nothing else is going on in the container but may be far to much for the system if there are heavy direct IO reads happening as well. So you really need to tune the limit on the dirty rate depending on how fast the writeback can happen (which is what current IO-less throttling does), not based on some hard throughput number like 50 MB/s (which is what Fengguang's patches did if I remember right). What could work a tad bit better (and that seems to be something you are proposing) is to have a weight for each memcg and each memcg would be allowed to dirty at a rate proportional to its weight * writeback throughput. But this still has a couple of problems: 1) This doesn't take into account local situation in a memcg - for memcg full of dirty pages you want to throttle dirtying much more than for a memcg which has no dirty pages. 2) Flusher thread (or workqueue these days) doesn't know anything about memcgs. So it can happily flush a memcg which is relatively OK for a rather long time while some other memcg is full of dirty pages and struggling to do any progress. 3) This will be somewhat unfair since the total IO allowed to happen from a container will depend on whether you are doing only reads (or DIO), only writes or both reads & writes. In an ideal world you could compute writeback throughput for each memcg (and writeback from a memcg would be accounted in a proper blkcg - we would need unified memcg & blkcg hieararchy for that), take into account number of dirty pages in each memcg, and compute dirty rate according to these two numbers. But whether this can work in practice heavily depends on the memcg size and how smooth / fair can the writeback from different memcgs be so that we don't have excessive stalls and throughput estimation errors... Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html