On Thu, 2014-01-02 at 07:46 +-0100, Jan Kara wrote: +AD4- On Tue 31-12-13 15:34:40, Chris Mason wrote: +AD4- +AD4- On Tue, 2013-12-31 at 22:22 +-0800, Tao Ma wrote: +AD4- +AD4- +AD4- Hi Chris, +AD4- +AD4- +AD4- On 12/31/2013 09:19 PM, Chris Mason wrote: +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- So I'd like to throttle the rate at which dirty pages are created, +AD4- +AD4- +AD4- +AD4- preferably based on the rates currently calculated in the BDI of how +AD4- +AD4- +AD4- +AD4- quickly the device is doing IO. This way we can limit dirty creation to +AD4- +AD4- +AD4- +AD4- a percentage of the disk capacity during the current workload +AD4- +AD4- +AD4- +AD4- (regardless of random vs buffered). +AD4- +AD4- +AD4- Fengguang had already done some work on this, but it seems that the +AD4- +AD4- +AD4- community does't have a consensus on where this control file should go. +AD4- +AD4- +AD4- You can look at this link: https://lkml.org/lkml/2011/4/4/205 +AD4- +AD4- +AD4- +AD4- I had forgotten Wu's patches here, it's very close to the starting point +AD4- +AD4- I was hoping for. +AD4- I specifically don't like those patches because throttling pagecache +AD4- dirty rate is IMHO rather poor interface. What people want to do is to +AD4- limit IO from a container. That means reads +ACY- writes, buffered +ACY- direct IO. +AD4- So dirty rate is just a one of several things which contributes to total IO +AD4- rate. When you have both direct IO +ACY- buffered IO happening in the container +AD4- they influence each other so dirty rate 50 MB/s may be fine when nothing +AD4- else is going on in the container but may be far to much for the system if +AD4- there are heavy direct IO reads happening as well. +AD4- +AD4- So you really need to tune the limit on the dirty rate depending on how +AD4- fast the writeback can happen (which is what current IO-less throttling +AD4- does), not based on some hard throughput number like +AD4- 50 MB/s (which is what Fengguang's patches did if I remember right). +AD4- +AD4- What could work a tad bit better (and that seems to be something you are +AD4- proposing) is to have a weight for each memcg and each memcg would be +AD4- allowed to dirty at a rate proportional to its weight +ACo- writeback +AD4- throughput. But this still has a couple of problems: +AD4- 1) This doesn't take into account local situation in a memcg - for memcg +AD4- full of dirty pages you want to throttle dirtying much more than for a +AD4- memcg which has no dirty pages. +AD4- 2) Flusher thread (or workqueue these days) doesn't know anything about +AD4- memcgs. So it can happily flush a memcg which is relatively OK for a +AD4- rather long time while some other memcg is full of dirty pages and +AD4- struggling to do any progress. +AD4- 3) This will be somewhat unfair since the total IO allowed to happen from a +AD4- container will depend on whether you are doing only reads (or DIO), only +AD4- writes or both reads +ACY- writes. +AD4- +AD4- In an ideal world you could compute writeback throughput for each memcg +AD4- (and writeback from a memcg would be accounted in a proper blkcg - we would +AD4- need unified memcg +ACY- blkcg hieararchy for that), take into account number of +AD4- dirty pages in each memcg, and compute dirty rate according to these two +AD4- numbers. But whether this can work in practice heavily depends on the memcg +AD4- size and how smooth / fair can the writeback from different memcgs be so +AD4- that we don't have excessive stalls and throughput estimation errors... +AFs- Adding Tejun, Vivek and Li from another thread +AF0- I do agree that a basket of knobs is confusing and it doesn't really help the admin. My first idea was a complex system where the controller in the block layer and the BDI flushers all communicated about current usage and cooperated on a single set of reader/writer rates. I think it could work, but it'll be fragile. But there are a limited number of non-pagecache methods to do IO. Why not just push the accounting and throttling for O+AF8-DIRECT into a new BDI controller idea? Tejun was just telling me how he'd rather fix the existing controllers than add a new one, but I think we can have a much better admin experience by having a having a single entry point based on BDIs. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html