Hello, Konstantin. Sorry about the delay. On Thu, Jan 15, 2015 at 09:49:10PM +0300, Konstantin Khebnikov wrote: > This is ressurection of my old RFC patch for dirty-set accounting cgroup [1] > Now it's merged into memory cgroup and got bandwidth controller as a bonus. > > That shows alternative solution: less accurate but much less monstrous than > accurate page-based dirty-set controller from Tejun Heo. I went over the whole patchset and ISTR having reviewed this a while ago and the conclusion is the same. This appears to be simpler on the surface but this is a hackjob of a design to put it nicely. You're working around the complexity of pressure propagation from the lower layer by building a separate pressure layer at the top most layer. In doing so, it's duplicating what already exist below in degenerate forms but at the cost of fundamental crippling of the whole thing. This, even in its current simplistic form, is already a dead end. e.g. iops or bw aren't even the proper resources to distribute for rotating disks, IO time is, which is what a large proportion of cfq is trying to estimate and distribute. What if there are multiple filesystems on a single device? Or if a cgroup accesses multiple backing devices? How would you guarantee low latency access to a high priority cgroup while a huge inode from a low pri cgroup is being written out when none of the lower layers have any idea what they're doing? Sure, these issues can be dealt with partially with various workarounds and additions and I'm sure we'll be doing that if we go down this path, but the only thing that'll lead to is duplicating more of what's already in the lower layers with ever growing list of behavioral and interface oddities which are inherent to the design. Even in the absence of alternatives, I'd be strongly against this direction. I think this sort of ad-hoc "let's solve this one immediate issue in the easiest way possible" is often worse than not doing anything. In the longer term, things like this paint us into a corner of which we can't easily get out and memcg happens to be an area where that sort of things took place quite a bit in the past and people have been desparately trying to right the course, so, no, I don't think this is happening. I agree that propagating backpressure from the lower layer involves more complexity but it is a full and conceptually and design-wise straight-forward solution which doesn't need to get constantly papered over. This is the right thing to do. It can be argued that the amount of complexity can be reduced by tracking dirty pages per-inode, but, if we're gonna do that, we should converting memcg itself to be per address space too. The arguments would be exactly the same for memcg and memcg and writeback must be on the same foundation. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>