On Wed, Jul 29, 2015 at 05:47:18PM +0200, Michal Hocko wrote: > On Wed 29-07-15 18:28:17, Vladimir Davydov wrote: > > On Wed, Jul 29, 2015 at 04:26:19PM +0200, Michal Hocko wrote: > > > On Wed 29-07-15 16:59:07, Vladimir Davydov wrote: > > > > On Wed, Jul 29, 2015 at 02:36:30PM +0200, Michal Hocko wrote: > > > > > On Sun 19-07-15 15:31:09, Vladimir Davydov wrote: > > > > > [...] > > > > > > ---- USER API ---- > > > > > > > > > > > > The user API consists of two new proc files: > > > > > > > > > > I was thinking about this for a while. I dislike the interface. It is > > > > > quite awkward to use - e.g. you have to read the full memory to check a > > > > > single memcg idleness. This might turn out being a problem especially on > > > > > large machines. > > > > > > > > Yes, with this API estimating the wss of a single memory cgroup will > > > > cost almost as much as doing this for the whole system. > > > > > > > > Come to think of it, does anyone really need to estimate idleness of one > > > > particular cgroup? > > > > > > It is certainly interesting for setting the low limit. > > > > Yes, but IMO there is no point in setting the low limit for one > > particular cgroup w/o considering what's going on with the rest of the > > system. > > If you use the low limit for isolating an important load then you do not > have to care about the others that much. All you care about is to set > the reasonable protection level and let others to compete for the rest. That's a use case, you're right. Well, it's a natural limitation of this API - you just have to perform a full PFN scan then. You can avoid costly rmap walks for the cgroups you are not interested in by filtering them out using /proc/kpagecgroup though. > > [...] > > > > > I would assume that most users are interested only in a single number > > > > > which tells the idleness of the system/memcg. > > > > > > > > Yes, that's what I need it for - estimating containers' wss for setting > > > > their limits accordingly. > > > > > > So why don't we export the single per memcg and global knobs then? > > > This would have few advantages. First of all it would be much easier to > > > use, you wouldn't have to export memcg ids and finally the implementation > > > could be changed without any user visible changes (e.g. lru vs. pfn walks), > > > potential caching and who knows what. In other words. Michel had a > > > single number interface AFAIR, what was the primary reason to move away > > > from that API? > > > > Because there is too much to be taken care of in the kernel with such an > > approach and chances are high that it won't satisfy everyone. What > > should the scan period be equal too? > > No, just gather the data on the read request and let the userspace > to decide when/how often etc. If we are clever enough we can cache > the numbers and prevent from the walk. Write to the file and do the > mark_idle stuff. Still, scan rate limiting would be an issue IMO. > > > Knob. How many kthreads do we want? > > Knob. I want to keep history for last N intervals (this was a part of > > Michel's implementation), what should N be equal to? Knob. > > This all relates to the kernel thread implementation which I wasn't > suggesting. I was referring to Michel's work which might induce that. > I was merely referring to a single number output. Sorry about the > confusion. Still, what about idle stats history? I mean having info about how many pages were idle for N scans. It might be useful for more robust/accurate wss estimation. > > > I want to be > > able to choose between an instant scan and a scan distributed in time. > > Knob. I want to see stats for anon/locked/file/dirty memory separately, > > Why is this useful for the memcg limits setting or the wss estimation? I > can imagine that a further drop down numbers might be interesting > from the debugging POV but I fail to see what kind of decisions from > userspace you would do based on them. A couple examples that pop up in my mind: It's difficult to make wss estimation perfect. By mlocking pages, a workload might give a hint to the system that it will be really unhappy if they are evicted. One might want to consider anon pages and/or dirty pages as not idle in order to protect them and hence avoid expensive pageout/swapout. > > [...] > > > Yes this is really tricky with the current LRU implementation. I > > > was playing with some ideas (do some checkpoints on the way) but > > > none of them was really working out on a busy systems. But the LRU > > > implementation might change in the future. > > > > It might. Then we could come up with a new /proc or /sys file which > > would do the same as /proc/kpageidle, but on per LRU^w whatever-it-is > > basis, and give people a choice which one to use. > > This just leads to proc files count explosion we are seeing > already... Proc ended up in dump ground for different things which > didn't fit elsewhere and I am not very much happy about it to be honest. Moving the API to memcg is not a good idea either IMO, because the feature can actually be useful with memcg disabled, e.g. it might help estimate if the system is over- or underloaded. /proc/kpageidle should probably live somewhere in /sys/kernel/mm, but I added it where similar files are located (kpagecount, kpageflags) to keep things consistent. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html