On Thu, Sep 8, 2011 at 9:17 PM, Glauber Costa <glommer@xxxxxxxxxxxxx> wrote: > On 09/08/2011 06:53 PM, Greg Thelen wrote: >> >> If a container must fit within 100M, then a single limit solution >> would set the limit to 100M and never change it. In a split limit >> solution a >> user daemon (e.g. uswapd) would need to monitor the usage and the amount >> of >> active memory vs inactive user memory and unreferenced kernel memory to >> determine where to apply pressure. > > Or it can just define some parameters and let the kernel do the > rest. Like for instance, a maximum proportion allowed, a maximum > proportion desired, etc. Agreed. >> With some more knobs such a uswapd could >> attempt to keep ahead of demand. But eventually direct reclaim would >> be needed to satisfy rapid growth spikes. Example: If the 100M container >> starts with limits of 20M kmem and 80M user memory but later its kernel >> memory needs grow to 70M. With separate user and kernel memory >> limits the kernel memory allocation could fail despite there being >> reclaimable user pages available. > > No no, this is a ratio, not a *limit*. A limit is something you > should not be allowed to go over. A good limit of kernel memory for > a 100 Mb container could be something like... 100 mb. But there is > more to that. > > Risking being a bit polemic here, I think that when we do > containers, we have to view the kernel a little bit like a shared resource > that is not accounted to anybody. It is easy to account things like tcp > buffer, but who do you account page tables for shared pages to? Or pinned > dentries for shared filesystems ? This is a tough challenge. If you don't account page tables, then fork bombs can grab lots of kernel memory. Containing such pigs is good. We are working on some patches that charge page tables to the memcg they are allocated from. This charge is checked against the memcg's limit and thus the page table allocation may fail due if over limit. I suggest charging dentries to the cgroup that allocated the dentry. If containers are used for isolation, then it seems that containers should typically not share files. That would break isolation. Of course there are shared files (/bin/bash, etc), but at least those are read-only. > Being the shared table under everybody, the kernel is more or less > like buffers inside a physical hdd, or cache lines. You know it is > there, you know it has a size, but provided you have some sane > protections, you don't really care - because in most cases you can't > - who is using it. I agree that it makes sense to not bother charging fixed sized resources (struct page, etc.) to containers; they are part of the platform. But the set of resources that a process can allocate are ideally charged to a cgroup to limit container memory usage (i.e. prevent DoS attacks, isolate performance). >> The job should have a way to >> transition to memory limits to 70M+ kernel and 30M- of user. > > Yes, and I don't see how what I propose prevents that. I don't think your current proposal prevents that. I was thinking about the future of the proposed kmem cgroup. I want to make sure the kernel has a way to apply reclaim pressure bidirectionally between cgroup kernel memory and cgroup user memory. >> I suppose a GFP_WAIT slab kernel page allocation could wakeup user space >> to >> perform user-assisted direct reclaim. User space would then lower the >> user >> limit thereby causing the kernel to direct reclaim user pages, then >> the user daemon would raise the kernel limit allowing the slab allocation >> to >> succeed. My hunch is that this would be prone to deadlocks (what prevents >> uswapd from needing more even more kmem?) I'll defer to more >> experienced minds to know if user assisted direct memory reclaim has >> other pitfalls. It scares me. > > Good that it scares you, it should. OTOH, userspace being > able to set parameters to it, has nothing scary at all. A daemon in > userspace can detect that you need more kernel space memory , and > then - according to a policy you abide to - write to a file allowing > it more, or maybe not - according to that same policy. It is very > far away from "userspace driven reclaim". Agreed. I have no problem with user space policy being used to updating kernel control files that alter reclaim. >> If kmem expands to include reclaimable kernel memory (e.g. dentry) then I >> presume the kernel would have no way to exchange unused user pages for >> dentry >> pages even if the user memory in the container is well below its limit. >> This is >> motivation for the above user assisted direct reclaim. > > Dentry is not always reclaimable. If it is pinned, it is non > reclaimable. Speaking of it, Would you take a look at > https://lkml.org/lkml/2011/8/14/110 ? > > I am targetting dentry as well. But since it is hard to assign a > dentry to a process all the time, going through a different path. I > however, haven't entirely given up of doing it cgroups based, so any > ideas are welcome =) My hope is that dentry consumption can be effectively limited by limiting the memory needed to allocate the dentries. IOW: with a memcg aware slab allocator which, when possible, charges the calling process' cgroup. > Well, I am giving this an extra thought... Having separate knobs > adds flexibility, but - as usual - also complexity. For the goals I > have in mind, "kernel memory" would work just as fine. > > If you look carefully at the other patches in the series besides > this one, you'll see that it is just a matter of billing from kernel > memory instead of tcp-memory, and then all the rest is the same. > > Do you think that a single kernel-memory knob would be better for > your needs? I am willing to give it a try. Regarding the tcp buffers we're discussing, how does an application consume lots of buffer memory? IOW, what would a malicious app do to cause a grows of tcp buffer usage? Or is this the kind of resource that is difficult to directly exploit? Also, how does the pressure get applied. I see you have a per-protocol, per-cgroup pressure pressure setting. Once set, how does this pressure cause memory to be freed? It looks like the pressure is set when allocating memory and later when packets are freed the associated memory _might_ be freed if pressure was previously detected. Did I get this right? For me the primary concern is that both user and kernel memory are eventually charged to the same counter with an associated limit. Per-container memory pressure is based on this composite limit. So I don't have a strong opinion as to how many kernel memory counters there are so long as they also feed into container memory usage counter. >> I agree that kernel memory is somewhat different. In some (I argue most) >> situations containers want the ability to exchange job kmem and job umem. >> Either split or combined accounting protects the system and isolates other >> containers from kmem allocations of a bad job. To me it seems natural to >> indicate that job X gets Y MB of memory. I have more trouble dividing the >> Y MB of memory into dedicated slices for different types of memory. > > I understand. And I don't think anyone doing containers > should be mandated to define a division. But a limit... SGTM, so long as it is an optional kernel memory limit (see below). >> I think >> memcg aware slab accounting does a good job of limiting a job's >> memory allocations. >> Would such slab accounting meet your needs? > > Well, the slab alone, no. There are other objects - like tcp buffers > - that aren't covered by the slab. Others are usually shared among > many cgroups, and others don't really belong to anybody in > particular.igh Sorry, I am dumb wrt. networking. I thought that these tcp buffers were allocated using slab with __alloc_skb(). Are they done in container process context or arbitrary context? > How do you think then, about turning this into 2 files inside memcg: > > - kernel_memory_hard_limit. > - kernel_memory_soft_limit. > > tcp memory would be the one defined in /proc, except if it is > greater than any of the limits. Instead of testing for memory > allocation against kmem.tcp_allocated_memory, we'd test it against > memcg.kmem_pinned_memory. memcg control files currently include: memory.limit_in_bytes memory.soft_limit_in_bytes memory.usage_in_bytes If you want a kernel memory limit, then we can introduce two new memcg APIs: memory.kernel_limit_in_bytes memory.kernel_soft_limit_in_bytes memory.kernel_usage_in_bytes Any memory charged to memory.kernel_limit_in_bytes would also be charged to memory.limit_in_bytes. IOW, memory.limit_in_bytes would include both user pages (as it does today) and some kernel pages. With these new files there are two cgroup memory limits: total and kernel. The total limit is a combination of user and some kinds of kernel memory. Kernel page reclaim would occur if memory.kernel_usage_in_bytes exceeds memory.kernel_limit_in_bytes. There would be no need to reclaim user pages in this situation. If kernel usage exceeds kernel_soft_limit_in_bytes, then the protocol pressure flags would be set. Memcg-wide page reclaim would occur if memory.usage_in_bytes exceeds memory.limit_in_bytes. This reclaim would consider either container kernel memory or user pages associated with the container. This would be a behavioral memcg change which would need to be carefully considered. Today the limit_in_bytes is just a user byte count that does not include kernel memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href