Re: 2009 kernel summit preparation for 'containers end-game' discussion

"Serge E. Hallyn" <serue@xxxxxxxxxx> · Tue, 6 Oct 2009 13:21:54 -0500

Wow, detailed notes - thanks, I'm still looking through them.  If you don't
mind, I'll use a link to the archive of this email
(https://lists.linux-foundation.org/pipermail/containers/2009-October/021227.html)
in the final summary.

thanks,
-serge

Quoting Ying Han (yinghan@xxxxxxxxxx):
> On Tue, Oct 6, 2009 at 8:56 AM, Serge E. Hallyn <serue@xxxxxxxxxx> wrote:
> > Hi,
> >
> > the kernel summit is rapidly approaching. One of the agenda
> > items is 'the containers end-game and how do we get there.'
> > As of now I don't yet know who will be there to represent the
> > containers community in that discussion.  I hope there is
> > someone planning on that?  In the hopes that there is, here is
> > a summary of the info I gathered in June, in case that is
> > helpful.  If it doesn't look like anyone will be attending
> > ksummit representing containers, then I'll send the final
> > version of this info to the ksummit mailing list so that someone
> > can stand in.
> >
> > 1. There will be an IO controller minisummit before KS.  I
> > trust someone (Balbir?) will be sending meeting notes to
> > the cgroup list, so that highlights can be mentioned at KS?
> >
> > 2. There was a checkpoint/restart BOF plus talk at plumber's.
> > Notes on the BOF are here:
> >
> https://lists.linux-foundation.org/pipermail/containers/2009-September/020915.html
> >
> > 3. There was an OOM notification talk or BOF at plumber's.
> > Dave or Balbir, are there any notes about that meeting?
> Serge:
> Here are some notes I took from Dave's OOM talk:
> 
> Change the OOM killer's policy.
> 
> The current goal of OOM killer is to kill a rogue memory hogging task which
> will lead to future memory freeing, and allow the system or container to
> resume normal operation. Under OOM condition, kernel scans the tasklist of
> the system or container and scores each task based on heuristic mechanism.
> The task with highest score is picked to kill. Also kernel provides
> /proc/pid/oom_adj API for adding user policy on top of the score, it allows
> admin to tune the "badness" on task basis.
> 
> Linux Theory: A free page is a wasted page of RAM and Linux will always fill
> up memory with disk caches. When we time stamp the running time of an
> application, we normally follow the sequence "flush cache - time - run app -
> time - flush cache". So being OOM is normal and it is not a bug.
> 
> Linux-mm has a list descripting the possible OOM conditions.
> http://linux-mm.org/OOM
> 
> User Perspectives:
> High Performance Computing: I will take as much memory can be given, Please
> tell me how much memory that is. In these systems, swapping is the devil.
> 
> Enterprise: Applications do their own memory management.If the system gets
> lowmem, I want the the kernel to tell me, and I will give some of mine back.
> Memory notification system brings up lots of attention. Couple of proposals
> have been posted in linux-mm and none of them seems fulfill all the
> requirements.
> 
> Desktop: This is the OOM designed for. When OpenOffice/Firefox flows up,
> please just kill it quickly, i will reopen it in a minute. Besides, Please
> don't kill sshd.
> 
> Memory Reclaim
> If no free memory, we scan the LRU and try to free pages. Recent issues on
> page reclaim focuses on scalability. In 1991 with 4M of DRAM, We have 1024
> pages to scan. In 2009 with 4G of DRAM, we have1048576 pages to scan. The
> increasing of the memory size makes reclaim job harder and harder.
> 
> Beat the LRU into shape
> * Never run out of memory, never reclaim and never look at the LRU.
> * Use large size pagesize. IBM uses 64k page instead of 4k page. "IBM uses
> 64K page, more on the kernel issue change than userpace change if they use
> libc"
> * Keep troublesome pages off the LRU lists including unreclaimable pages
> (anon, mlock, shm, slab, dirty pages)
> and Hugetlbfs which are not counted on RSS.
> * Split up the LRU lists. It includes the NUMA implementation as well as the
> unevictable patch from Rik (~2.6.28)
> What is next:
> 
> Having the OOM killer always pick the "right" application to kill is a tough
> problem and it has been the hot topic in upstream with several patches
> posted. Notification system has lots of attention during the talk, here are
> the summary of current posted patches:
> 
> Linux killed Kenny, bastard!
> Evgeniy Polyakov posted the patch early this year. What the patch does is to
> provide an API that admin can specify the oom victim by the process name.
> No one likes the patch in linux-mm. The argument is on the current mechanism
> of caculating "badness score" which is way complex for admin to determin
> which task to kill. Alan Cox simply answered the question: "its
> always heuristic", and he also pointed out "What you actually need is
> notifiers to work on /proc. In fact containers are probably the right way to
> do it".
> 
> Cgroup based OOM killer controller
> Nikanth Karthikesan re-posted the patch which adding the cgroup support. The
> patch added an adjustable value "oom.victim" for each oom cgroup. The OOM
> killer would kill all the processes in a cgruop with a higher oom.victim
> value before killing a process in a cgroup with lower oom.victim value.
> Among those tasks with the same oom.victim value, the usual "badness"
> heuristics would be applied.
> It is one step further which takes use of the cgroup hireachy for the OOM
> killer subsystem. However, the same question had been raised "What is the
> difference between oom_adj and this oom.victim to user?". Nikanth answered
> to that question "Using this oom.victim users can specify the exact order to
> kill processes.". Another word, oom_adj works as a hint to the kernel while
> oom_victim gives strict order.
> 
> Per-cgroup OOM handler
> Ying Han posted the google in-house patch into linux-mm which defers the OOM
> kill decisions to userspace. It allows userspace to respond the OOM by
> adding nodes, dropping caches, elevating memcg limit or sending signal. An
> alternative is to use /dev/mem_notify which David Rientjes proposed in
> linux-mm. The idea is similar, instead of waiting on oom_await, userspace
> can poll the information during lowmem condition and respond
> correspondingly.
> 
> Vladislav Buzov posted the patch which extends the memcg by adding the
> notification system on system lowmem condition. The feedbacks looks
> promising this time, Although there still lots of changes needs to be done.
> Discussions focused on the implementation of the notification mechanism.
> Balbir Singh mentioned the cgroupstats - a genetlink based mechanism for
> event delivery and request/respondse applications. Paul Menage proposed
> couple of options including new ioctl on cgroup files, new syscall and new
> per-cgroup file.
> 
> --Ying Han
> 
> >
> > 4. The actual title of the KS discussion is 'containers end-game'.
> > The containers-specific info I gathered in June was mainly about
> > additional resources which we might containerize.  I expect that
> > will be useful in helping the KS community decide how far down
> > the containerization path they are willing to go - i.e. whether
> > we want to call what we have good enough and say you must use kvm
> > for anything more, whether we want to be able to provide all the
> > features of a full VM with containers, or something in between,
> > say targetting specific uses (perhaps only expand on cooperative
> > resource management containers).  With that in mind, here are
> > some items that were mentioned in June as candidates for
> > more containerization work
> >
> >        1. Cpu hard limits, memory soft limits (Balbir)
> >        2. Large pages, mlock, shared page accounting (Balbir)
> >        3. Oom notification (Balbir - was anything decided on this
> >                at plumber's?)
> >        4. There is agreement on getting rid of the ns cgroup,
> >                provided that:
> >                a. user namespaces can provide container confinement
> >                guarantees
> >                b. a compatibility flag is created to clone parent
> >                cgroup when creating a new cgroup (Paul and Daniel)
> >        5. Poweroff/reboot handling in containers (Daniel)
> >        6. Full user namespaces to segragate uids in different
> >                containers and confine root users in containers, i.e.
> >                with respect to file systems like cgroupfs.
> >        7. Checkpoint/restart (c/r) will want time virtualization (Daniel)
> >        8. C/r will want inode virtualization (Daniel)
> >        9. Sunrpc containerization (required to allow multiple
> >                containers separate NFS client access to the same server)
> >        10. Sysfs tagging, support for physical netifs to migrate
> >                network namespaces, and /sys/class/net virtualization
> >
> > Again the point of this list isn't to ask for discussion about
> > whether or how to implement each at this KS, but rather to give
> > an idea of how much work is left to do.  Though let the discussion
> > lead where it may of course.
> >
> > I don't have it here, but maybe it would also be useful to
> > have a list ready of things we can do today with containerization?
> > Both with upstream, and with under-development patchsets.
> >
> > I also hope that someone will take notes on the ksummit
> > discussion to send to the containers and cgroup lists.
> > I expect there will be a good LWN writeup, but a more
> > containers-focused set of notes will probably be useful
> > too.
> >
> > thanks,
> > -serge
> > _______________________________________________
> > Containers mailing list
> > Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
> > https://lists.linux-foundation.org/mailman/listinfo/containers
> >
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers