Wow, detailed notes - thanks, I'm still looking through them. If you don't mind, I'll use a link to the archive of this email (https://lists.linux-foundation.org/pipermail/containers/2009-October/021227.html) in the final summary. thanks, -serge Quoting Ying Han (yinghan@xxxxxxxxxx): > On Tue, Oct 6, 2009 at 8:56 AM, Serge E. Hallyn <serue@xxxxxxxxxx> wrote: > > Hi, > > > > the kernel summit is rapidly approaching. One of the agenda > > items is 'the containers end-game and how do we get there.' > > As of now I don't yet know who will be there to represent the > > containers community in that discussion. I hope there is > > someone planning on that? In the hopes that there is, here is > > a summary of the info I gathered in June, in case that is > > helpful. If it doesn't look like anyone will be attending > > ksummit representing containers, then I'll send the final > > version of this info to the ksummit mailing list so that someone > > can stand in. > > > > 1. There will be an IO controller minisummit before KS. I > > trust someone (Balbir?) will be sending meeting notes to > > the cgroup list, so that highlights can be mentioned at KS? > > > > 2. There was a checkpoint/restart BOF plus talk at plumber's. > > Notes on the BOF are here: > > > https://lists.linux-foundation.org/pipermail/containers/2009-September/020915.html > > > > 3. There was an OOM notification talk or BOF at plumber's. > > Dave or Balbir, are there any notes about that meeting? > Serge: > Here are some notes I took from Dave's OOM talk: > > Change the OOM killer's policy. > > The current goal of OOM killer is to kill a rogue memory hogging task which > will lead to future memory freeing, and allow the system or container to > resume normal operation. Under OOM condition, kernel scans the tasklist of > the system or container and scores each task based on heuristic mechanism. > The task with highest score is picked to kill. Also kernel provides > /proc/pid/oom_adj API for adding user policy on top of the score, it allows > admin to tune the "badness" on task basis. > > Linux Theory: A free page is a wasted page of RAM and Linux will always fill > up memory with disk caches. When we time stamp the running time of an > application, we normally follow the sequence "flush cache - time - run app - > time - flush cache". So being OOM is normal and it is not a bug. > > Linux-mm has a list descripting the possible OOM conditions. > http://linux-mm.org/OOM > > User Perspectives: > High Performance Computing: I will take as much memory can be given, Please > tell me how much memory that is. In these systems, swapping is the devil. > > Enterprise: Applications do their own memory management.If the system gets > lowmem, I want the the kernel to tell me, and I will give some of mine back. > Memory notification system brings up lots of attention. Couple of proposals > have been posted in linux-mm and none of them seems fulfill all the > requirements. > > Desktop: This is the OOM designed for. When OpenOffice/Firefox flows up, > please just kill it quickly, i will reopen it in a minute. Besides, Please > don't kill sshd. > > Memory Reclaim > If no free memory, we scan the LRU and try to free pages. Recent issues on > page reclaim focuses on scalability. In 1991 with 4M of DRAM, We have 1024 > pages to scan. In 2009 with 4G of DRAM, we have1048576 pages to scan. The > increasing of the memory size makes reclaim job harder and harder. > > Beat the LRU into shape > * Never run out of memory, never reclaim and never look at the LRU. > * Use large size pagesize. IBM uses 64k page instead of 4k page. "IBM uses > 64K page, more on the kernel issue change than userpace change if they use > libc" > * Keep troublesome pages off the LRU lists including unreclaimable pages > (anon, mlock, shm, slab, dirty pages) > and Hugetlbfs which are not counted on RSS. > * Split up the LRU lists. It includes the NUMA implementation as well as the > unevictable patch from Rik (~2.6.28) > What is next: > > Having the OOM killer always pick the "right" application to kill is a tough > problem and it has been the hot topic in upstream with several patches > posted. Notification system has lots of attention during the talk, here are > the summary of current posted patches: > > Linux killed Kenny, bastard! > Evgeniy Polyakov posted the patch early this year. What the patch does is to > provide an API that admin can specify the oom victim by the process name. > No one likes the patch in linux-mm. The argument is on the current mechanism > of caculating "badness score" which is way complex for admin to determin > which task to kill. Alan Cox simply answered the question: "its > always heuristic", and he also pointed out "What you actually need is > notifiers to work on /proc. In fact containers are probably the right way to > do it". > > Cgroup based OOM killer controller > Nikanth Karthikesan re-posted the patch which adding the cgroup support. The > patch added an adjustable value "oom.victim" for each oom cgroup. The OOM > killer would kill all the processes in a cgruop with a higher oom.victim > value before killing a process in a cgroup with lower oom.victim value. > Among those tasks with the same oom.victim value, the usual "badness" > heuristics would be applied. > It is one step further which takes use of the cgroup hireachy for the OOM > killer subsystem. However, the same question had been raised "What is the > difference between oom_adj and this oom.victim to user?". Nikanth answered > to that question "Using this oom.victim users can specify the exact order to > kill processes.". Another word, oom_adj works as a hint to the kernel while > oom_victim gives strict order. > > Per-cgroup OOM handler > Ying Han posted the google in-house patch into linux-mm which defers the OOM > kill decisions to userspace. It allows userspace to respond the OOM by > adding nodes, dropping caches, elevating memcg limit or sending signal. An > alternative is to use /dev/mem_notify which David Rientjes proposed in > linux-mm. The idea is similar, instead of waiting on oom_await, userspace > can poll the information during lowmem condition and respond > correspondingly. > > Vladislav Buzov posted the patch which extends the memcg by adding the > notification system on system lowmem condition. The feedbacks looks > promising this time, Although there still lots of changes needs to be done. > Discussions focused on the implementation of the notification mechanism. > Balbir Singh mentioned the cgroupstats - a genetlink based mechanism for > event delivery and request/respondse applications. Paul Menage proposed > couple of options including new ioctl on cgroup files, new syscall and new > per-cgroup file. > > --Ying Han > > > > > 4. The actual title of the KS discussion is 'containers end-game'. > > The containers-specific info I gathered in June was mainly about > > additional resources which we might containerize. I expect that > > will be useful in helping the KS community decide how far down > > the containerization path they are willing to go - i.e. whether > > we want to call what we have good enough and say you must use kvm > > for anything more, whether we want to be able to provide all the > > features of a full VM with containers, or something in between, > > say targetting specific uses (perhaps only expand on cooperative > > resource management containers). With that in mind, here are > > some items that were mentioned in June as candidates for > > more containerization work > > > > 1. Cpu hard limits, memory soft limits (Balbir) > > 2. Large pages, mlock, shared page accounting (Balbir) > > 3. Oom notification (Balbir - was anything decided on this > > at plumber's?) > > 4. There is agreement on getting rid of the ns cgroup, > > provided that: > > a. user namespaces can provide container confinement > > guarantees > > b. a compatibility flag is created to clone parent > > cgroup when creating a new cgroup (Paul and Daniel) > > 5. Poweroff/reboot handling in containers (Daniel) > > 6. Full user namespaces to segragate uids in different > > containers and confine root users in containers, i.e. > > with respect to file systems like cgroupfs. > > 7. Checkpoint/restart (c/r) will want time virtualization (Daniel) > > 8. C/r will want inode virtualization (Daniel) > > 9. Sunrpc containerization (required to allow multiple > > containers separate NFS client access to the same server) > > 10. Sysfs tagging, support for physical netifs to migrate > > network namespaces, and /sys/class/net virtualization > > > > Again the point of this list isn't to ask for discussion about > > whether or how to implement each at this KS, but rather to give > > an idea of how much work is left to do. Though let the discussion > > lead where it may of course. > > > > I don't have it here, but maybe it would also be useful to > > have a list ready of things we can do today with containerization? > > Both with upstream, and with under-development patchsets. > > > > I also hope that someone will take notes on the ksummit > > discussion to send to the containers and cgroup lists. > > I expect there will be a good LWN writeup, but a more > > containers-focused set of notes will probably be useful > > too. > > > > thanks, > > -serge > > _______________________________________________ > > Containers mailing list > > Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx > > https://lists.linux-foundation.org/mailman/listinfo/containers > > _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers