On Tue, Oct 6, 2009 at 11:21 AM, Serge E. Hallyn <serue@xxxxxxxxxx> wrote: > Wow, detailed notes - thanks, I'm still looking through them. If you don't > mind, I'll use a link to the archive of this email > ( > https://lists.linux-foundation.org/pipermail/containers/2009-October/021227.html > ) > in the final summary. > > Sure. The archive works with me. :) --Ying thanks, > -serge > > Quoting Ying Han (yinghan@xxxxxxxxxx): > > On Tue, Oct 6, 2009 at 8:56 AM, Serge E. Hallyn <serue@xxxxxxxxxx> > wrote: > > > Hi, > > > > > > the kernel summit is rapidly approaching. One of the agenda > > > items is 'the containers end-game and how do we get there.' > > > As of now I don't yet know who will be there to represent the > > > containers community in that discussion. I hope there is > > > someone planning on that? In the hopes that there is, here is > > > a summary of the info I gathered in June, in case that is > > > helpful. If it doesn't look like anyone will be attending > > > ksummit representing containers, then I'll send the final > > > version of this info to the ksummit mailing list so that someone > > > can stand in. > > > > > > 1. There will be an IO controller minisummit before KS. I > > > trust someone (Balbir?) will be sending meeting notes to > > > the cgroup list, so that highlights can be mentioned at KS? > > > > > > 2. There was a checkpoint/restart BOF plus talk at plumber's. > > > Notes on the BOF are here: > > > > > > https://lists.linux-foundation.org/pipermail/containers/2009-September/020915.html > > > > > > 3. There was an OOM notification talk or BOF at plumber's. > > > Dave or Balbir, are there any notes about that meeting? > > Serge: > > Here are some notes I took from Dave's OOM talk: > > > > Change the OOM killer's policy. > > > > The current goal of OOM killer is to kill a rogue memory hogging task > which > > will lead to future memory freeing, and allow the system or container to > > resume normal operation. Under OOM condition, kernel scans the tasklist > of > > the system or container and scores each task based on heuristic > mechanism. > > The task with highest score is picked to kill. Also kernel provides > > /proc/pid/oom_adj API for adding user policy on top of the score, it > allows > > admin to tune the "badness" on task basis. > > > > Linux Theory: A free page is a wasted page of RAM and Linux will always > fill > > up memory with disk caches. When we time stamp the running time of an > > application, we normally follow the sequence "flush cache - time - run > app - > > time - flush cache". So being OOM is normal and it is not a bug. > > > > Linux-mm has a list descripting the possible OOM conditions. > > http://linux-mm.org/OOM > > > > User Perspectives: > > High Performance Computing: I will take as much memory can be given, > Please > > tell me how much memory that is. In these systems, swapping is the devil. > > > > Enterprise: Applications do their own memory management.If the system > gets > > lowmem, I want the the kernel to tell me, and I will give some of mine > back. > > Memory notification system brings up lots of attention. Couple of > proposals > > have been posted in linux-mm and none of them seems fulfill all the > > requirements. > > > > Desktop: This is the OOM designed for. When OpenOffice/Firefox flows up, > > please just kill it quickly, i will reopen it in a minute. Besides, > Please > > don't kill sshd. > > > > Memory Reclaim > > If no free memory, we scan the LRU and try to free pages. Recent issues > on > > page reclaim focuses on scalability. In 1991 with 4M of DRAM, We have > 1024 > > pages to scan. In 2009 with 4G of DRAM, we have1048576 pages to scan. The > > increasing of the memory size makes reclaim job harder and harder. > > > > Beat the LRU into shape > > * Never run out of memory, never reclaim and never look at the LRU. > > * Use large size pagesize. IBM uses 64k page instead of 4k page. "IBM > uses > > 64K page, more on the kernel issue change than userpace change if they > use > > libc" > > * Keep troublesome pages off the LRU lists including unreclaimable pages > > (anon, mlock, shm, slab, dirty pages) > > and Hugetlbfs which are not counted on RSS. > > * Split up the LRU lists. It includes the NUMA implementation as well as > the > > unevictable patch from Rik (~2.6.28) > > What is next: > > > > Having the OOM killer always pick the "right" application to kill is a > tough > > problem and it has been the hot topic in upstream with several patches > > posted. Notification system has lots of attention during the talk, here > are > > the summary of current posted patches: > > > > Linux killed Kenny, bastard! > > Evgeniy Polyakov posted the patch early this year. What the patch does is > to > > provide an API that admin can specify the oom victim by the process name. > > No one likes the patch in linux-mm. The argument is on the current > mechanism > > of caculating "badness score" which is way complex for admin to determin > > which task to kill. Alan Cox simply answered the question: "its > > always heuristic", and he also pointed out "What you actually need is > > notifiers to work on /proc. In fact containers are probably the right way > to > > do it". > > > > Cgroup based OOM killer controller > > Nikanth Karthikesan re-posted the patch which adding the cgroup support. > The > > patch added an adjustable value "oom.victim" for each oom cgroup. The OOM > > killer would kill all the processes in a cgruop with a higher oom.victim > > value before killing a process in a cgroup with lower oom.victim value. > > Among those tasks with the same oom.victim value, the usual "badness" > > heuristics would be applied. > > It is one step further which takes use of the cgroup hireachy for the OOM > > killer subsystem. However, the same question had been raised "What is the > > difference between oom_adj and this oom.victim to user?". Nikanth > answered > > to that question "Using this oom.victim users can specify the exact order > to > > kill processes.". Another word, oom_adj works as a hint to the kernel > while > > oom_victim gives strict order. > > > > Per-cgroup OOM handler > > Ying Han posted the google in-house patch into linux-mm which defers the > OOM > > kill decisions to userspace. It allows userspace to respond the OOM by > > adding nodes, dropping caches, elevating memcg limit or sending signal. > An > > alternative is to use /dev/mem_notify which David Rientjes proposed in > > linux-mm. The idea is similar, instead of waiting on oom_await, userspace > > can poll the information during lowmem condition and respond > > correspondingly. > > > > Vladislav Buzov posted the patch which extends the memcg by adding the > > notification system on system lowmem condition. The feedbacks looks > > promising this time, Although there still lots of changes needs to be > done. > > Discussions focused on the implementation of the notification mechanism. > > Balbir Singh mentioned the cgroupstats - a genetlink based mechanism for > > event delivery and request/respondse applications. Paul Menage proposed > > couple of options including new ioctl on cgroup files, new syscall and > new > > per-cgroup file. > > > > --Ying Han > > > > > > > > 4. The actual title of the KS discussion is 'containers end-game'. > > > The containers-specific info I gathered in June was mainly about > > > additional resources which we might containerize. I expect that > > > will be useful in helping the KS community decide how far down > > > the containerization path they are willing to go - i.e. whether > > > we want to call what we have good enough and say you must use kvm > > > for anything more, whether we want to be able to provide all the > > > features of a full VM with containers, or something in between, > > > say targetting specific uses (perhaps only expand on cooperative > > > resource management containers). With that in mind, here are > > > some items that were mentioned in June as candidates for > > > more containerization work > > > > > > 1. Cpu hard limits, memory soft limits (Balbir) > > > 2. Large pages, mlock, shared page accounting (Balbir) > > > 3. Oom notification (Balbir - was anything decided on this > > > at plumber's?) > > > 4. There is agreement on getting rid of the ns cgroup, > > > provided that: > > > a. user namespaces can provide container confinement > > > guarantees > > > b. a compatibility flag is created to clone parent > > > cgroup when creating a new cgroup (Paul and Daniel) > > > 5. Poweroff/reboot handling in containers (Daniel) > > > 6. Full user namespaces to segragate uids in different > > > containers and confine root users in containers, i.e. > > > with respect to file systems like cgroupfs. > > > 7. Checkpoint/restart (c/r) will want time virtualization > (Daniel) > > > 8. C/r will want inode virtualization (Daniel) > > > 9. Sunrpc containerization (required to allow multiple > > > containers separate NFS client access to the same > server) > > > 10. Sysfs tagging, support for physical netifs to migrate > > > network namespaces, and /sys/class/net virtualization > > > > > > Again the point of this list isn't to ask for discussion about > > > whether or how to implement each at this KS, but rather to give > > > an idea of how much work is left to do. Though let the discussion > > > lead where it may of course. > > > > > > I don't have it here, but maybe it would also be useful to > > > have a list ready of things we can do today with containerization? > > > Both with upstream, and with under-development patchsets. > > > > > > I also hope that someone will take notes on the ksummit > > > discussion to send to the containers and cgroup lists. > > > I expect there will be a good LWN writeup, but a more > > > containers-focused set of notes will probably be useful > > > too. > > > > > > thanks, > > > -serge > > > _______________________________________________ > > > Containers mailing list > > > Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx > > > https://lists.linux-foundation.org/mailman/listinfo/containers > > > > _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers