On 05.12.2017 18:43, Michal Hocko wrote: > On Tue 05-12-17 18:34:59, Kirill Tkhai wrote: >> On 05.12.2017 18:15, Michal Hocko wrote: >>> On Tue 05-12-17 13:00:54, Kirill Tkhai wrote: >>>> Currently, number of available aio requests may be >>>> limited only globally. There are two sysctl variables >>>> aio_max_nr and aio_nr, which implement the limitation >>>> and request accounting. They help to avoid >>>> the situation, when all the memory is eaten in-flight >>>> requests, which are written by slow block device, >>>> and which can't be reclaimed by shrinker. >>>> >>>> This meets the problem in case of many containers >>>> are used on the hardware node. Since aio_max_nr is >>>> a global limit, any container may occupy the whole >>>> available aio requests, and to deprive others the >>>> possibility to use aio at all. The situation may >>>> happen because of evil intentions of the container's >>>> user or because of the program error, when the user >>>> makes this occasionally >>>> >>>> The patch allows to fix the problem. It adds memcg >>>> accounting of user used aio data (the biggest is >>>> the bunch of aio_kiocb; ring buffer is the second >>>> biggest), so a user of a certain memcg won't be able >>>> to allocate more aio requests memory, then the cgroup >>>> allows, and he will bumped into the limit. >>> >>> So what happens when we hit the hard limit and oom kill somebody? >>> Are those charged objects somehow bound to a process context? >> >> There is exit_aio() called from __mmput(), which waits till >> the charged objects complete and decrement reference counter. > > OK, so it is bound to _a_ process context. The oom killer will not know > about which process has consumed those objects but the effect will be at > least reduced to a memcg. > >> If there was a problem with oom in memcg, there would be >> the same problem on global oom, as it can be seen there is >> no __GFP_NOFAIL flags anywhere in aio code. >> >> But it seems everything is safe. > > Could you share your testing scenario and the way how the system behaved > during a heavy aio? > > I am not saying the patch is wrong, I am just trying to undestand all > the consequences. My test is simple program, which creates aio context and then starts infinity io_submit() cycle. I've tested the cases, when certain stages fail: io_setup() meets oom, io_submit() meets oom, io_getevents() meets oom. This was simply tested by inserting sleep() before the stage, and moving the task to appropriate cgroup with low memory limit. The most cases, I get bash killed (I moved it to cgroup too). Also, I've executed the test in parallel. If you want I can send you the source code, but I don't think it will be easy to use it if you are not the author. Kirill