On 05.12.2017 01:59, Jeff Moyer wrote: > Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes: > >> On 05.12.2017 00:52, Tejun Heo wrote: >>> Hello, Kirill. >>> >>> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote: >>>>> Can you please explain how this is a fundamental resource which can't >>>>> be controlled otherwise? >>>> >>>> Currently, aio_nr and aio_max_nr are global. In case of containers this >>>> means that a single container may occupy all aio requests, which are >>>> available in the system, and to deprive others possibility to use aio >>>> at all. This may happen because of evil intentions of the container's >>>> user or because of the program error, when the user makes this occasionally. >>> >>> Hmm... I see. It feels really wrong to me to make this a first class >>> resource because there is a system wide limit. The only reason I can >>> think of for the system wide limit is to prevent too much kernel >>> memory consumed by creating a lot of aios but that squarely falls >>> inside cgroup memory controller protection. If there are other >>> reasons why the number of aios should be limited system-wide, please >>> bring them up. >>> >>> If the only reason is kernel memory consumption protection, the only >>> thing we need to do is making sure that memory used for aio commands >>> are accounted against cgroup kernel memory consumption and >>> relaxing/removing system wide limit. >> >> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio >> structures and pages, and all the memory will be accounted in kmem and >> limited by memcg. Looks very good. >> >> One detail about memory consumption. io_submit() calls primitives >> file_operations::write_iter and read_iter. It's not clear for me whether >> they consume the same memory as if writev() or readv() system calls >> would be used instead. writev() may delay the actual write till dirty >> pages limit will be reached, so it seems logic of the accounting should >> be the same. So aio mustn't use more not accounted system memory in file >> system internals, then simple writev(). >> >> Could you please to say if you have thoughts about this? > > I think you just need to account the completion ring. A request of struct aio_kiocb type consumes much more memory, than struct io_event does. Shouldn't we account it too? Kirill