Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes: > On 05.12.2017 00:52, Tejun Heo wrote: >> Hello, Kirill. >> >> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote: >>>> Can you please explain how this is a fundamental resource which can't >>>> be controlled otherwise? >>> >>> Currently, aio_nr and aio_max_nr are global. In case of containers this >>> means that a single container may occupy all aio requests, which are >>> available in the system, and to deprive others possibility to use aio >>> at all. This may happen because of evil intentions of the container's >>> user or because of the program error, when the user makes this occasionally. >> >> Hmm... I see. It feels really wrong to me to make this a first class >> resource because there is a system wide limit. The only reason I can >> think of for the system wide limit is to prevent too much kernel >> memory consumed by creating a lot of aios but that squarely falls >> inside cgroup memory controller protection. If there are other >> reasons why the number of aios should be limited system-wide, please >> bring them up. >> >> If the only reason is kernel memory consumption protection, the only >> thing we need to do is making sure that memory used for aio commands >> are accounted against cgroup kernel memory consumption and >> relaxing/removing system wide limit. > > So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio > structures and pages, and all the memory will be accounted in kmem and > limited by memcg. Looks very good. > > One detail about memory consumption. io_submit() calls primitives > file_operations::write_iter and read_iter. It's not clear for me whether > they consume the same memory as if writev() or readv() system calls > would be used instead. writev() may delay the actual write till dirty > pages limit will be reached, so it seems logic of the accounting should > be the same. So aio mustn't use more not accounted system memory in file > system internals, then simple writev(). > > Could you please to say if you have thoughts about this? I think you just need to account the completion ring. Cheers, Jeff