On 08/09/2015 09:00 PM, Greg Kroah-Hartman wrote: > In chatting with Daniel on IRC, he is writing up a summary of how the > kdbus memory pools work in more detail, and he said he would sent that > out in a day or so, so that everyone can review. Yes, let me quickly describe again how the kdbus pool logic works. Every bus connection (peer) owns a buffer which is used in order to receive payloads. Such payloads are either messages sent from other connections, notifications or returned answer structures in return of query commands (name lists, etc). In order to avoid the kernel having to maintaining an internal buffer the connections then read from with an extra command, we decided to let the connections own their buffer directly, so they can mmap() the memory into their task. Allocating a local buffer to collect asynchronous messages is what they would need to do anyway, so we implemented a short-cut that allows the kernel to directly access the memory and write to it. The size of this buffer pool is configured by each connection individually, during the HELLO call, so the kernel interface is as flexible as any other memory allocation scheme the kernel provides and is subject to the same limits. Internally, the connection pool is simply a shmem backed file. From the context of the HELLO ioctl, we are calling into shmem_file_setup(), so the file is eventually owned by the task which created the bus task connecting to the bus. One reason why we do the shmem file allocation in the kernel and on behalf of a the userspace task is that we clear the VM_MAYWRITE bit to prevent the task from writing to the pool through its mapped buffer. We also do not set VM_NORESERVE, so the entire buffer is pre-accounted for the task that created the connection. The pool implementation uses an r/b tree to organize the buffer into slices. Those slices can be kept by userspace as long as the parsing implementation needs to have access to them. When finished, the slices are freed. A simple ring buffer cannot cope with the gaps that emerge by that. When a connection buffer is written to, it is done from the context of another task which calls into the kdbus code through one of the ioctls. The memcg implementation should hence charge the task that acts as writer, which is maybe not ideal but can be changed easily with some addition to the internal APIs. We omitted it for the current version, which is non-intrusive with regards to other kernel subsystems. The kdbus implementation is actually comparable to two tasks X and Y which both have their own buffer file open and mmap()ed, and they both pass their FD to the other side. If X now writes to Y's file, and that is causing a page fault, X is accounted for it, correct? The kernel does *not* do any memory allocation to buffer payload, and all other allocations (for instance, to keep around the internal state of a connection, names etc) are subject to conservatively chosen limitations. There is no unbounded memory allocation in kdbus that I am aware of. If there was, it would clearly be a bug. Addressing the point Andy made earlier: yes, due to memory overcommitment, OOM situations may happen with certain patterns, but the kernel should have the same measures to deal with them that it already has with other types of shared userspace memory. Right? Hope that all makes sense, we're open to discussions around the desired accounting details. I've copied linux-mm to let more people have a look into this again. Thanks, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>