Hi everyone, I came up with an idea for speeding up memory sharing with memfd. The memfd introduced a idea of memory sealing. The memory is sealed when only one mapping is allowed to exist. This forces peers to repeat a costly mmap()/munmap() dance. Such an overhead makes memfd mechanism beneficial only for transfer larder than 512 kB. My idea is to avoid calling mmap()/munmap() at all by modifying only 1st-lvel of process page tables (PT). This allows to use page fault mechanism to prevent accessing the buffer without destroying/recreating page tables. _INTERNALS_ The new semantics consists of two operations LOCK and UNLOCK. The reader/writer functions is recognized by access flags passed to mmap() or to open(). * - means a comment LOCK() - if owner is not NULL return -EPERM - set ownership of a buffer to the current process * protection from multiple writers FOR READER - invalidate cache * avoid reading no longer valid data from reader's L1 cache FOR WRITER - restore entry in 1st level PT (if any) * accessing the memory will no longer cause page faults UNLOCK() - set buffer owner to NULL * allow other writers/readers to access a buffer FOR WRITER - store pointer to 2nd level page table (PT) in fd's private data * allow to restore writer's PT without recreating it - flush data cache for the buffer * make sure that updated data reached L2 and data is visible for other processes - set the entry in 1st level PT to PTE_NONE * force a page fault on an access to the buffer without owning it - invalidate TLB for the buffer region in VM * prevent avoiding a page fault if the page table entry is cached in TLB Accessing a buffer by a writer outside LOCK() / UNLOCK() session will cause a page fault. The virtually indexed L1 cache is flushed, so CPU must use TLB to translate virtual-to-physical address. There is no such an entry after flushing TLB in UNLOCK() so CPU must do a page table walk. The walk will fail because the entry in 1st level PT is empty. This will cause a page fault. The page fault handler must check if a process has owenership of the process it tries to access to. If ownership is NULL or some other process then the page fault is "upgraded" to SEGFAULT, effectively killing the process that had broken the memfd protocol. _USE CASE_ The simple use case for new semantics is described below. There are two processes called reader and writer. The writer mmap() a buffer with read/write access rights. The reader uses read-only permission. The writer fills the buffer and passes it to reader in following. 1. open memfd descriptor and setup its size 2. Pass fd to the reader using sockets 3. mmap() the buffer - reserve a region in virtual address space that refers to single entry in 1st level page tables (1-4 MiB depending on architecture). 4. LOCK(buffer) (details below). 5. Fill the buffer with data - populate the page table on write faults 6. UNLOCK(buffer) 7. Ping the reader using other API (like eventfd/sockets/signals) The reader process the buffer in following steps: Pre. Assume that fd is already shared with writer and mmaped with RDONLY flags. 1. LOCK(buffer) 2. Read a buffer 3. UNLOCK (buffer) 4. Ping the writer that the buffer was processed. _SUMMARY_ The benefits. Basically it speeds up sharing on the writer's side. There is no need for destruction of writer's page tables sealing a buffer. Moreover, the page tables are cached in private data so there is no need to recreate PT after retrieving buffer's ownership. It is possible to modify LOCK/UNLOCK semantics to disallowing concurrent reads. This might be useful for deciphering buffer content in place on server's side. The Problems. The main disadvantage is difficult portability of presented solution. The size of 2nd level page tables may greatly differ from platform to platform. Moreover, the mechanism reserves the huge region in virtual space for usage a single buffer. This might be a great waste for a valuable resource on 32-bit machines. A possible workaround might be using 3rd level entries for smaller buffers. I understand that implementing such a change might require very good understanding of MM's 'infernals'. It should be investigated what is the actual bottleneck of the current mechanism for memfd sharing. If the slowdown is caused by update of PTs then the new mechanism will be very beneficial. If performance loss is caused by cache flushes and TLB flushing then the gain might be negligible. More profiling data are needed. Regards, Tomasz Stanislawski -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>