[sorry for not being as responsive as I usually am] >> >> They share a *mm* including a consistent virtual memory layout (VMA >> list). Page table sharing is just a side product of that. You could even >> call page tables just an implementation detail to produce that >> consistent virtual memory layout -- described for that MM via a >> different data structure. > > Yes, sharing an mm and vma chain does make it different from implementation point of view. > >> >>> A number of people have commented on potential usefulness of this concept >>> and implementation. >> >> ... and a lot of people raised concerns. Yes, page table sharing to >> reduce memory consumption/tlb misses/... is something reasonable to >> have. But that doesn't require mshare, as hugetlb has proven. >> >> The design might be useful for a handful of corner (!) cases, but as the >> cover letter only talks about memory consumption of page tables, I'll >> not care about those. Once these corner cases are explained and deemed >> important, we might want to think of possible alternatives to explore >> the solution space. > > Memory consumption by page tables is turning out to be significant issue. I mentioned one real-world example from a > customer where a 300GB SGA on a 512GB server resulted in OOM when 1500+ processes tried to map parts of the SGA into > their address space. Some customers are able to solve this issue by switching to hugetlbfs but that is not feasible for > every one. Yes. Another use case I am aware of are KVM-based virtual machines, when VM memory (shmem, file-backed) is not only mapped into the emulator process, but also into other processes used to carry out I/O (e.g., vhost-user). In that case, it's tempting to simply share the page tables between all processes for the shared mapping -- automatically, just like shmem/hugetlb already does. [...] >> >>> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion >>> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is >>> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that >>> wrong. >> >> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff, >> and asked the honest question if we can just remove it and replace it by >> something generic in the future. And as I learned, we most probably >> cannot rip that out without affecting existing user space. Even >> replacing it by mshare() would degrade existing user space. >> >> So the natural thing to reduce page table consumption (again, what this >> cover letter talks about) for user space (semi- ?)automatically for >> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM >> code to cache and reuse page tables (PTE and PMD tables should be >> sufficient) where suitable. >> >> For reasonably aligned mappings and mapping sizes, it shouldn't be too >> hard (I know, locking ...), to cache and reuse page tables attached to >> files -- similar to what hugetlb does, just in a generic way. We might >> want a mechanism to enable/disable this for specific processes and/or >> VMAs, but these are minor details. >> >> And that could come for free for existing user space, because page >> tables, and how they are handled, would just be an implementation detail. >> >> >> I'd be really interested into what the major roadblocks/downsides >> file-based page table sharing has. Because I am not convinced that a >> mechanism like mshare() -- that has to be explicitly implemented+used by >> user space -- is required for that. >> > > I see two parts to what you are suggesting (please correct me if I get this wrong): > > 1. Implement a generic page table sharing mechanism > 2. Implement a way to use this mechanism from userspace Yes. Whereby 2) would usually just be some heuristic (e.g.,. file > X MiB -> start sharing), with an additional way to just disable it or just enable it. But yes, most of that stuff should just be automatic. > > For 1, your suggestion seems to be extract the page table sharing code from hugetlb and make it generic. My approach is > to create a special mm struct to host the shared page tables and create a minimal set of changes to simply get PTEs from > this special mm struct whenever a shared VMA is accessed. There may be value to extracting hugetlb page table sharing > code and recasting it into this framework of special mm struct. I will look some more into it. The basic idea would be that whenever a MAP_SHARED VMA has a reasonable size, is aligned in a suitable way (including MAP offset), and protection match, you can just share PTE tables and even PMD tables. As page tables of shared mappings usually don't really store per-process information (exceptions I am aware of are userfaultfd and softdirty tracking), we can simply share/unshare page tables of shared mappings fairly easily. Then, you'd have e.g., 2 sets of page tables cached by the fd that can be mapped into processes 1) PROT_READ|PROT_WRITE 2) PROT_READ On VMA protection changes, one would have to unshare (unmap the page table) and either map another shared one, or map a private one. I don't think there would be need to optimize e.g., for PROT_NONE, but of course, other combinations could make sense to cache. PROT_NONE and other corner cases (softdirty tracking) would simply not use shared page tables. Shared page tables would have to be refcounted and one could e.g., implement a shrinker that frees unused page tables in the fd cache when memory reclaim kicks in. With something like that in place, page table consumption could be reduced and vmscan/rmap walks could turn out more efficient. > > As for 2, is it fair to say you are not fond of explicit opt-in from userspace and would rather have the sharing be file > based like hugetlb? That is worth considering but is limiting page table sharing to just file objects reasonable? A goal > for mshare mechanism was to allow shared objects to be files, anonymous pages, RDMA buffers, whatever. Idea being if you > can map it, you can share it with shared page tables. Maybe that is too ambitious a goal and I am open to course correction. We can glue it to the file or anything else that's shared I think -- I don't think we particularly, as long as it's something shared between processes to be mapped. And to be quite honest, whenever I read about anonymous memory (i.e., MAP_PRIVATE) I hear my inner voice screaming: just use *shared* memory when you want to *share* memory between processes, and optimize that if anything is missing. Having that said, I understood from previous discussions that there is a use case of efficient read-only protection across many processes/VMAs. I was wondering if that could be handled on the fs-level (pte_mkwrite). I remember I raised the idea before: if one could have a userfaultfd-wp-style (overlay?) file (system?), user-space could protect/unprotect file pages via a different mechanism (ioctl) and get notified about write access via something similar to userfaultfd user-space handlers, not via signals. Instead of adjusting VMAs, once would only adjust file page mappings to map the relevant pages R/O when protecting -- if page tables are shared, that would be efficient. Now, that is is just a very vague brain dump to get it out of my (overloaded) system. What I think the overall message is: let's try not designing new features around page table sharing, let's use page table sharing as an rmap performance optimization and as a mechanism to reduce page table overhead. I hope what I said makes any sense, I might eb just wrong. -- Thanks, David / dhildenb