On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote: > Hi Matthew > > May I ask you another question about mshare under this RFC? I remember you > said you will redesign the mshare to per-vma not per-mapping (apologize if > remember wrongly) in last time MM alignment session. And I also refer to you > to re-code this part in our internal version (based on this RFC). It seems > that per VMA will can simplify the structure of pgtable sharing, even > doesn't care the different permission of file mapping. these are advantages > (maybe) that I can imagine. But IMHO, It seems not a strongly reason to > switch per-mapping to per-vma. > > And I can't imagine other considerations of upstream. Can you share the > reason why redesigning in a per-vma way, due to integation with hugetlbfs > pgtable sharing or anonymous page sharing? It was David who wants to make page table sharing be per-VMA. I think he is advocating for the wrong approach. In any case, I don't have time to work on mshare and Khalid is on leave until September, so I don't think anybody is actively working on mshare. > Thanks for your time. > > On 2023/4/27 00:49, Khalid Aziz wrote: > > Memory pages shared between processes require a page table entry > > (PTE) for each process. Each of these PTE consumes some of the > > memory and as long as number of mappings being maintained is small > > enough, this space consumed by page tables is not objectionable. > > When very few memory pages are shared between processes, the number > > of page table entries (PTEs) to maintain is mostly constrained by > > the number of pages of memory on the system. As the number of > > shared pages and the number of times pages are shared goes up, > > amount of memory consumed by page tables starts to become > > significant. This issue does not apply to threads. Any number of > > threads can share the same pages inside a process while sharing the > > same PTEs. Extending this same model to sharing pages across > > processes can eliminate this issue for sharing across processes as > > well. > > > > Some of the field deployments commonly see memory pages shared > > across 1000s of processes. On x86_64, each page requires a PTE that > > is only 8 bytes long which is very small compared to the 4K page > > size. When 2000 processes map the same page in their address space, > > each one of them requires 8 bytes for its PTE and together that adds > > up to 8K of memory just to hold the PTEs for one 4K page. On a > > database server with 300GB SGA, a system crash was seen with > > out-of-memory condition when 1500+ clients tried to share this SGA > > even though the system had 512GB of memory. On this server, in the > > worst case scenario of all 1500 processes mapping every page from > > SGA would have required 878GB+ for just the PTEs. If these PTEs > > could be shared, amount of memory saved is very significant. > > > > This patch series adds a new flag to mmap() call - MAP_SHARED_PT. > > This flag can be specified along with MAP_SHARED by a process to > > hint to kernel that it wishes to share page table entries for this > > file mapping mmap region with other processes. Any other process > > that mmaps the same file with MAP_SHARED_PT flag can then share the > > same page table entries. Besides specifying MAP_SHARED_PT flag, the > > processes must map the files at a PMD aligned address with a size > > that is a multiple of PMD size and at the same virtual addresses. > > This last requirement of same virtual addresses can possibly be > > relaxed if that is the consensus. > > > > When mmap() is called with MAP_SHARED_PT flag, a new host mm struct > > is created to hold the shared page tables. Host mm struct is not > > attached to a process. Start and size of host mm are set to the > > start and size of the mmap region and a VMA covering this range is > > also added to host mm struct. Existing page table entries from the > > process that creates the mapping are copied over to the host mm > > struct. All processes mapping this shared region are considered > > guest processes. When a guest process mmap's the shared region, a vm > > flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page > > fault, VMA is checked for the presence of VM_SHARED_PT flag. If the > > flag is found, its corresponding PMD is updated with the PMD from > > host mm struct so the PMD will point to the page tables in host mm > > struct. vm_mm pointer of the VMA is also updated to point to host mm > > struct for the duration of fault handling to ensure fault handling > > happens in the context of host mm struct. When a new PTE is > > created, it is created in the host mm struct page tables and the PMD > > in guest mm points to the same PTEs. > > > > This is a basic working implementation. It will need to go through > > more testing and refinements. Some notes and questions: > > > > - PMD size alignment and size requirement is currently hard coded > > in. Is there a need or desire to make this more flexible and work > > with other alignments/sizes? PMD size allows for adapting this > > infrastructure to form the basis for hugetlbfs page table sharing > > as well. More work will be needed to make that happen. > > > > - Is there a reason to allow a userspace app to query this size and > > alignment requirement for MAP_SHARED_PT in some way? > > > > - Shared PTEs means mprotect() call made by one process affects all > > processes sharing the same mapping and that behavior will need to > > be documented clearly. Effect of mprotect call being different for > > processes using shared page tables is the primary reason to > > require an explicit opt-in from userspace processes to share page > > tables. With a transparent sharing derived from MAP_SHARED alone, > > changed effect of mprotect can break significant number of > > userspace apps. One could work around that by unsharing whenever > > mprotect changes modes on shared mapping but that introduces > > complexity and the capability to execute a single mprotect to > > change modes across 1000's of processes sharing a mapped database > > is a feature explicitly asked for by database folks. This > > capability has significant performance advantage when compared to > > mechanism of sending messages to every process using shared > > mapping to call mprotect and change modes in each process, or > > using traps on permissions mismatch in each process. > > > > - This implementation does not allow unmapping page table shared > > mappings partially. Should that be supported in future? > > > > Some concerns in this RFC: > > > > - When page tables for a process are freed upon process exit, > > pmd_free_tlb() gets called at one point to free all PMDs allocated > > by the process. For a shared page table, shared PMDs can not be > > released when a guest process exits. These shared PMDs are > > released when host mm struct is released upon end of last > > reference to page table shared region hosted by this mm. For now > > to stop PMDs being released, this RFC introduces following change > > in mm/memory.c which works but does not feel like the right > > approach. Any suggestions for a better long term approach will be > > very appreciated: > > > > @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb, > > pud_t *pud, > > > > pmd = pmd_offset(pud, start); > > pud_clear(pud); > > - pmd_free_tlb(tlb, pmd, start); > > - mm_dec_nr_pmds(tlb->mm); > > + if (shared_pte) { > > + tlb_flush_pud_range(tlb, start, PAGE_SIZE); > > + tlb->freed_tables = 1; > > + } else { > > + pmd_free_tlb(tlb, pmd, start); > > + mm_dec_nr_pmds(tlb->mm); > > + } > > } > > > > static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, > > > > - This implementation requires an additional VM flag. Since all lower > > 32 bits are currently in use, the new VM flag must come from upper > > 32 bits which restricts this feature to 64-bit processors. > > > > - This feature is implemented for file mappings only. Is there a > > need to support it for anonymous memory as well? > > > > - Accounting for MAP_SHARED_PT mapped filepages in a process and > > pagetable bytes is not quite accurate yet in this RFC and will be > > fixed in the non-RFC version of patches. > > > > I appreciate any feedback on these patches and ideas for > > improvements before moving these patches out of RFC stage. > > > > > > Changes from RFC v1: > > - Broken the patches up into smaller patches > > - Fixed a few bugs related to freeing PTEs and PMDs incorrectly > > - Cleaned up the code a bit > > > > > > Khalid Aziz (4): > > mm/ptshare: Add vm flag for shared PTE > > mm/ptshare: Add flag MAP_SHARED_PT to mmap() > > mm/ptshare: Create new mm struct for page table sharing > > mm/ptshare: Add page fault handling for page table shared regions > > > > include/linux/fs.h | 2 + > > include/linux/mm.h | 8 + > > include/trace/events/mmflags.h | 3 +- > > include/uapi/asm-generic/mman-common.h | 1 + > > mm/Makefile | 2 +- > > mm/internal.h | 21 ++ > > mm/memory.c | 105 ++++++++-- > > mm/mmap.c | 88 +++++++++ > > mm/ptshare.c | 263 +++++++++++++++++++++++++ > > 9 files changed, 476 insertions(+), 17 deletions(-) > > create mode 100644 mm/ptshare.c > >