On Mon, Aug 16, 2021 at 03:24:38PM +0200, David Hildenbrand wrote: > On 16.08.21 14:46, Matthew Wilcox wrote: > > On Mon, Aug 16, 2021 at 02:20:43PM +0200, David Hildenbrand wrote: > > > On 16.08.21 14:07, Matthew Wilcox wrote: > > > > On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote: > > > > > > Mappings within this address range behave as if they were shared > > > > > > between threads, so a write to a MAP_PRIVATE mapping will create a > > > > > > page which is shared between all the sharers. The first process that > > > > > > declares an address range mshare'd can continue to map objects in the > > > > > > shared area. All other processes that want mshare'd access to this > > > > > > memory area can do so by calling mshare(). After this call, the > > > > > > address range given by mshare becomes a shared range in its address > > > > > > space. Anonymous mappings will be shared and not COWed. > > > > > > > > > > Did I understand correctly that you want to share actual page tables between > > > > > processes and consequently different MMs? That sounds like a very bad idea. > > > > > > > > That is the entire point. Consider a machine with 10,000 instances > > > > of an application running (process model, not thread model). If each > > > > application wants to map 1TB of RAM using 2MB pages, that's 4MB of page > > > > tables per process or 40GB of RAM for the whole machine. > > > > > > What speaks against 1 GB pages then? > > > > Until recently, the CPUs only having 4 1GB TLB entries. I'm sure we > > still have customers using that generation of CPUs. 2MB pages perform > > better than 1GB pages on the previous generation of hardware, and I > > haven't seen numbers for the next generation yet. > > I read that somewhere else before, yet we have heavy 1 GiB page users, > especially in the context of VMs and DPDK. I wonder if those users actually benchmarked. Or whether the memory savings worked out so well for them that the loss of TLB performance didn't matter. > So, it only works for hugetlbfs in case uffd is not in place (-> no > per-process data in the page table) and we have an actual shared mappings. > When unsharing, we zap the PUD entry, which will result in allocating a > per-process page table on next fault. I think uffd was a huge mistake. It should have been a filesystem instead of a hack on the side of anonymous memory. > I will rephrase my previous statement "hugetlbfs just doesn't raise these > problems because we are special casing it all over the place already". For > example, not allowing to swap such pages. Disallowing MADV_DONTNEED. Special > hugetlbfs locking. Sure, that's why I want to drag this feature out of "oh this is a hugetlb special case" and into "this is something Linux supports".