> -----Original Message----- > From: Khalid Aziz [mailto:khalid.aziz@xxxxxxxxxx] > Sent: Saturday, January 22, 2022 12:42 AM > To: Matthew Wilcox <willy@xxxxxxxxxxxxx>; Barry Song <21cnbao@xxxxxxxxx> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>; Arnd Bergmann <arnd@xxxxxxxx>; > Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>; David Hildenbrand > <david@xxxxxxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>; Linux-MM > <linux-mm@xxxxxxxxx>; Longpeng (Mike, Cloud Infrastructure Service Product > Dept.) <longpeng2@xxxxxxxxxx>; Mike Rapoport <rppt@xxxxxxxxxx>; Suren > Baghdasaryan <surenb@xxxxxxxxxx> > Subject: Re: [RFC PATCH 0/6] Add support for shared PTEs across processes > > On 1/21/22 07:47, Matthew Wilcox wrote: > > On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote: > >> On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > >>> On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote: > >>>>> A file under /sys/fs/mshare can be opened and read from. A read from > >>>>> this file returns two long values - (1) starting address, and (2) > >>>>> size of the mshare'd region. > >>>>> > >>>>> -- > >>>>> int mshare_unlink(char *name) > >>>>> > >>>>> A shared address range created by mshare() can be destroyed using > >>>>> mshare_unlink() which removes the shared named object. Once all > >>>>> processes have unmapped the shared object, the shared address range > >>>>> references are de-allocated and destroyed. > >>>> > >>>>> mshare_unlink() returns 0 on success or -1 on error. > >>>> > >>>> I am still struggling with the user scenarios of these new APIs. This patch > >>>> supposes multiple processes will have same virtual address for the shared > >>>> area? How can this be guaranteed while different processes can map different > >>>> stack, heap, libraries, files? > >>> > >>> The two processes choose to share a chunk of their address space. > >>> They can map anything they like in that shared area, and then also > >>> anything they like in the areas that aren't shared. They can choose > >>> for that shared area to have the same address in both processes > >>> or different locations in each process. > >>> > >>> If two processes want to put a shared library in that shared address > >>> space, that should work. They probably would need to agree to use > >>> the same virtual address for the shared page tables for that to work. > >> > >> we are depending on an elf loader and ld to map the library > >> dynamically , so hardly > >> can we find a chance in users' code to call mshare() to map libraries > >> in application > >> level? > > > > If somebody wants to modify ld.so to take advantage of mshare(), they > > could. That wasn't our primary motivation here, so if it turns out to > > not work for that usecase, well, that's a shame. > > > >>> Think of this like hugetlbfs, only instead of sharing hugetlbfs > >>> memory, you can share _anything_ that's mmapable. > >> > >> yep, we can call mshare() on any kind of memory. for example, if multiple > >> processes use SYSV shmem, posix shmem or mmap the same file. but > >> it seems it is more sensible to let kernel do it automatically rather than > >> depending on calling mshare() from users? It is difficult for users to > >> decide which areas should be applied mshare(). users might want to call > >> mshare() for all shared areas to save memory coming from duplicated PTEs? > >> unlike SYSV shmem and POSIX shmem which are a feature for inter-processes > >> communications, mshare() looks not like a feature for applications, > >> but like a feature > >> for the whole system level? why would applications have to call something > which > >> doesn't directly help them? without mshare(), those applications > >> will still work without any problem, right? is there anything in > >> mshare() which is > >> a must-have for applications? or mshare() is only a suggestion from > applications > >> like madvise()? > > > > Our use case is that we have some very large files stored on persistent > > memory which we want to mmap in thousands of processes. So the first > > one shares a chunk of its address space and mmaps all the files into > > that chunk of address space. Subsequent processes find that a suitable > > address space already exists and use it, sharing the page tables and > > avoiding the calls to mmap. > > > > Sharing page tables is akin to running multiple threads in a single > > address space; except that only part of the address space is the same. > > There does need to be a certain amount of trust between the processes > > sharing the address space. You don't want to do it to an unsuspecting > > process. > > > > Hello Barry, > > mshare() is really meant for sharing data across unrelated processes by sharing > address space explicitly and hence > opt-in is required. As Matthew said, the processes sharing this virtual address > space need to have a level of trust. > Permissions on the msharefs files control who can access this shared address > space. It is possible to adapt this > mechanism to share stack, libraries etc but that is not the intent. This feature > will be used by applications that share > data with multiple processes using shared mapping normally and it helps them > avoid the overhead of large number of > duplicated PTEs which consume memory. This extra memory consumed by PTEs reduces > amount of memory available for > applications and can result in out-of-memory condition. An example from the patch > 0/6: > > "On a database server with 300GB SGA, a system crash was seen with > out-of-memory condition when 1500+ clients tried to share this SGA > even though the system had 512GB of memory. On this server, in the > worst case scenario of all 1500 processes mapping every page from > SGA would have required 878GB+ for just the PTEs. If these PTEs > could be shared, amount of memory saved is very significant." > The memory overhead of PTEs would be significantly saved if we use hugetlbfs in this case, but why not? > -- > Khalid