RE: [RFC PATCH 0/6] Add support for shared PTEs across processes

"Longpeng (Mike, Cloud Infrastructure Service Product Dept.)" <longpeng2@xxxxxxxxxx> · Sat, 22 Jan 2022 01:39:46 +0000

> -----Original Message-----
> From: Khalid Aziz [mailto:khalid.aziz@xxxxxxxxxx]
> Sent: Saturday, January 22, 2022 12:42 AM
> To: Matthew Wilcox <willy@xxxxxxxxxxxxx>; Barry Song <21cnbao@xxxxxxxxx>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>; Arnd Bergmann <arnd@xxxxxxxx>;
> Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>; David Hildenbrand
> <david@xxxxxxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>; Linux-MM
> <linux-mm@xxxxxxxxx>; Longpeng (Mike, Cloud Infrastructure Service Product
> Dept.) <longpeng2@xxxxxxxxxx>; Mike Rapoport <rppt@xxxxxxxxxx>; Suren
> Baghdasaryan <surenb@xxxxxxxxxx>
> Subject: Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
> 
> On 1/21/22 07:47, Matthew Wilcox wrote:
> > On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote:
> >> On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >>> On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
> >>>>> A file under /sys/fs/mshare can be opened and read from. A read from
> >>>>> this file returns two long values - (1) starting address, and (2)
> >>>>> size of the mshare'd region.
> >>>>>
> >>>>> --
> >>>>> int mshare_unlink(char *name)
> >>>>>
> >>>>> A shared address range created by mshare() can be destroyed using
> >>>>> mshare_unlink() which removes the  shared named object. Once all
> >>>>> processes have unmapped the shared object, the shared address range
> >>>>> references are de-allocated and destroyed.
> >>>>
> >>>>> mshare_unlink() returns 0 on success or -1 on error.
> >>>>
> >>>> I am still struggling with the user scenarios of these new APIs. This patch
> >>>> supposes multiple processes will have same virtual address for the shared
> >>>> area? How can this be guaranteed while different processes can map different
> >>>> stack, heap, libraries, files?
> >>>
> >>> The two processes choose to share a chunk of their address space.
> >>> They can map anything they like in that shared area, and then also
> >>> anything they like in the areas that aren't shared.  They can choose
> >>> for that shared area to have the same address in both processes
> >>> or different locations in each process.
> >>>
> >>> If two processes want to put a shared library in that shared address
> >>> space, that should work.  They probably would need to agree to use
> >>> the same virtual address for the shared page tables for that to work.
> >>
> >> we are depending on an elf loader and ld to map the library
> >> dynamically , so hardly
> >> can we find a chance in users' code to call mshare() to map libraries
> >> in application
> >> level?
> >
> > If somebody wants to modify ld.so to take advantage of mshare(), they
> > could.  That wasn't our primary motivation here, so if it turns out to
> > not work for that usecase, well, that's a shame.
> >
> >>> Think of this like hugetlbfs, only instead of sharing hugetlbfs
> >>> memory, you can share _anything_ that's mmapable.
> >>
> >> yep, we can call mshare() on any kind of memory. for example, if multiple
> >> processes use SYSV shmem, posix shmem or mmap the same file. but
> >> it seems it is more sensible to let kernel do it automatically rather than
> >> depending on calling mshare() from users? It is difficult for users to
> >> decide which areas should be applied mshare(). users might want to call
> >> mshare() for all shared areas to save memory coming from duplicated PTEs?
> >> unlike SYSV shmem and POSIX shmem which are a feature for inter-processes
> >> communications,  mshare() looks not like a feature for applications,
> >> but like a feature
> >> for the whole system level? why would applications have to call something
> which
> >> doesn't directly help them? without mshare(), those applications
> >> will still work without any problem, right? is there anything in
> >> mshare() which is
> >> a must-have for applications? or mshare() is only a suggestion from
> applications
> >> like madvise()?
> >
> > Our use case is that we have some very large files stored on persistent
> > memory which we want to mmap in thousands of processes.  So the first
> > one shares a chunk of its address space and mmaps all the files into
> > that chunk of address space.  Subsequent processes find that a suitable
> > address space already exists and use it, sharing the page tables and
> > avoiding the calls to mmap.
> >
> > Sharing page tables is akin to running multiple threads in a single
> > address space; except that only part of the address space is the same.
> > There does need to be a certain amount of trust between the processes
> > sharing the address space.  You don't want to do it to an unsuspecting
> > process.
> >
> 
> Hello Barry,
> 
> mshare() is really meant for sharing data across unrelated processes by sharing
> address space explicitly and hence
> opt-in is required. As Matthew said, the processes sharing this virtual address
> space need to have a level of trust.
> Permissions on the msharefs files control who can access this shared address
> space. It is possible to adapt this
> mechanism to share stack, libraries etc but that is not the intent. This feature
> will be used by applications that share
> data with multiple processes using shared mapping normally and it helps them
> avoid the overhead of large number of
> duplicated PTEs which consume memory. This extra memory consumed by PTEs reduces
> amount of memory available for
> applications and can result in out-of-memory condition. An example from the patch
> 0/6:
> 
> "On a database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant."
> 

The memory overhead of PTEs would be significantly saved if we use
hugetlbfs in this case, but why not?

> --
> Khalid