Re: [PATCH v1 00/14] Add support for shared PTEs across processes

Khalid Aziz <khalid.aziz@xxxxxxxxxx> · Wed, 29 Jun 2022 11:48:17 -0600

On 5/30/22 05:18, David Hildenbrand wrote:
On 30.05.22 12:48, Barry Song wrote:
On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@xxxxxxxxxx> wrote:

Page tables in kernel consume some of the memory and as long as number
of mappings being maintained is small enough, this space consumed by
page tables is not objectionable. When very few memory pages are
shared between processes, the number of page table entries (PTEs) to
maintain is mostly constrained by the number of pages of memory on the
system. As the number of shared pages and the number of times pages
are shared goes up, amount of memory consumed by page tables starts to
become significant.

Some of the field deployments commonly see memory pages shared across
1000s of processes. On x86_64, each page requires a PTE that is only 8
bytes long which is very small compared to the 4K page size. When 2000
processes map the same page in their address space, each one of them
requires 8 bytes for its PTE and together that adds up to 8K of memory
just to hold the PTEs for one 4K page. On a database server with 300GB
SGA, a system carsh was seen with out-of-memory condition when 1500+
clients tried to share this SGA even though the system had 512GB of
memory. On this server, in the worst case scenario of all 1500
processes mapping every page from SGA would have required 878GB+ for
just the PTEs. If these PTEs could be shared, amount of memory saved
is very significant.

This patch series implements a mechanism in kernel to allow userspace
processes to opt into sharing PTEs. It adds two new system calls - (1)
mshare(), which can be used by a process to create a region (we will
call it mshare'd region) which can be used by other processes to map
same pages using shared PTEs, (2) mshare_unlink() which is used to
detach from the mshare'd region. Once an mshare'd region is created,
other process(es), assuming they have the right permissions, can make
the mashare() system call to map the shared pages into their address
space using the shared PTEs.  When a process is done using this
mshare'd region, it makes a mshare_unlink() system call to end its
access. When the last process accessing mshare'd region calls
mshare_unlink(), the mshare'd region is torn down and memory used by
it is freed.

API
===

The mshare API consists of two system calls - mshare() and mshare_unlink()

--
int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)

mshare() creates and opens a new, or opens an existing mshare'd
region that will be shared at PTE level. "name" refers to shared object
name that exists under /sys/fs/mshare. "addr" is the starting address
of this shared memory area and length is the size of this area.
oflags can be one of:

- O_RDONLY opens shared memory area for read only access by everyone
- O_RDWR opens shared memory area for read and write access
- O_CREAT creates the named shared memory area if it does not exist
- O_EXCL If O_CREAT was also specified, and a shared memory area
   exists with that name, return an error.

mode represents the creation mode for the shared object under
/sys/fs/mshare.

mshare() returns an error code if it fails, otherwise it returns 0.

PTEs are shared at pgdir level and hence it imposes following
requirements on the address and size given to the mshare():

- Starting address must be aligned to pgdir size (512GB on x86_64).
   This alignment value can be looked up in /proc/sys/vm//mshare_size
- Size must be a multiple of pgdir size
- Any mappings created in this address range at any time become
   shared automatically
- Shared address range can have unmapped addresses in it. Any access
   to unmapped address will result in SIGBUS

Mappings within this address range behave as if they were shared
between threads, so a write to a MAP_PRIVATE mapping will create a
page which is shared between all the sharers. The first process that
declares an address range mshare'd can continue to map objects in
the shared area. All other processes that want mshare'd access to
this memory area can do so by calling mshare(). After this call, the
address range given by mshare becomes a shared range in its address
space. Anonymous mappings will be shared and not COWed.

A file under /sys/fs/mshare can be opened and read from. A read from
this file returns two long values - (1) starting address, and (2)
size of the mshare'd region.

--
int mshare_unlink(char *name)

A shared address range created by mshare() can be destroyed using
mshare_unlink() which removes the  shared named object. Once all
processes have unmapped the shared object, the shared address range
references are de-allocated and destroyed.

mshare_unlink() returns 0 on success or -1 on error.

Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
         addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);
         if (addr == MAP_FAILED)
                 perror("ERROR: mmap failed");

         err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
                         GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
         if (err < 0) {
                 perror("mshare() syscall failed");
                 exit(1);
         }

         strncpy(addr, "Some random shared text",
                         sizeof("Some random shared text"));
-----------------

Snippet of code that a consumer process would execute looks like:

-----------------
         struct mshare_info minfo;

         fd = open("testregion", O_RDONLY);
         if (fd < 0) {
                 perror("open failed");
                 exit(1);
         }

         if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
                 printf("INFO: %ld bytes shared at addr 0x%lx \n",
                                 minfo.size, minfo.start);
         else
                 perror("read failed");

         close(fd);

         addr = (void *)minfo.start;
         err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
                         O_RDWR, 600);
         if (err < 0) {
                 perror("mshare() syscall failed");
                 exit(1);
         }

         printf("Guest mmap at %px:\n", addr);
         printf("%s\n", addr);
         printf("\nDone\n");

         err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
         if (err < 0) {
                 perror("mshare_unlink() failed");
                 exit(1);
         }
-----------------

Does  that mean those shared pages will get page_mapcount()=1 ?

AFAIU, for mshare() that is the case.

A big pain for a memory limited system like a desktop/embedded system is
that reverse mapping will take tons of cpu in memory reclamation path
especially for those pages mapped by multiple processes. sometimes,
100% cpu utilization on LRU to scan and find out if a page is accessed
by reading PTE young.

Regarding PTE-table sharing:

Even if we'd account each logical mapping (independent of page table
sharing) in the page_mapcount(), we would benefit from page table
sharing. Simply when we unmap the page from the shared page table, we'd
have to adjust the mapcount accordingly. So unmapping from a single
(shared) pagetable could directly result in the mapcount dropping to zero.

What I am trying to say is: how the mapcount is handled might be an
implementation detail for PTE-sharing. Not sure how hugetlb handles that
with its PMD-table sharing.

We'd have to clarify what the mapcount actually expresses. Having the
mapcount express "is this page mapped by multiple processes or at
multiple VMAs" might be helpful in some cases. Right now it mostly
expresses exactly that.

Right, that is the question - what does mapcount represent. I am interpreting it as mapcount represents how many ptes 
map the page. Since mshare uses one pte for each shared page irrespective of how many processes share the page, a 
mapcount of 1 sounds reasonable to me.

if we result in one PTE only by this patchset, it means we are getting
significant
performance improvement in kernel LRU particularly when free memory
approaches the watermarks.

But I don't see how a new system call like mshare(),  can be used
by those systems as they might need some more automatic PTEs sharing
mechanism.

IMHO, we should look into automatic PTE-table sharing of MAP_SHARED
mappings, similar to what hugetlb provides for PMD table sharing, which
leaves semantics unchanged for existing user space. Maybe there is a way
to factor that out and reuse it for PTE-table sharing.

I can understand that there are use cases for explicit sharing with new
(e.g., mprotect) semantics.

It is tempting to make this sharing automatic and mshare may evolve to that. Since mshare assumes significant trust 
between the processes sharing pages (shared pages share attributes and protection keys possibly) , it sounds dangerous 
to make that assumption automatically without processes explicitly declaring that level of trust.

Thanks,
Khalid