Re: [RFC PATCH 0/6] Add support for shared PTEs across processes

Khalid Aziz <khalid.aziz@xxxxxxxxxx> · Mon, 24 Jan 2022 15:30:23 -0700

On 1/24/22 12:45, Andy Lutomirski wrote:
On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@xxxxxxxxxx> wrote:

On 1/22/22 04:31, Mike Rapoport wrote:
(added linux-api)

On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
Page tables in kernel consume some of the memory and as long as
number of mappings being maintained is small enough, this space
consumed by page tables is not objectionable. When very few memory
pages are shared between processes, the number of page table entries
(PTEs) to maintain is mostly constrained by the number of pages of
memory on the system. As the number of shared pages and the number
of times pages are shared goes up, amount of memory consumed by page
tables starts to become significant.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is only 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system carsh was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant.

This is a proposal to implement a mechanism in kernel to allow
userspace processes to opt into sharing PTEs. The proposal is to add
a new system call - mshare(), which can be used by a process to
create a region (we will call it mshare'd region) which can be used
by other processes to map same pages using shared PTEs. Other
process(es), assuming they have the right permissions, can then make
the mashare() system call to map the shared pages into their address
space using the shared PTEs.  When a process is done using this
mshare'd region, it makes a mshare_unlink() system call to end its
access. When the last process accessing mshare'd region calls
mshare_unlink(), the mshare'd region is torn down and memory used by
it is freed.

API Proposal
============

The mshare API consists of two system calls - mshare() and mshare_unlink()

--
int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)

mshare() creates and opens a new, or opens an existing mshare'd
region that will be shared at PTE level. "name" refers to shared object
name that exists under /sys/fs/mshare. "addr" is the starting address
of this shared memory area and length is the size of this area.
oflags can be one of:

- O_RDONLY opens shared memory area for read only access by everyone
- O_RDWR opens shared memory area for read and write access
- O_CREAT creates the named shared memory area if it does not exist
- O_EXCL If O_CREAT was also specified, and a shared memory area
    exists with that name, return an error.

mode represents the creation mode for the shared object under
/sys/fs/mshare.

mshare() returns an error code if it fails, otherwise it returns 0.

Did you consider returning a file descriptor from mshare() system call?
Then there would be no need in mshare_unlink() as close(fd) would work.

That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
with a close() semantics though unless there was another way to force refcount to 0. Right?

I'm not sure I understand the problem.  If you're sharing a portion of
an mm and the mm goes away, then all that should be left are some
struct files that are no longer useful.  They'll go away when their
refcount goes to zero.

--Andy

The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process 
initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd 
portion go away.

One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but 
the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to 
hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare() 
increments the refcount and each call to mshare_unlink() decrements the refcount.

--
Khalid