On 13.08.21 21:49, Khalid Aziz wrote:
On Tue, 2021-07-13 at 00:57 +0000, Longpeng (Mike, Cloud Infrastructure
Service Product Dept.) wrote:
Hi Matthew,
-----Original Message-----
From: Matthew Wilcox [mailto:willy@xxxxxxxxxxxxx]
Sent: Monday, July 12, 2021 9:30 AM
To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
<longpeng2@xxxxxxxxxx>
Cc: Steven Sistare <steven.sistare@xxxxxxxxxx>; Anthony Yznaga
<anthony.yznaga@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
linux-mm@xxxxxxxxx; Gonglei (Arei) <arei.gonglei@xxxxxxxxxx>
Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Mon, Jul 12, 2021 at 09:05:45AM +0800, Longpeng (Mike, Cloud
Infrastructure Service Product Dept.) wrote:
Let me describe my use case more clearly (just ignore if you're not
interested in it):
1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
allocated VA range is [0x40000000,0x140000000)
2. Prog A specifies [0x48000000,0x50000000) and
[0x80000000,0x100000000) will be shared by its child.
3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
4. Prog B notice the shared ranges (e.g. by input parameters or
...)
and remap them to a continuous VA range.
This is dangerous. There must be an active step for Prog B to accept
Prog A's
ranges into its address space. Otherwise Prog A could almost
completely fill
Prog B's address space and so control where Prog B places its
mappings. It
could also provoke a latent bug in Prog B if it doesn't handle
address space
exhaustion gracefully.
I had a proposal to handle this. Would it meet your requirements?
https://lore.kernel.org/lkml/20200730152250.GG23808@xxxxxxxxxxxxxxxxxxxx/
I noticed your proposal of project Sileby and I think it can meet
Steven's requirement, but I not sure whether it's suitable for mine
because there's no sample code yet, is it in progress ?
Hi Mike,
I am working on refining the ideas from project Sileby. I am also
working on designing the implementation. Since the original concept,
the mshare API has evolved further. Here is what it loks like:
The mshare API consists of two system calls - mshare() and
mshare_unlink()
mshare
======
int mshare(char *name,void *addr, size_t length, int oflags, mode_t
mode)
mshare() creates and opens a new, or opens an existing shared memory
area that will be shared at PTE level. name refers to shared object
name that exists under /dev/mshare (this is subject to change. There
might be better ways to manage the names for mshare'd areas). addr is
the starting address of this shared memory area and length is the size
of this area. oflags can be one of:
O_RDONLY opens shared memory area for read only access by everyone
O_RDWR opens shared memory area for read and write access
O_CREAT creates the named shared memory area if it does not exist
O_EXCL If O_CREAT was also specified, and a shared memory area
exists with that name, return an error.
mode represents the creation mode for the shared object under
/dev/mshare.
Return Value
------------
mshare() returns a file descriptor. A read from this file descriptor
returns two long values - (1) starting address, and (2) size of the
shared memory area.
Notes
-----
PTEs are shared at pgdir level and hence it imposes following
requirements on the address and size given to the mshare():
- Starting address must be aligned to pgdir size (512GB on x86_64)
- Size must be a multiple of pgdir size
- Any mappings created in this address range at any time become
shared automatically
- Shared address range can have unmapped addresses in it. Any
access to unmapped address will result in SIGBUS
Mappings within this address range behave as if they were shared
between threads, so a write to a MAP_PRIVATE mapping will create a
page which is shared between all the sharers. The first process that
declares an address range mshare'd can continue to map objects in the
shared area. All other processes that want mshare'd access to this
memory area can do so by calling mshare(). After this call, the
address range given by mshare becomes a shared range in its address
space. Anonymous mappings will be shared and not COWed.
Did I understand correctly that you want to share actual page tables
between processes and consequently different MMs? That sounds like a
very bad idea.
--
Thanks,
David / dhildenb