Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)

Khalid Aziz <khalid.aziz@xxxxxxxxxx> · Mon, 4 Mar 2024 09:45:02 -0700

On 2/29/24 02:21, David Hildenbrand wrote:
On 28.02.24 23:56, Khalid Aziz wrote:
Threads of a process share address space and page tables that allows for
two key advantages:

1. Amount of memory required for PTEs to map physical pages stays low
even when large number of threads share the same pages since PTEs are
shared across threads.

2. Page protection attributes are shared across threads and a change
of attributes applies immediately to every thread without any overhead
of coordinating protection bit changes across threads.

These advantages no longer apply when unrelated processes share pages.
Large database applications can easily comprise of 1000s of processes
that share 100s of GB of pages. In cases like this, amount of memory
consumed by page tables can exceed the size of actual shared data.
On a database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA even
though the system had 512GB of memory. On this server, in the worst case
scenario of all 1500 processes mapping every page from SGA would have
required 878GB+ for just the PTEs.

I have sent proposals and patches to solve this problem by adding a
mechanism to the kernel for processes to use to opt into sharing
page tables with other processes. We have had discussions on original
proposal and subsequent refinements but we have not converged on a
solution. As systems with multi-TB memory and in-memory databases
are becoming more and more common, this is becoming a significant issue.
An interactive discussion can help us reach a consensus on how to
solve this.

Hi,

I was hoping for a follow-up to my previous comments from ~4 months ago [1], so one problem of "not converging" might be 
"no follow-up discussion".

Ideally, this session would not focus on mshare as previously discussed at LSF/MM, but take a step back and discuss 
requirements and possible adjustments to the original concept to get something possibly cleaner.

For example, I raised some ideas to not having to re-route mprotect()/mmap() calls. At least discussing somehwere why 
they are all bad would be helpful ;)

[1] https://lore.kernel.org/lkml/927b6339-ac5f-480c-9cdc-49c838cbef20@xxxxxxxxxx/

Hi David,

That is fair. A face to face discussion can help resolve these more easily but I will attempt to address these here and 
maybe we can come to a better understanding of the requirements. I do want to focus on requirements and let that drive 
the implementation.

On 11/2/23 14:25, David Hildenbrand wrote:
> On 01.11.23 23:40, Khalid Aziz wrote:
>> is slow and impacts database performance significantly. For each process to have to handle a fault/signal whenever page
>> protection is changed impacts every process. By sharing same PTE across all processes, any page protection changes apply
>
> ... and everyone has to get the fault and mprotect() again,
>
> Which is one of the reasons why I said that mprotect() is simply the wrong tool to use here.
>
> You want to protect a pagecache page from write access, catch write access and handle it, to then allow write-access
> again without successive fault->signal. Something similar is being done by filesystems already with the writenotify
> infrastructure I believe. You just don't get a signal on write access, because it's all handled internally in the FS.
>

My understanding of requirement from database applications is they want to create a large shared memory region for 1000s 
of processes. This region can have file-backed pages or not. One of the processes can be the control process that serves 
as gatekeeper to various parts of this shared region. This process can open up write access to a part of shared region 
(which can span thousands of pages), populate/update data and then close down write access to this region. Any other 
process that tries to write to this region at this time can get a signal and choose to handle it or simply be killed. 
All the gatekeper process wants to do is close access to shared region at any time without having to coordinate that 
with 1000s of processes and let other processes deal with access having been closed. With this requirement, what 
database applications has found to be effective is to use mprotect() to apply protection to the part of shared region 
and then have it propagate across everyone attempting to access that region. Using currently available mechanism, that 
meant sending messages to every process to apply the same mprotect() bits to their own PTEs and honor gatekeeper 
request. With shared PTEs, opted into explicitly, protection bits for all processes change at the same time with no 
additional action required by 1000s of processes. That helps performance very significantly.

The second big win here is in memory saved that would have been used by PTEs in all the processes. The memory saved this 
way literally takes a system from being completely infeasible to a system with room to spare (referring to the case I 
had described in my original mail where we needed more memory to store PTEs than installed on the system).

>> instantly to all processes (there is the TLB shootdown issue but as discussed in the meeting, it can be handled). The
>> mshare proposal implements the instant page protection change while bringing in benefits of shared page tables at the
>> same time. So the two requirements of this feature are not separable.
>
> Right, and I think we should talk about the problem we are trying to solve and not a solution to the problem. Because
> the current solution really requires sharing of page tables, which I absolutely don't like.
>
> It absolutely makes no sense to bring in mprotect and VMAs when wanting to catch all write accesses to a pagecache page.
> And because we still decide to do so, we have to come up with ways of making page table sharing a user-visible feature
> with weird VMA semantics.

We are not trying to catch write access to pagecache page here. We simply want to prevent write access to a large 
multi-page memory region by all processes sharing it and do it instantly and efficiently by allowing gatekeeper to close 
the gates and call it done.

Thanks,
Khalid