Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)

David Hildenbrand <david@xxxxxxxxxx> · Mon, 25 Mar 2024 18:57:00 +0100

Hi,

I was hoping for a follow-up to my previous comments from ~4 months ago [1], so one problem of "not converging" might be
"no follow-up discussion".

Ideally, this session would not focus on mshare as previously discussed at LSF/MM, but take a step back and discuss
requirements and possible adjustments to the original concept to get something possibly cleaner.

For example, I raised some ideas to not having to re-route mprotect()/mmap() calls. At least discussing somehwere why
they are all bad would be helpful ;)

[1] https://lore.kernel.org/lkml/927b6339-ac5f-480c-9cdc-49c838cbef20@xxxxxxxxxx/

Hi David,

That is fair. A face to face discussion can help resolve these more easily but I will attempt to address these here and
maybe we can come to a better understanding of the requirements. I do want to focus on requirements and let that drive
the implementation.

Hi Khalid,

sorry fro the late reply, my mailbox got a bit flooded.

On 11/2/23 14:25, David Hildenbrand wrote:
  > On 01.11.23 23:40, Khalid Aziz wrote:
  >> is slow and impacts database performance significantly. For each process to have to handle a fault/signal whenever page
  >> protection is changed impacts every process. By sharing same PTE across all processes, any page protection changes apply
  >
  > ... and everyone has to get the fault and mprotect() again,
  >
  > Which is one of the reasons why I said that mprotect() is simply the wrong tool to use here.
  >
  > You want to protect a pagecache page from write access, catch write access and handle it, to then allow write-access
  > again without successive fault->signal. Something similar is being done by filesystems already with the writenotify
  > infrastructure I believe. You just don't get a signal on write access, because it's all handled internally in the FS.
  >

My understanding of requirement from database applications is they want to create a large shared memory region for 1000s
of processes. This region can have file-backed pages or not. One of the processes can be the control process that serves
as gatekeeper to various parts of this shared region. This process can open up write access to a part of shared region
(which can span thousands of pages), populate/update data and then close down write access to this region. Any other
process that tries to write to this region at this time can get a signal and choose to handle it or simply be killed.

Got it.

All the gatekeper process wants to do is close access to shared region at any time without having to coordinate that
with 1000s of processes and let other processes deal with access having been closed. With this requirement, what
database applications has found to be effective is to use mprotect() to apply protection to the part of shared region
and then have it propagate across everyone attempting to access that region. Using currently available mechanism, that
meant sending messages to every process to apply the same mprotect() bits to their own PTEs and honor gatekeeper

Yes, mprotect() over multiple processes is indeed stupid. It's also the 
same one currently has to do with uffd-wp: each process has to protect 
the pages in their own page tables.

request. With shared PTEs, opted into explicitly, protection bits for all processes change at the same time with no
additional action required by 1000s of processes. That helps performance very significantly.

The second big win here is in memory saved that would have been used by PTEs in all the processes. The memory saved this
way literally takes a system from being completely infeasible to a system with room to spare (referring to the case I
had described in my original mail where we needed more memory to store PTEs than installed on the system).

Yes, I understood all that.

  >> instantly to all processes (there is the TLB shootdown issue but as discussed in the meeting, it can be handled). The
  >> mshare proposal implements the instant page protection change while bringing in benefits of shared page tables at the
  >> same time. So the two requirements of this feature are not separable.
  >
  > Right, and I think we should talk about the problem we are trying to solve and not a solution to the problem. Because
  > the current solution really requires sharing of page tables, which I absolutely don't like.
  >
  > It absolutely makes no sense to bring in mprotect and VMAs when wanting to catch all write accesses to a pagecache page.
  > And because we still decide to do so, we have to come up with ways of making page table sharing a user-visible feature
  > with weird VMA semantics.

We are not trying to catch write access to pagecache page here. We simply want to prevent write access to a large
multi-page memory region by all processes sharing it and do it instantly and efficiently by allowing gatekeeper to close
the gates and call it done.

Thanks for these details!

I'll have a bunch of other questions. Finding some way to discuss them 
with you in detail would be great. Will you be at LSF/MM so we can talk 
in person? Ideally, we could talk before any LSF/MM session.

--
Cheers,

David / dhildenb