On 29.02.24 15:12, Matthew Wilcox wrote:
On Thu, Feb 29, 2024 at 10:21:26AM +0100, David Hildenbrand wrote:
On 28.02.24 23:56, Khalid Aziz wrote:
Threads of a process share address space and page tables that allows for
two key advantages:
1. Amount of memory required for PTEs to map physical pages stays low
even when large number of threads share the same pages since PTEs are
shared across threads.
2. Page protection attributes are shared across threads and a change
of attributes applies immediately to every thread without any overhead
of coordinating protection bit changes across threads.
These advantages no longer apply when unrelated processes share pages.
Large database applications can easily comprise of 1000s of processes
that share 100s of GB of pages. In cases like this, amount of memory
consumed by page tables can exceed the size of actual shared data.
On a database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA even
though the system had 512GB of memory. On this server, in the worst case
scenario of all 1500 processes mapping every page from SGA would have
required 878GB+ for just the PTEs.
I have sent proposals and patches to solve this problem by adding a
mechanism to the kernel for processes to use to opt into sharing
page tables with other processes. We have had discussions on original
proposal and subsequent refinements but we have not converged on a
solution. As systems with multi-TB memory and in-memory databases
are becoming more and more common, this is becoming a significant issue.
An interactive discussion can help us reach a consensus on how to
solve this.
Hi,
I was hoping for a follow-up to my previous comments from ~4 months ago [1],
so one problem of "not converging" might be "no follow-up discussion".
Ideally, this session would not focus on mshare as previously discussed at
LSF/MM, but take a step back and discuss requirements and possible
adjustments to the original concept to get something possibly cleaner.
I think the concept is clean.
Your concept doesn't fit our use case!
Which one exactly are you talking about in particular?
I raised various alternatives/modifications for discussion, learning
what works and what doesn't work on the way. (I never understood why
protection on the pagecache level wouldn't work for your use case, but
let's put that aside).
In my last mail, I had the following:
"
It's been a while, but I remember that the feedback in the room was
primarily that:
(a) the original mshare approach/implementation had a very dangerous
smell to it. Rerouting mmap/mprotect/... is just absolutely nasty.
(b) that pure page table sharing itself might be itself a reasonable
optimization worth having.
I still think generic page table sharing (as a pure optimization) can be
something reasonable to have, and can help existing use cases without
the need to modify any software (well, except maybe give a hint that it
might be reasonable).
As said, I see value in some fd-thingy that can be mmaped, but is
internally assembled from other fds (using protect ioctls, not mmap)
with sub-protection (using protect ioctls, not mprotect). The ioctls
would be minimal and clearly specified. Most madvise()/uffd/... would
simply fail when seeing a VMA that mmaps such a fd thingy. No rerouting
of mmap, munmap, mprotect, ...
Under the hood, one can use a MM to manage all that and share page
tables. But it would be an implementation detail.
"
So I do think original mshare could be done "less scary" [1] by exposing
a different, well defined and restricted interface to manage the
"content" of mshare.
There is a lot of stuff to describe I have in mind, but it doesn't make
sense to describe if it won't solve your usecase.
In my world it would end up cleaner, and naive me would have thought
that you would enjoy something close to original mshare, just a bit less
scary :)
So essentially what you're asking for is for us to do a lot of work
which doesn't solve our problem. You can imagine our lack of enthusiasm
for this.
I recall that implementing generic page table sharing is a lot of work
that Oracle isn't interested in doing that, fair enough, I understood that.
Really, the amount of work is unclear if we don't talk about the actual
solution.
I cannot really do more than offer help like I did:
"I'm happy to discuss further. In a bi-weekly MM meeting, off-list or
here.".
But if my comments are so unreasonable that they are not even worth
discussing them, likely I wouldn't be of any help in another mshare session.
[1] https://lwn.net/Articles/895217/
--
Cheers,
David / dhildenb