On Wed, Jan 26, 2022 at 04:04:48AM +0000, Matthew Wilcox wrote: > On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote: > > On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote: > > > On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote: > > > > > I think zero-API approach (plus madvise() hints to tweak it) is worth > > > > > considering. > > > > > > > > I think the zero-API approach actually misses out on a lot of > > > > possibilities that the mshare() approach offers. For example, mshare() > > > > allows you to mmap() many small files in the shared region -- you > > > > can't do that with zeroAPI. > > > > > > Do you consider a use-case for many small files to be common? I would > > > think that the main consumer of the feature to be mmap of huge files. > > > And in this case zero enabling burden on userspace side sounds like a > > > sweet deal. > > > > mmap() of huge files is certainly the Oracle use-case. With occasional > > funny business like mprotect() of a single page in the middle of a 1GB > > hugepage. > > Bill and I were talking about this earlier and realised that this is > the key point. There's a requirement that when one process mprotects > a page that it gets protected in all processes. You can't do that > without *some* API because that's different behaviour than any existing > API would produce. "hurr, durr, we are Oracle" :P Sounds like a very niche requirement. I doubt there will more than single digit user count for the feature. Maybe only the DB. > So how about something like this ... > > int mcreate(const char *name, int flags, mode_t mode); > > creates a new mm_struct with a refcount of 2. returns an fd (one > of the two refcounts) and creates a name for it (inside msharefs, > holds the other refcount). > > You can then mmap() that fd to attach it to a chunk of your address > space. Once attached, you can start to populate it by calling > mmap() and specifying an address inside the attached mm as the first > argument to mmap(). That is not what mmap() would normally do to an existing mapping. So it requires special treatment. In general mmap() of a mm_struct scares me. I can't wrap my head around implications. Like how does it work on fork()? How accounting works? What happens on OOM? What prevents creating loops, like mapping a mm_struct inside itself? What mremap()/munmap() do to such mapping? Will it affect mapping of mm_struct or will it target mapping inside the mm_sturct? Maybe it just didn't clicked for me, I donno. > Maybe mcreate() is just a library call, and it's really a thin wrapper > around open() that happens to know where msharefs is mounted. -- Kirill A. Shutemov