Re: [PATCH v8 0/4] Introduce mseal

Jeff Xu <jeffxu@xxxxxxxxxxxx> · Wed, 31 Jan 2024 17:27:11 -0800

On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
<Liam.Howlett@xxxxxxxxxx> wrote:
>
> Please add me to the Cc list of these patches.
Ok.
>
> * jeffxu@xxxxxxxxxxxx <jeffxu@xxxxxxxxxxxx> [240131 12:50]:
> > From: Jeff Xu <jeffxu@xxxxxxxxxxxx>
> >
> > This patchset proposes a new mseal() syscall for the Linux kernel.
> >
> > In a nutshell, mseal() protects the VMAs of a given virtual memory
> > range against modifications, such as changes to their permission bits.
> >
> > Modern CPUs support memory permissions, such as the read/write (RW)
> > and no-execute (NX) bits. Linux has supported NX since the release of
> > kernel version 2.6.8 in August 2004 [1]. The memory permission feature
> > improves the security stance on memory corruption bugs, as an attacker
> > cannot simply write to arbitrary memory and point the code to it. The
> > memory must be marked with the X bit, or else an exception will occur.
> > Internally, the kernel maintains the memory permissions in a data
> > structure called VMA (vm_area_struct). mseal() additionally protects
> > the VMA itself is against modifications of the selected seal type.
>
> ... The v8 cut Jonathan's email discussion [1] off and
> instead of
> replying there, I'm going to add my question here.
>
> The best plan to ensure it is a general safety measure for all of linux
> is to work with the community before it lands upstream.  It's much
> harder to change functionality provided to users after it is upstream.
> I'm happy to hear google is super excited about sharing this, but so
> far, the community isn't as excited.
>
> It seems Theo has a lot of experience trying to add a feature very close
> to what you are doing and has real data on how this went [2].  Can we
> see if there is a solution that is, at least, different enough from what
> he tried to do for a shot of success?  Do we have anyone in the
> toolchain groups that sees this working well?  If this means Stephen
> needs to do something, can we get that to happen please?
>
For Theo's input from OpenBSD's perspective;
IIUC: as today, the mseal-Linux and mimmutable-OpenBSD has the same
scope on what operations to seal, e.g. considering the progress made
on both sides since the beginning of the RFC:
- mseal(Linux): dropped "multiple-bit" approach.
- mimmutable(OpenBSD): Dropped "downgradable"; Added madvise(DONOTNEED).

The difference is in mmap(), i.e.
- mseal(Linux): support of PROT_SEAL in mmap().
- mseal(Linux): use of MAP_SEALABLE in mmap().

I considered Theo's inputs from OpenBSD's perspective regarding the
difference, and I wasn't convinced that Linux should remove these. In
my view, those are two different kernels code, and the difference in
Linux is not added without reasons (for MAP_SEALABLE, there is a note
in the documentation section with details).

I would love to hear more from Linux developers on this.

> I mean, you specifically state that this is a 'very specific
> requirement' in your cover letter.  Does this mean even other browsers
> have no use for it?
>
No, I don’t mean “other browsers have no use for it”.

About specific requirements from Chrome, that refers to "The lifetime
of those mappings are not tied to the lifetime of the process, which
is not the case of libc" as in the cover letter. This addition to the
cover letter was made in V3, thus, it might be beneficial to provide
additional context to help answer the question.

This patch series begins with multiple-bit approaches (v1,v2,v3), the
rationale for this is that I am uncertain if Chrome's specific needs
are common enough for other use cases.  Consequently, I am unable to
make this decision myself without input from the community. To
accommodate this, multiple bits are selected initially due to their
adaptability.

Since V1, after hearing from the community, Chrome has changed its
design (no longer relying on separating out mprotect), and Linus
acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs,
today mseal() has a simple design that:
 - meet Chrome's specific needs.
 - meet Libc's needs.
 - Chrome's specific need doesn't interfere with Libc's.

[1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@xxxxxxxxxxxxxx/

> I am very concerned this feature will land and have to be maintained by
> the core mm people for the one user it was specifically targeting.
>
See above. This feature is not specifically targeting Chrome.

> Can we also get some benchmarking on the impact of this feature?  I
> believe my answer in v7 removed the worst offender, but since there is
> no benchmarking we really are guessing (educated or not, hard data would
> help).  We still have an extra loop in madvise, mprotect_pkey, mremap_to
> (and mreamp syscall?).
>
Yes. There is an extra loop in mmap(FIXED), munmap(),
madvise(DONOTNEED), mremap(), to emulate the VMAs for the given
address range. I suspect the impact would be low, but having some hard
data would be good. I will see what I can find to assist the perf
testing. If you have a specific test suite in mind, I can also try it.

> You also did not clean up the loop you copied from mlock, which I
> pointed out [3].  Stating that your copy/paste is easier to review is
> not sufficient to keep unneeded assignments around.
>
OK.

> [1]. https://lore.kernel.org/linux-mm/87a5ong41h.fsf@xxxxxxxxxxxx/
> [2]. https://lore.kernel.org/linux-mm/86181.1705962897@xxxxxxxxxxxxxxx/
> [3]. https://lore.kernel.org/linux-mm/20240124200628.ti327diy7arb7byb@revolver/