Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise

Michal Hocko <mhocko@xxxxxxxx> · Fri, 1 Jul 2022 14:36:18 +0200

On Fri 01-07-22 14:09:24, David Hildenbrand wrote:
> On 01.07.22 14:02, Michal Hocko wrote:
> > On Fri 01-07-22 12:50:59, David Hildenbrand wrote:
> >> On 01.07.22 12:32, David Hildenbrand wrote:
> >>> On 01.07.22 11:11, Michal Hocko wrote:
> >>>> [Cc Jann]
> >>>>
> >>>> On Fri 01-07-22 08:43:23, cgel.zte@xxxxxxxxx wrote:
> >>>>> From: xu xin <xu.xin16@xxxxxxxxxx>
> >>>>>
> >>>>> The benefits of doing this are obvious because using madvise in user code
> >>>>> is the only current way to enable KSM, which is inconvenient for those
> >>>>> compiled app without marking MERGEABLE wanting to enable KSM.
> >>>>
> >>>> I would rephrase:
> >>>> "
> >>>> KSM functionality is currently available only to processes which are
> >>>> using MADV_MERGEABLE directly. This is limiting because there are
> >>>> usecases which will benefit from enabling KSM on a remote process. One
> >>>> example would be an application which cannot be modified (e.g. because
> >>>> it is only distributed as a binary). MORE EXAMPLES WOULD BE REALLY
> >>>> BENEFICIAL.
> >>>> "
> >>>>
> >>>>> Since we already have the syscall of process_madvise(), then reusing the
> >>>>> interface to allow external KSM hints is more acceptable [1].
> >>>>>
> >>>>> Although this patch was released by Oleksandr Natalenko, but it was
> >>>>> unfortunately terminated without any conclusions because there was debate
> >>>>> on whether it should use signal_pending() to check the target task besides
> >>>>> the task of current() when calling unmerge_ksm_pages of other task [2].
> >>>>
> >>>> I am not sure this is particularly interesting. I do not remember
> >>>> details of that discussion but checking signal_pending on a different
> >>>> task is rarely the right thing to do. In this case the check is meant to
> >>>> allow bailing out from the operation so that the caller could be
> >>>> terminated for example.
> >>>>
> >>>>> I think it's unneeded to check the target task. For example, when we set
> >>>>> the klob /sys/kernel/mm/ksm/run from 1 to 2,
> >>>>> unmerge_and_remove_all_rmap_items() doesn't use signal_pending() to check
> >>>>> all other target tasks either.
> >>>>>
> >>>>> I hope this patch can get attention again.
> >>>>
> >>>> One thing that the changelog is missing and it is quite important IMHO
> >>>> is the permission model. As we have discussed in previous incarnations
> >>>> of the remote KSM functionality that KSM has some security implications.
> >>>> It would be really great to refer to that in the changelog for the
> >>>> future reference (http://lkml.kernel.org/r/CAG48ez0riS60zcA9CC9rUDV=kLS0326Rr23OKv1_RHaTkOOj7A@xxxxxxxxxxxxxx)
> >>>>
> >>>> So this implementation requires PTRACE_MODE_READ_FSCREDS and
> >>>> CAP_SYS_NICE so the remote process would need to be allowed to
> >>>> introspect the address space. This is the same constrain applied to the
> >>>> remote momory reclaim. Is this sufficient?
> >>>>
> >>>> I would say yes because to some degree KSM mergning can have very
> >>>> similar effect to memory reclaim from the side channel POV. But it
> >>>> should be really documented in the changelog so that it is clear that
> >>>> this has been a deliberate decision and thought through.
> >>>>
> >>>> Other than that this looks like the most reasonable approach to me.
> >>>>
> >>>>> [1] https://lore.kernel.org/lkml/YoOrdh85+AqJH8w1@xxxxxxxxxxxxxx/
> >>>>> [2] https://lore.kernel.org/lkml/2a66abd8-4103-f11b-06d1-07762667eee6@xxxxxxx/
> >>>>>
> >>>
> >>> I have various concerns, but the biggest concern is that this modifies
> >>> VMA flags and can possibly break applications.
> >>>
> >>> process_madvise must not modify remote process state.
> >>>
> >>> That's why we only allow a very limited selection that are merely hints.
> >>>
> >>> So nack from my side.
> >>>
> >>
> >> [I'm quit ebusy, but I think some more explanation might be of value]
> >>
> >> One COW example where I think force-enabling KSM for processes is
> >> *currently* not a good idea (besides the side channel discussions, which
> >> is also why Windows stopped to enable KSM system wide a while ago):
> >>
> >> App:
> >>
> >> a) memset(page, 0);
> >> b) trigger R/O long-term pin on page (e.g., vfio)
> >>
> >> If between a) and b) KSM replaces the page by the shared zeropage you'll
> >> get an unreliable pin because we don't break yet COW when taking a R/O
> >> pin on the shared zeropage. And in the traditional sense, the app did
> >> everything right to guarantee that the pin will stay reliable.
> > 
> > Isn't this a bug in the existing implementation of the CoW?
> 
> One the one hand yes (pinning the shared zeropage is questionable), on
> the other hand no (user space did modify that memory ahead of time and
> filled it with something reasonable, that's how why always worked
> correctly in the absence of KSM).

I am not sure about exact details of the KSM implementation but if that
is not a desirable behavior then it should be handled on the KSM level.
The very sam thing can easily happen in a multithreaded (or in general
multi-process with shared mm) environment as well.

> >> Further, if an app explicitly decides to disable KSM one some region, we
> >> should not overwrite that.
> > 
> > Well, the interface is rather spartan. You cannot really tell "disable
> > KSM on some reqion". You can only tell "KSM can be applied to this
> > region" and later change your mind. Maybe this is what you had in
> > mind though.
> 
> That's what I meant. The hugepage interface has different semantics and
> you get three possible states:
> 
> 1: yes please: MADV_HUGEPAGE
> 2: don't care -- don't set anything
> 3. please no: MADV_NOHUGEPAGE
> 
> Currently for KSM we only have 1 and 2 internally I think (single
> flag), because it didn't matter in the past ebcause there was no
> force-enablement. One could convert it into all 3 states, changing the
> semantics of MADV_UNMERGEABLE slightly from
> 
> 
> 1: yes please: MADV_MERGEABLE
> 2: don't care: MADV_UNMERGEABLE
> 
> to
> 
> 1: yes please: MADV_MERGEABLE
> 2: don't care -- don't set anything
> 3. please no: MADV_UNMERGEABLE

Are you saying that any remote handling of the KSM has to deal with a
pre-existing semantic as well? Are we aware of any existing application
that really uses MADV_UNMERGEABLE in a hope to disable KSM for any of
its sensitive memory ranges? My understanding is that this is simply a
on/off knob and a remote way to do the same is in line with the existing
API.

To be completely honest I do not really buy an argument that this might
break something much more than the original application can do already.
Unless I am missing the ptrace check puts the bar rather high. Adversary
with this level of access to the target application has already broken
it. Or am I missing something?
-- 
Michal Hocko
SUSE Labs