Re: [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 11 May 2022, Zach O'Keefe wrote:

> Hey Rongwei,
> 
> Thanks for taking the time to review!
> 
> On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
> <rongwei.wang@xxxxxxxxxxxxxxxxx> wrote:
> >
> > Hi, Zach
> >
> > Thanks for your great patchset!
> > Recently, We also try to collapse THP in this way, likes performance
> > degradation due to using too much hugepages in our scenes.
> >

Rongwei, could you elaborate on this?  I can understand undesired overhead 
for allocation of a hugepage at the time of fault if there is not enough 
benefit derived by the hugepages over the long term (like for a database 
workload), is this the performance degradation you're referring to?

Otherwise I'm unfamiliar with performance degradation for using too much 
hugepages after they have been allocated :)  Maybe RSS grows too much and 
we run into memory pressure?

It would be good to know more if you can share details here.

> > And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
> > this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
> > madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
> > process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
> > something, please let me know.
> >
> 
> I tried to have MADV_COLLAPSE follow the same THP eligibility
> semantics as khugepaged and at-fault: either THP=always, or
> THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
> out.
> 
> If I understand you correctly, the usefulness of
> process_madvise(MADV_COLLAPSE) is limited in the case where
> THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
> behalf of another process since they don't have a way to mark the
> target memory as eligible (which requires VM_HUGEPAGE).
> 
> If so, I think that's a valid point, and your suggestion below of a
> supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
> the sake of exploring all options, I'll mention that there was also a
> previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
> VM_HUGEPAGE[1].
> 

If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary 
to need to do MADV_HUGEPAGE before that regardless of system-wide settings 
that it may not control?

Same point for a root user doing this on behalf of another user.  It could 
either do MADV_HUGEPAGE and change the behavior that the user perhaps 
requested behind its back (not desired) or it could temporarily set the 
system-wide setting to allow THP always before doing the MADV_COLLAPSE 
(it's root).

We can simply allow MADV_COLLAPSE to always collapse in either context 
regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings?

> Since it's possible supporting MADV_[NO]HUGEPAGE for
> process_madivse(2) has applications outside a subsequent
> MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
> be in a hot path, I'd vote in favor of your suggestion and include
> process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.
> 
> Thanks again for your review and your suggestion!
> Zach
> 
> [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@xxxxxxxxxxxxxx/
> 
> > If so, how about introducing process_madvise(MADV_HUGEPAGE) or
> > process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
> > vma with 'hg', and the collapse process can be finished completely with
> > the help of other processes. the latter could let some special vma avoid
> > collapsing when setting 'THP=always'.
> >
> > Best regards,
> > -wrw
> >
> > On 5/5/22 5:44 AM, Zach O'Keefe wrote:
> > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> > > CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@xxxxxxxxxx>
> > > ---
> > >   mm/madvise.c | 6 ++++--
> > >   1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 638517952bd2..08c11217025a 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
> > >   }
> > >
> > >   static bool
> > > -process_madvise_behavior_valid(int behavior)
> > > +process_madvise_behavior_valid(int behavior, struct task_struct *task)
> > >   {
> > >       switch (behavior) {
> > >       case MADV_COLD:
> > >       case MADV_PAGEOUT:
> > >       case MADV_WILLNEED:
> > >               return true;
> > > +     case MADV_COLLAPSE:
> > > +             return task == current || capable(CAP_SYS_ADMIN);
> > >       default:
> > >               return false;
> > >       }
> > > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > >               goto free_iov;
> > >       }
> > >
> > > -     if (!process_madvise_behavior_valid(behavior)) {
> > > +     if (!process_madvise_behavior_valid(behavior, task)) {
> > >               ret = -EINVAL;
> > >               goto release_task;
> > >       }
> 




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux