Re: [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 5/13/22 4:03 AM, David Rientjes wrote:
On Wed, 11 May 2022, Zach O'Keefe wrote:

Hey Rongwei,

Thanks for taking the time to review!

On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
<rongwei.wang@xxxxxxxxxxxxxxxxx> wrote:

Hi, Zach

Thanks for your great patchset!
Recently, We also try to collapse THP in this way, likes performance
degradation due to using too much hugepages in our scenes.


Rongwei, could you elaborate on this?  I can understand undesired overhead
for allocation of a hugepage at the time of fault if there is not enough
benefit derived by the hugepages over the long term (like for a database
workload), is this the performance degradation you're referring to?

Hi David

Sorry for the late reply.

Actually, I'm not sure that the degradation is caused by using too much hugepages. Plus, I also referrd to Google's doc[1], and the TLB pressure mentioned here. Maybe the process_madvise(MADV_COLLAPSE) can help us to solve these issues.

[1] https://dyninst.github.io/scalable_tools_workshop/petascale2018/assets/slides/CSCADS%202018%20perf_events%20status%20update.pdf

-wrw

Otherwise I'm unfamiliar with performance degradation for using too much
hugepages after they have been allocated :)  Maybe RSS grows too much and
we run into memory pressure?

It would be good to know more if you can share details here.

And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
something, please let me know.


I tried to have MADV_COLLAPSE follow the same THP eligibility
semantics as khugepaged and at-fault: either THP=always, or
THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
out.

If I understand you correctly, the usefulness of
process_madvise(MADV_COLLAPSE) is limited in the case where
THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
behalf of another process since they don't have a way to mark the
target memory as eligible (which requires VM_HUGEPAGE).

If so, I think that's a valid point, and your suggestion below of a
supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
the sake of exploring all options, I'll mention that there was also a
previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
VM_HUGEPAGE[1].


If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary
to need to do MADV_HUGEPAGE before that regardless of system-wide settings
that it may not control?

Same point for a root user doing this on behalf of another user.  It could
either do MADV_HUGEPAGE and change the behavior that the user perhaps
requested behind its back (not desired) or it could temporarily set the
system-wide setting to allow THP always before doing the MADV_COLLAPSE
(it's root).

We can simply allow MADV_COLLAPSE to always collapse in either context
regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings?

Since it's possible supporting MADV_[NO]HUGEPAGE for
process_madivse(2) has applications outside a subsequent
MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
be in a hot path, I'd vote in favor of your suggestion and include
process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.

Thanks again for your review and your suggestion!
Zach

[1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@xxxxxxxxxxxxxx/

If so, how about introducing process_madvise(MADV_HUGEPAGE) or
process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
vma with 'hg', and the collapse process can be finished completely with
the help of other processes. the latter could let some special vma avoid
collapsing when setting 'THP=always'.

Best regards,
-wrw

On 5/5/22 5:44 AM, Zach O'Keefe wrote:
Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
CAP_SYS_ADMIN or is requesting collapse of it's own memory.

Signed-off-by: Zach O'Keefe <zokeefe@xxxxxxxxxx>
---
   mm/madvise.c | 6 ++++--
   1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 638517952bd2..08c11217025a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
   }

   static bool
-process_madvise_behavior_valid(int behavior)
+process_madvise_behavior_valid(int behavior, struct task_struct *task)
   {
       switch (behavior) {
       case MADV_COLD:
       case MADV_PAGEOUT:
       case MADV_WILLNEED:
               return true;
+     case MADV_COLLAPSE:
+             return task == current || capable(CAP_SYS_ADMIN);
       default:
               return false;
       }
@@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
               goto free_iov;
       }

-     if (!process_madvise_behavior_valid(behavior)) {
+     if (!process_madvise_behavior_valid(behavior, task)) {
               ret = -EINVAL;
               goto release_task;
       }





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux