Re: [PATCH v6] mm/ksm: introduce ksm_force for each process

Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx> · Fri, 13 May 2022 22:53:16 +0200

Hello.

On pátek 13. května 2022 22:32:10 CEST Andrew Morton wrote:
> On Fri, 13 May 2022 11:51:53 +0200 Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx> wrote:
> > On pátek 13. května 2022 0:37:53 CEST Andrew Morton wrote:
> > > On Tue, 10 May 2022 15:30:36 +0200 Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx> wrote:
> > > 
> > > > > If ksm_force is set to 1, force all anonymous and 'qualified' VMAs
> > > > > of this mm to be involved in KSM scanning without explicitly calling
> > > > > madvise to mark VMA as MADV_MERGEABLE. But It is effective only when
> > > > > the klob of /sys/kernel/mm/ksm/run is set as 1.
> > > > > 
> > > > > If ksm_force is set to 0, cancel the feature of ksm_force of this
> > > > > process (fallback to the default state) and unmerge those merged pages
> > > > > belonging to VMAs which is not madvised as MADV_MERGEABLE of this process,
> > > > > but still leave MADV_MERGEABLE areas merged.
> > > > 
> > > > To my best knowledge, last time a forcible KSM was discussed (see threads [1], [2], [3] and probably others) it was concluded that a) procfs was a horrible interface for things like this one; and b) process_madvise() syscall was among the best suggested places to implement this (which would require a more tricky handling from userspace, but still).
> > > > 
> > > > So, what changed since that discussion?
> > > > 
> > > > P.S. For now I do it via dedicated syscall, but I'm not trying to upstream this approach.
> > > 
> > > Why are you patching the kernel with a new syscall rather than using
> > > process_madvise()?
> > 
> > Because I'm not sure how to use `process_madvise()` to achieve $subj properly.
> > 
> > The objective is to mark all the eligible VMAs of the target task for KSM to consider them for merging.
> > 
> > For that, all the eligible VMAs have to be traversed.
> > 
> > Given `process_madvise()` has got an iovec API, this means the process that will call `process_madvise()` has to know the list of VMAs of the target process. In order to traverse them in a race-free manner the target task has to be SIGSTOP'ped or frozen, then the list of VMAs has to be obtained, then `process_madvise()` has to be called, and the the target task can continue. This is:
> > 
> > a) superfluous (the kernel already knows the list of VMAs of the target tasks, why proxy it through the userspace then?); and
> > b) may induce more latency than needed because the target task has to be stopped to avoid races.
> 
> OK.  And what happens to new vmas that the target process creates after
> the process_madvise()?

Call `process_madvise()` on them again. And do that again. Regularly, with some intervals. Use some daemon for that (like [1]).

[1] https://gitlab.com/post-factum/uksmd

> > OTOH, IIUC, even if `MADV_MERGEABLE` is allowed for `process_madvise()`,
> 
> Is it not?

It is not:

```
1158 static bool
1159 process_madvise_behavior_valid(int behavior)
1160 {
1161     switch (behavior) {
1162     case MADV_COLD:
1163     case MADV_PAGEOUT:
1164     case MADV_WILLNEED:
1165         return true;
1166     default:
1167         return false;
1168     }
1169 }
```

Initially, when `process_madvise()` stuff was being prepared for merging, I tried to enabled it but failed [2], and it was decided [3] to move forward without it.

[2] https://lore.kernel.org/linux-api/34f812b8-df54-eaad-5cf0-335f07da55c6@xxxxxxx/
[3] https://lore.kernel.org/lkml/20200623085944.cvob63vrv54fo7cs@butterfly.localdomain/

> > I cannot just call it like this:
> > 
> > ```
> > iovec.iov_base = 0;
> > iovec.iov_len = ~0ULL;
> > process_madvise(pidfd, &iovec, 1, MADV_MERGEABLE, 0);
> > ```
> > 
> > to cover the whole address space because iovec expects total size to be under ssize_t.
> > 
> > Or maybe there's no need to cover the whole address space, only the lower half of it?
> 
> Call process_madvise() twice, once for each half?

And still get `-ENOMEM`?

```
1191     /*
1192      * If the interval [start,end) covers some unmapped address
1193      * ranges, just ignore them, but return -ENOMEM at the end.
1194      * - different from the way of handling in mlock etc.
1195      */
```

I mean, it probably will work, and probably having the error returned is fine, but generally speaking an error value should hint that something is not being done right.

> > Or maybe there's another way of doing things, and I just look stupid and do not understand how this is supposed to work?..
> > 
> > I'm more than happy to read your comments on this.
> > 
> 
> I see the problem.  I do like the simplicity of the ksm_force concept. 
> Are there alternative ideas?

I do like it too. I wonder what to do with older concerns [4] [5] regarding presenting such an API.

[4] https://lore.kernel.org/lkml/20190516172452.GA2106@avx2/
[5] https://lore.kernel.org/lkml/20190515145151.GG16651@xxxxxxxxxxxxxx/

-- 
Oleksandr Natalenko (post-factum)