Hello. On pátek 13. května 2022 22:32:10 CEST Andrew Morton wrote: > On Fri, 13 May 2022 11:51:53 +0200 Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx> wrote: > > On pátek 13. května 2022 0:37:53 CEST Andrew Morton wrote: > > > On Tue, 10 May 2022 15:30:36 +0200 Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx> wrote: > > > > > > > > If ksm_force is set to 1, force all anonymous and 'qualified' VMAs > > > > > of this mm to be involved in KSM scanning without explicitly calling > > > > > madvise to mark VMA as MADV_MERGEABLE. But It is effective only when > > > > > the klob of /sys/kernel/mm/ksm/run is set as 1. > > > > > > > > > > If ksm_force is set to 0, cancel the feature of ksm_force of this > > > > > process (fallback to the default state) and unmerge those merged pages > > > > > belonging to VMAs which is not madvised as MADV_MERGEABLE of this process, > > > > > but still leave MADV_MERGEABLE areas merged. > > > > > > > > To my best knowledge, last time a forcible KSM was discussed (see threads [1], [2], [3] and probably others) it was concluded that a) procfs was a horrible interface for things like this one; and b) process_madvise() syscall was among the best suggested places to implement this (which would require a more tricky handling from userspace, but still). > > > > > > > > So, what changed since that discussion? > > > > > > > > P.S. For now I do it via dedicated syscall, but I'm not trying to upstream this approach. > > > > > > Why are you patching the kernel with a new syscall rather than using > > > process_madvise()? > > > > Because I'm not sure how to use `process_madvise()` to achieve $subj properly. > > > > The objective is to mark all the eligible VMAs of the target task for KSM to consider them for merging. > > > > For that, all the eligible VMAs have to be traversed. > > > > Given `process_madvise()` has got an iovec API, this means the process that will call `process_madvise()` has to know the list of VMAs of the target process. In order to traverse them in a race-free manner the target task has to be SIGSTOP'ped or frozen, then the list of VMAs has to be obtained, then `process_madvise()` has to be called, and the the target task can continue. This is: > > > > a) superfluous (the kernel already knows the list of VMAs of the target tasks, why proxy it through the userspace then?); and > > b) may induce more latency than needed because the target task has to be stopped to avoid races. > > OK. And what happens to new vmas that the target process creates after > the process_madvise()? Call `process_madvise()` on them again. And do that again. Regularly, with some intervals. Use some daemon for that (like [1]). [1] https://gitlab.com/post-factum/uksmd > > OTOH, IIUC, even if `MADV_MERGEABLE` is allowed for `process_madvise()`, > > Is it not? It is not: ``` 1158 static bool 1159 process_madvise_behavior_valid(int behavior) 1160 { 1161 switch (behavior) { 1162 case MADV_COLD: 1163 case MADV_PAGEOUT: 1164 case MADV_WILLNEED: 1165 return true; 1166 default: 1167 return false; 1168 } 1169 } ``` Initially, when `process_madvise()` stuff was being prepared for merging, I tried to enabled it but failed [2], and it was decided [3] to move forward without it. [2] https://lore.kernel.org/linux-api/34f812b8-df54-eaad-5cf0-335f07da55c6@xxxxxxx/ [3] https://lore.kernel.org/lkml/20200623085944.cvob63vrv54fo7cs@butterfly.localdomain/ > > I cannot just call it like this: > > > > ``` > > iovec.iov_base = 0; > > iovec.iov_len = ~0ULL; > > process_madvise(pidfd, &iovec, 1, MADV_MERGEABLE, 0); > > ``` > > > > to cover the whole address space because iovec expects total size to be under ssize_t. > > > > Or maybe there's no need to cover the whole address space, only the lower half of it? > > Call process_madvise() twice, once for each half? And still get `-ENOMEM`? ``` 1191 /* 1192 * If the interval [start,end) covers some unmapped address 1193 * ranges, just ignore them, but return -ENOMEM at the end. 1194 * - different from the way of handling in mlock etc. 1195 */ ``` I mean, it probably will work, and probably having the error returned is fine, but generally speaking an error value should hint that something is not being done right. > > Or maybe there's another way of doing things, and I just look stupid and do not understand how this is supposed to work?.. > > > > I'm more than happy to read your comments on this. > > > > I see the problem. I do like the simplicity of the ksm_force concept. > Are there alternative ideas? I do like it too. I wonder what to do with older concerns [4] [5] regarding presenting such an API. [4] https://lore.kernel.org/lkml/20190516172452.GA2106@avx2/ [5] https://lore.kernel.org/lkml/20190515145151.GG16651@xxxxxxxxxxxxxx/ -- Oleksandr Natalenko (post-factum)