On Mon, Nov 30, 2020 at 11:01 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Wed, Nov 25, 2020 at 3:43 PM Minchan Kim <minchan@xxxxxxxxxx> wrote: > > > > On Wed, Nov 25, 2020 at 03:23:40PM -0800, Suren Baghdasaryan wrote: > > > On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim <minchan@xxxxxxxxxx> wrote: > > > > > > > > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote: > > > > > process_madvise requires a vector of address ranges to be provided for > > > > > its operations. When an advice should be applied to the entire process, > > > > > the caller process has to obtain the list of VMAs of the target process > > > > > by reading the /proc/pid/maps or some other way. The cost of this > > > > > operation grows linearly with increasing number of VMAs in the target > > > > > process. Even constructing the input vector can be non-trivial when > > > > > target process has several thousands of VMAs and the syscall is being > > > > > issued during high memory pressure period when new allocations for such > > > > > a vector would only worsen the situation. > > > > > In the case when advice is being applied to the entire memory space of > > > > > the target process, this creates an extra overhead. > > > > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to > > > > > advise a memory range of the target process. For now, to keep it simple, > > > > > only the entire process memory range is supported, vec and vlen inputs > > > > > in this mode are ignored and can be NULL and 0. > > > > > Instead of returning the number of bytes that advice was successfully > > > > > applied to, the syscall in this mode returns 0 on success. This is due > > > > > to the fact that the number of bytes would not be useful for the caller > > > > > that does not know the amount of memory the call is supposed to affect. > > > > > Besides, the ssize_t return type can be too small to hold the number of > > > > > bytes affected when the operation is applied to a large memory range. > > > > > > > > Can we just use one element in iovec to indicate entire address rather > > > > than using up the reserved flags? > > > > > > > > struct iovec { > > > > .iov_base = NULL, > > > > .iov_len = (~(size_t)0), > > > > }; > > > > > > > > Furthermore, it would be applied for other syscalls where have support > > > > iovec if we agree on it. > > > > > > > > > > The flag also changes the return value semantics. If we follow your > > > suggestion we should also agree that in this mode the return value > > > will be 0 on success and negative otherwise instead of the number of > > > bytes madvise was applied to. > > > > Well, return value will depends on the each API. If the operation is > > desruptive, it should return the right size affected by the API but > > would be okay with 0 or error, otherwise. > > I'm fine with dropping the flag, I just thought with the flag it would > be more explicit that this is a special mode operating on ranges. This > way the patch also becomes simpler. > Andrew, Michal, Christian, what do you think about such API? Should I > change the API this way / keep the flag / change it in some other way? Friendly ping to get some feedback on the proposed API please.