On Thu, Feb 21, 2019 at 09:54:06AM +0100, Oscar Salvador wrote: > When using mremap() syscall in addition to MREMAP_FIXED flag, > mremap() calls mremap_to() which does the following: > > 1) unmaps the destination region where we are going to move the map > 2) If the new region is going to be smaller, we unmap the last part > of the old region > > Then, we will eventually call move_vma() to do the actual move. > > move_vma() checks whether we are at least 4 maps below max_map_count > before going further, otherwise it bails out with -ENOMEM. > The problem is that we might have already unmapped the vma's in steps > 1) and 2), so it is not possible for userspace to figure out the state > of the vma's after it gets -ENOMEM, and it gets tricky for userspace > to clean up properly on error path. > > While it is true that we can return -ENOMEM for more reasons > (e.g: see may_expand_vm() or move_page_tables()), I think that we can > avoid this scenario in concret if we check early in mremap_to() if the > operation has high chances to succeed map-wise. > > Should not be that the case, we can bail out before we even try to unmap > anything, so we make sure the vma's are left untouched in case we are likely > to be short of maps. > > The thumb-rule now is to rely on the worst-scenario case we can have. > That is when both vma's (old region and new region) are going to be split > in 3, so we get two more maps to the ones we already hold (one per each). > If current map count + 2 maps still leads us to 4 maps below the threshold, > we are going to pass the check in move_vma(). > > Of course, this is not free, as it might generate false positives when it is > true that we are tight map-wise, but the unmap operation can release several > vma's leading us to a good state. > > Because of that I am sending this as a RFC. > Another approach was also investigated [1], but it may be too much hassle > for what it brings. I believe we don't need the check in move_vma() with this patch. Or do we? > > [1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@xxxxxxxxxxxx/ > > Signed-off-by: Oscar Salvador <osalvador@xxxxxxx> > --- > mm/mremap.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/mm/mremap.c b/mm/mremap.c > index 3320616ed93f..e3edef6b7a12 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -516,6 +516,23 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len, > if (addr + old_len > new_addr && new_addr + new_len > addr) > goto out; > > + /* > + * move_vma() need us to stay 4 maps below the threshold, otherwise > + * it will bail out at the very beginning. > + * That is a problem if we have already unmaped the regions here > + * (new_addr, and old_addr), because userspace will not know the > + * state of the vma's after it gets -ENOMEM. > + * So, to avoid such scenario we can pre-compute if the whole > + * operation has high chances to success map-wise. > + * Worst-scenario case is when both vma's (new_addr and old_addr) get > + * split in 3 before unmaping it. > + * That means 2 more maps (1 for each) to the ones we already hold. > + * Check whether current map count plus 2 still leads us to 4 maps below > + * the threshold, otherwise return -ENOMEM here to be more safe. > + */ > + if ((mm->map_count + 2) >= sysctl_max_map_count - 3) Nit: redundant parentheses around 'mm->map_count + 2'. > + return -ENOMEM; > + > ret = do_munmap(mm, new_addr, new_len, uf_unmap_early); > if (ret) > goto out; > -- > 2.13.7 > -- Kirill A. Shutemov