On Fri, Aug 05, 2011 at 06:21:51PM +0200, Andrea Arcangeli wrote: > On Fri, Aug 05, 2011 at 04:25:16PM +0100, Mel Gorman wrote: > > > +int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, > > > + unsigned long old_addr, > > > + unsigned long new_addr, unsigned long old_end, > > > + pmd_t *old_pmd, pmd_t *new_pmd) > > > +{ > > > + int ret = 0; > > > + pmd_t pmd; > > > + > > > + struct mm_struct *mm = vma->vm_mm; > > > + > > > + if ((old_addr & ~HPAGE_PMD_MASK) || > > > + (new_addr & ~HPAGE_PMD_MASK) || > > > > How could these conditions ever be true? We are here because it was > > pmd_trans_huge. There should be no way this can be aligned. If this > > is paranoia, make it a BUG_ON. > > It actually happens. old_addr/new_addr aren't aligned to hpage > beforehand, and they just from the mremap syscall > parameters. Returning 0 is needed here. > Ah yes, of course. Just because the PMD is huge does not mean you are calling mremap on an aligned address. > > > > > + (old_addr + HPAGE_PMD_SIZE) > old_end || > > > > Again, is this possible? The old addr was already huge. > > This can happen too, old_end is also passed as parameter from syscall > and it's not mangled to fit an hpage, just calculated as old_addr+len > (len passed as parameter). The length of the destination vma shouldn't > be necessary to check, there's no new_end in the first place as > parameter of move_page_tables so the caller must ensure there's enough > space to copy in the destination. And if the old_end-old_addr <= > HPAGE_PMD_SIZE and the old_addr and new_addr are aligned, we're sure > the hugepmd is safe to create on the destination new_addr (if it's > aligned). > > So I think all checks are needed here. Yes, blindingly obvious now of course :/ > In theory the caller could > check this stuff before calling move_huge_pmd but I try to be as less > invasive as possible in the common code that may be built with > TRANSPARENT_HUGEPAGE=n (these checks would still be computed in the > pmd_trans_huge(*old_pmd) case only but it's kind of nicer to hide them > in huge_memory.c). > What you have at the moment is indeed nicer. > > > + /* > > > + * The destination pmd shouldn't be established, free_pgtables() > > > + * should have release it. > > > + */ > > > + if (!pmd_none(*new_pmd)) { > > > + WARN_ON(1); > > > + VM_BUG_ON(pmd_trans_huge(*new_pmd)); > > > + goto out; > > > + } > > > + > > > > Agreed that this should never happen. The mmap_sem is held for writing > > and we are remapping to what should be empty space. It should not be > > possible for a huge PMD to be established underneath us. > > Yes. The code will work fine also if the new_pmd points to a pte that > wasn't freed by free_pgtables As we hold mmap_sem for write, I would expect any munmap to have completed by the time this is reached and we're hardly calling mremap during exit. > but I think it's kind of safer to have a > WARN_ON because if that really ever happens we would prefer to > release the pte here and proceed mapping the hugepmd instead of > splitting the source hugepmd. > It's healthy paranoia. > If there's a hugepmd mapped instead it means the memory wasn't > munmapped. I could have run a mem compare of the pte against zero page > too but I didn't.. kind of overkill. But checking that there is no > hugepmd is fast. > I think the check is paranoid but there is no harm in that. > > > + spin_lock(&mm->page_table_lock); > > > + if (likely(pmd_trans_huge(*old_pmd))) { > > > + if (pmd_trans_splitting(*old_pmd)) { > > > + spin_unlock(&mm->page_table_lock); > > > + wait_split_huge_page(vma->anon_vma, old_pmd); > > > + ret = -1; > > > + } else { > > > + pmd = pmdp_get_and_clear(mm, old_addr, old_pmd); > > > + VM_BUG_ON(!pmd_none(*new_pmd)); > > > + set_pmd_at(mm, new_addr, new_pmd, pmd); > > > + spin_unlock(&mm->page_table_lock); > > > + ret = 1; > > > + } > > > + } else > > > + spin_unlock(&mm->page_table_lock); > > > + > > > > The meaning of the return values of -1, 0, 1 with the caller doing > > > > if (err) > > ... > > else if (!err) > > ... > > > > is tricky to work out. split_huge_page only needs to be called if > > returning 0. Would it be possible to have the split_huge_page called in > > this function? The end of the function would then look like > > > > return ret; > > > > out_split: > > split_huge_page_pmd() > > return ret; > > > > with either success or failure being returned instead of a tristate > > which is easier to understand. > > So basically always return 0 regardless if it was a > pmd_trans_splitting or if we splitted it ourself. And only return 1 in > case the move_huge_pmd was successful. > Exactly. > I'm afraid we've other trestates returned for the other mm > methods... if we cleanup this one later we may also cleanup the > others. > There are other tristates but lets not make life hard. When I've needed tristates in recent patches, I've used enums with symbolic names instead of -1, 0, 1 but that would be overkill here. > I'll try to clean up this one for now ok. > > > > diff --git a/mm/mremap.c b/mm/mremap.c > > > --- a/mm/mremap.c > > > +++ b/mm/mremap.c > > > @@ -41,8 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru > > > return NULL; > > > > > > pmd = pmd_offset(pud, addr); > > > - split_huge_page_pmd(mm, pmd); > > > - if (pmd_none_or_clear_bad(pmd)) > > > + if (pmd_none(*pmd)) > > > return NULL; > > > > > > > Ok, this is changing to pmd_none because it could be a huge PMD and > > pmd_none_or_clear_bad triggers on a huge PMD. Right? > > Right. It'd bug on with a hugepmd. Maybe we could someday change > pmd_none_or_clear_bad not to choke on the PSE bit, but it was kind of > safer to keep assuming the PSE bit was "bad" in case we forgotten some > split_huge_page somewhere, better to bug in that case than to > ignore. Agreed. > But as time passes and THP is rock solid, I've to admit it's > becoming more a lack of reliability to consider the PSE bit "bad" and > to remove some pmd_none_clear_bad than an increase of reliability to > validate the THP case. Maybe we should change that. It's a common > problem not specific to mremap. > It's something worth doing in a few releases time. It would be nice to have the bad PMD checks back at some point. > > > @@ -65,8 +64,6 @@ static pmd_t *alloc_new_pmd(struct mm_st > > > return NULL; > > > > > > VM_BUG_ON(pmd_trans_huge(*pmd)); > > > - if (pmd_none(*pmd) && __pte_alloc(mm, vma, pmd, addr)) > > > - return NULL; > > > > > > return pmd; > > > } > > > @@ -80,11 +77,7 @@ static void move_ptes(struct vm_area_str > > > struct mm_struct *mm = vma->vm_mm; > > > pte_t *old_pte, *new_pte, pte; > > > spinlock_t *old_ptl, *new_ptl; > > > - unsigned long old_start; > > > > > > - old_start = old_addr; > > > - mmu_notifier_invalidate_range_start(vma->vm_mm, > > > - old_start, old_end); > > > > The MMU notifier is now being called for a larger range. Previously it > > would usually be ranges of 64 pages and now it looks like it happens > > once for the entire range being remapped. This is not mentioned in > > the leader. What are the consequences of having a large gap between > > invalidate_start and invalidate_end? Would it be a big deal to call > > the MMU notifier within move_huge_pmd()? > > Well it should improve performance. The only downside is that it will > stall secondary page faults for a longer time, but because the mremap > can now run faster with fewer IPIs, I think overall it should improve > performance. That's a tough call. Increased jitter within KVM might be unwelcome but ordinarily the only people that might care about latencies are real time people and I dont think they would care about the KVM case. > Also it's probably not too common to do much mremap on > apps with a secondary MMU attached so in this place the mmu notifier > is more about correctness I think and correctness remains. Userland > can always do mremap in smaller chunks if it needs and then it'll > really run faster with no downside. After all no app is supposed to > touch the source addresses while they're being migrated as it could > SEGFAULT at any time. So a very long mremap basically makes a mapping > "not available" for a longer time, just now it'll be shorter because > mremap will run faster. The mapping being available for the secondary > mmu was incidental implementation detail, now it's not available to > the secondary mmu during the move, like it is not available to > userland. Trapping sigsegv to modify the mapping while it's under > mremap I doubt anybody is depending on. > I agree with you that overall it's probably for the best but splitting it out as a separate patch and cc'ing the KVM people would do no harm. Is there any impact for things like xpmem that might be using MMU notifiers in some creative manner? > > If it's safe to use larger ranges, it would be preferable to see it > > in a separate patch or at the very least explained in the changelog. > > I can split it off to a separate patch. I knew I should have done that > in the first place and rightfully I got caught :). > :) > > > if (vma->vm_file) { > > > /* > > > * Subtle point from Rajesh Venkatasubramanian: before > > > @@ -111,7 +104,7 @@ static void move_ptes(struct vm_area_str > > > new_pte++, new_addr += PAGE_SIZE) { > > > if (pte_none(*old_pte)) > > > continue; > > > - pte = ptep_clear_flush(vma, old_addr, old_pte); > > > + pte = ptep_get_and_clear(mm, old_addr, old_pte); > > > > This looks like an unrelated optimisation. You hint at this in the > > patch subject but it needs a separate patch or a better explanation in > > the leader. If I'm reading this right, it looks like you are deferring > > a TLB flush on a single page and calling one call later at the end of > > move_page_tables. At a glance, that seems ok and would reduce IPIs > > but I'm not thinking about it properly because I'm trying to think > > about THP shenanigans :) > > That's exactly what it does. Once I split it off you can concentrate > on the two parts separately. This is also the parts that requires > moving the mmu notifier outside along with the tlb flush outside. > > The THP shenanigans don't require moving the mmu notifier outside. > Right. I did notice that it would be less churn if the optimisation went in first but that was about it. > The one IPI per page is a major bottleneck for java, lack of hugepmd > migrate also major bottleneck, here we get both combined so we get 1 > IPI for a ton of THP. The benchmark I run was single threaded on a 12 > core system (and single threaded if scheduler is doing good won't > require any IPI), you can only imagine the boost it gets on heavily > multithreaded apps that requires flooding IPI on large SMP (I didn't > measure that as I was already happy with what I got single threaded :). > Yeah, I imagine it should also be a boost during GC if the JVM is using mremap to migrate full pages from an old heap to a new one. > > > + mmu_notifier_invalidate_range_start(vma->vm_mm, old_addr, old_end); > > > + > > > for (; old_addr < old_end; old_addr += extent, new_addr += extent) { > > > cond_resched(); > > > next = (old_addr + PMD_SIZE) & PMD_MASK; > > > - if (next - 1 > old_end) > > > + if (next > old_end) > > > next = old_end; > > > extent = next - old_addr; > > > old_pmd = get_old_pmd(vma->vm_mm, old_addr); > > > > You asked if removing this "- 1" is correct. It's an overflow check for > > a situation where old_addr + PMD_SIZE overflows. On what architecture > > is it possible to call mremap() at the very top of the address space > > or am I missing the point? > > > > Otherwise I think the existing check is harmless if obscure. It's > > reasonable to assume PAGE_SIZE will be > 1 and I'm not seeing why it is > > required by the rest of the patch. > > An arch where the -TASK_SIZE is less than PMD_SIZE didn't cross my > mind sorry, Johannes also pointed it out. I'd find this more readable > than a off by one -1 that looks erroneous. > > diff --git a/mm/mremap.c b/mm/mremap.c > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -136,9 +136,10 @@ unsigned long move_page_tables(struct vm > for (; old_addr < old_end; old_addr += extent, new_addr += extent) { > cond_resched(); > next = (old_addr + PMD_SIZE) & PMD_MASK; > - if (next > old_end) > - next = old_end; > + /* even if next overflowed, extent below will be ok */ > extent = next - old_addr; > + if (extent > old_end - old_addr) > + extent = old_end - old_addr; > old_pmd = get_old_pmd(vma->vm_mm, old_addr); > if (!old_pmd) > continue; > > What do you think? I can put the -1 back if you prefer. I doubt the > speed difference can matter here, all values should be in registers. > I don't feel very strongly about it. The change looks reasonable and with a fresh head with the patch split out, it probably is more readable. > > > @@ -150,6 +145,23 @@ unsigned long move_page_tables(struct vm > > > new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr); > > > if (!new_pmd) > > > break; > > > + if (pmd_trans_huge(*old_pmd)) { > > > + int err = 0; > > > + if (extent == HPAGE_PMD_SIZE) > > > + err = move_huge_pmd(vma, new_vma, old_addr, > > > + new_addr, old_end, > > > + old_pmd, new_pmd); > > > + if (err > 0) { > > > + need_flush = true; > > > + continue; > > > + } else if (!err) > > > + split_huge_page_pmd(vma->vm_mm, old_pmd); > > > + VM_BUG_ON(pmd_trans_huge(*old_pmd)); > > > > This tristate is hard to parse but I mentioned this already. > > Yep I'll try to make it 0/1. > > > Functionally, I can't see a major problem with the patch. The > > minor problems are that I'd like to see that tristate replaced for > > readability, the optimisation better explained or in a separate patch > > and an explanation why the larger ranges for mmu_notifiers is not > > a problem. > > Thanks for the review, very helpful as usual, I'll try to submit a 3 > patches version with the cleanups you suggested soon enough. Ideally > I'd like to replace the -1 with the above change that also should > guard against TASK_SIZE ending less than one pmd away from the end of > the address space if you like it. > Sounds good. Will review again when they come around. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>