On Fri, 7 May 2010 00:20:52 +0100 Mel Gorman <mel@xxxxxxxxx> wrote: > vma_adjust() is updating anon VMA information without locks being taken. > In contrast, file-backed mappings use the i_mmap_lock and this lack of > locking can result in races with users of rmap_walk such as page migration. > vma_address() can return -EFAULT for an address that will soon be valid. > For migration, this potentially leaves a dangling migration PTE behind > which can later cause a BUG_ON to trigger when the page is faulted in. > > With the recent anon_vma changes, there can be more than one anon_vma->lock > to take when walking a list of anon_vma_chains but as the order of anon_vmas > cannot be guaranteed, rmap_walk cannot take multiple locks without > potentially deadlocking. > > To resolve this problem, this patch has rmap_walk walk the anon_vma_chain > list but always starting from the "root" anon_vma which is the oldest > anon_vma in the list. It starts by locking the anon_vma lock associated > with a page. It then finds the "root" anon_vma using the anon_vma_chains > "same_vma" list as it is strictly ordered. The root anon_vma lock is taken > and rmap_walk traverses the list. This allows multiple locks to be taken > as the list is always traversed in the same direction. > > As spotted by Rik, to avoid any deadlocks versus mmu_notify, the order that > anon_vmas is locked in by mm_take_all_locks is reversed by this patch so that > both rmap_walk and mm_take_all_locks lock anon_vmas in the order of old->new. > > For vma_adjust(), the locking behaviour prior to the anon_vma is restored > so that rmap_walk() can be sure of the integrity of the VMA information and > lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock > is taken if > > a) If there is any overlap with the next VMA due to the adjustment > b) If there is a new VMA is being inserted into the address space > c) If the start of the VMA is being changed so that the > relationship between vm_start and vm_pgoff is preserved > for vma_address() > > Signed-off-by: Mel Gorman <mel@xxxxxxxxx> > --- > include/linux/rmap.h | 4 ++ > mm/ksm.c | 13 ++++++- > mm/mmap.c | 14 ++++++- > mm/rmap.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++---- > 4 files changed, 118 insertions(+), 10 deletions(-) > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index 7721674..1dc949f 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -121,6 +121,10 @@ int anon_vma_prepare(struct vm_area_struct *); > void unlink_anon_vmas(struct vm_area_struct *); > int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *); > int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *); > +struct anon_vma *anon_vma_lock_nested(struct anon_vma *prev, > + struct anon_vma *next, struct anon_vma *root); > +struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma); > +struct anon_vma *page_anon_vma_lock_root(struct page *page); > void __anon_vma_link(struct vm_area_struct *); > void anon_vma_free(struct anon_vma *); > > diff --git a/mm/ksm.c b/mm/ksm.c > index 3666d43..1db8656 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -1655,6 +1655,7 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *, > { > struct stable_node *stable_node; > struct hlist_node *hlist; > + struct anon_vma *nested_anon_vma = NULL; > struct rmap_item *rmap_item; > int ret = SWAP_AGAIN; > int search_new_forks = 0; > @@ -1671,9 +1672,16 @@ again: > struct anon_vma_chain *vmac; > struct vm_area_struct *vma; > > - spin_lock(&anon_vma->lock); > + anon_vma = anon_vma_lock_root(anon_vma); > + if (nested_anon_vma) { > + spin_unlock(&nested_anon_vma->lock); > + nested_anon_vma = NULL; > + } > list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) { > vma = vmac->vma; > + nested_anon_vma = anon_vma_lock_nested(nested_anon_vma, > + vma->anon_vma, anon_vma); > + > if (rmap_item->address < vma->vm_start || > rmap_item->address >= vma->vm_end) > continue; > @@ -1697,6 +1705,9 @@ again: > if (!search_new_forks++) > goto again; > out: > + if (nested_anon_vma) > + spin_unlock(&nested_anon_vma->lock); > + > return ret; > } > > diff --git a/mm/mmap.c b/mm/mmap.c > index f90ea92..b447d5b 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -505,6 +505,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start, > struct vm_area_struct *next = vma->vm_next; > struct vm_area_struct *importer = NULL; > struct address_space *mapping = NULL; > + struct anon_vma *anon_vma = NULL; > struct prio_tree_root *root = NULL; > struct file *file = vma->vm_file; > long adjust_next = 0; > @@ -578,6 +579,11 @@ again: remove_next = 1 + (end > next->vm_end); > } > } > > + if (vma->anon_vma && (insert || importer || start != vma->vm_start)) { > + anon_vma = vma->anon_vma; > + spin_lock(&anon_vma->lock); > + } > + > if (root) { > flush_dcache_mmap_lock(mapping); > vma_prio_tree_remove(vma, root); > @@ -620,6 +626,9 @@ again: remove_next = 1 + (end > next->vm_end); > if (mapping) > spin_unlock(&mapping->i_mmap_lock); > > + if (anon_vma) > + spin_unlock(&anon_vma->lock); > + > if (remove_next) { > if (file) { > fput(file); > @@ -2556,8 +2565,9 @@ int mm_take_all_locks(struct mm_struct *mm) > for (vma = mm->mmap; vma; vma = vma->vm_next) { > if (signal_pending(current)) > goto out_unlock; > + /* Lock the anon_vmas in the same order rmap_walk would */ > if (vma->anon_vma) > - list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) > + list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma) > vm_lock_anon_vma(mm, avc->anon_vma); > } > > @@ -2620,7 +2630,7 @@ void mm_drop_all_locks(struct mm_struct *mm) > > for (vma = mm->mmap; vma; vma = vma->vm_next) { > if (vma->anon_vma) > - list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) > + list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma) > vm_unlock_anon_vma(avc->anon_vma); > if (vma->vm_file && vma->vm_file->f_mapping) > vm_unlock_mapping(vma->vm_file->f_mapping); > diff --git a/mm/rmap.c b/mm/rmap.c > index 85f203e..2e65a75 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -236,6 +236,81 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > return -ENOMEM; > } > > +/* > + * When walking an anon_vma_chain and locking each anon_vma encountered, > + * this function is responsible for checking if the next VMA is the > + * same as the root, locking it if not and released the previous lock > + * if necessary. > + * > + * It is assumed the caller has locked the root anon_vma > + */ > +struct anon_vma *anon_vma_lock_nested(struct anon_vma *prev, > + struct anon_vma *next, struct anon_vma *root) > +{ > + if (prev) > + spin_unlock(&prev->lock); > + if (next == root) > + return NULL; > + spin_lock_nested(&next->lock, SINGLE_DEPTH_NESTING); > + return next; > +} > + > +/* > + * Given an anon_vma, find the root of the chain, lock it and return the > + * root. This must be called with the rcu_read_lock held > + */ > +struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma) > +{ > + struct anon_vma *root_anon_vma; > + struct anon_vma_chain *avc, *root_avc; > + struct vm_area_struct *vma; > + > + /* Lock the same_anon_vma list and make sure we are on a chain */ > + spin_lock(&anon_vma->lock); > + if (list_empty(&anon_vma->head)) { > + spin_unlock(&anon_vma->lock); > + return NULL; > + } > + > + /* > + * Get the root anon_vma on the list by depending on the ordering > + * of the same_vma list setup by __page_set_anon_rmap. Basically > + * we are doing > + * > + * local anon_vma -> local vma -> root vma -> root anon_vma > + */ > + avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma); > + vma = avc->vma; > + root_avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); > + root_anon_vma = root_avc->anon_vma; > + > + /* Get the lock of the root anon_vma */ > + if (anon_vma != root_anon_vma) { > + VM_BUG_ON(!rcu_read_lock_held()); > + spin_unlock(&anon_vma->lock); > + spin_lock(&root_anon_vma->lock); > + } I'm sorry but I don't think I understand this. Could you help me ? IIUC, anon_vma_chain is linked as 2D-mesh anon_vma1 anon_vma2 anon_vma3 | | | vma1 ----- 1 -------- 2 --------- 3 ----- | | | vma2 ----- 4 -------- 5 ---------- 6 ----- | | | vma3 ----- 7 -------- 8 ---------- 9 ----- Here, * vertical link is anon_vma->head, avc->same_anon_vma link. * horizontal link is vma->anon_vma_chain, avc->same_vma link. * 1-9 are avcs. When scanning pages, we may see a page whose anon_vma is anon_vma1 or anon_vma2 or anon_vma3. When we see anon_vma3 in page->mapping, we lock anon_vma1 and chase avc1->avc4->avc7. Then, start from vma1. Next, we visit vma2, we lock anon_vma2. At the last, we visit vma3 and lock anon_vma3.....And all are done under anon_vma1->lock. Right ? Hmm, one concern is anon_vma3 -> avc3 -> vma1 -> avc1 -> anon_vma1 chasing. What will prevent vma1 disappear right after releasling anon_vma3->lock ? ex) a1) At we chase, anon_vma3 -> avc3 -> vma1 -> anon_vma1, link was following. anon_vma1 anon_vma2 anon_vma3 | | | vma1 ----- 1 -------- 2 --------- 3 ----- | | | vma2 ----- 4 -------- 5 ---------- 6 ----- | | | vma3 ----- 7 -------- 8 ---------- 9 ----- We hold lock on anon_vma3. a2) After releasing anon_vma3 lock. vma1 can be unlinked. anon_vma1 anon_vma2 anon_vma3 | | | vma1 removed. | | | vma2 ----- 4 -------- 5 ---------- 6 ----- | | | vma3 ----- 7 -------- 8 ---------- 9 ----- But we know anon_vma1->head is not empty, and it's accessable. Then, no problem for our purpose. Right ? b1) Another thinking. At we chase, anon_vma3 -> avc3 -> vma1 -> anon_vma1, link was following. anon_vma1 anon_vma2 anon_vma3 | | | vma1 ----- 1 -------- 2 --------- 3 ----- | | | vma2 ----- 4 -------- 5 ---------- 6 ----- | | | vma3 ----- 7 -------- 8 ---------- 9 ----- We hold lock on anon_vma3. So, anon_vma1 anon_vma2 anon_vma3 | | | vma1 ----removed -----removed ------ 3 ----- | | | vma2 ----- 4 -------- 5 ---------- 6 ----- | | | vma3 ----- 7 -------- 8 ---------- 9 ----- we may see half-broken link while we take anon_vma3->lock. In this case, anon_vma1 can be caugt. Don't we need this ? void unlink_anon_vmas(struct vm_area_struct *vma) { struct anon_vma_chain *avc, *next; /* Unlink each anon_vma chained to the VMA. */ - list_for_each_entry_safe_reverse(avc, next, &vma->anon_vma_chain, same_vma) { + list_for_each_entry_safe_reverse(avc, next, &vma->anon_vma_chain, same_vma) { anon_vma_unlink(avc); list_del(&avc->same_vma); anon_vma_chain_free(avc); } } head avc should be removed last... Hmm ? I'm sorry if all are in correct order already. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>