Paul Mackerras <paulus@xxxxxxxxx> writes: I will reply to the other parts in a seperate email, but the below >> +static void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table) >> +{ >> + struct page *page; >> + struct mm_struct *mm; >> + unsigned int bit, mask; >> + >> + mm = tlb->mm; >> + /* Free 2K page table fragment of a 64K page */ >> + page = virt_to_page(table); >> + bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE); >> + spin_lock(&mm->page_table_lock); >> + /* >> + * stash the actual mask in higher half, and clear the lower half >> + * and selectively, add remove from pgtable list >> + */ >> + mask = atomic_xor_bits(&page->_mapcount, bit | (bit << FRAG_MASK_BITS)); >> + if (!(mask & FRAG_MASK)) >> + list_del(&page->lru); >> + else { >> + /* >> + * Add the page table page to pgtable_list so that >> + * the free fragment can be used by the next alloc >> + */ >> + list_del_init(&page->lru); >> + list_add_tail(&page->lru, &mm->context.pgtable_list); >> + } >> + spin_unlock(&mm->page_table_lock); >> + tlb_remove_table(tlb, table); >> +} > > This looks like you're allowing a fragment that is being freed to be > reallocated and used again during the grace period when we are waiting > for any references to the fragment to disappear. Doesn't that allow a > race where one CPU traversing the page table and using the fragment in > its old location in the tree could see a PTE created after the > fragment was reallocated? In other words, why is it safe to allow the > fragment to be used during the grace period? If it is safe, it at > least needs a comment explaining why. > We don't allow it to be reallocated during the grace period. The trick is in the below lines of page_table_alloc() /* * Update with the higher order mask bits accumulated, * added as a part of rcu free. */ mask = mask | (mask >> FRAG_MASK_BITS); When checking for mask, we also look at the higher order bits. The reason we add the page back to &mm->context.pgtable_list in page_table_free_rcu is because we need to have access to struct mm_struct. We don't have that in the rcu call back. So we add early and make sure we don't reallocate them, until the grace period is over. I will definitely add more comments around the code to clarify these details. -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>