On Sat, Aug 27, 2011 at 10:34 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > Subject: thp: tail page refcounting fix > > From: Andrea Arcangeli <aarcange@xxxxxxxxxx> > > Michel while working on the working set estimation code, noticed that calling > get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the > pfn ended up being a tail page of a transparent hugepage under splitting by > __split_huge_page_refcount(). He then found the problem could also > theoretically materialize with page_cache_get_speculative() during the > speculative radix tree lookups that uses get_page_unless_zero() in SMP if the > radix tree page is freed and reallocated and get_user_pages is called on it > before page_cache_get_speculative has a chance to call get_page_unless_zero(). > > So the best way to fix the problem is to keep page_tail->_count zero at all > times. This will guarantee that get_page_unless_zero() can never succeed on any > tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail > pages of a compound page, so we can simply account the tail page references > there and transfer them to tail_page->_count in __split_huge_page_refcount() (in > addition to the head_page->_mapcount). > > While debugging this s/_count/_mapcount/ change I also noticed get_page is > called by direct-io.c on pages returned by get_user_pages. That wasn't entirely > safe because the two atomic_inc in get_page weren't atomic. As opposed other > get_user_page users like secondary-MMU page fault to establish the shadow > pagetables would never call any superflous get_page after get_user_page > returns. It's safer to make get_page universally safe for tail pages and to use > get_page_foll() within follow_page (inside get_user_pages()). get_page_foll() > is safe to do the refcounting for tail pages without taking any locks because > it is run within PT lock protected critical sections (PT lock for pte and > page_table_lock for pmd_trans_huge). The standard get_page() as invoked by > direct-io instead will now take the compound_lock but still only for tail > pages. The direct-io paths are usually I/O bound and the compound_lock is per > THP so very finegrined, so there's no risk of scalability issues with it. A > simple direct-io benchmarks with all lockdep prove locking and spinlock > debugging infrastructure enabled shows identical performance and no overhead. > So it's worth it. Ideally direct-io should stop calling get_page() on pages > returned by get_user_pages(). The spinlock in get_page() is already optimized > away for no-THP builds but doing get_page() on tail pages returned by GUP is > generally a rare operation and usually only run in I/O paths. > > This new refcounting on page_tail->_mapcount in addition to avoiding new RCU > critical sections will also allow the working set estimation code to work > without any further complexity associated to the tail page refcounting > with THP. > > Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Reported-by: Michel Lespinasse <walken@xxxxxxxxxx> > Reviewed-by: Michel Lespinasse <walken@xxxxxxxxxx> Looks great. I think some page_mapcount call sites would be easier to read if you took on my tail_page_count() suggestion (so we can casually see it's a refcount rather than mapcount). But you don't have to do it if you don't think it helps. I'm happy enough with the code already :) -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>