On Tue, Aug 9, 2011 at 3:22 PM, Minchan Kim <minchan.kim@xxxxxxxxx> wrote: > On Tue, Aug 9, 2011 at 8:04 PM, Michel Lespinasse <walken@xxxxxxxxxx> wrote: >> On Sun, Aug 7, 2011 at 7:25 AM, Minchan Kim <minchan.kim@xxxxxxxxx> wrote: >>> On Thu, Aug 04, 2011 at 11:39:19PM -0700, Michel Lespinasse wrote: >>>> I can see that the page_cache_get_speculative comment in >>>> include/linux/pagemap.h maps out one way to prevent the issue. If >>>> thread T continually held an rcu read lock from the time it finds the >>>> pointer to P until the time it calls get_page_unless_zero on that >>>> page, AND there was a synchronize_rcu() call somewhere between the >>>> time a THP page gets allocated and the time __split_huge_page_refcount >>>> might first get called on that page, then things would be safe. >>>> However, that does not seem to be true today: I could not find a >>>> synchronize_rcu() call before __split_huge_page_refcount(), AND there >>>> are also places (such as deactivate_page() for example) that call >>>> get_page_unless_zero without being within an rcu read locked section >>>> (or holding the zone lru lock to provide exclusion against >>>> __split_huge_page_refcount). >> >> Going forward, I can see several possible solutions: >> - Use my proposed page count lock in order to avoid the race. One >> would have to convert all get_page_unless_zero() sites to use it. I >> expect the cost would be low but still measurable. > > It's not necessary to apply it on *all* get_page_unless_zero sites. > Because deactivate_page does it on file pages while THP handles only anon pages. > So the race should not a problem. But it doesn't matter what kind of page the get_page_unless_zero call site hopes to get a reference on - if it doesn't already hold a reference on the page (either directly as a reference, or if a known mapping points to that page and the page table lock is taken or interrupts are disabled in order to guarantee the mapping won't get yanked), then the page can get yanked and a THP page could show up there before the call site gets a reference. >> - Protect all get_page_unless_zero call sites with rcu read lock or >> lru lock (page_cache_get_speculative already has it, but there are >> others to consider), and add a synchronize_rcu() before splitting huge >> pages. > > I think it can't be a solution. > If we don't have any lock for protect write-side, page_count could be > unstable again while we peek page->count in > __split_huge_page_refcount after calling synchronize_rcu. > Do I miss something? The tail page count would be unstable for at most one rcu grace period after the page got allocated. This is guaranteed by making all get_page_unless_zero call sites make sure they somehow determine the page is not a THP tail page (for example because they found it in radix tree) before calling get_page_unless_zero and having an rcu read lock wrapping these two together. This is basically the protocol described in the comment for page_cache_get_speculative() in pagemap.h -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>