On Thu, Sep 10, 2015 at 12:54:08PM +0200, Vlastimil Babka wrote: > On 09/03/2015 02:35 PM, Kirill A. Shutemov wrote: > > Hugh has pointed that compound_head() call can be unsafe in some > > context. There's one example: > > > > CPU0 CPU1 > > > > isolate_migratepages_block() > > page_count() > > compound_head() > > !!PageTail() == true > > put_page() > > tail->first_page = NULL > > head = tail->first_page > > alloc_pages(__GFP_COMP) > > prep_compound_page() > > tail->first_page = head > > __SetPageTail(p); > > !!PageTail() == true > > <head == NULL dereferencing> > > > > The race is pure theoretical. I don't it's possible to trigger it in > > practice. But who knows. > > > > We can fix the race by changing how encode PageTail() and compound_head() > > within struct page to be able to update them in one shot. > > > > The patch introduces page->compound_head into third double word block in > > front of compound_dtor and compound_order. Bit 0 encodes PageTail() and > > the rest bits are pointer to head page if bit zero is set. > > > > The patch moves page->pmd_huge_pte out of word, just in case if an > > architecture defines pgtable_t into something what can have the bit 0 > > set. > > > > hugetlb_cgroup uses page->lru.next in the second tail page to store > > pointer struct hugetlb_cgroup. The patch switch it to use page->private > > in the second tail page instead. The space is free since ->first_page is > > removed from the union. > > > > The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER > > limitation, since there's now space in first tail page to store struct > > hugetlb_cgroup pointer. But that's out of scope of the patch. > > > > That means page->compound_head shares storage space with: > > > > - page->lru.next; > > - page->next; > > - page->rcu_head.next; > > > > That's too long list to be absolutely sure, but looks like nobody uses > > bit 0 of the word. > > Given the discussion about rcu_head, that should warrant some summary here :) Agreed. > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -120,7 +120,13 @@ struct page { > > }; > > }; > > > > - /* Third double word block */ > > + /* > > + * Third double word block > > + * > > + * WARNING: bit 0 of the first word encode PageTail(). That means > > + * the rest users of the storage space MUST NOT use the bit to > > + * avoid collision and false-positive PageTail(). > > + */ > > union { > > struct list_head lru; /* Pageout list, eg. active_list > > * protected by zone->lru_lock ! > > @@ -143,12 +149,19 @@ struct page { > > */ > > /* First tail page of compound page */ > > "First tail" doesn't apply for compound_head. I'll adjust comments. > > index 097c7a4bfbd9..330377f83ac7 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1686,8 +1686,7 @@ static void __split_huge_page_refcount(struct page *page, > > (1L << PG_unevictable))); > > page_tail->flags |= (1L << PG_dirty); > > > > - /* clear PageTail before overwriting first_page */ > > - smp_wmb(); > > + clear_compound_head(page_tail); > > I would sleep better if this was done before setting all the page->flags above, > previously, PageTail was cleared by the first operation which is > "page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;" > I do realize that it doesn't use WRITE_ONCE, so it might have been theoretically > broken already, if it does matter. Right. Nothing enforces particular order. If we really need to have some ordering on PageTail() vs. page->flags let's define it, but so far I don't see a reason to change this part. > > diff --git a/mm/internal.h b/mm/internal.h > > index 36b23f1e2ca6..89e21a07080a 100644 > > --- a/mm/internal.h > > +++ b/mm/internal.h > > @@ -61,9 +61,9 @@ static inline void __get_page_tail_foll(struct page *page, > > * speculative page access (like in > > * page_cache_get_speculative()) on tail pages. > > */ > > - VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page); > > + VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page); > > if (get_page_head) > > - atomic_inc(&page->first_page->_count); > > + atomic_inc(&compound_head(page)->_count); > > Doing another compound_head() seems like overkill when this code already assumes > PageTail. "Overkill"? It's too strong wording for re-read hot cache line. > All callers do it after if (PageTail()) which means they already did > READ_ONCE(page->compound_head) and here they do another one. Potentially with > different result in bit 0, which would be a subtle bug, that could be > interesting to catch with some VM_BUG_ON. I don't know if a direct plain > page->compound_head access here instead of compound_head() would also result in > a re-read, since the previous access did use READ_ONCE(). Maybe it would be best > to reorganize the code here and in the 3 call sites so that the READ_ONCE() used > to determine PageTail also obtains the compound head pointer. All we would possbily win by the change is few bytes in code. Additional READ_ONCE() only affect code generation. It doesn't introduce any cpu barriers. The cache line with compound_head is in L1 anyway. I don't see justification to change this part too. If you think we can gain something by reworking this code, feel free to propose patch on top. > Some of that is probably made moot by your other series, but better let's think > of this series as standalone first. > > > get_huge_page_tail(page); > > } > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > index 1f4446a90cef..4d1a5de9653d 100644 > > --- a/mm/memory-failure.c > > +++ b/mm/memory-failure.c > > @@ -787,8 +787,6 @@ static int me_huge_page(struct page *p, unsigned long pfn) > > #define lru (1UL << PG_lru) > > #define swapbacked (1UL << PG_swapbacked) > > #define head (1UL << PG_head) > > -#define tail (1UL << PG_tail) > > -#define compound (1UL << PG_compound) > > #define slab (1UL << PG_slab) > > #define reserved (1UL << PG_reserved) > > > > @@ -811,12 +809,7 @@ static struct page_state { > > */ > > { slab, slab, MF_MSG_SLAB, me_kernel }, > > > > -#ifdef CONFIG_PAGEFLAGS_EXTENDED > > { head, head, MF_MSG_HUGE, me_huge_page }, > > - { tail, tail, MF_MSG_HUGE, me_huge_page }, > > -#else > > - { compound, compound, MF_MSG_HUGE, me_huge_page }, > > -#endif > > > > { sc|dirty, sc|dirty, MF_MSG_DIRTY_SWAPCACHE, me_swapcache_dirty }, > > { sc|dirty, sc, MF_MSG_CLEAN_SWAPCACHE, me_swapcache_clean }, > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index c6733cc3cbce..a56ad53ff553 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -424,15 +424,15 @@ out: > > /* > > * Higher-order pages are called "compound pages". They are structured thusly: > > * > > - * The first PAGE_SIZE page is called the "head page". > > + * The first PAGE_SIZE page is called the "head page" and have PG_head set. > > * > > - * The remaining PAGE_SIZE pages are called "tail pages". > > + * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded > > + * in bit 0 of page->compound_head. The rest of bits is pointer to head page. > > * > > - * All pages have PG_compound set. All tail pages have their ->first_page > > - * pointing at the head page. > > + * The first tail page's ->compound_dtor holds the offset in array of compound > > + * page destructors. See compound_page_dtors. > > * > > - * The first tail page's ->lru.next holds the address of the compound page's > > - * put_page() function. Its ->lru.prev holds the order of allocation. > > + * The first tail page's ->compound_order holds the order of allocation. > > * This usage means that zero-order pages may not be compound. > > */ > > > > @@ -452,10 +452,7 @@ void prep_compound_page(struct page *page, unsigned long order) > > for (i = 1; i < nr_pages; i++) { > > struct page *p = page + i; > > set_page_count(p, 0); > > - p->first_page = page; > > - /* Make sure p->first_page is always valid for PageTail() */ > > - smp_wmb(); > > - __SetPageTail(p); > > + set_compound_head(p, page); > > } > > } > > > > @@ -830,17 +827,30 @@ static void free_one_page(struct zone *zone, > > > > static int free_tail_pages_check(struct page *head_page, struct page *page) > > { > > - if (!IS_ENABLED(CONFIG_DEBUG_VM)) > > - return 0; > > + int ret = 1; > > + > > + /* > > + * We rely page->lru.next never has bit 0 set, unless the page > > + * is PageTail(). Let's make sure that's true even for poisoned ->lru. > > + */ > > + BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1); > > + > > + if (!IS_ENABLED(CONFIG_DEBUG_VM)) { > > + ret = 0; > > + goto out; > > + } > > if (unlikely(!PageTail(page))) { > > bad_page(page, "PageTail not set", 0); > > - return 1; > > + goto out; > > } > > - if (unlikely(page->first_page != head_page)) { > > - bad_page(page, "first_page not consistent", 0); > > - return 1; > > + if (unlikely(compound_head(page) != head_page)) { > > + bad_page(page, "compound_head not consistent", 0); > > + goto out; > > Same here, although for a DEBUG_VM config only it's not as important. Ditto. > > > } > > - return 0; > > + ret = 0; > > +out: > > + clear_compound_head(page); > > + return ret; > > } > > > > static void __meminit __init_single_page(struct page *page, unsigned long pfn, > > @@ -888,6 +898,8 @@ static void init_reserved_page(unsigned long pfn) > > #else > > static inline void init_reserved_page(unsigned long pfn) > > { > > + /* Avoid false-positive PageTail() */ > > + INIT_LIST_HEAD(&pfn_to_page(pfn)->lru); > > } > > #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ > > > > diff --git a/mm/swap.c b/mm/swap.c > > index a3a0a2f1f7c3..faa9e1687dea 100644 > > --- a/mm/swap.c > > +++ b/mm/swap.c > > @@ -200,7 +200,7 @@ out_put_single: > > __put_single_page(page); > > return; > > } > > - VM_BUG_ON_PAGE(page_head != page->first_page, page); > > + VM_BUG_ON_PAGE(page_head != compound_head(page), page); > > /* > > * We can release the refcount taken by > > * get_page_unless_zero() now that > > @@ -261,7 +261,7 @@ static void put_compound_page(struct page *page) > > * Case 3 is possible, as we may race with > > * __split_huge_page_refcount tearing down a THP page. > > */ > > - page_head = compound_head_by_tail(page); > > + page_head = compound_head(page); > > This is also in a path after PageTail() is true. We can only save one branch here: other PageTail() is most likely in other compilation unit and compiler would not be able to eliminate additional load. Why bother? -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>