On Thu, Sep 24, 2015 at 05:05:48PM +0200, Vlastimil Babka wrote: > The problem is an endless loop in get_futex_key() when > CONFIG_TRANSPARENT_HUGEPAGE is enabled and the s390x machine has emulated > hugepages. The code tries to serialize against __split_huge_page_splitting(), > but __get_user_pages_fast() fails on the hugetlbfs tail page. This happens > because pmd_large() is false for emulated hugepages, so the code will proceed > into gup_pte_range() and fail page_cache_get_speculative() through failing > get_page_unless_zero() as the tail page count is zero. Failing __gup_fast is > supposed to be temporary due to a race, so get_futex_key() will try again > endlessly. > > This attempt for a fix is a bandaid solution and probably incomplete. > Hopefully something better will emerge from the discussion. Fully fixing > emulated hugepages just for stable backports is unlikely due to them being > removed. Also THP refcounting redesign should soon remove the trickery from > get_futex_key(). THP refcounting redesign will simplify things a lot here because the head page cannot be freed from under us if we hold a reference on the tail. With the current split_huge_page that cannot fail, it should be possible to stop using __get_user_pages_fast to reach the head page and pin it before it can be freed from under us by using the compound_lock_irqsave too. The old code could have done get_page on a already freed head page (if the THP was splitted after compound_head returned) and this is why it needed adjustement. Here we just need to safely get a refcount on the head page. If we do get_page_unless_zero() on the head page returned by compound_head, take compound_lock_irqsave and check if the tail page is still a tail (which means split_huge_page hasn't run yet and it cannot run anymore by holding the compound_lock), then we can take a reference on the head page. After we take a reference on the head we just put_page the tail page and we continue using the page_head. It should be the very same logic of __get_page_tail, except we don't want the refcount taken on the tail too (i.e. we must not increase the mapcount and we should skip the get_huge_page_tail or the head will be freed again if split_huge_page runs as result of MADV_DONTNEED and it literally frees the head). We want only one more recount on the head because the code then only works with page_head and we don't care about the tail anymore. A new function get_head_page() may work for that and avoid the pagetable walking. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>