Hi, On Tue, Oct 15, 2013 at 01:46:08PM -0600, Khalid Aziz wrote: > @@ -68,13 +69,26 @@ static void __put_compound_page(struct page *page) > { > compound_page_dtor *dtor; > > - __page_cache_release(page); > + if (!PageHuge(page)) > + __page_cache_release(page); > dtor = get_compound_page_dtor(page); > (*dtor)(page); What is the above change about? The above two liner is not even upstream (what is the differece in 3.0 that requires it??). From a very quick review it doesn't make sense to add a branch that removes a branch. It just slowdown all other cases but at zero performance gain for hugetlbfs. Gcc likely inlines the call anyway so there's no much to sikp other than testbit vs dtor comparison (if something PageLRU shall be faster even for hugetlbfs). In addition of just being a slowdown for everything, probably including hugetlbfs, it looks risky overall, if any hugepage would end up in the LRU this change would lead to a kernel crash. I'd say the above not ok for stable (nor for upstream in fact). > } > > static void put_compound_page(struct page *page) > { > + /* > + * hugetlbfs pages cannot be split from under us. If this is a > + * hugetlbfs page, check refcount on head page and release the page if > + * the refcount becomes zero. > + */ > + if (PageHuge(page)) { > + page = compound_head(page); > + if (put_page_testzero(page)) > + __put_compound_page(page); > + return; > + } > + > if (unlikely(PageTail(page))) { > /* __split_huge_page_refcount can run under us */ > struct page *page_head = compound_trans_head(page); > @@ -159,8 +173,20 @@ bool __get_page_tail(struct page *page) > */ > unsigned long flags; > bool got = false; > - struct page *page_head = compound_trans_head(page); > + struct page *page_head; > + > + /* > + * If this is a hugetlbfs page it cannot be split under us. Simply > + * increment refcount for the head page. > + */ > + if (PageHuge(page)) { > + page_head = compound_head(page); > + atomic_inc(&page_head->_count); > + got = true; > + goto out; > + } > > + page_head = compound_trans_head(page); > if (likely(page != page_head && get_page_unless_zero(page_head))) { > /* > * page_head wasn't a dangling pointer but it > @@ -178,6 +204,7 @@ bool __get_page_tail(struct page *page) > if (unlikely(!got)) > put_page(page_head); > } > +out: > return got; > } > EXPORT_SYMBOL(__get_page_tail); Rest is ok, but I had feedback from real Oracle benchmarks on double FusionIO (capable of 3GB/sec x 2) (i.e. the only case were this optimization can matter) and there was absolutely zero performance difference measured with this patch applied vs stock. So the whole patch is a dubious candidate for -stable. I have to guess Orasim is way way more I/O intensive than any real workload could ever be. You should try with real load too. Also note, Oracle blocksize is never 1MB AFIK, the 64kb case is way more interesting for aio loads. Still I'm obviously ok with optimizing directio on hugetlbfs even if it only showup as an optimization for Orasim, but again I don't think it's so important for -stable. Furthermore this can only make difference with non-commodity super fast SSD hardware (8GB/sec). I didn't even try benchmarking this on my commodity SSDs not exceeding 500MB/sec. For the long term I'm overall skeptical of the above changes if they only optimize hugetlbfs. THP is likely going to be extended to tmpfs (see Kirill and Ning work), and that will be a perfect fit for Oracle AMM (Automatic Memory Management) that can practically replace hugetlbfs. ==== quote ===== Automatic Memory Management and HugePages on Linux are not compatible, which means AMM is probably not a sensible option for any large systems. ==== quote ===== For now tmpfs isn't supported you can only take the boost on anonymous memory while running Oracle, but for certain appliances that's significant speedup already. About the concern for the race that upstream refuses to fix between threads + fork + directio + COW, I take opportunity to remind you that it's compulsory to mark every single anon vma that is target of directio in a threaded app running fork too, with MADV_DONTFORK in Oracle proprietary code (like KVM does for ages). Failure to do so, is fatal for userland data running on upstream kernels. This isn't really related to THP on/off at all. The race window would increase to 2m for hugetlbfs too if using MAP_PRIVATE within LD_RELOAD=libhugetlbfs.so. There's no safer and simpler way to protect against this annoying data corruption source left unfixed, than MADV_DONTFORK. Unfortunately the fixes were not pretty for this one and they added tiny overhead to fork and directio (no other kernel path was affected). Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html