Re: [PATCH v2] mm: fix aio performance regression for database caused by THP (backport to stable)

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Fri, 18 Oct 2013 01:34:05 +0200

Hi,

On Tue, Oct 15, 2013 at 01:46:08PM -0600, Khalid Aziz wrote:
> @@ -68,13 +69,26 @@ static void __put_compound_page(struct page *page)
>  {
>  	compound_page_dtor *dtor;
>  
> -	__page_cache_release(page);
> +	if (!PageHuge(page))
> +		__page_cache_release(page);
>  	dtor = get_compound_page_dtor(page);
>  	(*dtor)(page);

What is the above change about?

The above two liner is not even upstream (what is the differece in 3.0
that requires it??). From a very quick review it doesn't make sense to
add a branch that removes a branch. It just slowdown all other cases
but at zero performance gain for hugetlbfs.

Gcc likely inlines the call anyway so there's no much to sikp other
than testbit vs dtor comparison (if something PageLRU shall be faster
even for hugetlbfs).

In addition of just being a slowdown for everything, probably
including hugetlbfs, it looks risky overall, if any hugepage would end
up in the LRU this change would lead to a kernel crash. I'd say the
above not ok for stable (nor for upstream in fact).

>  }
>  
>  static void put_compound_page(struct page *page)
>  {
> +	/*
> +	 * hugetlbfs pages cannot be split from under us.  If this is a
> +	 * hugetlbfs page, check refcount on head page and release the page if
> +	 * the refcount becomes zero.
> +	 */
> +	if (PageHuge(page)) {
> +		page = compound_head(page);
> +		if (put_page_testzero(page))
> +			__put_compound_page(page);
> +		return;
> +	}
> +
>  	if (unlikely(PageTail(page))) {
>  		/* __split_huge_page_refcount can run under us */
>  		struct page *page_head = compound_trans_head(page);
> @@ -159,8 +173,20 @@ bool __get_page_tail(struct page *page)
>  	 */
>  	unsigned long flags;
>  	bool got = false;
> -	struct page *page_head = compound_trans_head(page);
> +	struct page *page_head;
> +
> +	/*
> +	 * If this is a hugetlbfs page it cannot be split under us.  Simply
> +	 * increment refcount for the head page.
> +	 */
> +	if (PageHuge(page)) {
> +		page_head = compound_head(page);
> +		atomic_inc(&page_head->_count);
> +		got = true;
> +		goto out;
> +	}
>  
> +	page_head = compound_trans_head(page);
>  	if (likely(page != page_head && get_page_unless_zero(page_head))) {
>  		/*
>  		 * page_head wasn't a dangling pointer but it
> @@ -178,6 +204,7 @@ bool __get_page_tail(struct page *page)
>  		if (unlikely(!got))
>  			put_page(page_head);
>  	}
> +out:
>  	return got;
>  }
>  EXPORT_SYMBOL(__get_page_tail);

Rest is ok, but I had feedback from real Oracle benchmarks on double
FusionIO (capable of 3GB/sec x 2) (i.e. the only case were this
optimization can matter) and there was absolutely zero performance
difference measured with this patch applied vs stock.

So the whole patch is a dubious candidate for -stable.

I have to guess Orasim is way way more I/O intensive than any real
workload could ever be. You should try with real load too.

Also note, Oracle blocksize is never 1MB AFIK, the 64kb case is way
more interesting for aio loads.

Still I'm obviously ok with optimizing directio on hugetlbfs even if
it only showup as an optimization for Orasim, but again I don't think
it's so important for -stable.

Furthermore this can only make difference with non-commodity super
fast SSD hardware (8GB/sec). I didn't even try benchmarking this on my
commodity SSDs not exceeding 500MB/sec.

For the long term I'm overall skeptical of the above changes if they
only optimize hugetlbfs.

THP is likely going to be extended to tmpfs (see Kirill and Ning
work), and that will be a perfect fit for Oracle AMM (Automatic Memory
Management) that can practically replace hugetlbfs.

==== quote =====
Automatic Memory Management and HugePages on Linux are not compatible,
which means AMM is probably not a sensible option for any large
systems. 
==== quote =====

For now tmpfs isn't supported you can only take the boost on anonymous
memory while running Oracle, but for certain appliances that's
significant speedup already.

About the concern for the race that upstream refuses to fix between
threads + fork + directio + COW, I take opportunity to remind you that
it's compulsory to mark every single anon vma that is target of
directio in a threaded app running fork too, with MADV_DONTFORK in
Oracle proprietary code (like KVM does for ages). Failure to do so, is
fatal for userland data running on upstream kernels. This isn't really
related to THP on/off at all. The race window would increase to 2m for
hugetlbfs too if using MAP_PRIVATE within
LD_RELOAD=libhugetlbfs.so. There's no safer and simpler way to protect
against this annoying data corruption source left unfixed, than
MADV_DONTFORK. Unfortunately the fixes were not pretty for this one
and they added tiny overhead to fork and directio (no other kernel
path was affected).

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html