Re: [RFC 0/6] the big khugepaged redesign

Vlastimil Babka <vbabka@xxxxxxx> · Tue, 24 Feb 2015 11:32:30 +0100

On 02/23/2015 11:56 PM, Andrew Morton wrote:
On Mon, 23 Feb 2015 14:46:43 -0800 Davidlohr Bueso <dave@xxxxxxxxxxxx> wrote:

On Mon, 2015-02-23 at 13:58 +0100, Vlastimil Babka wrote:
Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
THP allocation attempts on page faults are a good performance trade-off.

- THP allocations add to page fault latency, as high-order allocations are
   notoriously expensive. Page allocation slowpath now does extra checks for
   GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
   compaction for user page faults. But even async compaction can be expensive.
- During the first page fault in a 2MB range we cannot predict how much of the
   range will be actually accessed - we can theoretically waste as much as 511
   worth of pages [2]. Or, the pages in the range might be accessed from CPUs
   from different NUMA nodes and while base pages could be all local, THP could
   be remote to all but one CPU. The cost of remote accesses due to this false
   sharing would be higher than any savings on the TLB.
- The interaction with memcg are also problematic [1].

Now I don't have any hard data to show how big these problems are, and I
expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
for performance reasons.

There are plenty of examples of this, ie for Oracle:

https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge

hm, five months ago and I don't recall seeing any followup to this.

Actually it's year + five months, but nevertheless...

Does anyone know what's happening?

I would suspect mmap_sem being held during whole THP page fault 
(including the needed reclaim and compaction), which I forgot to mention 
in the first e-mail - it's not just the problem page fault latency, but 
also potentially holding back other processes, why we should allow 
shifting from THP page faults to deferred collapsing.
Although the attempts for opportunistic page faults without mmap_sem 
would also help in this particular case.

Khugepaged also used to hold mmap_sem (for read) during the allocation 
attempt, but that was fixed since then. It could be also zone lru_lock 
pressure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>