Hello everyone, On Tue, Jun 04, 2013 at 08:30:51PM +0800, Wanpeng Li wrote: > On Tue, Jun 04, 2013 at 04:57:57PM +0800, Jianguo Wu wrote: > >Hi all, > > > >I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on, > >memcpy has worse performance. > > > >When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault). > > > > I get similar result as you against 3.10-rc4 in the attachment. This > dues to the characteristic of thp takes a single page fault for each > 2MB virtual region touched by userland. I had a look at what prefault does and page faults should not be involved in the measurement of GB/sec. The "stats" also include the page faults but the page fault is not part of the printed GB/sec, if "-o" is used. If the perf test is correct, it looks more an hardware issue with memcpy and large TLBs than a software one. memset doesn't exibith it, if this was something fundamental memset should also exibith it. It shall be possible to reproduce this with hugetlbfs in fact... if you want to be 100% sure it's not software, you should try that. Chances are there's enough pre-fetching going on in the CPU to optimize for those 4k tlb loads in streaming copies, and the pagetables are also cached very nicely with streaming copies. Maybe large TLBs somewhere are less optimized for streaming copies. Only something smarter happening in the CPU optimized for 4k and not yet for 2M TLBs can explain this: if the CPU was equally intelligent it should definitely be faster with THP on even with "-o". Overall I doubt there's anything in software to fix here. Also note, this is not related to additional cache usage during page faults that I mentioned in the pdf. Page faults or cache effects in the page faults are completely removed from the equation because of "-o". The prefault pass, eliminates the page faults and trashes away all the cache (regardless if the page fault uses non-temporal stores or not) before the "measured" memcpy load starts. I don't think this is a major concern, as a proof of thumb you just need to prefix the "perf" command with "time" to see it: the THP version still completes much faster despite the prefault part of it is slightly slower with THP on. THP pays off the most during computations that are accessing randomly, and not sequentially. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>