Re: [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages

neha agarwal <neha.agbk@xxxxxxxxx> · Wed, 8 Jun 2016 14:43:44 -0400

On Mon, Jun 6, 2016 at 9:51 AM, Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote:
On Wed, May 25, 2016 at 03:11:55PM -0400, neha agarwal wrote:

> Hi All,

>

> I have been testing Hugh's and Kirill's huge tmpfs patch sets with

> Cassandra (NoSQL database). I am seeing significant performance gap between

> these two implementations (~30%). Hugh's implementation performs better

> than Kirill's implementation. I am surprised why I am seeing this

> performance gap. Following is my test setup.

>

> Patchsets

> ========

> - For Hugh's:

> I checked out 4.6-rc3, applied Hugh's preliminary patches (01 to 10

> patches) from here: https://lkml.org/lkml/2016/4/5/792 and then applied the

> THP patches posted on April 16 (01 to 29 patches).

>

> - For Kirill's:

> I am using his branch  "git://

> git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v8", which

> is based off of 4.6-rc3, posted on May 12.

>

>

> Khugepaged settings

> ================

> cd /sys/kernel/mm/transparent_hugepage

> echo 10 >khugepaged/alloc_sleep_millisecs

> echo 10 >khugepaged/scan_sleep_millisecs

> echo 511 >khugepaged/max_ptes_none

>

>

> Mount options

> ===========

> - For Hugh's:

> sudo sysctl -w vm/shmem_huge=2

> sudo mount -o remount,huge=1 /hugetmpfs

>

> - For Kirill's:

> sudo mount -o remount,huge=always /hugetmpfs

> echo force > /sys/kernel/mm/transparent_hugepage/shmem_enabled

> echo 511 >khugepaged/max_ptes_swap

>

>

> Workload Setting

> =============

> Please look at the attached setup document for Cassandra (NoSQL database):

> cassandra-setup.txt

>

>

> Machine setup

> ===========

> 36-core (72 hardware thread) dual-socket x86 server with 512 GB RAM running

> Ubuntu. I use control groups for resource isolation. Server and client

> threads run on different sockets. Frequency governor set to "performance"

> to remove any performance fluctuations due to frequency variation.

>

>

> Throughput numbers

> ================

> Hugh's implementation: 74522.08 ops/sec

> Kirill's implementation: 54919.10 ops/sec

In my setup I don't see the difference:

v4.7-rc1 + my implementation:

[OVERALL], RunTime(ms), 822862.0

[OVERALL], Throughput(ops/sec), 60763.53021527304

ShmemPmdMapped:  4999168 kB

v4.6-rc2 + Hugh's implementation:

[OVERALL], RunTime(ms), 833157.0

[OVERALL], Throughput(ops/sec), 60012.698687042175

ShmemPmdMapped:  5021696 kB

It's basically within measuarment error. 'ShmemPmdMapped' indicate how

much memory is mapped with huge pages by the end of test.

It's on dual-socket 24-core machine with 64G of RAM.

I guess we have some configuration difference or something, but so far I

don't see the drastic performance difference you've pointed to.

May be my implementation behaves slower on bigger machines, I don't know..

There's no architectural reason for this.

I'll post my updated patchset today.

--

 Kirill A. Shutemov

Thanks a lot Kirill for the testing. It is interesting that you don't see any significant performance difference. Also, your absolute throughput numbers are different from mine, more so for Hugh's implementation.

Can you please share your kernel config file? I will try to look if I have some different config settings. Also, I am assuming that you had turned off DVFS.

One thing I forgot mentioning in my previous setup email was: I use 8 cores for running Cassandra server threads. Can you please tell how many cores did you use? As Cassandra is CPU bound that can make a difference in throughput number we are seeing. 

-- 
Thanks and Regards,Neha