Performance tuning suggestions for nvme on aws

Michael Richardson <hello@xxxxxxxxxxxxxxxxxxxxx> · Sun, 5 Jan 2020 12:05:26 +1100

Hi all!
I'm experimenting with GFS for the first time have built a simple three-node cluster using AWS 'i3en' type instances. These instances provide raw nvme devices that are incredibly fast. 

What I'm finding in these tests is that gluster is offering only a fraction of the raw nvme performance in a 3 replica set (ie, 3 nodes with 1 brick each). I'm wondering if there is anything I can do to squeeze more performance out. 

For testing, I'm running fio using a 16GB test file with a 75/25 read/write split. Basically I'm trying to replicate a MySQL database which is what I'd ideally like to host here (which I realise is probably not practical). 

My fio test command is: 
$ fio --name=fio-test2 --filename=fio-test \
--randrepeat=1 \
--ioengine=libaio \
--direct=1 \
--runtime=300 \
--bs=16k \
--iodepth=64 \
--size=16G \
--readwrite=randrw \
--rwmixread=75 \
--group_reporting \
--numjobs=4

When I test this command directly on the nvme disk, I get: 
   READ: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s), io=47.0GiB (51.5GB), run=156806-156806msec
  WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=16.0GiB (17.2GB), run=156806-156806msec
When I install the disk into a gluster 3-replica volume, I get:
   READ: bw=86.3MiB/s (90.5MB/s), 86.3MiB/s-86.3MiB/s (90.5MB/s-90.5MB/s), io=25.3GiB (27.2GB), run=300002-300002msec
  WRITE: bw=28.9MiB/s (30.3MB/s), 28.9MiB/s-28.9MiB/s (30.3MB/s-30.3MB/s), io=8676MiB (9098MB), run=300002-300002msec
If I do the same but with only 2 replicas, I get the same performance results. I also get the same rough values when doing 'read', 'randread', 'write', and 'randwrite' tests. 
I'm testing directly on one of the storage nodes, so there's no variables line client/server network performance in the mix. 
I ran the same test with EBS volumes and I saw similar performance drops when offering up the volume using gluster. A "Provisioned IOPS" EBS volume that could offer 10,000 IOPS directly, was getting only about 3500 IOPS when running as part of a gluster volume. 
We're using TLS on the management and volume connections, but I'm not seeing any CPU or memory constraint when using these volumes, so I don't believe that is the bottleneck. Similarly, when I try with SSL turned off, I see no change in performance. 
Does anyone have any suggestions on things I might try to increase performance when using these very fast disks as part of a gluster volume, or is this is to be expected when factoring in all the extra work that gluster needs to do when replicating data around volumes? 
Thanks very much for your time!! I'll put the two full fio outputs below if anyone wants more details.
Mike

- First full fio test, nvme device without gluster
fio-test: (groupid=0, jobs=4): err= 0: pid=5636: Sat Jan  4 23:09:18 2020
  read: IOPS=20.0k, BW=313MiB/s (328MB/s)(47.0GiB/156806msec)
    slat (usec): min=3, max=6476, avg=88.44, stdev=326.96
    clat (usec): min=218, max=89292, avg=11141.58, stdev=1871.14
     lat (usec): min=226, max=89311, avg=11230.16, stdev=1883.88
    clat percentiles (usec):
     |  1.00th=[ 3654],  5.00th=[ 8455], 10.00th=[ 9372], 20.00th=[10159],
     | 30.00th=[10552], 40.00th=[10814], 50.00th=[11076], 60.00th=[11338],
     | 70.00th=[11731], 80.00th=[12256], 90.00th=[13042], 95.00th=[13960],
     | 99.00th=[15795], 99.50th=[16581], 99.90th=[19268], 99.95th=[23200],
     | 99.99th=[36439]
   bw (  KiB/s): min=75904, max=257120, per=25.00%, avg=80178.59, stdev=9421.58, samples=1252
   iops        : min= 4744, max=16070, avg=5011.15, stdev=588.85, samples=1252
  write: IOPS=6702, BW=105MiB/s (110MB/s)(16.0GiB/156806msec); 0 zone resets
    slat (usec): min=4, max=5587, avg=88.52, stdev=325.86
    clat (usec): min=54, max=29847, avg=4491.18, stdev=1481.06
     lat (usec): min=63, max=29859, avg=4579.83, stdev=1508.50
    clat percentiles (usec):
     |  1.00th=[  947],  5.00th=[ 1975], 10.00th=[ 2737], 20.00th=[ 3458],
     | 30.00th=[ 3916], 40.00th=[ 4178], 50.00th=[ 4424], 60.00th=[ 4686],
     | 70.00th=[ 5014], 80.00th=[ 5473], 90.00th=[ 6259], 95.00th=[ 6980],
     | 99.00th=[ 8717], 99.50th=[ 9503], 99.90th=[10945], 99.95th=[11600],
     | 99.99th=[13698]
   bw (  KiB/s): min=23296, max=86432, per=25.00%, avg=26812.24, stdev=3375.69, samples=1252
   iops        : min= 1456, max= 5402, avg=1675.75, stdev=210.98, samples=1252
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.06%, 750=0.11%, 1000=0.10%
  lat (msec)   : 2=1.12%, 4=7.69%, 10=28.88%, 20=61.95%, 50=0.06%
  lat (msec)   : 100=0.01%
  cpu          : usr=1.56%, sys=7.85%, ctx=1905114, majf=0, minf=56
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=3143262,1051042,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s), io=47.0GiB (51.5GB), run=156806-156806msec
  WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=16.0GiB (17.2GB), run=156806-156806msec

Disk stats (read/write):
    dm-4: ios=3455484/1154933, merge=0/0, ticks=35815316/4420412, in_queue=40257384, util=100.00%, aggrios=3456894/1155354, aggrmerge=0/0, aggrticks=35806896/4414972, aggrin_queue=40309192, aggrutil=99.99%
    dm-2: ios=3456894/1155354, merge=0/0, ticks=35806896/4414972, in_queue=40309192, util=99.99%, aggrios=1728447/577677, aggrmerge=0/0, aggrticks=17902352/2207092, aggrin_queue=20122108, aggrutil=100.00%
    dm-1: ios=3456894/1155354, merge=0/0, ticks=35804704/4414184, in_queue=40244216, util=100.00%, aggrios=3143273/1051086, aggrmerge=313621/104268, aggrticks=32277972/3937619, aggrin_queue=36289488, aggrutil=100.00%
  nvme0n1: ios=3143273/1051086, merge=313621/104268, ticks=32277972/3937619, in_queue=36289488, util=100.00%
  dm-0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
- Second full fio test, nvme device as part of a gluster volume
fio-test2: (groupid=0, jobs=4): err= 0: pid=5537: Sat Jan  4 23:30:28 2020
  read: IOPS=5525, BW=86.3MiB/s (90.5MB/s)(25.3GiB/300002msec)
    slat (nsec): min=1159, max=894687k, avg=9822.60, stdev=990825.87
    clat (usec): min=963, max=3141.5k, avg=37455.28, stdev=123109.88
     lat (usec): min=968, max=3141.5k, avg=37465.21, stdev=123121.94
    clat percentiles (msec):
     |  1.00th=[    7],  5.00th=[    8], 10.00th=[    8], 20.00th=[    9],
     | 30.00th=[    9], 40.00th=[    9], 50.00th=[   10], 60.00th=[   10],
     | 70.00th=[   11], 80.00th=[   12], 90.00th=[   48], 95.00th=[  180],
     | 99.00th=[  642], 99.50th=[  860], 99.90th=[ 1435], 99.95th=[ 1687],
     | 99.99th=[ 2022]
   bw (  KiB/s): min=   31, max=93248, per=26.30%, avg=23247.24, stdev=20716.86, samples=2280
   iops        : min=    1, max= 5828, avg=1452.92, stdev=1294.81, samples=2280
  write: IOPS=1850, BW=28.9MiB/s (30.3MB/s)(8676MiB/300002msec); 0 zone resets
    slat (usec): min=21, max=1586.3k, avg=2117.71, stdev=23082.86
    clat (usec): min=20, max=2614.0k, avg=23888.03, stdev=99651.34
     lat (usec): min=225, max=3141.2k, avg=26006.49, stdev=104758.57
    clat percentiles (usec):
     |  1.00th=[    889],  5.00th=[   2343], 10.00th=[   3654],
     | 20.00th=[   5276], 30.00th=[   5997], 40.00th=[   6456],
     | 50.00th=[   6849], 60.00th=[   7177], 70.00th=[   7504],
     | 80.00th=[   7963], 90.00th=[   8979], 95.00th=[  74974],
     | 99.00th=[ 513803], 99.50th=[ 717226], 99.90th=[1333789],
     | 99.95th=[1518339], 99.99th=[1803551]
   bw (  KiB/s): min=   31, max=30240, per=27.05%, avg=8009.39, stdev=6912.26, samples=2217
   iops        : min=    1, max= 1890, avg=500.56, stdev=432.02, samples=2217
  lat (usec)   : 50=0.03%, 100=0.02%, 250=0.01%, 500=0.06%, 750=0.08%
  lat (usec)   : 1000=0.11%
  lat (msec)   : 2=0.66%, 4=1.97%, 10=71.07%, 20=14.47%, 50=2.69%
  lat (msec)   : 100=2.23%, 250=3.21%, 500=1.94%, 750=0.82%, 1000=0.31%
  cpu          : usr=0.59%, sys=1.19%, ctx=1172180, majf=0, minf=56
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=1657579,555275,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=86.3MiB/s (90.5MB/s), 86.3MiB/s-86.3MiB/s (90.5MB/s-90.5MB/s), io=25.3GiB (27.2GB), run=300002-300002msec
  WRITE: bw=28.9MiB/s (30.3MB/s), 28.9MiB/s-28.9MiB/s (30.3MB/s-30.3MB/s), io=8676MiB (9098MB), run=300002-300002msec

________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users