Hi all!
I'm experimenting with GFS for the first time have built a simple three-node cluster using AWS 'i3en' type instances. These instances provide raw nvme devices that are incredibly fast.
What I'm finding in these tests is that gluster is offering only a fraction of the raw nvme performance in a 3 replica set (ie, 3 nodes with 1 brick each). I'm wondering if there is anything I can do to squeeze more performance out.
For testing, I'm running fio using a 16GB test file with a 75/25 read/write split. Basically I'm trying to replicate a MySQL database which is what I'd ideally like to host here (which I realise is probably not practical).
My fio test command is:
$ fio --name=fio-test2 --filename=fio-test \
--randrepeat=1 \
--ioengine=libaio \
--direct=1 \
--runtime=300 \
--bs=16k \
--iodepth=64 \
--size=16G \
--readwrite=randrw \
--rwmixread=75 \
--group_reporting \
--numjobs=4
--randrepeat=1 \
--ioengine=libaio \
--direct=1 \
--runtime=300 \
--bs=16k \
--iodepth=64 \
--size=16G \
--readwrite=randrw \
--rwmixread=75 \
--group_reporting \
--numjobs=4
When I test this command directly on the nvme disk, I get:
READ: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s), io=47.0GiB (51.5GB), run=156806-156806msec
WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=16.0GiB (17.2GB), run=156806-156806msec
When I install the disk into a gluster 3-replica volume, I get:
READ: bw=86.3MiB/s (90.5MB/s), 86.3MiB/s-86.3MiB/s (90.5MB/s-90.5MB/s), io=25.3GiB (27.2GB), run=300002-300002msec
WRITE: bw=28.9MiB/s (30.3MB/s), 28.9MiB/s-28.9MiB/s (30.3MB/s-30.3MB/s), io=8676MiB (9098MB), run=300002-300002msec
If I do the same but with only 2 replicas, I get the same performance results. I also get the same rough values when doing 'read', 'randread', 'write', and 'randwrite' tests.
I'm testing directly on one of the storage nodes, so there's no variables line client/server network performance in the mix.
I ran the same test with EBS volumes and I saw similar performance drops when offering up the volume using gluster. A "Provisioned IOPS" EBS volume that could offer 10,000 IOPS directly, was getting only about 3500 IOPS when running as part of a gluster volume.
We're using TLS on the management and volume connections, but I'm not seeing any CPU or memory constraint when using these volumes, so I don't believe that is the bottleneck. Similarly, when I try with SSL turned off, I see no change in performance.
Does anyone have any suggestions on things I might try to increase performance when using these very fast disks as part of a gluster volume, or is this is to be expected when factoring in all the extra work that gluster needs to do when replicating data around volumes?
Thanks very much for your time!! I'll put the two full fio outputs below if anyone wants more details.
Mike
- First full fio test, nvme device without gluster
fio-test: (groupid=0, jobs=4): err= 0: pid=5636: Sat Jan 4 23:09:18 2020
read: IOPS=20.0k, BW=313MiB/s (328MB/s)(47.0GiB/156806msec)
slat (usec): min=3, max=6476, avg=88.44, stdev=326.96
clat (usec): min=218, max=89292, avg=11141.58, stdev=1871.14
lat (usec): min=226, max=89311, avg=11230.16, stdev=1883.88
clat percentiles (usec):
| 1.00th=[ 3654], 5.00th=[ 8455], 10.00th=[ 9372], 20.00th=[10159],
| 30.00th=[10552], 40.00th=[10814], 50.00th=[11076], 60.00th=[11338],
| 70.00th=[11731], 80.00th=[12256], 90.00th=[13042], 95.00th=[13960],
| 99.00th=[15795], 99.50th=[16581], 99.90th=[19268], 99.95th=[23200],
| 99.99th=[36439]
bw ( KiB/s): min=75904, max=257120, per=25.00%, avg=80178.59, stdev=9421.58, samples=1252
iops : min= 4744, max=16070, avg=5011.15, stdev=588.85, samples=1252
write: IOPS=6702, BW=105MiB/s (110MB/s)(16.0GiB/156806msec); 0 zone resets
slat (usec): min=4, max=5587, avg=88.52, stdev=325.86
clat (usec): min=54, max=29847, avg=4491.18, stdev=1481.06
lat (usec): min=63, max=29859, avg=4579.83, stdev=1508.50
clat percentiles (usec):
| 1.00th=[ 947], 5.00th=[ 1975], 10.00th=[ 2737], 20.00th=[ 3458],
| 30.00th=[ 3916], 40.00th=[ 4178], 50.00th=[ 4424], 60.00th=[ 4686],
| 70.00th=[ 5014], 80.00th=[ 5473], 90.00th=[ 6259], 95.00th=[ 6980],
| 99.00th=[ 8717], 99.50th=[ 9503], 99.90th=[10945], 99.95th=[11600],
| 99.99th=[13698]
bw ( KiB/s): min=23296, max=86432, per=25.00%, avg=26812.24, stdev=3375.69, samples=1252
iops : min= 1456, max= 5402, avg=1675.75, stdev=210.98, samples=1252
lat (usec) : 100=0.01%, 250=0.01%, 500=0.06%, 750=0.11%, 1000=0.10%
lat (msec) : 2=1.12%, 4=7.69%, 10=28.88%, 20=61.95%, 50=0.06%
lat (msec) : 100=0.01%
cpu : usr=1.56%, sys=7.85%, ctx=1905114, majf=0, minf=56
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=3143262,1051042,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s), io=47.0GiB (51.5GB), run=156806-156806msec
WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=16.0GiB (17.2GB), run=156806-156806msec
Disk stats (read/write):
dm-4: ios=3455484/1154933, merge=0/0, ticks=35815316/4420412, in_queue=40257384, util=100.00%, aggrios=3456894/1155354, aggrmerge=0/0, aggrticks=35806896/4414972, aggrin_queue=40309192, aggrutil=99.99%
dm-2: ios=3456894/1155354, merge=0/0, ticks=35806896/4414972, in_queue=40309192, util=99.99%, aggrios=1728447/577677, aggrmerge=0/0, aggrticks=17902352/2207092, aggrin_queue=20122108, aggrutil=100.00%
dm-1: ios=3456894/1155354, merge=0/0, ticks=35804704/4414184, in_queue=40244216, util=100.00%, aggrios=3143273/1051086, aggrmerge=313621/104268, aggrticks=32277972/3937619, aggrin_queue=36289488, aggrutil=100.00%
nvme0n1: ios=3143273/1051086, merge=313621/104268, ticks=32277972/3937619, in_queue=36289488, util=100.00%
dm-0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
- Second full fio test, nvme device as part of a gluster volume
fio-test2: (groupid=0, jobs=4): err= 0: pid=5537: Sat Jan 4 23:30:28 2020
read: IOPS=5525, BW=86.3MiB/s (90.5MB/s)(25.3GiB/300002msec)
slat (nsec): min=1159, max=894687k, avg=9822.60, stdev=990825.87
clat (usec): min=963, max=3141.5k, avg=37455.28, stdev=123109.88
lat (usec): min=968, max=3141.5k, avg=37465.21, stdev=123121.94
clat percentiles (msec):
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9],
| 30.00th=[ 9], 40.00th=[ 9], 50.00th=[ 10], 60.00th=[ 10],
| 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 48], 95.00th=[ 180],
| 99.00th=[ 642], 99.50th=[ 860], 99.90th=[ 1435], 99.95th=[ 1687],
| 99.99th=[ 2022]
bw ( KiB/s): min= 31, max=93248, per=26.30%, avg=23247.24, stdev=20716.86, samples=2280
iops : min= 1, max= 5828, avg=1452.92, stdev=1294.81, samples=2280
write: IOPS=1850, BW=28.9MiB/s (30.3MB/s)(8676MiB/300002msec); 0 zone resets
slat (usec): min=21, max=1586.3k, avg=2117.71, stdev=23082.86
clat (usec): min=20, max=2614.0k, avg=23888.03, stdev=99651.34
lat (usec): min=225, max=3141.2k, avg=26006.49, stdev=104758.57
clat percentiles (usec):
| 1.00th=[ 889], 5.00th=[ 2343], 10.00th=[ 3654],
| 20.00th=[ 5276], 30.00th=[ 5997], 40.00th=[ 6456],
| 50.00th=[ 6849], 60.00th=[ 7177], 70.00th=[ 7504],
| 80.00th=[ 7963], 90.00th=[ 8979], 95.00th=[ 74974],
| 99.00th=[ 513803], 99.50th=[ 717226], 99.90th=[1333789],
| 99.95th=[1518339], 99.99th=[1803551]
bw ( KiB/s): min= 31, max=30240, per=27.05%, avg=8009.39, stdev=6912.26, samples=2217
iops : min= 1, max= 1890, avg=500.56, stdev=432.02, samples=2217
lat (usec) : 50=0.03%, 100=0.02%, 250=0.01%, 500=0.06%, 750=0.08%
lat (usec) : 1000=0.11%
lat (msec) : 2=0.66%, 4=1.97%, 10=71.07%, 20=14.47%, 50=2.69%
lat (msec) : 100=2.23%, 250=3.21%, 500=1.94%, 750=0.82%, 1000=0.31%
cpu : usr=0.59%, sys=1.19%, ctx=1172180, majf=0, minf=56
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=1657579,555275,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=86.3MiB/s (90.5MB/s), 86.3MiB/s-86.3MiB/s (90.5MB/s-90.5MB/s), io=25.3GiB (27.2GB), run=300002-300002msec
WRITE: bw=28.9MiB/s (30.3MB/s), 28.9MiB/s-28.9MiB/s (30.3MB/s-30.3MB/s), io=8676MiB (9098MB), run=300002-300002msec
________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users