On Jan 5, 2020 03:05, Michael Richardson <hello@xxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi all!
>
> I'm experimenting with GFS for the first time have built a simple three-node cluster using AWS 'i3en' type instances. These instances provide raw nvme devices that are incredibly fast.
>
> What I'm finding in these tests is that gluster is offering only a fraction of the raw nvme performance in a 3 replica set (ie, 3 nodes with 1 brick each). I'm wondering if there is anything I can do to squeeze more performance out.
>
> For testing, I'm running fio using a 16GB test file with a 75/25 read/write split. Basically I'm trying to replicate a MySQL database which is what I'd ideally like to host here (which I realise is probably not practical).
>
> My fio test command is:
> $ fio --name=fio-test2 --filename=fio-test \
> --randrepeat=1 \
> --ioengine=libaio \
> --direct=1 \
> --runtime=300 \
> --bs=16k \
> --iodepth=64 \
> --size=16G \
> --readwrite=randrw \
> --rwmixread=75 \
> --group_reporting \
> --numjobs=4
>
> When I test this command directly on the nvme disk, I get:
>
> READ: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s), io=47.0GiB (51.5GB), run=156806-156806msec
>
> WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=16.0GiB (17.2GB), run=156806-156806msec
>
> When I install the disk into a gluster 3-replica volume, I get:
>
> READ: bw=86.3MiB/s (90.5MB/s), 86.3MiB/s-86.3MiB/s (90.5MB/s-90.5MB/s), io=25.3GiB (27.2GB), run=300002-300002msec
>
> WRITE: bw=28.9MiB/s (30.3MB/s), 28.9MiB/s-28.9MiB/s (30.3MB/s-30.3MB/s), io=8676MiB (9098MB), run=300002-300002msec
>
> If I do the same but with only 2 replicas, I get the same performance results. I also get the same rough values when doing 'read', 'randread', 'write', and 'randwrite' tests.
>
> I'm testing directly on one of the storage nodes, so there's no variables line client/server network performance in the mix.
>
> I ran the same test with EBS volumes and I saw similar performance drops when offering up the volume using gluster. A "Provisioned IOPS" EBS volume that could offer 10,000 IOPS directly, was getting only about 3500 IOPS when running as part of a gluster volume.
>
> We're using TLS on the management and volume connections, but I'm not seeing any CPU or memory constraint when using these volumes, so I don't believe that is the bottleneck. Similarly, when I try with SSL turned off, I see no change in performance.
>
> Does anyone have any suggestions on things I might try to increase performance when using these very fast disks as part of a gluster volume, or is this is to be expected when factoring in all the extra work that gluster needs to do when replicating data around volumes?
1. Gluster & OS version ?
2. Check I/O scheduler of the NVMes -> should be none/noop
3. gluster volume set volname group db-workload
Last login: Sun Jan 5 11:03:54 2020 from 192.168.1.11
[root@ovirt1 ~]# cat /var/lib/glusterd/groups/db-workload
performance.open-behind=on
performance.write-behind=off
performance.stat-prefetch=off
performance.quick-read=off
performance.strict-o-direct=on
performance.read-ahead=off
performance.io-cache=off
performance.readdir-ahead=off
performance.client-io-threads=on
server.event-threads=4
client.event-threads=4
performance.read-after-open=yes
4. Afterwards you can test different value for server/client event-threads (based on CPU cores).
> Thanks very much for your time!! I'll put the two full fio outputs below if anyone wants more details.
>
> Mike
>
>
> - First full fio test, nvme device without gluster
>
> fio-test: (groupid=0, jobs=4): err= 0: pid=5636: Sat Jan 4 23:09:18 2020
>
> read: IOPS=20.0k, BW=313MiB/s (328MB/s)(47.0GiB/156806msec)
>
> slat (usec): min=3, max=6476, avg=88.44, stdev=326.96
>
> clat (usec): min=218, max=89292, avg=11141.58, stdev=1871.14
>
> lat (usec): min=226, max=89311, avg=11230.16, stdev=1883.88
>
> clat percentiles (usec):
>
> | 1.00th=[ 3654], 5.00th=[ 8455], 10.00th=[ 9372], 20.00th=[10159],
>
> | 30.00th=[10552], 40.00th=[10814], 50.00th=[11076], 60.00th=[11338],
>
> | 70.00th=[11731], 80.00th=[12256], 90.00th=[13042], 95.00th=[13960],
>
> | 99.00th=[15795], 99.50th=[16581], 99.90th=[19268], 99.95th=[23200],
>
> | 99.99th=[36439]
>
> bw ( KiB/s): min=75904, max=257120, per=25.00%, avg=80178.59, stdev=9421.58, samples=1252
>
> iops : min= 4744, max=16070, avg=5011.15, stdev=588.85, samples=1252
>
> write: IOPS=6702, BW=105MiB/s (110MB/s)(16.0GiB/156806msec); 0 zone resets
>
> slat (usec): min=4, max=5587, avg=88.52, stdev=325.86
>
> clat (usec): min=54, max=29847, avg=4491.18, stdev=1481.06
>
> lat (usec): min=63, max=29859, avg=4579.83, stdev=1508.50
>
> clat percentiles (usec):
>
> | 1.00th=[ 947], 5.00th=[ 1975], 10.00th=[ 2737], 20.00th=[ 3458],
>
> | 30.00th=[ 3916], 40.00th=[ 4178], 50.00th=[ 4424], 60.00th=[ 4686],
>
> | 70.00th=[ 5014], 80.00th=[ 5473], 90.00th=[ 6259], 95.00th=[ 6980],
>
> | 99.00th=[ 8717], 99.50th=[ 9503], 99.90th=[10945], 99.95th=[11600],
>
> | 99.99th=[13698]
>
> bw ( KiB/s): min=23296, max=86432, per=25.00%, avg=26812.24, stdev=3375.69, samples=1252
>
> iops : min= 1456, max= 5402, avg=1675.75, stdev=210.98, samples=1252
>
> lat (usec) : 100=0.01%, 250=0.01%, 500=0.06%, 750=0.11%, 1000=0.10%
>
> lat (msec) : 2=1.12%, 4=7.69%, 10=28.88%, 20=61.95%, 50=0.06%
>
> lat (msec) : 100=0.01%
>
> cpu : usr=1.56%, sys=7.85%, ctx=1905114, majf=0, minf=56
>
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>
> issued rwts: total=3143262,1051042,0,0 short=0,0,0,0 dropped=0,0,0,0
>
> latency : target=0, window=0, percentile=100.00%, depth=64
>
>
>
> Run status group 0 (all jobs):
>
> READ: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s), io=47.0GiB (51.5GB), run=156806-156806msec
>
> WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=16.0GiB (17.2GB), run=156806-156806msec
>
>
>
> Disk stats (read/write):
>
> dm-4: ios=3455484/1154933, merge=0/0, ticks=35815316/4420412, in_queue=40257384, util=100.00%, aggrios=3456894/1155354, aggrmerge=0/0, aggrticks=35806896/4414972, aggrin_queue=40309192, aggrutil=99.99%
>
> dm-2: ios=3456894/1155354, merge=0/0, ticks=35806896/4414972, in_queue=40309192, util=99.99%, aggrios=1728447/577677, aggrmerge=0/0, aggrticks=17902352/2207092, aggrin_queue=20122108, aggrutil=100.00%
>
> dm-1: ios=3456894/1155354, merge=0/0, ticks=35804704/4414184, in_queue=40244216, util=100.00%, aggrios=3143273/1051086, aggrmerge=313621/104268, aggrticks=32277972/3937619, aggrin_queue=36289488, aggrutil=100.00%
>
> nvme0n1: ios=3143273/1051086, merge=313621/104268, ticks=32277972/3937619, in_queue=36289488, util=100.00%
>
> dm-0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
> - Second full fio test, nvme device as part of a gluster volume
>
> fio-test2: (groupid=0, jobs=4): err= 0: pid=5537: Sat Jan 4 23:30:28 2020
>
> read: IOPS=5525, BW=86.3MiB/s (90.5MB/s)(25.3GiB/300002msec)
>
> slat (nsec): min=1159, max=894687k, avg=9822.60, stdev=990825.87
>
> clat (usec): min=963, max=3141.5k, avg=37455.28, stdev=123109.88
>
> lat (usec): min=968, max=3141.5k, avg=37465.21, stdev=123121.94
>
> clat percentiles (msec):
>
> | 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9],
>
> | 30.00th=[ 9], 40.00th=[ 9], 50.00th=[ 10], 60.00th=[ 10],
>
> | 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 48], 95.00th=[ 180],
>
> | 99.00th=[ 642], 99.50th=[ 860], 99.90th=[ 1435], 99.95th=[ 1687],
>
> | 99.99th=[ 2022]
>
> bw ( KiB/s): min= 31, max=93248, per=26.30%, avg=23247.24, stdev=20716.86, samples=2280
>
> iops : min= 1, max= 5828, avg=1452.92, stdev=1294.81, samples=2280
>
> write: IOPS=1850, BW=28.9MiB/s (30.3MB/s)(8676MiB/300002msec); 0 zone resets
>
> slat (usec): min=21, max=1586.3k, avg=2117.71, stdev=23082.86
>
> clat (usec): min=20, max=2614.0k, avg=23888.03, stdev=99651.34
>
> lat (usec): min=225, max=3141.2k, avg=26006.49, stdev=104758.57
>
> clat percentiles (usec):
>
> | 1.00th=[ 889], 5.00th=[ 2343], 10.00th=[ 3654],
>
> | 20.00th=[ 5276], 30.00th=[ 5997], 40.00th=[ 6456],
>
> | 50.00th=[ 6849], 60.00th=[ 7177], 70.00th=[ 7504],
>
> | 80.00th=[ 7963], 90.00th=[ 8979], 95.00th=[ 74974],
>
> | 99.00th=[ 513803], 99.50th=[ 717226], 99.90th=[1333789],
>
> | 99.95th=[1518339], 99.99th=[1803551]
>
> bw ( KiB/s): min= 31, max=30240, per=27.05%, avg=8009.39, stdev=6912.26, samples=2217
>
> iops : min= 1, max= 1890, avg=500.56, stdev=432.02, samples=2217
>
> lat (usec) : 50=0.03%, 100=0.02%, 250=0.01%, 500=0.06%, 750=0.08%
>
> lat (usec) : 1000=0.11%
>
> lat (msec) : 2=0.66%, 4=1.97%, 10=71.07%, 20=14.47%, 50=2.69%
>
> lat (msec) : 100=2.23%, 250=3.21%, 500=1.94%, 750=0.82%, 1000=0.31%
>
> cpu : usr=0.59%, sys=1.19%, ctx=1172180, majf=0, minf=56
>
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>
> issued rwts: total=1657579,555275,0,0 short=0,0,0,0 dropped=0,0,0,0
>
> latency : target=0, window=0, percentile=100.00%, depth=64
>
>
>
> Run status group 0 (all jobs):
>
> READ: bw=86.3MiB/s (90.5MB/s), 86.3MiB/s-86.3MiB/s (90.5MB/s-90.5MB/s), io=25.3GiB (27.2GB), run=300002-300002msec
>
> WRITE: bw=28.9MiB/s (30.3MB/s), 28.9MiB/s-28.9MiB/s (30.3MB/s-30.3MB/s), io=8676MiB (9098MB), run=300002-300002msec
>
Best Regards,
Strahil Nikolov
________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users