Re: Gluster Performance - 12 Gbps SSDs and 10 Gbps NIC

Strahil Nikolov <hunter86_bg@xxxxxxxxx> · Thu, 14 Dec 2023 12:54:22 +0000 (UTC)

Hi Gilberto,

Have you checked https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/html/administration_guide/chap-configuring_red_hat_storage_for_enhancing_performance ?

I think that you will need to test the virt profile as the settings will prevent some bad situations - especially VM live migration.
You should also consider sharding which can reduce healing time but also makes your life more difficult if you need to access the disks of the VMs.

I think that client.event-thread , server.event-thread and performance.io-thread-count can be tuned in your case. Consider setting ip a VM using the gluster volume as backing store and run the tests inside the VM to simulate real workload (best is to run a DB, webserver, etc inside a VM).

Best Regards,
Strahil Nikolov 

On Wednesday, December 13, 2023, 2:34 PM, Gilberto Ferreira <gilberto.nunes32@xxxxxxxxx> wrote:
Hi allAravinda, usually I set this in two server env and never get split brain:
gluster vol set VMS cluster.heal-timeout 5

gluster vol heal VMS enable

gluster vol set VMS cluster.quorum-reads false

gluster vol set VMS cluster.quorum-count 1

gluster vol set VMS network.ping-timeout 2

gluster vol set VMS cluster.favorite-child-policy mtime

gluster vol heal VMS granular-entry-heal enable

gluster vol set VMS cluster.data-self-heal-algorithm full

gluster vol set VMS features.shard on

Strahil, in general, I get 0,06ms with 1G dedicated NIC.
My env are very simple, using Proxmox + QEMU/KVM, with 3 or 5 VM.

---
Gilberto Nunes Ferreira
(47) 99676-7530 - Whatsapp / Telegram

Em qua., 13 de dez. de 2023 às 06:08, Strahil Nikolov <hunter86_bg@xxxxxxxxx> escreveu:

Hi Aravinda,
Based on the output it’s a ‘replica 3 arbiter 1’ type.

Gilberto,
What’s the latency between the nodes ?

Best Regards,
Strahil Nikolov 

On Wednesday, December 13, 2023, 7:36 AM, Aravinda <aravinda@xxxxxxxxxxx> wrote:
Only Replica 2 or Distributed Gluster volumes can be created with two servers. High chance of split brain with Replica 2 compared to Replica 3 volume.

For NFS Ganesha, no issue exporting the volume even if only one server is available. Run NFS Ganesha servers in Gluster server nodes and NFS clients from the network can connect to any NFS Ganesha server.

You can use Haproxy + Keepalived (or any other load balancer) if high availability required for the NFS Ganesha connections (Ex: If a server node goes down, then nfs client can connect to other NFS ganesha server node).

--
Aravinda
Kadalu Technologies

---- On Wed, 13 Dec 2023 01:42:11 +0530 Gilberto Ferreira <gilberto.nunes32@xxxxxxxxx> wrote ---

Ah that's nice.Somebody knows this can be achieved with two servers?

---
Gilberto Nunes Ferreira
(47) 99676-7530 - Whatsapp / Telegram

Em ter., 12 de dez. de 2023 às 17:08, Danny <dbray925+gluster@xxxxxxxxx> escreveu:

________

Community Meeting Calendar: 

Schedule - 
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC 
Bridge: https://meet.google.com/cpu-eiue-hvk 
Gluster-users mailing list 
Gluster-users@xxxxxxxxxxx 
https://lists.gluster.org/mailman/listinfo/gluster-users 
Wow, HUGE improvement with NFS-Ganesha!

sudo dnf -y install glusterfs-ganesha
sudo vim /etc/ganesha/ganesha.conf

NFS_CORE_PARAM {
    mount_path_pseudo = true;
    Protocols = 3,4;
}
EXPORT_DEFAULTS {
    Access_Type = RW;
}

LOG {
    Default_Log_Level = WARN;
}

EXPORT{
    Export_Id = 1 ;     # Export ID unique to each export
    Path = "/data";     # Path of the volume to be exported

    FSAL {
        name = GLUSTER;
        hostname = "localhost"; # IP of one of the nodes in the trusted pool
        volume = "data";        # Volume name. Eg: "test_volume"
    }

    Access_type = RW;           # Access permissions
    Squash = No_root_squash;    # To enable/disable root squashing
    Disable_ACL = TRUE;         # To enable/disable ACL
    Pseudo = "/data";           # NFSv4 pseudo path for this export
    Protocols = "3","4" ;       # NFS protocols supported
    Transports = "UDP","TCP" ;  # Transport protocols supported
    SecType = "sys";            # Security flavors supported
}

sudo systemctl enable --now nfs-ganesha
sudo vim /etc/fstab 

localhost:/data             /data                 nfs    defaults,_netdev          0 0

sudo systemctl daemon-reload
sudo mount -a

fio --name=test --filename=/data/wow --size=1G --readwrite=write

Run status group 0 (all jobs):
  WRITE: bw=2246MiB/s (2355MB/s), 2246MiB/s-2246MiB/s (2355MB/s-2355MB/s), io=1024MiB (1074MB), run=456-456msec

Yeah 
2355MB/s is much better than the original 115MB/s

So in the end, I guess FUSE isn't the best choice. 

On Tue, Dec 12, 2023 at 3:00 PM Gilberto Ferreira <gilberto.nunes32@xxxxxxxxx> wrote:
Fuse there some overhead.Take a look at libgfapi:
https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/libgfapi/

I know this doc somehow is out of date, but could be a hint

---
Gilberto Nunes Ferreira
(47) 99676-7530 - Whatsapp / Telegram

Em ter., 12 de dez. de 2023 às 16:29, Danny <dbray925+gluster@xxxxxxxxx> escreveu:
Nope, not a caching thing. I've tried multiple different types of fio tests, all produce the same results. Gbps when hitting the disks locally, slow MB\s when hitting the Gluster FUSE mount.

I've been reading up on glustr-ganesha, and will give that a try.

On Tue, Dec 12, 2023 at 1:58 PM Ramon Selga <ramon.selga@xxxxxxxxx> wrote:
Dismiss my first question: you have SAS
      12Gbps SSDs  Sorry!

 El 12/12/23 a les 19:52, Ramon Selga ha
      escrit:
May ask you which kind of disks you have
        in this setup? rotational, ssd SAS/SATA, nvme?

 Is there a RAID controller with writeback caching?

 It seems to me your fio test on local brick has a unclear result
        due to some caching.

 Try something like (you can consider to increase test file size
        depending of your caching memory) :

 fio --size=16G --name=test --filename=/gluster/data/brick/wow
        --bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0
        --rw=write --refill_buffers --end_fsync=1 --iodepth=200
        --ioengine=libaio

 Also remember a replica 3 arbiter 1 volume writes
      synchronously to two data bricks, halving throughput of your
      network backend.

 Try similar fio on gluster mount but I hardly see more than
      300MB/s writing sequentially on only one fuse mount even with nvme
      backend. On the other side, with 4 to 6 clients, you can easily
      reach 1.5GB/s of aggregate throughput 

 To start, I think is better to try with default parameters for
      your replica volume.

 Best regards!

 Ramon

 El 12/12/23 a les 19:10, Danny ha
        escrit:
Sorry, I noticed that too after I posted, so I
          instantly upgraded to 10. Issue remains. 

On Tue, Dec 12, 2023 at
            1:09 PM Gilberto Ferreira <gilberto.nunes32@xxxxxxxxx>
            wrote:
I strongly suggest you update to version 10
              or higher. 
 It's come with significant improvement
              regarding performance.
 ---
Gilberto Nunes Ferreira
(47)
                                  99676-7530 - Whatsapp / Telegram

Em ter., 12 de dez. de
                2023 às 13:03, Danny <dbray925+gluster@xxxxxxxxx>
                escreveu:
MTU is already 9000, and as you can see
                  from the IPERF results, I've got a nice, fast
                  connection between the nodes.

On Tue, Dec 12, 2023
                    at 9:49 AM Strahil Nikolov <hunter86_bg@xxxxxxxxx>
                    wrote:
Hi, 
Let’s try the simple things:

Check if you can use MTU9000 and if it’s
                        possible, set it on the Bond Slaves and the bond
                        devices:
 ping GLUSTER_PEER -c 10
                          -M do -s 8972

Then try to follow up the recommendations from https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/html/administration_guide/chap-configuring_red_hat_storage_for_enhancing_performance  

Best Regards,
Strahil Nikolov 

 On
                          Monday, December 11, 2023, 3:32 PM, Danny <dbray925+gluster@xxxxxxxxx>
                          wrote:
Hello list, I'm hoping someone can
                                let me know what setting I missed.

Hardware:
Dell R650 servers, Dual 24 Core Xeon
                                2.8 GHz, 1 TB RAM
8x SSD s Negotiated Speed 12 Gbps
PERC H755 Controller - RAID 6 
Created virtual "data" disk from the
                                above 8 SSD drives, for a ~20 TB
                                /dev/sdb

OS:
CentOS Stream
kernel-4.18.0-526.el8.x86_64
glusterfs-7.9-1.el8.x86_64

IPERF Test between nodes:
 [ ID] Interval           Transfer    
                                Bitrate         Retr
 [  5]   0.00-10.00  sec  11.5 GBytes
                                 9.90 Gbits/sec    0             sender
 [  5]   0.00-10.04  sec  11.5 GBytes
                                 9.86 Gbits/sec                
                                 receiver

All good there. ~10 Gbps, as
                                expected.

LVM Install:
export DISK="/dev/sdb"
 sudo parted --script $DISK "mklabel gpt"
 sudo parted --script $DISK "mkpart
                                primary 0% 100%"
 sudo parted --script $DISK "set 1 lvm
                                on"
sudo pvcreate --dataalignment 128K
                                /dev/sdb1
 sudo vgcreate --physicalextentsize 128K
                                gfs_vg /dev/sdb1
 sudo lvcreate -L 16G -n gfs_pool_meta
                                gfs_vg
 sudo lvcreate -l 95%FREE -n gfs_pool
                                gfs_vg
 sudo lvconvert --chunksize 1280K
                                --thinpool gfs_vg/gfs_pool
                                --poolmetadata gfs_vg/gfs_pool_meta
 sudo lvchange --zero n gfs_vg/gfs_pool
 sudo lvcreate -V 19.5TiB --thinpool
                                gfs_vg/gfs_pool -n gfs_lv
 sudo mkfs.xfs -f -i size=512 -n
                                size=8192 -d su=128k,sw=10
                                /dev/mapper/gfs_vg-gfs_lv
 sudo vim /etc/fstab
/dev/mapper/gfs_vg-gfs_lv  
                                /gluster/data/brick   xfs      
                                rw,inode64,noatime,nouuid 0 0

sudo systemctl daemon-reload
                                && sudo mount -a
 fio --name=test
                                --filename=/gluster/data/brick/wow
                                --size=1G --readwrite=write

Run status group 0 (all jobs):
   WRITE: bw=2081MiB/s (2182MB/s),
                                2081MiB/s-2081MiB/s (2182MB/s-2182MB/s),
                                io=1024MiB (1074MB), run=492-492msec

All good there. 2182MB/s =~ 17.5
                                Gbps. Nice!

Gluster install:
export NODE1='10.54.95.123'
 export NODE2='10.54.95.124'
 export NODE3='10.54.95.125'
 sudo gluster peer probe $NODE2
 sudo gluster peer probe $NODE3
 sudo gluster volume create data replica
                                3 arbiter 1 $NODE1:/gluster/data/brick
                                $NODE2:/gluster/data/brick
                                $NODE3:/gluster/data/brick force
 sudo gluster volume set data
                                network.ping-timeout 5
 sudo gluster volume set data
                                performance.client-io-threads on
 sudo gluster volume set data group
                                metadata-cache
 sudo gluster volume start data
 sudo gluster volume info all

Volume Name: data
 Type: Replicate
 Volume ID:
                                b52b5212-82c8-4b1a-8db3-52468bc0226e
 Status: Started
 Snapshot Count: 0
 Number of Bricks: 1 x (2 + 1) = 3
 Transport-type: tcp
 Bricks:
 Brick1: 10.54.95.123:/gluster/data/brick
 Brick2: 10.54.95.124:/gluster/data/brick
 Brick3: 10.54.95.125:/gluster/data/brick
                                (arbiter)
 Options Reconfigured:
 network.inode-lru-limit: 200000
 performance.md-cache-timeout: 600
 performance.cache-invalidation: on
 performance.stat-prefetch: on
 features.cache-invalidation-timeout: 600
 features.cache-invalidation: on
 network.ping-timeout: 5
 transport.address-family: inet
 storage.fips-mode-rchecksum: on
 nfs.disable: on
 performance.client-io-threads: on

sudo vim /etc/fstab
localhost:/data             /data    
                                            glusterfs defaults,_netdev  
                                   0 0

sudo systemctl daemon-reload
                                && sudo mount -a
fio --name=test --filename=/data/wow
                                --size=1G --readwrite=write

Run status group 0 (all jobs):
   WRITE: bw=109MiB/s (115MB/s),
                                109MiB/s-109MiB/s (115MB/s-115MB/s),
                                io=1024MiB (1074MB), run=9366-9366msec

Oh no, what's wrong? From 2182MB/s
                                down to only 115MB/s? What am I missing?
                                I'm not expecting the above ~17 Gbps,
                                but I'm thinking it should at least be
                                close(r) to ~10 Gbps. 

Any suggestions?
________

 Community Meeting Calendar:

 Schedule -
 Every 2nd and 4th Tuesday at 14:30 IST / 09:00
                          UTC
 Bridge: https://meet.google.com/cpu-eiue-hvk
 Gluster-users mailing list
 Gluster-users@xxxxxxxxxxx
 https://lists.gluster.org/mailman/listinfo/gluster-users
________

 Community Meeting Calendar:

 Schedule -
 Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
 Bridge: https://meet.google.com/cpu-eiue-hvk
 Gluster-users mailing list
 Gluster-users@xxxxxxxxxxx
 https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

________

 Community Meeting Calendar:

 Schedule -
 Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
 Bridge: https://meet.google.com/cpu-eiue-hvk
 Gluster-users mailing list
 Gluster-users@xxxxxxxxxxx
 https://lists.gluster.org/mailman/listinfo/gluster-users
________

 Community Meeting Calendar:

 Schedule -
 Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
 Bridge: https://meet.google.com/cpu-eiue-hvk
 Gluster-users mailing list
 Gluster-users@xxxxxxxxxxx
 https://lists.gluster.org/mailman/listinfo/gluster-users
________

 Community Meeting Calendar:

 Schedule -
 Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
 Bridge: https://meet.google.com/cpu-eiue-hvk
 Gluster-users mailing list
 Gluster-users@xxxxxxxxxxx
 https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users