Re: Setup recommendations

Strahil Nikolov <hunter86_bg@xxxxxxxxx> · Mon, 19 Oct 2020 15:15:36 +0000 (UTC)

Hi Nico,

also you can consider moving the Ganesha out of the Gluster TSP , so even if pacemaker kills a Ganesha - it won't kill the Gluster bricks.
Also, check the pacemaker configuration ,as a fail over should be transparent for the clients (NFS v4 might "hang" for up to 90s, but should not stale).

Best Regards,
Strahil Nikolov

В понеделник, 19 октомври 2020 г., 14:40:01 Гринуич+3, Nico van Royen <nico@xxxxxxxxxxxx> написа: 

Hello,

>4 cores is quite low, especially when healing.
The 4 cores (and, by default, 8GB RAM), is a standard offering in our situations.  It would be up to the specific usage our end-users to see of that is enough (most deployed glusters in our environment have an average of 5% total usage, so that does seem to be quite enough).  Even this particular gluster hardly even goes above 10/15% .. except when rebalancing after adding bricks (then shoots to 80% during the several hours of rebalancing).

>Why not FUSE ? Ganesha is suitable for UNIX and BSD systems that do not support FUSE.
When we designed our offering we did had a hard-time choosing the default..  Fuse vs NFS...  Since we have (very) large environment,  loads of network-segments, layer-7 firewalls across subnets, and a variety of possible clients (windows, aix, solaris, linux) we opted for NFS (via Ganesha).  Each OS can handle NFS and by sticking to NFSv4.0 with TCP it makes opening firewalls to only TCP/2049 a lot simpler (else all of the different ports needed for brick connections + glusterfsd itself need to be opened).

>Consider increasing the 'token' and 'consensus' to a more meaningful values -> start with 10s token for example.
That is actually something we did not yet look at, thanks for the suggestion.. we'd would need to test this but would sound like a good recommendation (currently they're at RedHat's defaults).

>For performance improvements , I would add some SSDs in the game (tier 1+ storage) and use the SSD-based LUNs as lvm caching.
As much as we'd like to, unfortunately not possible in our environment.  We use a 'private cloud' (which is not even a cloud, just a beefy vmware environment), and each tenant/consumer gets the same type of resources.
Problems of a large (and often sluggish) financial company....
It currently hosts almost 20.000 VM instances in total (80% RHEL based) and among that appr 55 gluster-clusters.

Customizing the corosync values to somewhat larger times does sound that it can help in this case (less busy glusters seem to be able to cope well), thanks for this suggestion!

Regards,
Nico

----- Oorspronkelijk bericht -----
Van: "Strahil Nikolov" <hunter86_bg@xxxxxxxxx>
Aan: "gluster-users" <gluster-users@xxxxxxxxxxx>, "Nico van Royen" <nico@xxxxxxxxxxxx>
Verzonden: Maandag 19 oktober 2020 05:56:20
Onderwerp: Re:  Setup recommendations

>Size is not that big, 600GB space with around half of that actually used.  GlusterFS servers themselves each have 4 cores and 12GB memory.  It might also be important to note that these are VMware hosted nodes that make use of  SAN storage for the datastores.

4 cores is quite low, especially when healing.

>Connected to that NFS (ganesha) exported share are just over 100 clients, all RHEL6 and RHEL7, some spanning 10 network hops away.  All of those clients are (currently) using the same virtual-IP, so all end up on the same server.

Why not FUSE ? Ganesha is suitable for UNIX and BSD systems that do not support FUSE.

>Note that I mentioned 'should', since at times it had anywhere between 250.000 and 1 million files in it (which of course is not advised).  Using some kind of hashing (subfolders spread per day/hour etc) was also already advised.
If you have multiple subdomains (from replicate -> to distributed-replicated) , you can also spread the load - yet 'find' won't be faster :)

Problems that are often seen:
>- Any kind of operation on VMware such as a vMotion, creating a VM snapshot etc. on the node that has these 100+ clients connected causes such a temporary pause that pacemaker decides to switch the resources (causing a failover of the virtual IP address, thus clients connected suffer delay).  
RH corosync defaults are not suitable for VMs. I prefer SUSE's defaults.
Consider increasing the 'token' and 'consensus' to a more meaningful values -> start with 10s token for example.

>One would expect this to last just shy under a minute, then clients would happily continue.  However connected clients are stuck with a non-working mountpoint (commands as df, ls, find etc simply hang.. they go into an uninterruptible sleep).
In regular HA NFS, there is a "notify" resource that notifies the clients about the failover. The stale happens because your IP is brought before the NFS export is ready. As you haven't provided HA details, I can't help much there.

>Mount are 'hard' mounts to insure guaranteed writes.
That's good. Also is needed for the HA to properly work.

>- Once the number of files are over the 100.000 mark (again into a single, unhashed, folder) any operation on that share becomes very sluggish (even a df, on a client, would take 20/30 seconds,  a find command would take minutes to complete).
I think it's expected...

>If anyone can spot any ideas for improvement ?
I would try to first switch to 'replica 3 arbiter 1' as current setup is wasting storage, next switch the clients to FUSE.
For performance improvements , I would add some SSDs in the game (tier 1+ storage) and use the SSD-based LUNs as lvm caching.

Best Regards,
Strahil Nikolov

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users