Re: Gluster usage scenarios in HPC cluster management

Erik Jacobson <erik.jacobson@xxxxxxx> · Tue, 23 Mar 2021 09:43:33 -0500

> I still have to grasp the "leader node" concept.
> Weren't gluster nodes "peers"? Or by "leader" you mean that it's
> mentioned in the fstab entry like
> /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0
> while the peer list includes l1,l2,l3 and a bunch of other nodes?

Right, it's a list of 24 peers. The 24 peers are split in to a 3x24
replicated/distributed setup for the volumes. They also have entries
for themselves as clients in /etc/fstab. I'll dump some volume info
at the end of this.

> > So we would have 24 leader nodes, each leader would have a disk serving
> > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded,
> > one is for logs, and one is heavily optimized for non-object expanded
> > tree NFS). The term "disk" is loose.
> That's a system way bigger than ours (3 nodes, replica3arbiter1, up to
> 36 bricks per node).

I have one dedicated "disk" (could be disk, raid lun, single ssd) and
4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just
for the lock and has a single file.

> 
> > Specs of a leader node at a customer site:
> >  * 256G RAM
> Glip! 256G for 4 bricks... No wonder I have had troubles running 26
> bricks in 64GB RAM... :)

I'm not an expert in memory pools or how they would be impacted by more
peers. I had to do a little research and I think what you're after is
if I can run gluster volume status cm_shared mem on a real cluster
that has a decent node count. I will see if I can do that.

TEST ENV INFO for those who care
--------------------------------
Here is some info on my own test environemnt which you can skip.

I have the environment duplicated on my desktop using virtual machines and it
runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache
from the optimized volumes but other than that it is fine. In my
development environment, the gluster disk is a 40G qcow2 image.

Cache sizes changed from 8G to 100M to fit in the VM.

XML snips for memory, cpus:
<domain type='kvm' id='24'>
  <name>cm-leader1</name>
  <uuid>99d5a8fc-a32c-b181-2f1a-2929b29c3953</uuid>
  <memory unit='KiB'>3268608</memory>
  <currentMemory unit='KiB'>3268608</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <resource>
......

I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test
compute node for my development environment.

My desktop where I test this cluster stack is a beefy but not brand new
desktop:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Stepping:            1
CPU MHz:             2594.333
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4190.22
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0-15
<SNIP>

(Not that it matters but this is a HP Z640 Workstation)

128G memory (good for a desktop I know, but I think 64G would work since
I also run windows10 vm environment for unrelated reasons)

I was able to find a MegaRAID in the lab a few years ago and so I have 4
drives in a MegaRAID and carve off a separate volume for the VM disk
images. It has a cache. So that's also more beefy than a normal desktop.
(on the other hand, I have no SSDs. May experiment with that some day
but things work so well now I'm tempted to leave it until something
croaks :)

I keep all VMs for the test cluster with "Unsafe cache mode" since there
is no true data to worry about and it makes the test cases faster.

So I am able to test a complete cluster management stack including
3-leader-gluster servers, an admin, and compute all on my desktop using
virtual machines and shared networks within libivrt/qemu.

It is so much easier to do development when you don't have to reserve
scarce test clusters and compete with people. I can do 90% of my cluster
development work this way. Things fall over when I need to care about
BMCs/ILOs or need to do performance testing of course. Then I move to
real hardware and play the hunger-games-of-internal-test-resources :) :)

I mention all this just to show that the beefy servers are not needed
nor the memory usage high. I'm not continually swapping or anything like
that.

Configuration Info from Real Machine
------------------------------------

Some info on an active 3x3 cluster. 2738 compute nodes.

The most active volume here is "cm_obj_sharded". It is where the image
objects live and this cluster uses image objects for compute node root
filesystems. I by hand changed the IP addresses (in case I made an
error doing that).

Memory status for volume : cm_obj_sharded
----------------------------------------------
Brick : 10.1.0.5:/data/brick_cm_obj_sharded
Mallinfo
--------
Arena    : 20676608
Ordblks  : 2077
Smblks   : 518
Hblks    : 17
Hblkhd   : 17350656
Usmblks  : 0
Fsmblks  : 53728
Uordblks : 5223376
Fordblks : 15453232
Keepcost : 127616

----------------------------------------------
Brick : 10.1.0.6:/data/brick_cm_obj_sharded
Mallinfo
--------
Arena    : 21409792
Ordblks  : 2424
Smblks   : 604
Hblks    : 17
Hblkhd   : 17350656
Usmblks  : 0
Fsmblks  : 62304
Uordblks : 5468096
Fordblks : 15941696
Keepcost : 127616

----------------------------------------------
Brick : 10.1.0.7:/data/brick_cm_obj_sharded
Mallinfo
--------
Arena    : 24240128
Ordblks  : 2471
Smblks   : 563
Hblks    : 17
Hblkhd   : 17350656
Usmblks  : 0
Fsmblks  : 58832
Uordblks : 5565360
Fordblks : 18674768
Keepcost : 127616

----------------------------------------------
Brick : 10.1.0.8:/data/brick_cm_obj_sharded
Mallinfo
--------
Arena    : 22454272
Ordblks  : 2575
Smblks   : 528
Hblks    : 17
Hblkhd   : 17350656
Usmblks  : 0
Fsmblks  : 53920
Uordblks : 5583712
Fordblks : 16870560
Keepcost : 127616

----------------------------------------------
Brick : 10.1.0.9:/data/brick_cm_obj_sharded
Mallinfo
--------
Arena    : 22835200
Ordblks  : 2493
Smblks   : 570
Hblks    : 17
Hblkhd   : 17350656
Usmblks  : 0
Fsmblks  : 59728
Uordblks : 5424992
Fordblks : 17410208
Keepcost : 127616

----------------------------------------------
Brick : 10.1.0.10:/data/brick_cm_obj_sharded
Mallinfo
--------
Arena    : 23085056
Ordblks  : 2717
Smblks   : 697
Hblks    : 17
Hblkhd   : 17350656
Usmblks  : 0
Fsmblks  : 74016
Uordblks : 5631520
Fordblks : 17453536
Keepcost : 127616

----------------------------------------------
Brick : 10.1.0.11:/data/brick_cm_obj_sharded
Mallinfo
--------
Arena    : 26537984
Ordblks  : 3044
Smblks   : 985
Hblks    : 17
Hblkhd   : 17350656
Usmblks  : 0
Fsmblks  : 103056
Uordblks : 5702592
Fordblks : 20835392
Keepcost : 127616

----------------------------------------------
Brick : 10.1.0.12:/data/brick_cm_obj_sharded
Mallinfo
--------
Arena    : 23556096
Ordblks  : 2658
Smblks   : 735
Hblks    : 17
Hblkhd   : 17350656
Usmblks  : 0
Fsmblks  : 78720
Uordblks : 5568736
Fordblks : 17987360
Keepcost : 127616

----------------------------------------------
Brick : 10.1.0.13:/data/brick_cm_obj_sharded
Mallinfo
--------
Arena    : 26050560
Ordblks  : 3064
Smblks   : 926
Hblks    : 17
Hblkhd   : 17350656
Usmblks  : 0
Fsmblks  : 96816
Uordblks : 5807312
Fordblks : 20243248
Keepcost : 127616

----------------------------------------------

Volume configuration details for this one:

Volume Name: cm_obj_sharded
Type: Distributed-Replicate
Volume ID: 76c30b65-7194-4af2-80f7-bf876f426e5a
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 10.1.0.5:/data/brick_cm_obj_sharded
Brick2: 10.1.0.6:/data/brick_cm_obj_sharded
Brick3: 10.1.0.7:/data/brick_cm_obj_sharded
Brick4: 10.1.0.8:/data/brick_cm_obj_sharded
Brick5: 10.1.0.9:/data/brick_cm_obj_sharded
Brick6: 10.1.0.10:/data/brick_cm_obj_sharded
Brick7: 10.1.0.11:/data/brick_cm_obj_sharded
Brick8: 10.1.0.12:/data/brick_cm_obj_sharded
Brick9: 10.1.0.13:/data/brick_cm_obj_sharded
Options Reconfigured:
nfs.rpc-auth-allow: 10.1.*
auth.allow: 10.1.*
performance.client-io-threads: on
nfs.disable: off
storage.fips-mode-rchecksum: on
transport.address-family: inet
performance.cache-size: 8GB
performance.flush-behind: on
performance.cache-refresh-timeout: 60
performance.nfs.io-cache: on
nfs.nlm: off
nfs.export-volumes: on
nfs.export-dirs: on
nfs.exports-auth-enable: on
transport.listen-backlog: 16384
nfs.mount-rmtab: /-
performance.io-thread-count: 32
server.event-threads: 32
nfs.auth-refresh-interval-sec: 360
nfs.auth-cache-ttl-sec: 360
features.shard: on

There are 3 other volumes (this is the only sharded one). I can provide
more info if desired.

Typical boot times for 3k nodes and 9 leaders, ignoring BIOS setup time,
is 2-5 minutes. The power of the image objects is what makes that fast.
An exapnded tree (traditional) nfs export where the whole directory tree
would be exported and used file by file would be more like 9-12 minutes.

Erik
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users