Re: Gluster usage scenarios in HPC cluster management

Ewen Chan <alpha754293@xxxxxxxxxxx> · Wed, 24 Mar 2021 01:32:49 +0000

Erik:

I just want to say that I really appreciate you sharing this information with us.

I don't think that my personal home lab micro cluster environment may get that complicated enough where I have a virtualized testing/Gluster development setup like you have, but on the other hand, as I mentioned before, I am running 100 Gbps Infiniband so what
 I am trying to do/use Gluster for is quite different than what and how most people deploy/install Gluster for production systems.

If I wanted to splurge, I'd get a second set of IB cables so that the high speed interconnect layer can be split so that jobs will run on one layer of the Infiniband fabric whilst storage/Gluster may run on another layer.

But for that, I'll have to revamp my entire microcluster, so there are no plans to do that just yet.

Thank you.

Sincerely,

Ewen

From: gluster-users-bounces@xxxxxxxxxxx <gluster-users-bounces@xxxxxxxxxxx> on behalf of Erik Jacobson <erik.jacobson@xxxxxxx>

Sent: March 23, 2021 10:43 AM

To: Diego Zuccato <diego.zuccato@xxxxxxxx>

Cc: gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>

Subject: Re:  Gluster usage scenarios in HPC cluster management

> I still have to grasp the "leader node" concept.

> Weren't gluster nodes "peers"? Or by "leader" you mean that it's

> mentioned in the fstab entry like

> /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0

> while the peer list includes l1,l2,l3 and a bunch of other nodes?

Right, it's a list of 24 peers. The 24 peers are split in to a 3x24

replicated/distributed setup for the volumes. They also have entries

for themselves as clients in /etc/fstab. I'll dump some volume info

at the end of this.

> > So we would have 24 leader nodes, each leader would have a disk serving

> > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded,

> > one is for logs, and one is heavily optimized for non-object expanded

> > tree NFS). The term "disk" is loose.

> That's a system way bigger than ours (3 nodes, replica3arbiter1, up to

> 36 bricks per node).

I have one dedicated "disk" (could be disk, raid lun, single ssd) and

4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just

for the lock and has a single file.

> 

> > Specs of a leader node at a customer site:

> >  * 256G RAM

> Glip! 256G for 4 bricks... No wonder I have had troubles running 26

> bricks in 64GB RAM... :)

I'm not an expert in memory pools or how they would be impacted by more

peers. I had to do a little research and I think what you're after is

if I can run gluster volume status cm_shared mem on a real cluster

that has a decent node count. I will see if I can do that.

TEST ENV INFO for those who care

--------------------------------

Here is some info on my own test environemnt which you can skip.

I have the environment duplicated on my desktop using virtual machines and it

runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache

from the optimized volumes but other than that it is fine. In my

development environment, the gluster disk is a 40G qcow2 image.

Cache sizes changed from 8G to 100M to fit in the VM.

XML snips for memory, cpus:

<domain type='kvm' id='24'>

  <name>cm-leader1</name>

  <uuid>99d5a8fc-a32c-b181-2f1a-2929b29c3953</uuid>

  <memory unit='KiB'>3268608</memory>

  <currentMemory unit='KiB'>3268608</currentMemory>

  <vcpu placement='static'>2</vcpu>

  <resource>

......

I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test

compute node for my development environment.

My desktop where I test this cluster stack is a beefy but not brand new

desktop:

Architecture:        x86_64

CPU op-mode(s):      32-bit, 64-bit

Byte Order:          Little Endian

Address sizes:       46 bits physical, 48 bits virtual

CPU(s):              16

On-line CPU(s) list: 0-15

Thread(s) per core:  2

Core(s) per socket:  8

Socket(s):           1

NUMA node(s):        1

Vendor ID:           GenuineIntel

CPU family:          6

Model:               79

Model name:          Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Stepping:            1

CPU MHz:             2594.333

CPU max MHz:         3000.0000

CPU min MHz:         1200.0000

BogoMIPS:            4190.22

Virtualization:      VT-x

L1d cache:           32K

L1i cache:           32K

L2 cache:            256K

L3 cache:            20480K

NUMA node0 CPU(s):   0-15

<SNIP>

(Not that it matters but this is a HP Z640 Workstation)

128G memory (good for a desktop I know, but I think 64G would work since

I also run windows10 vm environment for unrelated reasons)

I was able to find a MegaRAID in the lab a few years ago and so I have 4

drives in a MegaRAID and carve off a separate volume for the VM disk

images. It has a cache. So that's also more beefy than a normal desktop.

(on the other hand, I have no SSDs. May experiment with that some day

but things work so well now I'm tempted to leave it until something

croaks :)

I keep all VMs for the test cluster with "Unsafe cache mode" since there

is no true data to worry about and it makes the test cases faster.

So I am able to test a complete cluster management stack including

3-leader-gluster servers, an admin, and compute all on my desktop using

virtual machines and shared networks within libivrt/qemu.

It is so much easier to do development when you don't have to reserve

scarce test clusters and compete with people. I can do 90% of my cluster

development work this way. Things fall over when I need to care about

BMCs/ILOs or need to do performance testing of course. Then I move to

real hardware and play the hunger-games-of-internal-test-resources :) :)

I mention all this just to show that the beefy servers are not needed

nor the memory usage high. I'm not continually swapping or anything like

that.

Configuration Info from Real Machine

------------------------------------

Some info on an active 3x3 cluster. 2738 compute nodes.

The most active volume here is "cm_obj_sharded". It is where the image

objects live and this cluster uses image objects for compute node root

filesystems. I by hand changed the IP addresses (in case I made an

error doing that).

Memory status for volume : cm_obj_sharded

----------------------------------------------

Brick : 10.1.0.5:/data/brick_cm_obj_sharded

Mallinfo

--------

Arena    : 20676608

Ordblks  : 2077

Smblks   : 518

Hblks    : 17

Hblkhd   : 17350656

Usmblks  : 0

Fsmblks  : 53728

Uordblks : 5223376

Fordblks : 15453232

Keepcost : 127616

----------------------------------------------

Brick : 10.1.0.6:/data/brick_cm_obj_sharded

Mallinfo

--------

Arena    : 21409792

Ordblks  : 2424

Smblks   : 604

Hblks    : 17

Hblkhd   : 17350656

Usmblks  : 0

Fsmblks  : 62304

Uordblks : 5468096

Fordblks : 15941696

Keepcost : 127616

----------------------------------------------

Brick : 10.1.0.7:/data/brick_cm_obj_sharded

Mallinfo

--------

Arena    : 24240128

Ordblks  : 2471

Smblks   : 563

Hblks    : 17

Hblkhd   : 17350656

Usmblks  : 0

Fsmblks  : 58832

Uordblks : 5565360

Fordblks : 18674768

Keepcost : 127616

----------------------------------------------

Brick : 10.1.0.8:/data/brick_cm_obj_sharded

Mallinfo

--------

Arena    : 22454272

Ordblks  : 2575

Smblks   : 528

Hblks    : 17

Hblkhd   : 17350656

Usmblks  : 0

Fsmblks  : 53920

Uordblks : 5583712

Fordblks : 16870560

Keepcost : 127616

----------------------------------------------

Brick : 10.1.0.9:/data/brick_cm_obj_sharded

Mallinfo

--------

Arena    : 22835200

Ordblks  : 2493

Smblks   : 570

Hblks    : 17

Hblkhd   : 17350656

Usmblks  : 0

Fsmblks  : 59728

Uordblks : 5424992

Fordblks : 17410208

Keepcost : 127616

----------------------------------------------

Brick : 10.1.0.10:/data/brick_cm_obj_sharded

Mallinfo

--------

Arena    : 23085056

Ordblks  : 2717

Smblks   : 697

Hblks    : 17

Hblkhd   : 17350656

Usmblks  : 0

Fsmblks  : 74016

Uordblks : 5631520

Fordblks : 17453536

Keepcost : 127616

----------------------------------------------

Brick : 10.1.0.11:/data/brick_cm_obj_sharded

Mallinfo

--------

Arena    : 26537984

Ordblks  : 3044

Smblks   : 985

Hblks    : 17

Hblkhd   : 17350656

Usmblks  : 0

Fsmblks  : 103056

Uordblks : 5702592

Fordblks : 20835392

Keepcost : 127616

----------------------------------------------

Brick : 10.1.0.12:/data/brick_cm_obj_sharded

Mallinfo

--------

Arena    : 23556096

Ordblks  : 2658

Smblks   : 735

Hblks    : 17

Hblkhd   : 17350656

Usmblks  : 0

Fsmblks  : 78720

Uordblks : 5568736

Fordblks : 17987360

Keepcost : 127616

----------------------------------------------

Brick : 10.1.0.13:/data/brick_cm_obj_sharded

Mallinfo

--------

Arena    : 26050560

Ordblks  : 3064

Smblks   : 926

Hblks    : 17

Hblkhd   : 17350656

Usmblks  : 0

Fsmblks  : 96816

Uordblks : 5807312

Fordblks : 20243248

Keepcost : 127616

----------------------------------------------

Volume configuration details for this one:

Volume Name: cm_obj_sharded

Type: Distributed-Replicate

Volume ID: 76c30b65-7194-4af2-80f7-bf876f426e5a

Status: Started

Snapshot Count: 0

Number of Bricks: 3 x 3 = 9

Transport-type: tcp

Bricks:

Brick1: 10.1.0.5:/data/brick_cm_obj_sharded

Brick2: 10.1.0.6:/data/brick_cm_obj_sharded

Brick3: 10.1.0.7:/data/brick_cm_obj_sharded

Brick4: 10.1.0.8:/data/brick_cm_obj_sharded

Brick5: 10.1.0.9:/data/brick_cm_obj_sharded

Brick6: 10.1.0.10:/data/brick_cm_obj_sharded

Brick7: 10.1.0.11:/data/brick_cm_obj_sharded

Brick8: 10.1.0.12:/data/brick_cm_obj_sharded

Brick9: 10.1.0.13:/data/brick_cm_obj_sharded

Options Reconfigured:

nfs.rpc-auth-allow: 10.1.*

auth.allow: 10.1.*

performance.client-io-threads: on

nfs.disable: off

storage.fips-mode-rchecksum: on

transport.address-family: inet

performance.cache-size: 8GB

performance.flush-behind: on

performance.cache-refresh-timeout: 60

performance.nfs.io-cache: on

nfs.nlm: off

nfs.export-volumes: on

nfs.export-dirs: on

nfs.exports-auth-enable: on

transport.listen-backlog: 16384

nfs.mount-rmtab: /-

performance.io-thread-count: 32

server.event-threads: 32

nfs.auth-refresh-interval-sec: 360

nfs.auth-cache-ttl-sec: 360

features.shard: on

There are 3 other volumes (this is the only sharded one). I can provide

more info if desired.

Typical boot times for 3k nodes and 9 leaders, ignoring BIOS setup time,

is 2-5 minutes. The power of the image objects is what makes that fast.

An exapnded tree (traditional) nfs export where the whole directory tree

would be exported and used file by file would be more like 9-12 minutes.

Erik

________

Community Meeting Calendar:

Schedule -

Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC

Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users