Re: Full-Mesh?

Janne Johansson <icepic.dz@xxxxxxxxx> · Tue, 4 Feb 2025 07:55:13 +0100

> I read a Ceph Benchmark paper (
> https://www.proxmox.com/en/downloads/proxmox-virtual-environment/documentation/proxmox-ve-ceph-benchmark-2023-12)
> where they demonstrated, among other things, the performance of using a
> Full-Mesh Network Schema for Ceph on a three node cluster.
>
> Is this method used at an enterprise level? I am seriously considering it.

The idea that you can connect three nodes by each having a dual-port
card and have them talk to "all the other nodes" without having a
router or a switch is a neat corner case for having 3 hosts (ie, with
2 it is obvious that you can reach the other, and with 4 or more OSD
hosts you are really soon going to get a switch or a router) so while
this setup works, it also limits you a lot in terms of how easy it
will be to expand it later.

If you ask ceph admins, they will often tell you to never go below
repl=3 (or use EC K+M where M is 1) because of the inherent risks of
it and also to have at least one OSD node more than your repl factor
or the sum of K+M for EC pools so that the cluster can recover by
itself when a node or disk dies.

Those two pieces of advice makes this setup less attractive, since it
scales badly for more than 3 OSD hosts, while at the same time, if you
care about your data, you will not run repl=2 (at least not for long).

I get that you can have a really decent quick-to-setup proxmox storage
in this way and the results were really nice but if you put this in
place on your company and the setup becomes popular, will you scale by
building yet another next to it without them being able to see
eachother, or do you want to scale out (like the rest of us ceph
admins do) to tens or hundreds of OSD hosts?

Do note that the gains from scaling out your ceph cluster was visible
at the end of this paper, they added some old benchmark hosts and got
lots better results as expected, at least for reads.

This is what the nice part about adding OSD hosts to ceph means, you
get more cpu for the computing parts of crypto or checksumming, you
get more ram for caches, you add to the total network bandwidth and
you get better/faster/easier recovery when a single OSD host dies,
apart from the simple fact of "I also got more free disk space".

So that design works, and as long as you don't have faults or
surprises, it will perform nicely. It's just that as years pass by,
the chances of storage systems not seeing surprises or faults diminish
very rapidly. Drives die, PSUs fail, networks split. Stuff happens,
and the more popular this setup becomes internally at your enterprise,
the less forgiving will the clients be when a simple PSU faults causes
the whole storage to be degraded until someone can get a replacement
PSU just because the cluster had no place to recover into as you run
it with "just as much as needed and nothing else" when you build with
3 hosts.

Compare this to some 16 drive raid box. If you really want to be able
to sleep when a drive dies you would be running raid-10 or so, but
also have hot spares, and quite possibly cold spares lying around too.

In the ceph case, the repl=3 equals the raid-1 (but with 3 copies) and
the striping would be to have several OSD hosts, but when you have
repl=3 and 3 hosts it equals not having any hot spares in that
raid-box. When a drive dies, you can limp along by getting data off
the other disks, but you are at risk the whole time.

If your raid box has hot spares, it will automagically start repairing
into this, just as a 4-OSD-host ceph cluster with repl=3 pools will
recover data into the remaining 3 hosts. The difference is that the
ceph cluster will use the 4th host too, but will have the ability to
run on 3 (given enough free space) and can get back into a fully
redundant and performant state by itself so you can sleep over night
while it recovers and then wait for replacement parts while the
cluster is not in a panic mode.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx