Re: Full-Mesh?

Anthony Fecarotta <anthony@xxxxxxxxxxx> · Tue, 4 Feb 2025 03:08:17 -0600

Thank you for the detailed response.

At my primary facility I am already running a five node ceph cluster (among
other server backup servers and GPU stations), so it seems I am already
past the point of benefiting from such a design.

This cluster is currently connected to a Cisco N9K-C92160YC-X, using the
25gbe interfaces, with plenty to spare.I think the lack of throughput is a
fundamental flaw with the way I have the OSDs and PGs configured.

Regards,

* Anthony Fecarotta*
 Founder & President
 [image: phone-icon]  anthony@xxxxxxxxxxx
 [image: phone-icon]  224-339-1182  [image: phone-icon]  (855) 625-0300
 [image: phone-icon]  1 Mid America Plz Flr 3 Oakbrook Terrace, IL 60181
 [image: phone-icon]  www.linehaul.ai
[image: linehaul.ai logo] <http://www.linehaul.ai/>
[image: linkedin icon] <https://www.linkedin.com/in/anthony-fec/>

On Tue, Feb 4, 2025 at 12:55 AM Janne Johansson <icepic.dz@xxxxxxxxx> wrote:

> > I read a Ceph Benchmark paper (
> >
> https://www.proxmox.com/en/downloads/proxmox-virtual-environment/documentation/proxmox-ve-ceph-benchmark-2023-12
> )
> > where they demonstrated, among other things, the performance of using a
> > Full-Mesh Network Schema for Ceph on a three node cluster.
> >
> > Is this method used at an enterprise level? I am seriously considering
> it.
>
> The idea that you can connect three nodes by each having a dual-port
> card and have them talk to "all the other nodes" without having a
> router or a switch is a neat corner case for having 3 hosts (ie, with
> 2 it is obvious that you can reach the other, and with 4 or more OSD
> hosts you are really soon going to get a switch or a router) so while
> this setup works, it also limits you a lot in terms of how easy it
> will be to expand it later.
>
> If you ask ceph admins, they will often tell you to never go below
> repl=3 (or use EC K+M where M is 1) because of the inherent risks of
> it and also to have at least one OSD node more than your repl factor
> or the sum of K+M for EC pools so that the cluster can recover by
> itself when a node or disk dies.
>
> Those two pieces of advice makes this setup less attractive, since it
> scales badly for more than 3 OSD hosts, while at the same time, if you
> care about your data, you will not run repl=2 (at least not for long).
>
> I get that you can have a really decent quick-to-setup proxmox storage
> in this way and the results were really nice but if you put this in
> place on your company and the setup becomes popular, will you scale by
> building yet another next to it without them being able to see
> eachother, or do you want to scale out (like the rest of us ceph
> admins do) to tens or hundreds of OSD hosts?
>
> Do note that the gains from scaling out your ceph cluster was visible
> at the end of this paper, they added some old benchmark hosts and got
> lots better results as expected, at least for reads.
>
> This is what the nice part about adding OSD hosts to ceph means, you
> get more cpu for the computing parts of crypto or checksumming, you
> get more ram for caches, you add to the total network bandwidth and
> you get better/faster/easier recovery when a single OSD host dies,
> apart from the simple fact of "I also got more free disk space".
>
> So that design works, and as long as you don't have faults or
> surprises, it will perform nicely. It's just that as years pass by,
> the chances of storage systems not seeing surprises or faults diminish
> very rapidly. Drives die, PSUs fail, networks split. Stuff happens,
> and the more popular this setup becomes internally at your enterprise,
> the less forgiving will the clients be when a simple PSU faults causes
> the whole storage to be degraded until someone can get a replacement
> PSU just because the cluster had no place to recover into as you run
> it with "just as much as needed and nothing else" when you build with
> 3 hosts.
>
> Compare this to some 16 drive raid box. If you really want to be able
> to sleep when a drive dies you would be running raid-10 or so, but
> also have hot spares, and quite possibly cold spares lying around too.
>
> In the ceph case, the repl=3 equals the raid-1 (but with 3 copies) and
> the striping would be to have several OSD hosts, but when you have
> repl=3 and 3 hosts it equals not having any hot spares in that
> raid-box. When a drive dies, you can limp along by getting data off
> the other disks, but you are at risk the whole time.
>
> If your raid box has hot spares, it will automagically start repairing
> into this, just as a 4-OSD-host ceph cluster with repl=3 pools will
> recover data into the remaining 3 hosts. The difference is that the
> ceph cluster will use the 4th host too, but will have the ability to
> run on 3 (given enough free space) and can get back into a fully
> redundant and performant state by itself so you can sleep over night
> while it recovers and then wait for replacement parts while the
> cluster is not in a panic mode.
>
> --
> May the most significant bit of your life be positive.
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx