Thank you for the detailed response. At my primary facility I am already running a five node ceph cluster (among other server backup servers and GPU stations), so it seems I am already past the point of benefiting from such a design. This cluster is currently connected to a Cisco N9K-C92160YC-X, using the 25gbe interfaces, with plenty to spare.I think the lack of throughput is a fundamental flaw with the way I have the OSDs and PGs configured. Regards, * Anthony Fecarotta* Founder & President [image: phone-icon] anthony@xxxxxxxxxxx [image: phone-icon] 224-339-1182 [image: phone-icon] (855) 625-0300 [image: phone-icon] 1 Mid America Plz Flr 3 Oakbrook Terrace, IL 60181 [image: phone-icon] www.linehaul.ai [image: linehaul.ai logo] <http://www.linehaul.ai/> [image: linkedin icon] <https://www.linkedin.com/in/anthony-fec/> On Tue, Feb 4, 2025 at 12:55 AM Janne Johansson <icepic.dz@xxxxxxxxx> wrote: > > I read a Ceph Benchmark paper ( > > > https://www.proxmox.com/en/downloads/proxmox-virtual-environment/documentation/proxmox-ve-ceph-benchmark-2023-12 > ) > > where they demonstrated, among other things, the performance of using a > > Full-Mesh Network Schema for Ceph on a three node cluster. > > > > Is this method used at an enterprise level? I am seriously considering > it. > > The idea that you can connect three nodes by each having a dual-port > card and have them talk to "all the other nodes" without having a > router or a switch is a neat corner case for having 3 hosts (ie, with > 2 it is obvious that you can reach the other, and with 4 or more OSD > hosts you are really soon going to get a switch or a router) so while > this setup works, it also limits you a lot in terms of how easy it > will be to expand it later. > > If you ask ceph admins, they will often tell you to never go below > repl=3 (or use EC K+M where M is 1) because of the inherent risks of > it and also to have at least one OSD node more than your repl factor > or the sum of K+M for EC pools so that the cluster can recover by > itself when a node or disk dies. > > Those two pieces of advice makes this setup less attractive, since it > scales badly for more than 3 OSD hosts, while at the same time, if you > care about your data, you will not run repl=2 (at least not for long). > > I get that you can have a really decent quick-to-setup proxmox storage > in this way and the results were really nice but if you put this in > place on your company and the setup becomes popular, will you scale by > building yet another next to it without them being able to see > eachother, or do you want to scale out (like the rest of us ceph > admins do) to tens or hundreds of OSD hosts? > > Do note that the gains from scaling out your ceph cluster was visible > at the end of this paper, they added some old benchmark hosts and got > lots better results as expected, at least for reads. > > This is what the nice part about adding OSD hosts to ceph means, you > get more cpu for the computing parts of crypto or checksumming, you > get more ram for caches, you add to the total network bandwidth and > you get better/faster/easier recovery when a single OSD host dies, > apart from the simple fact of "I also got more free disk space". > > So that design works, and as long as you don't have faults or > surprises, it will perform nicely. It's just that as years pass by, > the chances of storage systems not seeing surprises or faults diminish > very rapidly. Drives die, PSUs fail, networks split. Stuff happens, > and the more popular this setup becomes internally at your enterprise, > the less forgiving will the clients be when a simple PSU faults causes > the whole storage to be degraded until someone can get a replacement > PSU just because the cluster had no place to recover into as you run > it with "just as much as needed and nothing else" when you build with > 3 hosts. > > Compare this to some 16 drive raid box. If you really want to be able > to sleep when a drive dies you would be running raid-10 or so, but > also have hot spares, and quite possibly cold spares lying around too. > > In the ceph case, the repl=3 equals the raid-1 (but with 3 copies) and > the striping would be to have several OSD hosts, but when you have > repl=3 and 3 hosts it equals not having any hot spares in that > raid-box. When a drive dies, you can limp along by getting data off > the other disks, but you are at risk the whole time. > > If your raid box has hot spares, it will automagically start repairing > into this, just as a 4-OSD-host ceph cluster with repl=3 pools will > recover data into the remaining 3 hosts. The difference is that the > ceph cluster will use the 4th host too, but will have the ability to > run on 3 (given enough free space) and can get back into a fully > redundant and performant state by itself so you can sleep over night > while it recovers and then wait for replacement parts while the > cluster is not in a panic mode. > > -- > May the most significant bit of your life be positive. > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx