> I read a Ceph Benchmark paper ( > https://www.proxmox.com/en/downloads/proxmox-virtual-environment/documentation/proxmox-ve-ceph-benchmark-2023-12) > where they demonstrated, among other things, the performance of using a > Full-Mesh Network Schema for Ceph on a three node cluster. > > Is this method used at an enterprise level? I am seriously considering it. The idea that you can connect three nodes by each having a dual-port card and have them talk to "all the other nodes" without having a router or a switch is a neat corner case for having 3 hosts (ie, with 2 it is obvious that you can reach the other, and with 4 or more OSD hosts you are really soon going to get a switch or a router) so while this setup works, it also limits you a lot in terms of how easy it will be to expand it later. If you ask ceph admins, they will often tell you to never go below repl=3 (or use EC K+M where M is 1) because of the inherent risks of it and also to have at least one OSD node more than your repl factor or the sum of K+M for EC pools so that the cluster can recover by itself when a node or disk dies. Those two pieces of advice makes this setup less attractive, since it scales badly for more than 3 OSD hosts, while at the same time, if you care about your data, you will not run repl=2 (at least not for long). I get that you can have a really decent quick-to-setup proxmox storage in this way and the results were really nice but if you put this in place on your company and the setup becomes popular, will you scale by building yet another next to it without them being able to see eachother, or do you want to scale out (like the rest of us ceph admins do) to tens or hundreds of OSD hosts? Do note that the gains from scaling out your ceph cluster was visible at the end of this paper, they added some old benchmark hosts and got lots better results as expected, at least for reads. This is what the nice part about adding OSD hosts to ceph means, you get more cpu for the computing parts of crypto or checksumming, you get more ram for caches, you add to the total network bandwidth and you get better/faster/easier recovery when a single OSD host dies, apart from the simple fact of "I also got more free disk space". So that design works, and as long as you don't have faults or surprises, it will perform nicely. It's just that as years pass by, the chances of storage systems not seeing surprises or faults diminish very rapidly. Drives die, PSUs fail, networks split. Stuff happens, and the more popular this setup becomes internally at your enterprise, the less forgiving will the clients be when a simple PSU faults causes the whole storage to be degraded until someone can get a replacement PSU just because the cluster had no place to recover into as you run it with "just as much as needed and nothing else" when you build with 3 hosts. Compare this to some 16 drive raid box. If you really want to be able to sleep when a drive dies you would be running raid-10 or so, but also have hot spares, and quite possibly cold spares lying around too. In the ceph case, the repl=3 equals the raid-1 (but with 3 copies) and the striping would be to have several OSD hosts, but when you have repl=3 and 3 hosts it equals not having any hot spares in that raid-box. When a drive dies, you can limp along by getting data off the other disks, but you are at risk the whole time. If your raid box has hot spares, it will automagically start repairing into this, just as a 4-OSD-host ceph cluster with repl=3 pools will recover data into the remaining 3 hosts. The difference is that the ceph cluster will use the 4th host too, but will have the ability to run on 3 (given enough free space) and can get back into a fully redundant and performant state by itself so you can sleep over night while it recovers and then wait for replacement parts while the cluster is not in a panic mode. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx