Hello, actually replying in the other thread was fine by me, it was after relevant in a sense to it. And you mentioned something important there, which you didn't mention below, that you're coming from DRBD with a lot of experience there. So do I and Ceph/RBD simply isn't (and probably never will be) an adequate replacement for DRBD in some use cases. I certainly plan to keep deploying DRBD where it makes more sense (IOPS/speed), while migrating everything else to Ceph. Anyway, lets look at your mail: On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote: > I've a question regarding advice from these threads: > https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01 > > https://www.mail-archive.com/ceph-users at lists.ceph.com/msg11011.html > > > > Our current setup has 4 osd's per node. When a drive fails the > cluster is almost unusable for data entry. I want to change our set up > so that under no circumstances ever happens. > While you can pretty much avoid this from happening, your cluster should be able to handle a recovery. While Ceph is a bit more hamfisted than DRBD and definitely needs more controls and tuning to make recoveries have less of an impact you would see something similar with DRBD and badly configured recovery speeds. In essence, if your current setup can't handle the loss of a single disk, what happens if a node fails? You will need to design (HW) and configure (various Ceph options) your cluster to handle these things because at some point a recovery might be unavoidable. To prevent recoveries based on failed disks, use RAID, for node failures you could permanently set OSD noout or have a monitoring software do that when it detects a node failure. > Network: we use 2 IB switches and bonding in fail over mode. > Systems are two Dell Poweredge r720 and Supermicro X8DT3 . > I'm confused. Those Dells tend to have 8 drive bays normally, don't they? So you're just using 4 HDDs for OSDs? No SSD journals? Just 2 storage nodes? Note that unless you do use RAIDed OSDs this leaves you vulnerable to dual disk failures. Which will happen. Also that SM product number is for a motherboard, not a server, is that your monitor host? Anything production with data on in that you value should have 3 mon hosts, if you can't afford dedicated ones sharing them on an OSD node (preferably with the OS on SSDs to keep leveldb happy) is better than just one, because if that one dies or gets corrupted, your data is inaccessible. > So looking at how to do things better we will try '#4- anti-cephalopod' > . That is a seriously funny phrase! > > We'll switch to using raid-10 or raid-6 and have one osd per node, using > high end raid controllers, hot spares etc. > Are you still talking about the same hardware as above, just 4 HDDs for storage? With 4 HDDs I'd go for RAID10 (definitely want a hotspare there), if you have more bays use up to 12 for RAID6 with a high performance and large HW cache controller. > And use one Intel 200gb S3700 per node for journal > That's barely enough for 4 HDDs at 365MB/s write speed, but will do nicely if those are in a RAID10 (half speed of individual drives). Keep in mind that your node will never be able to write faster than the speed of your journal. > My questions: > > is there a minimum number of OSD's which should be used? > If you have one OSD per node and the disks are RAIDed, 2 OSDs aka 2 nodes is sufficient to begin with. However your performance might not be what you expect (an OSD process seems to be incapable of doing more than 800 write IOPS). But with a 4 disk RAID10 (essentially 2 HDDs, so about 200 IOPS) that's not so much of an issue. In my case with a 11 disk RAID6 AND the 4GB HW cache Areca controller it certainly is rather frustrating. In short, the more nodes (OSDs) you can deploy, the better the performance will be. And of course in case a node dies and you don't think it can be brought back in a sensible short time frame, having more than 2 nodes will enable you to do a recovery/rebalance and restore your redundancy to the desired level. > should OSD's per node be the same? > It is advantageous to have identical disks and OSD sizes, makes the whole thing more predictable and you don't have to play with weights. As for having different number of OSDs per node, consider this example: 4 nodes with 1 OSD, one node with 4 OSDs (all OSDs are of the same size). What will happen here is that all the replicas from single OSD nodes might wind up on the 4 OSD node. So it better have more power in all aspects than the single OSD nodes. Now that node fails and you decide to let things rebalance as it can't be repaired shortly. But you cluster was half full and now it will be 100% full and become unusable (for writes). So the moral of the story, deploy as much identical HW as possible. Christian > best regards, Rob > > > PS: I had asked above in middle of another thread... please ignore > there. -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/