Hello Christian, Let me supply more info and answer some questions. * Our main concern is high availability, not speed. Our storage requirements are not huge. However we want good keyboard response 99.99% of the time. We mostly do data entry and reporting. 20-25 users doing mostly order , invoice processing and email. * DRBD has been very reliable , but I am the SPOF . Meaning that when split brain occurs [ every 18-24 months ] it is me or no one who knows what to do. Try to explain how to deal with split brain in advance.... For the future ceph looks like it will be easier to maintain. * We use Proxmox . So ceph and mons will share each node. I've used proxmox for a few years and like the kvm / openvz management. * Ceph hardware: Four hosts . 8 drives each. OPSYS: raid-1 on ssd . OSD: four disk raid 10 array using 2-TB drives. Two of the systems will use Seagate Constellation ES.3 2TB 7200 RPM 128MB Cache SAS 6Gb/s the other two hosts use Western Digital RE WD2000FYYZ 2TB 7200 RPM 64MB Cache SATA 6.0Gb/s drives. Journal: 200GB Intel DC S3700 Series Spare disk for raid. * more questions. you wrote: "In essence, if your current setup can't handle the loss of a single disk, what happens if a node fails? You will need to design (HW) and configure (various Ceph options) your cluster to handle these things because at some point a recovery might be unavoidable. To prevent recoveries based on failed disks, use RAID, for node failures you could permanently set OSD noout or have a monitoring software do that when it detects a node failure." I'll research 'OSD noout' . Are there other setting I should read up on / consider? For node reboots due to kernel upgrades - how is that handled? Of course that would be scheduled for off hours. Any other suggestions? thanks for the suggestions, Rob On Sat, Jul 26, 2014 at 1:47 AM, Christian Balzer <chibi at gol.com> wrote: > > Hello, > > actually replying in the other thread was fine by me, it was after > relevant in a sense to it. > And you mentioned something important there, which you didn't mention > below, that you're coming from DRBD with a lot of experience there. > > So do I and Ceph/RBD simply isn't (and probably never will be) an adequate > replacement for DRBD in some use cases. > I certainly plan to keep deploying DRBD where it makes more sense > (IOPS/speed), while migrating everything else to Ceph. > > Anyway, lets look at your mail: > > On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote: > > > I've a question regarding advice from these threads: > > > https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01 > > > > https://www.mail-archive.com/ceph-users at lists.ceph.com/msg11011.html > > > > > > > > Our current setup has 4 osd's per node. When a drive fails the > > cluster is almost unusable for data entry. I want to change our set up > > so that under no circumstances ever happens. > > > > While you can pretty much avoid this from happening, your cluster should > be able to handle a recovery. > While Ceph is a bit more hamfisted than DRBD and definitely needs more > controls and tuning to make recoveries have less of an impact you would > see something similar with DRBD and badly configured recovery speeds. > > In essence, if your current setup can't handle the loss of a single disk, > what happens if a node fails? > You will need to design (HW) and configure (various Ceph options) your > cluster to handle these things because at some point a recovery might be > unavoidable. > > To prevent recoveries based on failed disks, use RAID, for node failures > you could permanently set OSD noout or have a monitoring software do that > when it detects a node failure. > > > Network: we use 2 IB switches and bonding in fail over mode. > > Systems are two Dell Poweredge r720 and Supermicro X8DT3 . > > > > I'm confused. Those Dells tend to have 8 drive bays normally, don't they? > So you're just using 4 HDDs for OSDs? No SSD journals? > Just 2 storage nodes? > Note that unless you do use RAIDed OSDs this leaves you vulnerable to dual > disk failures. Which will happen. > > Also that SM product number is for a motherboard, not a server, is that > your monitor host? > Anything production with data on in that you value should have 3 mon > hosts, if you can't afford dedicated ones sharing them on an OSD node > (preferably with the OS on SSDs to keep leveldb happy) is better than just > one, because if that one dies or gets corrupted, your data is inaccessible. > > > So looking at how to do things better we will try '#4- anti-cephalopod' > > . That is a seriously funny phrase! > > > > We'll switch to using raid-10 or raid-6 and have one osd per node, using > > high end raid controllers, hot spares etc. > > > Are you still talking about the same hardware as above, just 4 HDDs for > storage? > With 4 HDDs I'd go for RAID10 (definitely want a hotspare there), if you > have more bays use up to 12 for RAID6 with a high performance and large > HW cache controller. > > > And use one Intel 200gb S3700 per node for journal > > > That's barely enough for 4 HDDs at 365MB/s write speed, but will do > nicely if those are in a RAID10 (half speed of individual drives). > Keep in mind that your node will never be able to write faster than the > speed of your journal. > > > My questions: > > > > is there a minimum number of OSD's which should be used? > > > If you have one OSD per node and the disks are RAIDed, 2 OSDs aka 2 nodes > is sufficient to begin with. > However your performance might not be what you expect (an OSD process > seems to be incapable of doing more than 800 write IOPS). > But with a 4 disk RAID10 (essentially 2 HDDs, so about 200 IOPS) that's > not so much of an issue. > In my case with a 11 disk RAID6 AND the 4GB HW cache Areca controller it > certainly is rather frustrating. > > In short, the more nodes (OSDs) you can deploy, the better the > performance will be. And of course in case a node dies and you don't > think it can be brought back in a sensible short time frame, having more > than 2 nodes will enable you to do a recovery/rebalance and restore your > redundancy to the desired level. > > > should OSD's per node be the same? > > > It is advantageous to have identical disks and OSD sizes, makes the whole > thing more predictable and you don't have to play with weights. > > As for having different number of OSDs per node, consider this example: > > 4 nodes with 1 OSD, one node with 4 OSDs (all OSDs are of the same size). > What will happen here is that all the replicas from single OSD nodes might > wind up on the 4 OSD node. So it better have more power in all aspects > than the single OSD nodes. > Now that node fails and you decide to let things rebalance as it can't be > repaired shortly. But you cluster was half full and now it will be 100% > full and become unusable (for writes). > > So the moral of the story, deploy as much identical HW as possible. > > Christian > > > best regards, Rob > > > > > > PS: I had asked above in middle of another thread... please ignore > > there. > > > -- > Christian Balzer Network/Systems Engineer > chibi at gol.com Global OnLine Japan/Fusion Communications > http://www.gol.com/ > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140727/36b3b94c/attachment.htm>