the osd's were created using proxmox web page . there is not data that I want to save.. so I'd like to start from scratch but not do a reinstall of the operating system. I'll check the documentation that you mentioned. On Mon, Jul 28, 2014 at 4:38 AM, Christian Balzer <chibi at gol.com> wrote: > > On Mon, 28 Jul 2014 04:19:16 -0400 Robert Fantini wrote: > > > I have 3 hosts that i want to use to test new setup... > > > > Currently they have 3-4 OSD's each. > > > How did you create the current cluster? > > ceph-deploy or something withing Proxmox? > > > Could you suggest a fast way to remove all the OSD's ? > > > There is documentation on how to remove OSDs in the manual deployment > section. > > If you can (have no data on it), why not start from scratch? > > Christian > > > > > > > > On Mon, Jul 28, 2014 at 3:49 AM, Christian Balzer <chibi at gol.com> wrote: > > > > > > > > Hello, > > > > > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote: > > > > > > > Hello Christian, > > > > > > > > Let me supply more info and answer some questions. > > > > > > > > * Our main concern is high availability, not speed. > > > > Our storage requirements are not huge. > > > > However we want good keyboard response 99.99% of the time. We > > > > mostly do data entry and reporting. 20-25 users doing mostly > > > > order , invoice processing and email. > > > > > > > > * DRBD has been very reliable , but I am the SPOF . Meaning that > > > > when split brain occurs [ every 18-24 months ] it is me or no one > > > > who knows what to do. Try to explain how to deal with split brain in > > > > advance.... For the future ceph looks like it will be easier to > > > > maintain. > > > > > > > The DRBD people would of course tell you to configure things in a way > > > that a split brain can't happen. ^o^ > > > > > > Note that given the right circumstances (too many OSDs down, MONs down) > > > Ceph can wind up in a similar state. > > > > > > > * We use Proxmox . So ceph and mons will share each node. I've used > > > > proxmox for a few years and like the kvm / openvz management. > > > > > > > I tried it some time ago, but at that time it was still stuck with > > > 2.6.32 due to OpenVZ and that wasn't acceptable to me for various > > > reasons. I think it still is, too. > > > > > > > * Ceph hardware: > > > > > > > > Four hosts . 8 drives each. > > > > > > > > OPSYS: raid-1 on ssd . > > > > > > > Good, that should be sufficient for running MONs (you will want 3). > > > > > > > OSD: four disk raid 10 array using 2-TB drives. > > > > > > > > Two of the systems will use Seagate Constellation ES.3 2TB 7200 RPM > > > > 128MB Cache SAS 6Gb/s > > > > > > > > the other two hosts use Western Digital RE WD2000FYYZ 2TB 7200 RPM > > > > 64MB Cache SATA 6.0Gb/s drives. > > > > > > > > Journal: 200GB Intel DC S3700 Series > > > > > > > > Spare disk for raid. > > > > > > > > * more questions. > > > > you wrote: > > > > "In essence, if your current setup can't handle the loss of a single > > > > disk, what happens if a node fails? > > > > You will need to design (HW) and configure (various Ceph options) > > > > your cluster to handle these things because at some point a recovery > > > > might be unavoidable. > > > > > > > > To prevent recoveries based on failed disks, use RAID, for node > > > > failures you could permanently set OSD noout or have a monitoring > > > > software do that when it detects a node failure." > > > > > > > > I'll research 'OSD noout' . > > > > > > > You probably might be happy with the "mon osd downout subtree limit" > > > set to "host" as well. > > > In that case you will need to manually trigger a rebuild (set that > > > node/OSD to out) if you can't repair a failed node in a short time and > > > keep your redundancy levels. > > > > > > > Are there other setting I should read up on / consider? > > > > > > > > For node reboots due to kernel upgrades - how is that handled? Of > > > > course that would be scheduled for off hours. > > > > > > > Set noout before a planned downtime or live dangerously and assume it > > > comes back within the timeout period (5 minutes IIRC). > > > > > > > Any other suggestions? > > > > > > > Test your cluster extensively before going into production. > > > > > > Fill it with enough data to be close to what you're expecting and fail > > > one node/OSD. > > > > > > See how bad things become, try to determine where any bottlenecks are > > > with tools like atop. > > > > > > While you've done pretty much everything to prevent that scenario from > > > a disk failure with the RAID10 and by keeping nodes from being set out > > > by whatever means you choose ("mon osd downout subtree limit = host" > > > seems to work, I just tested it), having a cluster that doesn't melt > > > down when recovering or at least knowing how bad things will be in > > > such a scenario helps a lot. > > > > > > Regards, > > > > > > Christian > > > > > > > thanks for the suggestions, > > > > Rob > > > > > > > > > > > > On Sat, Jul 26, 2014 at 1:47 AM, Christian Balzer <chibi at gol.com> > > > > wrote: > > > > > > > > > > > > > > Hello, > > > > > > > > > > actually replying in the other thread was fine by me, it was after > > > > > relevant in a sense to it. > > > > > And you mentioned something important there, which you didn't > > > > > mention below, that you're coming from DRBD with a lot of > > > > > experience there. > > > > > > > > > > So do I and Ceph/RBD simply isn't (and probably never will be) an > > > > > adequate replacement for DRBD in some use cases. > > > > > I certainly plan to keep deploying DRBD where it makes more sense > > > > > (IOPS/speed), while migrating everything else to Ceph. > > > > > > > > > > Anyway, lets look at your mail: > > > > > > > > > > On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote: > > > > > > > > > > > I've a question regarding advice from these threads: > > > > > > > > > > > > > > > https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01 > > > > > > > > > > > > > https://www.mail-archive.com/ceph-users at lists.ceph.com/msg11011.html > > > > > > > > > > > > > > > > > > > > > > > > Our current setup has 4 osd's per node. When a drive > > > > > > fails the cluster is almost unusable for data entry. I want > > > > > > to change our set up so that under no circumstances ever happens. > > > > > > > > > > > > > > > > While you can pretty much avoid this from happening, your cluster > > > > > should be able to handle a recovery. > > > > > While Ceph is a bit more hamfisted than DRBD and definitely needs > > > > > more controls and tuning to make recoveries have less of an impact > > > > > you would see something similar with DRBD and badly configured > > > > > recovery speeds. > > > > > > > > > > In essence, if your current setup can't handle the loss of a single > > > > > disk, what happens if a node fails? > > > > > You will need to design (HW) and configure (various Ceph options) > > > > > your cluster to handle these things because at some point a > > > > > recovery might be unavoidable. > > > > > > > > > > To prevent recoveries based on failed disks, use RAID, for node > > > > > failures you could permanently set OSD noout or have a monitoring > > > > > software do that when it detects a node failure. > > > > > > > > > > > Network: we use 2 IB switches and bonding in fail over mode. > > > > > > Systems are two Dell Poweredge r720 and Supermicro X8DT3 . > > > > > > > > > > > > > > > > I'm confused. Those Dells tend to have 8 drive bays normally, don't > > > > > they? So you're just using 4 HDDs for OSDs? No SSD journals? > > > > > Just 2 storage nodes? > > > > > Note that unless you do use RAIDed OSDs this leaves you vulnerable > > > > > to dual disk failures. Which will happen. > > > > > > > > > > Also that SM product number is for a motherboard, not a server, is > > > > > that your monitor host? > > > > > Anything production with data on in that you value should have 3 > > > > > mon hosts, if you can't afford dedicated ones sharing them on an > > > > > OSD node (preferably with the OS on SSDs to keep leveldb happy) is > > > > > better than just one, because if that one dies or gets corrupted, > > > > > your data is inaccessible. > > > > > > > > > > > So looking at how to do things better we will try '#4- > > > > > > anti-cephalopod' . That is a seriously funny phrase! > > > > > > > > > > > > We'll switch to using raid-10 or raid-6 and have one osd per > > > > > > node, using high end raid controllers, hot spares etc. > > > > > > > > > > > Are you still talking about the same hardware as above, just 4 > > > > > HDDs for storage? > > > > > With 4 HDDs I'd go for RAID10 (definitely want a hotspare there), > > > > > if you have more bays use up to 12 for RAID6 with a high > > > > > performance and large HW cache controller. > > > > > > > > > > > And use one Intel 200gb S3700 per node for journal > > > > > > > > > > > That's barely enough for 4 HDDs at 365MB/s write speed, but will do > > > > > nicely if those are in a RAID10 (half speed of individual drives). > > > > > Keep in mind that your node will never be able to write faster > > > > > than the speed of your journal. > > > > > > > > > > > My questions: > > > > > > > > > > > > is there a minimum number of OSD's which should be used? > > > > > > > > > > > If you have one OSD per node and the disks are RAIDed, 2 OSDs aka 2 > > > > > nodes is sufficient to begin with. > > > > > However your performance might not be what you expect (an OSD > > > > > process seems to be incapable of doing more than 800 write IOPS). > > > > > But with a 4 disk RAID10 (essentially 2 HDDs, so about 200 IOPS) > > > > > that's not so much of an issue. > > > > > In my case with a 11 disk RAID6 AND the 4GB HW cache Areca > > > > > controller it certainly is rather frustrating. > > > > > > > > > > In short, the more nodes (OSDs) you can deploy, the better the > > > > > performance will be. And of course in case a node dies and you > > > > > don't think it can be brought back in a sensible short time frame, > > > > > having more than 2 nodes will enable you to do a > > > > > recovery/rebalance and restore your redundancy to the desired > > > > > level. > > > > > > > > > > > should OSD's per node be the same? > > > > > > > > > > > It is advantageous to have identical disks and OSD sizes, makes the > > > > > whole thing more predictable and you don't have to play with > > > > > weights. > > > > > > > > > > As for having different number of OSDs per node, consider this > > > > > example: > > > > > > > > > > 4 nodes with 1 OSD, one node with 4 OSDs (all OSDs are of the same > > > > > size). What will happen here is that all the replicas from single > > > > > OSD nodes might wind up on the 4 OSD node. So it better have more > > > > > power in all aspects than the single OSD nodes. > > > > > Now that node fails and you decide to let things rebalance as it > > > > > can't be repaired shortly. But you cluster was half full and now > > > > > it will be 100% full and become unusable (for writes). > > > > > > > > > > So the moral of the story, deploy as much identical HW as possible. > > > > > > > > > > Christian > > > > > > > > > > > best regards, Rob > > > > > > > > > > > > > > > > > > PS: I had asked above in middle of another thread... please > > > > > > ignore there. > > > > > > > > > > > > > > > -- > > > > > Christian Balzer Network/Systems Engineer > > > > > chibi at gol.com Global OnLine Japan/Fusion Communications > > > > > http://www.gol.com/ > > > > > _______________________________________________ > > > > > ceph-users mailing list > > > > > ceph-users at lists.ceph.com > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > -- > > > Christian Balzer Network/Systems Engineer > > > chibi at gol.com Global OnLine Japan/Fusion Communications > > > http://www.gol.com/ > > > > > > -- > Christian Balzer Network/Systems Engineer > chibi at gol.com Global OnLine Japan/Fusion Communications > http://www.gol.com/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140728/ac8f50a6/attachment.htm>