Hello Dan, On Fri, 5 Sep 2014 07:46:12 +0000 Dan Van Der Ster wrote: > Hi Christian, > > > On 05 Sep 2014, at 03:09, Christian Balzer <chibi at gol.com> wrote: > > > > > > Hello, > > > > On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote: > > > >> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster > >> <daniel.vanderster at cern.ch> wrote: > >> > >>> > >>> > >>> 1) How often are DC S3700's failing in your deployments? > >>> > >> > >> None of mine have failed yet. I am planning to monitor the wear level > >> indicator, and preemptively replace any SSDs that go below 10%. > >> Manually flushing the journal, replacing the SSD, and building a new > >> journal is much faster than backfilling all the dependent OSDs. > >> > > What Craig said. > > > > Hell, even none of the consumer Intels (3xx, 520s) I have ever failed, > > though they are aging faster of course. > > Still got some ancient X-25s that haven't gone below 96% wearout. > > > > I expect my DC 3700s to outlive 2 HDD generations. ^o^ > > > > Monitor and replace them accordingly and I doubt you'll ever loose one > > in operation. > > OK, that?s good to know. > > > > >> > >> > >>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is > >>> the backfilling which results from an SSD failure? Have you > >>> considered tricks like increasing the down out interval so > >>> backfilling doesn?t happen in this case (leaving time for the SSD to > >>> be replaced)? > >>> > >> > >> Replacing a failed SSD won't help your backfill. I haven't actually > >> tested it, but I'm pretty sure that losing the journal effectively > >> corrupts your OSDs. I don't know what steps are required to complete > >> this operation, but it wouldn't surprise me if you need to re-format > >> the OSD. > >> > > This. > > All the threads I've read about this indicate that journal loss during > > operation means OSD loss. Total OSD loss, no recovery. > > From what I gathered the developers are aware of this and it might be > > addressed in the future. > > > > I suppose I need to try it then. I don?t understand why you can't just > use ceph-osd -i 10 --mkjournal to rebuild osd 10?s journal, for example. > I think the logic is if you shut down an OSD cleanly beforehand you can just do that. However from what I gathered there is no logic to re-issue transactions that made it to the journal but not the filestore. So a journal SSD failing mid-operation with a busy OSD would certainly be in that state. I'm sure (hope) somebody from the Ceph team will pipe up about this. > > Now 200GB DC 3700s can write close to 400MB/s so a 1:4 or even 1:5 > > ratio is sensible. However these will be the ones limiting your max > > sequential write speed if that is of importance to you. In nearly all > > use cases you run out of IOPS (on your HDDs) long before that becomes > > an issue, though. > > IOPS is definitely the main limit, but we also only have 1 single > 10Gig-E NIC on these servers, so 4 drives that can write (even only > 200MB/s) would be good enough. > Fair enough. ^o^ > Also, we?ll put the SSDs in the first four ports of an SAS2008 HBA which > is shared with the other 20 spinning disks. Counting the double writes, > the HBA will run out of bandwidth before these SSDs, I expect. > Depends on what PCIe slot it is and so forth. A 2008 should give you 4GB/s, enough to keep the SSDs happy at least. ^o^ A 2008 has only 8 SAS/SATA ports, so are you using port expanders on your case backplane? In that case you might want to spread the SSDs out over channels, as in have 3 HDDs sharing one channel with one SSD. > > Raiding the journal SSDs seems wasteful given the cost and quality of > > the DC 3700s. > > Configure your cluster in a way that re-balancing doesn't happen unless > > you want to (when the load low) by: > > a) Setting the "mon osd downout subtree limit" so that a host going > > down doesn't result in a full re-balancing and the resulting IO shit > > storm. In nearly all cases nodes a recoverable and if it isn't the > > OSDs may be. And even if that fails, you get to pick the time for the > > recovery. > > This is a good point ? I have it set at the rack level now. The whole > node failure we experienced manifested as a device remove of all 24 > drives followed quickly by a hot-insert. Restarting the daemons brought > those OSDs back online (though it was outside of working hours, so > backfilling kicked in before anyone noticed). > Lucky! ^o^ > > > b) As you mentioned and others have before, set the out interval so you > > can react to things. > > We use 15 minutes, which is so we can reboot a host without backfilling. > What do you use? > I'm not using it right now, but for the cluster I'm currently deploying will go with something like 4 hours (as do others here) or more if I feel that I might not be in time to set the cluster to "noout" if warranted. > > c) Configure the various backfill options to have only a small impact. > > Journal SSDs will improve things compared to your current situation. > > And if I recall correctly, you're using a replica size of 3 to 4, so > > you can afford a more sedate recovery. > > It?s already at 1 backfill, 1 recovery, and the lowest queue priority > (1/63) for recovery IOs. > So how long does that take you to recover 1TB then in the case of a single OSD failure? And is that setting still impacting your performance more than you'd like? > > Journals on a filesystem go against KISS. > > Not only do you add one more layer of complexity that can fail (and > > filesystems do have bugs as people were reminded when Firefly came > > out), you're also wasting CPU cycles that might needed over in the > > less than optimal OSD code. ^o^ > > And you gain nothing from putting journals on a filesystem. > > Well the gains that I had in mind resulted from my assumption that you > can create a new empty journal on another device, then restart the OSD. > If that?s not possible, then I agree there are no gains to speak of. > Can always create a new partition as well, if there is enough space. Regards, Christian > > > You might want to look into cache pools (and dedicated SSD servers with > > fast controllers and CPUs) in your test cluster and for the future. > > Right now my impression is that there is quite a bit more polishing to > > be done (retention of hot objects, etc) and there have been stability > > concerns raised here. > > Right, Greg already said publicly not to use the cache tiers for RBD. > > Thanks for your thorough response? you?ve provide a lot of confidence > that the traditional journal deployment is still a good or even the best > option. > > Cheers, Dan > > > > > > Regards, > > > > Christian > > -- > > Christian Balzer Network/Systems Engineer > > chibi at gol.com Global OnLine Japan/Fusion Communications > > http://www.gol.com/ > > _______________________________________________ > > ceph-users mailing list > > ceph-users at lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/