SSD journal deployment experiences

chibi@xxxxxxx (Christian Balzer) · Fri, 5 Sep 2014 18:04:48 +0900

Hello Dan,

On Fri, 5 Sep 2014 07:46:12 +0000 Dan Van Der Ster wrote:

> Hi Christian,
> 
> > On 05 Sep 2014, at 03:09, Christian Balzer <chibi at gol.com> wrote:
> > 
> > 
> > Hello,
> > 
> > On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote:
> > 
> >> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster
> >> <daniel.vanderster at cern.ch> wrote:
> >> 
> >>> 
> >>> 
> >>> 1) How often are DC S3700's failing in your deployments?
> >>> 
> >> 
> >> None of mine have failed yet.  I am planning to monitor the wear level
> >> indicator, and preemptively replace any SSDs that go below 10%.
> >> Manually flushing the journal, replacing the SSD, and building a new
> >> journal is much faster than backfilling all the dependent OSDs.
> >> 
> > What Craig said.
> > 
> > Hell, even none of the consumer Intels (3xx, 520s) I have ever failed,
> > though they are aging faster of course. 
> > Still got some ancient X-25s that haven't gone below 96% wearout.
> > 
> > I expect my DC 3700s to outlive 2 HDD generations. ^o^ 
> > 
> > Monitor and replace them accordingly and I doubt you'll ever loose one
> > in operation.
> 
> OK, that?s good to know.
> 
> > 
> >> 
> >> 
> >>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is
> >>> the backfilling which results from an SSD failure? Have you
> >>> considered tricks like increasing the down out interval so
> >>> backfilling doesn?t happen in this case (leaving time for the SSD to
> >>> be replaced)?
> >>> 
> >> 
> >> Replacing a failed SSD won't help your backfill.  I haven't actually
> >> tested it, but I'm pretty sure that losing the journal effectively
> >> corrupts your OSDs.  I don't know what steps are required to complete
> >> this operation, but it wouldn't surprise me if you need to re-format
> >> the OSD.
> >> 
> > This.
> > All the threads I've read about this indicate that journal loss during
> > operation means OSD loss. Total OSD loss, no recovery.
> > From what I gathered the developers are aware of this and it might be
> > addressed in the future.
> > 
> 
> I suppose I need to try it then. I don?t understand why you can't just
> use ceph-osd -i 10 --mkjournal to rebuild osd 10?s journal, for example.
> 
I think the logic is if you shut down an OSD cleanly beforehand you can
just do that.
However from what I gathered there is no logic to re-issue transactions
that made it to the journal but not the filestore.
So a journal SSD failing mid-operation with a busy OSD would certainly be
in that state.

I'm sure (hope) somebody from the Ceph team will pipe up about this.

> > Now 200GB DC 3700s can write close to 400MB/s so a 1:4 or even 1:5
> > ratio is sensible. However these will be the ones limiting your max
> > sequential write speed if that is of importance to you. In nearly all
> > use cases you run out of IOPS (on your HDDs) long before that becomes
> > an issue, though.
> 
> IOPS is definitely the main limit, but we also only have 1 single
> 10Gig-E NIC on these servers, so 4 drives that can write (even only
> 200MB/s) would be good enough.
> 
Fair enough. ^o^

> Also, we?ll put the SSDs in the first four ports of an SAS2008 HBA which
> is shared with the other 20 spinning disks. Counting the double writes,
> the HBA will run out of bandwidth before these SSDs, I expect.
> 
Depends on what PCIe slot it is and so forth. A 2008 should give you 4GB/s,
enough to keep the SSDs happy at least. ^o^

A 2008 has only 8 SAS/SATA ports, so are you using port expanders on your
case backplane? 
In that case you might want to spread the SSDs out over channels, as in
have 3 HDDs sharing one channel with one SSD.

> > Raiding the journal SSDs seems wasteful given the cost and quality of
> > the DC 3700s. 
> > Configure your cluster in a way that re-balancing doesn't happen unless
> > you want to (when the load low) by:
> > a) Setting the "mon osd downout subtree limit" so that a host going
> > down doesn't result in a full re-balancing and the resulting IO shit
> > storm. In nearly all cases nodes a recoverable and if it isn't the
> > OSDs may be. And even if that fails, you get to pick the time for the
> > recovery.
> 
> This is a good point ? I have it set at the rack level now. The whole
> node failure we experienced manifested as a device remove of all 24
> drives followed quickly by a hot-insert. Restarting the daemons brought
> those OSDs back online (though it was outside of working hours, so
> backfilling kicked in before anyone noticed).
> 
Lucky! ^o^

> 
> > b) As you mentioned and others have before, set the out interval so you
> > can react to things. 
> 
> We use 15 minutes, which is so we can reboot a host without backfilling.
> What do you use?
> 
I'm not using it right now, but for the cluster I'm currently deploying
will go with something like 4 hours (as do others here) or more if I feel
that I might not be in time to set the cluster to "noout" if warranted.

> > c) Configure the various backfill options to have only a small impact.
> > Journal SSDs will improve things compared to your current situation.
> > And if I recall correctly, you're using a replica size of 3 to 4, so
> > you can afford a more sedate recovery.
> 
> It?s already at 1 backfill, 1 recovery, and the lowest queue priority
> (1/63) for recovery IOs.
> 
So how long does that take you to recover 1TB then in the case of a
single OSD failure?
And is that setting still impacting your performance more than you'd like?

> > Journals on a filesystem go against KISS. 
> > Not only do you add one more layer of complexity that can fail (and
> > filesystems do have bugs as people were reminded when Firefly came
> > out), you're also wasting CPU cycles that might needed over in the
> > less than optimal OSD code. ^o^
> > And you gain nothing from putting journals on a filesystem.
> 
> Well the gains that I had in mind resulted from my assumption that you
> can create a new empty journal on another device, then restart the OSD.
> If that?s not possible, then I agree there are no gains to speak of.
> 
Can always create a new partition as well, if there is enough space.

Regards,

Christian

> 
> > You might want to look into cache pools (and dedicated SSD servers with
> > fast controllers and CPUs) in your test cluster and for the future.
> > Right now my impression is that there is quite a bit more polishing to
> > be done (retention of hot objects, etc) and there have been stability
> > concerns raised here.
> 
> Right, Greg already said publicly not to use the cache tiers for RBD.
> 
> Thanks for your thorough response? you?ve provide a lot of confidence
> that the traditional journal deployment is still a good or even the best
> option.
> 
> Cheers, Dan
> 
> 
> > 
> > Regards,
> > 
> > Christian
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi at gol.com   	Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/