SSD journal deployment experiences

daniel.vanderster@xxxxxxx (Dan van der Ster) · Thu, 4 Sep 2014 22:05:25 +0000

Hi Craig,

September 4 2014 11:50 PM, "Craig Lewis" <clewis at centraldesktop.com> wrote: 
> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster <daniel.vanderster at cern.ch> wrote:
> 
>> 1) How often are DC S3700's failing in your deployments?
> 
> None of mine have failed yet.  I am planning to monitor the wear level indicator, and preemptively
> replace any SSDs that go below 10%.  Manually flushing the journal, replacing the SSD, and building
> a new journal is much faster than backfilling all the dependent OSDs.
> 

That's good to know. I would plan similarly for the wear out. But I want to also prepare for catastrophic failures -- in the past we've had SSDs just disappear like a device unplug. Those were older OCZ's though...

>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the backfilling which
> results
>> from an SSD failure? Have you considered tricks like increasing the down out interval so
>> backfilling doesn?t happen in this case (leaving time for the SSD to be replaced)?
> 
> Replacing a failed SSD won't help your backfill.  I haven't actually tested it, but I'm pretty sure
> that losing the journal effectively corrupts your OSDs.  I don't know what steps are required to
> complete this operation, but it wouldn't surprise me if you need to re-format the OSD.

I'm really curious about this point. When we lose a journal, isn't it just the current write that would fail? Recent writes should be already written (buffered) to the filestore, then flushed out by the kernel eventually and thus should be persisted. And the failing write that took out the journal, well that would failover to another OSD.

So I had assumed that re-using that filestore with a newly prepared (empty) journal would be ok.

>> Next, I wonder how people with puppet/chef/? are handling the creation/re-creation of the SSD
>> devices. Are you just wiping and rebuilding all the dependent OSDs completely when the journal
> dev
>> fails? I?m not keen on puppetizing the re-creation of journals for OSDs...
> 
> So far, I'm doing my disk zapping manually.  Automatically zapping disks makes me nervous.  :-)
> 
> I'm of the opinion that you shouldn't automate something until you'll save time versus doing by
> hand.  My cluster is small enough that it's faster to do it manually.

I have the single platter situation puppetized and drive replacements already automated.

But moving to multi-devices breaks my simple manifests. I'm still working out how best to map OSDs to a given SSD journal partition. And then to find the correct one when I'm replacing devices.

>> We also have this crazy idea of failing over to a local journal file in case an SSD fails. In
> this
>> model, when an SSD fails we?d quickly create a new journal either on another SSD or on the local
>> OSD filesystem, then restart the OSDs before backfilling started. Thoughts?
> 
> See #2.

Cheers, Dan