On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster <daniel.vanderster at cern.ch> wrote: > > > 1) How often are DC S3700's failing in your deployments? > None of mine have failed yet. I am planning to monitor the wear level indicator, and preemptively replace any SSDs that go below 10%. Manually flushing the journal, replacing the SSD, and building a new journal is much faster than backfilling all the dependent OSDs. > 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the > backfilling which results from an SSD failure? Have you considered tricks > like increasing the down out interval so backfilling doesn?t happen in this > case (leaving time for the SSD to be replaced)? > Replacing a failed SSD won't help your backfill. I haven't actually tested it, but I'm pretty sure that losing the journal effectively corrupts your OSDs. I don't know what steps are required to complete this operation, but it wouldn't surprise me if you need to re-format the OSD. > Next, I wonder how people with puppet/chef/? are handling the > creation/re-creation of the SSD devices. Are you just wiping and rebuilding > all the dependent OSDs completely when the journal dev fails? I?m not keen > on puppetizing the re-creation of journals for OSDs... > So far, I'm doing my disk zapping manually. Automatically zapping disks makes me nervous. :-) I'm of the opinion that you shouldn't automate something until you'll save time versus doing by hand. My cluster is small enough that it's faster to do it manually. > > We also have this crazy idea of failing over to a local journal file in > case an SSD fails. In this model, when an SSD fails we?d quickly create a > new journal either on another SSD or on the local OSD filesystem, then > restart the OSDs before backfilling started. Thoughts? > See #2. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140904/31dd325f/attachment.htm>