Hello, On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote: > On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster > <daniel.vanderster at cern.ch> wrote: > > > > > > > 1) How often are DC S3700's failing in your deployments? > > > > None of mine have failed yet. I am planning to monitor the wear level > indicator, and preemptively replace any SSDs that go below 10%. Manually > flushing the journal, replacing the SSD, and building a new journal is > much faster than backfilling all the dependent OSDs. > What Craig said. Hell, even none of the consumer Intels (3xx, 520s) I have ever failed, though they are aging faster of course. Still got some ancient X-25s that haven't gone below 96% wearout. I expect my DC 3700s to outlive 2 HDD generations. ^o^ Monitor and replace them accordingly and I doubt you'll ever loose one in operation. > > > > 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is > > the backfilling which results from an SSD failure? Have you considered > > tricks like increasing the down out interval so backfilling doesn?t > > happen in this case (leaving time for the SSD to be replaced)? > > > > Replacing a failed SSD won't help your backfill. I haven't actually > tested it, but I'm pretty sure that losing the journal effectively > corrupts your OSDs. I don't know what steps are required to complete > this operation, but it wouldn't surprise me if you need to re-format the > OSD. > This. All the threads I've read about this indicate that journal loss during operation means OSD loss. Total OSD loss, no recovery.