Re: Many concurrent drive failures - How do I activate pgs?

Christian Balzer <chibi@xxxxxxx> · Thu, 21 Dec 2017 10:23:48 +0900

Hello,

first off, I don't have anything to add to your conclusions of the current
status, alas there are at least 2 folks here on the ML making a living
from Ceph disaster recovery, so I hope you have been contacted already.

Now once your data is safe or you have a moment, I and others here would
probably be quite interested in some more details, see inline below.

On Wed, 20 Dec 2017 22:25:23 +0000 David Herselman wrote:

[snip]
> 
> We've happily been running a 6 node cluster with 4 x FileStore HDDs per node (journals on SSD partitions) for over a year and recently upgraded all nodes to Debian 9, Ceph Luminous 12.2.2 and kernel 4.13.8. We ordered 12 x Intel DC S4600 SSDs which arrived last week so we added two per node on Thursday evening and brought them up as BlueStore OSDs. We had proactively updated our existing pools to reference only devices classed as 'hdd', so that we could move select images over to ssd replicated and erasure coded pools.
> 
Could you tell us more about that cluster, as in HW, how are the SSDs
connected and FW version of the controller if applicable. 

Kernel 4.13.8 suggests that this is a handrolled, upstream kernel.
While not necessarily related I'll note that as far as Debian kernels
(which are very lightly if at all patched) are concerned, nothing beyond
4.9 has been working to my satisfaction. 
4.11 still worked, but 4.12 crash-reboot-looped on all my Supermicro X10
machines (quite a varied selection). 
The current 4.13.13 backport boots on some of those machines, but still
throws errors with the EDAC devices, which works fine with 4.9.

4.14 is known to happily destroy data if used with bcache and even if one
doesn't use that it should give you pause.

> We were pretty diligent and downloaded Intel's Firmware Update Tool and validated that each new drive had the latest available firmware before installing them in the nodes. We did numerous benchmarks on Friday and eventually moved some images over to the new storage pools. Everything was working perfectly and extensive tests on Sunday showed excellent performance. Sunday night one of the new SSDs died and Ceph replicated and redistributed data accordingly, then another failed in the early hours of Monday morning and Ceph did what it needed to.
> 
> We had the two failed drives replaced by 11am and Ceph was up to 2/4918587 objects degraded (0.000%) when a third drive failed. At this point we updated the crush maps for the rbd_ssd and ec_ssd pools and set the device class to 'hdd', to essentially evacuate everything off the SSDs. Other SSDs then failed at 3:22pm, 4:19pm, 5:49pm and 5:50pm. We've ultimately lost half the Intel S4600 drives, which are all completely inaccessible. Our status at 11:42pm Monday night was: 1/1398478 objects unfound (0.000%) and 339/4633062 objects degraded (0.007%).
> 
The relevant logs when and how those SSDs failed would be interesting. 
Was the distribution of the failed SSDs random among the cluster?
Are you running smartd and did it have something to say?

Completely inaccessible sounds a lot like the infamous "self-bricking" of
Intel SSDs when they discover something isn't right, or they don't like
the color scheme of the server inside (^.^). 

I'm using quite a lot of Intel SSDs and had only one "fatal" incident.
A DC S3700 detected that its powercap had failed, but of course kept
working fine. Until a reboot was need, when it promptly bricked itself,
data inaccessible, SMART reporting barely that something was there.

So one wonders what caused your SSDs to get their knickers in such a twist.
Are the survivors showing any unusual signs in their SMART output?

Of course what your vendor/Intel will have to say will also be of
interest. ^o^

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com