Hello, first off, I don't have anything to add to your conclusions of the current status, alas there are at least 2 folks here on the ML making a living from Ceph disaster recovery, so I hope you have been contacted already. Now once your data is safe or you have a moment, I and others here would probably be quite interested in some more details, see inline below. On Wed, 20 Dec 2017 22:25:23 +0000 David Herselman wrote: [snip] > > We've happily been running a 6 node cluster with 4 x FileStore HDDs per node (journals on SSD partitions) for over a year and recently upgraded all nodes to Debian 9, Ceph Luminous 12.2.2 and kernel 4.13.8. We ordered 12 x Intel DC S4600 SSDs which arrived last week so we added two per node on Thursday evening and brought them up as BlueStore OSDs. We had proactively updated our existing pools to reference only devices classed as 'hdd', so that we could move select images over to ssd replicated and erasure coded pools. > Could you tell us more about that cluster, as in HW, how are the SSDs connected and FW version of the controller if applicable. Kernel 4.13.8 suggests that this is a handrolled, upstream kernel. While not necessarily related I'll note that as far as Debian kernels (which are very lightly if at all patched) are concerned, nothing beyond 4.9 has been working to my satisfaction. 4.11 still worked, but 4.12 crash-reboot-looped on all my Supermicro X10 machines (quite a varied selection). The current 4.13.13 backport boots on some of those machines, but still throws errors with the EDAC devices, which works fine with 4.9. 4.14 is known to happily destroy data if used with bcache and even if one doesn't use that it should give you pause. > We were pretty diligent and downloaded Intel's Firmware Update Tool and validated that each new drive had the latest available firmware before installing them in the nodes. We did numerous benchmarks on Friday and eventually moved some images over to the new storage pools. Everything was working perfectly and extensive tests on Sunday showed excellent performance. Sunday night one of the new SSDs died and Ceph replicated and redistributed data accordingly, then another failed in the early hours of Monday morning and Ceph did what it needed to. > > We had the two failed drives replaced by 11am and Ceph was up to 2/4918587 objects degraded (0.000%) when a third drive failed. At this point we updated the crush maps for the rbd_ssd and ec_ssd pools and set the device class to 'hdd', to essentially evacuate everything off the SSDs. Other SSDs then failed at 3:22pm, 4:19pm, 5:49pm and 5:50pm. We've ultimately lost half the Intel S4600 drives, which are all completely inaccessible. Our status at 11:42pm Monday night was: 1/1398478 objects unfound (0.000%) and 339/4633062 objects degraded (0.007%). > The relevant logs when and how those SSDs failed would be interesting. Was the distribution of the failed SSDs random among the cluster? Are you running smartd and did it have something to say? Completely inaccessible sounds a lot like the infamous "self-bricking" of Intel SSDs when they discover something isn't right, or they don't like the color scheme of the server inside (^.^). I'm using quite a lot of Intel SSDs and had only one "fatal" incident. A DC S3700 detected that its powercap had failed, but of course kept working fine. Until a reboot was need, when it promptly bricked itself, data inaccessible, SMART reporting barely that something was there. So one wonders what caused your SSDs to get their knickers in such a twist. Are the survivors showing any unusual signs in their SMART output? Of course what your vendor/Intel will have to say will also be of interest. ^o^ Regards, Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com