Many concurrent drive failures - How do I activate pgs?

David Herselman <dhe@xxxxxxxx> · Wed, 20 Dec 2017 22:25:23 +0000

Hi,

I would be extremely thankful for any assistance in attempting to resolve our situation and would be happy to pay consultation/support fees:
[admin@kvm5b ~]# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 168/4633062 objects misplaced (0.004%); 1/1398478 objects unfound (0.000%); Reduced data availability: 2
 pgs inactive, 2 pgs down; Degraded data redundancy: 339/4633062 objects degraded (0.007%), 3 pgs unclean, 1 pg degraded, 1 pg undersized
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
OBJECT_MISPLACED 168/4633062 objects misplaced (0.004%)
OBJECT_UNFOUND 1/1398478 objects unfound (0.000%)
    pg 4.43 has 1 unfound objects
PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs down
    pg 7.4 is down+remapped, acting [2147483647,2147483647,31,2147483647,32,2147483647]
    pg 7.f is down+remapped, acting [2147483647,2147483647,32,2147483647,30,2147483647]
PG_DEGRADED Degraded data redundancy: 339/4633062 objects degraded (0.007%), 3 pgs unclean, 1 pg degraded, 1 pg undersized
    pg 4.43 is stuck undersized for 4933.586411, current state active+recovery_wait+undersized+degraded+remapped, last acting [27]
    pg 7.4 is stuck unclean for 30429.018746, current state down+remapped, last acting [2147483647,2147483647,31,2147483647,32,2147483647]
    pg 7.f is stuck unclean for 30429.010752, current state down+remapped, last acting [2147483647,2147483647,32,2147483647,30,2147483647]

We’ve happily been running a 6 node cluster with 4 x FileStore HDDs per node (journals on SSD partitions) for over a year and recently upgraded all nodes to Debian 9, Ceph Luminous 12.2.2 and kernel 4.13.8. We ordered 12 x Intel DC S4600
 SSDs which arrived last week so we added two per node on Thursday evening and brought them up as BlueStore OSDs. We had proactively updated our existing pools to reference only devices classed as ‘hdd’, so that we could move select images over to ssd replicated
 and erasure coded pools.

We were pretty diligent and downloaded Intel’s Firmware Update Tool and validated that each new drive had the latest available firmware before installing them in the nodes. We did numerous benchmarks on Friday and eventually moved some
 images over to the new storage pools. Everything was working perfectly and extensive tests on Sunday showed excellent performance. Sunday night one of the new SSDs died and Ceph replicated and redistributed data accordingly, then another failed in the early
 hours of Monday morning and Ceph did what it needed to.

We had the two failed drives replaced by 11am and Ceph was up to 2/4918587 objects degraded (0.000%) when a third drive failed. At this point we updated the crush maps for the rbd_ssd and ec_ssd pools and set the device class to ‘hdd’,
 to essentially evacuate everything off the SSDs. Other SSDs then failed at 3:22pm, 4:19pm, 5:49pm and 5:50pm. We’ve ultimately lost half the Intel S4600 drives, which are all completely inaccessible. Our status at 11:42pm Monday night was: 1/1398478 objects
 unfound (0.000%) and 339/4633062 objects degraded (0.007%).

We’re essentially looking for assistance with:
 - Copying images from the damaged pools (attempting to access these currently results in requests which never timeout)
 - Advice on how to later, if Intel can unlock the failed SSDs, import the missing object shards to gain full access to the images

There are two pools that were configured to use devices with a ‘ssd’ device class, namely a replicated pool called ‘rbd_ssd’ (size 3) and an erasure coded pool ‘ec_ssd’ (k=4 and m=2). One 80 GB image was directly in the rbd_ssd pool and
 5 images totalling 1.4 TB were also in the rbd_ssd pool, but with their data in the ec_ssd pool.

What we’ve done so far:
 - Run Intel diagnostic tools on the failed SSDs, which report ‘logically locked, contact Intel support’. This is giving us some hope that Intel, should the case finally land with someone that talks English, can somehow unlock the drives.
 - Stopped the rest of the Intel SSD OSDs, peeked at the content of the BlueStore (ceph-objectstore-tool --op fuse --data-path /var/lib/ceph/osd/ceph-34 --mountpoint /mnt) unmounted it again and then exported placement groups:
        eg: ceph-objectstore-tool --op export --pgid 7.fs1 --data-path /var/lib/ceph/osd/ceph-34 --file /exported_data/osd34_7.fs1.export
 - Tried ‘ceph osd force-create-pg X’, ‘ceph pg deep-scrub X’, ‘ceph osd lost $ID --yes-i-really-mean-it’

We originally deleted the failed OSDs but ‘ceph pg 4.43 mark_unfound_lost delete’ yielded ‘Error EINVAL: pg has 1 unfound objects but we haven't probed all sources, not marking lost’ so we recreated all the previous OSDs in partitions of
 a different SSD and imported the previous exports with the OSD services stopped.

Status now:
[admin@kvm5b ~]#
ceph health detail
HEALTH_WARN noout flag(s) set; Reduced data availability: 3 pgs inactive, 3 pgs down; Degraded data redundancy: 3 pgs unclean
OSDMAP_FLAGS noout flag(s) set
PG_AVAILABILITY Reduced data availability: 3 pgs inactive, 3 pgs down
    pg 4.43 is down, acting [4,15,18]
    pg 7.4 is down, acting [8,5,21,18,15,0]
    pg 7.f is down, acting [23,0,16,5,11,14]
PG_DEGRADED Degraded data redundancy: 3 pgs unclean
    pg 4.43 is stuck unclean since forever, current state down, last acting [4,15,18]
    pg 7.4 is stuck unclean since forever, current state down, last acting [8,5,21,18,15,0]
    pg 7.f is stuck unclean since forever, current state down, last acting [23,0,16,5,11,14]

Original 'ceph pg X query' status (before we mucked around by exporting and deleting OSDs):
https://pastebin.com/fBQhq6UQ
Current ‘ceph pg X query’ status (after recreating temporary OSDs with the original IDs and importing the exports):
https://pastebin.com/qcN5uYkN

What we assume needs to be done:
 - Tell Ceph that the OSDs are lost (query status in the pastebin above reports ‘starting or marking this osd lost may let us proceed’). We have stopped, marked the temporary OSDs as out and run ‘ceph osd lost $ID --yes-i-really-mean-it’
 already though.
  - Somehow get Ceph to forget about the sharded objects it doesn’t have sufficient pieces of.
  - Copy the images to another pool so that we can get pieces of data off these and rebuild those systems.
  - Hopefully get Intel to unlock the drives, export as much of the content as possible and import the various exports so that we can ultimately copy off complete images.

Really, really hoping to have a Merry Christmas… ;)

PS: We got the 80 GB image out, it had a single 4MB object hole so we used ddrescue to read the source image forwards, rebooted the node when it stalled on the missing data and repeated the copy in reverse direction thereafter…

Regards
David Herselman

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com