We have upgraded from Hammer to Jewel and then Luminous 12.2.2 as of today. During the hammer upgrade to Jewel we lost two host servers and let the cluster rebalance/recover, it ran out of space and stalled. We then added three new host servers and then let the cluster rebalance/recover. During that process, at some point, we ended up with 4 pgs not being able to be repaired using “ceph pg repair xx.xx”. I tried using ceph pg 11.720 query and from what I can tell the missing information matches, but is being blocked from being marked clean. I keep seeing references to the ceph-object-store tool to use as an export/restore method, but I cannot find details on a step by step process given the current predicament. It may also be possible for us to just lose the data if it cant be extracted so we can at least return the cluster to a healthy state. Any thoughts? Ceph –s output: cluster: health: HEALTH_ERR Reduced data availability: 4 pgs inactive, 4 pgs incomplete Degraded data redundancy: 4 pgs unclean 4 stuck requests are blocked > 4096 sec too many PGs per OSD (2549 > max 200) services: mon: 3 daemons, quorum ukpixmon1,ukpixmon2,ukpixmon3 mgr: ukpixmon1(active), standbys: ukpixmon3, ukpixmon2 osd: 43 osds: 43 up, 43 in rgw: 3 daemons active data: pools: 12 pools, 37904 pgs objects: 8148k objects, 10486 GB usage: 21530 GB used, 135 TB / 156 TB avail pgs: 0.011% pgs not active 37900 active+clean 4 incomplete OSD TREE output: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 156.10268 root default -2 32.57996 host osdhost1 0 3.62000 osd.0 up 1.00000 1.00000 1 3.62000 osd.1 up 1.00000 1.00000 2 3.62000 osd.2 up 1.00000 1.00000 3 3.62000 osd.3 up 1.00000 1.00000 4 3.62000 osd.4 up 1.00000 1.00000 5 3.62000 osd.5 up 1.00000 1.00000 6 3.62000 osd.6 up 1.00000 1.00000 7 3.62000 osd.7 up 1.00000 1.00000 8 3.62000 osd.8 up 1.00000 1.00000 -3 25.33997 host osdhost2 9 3.62000 osd.9 up 1.00000 1.00000 10 3.62000 osd.10 up 1.00000 1.00000 11 3.62000 osd.11 up 1.00000 1.00000 12 3.62000 osd.12 up 1.00000 1.00000 15 3.62000 osd.15 up 1.00000 1.00000 16 3.62000 osd.16 up 1.00000 1.00000 17 3.62000 osd.17 up 1.00000 1.00000 -8 32.72758 host osdhost6 14 3.63640 osd.14 up 1.00000 1.00000 21 3.63640 osd.21 up 1.00000 1.00000 23 3.63640 osd.23 up 1.00000 1.00000 26 3.63640 osd.26 up 1.00000 1.00000 32 3.63640 osd.32 up 1.00000 1.00000 33 3.63640 osd.33 up 1.00000 1.00000 34 3.63640 osd.34 up 1.00000 1.00000 35 3.63640 osd.35 up 1.00000 1.00000 36 3.63640 osd.36 up 1.00000 1.00000 -9 32.72758 host osdhost7 19 3.63640 osd.19 up 1.00000 1.00000 37 3.63640 osd.37 up 1.00000 1.00000 38 3.63640 osd.38 up 1.00000 1.00000 39 3.63640 osd.39 up 1.00000 1.00000 40 3.63640 osd.40 up 1.00000 1.00000 41 3.63640 osd.41 up 1.00000 1.00000 42 3.63640 osd.42 up 1.00000 1.00000 43 3.63640 osd.43 up 1.00000 1.00000 44 3.63640 osd.44 up 1.00000 1.00000 -7 32.72758 host osdhost8 20 3.63640 osd.20 up 1.00000 1.00000 45 3.63640 osd.45 up 1.00000 1.00000 46 3.63640 osd.46 up 1.00000 1.00000 47 3.63640 osd.47 up 1.00000 1.00000 48 3.63640 osd.48 up 1.00000 1.00000 49 3.63640 osd.49 up 1.00000 1.00000 50 3.63640 osd.50 up 1.00000 1.00000 51 3.63640 osd.51 up 1.00000 1.00000 52 3.63640 osd.52 up 1.00000 1.00000 Ceph health detail output: HEALTH_ERR Reduced data availability: 4 pgs inactive, 4 pgs incomplete; Degraded data redundancy: 4 pgs unclean; 4 stuck requests are blocked > 4096 sec; too many PGs per OSD (2549 > max 200) PG_AVAILABILITY Reduced data availability: 4 pgs inactive, 4 pgs incomplete pg 11.720 is incomplete, acting [21,10] pg 11.9ab is incomplete, acting [14,2] pg 11.9fb is incomplete, acting [32,43] pg 11.c13 is incomplete, acting [42,26] PG_DEGRADED Degraded data redundancy: 4 pgs unclean pg 11.720 is stuck unclean since forever, current state incomplete, last acting [21,10] pg 11.9ab is stuck unclean since forever, current state incomplete, last acting [14,2] pg 11.9fb is stuck unclean since forever, current state incomplete, last acting [32,43] pg 11.c13 is stuck unclean since forever, current state incomplete, last acting [42,26] REQUEST_STUCK 4 stuck requests are blocked > 4096 sec 4 ops are blocked > 33554.4 sec osds 21,26,32,42 have stuck requests > 33554.4 sec TOO_MANY_PGS too many PGs per OSD (2549 > max 200) -Brent |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com