Recovering from PG in down+incomplete state

Mallikarjun Biradar <mallikarjuna.biradar@xxxxxxxxx> · Fri, 19 Dec 2014 12:11:49 +0530

Hi all,
I had 12 OSD's in my cluster with 2 OSD nodes. One of the OSD was in down state, I have removed that PG from cluster, by removing crush rule for that OSD.

Now cluster with 11 OSD's, started rebalancing. After sometime, cluster status was

ems@rack6-client-5:~$ sudo ceph -s
    cluster eb5452f4-5ce9-4b97-9bfd-2a34716855f1
     health HEALTH_WARN 1 pgs down; 252 pgs incomplete; 10 pgs peering; 73 pgs stale; 262 pgs stuck inactive; 73 pgs stuck stale; 262 pgs stuck unclean; clock skew detected on mon.rack6-client-5, mon.rack6-client-6
     monmap e1: 3 mons at {rack6-client-4=10.242.43.105:6789/0,rack6-client-5=10.242.43.106:6789/0,rack6-client-6=10.242.43.107:6789/0}, election epoch 12, quorum 0,1,2 rack6-client-4,rack6-client-5,rack6-client-6
     osdmap e2648: 11 osds: 11 up, 11 in
      pgmap v554251: 846 pgs, 3 pools, 4383 GB data, 1095 kobjects
            11668 GB used, 26048 GB / 37717 GB avail
                  63 stale+active+clean
                   1 down+incomplete
                 521 active+clean
                 251 incomplete
                  10 stale+peering
ems@rack6-client-5:~$

To fix this, i cant run "ceph osd lost <osd.id>" to remove the PG which is in down state. As OSD is already removed from the cluster.

ems@rack6-client-4:~$ sudo ceph pg dump all | grep down
dumped all in format plain
1.38    1548    0       0       0       0       6492782592      3001    3001    down+incomplete 2014-12-18 15:58:29.681708      1118'508438     2648:1073892    [6,3,1]     6       [6,3,1] 6       76'437184       2014-12-16 12:38:35.322835      76'437184       2014-12-16 12:38:35.322835
ems@rack6-client-4:~$

ems@rack6-client-4:~$ sudo ceph pg 1.38 query
.............
"recovery_state": [
        { "name": "Started\/Primary\/Peering",
          "enter_time": "2014-12-18 15:58:29.681666",
          "past_intervals": [
                { "first": 1109,
                  "last": 1118,
                  "maybe_went_rw": 1,
...................
...................
"down_osds_we_would_probe": [
                7],
          "peering_blocked_by": []},
...................
...................

ems@rack6-client-4:~$ sudo ceph osd tree
# id    weight  type name       up/down reweight
-1      36.85   root default
-2      20.1            host rack2-storage-1
0       3.35                    osd.0   up      1
1       3.35                    osd.1   up      1
2       3.35                    osd.2   up      1
3       3.35                    osd.3   up      1
4       3.35                    osd.4   up      1
5       3.35                    osd.5   up      1
-3      16.75           host rack2-storage-5
6       3.35                    osd.6   up      1
8       3.35                    osd.8   up      1
9       3.35                    osd.9   up      1
10      3.35                    osd.10  up      1
11      3.35                    osd.11  up      1
ems@rack6-client-4:~$ sudo ceph osd lost 7 --yes-i-really-mean-it
osd.7 is not down or doesn't exist
ems@rack6-client-4:~$

Can somebody suggest any other recovery step to come out of this?

-Thanks & Regards,
Mallikarjun Biradar

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com