Why did you remove osd.7?
Something else appears to be wrong. With all 11 OSDs up, you shouldn't have any PGs stuck in stale or peering.
How badly are the clocks skewed between nodes? If it's bad enough, it can cause communication problems between nodes. Ceph will complain if the clocks are more than 50ms different. It's best if you run ntpd on all nodes.
I'm thinking that cleaning up the clock skew will fix most of your issues.
If that does fix the issue, you can try bringing osd.7 back in. Don't reformat it, just deploy it as you normally would. The CRUSHMAP will go back to the way it was before you removed osd.7. Ceph will start to backfill+remap data onto the "new" osd, and see that most of it is already there. It should recovery relatively quickly... I think.
On Fri, Dec 19, 2014 at 10:28 AM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
I'm still pretty new at troubleshooting Ceph and since no one has responded yet I'll give a stab.What is the size of your pool?'ceph osd pool get <pool name> size'It seems like based on the number of incomplete PGs that it was '1'. I understand that if you are able to bring osd 7 back in, it would clear up. I'm just not seeing a secondary osd for that PG.Disclaimer: I could be totally wrong.Robert LeBlancOn Thu, Dec 18, 2014 at 11:41 PM, Mallikarjun Biradar <mallikarjuna.biradar@xxxxxxxxx> wrote:_______________________________________________Hi all,I had 12 OSD's in my cluster with 2 OSD nodes. One of the OSD was in down state, I have removed that PG from cluster, by removing crush rule for that OSD.Now cluster with 11 OSD's, started rebalancing. After sometime, cluster status wasems@rack6-client-5:~$ sudo ceph -scluster eb5452f4-5ce9-4b97-9bfd-2a34716855f1health HEALTH_WARN 1 pgs down; 252 pgs incomplete; 10 pgs peering; 73 pgs stale; 262 pgs stuck inactive; 73 pgs stuck stale; 262 pgs stuck unclean; clock skew detected on mon.rack6-client-5, mon.rack6-client-6monmap e1: 3 mons at {rack6-client-4=10.242.43.105:6789/0,rack6-client-5=10.242.43.106:6789/0,rack6-client-6=10.242.43.107:6789/0}, election epoch 12, quorum 0,1,2 rack6-client-4,rack6-client-5,rack6-client-6osdmap e2648: 11 osds: 11 up, 11 inpgmap v554251: 846 pgs, 3 pools, 4383 GB data, 1095 kobjects11668 GB used, 26048 GB / 37717 GB avail63 stale+active+clean1 down+incomplete521 active+clean251 incomplete10 stale+peeringems@rack6-client-5:~$To fix this, i cant run "ceph osd lost <osd.id>" to remove the PG which is in down state. As OSD is already removed from the cluster.ems@rack6-client-4:~$ sudo ceph pg dump all | grep downdumped all in format plain1.38 1548 0 0 0 0 6492782592 3001 3001 down+incomplete 2014-12-18 15:58:29.681708 1118'508438 2648:1073892 [6,3,1] 6 [6,3,1] 6 76'437184 2014-12-16 12:38:35.322835 76'437184 2014-12-16 12:38:35.322835ems@rack6-client-4:~$ems@rack6-client-4:~$ sudo ceph pg 1.38 query............."recovery_state": [{ "name": "Started\/Primary\/Peering","enter_time": "2014-12-18 15:58:29.681666","past_intervals": [{ "first": 1109,"last": 1118,"maybe_went_rw": 1,......................................"down_osds_we_would_probe": [7],"peering_blocked_by": []},......................................ems@rack6-client-4:~$ sudo ceph osd tree# id weight type name up/down reweight-1 36.85 root default-2 20.1 host rack2-storage-10 3.35 osd.0 up 11 3.35 osd.1 up 12 3.35 osd.2 up 13 3.35 osd.3 up 14 3.35 osd.4 up 15 3.35 osd.5 up 1-3 16.75 host rack2-storage-56 3.35 osd.6 up 18 3.35 osd.8 up 19 3.35 osd.9 up 110 3.35 osd.10 up 111 3.35 osd.11 up 1ems@rack6-client-4:~$ sudo ceph osd lost 7 --yes-i-really-mean-itosd.7 is not down or doesn't existems@rack6-client-4:~$Can somebody suggest any other recovery step to come out of this?-Thanks & Regards,Mallikarjun Biradar
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com