Hi Dilip,
Looking at the output of ceph -s it's still recovering (there are still pgs in recovery_wait, backfill_wait, recovering state) so you will have to be patient to let ceph recover.
The output of ceph osd dump doesn't mention osd.7 (it's referring to pool 7)
Kind regards,
Caspar Smit
Caspar Smit
2018-04-18 11:10 GMT+02:00 Dilip Renkila <dilip.renkila278@xxxxxxxxx>:
Hi all,We recently had an osd breakdown. After that i have manually added osd's thinking that ceph repairs by itself.I am running ceph 11 versionroot@node16:~# ceph -vceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba809 8311be38a7) root@node16:~# ceph -scluster 7c75f6e9-b858-4ac4-aa26-48ae1f33eda2 health HEALTH_WARN371 pgs backfill_wait372 pgs degraded1 pgs recovering3 pgs recovery_wait372 pgs stuck degraded375 pgs stuck unclean372 pgs stuck undersized372 pgs undersized2 requests are blocked > 32 secrecovery 95173/453987 objects degraded (20.964%)recovery 103542/453987 objects misplaced (22.807%)recovery 1/149832 unfound (0.001%)pool cinder-volumes pg_num 300 > pgp_num 128pool ephemeral-vms pg_num 300 > pgp_num 1281 mons down, quorum 0,1 node15,node16monmap e2: 3 mons at {node15=10.0.5.15:6789/0,node16=10.0.5.16:6789/0, }node17=10.0.5.17:6789/0 election epoch 1226, quorum 0,1 node15,node16mgr active: node16osdmap e7858: 6 osds: 6 up, 6 in; 375 remapped pgsflags sortbitwise,require_jewel_osds,require_kraken_osds pgmap v16570651: 600 pgs, 2 pools, 571 GB data, 146 kobjects1363 GB used, 4202 GB / 5566 GB avail95173/453987 objects degraded (20.964%)103542/453987 objects misplaced (22.807%)1/149832 unfound (0.001%)368 active+undersized+degraded+remapped+backfill_wait 225 active+clean3 active+remapped+backfill_wait3 active+recovery_wait+undersized+degraded+remapped 1 active+recovering+undersized+degraded+remapped client io 17441 B/s rd, 271 kB/s wr, 42 op/s rd, 26 op/s wrMany pgs are stuck degraded, remapped ..etc.root@node16:~# ceph osd treeID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY-1 5.81839 root default-2 1.81839 host node911 0.90919 osd.11 up 1.00000 1.000001 0.90919 osd.1 up 1.00000 1.00000-3 2.00000 host node100 1.00000 osd.0 up 1.00000 1.000002 1.00000 osd.2 up 1.00000 1.00000-4 2.00000 host node83 1.00000 osd.3 up 1.00000 1.000006 1.00000 osd.6 up 1.00000 1.00000I have attached the output of ceph osd dump. Interstingly you can see pg_temp . What does that means and why osd 7 is involved there?here is the crush maproot@node16:~# cat /tmp/crush.txt# begin crush maptunable choose_local_tries 0tunable choose_local_fallback_tries 0tunable choose_total_tries 50tunable chooseleaf_descend_once 1tunable chooseleaf_vary_r 1tunable straw_calc_version 1# devicesdevice 0 osd.0device 1 osd.1device 2 osd.2device 3 osd.3device 4 device4device 5 device5device 6 osd.6device 7 device7device 8 device8device 9 device9device 10 device10device 11 osd.11# typestype 0 osdtype 1 hosttype 2 chassistype 3 racktype 4 rowtype 5 pdutype 6 podtype 7 roomtype 8 datacentertype 9 regiontype 10 root# bucketshost node9 {id -2 # do not change unnecessarily# weight 1.818alg strawhash 0 # rjenkins1item osd.11 weight 0.909item osd.1 weight 0.909}host node10 {id -3 # do not change unnecessarily# weight 2.000alg strawhash 0 # rjenkins1item osd.0 weight 1.000item osd.2 weight 1.000}host node8 {id -4 # do not change unnecessarily# weight 2.000alg strawhash 0 # rjenkins1item osd.3 weight 1.000item osd.6 weight 1.000}root default {id -1 # do not change unnecessarily# weight 5.818alg strawhash 0 # rjenkins1item node9 weight 1.818item node10 weight 2.000item node8 weight 2.000}# rulesrule replicated_ruleset {ruleset 0type replicatedmin_size 1max_size 10step take defaultstep chooseleaf firstn 0 type hoststep emit}# end crush mapBut the intersting thing is i am seeing the following line on all osd logs2018-04-18 10:57:23.437006 7f883a14b700 0 -- 10.0.5.10:6802/25296 >> - conn(0x55f90cf8f000 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half accept state just closed 2018-04-18 10:57:26.715861 7f883a14b700 0 -- 10.0.5.10:6802/25296 >> - conn(0x55f90cf8f000 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half accept state just closed 2018-04-18 10:57:38.435193 7f883a14b700 0 -- 10.0.5.10:6802/25296 >> - conn(0x55f90d3d4800 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half accept state just closed 2018-04-18 10:57:41.717710 7f883a14b700 0 -- 10.0.5.10:6802/25296 >> - conn(0x55f8e2944800 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half accept state just closed What does this means
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com