Hi all,
We recently had an osd breakdown. After that i have manually added osd's thinking that ceph repairs by itself.
I am running ceph 11 version
root@node16:~# ceph -v
ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)
root@node16:~# ceph -s
cluster 7c75f6e9-b858-4ac4-aa26-48ae1f33eda2
health HEALTH_WARN
371 pgs backfill_wait
372 pgs degraded
1 pgs recovering
3 pgs recovery_wait
372 pgs stuck degraded
375 pgs stuck unclean
372 pgs stuck undersized
372 pgs undersized
2 requests are blocked > 32 sec
recovery 95173/453987 objects degraded (20.964%)
recovery 103542/453987 objects misplaced (22.807%)
recovery 1/149832 unfound (0.001%)
pool cinder-volumes pg_num 300 > pgp_num 128
pool ephemeral-vms pg_num 300 > pgp_num 128
1 mons down, quorum 0,1 node15,node16
monmap e2: 3 mons at {node15=10.0.5.15:6789/0,node16=10.0.5.16:6789/0,node17=10.0.5.17:6789/0}
election epoch 1226, quorum 0,1 node15,node16
mgr active: node16
osdmap e7858: 6 osds: 6 up, 6 in; 375 remapped pgs
flags sortbitwise,require_jewel_osds,require_kraken_osds
pgmap v16570651: 600 pgs, 2 pools, 571 GB data, 146 kobjects
1363 GB used, 4202 GB / 5566 GB avail
95173/453987 objects degraded (20.964%)
103542/453987 objects misplaced (22.807%)
1/149832 unfound (0.001%)
368 active+undersized+degraded+remapped+backfill_wait
225 active+clean
3 active+remapped+backfill_wait
3 active+recovery_wait+undersized+degraded+remapped
1 active+recovering+undersized+degraded+remapped
client io 17441 B/s rd, 271 kB/s wr, 42 op/s rd, 26 op/s wr
Many pgs are stuck degraded, remapped ..etc.
root@node16:~# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.81839 root default
-2 1.81839 host node9
11 0.90919 osd.11 up 1.00000 1.00000
1 0.90919 osd.1 up 1.00000 1.00000
-3 2.00000 host node10
0 1.00000 osd.0 up 1.00000 1.00000
2 1.00000 osd.2 up 1.00000 1.00000
-4 2.00000 host node8
3 1.00000 osd.3 up 1.00000 1.00000
6 1.00000 osd.6 up 1.00000 1.00000
I have attached the output of ceph osd dump. Interstingly you can see pg_temp . What does that means and why osd 7 is involved there?
here is the crush map
root@node16:~# cat /tmp/crush.txt
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 device4
device 5 device5
device 6 osd.6
device 7 device7
device 8 device8
device 9 device9
device 10 device10
device 11 osd.11
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host node9 {
id -2 # do not change unnecessarily
# weight 1.818
alg straw
hash 0 # rjenkins1
item osd.11 weight 0.909
item osd.1 weight 0.909
}
host node10 {
id -3 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.000
item osd.2 weight 1.000
}
host node8 {
id -4 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item osd.3 weight 1.000
item osd.6 weight 1.000
}
root default {
id -1 # do not change unnecessarily
# weight 5.818
alg straw
hash 0 # rjenkins1
item node9 weight 1.818
item node10 weight 2.000
item node8 weight 2.000
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
But the intersting thing is i am seeing the following line on all osd logs
2018-04-18 10:57:23.437006 7f883a14b700 0 -- 10.0.5.10:6802/25296 >> - conn(0x55f90cf8f000 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half accept state just closed
2018-04-18 10:57:26.715861 7f883a14b700 0 -- 10.0.5.10:6802/25296 >> - conn(0x55f90cf8f000 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half accept state just closed
2018-04-18 10:57:38.435193 7f883a14b700 0 -- 10.0.5.10:6802/25296 >> - conn(0x55f90d3d4800 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half accept state just closed
2018-04-18 10:57:41.717710 7f883a14b700 0 -- 10.0.5.10:6802/25296 >> - conn(0x55f8e2944800 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half accept state just closed
What does this means
Attachment:
ceph-osd-dump
Description: Binary data
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com