HI,cephers
My ceph cluster faces a new problem. many osd wrongly marked down.
each time tens of or hundreds of osd been marked down, many times it became up again soon.
sometimes it is not, and I should restart the osd manually because it is blocked by peer.
I search the dmesg, I found nothing. And then I saw the osd log. I found the lines like below is too much:
2019-04-22 06:27:07.111256 7ffa9b4d6700 1 osd.880 22517 ms_handle_reset con 0x55f26cc0d300 session 0x55f2904b90c0
I have 1080 osds in this cluster and last 24 hours, this log repeat more than 57 million times。
I compare it with m another cluster, this log only repeats tens of times in 24 hours.
another key point: at first we use async, but then we found some bug like IO hang without any errors in ceph status. so we change ms_type to simple online for each mon and osd service one by one.
ceph -s
cluster 2bec9425-ea5f-4a48-b56a-fe88e126bced
health HEALTH_WARN
noout flag(s) set
monmap e1: 3 mons at {a=10.191.175.249:6789/0,b=10.191.175.250:6789/0,c=10.191.175.251:6789/0}
election epoch 26, quorum 0,1,2 a,b,c
osdmap e22551: 1080 osds: 1078 up, 1078 in
flags noout,sortbitwise,require_jewel_osds
pgmap v29327873: 90112 pgs, 3 pools, 69753 GB data, 30081 kobjects
214 TB used, 1500 TB / 1715 TB avail
90111 active+clean
1 active+clean+scrubbing+deep
client io 57082 kB/s rd, 207 MB/s wr, 1091 op/s rd, 7658 op/s wr
ceph osd pool ls detail
pool 5 'ssd-ctrl' replicated size 3 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 18865 owner 300 flags hashpspool stripe_width 0
removed_snaps [1~3,5~1,7~1]
pool 6 'ssd-img' replicated size 3 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 16384 pgp_num 16384 last_change 19416 owner 100 flags hashpspool stripe_width 0
removed_snaps [1~13,15~1,17~1,19~1,1f~1,21~2,24~1,2c~3,30~b,3d~a,49~4,4e~1,50~1,52~3,56~19,71~2,74~2,78~1,7a~1,7c~2,7f~1]
pool 7 '-sas-img' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 65536 pgp_num 65536 last_change 19985 owner 100 flags hashpspool stripe_width 0
removed_snaps [1~5,7~1,f~2,12~1,1a~3,1e~5,24~6,2b~9,35~1,37~1,3b~18,57~3,5b~19,76~2,79~1,7b~1,7d~2,82~1]
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com