10.2.10-many osd wrongly marked down and osd log has too much ms_handle_reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



HI,cephers

My ceph cluster faces a new problem. many osd wrongly marked down.
each time tens of or hundreds of osd been marked down, many times it became up again soon.
sometimes it is not, and I should restart the osd manually because it is blocked by peer.

I search the dmesg, I found nothing. And then I saw the osd log. I found the lines like below is too much:

2019-04-22 06:27:07.111256 7ffa9b4d6700  1 osd.880 22517 ms_handle_reset con 0x55f26cc0d300 session 0x55f2904b90c0

I have 1080 osds in this cluster and last 24 hours, this log repeat more than 57 million times。
I compare it with m another cluster, this log only repeats tens of times in 24 hours.

another key point: at first we use async, but then we found some bug like IO hang without any errors in ceph status. so we change ms_type to simple online for each mon and osd service one by one.
 

ceph -s
    cluster 2bec9425-ea5f-4a48-b56a-fe88e126bced
     health HEALTH_WARN
            noout flag(s) set
            election epoch 26, quorum 0,1,2 a,b,c
     osdmap e22551: 1080 osds: 1078 up, 1078 in
            flags noout,sortbitwise,require_jewel_osds
      pgmap v29327873: 90112 pgs, 3 pools, 69753 GB data, 30081 kobjects
            214 TB used, 1500 TB / 1715 TB avail
               90111 active+clean
                   1 active+clean+scrubbing+deep
  client io 57082 kB/s rd, 207 MB/s wr, 1091 op/s rd, 7658 op/s wr


ceph osd pool ls detail
pool 5 'ssd-ctrl' replicated size 3 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 18865 owner 300 flags hashpspool stripe_width 0
removed_snaps [1~3,5~1,7~1]
pool 6 'ssd-img' replicated size 3 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 16384 pgp_num 16384 last_change 19416 owner 100 flags hashpspool stripe_width 0
removed_snaps [1~13,15~1,17~1,19~1,1f~1,21~2,24~1,2c~3,30~b,3d~a,49~4,4e~1,50~1,52~3,56~19,71~2,74~2,78~1,7a~1,7c~2,7f~1]
pool 7 '-sas-img' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 65536 pgp_num 65536 last_change 19985 owner 100 flags hashpspool stripe_width 0
removed_snaps [1~5,7~1,f~2,12~1,1a~3,1e~5,24~6,2b~9,35~1,37~1,3b~18,57~3,5b~19,76~2,79~1,7b~1,7d~2,82~1]
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux