rbd omap disappeared

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



my cluster occur a big error this morning.
many osd suicide because of heartbeat_map timeout.
when I start all osd manually.it looks well.

but when I using rbd info for a rbd from rbd ls, it say not such file or directory.
And the I use the way in https://fnordahl.com/2017/04/17/ceph-rbd-volume-header-recovery/ to recovery omap.

At last,most rbd works well.But still have 5 rbd can not reocvery.

2019-04-24 16:00:40.045342 7fd5fdffb700 -1 librbd::image::OpenRequest: failed to retreive immutable metadata: (5) Input/output error
2019-04-24 16:00:40.045419 7fd5fd7fa700 -1 librbd::ImageState: 0x5593262fb830 failed to open image: (5) Input/output error
rbd: error opening image volume-bef74858-0fcb-4a0b-b197-9618e6824c46: (5) Input/output error

I try to locate its omap head object, and stop master osd, delete the pg data in master osd, then restart master osd. It looks three replication is the same  when I using attr.

But I can not get the rbd info.Alwayse the vm among this  rbd is still alive and works well.

ceph -s
    cluster 2bec9425-ea5f-4a48-b56a-fe88e126bced
     health HEALTH_WARN
            noout flag(s) set
            election epoch 26, quorum 0,1,2 a,b,c
     osdmap e22551: 1080 osds: 1078 up, 1078 in
            flags noout,sortbitwise,require_jewel_osds
      pgmap v29327873: 90112 pgs, 3 pools, 69753 GB data, 30081 kobjects
            214 TB used, 1500 TB / 1715 TB avail
               90111 active+clean
                   1 active+clean+scrubbing+deep
  client io 57082 kB/s rd, 207 MB/s wr, 1091 op/s rd, 7658 op/s wr

the log of osd.219 which is the firstest log heartbeat no reply.

2019-04-24 04:00:54.905504 7f5eb7fab700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f5e97182700' had timed out after 15
...
2019-04-24 04:00:54.905536 7f5eb7fab700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f5e9b18a700' had timed out after 15
2019-04-24 04:00:57.499903 7f5df2997700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.33:6947/3001469 pipe(0x55d93b470800 sd=1914 :6855 s=2 pgs=514958 cs=17 l=0 c=0x55d91a3c1200).fault with nothing to send, going to standby
... repeat large
2019-04-24 04:00:57.651603 7f5e77735700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.36:6845/1062649 pipe(0x55d92d2cb400 sd=105 :45295 s=2 pgs=493749 cs=11 l=0 c=0x55d943755200).fault with nothing to send, going to standby
2019-04-24 04:00:59.905846 7f5eb7fab700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f5e97182700' had timed out after 15
...
2019-04-24 04:01:04.666159 7f5dbb6d7700  0 -- 10.191.175.20:6856/3169631 >> :/0 pipe(0x55d9234b1400 sd=1926 :6856 s=0 pgs=0 cs=0 l=0 c=0x55d924743600).accept failed to getpeername (107) Transport endpoint is not connected
2019-04-24 04:01:04.905958 7f5eb7fab700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f5e97182700' had timed out after 15
2019-04-24 04:01:14.550812 7f5db7793700  0 -- 10.191.175.20:6856/3169631 >> :/0 pipe(0x55d9234b2800 sd=1927 :6856 s=0 pgs=0 cs=0 l=0 c=0x55d924742a00).accept failed to getpeername (107) Transport endpoint is not connected
...
2019-04-24 04:01:14.691430 7f5db7793700  0 -- 10.191.175.20:6858/3169631 >> :/0 pipe(0x55d936a78800 sd=1383 :6858 s=0 pgs=0 cs=0 l=0 c=0x55d94ecf5a80).accept failed to getpeername (107) Transport endpoint is not connected
2019-04-24 04:01:14.692176 7f5e90469700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.31:6835/2092007 pipe(0x55d93c41a800 sd=67 :17328 s=2 pgs=571754 cs=83 l=0 c=0x55d939913180).fault, initiating reconnect
2019-04-24 04:01:14.693742 7f5e4c889700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.31:6835/2092007 pipe(0x55d93c41a800 sd=67 :17766 s=1 pgs=571754 cs=84 l=0 c=0x55d939913180).connect got RESETSESSION
2019-04-24 04:01:14.697098 7f5db256c700  0 -- 10.191.175.20:6908/169631 >> :/0 pipe(0x55d95606c800 sd=352 :6908 s=0 pgs=0 cs=0 l=0 c=0x55d920d79500).accept failed to getpeername (107) Transport endpoint is not connected
2019-04-24 04:01:14.697516 7f5e08523700  0 -- 10.191.175.20:6908/169631 >> :/0 pipe(0x55d91b30c000 sd=1926 :6908 s=0 pgs=0 cs=0 l=0 c=0x55d9263cb780).accept failed to getpeername (107) Transport endpoint is not connected
...
2019-04-24 04:01:14.704225 7f5dd7790700  0 -- 10.191.175.20:6908/169631 >> :/0 pipe(0x55d9531c5400 sd=1927 :6908 s=0 pgs=0 cs=0 l=0 c=0x55d9263ca280).accept failed to getpeername (107) Transport endpoint is not connected
2019-04-24 04:01:14.704511 7f5e78c4a700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.37:6801/2131766 pipe(0x55d939274800 sd=507 :10332 s=1 pgs=604906 cs=3 l=0 c=0x55d925714600).connect got RESETSESSION
...
2019-04-24 04:01:14.705970 7f5dc4980700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.38:6833/3181256 pipe(0x55d950fed400 sd=455 :58907 s=1 pgs=563194 cs=17 l=0 c=0x55d9320a2480).connect got RESETSESSION
2019-04-24 04:01:14.696315 7f5db548e700  0 -- 10.191.175.20:6908/169631 >> :/0 pipe(0x55d939e45400 sd=1929 :6908 s=0 pgs=0 cs=0 l=0 c=0x55d91b787d80).accept failed to getpeername (107) Transport endpoint is not connected
2019-04-24 04:01:14.708160 7f5e808c6700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.37:6823/2160572 pipe(0x55d943742800 sd=812 :41939 s=1 pgs=566937 cs=1 l=0 c=0x55d930f82900).connect got RESETSESSION
2019-04-24 04:01:14.708417 7f5e9a989700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f5e9a989700' had timed out after 15
...
2019-04-24 04:01:14.708949 7f5e97983700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f5e97983700' had timed out after 15
2019-04-24 04:01:14.709602 7f5e007ac700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.28:6821/1091360 pipe(0x55d94e69b400 sd=1419 :40317 s=1 pgs=566133 cs=1 l=0 c=0x55d938741480).connect got RESETSESSION
2019-04-24 04:01:14.710727 7f5e07216700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.44:6838/112579 pipe(0x55d946d74800 sd=809 :23481 s=1 pgs=32 cs=1 l=0 c=0x55d94f963a80).connect got RESETSESSION
2019-04-24 04:01:14.712774 7f5e8716c700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.34:6916/1001159 pipe(0x55d95be21400 sd=189 :55315 s=1 pgs=533958 cs=1 l=0 c=0x55d935329e00).connect got RESETSESSION
2019-04-24 04:01:14.713262 7f5e90e73700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.34:6945/1173248 pipe(0x55d9447e8000 sd=99 :37118 s=1 pgs=556344 cs=1 l=0 c=0x55d91ff36700).connect got RESETSESSION
2019-04-24 04:01:14.717779 7f5e87a75700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.26:6930/1101469 pipe(0x55d92c171400 sd=183 :23002 s=1 pgs=590456 cs=1 l=0 c=0x55d94e748580).connect got RESETSESSION
2019-04-24 04:01:14.717778 7f5db2f5d700  0 -- 10.191.175.20:6908/169631 >> :/0 pipe(0x55d948732800 sd=1936 :6908 s=0 pgs=0 cs=0 l=0 c=0x55d9276cd680).accept failed to getpeername (107) Transport endpoint is not connected
2019-04-24 04:01:14.717885 7f5e8a409700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.41:6827/2071106 pipe(0x55d93cc55400 sd=165 :48272 s=1 pgs=578891 cs=1 l=0 c=0x55d93d4aca00).connect got RESETSESSION
2019-04-24 04:01:14.717982 7f5e3dfa1700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.54:6851/2123643 pipe(0x55d920934000 sd=383 :42585 s=1 pgs=574915 cs=1 l=0 c=0x55d928482d00).connect got RESETSESSION
2019-04-24 04:01:14.718139 7f5e513d4700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.42:6861/3118363 pipe(0x55d93dcb0000 sd=808 :45616 s=1 pgs=492641 cs=1 l=0 c=0x55d930cf9800).connect got RESETSESSION
2019-04-24 04:01:14.718191 7f5db4b6b700  0 -- 10.191.175.20:6908/169631 >> :/0 pipe(0x55d934596000 sd=1932 :6908 s=0 pgs=0 cs=0 l=0 c=0x55d92f572000).accept failed to getpeername (107) Transport endpoint is not connected
2019-04-24 04:01:14.719112 7f5e8dc41700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.30:6815/3154304 pipe(0x55d946d73400 sd=545 :54330 s=1 pgs=552367 cs=1 l=0 c=0x55d94f019180).connect got RESETSESSION
2019-04-24 04:01:14.719200 7f5e07715700  0 -- 10.191.175.20:6908/169631 >> :/0 pipe(0x55d9234b1400 sd=1935 :6908 s=0 pgs=0 cs=0 l=0 c=0x55d92345c700).accept failed to getpeername (107) Transport endpoint is not connected
2019-04-24 04:01:14.719295 7f5ea5ac1700  0 -- 10.191.175.20:6908/169631 >> :/0 pipe(0x55d9234b2800 sd=1934 :6908 s=0 pgs=0 cs=0 l=0 c=0x55d941d0a780).accept failed to getpeername (107) Transport endpoint is not connected
2019-04-24 04:01:14.719556 7f5e97182700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f5e97182700' had timed out after 15
。。。。
2019-04-24 04:01:14.731783 7f5e8a70c700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.47:6890/1064637 pipe(0x55d93cc56800 sd=164 :56146 s=1 pgs=493037 cs=1 l=0 c=0x55d93d4ad600).connect got RESETSESSION
2019-04-24 04:01:14.731789 7f5e22f8c700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.35:6818/1109909 pipe(0x55d93b319400 sd=300 :48698 s=1 pgs=555765 cs=1 l=0 c=0x55d925407500).connect got RESETSESSION
2019-04-24 04:01:14.732623 7f5e2419e700  0 -- 10.191.175.20:6855/3169631 >> 10.191.175.29:6956/2086678 pipe(0x55d94b9cc800 sd=1070 :42700 s=1 pgs=582448 cs=1 l=0 c=0x55d94ef83b00).connect got RESETSESSION
2019-04-24 04:01:14.792353 7f5eb2c35700 -1 osd.219 22611 heartbeat_check: no reply from 10.191.175.20:6861 osd.218 since back 2019-04-24 04:00:33.420027 front 2019-04-24 04:00:33.420027 (cutoff 2019-04-24 04:00:54.792349)
... large amount of no reply
2019-04-24 04:01:14.810961 7f5eb2c35700 -1 osd.219 22611 heartbeat_check: no reply from 10.191.175.55:6896 osd.1069 since back 2019-04-24 04:00:33.420027 front 2019-04-24 04:00:33.420027 (cutoff 2019-04-24 04:00:54.792349)
2019-04-24 04:01:14.810972 7f5eb2c35700 -1 osd.219 22611 heartbeat_check: no reply from 10.191.175.55:6807 osd.1072 since back 2019-04-24 04:00:33.420027 front 2019-04-24 04:00:33.420027 (cutoff 2019-04-24 04:00:54.792349)
2019-04-24 04:01:14.820935 7f5e9a188700  0 -- 10.191.175.20:6908/169631 submit_message osd_op_reply(1103570 rbd_header.b0bc14677ca78c [watch ping cookie 140029489856640 gen 1] v0'0 uv0 _ondisk_ = -107 ((107) Transport endpoint is not connected)) v7 remote, 10.191.122.251:0/1594377730, failed lossy con, dropping message 0x55d957747600
2019-04-24 04:01:14.821340 7f5eb2c35700  0 log_channel(cluster) log [WRN] : 69 slow requests, 5 included below; oldest blocked for > 44.042324 secs
2019-04-24 04:01:14.821355 7f5eb2c35700  0 log_channel(cluster) log [WRN] : slow request 37.989255 seconds old, received at 2019-04-24 04:00:36.821750: osd_op(client.9569145.0:11357796 7.2339e183 (undecoded) ack+ondisk+write+known_if_redirected e22611) currently queued_for_pg
...
2019-04-24 04:10:15.049210 7f5eb7fab700  1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5e9c98d700' had timed out after 15
2019-04-24 04:10:15.049213 7f5eb7fab700  1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5e9c98d700' had suicide timed out after 150
2019-04-24 04:10:15.159190 7f5eb7fab700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f5eb7fab70
0 time 2019-04-24 04:10:15.049507
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x55d909a9ae12]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x11f) [0x55d9099d2f4f]
 3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0x55d9099d3976]
 4: (ceph::HeartbeatMap::check_touch_file()+0x2a) [0x55d9099d418a]
 5: (CephContextServiceThread::entry()+0x16c) [0x55d909ab4edc]
 6: (()+0x8064) [0x7f5ebc5dc064]
 7: (clone()+0x6d) [0x7f5eba6dd62d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
-10000> 2019-04-24 04:07:00.490378 7f5e9c18c700  5 osd.219 pg_epoch: 22818 pg[7.1ee8( v 22611'65709 (19416'62700,22611'65709] local-les=22799 n=217 ec=6370 les/c/f 22799/22800/0 22804/22804/22804) [437,1013] r=-1 lpr=22804 pi=22780-22803/1 crt=22611'65707 lcod 22611'65708 inactive NOTIFY] enter Reset
 -9999> 2019-04-24 04:07:00.490427 7f5e9c18c700  5 osd.219 pg_epoch: 22819 pg[7.1ee8( v 22611'65709 (19416'62700,22611'65709] local-les=22799 n=217 ec=6370 les/c/f 22799/22800/0 22818/22818/22818) [] r=-1 lpr=22818 pi=22780-22817/2 crt=22611'65707 lcod 22611'65708 inactive NOTIFY] exit Reset 0.000049 2 0.000045
 -9998> 2019-04-24 04:07:00.490437 7f5e9c18c700  5 osd.219 pg_epoch: 22819 pg[7.1ee8( v 22611'65709 (19416'62700,22611'65709] local-les=22799 n=217 ec=6370 les/c/f 22799/22800/0 22818/22818/22818) [] r=-1 lpr=22818 pi=22780-22817/2 crt=22611'65707 lcod 22611'65708 inactive NOTIFY] enter Started
 -9997> 2019-04-24 04:07:00.490443 7f5e9c18c700  5 osd.219 pg_epoch: 22819 pg[7.1ee8( v 22611'65709 (19416'62700,22611'65709] local-les=22799 n=217 ec=6370 les/c/f 22799/22800/0 22818/22818/22818) [] r=-1 lpr=22818 pi=22780-22817/2 crt=22611'65707 lcod 22611'65708 inactive NOTIFY] enter Start
 -9996> 2019-04-24 04:07:00.490448 7f5e9c18c700  1 osd.219 pg_epoch: 22819 pg[7.1ee8( v 22611'65709 (19416'62700,22611'65709] local-les=22799 n=217 ec=6370 les/c/f 22799/22800/0 22818/22818/22818) [] r=-1 lpr=22818 pi=22780-22817/2 crt=22611'65707 lcod 22611'65708 inactive NOTIFY] state<Start>: transitioning to Stray
 -9995> 2019-04-24 04:07:00.490455 7f5e9c18c700  5 osd.219 pg_epoch: 22819 pg[7.1ee8( v 22611'65709 (19416'62700,22611'65709] local-les=22799 n=217 ec=6370 les/c/f 22799/22800/0 22818/22818/22818) [] r=-1 lpr=22818 pi=22780-22817/2 crt=22611'65707 lcod 22611'65708 inactive NOTIFY] exit Start 0.000012 0 0.000000
 -9994> 2019-04-24 04:07:00.490462 7f5e9c18c700  5 osd.219 pg_epoch: 22819 pg[7.1ee8( v 22611'65709 (19416'62700,22611'65709] local-les=22799 n=217 ec=6370 les/c/f 22799/22800/0 22818/22818/22818) [] r=-1 lpr=22818 pi=22780-22817/2 crt=22611'65707 lcod 22611'65708 inactive NOTIFY] enter Started/Stray
 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux