Re: OSD went down but no idea why

"blackpiglet J." <blackpigletbruce@xxxxxxxxx> · Tue, 30 Jan 2018 16:10:34 +0800

I really not sure why monitor mark the OSD to down state

 "Monitor daemon marked osd.9 down, but it is still running"

2018-01-30 16:07 GMT+08:00 blackpiglet J. <blackpigletbruce@xxxxxxxxx>:
Guys,
We had set up a five nodes Ceph cluster. Four are OSD servers and the other one is MON and MGR.
Recently, during RGW stability test, RGW default pool: default.rgw.buckets.data is accidentally written full. As a result, RGW is stuck. We don't know the exact steps to recover, then we deleted all default RGW pools directly. After RGW restarted, all pools are back.

Then I found 4 OSD processes are down. I am not whether this is related to our RGW pool delete operation. 

This a the log I think may be useful. The full version is in the attachment. 

Any help is appreciated. Thanks in advance.

2018-01-29 20:17:48.896544 7fc111122700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 15
2018-01-29 20:17:52.360069 7fc110183700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 15
2018-01-29 20:17:57.360192 7fc110183700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 15
2018-01-29 20:18:02.360315 7fc110183700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 15
2018-01-29 20:18:06.029867 7fc0fb912700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 15
2018-01-29 20:18:06.030576 7fc0fc914700  1 osd.9 pg_epoch: 2481 pg[6.18( v 2276'10 (0'0,2276'10] local-lis/les=2479/2480 n=3 ec=704/704 lis/c 2479/2470 les/c/f 2480/2471/2312 2481/2481/2476) [9,91,51] r=0 lpr=2481 pi=[2470,2481)/2 luod=
0'0 crt=2276'10 lcod 2151'9 mlcod 0'0 active] start_peering_interval up [9,51] -> [9,91,51], acting [9,51] -> [9,91,51], acting_primary 9 -> 9, up_primary 9 -> 9, role 0 -> 0, features acting 2305244844532236283 upacting 230524484453223
6283
2018-01-29 20:18:06.030774 7fc0fc914700  1 osd.9 pg_epoch: 2481 pg[6.18( v 2276'10 (0'0,2276'10] local-lis/les=2479/2480 n=3 ec=704/704 lis/c 2479/2470 les/c/f 2480/2471/2312 2481/2481/2476) [9,91,51] r=0 lpr=2481 pi=[2470,2481)/2 crt=2
276'10 lcod 2151'9 mlcod 0'0 unknown] state<Start>: transitioning to Primary
2018-01-29 20:18:06.031236 7fc0fc914700  1 osd.9 pg_epoch: 2481 pg[6.1d( v 2316'568 (0'0,2316'568] local-lis/les=2479/2480 n=2 ec=704/704 lis/c 2479/2470 les/c/f 2480/2471/2312 2481/2481/2481) [91,79,9] r=2 lpr=2481 pi=[2470,2481)/2 luo
d=0'0 crt=2316'568 lcod 2312'567 active] start_peering_interval up [79,9] -> [91,79,9], acting [79,9] -> [91,79,9], acting_primary 79 -> 91, up_primary 79 -> 91, role 1 -> 2, features acting 2305244844532236283 upacting 2305244844532236
283
2018-01-29 20:18:06.031324 7fc0fc914700  1 osd.9 pg_epoch: 2481 pg[6.1d( v 2316'568 (0'0,2316'568] local-lis/les=2479/2480 n=2 ec=704/704 lis/c 2479/2470 les/c/f 2480/2471/2312 2481/2481/2481) [91,79,9] r=2 lpr=2481 pi=[2470,2481)/2 crt
=2316'568 lcod 2312'567 unknown NOTIFY] state<Start>: transitioning to Stray
2018-01-29 20:18:06.032167 7fc112124700  0 -- 10.1.248.1:6850/6003656 >> 10.1.248.4:6825/1003558 conn(0x556089bf4800 :6850 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=5245 cs=1 l=0).handle_connect_reply connect got RESETSESSION
2018-01-29 20:18:06.032184 7fc111923700  0 -- 10.1.248.1:6850/6003656 >> 10.1.248.3:6827/3726 conn(0x5560c0c06000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=6734 cs=1 l=0).handle_connect_reply connect got RESETSESSION
2018-01-29 20:18:06.053899 7fc103121700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.9 down, but it is still running
2018-01-29 20:18:06.053917 7fc103121700  0 log_channel(cluster) log [DBG] : map e2485 wrongly marked me down at e2483
2018-01-29 20:18:06.053922 7fc103121700  0 osd.9 2485 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
2018-01-29 20:18:06.053927 7fc103121700  1 osd.9 2485 start_waiting_for_healthy
2018-01-29 20:18:06.064250 7fc0fc914700  1 osd.9 pg_epoch: 2483 pg[10.53( empty local-lis/les=2476/2477 n=0 ec=1815/1815 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [] r=-1 lpr=2483 pi=[2476,2483)/1 crt=0'0 active] start_peer
ing_interval up [9] -> [], acting [9] -> [], acting_primary 9 -> -1, up_primary 9 -> -1, role 0 -> -1, features acting 2305244844532236283 upacting 2305244844532236283
2018-01-29 20:18:06.064258 7fc0fd115700  1 osd.9 pg_epoch: 2483 pg[3.16( v 2284'372 (0'0,2284'372] local-lis/les=2476/2477 n=2 ec=1758/130 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/1760) [33,85] r=-1 lpr=2483 pi=[2476,2483)/1 luo
d=0'0 crt=2284'372 lcod 2284'371 active] start_peering_interval up [33,9,85] -> [33,85], acting [33,9,85] -> [33,85], acting_primary 33 -> 33, up_primary 33 -> 33, role 1 -> -1, features acting 2305244844532236283 upacting 2305244844532
236283
2018-01-29 20:18:06.064493 7fc0fc914700  1 osd.9 pg_epoch: 2485 pg[10.53( empty local-lis/les=2476/2477 n=0 ec=1815/1815 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [] r=-1 lpr=2483 pi=[2476,2483)/1 crt=0'0 unknown NOTIFY] st
ate<Start>: transitioning to Stray
2018-01-29 20:18:06.064545 7fc0fd115700  1 osd.9 pg_epoch: 2485 pg[3.16( v 2284'372 (0'0,2284'372] local-lis/les=2476/2477 n=2 ec=1758/130 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/1760) [33,85] r=-1 lpr=2483 pi=[2476,2483)/1 crt
=2284'372 lcod 2284'371 unknown NOTIFY] state<Start>: transitioning to Stray
2018-01-29 20:18:06.064886 7fc0fc914700  1 osd.9 pg_epoch: 2483 pg[10.3f( empty local-lis/les=2476/2477 n=0 ec=1815/1815 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [] r=-1 lpr=2483 pi=[2476,2483)/1 crt=0'0 active] start_peer
ing_interval up [9] -> [], acting [9] -> [], acting_primary 9 -> -1, up_primary 9 -> -1, role 0 -> -1, features acting 2305244844532236283 upacting 2305244844532236283
2018-01-29 20:18:06.064972 7fc0fd115700  1 osd.9 pg_epoch: 2483 pg[6.42( v 2298'367 (0'0,2298'367] local-lis/les=2476/2477 n=1 ec=1753/704 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2367) [24,84] r=-1 lpr=2483 pi=[2476,2483)/1 luo
d=0'0 crt=2298'367 lcod 2298'366 active] start_peering_interval up [24,84,9] -> [24,84], acting [24,84,9] -> [24,84], acting_primary 24 -> 24, up_primary 24 -> 24, role 2 -> -1, features acting 2305244844532236283 upacting 2305244844532
236283
2018-01-29 20:18:06.065091 7fc0fc914700  1 osd.9 pg_epoch: 2485 pg[10.3f( empty local-lis/les=2476/2477 n=0 ec=1815/1815 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [] r=-1 lpr=2483 pi=[2476,2483)/1 crt=0'0 unknown NOTIFY] st
ate<Start>: transitioning to Stray

2018-01-29 20:18:06.072202 7fc0fc914700  1 osd.9 pg_epoch: 2483 pg[5.3c( v 2312'24 (0'0,2312'24] local-lis/les=2476/2477 n=1 ec=1748/700 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [32,66] r=-1 lpr=2483 pi=[2476,2483)/1 luod=
0'0 crt=2312'24 lcod 2312'23 active] start_peering_interval up [9,32,66] -> [32,66], acting [9,32,66] -> [32,66], acting_primary 9 -> 32, up_primary 9 -> 32, role 0 -> -1, features acting 2305244844532236283 upacting 2305244844532236283
2018-01-29 20:18:06.072324 7fc0fc914700  1 osd.9 pg_epoch: 2485 pg[5.3c( v 2312'24 (0'0,2312'24] local-lis/les=2476/2477 n=1 ec=1748/700 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [32,66] r=-1 lpr=2483 pi=[2476,2483)/1 crt=2
312'24 lcod 2312'23 unknown NOTIFY] state<Start>: transitioning to Stray
2018-01-29 20:18:06.072463 7fc103121700  0 osd.9 2485 _committed_osd_maps shutdown OSD via async signal
2018-01-29 20:18:06.072570 7fc0f4904700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory
2018-01-29 20:18:06.072597 7fc0f4904700 -1 received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0
2018-01-29 20:18:06.072603 7fc0f4904700 -1 osd.9 2485 *** Got signal Interrupt ***
2018-01-29 20:18:06.072609 7fc0f4904700  0 osd.9 2485 prepare_to_stop starting shutdown
2018-01-29 20:18:06.072614 7fc0f4904700 -1 osd.9 2485 shutdown
2018-01-29 20:18:06.429238 7fc10d936700  0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 55.682037 secs
2018-01-29 20:18:06.429251 7fc10d936700  0 log_channel(cluster) log [WRN] : slow request 55.682037 seconds old, received at 2018-01-29 20:17:10.747125: pg_notify((query:2482 sent:2482 6.18( v 2276'10 (0'0,2276'10] local-lis/les=2473/247
4 n=3 ec=704/704 lis/c 2473/2470 les/c/f 2474/2471/2312 2481/2481/2476))=([2470,2480] intervals=([2473,2475] acting 51,91),([2479,2480] acting 9,51)) epoch 2482) currently wait for new map
2018-01-29 20:18:06.429265 7fc10d936700  0 log_channel(cluster) log [WRN] : slow request 55.130508 seconds old, received at 2018-01-29 20:17:11.298655: pg_notify((query:2482 sent:2482 6.18( v 2276'10 (0'0,2276'10] local-lis/les=2479/248
0 n=3 ec=704/704 lis/c 2479/2470 les/c/f 2480/2471/2312 2481/2481/2476))=([2470,2480] intervals=([2473,2475] acting 51,91),([2479,2480] acting 9,51)) epoch 2482) currently wait for new map
2018-01-29 20:18:08.117736 7fc0f4904700  1 bluestore(/var/lib/ceph/osd/ceph-9) umount
2018-01-29 20:18:08.317882 7fc0f4904700  1 stupidalloc shutdown
2018-01-29 20:18:08.322660 7fc0f4904700  1 freelist shutdown
2018-01-29 20:18:08.322718 7fc0f4904700  4 rocksdb: [/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all background work
2018-01-29 20:18:08.339096 7fc0f4904700  4 rocksdb: [/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:343] Shutdown complete
2018-01-29 20:18:08.508354 7fc0f4904700  1 bluefs umount
2018-01-29 20:18:08.508555 7fc0f4904700  1 stupidalloc shutdown
2018-01-29 20:18:08.508571 7fc0f4904700  1 stupidalloc shutdown
2018-01-29 20:18:08.508573 7fc0f4904700  1 stupidalloc shutdown
2018-01-29 20:18:08.508718 7fc0f4904700  1 bdev(0x5560701c7440 /dev/nvme0n1p19) close
2018-01-29 20:18:08.786803 7fc0f4904700  1 bdev(0x5560701c6fc0 /dev/nvme0n1p20) close
2018-01-29 20:18:08.874747 7fc0f4904700  1 bdev(0x5560701c7200 /var/lib/ceph/osd/ceph-9/block) close
2018-01-29 20:18:09.030750 7fc0f4904700  1 bdev(0x5560701c6d80 /var/lib/ceph/osd/ceph-9/block) close

BR,
Bruce J.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com