I really not sure why monitor mark the OSD to down state
"Monitor daemon marked osd.9 down, but it is still running"
2018-01-30 16:07 GMT+08:00 blackpiglet J. <blackpigletbruce@xxxxxxxxx>:
Guys,We had set up a five nodes Ceph cluster. Four are OSD servers and the other one is MON and MGR.Recently, during RGW stability test, RGW default pool: default.rgw.buckets.data is accidentally written full. As a result, RGW is stuck. We don't know the exact steps to recover, then we deleted all default RGW pools directly. After RGW restarted, all pools are back.Then I found 4 OSD processes are down. I am not whether this is related to our RGW pool delete operation.This a the log I think may be useful. The full version is in the attachment.Any help is appreciated. Thanks in advance.2018-01-29 20:17:48.896544 7fc111122700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 152018-01-29 20:17:52.360069 7fc110183700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 152018-01-29 20:17:57.360192 7fc110183700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 152018-01-29 20:18:02.360315 7fc110183700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 152018-01-29 20:18:06.029867 7fc0fb912700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 152018-01-29 20:18:06.030576 7fc0fc914700 1 osd.9 pg_epoch: 2481 pg[6.18( v 2276'10 (0'0,2276'10] local-lis/les=2479/2480 n=3 ec=704/704 lis/c 2479/2470 les/c/f 2480/2471/2312 2481/2481/2476) [9,91,51] r=0 lpr=2481 pi=[2470,2481)/2 luod=0'0 crt=2276'10 lcod 2151'9 mlcod 0'0 active] start_peering_interval up [9,51] -> [9,91,51], acting [9,51] -> [9,91,51], acting_primary 9 -> 9, up_primary 9 -> 9, role 0 -> 0, features acting 2305244844532236283 upacting 23052448445322362832018-01-29 20:18:06.030774 7fc0fc914700 1 osd.9 pg_epoch: 2481 pg[6.18( v 2276'10 (0'0,2276'10] local-lis/les=2479/2480 n=3 ec=704/704 lis/c 2479/2470 les/c/f 2480/2471/2312 2481/2481/2476) [9,91,51] r=0 lpr=2481 pi=[2470,2481)/2 crt=2276'10 lcod 2151'9 mlcod 0'0 unknown] state<Start>: transitioning to Primary2018-01-29 20:18:06.031236 7fc0fc914700 1 osd.9 pg_epoch: 2481 pg[6.1d( v 2316'568 (0'0,2316'568] local-lis/les=2479/2480 n=2 ec=704/704 lis/c 2479/2470 les/c/f 2480/2471/2312 2481/2481/2481) [91,79,9] r=2 lpr=2481 pi=[2470,2481)/2 luod=0'0 crt=2316'568 lcod 2312'567 active] start_peering_interval up [79,9] -> [91,79,9], acting [79,9] -> [91,79,9], acting_primary 79 -> 91, up_primary 79 -> 91, role 1 -> 2, features acting 2305244844532236283 upacting 23052448445322362832018-01-29 20:18:06.031324 7fc0fc914700 1 osd.9 pg_epoch: 2481 pg[6.1d( v 2316'568 (0'0,2316'568] local-lis/les=2479/2480 n=2 ec=704/704 lis/c 2479/2470 les/c/f 2480/2471/2312 2481/2481/2481) [91,79,9] r=2 lpr=2481 pi=[2470,2481)/2 crt=2316'568 lcod 2312'567 unknown NOTIFY] state<Start>: transitioning to Stray2018-01-29 20:18:06.032167 7fc112124700 0 -- 10.1.248.1:6850/6003656 >> 10.1.248.4:6825/1003558 conn(0x556089bf4800 :6850 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=5245 cs=1 l=0).handle_connect_reply connect got RESETSESSION 2018-01-29 20:18:06.032184 7fc111923700 0 -- 10.1.248.1:6850/6003656 >> 10.1.248.3:6827/3726 conn(0x5560c0c06000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=6734 cs=1 l=0).handle_connect_reply connect got RESETSESSION 2018-01-29 20:18:06.053899 7fc103121700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.9 down, but it is still running2018-01-29 20:18:06.053917 7fc103121700 0 log_channel(cluster) log [DBG] : map e2485 wrongly marked me down at e24832018-01-29 20:18:06.053922 7fc103121700 0 osd.9 2485 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down2018-01-29 20:18:06.053927 7fc103121700 1 osd.9 2485 start_waiting_for_healthy2018-01-29 20:18:06.064250 7fc0fc914700 1 osd.9 pg_epoch: 2483 pg[10.53( empty local-lis/les=2476/2477 n=0 ec=1815/1815 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [] r=-1 lpr=2483 pi=[2476,2483)/1 crt=0'0 active] start_peering_interval up [9] -> [], acting [9] -> [], acting_primary 9 -> -1, up_primary 9 -> -1, role 0 -> -1, features acting 2305244844532236283 upacting 23052448445322362832018-01-29 20:18:06.064258 7fc0fd115700 1 osd.9 pg_epoch: 2483 pg[3.16( v 2284'372 (0'0,2284'372] local-lis/les=2476/2477 n=2 ec=1758/130 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/1760) [33,85] r=-1 lpr=2483 pi=[2476,2483)/1 luod=0'0 crt=2284'372 lcod 2284'371 active] start_peering_interval up [33,9,85] -> [33,85], acting [33,9,85] -> [33,85], acting_primary 33 -> 33, up_primary 33 -> 33, role 1 -> -1, features acting 2305244844532236283 upacting 23052448445322362832018-01-29 20:18:06.064493 7fc0fc914700 1 osd.9 pg_epoch: 2485 pg[10.53( empty local-lis/les=2476/2477 n=0 ec=1815/1815 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [] r=-1 lpr=2483 pi=[2476,2483)/1 crt=0'0 unknown NOTIFY] state<Start>: transitioning to Stray2018-01-29 20:18:06.064545 7fc0fd115700 1 osd.9 pg_epoch: 2485 pg[3.16( v 2284'372 (0'0,2284'372] local-lis/les=2476/2477 n=2 ec=1758/130 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/1760) [33,85] r=-1 lpr=2483 pi=[2476,2483)/1 crt=2284'372 lcod 2284'371 unknown NOTIFY] state<Start>: transitioning to Stray2018-01-29 20:18:06.064886 7fc0fc914700 1 osd.9 pg_epoch: 2483 pg[10.3f( empty local-lis/les=2476/2477 n=0 ec=1815/1815 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [] r=-1 lpr=2483 pi=[2476,2483)/1 crt=0'0 active] start_peering_interval up [9] -> [], acting [9] -> [], acting_primary 9 -> -1, up_primary 9 -> -1, role 0 -> -1, features acting 2305244844532236283 upacting 23052448445322362832018-01-29 20:18:06.064972 7fc0fd115700 1 osd.9 pg_epoch: 2483 pg[6.42( v 2298'367 (0'0,2298'367] local-lis/les=2476/2477 n=1 ec=1753/704 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2367) [24,84] r=-1 lpr=2483 pi=[2476,2483)/1 luod=0'0 crt=2298'367 lcod 2298'366 active] start_peering_interval up [24,84,9] -> [24,84], acting [24,84,9] -> [24,84], acting_primary 24 -> 24, up_primary 24 -> 24, role 2 -> -1, features acting 2305244844532236283 upacting 23052448445322362832018-01-29 20:18:06.065091 7fc0fc914700 1 osd.9 pg_epoch: 2485 pg[10.3f( empty local-lis/les=2476/2477 n=0 ec=1815/1815 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [] r=-1 lpr=2483 pi=[2476,2483)/1 crt=0'0 unknown NOTIFY] state<Start>: transitioning to Stray2018-01-29 20:18:06.072202 7fc0fc914700 1 osd.9 pg_epoch: 2483 pg[5.3c( v 2312'24 (0'0,2312'24] local-lis/les=2476/2477 n=1 ec=1748/700 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [32,66] r=-1 lpr=2483 pi=[2476,2483)/1 luod=0'0 crt=2312'24 lcod 2312'23 active] start_peering_interval up [9,32,66] -> [32,66], acting [9,32,66] -> [32,66], acting_primary 9 -> 32, up_primary 9 -> 32, role 0 -> -1, features acting 2305244844532236283 upacting 23052448445322362832018-01-29 20:18:06.072324 7fc0fc914700 1 osd.9 pg_epoch: 2485 pg[5.3c( v 2312'24 (0'0,2312'24] local-lis/les=2476/2477 n=1 ec=1748/700 lis/c 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [32,66] r=-1 lpr=2483 pi=[2476,2483)/1 crt=2312'24 lcod 2312'23 unknown NOTIFY] state<Start>: transitioning to Stray2018-01-29 20:18:06.072463 7fc103121700 0 osd.9 2485 _committed_osd_maps shutdown OSD via async signal2018-01-29 20:18:06.072570 7fc0f4904700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory2018-01-29 20:18:06.072597 7fc0f4904700 -1 received signal: Interrupt from PID: 0 task name: <unknown> UID: 02018-01-29 20:18:06.072603 7fc0f4904700 -1 osd.9 2485 *** Got signal Interrupt ***2018-01-29 20:18:06.072609 7fc0f4904700 0 osd.9 2485 prepare_to_stop starting shutdown2018-01-29 20:18:06.072614 7fc0f4904700 -1 osd.9 2485 shutdown2018-01-29 20:18:06.429238 7fc10d936700 0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 55.682037 secs2018-01-29 20:18:06.429251 7fc10d936700 0 log_channel(cluster) log [WRN] : slow request 55.682037 seconds old, received at 2018-01-29 20:17:10.747125: pg_notify((query:2482 sent:2482 6.18( v 2276'10 (0'0,2276'10] local-lis/les=2473/2474 n=3 ec=704/704 lis/c 2473/2470 les/c/f 2474/2471/2312 2481/2481/2476))=([2470,2480] intervals=([2473,2475] acting 51,91),([2479,2480] acting 9,51)) epoch 2482) currently wait for new map2018-01-29 20:18:06.429265 7fc10d936700 0 log_channel(cluster) log [WRN] : slow request 55.130508 seconds old, received at 2018-01-29 20:17:11.298655: pg_notify((query:2482 sent:2482 6.18( v 2276'10 (0'0,2276'10] local-lis/les=2479/2480 n=3 ec=704/704 lis/c 2479/2470 les/c/f 2480/2471/2312 2481/2481/2476))=([2470,2480] intervals=([2473,2475] acting 51,91),([2479,2480] acting 9,51)) epoch 2482) currently wait for new map2018-01-29 20:18:08.117736 7fc0f4904700 1 bluestore(/var/lib/ceph/osd/ceph-9) umount 2018-01-29 20:18:08.317882 7fc0f4904700 1 stupidalloc shutdown2018-01-29 20:18:08.322660 7fc0f4904700 1 freelist shutdown2018-01-29 20:18:08.322718 7fc0f4904700 4 rocksdb: [/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all background work 2018-01-29 20:18:08.339096 7fc0f4904700 4 rocksdb: [/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:343] Shutdown complete 2018-01-29 20:18:08.508354 7fc0f4904700 1 bluefs umount2018-01-29 20:18:08.508555 7fc0f4904700 1 stupidalloc shutdown2018-01-29 20:18:08.508571 7fc0f4904700 1 stupidalloc shutdown2018-01-29 20:18:08.508573 7fc0f4904700 1 stupidalloc shutdown2018-01-29 20:18:08.508718 7fc0f4904700 1 bdev(0x5560701c7440 /dev/nvme0n1p19) close2018-01-29 20:18:08.786803 7fc0f4904700 1 bdev(0x5560701c6fc0 /dev/nvme0n1p20) close2018-01-29 20:18:08.874747 7fc0f4904700 1 bdev(0x5560701c7200 /var/lib/ceph/osd/ceph-9/block) close 2018-01-29 20:18:09.030750 7fc0f4904700 1 bdev(0x5560701c6d80 /var/lib/ceph/osd/ceph-9/block) close BR,Bruce J.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com