Random individual OSD failures with "connection refused reported by" another OSD?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I've recently had a minor issue come up where random individual OSDs are failed due to a connection refused on another OSD. I say minor, bc it's not a node-wide issue, and appears to be random nodes -- and besides that, the OSD comes up within less than a second, as if the OSD is sent a "restart," or something.

On the MON I see this (notice the entire time is ~0.77s):

2018-03-26 22:39:36.821247 mon.mon-01 [INF] osd.77 failed (root=default,host=osd-05) (connection refused reported by osd.55) 2018-03-26 22:39:36.935445 mon.mon-01 [WRN] Health check failed: 1 osds down (OSD_DOWN) 2018-03-26 22:39:39.959496 mon.mon-01 [WRN] Health check failed: Reduced data availability: 6 pgs peering (PG_AVAILABILITY) 2018-03-26 22:39:41.969578 mon.mon-01 [WRN] Health check failed: Degraded data redundancy: 6528/1742880 objects degraded (0.375%), 46 pgs degraded (PG_DEGRADED) 2018-03-26 22:39:43.978429 mon.mon-01 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 6 pgs peering) 2018-03-26 22:39:48.913112 mon.mon-01 [WRN] Health check update: Degraded data redundancy: 19411/1742880 objects degraded (1.114%), 136 pgs degraded (PG_DEGRADED) 2018-03-26 22:40:06.288138 mon.mon-01 [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2018-03-26 22:40:06.301955 mon.mon-01 [INF] osd.77 172.16.238.18:6818/57264 boot 2018-03-26 22:40:07.298884 mon.mon-01 [WRN] Health check update: Degraded data redundancy: 17109/1742880 objects degraded (0.982%), 120 pgs degraded (PG_DEGRADED) 2018-03-26 22:40:13.330362 mon.mon-01 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 5605/1742880 objects degraded (0.322%), 39 pgs degraded)
2018-03-26 22:40:13.330409 mon.mon-01 [INF] Cluster is now healthy


On host osd-05 (which hosts osd.77) there appears to be normal heartbeat traffic before the OSD spontaneously reboots (truncated for brevity):

2018-03-26 22:33:00.773897 7efcaf20f700 0 -- 172.16.239.18:6818/7788 >> 172.16.239.21:6818/8675 conn(0x55c5ea2d9000 :6818 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 73 vs existing csq=73 existing_state=STATE_STANDBY 2018-03-26 22:33:00.774124 7efcaf20f700 0 -- 172.16.239.18:6818/7788 >> 172.16.239.21:6818/8675 conn(0x55c5ea2d9000 :6818 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 74 vs existing csq=73 existing_state=STATE_STANDBY 2018-03-26 22:39:56.832556 7f80ceea8e00 0 set uid:gid to 64045:64045 (ceph:ceph) 2018-03-26 22:39:56.832576 7f80ceea8e00 0 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process (unknown), pid 57264 2018-03-26 22:39:56.849487 7f80ceea8e00 0 pidfile_write: ignore empty --pid-file 2018-03-26 22:39:56.859045 7f80ceea8e00 0 load: jerasure load: lrc load: isa 2018-03-26 22:39:56.859122 7f80ceea8e00 1 bdev create path /var/lib/ceph/osd/ceph-77/block type kernel 2018-03-26 22:39:56.859135 7f80ceea8e00 1 bdev(0x5568ccae4d80 /var/lib/ceph/osd/ceph-77/block) open path /var/lib/ceph/osd/ceph-77/block 2018-03-26 22:39:56.859398 7f80ceea8e00 1 bdev(0x5568ccae4d80 /var/lib/ceph/osd/ceph-77/block) open size 8001559724032 (0x74702200000, 7452 GB) block_size 4096 (4096 B) rotational 2018-03-26 22:39:56.859711 7f80ceea8e00 1 bluestore(/var/lib/ceph/osd/ceph-77) _set_cache_sizes max 0.5 < ratio 0.99 2018-03-26 22:39:56.859733 7f80ceea8e00 1 bluestore(/var/lib/ceph/osd/ceph-77) _set_cache_sizes cache_size 1073741824 meta 0.5 kv 0.5 data 0 2018-03-26 22:39:56.859738 7f80ceea8e00 1 bdev(0x5568ccae4d80 /var/lib/ceph/osd/ceph-77/block) close 2018-03-26 22:39:57.132071 7f80ceea8e00 1 bluestore(/var/lib/ceph/osd/ceph-77) _mount path /var/lib/ceph/osd/ceph-77 2018-03-26 22:39:57.132534 7f80ceea8e00 1 bdev create path /var/lib/ceph/osd/ceph-77/block type kernel
...truncated...


Same on host osd-07 (which hosts osd.55, the one that reported connection refused), you'll see normal heartbeat traffic, followed by something interesting, before normal heartbeat traffic returns:

2018-03-26 22:33:20.598576 7f495c52e700 0 -- 172.16.239.20:6810/7206 >> 172.16.239.21:6816/8668 conn(0x5619e70c9800 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 42 vs existing csq=41 existing_state=STATE_STANDBY 2018-03-26 22:39:36.974204 7f495cd2f700 0 -- 172.16.239.20:6810/7206 >> 172.16.239.17:6805/6991 conn(0x5619f476b000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 57 vs existing csq=57 existing_state=STATE_STANDBY 2018-03-26 22:39:36.974457 7f495cd2f700 0 -- 172.16.239.20:6810/7206 >> 172.16.239.17:6805/6991 conn(0x5619f476b000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 58 vs existing csq=57 existing_state=STATE_STANDBY 2018-03-26 22:39:36.978674 7f494851d700 1 osd.55 pg_epoch: 21688 pg[1.714( v 21687'7063 (21590'5500,21687'7063] local-lis/les=21654/21655 n=163 ec=480/480 lis/c 21654/21654 les/c/f 21655/21655/0 21688/21688/21654) [55,39] r=0 lpr=21688 pi=[21654,21688)/1 luod=0'0 crt=21687'7063 lcod 21687'7062 mlcod 0'0 active] start_peering_interval up [55,39,77] -> [55,39], acting [55,39,77] -> [55,39], acting_primary 55 -> 55, up_primary 55 -> 55, role 0 -> 0, features acting 2305244844532236283 upacting 2305244844532236283 2018-03-26 22:39:36.979216 7f494851d700 1 osd.55 pg_epoch: 21688 pg[1.714( v 21687'7063 (21590'5500,21687'7063] local-lis/les=21654/21655 n=163 ec=480/480 lis/c 21654/21654 les/c/f 21655/21655/0 21688/21688/21654) [55,39] r=0 lpr=21688 pi=[21654,21688)/1 crt=21687'7063 lcod 21687'7062 mlcod 0'0 unknown] state<Start>: transitioning to Primary 2018-03-26 22:40:06.596889 7f495cd2f700 0 -- 172.16.239.20:6810/7206 >> 172.16.239.19:6810/7272 conn(0x5619e6f75000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 113 vs existing csq=113 existing_state=STATE_STANDBY 2018-03-26 22:40:06.597867 7f495cd2f700 0 -- 172.16.239.20:6810/7206 >> 172.16.239.19:6810/7272 conn(0x5619e6f75000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 114 vs existing csq=113 existing_state=STATE_STANDBY


Any ideas of what might cause this? Being that there's less than a second between the failure and the cluster returning to health, I'm wondering if I should even be concerned at all. This was not happening a few days ago, however, and we did have networking issues before that we've since fixed.

Thanks in advance.

--
Andre Goree
-=-=-=-=-=-
Email     - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux