Random individual OSD failures with "connection refused reported by" another OSD?

Andre Goree <andre@xxxxxxxxxx> · Wed, 28 Mar 2018 13:30:26 -0400

Hello,

I've recently had a minor issue come up where random individual OSDs are 
failed due to a connection refused on another OSD.  I say minor, bc it's 
not a node-wide issue, and appears to be random nodes -- and besides 
that, the OSD comes up within less than a second, as if the OSD is sent 
a "restart," or something.

On the MON I see this (notice the entire time is ~0.77s):

2018-03-26 22:39:36.821247 mon.mon-01 [INF] osd.77 failed 
(root=default,host=osd-05) (connection refused reported by osd.55)
2018-03-26 22:39:36.935445 mon.mon-01 [WRN] Health check failed: 1 osds 
down (OSD_DOWN)
2018-03-26 22:39:39.959496 mon.mon-01 [WRN] Health check failed: Reduced 
data availability: 6 pgs peering (PG_AVAILABILITY)
2018-03-26 22:39:41.969578 mon.mon-01 [WRN] Health check failed: 
Degraded data redundancy: 6528/1742880 objects degraded (0.375%), 46 pgs 
degraded (PG_DEGRADED)
2018-03-26 22:39:43.978429 mon.mon-01 [INF] Health check cleared: 
PG_AVAILABILITY (was: Reduced data availability: 6 pgs peering)
2018-03-26 22:39:48.913112 mon.mon-01 [WRN] Health check update: 
Degraded data redundancy: 19411/1742880 objects degraded (1.114%), 136 
pgs degraded (PG_DEGRADED)
2018-03-26 22:40:06.288138 mon.mon-01 [INF] Health check cleared: 
OSD_DOWN (was: 1 osds down)
2018-03-26 22:40:06.301955 mon.mon-01 [INF] osd.77 
172.16.238.18:6818/57264 boot
2018-03-26 22:40:07.298884 mon.mon-01 [WRN] Health check update: 
Degraded data redundancy: 17109/1742880 objects degraded (0.982%), 120 
pgs degraded (PG_DEGRADED)
2018-03-26 22:40:13.330362 mon.mon-01 [INF] Health check cleared: 
PG_DEGRADED (was: Degraded data redundancy: 5605/1742880 objects 
degraded (0.322%), 39 pgs degraded)
2018-03-26 22:40:13.330409 mon.mon-01 [INF] Cluster is now healthy

On host osd-05 (which hosts osd.77) there appears to be normal heartbeat 
traffic before the OSD spontaneously reboots (truncated for brevity):

2018-03-26 22:33:00.773897 7efcaf20f700  0 -- 172.16.239.18:6818/7788 >> 
172.16.239.21:6818/8675 conn(0x55c5ea2d9000 :6818 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 73 vs existing csq=73 
existing_state=STATE_STANDBY
2018-03-26 22:33:00.774124 7efcaf20f700  0 -- 172.16.239.18:6818/7788 >> 
172.16.239.21:6818/8675 conn(0x55c5ea2d9000 :6818 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 74 vs existing csq=73 
existing_state=STATE_STANDBY
2018-03-26 22:39:56.832556 7f80ceea8e00  0 set uid:gid to 64045:64045 
(ceph:ceph)
2018-03-26 22:39:56.832576 7f80ceea8e00  0 ceph version 12.2.4 
(52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process 
(unknown), pid 57264
2018-03-26 22:39:56.849487 7f80ceea8e00  0 pidfile_write: ignore empty 
--pid-file
2018-03-26 22:39:56.859045 7f80ceea8e00  0 load: jerasure load: lrc 
load: isa
2018-03-26 22:39:56.859122 7f80ceea8e00  1 bdev create path 
/var/lib/ceph/osd/ceph-77/block type kernel
2018-03-26 22:39:56.859135 7f80ceea8e00  1 bdev(0x5568ccae4d80 
/var/lib/ceph/osd/ceph-77/block) open path 
/var/lib/ceph/osd/ceph-77/block
2018-03-26 22:39:56.859398 7f80ceea8e00  1 bdev(0x5568ccae4d80 
/var/lib/ceph/osd/ceph-77/block) open size 8001559724032 (0x74702200000, 
7452 GB) block_size 4096 (4096 B) rotational
2018-03-26 22:39:56.859711 7f80ceea8e00  1 
bluestore(/var/lib/ceph/osd/ceph-77) _set_cache_sizes max 0.5 < ratio 
0.99
2018-03-26 22:39:56.859733 7f80ceea8e00  1 
bluestore(/var/lib/ceph/osd/ceph-77) _set_cache_sizes cache_size 
1073741824 meta 0.5 kv 0.5 data 0
2018-03-26 22:39:56.859738 7f80ceea8e00  1 bdev(0x5568ccae4d80 
/var/lib/ceph/osd/ceph-77/block) close
2018-03-26 22:39:57.132071 7f80ceea8e00  1 
bluestore(/var/lib/ceph/osd/ceph-77) _mount path 
/var/lib/ceph/osd/ceph-77
2018-03-26 22:39:57.132534 7f80ceea8e00  1 bdev create path 
/var/lib/ceph/osd/ceph-77/block type kernel
...truncated...

Same on host osd-07 (which hosts osd.55, the one that reported 
connection refused), you'll see normal heartbeat traffic, followed by 
something interesting, before normal heartbeat traffic returns:

2018-03-26 22:33:20.598576 7f495c52e700  0 -- 172.16.239.20:6810/7206 >> 
172.16.239.21:6816/8668 conn(0x5619e70c9800 :6810 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 42 vs existing csq=41 
existing_state=STATE_STANDBY
2018-03-26 22:39:36.974204 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >> 
172.16.239.17:6805/6991 conn(0x5619f476b000 :6810 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 57 vs existing csq=57 
existing_state=STATE_STANDBY
2018-03-26 22:39:36.974457 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >> 
172.16.239.17:6805/6991 conn(0x5619f476b000 :6810 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 58 vs existing csq=57 
existing_state=STATE_STANDBY
2018-03-26 22:39:36.978674 7f494851d700  1 osd.55 pg_epoch: 21688 
pg[1.714( v 21687'7063 (21590'5500,21687'7063] local-lis/les=21654/21655 
n=163 ec=480/480 lis/c 21654/21654 les/c/f 21655/21655/0 
21688/21688/21654) [55,39] r=0 lpr=21688 pi=[21654,21688)/1 luod=0'0 
crt=21687'7063 lcod 21687'7062 mlcod 0'0 active] start_peering_interval 
up [55,39,77] -> [55,39], acting [55,39,77] -> [55,39], acting_primary 
55 -> 55, up_primary 55 -> 55, role 0 -> 0, features acting 
2305244844532236283 upacting 2305244844532236283
2018-03-26 22:39:36.979216 7f494851d700  1 osd.55 pg_epoch: 21688 
pg[1.714( v 21687'7063 (21590'5500,21687'7063] local-lis/les=21654/21655 
n=163 ec=480/480 lis/c 21654/21654 les/c/f 21655/21655/0 
21688/21688/21654) [55,39] r=0 lpr=21688 pi=[21654,21688)/1 
crt=21687'7063 lcod 21687'7062 mlcod 0'0 unknown] state<Start>: 
transitioning to Primary
2018-03-26 22:40:06.596889 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >> 
172.16.239.19:6810/7272 conn(0x5619e6f75000 :6810 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 113 vs existing csq=113 
existing_state=STATE_STANDBY
2018-03-26 22:40:06.597867 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >> 
172.16.239.19:6810/7272 conn(0x5619e6f75000 :6810 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 114 vs existing csq=113 
existing_state=STATE_STANDBY

Any ideas of what might cause this?  Being that there's less than a 
second between the failure and the cluster returning to health, I'm 
wondering if I should even be concerned at all.  This was not happening 
a few days ago, however, and we did have networking issues before that 
we've since fixed.

Thanks in advance.

--
Andre Goree
-=-=-=-=-=-
Email     - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com