Re: Random individual OSD failures with "connection refused reported by" another OSD?

Subhachandra Chandra <schandra@xxxxxxxxxxxx> · Wed, 28 Mar 2018 10:39:49 -0700

We have seen similar behavior when there are network issues. AFAIK, the OSD is being reported down by an OSD that cannot reach it. But either another OSD that can reach it or the heartbeat between the OSD and the monitor declares it up. The OSD "boot" message does not seem to indicate an actual OSD restart.
Subhachandra

On Wed, Mar 28, 2018 at 10:30 AM, Andre Goree <andre@xxxxxxxxxx> wrote:
Hello,

I've recently had a minor issue come up where random individual OSDs are failed due to a connection refused on another OSD.  I say minor, bc it's not a node-wide issue, and appears to be random nodes -- and besides that, the OSD comes up within less than a second, as if the OSD is sent a "restart," or something.

On the MON I see this (notice the entire time is ~0.77s):

2018-03-26 22:39:36.821247 mon.mon-01 [INF] osd.77 failed (root=default,host=osd-05) (connection refused reported by osd.55)

2018-03-26 22:39:36.935445 mon.mon-01 [WRN] Health check failed: 1 osds down (OSD_DOWN)

2018-03-26 22:39:39.959496 mon.mon-01 [WRN] Health check failed: Reduced data availability: 6 pgs peering (PG_AVAILABILITY)

2018-03-26 22:39:41.969578 mon.mon-01 [WRN] Health check failed: Degraded data redundancy: 6528/1742880 objects degraded (0.375%), 46 pgs degraded (PG_DEGRADED)

2018-03-26 22:39:43.978429 mon.mon-01 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 6 pgs peering)

2018-03-26 22:39:48.913112 mon.mon-01 [WRN] Health check update: Degraded data redundancy: 19411/1742880 objects degraded (1.114%), 136 pgs degraded (PG_DEGRADED)

2018-03-26 22:40:06.288138 mon.mon-01 [INF] Health check cleared: OSD_DOWN (was: 1 osds down)

2018-03-26 22:40:06.301955 mon.mon-01 [INF] osd.77 172.16.238.18:6818/57264 boot

2018-03-26 22:40:07.298884 mon.mon-01 [WRN] Health check update: Degraded data redundancy: 17109/1742880 objects degraded (0.982%), 120 pgs degraded (PG_DEGRADED)

2018-03-26 22:40:13.330362 mon.mon-01 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 5605/1742880 objects degraded (0.322%), 39 pgs degraded)

2018-03-26 22:40:13.330409 mon.mon-01 [INF] Cluster is now healthy

On host osd-05 (which hosts osd.77) there appears to be normal heartbeat traffic before the OSD spontaneously reboots (truncated for brevity):

2018-03-26 22:33:00.773897 7efcaf20f700  0 -- 172.16.239.18:6818/7788 >> 172.16.239.21:6818/8675 conn(0x55c5ea2d9000 :6818 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 73 vs existing csq=73 existing_state=STATE_STANDBY

2018-03-26 22:33:00.774124 7efcaf20f700  0 -- 172.16.239.18:6818/7788 >> 172.16.239.21:6818/8675 conn(0x55c5ea2d9000 :6818 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 74 vs existing csq=73 existing_state=STATE_STANDBY

2018-03-26 22:39:56.832556 7f80ceea8e00  0 set uid:gid to 64045:64045 (ceph:ceph)

2018-03-26 22:39:56.832576 7f80ceea8e00  0 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process (unknown), pid 57264

2018-03-26 22:39:56.849487 7f80ceea8e00  0 pidfile_write: ignore empty --pid-file

2018-03-26 22:39:56.859045 7f80ceea8e00  0 load: jerasure load: lrc load: isa

2018-03-26 22:39:56.859122 7f80ceea8e00  1 bdev create path /var/lib/ceph/osd/ceph-77/block type kernel

2018-03-26 22:39:56.859135 7f80ceea8e00  1 bdev(0x5568ccae4d80 /var/lib/ceph/osd/ceph-77/block) open path /var/lib/ceph/osd/ceph-77/block

2018-03-26 22:39:56.859398 7f80ceea8e00  1 bdev(0x5568ccae4d80 /var/lib/ceph/osd/ceph-77/block) open size 8001559724032 (0x74702200000, 7452 GB) block_size 4096 (4096 B) rotational

2018-03-26 22:39:56.859711 7f80ceea8e00  1 bluestore(/var/lib/ceph/osd/ceph-77) _set_cache_sizes max 0.5 < ratio 0.99

2018-03-26 22:39:56.859733 7f80ceea8e00  1 bluestore(/var/lib/ceph/osd/ceph-77) _set_cache_sizes cache_size 1073741824 meta 0.5 kv 0.5 data 0

2018-03-26 22:39:56.859738 7f80ceea8e00  1 bdev(0x5568ccae4d80 /var/lib/ceph/osd/ceph-77/block) close

2018-03-26 22:39:57.132071 7f80ceea8e00  1 bluestore(/var/lib/ceph/osd/ceph-77) _mount path /var/lib/ceph/osd/ceph-77

2018-03-26 22:39:57.132534 7f80ceea8e00  1 bdev create path /var/lib/ceph/osd/ceph-77/block type kernel

...truncated...

Same on host osd-07 (which hosts osd.55, the one that reported connection refused), you'll see normal heartbeat traffic, followed by something interesting, before normal heartbeat traffic returns:

2018-03-26 22:33:20.598576 7f495c52e700  0 -- 172.16.239.20:6810/7206 >> 172.16.239.21:6816/8668 conn(0x5619e70c9800 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 42 vs existing csq=41 existing_state=STATE_STANDBY

2018-03-26 22:39:36.974204 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >> 172.16.239.17:6805/6991 conn(0x5619f476b000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 57 vs existing csq=57 existing_state=STATE_STANDBY

2018-03-26 22:39:36.974457 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >> 172.16.239.17:6805/6991 conn(0x5619f476b000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 58 vs existing csq=57 existing_state=STATE_STANDBY

2018-03-26 22:39:36.978674 7f494851d700  1 osd.55 pg_epoch: 21688 pg[1.714( v 21687'7063 (21590'5500,21687'7063] local-lis/les=21654/21655 n=163 ec=480/480 lis/c 21654/21654 les/c/f 21655/21655/0 21688/21688/21654) [55,39] r=0 lpr=21688 pi=[21654,21688)/1 luod=0'0 crt=21687'7063 lcod 21687'7062 mlcod 0'0 active] start_peering_interval up [55,39,77] -> [55,39], acting [55,39,77] -> [55,39], acting_primary 55 -> 55, up_primary 55 -> 55, role 0 -> 0, features acting 2305244844532236283 upacting 2305244844532236283

2018-03-26 22:39:36.979216 7f494851d700  1 osd.55 pg_epoch: 21688 pg[1.714( v 21687'7063 (21590'5500,21687'7063] local-lis/les=21654/21655 n=163 ec=480/480 lis/c 21654/21654 les/c/f 21655/21655/0 21688/21688/21654) [55,39] r=0 lpr=21688 pi=[21654,21688)/1 crt=21687'7063 lcod 21687'7062 mlcod 0'0 unknown] state<Start>: transitioning to Primary

2018-03-26 22:40:06.596889 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >> 172.16.239.19:6810/7272 conn(0x5619e6f75000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 113 vs existing csq=113 existing_state=STATE_STANDBY

2018-03-26 22:40:06.597867 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >> 172.16.239.19:6810/7272 conn(0x5619e6f75000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 114 vs existing csq=113 existing_state=STATE_STANDBY

Any ideas of what might cause this?  Being that there's less than a second between the failure and the cluster returning to health, I'm wondering if I should even be concerned at all.  This was not happening a few days ago, however, and we did have networking issues before that we've since fixed.

Thanks in advance.

-- 

Andre Goree

-=-=-=-=-=-

Email     - andre at drenet.net

Website   - http://blog.drenet.net

PGP key   - http://www.drenet.net/pubkey.html

-=-=-=-=-=-

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com