Hello,
I've recently had a minor issue come up where random individual OSDs are
failed due to a connection refused on another OSD. I say minor, bc it's
not a node-wide issue, and appears to be random nodes -- and besides
that, the OSD comes up within less than a second, as if the OSD is sent
a "restart," or something.
On the MON I see this (notice the entire time is ~0.77s):
2018-03-26 22:39:36.821247 mon.mon-01 [INF] osd.77 failed
(root=default,host=osd-05) (connection refused reported by osd.55)
2018-03-26 22:39:36.935445 mon.mon-01 [WRN] Health check failed: 1 osds
down (OSD_DOWN)
2018-03-26 22:39:39.959496 mon.mon-01 [WRN] Health check failed: Reduced
data availability: 6 pgs peering (PG_AVAILABILITY)
2018-03-26 22:39:41.969578 mon.mon-01 [WRN] Health check failed:
Degraded data redundancy: 6528/1742880 objects degraded (0.375%), 46 pgs
degraded (PG_DEGRADED)
2018-03-26 22:39:43.978429 mon.mon-01 [INF] Health check cleared:
PG_AVAILABILITY (was: Reduced data availability: 6 pgs peering)
2018-03-26 22:39:48.913112 mon.mon-01 [WRN] Health check update:
Degraded data redundancy: 19411/1742880 objects degraded (1.114%), 136
pgs degraded (PG_DEGRADED)
2018-03-26 22:40:06.288138 mon.mon-01 [INF] Health check cleared:
OSD_DOWN (was: 1 osds down)
2018-03-26 22:40:06.301955 mon.mon-01 [INF] osd.77
172.16.238.18:6818/57264 boot
2018-03-26 22:40:07.298884 mon.mon-01 [WRN] Health check update:
Degraded data redundancy: 17109/1742880 objects degraded (0.982%), 120
pgs degraded (PG_DEGRADED)
2018-03-26 22:40:13.330362 mon.mon-01 [INF] Health check cleared:
PG_DEGRADED (was: Degraded data redundancy: 5605/1742880 objects
degraded (0.322%), 39 pgs degraded)
2018-03-26 22:40:13.330409 mon.mon-01 [INF] Cluster is now healthy
On host osd-05 (which hosts osd.77) there appears to be normal heartbeat
traffic before the OSD spontaneously reboots (truncated for brevity):
2018-03-26 22:33:00.773897 7efcaf20f700 0 -- 172.16.239.18:6818/7788 >>
172.16.239.21:6818/8675 conn(0x55c5ea2d9000 :6818
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 73 vs existing csq=73
existing_state=STATE_STANDBY
2018-03-26 22:33:00.774124 7efcaf20f700 0 -- 172.16.239.18:6818/7788 >>
172.16.239.21:6818/8675 conn(0x55c5ea2d9000 :6818
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 74 vs existing csq=73
existing_state=STATE_STANDBY
2018-03-26 22:39:56.832556 7f80ceea8e00 0 set uid:gid to 64045:64045
(ceph:ceph)
2018-03-26 22:39:56.832576 7f80ceea8e00 0 ceph version 12.2.4
(52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process
(unknown), pid 57264
2018-03-26 22:39:56.849487 7f80ceea8e00 0 pidfile_write: ignore empty
--pid-file
2018-03-26 22:39:56.859045 7f80ceea8e00 0 load: jerasure load: lrc
load: isa
2018-03-26 22:39:56.859122 7f80ceea8e00 1 bdev create path
/var/lib/ceph/osd/ceph-77/block type kernel
2018-03-26 22:39:56.859135 7f80ceea8e00 1 bdev(0x5568ccae4d80
/var/lib/ceph/osd/ceph-77/block) open path
/var/lib/ceph/osd/ceph-77/block
2018-03-26 22:39:56.859398 7f80ceea8e00 1 bdev(0x5568ccae4d80
/var/lib/ceph/osd/ceph-77/block) open size 8001559724032 (0x74702200000,
7452 GB) block_size 4096 (4096 B) rotational
2018-03-26 22:39:56.859711 7f80ceea8e00 1
bluestore(/var/lib/ceph/osd/ceph-77) _set_cache_sizes max 0.5 < ratio
0.99
2018-03-26 22:39:56.859733 7f80ceea8e00 1
bluestore(/var/lib/ceph/osd/ceph-77) _set_cache_sizes cache_size
1073741824 meta 0.5 kv 0.5 data 0
2018-03-26 22:39:56.859738 7f80ceea8e00 1 bdev(0x5568ccae4d80
/var/lib/ceph/osd/ceph-77/block) close
2018-03-26 22:39:57.132071 7f80ceea8e00 1
bluestore(/var/lib/ceph/osd/ceph-77) _mount path
/var/lib/ceph/osd/ceph-77
2018-03-26 22:39:57.132534 7f80ceea8e00 1 bdev create path
/var/lib/ceph/osd/ceph-77/block type kernel
...truncated...
Same on host osd-07 (which hosts osd.55, the one that reported
connection refused), you'll see normal heartbeat traffic, followed by
something interesting, before normal heartbeat traffic returns:
2018-03-26 22:33:20.598576 7f495c52e700 0 -- 172.16.239.20:6810/7206 >>
172.16.239.21:6816/8668 conn(0x5619e70c9800 :6810
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 42 vs existing csq=41
existing_state=STATE_STANDBY
2018-03-26 22:39:36.974204 7f495cd2f700 0 -- 172.16.239.20:6810/7206 >>
172.16.239.17:6805/6991 conn(0x5619f476b000 :6810
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 57 vs existing csq=57
existing_state=STATE_STANDBY
2018-03-26 22:39:36.974457 7f495cd2f700 0 -- 172.16.239.20:6810/7206 >>
172.16.239.17:6805/6991 conn(0x5619f476b000 :6810
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 58 vs existing csq=57
existing_state=STATE_STANDBY
2018-03-26 22:39:36.978674 7f494851d700 1 osd.55 pg_epoch: 21688
pg[1.714( v 21687'7063 (21590'5500,21687'7063] local-lis/les=21654/21655
n=163 ec=480/480 lis/c 21654/21654 les/c/f 21655/21655/0
21688/21688/21654) [55,39] r=0 lpr=21688 pi=[21654,21688)/1 luod=0'0
crt=21687'7063 lcod 21687'7062 mlcod 0'0 active] start_peering_interval
up [55,39,77] -> [55,39], acting [55,39,77] -> [55,39], acting_primary
55 -> 55, up_primary 55 -> 55, role 0 -> 0, features acting
2305244844532236283 upacting 2305244844532236283
2018-03-26 22:39:36.979216 7f494851d700 1 osd.55 pg_epoch: 21688
pg[1.714( v 21687'7063 (21590'5500,21687'7063] local-lis/les=21654/21655
n=163 ec=480/480 lis/c 21654/21654 les/c/f 21655/21655/0
21688/21688/21654) [55,39] r=0 lpr=21688 pi=[21654,21688)/1
crt=21687'7063 lcod 21687'7062 mlcod 0'0 unknown] state<Start>:
transitioning to Primary
2018-03-26 22:40:06.596889 7f495cd2f700 0 -- 172.16.239.20:6810/7206 >>
172.16.239.19:6810/7272 conn(0x5619e6f75000 :6810
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 113 vs existing csq=113
existing_state=STATE_STANDBY
2018-03-26 22:40:06.597867 7f495cd2f700 0 -- 172.16.239.20:6810/7206 >>
172.16.239.19:6810/7272 conn(0x5619e6f75000 :6810
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 114 vs existing csq=113
existing_state=STATE_STANDBY
Any ideas of what might cause this? Being that there's less than a
second between the failure and the cluster returning to health, I'm
wondering if I should even be concerned at all. This was not happening
a few days ago, however, and we did have networking issues before that
we've since fixed.
Thanks in advance.
--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website - http://blog.drenet.net
PGP key - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com