OSD node suddenly slow to responding cmd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I am using Ceph nautilus cluster and i found one of my OSD node running 3 OSD's service suddenly went down and it was very slow typing command. I killed ceph-osd process and system become normal and started all OSD service.

After that it becomes normal I figure out that due low memory the system was slow but there is no OOM in any of the logs.

I can see heartbeat_check and suicide time out in below msg in OSD logs, but why did this happened? what could be the issue?

osd.1

2019-10-16 20:49:38.433 7f246f302700  0 bluestore(/var/lib/ceph/osd/ceph-1) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.20965s, txc = 0x55
a2541f1500
2019-10-16 20:50:05.233 7f247bf7d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f245e4e6700' had timed out after 15
2019-10-16 20:50:06.721 7f247af7b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f245e4e6700' had timed out after 15
2019-10-16 20:53:57.881 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.113:6805 osd.2 since back 2019-10-16 20:49:44.711088 front 2019-10-16 20:4
9:47.684146 (oldest deadline 2019-10-16 20:50:07.937387)
2019-10-16 20:55:26.997 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.111:6821 osd.3 since back 2019-10-16 20:49:44.711031 front 2019-10-16 20:4
9:44.711495 (oldest deadline 2019-10-16 20:50:07.937387)
2019-10-16 20:56:25.513 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.112:6805 osd.4 since back 2019-10-16 20:49:44.484031 front 2019-10-16 20:4
9:44.711337 (oldest deadline 2019-10-16 20:50:03.162338)
2019-10-16 20:57:52.557 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.113:6813 osd.5 since back 2019-10-16 20:49:44.484064 front 2019-10-16 20:4
9:44.711504 (oldest deadline 2019-10-16 20:50:03.162338)
2019-10-16 20:59:22.837 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.111:6805 osd.6 since back 2019-10-16 20:49:44.711483 front 2019-10-16 20:4
9:44.483337 (oldest deadline 2019-10-16 20:50:03.162338)
2019-10-16 21:01:18.761 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.112:6821 osd.7 since back 2019-10-16 20:49:44.711275 front 2019-10-16 20:4
9:44.711352 (oldest deadline 2019-10-16 20:50:07.937387)
2019-10-16 21:02:34.021 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.113:6821 osd.8 since back 2019-10-16 20:49:44.711517 front 2019-10-16 20:4
9:44.711470 (oldest deadline 2019-10-16 20:50:07.937387)
2019-10-16 21:04:26.361 7f247b77c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f245d4e4700' had timed out after 60
2019-10-16 21:04:46.225 7f247b77c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f245e4e6700' had timed out after 15
2019-10-16 21:05:01.453 7f247b77c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f245e4e6700' had suicide timed out after 150
2019-10-16 21:34:06.365 7f0a6105cf80  0 set uid:gid to 64045:64045 (ceph:ceph)
2019-10-16 21:34:06.365 7f0a6105cf80  0 ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable), process ceph-osd, pid 203416
2019-10-16 21:34:06.365 7f0a6105cf80  0 pidfile_write: ignore empty --pid-file


osd.4
2019-10-16 20:53:57.881 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.112:6813 osd.1 since back 2019-10-16 20:49:41.268550 front 2019-10-16 20:49:41.268546 (oldest deadline 2019-10-16 20:50:07.683860)
2019-10-16 20:55:26.997 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.113:6805 osd.2 since back 2019-10-16 20:49:44.484184 front 2019-10-16 20:49:44.483961 (oldest deadline 2019-10-16 20:50:07.683860)
2019-10-16 20:56:25.513 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.111:6821 osd.3 since back 2019-10-16 20:49:41.192151 front 2019-10-16 20:49:41.191856 (oldest deadline 2019-10-16 20:50:07.683860)
2019-10-16 20:57:52.557 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.113:6813 osd.5 since back 2019-10-16 20:49:44.484278 front 2019-10-16 20:49:47.920080 (oldest deadline 2019-10-16 20:50:07.683860)
2019-10-16 20:59:22.837 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.111:6805 osd.6 since back 2019-10-16 20:49:41.268846 front 2019-10-16 20:49:41.192139 (oldest deadline 2019-10-16 20:50:07.683860)
2019-10-16 21:01:16.561 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.112:6821 osd.7 since back 2019-10-16 20:49:41.380152 front 2019-10-16 20:49:41.380178 (oldest deadline 2019-10-16 20:50:07.683860)
2019-10-16 21:02:15.621 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.113:6821 osd.8 since back 2019-10-16 20:49:43.862767 front 2019-10-16 20:49:44.484153 (oldest deadline 2019-10-16 20:50:07.683860)
2019-10-16 21:04:14.997 7f7efba77700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7edd7df700' had timed out after 60
2019-10-16 21:04:31.689 7f7efba77700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7eddfe0700' had timed out after 60
2019-10-16 21:04:56.025 7f7efba77700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7ede7e1700' had timed out after 60
2019-10-16 21:05:10.409 7f7efc278700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7edd7df700' had timed out after 60
2019-10-16 21:05:20.813 7f7efc278700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7eddfe0700' had timed out after 60
2019-10-16 21:05:34.149 7f7efc278700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7ede7e1700' had timed out after 60
2019-10-16 21:05:48.465 7f7efb276700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7edd7df700' had timed out after 60
2019-10-16 21:06:15.409 7f7efb276700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7eddfe0700' had timed out after 60
2019-10-16 21:06:32.173 7f7efb276700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7ede7e1700' had timed out after 60
2019-10-16 21:21:53.125 7f7ee87f5700  4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS -------
2019-10-16 21:22:15.137 7f7ee87f5700  4 rocksdb: [db/db_impl.cc:778]



osd.7

2019-10-16 20:53:57.881 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.112:6813 osd.1 since back 2019-10-16 20:49:41.380241 front 2019-10-16 20:49:41.380043 (oldest deadline 2019-10-16 20:50:07.683838)
2019-10-16 20:55:26.997 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.113:6805 osd.2 since back 2019-10-16 20:49:44.484063 front 2019-10-16 20:49:47.920022 (oldest deadline 2019-10-16 20:50:07.683838)
2019-10-16 20:56:25.513 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.111:6821 osd.3 since back 2019-10-16 20:49:41.379978 front 2019-10-16 20:49:41.380217 (oldest deadline 2019-10-16 20:50:07.683838)
2019-10-16 20:57:49.685 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.112:6805 osd.4 since back 2019-10-16 20:49:41.380202 front 2019-10-16 20:49:41.380119 (oldest deadline 2019-10-16 20:50:07.683838)
2019-10-16 20:59:18.389 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.113:6813 osd.5 since back 2019-10-16 20:49:43.862826 front 2019-10-16 20:49:43.723800 (oldest deadline 2019-10-16 20:50:07.683838)
2019-10-16 21:01:16.561 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.111:6805 osd.6 since back 2019-10-16 20:49:41.380229 front 2019-10-16 20:49:41.380023 (oldest deadline 2019-10-16 20:50:07.683838)
2019-10-16 21:02:19.197 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.113:6821 osd.8 since back 2019-10-16 20:49:44.484017 front 2019-10-16 20:49:44.484078 (oldest deadline 2019-10-16 20:50:07.683838)
2019-10-16 21:21:53.125 7f959cbe1700  4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS -------
2019-10-16 21:22:20.221 7f959cbe1700  4 rocksdb: [db/db_impl.cc:778]


Journalctl - output

Oct 16 20:53:38 test-node2 systemd-resolved[1005]: Using degraded feature set (TCP) for DNS server 192.168.4.12.
Oct 16 20:56:02 test-node2 ceph-osd[1710]: 2019-10-16 20:53:57.881 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.113:6805 osd.2 since back 2019-
Oct 16 20:56:29 test-node2 ceph-osd[91197]: 2019-10-16 20:53:57.881 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.112:6813 osd.1 since back 2019
Oct 16 20:53:40 test-node2 systemd-resolved[1005]: Using degraded feature set (TCP) for DNS server 192.168.4.7.
Oct 16 20:57:17 test-node2 ceph-osd[1713]: 2019-10-16 20:53:57.881 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.112:6813 osd.1 since back 2019-
Oct 16 20:57:17 test-node2 ceph-osd[1713]: 2019-10-16 20:55:26.997 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.113:6805 osd.2 since back 2019-
Oct 16 20:57:59 test-node2 ceph-osd[1710]: 2019-10-16 20:55:26.997 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.111:6821 osd.3 since back 2019-
Oct 16 20:57:59 test-node2 ceph-osd[1710]: 2019-10-16 20:56:25.513 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.112:6805 osd.4 since back 2019-
Oct 16 20:58:43 test-node2 ceph-osd[91197]: 2019-10-16 20:55:26.997 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.113:6805 osd.2 since back 2019
Oct 16 20:58:43 test-node2 ceph-osd[91197]: 2019-10-16 20:56:25.513 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.111:6821 osd.3 since back 2019
Oct 16 20:53:51 test-node2 systemd-networkd[907]: enp5s0: Failed to save LLDP data to /run/systemd/netif/lldp/2: No space left on device
Oct 16 21:00:13 test-node2 ceph-osd[1713]: 2019-10-16 20:56:25.513 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.111:6821 osd.3 since back 2019-
Oct 16 21:00:13 test-node2 ceph-osd[1713]: 2019-10-16 20:57:52.557 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.113:6813 osd.5 since back 2019-
Oct 16 21:01:42 test-node2 ceph-osd[1710]: 2019-10-16 20:57:52.557 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.113:6813 osd.5 since back 2019-
Oct 16 21:01:42 test-node2 ceph-osd[1710]: 2019-10-16 20:59:22.837 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.111:6805 osd.6 since back 2019-
Oct 16 21:02:33 test-node2 ceph-osd[91197]: 2019-10-16 20:57:49.685 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.112:6805 osd.4 since back 2019
Oct 16 21:02:33 test-node2 ceph-osd[91197]: 2019-10-16 20:59:18.389 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.113:6813 osd.5 since back 2019
Oct 16 20:54:24 test-node2 systemd-networkd[907]: enp5s0: Failed to save LLDP data to /run/systemd/netif/lldp/2: No space left on device
Oct 16 21:04:08 test-node2 ceph-osd[1713]: 2019-10-16 20:59:22.837 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.111:6805 osd.6 since back 2019-
Oct 16 21:04:08 test-node2 ceph-osd[1713]: 2019-10-16 21:01:16.561 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.112:6821 osd.7 since back 2019-
Oct 16 21:04:54 test-node2 ceph-osd[1710]: 2019-10-16 21:01:18.761 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.112:6821 osd.7 since back 2019-
Oct 16 21:04:54 test-node2 ceph-osd[1710]: 2019-10-16 21:02:34.021 7f2475d15700 -1 osd.1 2035 heartbeat_check: no reply from 192.168.100.113:6821 osd.8 since back 2019-
Oct 16 21:05:55 test-node2 ceph-osd[91197]: 2019-10-16 21:01:16.561 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.111:6805 osd.6 since back 2019
Oct 16 21:05:55 test-node2 ceph-osd[91197]: 2019-10-16 21:02:19.197 7f95aa3fc700 -1 osd.7 2035 heartbeat_check: no reply from 192.168.100.113:6821 osd.8 since back 2019
Oct 16 20:54:56 test-node2 systemd-networkd[907]: enp5s0: Failed to save LLDP data to /run/systemd/netif/lldp/2: No space left on device
Oct 16 21:07:37 test-node2 ceph-osd[1713]: 2019-10-16 21:02:15.621 7f7ef6010700 -1 osd.4 2035 heartbeat_check: no reply from 192.168.100.113:6821 osd.8 since back 2019-
Oct 16 21:08:12 test-node2 ceph-osd[1710]: *** Caught signal (Aborted) **
Oct 16 21:08:12 test-node2 ceph-osd[1710]:  in thread 7f245e4e6700 thread_name:tp_osd_tp
Oct 16 21:11:49 test-node2 ceph-osd[1710]: /build/ceph-14.2.4/src/common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*,
Oct 16 21:11:49 test-node2 ceph-osd[1710]: /build/ceph-14.2.4/src/common/HeartbeatMap.cc: 82: ceph_abort_msg("hit suicide timeout")

Oct 16 21:12:36 test-node2 ceph-osd[1710]:  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
Oct 16 21:12:36 test-node2 ceph-osd[1710]:  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocato
Oct 16 21:34:05 test-node2 systemd[1]: systemd-journald.service: Main process exited, code=dumped, status=6/ABRT
Oct 16 21:34:05 test-node2 systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
Oct 16 21:34:05 test-node2 systemd[1]: ceph-osd@1.service: Main process exited, code=killed, status=6/ABRT
Oct 16 21:34:05 test-node2 systemd[1]: ceph-osd@1.service: Failed with result 'signal'.
Oct 16 21:34:05 test-node2 systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart.
Oct 16 21:34:05 test-node2 systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 1.
Oct 16 21:34:05 test-node2 systemd[1]: ceph-osd@4.service: Main process exited, code=killed, status=9/KILL
Oct 16 21:34:05 test-node2 systemd[1]: ceph-osd@4.service: Failed with result 'signal'.
Oct 16 21:34:05 test-node2 systemd[1]: ceph-osd@1.service: Service hold-off time over, scheduling restart.
Oct 16 21:34:05 test-node2 systemd[1]: ceph-osd@1.service: Scheduled restart job, restart counter is at 1.
Oct 16 21:34:05 test-node2 systemd[1]: Started Ceph object storage daemon osd.4.
Oct 16 21:34:05 test-node2 systemd[1]: Started Ceph object storage daemon osd.1.
Oct 16 21:34:05 test-node2 systemd[1]: Started Ceph object storage daemon osd.7.

Oct 17 06:25:03 test-node2 ceph-osd[203415]: 2019-10-17 06:25:03.652 7f91880ca700 -1 Fail to open '/proc/204405/cmdline' error = (2) No such file or directory
Oct 17 06:25:03 test-node2 ceph-osd[203415]: 2019-10-17 06:25:03.680 7f91880ca700 -1 received  signal: Hangup from <unknown> (PID: 204405) UID: 0
Oct 17 06:25:03 test-node2 ceph-osd[203415]: 2019-10-17 06:25:03.736 7f91880ca700 -1 received  signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-
Oct 17 06:25:04 test-node2 ceph-osd[203417]: 2019-10-17 06:25:03.652 7f2ce27ac700 -1 Fail to open '/proc/204405/cmdline' error = (2) No such file or directory
Oct 17 06:25:04 test-node2 ceph-osd[203417]: 2019-10-17 06:25:03.692 7f2ce27ac700 -1 received  signal: Hangup from <unknown> (PID: 204405) UID: 0
Oct 17 06:25:04 test-node2 ceph-osd[203417]: 2019-10-17 06:25:03.736 7f2ce27ac700 -1 received  signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-
Oct 17 06:25:04 test-node2 ceph-osd[203416]: 2019-10-17 06:25:03.960 7f0a577ab700 -1 Fail to open '/proc/204406/cmdline' error = (2) No such file or directory
Oct 17 06:25:04 test-node2 ceph-osd[203416]: 2019-10-17 06:25:03.960 7f0a577ab700 -1 received  signal: Hangup from <unknown> (PID: 204406) UID: 0
Oct 17 06:25:04 test-node2 ceph-osd[203416]: 2019-10-17 06:25:03.960 7f0a577ab700 -1 Fail to open '/proc/204406/cmdline' error = (2) No such file or directory
Oct 17 06:25:04 test-node2 ceph-osd[203416]: 2019-10-17 06:25:03.960 7f0a577ab700 -1 received  signal: Hangup from <unknown> (PID: 204406) UID: 0

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux