Re: Ceph nvme timeout and then aborting

zxcs <zhuxiongcs@xxxxxxx> · Fri, 19 Feb 2021 18:22:00 +0800

BTW, actually i have two nodes has same issues, and another error node's nvme output as below 

Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 29 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 1%
data_units_read                     : 592,340,175
data_units_written                  : 26,443,352
host_read_commands                  : 5,341,278,662
host_write_commands                 : 515,730,885
controller_busy_time                : 14,052
power_cycles                        : 8
power_on_hours                      : 4,294
unsafe_shutdowns                    : 6
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 29 C
Temperature Sensor 2                : 46 C
Temperature Sensor 3                : 0 C
Temperature Sensor 4                : 0 C
Temperature Sensor 5                : 0 C
Temperature Sensor 6                : 0 C
Temperature Sensor 7                : 0 C
Temperature Sensor 8                : 0 C

For compare, i get one healthy node’s nvme output as below:

mart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 27 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 1%
data_units_read                     : 579,829,652
data_units_written                  : 28,271,336
host_read_commands                  : 5,237,750,233
host_write_commands                 : 518,979,861
controller_busy_time                : 14,166
power_cycles                        : 3
power_on_hours                      : 4,252
unsafe_shutdowns                    : 1
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 27 C
Temperature Sensor 2                : 39 C
Temperature Sensor 3                : 0 C
Temperature Sensor 4                : 0 C
Temperature Sensor 5                : 0 C
Temperature Sensor 6                : 0 C
Temperature Sensor 7                : 0 C
Temperature Sensor 8                : 0 C

Thanks,
zx

> 在 2021年2月19日，下午6:08，zxcs <zhuxiongcs@xxxxxxx> 写道：
> 
> Thank you very much, Konstantin!
> 
> Here is the output of `nvme smart-log /dev/nvme0n1`
> 
> Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
> critical_warning                    : 0
> temperature                         : 27 C
> available_spare                     : 100%
> available_spare_threshold           : 10%
> percentage_used                     : 1%
> data_units_read                     : 602,417,903
> data_units_written                  : 24,350,864
> host_read_commands                  : 5,610,227,794
> host_write_commands                 : 519,030,512
> controller_busy_time                : 14,356
> power_cycles                        : 7
> power_on_hours                      : 4,256
> unsafe_shutdowns                    : 5
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 0
> Critical Composite Temperature Time : 0
> Temperature Sensor 1                : 27 C
> Temperature Sensor 2                : 41 C
> Temperature Sensor 3                : 0 C
> Temperature Sensor 4                : 0 C
> Temperature Sensor 5                : 0 C
> Temperature Sensor 6                : 0 C
> Temperature Sensor 7                : 0 C
> Temperature Sensor 8                : 0 C
> 
> 
> Thanks,
> 
> zx
> 
>> 在 2021年2月19日，下午6:01，Konstantin Shalygin <k0ste@xxxxxxxx <mailto:k0ste@xxxxxxxx>> 写道：
>> 
>> Please paste your `name smart-log /dev/nvme0n1` output
>> 
>> 
>> 
>> k
>> 
>>> On 19 Feb 2021, at 12:53, zxcs <zhuxiongcs@xxxxxxx <mailto:zhuxiongcs@xxxxxxx> <mailto:zhuxiongcs@xxxxxxx <mailto:zhuxiongcs@xxxxxxx>>> wrote:
>>> 
>>> I have one ceph cluster with nautilus 14.2.10 and one node has 3 SSD and 4 HDD each. 
>>> Also has two nvmes as cache.  (Means nvme0n1 cache for 0-2 SSD  and Nvme1n1 cache for 3-7 HDD)
>>> 
>>> but there is one nodes’ nvme0n1 always hit below issues(see name..I/O…timeout, aborting), and sudden this nvme0n1 disappear . 
>>> After that i need reboot this node to recover.
>>> Any one hit same issue ? and how to slow it? Any suggestion are welcome. Thanks in advance!
>>> I am once googled the issue, and see below link, but not see any help 
>>> https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd> <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd>> <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd> <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd>>><https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd><https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd>> <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd> <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd>>>>
>>> 
>>> From syslog
>>> Feb 19 01:31:52 ip kernel: [1275313.393211] nvme 0000:03:00.0: I/O 949 QID 12 timeout, aborting
>>> Feb 19 01:31:53 ip kernel: [1275314.389232] nvme 0000:03:00.0: I/O 728 QID 5 timeout, aborting
>>> Feb 19 01:31:53 ip kernel: [1275314.389247] nvme 0000:03:00.0: I/O 515 QID 7 timeout, aborting
>>> Feb 19 01:31:53 ip kernel: [1275314.389252] nvme 0000:03:00.0: I/O 516 QID 7 timeout, aborting
>>> Feb 19 01:31:53 ip kernel: [1275314.389257] nvme 0000:03:00.0: I/O 517 QID 7 timeout, aborting
>>> Feb 19 01:31:53 ip kernel: [1275314.389263] nvme 0000:03:00.0: I/O 82 QID 9 timeout, aborting
>>> Feb 19 01:31:53 ip kernel: [1275314.389271] nvme 0000:03:00.0: I/O 853 QID 13 timeout, aborting
>>> Feb 19 01:31:53 ip kernel: [1275314.389275] nvme 0000:03:00.0: I/O 854 QID 13 timeout, aborting
>>> Feb 19 01:32:23 ip kernel: [1275344.401708] nvme 0000:03:00.0: I/O 728 QID 5 timeout, reset controller
>>> Feb 19 01:32:52 ip kernel: [1275373.394112] nvme 0000:03:00.0: I/O 0 QID 0 timeout, reset controller
>>> Feb 19 01:33:53 ip ceph-osd[3179]: /build/ceph-14.2.10/src/common/HeartbeatMap.cc <http://heartbeatmap.cc/> <http://heartbeatmap.cc/ <http://heartbeatmap.cc/>> <http://heartbeatmap.cc/ <http://heartbeatmap.cc/> <http://heartbeatmap.cc/ <http://heartbeatmap.cc/>>>: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, ceph::time_detail::coarse_mono_clock::rep)' thread 7f36c03fb700 time 2021-02-19 01:33:53.436018
>>> Feb 19 01:33:53 ip ceph-osd[3179]: /build/ceph-14.2.10/src/common/HeartbeatMap.cc <http://heartbeatmap.cc/> <http://heartbeatmap.cc/ <http://heartbeatmap.cc/>> <http://heartbeatmap.cc/ <http://heartbeatmap.cc/> <http://heartbeatmap.cc/ <http://heartbeatmap.cc/>>>: 82: ceph_abort_msg("hit suicide timeout")
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xdf) [0x83eb8c]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x4a5) [0xec56f5]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  3: (ceph::HeartbeatMap::is_healthy()+0x106) [0xec6846]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  4: (OSD::handle_osd_ping(MOSDPing*)+0x67c) [0x8aaf0c]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  5: (OSD::heartbeat_dispatch(Message*)+0x1eb) [0x8b3f4b]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  6: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0x27d) [0x12456bd]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  7: (ProtocolV2::handle_message()+0x9d6) [0x129b4e6]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  8: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x12ad330]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  9: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x178) [0x12ad598]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  10: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x12956b4]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  11: (AsyncConnection::process()+0x186) [0x126f446]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  12: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x7cd) [0x10b14cd]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  13: /usr/bin/ceph-osd() [0x10b3fd8]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  14: /usr/bin/ceph-osd() [0x162b59f]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  15: (()+0x76ba) [0x7f36c2ed46ba]
>>> Feb 19 01:33:53 ip ceph-osd[3179]:  16: (clone()+0x6d) [0x7f36c24db4dd]
>>> Feb 19 01:33:53 ip ceph-osd[3179]: *** Caught signal (Aborted) **
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> <mailto:ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx