Re: [Ceph-community] HEALTH_WARN - daemons have recently crashed

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 5 Feb 2020 13:02:58 +0000 (UTC)

[Moving this to ceph-users@xxxxxxx]

This looks like https://tracker.ceph.com/issues/43365, which *looks* like 
it is an issue with the standard libraries in ubuntu 18.04.  One user 
said: "After upgrading our monitor Ubuntu 18.04 packages (apt-get upgrade) 
with the 5.3.0-26-generic kernel, it seems that the crashes have been 
fixed (they run stable now for 8 days)."  Can you give that a try?

Also, 

On Wed, 5 Feb 2020, Micha Ballmann wrote:
> Hi,
> 
> i have a Proxmox Ceph Cluster VE 6.1-5.
> 
> # ceph -v
> ceph version 14.2.6 (ba51347bdbe28c7c0e2e9172fa2983111137bb60) nautilus
> (stable)
> 
> My problem since version 14.2.6 im receiving nearly everyday the following
> messages:
> 
> # cephs -s
> 
> ...
> 
> health: HEALTH_WARN
>             2 daemons have recently crashed
> 
> ...
> 
> I archive the messages:
> 
> # ceph crash archive-all
> 
> But one or two days laters the same problem occurs. Trying to found out what
> is the problem:
> 
> For example:
> 
> # ceph crash info <ID>

note that this ID is a hash of the stack trace and is meant to be a 
unique signature for this crash/bug but contains no identifying 
information.  Yours is probably one of

6a617f9d477ab8df2d068af0768ff741c68adabcc5c1ecb5dd3e9872d613c943
dacbff55030f3d0837e58d8f4961441b6902d5750b0e1579682df5650c33d44d

Please consider turning on telemetry so we get this crash information 
automatically:

https://docs.ceph.com/docs/master/mgr/telemetry/

Thanks!
sage

> 
> Node4
> {
> "os_version_id": "10",
> "utsname_machine": "x86_64",
> "entity_name": "mon.promo4",
> "backtrace": [
> "(()+0x12730) [0x7f30ca142730]",
> "(gsignal()+0x10b) [0x7f30c9c257bb]",
> "(abort()+0x121) [0x7f30c9c10535]",
> "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3)
> [0x7f30cb27be79]",
> "(()+0x282000) [0x7f30cb27c000]",
> "(Paxos::store_state(MMonPaxos*)+0xaa8) [0x5602540626f8]",
> "(Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2ea)
> [0x560254062a5a]",
> "(Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223)
> [0x560254068213]",
> "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c)
> [0x560253f9db1c]",
> "(Monitor::_ms_dispatch(Message*)+0x4aa) [0x560253f9e10a]",
> "(Monitor::ms_dispatch(Message*)+0x26) [0x560253fcda36]",
> "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26)
> [0x560253fc9f66]",
> "(DispatchQueue::entry()+0x1a49) [0x7f30cb4b4e69]",
> "(DispatchQueue::DispatchThread::entry()+0xd) [0x7f30cb5629ed]",
> "(()+0x7fa3) [0x7f30ca137fa3]",
> "(clone()+0x3f) [0x7f30c9ce74cf]"
> ],
> "process_name": "ceph-mon",
> "assert_line": 485,
> "archived": "2020-01-21 07:02:49.036123",
> "assert_file":
> "/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h",
> "utsname_sysname": "Linux",
> "os_version": "10 (buster)",
> "os_id": "10",
> "assert_msg":
> "/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h: In
> function 'ceph::time_detail::timespan
> ceph::to_timespan(ceph::time_detail::signedspan)' thread 7f30c11fe700 time
> 2020-01-21
> 03:43:48.848411\n/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h:
> 485: FAILED ceph_assert(z >= signedspan::zero())\n",
> "assert_func": "ceph::time_detail::timespan
> ceph::to_timespan(ceph::time_detail::signedspan)",
> "ceph_version": "14.2.6",
> "os_name": "Debian GNU/Linux 10 (buster)",
> "timestamp": "2020-01-21 02:43:48.891122Z",
> "assert_thread_name": "ms_dispatch",
> "utsname_release": "5.3.13-1-pve",
> "utsname_hostname": "promo4",
> "crash_id":
> "2020-01-21_02:43:48.891122Z_0aade13c-463f-43fe-9b05-76ca71f6bc1b",
> "assert_condition": "z >= signedspan::zero()",
> "utsname_version": "#1 SMP PVE 5.3.13-1 (Thu, 05 Dec 2019 07:18:14 +0100)"
> }
> 
> Node2
> {
> "os_version_id": "10",
> "utsname_machine": "x86_64",
> "entity_name": "mon.promo2",
> "backtrace": [
> "(()+0x12730) [0x7f74f6c3f730]",
> "(gsignal()+0x10b) [0x7f74f67227bb]",
> "(abort()+0x121) [0x7f74f670d535]",
> "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3)
> [0x7f74f7d78e79]",
> "(()+0x282000) [0x7f74f7d79000]",
> "(Paxos::store_state(MMonPaxos*)+0xaa8) [0x55b9540ae6f8]",
> "(Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2ea)
> [0x55b9540aea5a]",
> "(Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223)
> [0x55b9540b4213]",
> "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c)
> [0x55b953fe9b1c]",
> "(Monitor::_ms_dispatch(Message*)+0x4aa) [0x55b953fea10a]",
> "(Monitor::ms_dispatch(Message*)+0x26) [0x55b954019a36]",
> "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26)
> [0x55b954015f66]",
> "(DispatchQueue::entry()+0x1a49) [0x7f74f7fb1e69]",
> "(DispatchQueue::DispatchThread::entry()+0xd) [0x7f74f805f9ed]",
> "(()+0x7fa3) [0x7f74f6c34fa3]",
> "(clone()+0x3f) [0x7f74f67e44cf]"
> ],
> "process_name": "ceph-mon",
> "assert_line": 485,
> "archived": "2020-01-21 07:02:49.041386",
> "assert_file":
> "/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h",
> "utsname_sysname": "Linux",
> "os_version": "10 (buster)",
> "os_id": "10",
> "assert_msg":
> "/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h: In
> function 'ceph::time_detail::timespan
> ceph::to_timespan(ceph::time_detail::signedspan)' thread 7f74edcfb700 time
> 2020-01-20
> 22:32:56.933800\n/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h:
> 485: FAILED ceph_assert(z >= signedspan::zero())\n",
> "assert_func": "ceph::time_detail::timespan
> ceph::to_timespan(ceph::time_detail::signedspan)",
> "ceph_version": "14.2.6",
> "os_name": "Debian GNU/Linux 10 (buster)",
> "timestamp": "2020-01-20 21:32:56.947402Z",
> "assert_thread_name": "ms_dispatch",
> "utsname_release": "5.3.13-1-pve",
> "utsname_hostname": "promo2",
> "crash_id":
> "2020-01-20_21:32:56.947402Z_3ae7220c-23c9-478a-a22d-626c2fa34414",
> "assert_condition": "z >= signedspan::zero()",
> "utsname_version": "#1 SMP PVE 5.3.13-1 (Thu, 05 Dec 2019 07:18:14 +0100)"
> }
> 
> Is there a problem with my NTP? Im syncing my time with CHRONY to my local NTP
> Server.
> 
> It would be nice if you can help.
> 
> I have to say my ceph cluster is clean and works without any issue. All OSDs
> are up and after ceph crash archive-all; cephs -s says; HEALTH_OK
> 
> Regards
> 
> Micha
> 
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx