Re: HEALTH_WARN, 3 daemons have recently crashed

Simon Oosthoek <s.oosthoek@xxxxxxxxxxxxx> · Fri, 10 Jan 2020 11:20:37 +0100

On 10/01/2020 10:41, Ashley Merrick wrote:
> Once you have fixed the issue your need to mark / archive the crash
> entry's as seen here: https://docs.ceph.com/docs/master/mgr/crash/

Hi Ashley,

thanks, I didn't know this before...

It turned out there were quite a few old crashes (since I never archived
them) and of the three most recent ones, two were like this:

"assert_msg": "/build/ceph-14.2.5/src/common/ceph_time.h: In function
'ceph::time_detail::timespan
ceph::to_timespan(ceph::time_detail::signedspan)' thread 7fbda425a700
time 2020-01-02
17:37:56.885082\n/build/ceph-14.2.5/src/common/ceph_time.h: 485: FAILED
ceph_assert(z >= signedspan::zero())\n",

And another one was too big to paste here ;-)

I did a `ceph crash archive-all` and now ceph is OK again :-)

Cheers

/Simon

> 
> 
> ---- On Fri, 10 Jan 2020 17:37:47 +0800 *Simon Oosthoek
> <s.oosthoek@xxxxxxxxxxxxx>* wrote ----
> 
>     Hi,
> 
>     last week I upgraded our ceph to 14.2.5 (from 14.2.4) and either during
>     the procedure or shortly after that, some osds crashed. I
>     re-initialised
>     them and that should be enough to fix everything, I thought.
> 
>     I looked a bit further and I do see a lot of lines like this (which are
>     worrying I suppose):
> 
>     ceph.log:2020-01-10 10:06:41.049879 mon.cephmon3 (mon.0) 234423 :
>     cluster [DBG] osd.97 reported immediately failed by osd.67
> 
>     osd.109
>     osd.133
>     osd.139
>     osd.111
>     osd.38
>     osd.65
>     osd.38
>     osd.65
>     osd.97
> 
>     Now everything seems to be OK, but the WARN status remains. Is this a
>     "feature" of 14.2.5 or am I missing something?
> 
>     Below the output of `ceph -s`
> 
>     Cheers
> 
>     /Simon
> 
>     10:13 [root@cephmon1 ~]# ceph -s
>     cluster:
>     id: b489547c-ba50-4745-a914-23eb78e0e5dc
>     health: HEALTH_WARN
>     3 daemons have recently crashed
> 
>     services:
>     mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 27h)
>     mgr: cephmon3(active, since 27h), standbys: cephmon1, cephmon2
>     mds: cephfs:1 {0=cephmds1=up:active} 1 up:standby
>     osd: 168 osds: 168 up (since 6m), 168 in (since 3d); 11 remapped pgs
> 
>     data:
>     pools: 10 pools, 5216 pgs
>     objects: 167.61M objects, 134 TiB
>     usage: 245 TiB used, 1.5 PiB / 1.8 PiB avail
>     pgs: 1018213/1354096231 objects misplaced (0.075%)
>     5203 active+clean
>     10 active+remapped+backfill_wait
>     2 active+clean+scrubbing+deep
>     1 active+remapped+backfilling
> 
>     io:
>     client: 149 MiB/s wr, 0 op/s rd, 55 op/s wr
>     recovery: 0 B/s, 30 objects/s
> 
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com