Re: Cluster Health error's status

Etienne Menguy <etienne.menguy@xxxxxxxx> · Fri, 29 Oct 2021 10:56:00 +0200

With “ceph pg x.y query” you can check why it’s complaining.

x.y for pg id, like 5.77 

It would also be interesting to check why mon fails to rejoin quorum, it may give you hints at your OSD issues.

-
Etienne Menguy
etienne.menguy@xxxxxxxx

> On 29 Oct 2021, at 10:34, Michel Niyoyita <micou12@xxxxxxxxx> wrote:
> 
> Hello Etienne
> 
> This is the ceph -s output
> 
> root@ceph-mon1:~# ceph -s
>   cluster:
>     id:     43f5d6b4-74b0-4281-92ab-940829d3ee5e
>     health: HEALTH_ERR
>             1/3 mons down, quorum ceph-mon1,ceph-mon3
>             14/47681 objects unfound (0.029%)
>             1 scrub errors
>             Possible data damage: 13 pgs recovery_unfound, 1 pg inconsistent
>             Degraded data redundancy: 42/143043 objects degraded (0.029%), 13 pgs degraded
>             2 slow ops, oldest one blocked for 2897 sec, daemons [osd.0,osd.7] have slow ops.
> 
>   services:
>     mon: 3 daemons, quorum ceph-mon1,ceph-mon3 (age 2h), out of quorum: ceph-mon4
>     mgr: ceph-mon1(active, since 25h), standbys: ceph-mon2
>     osd: 12 osds: 12 up (since 97m), 12 in (since 25h); 10 remapped pgs
> 
>   data:
>     pools:   5 pools, 225 pgs
>     objects: 47.68k objects, 204 GiB
>     usage:   603 GiB used, 4.1 TiB / 4.7 TiB avail
>     pgs:     42/143043 objects degraded (0.029%)
>              2460/143043 objects misplaced (1.720%)
>              14/47681 objects unfound (0.029%)
>              211 active+clean
>              10  active+recovery_unfound+degraded+remapped
>              3   active+recovery_unfound+degraded
>              1   active+clean+inconsistent
> 
>   io:
>     client:   2.0 KiB/s rd, 88 KiB/s wr, 2 op/s rd, 12 op/s wr
> 
> On Fri, Oct 29, 2021 at 10:09 AM Etienne Menguy <etienne.menguy@xxxxxxxx <mailto:etienne.menguy@xxxxxxxx>> wrote:
> Hi,
> 
> Please share “ceph -s” output.
> 
> -
> Etienne Menguy
> etienne.menguy@xxxxxxxx <mailto:etienne.menguy@xxxxxxxx>
> 
> 
> 
> 
>> On 29 Oct 2021, at 10:03, Michel Niyoyita <micou12@xxxxxxxxx <mailto:micou12@xxxxxxxxx>> wrote:
>> 
>> Hello team
>> 
>> I am running a ceph cluster with 3 monitors and 4 OSDs nodes running 3osd
>> each , I deployed my ceph cluster using ansible and ubuntu 20.04 as OS ,
>> the ceph version is Octopus. yesterday , My server which hosts OSDs nodes
>> restarted because of power issue and to comeback on its status one of the
>> monitor is out of quorum and some Pg marks as damaged . please help me to
>> solve this issue. below are health detail status I am finding. and the  4
>> OSDs node are the same which are running monitors (3 of them).
>> 
>> Best regards.
>> 
>> Michel
>> 
>> 
>> root@ceph-mon1:~# ceph health detail
>> HEALTH_ERR 1/3 mons down, quorum ceph-mon1,ceph-mon3; 14/47195 objects
>> unfound (0.030%); Possible data damage: 13 pgs recovery_unfound; Degraded
>> data redundancy: 42/141585 objects degraded (0.030%), 13 pgs degraded; 2
>> slow ops, oldest one blocked for 322 sec, daemons [osd.0,osd.7] have slow
>> ops.
>> [WRN] MON_DOWN: 1/3 mons down, quorum ceph-mon1,ceph-mon3
>>    mon.ceph-mon4 (rank 2) addr [v2:
>> 10.10.29.154:3300/0,v1:10.10.29.154:6789/0 <http://10.10.29.154:3300/0,v1:10.10.29.154:6789/0>] is down (out of quorum)
>> [WRN] OBJECT_UNFOUND: 14/47195 objects unfound (0.030%)
>>    pg 5.77 has 1 unfound objects
>>    pg 5.6d has 2 unfound objects
>>    pg 5.6a has 1 unfound objects
>>    pg 5.65 has 1 unfound objects
>>    pg 5.4a has 1 unfound objects
>>    pg 5.30 has 1 unfound objects
>>    pg 5.28 has 1 unfound objects
>>    pg 5.25 has 1 unfound objects
>>    pg 5.19 has 1 unfound objects
>>    pg 5.1a has 1 unfound objects
>>    pg 5.1 has 1 unfound objects
>>    pg 5.b has 1 unfound objects
>>    pg 5.8 has 1 unfound objects
>> [ERR] PG_DAMAGED: Possible data damage: 13 pgs recovery_unfound
>>    pg 5.1 is active+recovery_unfound+degraded+remapped, acting [5,8,7], 1
>> unfound
>>    pg 5.8 is active+recovery_unfound+degraded+remapped, acting [6,11,8], 1
>> unfound
>>    pg 5.b is active+recovery_unfound+degraded+remapped, acting [7,0,5], 1
>> unfound
>>    pg 5.19 is active+recovery_unfound+degraded+remapped, acting [0,5,7], 1
>> unfound
>>    pg 5.1a is active+recovery_unfound+degraded, acting [10,11,8], 1 unfound
>>    pg 5.25 is active+recovery_unfound+degraded+remapped, acting [0,10,11],
>> 1 unfound
>>    pg 5.28 is active+recovery_unfound+degraded+remapped, acting [6,11,8],
>> 1 unfound
>>    pg 5.30 is active+recovery_unfound+degraded+remapped, acting [7,5,0], 1
>> unfound
>>    pg 5.4a is active+recovery_unfound+degraded, acting [0,11,7], 1 unfound
>>    pg 5.65 is active+recovery_unfound+degraded+remapped, acting [0,10,11],
>> 1 unfound
>>    pg 5.6a is active+recovery_unfound+degraded, acting [0,11,7], 1 unfound
>>    pg 5.6d is active+recovery_unfound+degraded+remapped, acting [7,2,0], 2
>> unfound
>>    pg 5.77 is active+recovery_unfound+degraded+remapped, acting [5,6,8], 1
>> unfound
>> [WRN] PG_DEGRADED: Degraded data redundancy: 42/141585 objects degraded
>> (0.030%), 13 pgs degraded
>>    pg 5.1 is active+recovery_unfound+degraded+remapped, acting [5,8,7], 1
>> unfound
>>    pg 5.8 is active+recovery_unfound+degraded+remapped, acting [6,11,8], 1
>> unfound
>>    pg 5.b is active+recovery_unfound+degraded+remapped, acting [7,0,5], 1
>> unfound
>>    pg 5.19 is active+recovery_unfound+degraded+remapped, acting [0,5,7], 1
>> unfound
>>    pg 5.1a is active+recovery_unfound+degraded, acting [10,11,8], 1 unfound
>>    pg 5.25 is active+recovery_unfound+degraded+remapped, acting [0,10,11],
>> 1 unfound
>>    pg 5.28 is active+recovery_unfound+degraded+remapped, acting [6,11,8],
>> 1 unfound
>>    pg 5.30 is active+recovery_unfound+degraded+remapped, acting [7,5,0], 1
>> unfound
>>    pg 5.4a is active+recovery_unfound+degraded, acting [0,11,7], 1 unfound
>>    pg 5.65 is active+recovery_unfound+degraded+remapped, acting [0,10,11],
>> 1 unfound
>>    pg 5.6a is active+recovery_unfound+degraded, acting [0,11,7], 1 unfound
>>    pg 5.6d is active+recovery_unfound+degraded+remapped, acting [7,2,0], 2
>> unfound
>>    pg 5.77 is active+recovery_unfound+degraded+remapped, acting [5,6,8], 1
>> unfound
>> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 322 sec, daemons
>> [osd.0,osd.7] have slow ops.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx