Re: Cluster Health error's status

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Eugen

The failure_domain is host level and crush rule is replicated_rule in
troubleshooting process I changed for pool 5 its PG from 32 to 128 to see
if there
can be some changes. and it has the default replica (3)

Thanks for your continous help

On Fri, Oct 29, 2021 at 11:44 AM Etienne Menguy <etienne.menguy@xxxxxxxx>
wrote:

> Is a way there you can enforce mon to rejoin a quorum ? I tried to restart
> it but nothing changed. I guess it is the cause If I am not mistaken.
>
>
> No, but with quorum_status you can check monitor status and if it’s trying
> to join quorum.
> You may have to use daemon socket interface (asok file) to directly get
> info for this monitor.
>
>
> https://docs.ceph.com/en/latest/rados/operations/monitoring/#checking-monitor-status
>
> Which OSD were down? As written by Eugen, having having crush rule and
> failure would be useful. It’s unusual that a single host failure triggers
> this issue.
>
>  I guess it is the cause If I am not mistaken
>
> I don’t think monitor issue is the root cause of the unfound objects. You
> could easily delete monitor and deploy it again to “fix” your monitor
> quorum.
>
>
> -
> Etienne Menguy
> etienne.menguy@xxxxxxxx
>
>
>
>
> On 29 Oct 2021, at 11:30, Michel Niyoyita <micou12@xxxxxxxxx> wrote:
>
> Dear Etienne
>
> Is a way there you can enforce mon to rejoin a quorum ? I tried to restart
> it but nothing changed. I guess it is the cause If I am not mistaken.
>
> below is pg querry output
>
>
> Regards
>
> On Fri, Oct 29, 2021 at 10:56 AM Etienne Menguy <etienne.menguy@xxxxxxxx>
> wrote:
>
>> With “ceph pg x.y query” you can check why it’s complaining.
>>
>> x.y for pg id, like 5.77
>>
>> It would also be interesting to check why mon fails to rejoin quorum, it
>> may give you hints at your OSD issues.
>>
>> -
>> Etienne Menguy
>> etienne.menguy@xxxxxxxx
>>
>>
>>
>>
>> On 29 Oct 2021, at 10:34, Michel Niyoyita <micou12@xxxxxxxxx> wrote:
>>
>> Hello Etienne
>>
>> This is the ceph -s output
>>
>> root@ceph-mon1:~# ceph -s
>>   cluster:
>>     id:     43f5d6b4-74b0-4281-92ab-940829d3ee5e
>>     health: HEALTH_ERR
>>             1/3 mons down, quorum ceph-mon1,ceph-mon3
>>             14/47681 objects unfound (0.029%)
>>             1 scrub errors
>>             Possible data damage: 13 pgs recovery_unfound, 1 pg
>> inconsistent
>>             Degraded data redundancy: 42/143043 objects degraded
>> (0.029%), 13 pgs degraded
>>             2 slow ops, oldest one blocked for 2897 sec, daemons
>> [osd.0,osd.7] have slow ops.
>>
>>   services:
>>     mon: 3 daemons, quorum ceph-mon1,ceph-mon3 (age 2h), out of quorum:
>> ceph-mon4
>>     mgr: ceph-mon1(active, since 25h), standbys: ceph-mon2
>>     osd: 12 osds: 12 up (since 97m), 12 in (since 25h); 10 remapped pgs
>>
>>   data:
>>     pools:   5 pools, 225 pgs
>>     objects: 47.68k objects, 204 GiB
>>     usage:   603 GiB used, 4.1 TiB / 4.7 TiB avail
>>     pgs:     42/143043 objects degraded (0.029%)
>>              2460/143043 objects misplaced (1.720%)
>>              14/47681 objects unfound (0.029%)
>>              211 active+clean
>>              10  active+recovery_unfound+degraded+remapped
>>              3   active+recovery_unfound+degraded
>>              1   active+clean+inconsistent
>>
>>   io:
>>     client:   2.0 KiB/s rd, 88 KiB/s wr, 2 op/s rd, 12 op/s wr
>>
>> On Fri, Oct 29, 2021 at 10:09 AM Etienne Menguy <etienne.menguy@xxxxxxxx>
>> wrote:
>>
>>> Hi,
>>>
>>> Please share “ceph -s” output.
>>>
>>> -
>>> Etienne Menguy
>>> etienne.menguy@xxxxxxxx
>>>
>>>
>>>
>>>
>>> On 29 Oct 2021, at 10:03, Michel Niyoyita <micou12@xxxxxxxxx> wrote:
>>>
>>> Hello team
>>>
>>> I am running a ceph cluster with 3 monitors and 4 OSDs nodes running 3osd
>>> each , I deployed my ceph cluster using ansible and ubuntu 20.04 as OS ,
>>> the ceph version is Octopus. yesterday , My server which hosts OSDs nodes
>>> restarted because of power issue and to comeback on its status one of the
>>> monitor is out of quorum and some Pg marks as damaged . please help me to
>>> solve this issue. below are health detail status I am finding. and the  4
>>> OSDs node are the same which are running monitors (3 of them).
>>>
>>> Best regards.
>>>
>>> Michel
>>>
>>>
>>> root@ceph-mon1:~# ceph health detail
>>> HEALTH_ERR 1/3 mons down, quorum ceph-mon1,ceph-mon3; 14/47195 objects
>>> unfound (0.030%); Possible data damage: 13 pgs recovery_unfound; Degraded
>>> data redundancy: 42/141585 objects degraded (0.030%), 13 pgs degraded; 2
>>> slow ops, oldest one blocked for 322 sec, daemons [osd.0,osd.7] have slow
>>> ops.
>>> [WRN] MON_DOWN: 1/3 mons down, quorum ceph-mon1,ceph-mon3
>>>    mon.ceph-mon4 (rank 2) addr [v2:
>>> 10.10.29.154:3300/0,v1:10.10.29.154:6789/0] is down (out of quorum)
>>> [WRN] OBJECT_UNFOUND: 14/47195 objects unfound (0.030%)
>>>    pg 5.77 has 1 unfound objects
>>>    pg 5.6d has 2 unfound objects
>>>    pg 5.6a has 1 unfound objects
>>>    pg 5.65 has 1 unfound objects
>>>    pg 5.4a has 1 unfound objects
>>>    pg 5.30 has 1 unfound objects
>>>    pg 5.28 has 1 unfound objects
>>>    pg 5.25 has 1 unfound objects
>>>    pg 5.19 has 1 unfound objects
>>>    pg 5.1a has 1 unfound objects
>>>    pg 5.1 has 1 unfound objects
>>>    pg 5.b has 1 unfound objects
>>>    pg 5.8 has 1 unfound objects
>>> [ERR] PG_DAMAGED: Possible data damage: 13 pgs recovery_unfound
>>>    pg 5.1 is active+recovery_unfound+degraded+remapped, acting [5,8,7], 1
>>> unfound
>>>    pg 5.8 is active+recovery_unfound+degraded+remapped, acting [6,11,8],
>>> 1
>>> unfound
>>>    pg 5.b is active+recovery_unfound+degraded+remapped, acting [7,0,5], 1
>>> unfound
>>>    pg 5.19 is active+recovery_unfound+degraded+remapped, acting [0,5,7],
>>> 1
>>> unfound
>>>    pg 5.1a is active+recovery_unfound+degraded, acting [10,11,8], 1
>>> unfound
>>>    pg 5.25 is active+recovery_unfound+degraded+remapped, acting
>>> [0,10,11],
>>> 1 unfound
>>>    pg 5.28 is active+recovery_unfound+degraded+remapped, acting [6,11,8],
>>> 1 unfound
>>>    pg 5.30 is active+recovery_unfound+degraded+remapped, acting [7,5,0],
>>> 1
>>> unfound
>>>    pg 5.4a is active+recovery_unfound+degraded, acting [0,11,7], 1
>>> unfound
>>>    pg 5.65 is active+recovery_unfound+degraded+remapped, acting
>>> [0,10,11],
>>> 1 unfound
>>>    pg 5.6a is active+recovery_unfound+degraded, acting [0,11,7], 1
>>> unfound
>>>    pg 5.6d is active+recovery_unfound+degraded+remapped, acting [7,2,0],
>>> 2
>>> unfound
>>>    pg 5.77 is active+recovery_unfound+degraded+remapped, acting [5,6,8],
>>> 1
>>> unfound
>>> [WRN] PG_DEGRADED: Degraded data redundancy: 42/141585 objects degraded
>>> (0.030%), 13 pgs degraded
>>>    pg 5.1 is active+recovery_unfound+degraded+remapped, acting [5,8,7], 1
>>> unfound
>>>    pg 5.8 is active+recovery_unfound+degraded+remapped, acting [6,11,8],
>>> 1
>>> unfound
>>>    pg 5.b is active+recovery_unfound+degraded+remapped, acting [7,0,5], 1
>>> unfound
>>>    pg 5.19 is active+recovery_unfound+degraded+remapped, acting [0,5,7],
>>> 1
>>> unfound
>>>    pg 5.1a is active+recovery_unfound+degraded, acting [10,11,8], 1
>>> unfound
>>>    pg 5.25 is active+recovery_unfound+degraded+remapped, acting
>>> [0,10,11],
>>> 1 unfound
>>>    pg 5.28 is active+recovery_unfound+degraded+remapped, acting [6,11,8],
>>> 1 unfound
>>>    pg 5.30 is active+recovery_unfound+degraded+remapped, acting [7,5,0],
>>> 1
>>> unfound
>>>    pg 5.4a is active+recovery_unfound+degraded, acting [0,11,7], 1
>>> unfound
>>>    pg 5.65 is active+recovery_unfound+degraded+remapped, acting
>>> [0,10,11],
>>> 1 unfound
>>>    pg 5.6a is active+recovery_unfound+degraded, acting [0,11,7], 1
>>> unfound
>>>    pg 5.6d is active+recovery_unfound+degraded+remapped, acting [7,2,0],
>>> 2
>>> unfound
>>>    pg 5.77 is active+recovery_unfound+degraded+remapped, acting [5,6,8],
>>> 1
>>> unfound
>>> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 322 sec, daemons
>>> [osd.0,osd.7] have slow ops.
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>>
>>>
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux