=?eucgb2312_cn?b?u9i4tDogUmU6IENsdXN0ZXIgSGVhbHRoIGVycm9yJ3Mgc3RhdHVz?=

胡玮文 <huww98@xxxxxxxxxxx> · Fri, 29 Oct 2021 14:06:09 +0000

Hi Michel,

This “Structure needs cleaning” seems to mean that your file system is not in order, you should try “fsck”.

Weiwen Hu

发件人: Michel Niyoyita<mailto:micou12@xxxxxxxxx>
发送时间: 2021年10月29日 20:10
收件人: Etienne Menguy<mailto:etienne.menguy@xxxxxxxx>
抄送: ceph-users<mailto:ceph-users@xxxxxxx>
主题:  Re: Cluster Health error's status

Hello team

Below is the error , I am getting once I try to redeploy the same cluster

TASK [ceph-mon : recursively fix ownership of monitor directory]
******************************************************************************************************************************************************
Friday 29 October 2021  12:07:18 +0000 (0:00:00.411)       0:01:41.157
********
ok: [ceph-mon1]
ok: [ceph-mon2]
ok: [ceph-mon3]
An exception occurred during task execution. To see the full traceback, use
-vvv. The error was: OSError: [Errno 117] Structure needs cleaning:
b'/var/lib/ceph/mon/ceph-ceph-mon4/store.db/216815.log'
fatal: [ceph-mon4]: FAILED! => changed=false
  module_stderr: |-
    Traceback (most recent call last):
      File "<stdin>", line 102, in <module>
      File "<stdin>", line 94, in _ansiballz_main
      File "<stdin>", line 40, in invoke_module
      File "/usr/lib/python3.8/runpy.py", line 207, in run_module
        return _run_module_code(code, init_globals, run_name, mod_spec)
      File "/usr/lib/python3.8/runpy.py", line 97, in _run_module_code
        _run_code(code, mod_globals, init_globals,
      File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File
"/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/modules/files/file.py",
line 940, in <module>
      File
"/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/modules/files/file.py",
line 926, in main
      File
"/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/modules/files/file.py",
line 665, in ensure_directory
      File
"/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/modules/files/file.py",
line 340, in recursive_set_attributes
      File
"/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/module_utils/basic.py",
line 1335, in set_fs_attributes_if_different
      File
"/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/module_utils/basic.py",
line 988, in set_owner_if_different
      File
"/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/module_utils/basic.py",
line 883, in user_and_group
    OSError: [Errno 117] Structure needs cleaning:
b'/var/lib/ceph/mon/ceph-ceph-mon4/store.db/216815.log'
  module_stdout: ''
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: 1

On Fri, Oct 29, 2021 at 12:37 PM Etienne Menguy <etienne.menguy@xxxxxxxx>
wrote:

> Have you tried to restart one of the OSD that seems to block PG recover?
>
> I don’t think increasing PG can help.
> -
> Etienne Menguy
> etienne.menguy@xxxxxxxx
>
>
>
>
> On 29 Oct 2021, at 11:53, Michel Niyoyita <micou12@xxxxxxxxx> wrote:
>
> Hello Eugen
>
> The failure_domain is host level and crush rule is replicated_rule in
> troubleshooting process I changed for pool 5 its PG from 32 to 128 to see
> if there
> can be some changes. and it has the default replica (3)
>
> Thanks for your continous help
>
> On Fri, Oct 29, 2021 at 11:44 AM Etienne Menguy <etienne.menguy@xxxxxxxx>
> wrote:
>
>> Is a way there you can enforce mon to rejoin a quorum ? I tried to
>> restart it but nothing changed. I guess it is the cause If I am not
>> mistaken.
>>
>>
>> No, but with quorum_status you can check monitor status and if it’s
>> trying to join quorum.
>> You may have to use daemon socket interface (asok file) to directly get
>> info for this monitor.
>>
>>
>> https://docs.ceph.com/en/latest/rados/operations/monitoring/#checking-monitor-status
>>
>> Which OSD were down? As written by Eugen, having having crush rule and
>> failure would be useful. It’s unusual that a single host failure triggers
>> this issue.
>>
>>  I guess it is the cause If I am not mistaken
>>
>> I don’t think monitor issue is the root cause of the unfound objects. You
>> could easily delete monitor and deploy it again to “fix” your monitor
>> quorum.
>>
>>
>> -
>> Etienne Menguy
>> etienne.menguy@xxxxxxxx
>>
>>
>>
>>
>> On 29 Oct 2021, at 11:30, Michel Niyoyita <micou12@xxxxxxxxx> wrote:
>>
>> Dear Etienne
>>
>> Is a way there you can enforce mon to rejoin a quorum ? I tried to
>> restart it but nothing changed. I guess it is the cause If I am not
>> mistaken.
>>
>> below is pg querry output
>>
>>
>> Regards
>>
>> On Fri, Oct 29, 2021 at 10:56 AM Etienne Menguy <etienne.menguy@xxxxxxxx>
>> wrote:
>>
>>> With “ceph pg x.y query” you can check why it’s complaining.
>>>
>>> x.y for pg id, like 5.77
>>>
>>> It would also be interesting to check why mon fails to rejoin quorum, it
>>> may give you hints at your OSD issues.
>>>
>>> -
>>> Etienne Menguy
>>> etienne.menguy@xxxxxxxx
>>>
>>>
>>>
>>>
>>> On 29 Oct 2021, at 10:34, Michel Niyoyita <micou12@xxxxxxxxx> wrote:
>>>
>>> Hello Etienne
>>>
>>> This is the ceph -s output
>>>
>>> root@ceph-mon1:~# ceph -s
>>>   cluster:
>>>     id:     43f5d6b4-74b0-4281-92ab-940829d3ee5e
>>>     health: HEALTH_ERR
>>>             1/3 mons down, quorum ceph-mon1,ceph-mon3
>>>             14/47681 objects unfound (0.029%)
>>>             1 scrub errors
>>>             Possible data damage: 13 pgs recovery_unfound, 1 pg
>>> inconsistent
>>>             Degraded data redundancy: 42/143043 objects degraded
>>> (0.029%), 13 pgs degraded
>>>             2 slow ops, oldest one blocked for 2897 sec, daemons
>>> [osd.0,osd.7] have slow ops.
>>>
>>>   services:
>>>     mon: 3 daemons, quorum ceph-mon1,ceph-mon3 (age 2h), out of quorum:
>>> ceph-mon4
>>>     mgr: ceph-mon1(active, since 25h), standbys: ceph-mon2
>>>     osd: 12 osds: 12 up (since 97m), 12 in (since 25h); 10 remapped pgs
>>>
>>>   data:
>>>     pools:   5 pools, 225 pgs
>>>     objects: 47.68k objects, 204 GiB
>>>     usage:   603 GiB used, 4.1 TiB / 4.7 TiB avail
>>>     pgs:     42/143043 objects degraded (0.029%)
>>>              2460/143043 objects misplaced (1.720%)
>>>              14/47681 objects unfound (0.029%)
>>>              211 active+clean
>>>              10  active+recovery_unfound+degraded+remapped
>>>              3   active+recovery_unfound+degraded
>>>              1   active+clean+inconsistent
>>>
>>>   io:
>>>     client:   2.0 KiB/s rd, 88 KiB/s wr, 2 op/s rd, 12 op/s wr
>>>
>>> On Fri, Oct 29, 2021 at 10:09 AM Etienne Menguy <etienne.menguy@xxxxxxxx>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Please share “ceph -s” output.
>>>>
>>>> -
>>>> Etienne Menguy
>>>> etienne.menguy@xxxxxxxx
>>>>
>>>>
>>>>
>>>>
>>>> On 29 Oct 2021, at 10:03, Michel Niyoyita <micou12@xxxxxxxxx> wrote:
>>>>
>>>> Hello team
>>>>
>>>> I am running a ceph cluster with 3 monitors and 4 OSDs nodes running
>>>> 3osd
>>>> each , I deployed my ceph cluster using ansible and ubuntu 20.04 as OS ,
>>>> the ceph version is Octopus. yesterday , My server which hosts OSDs
>>>> nodes
>>>> restarted because of power issue and to comeback on its status one of
>>>> the
>>>> monitor is out of quorum and some Pg marks as damaged . please help me
>>>> to
>>>> solve this issue. below are health detail status I am finding. and the
>>>>  4
>>>> OSDs node are the same which are running monitors (3 of them).
>>>>
>>>> Best regards.
>>>>
>>>> Michel
>>>>
>>>>
>>>> root@ceph-mon1:~# ceph health detail
>>>> HEALTH_ERR 1/3 mons down, quorum ceph-mon1,ceph-mon3; 14/47195 objects
>>>> unfound (0.030%); Possible data damage: 13 pgs recovery_unfound;
>>>> Degraded
>>>> data redundancy: 42/141585 objects degraded (0.030%), 13 pgs degraded; 2
>>>> slow ops, oldest one blocked for 322 sec, daemons [osd.0,osd.7] have
>>>> slow
>>>> ops.
>>>> [WRN] MON_DOWN: 1/3 mons down, quorum ceph-mon1,ceph-mon3
>>>>    mon.ceph-mon4 (rank 2) addr [v2:
>>>> 10.10.29.154:3300/0,v1:10.10.29.154:6789/0] is down (out of quorum)
>>>> [WRN] OBJECT_UNFOUND: 14/47195 objects unfound (0.030%)
>>>>    pg 5.77 has 1 unfound objects
>>>>    pg 5.6d has 2 unfound objects
>>>>    pg 5.6a has 1 unfound objects
>>>>    pg 5.65 has 1 unfound objects
>>>>    pg 5.4a has 1 unfound objects
>>>>    pg 5.30 has 1 unfound objects
>>>>    pg 5.28 has 1 unfound objects
>>>>    pg 5.25 has 1 unfound objects
>>>>    pg 5.19 has 1 unfound objects
>>>>    pg 5.1a has 1 unfound objects
>>>>    pg 5.1 has 1 unfound objects
>>>>    pg 5.b has 1 unfound objects
>>>>    pg 5.8 has 1 unfound objects
>>>> [ERR] PG_DAMAGED: Possible data damage: 13 pgs recovery_unfound
>>>>    pg 5.1 is active+recovery_unfound+degraded+remapped, acting [5,8,7],
>>>> 1
>>>> unfound
>>>>    pg 5.8 is active+recovery_unfound+degraded+remapped, acting
>>>> [6,11,8], 1
>>>> unfound
>>>>    pg 5.b is active+recovery_unfound+degraded+remapped, acting [7,0,5],
>>>> 1
>>>> unfound
>>>>    pg 5.19 is active+recovery_unfound+degraded+remapped, acting
>>>> [0,5,7], 1
>>>> unfound
>>>>    pg 5.1a is active+recovery_unfound+degraded, acting [10,11,8], 1
>>>> unfound
>>>>    pg 5.25 is active+recovery_unfound+degraded+remapped, acting
>>>> [0,10,11],
>>>> 1 unfound
>>>>    pg 5.28 is active+recovery_unfound+degraded+remapped, acting
>>>> [6,11,8],
>>>> 1 unfound
>>>>    pg 5.30 is active+recovery_unfound+degraded+remapped, acting
>>>> [7,5,0], 1
>>>> unfound
>>>>    pg 5.4a is active+recovery_unfound+degraded, acting [0,11,7], 1
>>>> unfound
>>>>    pg 5.65 is active+recovery_unfound+degraded+remapped, acting
>>>> [0,10,11],
>>>> 1 unfound
>>>>    pg 5.6a is active+recovery_unfound+degraded, acting [0,11,7], 1
>>>> unfound
>>>>    pg 5.6d is active+recovery_unfound+degraded+remapped, acting
>>>> [7,2,0], 2
>>>> unfound
>>>>    pg 5.77 is active+recovery_unfound+degraded+remapped, acting
>>>> [5,6,8], 1
>>>> unfound
>>>> [WRN] PG_DEGRADED: Degraded data redundancy: 42/141585 objects degraded
>>>> (0.030%), 13 pgs degraded
>>>>    pg 5.1 is active+recovery_unfound+degraded+remapped, acting [5,8,7],
>>>> 1
>>>> unfound
>>>>    pg 5.8 is active+recovery_unfound+degraded+remapped, acting
>>>> [6,11,8], 1
>>>> unfound
>>>>    pg 5.b is active+recovery_unfound+degraded+remapped, acting [7,0,5],
>>>> 1
>>>> unfound
>>>>    pg 5.19 is active+recovery_unfound+degraded+remapped, acting
>>>> [0,5,7], 1
>>>> unfound
>>>>    pg 5.1a is active+recovery_unfound+degraded, acting [10,11,8], 1
>>>> unfound
>>>>    pg 5.25 is active+recovery_unfound+degraded+remapped, acting
>>>> [0,10,11],
>>>> 1 unfound
>>>>    pg 5.28 is active+recovery_unfound+degraded+remapped, acting
>>>> [6,11,8],
>>>> 1 unfound
>>>>    pg 5.30 is active+recovery_unfound+degraded+remapped, acting
>>>> [7,5,0], 1
>>>> unfound
>>>>    pg 5.4a is active+recovery_unfound+degraded, acting [0,11,7], 1
>>>> unfound
>>>>    pg 5.65 is active+recovery_unfound+degraded+remapped, acting
>>>> [0,10,11],
>>>> 1 unfound
>>>>    pg 5.6a is active+recovery_unfound+degraded, acting [0,11,7], 1
>>>> unfound
>>>>    pg 5.6d is active+recovery_unfound+degraded+remapped, acting
>>>> [7,2,0], 2
>>>> unfound
>>>>    pg 5.77 is active+recovery_unfound+degraded+remapped, acting
>>>> [5,6,8], 1
>>>> unfound
>>>> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 322 sec, daemons
>>>> [osd.0,osd.7] have slow ops.
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>>>
>>>>
>>>
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx