The OSDs are up and in , I have the problem on PGs as you see below root@ceph-mon1:~# ceph -s cluster: id: 43f5d6b4-74b0-4281-92ab-940829d3ee5e health: HEALTH_ERR 1/3 mons down, quorum ceph-mon1,ceph-mon3 14/32863 objects unfound (0.043%) Possible data damage: 13 pgs recovery_unfound Degraded data redundancy: 42/98589 objects degraded (0.043%), 9 pgs degraded 5 daemons have recently crashed 1 slow ops, oldest one blocked for 22521 sec, osd.7 has slow ops services: mon: 3 daemons, quorum ceph-mon1,ceph-mon3 (age 41m), out of quorum: ceph-mon4 mgr: ceph-mon1(active, since 30h), standbys: ceph-mon2 osd: 12 osds: 12 up (since 75m), 12 in (since 30h); 10 remapped pgs data: pools: 5 pools, 225 pgs objects: 32.86k objects, 129 GiB usage: 384 GiB used, 4.3 TiB / 4.7 TiB avail pgs: 42/98589 objects degraded (0.043%) 1811/98589 objects misplaced (1.837%) 14/32863 objects unfound (0.043%) 212 active+clean 6 active+recovery_unfound+degraded+remapped 4 active+recovery_unfound+remapped 3 active+recovery_unfound+degraded io: client: 34 KiB/s rd, 41 op/s rd, 0 op/s wr On Fri, Oct 29, 2021 at 3:10 PM Etienne Menguy <etienne.menguy@xxxxxxxx> wrote: > Could your hardware be faulty? > > You are trying to deploy the faulty monitor? Or a whole new cluster? > > If you are trying to fix your cluster, you should focus on OSD. > A cluster can run without big troubles with 2 monitors for few days (if > not years…). > > - > Etienne Menguy > etienne.menguy@xxxxxxxx > > > > > On 29 Oct 2021, at 14:08, Michel Niyoyita <micou12@xxxxxxxxx> wrote: > > Hello team > > Below is the error , I am getting once I try to redeploy the same cluster > > TASK [ceph-mon : recursively fix ownership of monitor directory] > ****************************************************************************************************************************************************** > Friday 29 October 2021 12:07:18 +0000 (0:00:00.411) 0:01:41.157 > ******** > ok: [ceph-mon1] > ok: [ceph-mon2] > ok: [ceph-mon3] > An exception occurred during task execution. To see the full traceback, > use -vvv. The error was: OSError: [Errno 117] Structure needs cleaning: > b'/var/lib/ceph/mon/ceph-ceph-mon4/store.db/216815.log' > fatal: [ceph-mon4]: FAILED! => changed=false > module_stderr: |- > Traceback (most recent call last): > File "<stdin>", line 102, in <module> > File "<stdin>", line 94, in _ansiballz_main > File "<stdin>", line 40, in invoke_module > File "/usr/lib/python3.8/runpy.py", line 207, in run_module > return _run_module_code(code, init_globals, run_name, mod_spec) > File "/usr/lib/python3.8/runpy.py", line 97, in _run_module_code > _run_code(code, mod_globals, init_globals, > File "/usr/lib/python3.8/runpy.py", line 87, in _run_code > exec(code, run_globals) > File > "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/modules/files/file.py", > line 940, in <module> > File > "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/modules/files/file.py", > line 926, in main > File > "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/modules/files/file.py", > line 665, in ensure_directory > File > "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/modules/files/file.py", > line 340, in recursive_set_attributes > File > "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/module_utils/basic.py", > line 1335, in set_fs_attributes_if_different > File > "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/module_utils/basic.py", > line 988, in set_owner_if_different > File > "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/module_utils/basic.py", > line 883, in user_and_group > OSError: [Errno 117] Structure needs cleaning: > b'/var/lib/ceph/mon/ceph-ceph-mon4/store.db/216815.log' > module_stdout: '' > msg: |- > MODULE FAILURE > See stdout/stderr for the exact error > rc: 1 > > > > On Fri, Oct 29, 2021 at 12:37 PM Etienne Menguy <etienne.menguy@xxxxxxxx> > wrote: > >> Have you tried to restart one of the OSD that seems to block PG recover? >> >> I don’t think increasing PG can help. >> - >> Etienne Menguy >> etienne.menguy@xxxxxxxx >> >> >> >> >> On 29 Oct 2021, at 11:53, Michel Niyoyita <micou12@xxxxxxxxx> wrote: >> >> Hello Eugen >> >> The failure_domain is host level and crush rule is replicated_rule in >> troubleshooting process I changed for pool 5 its PG from 32 to 128 to see >> if there >> can be some changes. and it has the default replica (3) >> >> Thanks for your continous help >> >> On Fri, Oct 29, 2021 at 11:44 AM Etienne Menguy <etienne.menguy@xxxxxxxx> >> wrote: >> >>> Is a way there you can enforce mon to rejoin a quorum ? I tried to >>> restart it but nothing changed. I guess it is the cause If I am not >>> mistaken. >>> >>> >>> No, but with quorum_status you can check monitor status and if it’s >>> trying to join quorum. >>> You may have to use daemon socket interface (asok file) to directly get >>> info for this monitor. >>> >>> >>> https://docs.ceph.com/en/latest/rados/operations/monitoring/#checking-monitor-status >>> >>> Which OSD were down? As written by Eugen, having having crush rule and >>> failure would be useful. It’s unusual that a single host failure triggers >>> this issue. >>> >>> I guess it is the cause If I am not mistaken >>> >>> I don’t think monitor issue is the root cause of the unfound objects. >>> You could easily delete monitor and deploy it again to “fix” your monitor >>> quorum. >>> >>> >>> - >>> Etienne Menguy >>> etienne.menguy@xxxxxxxx >>> >>> >>> >>> >>> On 29 Oct 2021, at 11:30, Michel Niyoyita <micou12@xxxxxxxxx> wrote: >>> >>> Dear Etienne >>> >>> Is a way there you can enforce mon to rejoin a quorum ? I tried to >>> restart it but nothing changed. I guess it is the cause If I am not >>> mistaken. >>> >>> below is pg querry output >>> >>> >>> Regards >>> >>> On Fri, Oct 29, 2021 at 10:56 AM Etienne Menguy <etienne.menguy@xxxxxxxx> >>> wrote: >>> >>>> With “ceph pg x.y query” you can check why it’s complaining. >>>> >>>> x.y for pg id, like 5.77 >>>> >>>> It would also be interesting to check why mon fails to rejoin quorum, >>>> it may give you hints at your OSD issues. >>>> >>>> - >>>> Etienne Menguy >>>> etienne.menguy@xxxxxxxx >>>> >>>> >>>> >>>> >>>> On 29 Oct 2021, at 10:34, Michel Niyoyita <micou12@xxxxxxxxx> wrote: >>>> >>>> Hello Etienne >>>> >>>> This is the ceph -s output >>>> >>>> root@ceph-mon1:~# ceph -s >>>> cluster: >>>> id: 43f5d6b4-74b0-4281-92ab-940829d3ee5e >>>> health: HEALTH_ERR >>>> 1/3 mons down, quorum ceph-mon1,ceph-mon3 >>>> 14/47681 objects unfound (0.029%) >>>> 1 scrub errors >>>> Possible data damage: 13 pgs recovery_unfound, 1 pg >>>> inconsistent >>>> Degraded data redundancy: 42/143043 objects degraded >>>> (0.029%), 13 pgs degraded >>>> 2 slow ops, oldest one blocked for 2897 sec, daemons >>>> [osd.0,osd.7] have slow ops. >>>> >>>> services: >>>> mon: 3 daemons, quorum ceph-mon1,ceph-mon3 (age 2h), out of quorum: >>>> ceph-mon4 >>>> mgr: ceph-mon1(active, since 25h), standbys: ceph-mon2 >>>> osd: 12 osds: 12 up (since 97m), 12 in (since 25h); 10 remapped pgs >>>> >>>> data: >>>> pools: 5 pools, 225 pgs >>>> objects: 47.68k objects, 204 GiB >>>> usage: 603 GiB used, 4.1 TiB / 4.7 TiB avail >>>> pgs: 42/143043 objects degraded (0.029%) >>>> 2460/143043 objects misplaced (1.720%) >>>> 14/47681 objects unfound (0.029%) >>>> 211 active+clean >>>> 10 active+recovery_unfound+degraded+remapped >>>> 3 active+recovery_unfound+degraded >>>> 1 active+clean+inconsistent >>>> >>>> io: >>>> client: 2.0 KiB/s rd, 88 KiB/s wr, 2 op/s rd, 12 op/s wr >>>> >>>> On Fri, Oct 29, 2021 at 10:09 AM Etienne Menguy < >>>> etienne.menguy@xxxxxxxx> wrote: >>>> >>>>> Hi, >>>>> >>>>> Please share “ceph -s” output. >>>>> >>>>> - >>>>> Etienne Menguy >>>>> etienne.menguy@xxxxxxxx >>>>> >>>>> >>>>> >>>>> >>>>> On 29 Oct 2021, at 10:03, Michel Niyoyita <micou12@xxxxxxxxx> wrote: >>>>> >>>>> Hello team >>>>> >>>>> I am running a ceph cluster with 3 monitors and 4 OSDs nodes running >>>>> 3osd >>>>> each , I deployed my ceph cluster using ansible and ubuntu 20.04 as OS >>>>> , >>>>> the ceph version is Octopus. yesterday , My server which hosts OSDs >>>>> nodes >>>>> restarted because of power issue and to comeback on its status one of >>>>> the >>>>> monitor is out of quorum and some Pg marks as damaged . please help me >>>>> to >>>>> solve this issue. below are health detail status I am finding. and the >>>>> 4 >>>>> OSDs node are the same which are running monitors (3 of them). >>>>> >>>>> Best regards. >>>>> >>>>> Michel >>>>> >>>>> >>>>> root@ceph-mon1:~# ceph health detail >>>>> HEALTH_ERR 1/3 mons down, quorum ceph-mon1,ceph-mon3; 14/47195 objects >>>>> unfound (0.030%); Possible data damage: 13 pgs recovery_unfound; >>>>> Degraded >>>>> data redundancy: 42/141585 objects degraded (0.030%), 13 pgs degraded; >>>>> 2 >>>>> slow ops, oldest one blocked for 322 sec, daemons [osd.0,osd.7] have >>>>> slow >>>>> ops. >>>>> [WRN] MON_DOWN: 1/3 mons down, quorum ceph-mon1,ceph-mon3 >>>>> mon.ceph-mon4 (rank 2) addr [v2: >>>>> 10.10.29.154:3300/0,v1:10.10.29.154:6789/0] is down (out of quorum) >>>>> [WRN] OBJECT_UNFOUND: 14/47195 objects unfound (0.030%) >>>>> pg 5.77 has 1 unfound objects >>>>> pg 5.6d has 2 unfound objects >>>>> pg 5.6a has 1 unfound objects >>>>> pg 5.65 has 1 unfound objects >>>>> pg 5.4a has 1 unfound objects >>>>> pg 5.30 has 1 unfound objects >>>>> pg 5.28 has 1 unfound objects >>>>> pg 5.25 has 1 unfound objects >>>>> pg 5.19 has 1 unfound objects >>>>> pg 5.1a has 1 unfound objects >>>>> pg 5.1 has 1 unfound objects >>>>> pg 5.b has 1 unfound objects >>>>> pg 5.8 has 1 unfound objects >>>>> [ERR] PG_DAMAGED: Possible data damage: 13 pgs recovery_unfound >>>>> pg 5.1 is active+recovery_unfound+degraded+remapped, acting >>>>> [5,8,7], 1 >>>>> unfound >>>>> pg 5.8 is active+recovery_unfound+degraded+remapped, acting >>>>> [6,11,8], 1 >>>>> unfound >>>>> pg 5.b is active+recovery_unfound+degraded+remapped, acting >>>>> [7,0,5], 1 >>>>> unfound >>>>> pg 5.19 is active+recovery_unfound+degraded+remapped, acting >>>>> [0,5,7], 1 >>>>> unfound >>>>> pg 5.1a is active+recovery_unfound+degraded, acting [10,11,8], 1 >>>>> unfound >>>>> pg 5.25 is active+recovery_unfound+degraded+remapped, acting >>>>> [0,10,11], >>>>> 1 unfound >>>>> pg 5.28 is active+recovery_unfound+degraded+remapped, acting >>>>> [6,11,8], >>>>> 1 unfound >>>>> pg 5.30 is active+recovery_unfound+degraded+remapped, acting >>>>> [7,5,0], 1 >>>>> unfound >>>>> pg 5.4a is active+recovery_unfound+degraded, acting [0,11,7], 1 >>>>> unfound >>>>> pg 5.65 is active+recovery_unfound+degraded+remapped, acting >>>>> [0,10,11], >>>>> 1 unfound >>>>> pg 5.6a is active+recovery_unfound+degraded, acting [0,11,7], 1 >>>>> unfound >>>>> pg 5.6d is active+recovery_unfound+degraded+remapped, acting >>>>> [7,2,0], 2 >>>>> unfound >>>>> pg 5.77 is active+recovery_unfound+degraded+remapped, acting >>>>> [5,6,8], 1 >>>>> unfound >>>>> [WRN] PG_DEGRADED: Degraded data redundancy: 42/141585 objects degraded >>>>> (0.030%), 13 pgs degraded >>>>> pg 5.1 is active+recovery_unfound+degraded+remapped, acting >>>>> [5,8,7], 1 >>>>> unfound >>>>> pg 5.8 is active+recovery_unfound+degraded+remapped, acting >>>>> [6,11,8], 1 >>>>> unfound >>>>> pg 5.b is active+recovery_unfound+degraded+remapped, acting >>>>> [7,0,5], 1 >>>>> unfound >>>>> pg 5.19 is active+recovery_unfound+degraded+remapped, acting >>>>> [0,5,7], 1 >>>>> unfound >>>>> pg 5.1a is active+recovery_unfound+degraded, acting [10,11,8], 1 >>>>> unfound >>>>> pg 5.25 is active+recovery_unfound+degraded+remapped, acting >>>>> [0,10,11], >>>>> 1 unfound >>>>> pg 5.28 is active+recovery_unfound+degraded+remapped, acting >>>>> [6,11,8], >>>>> 1 unfound >>>>> pg 5.30 is active+recovery_unfound+degraded+remapped, acting >>>>> [7,5,0], 1 >>>>> unfound >>>>> pg 5.4a is active+recovery_unfound+degraded, acting [0,11,7], 1 >>>>> unfound >>>>> pg 5.65 is active+recovery_unfound+degraded+remapped, acting >>>>> [0,10,11], >>>>> 1 unfound >>>>> pg 5.6a is active+recovery_unfound+degraded, acting [0,11,7], 1 >>>>> unfound >>>>> pg 5.6d is active+recovery_unfound+degraded+remapped, acting >>>>> [7,2,0], 2 >>>>> unfound >>>>> pg 5.77 is active+recovery_unfound+degraded+remapped, acting >>>>> [5,6,8], 1 >>>>> unfound >>>>> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 322 sec, daemons >>>>> [osd.0,osd.7] have slow ops. >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> >>>>> >>>>> >>>> >>> >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx