Re: Ceph cluster not recover after OSD down

David Caro <dcaro@xxxxxxxxxxxxx> · Wed, 5 May 2021 11:12:50 +0200

I think that the recovery might be blocked due to all those PGs in inactive state:
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/administration_guide/monitoring-a-ceph-storage-cluster#identifying-stuck-placement-groups_admin

"""
 Inactive: Placement groups cannot process reads or writes because they are waiting for an OSD with the most up-to-date data to come back up.
"""

What is your pool configuration? And other configs?

Can you send the output of  "ceph config dump" and "osd pool detail"?

On 05/05 11:00, Andres Rojas Guerrero wrote:
> Yes, the principal problem is the MDS start to report slowly and the
> information is no longer accessible, and the cluster never recover.
> 
> 
> # ceph status
>   cluster:
>     id:     c74da5b8-3d1b-483e-8b3a-739134db6cf8
>     health: HEALTH_WARN
>             2 clients failing to respond to capability release
>             2 MDSs report slow metadata IOs
>             1 MDSs report slow requests
>             2 MDSs behind on trimming
>             Reduced data availability: 238 pgs inactive, 8 pgs down, 230
> pgs incomplete
>             Degraded data redundancy: 1400453/220552172 objects degraded
> (0.635%), 461 pgs degraded, 464 pgs undersized
>             241 slow ops, oldest one blocked for 638 sec, daemons
> [osd.101,osd.127,osd.155,osd.166,osd.172,osd.189,osd.200,osd.210,osd.214,osd.233]...
> have slow ops.
> 
>   services:
>     mon: 3 daemons, quorum ceph2mon01,ceph2mon02,ceph2mon03 (age 25h)
>     mgr: ceph2mon02(active, since 6d), standbys: ceph2mon01, ceph2mon03
>     mds: nxtclfs:2 {0=ceph2mon01=up:active,1=ceph2mon02=up:active} 1
> up:standby
>     osd: 768 osds: 736 up (since 11m), 736 in (since 95s); 416 remapped pgs
> 
>   data:
>     pools:   2 pools, 16384 pgs
>     objects: 33.40M objects, 39 TiB
>     usage:   63 TiB used, 2.6 PiB / 2.6 PiB avail
>     pgs:     1.489% pgs not active
>              1400453/220552172 objects degraded (0.635%)
>              15676 active+clean
>              285   active+undersized+degraded+remapped+backfill_wait
>              230   incomplete
>              176   active+undersized+degraded+remapped+backfilling
>              8     down
>              6     peering
>              3     active+undersized+remapped
> 
> El 5/5/21 a las 10:54, David Caro escribió:
> > 
> > Can you share more information?
> > 
> > The output of 'ceph status' when the osd is down would help, also 'ceph health detail' could be useful.
> > 
> > On 05/05 10:48, Andres Rojas Guerrero wrote:
> >> Hi, I have a Nautilus cluster version 14.2.6 , and I have noted that
> >> when some OSD go down the cluster doesn't start recover. I have checked
> >> that the option noout is unset.
> >>
> >> What could be the reason for this behavior?
> >>
> >>
> >>
> >> -- 
> >> *******************************************************
> >> Andrés Rojas Guerrero
> >> Unidad Sistemas Linux
> >> Area Arquitectura Tecnológica
> >> Secretaría General Adjunta de Informática
> >> Consejo Superior de Investigaciones Científicas (CSIC)
> >> Pinar 19
> >> 28006 - Madrid
> >> Tel: +34 915680059 -- Ext. 990059
> >> email: a.rojas@xxxxxxx
> >> ID comunicate.csic.es: @50852720l:matrix.csic.es
> >> *******************************************************
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > 
> 
> -- 
> *******************************************************
> Andrés Rojas Guerrero
> Unidad Sistemas Linux
> Area Arquitectura Tecnológica
> Secretaría General Adjunta de Informática
> Consejo Superior de Investigaciones Científicas (CSIC)
> Pinar 19
> 28006 - Madrid
> Tel: +34 915680059 -- Ext. 990059
> email: a.rojas@xxxxxxx
> ID comunicate.csic.es: @50852720l:matrix.csic.es
> *******************************************************

-- 
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx