I think that the recovery might be blocked due to all those PGs in inactive state: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/administration_guide/monitoring-a-ceph-storage-cluster#identifying-stuck-placement-groups_admin """ Inactive: Placement groups cannot process reads or writes because they are waiting for an OSD with the most up-to-date data to come back up. """ What is your pool configuration? And other configs? Can you send the output of "ceph config dump" and "osd pool detail"? On 05/05 11:00, Andres Rojas Guerrero wrote: > Yes, the principal problem is the MDS start to report slowly and the > information is no longer accessible, and the cluster never recover. > > > # ceph status > cluster: > id: c74da5b8-3d1b-483e-8b3a-739134db6cf8 > health: HEALTH_WARN > 2 clients failing to respond to capability release > 2 MDSs report slow metadata IOs > 1 MDSs report slow requests > 2 MDSs behind on trimming > Reduced data availability: 238 pgs inactive, 8 pgs down, 230 > pgs incomplete > Degraded data redundancy: 1400453/220552172 objects degraded > (0.635%), 461 pgs degraded, 464 pgs undersized > 241 slow ops, oldest one blocked for 638 sec, daemons > [osd.101,osd.127,osd.155,osd.166,osd.172,osd.189,osd.200,osd.210,osd.214,osd.233]... > have slow ops. > > services: > mon: 3 daemons, quorum ceph2mon01,ceph2mon02,ceph2mon03 (age 25h) > mgr: ceph2mon02(active, since 6d), standbys: ceph2mon01, ceph2mon03 > mds: nxtclfs:2 {0=ceph2mon01=up:active,1=ceph2mon02=up:active} 1 > up:standby > osd: 768 osds: 736 up (since 11m), 736 in (since 95s); 416 remapped pgs > > data: > pools: 2 pools, 16384 pgs > objects: 33.40M objects, 39 TiB > usage: 63 TiB used, 2.6 PiB / 2.6 PiB avail > pgs: 1.489% pgs not active > 1400453/220552172 objects degraded (0.635%) > 15676 active+clean > 285 active+undersized+degraded+remapped+backfill_wait > 230 incomplete > 176 active+undersized+degraded+remapped+backfilling > 8 down > 6 peering > 3 active+undersized+remapped > > El 5/5/21 a las 10:54, David Caro escribió: > > > > Can you share more information? > > > > The output of 'ceph status' when the osd is down would help, also 'ceph health detail' could be useful. > > > > On 05/05 10:48, Andres Rojas Guerrero wrote: > >> Hi, I have a Nautilus cluster version 14.2.6 , and I have noted that > >> when some OSD go down the cluster doesn't start recover. I have checked > >> that the option noout is unset. > >> > >> What could be the reason for this behavior? > >> > >> > >> > >> -- > >> ******************************************************* > >> Andrés Rojas Guerrero > >> Unidad Sistemas Linux > >> Area Arquitectura Tecnológica > >> Secretaría General Adjunta de Informática > >> Consejo Superior de Investigaciones Científicas (CSIC) > >> Pinar 19 > >> 28006 - Madrid > >> Tel: +34 915680059 -- Ext. 990059 > >> email: a.rojas@xxxxxxx > >> ID comunicate.csic.es: @50852720l:matrix.csic.es > >> ******************************************************* > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > -- > ******************************************************* > Andrés Rojas Guerrero > Unidad Sistemas Linux > Area Arquitectura Tecnológica > Secretaría General Adjunta de Informática > Consejo Superior de Investigaciones Científicas (CSIC) > Pinar 19 > 28006 - Madrid > Tel: +34 915680059 -- Ext. 990059 > email: a.rojas@xxxxxxx > ID comunicate.csic.es: @50852720l:matrix.csic.es > ******************************************************* -- David Caro SRE - Cloud Services Wikimedia Foundation <https://wikimediafoundation.org/> PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3 "Imagine a world in which every single human being can freely share in the sum of all knowledge. That's our commitment."
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx