I see that problem, when the osds fail the mds fail, with errors with type "slow metadata, slow requests" but do not recover once the cluster has recovered ... Why? El 5/5/21 a las 11:07, Andres Rojas Guerrero escribió: > Sorry, I have not understood the problem well, the problem I see is that > once the OSD fails, the cluster recovers but the MDS remains faulty: > > # ceph status > cluster: > id: c74da5b8-3d1b-483e-8b3a-739134db6cf8 > health: HEALTH_WARN > 3 clients failing to respond to capability release > 2 MDSs report slow metadata IOs > 2 MDSs report slow requests > 2 MDSs behind on trimming > Reduced data availability: 256 pgs inactive, 18 pgs down, > 238 pgs incomplete > 22 slow ops, oldest one blocked for 26719 sec, daemons > [osd.134,osd.210,osd.244,osd.251,osd.301,osd.514,osd.520,osd.528,osd.642,osd.713]... > have slow ops. > > services: > mon: 3 daemons, quorum ceph2mon01,ceph2mon02,ceph2mon03 (age 23h) > mgr: ceph2mon02(active, since 6d), standbys: ceph2mon01, ceph2mon03 > mds: nxtclfs:2 {0=ceph2mon01=up:active,1=ceph2mon02=up:active} 1 > up:standby > osd: 768 osds: 736 up (since 7h), 736 in (since 7h) > > data: > pools: 2 pools, 16384 pgs > objects: 33.39M objects, 39 TiB > usage: 64 TiB used, 2.6 PiB / 2.6 PiB avail > pgs: 1.562% pgs not active > 16128 active+clean > 238 incomplete > 18 down > > El 5/5/21 a las 11:00, Andres Rojas Guerrero escribió: >> Yes, the principal problem is the MDS start to report slowly and the >> information is no longer accessible, and the cluster never recover. >> >> >> # ceph status >> cluster: >> id: c74da5b8-3d1b-483e-8b3a-739134db6cf8 >> health: HEALTH_WARN >> 2 clients failing to respond to capability release >> 2 MDSs report slow metadata IOs >> 1 MDSs report slow requests >> 2 MDSs behind on trimming >> Reduced data availability: 238 pgs inactive, 8 pgs down, 230 >> pgs incomplete >> Degraded data redundancy: 1400453/220552172 objects degraded >> (0.635%), 461 pgs degraded, 464 pgs undersized >> 241 slow ops, oldest one blocked for 638 sec, daemons >> [osd.101,osd.127,osd.155,osd.166,osd.172,osd.189,osd.200,osd.210,osd.214,osd.233]... >> have slow ops. >> >> services: >> mon: 3 daemons, quorum ceph2mon01,ceph2mon02,ceph2mon03 (age 25h) >> mgr: ceph2mon02(active, since 6d), standbys: ceph2mon01, ceph2mon03 >> mds: nxtclfs:2 {0=ceph2mon01=up:active,1=ceph2mon02=up:active} 1 >> up:standby >> osd: 768 osds: 736 up (since 11m), 736 in (since 95s); 416 remapped pgs >> >> data: >> pools: 2 pools, 16384 pgs >> objects: 33.40M objects, 39 TiB >> usage: 63 TiB used, 2.6 PiB / 2.6 PiB avail >> pgs: 1.489% pgs not active >> 1400453/220552172 objects degraded (0.635%) >> 15676 active+clean >> 285 active+undersized+degraded+remapped+backfill_wait >> 230 incomplete >> 176 active+undersized+degraded+remapped+backfilling >> 8 down >> 6 peering >> 3 active+undersized+remapped >> >> El 5/5/21 a las 10:54, David Caro escribió: >>> >>> Can you share more information? >>> >>> The output of 'ceph status' when the osd is down would help, also 'ceph health detail' could be useful. >>> >>> On 05/05 10:48, Andres Rojas Guerrero wrote: >>>> Hi, I have a Nautilus cluster version 14.2.6 , and I have noted that >>>> when some OSD go down the cluster doesn't start recover. I have checked >>>> that the option noout is unset. >>>> >>>> What could be the reason for this behavior? >>>> >>>> >>>> >>>> -- >>>> ******************************************************* >>>> Andrés Rojas Guerrero >>>> Unidad Sistemas Linux >>>> Area Arquitectura Tecnológica >>>> Secretaría General Adjunta de Informática >>>> Consejo Superior de Investigaciones Científicas (CSIC) >>>> Pinar 19 >>>> 28006 - Madrid >>>> Tel: +34 915680059 -- Ext. 990059 >>>> email: a.rojas@xxxxxxx >>>> ID comunicate.csic.es: @50852720l:matrix.csic.es >>>> ******************************************************* >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >> > -- ******************************************************* Andrés Rojas Guerrero Unidad Sistemas Linux Area Arquitectura Tecnológica Secretaría General Adjunta de Informática Consejo Superior de Investigaciones Científicas (CSIC) Pinar 19 28006 - Madrid Tel: +34 915680059 -- Ext. 990059 email: a.rojas@xxxxxxx ID comunicate.csic.es: @50852720l:matrix.csic.es ******************************************************* _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx