FS_DEGRADED indicates that your MDS restarted or stopped responding to health beacons. Are your MDSs going OOM? I see you have two active MDSs. Is your cluster more stable if you use only one single active MDS? -- Dan On Wed, May 26, 2021 at 2:44 PM Andres Rojas Guerrero <a.rojas@xxxxxxx> wrote: > > Ok thank's, I will try to update Nautilus. But really I don't > understand the problem, apparently randomly Warnings appear: > > [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST) > > cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is > degraded) > > : cluster [DBG] mds.? > [v2:10.100.190.39:6800/2624951349,v1:10.100.190.39:6801/2624951349] > up:rejoin > 2021-05-26 10:55:33.215102 mon.ceph2mon01 (mon.0) 700 : cluster [DBG] > fsmap nxtclfs:2/2 {0=ceph2mon03=up:rejoin,1=ceph2mon01=up:active} 1 > up:standby > > Degrading the filesystem and I have assumed that the problem is due to > the memory consumption of the MDS process, which can reach around 80% or > more of the total memory. > > > > > > El 26/5/21 a las 13:21, Dan van der Ster escribió: > > I've seen your other thread. Using 78GB of RAM when the memory limit > > is set to 64GB is not highly unusual, and doesn't necessarily indicate > > any problem. > > It *would* be a problem if the MDS memory grows uncontrollably, however. > > > > Otherwise, check those new defaults for caps recall -- they were > > released around 14.2.19 IIRC. > > > > -- Dan > > > > On Wed, May 26, 2021 at 12:46 PM Andres Rojas Guerrero <a.rojas@xxxxxxx> wrote: > >> > >> Thanks for the answer. Yes, during these last weeks I have had memory > >> consumption problems in the MDS nodes that led, at least it seemed to > >> me, to performance problems in CephFS. I have been varying, for example: > >> > >> mds_cache_memory_limit > >> mds_min_caps_per_client > >> mds_health_cache_threshold > >> mds_max_caps_per_client > >> mds_cache_reservation > >> > >> But without much knowledge and with a trial and error procedure, i.e. > >> observing how CephFS behaved when changing one of the parameters. > >> Although I have achieved improvement the procedure does not convince me > >> at all and that's why I was asking if there was something more reliable ... > >> > >> > >> > >> > >> El 26/5/21 a las 12:15, Dan van der Ster escribió: > >>> Hi, > >>> > >>> The mds_cache_memory_limit should be set to something relative to the > >>> RAM size of the MDS -- maybe 50% is a good rule of thumb, because > >>> there are a few cases where the RSS can exceed this limit. Your > >>> experience will help guide what size you need (metadata pool IO > >>> activity will be really high if the MDS cache is too small) > >>> > >>> Otherwise, in recent releases of N/O/P the defaults for those settings > >>> you mentioned are quite good [1]; I would be surprised if they need > >>> further tuning for 99% of users. > >>> Is there any reason you want to start adjusting these params? > >>> > >>> Best Regards, > >>> > >>> Dan > >>> > >>> [1] https://github.com/ceph/ceph/pull/38574 > >>> > >>> On Wed, May 26, 2021 at 11:58 AM Andres Rojas Guerrero <a.rojas@xxxxxxx> wrote: > >>>> > >>>> Hi all, I have observed that the MDS Cache Configuration has 18 parameters: > >>>> > >>>> mds_cache_memory_limit > >>>> mds_cache_reservation > >>>> mds_health_cache_threshold > >>>> mds_cache_trim_threshold > >>>> mds_cache_trim_decay_rate > >>>> mds_recall_max_caps > >>>> mds_recall_max_decay_threshold > >>>> mds_recall_max_decay_rate > >>>> mds_recall_global_max_decay_threshold > >>>> mds_recall_warning_threshold > >>>> mds_recall_warning_decay_rate > >>>> mds_session_cap_acquisition_throttle > >>>> mds_session_cap_acquisition_decay_rate > >>>> mds_session_max_caps_throttle_ratio > >>>> mds_cap_acquisition_throttle_retry_request_timeout > >>>> mds_session_cache_liveness_magnitude > >>>> mds_session_cache_liveness_decay_rate > >>>> mds_max_caps_per_client > >>>> > >>>> > >>>> > >>>> I find the Ceph documentation in this section a bit cryptic and I have > >>>> tried to find some resources that talk about how to tune these > >>>> parameters, but without success. > >>>> > >>>> Does anyone have experience in adjusting these parameters according to > >>>> the characteristics of the Ceph cluster itself, the hardware and the use > >>>> of MDS? > >>>> > >>>> Regards! > >>>> -- > >>>> ******************************************************* > >>>> Andrés Rojas Guerrero > >>>> Unidad Sistemas Linux > >>>> Area Arquitectura Tecnológica > >>>> Secretaría General Adjunta de Informática > >>>> Consejo Superior de Investigaciones Científicas (CSIC) > >>>> Pinar 19 > >>>> 28006 - Madrid > >>>> Tel: +34 915680059 -- Ext. 990059 > >>>> email: a.rojas@xxxxxxx > >>>> ID comunicate.csic.es: @50852720l:matrix.csic.es > >>>> ******************************************************* > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > >> -- > >> ******************************************************* > >> Andrés Rojas Guerrero > >> Unidad Sistemas Linux > >> Area Arquitectura Tecnológica > >> Secretaría General Adjunta de Informática > >> Consejo Superior de Investigaciones Científicas (CSIC) > >> Pinar 19 > >> 28006 - Madrid > >> Tel: +34 915680059 -- Ext. 990059 > >> email: a.rojas@xxxxxxx > >> ID comunicate.csic.es: @50852720l:matrix.csic.es > >> ******************************************************* > > -- > ******************************************************* > Andrés Rojas Guerrero > Unidad Sistemas Linux > Area Arquitectura Tecnológica > Secretaría General Adjunta de Informática > Consejo Superior de Investigaciones Científicas (CSIC) > Pinar 19 > 28006 - Madrid > Tel: +34 915680059 -- Ext. 990059 > email: a.rojas@xxxxxxx > ID comunicate.csic.es: @50852720l:matrix.csic.es > ******************************************************* _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx