Meaning of Ceph MDS / Rank in "Stopped" state.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I am working to develop some monitoring for our File clusters and as part of the check I inspect `ceph mds stat` for damaged,failed,stopped MDS/Ranks. Initially I set my check to Alarm if any of these states was discovered but as I distributed it out I noticed that one of our clusters had the following:

                 "failed": [],
                   "damaged": [],
                   "stopped": [
                       2
                   ],

However the cluster health is good and the mds state is: cephfs-2/2/2 up  {0=p3plcephmds001=up:active,1=p3plcephmds002=up:active}, 1 up:standby

A little further digging and I found that a stopped state doesnt apply to an MDS but rather a rank and may indicate that max_mds was previously set higher than its current setting of 2, and the "Stopped" ranks are simply ranks which were active and simply offloaded their state to other ranks. 

My question is, how can I inspect further which ranks are "stopped" and would it be appropriate to "clear" those stopped ranks if possible or should I modify my check to ignore stopped ranks and only focus on damaged/failed ranks.

The cluster is running 12.2.12

Thanks.

Respectfully,

Wes Dillingham
wdillingham@xxxxxxxxxxx
Site Reliability Engineer IV - Platform Storage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux