Hi Felix, "rejoin" took awhile in the past because the MDS needs to reload all inodes for all the open directories at boot time. In our experience this can take ~10 minutes on the most active clusters. In your case, I wonder if the MDS was going OOM in a loop while recovering? This was happening to us before -- there are recipes on this ML how to remove the "openfiles" objects to get out of that situation. Anyway, Octopus now has a feature to skip preloading the direntries at rejoin time: https://github.com/ceph/ceph/pull/44667 This will become the default soon, but you can switch off the preload already now. In our experience, rejoin is now taking under a minute on even the busiest clusters. Cheers, Dan On Tue, May 17, 2022 at 11:45 AM Felix Lee <felix@xxxxxxxxxx> wrote: > > Yes, we do plan to upgrade Ceph in near future for sure. > In any case, I used brutal way(kinda) to kick rejoin to active by > setting "mds_wipe_sessions = true" to all MDS. > Still, the entire MDS recovery process makes us blind to estimate the > service downtime. So, I am wondering if there is any way for us to > estimate the rejoin time? So that we can decide whether to wait or take > proactive action if necessary. > > > > Best regards, > Felix Lee ~ > > On 5/17/22 16:15, Jos Collin wrote: > > I suggest you to upgrade the cluster to the latest release [1], as > > nautilus reached EOL. > > > > [1] https://docs.ceph.com/en/latest/releases/ > > > > On 16/05/22 13:29, Felix Lee wrote: > >> Hi, Jos, > >> Many thanks for your reply. > >> And sorry, I missed to mention the version, which is 14.2.22. > >> > >> Here is the log: > >> https://drive.google.com/drive/folders/1qzPf64qw16VJDKSzcDoixZ690KL8XSoc?usp=sharing > >> > >> > >> Here, the ceph01(active) and ceph11(standby-replay) were the ones what > >> suffered crash. The log didn't tell us much but several slow request > >> were occurring. And, the ceph11 had "cache is too large" warning by > >> the time it went crashed, suppose it could happen when doing recovery. > >> (each MDS has 64GB memory, BTW ) > >> The ceph16 is current rejoin one, I've turned debug_mds to 20 for a > >> while as ceph-mds.ceph16.log-20220516.gz > >> > >> > >> Thanks > >> & > >> Best regards, > >> Felix Lee ~ > >> > >> > >> > >> On 5/16/22 14:45, Jos Collin wrote: > >>> It's hard to suggest without the logs. Do verbose logging > >>> debug_mds=20. What's the ceph version? Do you have the logs why the > >>> MDS crashed? > >>> > >>> On 16/05/22 11:20, Felix Lee wrote: > >>>> Dear all, > >>>> We currently have 7 multi-active MDS, with another 7 standby-replay. > >>>> We thought this should cover most of disasters, and it actually did. > >>>> But things just got happened, here is the story: > >>>> One of MDS crashed and standby-replay took over, but got stuck at > >>>> resolve state. > >>>> Then, the other two MDS(rank 0 and 5) received tones of slow > >>>> requests, and my colleague restarted them, thinking the > >>>> standby-replay would take over immediately (this seemed to be wrong > >>>> or at least unnecessary action, I guess...). Then, it resulted three > >>>> of them in resolve state... > >>>> In the meanwhile, I realized that the first failed rank(rank 2) had > >>>> abnormal memory usage and kept getting crashed, after couple > >>>> restarting, the memory usage was back to normal, and then, those > >>>> tree MDS entered into rejoin state. > >>>> Now, this rejoin state is there for three days and keeps going as > >>>> we're speaking. Here, no significant error message shows up even > >>>> with "debug_mds 10", so, we have no idea when it's gonna end and if > >>>> it's really running on the track. > >>>> So, I am wondering how do we check MDS rejoin progress/status to > >>>> make sure if it's running normally? Or, how do we estimate the > >>>> rejoin time and maybe improve it? because we always need to tell > >>>> user the time estimation of its recovery. > >>>> > >>>> > >>>> Thanks > >>>> & > >>>> Best regards, > >>>> Felix Lee ~ > >>>> > >>> > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> > >> > > > > > > -- > Felix H.T Lee Academia Sinica Grid & Cloud. > Tel: +886-2-27898308 > Office: Room P111, Institute of Physics, 128 Academia Road, Section 2, > Nankang, Taipei 115, Taiwan > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx