Re: Reasonable MDS rejoin time?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Felix,

"rejoin" took awhile in the past because the MDS needs to reload all
inodes for all the open directories at boot time.
In our experience this can take ~10 minutes on the most active clusters.
In your case, I wonder if the MDS was going OOM in a loop while
recovering? This was happening to us before -- there are recipes on
this ML how to remove the "openfiles" objects to get out of that
situation.

Anyway, Octopus now has a feature to skip preloading the direntries at
rejoin time: https://github.com/ceph/ceph/pull/44667
This will become the default soon, but you can switch off the preload
already now. In our experience, rejoin is now taking under a minute on
even the busiest clusters.

Cheers, Dan

On Tue, May 17, 2022 at 11:45 AM Felix Lee <felix@xxxxxxxxxx> wrote:
>
> Yes, we do plan to upgrade Ceph in near future for sure.
> In any case, I used brutal way(kinda) to kick rejoin to active by
> setting "mds_wipe_sessions = true" to all MDS.
> Still, the entire MDS recovery process makes us blind to estimate the
> service downtime. So, I am wondering if there is any way for us to
> estimate the rejoin time? So that we can decide whether to wait or take
> proactive action if necessary.
>
>
>
> Best regards,
> Felix Lee ~
>
> On 5/17/22 16:15, Jos Collin wrote:
> > I suggest you to upgrade the cluster to the latest release [1], as
> > nautilus reached EOL.
> >
> > [1] https://docs.ceph.com/en/latest/releases/
> >
> > On 16/05/22 13:29, Felix Lee wrote:
> >> Hi, Jos,
> >> Many thanks for your reply.
> >> And sorry, I missed to mention the version, which is 14.2.22.
> >>
> >> Here is the log:
> >> https://drive.google.com/drive/folders/1qzPf64qw16VJDKSzcDoixZ690KL8XSoc?usp=sharing
> >>
> >>
> >> Here, the ceph01(active) and ceph11(standby-replay) were the ones what
> >> suffered crash. The log didn't tell us much but several slow request
> >> were occurring. And, the ceph11 had "cache is too large" warning by
> >> the time it went crashed, suppose it could happen when doing recovery.
> >> (each MDS has 64GB memory, BTW )
> >> The ceph16 is current rejoin one, I've turned debug_mds to 20 for a
> >> while as ceph-mds.ceph16.log-20220516.gz
> >>
> >>
> >> Thanks
> >> &
> >> Best regards,
> >> Felix Lee ~
> >>
> >>
> >>
> >> On 5/16/22 14:45, Jos Collin wrote:
> >>> It's hard to suggest without the logs. Do verbose logging
> >>> debug_mds=20. What's the ceph version? Do you have the logs why the
> >>> MDS crashed?
> >>>
> >>> On 16/05/22 11:20, Felix Lee wrote:
> >>>> Dear all,
> >>>> We currently have 7 multi-active MDS, with another 7 standby-replay.
> >>>> We thought this should cover most of disasters, and it actually did.
> >>>> But things just got happened, here is the story:
> >>>> One of MDS crashed and standby-replay took over, but got stuck at
> >>>> resolve state.
> >>>> Then, the other two MDS(rank 0 and 5) received tones of slow
> >>>> requests, and my colleague restarted them, thinking the
> >>>> standby-replay would take over immediately (this seemed to be wrong
> >>>> or at least unnecessary action, I guess...). Then, it resulted three
> >>>> of them in resolve state...
> >>>> In the meanwhile, I realized that the first failed rank(rank 2) had
> >>>> abnormal memory usage and kept getting crashed, after couple
> >>>> restarting, the memory usage was back to normal, and then, those
> >>>> tree MDS entered into rejoin state.
> >>>> Now, this rejoin state is there for three days and keeps going as
> >>>> we're speaking. Here, no significant error message shows up even
> >>>> with "debug_mds 10", so, we have no idea when it's gonna end and if
> >>>> it's really running on the track.
> >>>> So, I am wondering how do we check MDS rejoin progress/status to
> >>>> make sure if it's running normally? Or, how do we estimate the
> >>>> rejoin time and maybe improve it? because we always need to tell
> >>>> user the time estimation of its recovery.
> >>>>
> >>>>
> >>>> Thanks
> >>>> &
> >>>> Best regards,
> >>>> Felix Lee ~
> >>>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>
> >>
> >
> >
>
> --
> Felix H.T Lee                           Academia Sinica Grid & Cloud.
> Tel: +886-2-27898308
> Office: Room P111, Institute of Physics, 128 Academia Road, Section 2,
> Nankang, Taipei 115, Taiwan
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux