Re: Reasonable MDS rejoin time?

Felix Lee <felix@xxxxxxxxxx> · Tue, 17 May 2022 20:13:31 +0800

Hi, Dan,
> In our experience this can take ~10 minutes on the most active clusters.
Many thanks, this information is quite helpful for us.

> this ML how to remove the "openfiles" objects to get out of that
Yes, I read that mail threads as well. In fact, it was indeed my next 
move after setting "mds_wipe_sessions = true", but since the MDS soon 
got active, so, I didn't go with removing openfiles.0 then.

> Anyway, Octopus now has a feature to skip preloading the direntries at
> rejoin time: https://github.com/ceph/ceph/pull/44667
Sounds great, that gives us good motivation to speed up the Ceph upgrade.

Again, thanks you all for the great inputs
&
Best regards,
Felix Lee ~

On 5/17/22 19:41, Dan van der Ster wrote:
Hi Felix,

"rejoin" took awhile in the past because the MDS needs to reload all
inodes for all the open directories at boot time.
In our experience this can take ~10 minutes on the most active clusters.
In your case, I wonder if the MDS was going OOM in a loop while
recovering? This was happening to us before -- there are recipes on
this ML how to remove the "openfiles" objects to get out of that
situation.

Anyway, Octopus now has a feature to skip preloading the direntries at
rejoin time: https://github.com/ceph/ceph/pull/44667
This will become the default soon, but you can switch off the preload
already now. In our experience, rejoin is now taking under a minute on
even the busiest clusters.

Cheers, Dan

On Tue, May 17, 2022 at 11:45 AM Felix Lee <felix@xxxxxxxxxx> wrote:

Yes, we do plan to upgrade Ceph in near future for sure.
In any case, I used brutal way(kinda) to kick rejoin to active by
setting "mds_wipe_sessions = true" to all MDS.
Still, the entire MDS recovery process makes us blind to estimate the
service downtime. So, I am wondering if there is any way for us to
estimate the rejoin time? So that we can decide whether to wait or take
proactive action if necessary.

Best regards,
Felix Lee ~

On 5/17/22 16:15, Jos Collin wrote:
I suggest you to upgrade the cluster to the latest release [1], as
nautilus reached EOL.

[1] https://docs.ceph.com/en/latest/releases/

On 16/05/22 13:29, Felix Lee wrote:
Hi, Jos,
Many thanks for your reply.
And sorry, I missed to mention the version, which is 14.2.22.

Here is the log:
https://drive.google.com/drive/folders/1qzPf64qw16VJDKSzcDoixZ690KL8XSoc?usp=sharing

Here, the ceph01(active) and ceph11(standby-replay) were the ones what
suffered crash. The log didn't tell us much but several slow request
were occurring. And, the ceph11 had "cache is too large" warning by
the time it went crashed, suppose it could happen when doing recovery.
(each MDS has 64GB memory, BTW )
The ceph16 is current rejoin one, I've turned debug_mds to 20 for a
while as ceph-mds.ceph16.log-20220516.gz

Thanks
&
Best regards,
Felix Lee ~

On 5/16/22 14:45, Jos Collin wrote:
It's hard to suggest without the logs. Do verbose logging
debug_mds=20. What's the ceph version? Do you have the logs why the
MDS crashed?

On 16/05/22 11:20, Felix Lee wrote:
Dear all,
We currently have 7 multi-active MDS, with another 7 standby-replay.
We thought this should cover most of disasters, and it actually did.
But things just got happened, here is the story:
One of MDS crashed and standby-replay took over, but got stuck at
resolve state.
Then, the other two MDS(rank 0 and 5) received tones of slow
requests, and my colleague restarted them, thinking the
standby-replay would take over immediately (this seemed to be wrong
or at least unnecessary action, I guess...). Then, it resulted three
of them in resolve state...
In the meanwhile, I realized that the first failed rank(rank 2) had
abnormal memory usage and kept getting crashed, after couple
restarting, the memory usage was back to normal, and then, those
tree MDS entered into rejoin state.
Now, this rejoin state is there for three days and keeps going as
we're speaking. Here, no significant error message shows up even
with "debug_mds 10", so, we have no idea when it's gonna end and if
it's really running on the track.
So, I am wondering how do we check MDS rejoin progress/status to
make sure if it's running normally? Or, how do we estimate the
rejoin time and maybe improve it? because we always need to tell
user the time estimation of its recovery.

Thanks
&
Best regards,
Felix Lee ~

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Felix H.T Lee                           Academia Sinica Grid & Cloud.
Tel: +886-2-27898308
Office: Room P111, Institute of Physics, 128 Academia Road, Section 2,
Nankang, Taipei 115, Taiwan

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Felix H.T Lee                           Academia Sinica Grid & Cloud.
Tel: +886-2-27898308
Office: Room P111, Institute of Physics, 128 Academia Road, Section 2, 
Nankang, Taipei 115, Taiwan

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx