Re: MDS lost, Filesystem degraded and wont mount

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Sat, 5 Dec 2020 14:41:14 +0100

On 05/12/2020 09:26, Dan van der Ster wrote:
Hi Janek,

I'd love to hear your standard maintenance procedures. Are you
cleaning up those open files outside of "rejoin" OOMs ?

No, of course not. But those rejoin problems happen more often than I'd 
like them to. It has become much better with recent releases, but if one 
of the clients trains a Tensorflow model from files in the CephFS or 
when our Hadoop cluster starts reading from it, the MDS will almost 
certainly crash or at least degrade massively in performance. S3 doesn't 
have these problems at all, obviously.

That said, our metadata pool resides on rotating platters at the moment 
and we plan to move it to SSDs, but that should only fix latency issues 
and not the crash and rejoin problems (btw it doesn't matter how long 
you set the heartbeat interval, the rejoining MDS will always be 
replaced by a standby before it's finished).

I guess we're pretty lucky with our CephFS's because we have more than
1k clients and it is pretty solid (though the last upgrade had a
hiccup decreasing down to single active MDS).

-- Dan

On Fri, Dec 4, 2020 at 8:20 PM Janek Bevendorff
<janek.bevendorff@xxxxxxxxxxxxx> wrote:
This is very common issue. Deleting mdsX_openfiles.Y has become part of
my standard maintenance repertoire. As soon as you have a few more
clients and one of them starts opening and closing files in rapid
succession (or does other metadata-heavy things), it becomes very likely
that the MDS crashes and is unable to recover.

There have been numerous fixes in the past, which improved the overall
stability, but it is far from perfect. I am happy to see another patch
in that direction, but I believe more effort needs to be spent here. It
is way too easy to DoS the MDS from a single client. Our 78-node CephFS
beats our old NFS RAID server in terms of throughput, but latency and
stability are way behind.

Janek

On 04/12/2020 11:39, Dan van der Ster wrote:
Excellent!

For the record, this PR is the plan to fix this:
https://github.com/ceph/ceph/pull/36089
(nautilus, octopus PRs here: https://github.com/ceph/ceph/pull/37382
https://github.com/ceph/ceph/pull/37383)

Cheers, Dan

On Fri, Dec 4, 2020 at 11:35 AM Anton Aleksandrov <anton@xxxxxxxxxxxxxx> wrote:
Thank you very much! This solution helped:

Stop all MDS, then:
# rados -p cephfs_metadata_pool rm mds0_openfiles.0
then start one MDS.

We are back online. Amazing!!!  :)

On 04.12.2020 12:20, Dan van der Ster wrote:
Please also make sure the mds_beacon_grace is high on the mon's too.

it doesn't matter which mds you select to be the running one.

Is the processing getting killed, restarted?
If you're confident that the mds is getting OOM killed during rejoin
step, then you might find this useful:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028964.html

Stop all MDS, then:
# rados -p cephfs_metadata_pool rm mds0_openfiles.0
then start one MDS.

-- Dan

On Fri, Dec 4, 2020 at 11:05 AM Anton Aleksandrov <anton@xxxxxxxxxxxxxx> wrote:
Yes, MDS eats all memory+swap, stays like this for a moment and then
frees memory.

mds_beacon_grace was already set to 1800

Also on other it is seen this message: Map has assigned me to become a
standby.

Does it matter, which MDS we stop and which we leave running?

Anton

On 04.12.2020 11:53, Dan van der Ster wrote:
How many active MDS's did you have? (max_mds == 1, right?)

Stop the other two MDS's so you can focus on getting exactly one running.
Tail the log file and see what it is reporting.
Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS
while it is rejoining.

Is that single MDS running out of memory during the rejoin phase?

-- dan

On Fri, Dec 4, 2020 at 10:49 AM Anton Aleksandrov <anton@xxxxxxxxxxxxxx> wrote:
Hello community,

we are on ceph 13.2.8 - today something happenned with one MDS and cephs
status tells, that filesystem is degraded. It won't mount either. I have
take server with MDS, that was not working down. There are 2 more MDS
servers, but they stay in "rejoin" state. Also only 1 is shown in
"services", even though there are 2.

Both running MDS servers have these lines in their logs:

heartbeat_map is_healthy 'MDSRank' had timed out after 15
mds.beacon.mds2 Skipping beacon heartbeat to monitors (last acked
28.8979s ago); MDS internal heartbeat is not healthy!

On one of MDS nodes I enabled more detailed debug, so I am getting there
also:

mds.beacon.mds3 Sending beacon up:standby seq 178
mds.beacon.mds3 received beacon reply up:standby seq 178 rtt 0.000999968

Makes no sense and too much stress in my head... Anyone could help please?

Anton.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx