Re: mds crashes after up:replay state

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We've seen it use as much as 1.6t of ram/swap.    Swap makes it slow, but a slow recovery is better than no recovery.   My coworker looked into it at the  source code level and while it is doing some things suboptimal that's how it's currently written.

The MDS code needs some real love if ceph is going to offer file services that can match what the back end storage can actually provide.

--

Paul Mezzanini
Platform Engineer III
Research Computing

Rochester Institute of Technology

 Sent from my phone, please excuse typos and brevity
________________________________
From: Lars Köppel <lars.koeppel@xxxxxxxxxx>
Sent: Sunday, January 7, 2024 4:20:05 AM
To: Paul Mezzanini <pfmeec@xxxxxxx>
Cc: Patrick Donnelly <pdonnell@xxxxxxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject: Re:  Re: mds crashes after up:replay state

Hi Paul,

your suggestion was correct. The mds went through the replay state and was a few minutes in the active state. But then it gets killed because of too high memory consumption.
@mds.cephfs.storage01.pgperp.service: Main process exited, code=exited, status=137/n/a
How could I raise the memory limit for the mds?

>From the looks in htop. It looked like there is a memory leak, because it consumed over 200 GB of memory while reporting that it actually used 20 - 30 GB.
Is this possible?

Best regardes
Lars


[ariadne.ai Logo]       Lars Köppel
Developer
Email:  lars.koeppel@xxxxxxxxxx<mailto:lars.koeppel@xxxxxxxxxx>
Phone:  +49 6221 5993580<tel:+4962215993580>
ariadne.ai<http://ariadne.ai> (Germany) GmbH
Häusserstraße 3, 69115 Heidelberg
Amtsgericht Mannheim, HRB 744040
Geschäftsführer: Dr. Fabian Svara
https://ariadne.ai


On Sat, Jan 6, 2024 at 3:33 PM Paul Mezzanini <pfmeec@xxxxxxx<mailto:pfmeec@xxxxxxx>> wrote:
I'm replying from my phone so hopefully this works well.  This sounds suspiciously similar to an issue we have run into where there is an internal loop in the MDS that doesn't have heartbeat in it. If that loop goes for too long, it is marked as failed and the process jumps to another server and starts again.

We get around it by "wedging it in a corner" and removing the ability to migrate. This is as simple as stopping all standby MDS services and just waiting for the MDS to complete.



--

Paul Mezzanini
Platform Engineer III
Research Computing

Rochester Institute of Technology

 Sent from my phone, please excuse typos and brevity
________________________________
From: Lars Köppel <lars.koeppel@xxxxxxxxxx<mailto:lars.koeppel@xxxxxxxxxx>>
Sent: Saturday, January 6, 2024 7:22:14 AM
To: Patrick Donnelly <pdonnell@xxxxxxxxxx<mailto:pdonnell@xxxxxxxxxx>>
Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> <ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
Subject:  Re: mds crashes after up:replay state

Hi Patrick,

thank you for your response.
I already changed the mentioned settings, but I had no luck with this.

The journal inspection I had running yesterday finished with: 'Overall
journal integrity: OK'.
So you are probably right that the mds is crashing shortly after the replay
finished.

I checked the logs and there is every few seconds a new FSMap epoch without
any visible changes. One of the current epochs is at the end. Is there
anything useful in it?

When the replay is finished the running mds goes to the state
'up:reconnect' and after a second to the state 'up:rejoin'. After this
there is for ~20 min no new fsmap until this message pops up:

> Jan 06 12:38:23 storage01 ceph-mds[223997]:
> mds.beacon.cephfs.storage01.pgperp Skipping beacon heartbeat to monitors
> (last acked 4.00012s ago); MDS internal heartbeat is not healthy!
>
A few seconds later (the heartbeat message is still there) a new fsmap is
created with a new mds now in replay state.
The last of the heartbeat messages is after 1446 seconds. Then it is gone
and no more warnings or errors are displayed at this point. One minute
after the last message the mds is back as standy mds.

> Jan 06 13:02:26 storage01 ceph-mds[223997]:
> mds.beacon.cephfs.storage01.pgperp Skipping beacon heartbeat to monitors
> (last acked 1446.6s ago); MDS internal heartbeat is not healthy!
>

Also i can not find any warning in the logs when the mds crashes. What
could I do to find the error for the crash?

Best regardes
Lars

e205510
> enable_multiple, ever_enabled_multiple: 1,1
> default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
> writeable ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> anchor table,9=file layout v2,10=snaprealm v2}
> legacy client fscid: 3
>
> Filesystem 'cephfs' (3)
> fs_name cephfs
> epoch   205510
> flags   32 joinable allow_snaps allow_multimds_snaps allow_standby_replay
> created 2023-06-06T11:44:03.651905+0000
> modified        2024-01-06T10:28:14.676738+0000
> tableserver     0
> root    0
> session_timeout 60
> session_autoclose       300
> max_file_size   8796093022208
> required_client_features        {}
> last_failure    0
> last_failure_osd_epoch  42962
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> max_mds 1
> in      0
> up      {0=2178448}
> failed
> damaged
> stopped
> data_pools      [11,12]
> metadata_pool   10
> inline_data     disabled
> balancer
> standby_count_wanted    1
> [mds.cephfs.storage01.pgperp{0:2178448} state up:replay seq 4484
> join_fscid=3 addr [v2:
> 192.168.0.101:6800/855849996,v1:192.168.0.101:6801/855849996<http://192.168.0.101:6800/855849996,v1:192.168.0.101:6801/855849996>] compat
> {c=[1],r=[1],i=[7ff]}]
>
>
> Filesystem 'cephfs_recovery' (4)
> fs_name cephfs_recovery
> epoch   193460
> flags   13 allow_snaps allow_multimds_snaps
> created 2024-01-05T10:47:32.224388+0000
> modified        2024-01-05T16:43:37.677241+0000
> tableserver     0
> root    0
> session_timeout 60
> session_autoclose       300
> max_file_size   1099511627776
> required_client_features        {}
> last_failure    0
> last_failure_osd_epoch  42904
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> max_mds 1
> in      0
> up      {}
> failed
> damaged 0
> stopped
> data_pools      [11,12]
> metadata_pool   13
> inline_data     disabled
> balancer
> standby_count_wanted    1
>
>
> Standby daemons:
>
> [mds.cephfs.storage02.zopcif{-1:2356728} state up:standby seq 1
> join_fscid=3 addr [v2:
> 192.168.0.102:6800/3567764205,v1:192.168.0.102:6801/3567764205<http://192.168.0.102:6800/3567764205,v1:192.168.0.102:6801/3567764205>] compat
> {c=[1],r=[1],i=[7ff]}]
> dumped fsmap epoch 205510
>


[image: ariadne.ai<http://ariadne.ai> Logo] Lars Köppel
Developer
Email: lars.koeppel@xxxxxxxxxx<mailto:lars.koeppel@xxxxxxxxxx>
Phone: +49 6221 5993580 <+4962215993580>
ariadne.ai<http://ariadne.ai> (Germany) GmbH
Häusserstraße 3, 69115 Heidelberg
Amtsgericht Mannheim, HRB 744040
Geschäftsführer: Dr. Fabian Svara
https://ariadne.ai


On Fri, Jan 5, 2024 at 7:52 PM Patrick Donnelly <pdonnell@xxxxxxxxxx<mailto:pdonnell@xxxxxxxxxx>> wrote:

> Hi Lars,
>
> On Fri, Jan 5, 2024 at 9:53 AM Lars Köppel <lars.koeppel@xxxxxxxxxx<mailto:lars.koeppel@xxxxxxxxxx>>
> wrote:
> >
> > Hello everyone,
> >
> > we are running a small cluster with 3 nodes and 25 osds per node. And
> Ceph
> > version 17.2.6.
> > Recently the active mds crashed and since then the new starting mds has
> > always been in the up:replay state. In the output of the command 'ceph
> tell
> > mds.cephfs:0 status' you can see that the journal is completely read in.
> As
> > soon as it's finished, the mds crashes and the next one starts reading
> the
> > journal.
> >
> > At the moment I have the journal inspection running ('cephfs-journal-tool
> > --rank=cephfs:0 journal inspect').
> >
> > Does anyone have any further suggestions on how I can get the cluster
> > running again as quickly as possible?
>
> Please review:
>
> https://docs.ceph.com/en/reef/cephfs/troubleshooting/#stuck-during-recovery
>
> Note: your MDS is probably not failing in up:replay but shortly after
> reaching one of the later states. Check the mon logs to see what the
> FSMap changes were.
>
>
> Patrick Donnelly, Ph.D.
> He / Him / His
> Red Hat Partner Engineer
> IBM, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux