Re: mds crashes after up:replay state

Milind Changire <mchangir@xxxxxxxxxx> · Mon, 8 Jan 2024 09:24:11 +0530

Hi Paul,
Could you create a ceph tracker (tracker.ceph.com) and list out things
that are suboptimal according to your investigation?
We'd like to hear more on this.

Alternatively, you could list the issues with mds here.

Thanks,
Milind

On Sun, Jan 7, 2024 at 4:37 PM Paul Mezzanini <pfmeec@xxxxxxx> wrote:
>
> We've seen it use as much as 1.6t of ram/swap.    Swap makes it slow, but a slow recovery is better than no recovery.   My coworker looked into it at the  source code level and while it is doing some things suboptimal that's how it's currently written.
>
> The MDS code needs some real love if ceph is going to offer file services that can match what the back end storage can actually provide.
>
> --
>
> Paul Mezzanini
> Platform Engineer III
> Research Computing
>
> Rochester Institute of Technology
>
>  Sent from my phone, please excuse typos and brevity
> ________________________________
> From: Lars Köppel <lars.koeppel@xxxxxxxxxx>
> Sent: Sunday, January 7, 2024 4:20:05 AM
> To: Paul Mezzanini <pfmeec@xxxxxxx>
> Cc: Patrick Donnelly <pdonnell@xxxxxxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> Subject: Re:  Re: mds crashes after up:replay state
>
> Hi Paul,
>
> your suggestion was correct. The mds went through the replay state and was a few minutes in the active state. But then it gets killed because of too high memory consumption.
> @mds.cephfs.storage01.pgperp.service: Main process exited, code=exited, status=137/n/a
> How could I raise the memory limit for the mds?
>
> From the looks in htop. It looked like there is a memory leak, because it consumed over 200 GB of memory while reporting that it actually used 20 - 30 GB.
> Is this possible?
>
> Best regardes
> Lars
>
>
> [ariadne.ai Logo]       Lars Köppel
> Developer
> Email:  lars.koeppel@xxxxxxxxxx<mailto:lars.koeppel@xxxxxxxxxx>
> Phone:  +49 6221 5993580<tel:+4962215993580>
> ariadne.ai<http://ariadne.ai> (Germany) GmbH
> Häusserstraße 3, 69115 Heidelberg
> Amtsgericht Mannheim, HRB 744040
> Geschäftsführer: Dr. Fabian Svara
> https://ariadne.ai
>
>
> On Sat, Jan 6, 2024 at 3:33 PM Paul Mezzanini <pfmeec@xxxxxxx<mailto:pfmeec@xxxxxxx>> wrote:
> I'm replying from my phone so hopefully this works well.  This sounds suspiciously similar to an issue we have run into where there is an internal loop in the MDS that doesn't have heartbeat in it. If that loop goes for too long, it is marked as failed and the process jumps to another server and starts again.
>
> We get around it by "wedging it in a corner" and removing the ability to migrate. This is as simple as stopping all standby MDS services and just waiting for the MDS to complete.
>
>
>
> --
>
> Paul Mezzanini
> Platform Engineer III
> Research Computing
>
> Rochester Institute of Technology
>
>  Sent from my phone, please excuse typos and brevity
> ________________________________
> From: Lars Köppel <lars.koeppel@xxxxxxxxxx<mailto:lars.koeppel@xxxxxxxxxx>>
> Sent: Saturday, January 6, 2024 7:22:14 AM
> To: Patrick Donnelly <pdonnell@xxxxxxxxxx<mailto:pdonnell@xxxxxxxxxx>>
> Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> <ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
> Subject:  Re: mds crashes after up:replay state
>
> Hi Patrick,
>
> thank you for your response.
> I already changed the mentioned settings, but I had no luck with this.
>
> The journal inspection I had running yesterday finished with: 'Overall
> journal integrity: OK'.
> So you are probably right that the mds is crashing shortly after the replay
> finished.
>
> I checked the logs and there is every few seconds a new FSMap epoch without
> any visible changes. One of the current epochs is at the end. Is there
> anything useful in it?
>
> When the replay is finished the running mds goes to the state
> 'up:reconnect' and after a second to the state 'up:rejoin'. After this
> there is for ~20 min no new fsmap until this message pops up:
>
> > Jan 06 12:38:23 storage01 ceph-mds[223997]:
> > mds.beacon.cephfs.storage01.pgperp Skipping beacon heartbeat to monitors
> > (last acked 4.00012s ago); MDS internal heartbeat is not healthy!
> >
> A few seconds later (the heartbeat message is still there) a new fsmap is
> created with a new mds now in replay state.
> The last of the heartbeat messages is after 1446 seconds. Then it is gone
> and no more warnings or errors are displayed at this point. One minute
> after the last message the mds is back as standy mds.
>
> > Jan 06 13:02:26 storage01 ceph-mds[223997]:
> > mds.beacon.cephfs.storage01.pgperp Skipping beacon heartbeat to monitors
> > (last acked 1446.6s ago); MDS internal heartbeat is not healthy!
> >
>
> Also i can not find any warning in the logs when the mds crashes. What
> could I do to find the error for the crash?
>
> Best regardes
> Lars
>
> e205510
> > enable_multiple, ever_enabled_multiple: 1,1
> > default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
> > writeable ranges,3=default file layouts on dirs,4=dir inode in separate
> > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> > anchor table,9=file layout v2,10=snaprealm v2}
> > legacy client fscid: 3
> >
> > Filesystem 'cephfs' (3)
> > fs_name cephfs
> > epoch   205510
> > flags   32 joinable allow_snaps allow_multimds_snaps allow_standby_replay
> > created 2023-06-06T11:44:03.651905+0000
> > modified        2024-01-06T10:28:14.676738+0000
> > tableserver     0
> > root    0
> > session_timeout 60
> > session_autoclose       300
> > max_file_size   8796093022208
> > required_client_features        {}
> > last_failure    0
> > last_failure_osd_epoch  42962
> > compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> > ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> > uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> > data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> > max_mds 1
> > in      0
> > up      {0=2178448}
> > failed
> > damaged
> > stopped
> > data_pools      [11,12]
> > metadata_pool   10
> > inline_data     disabled
> > balancer
> > standby_count_wanted    1
> > [mds.cephfs.storage01.pgperp{0:2178448} state up:replay seq 4484
> > join_fscid=3 addr [v2:
> > 192.168.0.101:6800/855849996,v1:192.168.0.101:6801/855849996<http://192.168.0.101:6800/855849996,v1:192.168.0.101:6801/855849996>] compat
> > {c=[1],r=[1],i=[7ff]}]
> >
> >
> > Filesystem 'cephfs_recovery' (4)
> > fs_name cephfs_recovery
> > epoch   193460
> > flags   13 allow_snaps allow_multimds_snaps
> > created 2024-01-05T10:47:32.224388+0000
> > modified        2024-01-05T16:43:37.677241+0000
> > tableserver     0
> > root    0
> > session_timeout 60
> > session_autoclose       300
> > max_file_size   1099511627776
> > required_client_features        {}
> > last_failure    0
> > last_failure_osd_epoch  42904
> > compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> > ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> > uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> > data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> > max_mds 1
> > in      0
> > up      {}
> > failed
> > damaged 0
> > stopped
> > data_pools      [11,12]
> > metadata_pool   13
> > inline_data     disabled
> > balancer
> > standby_count_wanted    1
> >
> >
> > Standby daemons:
> >
> > [mds.cephfs.storage02.zopcif{-1:2356728} state up:standby seq 1
> > join_fscid=3 addr [v2:
> > 192.168.0.102:6800/3567764205,v1:192.168.0.102:6801/3567764205<http://192.168.0.102:6800/3567764205,v1:192.168.0.102:6801/3567764205>] compat
> > {c=[1],r=[1],i=[7ff]}]
> > dumped fsmap epoch 205510
> >
>
>
> [image: ariadne.ai<http://ariadne.ai> Logo] Lars Köppel
> Developer
> Email: lars.koeppel@xxxxxxxxxx<mailto:lars.koeppel@xxxxxxxxxx>
> Phone: +49 6221 5993580 <+4962215993580>
> ariadne.ai<http://ariadne.ai> (Germany) GmbH
> Häusserstraße 3, 69115 Heidelberg
> Amtsgericht Mannheim, HRB 744040
> Geschäftsführer: Dr. Fabian Svara
> https://ariadne.ai
>
>
> On Fri, Jan 5, 2024 at 7:52 PM Patrick Donnelly <pdonnell@xxxxxxxxxx<mailto:pdonnell@xxxxxxxxxx>> wrote:
>
> > Hi Lars,
> >
> > On Fri, Jan 5, 2024 at 9:53 AM Lars Köppel <lars.koeppel@xxxxxxxxxx<mailto:lars.koeppel@xxxxxxxxxx>>
> > wrote:
> > >
> > > Hello everyone,
> > >
> > > we are running a small cluster with 3 nodes and 25 osds per node. And
> > Ceph
> > > version 17.2.6.
> > > Recently the active mds crashed and since then the new starting mds has
> > > always been in the up:replay state. In the output of the command 'ceph
> > tell
> > > mds.cephfs:0 status' you can see that the journal is completely read in.
> > As
> > > soon as it's finished, the mds crashes and the next one starts reading
> > the
> > > journal.
> > >
> > > At the moment I have the journal inspection running ('cephfs-journal-tool
> > > --rank=cephfs:0 journal inspect').
> > >
> > > Does anyone have any further suggestions on how I can get the cluster
> > > running again as quickly as possible?
> >
> > Please review:
> >
> > https://docs.ceph.com/en/reef/cephfs/troubleshooting/#stuck-during-recovery
> >
> > Note: your MDS is probably not failing in up:replay but shortly after
> > reaching one of the later states. Check the mon logs to see what the
> > FSMap changes were.
> >
> >
> > Patrick Donnelly, Ph.D.
> > He / Him / His
> > Red Hat Partner Engineer
> > IBM, Inc.
> > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Milind
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx