Re: Stuck in replay?

Sake Ceph <ceph@xxxxxxxxxxx> · Mon, 22 Apr 2024 23:30:26 +0200 (CEST)

100 GB of Ram! Damn that's a lot for a filesystem in my opinion, or am I wrong? 

Kind regards, 
Sake 

> Op 22-04-2024 21:50 CEST schreef Erich Weiler <weiler@xxxxxxxxxxxx>:
> 
>  
> I was able to start another MDS daemon on another node that had 512GB 
> RAM, and then the active MDS eventually migrated there, and went through 
> the replay (which consumed about 100 GB of RAM), and then things 
> recovered.  Phew.  I guess I need significantly more RAM in my MDS 
> servers...  I had no idea the MDS daemon could require that much RAM.
> 
> -erich
> 
> On 4/22/24 11:41 AM, Erich Weiler wrote:
> > possibly but it would be pretty time consuming and difficult...
> > 
> > Is it maybe a RAM issue since my MDS RAM is filling up?  Should maybe I 
> > bring up another MDS on another server with huge amount of RAM and move 
> > the MDS there in hopes it will have enough RAM to complete the replay?
> > 
> > On 4/22/24 11:37 AM, Sake Ceph wrote:
> >> Just a question: is it possible to block or disable all clients? Just 
> >> to prevent load on the system.
> >>
> >> Kind regards,
> >> Sake
> >>> Op 22-04-2024 20:33 CEST schreef Erich Weiler <weiler@xxxxxxxxxxxx>:
> >>>
> >>> I also see this from 'ceph health detail':
> >>>
> >>> # ceph health detail
> >>> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1
> >>> MDSs behind on trimming
> >>> [WRN] FS_DEGRADED: 1 filesystem is degraded
> >>>       fs slugfs is degraded
> >>> [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
> >>>       mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large
> >>> (19GB/8GB); 0 inodes in use by clients, 0 stray files
> >>> [WRN] MDS_TRIM: 1 MDSs behind on trimming
> >>>       mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (127084/250)
> >>> max_segments: 250, num_segments: 127084
> >>>
> >>> MDS cache too large?  The mds process is taking up 22GB right now and
> >>> starting to swap my server, so maybe it somehow is too large....
> >>>
> >>> On 4/22/24 11:17 AM, Erich Weiler wrote:
> >>>> Hi All,
> >>>>
> >>>> We have a somewhat serious situation where we have a cephfs filesystem
> >>>> (18.2.1), and 2 active MDSs (one standby).  ThI tried to restart one of
> >>>> the active daemons to unstick a bunch of blocked requests, and the
> >>>> standby went into 'replay' for a very long time, then RAM on that MDS
> >>>> server filled up, and it just stayed there for a while then eventually
> >>>> appeared to give up and switched to the standby, but the cycle started
> >>>> again.  So I restarted that MDS, and now I'm in a situation where I see
> >>>> this:
> >>>>
> >>>> # ceph fs status
> >>>> slugfs - 29 clients
> >>>> ======
> >>>> RANK   STATE            MDS            ACTIVITY   DNS    INOS   
> >>>> DIRS   CAPS
> >>>>    0     replay  slugfs.pr-md-01.xdtppo            3958k  57.1k  
> >>>> 12.2k     0
> >>>>    1    resolve  slugfs.pr-md-02.sbblqq               0      3      
> >>>> 1      0
> >>>>          POOL           TYPE     USED  AVAIL
> >>>>    cephfs_metadata    metadata   997G  2948G
> >>>> cephfs_md_and_data    data       0   87.6T
> >>>>      cephfs_data        data     773T   175T
> >>>>        STANDBY MDS
> >>>> slugfs.pr-md-03.mclckv
> >>>> MDS version: ceph version 18.2.1
> >>>> (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
> >>>>
> >>>> It just stays there indefinitely.  All my clients are hung.  I tried
> >>>> restarting all MDS daemons and they just went back to this state after
> >>>> coming back up.
> >>>>
> >>>> Is there any way I can somehow escape this state of indefinite
> >>>> replay/resolve?
> >>>>
> >>>> Thanks so much!  I'm kinda nervous since none of my clients have
> >>>> filesystem access at the moment...
> >>>>
> >>>> cheers,
> >>>> erich
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx