Re: Stuck in replay?

Eugen Block <eblock@xxxxxx> · Mon, 22 Apr 2024 19:46:09 +0000

IIRC, you have 8 GB configured for the mds cache memory limit, and it  
doesn’t seem to be enough. Does the host run into oom killer as well?  
But it’s definitely a good approach to increase the cache limit (try  
24 GB if possible since it’s trying to use at least 19 GB) on a host  
with enough RAM. Or can you pinpoint a single client maybe?

Zitat von Erich Weiler <weiler@xxxxxxxxxxxx>:

possibly but it would be pretty time consuming and difficult...

Is it maybe a RAM issue since my MDS RAM is filling up?  Should  
maybe I bring up another MDS on another server with huge amount of  
RAM and move the MDS there in hopes it will have enough RAM to  
complete the replay?

On 4/22/24 11:37 AM, Sake Ceph wrote:
Just a question: is it possible to block or disable all clients?  
Just to prevent load on the system.

Kind regards,
Sake
Op 22-04-2024 20:33 CEST schreef Erich Weiler <weiler@xxxxxxxxxxxx>:

 I also see this from 'ceph health detail':

# ceph health detail
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1
MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
     fs slugfs is degraded
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
     mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large
(19GB/8GB); 0 inodes in use by clients, 0 stray files
[WRN] MDS_TRIM: 1 MDSs behind on trimming
     mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (127084/250)
max_segments: 250, num_segments: 127084

MDS cache too large?  The mds process is taking up 22GB right now and
starting to swap my server, so maybe it somehow is too large....

On 4/22/24 11:17 AM, Erich Weiler wrote:
Hi All,

We have a somewhat serious situation where we have a cephfs filesystem
(18.2.1), and 2 active MDSs (one standby).  ThI tried to restart one of
the active daemons to unstick a bunch of blocked requests, and the
standby went into 'replay' for a very long time, then RAM on that MDS
server filled up, and it just stayed there for a while then eventually
appeared to give up and switched to the standby, but the cycle started
again.  So I restarted that MDS, and now I'm in a situation where I see
this:

# ceph fs status
slugfs - 29 clients
======
RANK   STATE            MDS            ACTIVITY   DNS    INOS    
DIRS   CAPS
  0     replay  slugfs.pr-md-01.xdtppo            3958k  57.1k   
12.2k     0
  1    resolve  slugfs.pr-md-02.sbblqq               0       
3      1      0
        POOL           TYPE     USED  AVAIL
  cephfs_metadata    metadata   997G  2948G
cephfs_md_and_data    data       0   87.6T
    cephfs_data        data     773T   175T
      STANDBY MDS
slugfs.pr-md-03.mclckv
MDS version: ceph version 18.2.1
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)

It just stays there indefinitely.  All my clients are hung.  I tried
restarting all MDS daemons and they just went back to this state after
coming back up.

Is there any way I can somehow escape this state of indefinite
replay/resolve?

Thanks so much!  I'm kinda nervous since none of my clients have
filesystem access at the moment...

cheers,
erich
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx