100 GB of Ram! Damn that's a lot for a filesystem in my opinion, or am I wrong? Kind regards, Sake > Op 22-04-2024 21:50 CEST schreef Erich Weiler <weiler@xxxxxxxxxxxx>: > > > I was able to start another MDS daemon on another node that had 512GB > RAM, and then the active MDS eventually migrated there, and went through > the replay (which consumed about 100 GB of RAM), and then things > recovered. Phew. I guess I need significantly more RAM in my MDS > servers... I had no idea the MDS daemon could require that much RAM. > > -erich > > On 4/22/24 11:41 AM, Erich Weiler wrote: > > possibly but it would be pretty time consuming and difficult... > > > > Is it maybe a RAM issue since my MDS RAM is filling up? Should maybe I > > bring up another MDS on another server with huge amount of RAM and move > > the MDS there in hopes it will have enough RAM to complete the replay? > > > > On 4/22/24 11:37 AM, Sake Ceph wrote: > >> Just a question: is it possible to block or disable all clients? Just > >> to prevent load on the system. > >> > >> Kind regards, > >> Sake > >>> Op 22-04-2024 20:33 CEST schreef Erich Weiler <weiler@xxxxxxxxxxxx>: > >>> > >>> I also see this from 'ceph health detail': > >>> > >>> # ceph health detail > >>> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1 > >>> MDSs behind on trimming > >>> [WRN] FS_DEGRADED: 1 filesystem is degraded > >>> fs slugfs is degraded > >>> [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache > >>> mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large > >>> (19GB/8GB); 0 inodes in use by clients, 0 stray files > >>> [WRN] MDS_TRIM: 1 MDSs behind on trimming > >>> mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (127084/250) > >>> max_segments: 250, num_segments: 127084 > >>> > >>> MDS cache too large? The mds process is taking up 22GB right now and > >>> starting to swap my server, so maybe it somehow is too large.... > >>> > >>> On 4/22/24 11:17 AM, Erich Weiler wrote: > >>>> Hi All, > >>>> > >>>> We have a somewhat serious situation where we have a cephfs filesystem > >>>> (18.2.1), and 2 active MDSs (one standby). ThI tried to restart one of > >>>> the active daemons to unstick a bunch of blocked requests, and the > >>>> standby went into 'replay' for a very long time, then RAM on that MDS > >>>> server filled up, and it just stayed there for a while then eventually > >>>> appeared to give up and switched to the standby, but the cycle started > >>>> again. So I restarted that MDS, and now I'm in a situation where I see > >>>> this: > >>>> > >>>> # ceph fs status > >>>> slugfs - 29 clients > >>>> ====== > >>>> RANK STATE MDS ACTIVITY DNS INOS > >>>> DIRS CAPS > >>>> 0 replay slugfs.pr-md-01.xdtppo 3958k 57.1k > >>>> 12.2k 0 > >>>> 1 resolve slugfs.pr-md-02.sbblqq 0 3 > >>>> 1 0 > >>>> POOL TYPE USED AVAIL > >>>> cephfs_metadata metadata 997G 2948G > >>>> cephfs_md_and_data data 0 87.6T > >>>> cephfs_data data 773T 175T > >>>> STANDBY MDS > >>>> slugfs.pr-md-03.mclckv > >>>> MDS version: ceph version 18.2.1 > >>>> (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) > >>>> > >>>> It just stays there indefinitely. All my clients are hung. I tried > >>>> restarting all MDS daemons and they just went back to this state after > >>>> coming back up. > >>>> > >>>> Is there any way I can somehow escape this state of indefinite > >>>> replay/resolve? > >>>> > >>>> Thanks so much! I'm kinda nervous since none of my clients have > >>>> filesystem access at the moment... > >>>> > >>>> cheers, > >>>> erich > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx