Hi Erich, great that you recovered from this. It sounds like you had the same problem I had a few months ago. mds crashes after up:replay state - ceph-users - lists.ceph.io <https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/IRV6K74GWE2SWAWQZVUDQAPSMY4J4R4D/#UBPVYXO5ODABKUZ436HN4WBX7QJUXY3P> Kind regards, Lars [image: ariadne.ai Logo] Lars Köppel Developer Email: lars.koeppel@xxxxxxxxxx Phone: +49 6221 5993580 <+4962215993580> ariadne.ai (Germany) GmbH Häusserstraße 3, 69115 Heidelberg Amtsgericht Mannheim, HRB 744040 Geschäftsführer: Dr. Fabian Svara https://ariadne.ai On Mon, Apr 22, 2024 at 11:31 PM Sake Ceph <ceph@xxxxxxxxxxx> wrote: > 100 GB of Ram! Damn that's a lot for a filesystem in my opinion, or am I > wrong? > > Kind regards, > Sake > > > Op 22-04-2024 21:50 CEST schreef Erich Weiler <weiler@xxxxxxxxxxxx>: > > > > > > I was able to start another MDS daemon on another node that had 512GB > > RAM, and then the active MDS eventually migrated there, and went through > > the replay (which consumed about 100 GB of RAM), and then things > > recovered. Phew. I guess I need significantly more RAM in my MDS > > servers... I had no idea the MDS daemon could require that much RAM. > > > > -erich > > > > On 4/22/24 11:41 AM, Erich Weiler wrote: > > > possibly but it would be pretty time consuming and difficult... > > > > > > Is it maybe a RAM issue since my MDS RAM is filling up? Should maybe > I > > > bring up another MDS on another server with huge amount of RAM and > move > > > the MDS there in hopes it will have enough RAM to complete the replay? > > > > > > On 4/22/24 11:37 AM, Sake Ceph wrote: > > >> Just a question: is it possible to block or disable all clients? Just > > >> to prevent load on the system. > > >> > > >> Kind regards, > > >> Sake > > >>> Op 22-04-2024 20:33 CEST schreef Erich Weiler <weiler@xxxxxxxxxxxx>: > > >>> > > >>> I also see this from 'ceph health detail': > > >>> > > >>> # ceph health detail > > >>> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; > 1 > > >>> MDSs behind on trimming > > >>> [WRN] FS_DEGRADED: 1 filesystem is degraded > > >>> fs slugfs is degraded > > >>> [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache > > >>> mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large > > >>> (19GB/8GB); 0 inodes in use by clients, 0 stray files > > >>> [WRN] MDS_TRIM: 1 MDSs behind on trimming > > >>> mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming > (127084/250) > > >>> max_segments: 250, num_segments: 127084 > > >>> > > >>> MDS cache too large? The mds process is taking up 22GB right now and > > >>> starting to swap my server, so maybe it somehow is too large.... > > >>> > > >>> On 4/22/24 11:17 AM, Erich Weiler wrote: > > >>>> Hi All, > > >>>> > > >>>> We have a somewhat serious situation where we have a cephfs > filesystem > > >>>> (18.2.1), and 2 active MDSs (one standby). ThI tried to restart > one of > > >>>> the active daemons to unstick a bunch of blocked requests, and the > > >>>> standby went into 'replay' for a very long time, then RAM on that > MDS > > >>>> server filled up, and it just stayed there for a while then > eventually > > >>>> appeared to give up and switched to the standby, but the cycle > started > > >>>> again. So I restarted that MDS, and now I'm in a situation where I > see > > >>>> this: > > >>>> > > >>>> # ceph fs status > > >>>> slugfs - 29 clients > > >>>> ====== > > >>>> RANK STATE MDS ACTIVITY DNS INOS > > >>>> DIRS CAPS > > >>>> 0 replay slugfs.pr-md-01.xdtppo 3958k 57.1k > > >>>> 12.2k 0 > > >>>> 1 resolve slugfs.pr-md-02.sbblqq 0 3 > > >>>> 1 0 > > >>>> POOL TYPE USED AVAIL > > >>>> cephfs_metadata metadata 997G 2948G > > >>>> cephfs_md_and_data data 0 87.6T > > >>>> cephfs_data data 773T 175T > > >>>> STANDBY MDS > > >>>> slugfs.pr-md-03.mclckv > > >>>> MDS version: ceph version 18.2.1 > > >>>> (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) > > >>>> > > >>>> It just stays there indefinitely. All my clients are hung. I tried > > >>>> restarting all MDS daemons and they just went back to this state > after > > >>>> coming back up. > > >>>> > > >>>> Is there any way I can somehow escape this state of indefinite > > >>>> replay/resolve? > > >>>> > > >>>> Thanks so much! I'm kinda nervous since none of my clients have > > >>>> filesystem access at the moment... > > >>>> > > >>>> cheers, > > >>>> erich > > >>> _______________________________________________ > > >>> ceph-users mailing list -- ceph-users@xxxxxxx > > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx