Hi Erich When mds cache usage is very high, recovery is very slow. So I use command to drop mds cache: ceph tell mds.* cache drop 600 Lars Köppel <lars.koeppel@xxxxxxxxxx> 于2024年4月23日周二 16:36写道: > > Hi Erich, > > great that you recovered from this. > It sounds like you had the same problem I had a few months ago. > mds crashes after up:replay state - ceph-users - lists.ceph.io > <https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/IRV6K74GWE2SWAWQZVUDQAPSMY4J4R4D/#UBPVYXO5ODABKUZ436HN4WBX7QJUXY3P> > > Kind regards, > Lars > > > [image: ariadne.ai Logo] Lars Köppel > Developer > Email: lars.koeppel@xxxxxxxxxx > Phone: +49 6221 5993580 <+4962215993580> > ariadne.ai (Germany) GmbH > Häusserstraße 3, 69115 Heidelberg > Amtsgericht Mannheim, HRB 744040 > Geschäftsführer: Dr. Fabian Svara > https://ariadne.ai > > > On Mon, Apr 22, 2024 at 11:31 PM Sake Ceph <ceph@xxxxxxxxxxx> wrote: > > > 100 GB of Ram! Damn that's a lot for a filesystem in my opinion, or am I > > wrong? > > > > Kind regards, > > Sake > > > > > Op 22-04-2024 21:50 CEST schreef Erich Weiler <weiler@xxxxxxxxxxxx>: > > > > > > > > > I was able to start another MDS daemon on another node that had 512GB > > > RAM, and then the active MDS eventually migrated there, and went through > > > the replay (which consumed about 100 GB of RAM), and then things > > > recovered. Phew. I guess I need significantly more RAM in my MDS > > > servers... I had no idea the MDS daemon could require that much RAM. > > > > > > -erich > > > > > > On 4/22/24 11:41 AM, Erich Weiler wrote: > > > > possibly but it would be pretty time consuming and difficult... > > > > > > > > Is it maybe a RAM issue since my MDS RAM is filling up? Should maybe > > I > > > > bring up another MDS on another server with huge amount of RAM and > > move > > > > the MDS there in hopes it will have enough RAM to complete the replay? > > > > > > > > On 4/22/24 11:37 AM, Sake Ceph wrote: > > > >> Just a question: is it possible to block or disable all clients? Just > > > >> to prevent load on the system. > > > >> > > > >> Kind regards, > > > >> Sake > > > >>> Op 22-04-2024 20:33 CEST schreef Erich Weiler <weiler@xxxxxxxxxxxx>: > > > >>> > > > >>> I also see this from 'ceph health detail': > > > >>> > > > >>> # ceph health detail > > > >>> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; > > 1 > > > >>> MDSs behind on trimming > > > >>> [WRN] FS_DEGRADED: 1 filesystem is degraded > > > >>> fs slugfs is degraded > > > >>> [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache > > > >>> mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large > > > >>> (19GB/8GB); 0 inodes in use by clients, 0 stray files > > > >>> [WRN] MDS_TRIM: 1 MDSs behind on trimming > > > >>> mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming > > (127084/250) > > > >>> max_segments: 250, num_segments: 127084 > > > >>> > > > >>> MDS cache too large? The mds process is taking up 22GB right now and > > > >>> starting to swap my server, so maybe it somehow is too large.... > > > >>> > > > >>> On 4/22/24 11:17 AM, Erich Weiler wrote: > > > >>>> Hi All, > > > >>>> > > > >>>> We have a somewhat serious situation where we have a cephfs > > filesystem > > > >>>> (18.2.1), and 2 active MDSs (one standby). ThI tried to restart > > one of > > > >>>> the active daemons to unstick a bunch of blocked requests, and the > > > >>>> standby went into 'replay' for a very long time, then RAM on that > > MDS > > > >>>> server filled up, and it just stayed there for a while then > > eventually > > > >>>> appeared to give up and switched to the standby, but the cycle > > started > > > >>>> again. So I restarted that MDS, and now I'm in a situation where I > > see > > > >>>> this: > > > >>>> > > > >>>> # ceph fs status > > > >>>> slugfs - 29 clients > > > >>>> ====== > > > >>>> RANK STATE MDS ACTIVITY DNS INOS > > > >>>> DIRS CAPS > > > >>>> 0 replay slugfs.pr-md-01.xdtppo 3958k 57.1k > > > >>>> 12.2k 0 > > > >>>> 1 resolve slugfs.pr-md-02.sbblqq 0 3 > > > >>>> 1 0 > > > >>>> POOL TYPE USED AVAIL > > > >>>> cephfs_metadata metadata 997G 2948G > > > >>>> cephfs_md_and_data data 0 87.6T > > > >>>> cephfs_data data 773T 175T > > > >>>> STANDBY MDS > > > >>>> slugfs.pr-md-03.mclckv > > > >>>> MDS version: ceph version 18.2.1 > > > >>>> (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) > > > >>>> > > > >>>> It just stays there indefinitely. All my clients are hung. I tried > > > >>>> restarting all MDS daemons and they just went back to this state > > after > > > >>>> coming back up. > > > >>>> > > > >>>> Is there any way I can somehow escape this state of indefinite > > > >>>> replay/resolve? > > > >>>> > > > >>>> Thanks so much! I'm kinda nervous since none of my clients have > > > >>>> filesystem access at the moment... > > > >>>> > > > >>>> cheers, > > > >>>> erich > > > >>> _______________________________________________ > > > >>> ceph-users mailing list -- ceph-users@xxxxxxx > > > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx