On Fri, Feb 11, 2022 at 9:36 PM Izzy Kulbe <ceph@xxxxxxx> wrote: > > Hi, > > at the moment no clients should be connected to the MDS(since the MDS doesn't come up) and the cluster only serves these MDS. The MDS also didn't start properly with mds_wipe_sessions = true. > > ceph health detail with the MDS trying to run: > > HEALTH_WARN 1 failed cephadm daemon(s); 3 large omap objects; 1 MDSs report oversized cache; insufficient standby MDS daemons available > [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s) > daemon mds.backupfs.SERVER.fsrhfw on SERVER is in error state > [WRN] LARGE_OMAP_OBJECTS: 3 large omap objects > 3 large objects found in pool 'cephfs.backupfs.meta' > Search the cluster log for 'Large omap object found' for more details. > [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache > mds.backupfs.SERVER.fsrhfw(mds.0): MDS cache is too large (9GB/4GB); 0 inodes in use by clients, 0 stray files If the MDS host has enough spare memory, setting `mds_cache_memory_limit`[*] to 9GB (or more if it permits) would get rid of this warning. Could you check if that improves the situation? Normally, the MDS starts trimming its cache when it overshoots the cache limit. [*]: https://docs.ceph.com/en/latest/cephfs/cache-configuration/#confval-mds_cache_memory_limit BTW, you earlier mentioned MDS using 150+GB of memory. That does not seem to be happening now (after Dan's config suggestion?). > [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available > have 0; want 1 more You don't seem to have a standby MDS configured? Is that intentional? > > > ceph versions > { > "mon": { > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 5 > }, > "mgr": { > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 2 > }, > "osd": { > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 19 > }, > "mds": { > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 1 > }, > "overall": { > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 27 > } > } > > > This is what the logs on startup look like(followed by the logs I put in my first mail). > > 4 mds.0.purge_queue _recover: write_pos recovered > 1 mds.0.24252 Finished replaying journal > 1 mds.0.24252 making mds journal writeable > 2 mds.0.24252 i am alone, moving to state reconnect > 3 mds.0.24252 request_state up:reconnect > 5 mds.beacon.backupfs.SERVER.fsrhfw set_want_state: up:replay -> up:reconnect > 5 mds.beacon.backupfs.SERVER.fsrhfw Sending beacon up:reconnect seq 2 > 2 mds.0.cache Memory usage: total 475764, rss 55348, heap 340224, baseline 315648, 0 / 3 inodes have caps, 0 caps, 0 caps per inode > 1 mds.backupfs.SERVER.fsrhfw Updating MDS map to version 24253 from mon.2 > 1 mds.0.24252 handle_mds_map i am now mds.0.24252 > 1 mds.0.24252 handle_mds_map state change up:replay --> up:reconnect > 1 mds.0.24252 reconnect_start > 1 mds.0.24252 reopen_log > 4 mds.0.24252 apply_blocklist: killed 0 blocklisted sessions (0 blocklist entries, 0) > 7 mds.0.server reconnect_clients -- no sessions, doing nothing. > 7 mds.0.server reconnect_gather_finish. failed on 0 clients > 7 mds.0.server snaptable cache isn't synced, delaying state transition > 7 mds.0.tableserver(snaptable) finish_recovery > 7 mds.0.tableserver(snaptable) _do_server_recovery 0 > 7 mds.0.24252 mds has 2 queued contexts > 1 mds.0.24252 reconnect_done > 3 mds.0.24252 request_state up:rejoin > 5 mds.beacon.backupfs.SERVER.fsrhfw set_want_state: up:reconnect -> up:rejoin > 5 mds.beacon.backupfs.SERVER.fsrhfw Sending beacon up:rejoin seq 3 > 5 mds.beacon.backupfs.SERVER.fsrhfw received beacon reply up:reconnect seq 2 rtt 0.928015 > 2 mds.0.cache Memory usage: total 475764, rss 55348, heap 340224, baseline 315648, 0 / 3 inodes have caps, 0 caps, 0 caps per inode > 1 mds.backupfs.SERVER.fsrhfw Updating MDS map to version 24254 from mon.2 > 1 mds.0.24252 handle_mds_map i am now mds.0.24252 > 1 mds.0.24252 handle_mds_map state change up:reconnect --> up:rejoin > 1 mds.0.24252 rejoin_start > 7 mds.0.cache trim_non_auth > 1 mds.0.24252 rejoin_joint_start > 7 mds.0.cache rejoin_send_acks > 1 mds.0.24252 rejoin_done > 7 mds.0.cache show_cache > 7 mds.0.cache unlinked [inode 0x100 [...2,head] ~mds0/ auth v4366057 snaprealm=0x5575d46f7200 f(v0 10=0+10) n(v870435 rc2023-01-24T14:42:18.000000+0000 b9444016583 21744450=7002+21737448)/n(v0 rc2021-03-10T14:41:15.806425+0000 11=0+11) old_inodes=1 (iversion lock) | dirfrag=1 openingsnapparents=0 0x5575d4722580] > 7 mds.0.cache dirfrag [dir 0x100 ~mds0/ [2,head] auth v=0 cv=0/0 dir_auth=0 state=1073741824 f() n() hs=0+0,ss=0+0 | subtree=1 subtreetemp=0 0x5575d489a480] > 7 mds.0.cache unlinked [inode 0x1 [...2,head] / auth v8206964 snaprealm=0x5575d46f7600 f(v0 m2021-09-01T14:29:05.934507+0000 3=2+1) n(v1380349 rc2023-01-27T09:46:07.000000+0000 b29475132133328 rs287 143815639=91231423+52584216)/n(v0 rc2021-03-10T14:41:15.805702+0000 1=0+1) (iversion lock) | dirfrag=1 openingsnapparents=0 0x5575d4723080] > 7 mds.0.cache dirfrag [dir 0x1 / [2,head] auth v=0 cv=0/0 dir_auth=0 state=1073741824 f() n() hs=0+0,ss=0+0 | subtree=1 subtreetemp=0 0x5575d489a000] > 7 mds.0.cache unlinked [inode 0x3 [...2,head] #3/ auth v1 snaprealm=0x5575d46f7400 f() n(v0 rc2022-02-11T15:21:09.622612+0000 1=0+1) (iversion lock) 0x5575d4722b00] > 3 mds.0.24252 request_state up:active > 5 mds.beacon.backupfs.SERVER.fsrhfw set_want_state: up:rejoin -> up:active > 5 mds.beacon.backupfs.SERVER.fsrhfw Sending beacon up:active seq 4 > 5 mds.beacon.backupfs.SERVER.fsrhfw received beacon reply up:rejoin seq 3 rtt 1.05602 > 2 mds.0.cache Memory usage: total 475764, rss 55348, heap 340224, baseline 315648, 0 / 3 inodes have caps, 0 caps, 0 caps per inode > 1 mds.backupfs.SERVER.fsrhfw Updating MDS map to version 24255 from mon.2 > 1 mds.0.24252 handle_mds_map i am now mds.0.24252 > 1 mds.0.24252 handle_mds_map state change up:rejoin --> up:active > 1 mds.0.24252 recovery_done -- successful recovery! > 1 mds.0.24252 active_start At this point the cluster does have an active MDS (probably with an oversized cache). Could you try the suggestions I mentioned earlier? > 1 mds.0.24252 cluster recovered. > 4 mds.0.24252 set_osd_epoch_barrier: epoch=16563 > 7 mds.0.24252 mds has 1 queued contexts > 7 mds.0.24252 mds has 1 queued contexts > > > After you mentioned clients I noticed something rather curious in the mds client ls - listing the clients reports one IP that is changing the client id in rapid succession. The IP is the same as the first MDS server(i had a mount there for testing. mounting does not work tho and the mount has since been removed). These commands were executed back to back. Is the client.XXXX changing expected behaviour? That's the client-id of the ceph command you run. > > ceph tell mds.0 client ls > 2022-02-11T15:22:07.614+0000 7fe163fff700 0 client.18437634 ms_handle_reset on v2:x.x.x.x:6800/3145139402 > 2022-02-11T15:22:07.630+0000 7fe163fff700 0 client.18429308 ms_handle_reset on v2:x.x.x.x:6800/3145139402 > [] > > ceph tell mds.0 client ls > 2022-02-11T15:22:08.726+0000 7f593e7f4700 0 client.18429318 ms_handle_reset on v2:x.x.x.x:6800/3145139402 > 2022-02-11T15:22:08.770+0000 7f593e7f4700 0 client.18437649 ms_handle_reset on v2:x.x.x.x:6800/3145139402 > [] > > ceph tell mds.0 client ls > 2022-02-11T15:22:09.870+0000 7fa762ffd700 0 client.18427920 ms_handle_reset on v2:x.x.x.x:6800/3145139402 > 2022-02-11T15:22:09.886+0000 7fa762ffd700 0 client.18429338 ms_handle_reset on v2:x.x.x.x:6800/3145139402 > [] > > > Just for completeness here's a dump of the mds mempools: > > { > "by_pool": { > "bloom_filter": { > "items": 0, > "bytes": 0 > }, > "bluestore_alloc": { > "items": 0, > "bytes": 0 > }, > "bluestore_cache_data": { > "items": 0, > "bytes": 0 > }, > "bluestore_cache_onode": { > "items": 0, > "bytes": 0 > }, > "bluestore_cache_meta": { > "items": 0, > "bytes": 0 > }, > "bluestore_cache_other": { > "items": 0, > "bytes": 0 > }, > "bluestore_Buffer": { > "items": 0, > "bytes": 0 > }, > "bluestore_Extent": { > "items": 0, > "bytes": 0 > }, > "bluestore_Blob": { > "items": 0, > "bytes": 0 > }, > "bluestore_SharedBlob": { > "items": 0, > "bytes": 0 > }, > "bluestore_inline_bl": { > "items": 0, > "bytes": 0 > }, > "bluestore_fsck": { > "items": 0, > "bytes": 0 > }, > "bluestore_txc": { > "items": 0, > "bytes": 0 > }, > "bluestore_writing_deferred": { > "items": 0, > "bytes": 0 > }, > "bluestore_writing": { > "items": 0, > "bytes": 0 > }, > "bluefs": { > "items": 0, > "bytes": 0 > }, > "bluefs_file_reader": { > "items": 0, > "bytes": 0 > }, > "bluefs_file_writer": { > "items": 0, > "bytes": 0 > }, > "buffer_anon": { > "items": 63, > "bytes": 4132815 > }, > "buffer_meta": { > "items": 3, > "bytes": 264 > }, > "osd": { > "items": 0, > "bytes": 0 > }, > "osd_mapbl": { > "items": 0, > "bytes": 0 > }, > "osd_pglog": { > "items": 0, > "bytes": 0 > }, > "osdmap": { > "items": 868, > "bytes": 19064 > }, > "osdmap_mapping": { > "items": 0, > "bytes": 0 > }, > "pgmap": { > "items": 0, > "bytes": 0 > }, > "mds_co": { > "items": 576347798, > "bytes": 25705305678 > }, There is a lot in cache as compared to the cache memory limit set. If there are no clients (and nothing in stray), the MDS should be trimming its cache to keep it within limits. > "unittest_1": { > "items": 0, > "bytes": 0 > }, > "unittest_2": { > "items": 0, > "bytes": 0 > } > }, > "total": { > "items": 576348732, > "bytes": 25709457821 > } > } > > > > On Fri, 11 Feb 2022 at 14:40, Venky Shankar <vshankar@xxxxxxxxxx> wrote: >> >> On Fri, Feb 11, 2022 at 3:40 PM Izzy Kulbe <ceph@xxxxxxx> wrote: >> > >> > Hi, >> > >> > I've tried to rm the mds0.openfiles and did a `ceph config set mds >> > mds_oft_prefetch_dirfrags false` but still with the same result of ceph >> > status reporting the daemon as up and not a lot more. >> > >> > I also tried setting the cache to ridiculously small(128M) but the MDS' >> > memory usage would still go up to 50+GB either way. >> >> Setting the cache size too small would not help. >> >> The "oversized cache" could be due to clients not releasing >> capabilities or perhaps a bug. How many clients do you have and what >> versions are they running? >> >> Could you also check "ceph health detail"? >> >> > >> > cluster: >> > health: HEALTH_WARN >> > 1 MDSs report oversized cache >> > >> > [...] >> > services >> > [...] >> > mds: 1/1 daemons up >> > >> > Cheers, >> > Izzy >> > >> > On Fri, 11 Feb 2022 at 09:32, Dan van der Ster <dvanders@xxxxxxxxx> wrote: >> > >> > > Hi, >> > > >> > > Is the memory ballooning while the MDS is active or could it be while >> > > it is rejoining the cluster? >> > > If the latter, this could be another case of: >> > > https://tracker.ceph.com/issues/54253 >> > > >> > > Cheers, Dan >> > > >> > > >> > > On Wed, Feb 9, 2022 at 7:23 PM Izzy Kulbe <ceph@xxxxxxx> wrote: >> > > > >> > > > Hi, >> > > > >> > > > last weekend we upgraded one of our clusters from 16.2.5 to 16.2.7 using >> > > > cephadm. >> > > > >> > > > The upgrade itself seemed to run without a problem but shortly after the >> > > > upgrade we noticed the servers holding the MDS containers being laggy, >> > > then >> > > > unresponsive, then crashing outright due getting reaped by the OOM >> > > killer. >> > > > The MDS would restart, at some point resulting in a reset of the whole >> > > > machine. >> > > > >> > > > At this point the cluster status started showing a lot of PGs with >> > > > snaptrim/snaptrim_wait status, so we waited for those to finish with the >> > > > MDS shutdown. >> > > > >> > > > So far everything we've tried changed nothing about our situation - our >> > > MDS >> > > > will try to hog extreme amounts of memory they never needed before. Even >> > > > when we push the cache down to 1G or reset it to the default the will >> > > > report as having oversized caches, using up 150+GB of RAM+swap until >> > > > resource exhaustion makes the OOM killer reap them again. >> > > > >> > > > Looking at the output of ceph fs status and the logs it seems it's trying >> > > > to push too much into the cache. The MDS crashes at around 25-30M inodes >> > > > and a few thousand dirs. ceph fs status reports the MDS as being active, >> > > > unlike a few other posts with similar issues the MDS does not seem to be >> > > > stuck in replay. The journal reports no issues. Additionally perf reports >> > > > no strays, nor does listomapvals. >> > > > >> > > > The cluster contains ~33TB in ~220M files. There are multiple snapshots >> > > for >> > > > several folders. While ceph fs status reports the MDS as active, I am not >> > > > able to mount the filesystem. >> > > > >> > > > I will probably recreate/resync this soon but I'd still like to find out >> > > > why this happened since we plan on using Ceph in a few other applications >> > > > but at the moment I wouldn't feel comfortable using CephFS given that a >> > > > single cephadm upgrade command made these amounts of data unavailable. >> > > > >> > > > Below are a few lines of log that basically make up the MDS' logs until >> > > it >> > > > runs into "mds.0.purge_queue push: pushing inode" and shortly after that >> > > > the crash. >> > > > >> > > > Any pointers to what went wrong here, how to debug this further or how to >> > > > fix this would be greatly appreciated. I'll probably have the broken >> > > system >> > > > for another day or two before resetting/recreating the FS. >> > > > >> > > > Thanks, >> > > > Izzy Kulbe >> > > > >> > > > >> > > > MDS Log: >> > > > >> > > > 20 mds.0.cache.dir(0x604.001111000*) lookup_exact_snap (head, >> > > '10000eff917') >> > > > 10 mds.0.cache.snaprealm(0x10000eff917 seq 3398 0x55c541856200) >> > > > adjust_parent 0 -> 0x55ba9f1e2e00 >> > > > 12 mds.0.cache.dir(0x604.001111000*) add_primary_dentry [dentry >> > > > #0x100/stray4/10000eff917 [d89,head] auth (dversion lock) pv=0 v=38664524 >> > > > ino=0x10000eff917 state=1073741824 0x55c541854a00] >> > > > 12 mds.0.cache.dir(0x604.001111000*) _fetched got [dentry >> > > > #0x100/stray4/10000eff917 [d89,head] auth (dversion lock) pv=0 v=38664524 >> > > > ino=0x10000eff917 state=1073741824 0x55c541854a00] [inode 0x10000eff917 >> > > > [...d89,head] ~mds0/stray4/10000eff917/ auth v38642807 >> > > > snaprealm=0x55c541856200 f(v0 m2021-12-29T05:05:25.519523+0000) n(v4 >> > > > rc2021-12-29T05:05:25.527523+0000 1=0+1) old_inodes=1 (iversion lock) >> > > > 0x55c541851180] >> > > > 15 mds.0.cache.ino(0x10000eff917) maybe_ephemeral_rand unlinked >> > > directory: >> > > > cannot ephemeral random pin [inode 0x10000eff917 [...d89,head] >> > > > ~mds0/stray4/10000eff917/ auth v38642807 snaprealm=0x55c541856200 f(v0 >> > > > m2021-12-29T05:05:25.519523+0000) n(v4 rc2021-12-29T05:05:25.527523+0000 >> > > > 1=0+1) old_inodes=1 (iversion lock) 0x55c541851180] >> > > > 20 mds.0.cache.dir(0x604.001111000*) _fetched pos 137 marker 'i' dname >> > > > '10000efded1 [daf,head] >> > > > 20 mds.0.cache.dir(0x604.001111000*) lookup (head, '10000efded1') >> > > > 20 mds.0.cache.dir(0x604.001111000*) miss -> (1001124a935,head) >> > > > 20 mds.0.cache.ino(0x10000efded1) decode_snap_blob >> > > snaprealm(0x10000efded1 >> > > > seq d9e lc 0 cr d9e cps daf snaps={} >> > > > past_parent_snaps=be5,c3d,c95,ced,d45,d9d 0x55c541856400) >> > > > 20 mds.0.cache.dir(0x604.001111000*) lookup_exact_snap (head, >> > > '10000efded1') >> > > > 10 mds.0.cache.snaprealm(0x10000efded1 seq 3486 0x55c541856400) >> > > > adjust_parent 0 -> 0x55ba9f1e2e00 >> > > > 12 mds.0.cache.dir(0x604.001111000*) add_primary_dentry [dentry >> > > > #0x100/stray4/10000efded1 [daf,head] auth (dversion lock) pv=0 v=38664524 >> > > > ino=0x10000efded1 state=1073741824 0x55c541854f00] >> > > > 12 mds.0.cache.dir(0x604.001111000*) _fetched got [dentry >> > > > #0x100/stray4/10000efded1 [daf,head] auth (dversion lock) pv=0 v=38664524 >> > > > ino=0x10000efded1 state=1073741824 0x55c541854f00] [inode 0x10000efded1 >> > > > [...daf,head] ~mds0/stray4/10000efded1/ auth v38652010 >> > > > snaprealm=0x55c541856400 f(v0 m2022-01-04T19:09:57.033258+0000) n(v118 >> > > > rc2022-01-04T19:09:57.041258+0000 1=0+1) old_inodes=6 (iversion lock) >> > > > 0x55c541851700] >> > > > 15 mds.0.cache.ino(0x10000efded1) maybe_ephemeral_rand unlinked >> > > directory: >> > > > cannot ephemeral random pin [inode 0x10000efded1 [...daf,head] >> > > > ~mds0/stray4/10000efded1/ auth v38652010 snaprealm=0x55c541856400 f(v0 >> > > > m2022-01-04T19:09:57.033258+0000) n(v118 >> > > rc2022-01-04T19:09:57.041258+0000 >> > > > 1=0+1) old_inodes=6 (iversion lock) 0x55c541851700] >> > > > 20 mds.0.cache.dir(0x604.001111000*) _fetched pos 136 marker 'i' dname >> > > > '10000efad6f [daf,head] >> > > > 20 mds.0.cache.dir(0x604.001111000*) lookup (head, '10000efad6f') >> > > > 20 mds.0.cache.dir(0x604.001111000*) miss -> (1000539afa8,head) >> > > > 20 mds.0.cache.ino(0x10000efad6f) decode_snap_blob >> > > snaprealm(0x10000efad6f >> > > > seq d9e lc 0 cr d9e cps daf snaps={} >> > > > past_parent_snaps=be5,c3d,c95,ced,d45,d9d 0x55c541856600) >> > > > 20 mds.0.cache.dir(0x604.001111000*) lookup_exact_snap (head, >> > > '10000efad6f') >> > > > 10 mds.0.cache.snaprealm(0x10000efad6f seq 3486 0x55c541856600) >> > > > adjust_parent 0 -> 0x55ba9f1e2e00 >> > > > 12 mds.0.cache.dir(0x604.001111000*) add_primary_dentry [dentry >> > > > #0x100/stray4/10000efad6f [daf,head] auth (dversion lock) pv=0 v=38664524 >> > > > ino=0x10000efad6f state=1073741824 0x55c541855400] >> > > > 12 mds.0.cache.dir(0x604.001111000*) _fetched got [dentry >> > > > #0x100/stray4/10000efad6f [daf,head] auth (dversion lock) pv=0 v=38664524 >> > > > ino=0x10000efad6f state=1073741824 0x55c541855400] [inode 0x10000efad6f >> > > > [...daf,head] ~mds0/stray4/10000efad6f/ auth v38652024 >> > > > snaprealm=0x55c541856600 f() n(v0 rc2022-01-04T19:10:53.334202+0000 >> > > 1=0+1) >> > > > old_inodes=6 (iversion lock) 0x55c54185c000] >> > > > 15 mds.0.cache.ino(0x10000efad6f) maybe_ephemeral_rand unlinked >> > > directory: >> > > > cannot ephemeral random pin [inode 0x10000efad6f [...daf,head] >> > > > ~mds0/stray4/10000efad6f/ auth v38652024 snaprealm=0x55c541856600 f() >> > > n(v0 >> > > > rc2022-01-04T19:10:53.334202+0000 1=0+1) old_inodes=6 (iversion lock) >> > > > 0x55c54185c000] >> > > > 20 mds.0.cache.dir(0x604.001111000*) _fetched pos 135 marker 'i' dname >> > > > '10000ef98c9 [d89,head] >> > > > 20 mds.0.cache.dir(0x604.001111000*) lookup (head, '10000ef98c9') >> > > > 20 mds.0.cache.dir(0x604.001111000*) miss -> (100113dd761,head) >> > > > 20 mds.0.cache.ino(0x10000ef98c9) decode_snap_blob >> > > snaprealm(0x10000ef98c9 >> > > > seq d46 lc 0 cr d46 cps d89 snaps={} >> > > > past_parent_snaps=b8b,be5,c3d,c95,ced,d45 0x55c541856800) >> > > > _______________________________________________ >> > > > ceph-users mailing list -- ceph-users@xxxxxxx >> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > > >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > >> >> >> -- >> Cheers, >> Venky >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Cheers, Venky _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx