Re: Not able to start MDS after upgrade to 16.2.7

Izzy Kulbe <ceph@xxxxxxx> · Fri, 11 Feb 2022 17:05:28 +0100

Hi,

at the moment no clients should be connected to the MDS(since the MDS
doesn't come up) and the cluster only serves these MDS. The MDS also didn't
start properly with mds_wipe_sessions = true.

ceph health detail with the MDS trying to run:

HEALTH_WARN 1 failed cephadm daemon(s); 3 large omap objects; 1 MDSs report
oversized cache; insufficient standby MDS daemons available
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon mds.backupfs.SERVER.fsrhfw on SERVER is in error state
[WRN] LARGE_OMAP_OBJECTS: 3 large omap objects
    3 large objects found in pool 'cephfs.backupfs.meta'
    Search the cluster log for 'Large omap object found' for more details.
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
    mds.backupfs.SERVER.fsrhfw(mds.0): MDS cache is too large (9GB/4GB); 0
inodes in use by clients, 0 stray files
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
    have 0; want 1 more

ceph versions
{
    "mon": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)": 5
    },
    "mgr": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)": 2
    },
    "osd": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)": 19
    },
    "mds": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)": 27
    }
}

This is what the logs on startup look like(followed by the logs I put in my
first mail).

4 mds.0.purge_queue _recover: write_pos recovered
1 mds.0.24252 Finished replaying journal
1 mds.0.24252 making mds journal writeable
2 mds.0.24252 i am alone, moving to state reconnect
3 mds.0.24252 request_state up:reconnect
5 mds.beacon.backupfs.SERVER.fsrhfw set_want_state: up:replay ->
up:reconnect
5 mds.beacon.backupfs.SERVER.fsrhfw Sending beacon up:reconnect seq 2
2 mds.0.cache Memory usage:  total 475764, rss 55348, heap 340224, baseline
315648, 0 / 3 inodes have caps, 0 caps, 0 caps per inode
1 mds.backupfs.SERVER.fsrhfw Updating MDS map to version 24253 from mon.2
1 mds.0.24252 handle_mds_map i am now mds.0.24252
1 mds.0.24252 handle_mds_map state change up:replay --> up:reconnect
1 mds.0.24252 reconnect_start
1 mds.0.24252 reopen_log
4 mds.0.24252 apply_blocklist: killed 0 blocklisted sessions (0 blocklist
entries, 0)
7 mds.0.server reconnect_clients -- no sessions, doing nothing.
7 mds.0.server reconnect_gather_finish.  failed on 0 clients
7 mds.0.server  snaptable cache isn't synced, delaying state transition
7 mds.0.tableserver(snaptable) finish_recovery
7 mds.0.tableserver(snaptable) _do_server_recovery 0
7 mds.0.24252 mds has 2 queued contexts
1 mds.0.24252 reconnect_done
3 mds.0.24252 request_state up:rejoin
5 mds.beacon.backupfs.SERVER.fsrhfw set_want_state: up:reconnect ->
up:rejoin
5 mds.beacon.backupfs.SERVER.fsrhfw Sending beacon up:rejoin seq 3
5 mds.beacon.backupfs.SERVER.fsrhfw received beacon reply up:reconnect seq
2 rtt 0.928015
2 mds.0.cache Memory usage:  total 475764, rss 55348, heap 340224, baseline
315648, 0 / 3 inodes have caps, 0 caps, 0 caps per inode
1 mds.backupfs.SERVER.fsrhfw Updating MDS map to version 24254 from mon.2
1 mds.0.24252 handle_mds_map i am now mds.0.24252
1 mds.0.24252 handle_mds_map state change up:reconnect --> up:rejoin
1 mds.0.24252 rejoin_start
7 mds.0.cache trim_non_auth
1 mds.0.24252 rejoin_joint_start
7 mds.0.cache rejoin_send_acks
1 mds.0.24252 rejoin_done
7 mds.0.cache show_cache
7 mds.0.cache  unlinked [inode 0x100 [...2,head] ~mds0/ auth v4366057
snaprealm=0x5575d46f7200 f(v0 10=0+10) n(v870435
rc2023-01-24T14:42:18.000000+0000 b9444016583 21744450=7002+21737448)/n(v0
rc2021-03-10T14:41:15.806425+0000 11=0+11) old_inodes=1 (iversion lock) |
dirfrag=1 openingsnapparents=0 0x5575d4722580]
7 mds.0.cache   dirfrag [dir 0x100 ~mds0/ [2,head] auth v=0 cv=0/0
dir_auth=0 state=1073741824 f() n() hs=0+0,ss=0+0 | subtree=1 subtreetemp=0
0x5575d489a480]
7 mds.0.cache  unlinked [inode 0x1 [...2,head] / auth v8206964
snaprealm=0x5575d46f7600 f(v0 m2021-09-01T14:29:05.934507+0000 3=2+1)
n(v1380349 rc2023-01-27T09:46:07.000000+0000 b29475132133328 rs287
143815639=91231423+52584216)/n(v0 rc2021-03-10T14:41:15.805702+0000 1=0+1)
(iversion lock) | dirfrag=1 openingsnapparents=0 0x5575d4723080]
7 mds.0.cache   dirfrag [dir 0x1 / [2,head] auth v=0 cv=0/0 dir_auth=0
state=1073741824 f() n() hs=0+0,ss=0+0 | subtree=1 subtreetemp=0
0x5575d489a000]
7 mds.0.cache  unlinked [inode 0x3 [...2,head] #3/ auth v1
snaprealm=0x5575d46f7400 f() n(v0 rc2022-02-11T15:21:09.622612+0000 1=0+1)
(iversion lock) 0x5575d4722b00]
3 mds.0.24252 request_state up:active
5 mds.beacon.backupfs.SERVER.fsrhfw set_want_state: up:rejoin -> up:active
5 mds.beacon.backupfs.SERVER.fsrhfw Sending beacon up:active seq 4
5 mds.beacon.backupfs.SERVER.fsrhfw received beacon reply up:rejoin seq 3
rtt 1.05602
2 mds.0.cache Memory usage:  total 475764, rss 55348, heap 340224, baseline
315648, 0 / 3 inodes have caps, 0 caps, 0 caps per inode
1 mds.backupfs.SERVER.fsrhfw Updating MDS map to version 24255 from mon.2
1 mds.0.24252 handle_mds_map i am now mds.0.24252
1 mds.0.24252 handle_mds_map state change up:rejoin --> up:active
1 mds.0.24252 recovery_done -- successful recovery!
1 mds.0.24252 active_start
1 mds.0.24252 cluster recovered.
4 mds.0.24252 set_osd_epoch_barrier: epoch=16563
7 mds.0.24252 mds has 1 queued contexts
7 mds.0.24252 mds has 1 queued contexts

After you mentioned clients I noticed something rather curious in the mds
client ls - listing the clients reports one IP that is changing the client
id in rapid succession. The IP is the same as the first MDS server(i had a
mount there for testing. mounting does not work tho and the mount has since
been removed). These commands were executed back to back. Is the
client.XXXX changing expected behaviour?

ceph tell mds.0 client ls
2022-02-11T15:22:07.614+0000 7fe163fff700  0 client.18437634
ms_handle_reset on v2:x.x.x.x:6800/3145139402
2022-02-11T15:22:07.630+0000 7fe163fff700  0 client.18429308
ms_handle_reset on v2:x.x.x.x:6800/3145139402
[]

ceph tell mds.0 client ls
2022-02-11T15:22:08.726+0000 7f593e7f4700  0 client.18429318
ms_handle_reset on v2:x.x.x.x:6800/3145139402
2022-02-11T15:22:08.770+0000 7f593e7f4700  0 client.18437649
ms_handle_reset on v2:x.x.x.x:6800/3145139402
[]

ceph tell mds.0 client ls
2022-02-11T15:22:09.870+0000 7fa762ffd700  0 client.18427920
ms_handle_reset on v2:x.x.x.x:6800/3145139402
2022-02-11T15:22:09.886+0000 7fa762ffd700  0 client.18429338
ms_handle_reset on v2:x.x.x.x:6800/3145139402
[]

Just for completeness here's a dump of the mds mempools:

{
  "by_pool": {
    "bloom_filter": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_alloc": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_cache_data": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_cache_onode": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_cache_meta": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_cache_other": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_Buffer": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_Extent": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_Blob": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_SharedBlob": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_inline_bl": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_fsck": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_txc": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_writing_deferred": {
      "items": 0,
      "bytes": 0
    },
    "bluestore_writing": {
      "items": 0,
      "bytes": 0
    },
    "bluefs": {
      "items": 0,
      "bytes": 0
    },
    "bluefs_file_reader": {
      "items": 0,
      "bytes": 0
    },
    "bluefs_file_writer": {
      "items": 0,
      "bytes": 0
    },
    "buffer_anon": {
      "items": 63,
      "bytes": 4132815
    },
    "buffer_meta": {
      "items": 3,
      "bytes": 264
    },
    "osd": {
      "items": 0,
      "bytes": 0
    },
    "osd_mapbl": {
      "items": 0,
      "bytes": 0
    },
    "osd_pglog": {
      "items": 0,
      "bytes": 0
    },
    "osdmap": {
      "items": 868,
      "bytes": 19064
    },
    "osdmap_mapping": {
      "items": 0,
      "bytes": 0
    },
    "pgmap": {
      "items": 0,
      "bytes": 0
    },
    "mds_co": {
      "items": 576347798,
      "bytes": 25705305678
    },
    "unittest_1": {
      "items": 0,
      "bytes": 0
    },
    "unittest_2": {
      "items": 0,
      "bytes": 0
    }
  },
  "total": {
    "items": 576348732,
    "bytes": 25709457821
  }
}

On Fri, 11 Feb 2022 at 14:40, Venky Shankar <vshankar@xxxxxxxxxx> wrote:

> On Fri, Feb 11, 2022 at 3:40 PM Izzy Kulbe <ceph@xxxxxxx> wrote:
> >
> > Hi,
> >
> > I've tried to rm the mds0.openfiles and did a `ceph config set mds
> > mds_oft_prefetch_dirfrags false` but still with the same result of ceph
> > status reporting the daemon as up and not a lot more.
> >
> > I also tried setting the cache to ridiculously small(128M) but the MDS'
> > memory usage would still go up to 50+GB either way.
>
> Setting the cache size too small would not help.
>
> The "oversized cache" could be due to clients not releasing
> capabilities or perhaps a bug. How many clients do you have and what
> versions are they running?
>
> Could you also check "ceph health detail"?
>
> >
> > cluster:
> >   health: HEALTH_WARN
> >     1 MDSs report oversized cache
> >
> > [...]
> > services
> >     [...]
> >     mds: 1/1 daemons up
> >
> > Cheers,
> > Izzy
> >
> > On Fri, 11 Feb 2022 at 09:32, Dan van der Ster <dvanders@xxxxxxxxx>
> wrote:
> >
> > > Hi,
> > >
> > > Is the memory ballooning while the MDS is active or could it be while
> > > it is rejoining the cluster?
> > > If the latter, this could be another case of:
> > > https://tracker.ceph.com/issues/54253
> > >
> > > Cheers, Dan
> > >
> > >
> > > On Wed, Feb 9, 2022 at 7:23 PM Izzy Kulbe <ceph@xxxxxxx> wrote:
> > > >
> > > > Hi,
> > > >
> > > > last weekend we upgraded one of our clusters from 16.2.5 to 16.2.7
> using
> > > > cephadm.
> > > >
> > > > The upgrade itself seemed to run without a problem but shortly after
> the
> > > > upgrade we noticed the servers holding the MDS containers being
> laggy,
> > > then
> > > > unresponsive, then crashing outright due getting reaped by the OOM
> > > killer.
> > > > The MDS would restart, at some point resulting in a reset of the
> whole
> > > > machine.
> > > >
> > > > At this point the cluster status started showing a lot of PGs with
> > > > snaptrim/snaptrim_wait status, so we waited for those to finish with
> the
> > > > MDS shutdown.
> > > >
> > > > So far everything we've tried changed nothing about our situation -
> our
> > > MDS
> > > > will try to hog extreme amounts of memory they never needed before.
> Even
> > > > when we push the cache down to 1G or reset it to the default the will
> > > > report as having oversized caches, using up 150+GB of RAM+swap until
> > > > resource exhaustion makes the OOM killer reap them again.
> > > >
> > > > Looking at the output of ceph fs status and the logs it seems it's
> trying
> > > > to push too much into the cache. The MDS crashes at around 25-30M
> inodes
> > > > and a few thousand dirs. ceph fs status reports the MDS as being
> active,
> > > > unlike a few other posts with similar issues the MDS does not seem
> to be
> > > > stuck in replay. The journal reports no issues. Additionally perf
> reports
> > > > no strays, nor does listomapvals.
> > > >
> > > > The cluster contains ~33TB in ~220M files. There are multiple
> snapshots
> > > for
> > > > several folders. While ceph fs status reports the MDS as active, I
> am not
> > > > able to mount the filesystem.
> > > >
> > > > I will probably recreate/resync this soon but I'd still like to find
> out
> > > > why this happened since we plan on using Ceph in a few other
> applications
> > > > but at the moment I wouldn't feel comfortable using CephFS given
> that a
> > > > single cephadm upgrade command made these amounts of data
> unavailable.
> > > >
> > > > Below are a few lines of log that basically make up the MDS' logs
> until
> > > it
> > > > runs into "mds.0.purge_queue push: pushing inode" and shortly after
> that
> > > > the crash.
> > > >
> > > > Any pointers to what went wrong here, how to debug this further or
> how to
> > > > fix this would be greatly appreciated. I'll probably have the broken
> > > system
> > > > for another day or two before resetting/recreating the FS.
> > > >
> > > > Thanks,
> > > > Izzy Kulbe
> > > >
> > > >
> > > > MDS Log:
> > > >
> > > > 20 mds.0.cache.dir(0x604.001111000*) lookup_exact_snap (head,
> > > '10000eff917')
> > > > 10  mds.0.cache.snaprealm(0x10000eff917 seq 3398 0x55c541856200)
> > > > adjust_parent 0 -> 0x55ba9f1e2e00
> > > > 12 mds.0.cache.dir(0x604.001111000*) add_primary_dentry [dentry
> > > > #0x100/stray4/10000eff917 [d89,head] auth (dversion lock) pv=0
> v=38664524
> > > > ino=0x10000eff917 state=1073741824 0x55c541854a00]
> > > > 12 mds.0.cache.dir(0x604.001111000*) _fetched  got [dentry
> > > > #0x100/stray4/10000eff917 [d89,head] auth (dversion lock) pv=0
> v=38664524
> > > > ino=0x10000eff917 state=1073741824 0x55c541854a00] [inode
> 0x10000eff917
> > > > [...d89,head] ~mds0/stray4/10000eff917/ auth v38642807
> > > > snaprealm=0x55c541856200 f(v0 m2021-12-29T05:05:25.519523+0000) n(v4
> > > > rc2021-12-29T05:05:25.527523+0000 1=0+1) old_inodes=1 (iversion lock)
> > > > 0x55c541851180]
> > > > 15 mds.0.cache.ino(0x10000eff917) maybe_ephemeral_rand unlinked
> > > directory:
> > > > cannot ephemeral random pin [inode 0x10000eff917 [...d89,head]
> > > > ~mds0/stray4/10000eff917/ auth v38642807 snaprealm=0x55c541856200
> f(v0
> > > > m2021-12-29T05:05:25.519523+0000) n(v4
> rc2021-12-29T05:05:25.527523+0000
> > > > 1=0+1) old_inodes=1 (iversion lock) 0x55c541851180]
> > > > 20 mds.0.cache.dir(0x604.001111000*) _fetched pos 137 marker 'i'
> dname
> > > > '10000efded1 [daf,head]
> > > > 20 mds.0.cache.dir(0x604.001111000*) lookup (head, '10000efded1')
> > > > 20 mds.0.cache.dir(0x604.001111000*)   miss -> (1001124a935,head)
> > > > 20 mds.0.cache.ino(0x10000efded1) decode_snap_blob
> > > snaprealm(0x10000efded1
> > > > seq d9e lc 0 cr d9e cps daf snaps={}
> > > > past_parent_snaps=be5,c3d,c95,ced,d45,d9d 0x55c541856400)
> > > > 20 mds.0.cache.dir(0x604.001111000*) lookup_exact_snap (head,
> > > '10000efded1')
> > > > 10  mds.0.cache.snaprealm(0x10000efded1 seq 3486 0x55c541856400)
> > > > adjust_parent 0 -> 0x55ba9f1e2e00
> > > > 12 mds.0.cache.dir(0x604.001111000*) add_primary_dentry [dentry
> > > > #0x100/stray4/10000efded1 [daf,head] auth (dversion lock) pv=0
> v=38664524
> > > > ino=0x10000efded1 state=1073741824 0x55c541854f00]
> > > > 12 mds.0.cache.dir(0x604.001111000*) _fetched  got [dentry
> > > > #0x100/stray4/10000efded1 [daf,head] auth (dversion lock) pv=0
> v=38664524
> > > > ino=0x10000efded1 state=1073741824 0x55c541854f00] [inode
> 0x10000efded1
> > > > [...daf,head] ~mds0/stray4/10000efded1/ auth v38652010
> > > > snaprealm=0x55c541856400 f(v0 m2022-01-04T19:09:57.033258+0000)
> n(v118
> > > > rc2022-01-04T19:09:57.041258+0000 1=0+1) old_inodes=6 (iversion lock)
> > > > 0x55c541851700]
> > > > 15 mds.0.cache.ino(0x10000efded1) maybe_ephemeral_rand unlinked
> > > directory:
> > > > cannot ephemeral random pin [inode 0x10000efded1 [...daf,head]
> > > > ~mds0/stray4/10000efded1/ auth v38652010 snaprealm=0x55c541856400
> f(v0
> > > > m2022-01-04T19:09:57.033258+0000) n(v118
> > > rc2022-01-04T19:09:57.041258+0000
> > > > 1=0+1) old_inodes=6 (iversion lock) 0x55c541851700]
> > > > 20 mds.0.cache.dir(0x604.001111000*) _fetched pos 136 marker 'i'
> dname
> > > > '10000efad6f [daf,head]
> > > > 20 mds.0.cache.dir(0x604.001111000*) lookup (head, '10000efad6f')
> > > > 20 mds.0.cache.dir(0x604.001111000*)   miss -> (1000539afa8,head)
> > > > 20 mds.0.cache.ino(0x10000efad6f) decode_snap_blob
> > > snaprealm(0x10000efad6f
> > > > seq d9e lc 0 cr d9e cps daf snaps={}
> > > > past_parent_snaps=be5,c3d,c95,ced,d45,d9d 0x55c541856600)
> > > > 20 mds.0.cache.dir(0x604.001111000*) lookup_exact_snap (head,
> > > '10000efad6f')
> > > > 10  mds.0.cache.snaprealm(0x10000efad6f seq 3486 0x55c541856600)
> > > > adjust_parent 0 -> 0x55ba9f1e2e00
> > > > 12 mds.0.cache.dir(0x604.001111000*) add_primary_dentry [dentry
> > > > #0x100/stray4/10000efad6f [daf,head] auth (dversion lock) pv=0
> v=38664524
> > > > ino=0x10000efad6f state=1073741824 0x55c541855400]
> > > > 12 mds.0.cache.dir(0x604.001111000*) _fetched  got [dentry
> > > > #0x100/stray4/10000efad6f [daf,head] auth (dversion lock) pv=0
> v=38664524
> > > > ino=0x10000efad6f state=1073741824 0x55c541855400] [inode
> 0x10000efad6f
> > > > [...daf,head] ~mds0/stray4/10000efad6f/ auth v38652024
> > > > snaprealm=0x55c541856600 f() n(v0 rc2022-01-04T19:10:53.334202+0000
> > > 1=0+1)
> > > > old_inodes=6 (iversion lock) 0x55c54185c000]
> > > > 15 mds.0.cache.ino(0x10000efad6f) maybe_ephemeral_rand unlinked
> > > directory:
> > > > cannot ephemeral random pin [inode 0x10000efad6f [...daf,head]
> > > > ~mds0/stray4/10000efad6f/ auth v38652024 snaprealm=0x55c541856600 f()
> > > n(v0
> > > > rc2022-01-04T19:10:53.334202+0000 1=0+1) old_inodes=6 (iversion lock)
> > > > 0x55c54185c000]
> > > > 20 mds.0.cache.dir(0x604.001111000*) _fetched pos 135 marker 'i'
> dname
> > > > '10000ef98c9 [d89,head]
> > > > 20 mds.0.cache.dir(0x604.001111000*) lookup (head, '10000ef98c9')
> > > > 20 mds.0.cache.dir(0x604.001111000*)   miss -> (100113dd761,head)
> > > > 20 mds.0.cache.ino(0x10000ef98c9) decode_snap_blob
> > > snaprealm(0x10000ef98c9
> > > > seq d46 lc 0 cr d46 cps d89 snaps={}
> > > > past_parent_snaps=b8b,be5,c3d,c95,ced,d45 0x55c541856800)
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
>
>
> --
> Cheers,
> Venky
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx