MDS rank 0 damaged after update to 14.2.20

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi *,

I tried a minor update (14.2.9 --> 14.2.20) on our ceph cluster today and got into a damaged CephFS. It's rather urgent since noone can really work right now, so any quick help is highly appreciated.

As for the update process I followed the usual update procedure, when all MONs were finished I started to restart the OSDs, but suddenly our cephfs got unresponsive (and still is).

I believe these lines are the critical ones:

---snap---
-12> 2021-05-18 09:53:01.488 7f7e9ed82700 5 mds.beacon.mds01 received beacon reply up:replay seq 906 rtt 0 -11> 2021-05-18 09:53:01.624 7f7e9f583700 10 monclient: get_auth_request con 0x5608a5171600 auth_method 0 -10> 2021-05-18 09:53:03.732 7f7e94d6e700 -1 mds.0.journaler.mdlog(ro) try_read_entry: decode error from _is_readable -9> 2021-05-18 09:53:03.732 7f7e94d6e700 0 mds.0.log _replay journaler got error -22, aborting -8> 2021-05-18 09:53:03.732 7f7e94d6e700 -1 log_channel(cluster) log [ERR] : Error loading MDS rank 0: (22) Invalid argument -7> 2021-05-18 09:53:03.732 7f7e94d6e700 5 mds.beacon.mds01 set_want_state: up:replay -> down:damaged -6> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client log_queue is 1 last_log 1 sent 0 num 1 unsent 1 sending 1 -5> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client will send 2021-05-18 09:53:03.735824 mds.mds01 (mds.0) 1 : cluster [ERR] Error loading MDS rank 0: (22) Invalid argument -4> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient: _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0 -3> 2021-05-18 09:53:03.732 7f7e94d6e700 5 mds.beacon.mds01 Sending beacon down:damaged seq 907 -2> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient: _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0 -1> 2021-05-18 09:53:03.908 7f7e9ed82700 5 mds.beacon.mds01 received beacon reply down:damaged seq 907 rtt 0.176001
     0> 2021-05-18 09:53:03.908 7f7e94d6e700  1 mds.mds01 respawn!
---snap---

These logs are from the attempt to bring the mds rank back up with

ceph mds repaired 0

I attached a longer excerpt of the log files if it helps. Before trying anything from the disaster recovery steps I'd like to ask for your input since one can damage it even more. The current status is below, please let me know if more information is required.

Thanks!
Eugen


ceph01:~ # ceph -s
  cluster:
    id:     655cb05a-435a-41ba-83d9-8549f7c36167
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged
            noout flag(s) set
            Some pool(s) have the nodeep-scrub flag(s) set

  services:
    mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 116m)
    mgr: ceph03(active, since 118m), standbys: ceph02, ceph01
    mds: cephfs:0/1 3 up:standby, 1 damaged
    osd: 32 osds: 32 up (since 64m), 32 in (since 8w)
         flags noout

  data:
    pools:   14 pools, 512 pgs
    objects: 5.08M objects, 8.6 TiB
    usage:   27 TiB used, 33 TiB / 59 TiB avail
    pgs:     512 active+clean



-71> 2021-05-18 09:52:57.488 7f7e9a579700  5 mds.beacon.mds01 Sending beacon up:standby seq 905
   -70> 2021-05-18 09:52:57.488 7f7e9a579700 10 monclient: _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
   -69> 2021-05-18 09:52:57.568 7f7e9ed82700  5 mds.beacon.mds01 received beacon reply up:standby seq 905 rtt 0.0800005
   -68> 2021-05-18 09:53:01.468 7f7e9cd7e700  1 mds.mds01 Updating MDS map to version 27240 from mon.0
   -67> 2021-05-18 09:53:01.468 7f7e9cd7e700  4 mds.0.purge_queue operator():  data pool 35 not found in OSDMap
   -66> 2021-05-18 09:53:01.468 7f7e9cd7e700  4 mds.0.purge_queue operator():  data pool 53 not found in OSDMap
   -65> 2021-05-18 09:53:01.472 7f7e9cd7e700  5 asok(0x5608a42c6000) register_command objecter_requests hook 0x5608a4214200
   -64> 2021-05-18 09:53:01.472 7f7e9cd7e700 10 monclient: _renew_subs
   -63> 2021-05-18 09:53:01.472 7f7e9cd7e700 10 monclient: _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
   -62> 2021-05-18 09:53:01.472 7f7e9cd7e700 10 log_channel(cluster) update_config to_monitors: true to_syslog: false syslog_facility: daemon prio: info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201)
   -61> 2021-05-18 09:53:01.472 7f7e9cd7e700  4 mds.0.purge_queue operator():  data pool 35 not found in OSDMap
   -60> 2021-05-18 09:53:01.472 7f7e9cd7e700  4 mds.0.purge_queue operator():  data pool 53 not found in OSDMap
   -59> 2021-05-18 09:53:01.472 7f7e9cd7e700  4 mds.0.0 handle_osd_map epoch 0, 0 new blacklist entries
   -58> 2021-05-18 09:53:01.472 7f7e9cd7e700  1 mds.0.27240 handle_mds_map i am now mds.0.27240
   -57> 2021-05-18 09:53:01.472 7f7e9cd7e700  1 mds.0.27240 handle_mds_map state change up:boot --> up:replay
   -56> 2021-05-18 09:53:01.472 7f7e9cd7e700  5 mds.beacon.mds01 set_want_state: up:standby -> up:replay
   -55> 2021-05-18 09:53:01.472 7f7e9cd7e700  1 mds.0.27240 replay_start
   -54> 2021-05-18 09:53:01.472 7f7e9cd7e700  1 mds.0.27240  waiting for osdmap 296433 (which blacklists prior instance)
   -53> 2021-05-18 09:53:01.472 7f7e9cd7e700  4 mds.0.27240 handle_osd_map epoch 296433, 0 new blacklist entries
   -52> 2021-05-18 09:53:01.472 7f7e9cd7e700 10 monclient: _renew_subs
   -51> 2021-05-18 09:53:01.472 7f7e9cd7e700 10 monclient: _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
   -50> 2021-05-18 09:53:01.472 7f7e96571700  2 mds.0.27240 Booting: 0: opening inotable
   -49> 2021-05-18 09:53:01.472 7f7e96571700  2 mds.0.27240 Booting: 0: opening sessionmap
   -48> 2021-05-18 09:53:01.472 7f7e96571700  2 mds.0.27240 Booting: 0: opening mds log
   -47> 2021-05-18 09:53:01.472 7f7e96571700  5 mds.0.log open discovering log bounds
   -46> 2021-05-18 09:53:01.472 7f7e96571700  2 mds.0.27240 Booting: 0: opening purge queue (async)
   -45> 2021-05-18 09:53:01.472 7f7e96571700  4 mds.0.purge_queue open: opening
   -44> 2021-05-18 09:53:01.472 7f7e96571700  1 mds.0.journaler.pq(ro) recover start
   -43> 2021-05-18 09:53:01.472 7f7e96571700  1 mds.0.journaler.pq(ro) read_head
   -42> 2021-05-18 09:53:01.472 7f7e96571700  2 mds.0.27240 Booting: 0: loading open file table (async)
   -41> 2021-05-18 09:53:01.472 7f7e96571700  2 mds.0.27240 Booting: 0: opening snap table
   -40> 2021-05-18 09:53:01.472 7f7e95d70700  4 mds.0.journalpointer Reading journal pointer '400.00000000'
   -39> 2021-05-18 09:53:01.472 7f7e9f583700 10 monclient: get_auth_request con 0x5608a4ffe480 auth_method 0
   -38> 2021-05-18 09:53:01.472 7f7e9fd84700 10 monclient: get_auth_request con 0x5608a4ffe000 auth_method 0
   -37> 2021-05-18 09:53:01.472 7f7e9ed82700 10 monclient: get_auth_request con 0x5608a4ffe900 auth_method 0
   -36> 2021-05-18 09:53:01.476 7f7e9fd84700 10 monclient: get_auth_request con 0x5608a4ffed80 auth_method 0
   -35> 2021-05-18 09:53:01.476 7f7e95d70700  1 mds.0.journaler.mdlog(ro) recover start
   -34> 2021-05-18 09:53:01.476 7f7e95d70700  1 mds.0.journaler.mdlog(ro) read_head
   -33> 2021-05-18 09:53:01.476 7f7e95d70700  4 mds.0.log Waiting for journal 0x200 to recover...
   -32> 2021-05-18 09:53:01.476 7f7e9f583700 10 monclient: get_auth_request con 0x5608a4fffb00 auth_method 0
   -31> 2021-05-18 09:53:01.476 7f7e97573700  1 mds.0.journaler.pq(ro) _finish_read_head loghead(trim 352682246144, expire 352684157231, write 352684790299, stream_format 1).  probing for end of log (from 352684790299)...
   -30> 2021-05-18 09:53:01.476 7f7e97573700  1 mds.0.journaler.pq(ro) probing for end of the log
   -29> 2021-05-18 09:53:01.476 7f7e9ed82700 10 monclient: get_auth_request con 0x5608a5153b00 auth_method 0
   -28> 2021-05-18 09:53:01.476 7f7e9fd84700 10 monclient: get_auth_request con 0x5608a515a000 auth_method 0
   -27> 2021-05-18 09:53:01.476 7f7e96571700  1 mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim 59544981995520, expire 59544983437675, write 59545225782916, stream_format 1).  probing for end of log (from 59545225782916)...
   -26> 2021-05-18 09:53:01.476 7f7e96571700  1 mds.0.journaler.mdlog(ro) probing for end of the log
   -25> 2021-05-18 09:53:01.476 7f7e96571700  1 mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 59545228977934 (header had 59545225782916). recovered.
   -24> 2021-05-18 09:53:01.476 7f7e95d70700  4 mds.0.log Journal 0x200 recovered.
   -23> 2021-05-18 09:53:01.476 7f7e95d70700  4 mds.0.log Recovered journal 0x200 in format 1
   -22> 2021-05-18 09:53:01.476 7f7e97573700  1 mds.0.journaler.pq(ro) _finish_probe_end write_pos = 352684790299 (header had 352684790299). recovered.
   -21> 2021-05-18 09:53:01.476 7f7e97573700  4 mds.0.purge_queue operator(): open complete
   -20> 2021-05-18 09:53:01.476 7f7e97573700  1 mds.0.journaler.pq(ro) set_writeable
   -19> 2021-05-18 09:53:01.476 7f7e95d70700  2 mds.0.27240 Booting: 1: loading/discovering base inodes
   -18> 2021-05-18 09:53:01.476 7f7e95d70700  0 mds.0.cache creating system inode with ino:0x100
   -17> 2021-05-18 09:53:01.476 7f7e95d70700  0 mds.0.cache creating system inode with ino:0x1
   -16> 2021-05-18 09:53:01.480 7f7e96571700  2 mds.0.27240 Booting: 2: replaying mds log
   -15> 2021-05-18 09:53:01.480 7f7e96571700  2 mds.0.27240 Booting: 2: waiting for purge queue recovered
   -14> 2021-05-18 09:53:01.488 7f7e9a579700  5 mds.beacon.mds01 Sending beacon up:replay seq 906
   -13> 2021-05-18 09:53:01.488 7f7e9a579700 10 monclient: _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
   -12> 2021-05-18 09:53:01.488 7f7e9ed82700  5 mds.beacon.mds01 received beacon reply up:replay seq 906 rtt 0
   -11> 2021-05-18 09:53:01.624 7f7e9f583700 10 monclient: get_auth_request con 0x5608a5171600 auth_method 0
   -10> 2021-05-18 09:53:03.732 7f7e94d6e700 -1 mds.0.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
    -9> 2021-05-18 09:53:03.732 7f7e94d6e700  0 mds.0.log _replay journaler got error -22, aborting
    -8> 2021-05-18 09:53:03.732 7f7e94d6e700 -1 log_channel(cluster) log [ERR] : Error loading MDS rank 0: (22) Invalid argument
    -7> 2021-05-18 09:53:03.732 7f7e94d6e700  5 mds.beacon.mds01 set_want_state: up:replay -> down:damaged
    -6> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client  log_queue is 1 last_log 1 sent 0 num 1 unsent 1 sending 1
    -5> 2021-05-18 09:53:03.732 7f7e94d6e700 10 log_client  will send 2021-05-18 09:53:03.735824 mds.mds01 (mds.0) 1 : cluster [ERR] Error loading MDS rank 0: (22) Invalid argument
    -4> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient: _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
    -3> 2021-05-18 09:53:03.732 7f7e94d6e700  5 mds.beacon.mds01 Sending beacon down:damaged seq 907
    -2> 2021-05-18 09:53:03.732 7f7e94d6e700 10 monclient: _send_mon_message to mon.ceph01 at v2:XXX.XXX.XXX.XXX:3300/0
    -1> 2021-05-18 09:53:03.908 7f7e9ed82700  5 mds.beacon.mds01 received beacon reply down:damaged seq 907 rtt 0.176001
     0> 2021-05-18 09:53:03.908 7f7e94d6e700  1 mds.mds01 respawn!
--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.mds01.log
--- end dump of recent events ---
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01  e: '/usr/bin/ceph-mds'
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01  0: '/usr/bin/ceph-mds'
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01  1: '-f'
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01  2: '--cluster'
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01  3: 'ceph'
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01  4: '--id'
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01  5: 'mds01'
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01  6: '--setuser'
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01  7: 'ceph'
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01  8: '--setgroup'
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01  9: 'ceph'
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01 respawning with exe /usr/bin/ceph-mds
2021-05-18 09:53:03.920 7f7e94d6e700  1 mds.mds01  exe_path /proc/self/exe
2021-05-18 09:53:04.004 7f466a2a7540  0 ceph version 14.2.20-402-g6aa76c6815 (6aa76c681583e1eb4912f4e663591aba466c73ff) nautilus (stable), process ceph-mds, pid 17993
2021-05-18 09:53:04.004 7f466a2a7540  0 pidfile_write: ignore empty --pid-file
2021-05-18 09:53:04.012 7f4658856700  1 mds.mds01 Updating MDS map to version 27240 from mon.0
2021-05-18 09:53:04.104 7f4658856700  1 mds.mds01 Updating MDS map to version 27241 from mon.0
2021-05-18 09:53:09.004 7f4658856700  1 mds.mds01 Updating MDS map to version 27242 from mon.0
2021-05-18 09:53:09.004 7f4658856700  1 mds.mds01 Monitors have assigned me to become a standby.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux