how to debug (in order to repair) damaged MDS (rank)?

Daniel Baumann <daniel.baumann@xxxxxx> · Tue, 10 Oct 2017 11:28:50 +0200

Hi all,

unfortunatly I'm still struggling bringing cephfs back up after one of
the MDS has been marked "damaged" (see messages from monday).

1. When I mark the rank as "repaired", this is what I get in the monitor
   log (leaving unrelated leveldb compacting chatter aside):

2017-10-10 10:51:23.177865 7f3290710700  0 log_channel(audit) log [INF]
: from='client.? 147.87.226.72:0/1658479115' entity='client.admin' cmd
='[{"prefix": "mds repaired", "rank": "6"}]': finished
2017-10-10 10:51:23.177993 7f3290710700  0 log_channel(cluster) log
[DBG] : fsmap cephfs-9/9/9 up  {0=mds1=up:resolve,1=mds2=up:resolve,2=mds3
=up:resolve,3=mds4=up:resolve,4=mds5=up:resolve,5=mds6=up:resolve,6=mds9=up:replay,7=mds7=up:resolve,8=mds8=up:resolve}
[...]

2017-10-10 10:51:23.492040 7f328ab1c700  1 mon.mon1@0(leader).mds e96186
 mds mds.? 147.87.226.189:6800/524543767 can't write to fsmap compat=
{},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versi
oned encoding,6=dirfrag is stored in omap,8=file layout v2}
[...]

2017-10-10 10:51:24.291827 7f328d321700 -1 log_channel(cluster) log
[ERR] : Health check failed: 1 mds daemon damaged (MDS_DAMAGE)

2. ...and this is what I get on the mds:

2017-10-10 11:21:26.537204 7fcb01702700 -1 mds.6.journaler.pq(ro)
_decode error from assimilate_prefetch
2017-10-10 11:21:26.537223 7fcb01702700 -1 mds.6.purge_queue _recover:
Error -22 recovering write_pos

(see attachment for the full mds log during the "repair" action)

I'm really stuck here and would greatly appreciate any help. How can I
see what is actually going on/the problem? Running ceph-mon/ceph-mds
with debug levels logs just "damaged" as quoted above, but doesn't tell
what is wrong or why it's failing.

would going back to single MDS with "ceph fs reset" allow me to access
the data again?

Regards,
Daniel
2017-10-10 11:21:26.419394 7fcb0670c700 10 mds.mds9  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=de
fault file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
2017-10-10 11:21:26.419399 7fcb0670c700 10 mds.mds9 map says I am 147.87.226.189:6800/1182896077 mds.6.96195 state up:replay
2017-10-10 11:21:26.419623 7fcb0670c700  4 mds.6.purge_queue operator():  data pool 7 not found in OSDMap
2017-10-10 11:21:26.419679 7fcb0670c700 10 mds.mds9 handle_mds_map: initializing MDS rank 6
2017-10-10 11:21:26.419916 7fcb0670c700 10 mds.6.0 update_log_config log_to_monitors {default=true}
2017-10-10 11:21:26.419920 7fcb0670c700 10 mds.6.0 create_logger
2017-10-10 11:21:26.420138 7fcb0670c700  7 mds.6.server operator(): full = 0 epoch = 0
2017-10-10 11:21:26.420146 7fcb0670c700  4 mds.6.purge_queue operator():  data pool 7 not found in OSDMap
2017-10-10 11:21:26.420150 7fcb0670c700  4 mds.6.0 handle_osd_map epoch 0, 0 new blacklist entries
2017-10-10 11:21:26.420159 7fcb0670c700 10 mds.6.server apply_blacklist: killed 0
2017-10-10 11:21:26.420338 7fcb0670c700 10 mds.mds9 handle_mds_map: handling map as rank 6
2017-10-10 11:21:26.420347 7fcb0670c700  1 mds.6.96195 handle_mds_map i am now mds.6.96195
2017-10-10 11:21:26.420351 7fcb0670c700  1 mds.6.96195 handle_mds_map state change up:boot --> up:replay
2017-10-10 11:21:26.420366 7fcb0670c700 10 mds.beacon.mds9 set_want_state: up:standby -> up:replay
2017-10-10 11:21:26.420370 7fcb0670c700  1 mds.6.96195 replay_start
2017-10-10 11:21:26.420375 7fcb0670c700  7 mds.6.cache set_recovery_set 0,1,2,3,4,5,7,8
2017-10-10 11:21:26.420380 7fcb0670c700  1 mds.6.96195  recovery set is 0,1,2,3,4,5,7,8
2017-10-10 11:21:26.420395 7fcb0670c700  1 mds.6.96195  waiting for osdmap 18593 (which blacklists prior instance)
2017-10-10 11:21:26.420401 7fcb0670c700  4 mds.6.purge_queue operator():  data pool 7 not found in OSDMap
2017-10-10 11:21:26.421206 7fcb0670c700  7 mds.6.server operator(): full = 0 epoch = 18593
2017-10-10 11:21:26.421217 7fcb0670c700  4 mds.6.96195 handle_osd_map epoch 18593, 0 new blacklist entries
2017-10-10 11:21:26.421220 7fcb0670c700 10 mds.6.server apply_blacklist: killed 0
2017-10-10 11:21:26.421253 7fcb00700700 10 MDSIOContextBase::complete: 12C_IO_Wrapper
2017-10-10 11:21:26.421263 7fcb00700700 10 MDSInternalContextBase::complete: 15C_MDS_BootStart
2017-10-10 11:21:26.421267 7fcb00700700  2 mds.6.96195 boot_start 0: opening inotable
2017-10-10 11:21:26.421285 7fcb00700700 10 mds.6.inotable: load
2017-10-10 11:21:26.421441 7fcb00700700  2 mds.6.96195 boot_start 0: opening sessionmap
2017-10-10 11:21:26.421449 7fcb00700700 10 mds.6.sessionmap load
2017-10-10 11:21:26.421551 7fcb00700700  2 mds.6.96195 boot_start 0: opening mds log
2017-10-10 11:21:26.421558 7fcb00700700  5 mds.6.log open discovering log bounds
2017-10-10 11:21:26.421720 7fcaff6fe700 10 mds.6.log _submit_thread start
2017-10-10 11:21:26.423002 7fcb00700700 10 MDSIOContextBase::complete: N12_GLOBAL__N_112C_IO_SM_LoadE
2017-10-10 11:21:26.423017 7fcb00700700 10 mds.6.sessionmap _load_finish loaded version 0
2017-10-10 11:21:26.423023 7fcb00700700 10 mds.6.sessionmap _load_finish: omap load complete
2017-10-10 11:21:26.423025 7fcb00700700 10 mds.6.sessionmap _load_finish: v 0, 0 sessions
2017-10-10 11:21:26.423027 7fcb00700700 10 mds.6.sessionmap dump
2017-10-10 11:21:26.423195 7fcb00700700 10 MDSIOContextBase::complete: 12C_IO_MT_Load
2017-10-10 11:21:26.423206 7fcb00700700 10 mds.6.inotable: load_2 got 34 bytes
2017-10-10 11:21:26.423211 7fcb00700700 10 mds.6.inotable: load_2 loaded v1
2017-10-10 11:21:26.423490 7fcaffeff700  4 mds.6.log Waiting for journal 0x206 to recover...
2017-10-10 11:21:26.426426 7fcaffeff700  4 mds.6.log Journal 0x206 recovered.
2017-10-10 11:21:26.426443 7fcaffeff700  4 mds.6.log Recovered journal 0x206 in format 1
2017-10-10 11:21:26.426451 7fcaffeff700 10 MDSInternalContextBase::complete: 15C_MDS_BootStart
2017-10-10 11:21:26.426454 7fcaffeff700  2 mds.6.96195 boot_start 1: loading/discovering base inodes
2017-10-10 11:21:26.426474 7fcaffeff700  0 mds.6.cache creating system inode with ino:0x106
2017-10-10 11:21:26.426523 7fcaffeff700 10 mds.6.cache.ino(0x106) fetch
2017-10-10 11:21:26.426736 7fcaffeff700  4 mds.6.purge_queue open: opening
2017-10-10 11:21:26.426823 7fcaffeff700  0 mds.6.cache creating system inode with ino:0x1
2017-10-10 11:21:26.428293 7fcb00700700 10 MDSIOContextBase::complete: 18C_IO_Inode_Fetched
2017-10-10 11:21:26.428309 7fcb00700700 10 mds.6.cache.ino(0x106) _fetched got 0 and 536
2017-10-10 11:21:26.428317 7fcb00700700 10 mds.6.cache.ino(0x106)  magic is 'ceph fs volume v011' (expecting 'ceph fs volume v011')
2017-10-10 11:21:26.428334 7fcb00700700 10  mds.6.cache.snaprealm(0x106 seq 1 0x557db04edc00) open_parents [1,head]
2017-10-10 11:21:26.428338 7fcb00700700 20 mds.6.cache.ino(0x106) decode_snap_blob snaprealm(0x106 seq 1 lc 0 cr 0 cps 1 snaps={} 0x557db04edc00)
2017-10-10 11:21:26.428343 7fcb00700700 10 mds.6.cache.ino(0x106) _fetched [inode 0x106 [...2,head] ~mds6/ auth v19 snaprealm=0x557db04edc00 f(v0 10=0+10) n(v3 rc2017-10-03 22:56:32.400835 b6253 88=11+77)/n(v0 11=0+11) (iversion lock) 0x557db06fa700]
2017-10-10 11:21:26.535132 7fcb01702700  4 mds.6.purge_queue operator(): open complete
2017-10-10 11:21:26.535135 7fcb01702700  4 mds.6.purge_queue operator(): recovering write_pos
2017-10-10 11:21:26.537204 7fcb01702700 -1 mds.6.journaler.pq(ro) _decode error from assimilate_prefetch
2017-10-10 11:21:26.537223 7fcb01702700 -1 mds.6.purge_queue _recover: Error -22 recovering write_pos
2017-10-10 11:21:26.537228 7fcb01702700 10 mds.beacon.mds9 set_want_state: up:replay -> down:damaged
2017-10-10 11:21:26.537235 7fcb01702700 20 mds.beacon.mds9 0 slow request found
2017-10-10 11:21:26.537264 7fcb01702700 10 mds.beacon.mds9 _send down:damaged seq 80
2017-10-10 11:21:26.537334 7fcb01702700 20 mds.beacon.mds9 send_and_wait: awaiting 80 for up to 5s
2017-10-10 11:21:26.543091 7fcb0670c700 10 mds.beacon.mds9 handle_mds_beacon down:damaged seq 80 rtt 0.005819
2017-10-10 11:21:26.543144 7fcb01702700  1 mds.mds9 respawn
2017-10-10 11:21:26.543160 7fcb01702700  1 mds.mds9  e: 'ceph-mds'
2017-10-10 11:21:26.543163 7fcb01702700  1 mds.mds9  0: 'ceph-mds'
2017-10-10 11:21:26.543166 7fcb01702700  1 mds.mds9  1: '-i'
2017-10-10 11:21:26.543168 7fcb01702700  1 mds.mds9  2: 'mds9'
2017-10-10 11:21:26.543169 7fcb01702700  1 mds.mds9  3: '-f'
2017-10-10 11:21:26.543171 7fcb01702700  1 mds.mds9  4: '--debug-mds=20'
2017-10-10 11:21:26.543210 7fcb01702700  1 mds.mds9 respawning with exe /usr/bin/ceph-mds
2017-10-10 11:21:26.543216 7fcb01702700  1 mds.mds9  exe_path /proc/self/exe
2017-10-10 11:21:26.571407 7fc08e357240  0 ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable), process (unknown), pid 2237

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com