Hi all, unfortunatly I'm still struggling bringing cephfs back up after one of the MDS has been marked "damaged" (see messages from monday). 1. When I mark the rank as "repaired", this is what I get in the monitor log (leaving unrelated leveldb compacting chatter aside): 2017-10-10 10:51:23.177865 7f3290710700 0 log_channel(audit) log [INF] : from='client.? 147.87.226.72:0/1658479115' entity='client.admin' cmd ='[{"prefix": "mds repaired", "rank": "6"}]': finished 2017-10-10 10:51:23.177993 7f3290710700 0 log_channel(cluster) log [DBG] : fsmap cephfs-9/9/9 up {0=mds1=up:resolve,1=mds2=up:resolve,2=mds3 =up:resolve,3=mds4=up:resolve,4=mds5=up:resolve,5=mds6=up:resolve,6=mds9=up:replay,7=mds7=up:resolve,8=mds8=up:resolve} [...] 2017-10-10 10:51:23.492040 7f328ab1c700 1 mon.mon1@0(leader).mds e96186 mds mds.? 147.87.226.189:6800/524543767 can't write to fsmap compat= {},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versi oned encoding,6=dirfrag is stored in omap,8=file layout v2} [...] 2017-10-10 10:51:24.291827 7f328d321700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 mds daemon damaged (MDS_DAMAGE) 2. ...and this is what I get on the mds: 2017-10-10 11:21:26.537204 7fcb01702700 -1 mds.6.journaler.pq(ro) _decode error from assimilate_prefetch 2017-10-10 11:21:26.537223 7fcb01702700 -1 mds.6.purge_queue _recover: Error -22 recovering write_pos (see attachment for the full mds log during the "repair" action) I'm really stuck here and would greatly appreciate any help. How can I see what is actually going on/the problem? Running ceph-mon/ceph-mds with debug levels logs just "damaged" as quoted above, but doesn't tell what is wrong or why it's failing. would going back to single MDS with "ceph fs reset" allow me to access the data again? Regards, Daniel
2017-10-10 11:21:26.419394 7fcb0670c700 10 mds.mds9 mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=de fault file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2} 2017-10-10 11:21:26.419399 7fcb0670c700 10 mds.mds9 map says I am 147.87.226.189:6800/1182896077 mds.6.96195 state up:replay 2017-10-10 11:21:26.419623 7fcb0670c700 4 mds.6.purge_queue operator(): data pool 7 not found in OSDMap 2017-10-10 11:21:26.419679 7fcb0670c700 10 mds.mds9 handle_mds_map: initializing MDS rank 6 2017-10-10 11:21:26.419916 7fcb0670c700 10 mds.6.0 update_log_config log_to_monitors {default=true} 2017-10-10 11:21:26.419920 7fcb0670c700 10 mds.6.0 create_logger 2017-10-10 11:21:26.420138 7fcb0670c700 7 mds.6.server operator(): full = 0 epoch = 0 2017-10-10 11:21:26.420146 7fcb0670c700 4 mds.6.purge_queue operator(): data pool 7 not found in OSDMap 2017-10-10 11:21:26.420150 7fcb0670c700 4 mds.6.0 handle_osd_map epoch 0, 0 new blacklist entries 2017-10-10 11:21:26.420159 7fcb0670c700 10 mds.6.server apply_blacklist: killed 0 2017-10-10 11:21:26.420338 7fcb0670c700 10 mds.mds9 handle_mds_map: handling map as rank 6 2017-10-10 11:21:26.420347 7fcb0670c700 1 mds.6.96195 handle_mds_map i am now mds.6.96195 2017-10-10 11:21:26.420351 7fcb0670c700 1 mds.6.96195 handle_mds_map state change up:boot --> up:replay 2017-10-10 11:21:26.420366 7fcb0670c700 10 mds.beacon.mds9 set_want_state: up:standby -> up:replay 2017-10-10 11:21:26.420370 7fcb0670c700 1 mds.6.96195 replay_start 2017-10-10 11:21:26.420375 7fcb0670c700 7 mds.6.cache set_recovery_set 0,1,2,3,4,5,7,8 2017-10-10 11:21:26.420380 7fcb0670c700 1 mds.6.96195 recovery set is 0,1,2,3,4,5,7,8 2017-10-10 11:21:26.420395 7fcb0670c700 1 mds.6.96195 waiting for osdmap 18593 (which blacklists prior instance) 2017-10-10 11:21:26.420401 7fcb0670c700 4 mds.6.purge_queue operator(): data pool 7 not found in OSDMap 2017-10-10 11:21:26.421206 7fcb0670c700 7 mds.6.server operator(): full = 0 epoch = 18593 2017-10-10 11:21:26.421217 7fcb0670c700 4 mds.6.96195 handle_osd_map epoch 18593, 0 new blacklist entries 2017-10-10 11:21:26.421220 7fcb0670c700 10 mds.6.server apply_blacklist: killed 0 2017-10-10 11:21:26.421253 7fcb00700700 10 MDSIOContextBase::complete: 12C_IO_Wrapper 2017-10-10 11:21:26.421263 7fcb00700700 10 MDSInternalContextBase::complete: 15C_MDS_BootStart 2017-10-10 11:21:26.421267 7fcb00700700 2 mds.6.96195 boot_start 0: opening inotable 2017-10-10 11:21:26.421285 7fcb00700700 10 mds.6.inotable: load 2017-10-10 11:21:26.421441 7fcb00700700 2 mds.6.96195 boot_start 0: opening sessionmap 2017-10-10 11:21:26.421449 7fcb00700700 10 mds.6.sessionmap load 2017-10-10 11:21:26.421551 7fcb00700700 2 mds.6.96195 boot_start 0: opening mds log 2017-10-10 11:21:26.421558 7fcb00700700 5 mds.6.log open discovering log bounds 2017-10-10 11:21:26.421720 7fcaff6fe700 10 mds.6.log _submit_thread start 2017-10-10 11:21:26.423002 7fcb00700700 10 MDSIOContextBase::complete: N12_GLOBAL__N_112C_IO_SM_LoadE 2017-10-10 11:21:26.423017 7fcb00700700 10 mds.6.sessionmap _load_finish loaded version 0 2017-10-10 11:21:26.423023 7fcb00700700 10 mds.6.sessionmap _load_finish: omap load complete 2017-10-10 11:21:26.423025 7fcb00700700 10 mds.6.sessionmap _load_finish: v 0, 0 sessions 2017-10-10 11:21:26.423027 7fcb00700700 10 mds.6.sessionmap dump 2017-10-10 11:21:26.423195 7fcb00700700 10 MDSIOContextBase::complete: 12C_IO_MT_Load 2017-10-10 11:21:26.423206 7fcb00700700 10 mds.6.inotable: load_2 got 34 bytes 2017-10-10 11:21:26.423211 7fcb00700700 10 mds.6.inotable: load_2 loaded v1 2017-10-10 11:21:26.423490 7fcaffeff700 4 mds.6.log Waiting for journal 0x206 to recover... 2017-10-10 11:21:26.426426 7fcaffeff700 4 mds.6.log Journal 0x206 recovered. 2017-10-10 11:21:26.426443 7fcaffeff700 4 mds.6.log Recovered journal 0x206 in format 1 2017-10-10 11:21:26.426451 7fcaffeff700 10 MDSInternalContextBase::complete: 15C_MDS_BootStart 2017-10-10 11:21:26.426454 7fcaffeff700 2 mds.6.96195 boot_start 1: loading/discovering base inodes 2017-10-10 11:21:26.426474 7fcaffeff700 0 mds.6.cache creating system inode with ino:0x106 2017-10-10 11:21:26.426523 7fcaffeff700 10 mds.6.cache.ino(0x106) fetch 2017-10-10 11:21:26.426736 7fcaffeff700 4 mds.6.purge_queue open: opening 2017-10-10 11:21:26.426823 7fcaffeff700 0 mds.6.cache creating system inode with ino:0x1 2017-10-10 11:21:26.428293 7fcb00700700 10 MDSIOContextBase::complete: 18C_IO_Inode_Fetched 2017-10-10 11:21:26.428309 7fcb00700700 10 mds.6.cache.ino(0x106) _fetched got 0 and 536 2017-10-10 11:21:26.428317 7fcb00700700 10 mds.6.cache.ino(0x106) magic is 'ceph fs volume v011' (expecting 'ceph fs volume v011') 2017-10-10 11:21:26.428334 7fcb00700700 10 mds.6.cache.snaprealm(0x106 seq 1 0x557db04edc00) open_parents [1,head] 2017-10-10 11:21:26.428338 7fcb00700700 20 mds.6.cache.ino(0x106) decode_snap_blob snaprealm(0x106 seq 1 lc 0 cr 0 cps 1 snaps={} 0x557db04edc00) 2017-10-10 11:21:26.428343 7fcb00700700 10 mds.6.cache.ino(0x106) _fetched [inode 0x106 [...2,head] ~mds6/ auth v19 snaprealm=0x557db04edc00 f(v0 10=0+10) n(v3 rc2017-10-03 22:56:32.400835 b6253 88=11+77)/n(v0 11=0+11) (iversion lock) 0x557db06fa700] 2017-10-10 11:21:26.535132 7fcb01702700 4 mds.6.purge_queue operator(): open complete 2017-10-10 11:21:26.535135 7fcb01702700 4 mds.6.purge_queue operator(): recovering write_pos 2017-10-10 11:21:26.537204 7fcb01702700 -1 mds.6.journaler.pq(ro) _decode error from assimilate_prefetch 2017-10-10 11:21:26.537223 7fcb01702700 -1 mds.6.purge_queue _recover: Error -22 recovering write_pos 2017-10-10 11:21:26.537228 7fcb01702700 10 mds.beacon.mds9 set_want_state: up:replay -> down:damaged 2017-10-10 11:21:26.537235 7fcb01702700 20 mds.beacon.mds9 0 slow request found 2017-10-10 11:21:26.537264 7fcb01702700 10 mds.beacon.mds9 _send down:damaged seq 80 2017-10-10 11:21:26.537334 7fcb01702700 20 mds.beacon.mds9 send_and_wait: awaiting 80 for up to 5s 2017-10-10 11:21:26.543091 7fcb0670c700 10 mds.beacon.mds9 handle_mds_beacon down:damaged seq 80 rtt 0.005819 2017-10-10 11:21:26.543144 7fcb01702700 1 mds.mds9 respawn 2017-10-10 11:21:26.543160 7fcb01702700 1 mds.mds9 e: 'ceph-mds' 2017-10-10 11:21:26.543163 7fcb01702700 1 mds.mds9 0: 'ceph-mds' 2017-10-10 11:21:26.543166 7fcb01702700 1 mds.mds9 1: '-i' 2017-10-10 11:21:26.543168 7fcb01702700 1 mds.mds9 2: 'mds9' 2017-10-10 11:21:26.543169 7fcb01702700 1 mds.mds9 3: '-f' 2017-10-10 11:21:26.543171 7fcb01702700 1 mds.mds9 4: '--debug-mds=20' 2017-10-10 11:21:26.543210 7fcb01702700 1 mds.mds9 respawning with exe /usr/bin/ceph-mds 2017-10-10 11:21:26.543216 7fcb01702700 1 mds.mds9 exe_path /proc/self/exe 2017-10-10 11:21:26.571407 7fc08e357240 0 ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable), process (unknown), pid 2237
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com