I did something wrong in the upgrade restart also...
after rescaning with:
cephfs-data-scan scan_extents cephfs_data (with threads)
cephfs-data-scan scan_inodes cephfs_data (with threads)
cephfs-data-scan scan_links
My MDS still crashes and wont replay.
1: (()+0x3ec320) [0x55b0e2bd2320]
2: (()+0x12890) [0x7fc3adce3890]
3: (gsignal()+0xc7) [0x7fc3acddbe97]
4: (abort()+0x141) [0x7fc3acddd801]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x250) [0x7fc3ae3cc080]
6: (()+0x26c0f7) [0x7fc3ae3cc0f7]
7: (()+0x21eb27) [0x55b0e2a04b27]
8: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*,
CInode*, snapid_t)+0xc0) [0x55b0e2a04d40]
9: (Locker::check_inode_max_size(CInode*, bool, unsigned long,
unsigned long, utime_t)+0x91d) [0x55b0e2a6a0fd]
10: (RecoveryQueue::_recovered(CInode*, int, unsigned long,
utime_t)+0x39f) [0x55b0e2a3ca2f]
11: (MDSIOContextBase::complete(int)+0x119) [0x55b0e2b54ab9]
12: (Filer::C_Probe::finish(int)+0xe7) [0x55b0e2bd94e7]
13: (Context::complete(int)+0x9) [0x55b0e28e9719]
14: (Finisher::finisher_thread_entry()+0x12e) [0x7fc3ae3ca4ce]
15: (()+0x76db) [0x7fc3adcd86db]
16: (clone()+0x3f) [0x7fc3acebe88f]
Did you do somenthing else before starting the MDSs again?
On 05/10/18 21:17, Sergey Malinin
wrote:
I ended up rescanning the entire fs using alternate metadata pool
approach as in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
The process has not competed yet because during the
recovery our cluster encountered another problem with OSDs that
I got fixed yesterday (thanks to Igor Fedotov @ SUSE).
The first stage (scan_extents) completed in 84
hours (120M objects in data pool on 8 hdd OSDs on 4 hosts).
The second (scan_inodes) was interrupted by OSDs failure so I
have no timing stats but it seems to be runing 2-3 times
faster than extents scan.
As to root cause -- in my case I recall that
during upgrade I had forgotten to restart 3 OSDs, one of which
was holding metadata pool contents, before restarting MDS
daemons and that seemed to had an impact on MDS journal
corruption, because when I restarted those OSDs, MDS was able
to start up but soon failed throwing lots of 'loaded dup
inode' errors.
Same problem...
# cephfs-journal-tool --journal=purge_queue journal
inspect
2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object
500.0000016c
Overall journal integrity: DAMAGED
Objects missing:
0x16c
Corrupt regions:
0x5b000000-ffffffffffffffff
Just after upgrade to 13.2.2
Did you fixed it?
On 26/09/18 13:05, Sergey Malinin wrote:
Hello,
Followed standard upgrade procedure to upgrade from
13.2.1 to 13.2.2.
After upgrade MDS cluster is down, mds rank 0 and
purge_queue journal are damaged. Resetting purge_queue
does not seem to work well as journal still appears to
be damaged.
Can anybody help?
mds log:
-789> 2018-09-26 18:42:32.527 7f70f78b1700 1
mds.mds2 Updating MDS map to version 586 from mon.2
-788> 2018-09-26 18:42:32.527 7f70f78b1700 1
mds.0.583 handle_mds_map i am now mds.0.583
-787> 2018-09-26 18:42:32.527 7f70f78b1700 1
mds.0.583 handle_mds_map state change up:rejoin -->
up:active
-786> 2018-09-26 18:42:32.527 7f70f78b1700 1
mds.0.583 recovery_done -- successful recovery!
<skip>
-38> 2018-09-26 18:42:32.707 7f70f28a7700 -1
mds.0.purge_queue _consume: Decode error at
read_pos=0x322ec6636
-37> 2018-09-26 18:42:32.707 7f70f28a7700 5
mds.beacon.mds2 set_want_state: up:active ->
down:damaged
-36> 2018-09-26 18:42:32.707 7f70f28a7700 5
mds.beacon.mds2 _send down:damaged seq 137
-35> 2018-09-26 18:42:32.707 7f70f28a7700 10
monclient: _send_mon_message to mon.ceph3 at mon:6789/0
-34> 2018-09-26 18:42:32.707 7f70f28a7700 1 --
mds:6800/e4cc09cf --> mon:6789/0 --
mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 --
0x563b321ad480 con 0
<skip>
-3> 2018-09-26 18:42:32.743 7f70f98b5700 5 --
mds:6800/3838577103 >> mon:6789/0
conn(0x563b3213e000 :-1
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1
l=1). rx mon.2 seq 29 0x563b321ab880 mdsbeaco
n(85106/mds2 down:damaged seq 311 v587) v7
-2> 2018-09-26 18:42:32.743 7f70f98b5700 1 --
mds:6800/3838577103 <== mon.2 mon:6789/0 29 ====
mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 ====
129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e
000
-1> 2018-09-26 18:42:32.743 7f70f98b5700 5
mds.beacon.mds2 handle_mds_beacon down:damaged seq 311
rtt 0.038261
0> 2018-09-26 18:42:32.743 7f70f28a7700 1
mds.mds2 respawn!
# cephfs-journal-tool --journal=purge_queue journal
inspect
Overall journal integrity: DAMAGED
Corrupt regions:
0x322ec65d9-ffffffffffffffff
# cephfs-journal-tool --journal=purge_queue journal
reset
old journal was 13470819801~8463
new journal start will be 13472104448 (1276184 bytes
past old end)
writing journal head
done
# cephfs-journal-tool --journal=purge_queue journal
inspect
2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object
500.00000c8c
Overall journal integrity: DAMAGED
Objects missing:
0xc8c
Corrupt regions:
0x323000000-ffffffffffffffff
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com