Re: MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

"Yan, Zheng" <ukernel@xxxxxxxxx> · Wed, 21 Nov 2018 08:58:42 +0800

you can run 13.2.1 mds on another machine. kill all client sessions
and wait until purge queue is empty.  then it's safe to run 13.2.2
mds.

run command "cephfs-journal-tool --rank=cephfs_name:rank
--journal=purge_queue header get"

purge queue is  empty when write_pos == expire_pos
On Wed, Nov 21, 2018 at 8:49 AM Chris Martin <span> wrote:
&gt;
&gt; I am also having this problem. Zheng (or anyone else), any idea how to
&gt; perform this downgrade on a node that is also a monitor and an OSD
&gt; node?
&gt;
&gt; dpkg complains of a dependency conflict when I try to install
&gt; ceph-mds_13.2.1-1xenial_amd64.deb:
&gt;
&gt; ```
&gt; dpkg: dependency problems prevent configuration of ceph-mds:
&gt;  ceph-mds depends on ceph-base (= 13.2.1-1xenial); however:
&gt;   Version of ceph-base on system is 13.2.2-1xenial.
&gt; ```
&gt;
&gt; I don't think I want to downgrade ceph-base to 13.2.1.
&gt;
&gt; Thank you,
&gt; Chris Martin
&gt;
&gt; &gt; Sorry. this is caused wrong backport. downgrading mds to 13.2.1 and
&gt; &gt; marking mds repaird can resolve this.
&gt; &gt;
&gt; &gt; Yan, Zheng
&gt; &gt; On Sat, Oct 6, 2018 at 8:26 AM Sergey Malinin <span> wrote:
&gt; &gt; &gt;
&gt; &gt; &gt; Update:
&gt; &gt; &gt; I discovered http://tracker.ceph.com/issues/24236 and
https://github.com/ceph/ceph/pull/22146
&gt; &gt; &gt; Make sure that it is not relevant in your case before
proceeding to operations that modify on-disk data.
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; On 6.10.2018, at 03:17, Sergey Malinin <span> wrote:
&gt; &gt; &gt;
&gt; &gt; &gt; I ended up rescanning the entire fs using alternate
metadata pool approach as in
http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
&gt; &gt; &gt; The process has not competed yet because during the
recovery our cluster encountered another problem with OSDs that I got
fixed yesterday (thanks to Igor Fedotov @ SUSE).
&gt; &gt; &gt; The first stage (scan_extents) completed in 84 hours
(120M objects in data pool on 8 hdd OSDs on 4 hosts). The second
(scan_inodes) was interrupted by OSDs failure so I have no timing
stats but it seems to be runing 2-3 times faster than extents scan.
&gt; &gt; &gt; As to root cause -- in my case I recall that during
upgrade I had forgotten to restart 3 OSDs, one of which was holding
metadata pool contents, before restarting MDS daemons and that seemed
to had an impact on MDS journal corruption, because when I restarted
those OSDs, MDS was able to start up but soon failed throwing lots of
'loaded dup inode' errors.
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky <span> wrote:
&gt; &gt; &gt;
&gt; &gt; &gt; Same problem...
&gt; &gt; &gt;
&gt; &gt; &gt; # cephfs-journal-tool --journal=purge_queue journal inspect
&gt; &gt; &gt; 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object
500.0000016c
&gt; &gt; &gt; Overall journal integrity: DAMAGED
&gt; &gt; &gt; Objects missing:
&gt; &gt; &gt;   0x16c
&gt; &gt; &gt; Corrupt regions:
&gt; &gt; &gt;   0x5b000000-ffffffffffffffff
&gt; &gt; &gt;
&gt; &gt; &gt; Just after upgrade to 13.2.2
&gt; &gt; &gt;
&gt; &gt; &gt; Did you fixed it?
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; On 26/09/18 13:05, Sergey Malinin wrote:
&gt; &gt; &gt;
&gt; &gt; &gt; Hello,
&gt; &gt; &gt; Followed standard upgrade procedure to upgrade from
13.2.1 to 13.2.2.
&gt; &gt; &gt; After upgrade MDS cluster is down, mds rank 0 and
purge_queue journal are damaged. Resetting purge_queue does not seem
to work well as journal still appears to be damaged.
&gt; &gt; &gt; Can anybody help?
&gt; &gt; &gt;
&gt; &gt; &gt; mds log:
&gt; &gt; &gt;
&gt; &gt; &gt;   -789&gt; 2018-09-26 18:42:32.527 7f70f78b1700  1
mds.mds2 Updating MDS map to version 586 from mon.2
&gt; &gt; &gt;   -788&gt; 2018-09-26 18:42:32.527 7f70f78b1700  1
mds.0.583 handle_mds_map i am now mds.0.583
&gt; &gt; &gt;   -787&gt; 2018-09-26 18:42:32.527 7f70f78b1700  1
mds.0.583 handle_mds_map state change up:rejoin --&gt; up:active
&gt; &gt; &gt;   -786&gt; 2018-09-26 18:42:32.527 7f70f78b1700  1
mds.0.583 recovery_done -- successful recovery!
&gt; &gt; &gt; <span>
&gt; &gt; &gt;    -38&gt; 2018-09-26 18:42:32.707 7f70f28a7700 -1
mds.0.purge_queue _consume: Decode error at read_pos=0x322ec6636
&gt; &gt; &gt;    -37&gt; 2018-09-26 18:42:32.707 7f70f28a7700  5
mds.beacon.mds2 set_want_state: up:active -&gt; down:damaged
&gt; &gt; &gt;    -36&gt; 2018-09-26 18:42:32.707 7f70f28a7700  5
mds.beacon.mds2 _send down:damaged seq 137
&gt; &gt; &gt;    -35&gt; 2018-09-26 18:42:32.707 7f70f28a7700 10
monclient: _send_mon_message to mon.ceph3 at mon:6789/0
&gt; &gt; &gt;    -34&gt; 2018-09-26 18:42:32.707 7f70f28a7700  1 --
mds:6800/e4cc09cf --&gt; mon:6789/0 -- mdsbeacon(14c72/mds2
down:damaged seq 137 v24a) v7 -- 0x563b321ad480 con 0
&gt; &gt; &gt; <span>
&gt; &gt; &gt;     -3&gt; 2018-09-26 18:42:32.743 7f70f98b5700  5 --
mds:6800/3838577103 &gt;&gt; mon:6789/0 conn(0x563b3213e000 :-1
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx
mon.2 seq 29 0x563b321ab880 mdsbeaco
&gt; &gt; &gt; n(85106/mds2 down:damaged seq 311 v587) v7
&gt; &gt; &gt;     -2&gt; 2018-09-26 18:42:32.743 7f70f98b5700  1 --
mds:6800/3838577103 &lt;== mon.2 mon:6789/0 29 ====
mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 ==== 129+0+0
(3296573291 0 0) 0x563b321ab880 con 0x563b3213e
&gt; &gt; &gt; 000
&gt; &gt; &gt;     -1&gt; 2018-09-26 18:42:32.743 7f70f98b5700  5
mds.beacon.mds2 handle_mds_beacon down:damaged seq 311 rtt 0.038261
&gt; &gt; &gt;      0&gt; 2018-09-26 18:42:32.743 7f70f28a7700  1
mds.mds2 respawn!
&gt; &gt; &gt;
&gt; &gt; &gt; # cephfs-journal-tool --journal=purge_queue journal inspect
&gt; &gt; &gt; Overall journal integrity: DAMAGED
&gt; &gt; &gt; Corrupt regions:
&gt; &gt; &gt;   0x322ec65d9-ffffffffffffffff
&gt; &gt; &gt;
&gt; &gt; &gt; # cephfs-journal-tool --journal=purge_queue journal reset
&gt; &gt; &gt; old journal was 13470819801~8463
&gt; &gt; &gt; new journal start will be 13472104448 (1276184 bytes
past old end)
&gt; &gt; &gt; writing journal head
&gt; &gt; &gt; done
&gt; &gt; &gt;
&gt; &gt; &gt; # cephfs-journal-tool --journal=purge_queue journal inspect
&gt; &gt; &gt; 2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object
500.00000c8c
&gt; &gt; &gt; Overall journal integrity: DAMAGED
&gt; &gt; &gt; Objects missing:
&gt; &gt; &gt;   0xc8c
&gt; &gt; &gt; Corrupt regions:
&gt; &gt; &gt;   0x323000000-ffffffffffffffff
&gt; &gt; &gt; _______________________________________________
&gt; &gt; &gt; ceph-users mailing list
&gt; &gt; &gt; ceph-users at lists.ceph.com
&gt; &gt; &gt; http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; _______________________________________________
&gt; &gt; &gt; ceph-users mailing list
&gt; &gt; &gt; ceph-users at lists.ceph.com
&gt; &gt; &gt; http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
&gt; _______________________________________________
&gt; ceph-users mailing list
&gt; ceph-users@xxxxxxxxxxxxxx
&gt; http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com</span></span></span></span></span></span>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com