Re: MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 8, 2018 at 5:43 PM Sergey Malinin <hell@xxxxxxxxxxx> wrote:
>
>
>
> > On 8.10.2018, at 12:37, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> >
> > On Mon, Oct 8, 2018 at 4:37 PM Sergey Malinin <hell@xxxxxxxxxxx> wrote:
> >>
> >> What additional steps need to be taken in order to (try to) regain access to the fs providing that I backed up metadata pool, created alternate metadata pool and ran scan_extents, scan_links, scan_inodes, and somewhat recursive scrub.
> >> After that I only mounted the fs read-only to backup the data.
> >> Would anything even work if I had mds journal and purge queue truncated?
> >>
> >
> > did you backed up whole metadata pool?  did you make any modification
> > to the original metadata pool? If you did, what modifications?
>
> I backed up both journal and purge queue and used cephfs-journal-tool to recover dentries, then reset journal and purge queue on original metadata pool.

You can try restoring original journal and purge queue, then downgrade
mds to 13.2.1.   Journal objects names are 20x.xxxxxxxx, purge queue
objects names are 50x.xxxxxxxxx.

> Before proceeding to alternate metadata pool recovery I was able to start MDS but it soon failed throwing lots of 'loaded dup inode' errors, not sure if that involved changing anything in the pool.
> I have left the original metadata pool untouched sine then.
>
>
> >
> > Yan, Zheng
> >
> >>
> >>> On 8.10.2018, at 05:15, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> >>>
> >>> Sorry. this is caused wrong backport. downgrading mds to 13.2.1 and
> >>> marking mds repaird can resolve this.
> >>>
> >>> Yan, Zheng
> >>> On Sat, Oct 6, 2018 at 8:26 AM Sergey Malinin <hell@xxxxxxxxxxx> wrote:
> >>>>
> >>>> Update:
> >>>> I discovered http://tracker.ceph.com/issues/24236 and https://github.com/ceph/ceph/pull/22146
> >>>> Make sure that it is not relevant in your case before proceeding to operations that modify on-disk data.
> >>>>
> >>>>
> >>>> On 6.10.2018, at 03:17, Sergey Malinin <hell@xxxxxxxxxxx> wrote:
> >>>>
> >>>> I ended up rescanning the entire fs using alternate metadata pool approach as in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
> >>>> The process has not competed yet because during the recovery our cluster encountered another problem with OSDs that I got fixed yesterday (thanks to Igor Fedotov @ SUSE).
> >>>> The first stage (scan_extents) completed in 84 hours (120M objects in data pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by OSDs failure so I have no timing stats but it seems to be runing 2-3 times faster than extents scan.
> >>>> As to root cause -- in my case I recall that during upgrade I had forgotten to restart 3 OSDs, one of which was holding metadata pool contents, before restarting MDS daemons and that seemed to had an impact on MDS journal corruption, because when I restarted those OSDs, MDS was able to start up but soon failed throwing lots of 'loaded dup inode' errors.
> >>>>
> >>>>
> >>>> On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky <alfrenovsky@xxxxxxxxx> wrote:
> >>>>
> >>>> Same problem...
> >>>>
> >>>> # cephfs-journal-tool --journal=purge_queue journal inspect
> >>>> 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.0000016c
> >>>> Overall journal integrity: DAMAGED
> >>>> Objects missing:
> >>>> 0x16c
> >>>> Corrupt regions:
> >>>> 0x5b000000-ffffffffffffffff
> >>>>
> >>>> Just after upgrade to 13.2.2
> >>>>
> >>>> Did you fixed it?
> >>>>
> >>>>
> >>>> On 26/09/18 13:05, Sergey Malinin wrote:
> >>>>
> >>>> Hello,
> >>>> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
> >>>> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are damaged. Resetting purge_queue does not seem to work well as journal still appears to be damaged.
> >>>> Can anybody help?
> >>>>
> >>>> mds log:
> >>>>
> >>>> -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map to version 586 from mon.2
> >>>> -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i am now mds.0.583
> >>>> -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map state change up:rejoin --> up:active
> >>>> -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- successful recovery!
> >>>> <skip>
> >>>>  -38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue _consume: Decode error at read_pos=0x322ec6636
> >>>>  -37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 set_want_state: up:active -> down:damaged
> >>>>  -36> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 _send down:damaged seq 137
> >>>>  -35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: _send_mon_message to mon.ceph3 at mon:6789/0
> >>>>  -34> 2018-09-26 18:42:32.707 7f70f28a7700  1 -- mds:6800/e4cc09cf --> mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- 0x563b321ad480 con 0
> >>>> <skip>
> >>>>   -3> 2018-09-26 18:42:32.743 7f70f98b5700  5 -- mds:6800/3838577103 >> mon:6789/0 conn(0x563b3213e000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx mon.2 seq 29 0x563b321ab880 mdsbeaco
> >>>> n(85106/mds2 down:damaged seq 311 v587) v7
> >>>>   -2> 2018-09-26 18:42:32.743 7f70f98b5700  1 -- mds:6800/3838577103 <== mon.2 mon:6789/0 29 ==== mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 ==== 129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e
> >>>> 000
> >>>>   -1> 2018-09-26 18:42:32.743 7f70f98b5700  5 mds.beacon.mds2 handle_mds_beacon down:damaged seq 311 rtt 0.038261
> >>>>    0> 2018-09-26 18:42:32.743 7f70f28a7700  1 mds.mds2 respawn!
> >>>>
> >>>> # cephfs-journal-tool --journal=purge_queue journal inspect
> >>>> Overall journal integrity: DAMAGED
> >>>> Corrupt regions:
> >>>> 0x322ec65d9-ffffffffffffffff
> >>>>
> >>>> # cephfs-journal-tool --journal=purge_queue journal reset
> >>>> old journal was 13470819801~8463
> >>>> new journal start will be 13472104448 (1276184 bytes past old end)
> >>>> writing journal head
> >>>> done
> >>>>
> >>>> # cephfs-journal-tool --journal=purge_queue journal inspect
> >>>> 2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object 500.00000c8c
> >>>> Overall journal integrity: DAMAGED
> >>>> Objects missing:
> >>>> 0xc8c
> >>>> Corrupt regions:
> >>>> 0x323000000-ffffffffffffffff
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux