Re: assertion error trying to start mds server

Bill Sharer <bsharer@xxxxxxxxxxxxxx> · Thu, 12 Oct 2017 23:13:41 -0400

After your comment about the dual mds servers I decided to just give up
trying to get the second restarted.  After eyeballing what I had on one
of the new Ryzen boxes for drive space, I decided to just dump the
filesystem.  That will also make things go faster if and when I flip
everything over to bluestore.  So far so good...  I just took a peek and
saw the files being owned by Mr root though.  Is there going to be an
ownership reset at some point or will I have to resolve that by hand?

On 10/12/2017 06:09 AM, John Spray wrote:
> On Thu, Oct 12, 2017 at 12:23 AM, Bill Sharer <bsharer@xxxxxxxxxxxxxx> wrote:
>> I was wondering if I can't get the second mds back up.... That offline
>> backward scrub check sounds like it should be able to also salvage what
>> it can of the two pools to a normal filesystem.  Is there an option for
>> that or has someone written some form of salvage tool?
> Yep, cephfs-data-scan can do that.
>
> To scrape the files out of a CephFS data pool to a local filesystem, do this:
> cephfs-data-scan scan_extents <data pool name>  # this is discovering
> all the file sizes
> cephfs-data-scan scan_inodes --output-dir /tmp/my_output <data pool name>
>
> The time taken by both these commands scales linearly with the number
> of objects in your data pool.
>
> This tool may not see the correct filename for recently created files
> (any file whose metadata is in the journal but not flushed), these
> files will go into a lost+found directory, named after their inode
> number.
>
> John
>
>> On 10/11/2017 07:07 AM, John Spray wrote:
>>> On Wed, Oct 11, 2017 at 1:42 AM, Bill Sharer <bsharer@xxxxxxxxxxxxxx> wrote:
>>>> I've been in the process of updating my gentoo based cluster both with
>>>> new hardware and a somewhat postponed update.  This includes some major
>>>> stuff including the switch from gcc 4.x to 5.4.0 on existing hardware
>>>> and using gcc 6.4.0 to make better use of AMD Ryzen on the new
>>>> hardware.  The existing cluster was on 10.2.2, but I was going to
>>>> 10.2.7-r1 as an interim step before moving on to 12.2.0 to begin
>>>> transitioning to bluestore on the osd's.
>>>>
>>>> The Ryzen units are slated to be bluestore based OSD servers if and when
>>>> I get to that point.  Up until the mds failure, they were simply cephfs
>>>> clients.  I had three OSD servers updated to 10.2.7-r1 (one is also a
>>>> MON) and had two servers left to update.  Both of these are also MONs
>>>> and were acting as a pair of dual active MDS servers running 10.2.2.
>>>> Monday morning I found out the hard way that an UPS one of them was on
>>>> has a dead battery.  After I fsck'd and came back up, I saw the
>>>> following assertion error when it was trying to start it's mds.B server:
>>>>
>>>>
>>>> ==== mdsbeacon(64162/B up:replay seq 3 v4699) v7 ==== 126+0+0 (709014160
>>>> 0 0) 0x7f6fb4001bc0 con 0x55f94779d
>>>> 8d0
>>>>      0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In
>>>> function 'virtual void EImportStart::r
>>>> eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972
>>>> mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)
>>>>
>>>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>> const*)+0x82) [0x55f93d64a122]
>>>>  2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce]
>>>>  3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34]
>>>>  4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d]
>>>>  5: (()+0x74a4) [0x7f6fd009b4a4]
>>>>  6: (clone()+0x6d) [0x7f6fce5a598d]
>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>> needed to interpret this.
>>>>
>>>> --- logging levels ---
>>>>    0/ 5 none
>>>>    0/ 1 lockdep
>>>>    0/ 1 context
>>>>    1/ 1 crush
>>>>    1/ 5 mds
>>>>    1/ 5 mds_balancer
>>>>    1/ 5 mds_locker
>>>>    1/ 5 mds_log
>>>>    1/ 5 mds_log_expire
>>>>    1/ 5 mds_migrator
>>>>    0/ 1 buffer
>>>>    0/ 1 timer
>>>>    0/ 1 filer
>>>>    0/ 1 striper
>>>>    0/ 1 objecter
>>>>    0/ 5 rados
>>>>    0/ 5 rbd
>>>>    0/ 5 rbd_mirror
>>>>    0/ 5 rbd_replay
>>>>    0/ 5 journaler
>>>>    0/ 5 objectcacher
>>>>    0/ 5 client
>>>>    0/ 5 osd
>>>>    0/ 5 optracker
>>>>    0/ 5 objclass
>>>>    1/ 3 filestore
>>>>    1/ 3 journal
>>>>    0/ 5 ms
>>>>    1/ 5 mon
>>>>    0/10 monc
>>>>    1/ 5 paxos
>>>>    0/ 5 tp
>>>>    1/ 5 auth
>>>>    1/ 5 crypto
>>>>    1/ 1 finisher
>>>>    1/ 5 heartbeatmap
>>>>    1/ 5 perfcounter
>>>>    1/ 5 rgw
>>>>    1/10 civetweb
>>>>    1/ 5 javaclient
>>>>    1/ 5 asok
>>>>    1/ 1 throttle
>>>>    0/ 0 refs
>>>>    1/ 5 xio
>>>>    1/ 5 compressor
>>>>    1/ 5 newstore
>>>>    1/ 5 bluestore
>>>>    1/ 5 bluefs
>>>>    1/ 3 bdev
>>>>    1/ 5 kstore
>>>>    4/ 5 rocksdb
>>>>    4/ 5 leveldb
>>>>    1/ 5 kinetic
>>>>    1/ 5 fuse
>>>>   -2/-2 (syslog threshold)
>>>>   -1/-1 (stderr threshold)
>>>>   max_recent     10000
>>>>   max_new         1000
>>>>   log_file /var/log/ceph/ceph-mds.B.log
>>>>
>>>>
>>>>
>>>> When I was googling around, I ran into this Cern presentation and tried
>>>> out the offline backware scrubbing commands on slide 25 first:
>>>>
>>>> https://indico.cern.ch/event/531810/contributions/2309925/attachments/1357386/2053998/GoncaloBorges-HEPIX16-v3.pdf
>>>>
>>>>
>>>> Both ran without any messages, so I'm assuming I have sane contents in
>>>> the cephfs_data and cephfs_metadata pools.  Still no luck getting things
>>>> restarted, so I tried the cephfs-journal-tool journal reset on slide
>>>> 23.  That didn't work either.  Just for giggles, I tried setting up the
>>>> two Ryzen boxes as new mds.C and mds.D servers which would run on
>>>> 10.2.7-r1 instead of using mds.A and mds.B (10.2.2).  The D server fails
>>>> with the same assert as follows:
>>> Because this system was running multiple active MDSs on Jewel (based
>>> on seeing an EImportStart journal entry), and that was known to be
>>> unstable, I would advise you to blow away the filesystem and create a
>>> fresh one using luminous (where multi-mds is stable), rather than
>>> trying to debug it.  Going back to try and work out what went wrong
>>> with Jewel code is probably not a very valuable activity unless you
>>> have irreplacable data.
>>>
>>> If you do want to get this filesystem back on its feet in-place:
>>> (first stopping all MDSs) I'm guessing that your cephfs-journal-tool
>>> reset didn't help because you had multiple MDS ranks, and that tool
>>> just operates on rank 0 by default.  You need to work out which rank's
>>> journal is actually damaged (it's part of the prefix to MDS log
>>> messages), and then pass a --rank argument to cephfs-journal-tool.
>>> You will also need to reset all the other ranks' journals to keep
>>> things consistent, and then do a "ceph fs reset" so that it will start
>>> up with a single MDS next time.  If you get the filesystem up and
>>> running again, I'd still recommend copying anything important off it
>>> and creating a new one using luminous, rather than continuing to run
>>> with maybe-still-subtly-damaged metadata.
>>>
>>> John
>>>
>>>> === 132+0+1979520 (4198351460 0 1611007530) 0x7fffc4000a70 con
>>>> 0x7fffe0013310
>>>>      0> 2017-10-09 13:01:31.571195 7fffd99f5700 -1 mds/journal.cc: In
>>>> function 'virtual void EImportStart::replay(MDSRank*)' thread
>>>> 7fffd99f5700 time 2017-10-09 13:01:31.570608
>>>> mds/journal.cc: 2949: FAILED assert(mds->sessionmap.get_version() == cmapv)
>>>>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>> const*)+0x80) [0x555555b7ebc8]
>>>>  2: (EImportStart::replay(MDSRank*)+0x9ea) [0x555555a5674a]
>>>>  3: (MDLog::_replay_thread()+0xe51) [0x5555559cef21]
>>>>  4: (MDLog::ReplayThread::entry()+0xd) [0x5555557778cd]
>>>>  5: (()+0x7364) [0x7ffff7bc5364]
>>>>  6: (clone()+0x6d) [0x7ffff6051ccd]
>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>> needed to interpret this.
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com