Re: assertion error trying to start mds server

John Spray <jspray@xxxxxxxxxx> · Fri, 13 Oct 2017 10:59:39 +0100

On Fri, Oct 13, 2017 at 4:13 AM, Bill Sharer <bsharer@xxxxxxxxxxxxxx> wrote:
> After your comment about the dual mds servers I decided to just give up
> trying to get the second restarted.  After eyeballing what I had on one
> of the new Ryzen boxes for drive space, I decided to just dump the
> filesystem.  That will also make things go faster if and when I flip
> everything over to bluestore.  So far so good...  I just took a peek and
> saw the files being owned by Mr root though.  Is there going to be an
> ownership reset at some point or will I have to resolve that by hand?

You'll have to do that by hand I'm afraid -- the tool only works on
what it gets in the data pool, which is just the path and the layout.
The rest of the metadata is all lost.  This also includes any xattrs,
acls, etc.

John

>
> On 10/12/2017 06:09 AM, John Spray wrote:
>> On Thu, Oct 12, 2017 at 12:23 AM, Bill Sharer <bsharer@xxxxxxxxxxxxxx> wrote:
>>> I was wondering if I can't get the second mds back up.... That offline
>>> backward scrub check sounds like it should be able to also salvage what
>>> it can of the two pools to a normal filesystem.  Is there an option for
>>> that or has someone written some form of salvage tool?
>> Yep, cephfs-data-scan can do that.
>>
>> To scrape the files out of a CephFS data pool to a local filesystem, do this:
>> cephfs-data-scan scan_extents <data pool name>  # this is discovering
>> all the file sizes
>> cephfs-data-scan scan_inodes --output-dir /tmp/my_output <data pool name>
>>
>> The time taken by both these commands scales linearly with the number
>> of objects in your data pool.
>>
>> This tool may not see the correct filename for recently created files
>> (any file whose metadata is in the journal but not flushed), these
>> files will go into a lost+found directory, named after their inode
>> number.
>>
>> John
>>
>>> On 10/11/2017 07:07 AM, John Spray wrote:
>>>> On Wed, Oct 11, 2017 at 1:42 AM, Bill Sharer <bsharer@xxxxxxxxxxxxxx> wrote:
>>>>> I've been in the process of updating my gentoo based cluster both with
>>>>> new hardware and a somewhat postponed update.  This includes some major
>>>>> stuff including the switch from gcc 4.x to 5.4.0 on existing hardware
>>>>> and using gcc 6.4.0 to make better use of AMD Ryzen on the new
>>>>> hardware.  The existing cluster was on 10.2.2, but I was going to
>>>>> 10.2.7-r1 as an interim step before moving on to 12.2.0 to begin
>>>>> transitioning to bluestore on the osd's.
>>>>>
>>>>> The Ryzen units are slated to be bluestore based OSD servers if and when
>>>>> I get to that point.  Up until the mds failure, they were simply cephfs
>>>>> clients.  I had three OSD servers updated to 10.2.7-r1 (one is also a
>>>>> MON) and had two servers left to update.  Both of these are also MONs
>>>>> and were acting as a pair of dual active MDS servers running 10.2.2.
>>>>> Monday morning I found out the hard way that an UPS one of them was on
>>>>> has a dead battery.  After I fsck'd and came back up, I saw the
>>>>> following assertion error when it was trying to start it's mds.B server:
>>>>>
>>>>>
>>>>> ==== mdsbeacon(64162/B up:replay seq 3 v4699) v7 ==== 126+0+0 (709014160
>>>>> 0 0) 0x7f6fb4001bc0 con 0x55f94779d
>>>>> 8d0
>>>>>      0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In
>>>>> function 'virtual void EImportStart::r
>>>>> eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972
>>>>> mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)
>>>>>
>>>>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>>> const*)+0x82) [0x55f93d64a122]
>>>>>  2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce]
>>>>>  3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34]
>>>>>  4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d]
>>>>>  5: (()+0x74a4) [0x7f6fd009b4a4]
>>>>>  6: (clone()+0x6d) [0x7f6fce5a598d]
>>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>> needed to interpret this.
>>>>>
>>>>> --- logging levels ---
>>>>>    0/ 5 none
>>>>>    0/ 1 lockdep
>>>>>    0/ 1 context
>>>>>    1/ 1 crush
>>>>>    1/ 5 mds
>>>>>    1/ 5 mds_balancer
>>>>>    1/ 5 mds_locker
>>>>>    1/ 5 mds_log
>>>>>    1/ 5 mds_log_expire
>>>>>    1/ 5 mds_migrator
>>>>>    0/ 1 buffer
>>>>>    0/ 1 timer
>>>>>    0/ 1 filer
>>>>>    0/ 1 striper
>>>>>    0/ 1 objecter
>>>>>    0/ 5 rados
>>>>>    0/ 5 rbd
>>>>>    0/ 5 rbd_mirror
>>>>>    0/ 5 rbd_replay
>>>>>    0/ 5 journaler
>>>>>    0/ 5 objectcacher
>>>>>    0/ 5 client
>>>>>    0/ 5 osd
>>>>>    0/ 5 optracker
>>>>>    0/ 5 objclass
>>>>>    1/ 3 filestore
>>>>>    1/ 3 journal
>>>>>    0/ 5 ms
>>>>>    1/ 5 mon
>>>>>    0/10 monc
>>>>>    1/ 5 paxos
>>>>>    0/ 5 tp
>>>>>    1/ 5 auth
>>>>>    1/ 5 crypto
>>>>>    1/ 1 finisher
>>>>>    1/ 5 heartbeatmap
>>>>>    1/ 5 perfcounter
>>>>>    1/ 5 rgw
>>>>>    1/10 civetweb
>>>>>    1/ 5 javaclient
>>>>>    1/ 5 asok
>>>>>    1/ 1 throttle
>>>>>    0/ 0 refs
>>>>>    1/ 5 xio
>>>>>    1/ 5 compressor
>>>>>    1/ 5 newstore
>>>>>    1/ 5 bluestore
>>>>>    1/ 5 bluefs
>>>>>    1/ 3 bdev
>>>>>    1/ 5 kstore
>>>>>    4/ 5 rocksdb
>>>>>    4/ 5 leveldb
>>>>>    1/ 5 kinetic
>>>>>    1/ 5 fuse
>>>>>   -2/-2 (syslog threshold)
>>>>>   -1/-1 (stderr threshold)
>>>>>   max_recent     10000
>>>>>   max_new         1000
>>>>>   log_file /var/log/ceph/ceph-mds.B.log
>>>>>
>>>>>
>>>>>
>>>>> When I was googling around, I ran into this Cern presentation and tried
>>>>> out the offline backware scrubbing commands on slide 25 first:
>>>>>
>>>>> https://indico.cern.ch/event/531810/contributions/2309925/attachments/1357386/2053998/GoncaloBorges-HEPIX16-v3.pdf
>>>>>
>>>>>
>>>>> Both ran without any messages, so I'm assuming I have sane contents in
>>>>> the cephfs_data and cephfs_metadata pools.  Still no luck getting things
>>>>> restarted, so I tried the cephfs-journal-tool journal reset on slide
>>>>> 23.  That didn't work either.  Just for giggles, I tried setting up the
>>>>> two Ryzen boxes as new mds.C and mds.D servers which would run on
>>>>> 10.2.7-r1 instead of using mds.A and mds.B (10.2.2).  The D server fails
>>>>> with the same assert as follows:
>>>> Because this system was running multiple active MDSs on Jewel (based
>>>> on seeing an EImportStart journal entry), and that was known to be
>>>> unstable, I would advise you to blow away the filesystem and create a
>>>> fresh one using luminous (where multi-mds is stable), rather than
>>>> trying to debug it.  Going back to try and work out what went wrong
>>>> with Jewel code is probably not a very valuable activity unless you
>>>> have irreplacable data.
>>>>
>>>> If you do want to get this filesystem back on its feet in-place:
>>>> (first stopping all MDSs) I'm guessing that your cephfs-journal-tool
>>>> reset didn't help because you had multiple MDS ranks, and that tool
>>>> just operates on rank 0 by default.  You need to work out which rank's
>>>> journal is actually damaged (it's part of the prefix to MDS log
>>>> messages), and then pass a --rank argument to cephfs-journal-tool.
>>>> You will also need to reset all the other ranks' journals to keep
>>>> things consistent, and then do a "ceph fs reset" so that it will start
>>>> up with a single MDS next time.  If you get the filesystem up and
>>>> running again, I'd still recommend copying anything important off it
>>>> and creating a new one using luminous, rather than continuing to run
>>>> with maybe-still-subtly-damaged metadata.
>>>>
>>>> John
>>>>
>>>>> === 132+0+1979520 (4198351460 0 1611007530) 0x7fffc4000a70 con
>>>>> 0x7fffe0013310
>>>>>      0> 2017-10-09 13:01:31.571195 7fffd99f5700 -1 mds/journal.cc: In
>>>>> function 'virtual void EImportStart::replay(MDSRank*)' thread
>>>>> 7fffd99f5700 time 2017-10-09 13:01:31.570608
>>>>> mds/journal.cc: 2949: FAILED assert(mds->sessionmap.get_version() == cmapv)
>>>>>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>>>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>>> const*)+0x80) [0x555555b7ebc8]
>>>>>  2: (EImportStart::replay(MDSRank*)+0x9ea) [0x555555a5674a]
>>>>>  3: (MDLog::_replay_thread()+0xe51) [0x5555559cef21]
>>>>>  4: (MDLog::ReplayThread::entry()+0xd) [0x5555557778cd]
>>>>>  5: (()+0x7364) [0x7ffff7bc5364]
>>>>>  6: (clone()+0x6d) [0x7ffff6051ccd]
>>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>> needed to interpret this.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com