Re: Can't get MDS running after a power outage

Webert de Souza Lima <webert.boss@xxxxxxxxx> · Thu, 29 Mar 2018 09:18:31 -0300

I'd also try to boot up only one mds until it's fully up and running. Not both of them.Sometimes they go switching states between each other.

Regards,
Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ

On Thu, Mar 29, 2018 at 7:32 AM, John Spray <jspray@xxxxxxxxxx> wrote:
On Thu, Mar 29, 2018 at 8:16 AM, Zhang Qiang <dotslash.lu@xxxxxxxxx> wrote:

> Hi,

>

> Ceph version 10.2.3. After a power outage, I tried to start the MDS

> deamons, but they stuck forever replaying journals, I had no idea why

> they were taking that long, because this is just a small cluster for

> testing purpose with only hundreds MB data. I restarted them, and the

> error below was encountered.

Usually if an MDS is stuck in replay, it's because it's waiting for

the OSDs to service the reads of the journal.  Are all your PGs up and

healthy?

>

> Any chance I can restore them?

>

> Mar 28 14:20:30 node01 systemd: Started Ceph metadata server daemon.

> Mar 28 14:20:30 node01 systemd: Starting Ceph metadata server daemon...

> Mar 28 14:20:30 node01 ceph-mds: 2018-03-28 14:20:30.796255

> 7f0150c8c180 -1 deprecation warning: MDS id 'mds.0' is invalid and

> will be forbidden in a future version.  MDS names may not start with a

> numeric digit.

If you're really using "0" as an MDS name, now would be a good time to

fix that -- most people use a hostname or something like that.  The

reason that numeric MDS names are invalid is that it makes commands

like "ceph mds fail 0" ambiguous (do we mean the name 0 or the rank

0?).

> Mar 28 14:20:30 node01 ceph-mds: starting mds.0 at :/0

> Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: In function 'const

> entity_inst_t MDSMap::get_inst(mds_rank_t)' thread 7f014ac6c700 time

> 2018-03-28 14:20:30.942480

> Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: 582: FAILED assert(up.count(m))

> Mar 28 14:20:30 node01 ceph-mds: ceph version 10.2.3

> (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)

> Mar 28 14:20:30 node01 ceph-mds: 1: (ceph::__ceph_assert_fail(char

> const*, char const*, int, char const*)+0x85) [0x7f01512aba45]

> Mar 28 14:20:30 node01 ceph-mds: 2: (MDSMap::get_inst(int)+0x20f)

> [0x7f0150ee5e3f]

> Mar 28 14:20:30 node01 ceph-mds: 3:

> (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x7b9)

> [0x7f0150ed6e49]

This is a weird assertion.  I can't see how it could be reached :-/

John

> Mar 28 14:20:30 node01 ceph-mds: 4:

> (MDSDaemon::handle_mds_map(MMDSMap*)+0xe3d) [0x7f0150eb396d]

> Mar 28 14:20:30 node01 ceph-mds: 5:

> (MDSDaemon::handle_core_message(Message*)+0x7b3) [0x7f0150eb4eb3]

> Mar 28 14:20:30 node01 ceph-mds: 6:

> (MDSDaemon::ms_dispatch(Message*)+0xdb) [0x7f0150eb514b]

> Mar 28 14:20:30 node01 ceph-mds: 7: (DispatchQueue::entry()+0x78a)

> [0x7f01513ad4aa]

> Mar 28 14:20:30 node01 ceph-mds: 8:

> (DispatchQueue::DispatchThread::entry()+0xd) [0x7f015129098d]

> Mar 28 14:20:30 node01 ceph-mds: 9: (()+0x7dc5) [0x7f0150095dc5]

> Mar 28 14:20:30 node01 ceph-mds: 10: (clone()+0x6d) [0x7f014eb61ced]

> Mar 28 14:20:30 node01 ceph-mds: NOTE: a copy of the executable, or

> `objdump -rdS <executable>` is needed to interpret this.

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com