Re: Can't get MDS running after a power outage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'd also try to boot up only one mds until it's fully up and running. Not both of them.
Sometimes they go switching states between each other.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ

On Thu, Mar 29, 2018 at 7:32 AM, John Spray <jspray@xxxxxxxxxx> wrote:
On Thu, Mar 29, 2018 at 8:16 AM, Zhang Qiang <dotslash.lu@xxxxxxxxx> wrote:
> Hi,
>
> Ceph version 10.2.3. After a power outage, I tried to start the MDS
> deamons, but they stuck forever replaying journals, I had no idea why
> they were taking that long, because this is just a small cluster for
> testing purpose with only hundreds MB data. I restarted them, and the
> error below was encountered.

Usually if an MDS is stuck in replay, it's because it's waiting for
the OSDs to service the reads of the journal.  Are all your PGs up and
healthy?

>
> Any chance I can restore them?
>
> Mar 28 14:20:30 node01 systemd: Started Ceph metadata server daemon.
> Mar 28 14:20:30 node01 systemd: Starting Ceph metadata server daemon...
> Mar 28 14:20:30 node01 ceph-mds: 2018-03-28 14:20:30.796255
> 7f0150c8c180 -1 deprecation warning: MDS id 'mds.0' is invalid and
> will be forbidden in a future version.  MDS names may not start with a
> numeric digit.

If you're really using "0" as an MDS name, now would be a good time to
fix that -- most people use a hostname or something like that.  The
reason that numeric MDS names are invalid is that it makes commands
like "ceph mds fail 0" ambiguous (do we mean the name 0 or the rank
0?).

> Mar 28 14:20:30 node01 ceph-mds: starting mds.0 at :/0
> Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: In function 'const
> entity_inst_t MDSMap::get_inst(mds_rank_t)' thread 7f014ac6c700 time
> 2018-03-28 14:20:30.942480
> Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: 582: FAILED assert(up.count(m))
> Mar 28 14:20:30 node01 ceph-mds: ceph version 10.2.3
> (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> Mar 28 14:20:30 node01 ceph-mds: 1: (ceph::__ceph_assert_fail(char
> const*, char const*, int, char const*)+0x85) [0x7f01512aba45]
> Mar 28 14:20:30 node01 ceph-mds: 2: (MDSMap::get_inst(int)+0x20f)
> [0x7f0150ee5e3f]
> Mar 28 14:20:30 node01 ceph-mds: 3:
> (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x7b9)
> [0x7f0150ed6e49]

This is a weird assertion.  I can't see how it could be reached :-/

John

> Mar 28 14:20:30 node01 ceph-mds: 4:
> (MDSDaemon::handle_mds_map(MMDSMap*)+0xe3d) [0x7f0150eb396d]
> Mar 28 14:20:30 node01 ceph-mds: 5:
> (MDSDaemon::handle_core_message(Message*)+0x7b3) [0x7f0150eb4eb3]
> Mar 28 14:20:30 node01 ceph-mds: 6:
> (MDSDaemon::ms_dispatch(Message*)+0xdb) [0x7f0150eb514b]
> Mar 28 14:20:30 node01 ceph-mds: 7: (DispatchQueue::entry()+0x78a)
> [0x7f01513ad4aa]
> Mar 28 14:20:30 node01 ceph-mds: 8:
> (DispatchQueue::DispatchThread::entry()+0xd) [0x7f015129098d]
> Mar 28 14:20:30 node01 ceph-mds: 9: (()+0x7dc5) [0x7f0150095dc5]
> Mar 28 14:20:30 node01 ceph-mds: 10: (clone()+0x6d) [0x7f014eb61ced]
> Mar 28 14:20:30 node01 ceph-mds: NOTE: a copy of the executable, or
> `objdump -rdS <executable>` is needed to interpret this.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux