Re: Down monitors after adding mds node

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 3, 2016 at 6:29 AM, Adam Tygart <mozes@xxxxxxx> wrote:
> I put this in the #ceph-dev on Friday,
>
> (gdb) print info
> $7 = (const MDSMap::mds_info_t &) @0x55555fb1da68: {
>   global_id = {<boost::totally_ordered1<mds_gid_t,
> boost::totally_ordered2<mds_gid_t, unsigned long,
> boost::detail::empty_base<mds_gid_t> > >> =
> {<boost::less_than_comparable1<mds_gid_t,
> boost::equality_comparable1<mds_gid_t,
> boost::totally_ordered2<mds_gid_t, unsigned long,
> boost::detail::empty_base<mds_gid_t> > > >> =
> {<boost::equality_comparable1<mds_gid_t,
> boost::totally_ordered2<mds_gid_t, unsigned long,
> boost::detail::empty_base<mds_gid_t> > >> =
> {<boost::totally_ordered2<mds_gid_t, unsigned long,
> boost::detail::empty_base<mds_gid_t> >> =
> {<boost::less_than_comparable2<mds_gid_t, unsigned long,
> boost::equality_comparable2<mds_gid_t, unsigned long,
> boost::detail::empty_base<mds_gid_t> > >> =
> {<boost::equality_comparable2<mds_gid_t, unsigned long,
> boost::detail::empty_base<mds_gid_t> >> =
> {<boost::detail::empty_base<mds_gid_t>> = {<No data fields>}, <No data
> fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No
> data fields>}, <No data fields>}, t = 1055992652}, name = "mormo",
> rank = -1, inc = 0,
>   state = MDSMap::STATE_STANDBY, state_seq = 2, addr = {type = 0,
> nonce = 8835, {addr = {ss_family = 2, __ss_align = 0, __ss_padding =
> '\000' <repeats 111 times>}, addr4 = {sin_family = 2, sin_port =
> 36890,
>         sin_addr = {s_addr = 50398474}, sin_zero =
> "\000\000\000\000\000\000\000"}, addr6 = {sin6_family = 2, sin6_port =
> 36890, sin6_flowinfo = 50398474, sin6_addr = {__in6_u = {
>             __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0,
> 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id =
> 0}}}, laggy_since = {tv = {tv_sec = 0, tv_nsec = 0}},
>   standby_for_rank = 0, standby_for_name = "", standby_for_fscid =
> 328, standby_replay = true, export_targets = std::set with 0 elements,
> mds_features = 1967095022025}
> (gdb) print target_role
> $8 = {rank = 0, fscid = <optimized out>}
>
> It looks like target_role.fscid was somehow optimized out.

Thanks for this, let's switch discussion to the ticket (I think I know
what's wrong now).

John

>
> --
> Adam
>
> On Sun, Oct 2, 2016 at 4:26 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>> On Sat, Oct 1, 2016 at 7:19 PM, Adam Tygart <mozes@xxxxxxx> wrote:
>>> The wip-fixup-mds-standby-init branch doesn't seem to allow the
>>> ceph-mons to start up correctly. I disabled all mds servers before
>>> starting the monitors up, so it would seem the pending mdsmap update
>>> is in durable storage. Now that the mds servers are down, can we clear
>>> the mdsmap of active and standby servers while initializing the mons?
>>> I would hope that, now that all the versions are in sync, a bad
>>> standby_for_fscid would not be possible with new mds servers starting.
>>
>> Looks like my first guess about the run-time initialization being
>> confused was wrong. :(
>> Given that, we're pretty befuddled. But I commented on irc:
>>
>>>if you've still got a core dump, can you go up a frame (to MDSMonitor::maybe_promote_standby) and check the values of target_role.rank and target_role.fscid, and how that compares to info.standby_for_fscid, info.legacy_client_fscid, and info.standby_for_rank?
>>
>> That might pop up something and isn't accessible in the log you
>> posted. We also can't see an osdmap or dump; if you could either
>> extract and print that or get a log which includes it that might show
>> up something.
>>
>> I don't think we changed the mds<-> protocol or anything in the point
>> releases, so the different package version *shouldn't* matter...right,
>> John? ;)
>> -Greg
>>
>>>
>>> --
>>> Adam
>>>
>>> On Fri, Sep 30, 2016 at 3:49 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>>> On Fri, Sep 30, 2016 at 11:39 AM, Adam Tygart <mozes@xxxxxxx> wrote:
>>>>> Hello all,
>>>>>
>>>>> Not sure if this went through before or not, as I can't check the
>>>>> mailing list archives.
>>>>>
>>>>> I've gotten myself into a bit of a bind. I was prepping to add a new
>>>>> mds node to my ceph cluster. e.g. ceph-deploy mds create mormo
>>>>>
>>>>> Unfortunately, it started the mds server before I was ready. My
>>>>> cluster was running 10.2.1, and the newly deployed mds is 10.2.3.
>>>>>
>>>>> This caused 3 of my 5 monitors to crash. Since I immediately realized
>>>>> the mds was a newer version, I took that opportunity to upgrade my
>>>>> monitors to 10.2.3. Three of the 5 monitors continue to crash. And it
>>>>> looks like they are crashing when trying to apply a pending mdsmap
>>>>> update.
>>>>>
>>>>> The log is available here:
>>>>> http://people.cis.ksu.edu/~mozes/hobbit01.mon-20160930.log.gz
>>>>>
>>>>> I have attempted (making backups of course) to extract the monmap from
>>>>> a working monitor and inserting it into a broken one. No luck, and
>>>>> backup was restored.
>>>>>
>>>>> Since I had 2 working monitors, I backed up the monitor stores,
>>>>> updated the monmaps to remove the broken ones and tried to restart
>>>>> them. I then tried to restart the "working" ones. They then failed in
>>>>> the same way. I've now restored my backups of those monitors.
>>>>>
>>>>> I need to get these monitors back up post-haste.
>>>>>
>>>>> If you've got any ideas, I would be grateful.
>>>>
>>>> I'm not sure but it looks like it's now too late to keep the problem
>>>> out of the durable storage, but if you try again make sure you turn
>>>> off the MDS first.
>>>>
>>>> It sort of looks like you've managed to get a failed MDS with an
>>>> invalid fscid (ie, a cephfs filesystem ID).
>>>>
>>>> ...or maybe just a terrible coding mistake. As mentioned on irc,
>>>> wip-fixup-mds-standby-init should fix it. I've created a ticket as
>>>> well: http://tracker.ceph.com/issues/17466
>>>> -Greg
>>>>
>>>>
>>>>>
>>>>> --
>>>>> Adam
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux