Re: Mismatching nonce for 'ceph osd.0 tell'

Willem Jan Withagen <wjw@xxxxxxxxxxx> · Tue, 13 Dec 2016 02:34:03 +0100

On 13-12-2016 02:02, Willem Jan Withagen wrote:
> On 10-12-2016 01:39, Willem Jan Withagen wrote:
>> On 9-12-2016 13:02, Willem Jan Withagen wrote:
>>> On 9-12-2016 10:22, Willem Jan Withagen wrote:
>>>> On 9-12-2016 09:59, kefu chai wrote:
>>>>> On Thu, Dec 8, 2016 at 8:30 PM, Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote:
>>>>>> On 8-12-2016 11:03, kefu chai wrote:
>>>>>>> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote:
>>>>>>>> On 3-10-2016 19:50, Gregory Farnum wrote:
>>>>>>>>>> Question here is:
>>>>>>>>>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>>>>>>>>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>>>>>>>>>   ceph-osd.
>>>>>>>>> The monitor has learned about specific IP addresses/nonces/etc via
>>>>>>>>> MOSDBoot messages from the OSDs. The crush locations are set via
>>>>>>>>> monitor command messages, generally invoked as part of the init
>>>>>>>>> scripts. Maps are generated entirely on the monitor. :)
>>>>>>>>>
>>>>>>>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>>>>>>>>>> and ceph-mon receives in the maps?
>>>>>>>>>> Just to make sure that it is clear where the problem occurs.
>>>>>>>>> You should be able to see the info going in and out by bumping the
>>>>>>>>> debug levels up — every message's "print" function is invoked when
>>>>>>>>> it's sent/received as long as you have "debug ms = 1". It looks like
>>>>>>>>> the MOSDBoot message doesn't natively dump its addresses but you can
>>>>>>>>> add them easily if you need to.
>>>>>>>> Hi Greg,
>>>>>>>>
>>>>>>>> Thanx for the answer....
>>>>>>>>
>>>>>>>> I've got debug_ms already pumped up all the way to 20.
>>>>>>>> So I do get to see what addresses are selected during bind. But still
>>>>>>>> they do not end up at the MON, and 'ceph osd dump' reports:
>>>>>>>>         :/0
>>>>>>>> as bind address.
>>>>>>>>
>>>>>>>> I'm going to add some more debugs to actually see what MOSDBoot is doing....
>>>>>>> there are multiple messengers used by ceph-osd, the one connected by
>>>>>>> rados client is the external/public messenger. it is also used by osd
>>>>>>> to talk with the monitor.
>>>>>>>
>>>>>>> the nonce of the external address of an OSD does not change after it's
>>>>>>> up: it's always the pid of ceph-osd process. and the (peer) address of
>>>>>>> the booting OSD collected by monitor comes from the connection's
>>>>>>> peer_addr field, which is set when the monitor accepts the connection
>>>>>>> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in
>>>>>>> AsyncConnection::_process_connection().
>>>>>>>
>>>>>>> but there are chances that an OSD is restarted and fail to bind its
>>>>>>> external messenger to the specified the port. in that case, ceph-osd
>>>>>>> will try with another port, but keep the nonce the same. but when it
>>>>>>> comes to other messengers used by ceph-osd, their nonces increase by
>>>>>>> 1000000 every time they rebind. that's why "ceph osd thrash" can
>>>>>>> change the nonces of the cluster_addr, heartbeat_back_addr and
>>>>>>> heartbeat_front_addr. the PR of
>>>>>>> https://github.com/ceph/ceph/pull/11706 actually changes the behavior
>>>>>>> of the messengers of these three messengers. and it has nothing to do
>>>>>>> with the external messenger to which the ceph cli client is
>>>>>>> connecting.
>>>>>>>
>>>>>>> so you might want to check
>>>>>>> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid
>>>>>>> 2) while the nonce of the same messenger is $pid when the ceph cli
>>>>>>> connects to it.
>>>>>>>
>>>>>>> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup.
>>>>>>> it avoids setting the nonce before the rebind finishes. and i tried
>>>>>>> with your producer on my linux box, no luck =(
>>>>>> Right,
>>>>>>
>>>>>> You gave me a lot of things to think about, and to start figuring out.
>>>>>>
>>>>>> And you are right that something really bad needs to happen to an OSD to
>>>>>> get in this state. But that is what the tests actually do: They just
>>>>>> down/up or kill OSDs and restart.
>>>>>>
>>>>>> And from previous discussions I "learned" that if the process doesn't
>>>>>> die but needs to rebind on the port, the OSD stays at the same port but
>>>>>> increments the nonce to indicate that it is a fresh connection. And log
>>>>> the external messenger should *not* increment its nonce.
>>>>>
>>>>>> printing actually shows that the code is going thru a rebind.
>>>>> and it should *not* go through rebind().
>>>> I have to dig thru the testscript but as far as I can tell just about
>>>> all of the daemons are getting reboots in this test.
>>>>
>>>> So when would I get a rebind?
>>>>
>>>> I thought it was because I had an OSD incorrectly marked down:
>>>> ./src/osd/OSD.cc:7074:                 << " wrongly marked me down";
>>>> This I found in the logs, and then I got a rebind.
>>>>
>>>> Wido suggested looking for this message, on my question why my OSDs were
>>>> not getting UP after a good hustle with all OSDs and MONs.
>>>>
>>>> And that is one of the tests in cephtool-test-mon.sh.
>>>> right before the 'ceph tell osd.0 version' there are tests like:
>>>>   ceph osd set noup
>>>>   ceph osd down 0
>>>>   ceph osd dump | grep 'osd.0 down'
>>>>   ceph osd unset noup
>>>> and
>>>>   ceph osd reweight osd.0 .5
>>>>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
>>>>   ceph osd out 0
>>>>   ceph osd in 0
>>>>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
>>>>
>>>>
>>>>>> Now the bad thing is that the Linux and FreeBSD log do comparable things
>>>>>> with my (small) change to the setting of addr. And the nonce is indeed
>>>>>> incremented, which increment is actually picked up by all ceph components.
>>>> So now I have 2 challenges??
>>>>
>>>> 1) Find out why I get a rebind, where you think I should not.
>>>>    For that I'll have to collect all maltreatment that is done in
>>>>    cephtool-test-mon.sh. And again compare the Linux and FreeBSD logs
>>>>    to see what is up.
>>>> 2) If we get a rebind...
>>>>    Why doesn't the FreeBSD version end up with consistent noncees.
>>>>
>>>> "Good thing" about the previous code was that I could tweak it, and at
>>>> least get it to Work for FreeBSD. Have not had the time to see if I
>>>> could again with this code....
>>> So the smallest sequence I can find that demonstrates the problem:
>>> function test_mon_rebind()
>>> {
>>>   ceph osd set noup
>>>   ceph osd down 0
>>>   ceph osd dump | grep 'osd.0 down'
>>>   ceph osd unset noup
>>>   max_run=1000
>>>   for ((i=0; i < $max_run; i++)); do
>>>     if ! ceph osd dump | grep 'osd.0 up'; then
>>>       echo "waiting for osd.0 to come back up ($i/$max_run)"
>>>       sleep 1
>>>     else
>>>       break
>>>     fi
>>>   done
>>>   ceph osd dump | grep 'osd.0 up'
>>>
>>>   for id in `ceph osd ls` ; do
>>>     retry_eagain 5 map_enxio_to_eagain ceph tell osd.$id version
>>>   done
>>> }
>>>
>>> Which matches with what I thought I knew:
>>>   OSD down => up => rebind
>>> which follows from the log where the osd complains about being marked
>>> down incorrectly.
>>> search for
>>>    log_channel(cluster) log [WRN] : map e8 wrongly marked me down
>>> in the osd.0.log
>>>
>>
>> I ran the code snippet, and it DOES generate a rebind even on Linux
>> Centos 7.
>> So I expect that the rebind action is called for?
>>
>> Also note that at least one nonce is incremented, AND new ports are
>> selected.
>> 6801,6802,6803 are avoided, and it looks like already connected ports
>> are avoided as well.
>> (obvious, since binding on them will not work)
>>
>> Next is the FreeBSD log, and that one is relatively seriously wrong
>> since nonce are 0 .
>> And that must be incorrect.
>>
>> Last is the log of the current master code with the patch I made of
>> msg->my_addr, versus addr.
>> IMHO a change that should be semantical equal?
>>   return bind(msgr->my_inst.addr, new_avoid);
>>   // return bind(addr, new_avoid);
>> One thing that is not working there is the port avoidance. That I have
>> to look into since there is no reason
>> why this should not work under FreeBSD.
> 
> I have tested with the new code Kefu submitted to master this afternoon:
> The Linux variant gives:
> 7f36b1dd1700  1 -- 127.0.0.1:6800/13799 <== mon.2 127.0.0.1:8204/0 38
> ==== osd_map(18..21 src has 1..21) v3 ==== 1012+0+0 (424389225 0 0)
> 0x7f36cdcc7cc0 con 0x7f36cdb24000
> 7f36b5ac6700  0 log_channel(cluster) log [WRN] : map e21 wrongly marked
> me down
> 7f36b5ac6700  1 -- 127.0.0.1:6801/13799 rebind rebind avoid 6801,6802,6803
> 7f36b5ac6700  1 -- 127.0.0.1:6801/13799 shutdown_connections
> 7f36b5ac6700  1 -- 127.0.0.1:6812/1013799 _finish_bind bind my_inst.addr
> is 127.0.0.1:6812/1013799
> 7f36b5ac6700  1  Processor -- start
> 
> Note that the new allocated channel is on port 6812 with an incremented
> nonce, and the channel on port 6800 is not touched. Probably because it
> is still open, and it is not the channel being rebound.
> 
> Now the FreeBSD version:
> bb99f80  1 -- 127.0.0.1:6800/6449 <== mon.1 127.0.0.1:7203/0 158 ====
> osd_map(172..175 src has 1..175) v3 ==== 1045+0+0 (1598708051 0 0)
> 0xc08eb40 con 0xbbca000
> bb98d80  0 log_channel(cluster) log [WRN] : map e175 wrongly marked me down
> bb98d80  1 -- 127.0.0.1:6801/6449 rebind rebind avoid 6801,6802,6803
> bb98d80  1 -- 127.0.0.1:6801/6449 shutdown_connections
> bb98d80  1 -- 127.0.0.1:6800/1006449 _finish_bind bind my_inst.addr is
> 127.0.0.1:6800/1006449
> bb98d80  1  Processor -- start
> 
> And here the new channel is allocated on port 6800, as if that would be
> available. Should it also not be marked as such in avoid_ports??
> But that port is still being used as is indicated in the first line.
> FreeBSD does not seem to have a problem binding to that port ??
> And thus a channel is listening on 6800, but with a new nonce.
> So that could be the reason that ceph and rados are complaining that
> they cannot talk to osd.0
> 
> So why is 6800 not in the list for channels to avoid?

I patched osd/OSD.cc: (added to fit with 80 chars)

--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -7126,6 +7126,7 @@ void OSD::_committed_osd_maps(epoch_t first,
epoch_t last, MOSDMap *m)
        start_waiting_for_healthy();

        set<int> avoid_ports;
+        avoid_ports.insert(client_messenger->get_myaddr().get_port());
        avoid_ports.insert(cluster_messenger->get_myaddr().get_port());
 avoid_ports.insert(hb_back_server_messenger->get_myaddr().get_port());
 avoid_ports.insert(hb_front_server_messenger->get_myaddr().get_port());

And just adding this extra port to avoid gets things done.

I guess that there are enough remarks to be made on this patch??

--WjW

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html