Re: Mismatching nonce for 'ceph osd.0 tell'

Willem Jan Withagen <wjw@xxxxxxxxxxx> · Tue, 13 Dec 2016 02:02:12 +0100

On 10-12-2016 01:39, Willem Jan Withagen wrote:
> On 9-12-2016 13:02, Willem Jan Withagen wrote:
>> On 9-12-2016 10:22, Willem Jan Withagen wrote:
>>> On 9-12-2016 09:59, kefu chai wrote:
>>>> On Thu, Dec 8, 2016 at 8:30 PM, Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote:
>>>>> On 8-12-2016 11:03, kefu chai wrote:
>>>>>> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote:
>>>>>>> On 3-10-2016 19:50, Gregory Farnum wrote:
>>>>>>>>> Question here is:
>>>>>>>>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>>>>>>>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>>>>>>>>   ceph-osd.
>>>>>>>> The monitor has learned about specific IP addresses/nonces/etc via
>>>>>>>> MOSDBoot messages from the OSDs. The crush locations are set via
>>>>>>>> monitor command messages, generally invoked as part of the init
>>>>>>>> scripts. Maps are generated entirely on the monitor. :)
>>>>>>>>
>>>>>>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>>>>>>>>> and ceph-mon receives in the maps?
>>>>>>>>> Just to make sure that it is clear where the problem occurs.
>>>>>>>> You should be able to see the info going in and out by bumping the
>>>>>>>> debug levels up — every message's "print" function is invoked when
>>>>>>>> it's sent/received as long as you have "debug ms = 1". It looks like
>>>>>>>> the MOSDBoot message doesn't natively dump its addresses but you can
>>>>>>>> add them easily if you need to.
>>>>>>> Hi Greg,
>>>>>>>
>>>>>>> Thanx for the answer....
>>>>>>>
>>>>>>> I've got debug_ms already pumped up all the way to 20.
>>>>>>> So I do get to see what addresses are selected during bind. But still
>>>>>>> they do not end up at the MON, and 'ceph osd dump' reports:
>>>>>>>         :/0
>>>>>>> as bind address.
>>>>>>>
>>>>>>> I'm going to add some more debugs to actually see what MOSDBoot is doing....
>>>>>> there are multiple messengers used by ceph-osd, the one connected by
>>>>>> rados client is the external/public messenger. it is also used by osd
>>>>>> to talk with the monitor.
>>>>>>
>>>>>> the nonce of the external address of an OSD does not change after it's
>>>>>> up: it's always the pid of ceph-osd process. and the (peer) address of
>>>>>> the booting OSD collected by monitor comes from the connection's
>>>>>> peer_addr field, which is set when the monitor accepts the connection
>>>>>> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in
>>>>>> AsyncConnection::_process_connection().
>>>>>>
>>>>>> but there are chances that an OSD is restarted and fail to bind its
>>>>>> external messenger to the specified the port. in that case, ceph-osd
>>>>>> will try with another port, but keep the nonce the same. but when it
>>>>>> comes to other messengers used by ceph-osd, their nonces increase by
>>>>>> 1000000 every time they rebind. that's why "ceph osd thrash" can
>>>>>> change the nonces of the cluster_addr, heartbeat_back_addr and
>>>>>> heartbeat_front_addr. the PR of
>>>>>> https://github.com/ceph/ceph/pull/11706 actually changes the behavior
>>>>>> of the messengers of these three messengers. and it has nothing to do
>>>>>> with the external messenger to which the ceph cli client is
>>>>>> connecting.
>>>>>>
>>>>>> so you might want to check
>>>>>> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid
>>>>>> 2) while the nonce of the same messenger is $pid when the ceph cli
>>>>>> connects to it.
>>>>>>
>>>>>> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup.
>>>>>> it avoids setting the nonce before the rebind finishes. and i tried
>>>>>> with your producer on my linux box, no luck =(
>>>>> Right,
>>>>>
>>>>> You gave me a lot of things to think about, and to start figuring out.
>>>>>
>>>>> And you are right that something really bad needs to happen to an OSD to
>>>>> get in this state. But that is what the tests actually do: They just
>>>>> down/up or kill OSDs and restart.
>>>>>
>>>>> And from previous discussions I "learned" that if the process doesn't
>>>>> die but needs to rebind on the port, the OSD stays at the same port but
>>>>> increments the nonce to indicate that it is a fresh connection. And log
>>>> the external messenger should *not* increment its nonce.
>>>>
>>>>> printing actually shows that the code is going thru a rebind.
>>>> and it should *not* go through rebind().
>>> I have to dig thru the testscript but as far as I can tell just about
>>> all of the daemons are getting reboots in this test.
>>>
>>> So when would I get a rebind?
>>>
>>> I thought it was because I had an OSD incorrectly marked down:
>>> ./src/osd/OSD.cc:7074:                 << " wrongly marked me down";
>>> This I found in the logs, and then I got a rebind.
>>>
>>> Wido suggested looking for this message, on my question why my OSDs were
>>> not getting UP after a good hustle with all OSDs and MONs.
>>>
>>> And that is one of the tests in cephtool-test-mon.sh.
>>> right before the 'ceph tell osd.0 version' there are tests like:
>>>   ceph osd set noup
>>>   ceph osd down 0
>>>   ceph osd dump | grep 'osd.0 down'
>>>   ceph osd unset noup
>>> and
>>>   ceph osd reweight osd.0 .5
>>>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
>>>   ceph osd out 0
>>>   ceph osd in 0
>>>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
>>>
>>>
>>>>> Now the bad thing is that the Linux and FreeBSD log do comparable things
>>>>> with my (small) change to the setting of addr. And the nonce is indeed
>>>>> incremented, which increment is actually picked up by all ceph components.
>>> So now I have 2 challenges??
>>>
>>> 1) Find out why I get a rebind, where you think I should not.
>>>    For that I'll have to collect all maltreatment that is done in
>>>    cephtool-test-mon.sh. And again compare the Linux and FreeBSD logs
>>>    to see what is up.
>>> 2) If we get a rebind...
>>>    Why doesn't the FreeBSD version end up with consistent noncees.
>>>
>>> "Good thing" about the previous code was that I could tweak it, and at
>>> least get it to Work for FreeBSD. Have not had the time to see if I
>>> could again with this code....
>> So the smallest sequence I can find that demonstrates the problem:
>> function test_mon_rebind()
>> {
>>   ceph osd set noup
>>   ceph osd down 0
>>   ceph osd dump | grep 'osd.0 down'
>>   ceph osd unset noup
>>   max_run=1000
>>   for ((i=0; i < $max_run; i++)); do
>>     if ! ceph osd dump | grep 'osd.0 up'; then
>>       echo "waiting for osd.0 to come back up ($i/$max_run)"
>>       sleep 1
>>     else
>>       break
>>     fi
>>   done
>>   ceph osd dump | grep 'osd.0 up'
>>
>>   for id in `ceph osd ls` ; do
>>     retry_eagain 5 map_enxio_to_eagain ceph tell osd.$id version
>>   done
>> }
>>
>> Which matches with what I thought I knew:
>>   OSD down => up => rebind
>> which follows from the log where the osd complains about being marked
>> down incorrectly.
>> search for
>>    log_channel(cluster) log [WRN] : map e8 wrongly marked me down
>> in the osd.0.log
>>
> 
> I ran the code snippet, and it DOES generate a rebind even on Linux
> Centos 7.
> So I expect that the rebind action is called for?
> 
> Also note that at least one nonce is incremented, AND new ports are
> selected.
> 6801,6802,6803 are avoided, and it looks like already connected ports
> are avoided as well.
> (obvious, since binding on them will not work)
> 
> Next is the FreeBSD log, and that one is relatively seriously wrong
> since nonce are 0 .
> And that must be incorrect.
> 
> Last is the log of the current master code with the patch I made of
> msg->my_addr, versus addr.
> IMHO a change that should be semantical equal?
>   return bind(msgr->my_inst.addr, new_avoid);
>   // return bind(addr, new_avoid);
> One thing that is not working there is the port avoidance. That I have
> to look into since there is no reason
> why this should not work under FreeBSD.

I have tested with the new code Kefu submitted to master this afternoon:
The Linux variant gives:
7f36b1dd1700  1 -- 127.0.0.1:6800/13799 <== mon.2 127.0.0.1:8204/0 38
==== osd_map(18..21 src has 1..21) v3 ==== 1012+0+0 (424389225 0 0)
0x7f36cdcc7cc0 con 0x7f36cdb24000
7f36b5ac6700  0 log_channel(cluster) log [WRN] : map e21 wrongly marked
me down
7f36b5ac6700  1 -- 127.0.0.1:6801/13799 rebind rebind avoid 6801,6802,6803
7f36b5ac6700  1 -- 127.0.0.1:6801/13799 shutdown_connections
7f36b5ac6700  1 -- 127.0.0.1:6812/1013799 _finish_bind bind my_inst.addr
is 127.0.0.1:6812/1013799
7f36b5ac6700  1  Processor -- start

Note that the new allocated channel is on port 6812 with an incremented
nonce, and the channel on port 6800 is not touched. Probably because it
is still open, and it is not the channel being rebound.

Now the FreeBSD version:
bb99f80  1 -- 127.0.0.1:6800/6449 <== mon.1 127.0.0.1:7203/0 158 ====
osd_map(172..175 src has 1..175) v3 ==== 1045+0+0 (1598708051 0 0)
0xc08eb40 con 0xbbca000
bb98d80  0 log_channel(cluster) log [WRN] : map e175 wrongly marked me down
bb98d80  1 -- 127.0.0.1:6801/6449 rebind rebind avoid 6801,6802,6803
bb98d80  1 -- 127.0.0.1:6801/6449 shutdown_connections
bb98d80  1 -- 127.0.0.1:6800/1006449 _finish_bind bind my_inst.addr is
127.0.0.1:6800/1006449
bb98d80  1  Processor -- start

And here the new channel is allocated on port 6800, as if that would be
available. Should it also not be marked as such in avoid_ports??
But that port is still being used as is indicated in the first line.
FreeBSD does not seem to have a problem binding to that port ??
And thus a channel is listening on 6800, but with a new nonce.
So that could be the reason that ceph and rados are complaining that
they cannot talk to osd.0

So why is 6800 not in the list for channels to avoid?

--WjW

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html