Re: Mismatching nonce for 'ceph osd.0 tell'

Willem Jan Withagen <wjw@xxxxxxxxxxx> · Fri, 9 Dec 2016 10:22:15 +0100

On 9-12-2016 09:59, kefu chai wrote:
> On Thu, Dec 8, 2016 at 8:30 PM, Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote:
>> On 8-12-2016 11:03, kefu chai wrote:
>>> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote:
>>>> On 3-10-2016 19:50, Gregory Farnum wrote:
>>>>>> Question here is:
>>>>>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>>>>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>>>>>   ceph-osd.
>>>>>
>>>>> The monitor has learned about specific IP addresses/nonces/etc via
>>>>> MOSDBoot messages from the OSDs. The crush locations are set via
>>>>> monitor command messages, generally invoked as part of the init
>>>>> scripts. Maps are generated entirely on the monitor. :)
>>>>>
>>>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>>>>>> and ceph-mon receives in the maps?
>>>>>> Just to make sure that it is clear where the problem occurs.
>>>>>
>>>>> You should be able to see the info going in and out by bumping the
>>>>> debug levels up — every message's "print" function is invoked when
>>>>> it's sent/received as long as you have "debug ms = 1". It looks like
>>>>> the MOSDBoot message doesn't natively dump its addresses but you can
>>>>> add them easily if you need to.
>>>>
>>>> Hi Greg,
>>>>
>>>> Thanx for the answer....
>>>>
>>>> I've got debug_ms already pumped up all the way to 20.
>>>> So I do get to see what addresses are selected during bind. But still
>>>> they do not end up at the MON, and 'ceph osd dump' reports:
>>>>         :/0
>>>> as bind address.
>>>>
>>>> I'm going to add some more debugs to actually see what MOSDBoot is doing....
>>>
>>> there are multiple messengers used by ceph-osd, the one connected by
>>> rados client is the external/public messenger. it is also used by osd
>>> to talk with the monitor.
>>>
>>> the nonce of the external address of an OSD does not change after it's
>>> up: it's always the pid of ceph-osd process. and the (peer) address of
>>> the booting OSD collected by monitor comes from the connection's
>>> peer_addr field, which is set when the monitor accepts the connection
>>> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in
>>> AsyncConnection::_process_connection().
>>>
>>> but there are chances that an OSD is restarted and fail to bind its
>>> external messenger to the specified the port. in that case, ceph-osd
>>> will try with another port, but keep the nonce the same. but when it
>>> comes to other messengers used by ceph-osd, their nonces increase by
>>> 1000000 every time they rebind. that's why "ceph osd thrash" can
>>> change the nonces of the cluster_addr, heartbeat_back_addr and
>>> heartbeat_front_addr. the PR of
>>> https://github.com/ceph/ceph/pull/11706 actually changes the behavior
>>> of the messengers of these three messengers. and it has nothing to do
>>> with the external messenger to which the ceph cli client is
>>> connecting.
>>>
>>> so you might want to check
>>> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid
>>> 2) while the nonce of the same messenger is $pid when the ceph cli
>>> connects to it.
>>>
>>> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup.
>>> it avoids setting the nonce before the rebind finishes. and i tried
>>> with your producer on my linux box, no luck =(
>>
>> Right,
>>
>> You gave me a lot of things to think about, and to start figuring out.
>>
>> And you are right that something really bad needs to happen to an OSD to
>> get in this state. But that is what the tests actually do: They just
>> down/up or kill OSDs and restart.
>>
>> And from previous discussions I "learned" that if the process doesn't
>> die but needs to rebind on the port, the OSD stays at the same port but
>> increments the nonce to indicate that it is a fresh connection. And log
> 
> the external messenger should *not* increment its nonce.
> 
>> printing actually shows that the code is going thru a rebind.
> 
> and it should *not* go through rebind().

I have to dig thru the testscript but as far as I can tell just about
all of the daemons are getting reboots in this test.

So when would I get a rebind?

I thought it was because I had an OSD incorrectly marked down:
./src/osd/OSD.cc:7074:                 << " wrongly marked me down";
This I found in the logs, and then I got a rebind.

Wido suggested looking for this message, on my question why my OSDs were
not getting UP after a good hustle with all OSDs and MONs.

And that is one of the tests in cephtool-test-mon.sh.
right before the 'ceph tell osd.0 version' there are tests like:
  ceph osd set noup
  ceph osd down 0
  ceph osd dump | grep 'osd.0 down'
  ceph osd unset noup
and
  ceph osd reweight osd.0 .5
  ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
  ceph osd out 0
  ceph osd in 0
  ceph osd dump | grep ^osd.0 | grep 'weight 0.5'

>> Now the bad thing is that the Linux and FreeBSD log do comparable things
>> with my (small) change to the setting of addr. And the nonce is indeed
>> incremented, which increment is actually picked up by all ceph components.

So now I have 2 challenges??

1) Find out why I get a rebind, where you think I should not.
   For that I'll have to collect all maltreatment that is done in
   cephtool-test-mon.sh. And again compare the Linux and FreeBSD logs
   to see what is up.
2) If we get a rebind...
   Why doesn't the FreeBSD version end up with consistent noncees.

"Good thing" about the previous code was that I could tweak it, and at
least get it to Work for FreeBSD. Have not had the time to see if I
could again with this code....

--WjW
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html