On 9-12-2016 10:22, Willem Jan Withagen wrote: > On 9-12-2016 09:59, kefu chai wrote: >> On Thu, Dec 8, 2016 at 8:30 PM, Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote: >>> On 8-12-2016 11:03, kefu chai wrote: >>>> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote: >>>>> On 3-10-2016 19:50, Gregory Farnum wrote: >>>>>>> Question here is: >>>>>>> If I ask 'ceph osd dump', I'm actually asking ceph-mon. >>>>>>> And cehp-mon has learned this from (crush?)maps being sent to it by >>>>>>> ceph-osd. >>>>>> >>>>>> The monitor has learned about specific IP addresses/nonces/etc via >>>>>> MOSDBoot messages from the OSDs. The crush locations are set via >>>>>> monitor command messages, generally invoked as part of the init >>>>>> scripts. Maps are generated entirely on the monitor. :) >>>>>> >>>>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends >>>>>>> and ceph-mon receives in the maps? >>>>>>> Just to make sure that it is clear where the problem occurs. >>>>>> >>>>>> You should be able to see the info going in and out by bumping the >>>>>> debug levels up — every message's "print" function is invoked when >>>>>> it's sent/received as long as you have "debug ms = 1". It looks like >>>>>> the MOSDBoot message doesn't natively dump its addresses but you can >>>>>> add them easily if you need to. >>>>> >>>>> Hi Greg, >>>>> >>>>> Thanx for the answer.... >>>>> >>>>> I've got debug_ms already pumped up all the way to 20. >>>>> So I do get to see what addresses are selected during bind. But still >>>>> they do not end up at the MON, and 'ceph osd dump' reports: >>>>> :/0 >>>>> as bind address. >>>>> >>>>> I'm going to add some more debugs to actually see what MOSDBoot is doing.... >>>> >>>> there are multiple messengers used by ceph-osd, the one connected by >>>> rados client is the external/public messenger. it is also used by osd >>>> to talk with the monitor. >>>> >>>> the nonce of the external address of an OSD does not change after it's >>>> up: it's always the pid of ceph-osd process. and the (peer) address of >>>> the booting OSD collected by monitor comes from the connection's >>>> peer_addr field, which is set when the monitor accepts the connection >>>> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in >>>> AsyncConnection::_process_connection(). >>>> >>>> but there are chances that an OSD is restarted and fail to bind its >>>> external messenger to the specified the port. in that case, ceph-osd >>>> will try with another port, but keep the nonce the same. but when it >>>> comes to other messengers used by ceph-osd, their nonces increase by >>>> 1000000 every time they rebind. that's why "ceph osd thrash" can >>>> change the nonces of the cluster_addr, heartbeat_back_addr and >>>> heartbeat_front_addr. the PR of >>>> https://github.com/ceph/ceph/pull/11706 actually changes the behavior >>>> of the messengers of these three messengers. and it has nothing to do >>>> with the external messenger to which the ceph cli client is >>>> connecting. >>>> >>>> so you might want to check >>>> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid >>>> 2) while the nonce of the same messenger is $pid when the ceph cli >>>> connects to it. >>>> >>>> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup. >>>> it avoids setting the nonce before the rebind finishes. and i tried >>>> with your producer on my linux box, no luck =( >>> >>> Right, >>> >>> You gave me a lot of things to think about, and to start figuring out. >>> >>> And you are right that something really bad needs to happen to an OSD to >>> get in this state. But that is what the tests actually do: They just >>> down/up or kill OSDs and restart. >>> >>> And from previous discussions I "learned" that if the process doesn't >>> die but needs to rebind on the port, the OSD stays at the same port but >>> increments the nonce to indicate that it is a fresh connection. And log >> >> the external messenger should *not* increment its nonce. >> >>> printing actually shows that the code is going thru a rebind. >> >> and it should *not* go through rebind(). > > I have to dig thru the testscript but as far as I can tell just about > all of the daemons are getting reboots in this test. > > So when would I get a rebind? > > I thought it was because I had an OSD incorrectly marked down: > ./src/osd/OSD.cc:7074: << " wrongly marked me down"; > This I found in the logs, and then I got a rebind. > > Wido suggested looking for this message, on my question why my OSDs were > not getting UP after a good hustle with all OSDs and MONs. > > And that is one of the tests in cephtool-test-mon.sh. > right before the 'ceph tell osd.0 version' there are tests like: > ceph osd set noup > ceph osd down 0 > ceph osd dump | grep 'osd.0 down' > ceph osd unset noup > and > ceph osd reweight osd.0 .5 > ceph osd dump | grep ^osd.0 | grep 'weight 0.5' > ceph osd out 0 > ceph osd in 0 > ceph osd dump | grep ^osd.0 | grep 'weight 0.5' > > >>> Now the bad thing is that the Linux and FreeBSD log do comparable things >>> with my (small) change to the setting of addr. And the nonce is indeed >>> incremented, which increment is actually picked up by all ceph components. > > So now I have 2 challenges?? > > 1) Find out why I get a rebind, where you think I should not. > For that I'll have to collect all maltreatment that is done in > cephtool-test-mon.sh. And again compare the Linux and FreeBSD logs > to see what is up. > 2) If we get a rebind... > Why doesn't the FreeBSD version end up with consistent noncees. > > "Good thing" about the previous code was that I could tweak it, and at > least get it to Work for FreeBSD. Have not had the time to see if I > could again with this code.... So the smallest sequence I can find that demonstrates the problem: function test_mon_rebind() { ceph osd set noup ceph osd down 0 ceph osd dump | grep 'osd.0 down' ceph osd unset noup max_run=1000 for ((i=0; i < $max_run; i++)); do if ! ceph osd dump | grep 'osd.0 up'; then echo "waiting for osd.0 to come back up ($i/$max_run)" sleep 1 else break fi done ceph osd dump | grep 'osd.0 up' for id in `ceph osd ls` ; do retry_eagain 5 map_enxio_to_eagain ceph tell osd.$id version done } Which matches with what I thought I knew: OSD down => up => rebind which follows from the log where the osd complains about being marked down incorrectly. search for log_channel(cluster) log [WRN] : map e8 wrongly marked me down in the osd.0.log --WjW -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html