On Tue, Sep 13, 2016 at 2:00 AM, Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote: > On 13-9-2016 04:29, Haomai Wang wrote: >> On Tue, Sep 13, 2016 at 6:59 AM, Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote: >>> Hi >>> >>> When running cephtool-test-mon.sh, part of it executes: >>> ceph tell osd.0 version >>> I see reports on the commandline, I guess that this is the OSD >>> complaining that things are wrong: >>> >>> 2016-09-12 23:50:39.239037 814e50e00 0 -- 127.0.0.1:0/1925715881 >> >>> 127.0.0.1:6800/26384 conn(0x814fde800 sd=18 :-1 >>> s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 >>> l=1)._process_connection connect claims to be 127.0.0.1:6800/1026384 not >>> 127.0.0.1:6800/26384 - wrong node! >>> >>> Which it will run until it is shot down.... after 3600 secs. >>> >>> the nonce is incremented with 1000000 on every rebind. >>> >>> But what I do not understand is how this mismatch has occurred. >>> I would expect port 6800 to be the port on which the OSD is connected >>> too, so the connecting party (ceph in this case) thinks the nonce to be >>> 1026384. Did the MON have this information? And where did the MON then >>> get it from.... >>> >>> Somewhere one of the parts did not receive the new nonce, or did not >>> also increment it? >> >> nonce is a part of ceph_entity_addr, so OSDMap will take this > > Right, but then the following is also suspicious??? > > ==== > # ceph osd dump > epoch 188 > fsid 2e02472d-ecbb-43ac-a687-bbf2523233d9 > created 2016-09-13 10:28:07.970254 > modified 2016-09-13 10:34:57.318988 > flags sortbitwise,require_jewel_osds,require_kraken_osds > pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash > rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0 > max_osd 10 > osd.0 up in weight 1 up_from 175 up_thru 185 down_at 172 > last_clean_interval [8,174) 127.0.0.1:6800/36565 127.0.0.1:6800/1036565 > 127.0.0.1:6804/1036565 127.0.0.1:6805/1036565 exists,up > e0e44b9c-9869-49d8-8afb-bdb71c04ea27 > osd.1 up in weight 1 up_from 10 up_thru 184 down_at 0 > last_clean_interval [0,0) 127.0.0.1:6804/36579 127.0.0.1:6805/36579 > 127.0.0.1:6806/36579 127.0.0.1:6807/36579 exists,up > b554849c-2cf1-4cf7-a5fd-3529d33345ff > osd.2 up in weight 1 up_from 12 up_thru 185 down_at 0 > last_clean_interval [0,0) 127.0.0.1:6808/36593 127.0.0.1:6809/36593 > 127.0.0.1:6810/36593 127.0.0.1:6811/36593 exists,up > 2d6648ba-72e1-4c53-ae10-929a9d13a3dd > ==== > > osd.0 has: > 127.0.0.1:6800/36565 > 127.0.0.1:6800/1036565 > 127.0.0.1:6804/1036565 > 127.0.0.1:6805/1036565 > > So I guess that one nonce did not get updated, because I would expect > all ports te be rebound, and incr the nonce? > > The other bad thing is that ports 6804 and 6805 are now both in osd.0 > and osd.1, which is going to create some trouble also I would guess. > > So this is what the osdmap distributes? > And then MON reports to clients? > > How would I retrieve the dump from osd.0 itself? > Trying: > # ceph -c ceph.conf daemon osd.0 dump > Can't get admin socket path: [Errno 2] No such file or directory > > Using the admin-socket directly does work. > But there is not really a command to get something equal to: > ==== > osd.0 up in weight 1 up_from 175 up_thru 185 down_at 172 > last_clean_interval [8,174) 127.0.0.1:6800/36565 127.0.0.1:6800/1036565 > 127.0.0.1:6804/1036565 127.0.0.1:6805/1036565 exists,up > e0e44b9c-9869-49d8-8afb-bdb71c04ea27 > ==== > So that one can see what the OSD itself thinks it is.... Is osd.0 actually running? If so it *should* have a socket, unless you've disabled them somehow. Check the logs and see if there are failures when it gets set up, I guess? Anyway, something has indeed gone terribly wrong here. I know at one point you had some messenger patches you were using to try and get stuff going on BSD; if you still have some there I think you need to consider them suspect. Otherwise, uh...the network stack is behaving very differently than Linux's? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html