Re: Inconclusive answer when running tests for cephtool-test-mon.sh

Willem Jan Withagen <wjw@xxxxxxxxxxx> · Tue, 6 Sep 2016 19:54:43 +0200

On 6-9-2016 19:36, Gregory Farnum wrote:
> On Mon, Sep 5, 2016 at 8:18 AM, Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote:
>> On 5-9-2016 13:42, Willem Jan Withagen wrote:
>>> Hi
>>>
>>> I would innterested to know why I get 2 different answers:
>>>
>>> (this is asking the OSDs directly?)
>>> ceph osd dump
>>>   but osd.0 is in, up, up_from 179 up_thru 185 down_at 176
>>>   osd.1/2 are in, up, up_from 8/13 up_thru 224 down_at 0
>>>
>>> ceph -s reports 1 OSD down
>>>
>>> So all osds in the dump are in and up, but ...
>>> I guess that osd.0 is telling me when it came back and that it has not
>>> all the data that osd.1/2 have because they go up to 224
>>>
>>> Which is why ceph -s tells me that one osd is down?
>>> Or did the leading MON nog get fully informed.
>>>
>>> What daemon is missing what part of the communication?
>>> And what for type of warning/error should I look for in the log files?
>>
>> A bit of extra info:
>>
>> The trouble starts with:
>> ceph osd down 0
>> 4: marked down osd.0.
>>
>> ceph osd dump
>> 4: epoch 174
>> 4: fsid eee1774b-3560-46ec-83b4-bac0e8763e93
>> 4: created 2016-09-05 16:47:43.301220
>> 4: modified 2016-09-05 16:51:16.529826
>> 4: flags noup,sortbitwise,require_jewel_osds,require_kraken_osds
>> 4: pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
>> rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
>> 4: max_osd 3
>> 4: osd.0 down in  weight 1 up_from 4 up_thru 167 down_at 173
>> last_clean_interval [0,0) 127.0.0.1:6800/45257 127.0.0.1:6801/45257
>> 127.0.0.1:6802/45257 127.0.0.1:6803/45257 exists
>> 7688bac7-75ec-4a3c-8edc-ce7245071a90
>> 4: osd.1 up   in  weight 1 up_from 8 up_thru 173 down_at 0
>> last_clean_interval [0,0) 127.0.0.1:6804/45271 127.0.0.1:6805/45271
>> 127.0.0.1:6806/45271 127.0.0.1:6807/45271 exists,up
>> 7639c41c-be59-42e4-9eb4-fcb6372e7042
>> 4: osd.2 up   in  weight 1 up_from 14 up_thru 173 down_at 0
>> last_clean_interval [0,0) 127.0.0.1:6808/45285 127.0.0.1:6809/45285
>> 127.0.0.1:6810/45285 127.0.0.1:6811/45285 exists,up
>> 39e812a4-2f50-4088-bebd-6392dc05c76c
>> 4: pg_temp 0.0 [0,2,1]
>> 4: pg_temp 0.1 [2,0,1]
>> 4: pg_temp 0.2 [0,1,2]
>> 4: pg_temp 0.3 [2,0,1]
>> 4: pg_temp 0.4 [0,2,1]
>> 4: pg_temp 0.5 [0,2,1]
>> 4: pg_temp 0.6 [0,1,2]
>> 4: pg_temp 0.7 [1,0,2]
>>
>> ceph daemon osd.0 status
>> 4: {
>> 4:     "cluster_fsid": "eee1774b-3560-46ec-83b4-bac0e8763e93",
>> 4:     "osd_fsid": "7688bac7-75ec-4a3c-8edc-ce7245071a90",
>> 4:     "whoami": 0,
>> 4:     "state": "active",
>> 4:     "oldest_map": 1,
>> 4:     "newest_map": 172,
>> 4:     "num_pgs": 8
>> 4: }
>> 4: /home/wjw/wip/qa/workunits/cephtool/test.sh:15: osd_state:  ceph -s
>> 4:     cluster eee1774b-3560-46ec-83b4-bac0e8763e93
>> 4:      health HEALTH_WARN
>> 4:             3 pgs degraded
>> 4:             3 pgs stale
>> 4:             3 pgs undersized
>> 4:             1/3 in osds are down
>> 4:             noup,sortbitwise,require_jewel_osds,require_kraken_osds
>> flag(s) set
>> 4:      monmap e1: 3 mons at
>> {a=127.0.0.1:7202/0,b=127.0.0.1:7203/0,c=127.0.0.1:7204/0}
>> 4:             election epoch 6, quorum 0,1,2 a,b,c
>> 4:      osdmap e174: 3 osds: 2 up, 3 in; 8 remapped pgs
>> 4:             flags noup,sortbitwise,require_jewel_osds,require_kraken_osds
>> 4:       pgmap v294: 8 pgs, 1 pools, 0 bytes data, 0 objects
>> 4:             300 GB used, 349 GB / 650 GB avail
>> 4:                    3 stale+active+clean
>> 4:                    3 active+undersized+degraded
>> 4:                    2 active+clean
>>
>> And then:
>> ceph osd unset noup
>> 4: noup is unset
>>
>> And we wait/loop until çeph osd dump' reports osd.0 up.
>>
>> ceph osd dump
>> 4: epoch 177
>> 4: fsid eee1774b-3560-46ec-83b4-bac0e8763e93
>> 4: created 2016-09-05 16:47:43.301220
>> 4: modified 2016-09-05 16:51:20.098412
>> 4: flags sortbitwise,require_jewel_osds,require_kraken_osds
>> 4: pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
>> rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
>> 4: max_osd 3
>> 4: osd.0 up   in  weight 1 up_from 176 up_thru 176 down_at 173
>> last_clean_interval [4,175) 127.0.0.1:6800/45257 127.0.0.1:6812/1045257
>> 127.0.0.1:6813/1045257 127.0.0.1:6814/1045257 exists,up
>> 7688bac7-75ec-4a3c-8edc-ce7245071a90
>> 4: osd.1 up   in  weight 1 up_from 8 up_thru 176 down_at 0
>> last_clean_interval [0,0) 127.0.0.1:6804/45271 127.0.0.1:6805/45271
>> 127.0.0.1:6806/45271 127.0.0.1:6807/45271 exists,up
>> 7639c41c-be59-42e4-9eb4-fcb6372e7042
>> 4: osd.2 up   in  weight 1 up_from 14 up_thru 176 down_at 0
>> last_clean_interval [0,0) 127.0.0.1:6808/45285 127.0.0.1:6809/45285
>> 127.0.0.1:6810/45285 127.0.0.1:6811/45285 exists,up
>> 39e812a4-2f50-4088-bebd-6392dc05c76c
>> 4: pg_temp 0.2 [1,2]
>> 4: pg_temp 0.6 [1,2]
>> 4: pg_temp 0.7 [1,2]
>>
>> ceph daemon osd.0 status
>> 4: {
>> 4:     "cluster_fsid": "eee1774b-3560-46ec-83b4-bac0e8763e93",
>> 4:     "osd_fsid": "7688bac7-75ec-4a3c-8edc-ce7245071a90",
>> 4:     "whoami": 0,
>> 4:     "state": "booting",
>> 4:     "oldest_map": 1,
>> 4:     "newest_map": 177,
>> 4:     "num_pgs": 8
>> 4: }
>>
>> ceph -s
>> 4:     cluster eee1774b-3560-46ec-83b4-bac0e8763e93
>> 4:      health HEALTH_WARN
>> 4:             3 pgs degraded
>> 4:             3 pgs peering
>> 4:             3 pgs undersized
>> 4:      monmap e1: 3 mons at
>> {a=127.0.0.1:7202/0,b=127.0.0.1:7203/0,c=127.0.0.1:7204/0}
>> 4:             election epoch 6, quorum 0,1,2 a,b,c
>> 4:      osdmap e177: 3 osds: 3 up, 3 in; 3 remapped pgs
>> 4:             flags sortbitwise,require_jewel_osds,require_kraken_osds
>> 4:       pgmap v298: 8 pgs, 1 pools, 0 bytes data, 0 objects
>> 4:             300 GB used, 349 GB / 650 GB avail
>> 4:                    3 peering
>> 4:                    3 active+undersized+degraded
>> 4:                    2 active+clean
>>
>> And even though the cluster is reported UP, the osd itself is booting,
>> and never gets out of that state.
>>
>> I have reworked the code that actually rebinds the accepters, which
>> seems to be working. But obviously it doesn't have enough work done to
>> really go to the active state.
>> Wido suggested to look for: wrongly marked me down, because that should
>> start triggering the reconnect state.
>> Which led me to SimpleMessenger
>>
>> Does anybody have suggestions as to where to start looking this time??
> 
> Just at a first guess, I think your rebind changes might have busted 
> something. I haven't looked at it in any depth but
> https://github.com/ceph/ceph/pull/10976 made me nervous when I saw
> it: in general we expect a slightly different address to guarantee
> that the OSD knows it was previously marked down.

Right,

Wat you say is the other side of what I noticed: During the rebind there
are a few attempts to actually get connected, causing a rather long
delay before the rebind completes. And the test did not complete also.

The first part of the change is:
https://github.com/ceph/ceph/pull/10720
Which is really needed because shutdown() on a socket-listner does not
work on FreeBSD.

I'll take the change (#10976) out, and start watching more careful for
the rebinding. And start working from the state machine why it does not
leave booting...

Thanx for the suggestion.
--WjW

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html