Re: OSD/monitor timeouts?

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 29 Jan 2014 09:36:42 -0800

On Tue, Jan 28, 2014 at 6:43 PM, Stuart Longland <stuartl@xxxxxxxxxx> wrote:
> Hi Gregory,
> On 28/01/14 15:51, Gregory Farnum wrote:
>>> I do note ntp doesn't seem to be doing its job, but that's a side issue.
>> Actually, that could be it. If you take down one of the monitors and
>> the other two have enough of a time gap that they won't talk to each
>> other, your cluster won't be able to make any progress. The OSDs don't
>> much care, but your monitor nodes need to have a well-synced clock.
>
> I've done some more testing here.  Had lots of fun and games learning
> about the intricacies of running NTP peers (as opposed to servers) and
> the like.
>
> One thing I observe is that mon.0 seems to always dominate.  I'll have
> all three nodes up, kill mon.0 via the power switch, wait for the
> clients to switch to another monitor, then power it back up.
>
> Suddenly monitors 1 and 2 are the ones with the clocks askew, not
> monitor 0 who's just loaded the time from the hardware clock and not yet
> synchronised to NTP.  After a while, it all sorts itself out (I guess
> when NTP gets them within sync), but for a while, I get the HEALTH_WARN
> state in ceph.

Yeah, the monitors have no idea who's "right", just that they differ.
mon.0 will tend to be elected as the leader and everybody speaks in
terms of the leader's clock.

>
> i.e. initially ceph health reports:
>> HEALTH_OK
>
> then I power off sn0, I get:
>> HEALTH_WARN 759 pgs degraded; 387 pgs stuck unclean; recovery 102277/306831 objects degraded (33.333%); 2/6 in osds are
>>  down; 1 mons down, quorum 1,2 1,2
>
> and ceph -w shows:
>> 2014-01-29 12:39:12.127994 mon.1 [INF] pgmap v86707: 800 pgs: 41 active+remapped, 759 active+degraded; 399 GB data, 1199 GB used, 15559 GB / 16758 GB avail; 17362 kB/s rd, 135 op/s; 102277/306831 objects degraded (33.333%)
>> 2014-01-29 12:39:14.023174 mon.1 [INF] pgmap v86708: 800 pgs: 41 active+remapped, 759 active+degraded; 399 GB data, 1199 GB used, 15559 GB / 16758 GB avail; 5340 kB/s rd, 41 op/s; 102277/306831 objects degraded (33.333%)
>> 2014-01-29 12:39:16.032460 mon.1 [INF] pgmap v86709: 800 pgs: 41 active+remapped, 759 active+degraded; 399 GB data, 1199 GB used, 15559 GB / 16758 GB avail; 4196 kB/s rd, 32 op/s; 102277/306831 objects degraded (33.333%)
>
> I power up sn0 now, and after an OSD recovery:
>> HEALTH_WARN clock skew detected on mon.1
>
> and ceph -w shows:
>> 2014-01-29 12:40:36.042565 mon.1 [INF] pgmap v86757: 800 pgs: 41 active+remapped, 759 active+degraded; 399 GB data, 1199 GB used, 15559 GB / 16758 GB avail; 5124 kB/s rd, 40 op/s; 102277/306831 objects degraded (33.333%)
>> 2014-01-29 12:40:37.047329 mon.1 [INF] pgmap v86758: 800 pgs: 41 active+remapped, 759 active+degraded; 399 GB data, 1199 GB used, 15559 GB / 16758 GB avail; 15798 kB/s rd, 123 op/s; 102277/306831 objects degraded (33.333%)
>> 2014-01-29 12:40:39.033544 mon.1 [INF] pgmap v86759: 800 pgs: 41 active+remapped, 759 active+degraded; 399 GB data, 1199 GB used, 15559 GB / 16758 GB avail; 15920 kB/s rd, 124 op/s; 102277/306831 objects degraded (33.333%)
>> 2014-01-29 12:40:40.556413 mon.0 [INF] mon.0 calling new monitor election
>> 2014-01-29 12:40:40.562922 mon.0 [INF] mon.0 calling new monitor election
>> 2014-01-29 12:40:40.563389 mon.0 [INF] mon.0@0 won leader election with quorum 0,1,2
>> 2014-01-29 12:40:40.565821 mon.0 [WRN] mon.1 10.87.160.225:6789/0 clock skew 0.106539s > max 0.05s
>> 2014-01-29 12:40:40.670626 mon.1 [INF] mon.1 calling new monitor election
>> 2014-01-29 12:40:40.687451 mon.2 [INF] mon.2 calling new monitor election
>> 2014-01-29 12:40:40.866972 mon.0 [INF] pgmap v86759: 800 pgs: 41 active+remapped, 759 active+degraded; 399 GB data, 1199 GB used, 15559 GB / 16758 GB avail; 102277/306831 objects degraded (33.333%)
>> 2014-01-29 12:40:40.867048 mon.0 [INF] mdsmap e1: 0/0/1 up
>> 2014-01-29 12:40:40.867199 mon.0 [INF] osdmap e501: 6 osds: 4 up, 6 in
>> 2014-01-29 12:40:40.867285 mon.0 [INF] monmap e1: 3 mons at {0=10.20.30.224:6789/0,1=10.20.30.225:6789/0,2=10.20.30.226:6789/0}
>> 2014-01-29 12:40:40.962267 mon.0 [WRN] message from mon.2 was stamped 0.125828s in the future, clocks not synchronized
>> 2014-01-29 12:40:41.003207 mon.0 [INF] osdmap e502: 6 osds: 4 up, 6 in
>> 2014-01-29 12:40:41.006869 mon.0 [INF] pgmap v86760: 800 pgs: 41 active+remapped, 759 active+degraded; 399 GB data, 1199 GB used, 15559 GB / 16758 GB avail; 5168 kB/s rd, 40 op/s; 102277/306831 objects degraded (33.333%)
>> 2014-01-29 12:40:41.971799 mon.0 [INF] osdmap e503: 6 osds: 4 up, 6 in
>
> Presently, I've got three authorative NTP servers on the network:
> - border router (Ubuntu Linux 12.04)
> - interior router (Cisco IOS)
> - file server (Ubuntu Linux 10.04)
>
> I initially had the Ceph nodes synced to them, and to each other i.e. in
> ntp.conf:
>
> server ntp0.domain
> server ntp1.domain
> server ntp2.domain
> peer a-mon-node
> peer another-mon-node
>
> Having discovered NTP doesn't like these synchronisation loops (where a
> peer uses the same reference as another) I've since changed this to:
>
> server ntp.domain # RR A record for ntp0, ntp1 and ntp2
> server 0.au.pool.ntp.org
> server 1.au.pool.ntp.org
> server 2.au.pool.ntp.org
> server 3.au.pool.ntp.org
> peer a-mon-node
> peer another-mon-node
>
> That means they all pick a number of nodes at random from the public NTP
> server pool.  Not sure what tricks others have tried with regards to
> NTP, but at least now the peers are marked as "candidates" and not
> "rejected".
>
>> And in your trace everything got wedged for so long the system just
>> gave up; that's probably a result of the cluster having data it
>> couldn't write to for too long. (Like I said before, you should make
>> sure your CRUSH map and rules look right.)
>
> At the moment my CRUSH map looks like this:
>> root@sn2:~# ceph osd getcrushmap > crushmap.bin
>> got crush map from osdmap epoch 489
>> root@sn2:~# crushtool -d crushmap.bin > crushmap.txt
>> root@sn2:~# cat crushmap.txt
>> # begin crush map
>>
>> # devices
>> device 0 osd.0
>> device 1 osd.1
>> device 2 osd.2
>> device 3 osd.3
>> device 4 osd.4
>> device 5 osd.5
>>
>> # types
>> type 0 osd
>> type 1 host
>> type 2 rack
>> type 3 row
>> type 4 room
>> type 5 datacenter
>> type 6 root
>>
>> # buckets
>> host sn0 {
>>         id -2           # do not change unnecessarily
>>         # weight 5.460
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd.0 weight 2.730
>>         item osd.1 weight 2.730
>> }
>> host sn1 {
>>         id -3           # do not change unnecessarily
>>         # weight 5.460
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd.2 weight 2.730
>>         item osd.3 weight 2.730
>> }
>> host sn2 {
>>         id -4           # do not change unnecessarily
>>         # weight 5.460
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd.4 weight 2.730
>>         item osd.5 weight 2.730
>> }
>> root default {
>>         id -1           # do not change unnecessarily
>>         # weight 16.380
>>         alg straw
>>         hash 0  # rjenkins1
>>         item sn0 weight 5.460
>>         item sn1 weight 5.460
>>         item sn2 weight 5.460
>> }
>>
>> # rules
>> rule data {
>>         ruleset 0
>>         type replicated
>>         min_size 1
>>         max_size 10
>>         step take default
>>         step chooseleaf firstn 0 type host
>>         step emit
>> }
>> rule metadata {
>>         ruleset 1
>>         type replicated
>>         min_size 1
>>         max_size 10
>>         step take default
>>         step chooseleaf firstn 0 type host
>>         step emit
>> }
>> rule rbd {
>>         ruleset 2
>>         type replicated
>>         min_size 1
>>         max_size 10
>>         step take default
>>         step chooseleaf firstn 0 type host
>>         step emit
>> }
>>
>> # end crush map
>
> The three nodes are in the same rack, so same power, same switch.  Any
> disk is as good as any other.  My intent, is that a copy of each object
> is stored on any of 3 nodes (presently 3 is all we have, but that may
> change).
>
> The default "size" for the cluster is set to 3, and this seems to be
> reflected in the current "size' if each of the pools.  My understanding
> of the above is:
>
> - take default: pick the top-level group
> - chooseleaf firsn type host: choose {pool-size} hosts, then choose an
> OSD on each host at random
> - emit: dump the objects at those locations
>
> That was the default, and if my understanding is correct, it'll do what
> I'm after.

Yep, that all looks good.

> And only just now, do I get the all-clear from `ceph health`.
>
> I'm not sure what triggered the A-OK message.  Nothing seems different
> in ntp, and I'm not sure how else to measure clock sync.

Ceph Health is reporting on more than just "reads and writes succeed";
it's also telling you about how durable your data is and if it's
located where it should be. Taking a node down while continuing to
read/write the data meant that when it came back up it had to re-sync
and that can take some time.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com