Re: OSD/monitor timeouts?

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 27 Jan 2014 21:51:26 -0800

On Mon, Jan 27, 2014 at 9:05 PM, Stuart Longland <stuartl@xxxxxxxxxx> wrote:
> On 25/01/14 16:41, Stuart Longland wrote:
>> Hi Gregory,
>> On 24/01/14 12:20, Gregory Farnum wrote:
>>> Did the cluster actually detect the node as down? (You could check
>>> this by looking at the ceph -w output or similar when running the
>>> test.) If it was detected as down and the VM continued to block
>>> (modulo maybe a little time for the client to decide its monitor was
>>> down; I forget what the timeouts are there), that would be odd.
>>
>> I shall give that a command a try next time I get near the cluster
>> (Tuesday).  (I could do it today I guess, but I can't remotely power
>> nodes back on, or hard-power them off from home.)
>
> Okay, I did some further tests today.  In addition to the Windows 2008R2
> VM, I also started pummelling it with my own laptop (2.6GHz Core i5
> 3220M; 8GB RAM) which runs Gentoo Linux AMD64 and kernel 3.12.4.
>
> ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) was
> installed from Gentoo's repository.
>
> I mapped a 20GB RBD using `rbd map`, formatted it XFS, then started
> pummeling that with my gigabit link (which passes through a couple of
> shared VLAN trunks), various disk stress testers and dd.
>
> Whilst that was proceeding, I then wandered to the server rack and
> started fiddling.
>
> Before simulating outages, I was getting write speeds between the
> 74MB/sec and 145MB/sec according to dbench.  dd was getting about
> 15.1MB/sec writing 1GB of random data.
>
> With a bash script running dd in a loop, and also running bonnie++ to
> really push things, I started playing with the nodes, rebooting some,
> powering off others.
>
> It seems there's a limit to how often you can power things off, even if
> you wait for the cluster health to recover before proceeding.
> Eventually the client (kernel or userspace) gets fed up, as seen in the
> attached log.
>
> At present, `ceph -s` reports:
>> HEALTH_WARN clock skew detected on mon.2
>>     cluster b9b2ed48-e249-48ee-8e76-86493c2cc849
>>      health HEALTH_WARN clock skew detected on mon.2
>>      monmap e1: 3 mons at {0=10.87.160.224:6789/0,1=10.87.160.225:6789/0,2=10.87.160.226:6789/0}, election epoch 42, qu
>> orum 0,1,2 0,1,2
>>      osdmap e174: 6 osds: 6 up, 6 in
>>       pgmap v45386: 800 pgs, 4 pools, 398 GB data, 102026 objects
>>             1195 GB used, 15563 GB / 16758 GB avail
>>                  800 active+clean
>>
>
> and out of `ceph -w` I get:
>> 6758 GB avail; 1130 B/s wr, 0 op/s
>> 2014-01-28 14:49:20.812284 mon.0 [INF] pgmap v45379: 800 pgs: 800 active+clean; 398 GB data, 1195 GB used, 15563 GB / 16758 GB avail; 1126 B/s wr, 0 op/s
>> 2014-01-28 14:49:34.225852 mon.0 [INF] pgmap v45380: 800 pgs: 800 active+clean; 398 GB data, 1195 GB used, 15563 GB / 16758 GB avail; 71 B/s wr, 0 op/s
>> 2014-01-28 14:49:48.056665 mon.0 [INF] pgmap v45381: 800 pgs: 800 active+clean; 398 GB data, 1195 GB used, 15563 GB / 16758 GB avail
>> 2014-01-28 14:49:49.065547 mon.0 [INF] pgmap v45382: 800 pgs: 800 active+clean; 398 GB data, 1195 GB used, 15563 GB / 16758 GB avail
>> 2014-01-28 14:49:50.074878 mon.0 [INF] pgmap v45383: 800 pgs: 800 active+clean; 398 GB data, 1195 GB used, 15563 GB / 16758 GB avail; 16270 B/s wr, 0 op/s
>> 2014-01-28 14:49:51.083527 mon.0 [INF] pgmap v45384: 800 pgs: 800 active+clean; 398 GB data, 1195 GB used, 15563 GB / 16758 GB avail; 16742 B/s wr, 0 op/s
>> 2014-01-28 14:50:10.437994 mon.0 [WRN] mon.2 10.87.160.226:6789/0 clock skew 4.05188s > max 0.05s
>> 2014-01-28 14:50:19.813536 mon.0 [INF] pgmap v45385: 800 pgs: 800 active+clean; 398 GB data, 1195 GB used, 15563 GB / 16758 GB avail; 1140 B/s wr, 0 op/s
>> 2014-01-28 14:50:20.818168 mon.0 [INF] pgmap v45386: 800 pgs: 800 active+clean; 398 GB data, 1195 GB used, 15563 GB / 16758 GB avail; 1136 B/s wr, 0 op/s
>> 2014-01-28 14:50:49.816479 mon.0 [INF] pgmap v45387: 800 pgs: 800 active+clean; 398 GB data, 1195 GB used, 15563 GB / 16758 GB avail; 1130 B/s wr, 0 op/s
>> 2014-01-28 14:50:50.825369 mon.0 [INF] pgmap v45388: 800 pgs: 800 active+clean; 398 GB data, 1195 GB used, 15563 GB / 16758 GB avail; 1126 B/s wr, 0 op/s
>> 2014-01-28 14:51:19.819779 mon.0 [INF] pgmap v45389: 800 pgs: 800 active+clean; 398 GB data, 1195 GB used, 15563 GB / 16758 GB avail; 1130 B/s wr, 0 op/s
>
> I do note ntp doesn't seem to be doing its job, but that's a side issue.

Actually, that could be it. If you take down one of the monitors and
the other two have enough of a time gap that they won't talk to each
other, your cluster won't be able to make any progress. The OSDs don't
much care, but your monitor nodes need to have a well-synced clock.

And in your trace everything got wedged for so long the system just
gave up; that's probably a result of the cluster having data it
couldn't write to for too long. (Like I said before, you should make
sure your CRUSH map and rules look right.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com