Monitor/OSD report tuning question

Bruce.McFarland@xxxxxxxxxxxxxxxx (Bruce McFarland) · Mon, 25 Aug 2014 17:32:45 +0000

See inline:
Ceph version:
>>> [root at ceph2 ceph]# ceph -v
>>> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)

Initial testing was with 30 osd's 10/storage server with the following HW:
>>> 4TB SATA disks - 1 hdd/osd - 30hdd's/server - 6 ssd's/server - forming a md raid0 virtual drive with 30 96GB partitions for 1 partition/osd journal.

Storage Server HW:
>>> 2 x Xeon e5-2630 2.6GHz 24 cores total with 128GB/server

Monitor HW:
>>> Monitor: 2 x Xeon e5-2630 2.6GHz 24 cores total with 64GB - system disks are 4 x 480GB SAS ssd configured as virtual md raid0

It seems my cluster's main issue is osd_heartbeat_grace since I constantly see osd failures for reporting outside the 20 second grace. The cluster was configured from boot time (I completely tore down the original cluster and rebuilt with increased osd_heartbeat_grace of 35).  As you can see the osd is marked down the cluster then goes into a osdmap/pgmap rebalancing cycle and everything is UP/IN with page states of 'active+clean' - for a few moments and then the osd flapping and map rebalancing restarts. 

All of the osd's are configured and report osd_heartbeat_grace of 35. Any idea why osd's are still failing for > 20??
root at ceph0 ceph]# sh -x ./ceph0-daemon-config.sh beat_grace
+ '[' 1 '!=' 1 ']'
+ for i in '{0..29}'
+ ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show
+ grep beat_grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "osd_heartbeat_grace": "35",
+ for i in '{0..29}'
+ grep beat_grace
+ ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config show
  "mon_osd_adjust_heartbeat_grace": "true",
  "osd_heartbeat_grace": "35",
+ for i in '{0..29}'

2014-08-25 10:18:10.812179 mon.0 [INF] osd.26 209.243.160.83:6878/4819 failed (279 reports from 56 peers after 21.006896 >= grace 20.995963)
2014-08-25 10:18:10.812440 mon.0 [INF] osd.29 209.243.160.83:6887/7439 failed (254 reports from 51 peers after 21.007140 >= grace 20.995963)
2014-08-25 10:18:10.817675 mon.0 [INF] osd.18 209.243.160.83:6854/30165 failed (280 reports from 56 peers after 21.012978 >= grace 20.995962)
2014-08-25 10:18:10.817850 mon.0 [INF] osd.19 209.243.160.83:6857/31036 failed (245 reports from 49 peers after 21.013135 >= grace 20.995962)
2014-08-25 10:18:11.127275 mon.0 [INF] osdmap e25128: 91 osds: 82 up, 90 in
2014-08-25 10:18:11.157030 mon.0 [INF] pgmap v51553: 5760 pgs: 519 stale+active+clean, 5241 active+clean; 0 bytes data, 135 GB used, 327 TB / 327 TB avail
2014-08-25 10:18:11.924773 mon.0 [INF] osd.5 209.243.160.83:6815/19790 failed (270 reports from 54 peers after 22.120541 >= grace 21.991499)
2014-08-25 10:18:11.924858 mon.0 [INF] osd.7 209.243.160.83:6821/21303 failed (240 reports from 48 peers after 22.120345 >= grace 21.991499)
2014-08-25 10:18:11.924894 mon.0 [INF] osd.11 209.243.160.83:6833/24394 failed (260 reports from 52 peers after 22.120297 >= grace 21.991499)
2014-08-25 10:18:11.924943 mon.0 [INF] osd.16 209.243.160.83:6848/28431 failed (265 reports from 53 peers after 22.120080 >= grace 21.991499)
2014-08-25 10:18:11.924977 mon.0 [INF] osd.17 209.243.160.83:6851/29253 failed (250 reports from 50 peers after 22.120067 >= grace 21.991499)
2014-08-25 10:18:11.925012 mon.0 [INF] osd.23 209.243.160.83:6869/2073 failed (270 reports from 54 peers after 22.120020 >= grace 21.991499)
2014-08-25 10:18:11.925065 mon.0 [INF] osd.24 209.243.160.83:6872/3025 failed (260 reports from 52 peers after 22.120010 >= grace 21.991499)
2014-08-25 10:15:17.753867 osd.10 [WRN] map e25128 wrongly marked me down
2014-08-25 10:15:17.960953 osd.18 [WRN] map e25128 wrongly marked me down
2014-08-25 10:15:18.217959 osd.29 [WRN] map e25128 wrongly marked me down
2014-08-25 10:18:11.925143 mon.0 [INF] osd.28 209.243.160.83:6884/6572 failed (275 reports from 55 peers after 22.670894 >= grace 21.991288)
2014-08-25 10:18:12.204918 mon.0 [INF] pgmap v51554: 5760 pgs: 519 stale+active+clean, 5241 active+clean; 0 bytes data, 135 GB used, 327 TB / 327 TB avail

-----Original Message-----
From: Christian Balzer [mailto:chibi@xxxxxxx] 
Sent: Monday, August 25, 2014 1:15 AM
To: ceph-users at ceph.com
Cc: Bruce McFarland
Subject: Re: Monitor/OSD report tuning question

Hello,

On Sat, 23 Aug 2014 20:23:55 +0000 Bruce McFarland wrote:

Firstly while the runtime changes you injected into the cluster should have done something (and I hope some Ceph developer comments on that) you're asking for tuning advice which really isn't the issue here.

Your cluster should not need any tuning to become functional, what you're seeing is something massively wrong with it.

> Hello,
> I have a Cluster
Which version? I assume Firefly due to the single monitor which suggests a test cluster, but if you're running a development version all bets are off.

>>> [root at ceph2 ceph]# ceph -v
>>> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)

> with 30 OSDs

What disks? How connected? SSD journals?

>>> 4TB SATA disks 1/osd - 30hdd's/server - 6 ssd's forming a md raid0 virtual drive with 30 96GB partitions 1/osd journal.

> distributed over 3 Storage Servers

What CPUs, how much memory? 
Where (HDD/SSD) is the OS and thus the ceph logs and state information stored?

>>> 2 x Xeon e5-2630 2.6GHz 24 cores total with 128GB/server

The errors you're seeing very much suggest either networking problems of sorts or a huge overload/slowdown of the OSD, running out of CPU, being swapped out, having troubles writing its logs in time.

> connected by a 10G cluster link and connected to the Monitor over 1G. 
> I

A single monitor is asking Murphy to go and trash its leveldb and thus loosing all data. Even with a test cluster something to consider.

Also, specs for that machine please.

>>> Monitor: 2 x Xeon e5-2630 2.6GHz 24 cores total with 64GB - system disks are 4 x 480GB SAS ssd configured as virtual md raid0

> still have a lot to understand with Ceph. Observing the cluster 
> messages in a "ceph -watch" window I see a lot of osd "flapping" when 
> it is sitting in a configured state and page/placement groups 
> constantly changing status. The cluster was configured and came up to 
> 1920 'active
> + clean' pages.
> 
Run atop on all 3 storage nodes, see how busy things are, see if there are errors on the network.

The examples you're giving below all are for osd.2x, suggesting they are all on the same node.
If that's so, focus on that node and its network connectivity.

> The 3 status below outputs were issued over the course of about 2 to 
> minutes. As you can see there is a lot of activity where I'm assuming 
> the osd reporting is occasionally outside the heartbeat TO and various 
> pages/placement groups get set to 'stale' and/or 'degrded' but still 
> 'active'. There are osd's being  marked 'out' in the osd map that I 
> see in the watch window as reported of failures that very quickly 
> report "wrongly marked me down". I'm assuming I need to 'tune' some of 
> the many TO values so that the osd's and page/placement groups all can 
> report within the TO window.
> 

Again, 20 seconds are an eternity really, especially on a cluster according to your outputs is totally empty and should be idle.

Regards,

Christian

> 
> A quick look at the -admin-daemon config show cmd tells me that I 
> might consider tuning some of these values:
> 
> [root at ceph0 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.20.asok 
> config show | grep report "mon_osd_report_timeout": "900",
>   "mon_osd_min_down_reporters": "1",
>   "mon_osd_min_down_reports": "3",
>   "osd_mon_report_interval_max": "120",
>   "osd_mon_report_interval_min": "5",
>   "osd_pg_stat_report_interval_max": "500",
> [root at ceph0 ceph]#
> 
> Which osd and/or mon settings should I increase/decrease to eliminate 
> all this state flapping while the cluster sits configured with no data?
> Thanks, Bruce
> 
> 014-08-23 13:16:15.564932 mon.0 [INF] osd.20 209.243.160.83:6800/20604 
> failed (65 reports from 20 peers after 23.380808 >= grace 21.991016)
> 2014-08-23 13:16:15.565784 mon.0 [INF] osd.23 
> 209.243.160.83:6810/29727 failed (79 reports from 20 peers after 
> 23.675170 >= grace 21.990903)
> 2014-08-23 13:16:15.566038 mon.0 [INF] osd.25 
> 209.243.160.83:6808/31984 failed (65 reports from 20 peers after 
> 23.380921 >= grace 21.991016)
> 2014-08-23 13:16:15.566206 mon.0 [INF] osd.26 209.243.160.83:6811/518 
> failed (65 reports from 20 peers after 23.381043 >= grace 21.991016)
> 2014-08-23 13:16:15.566372 mon.0 [INF] osd.27 209.243.160.83:6822/2511 
> failed (65 reports from 20 peers after 23.381195 >= grace
> 21.991016) . . . 2014-08-23 13:17:09.547684 osd.20 [WRN] map e27128 
> wrongly marked me down 2014-08-23 13:17:10.826541 osd.23 [WRN] map
> e27130 wrongly marked me down 2014-08-23 13:20:09.615826 mon.0 [INF] 
> osdmap e27134: 30 osds: 26 up, 30 in 2014-08-23 13:17:10.954121 osd.26 
> [WRN] map e27130 wrongly marked me down 2014-08-23 13:17:19.125177
> osd.25 [WRN] map e27135 wrongly marked me down
> 
> [root at ceph-mon01 ceph]# ceph -s
>     cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
>      health HEALTH_OK
>      monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election 
> epoch 2, quorum 0 ceph-mon01 osdmap e26636: 30 osds: 30 up, 30 in
>       pgmap v56534: 1920 pgs, 3 pools, 0 bytes data, 0 objects
>             26586 MB used, 109 TB / 109 TB avail
>                 1920 active+clean
> [root at ceph-mon01 ceph]# ceph -s
>     cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
>      health HEALTH_WARN 160 pgs degraded; 83 pgs stale
>      monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election 
> epoch 2, quorum 0 ceph-mon01 osdmap e26641: 30 osds: 30 up, 30 in
>       pgmap v56545: 1920 pgs, 3 pools, 0 bytes data, 0 objects
>             26558 MB used, 109 TB / 109 TB avail
>                   83 stale+active+clean
>                  160 active+degraded
>                 1677 active+clean
> [root at ceph-mon01 ceph]# ceph -s
>     cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
>      health HEALTH_OK
>      monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election 
> epoch 2, quorum 0 ceph-mon01 osdmap e26657: 30 osds: 30 up, 30 in
>       pgmap v56584: 1920 pgs, 3 pools, 0 bytes data, 0 objects
>             26610 MB used, 109 TB / 109 TB avail
>                 1920 active+clean
> [root at ceph-mon01 ceph]#
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/