Monitor/OSD report tuning question

chibi@xxxxxxx (Christian Balzer) · Mon, 25 Aug 2014 17:14:56 +0900

Hello,

On Sat, 23 Aug 2014 20:23:55 +0000 Bruce McFarland wrote:

Firstly while the runtime changes you injected into the cluster
should have done something (and I hope some Ceph developer comments
on that) you're asking for tuning advice which really isn't the issue here.

Your cluster should not need any tuning to become functional, what you're
seeing is something massively wrong with it.

> Hello,
> I have a Cluster 
Which version? I assume Firefly due to the single monitor which suggests a
test cluster, but if you're running a development version all bets are off.

> with 30 OSDs 

What disks? How connected? SSD journals?

> distributed over 3 Storage Servers

What CPUs, how much memory? 
Where (HDD/SSD) is the OS and thus the ceph logs and state information
stored?

The errors you're seeing very much suggest either networking problems of
sorts or a huge overload/slowdown of the OSD, running out of CPU, being
swapped out, having troubles writing its logs in time.

> connected by a 10G cluster link and connected to the Monitor over 1G. I

A single monitor is asking Murphy to go and trash its leveldb and thus
loosing all data. Even with a test cluster something to consider.

Also, specs for that machine please.

> still have a lot to understand with Ceph. Observing the cluster messages
> in a "ceph -watch" window I see a lot of osd "flapping" when it is
> sitting in a configured state and page/placement groups constantly
> changing status. The cluster was configured and came up to 1920 'active
> + clean' pages.
> 
Run atop on all 3 storage nodes, see how busy things are, see if there are
errors on the network.

The examples you're giving below all are for osd.2x, suggesting they are
all on the same node.
If that's so, focus on that node and its network connectivity.

> The 3 status below outputs were issued over the course of about 2 to
> minutes. As you can see there is a lot of activity where I'm assuming
> the osd reporting is occasionally outside the heartbeat TO and various
> pages/placement groups get set to 'stale' and/or 'degrded' but still
> 'active'. There are osd's being  marked 'out' in the osd map that I see
> in the watch window as reported of failures that very quickly report
> "wrongly marked me down". I'm assuming I need to 'tune' some of the many
> TO values so that the osd's and page/placement groups all can report
> within the TO window.
> 

Again, 20 seconds are an eternity really, especially on a cluster
according to your outputs is totally empty and should be idle.

Regards,

Christian

> 
> A quick look at the -admin-daemon config show cmd tells me that I might
> consider tuning some of these values:
> 
> [root at ceph0 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.20.asok
> config show | grep report "mon_osd_report_timeout": "900",
>   "mon_osd_min_down_reporters": "1",
>   "mon_osd_min_down_reports": "3",
>   "osd_mon_report_interval_max": "120",
>   "osd_mon_report_interval_min": "5",
>   "osd_pg_stat_report_interval_max": "500",
> [root at ceph0 ceph]#
> 
> Which osd and/or mon settings should I increase/decrease to eliminate
> all this state flapping while the cluster sits configured with no data?
> Thanks, Bruce
> 
> 014-08-23 13:16:15.564932 mon.0 [INF] osd.20 209.243.160.83:6800/20604
> failed (65 reports from 20 peers after 23.380808 >= grace 21.991016)
> 2014-08-23 13:16:15.565784 mon.0 [INF] osd.23 209.243.160.83:6810/29727
> failed (79 reports from 20 peers after 23.675170 >= grace 21.990903)
> 2014-08-23 13:16:15.566038 mon.0 [INF] osd.25 209.243.160.83:6808/31984
> failed (65 reports from 20 peers after 23.380921 >= grace 21.991016)
> 2014-08-23 13:16:15.566206 mon.0 [INF] osd.26 209.243.160.83:6811/518
> failed (65 reports from 20 peers after 23.381043 >= grace 21.991016)
> 2014-08-23 13:16:15.566372 mon.0 [INF] osd.27 209.243.160.83:6822/2511
> failed (65 reports from 20 peers after 23.381195 >= grace
> 21.991016) . . . 2014-08-23 13:17:09.547684 osd.20 [WRN] map e27128
> wrongly marked me down 2014-08-23 13:17:10.826541 osd.23 [WRN] map
> e27130 wrongly marked me down 2014-08-23 13:20:09.615826 mon.0 [INF]
> osdmap e27134: 30 osds: 26 up, 30 in 2014-08-23 13:17:10.954121 osd.26
> [WRN] map e27130 wrongly marked me down 2014-08-23 13:17:19.125177
> osd.25 [WRN] map e27135 wrongly marked me down
> 
> [root at ceph-mon01 ceph]# ceph -s
>     cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
>      health HEALTH_OK
>      monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election
> epoch 2, quorum 0 ceph-mon01 osdmap e26636: 30 osds: 30 up, 30 in
>       pgmap v56534: 1920 pgs, 3 pools, 0 bytes data, 0 objects
>             26586 MB used, 109 TB / 109 TB avail
>                 1920 active+clean
> [root at ceph-mon01 ceph]# ceph -s
>     cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
>      health HEALTH_WARN 160 pgs degraded; 83 pgs stale
>      monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election
> epoch 2, quorum 0 ceph-mon01 osdmap e26641: 30 osds: 30 up, 30 in
>       pgmap v56545: 1920 pgs, 3 pools, 0 bytes data, 0 objects
>             26558 MB used, 109 TB / 109 TB avail
>                   83 stale+active+clean
>                  160 active+degraded
>                 1677 active+clean
> [root at ceph-mon01 ceph]# ceph -s
>     cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
>      health HEALTH_OK
>      monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election
> epoch 2, quorum 0 ceph-mon01 osdmap e26657: 30 osds: 30 up, 30 in
>       pgmap v56584: 1920 pgs, 3 pools, 0 bytes data, 0 objects
>             26610 MB used, 109 TB / 109 TB avail
>                 1920 active+clean
> [root at ceph-mon01 ceph]#
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/