osd heartbeat interval

Cláudio Martins <ctpm@xxxxxxxxxx> · Tue, 27 Mar 2012 20:27:44 +0100

 Hi,

 While testing a cluster with 47 OSDs, we noticed that with that many
OSDs there is considerable network traffic (around 2 Mbit/s), most of
it apparently just from the OSD heartbeats alone (measured while no
clients were generating I/O). Also, OSD CPU consumption was very
measurable, constantly around 1~2% on a 3.2GHz Xeon CPU.

 So we experimented by including

 osd heartbeat interval = 10

 on ceph.conf on all nodes and, as suspected, network traffic diminished
and CPU usage from an idle OSD is not measurable on top anymore.

 Since there is a considerable number of OSDs in this cluster, we think
that even with a 10 sec heartbeat, detection of a down OSD by some
other OSDs is likely to be reasonably quick. As a matter of fact we saw
on the mon log that, when we stopped an OSD, it was flagged as "failed"
by other OSDs in just a few seconds.

 So, we would like to know the opinion of the list about increasing the
heartbeat interval on large clusters (and perhaps suggesting that on
the official documentation), namely if you think there might be some
negative consequences that we haven't foreseen.

 Thanks in advance

Best regards

Cláudio

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html