Re: Exact scope of OSD heartbeating?

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 18 Jul 2018 10:31:19 +0200

On Wed, Jul 18, 2018 at 3:20 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
>
> The documentation here:
>
> http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
>
> says
>
> "Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 seconds"
>
> and
>
> " If a neighboring Ceph OSD Daemon doesn’t show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may consider the neighboring Ceph OSD Daemon down and report it back to a Ceph Monitor,"
>
> I've always thought that each OSD heartbeats with *every* other OSD, which of course means that total heartbeat traffic grows ~ quadratically.  However in extending test we've observed that the number of other OSDs that a subject heartbeat (heartbeated?) was < N, which has us wondering if perhaps only OSDs with which a given OSD shares are contacted -- or some other subset.
>

OSDs heartbeat with their peers, the set of osds with whom they share
at least one PG.
You can see the heartbeat peers (HB_PEERS) in ceph pg dump -- after
the header "OSD_STAT USED  AVAIL TOTAL HB_PEERS..."

This is one of the nice features of the placement group concept --
heartbeats and peering in general stays constant with the number of
PGs per OSD, rather than scaling up with the total number of OSDs in a
cluster.

Cheers, Dan

> I plan to submit a doc fix for mon_osd_min_down_reporters and wanted to resolve this FUD first.
>
> -- aad
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com