On Wed, Jul 18, 2018 at 3:20 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > > The documentation here: > > http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/ > > says > > "Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 seconds" > > and > > " If a neighboring Ceph OSD Daemon doesn’t show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may consider the neighboring Ceph OSD Daemon down and report it back to a Ceph Monitor," > > I've always thought that each OSD heartbeats with *every* other OSD, which of course means that total heartbeat traffic grows ~ quadratically. However in extending test we've observed that the number of other OSDs that a subject heartbeat (heartbeated?) was < N, which has us wondering if perhaps only OSDs with which a given OSD shares are contacted -- or some other subset. > OSDs heartbeat with their peers, the set of osds with whom they share at least one PG. You can see the heartbeat peers (HB_PEERS) in ceph pg dump -- after the header "OSD_STAT USED AVAIL TOTAL HB_PEERS..." This is one of the nice features of the placement group concept -- heartbeats and peering in general stays constant with the number of PGs per OSD, rather than scaling up with the total number of OSDs in a cluster. Cheers, Dan > I plan to submit a doc fix for mon_osd_min_down_reporters and wanted to resolve this FUD first. > > -- aad > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com