Re: osd laggy algorithm

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 16 Mar 2015 10:28:43 -0700



On Wed, Mar 11, 2015 at 8:40 AM, Artem Savinov <asavinov@xxxxxxxx> wrote:
> hello.
> ceph transfers osd node in the down status by default , after receiving 3
> reports about disabled nodes. Reports are sent per   "osd heartbeat grace"
> seconds, but the settings of "mon_osd_adjust_heartbeat_gratse = true,
> mon_osd_adjust_down_out_interval = true" timeout to transfer nodes in down
> status may vary. Tell me please: what algorithm enables changes timeout for
> the transfer nodes occur in down/out status and which parameters are
> affected?
> thanks.

The monitors keep track of which detected failures are incorrect
(based on reports from the marked-down/out OSDs) and build up an
expectation about how often the failures are correct based on an
exponential backoff of the data points. You can look at the code in
OSDMonitor.cc if you're interested, but basically they apply that
expectation to modify the down interval and the down-out interval to a
value large enough that they believe the OSD is really down (assuming
these config options are set). It's not terribly interesting. :)
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com