Re: Unstable clock

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 17 Oct 2017 13:27:39 +0000 (UTC)

On Tue, 17 Oct 2017, Mohamad Gebai wrote:
> Hi,
> 
> I am looking at the following issue: http://tracker.ceph.com/issues/21375
> 
> In summary, during a 'rados bench', impossible latency values (e.g.
> 9.00648e+07) are suddenly reported. I looked briefly at the code, it
> seems CLOCK_REALTIME is used, which means that wall clock changes would
> affect this output. This is a VM cluster, so the hypothesis was that the
> system's clock was falling behind for some reason, then getting
> readjusted (that's the only way I could reproduce the issue), which I
> think is quite possible in a virtual environment.
> 
> A concern was raised: are there more critical parts of Ceph where a
> clock jumping around might interfere with the behavior of the cluster?

Yes, definitely.

> It would be good to know if there are any, and maybe prepare for them?

Adam added a new set of clock primitives that include a monotonic clock 
option that should be used in all cases where we're measuring the passage 
of time instead of the wall clock time.  There is a longstanding trello 
card to go through and change the latency calculations to use the 
monotonic clock.  There are probably dozens of places where an ill-timed 
clock jump is liable to trigger some random assert.  It's just a matter of 
going through and auditing calls to the legacy ceph_clock_now() method.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html