On Tue, 17 Oct 2017, Mohamad Gebai wrote: > Hi, > > I am looking at the following issue: http://tracker.ceph.com/issues/21375 > > In summary, during a 'rados bench', impossible latency values (e.g. > 9.00648e+07) are suddenly reported. I looked briefly at the code, it > seems CLOCK_REALTIME is used, which means that wall clock changes would > affect this output. This is a VM cluster, so the hypothesis was that the > system's clock was falling behind for some reason, then getting > readjusted (that's the only way I could reproduce the issue), which I > think is quite possible in a virtual environment. > > A concern was raised: are there more critical parts of Ceph where a > clock jumping around might interfere with the behavior of the cluster? Yes, definitely. > It would be good to know if there are any, and maybe prepare for them? Adam added a new set of clock primitives that include a monotonic clock option that should be used in all cases where we're measuring the passage of time instead of the wall clock time. There is a longstanding trello card to go through and change the latency calculations to use the monotonic clock. There are probably dozens of places where an ill-timed clock jump is liable to trigger some random assert. It's just a matter of going through and auditing calls to the legacy ceph_clock_now() method. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html