In CephFS testing, we've observed transient failures caused by what appears to messages being dropped [1,2]. These appear to have been caused by the recent refactor PR [3,4] but I have no evidence other than the problems appearing during testing with [4] after [4] was merged. I'm running tests [5] to see if I can get more debugging (debug ms = 20) but I wanted to canvas for ideas/advice before I get much deeper. Has anyone else seen transient failures with messages getting dropped? [1] http://tracker.ceph.com/issues/36389 [2] http://tracker.ceph.com/issues/36349 [3] https://github.com/ceph/ceph/pull/23415 [4] https://github.com/ceph/ceph/pull/24305 [5] http://pulpito.ceph.com/?branch=wip-pdonnell-testing-20181011.152759 -- Patrick Donnelly