On 16/10/2018 02:58, Patrick Donnelly wrote: > In CephFS testing, we've observed transient failures caused by what > appears to messages being dropped [1,2]. These appear to have been > caused by the recent refactor PR [3,4] but I have no evidence other > than the problems appearing during testing with [4] after [4] was > merged. > > I'm running tests [5] to see if I can get more debugging (debug ms = > 20) but I wanted to canvas for ideas/advice before I get much deeper. > Has anyone else seen transient failures with messages getting dropped? If you successfully reproduce these issues with "debug ms = 20", I'm mostly sure that we will be able to find the root cause. In the meantime I'll take a look at the code to see if I find something strange in the message dispatch code. > > [1] http://tracker.ceph.com/issues/36389 > [2] http://tracker.ceph.com/issues/36349 > [3] https://github.com/ceph/ceph/pull/23415 > [4] https://github.com/ceph/ceph/pull/24305 > [5] http://pulpito.ceph.com/?branch=wip-pdonnell-testing-20181011.152759 > -- Ricardo Dias Senior Software Engineer - Storage Team SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
Attachment:
signature.asc
Description: OpenPGP digital signature