On Tue, Oct 16, 2018 at 9:12 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > > On Tue, Oct 16, 2018 at 9:05 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > > > On Mon, Oct 15, 2018 at 6:58 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > > > > > > In CephFS testing, we've observed transient failures caused by what > > > appears to messages being dropped [1,2]. These appear to have been > > > caused by the recent refactor PR [3,4] but I have no evidence other > > > than the problems appearing during testing with [4] after [4] was > > > merged. > > > > > > I'm running tests [5] to see if I can get more debugging (debug ms = > > > 20) but I wanted to canvas for ideas/advice before I get much deeper. > > > Has anyone else seen transient failures with messages getting dropped? > > > > I will note that these tickets are both from after patch 1 but before > > patch 2. > > No, the tickets were both from testing with the second patch ([4] in my OP). > > I'll report back if I can reproduce this with higher debugging. Here is one example: 2018-10-16 19:57:47.356 1b1c9700 5 mds.beacon.e Sending beacon up:active seq 2214 2018-10-16 19:57:47.356 1b1c9700 1 -- 172.21.15.179:6816/2028340967 --> 172.21.15.179:6789/0 -- mdsbeacon(4307/e up:active seq 2214 v283) v7 -- 0x209f2f40 con 0 2018-10-16 19:57:47.356 1b1c9700 20 -- 172.21.15.179:6816/2028340967 >> 172.21.15.179:6789/0 conn(0x1f544e20 legacy :-1 s=OPENED pgs=143 cs=1 l=1).prepare_send_message m mdsbeacon(4307/e up:active seq 2214 v283) v7 2018-10-16 19:57:47.357 1b1c9700 20 -- 172.21.15.179:6816/2028340967 >> 172.21.15.179:6789/0 conn(0x1f544e20 legacy :-1 s=OPENED pgs=143 cs=1 l=1).prepare_send_message encoding features 4611087854031142911 0x209f2f40 mdsbeacon(4307/e up:active seq 2214 v283) v7 2018-10-16 19:57:47.357 1b1c9700 15 -- 172.21.15.179:6816/2028340967 >> 172.21.15.179:6789/0 conn(0x1f544e20 legacy :-1 s=OPENED pgs=143 cs=1 l=1).send_message inline write is denied, reschedule m=0x209f2f40 From: /ceph/teuthology-archive/pdonnell-2018-10-16_16:46:31-multimds-wip-pdonnell-testing-20181011.152759-distro-basic-smithi/3148285/remote/smithi179/log/ceph-mds.e.log.gz It looks like the messenger just never sent the message. FWIW, the mds and mon in this particular case are on the same host. I looked around at other beacon sends (grep "beacon.e") and the actual send by the msgr happens promptly afterwards. For some reason, that's not the case case for seq=2214 ... but I'm not that familiar with msgr debugging. Help would be appreciated! -- Patrick Donnelly