Re: msgr bug in master caused by recent protocol refactor (?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 16, 2018 at 9:12 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
>
> On Tue, Oct 16, 2018 at 9:05 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> >
> > On Mon, Oct 15, 2018 at 6:58 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> > >
> > > In CephFS testing, we've observed transient failures caused by what
> > > appears to messages being dropped [1,2]. These appear to have been
> > > caused by the recent refactor PR [3,4] but I have no evidence other
> > > than the problems appearing during testing with [4] after [4] was
> > > merged.
> > >
> > > I'm running tests [5] to see if I can get more debugging (debug ms =
> > > 20) but I wanted to canvas for ideas/advice before I get much deeper.
> > > Has anyone else seen transient failures with messages getting dropped?
> >
> > I will note that these tickets are both from after patch 1 but before
> > patch 2.
>
> No, the tickets were both from testing with the second patch ([4] in my OP).
>
> I'll report back if I can reproduce this with higher debugging.

Here is one example:

2018-10-16 19:57:47.356 1b1c9700  5 mds.beacon.e Sending beacon
up:active seq 2214
2018-10-16 19:57:47.356 1b1c9700  1 -- 172.21.15.179:6816/2028340967
--> 172.21.15.179:6789/0 -- mdsbeacon(4307/e up:active seq 2214 v283)
v7 -- 0x209f2f40 con 0
2018-10-16 19:57:47.356 1b1c9700 20 -- 172.21.15.179:6816/2028340967
>> 172.21.15.179:6789/0 conn(0x1f544e20 legacy :-1 s=OPENED pgs=143
cs=1 l=1).prepare_send_message m mdsbeacon(4307/e up:active seq 2214
v283) v7
2018-10-16 19:57:47.357 1b1c9700 20 -- 172.21.15.179:6816/2028340967
>> 172.21.15.179:6789/0 conn(0x1f544e20 legacy :-1 s=OPENED pgs=143
cs=1 l=1).prepare_send_message encoding features 4611087854031142911
0x209f2f40 mdsbeacon(4307/e up:active seq 2214 v283) v7
2018-10-16 19:57:47.357 1b1c9700 15 -- 172.21.15.179:6816/2028340967
>> 172.21.15.179:6789/0 conn(0x1f544e20 legacy :-1 s=OPENED pgs=143
cs=1 l=1).send_message inline write is denied, reschedule m=0x209f2f40

From: /ceph/teuthology-archive/pdonnell-2018-10-16_16:46:31-multimds-wip-pdonnell-testing-20181011.152759-distro-basic-smithi/3148285/remote/smithi179/log/ceph-mds.e.log.gz

It looks like the messenger just never sent the message. FWIW, the mds
and mon in this particular case are on the same host. I looked around
at other beacon sends (grep "beacon.e") and the actual send by the
msgr happens promptly afterwards. For some reason, that's not the case
case for seq=2214 ... but I'm not that familiar with msgr debugging.

Help would be appreciated!

-- 
Patrick Donnelly



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux