Re: msgr bug in master caused by recent protocol refactor (?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 18, 2018 at 12:51 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
>
> On Wed, Oct 17, 2018 at 9:54 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> >
> > On Wed, Oct 17, 2018 at 5:01 AM Ricardo Dias <rdias@xxxxxxxx> wrote:
> > > From the above history, the strange thing that I see is that the
> > > EventManager didn't call the handle_write on the messenger connection
> > > since 19:57:45 until the connection is stopped by the MonClient.
> > > This is the cause for the keepalives and mdsbeacon messages to not be
> > > send to the monitor.
> > > But I don't quite understand why this happens. Maybe the EventManager is
> > > too busy handling other events?
> >
> > It's possible. It may also be relevant that the MDS is running in
> > valgrind. I'm goign to see if we have another instance of this in
> > testing without valgrind.
>
> Another job with the same test configuration (with valgrind) but this
> one is different in that the connection eventually sent the beacons
> without requiring a reconnect:
>
> 2018-10-16 18:32:55.287 1b1c9700  5 mds.beacon.c Sending beacon
> up:active seq 634
> 2018-10-16 18:32:55.287 1b1c9700  1 -- 172.21.15.156:6819/417220422
> --> 172.21.15.156:6789/0 -- mdsbeacon(4315/c up:active seq 634 v331)
> v7 -- 0x21aabd20 con 0
> 2018-10-16 18:32:55.287 1b1c9700 20 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
> l=1).prepare_send_message m mdsbeacon(4315/c up:active seq 634 v331)
> v7
> 2018-10-16 18:32:55.287 1b1c9700 20 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
> l=1).prepare_send_message encoding features 4611087854031142911
> 0x21aabd20 mdsbeacon(4315/c up:active seq 634 v331) v7
> 2018-10-16 18:32:55.287 1b1c9700 15 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
> l=1).send_message inline write is denied, reschedule m=0x21aabd20
> ...
> 2018-10-16 18:33:07.993 159be700  4 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1
> s=STATE_CONNECTION_ESTABLISHED l=1).handle_write
> 2018-10-16 18:33:07.993 159be700 10 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
> l=1).write_event
> 2018-10-16 18:33:07.993 159be700 10 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
> l=1).append_keepalive_or_ack
> 2018-10-16 18:33:07.993 159be700 20 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
> l=1).write_message no session security
> 2018-10-16 18:33:07.993 159be700 20 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
> l=1).write_message sending message type=100 src mds.2 front=416 data=0
> off 0
> 2018-10-16 18:33:07.993 159be700 20 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
> l=1).write_message sending 721 0x21aabd20
> 2018-10-16 18:33:07.994 159be700 10 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1
> s=STATE_CONNECTION_ESTABLISHED l=1)._try_send sent bytes 500 remaining
> bytes 0
>
> From: /ceph/teuthology-archive/pdonnell-2018-10-16_16:46:31-multimds-wip-pdonnell-testing-20181011.152759-distro-basic-smithi/3148290/remote/smithi156/log/ceph-mds.c.log.gz
>
> So 12 seconds inexplicably spanned the beacon queue and the actual
> send on the socket.
>
> Notice it also was also slow to "receive" three mdsmaps in the same time period:
>
> 2018-10-16 18:33:07.477 159be700  5 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1
> s=READ_FOOTER_AND_DISPATCH pgs=78 cs=1 l=1). rx mon.1 seq 998
> 0x12348cd0 mdsmap(e 331) v1
> ...
> 2018-10-16 18:33:07.481 159be700  5 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1
> s=READ_FOOTER_AND_DISPATCH pgs=78 cs=1 l=1). rx mon.1 seq 999
> 0x12350210 mdsmap(e 332) v1
> ...
> 2018-10-16 18:33:07.486 159be700  5 -- 172.21.15.156:6819/417220422 >>
> 172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1
> s=READ_FOOTER_AND_DISPATCH pgs=78 cs=1 l=1). rx mon.1 seq 1000
> 0x123652c0 mdsmap(e 333) v1
>
> All three rapidly received in sequence.
>
> The mons sent the above mdsmaps in a timely fashion, e.g. e331:
>
> 2018-10-16 18:32:43.343 1d101700 20 -- 172.21.15.156:6789/0 done
> calling dispatch on 0x22151430
> 2018-10-16 18:32:43.344 1f105700  4 -- 172.21.15.156:6789/0 >>
> 172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789
> s=STATE_CONNECTION_ESTABLISHED l=1).handle_write
> 2018-10-16 18:32:43.344 1f105700 10 -- 172.21.15.156:6789/0 >>
> 172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
> pgs=1 cs=1 l=1).write_event
> 2018-10-16 18:32:43.344 1f105700 20 -- 172.21.15.156:6789/0 >>
> 172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
> pgs=1 cs=1 l=1).prepare_send_message m mdsmap(e 330) v1
> 2018-10-16 18:32:43.344 1f105700 20 -- 172.21.15.156:6789/0 >>
> 172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
> pgs=1 cs=1 l=1).prepare_send_message encoding features
> 4611087854031142911 0x224309d0 mdsmap(e 330) v1
> 2018-10-16 18:32:43.345 1f105700 20 -- 172.21.15.156:6789/0 >>
> 172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
> pgs=1 cs=1 l=1).write_message signed m=0x224309d0): sig = 0
> 2018-10-16 18:32:43.345 1f105700 20 -- 172.21.15.156:6789/0 >>
> 172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
> pgs=1 cs=1 l=1).write_message sending message type=21 src mon.1
> front=1905 data=0 off 0
> 2018-10-16 18:32:43.345 1f105700 20 -- 172.21.15.156:6789/0 >>
> 172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
> pgs=1 cs=1 l=1).write_message sending 995 0x224309d0
> 2018-10-16 18:32:43.345 1f105700 10 -- 172.21.15.156:6789/0 >>
> 172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789
> s=STATE_CONNECTION_ESTABLISHED l=1)._try_send sent bytes 1980
> remaining bytes 0
> 2018-10-16 18:32:43.345 1f105700 10 -- 172.21.15.156:6789/0 >>
> 172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
> pgs=1 cs=1 l=1).write_message sending 0x224309d0 done.
>
> From: /ceph/teuthology-archive/pdonnell-2018-10-16_16:46:31-multimds-wip-pdonnell-testing-20181011.152759-distro-basic-smithi/3148290/remote/smithi156/log/ceph-mon.a.log.gz

I found another protocol refactor regression:

  https://github.com/idryomov/ceph/commit/b7a62742fcb7cbfca0766e34530d784aef8a6b98
  https://github.com/ceph/ceph/pull/24693

(I was handed that PR link after a long wait.  Looks like github is
suffering an outage, so it may not be accessible right away.)

>From a quick look at the logs it doesn't appear to be related to this
one, but I thought it might be worth including in multimds runs just in
case.

Thanks,

                Ilya



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux