Re: msgr bug in master caused by recent protocol refactor (?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 17/10/2018 04:22, Patrick Donnelly wrote:
> On Tue, Oct 16, 2018 at 9:12 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> 
> Here is one example:
> 
> 2018-10-16 19:57:47.356 1b1c9700  5 mds.beacon.e Sending beacon
> up:active seq 2214
> 2018-10-16 19:57:47.356 1b1c9700  1 -- 172.21.15.179:6816/2028340967
> --> 172.21.15.179:6789/0 -- mdsbeacon(4307/e up:active seq 2214 v283)
> v7 -- 0x209f2f40 con 0
> 2018-10-16 19:57:47.356 1b1c9700 20 -- 172.21.15.179:6816/2028340967
> 172.21.15.179:6789/0 conn(0x1f544e20 legacy :-1 s=OPENED pgs=143
> cs=1 l=1).prepare_send_message m mdsbeacon(4307/e up:active seq 2214
> v283) v7
> 2018-10-16 19:57:47.357 1b1c9700 20 -- 172.21.15.179:6816/2028340967
> 172.21.15.179:6789/0 conn(0x1f544e20 legacy :-1 s=OPENED pgs=143
> cs=1 l=1).prepare_send_message encoding features 4611087854031142911
> 0x209f2f40 mdsbeacon(4307/e up:active seq 2214 v283) v7
> 2018-10-16 19:57:47.357 1b1c9700 15 -- 172.21.15.179:6816/2028340967
> 172.21.15.179:6789/0 conn(0x1f544e20 legacy :-1 s=OPENED pgs=143
> cs=1 l=1).send_message inline write is denied, reschedule m=0x209f2f40
> 
> From: /ceph/teuthology-archive/pdonnell-2018-10-16_16:46:31-multimds-wip-pdonnell-testing-20181011.152759-distro-basic-smithi/3148285/remote/smithi179/log/ceph-mds.e.log.gz
> 
> It looks like the messenger just never sent the message. FWIW, the mds
> and mon in this particular case are on the same host. I looked around
> at other beacon sends (grep "beacon.e") and the actual send by the
> msgr happens promptly afterwards. For some reason, that's not the case
> case for seq=2214 ... but I'm not that familiar with msgr debugging.
> 
> Help would be appreciated!

I analyzed the ceph-mds.e.log and here is the summary of what I've seen:

########################################################################

19:57:43.294 mds sends mdsbeacon(seq=2213)
19:57:43.294 messenger connection queues the mdsbeacon message
         - at this point the connection asks the EventManager to
           asynchronously run the handle_write (which writes the queued
           messages to the socket).
19:57:43.295 EventManager calls handle_write on the messenger connection
19:57:43.297 messenger connection writes mdsbeacon message to the socket
19:57:43.326 messenger connection receives mdsbeacon(seq=2213) from
             monitor

19:57:45.054 mds MonClient sends keepalive
19:57:45.054 messenger connection flags that keepalive must be sent
         - at this point the connection asks the EventManager to
           asynchronously run the handle_write (which writes the
           keepalive message to the socket)

19:57:47.356 mds sends mdsbeacon(seq=2214)
19:57:47.357 messenger connection queues the mdsbeacon message
         - at this point the connection asks the EventManager to
           asynchronously run the handle_write (which writes the queued
           messages to the socket).

19:57:51.439 mds sends mdsbeacon(seq=2215)
19:57:51.439 messenger connection queues the mdsbeacon message

19:57:55.217 mds MonClient sends keepalive
19:57:55.217 messenger connection flags that keepalive must be sent

19:57:55.629 mds sends mdsbeacon(seq=2216)
19:57:55.629 messenger connection queues the mdsbeacon message

19:58:00.152 mds sends mdsbeacon(seq=2217)
19:58:00.153 messenger connection queues the mdsbeacon message

19:58:05.221 mds MonClient sends keepalive
19:58:05.221 messenger connection flags that keepalive must be sent

19:58:05:221 mark_down is called on messenger connection
             I assume that this was called by the MonClient due to
             reaching mon_client_ping_timeout
19:58:05:221 messenger connection discards message queue (all pending
             mdsbeacon messages are lost)

########################################################################

From the above history, the strange thing that I see is that the
EventManager didn't call the handle_write on the messenger connection
since 19:57:45 until the connection is stopped by the MonClient.
This is the cause for the keepalives and mdsbeacon messages to not be
send to the monitor.
But I don't quite understand why this happens. Maybe the EventManager is
too busy handling other events?

Also, let's imagine that the EventManager is thrashing and really takes
that long to issue a handle_write in the connection. Shouldn't the MDS
be aware that the MonClient might restart the connection to the monitor
due to not receiving keepalives ack, and take care of that situation?

In the meantime, I'm going to look at the code of the EventManager
(which wasn't changed by the messenger refactorings) to understand why
the above situation happened.

-- 
Ricardo Dias
Senior Software Engineer - Storage Team
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284
(AG Nürnberg)

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux