Re: Kernel mounted RBD's hanging

Nick Fisk <nick@xxxxxxxxxx> · Mon, 31 Jul 2017 11:46:10 +0100

> -----Original Message-----
> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx]
> Sent: 31 July 2017 11:36
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Kernel mounted RBD's hanging
> 
> On Thu, Jul 13, 2017 at 12:54 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> > On Wed, Jul 12, 2017 at 7:15 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >>> Hi Ilya,
> >>>
> >>> I have managed today to capture the kernel logs with debugging turned on and the ms+osd debug logs from the mentioned OSD.
> >>> However, this is from a few minutes after the stall starts, not
> >>> before. The very random nature of the stalls have made it difficult
> >>> to have debug logging on for extended periods of time. I hope there is something in here which might give a clue to what is
> happening, otherwise I will continue to try and capture debugs from before the stall occurs.
> >>>
> >>> The kernel logs and OSD logs are attached as a zip.
> >>>
> >>> Sample of the hung requests
> >>> Wed 12 Jul 11:28:01 BST 2017
> >>> REQUESTS 8 homeless 0
> >>> 11457738        osd37   17.be732844     [37,74,14]/37   [37,74,14]/37   rbd_data.15d8670238e1f29.00000000001a2068       0x400024        1
> >>> 0'0     set-alloc-hint,write
> >>> 11457759        osd37   17.9e3d4404     [37,74,14]/37   [37,74,14]/37   rbd_data.15d8670238e1f29.0000000000118149       0x400024        1
> >>> 0'0     set-alloc-hint,write
> >>> 11457770        osd37   17.86ec0d14     [37,72,74]/37   [37,72,74]/37   rbd_data.15d8670238e1f29.000000000006665c       0x400024        1
> >>> 0'0     set-alloc-hint,write
> >>> 11457818        osd37   17.e80ed1c0     [37,75,14]/37   [37,75,14]/37   rbd_data.158f204238e1f29.00000000000d7f1e       0x400014        1
> >>> 0'0     read
> >>> 11457822        osd37   17.9db0a684     [37,74,14]/37   [37,74,14]/37   rbd_data.15d8670238e1f29.000000000000211c       0x400024        1
> >>> 0'0     set-alloc-hint,write
> >>> 11457916        osd37   17.1848293b     [37,3,73]/37    [37,3,73]/37    rbd_data.158f204238e1f29.00000000000d7f8e       0x400024        1
> 0'0
> >>> set-alloc-hint,write
> >>> 11457967        osd37   17.56b0f4c0     [37,75,14]/37   [37,75,14]/37   rbd_data.158f204238e1f29.0000000000047ffc       0x400014        1
> 0'0
> >>> read
> >>> 11457970        osd37   17.65ad6d40     [37,75,14]/37   [37,75,14]/37   rbd_data.15d8670238e1f29.0000000000060318       0x400024        1
> >>> 0'0     set-alloc-hint,write
> >>>
> >>> Also a nice hung_task_timeout
> >>> [57154.424300] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>> [57154.424331] nfsd            D    0  1650      2 0x00000000
> >>> [57154.424334] Call Trace:
> >>> [57154.424341]  __schedule+0x3c6/0x8c0 [57154.424344]
> >>> schedule+0x36/0x80 [57154.424346]
> >>> rwsem_down_write_failed+0x230/0x3a0
> >>> [57154.424389]  ? xfs_file_buffered_aio_write+0x68/0x270 [xfs]
> >>> [57154.424392]  call_rwsem_down_write_failed+0x17/0x30
> >>> [57154.424394]  ? call_rwsem_down_write_failed+0x17/0x30
> >>> [57154.424396]  down_write+0x2d/0x40 [57154.424422]
> >>> xfs_ilock+0xa7/0x110 [xfs] [57154.424446]  xfs_file_buffered_aio_write+0x68/0x270 [xfs] [57154.424449]  ?
> >>> iput+0x8a/0x230 [57154.424450]  ? __check_object_size+0x100/0x19d
> >>> iput+[57154.424474]  xfs_file_write_iter+0x103/0x150 [xfs]
> >>> [57154.424482]  __do_readv_writev+0x2fb/0x390 [57154.424484]  do_readv_writev+0x8d/0xe0 [57154.424487]  ?
> >>> security_file_open+0x8a/0x90 [57154.424488]  ? do_dentry_open+0x27a/0x310 [57154.424511]  ?
> >>> xfs_extent_busy_ag_cmp+0x20/0x20 [xfs] [57154.424513]
> >>> vfs_writev+0x3c/0x50 [57154.424514]  ? vfs_writev+0x3c/0x50
> >>> [57154.424525]  nfsd_vfs_write+0xc6/0x3a0 [nfsd] [57154.424531]
> >>> nfsd_write+0x144/0x200 [nfsd] [57154.424538]
> >>> nfsd3_proc_write+0xaa/0x140 [nfsd] [57154.424549]
> >>> nfsd_dispatch+0xc8/0x260 [nfsd] [57154.424566]
> >>> svc_process_common+0x374/0x6b0 [sunrpc] [57154.424575]
> >>> svc_process+0xfe/0x1b0 [sunrpc] [57154.424581]  nfsd+0xe9/0x150 [nfsd] [57154.424583]  kthread+0x109/0x140 [57154.424588]  ?
> nfsd_destroy+0x60/0x60 [nfsd] [57154.424590]  ?
> >>> kthread_create_on_node+0x70/0x70 [57154.424592]
> >>> ret_from_fork+0x2c/0x40
> >>>
> >>> Regards,
> >>> Nick
> >>>
> >>>
> >>
> >> Just looking into this some more, I have found laggy requests to osd37 in the kernel logs:
> >>
> >> Jul 12 11:25:45 MS-CEPH-Proxy1 kernel: [66691.685559] libceph:   req ffff9adad97a1450 tid 11457738 on osd37 is laggy
> >> Jul 12 11:25:45 MS-CEPH-Proxy1 kernel: [66691.685560] libceph:   req ffff9adad0df9040 tid 11457759 on osd37 is laggy
> >> Jul 12 11:25:45 MS-CEPH-Proxy1 kernel: [66691.685561] libceph:   req ffff9adad97a5d70 tid 11457770 on osd37 is laggy
> >> Jul 12 11:25:45 MS-CEPH-Proxy1 kernel: [66691.685562] libceph:   req ffff9adad9036590 tid 11457818 on osd37 is laggy
> >> Jul 12 11:25:45 MS-CEPH-Proxy1 kernel: [66691.685563] libceph:   req ffff9adad9035960 tid 11457822 on osd37 is laggy
> >> Jul 12 11:25:45 MS-CEPH-Proxy1 kernel: [66691.685564] libceph:   req ffff9adad9036db0 tid 11457916 on osd37 is laggy
> >> Jul 12 11:25:45 MS-CEPH-Proxy1 kernel: [66691.685565] libceph:   req ffff9adad90330c0 tid 11457967 on osd37 is laggy
> >> Jul 12 11:25:45 MS-CEPH-Proxy1 kernel: [66691.685565] libceph:   req ffff9adad97a5550 tid 11457970 on osd37 is laggy
> >
> > Hi Nick,
> >
> > I took a look, but didn't see anything useful.  There are gaps in
> > these excerpts, like:
> >
> > Jul 12 11:26:10 MS-CEPH-Proxy1 kernel: [66716.981713] libceph:
> > try_write start ffff9adaebf00288 state 5 Jul 12 11:26:10
> > MS-CEPH-Proxy1 kernel: [66716.981714] libceph:
> > try_write out_kvec_bytes 0
> > Jul 12 11:26:10 MS-CEPH-Proxy1 kernel: [66716.981716] libceph:
> > try_write nothing else to write.
> > Jul 12 11:26:10 MS-CEPH-Proxy1 kernel: [66716.981716] libceph:
> > try_write done on ffff9adaebf00288 ret 0 Jul 12 11:26:12
> > MS-CEPH-Proxy1 systemd[1]: Started Session 2398 of user sysprosupport.
> > Jul 12 11:26:16 MS-CEPH-Proxy1 postfix/pickup[13798]: 042FC385:
> > uid=1000 from=<sysprosupport>
> > Jul 12 11:26:16 MS-CEPH-Proxy1 postfix/cleanup[28873]: 042FC385:
> > message-id=<20170712102616.042FC385@MS-CEPH-Proxy1>
> > Jul 12 11:26:16 MS-CEPH-Proxy1 postfix/qmgr[2151]: 042FC385:
> > from=<sysprosupport@MS-CEPH-Proxy1>, size=487, nrcpt=1 (queue active)
> > Jul 12 11:26:16 MS-CEPH-Proxy1 postfix/local[28875]: 042FC385:
> > to=<root@MS-CEPH-Proxy1>, orig_to=<root>, relay=local, delay=0.24,
> > delays=0.11/0.11/0/0.01, dsn=2.0.0, status=sent (delivered to mailbox)
> > Jul 12 11:26:16 MS-CEPH-Proxy1 postfix/qmgr[2151]: 042FC385: removed
> > Jul 12 11:26:18 MS-CEPH-Proxy1 systemd[1]: Started Session c2 of user root.
> > Jul 12 11:27:01 MS-CEPH-Proxy1 CRON[29822]: (root) CMD (   date >>
> > /home/sysprosupport/osdc.txt;cat /sys/kernel/debug/ceph/*/osdc >>
> > /home/sysprosupport/osdc.txt)
> > Jul 12 11:27:01 MS-CEPH-Proxy1 CRON[29823]: (root) CMD (date >>
> > /home/sysprosupport/buddy.txt;cat /proc/buddyinfo >>
> > /home/sysprosupport/buddy.txt)
> > Jul 12 11:28:01 MS-CEPH-Proxy1 CRON[31057]: (root) CMD (date >>
> > /home/sysprosupport/buddy.txt;cat /proc/buddyinfo >>
> > /home/sysprosupport/buddy.txt)
> > Jul 12 11:28:01 MS-CEPH-Proxy1 CRON[31058]: (root) CMD (   date >>
> > /home/sysprosupport/osdc.txt;cat /sys/kernel/debug/ceph/*/osdc >>
> > /home/sysprosupport/osdc.txt)
> > Jul 12 11:29:01 MS-CEPH-Proxy1 CRON[32293]: (root) CMD (date >>
> > /home/sysprosupport/buddy.txt;cat /proc/buddyinfo >>
> > /home/sysprosupport/buddy.txt)
> > Jul 12 11:29:01 MS-CEPH-Proxy1 CRON[32294]: (root) CMD (   date >>
> > /home/sysprosupport/osdc.txt;cat /sys/kernel/debug/ceph/*/osdc >>
> > /home/sysprosupport/osdc.txt)
> >
> > There is no way the kernel client would suddently stay silent for
> > minutes.  Looks like your osdc cron job was running - could you attach
> > all output between 10:00 and 12:00?  I'd like to see when the first of
> > these 8 requests showed up and when they cleared up.  It looks like
> > they cleared up at around 11:35, but I want to confirm.
> >
> >>
> >> And have found in the kernel client where it displays that message.
> >>
> >> https://github.com/ceph/ceph-client/blob/master/net/ceph/osd_client.c
> >> #L3073-L3098
> >>
> >> Does that suggest for some reason the TCP connection is getting broken and causing the hang? For whatever reason it can't re-
> establish, or takes longer than it should? And thus causes the outage?
> >
> > It could be.  When the kernel client detects laggy requests, it sends
> > the keepalive tag -- so in this case it should have been sending it to
> > osd37 every 5 seconds.  If those keepalives (along with the original
> > OSD requests) were going nowhere until the networking stack closed the
> > socket
> >
> > Jul 12 11:35:34 MS-CEPH-Proxy1 kernel: [67280.665519] libceph: osd37
> > 10.3.31.14:6829 socket closed (con state OPEN)
> >
> > thus triggering the resend of the OSD requests, that would certainly
> > do it ;)  Who closed osd37 socket at 11:35 (the OSD or the kernel) was
> > one of the things I was going to check for, but the OSD log ends at 11:28.
> >
> > Try capturing netstat output along with osdc and osdmap from debugfs
> > every 5-10 seconds on the kernel client boxes.  That's not a lot of
> > output and correlating send/recv queue sizes for IPs from osdmap to
> > osdc might help.
> 
> Hi Nick,
> 
> Any progress on this?
> 
> Thanks,
> 
>                 Ilya

Hi Ilya,

Yes, I think I have made some progress, I was away last couple of weeks, so sorry for the lack of recent response. I found a setting on the VM acting as the proxy, which controls LRO of the virtual NIC. It was disabled as part of a setting which is meant to improve latency, after switching it back on, the hangs seem to have disappeared. Although I need to run this for a longer time to make sure that this has not just reduced their occurrence. 

Unfortunately, I don't understand why this would be effectively be causing TCP connections to effectively drop/disappear in Linux, or why this has suddenly started happening as this setting had been on for a while. If anything disabling these offload features is meant to make things more reliable. I would still like to get to the bottom of this and work out in more detail what was happening and why Linux/Ceph couldn't recover more gracefully. I will hopefully have some time in the coming weeks to re-enable the setting and try and capture some more focussed debugging data.

Nick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com