Re: Kernel mounted RBD's hanging

Nick Fisk <nick@xxxxxxxxxx> · Fri, 30 Jun 2017 13:14:36 +0100

> -----Original Message-----
> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx]
> Sent: 29 June 2017 18:54
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Kernel mounted RBD's hanging
> 
> On Thu, Jun 29, 2017 at 6:22 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >> -----Original Message-----
> >> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx]
> >> Sent: 29 June 2017 16:58
> >> To: Nick Fisk <nick@xxxxxxxxxx>
> >> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
> >> Subject: Re:  Kernel mounted RBD's hanging
> >>
> >> On Thu, Jun 29, 2017 at 4:30 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >> > Hi All,
> >> >
> >> > Putting out a call for help to see if anyone can shed some light on this.
> >> >
> >> > Configuration:
> >> > Ceph cluster presenting RBD's->XFS->NFS->ESXi Running 10.2.7 on the
> >> > OSD's and 4.11 kernel on the NFS gateways in a pacemaker cluster
> >> > Both OSD's and clients are go into a pair of switches, single L2
> >> > domain (no sign from pacemaker that there is network connectivity
> >> > issues)
> >> >
> >> > Symptoms:
> >> > - All RBD's on a single client randomly hang for 30s to several
> >> > minutes, confirmed by pacemaker and ESXi hosts complaining
> >>
> >> Hi Nick,
> >>
> >> What is a "single client" here?
> >
> > I mean a node of the pacemaker cluster. So all RBD's on the same
> pacemaker node hang.
> >
> >>
> >> > - Cluster load is minimal when this happens most times
> >>
> >> Can you post gateway syslog and point at when this happened?
> >> Corresponding pacemaker excerpts won't hurt either.
> >
> > Jun 28 16:35:38 MS-CEPH-Proxy1 lrmd[2026]:  warning: p_export_ceph-
> ds1_monitor_60000 process (PID 17754) timed out
> > Jun 28 16:35:43 MS-CEPH-Proxy1 lrmd[2026]:     crit: p_export_ceph-
> ds1_monitor_60000 process (PID 17754) will not die!
> > Jun 28 16:43:51 MS-CEPH-Proxy1 lrmd[2026]:  warning:
> > p_export_ceph-ds1_monitor_60000:17754 - timed out after 30000ms Jun
> 28 16:43:52 MS-CEPH-Proxy1 IPaddr(p_vip_ceph-ds1)[28482]: INFO: ifconfig
> ens224:0 down
> > Jun 28 16:43:52 MS-CEPH-Proxy1 lrmd[2026]:   notice: p_vip_ceph-
> ds1_stop_0:28482:stderr [ SIOCDELRT: No such process ]
> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
> p_vip_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=471, rc=0, cib-
> update=318, confirmed=true)
> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]:
> INFO: Un-exporting file system ...
> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]:
> > INFO: unexporting 10.3.20.0/24:/mnt/Ceph-DS1 Jun 28 16:43:52
> > MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: Unlocked NFS
> export /mnt/Ceph-DS1 Jun 28 16:43:52 MS-CEPH-Proxy1
> exportfs(p_export_ceph-ds1)[28499]: INFO: Un-exported file system(s)
> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
> p_export_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=473, rc=0, cib-
> update=319, confirmed=true)
> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]:
> INFO: Exporting file system(s) ...
> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]:
> > INFO: exporting 10.3.20.0/24:/mnt/Ceph-DS1 Jun 28 16:43:52 MS-CEPH-
> Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO: directory /mnt/Ceph-DS1
> exported
> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
> p_export_ceph-ds1_start_0: ok (node=MS-CEPH-Proxy1, call=474, rc=0, cib-
> update=320, confirmed=true)
> >
> > If I enable the read/write checks for the FS resource, they also timeout at
> the same time.
> 
> What about syslog that the above corresponds to?

I get exactly the same "_monitor" timeout message.

Is there anything logging wise I can do with the kernel client to log when an IO is taking a long time. Sort of like the slow requests in Ceph, but client side?

> 
> Thanks,
> 
>                 Ilya

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com