Re: Kernel mounted RBD's hanging

Nick Fisk <nick@xxxxxxxxxx> · Thu, 29 Jun 2017 17:22:59 +0100

> -----Original Message-----
> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx]
> Sent: 29 June 2017 16:58
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Kernel mounted RBD's hanging
> 
> On Thu, Jun 29, 2017 at 4:30 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > Hi All,
> >
> > Putting out a call for help to see if anyone can shed some light on this.
> >
> > Configuration:
> > Ceph cluster presenting RBD's->XFS->NFS->ESXi Running 10.2.7 on the
> > OSD's and 4.11 kernel on the NFS gateways in a pacemaker cluster Both
> > OSD's and clients are go into a pair of switches, single L2 domain (no
> > sign from pacemaker that there is network connectivity issues)
> >
> > Symptoms:
> > - All RBD's on a single client randomly hang for 30s to several
> > minutes, confirmed by pacemaker and ESXi hosts complaining
> 
> Hi Nick,
> 
> What is a "single client" here?

I mean a node of the pacemaker cluster. So all RBD's on the same pacemaker node hang.

> 
> > - Cluster load is minimal when this happens most times
> 
> Can you post gateway syslog and point at when this happened?
> Corresponding pacemaker excerpts won't hurt either.

Jun 28 16:35:38 MS-CEPH-Proxy1 lrmd[2026]:  warning: p_export_ceph-ds1_monitor_60000 process (PID 17754) timed out
Jun 28 16:35:43 MS-CEPH-Proxy1 lrmd[2026]:     crit: p_export_ceph-ds1_monitor_60000 process (PID 17754) will not die!
Jun 28 16:43:51 MS-CEPH-Proxy1 lrmd[2026]:  warning: p_export_ceph-ds1_monitor_60000:17754 - timed out after 30000ms
Jun 28 16:43:52 MS-CEPH-Proxy1 IPaddr(p_vip_ceph-ds1)[28482]: INFO: ifconfig ens224:0 down
Jun 28 16:43:52 MS-CEPH-Proxy1 lrmd[2026]:   notice: p_vip_ceph-ds1_stop_0:28482:stderr [ SIOCDELRT: No such process ]
Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation p_vip_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=471, rc=0, cib-update=318, confirmed=true)
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: Un-exporting file system ...
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: unexporting 10.3.20.0/24:/mnt/Ceph-DS1
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: Unlocked NFS export /mnt/Ceph-DS1
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: Un-exported file system(s)
Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation p_export_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=473, rc=0, cib-update=319, confirmed=true)
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO: Exporting file system(s) ...
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO: exporting 10.3.20.0/24:/mnt/Ceph-DS1
Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO: directory /mnt/Ceph-DS1 exported
Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation p_export_ceph-ds1_start_0: ok (node=MS-CEPH-Proxy1, call=474, rc=0, cib-update=320, confirmed=true)

If I enable the read/write checks for the FS resource, they also timeout at the same time.

> 
> > - All other clients with RBD's are not affected (Same RADOS pool), so
> > its seems more of a client issue than cluster issue
> > - It looks like pacemaker tries to also stop RBD+FS resource, but this
> > also hangs
> > - Eventually pacemaker succeeds in stopping resources and immediately
> > restarts them, IO returns to normal
> > - No errors, slow requests, or any other non normal Ceph status is
> > reported on the cluster or ceph.log
> > - Client logs show nothing apart from pacemaker
> >
> > Things I've tried:
> > - Different kernels (potentially happened less with older kernels, but
> > can't be 100% sure)
> 
> But still happened?  Do you have a list of all the kernels you've tried?

4.5 and 4.11. 

> 
> > - Disabling scrubbing and anything else that could be causing high
> > load
> > - Enabling Kernel RBD debugging (Problem maybe happens a couple of
> > times a day, debug logging was not practical as I can't reproduce on
> > demand)
> 
> When did it start occuring?  Can you think of any configuration changes that
> might have been the trigger or is this a new setup?

It has always done this from what I can tell. The majority of the time IO resumed before ESXi went All Paths Down, so it wasn't on my list of priorities to fix. But recently the hangs are lasting a lot longer. I need to go back to the 4.5 kernel, as I don't remember it happening as often or being as disruptive since upgrading to 4.11.

> 
> Thanks,
> 
>                 Ilya

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com