> -----Original Message----- > From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx] > Sent: 29 June 2017 18:54 > To: Nick Fisk <nick@xxxxxxxxxx> > Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: Kernel mounted RBD's hanging > > On Thu, Jun 29, 2017 at 6:22 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > >> -----Original Message----- > >> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx] > >> Sent: 29 June 2017 16:58 > >> To: Nick Fisk <nick@xxxxxxxxxx> > >> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx> > >> Subject: Re: Kernel mounted RBD's hanging > >> > >> On Thu, Jun 29, 2017 at 4:30 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > >> > Hi All, > >> > > >> > Putting out a call for help to see if anyone can shed some light on this. > >> > > >> > Configuration: > >> > Ceph cluster presenting RBD's->XFS->NFS->ESXi Running 10.2.7 on the > >> > OSD's and 4.11 kernel on the NFS gateways in a pacemaker cluster > >> > Both OSD's and clients are go into a pair of switches, single L2 > >> > domain (no sign from pacemaker that there is network connectivity > >> > issues) > >> > > >> > Symptoms: > >> > - All RBD's on a single client randomly hang for 30s to several > >> > minutes, confirmed by pacemaker and ESXi hosts complaining > >> > >> Hi Nick, > >> > >> What is a "single client" here? > > > > I mean a node of the pacemaker cluster. So all RBD's on the same > pacemaker node hang. > > > >> > >> > - Cluster load is minimal when this happens most times > >> > >> Can you post gateway syslog and point at when this happened? > >> Corresponding pacemaker excerpts won't hurt either. > > > > Jun 28 16:35:38 MS-CEPH-Proxy1 lrmd[2026]: warning: p_export_ceph- > ds1_monitor_60000 process (PID 17754) timed out > > Jun 28 16:35:43 MS-CEPH-Proxy1 lrmd[2026]: crit: p_export_ceph- > ds1_monitor_60000 process (PID 17754) will not die! > > Jun 28 16:43:51 MS-CEPH-Proxy1 lrmd[2026]: warning: > > p_export_ceph-ds1_monitor_60000:17754 - timed out after 30000ms Jun > 28 16:43:52 MS-CEPH-Proxy1 IPaddr(p_vip_ceph-ds1)[28482]: INFO: ifconfig > ens224:0 down > > Jun 28 16:43:52 MS-CEPH-Proxy1 lrmd[2026]: notice: p_vip_ceph- > ds1_stop_0:28482:stderr [ SIOCDELRT: No such process ] > > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]: notice: Operation > p_vip_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=471, rc=0, cib- > update=318, confirmed=true) > > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: > INFO: Un-exporting file system ... > > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: > > INFO: unexporting 10.3.20.0/24:/mnt/Ceph-DS1 Jun 28 16:43:52 > > MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: Unlocked NFS > export /mnt/Ceph-DS1 Jun 28 16:43:52 MS-CEPH-Proxy1 > exportfs(p_export_ceph-ds1)[28499]: INFO: Un-exported file system(s) > > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]: notice: Operation > p_export_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=473, rc=0, cib- > update=319, confirmed=true) > > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]: > INFO: Exporting file system(s) ... > > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]: > > INFO: exporting 10.3.20.0/24:/mnt/Ceph-DS1 Jun 28 16:43:52 MS-CEPH- > Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO: directory /mnt/Ceph-DS1 > exported > > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]: notice: Operation > p_export_ceph-ds1_start_0: ok (node=MS-CEPH-Proxy1, call=474, rc=0, cib- > update=320, confirmed=true) > > > > If I enable the read/write checks for the FS resource, they also timeout at > the same time. > > What about syslog that the above corresponds to? I get exactly the same "_monitor" timeout message. Is there anything logging wise I can do with the kernel client to log when an IO is taking a long time. Sort of like the slow requests in Ceph, but client side? > > Thanks, > > Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com