Re: Kernel mounted RBD's hanging

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jun 30, 2017 at 2:14 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>
>
>> -----Original Message-----
>> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx]
>> Sent: 29 June 2017 18:54
>> To: Nick Fisk <nick@xxxxxxxxxx>
>> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
>> Subject: Re:  Kernel mounted RBD's hanging
>>
>> On Thu, Jun 29, 2017 at 6:22 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> >> -----Original Message-----
>> >> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx]
>> >> Sent: 29 June 2017 16:58
>> >> To: Nick Fisk <nick@xxxxxxxxxx>
>> >> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
>> >> Subject: Re:  Kernel mounted RBD's hanging
>> >>
>> >> On Thu, Jun 29, 2017 at 4:30 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> >> > Hi All,
>> >> >
>> >> > Putting out a call for help to see if anyone can shed some light on this.
>> >> >
>> >> > Configuration:
>> >> > Ceph cluster presenting RBD's->XFS->NFS->ESXi Running 10.2.7 on the
>> >> > OSD's and 4.11 kernel on the NFS gateways in a pacemaker cluster
>> >> > Both OSD's and clients are go into a pair of switches, single L2
>> >> > domain (no sign from pacemaker that there is network connectivity
>> >> > issues)
>> >> >
>> >> > Symptoms:
>> >> > - All RBD's on a single client randomly hang for 30s to several
>> >> > minutes, confirmed by pacemaker and ESXi hosts complaining
>> >>
>> >> Hi Nick,
>> >>
>> >> What is a "single client" here?
>> >
>> > I mean a node of the pacemaker cluster. So all RBD's on the same
>> pacemaker node hang.
>> >
>> >>
>> >> > - Cluster load is minimal when this happens most times
>> >>
>> >> Can you post gateway syslog and point at when this happened?
>> >> Corresponding pacemaker excerpts won't hurt either.
>> >
>> > Jun 28 16:35:38 MS-CEPH-Proxy1 lrmd[2026]:  warning: p_export_ceph-
>> ds1_monitor_60000 process (PID 17754) timed out
>> > Jun 28 16:35:43 MS-CEPH-Proxy1 lrmd[2026]:     crit: p_export_ceph-
>> ds1_monitor_60000 process (PID 17754) will not die!
>> > Jun 28 16:43:51 MS-CEPH-Proxy1 lrmd[2026]:  warning:
>> > p_export_ceph-ds1_monitor_60000:17754 - timed out after 30000ms Jun
>> 28 16:43:52 MS-CEPH-Proxy1 IPaddr(p_vip_ceph-ds1)[28482]: INFO: ifconfig
>> ens224:0 down
>> > Jun 28 16:43:52 MS-CEPH-Proxy1 lrmd[2026]:   notice: p_vip_ceph-
>> ds1_stop_0:28482:stderr [ SIOCDELRT: No such process ]
>> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
>> p_vip_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=471, rc=0, cib-
>> update=318, confirmed=true)
>> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]:
>> INFO: Un-exporting file system ...
>> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]:
>> > INFO: unexporting 10.3.20.0/24:/mnt/Ceph-DS1 Jun 28 16:43:52
>> > MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28499]: INFO: Unlocked NFS
>> export /mnt/Ceph-DS1 Jun 28 16:43:52 MS-CEPH-Proxy1
>> exportfs(p_export_ceph-ds1)[28499]: INFO: Un-exported file system(s)
>> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
>> p_export_ceph-ds1_stop_0: ok (node=MS-CEPH-Proxy1, call=473, rc=0, cib-
>> update=319, confirmed=true)
>> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]:
>> INFO: Exporting file system(s) ...
>> > Jun 28 16:43:52 MS-CEPH-Proxy1 exportfs(p_export_ceph-ds1)[28549]:
>> > INFO: exporting 10.3.20.0/24:/mnt/Ceph-DS1 Jun 28 16:43:52 MS-CEPH-
>> Proxy1 exportfs(p_export_ceph-ds1)[28549]: INFO: directory /mnt/Ceph-DS1
>> exported
>> > Jun 28 16:43:52 MS-CEPH-Proxy1 crmd[2029]:   notice: Operation
>> p_export_ceph-ds1_start_0: ok (node=MS-CEPH-Proxy1, call=474, rc=0, cib-
>> update=320, confirmed=true)
>> >
>> > If I enable the read/write checks for the FS resource, they also timeout at
>> the same time.
>>
>> What about syslog that the above corresponds to?
>
> I get exactly the same "_monitor" timeout message.

No "libceph: " or "rbd: " messages at all?  No WARNs or hung tasks?

>
> Is there anything logging wise I can do with the kernel client to log when an IO is taking a long time. Sort of like the slow requests in Ceph, but client side?

Nothing out of the box, as slow requests are usually not the client
implementation's fault.  Can you put together a script that would
snapshot all files in /sys/kernel/debug/ceph/<cluster-fsid.client-id>/*
on the gateways every second and rotate on an hourly basis?  One of
those files, osdc, lists in-flight requests.  If that's empty when the
timeouts occur then it's probably not krbd.

What Maged said, and also can you clarify what those "read/write checks
for the FS resource" do exactly?  read/write to local xfs on /dev/rbd*
or further up?

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux