On Thu, Jun 29, 2017 at 4:30 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > Hi All, > > Putting out a call for help to see if anyone can shed some light on this. > > Configuration: > Ceph cluster presenting RBD's->XFS->NFS->ESXi > Running 10.2.7 on the OSD's and 4.11 kernel on the NFS gateways in a > pacemaker cluster > Both OSD's and clients are go into a pair of switches, single L2 domain (no > sign from pacemaker that there is network connectivity issues) > > Symptoms: > - All RBD's on a single client randomly hang for 30s to several minutes, > confirmed by pacemaker and ESXi hosts complaining Hi Nick, What is a "single client" here? > - Cluster load is minimal when this happens most times Can you post gateway syslog and point at when this happened? Corresponding pacemaker excerpts won't hurt either. > - All other clients with RBD's are not affected (Same RADOS pool), so its > seems more of a client issue than cluster issue > - It looks like pacemaker tries to also stop RBD+FS resource, but this also > hangs > - Eventually pacemaker succeeds in stopping resources and immediately > restarts them, IO returns to normal > - No errors, slow requests, or any other non normal Ceph status is reported > on the cluster or ceph.log > - Client logs show nothing apart from pacemaker > > Things I've tried: > - Different kernels (potentially happened less with older kernels, but can't > be 100% sure) But still happened? Do you have a list of all the kernels you've tried? > - Disabling scrubbing and anything else that could be causing high load > - Enabling Kernel RBD debugging (Problem maybe happens a couple of times a > day, debug logging was not practical as I can't reproduce on demand) When did it start occuring? Can you think of any configuration changes that might have been the trigger or is this a new setup? Thanks, Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com