Re: Deadly slow Ceph cluster revisited

J David <j.david.lists@xxxxxxxxx> · Fri, 17 Jul 2015 16:34:23 -0400

On Fri, Jul 17, 2015 at 12:19 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> Maybe try some iperf tests between the different OSD nodes in your
> cluster and also the client to the OSDs.

This proved to be an excellent suggestion.  One of these is not like the others:

f16 inbound: 6Gbps
f16 outbound: 6Gbps
f17 inbound: 6Gbps
f17 outbound: 6Gbps
f18 inbound: 6Gbps
f18 outbound: 1.2Mbps

There is flatly no explanation for the outbound performance on f18.
There are no errors in ifconfig/netstat, nothing logged on the switch,
etc.  Even with tcpdump running during iperf, there aren't retransmits
or anything.  It's just slow.

ifconfig'ing the primary bond interface down immediately resolved the
problem.  The iostat running in the virtual machine immediately surged
to 500+ IOPS and 40M-60M/sec.

Weirdly, ifconfig'ing the primary device back up did not bring the
problem back.  It switched back to that interface, but everything is
still fine (and iperf gives 6Gbps) at the moment.  There's no way of
telling if that will last, but it's a solid lead either way.

It's an Intel onboard dual-port X540's using the ixgbe driver.  If it
were a driver problem, we've got tons of these so I'd expect to see
this problem elsewhere.  If it's a hardware problem, ifconfig down/up
doesn't seem like it would "fix" it.  Very mysterious!

Thanks!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com