Ceph runs great then falters

ckitzmiller@xxxxxxxxxxxxx (Chris Kitzmiller) · Fri, 1 Aug 2014 14:23:28 -0400

I have 3 nodes each running a MON and 30 OSDs. When I test my cluster with either rados bench or with fio via a 10GbE client using RBD I get great initial speeds >900MBps and I max out my 10GbE links for a while. Then, something goes wrong the performance falters and the cluster stops responding all together. I'll see a monitor call for a new election and then my OSDs mark each other down, they complain that they've been wrongly marked down, I get slow request warnings of >30 and >60 seconds. This eventually resolves itself and the cluster recovers but it then recurs again right away. Sometimes, via fio, I'll get an I/O error and it will bail.

The amount of time for the cluster to start acting up varies. Sometimes it is great for hours, sometimes it fails after 10 seconds. Nothing significant shows up in dmesg. A snippet from ceph-osd.77.log (for example) is at: http://pastebin.com/Zb92Ei7a

I'm not sure why I can run at full speed for a little while or what the problem is when it stops working. Please help!

My nodes:
	Ubuntu 14.04 - Linux storage3 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
	ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
	2 x 6-core Xeon 2620s
	64GB RAM
	30 x 3TB Seagate ST3000DM001-1CH166
	6 x 128GB Samsung 840 Pro SSD
	1 x Dual port Broadcom NetXtreme II 5771x/578xx 10GbE
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140801/9ff92cef/attachment.htm>