requests are blocked > 32 sec woes

Matthew Monaco <matt@xxxxxxxxx> · Sun, 08 Feb 2015 21:48:43 -0700

Hello!

*** Shameless plug: Sage, I'm working with Dirk Grunwald on this cluster; I
believe some of the members of your thesis committee were students of his =)

We have a modest cluster at CU Boulder and are frequently plagued by "requests
are blocked" issues. I'd greatly appreciate any insight or pointers. The issue
is not specific to any one OSD; I'm pretty sure they've all showed up in ceph
health detail at this point.

We have 8 identical nodes:

	- 5 * 1TB Seagate enterprise SAS drives
	  - btrfs
	- 1 * Intel 480G S3500 SSD
	  - with 5*16G partitions as journals
	  - also hosting the OS, unfortunately
	-  64G RAM
	- 2 * Xeon E5-2630 v2
	  - So 24 hyperthreads @ 2.60 GHz
	- 10G-ish IPoIB for networking

So the cluster has 40TB over 40 OSDs total with a very straightforward crushmap.
These nodes are also (unfortunately for the time being) OpenStack compute nodes
and 99% of the usage is OpenStack volumes/images. I see a lot of kernel messages
like:

	ib_mthca 0000:02:00.0: Async event 16 for bogus QP 00dc0408

which may or may not be correlated w/ the Ceph hangs.

Other info: we have 3 mons on 3 of the 8 nodes listed above. The openstack
volumes pool has 4096 pgs and is sized 3. This is probably too many PGs, but
came from an initial misunderstanding of the formula in the documentation.

Thanks,
Matt

PS - I'm trying to secure funds to get an additional 8 nodes with a little less
RAM and CPU to move the OSDs to, with dual 10G Ethernet, and a SATA DOM for the
OS so the SSD will be strictly journal. I may even be able to get an additional
SSD or two per-node to use for caching or simply to set a higher primary affinity

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com