Hello! *** Shameless plug: Sage, I'm working with Dirk Grunwald on this cluster; I believe some of the members of your thesis committee were students of his =) We have a modest cluster at CU Boulder and are frequently plagued by "requests are blocked" issues. I'd greatly appreciate any insight or pointers. The issue is not specific to any one OSD; I'm pretty sure they've all showed up in ceph health detail at this point. We have 8 identical nodes: - 5 * 1TB Seagate enterprise SAS drives - btrfs - 1 * Intel 480G S3500 SSD - with 5*16G partitions as journals - also hosting the OS, unfortunately - 64G RAM - 2 * Xeon E5-2630 v2 - So 24 hyperthreads @ 2.60 GHz - 10G-ish IPoIB for networking So the cluster has 40TB over 40 OSDs total with a very straightforward crushmap. These nodes are also (unfortunately for the time being) OpenStack compute nodes and 99% of the usage is OpenStack volumes/images. I see a lot of kernel messages like: ib_mthca 0000:02:00.0: Async event 16 for bogus QP 00dc0408 which may or may not be correlated w/ the Ceph hangs. Other info: we have 3 mons on 3 of the 8 nodes listed above. The openstack volumes pool has 4096 pgs and is sized 3. This is probably too many PGs, but came from an initial misunderstanding of the formula in the documentation. Thanks, Matt PS - I'm trying to secure funds to get an additional 8 nodes with a little less RAM and CPU to move the OSDs to, with dual 10G Ethernet, and a SATA DOM for the OS so the SSD will be strictly journal. I may even be able to get an additional SSD or two per-node to use for caching or simply to set a higher primary affinity
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com