Re: requests are blocked > 32 sec woes

Matthew Monaco <matt@xxxxxxxxx> · Mon, 09 Feb 2015 20:12:00 -0700

On 02/09/2015 08:20 AM, Gregory Farnum wrote:
> There are a lot of next steps on 
> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
> 
> You probably want to look at the bits about using the admin socket, and 
> diagnosing slow requests. :) -Greg

Yeah, I've been through most of that. It's still been difficult to pinpoint
what's causing the blocking. Can I get some clarification on this comment:

> Ceph acknowledges writes after journaling, so fast SSDs are an attractive 
> option to accelerate the response time–particularly when using the ext4 or 
> XFS filesystems. By contrast, the btrfs filesystem can write and journal 
> simultaneously.

Does this mean btrfs doesn't need separate journal partition/block device? I.e.,
is what ceph-disk does when creating with --fs-type btrfs entirely non-optimal
(creates a 5G journal partition and the rest a btrfs partition).

I just don't get the "by contrast." If the OSD is btrfs+rotational, then why
doesn't putting the journal on an SSD help (as much?) if writes are returned
after journaling?

> On Sun, Feb 8, 2015 at 8:48 PM, Matthew Monaco <matt@xxxxxxxxx> wrote:
>> Hello!
>> 
>> *** Shameless plug: Sage, I'm working with Dirk Grunwald on this cluster; I
>> believe some of the members of your thesis committee were students of his 
>> =)
>> 
>> We have a modest cluster at CU Boulder and are frequently plagued by 
>> "requests are blocked" issues. I'd greatly appreciate any insight or 
>> pointers. The issue is not specific to any one OSD; I'm pretty sure
>> they've all showed up in ceph health detail at this point.
>> 
>> We have 8 identical nodes:
>> 
>> - 5 * 1TB Seagate enterprise SAS drives - btrfs - 1 * Intel 480G S3500 SSD
>>  - with 5*16G partitions as journals - also hosting the OS, unfortunately
>> - 64G RAM - 2 * Xeon E5-2630 v2 - So 24 hyperthreads @ 2.60 GHz - 10G-ish 
>> IPoIB for networking
>> 
>> So the cluster has 40TB over 40 OSDs total with a very straightforward 
>> crushmap. These nodes are also (unfortunately for the time being)
>> OpenStack compute nodes and 99% of the usage is OpenStack volumes/images. I
>> see a lot of kernel messages like:
>> 
>> ib_mthca 0000:02:00.0: Async event 16 for bogus QP 00dc0408
>> 
>> which may or may not be correlated w/ the Ceph hangs.
>> 
>> Other info: we have 3 mons on 3 of the 8 nodes listed above. The openstack
>>  volumes pool has 4096 pgs and is sized 3. This is probably too many PGs, 
>> but came from an initial misunderstanding of the formula in the 
>> documentation.
>> 
>> Thanks, Matt
>> 
>> 
>> PS - I'm trying to secure funds to get an additional 8 nodes with a little 
>> less RAM and CPU to move the OSDs to, with dual 10G Ethernet, and a SATA 
>> DOM for the OS so the SSD will be strictly journal. I may even be able to 
>> get an additional SSD or two per-node to use for caching or simply to set
>> a higher primary affinity
>> 
>> 
>> _______________________________________________ ceph-users mailing list 
>> ceph-users@xxxxxxxxxxxxxx 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com