Re: requests are blocked > 32 sec woes

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 9 Feb 2015 23:33:08 -0800



On Mon, Feb 9, 2015 at 7:12 PM, Matthew Monaco <matt@xxxxxxxxx> wrote:
> On 02/09/2015 08:20 AM, Gregory Farnum wrote:
>> There are a lot of next steps on
>> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
>>
>> You probably want to look at the bits about using the admin socket, and
>> diagnosing slow requests. :) -Greg
>
> Yeah, I've been through most of that. It's still been difficult to pinpoint
> what's causing the blocking. Can I get some clarification on this comment:
>
>> Ceph acknowledges writes after journaling, so fast SSDs are an attractive
>> option to accelerate the response time–particularly when using the ext4 or
>> XFS filesystems. By contrast, the btrfs filesystem can write and journal
>> simultaneously.
>
> Does this mean btrfs doesn't need separate journal partition/block device? I.e.,
> is what ceph-disk does when creating with --fs-type btrfs entirely non-optimal
> (creates a 5G journal partition and the rest a btrfs partition).
>
> I just don't get the "by contrast." If the OSD is btrfs+rotational, then why
> doesn't putting the journal on an SSD help (as much?) if writes are returned
> after journaling?

Yeah, that's not quite the best phrasing. btrfs' parallel journaling
can be a big advantage in all-spinner cases where under the right
kinds of load the filesystem actually has a chance of committing data
to disk faster than the journal does. There aren't many situations
where that's likely, though — it's more useful for direct librados
users who might want to proceed once data is readable rather than when
it's durable. That's not an option with xfs.
-Greg

>
>
>> On Sun, Feb 8, 2015 at 8:48 PM, Matthew Monaco <matt@xxxxxxxxx> wrote:
>>> Hello!
>>>
>>> *** Shameless plug: Sage, I'm working with Dirk Grunwald on this cluster; I
>>> believe some of the members of your thesis committee were students of his
>>> =)
>>>
>>> We have a modest cluster at CU Boulder and are frequently plagued by
>>> "requests are blocked" issues. I'd greatly appreciate any insight or
>>> pointers. The issue is not specific to any one OSD; I'm pretty sure
>>> they've all showed up in ceph health detail at this point.
>>>
>>> We have 8 identical nodes:
>>>
>>> - 5 * 1TB Seagate enterprise SAS drives - btrfs - 1 * Intel 480G S3500 SSD
>>>  - with 5*16G partitions as journals - also hosting the OS, unfortunately
>>> - 64G RAM - 2 * Xeon E5-2630 v2 - So 24 hyperthreads @ 2.60 GHz - 10G-ish
>>> IPoIB for networking
>>>
>>> So the cluster has 40TB over 40 OSDs total with a very straightforward
>>> crushmap. These nodes are also (unfortunately for the time being)
>>> OpenStack compute nodes and 99% of the usage is OpenStack volumes/images. I
>>> see a lot of kernel messages like:
>>>
>>> ib_mthca 0000:02:00.0: Async event 16 for bogus QP 00dc0408
>>>
>>> which may or may not be correlated w/ the Ceph hangs.
>>>
>>> Other info: we have 3 mons on 3 of the 8 nodes listed above. The openstack
>>>  volumes pool has 4096 pgs and is sized 3. This is probably too many PGs,
>>> but came from an initial misunderstanding of the formula in the
>>> documentation.
>>>
>>> Thanks, Matt
>>>
>>>
>>> PS - I'm trying to secure funds to get an additional 8 nodes with a little
>>> less RAM and CPU to move the OSDs to, with dual 10G Ethernet, and a SATA
>>> DOM for the OS so the SSD will be strictly journal. I may even be able to
>>> get an additional SSD or two per-node to use for caching or simply to set
>>> a higher primary affinity
>>>
>>>
>>> _______________________________________________ ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com