On Mon, Feb 9, 2015 at 7:12 PM, Matthew Monaco <matt@xxxxxxxxx> wrote: > On 02/09/2015 08:20 AM, Gregory Farnum wrote: >> There are a lot of next steps on >> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ >> >> You probably want to look at the bits about using the admin socket, and >> diagnosing slow requests. :) -Greg > > Yeah, I've been through most of that. It's still been difficult to pinpoint > what's causing the blocking. Can I get some clarification on this comment: > >> Ceph acknowledges writes after journaling, so fast SSDs are an attractive >> option to accelerate the response time–particularly when using the ext4 or >> XFS filesystems. By contrast, the btrfs filesystem can write and journal >> simultaneously. > > Does this mean btrfs doesn't need separate journal partition/block device? I.e., > is what ceph-disk does when creating with --fs-type btrfs entirely non-optimal > (creates a 5G journal partition and the rest a btrfs partition). > > I just don't get the "by contrast." If the OSD is btrfs+rotational, then why > doesn't putting the journal on an SSD help (as much?) if writes are returned > after journaling? Yeah, that's not quite the best phrasing. btrfs' parallel journaling can be a big advantage in all-spinner cases where under the right kinds of load the filesystem actually has a chance of committing data to disk faster than the journal does. There aren't many situations where that's likely, though — it's more useful for direct librados users who might want to proceed once data is readable rather than when it's durable. That's not an option with xfs. -Greg > > >> On Sun, Feb 8, 2015 at 8:48 PM, Matthew Monaco <matt@xxxxxxxxx> wrote: >>> Hello! >>> >>> *** Shameless plug: Sage, I'm working with Dirk Grunwald on this cluster; I >>> believe some of the members of your thesis committee were students of his >>> =) >>> >>> We have a modest cluster at CU Boulder and are frequently plagued by >>> "requests are blocked" issues. I'd greatly appreciate any insight or >>> pointers. The issue is not specific to any one OSD; I'm pretty sure >>> they've all showed up in ceph health detail at this point. >>> >>> We have 8 identical nodes: >>> >>> - 5 * 1TB Seagate enterprise SAS drives - btrfs - 1 * Intel 480G S3500 SSD >>> - with 5*16G partitions as journals - also hosting the OS, unfortunately >>> - 64G RAM - 2 * Xeon E5-2630 v2 - So 24 hyperthreads @ 2.60 GHz - 10G-ish >>> IPoIB for networking >>> >>> So the cluster has 40TB over 40 OSDs total with a very straightforward >>> crushmap. These nodes are also (unfortunately for the time being) >>> OpenStack compute nodes and 99% of the usage is OpenStack volumes/images. I >>> see a lot of kernel messages like: >>> >>> ib_mthca 0000:02:00.0: Async event 16 for bogus QP 00dc0408 >>> >>> which may or may not be correlated w/ the Ceph hangs. >>> >>> Other info: we have 3 mons on 3 of the 8 nodes listed above. The openstack >>> volumes pool has 4096 pgs and is sized 3. This is probably too many PGs, >>> but came from an initial misunderstanding of the formula in the >>> documentation. >>> >>> Thanks, Matt >>> >>> >>> PS - I'm trying to secure funds to get an additional 8 nodes with a little >>> less RAM and CPU to move the OSDs to, with dual 10G Ethernet, and a SATA >>> DOM for the OS so the SSD will be strictly journal. I may even be able to >>> get an additional SSD or two per-node to use for caching or simply to set >>> a higher primary affinity >>> >>> >>> _______________________________________________ ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com