Re: Interpretation Guidance for Slow Requests

Christian Theune <ct@xxxxxxxxxxxxxxx> · Tue, 6 Dec 2016 03:37:32 +0100

Hi Christian (heh),

thanks for picking this up. :)

This has become a rather long post as I added more details and giving our history, but if we make progress then maybe this can help others in the future. I find slow requests extremely hard to debug and as I said: aside from scratching my own itch, I’d be happy to help future travellers.

> On 6 Dec 2016, at 00:59, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> Hello,
> 
> On Mon, 5 Dec 2016 15:25:37 +0100 Christian Theune wrote:
> 
>> Hi,
>> 
>> we’re currently expanding our cluster to grow the number of IOPS we can provide to clients. We’re still on Hammer but in the process of upgrading to Jewel. 
> 
> You might want to wait until the next Jewel release, given the current
> number of issues.

Our issues or Jewel’s? (This is only half a joke, a friend’s Ceph cluster’s issues on Jewel is making me quite nervous and I’m re-evaluating postponing already.)

>> We started adding pure-SSD OSDs in the last days (based on MICRON
> S610DC-3840) and the slow requests we’ve seen in the past have started to
> show a different pattern.
>> 
> I looked in the archives and can't find a full description of your cluster
> hardware in any posts from you or the other Christian (hint, hint). 
> Slow requests can nearly all the time being traced back to HW
> issues/limitations.

We’re currently running on 42% Christians. ;)

We currently have hosts of 2 generations in our cluster. They’re both SuperMicro, sold by Thomas Krenn.

Type 1 (4 Hosts)

    SuperMicro X9DR3-F
    64 GiB RAM
    2 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
    LSI MegaRAID SAS 9271-8i
    Dual 10-Gigabit X540-AT2
    OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k
    Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those should be DCP3700, maybe my inventory is wrong)

    Pool “rbd.ssd”

        2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)

    Pool “rbd.hdd”

        5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0)
        1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)

Type 2 (2 Hosts)

    SuperMicro X9SRL-F
    32 GiB RAM
    2 Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz
    LSI MegaRAID SAS 9260-4i
    Dual 10-Gigabit X540-AT2
    OS on RAID 1 SATA (HGST HUS724020ALS640 1.818 TB)
    Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those should be DCP3700, maybe my inventory is wrong)

    Pool “rbd.ssd”

        2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)

    Pool “rbd.hdd”

        7 OSDs on HITACHI HUS156060VLS600 558.406 GB 15k (jbod/raid 0)
        1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)

Network

    10 GE Copper on Brocade VDX 6740-T Switches w/ 20 GBit interconnect
     Storage Access and Cluster network on 2 physically separated ports
     Some KVM servers are still running 1 GE interfaces for access, some 10 GE

Software

    Ceph: 0.94.7
    Filesystems: OS on ext4, OSDs on xfs
    Kernel: 4.4.27 (one host of type 2 still waiting for reboot to upgrade from 4.1.16 in the next days)
    Qemu: 2.7 with librbd

Overall we’ve been more happy with the second type of hosts (the 15k disks obviously don’t reach their limit in a mixed environment), but generally we’ve been unhappy with having RAID controllers in there at all. We’re going for pure HBA setups with newer machines.

This cluster setup has been with us for a few years now and has seen various generations of configurations ranging from initial huge OSD on RAID 5, over temporary 2-disk RAID 0, and using CacheCade (until it died ungracefully). Now it’s just single disk jbod/raid 0 with journals on NVMe SSD. We added a pure SSD Pool with inline journals a while ago which does not show any slow requests for the last 14 days or so. We added 10GE about 18 months ago and migrated that from a temporary solution to Brocade in September.

> Have you tested these SSDs for their suitability as a Ceph journal (I
> assume inline journals)? 
> This has been discussed here countless times, google.

We’re testing them right now. We stopped testing this in our development environment as any learning aside from “works generally”/“doesn’t work at all” has not been portable to our production load at all. All our experiments trying to reliably replicate production traffic in a reasonable way have failed.

I’ve FIOed them and was able to see their spec’d write IOPS (18k random IO 4k with 32 QD) and throughput. I wasn’t able to see 100k read IOPS but maxed out around 30k.

> If those SSDs aren't up to snuff, they may be worse than plain disks in
> some cases.
> 
> In addition, these SSDs have an endurance of 1 DWPD, less than 0.5 when
> factoring journal and FS overheads and write amplification scenarios.
> I'd be worried about wearing these out long before 5 years are over.

I reviewed the endurance from our existing hard drives. The hard drives had less than 0.05 DWPD since we’ve added them. I wasn’t able to get data from the NVMe drives about how much data they actually wrote to be able to compute endurance. The Intel tool currently shows an endurance analyzer of 14.4 years. This might be a completely useless number due to aggressive over-provisioning of the SSD.

However, I’ll be monitoring this more closely on the Microns and not worry about the full 5 years just yet. 

>> I’m currently seeing those: 
>> 
>> 2016-12-05 15:13:37.527469 osd.60 172.22.4.46:6818/19894 8080 : cluster [WRN] 5 slow requests, 1 included below; oldest blocked for > 31.675358 secs
>> 2016-12-05 15:13:37.527478 osd.60 172.22.4.46:6818/19894 8081 : cluster [WRN] slow request 31.674886 seconds old, received at 2016-12-05 15:13:05.852525: osd_op(client.518589944.0:2734750 rbd_data.1e2b40f879e2a9e3.00000000000000a2 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1892352~4096] 277.ceaf1c22 ack+ondisk+write+known_if_redirected e1107736) currently waiting for rw locks
>> 
> Is osd.60 one of these SSDs?

It is. However, I can also show any other of the OSDs from the same pool - this isn’t restricted to the Micron SSDs but also shows for the 1.8 TB SATA drives.I reviewed all the logs I got from the last two weeks and out of all complaints those have always made up ~30% of the slow requests we see. Others include:

currently waiting for subops from … (~35%)
currently waiting for rw (~30%)
currently no flag points reached (~20%)
currently commit_sent (<5%)
currently reached_pg (<3%)
currently started (<1%)
currently waiting for scrub (extremely rare)

The statistics for waiting for sub operations include OSD 60 (~1886) more than any others, even though all of the others do show a significant amount (~100-400). My current guess is that OSD 60 has gotten more than its fair share of PGs temporarily and a lot of the movement on the cluster just affected it much more. I added 3 of the Micron SSDs as the same time and the two others are well in the normal range. I don’t have distribution over time, so I can’t prove that right now. The Microns do have a) higher primary affinity b) double the size than the normal SATA drives so that should result in more requests ending up there.

Looking at IOstat I don’t see them creating large backlogs - at least I can’t spot them interactively.

Looking at the specific traffic that Ceph with inline journal puts on the device, then during backfill it appears to be saturated from a bandwith perspective. I see around 100MB/s going in with around 1k write IOPS. During this time, iostat shows an average queue length of around 200 requests and an average wait of around 150ms, which is quite high. Adding a 4k FIO [1] at the same time doesn’t change throughput or IOPS significantly. Stopping the OSD then shows IOPS going up to 12k r/w, but of course bandwith is much less at that point.

Something we have looked out for again but weren’t able to find something on is whether the controller might be limiting us in some way. Then again: the Intel SSDs are attached to the same controller and have not shown *any* slow request at all. All the slow requests are defined the the rbd.hdd pool (which now includes the Microns as part of a migration strategy).

I’m also wondering whether we may be CPU bound, but the stats do not indicate that. We’re running the “ondemand” governor and I’ve seen references to people preferring the “performance” governor. However, I would expect that we’d see significantly reduced idle times when this becomes an issue.

> Tools like atop and iostat can give you a good insight on how your storage
> subsystem is doing.

Right. We have collectd running including the Ceph plugin which helps finding historical data. At the moment I’m not getting much out of the interactive tools as any issue will have resolved once I’m logged in. ;)

> The moment you see slow requests, your cluster has issues you need to
> address.
> Especially if this happens when no scrubs are going on (restrict scrubs
> to off-peak hours).

I’ve seen this happen when no scrubs are around. I can currently not pinpoint performance issues to ongoing scrubs. We have limited scrubs to a low load threshold but run them regularly during the day.

> The 30 seconds is an arbitrary value and the WRN would warrant an ERR in
> my book, as some applications take a very dim view on being blocked for
> more than few seconds.

Absolutely: we see adverse effects within VMs much earlier and have lowered the complaint time to 15 seconds due to that. 

Here’s the Ceph config we’re running with. Some of the values are temporarily tuned at runtime to support operations (like max backfills, etc.):

[global]
fsid = ...
public network = 172.22.4.0/24
cluster network = 172.22.8.0/24
pid file = /run/ceph/$type-$id.pid
# should correspond to the setting in /etc/init.d/ceph
max open files = 262144
mon host = <3 mons>
mon osd nearfull ratio = .9
osd pool default min size = 1
osd pool default size = 2
# recovery sleep becomes available in infernalis.
# osd recovery sleep = 0.005
osd snap trim sleep = 0.05
osd scrub sleep = 0.05

[client]
log file = /var/log/ceph/client.log
rbd cache = true
rbd default format = 2
admin socket = /run/ceph/rbd-$pid-$cctid.asok

[mon]
admin socket = /run/ceph/$cluster-$name.asok
mon addr = 172.22.4.42:6789,172.22.4.43:6789,172.22.4.54:6789
mon data = /srv/ceph/mon/$cluster-$id
mon data avail crit = 2
mon data avail warn = 5
mon osd allow primary affinity = true
mon pg warn max per osd = 3000

[mon.cartman07]
host = cartman07
mon addr = 172.22.4.43:6789
public addr = 172.22.4.43:6789
cluster addr = 172.22.8.13:6789

[osd]
host = cartman07
admin socket = /run/ceph/$cluster-$name.asok
public addr = 172.22.4.43
cluster addr = 172.22.8.13
filestore fiemap = true
filestore max sync interval = 5
# slightly more than 1 week to avoid hitting the same day the same time
osd deep scrub interval = 700000
# place this between "nearfull" and "full"
osd backfill full ratio = 0.92
# 2* 5 seconds * 200MB/s disk throughput = 2000MB
osd journal size = 2000
osd max backfills = 1
osd op complaint time = 15
osd recovery max active = 1
osd recovery op priority = 1
osd scrub load threshold = 2

Something I started playing around with today was looking up specific slow requests from the admin sockets. Here’s what happened a bit earlier. Something I see is that the request flaps between “reached pg” and “waiting for rw locks”. I’ll provide a set of 3 dumps that are correlated. I started with OSD 30 (Micron) and pulled the ones from OSD 33 and 22 as they are referenced from the “waiting for subops from”.

Attachment:
historic-ops-osd30.txt.gz

Description: GNU Zip compressed data
Attachment:
historic-ops-osd22.txt.gz

Description: GNU Zip compressed data
Attachment:
historic-ops-osd33.txt.gz

Description: GNU Zip compressed data

I’m also looking forward to see how the cluster behaves tomorrow when it finishes its current backfill after we finished adding the last 3 Microns tonight (so 6 in total now).

Thanks for listening, 
Christian

[1] fio --rw=readwrite --name=test --size=5000M --nrfiles=1 --numjobs=3 --loops=1000 --direct=1  --rwmixread=50 --bs=4k --runtime=20000 --iodepth=32

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com