Re: Interpretation Guidance for Slow Requests

Reed Dier <reed.dier@xxxxxxxxxxx> · Tue, 6 Dec 2016 11:14:59 -0600

On Dec 5, 2016, at 9:42 PM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Tue, 6 Dec 2016 03:37:32 +0100 Christian Theune wrote:

Hi Christian (heh),

thanks for picking this up. :)

This has become a rather long post as I added more details and giving
our history, but if we make progress then maybe this can help others in
the future. I find slow requests extremely hard to debug and as I said:
aside from scratching my own itch, I’d be happy to help future
travellers.

On 6 Dec 2016, at 00:59, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Mon, 5 Dec 2016 15:25:37 +0100 Christian Theune wrote:

Hi,

we’re currently expanding our cluster to grow the number of IOPS we
can provide to clients. We’re still on Hammer but in the process of
upgrading to Jewel. 

You might want to wait until the next Jewel release, given the current
number of issues.

Our issues or Jewel’s? (This is only half a joke, a friend’s Ceph
cluster’s issues on Jewel is making me quite nervous and I’m
re-evaluating postponing already.)

Jewel issues, like the most recent one with scrub sending OSDs to
neverland.

We started adding pure-SSD OSDs in the last days (based on MICRON
S610DC-3840) and the slow requests we’ve seen in the past have started
to show a different pattern.

I looked in the archives and can't find a full description of your
cluster hardware in any posts from you or the other Christian (hint,
hint). Slow requests can nearly all the time being traced back to HW
issues/limitations.

We’re currently running on 42% Christians. ;)

We currently have hosts of 2 generations in our cluster. They’re both
SuperMicro, sold by Thomas Krenn.

Type 1 (4 Hosts)

   SuperMicro X9DR3-F
   64 GiB RAM
   2 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz

A bit on the weak side, if you'd be doing lots of small IOPS, as you found
out later. Not likely a real problem here, though.

   LSI MegaRAID SAS 9271-8i
And as you already found out, not a good match for Ceph (unless you actual
do RAIDs, like me with Areca)
Also LSI tends to have issues unless you're on the latest drivers
(firmware and kernel side)

Currently have our OSD’s behind an LSI 3108 RoC, in individual RAID 0 virtual disks. Curious to see if this looks to be an issue in this case, as we also see slow requests, though we are completely on-disk journaling, though evaluating NVMe journaling possibly as soon as today.

While not idea, they are BBU backed, which is the result of being hurt from a power outage during PoC stage, when we used on-device write cache, and corrupted the leveldb beyond repair.
So while an HBA would be ‘ideal,’ the BBU backed RAID has added a level of resiliency that could be helpful to some (verified with further destructive testing both with disk cache enabled/disabled to verify reproducibility).

   Dual 10-Gigabit X540-AT2
   OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k
   Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
should be DCP3700, maybe my inventory is wrong)

That needs clarification, which ones are they now?

   Pool “rbd.ssd”

       2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)

You're clearly a much braver man than me or have a very light load (but at
least the 3510's have 1 DWPD endurance unlike the 0.3 of 3500s).  
Please clarify, journals inline (as you suggest below) or also on the NVME?

   Pool “rbd.hdd”

       5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0)
       1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
Same here, journal inline or also on the NVME?

Type 2 (2 Hosts)

   SuperMicro X9SRL-F
   32 GiB RAM
   2 Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz
   LSI MegaRAID SAS 9260-4i
   Dual 10-Gigabit X540-AT2
   OS on RAID 1 SATA (HGST HUS724020ALS640 1.818 TB)
   Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
should be DCP3700, maybe my inventory is wrong)

   Pool “rbd.ssd”

       2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)

   Pool “rbd.hdd”

       7 OSDs on HITACHI HUS156060VLS600 558.406 GB 15k (jbod/raid 0)
       1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)

Network

   10 GE Copper on Brocade VDX 6740-T Switches w/ 20 GBit interconnect
    Storage Access and Cluster network on 2 physically separated ports
    Some KVM servers are still running 1 GE interfaces for access, some
10 GE

Software

   Ceph: 0.94.7
   Filesystems: OS on ext4, OSDs on xfs
   Kernel: 4.4.27 (one host of type 2 still waiting for reboot to
upgrade from 4.1.16 in the next days) Qemu: 2.7 with librbd

Overall we’ve been more happy with the second type of hosts (the 15k
disks obviously don’t reach their limit in a mixed environment), but
generally we’ve been unhappy with having RAID controllers in there at
all. We’re going for pure HBA setups with newer machines.

Nods.

This cluster setup has been with us for a few years now and has seen
various generations of configurations ranging from initial huge OSD on
RAID 5, over temporary 2-disk RAID 0, and using CacheCade (until it died
ungracefully). Now it’s just single disk jbod/raid 0 with journals on
NVMe SSD. We added a pure SSD Pool with inline journals a while ago
which does not show any slow requests for the last 14 days or so. 

I'd be impressed to see slow requests from that pool, even with 3510s.
Because that level of activity should cause your HDD pool to come to a
total standstill. 

We added 10GE about 18 months ago and migrated that from a temporary
solution to Brocade in September.

Have you tested these SSDs for their suitability as a Ceph journal (I
assume inline journals)? 
This has been discussed here countless times, google.

We’re testing them right now. We stopped testing this in our development
environment as any learning aside from “works generally”/“doesn’t work
at all” has not been portable to our production load at all. All our
experiments trying to reliably replicate production traffic in a
reasonable way have failed.

I’ve FIOed them and was able to see their spec’d write IOPS (18k random
IO 4k with 32 QD) and throughput. I wasn’t able to see 100k read IOPS
but maxed out around 30k.

Not FIO (the one you cite below) nor IOPS are of relevance to journals,
sequential sync writes are. Find the most recent discussions here and for a
slightly dated info with dubious data/results see:
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

If those SSDs aren't up to snuff, they may be worse than plain disks in
some cases.

In addition, these SSDs have an endurance of 1 DWPD, less than 0.5 when
factoring journal and FS overheads and write amplification scenarios.
I'd be worried about wearing these out long before 5 years are over.

I reviewed the endurance from our existing hard drives. The hard drives
had less than 0.05 DWPD since we’ve added them. I wasn’t able to get
data from the NVMe drives about how much data they actually wrote to be
able to compute endurance. The Intel tool currently shows an endurance
analyzer of 14.4 years. This might be a completely useless number due to
aggressive over-provisioning of the SSD.

No SMART interface for the NVMes? I don't have any, so I wouldn't know.

Smartmontools says they added basic NVMe support in 6.5, but I haven’t jumped from 6.2 yet.

Intel’s SSD Data Center Tool (isdct) is not entirely intuitive, but provides some SMART statistics, including NAND Bytes Written

# isdct show -o json -smart F4 -intelssd
{
	"SMART Attributes PHMD54560053400AGN":
	{
		"F4":
		{
			"Description":"NAND Bytes Written",
			"ID":"F4",
			"Normalized":100,
			"Raw":1594960
		}
	}
}

There are some other statistics worth checking, including the ‘SmartHealthInfo’ nvmelog.

Reed

Well, at 0.05 DWPD you're probably fine, so much more IOPS than bandwidth
bound.

Some average/peak IOPS and bandwidth numbers (ceph -s) from your cluster
would give us some idea if you're trying to do the impossible or if your
HW is to blame to some extend at least.

However, I’ll be monitoring this more closely on the Microns and not
worry about the full 5 years just yet. 

I’m currently seeing those: 

2016-12-05 15:13:37.527469 osd.60 172.22.4.46:6818/19894 8080 :
cluster [WRN] 5 slow requests, 1 included below; oldest blocked for >
31.675358 secs 2016-12-05 15:13:37.527478 osd.60
172.22.4.46:6818/19894 8081 : cluster [WRN] slow request 31.674886
seconds old, received at 2016-12-05 15:13:05.852525:
osd_op(client.518589944.0:2734750
rbd_data.1e2b40f879e2a9e3.00000000000000a2 [stat,set-alloc-hint
object_size 4194304 write_size 4194304,write 1892352~4096]
277.ceaf1c22 ack+ondisk+write+known_if_redirected e1107736) currently
waiting for rw locks

Is osd.60 one of these SSDs?

It is. However, I can also show any other of the OSDs from the same pool
- this isn’t restricted to the Micron SSDs but also shows for the 1.8 TB
SATA drives.I reviewed all the logs I got from the last two weeks and
out of all complaints those have always made up ~30% of the slow
requests we see. Others include:

Again, seeing slow requests from an SSD based OSD strikes me as odd, aside
from LSI issues (which should show up in kernel logs) the main suspect
would be the journal writes totally overwhelming it.
Unlike the Intel 3xxx which are known to handle things (sync writes)
according to their specs. 

currently waiting for subops from … (~35%)
currently waiting for rw (~30%)
currently no flag points reached (~20%)
currently commit_sent (<5%)
currently reached_pg (<3%)
currently started (<1%)
currently waiting for scrub (extremely rare)

The statistics for waiting for sub operations include OSD 60 (~1886)
more than any others, even though all of the others do show a
significant amount (~100-400). 
Those are not relevant (related) to this OSD.

My current guess is that OSD 60 has gotten more than its fair share of
PGs temporarily and a lot of the movement on the cluster just affected it
much more. I added 3 of the Micron SSDs as the same time and the two
others are well in the normal range. I don’t have distribution over time,
so I can’t prove that right now. The Microns do have a) higher primary
affinity b) double the size than the normal SATA drives so that should
result in more requests ending up there.

Naturally.
I'm not convinced that mixing HDDs and SSDs is the way forward, either you
can handle your load with HDDs (enough of them) and fast journals, or you
can't and are better off going all SSD, unless your use case is a match
for cache-tiering. 

Looking at IOstat I don’t see them creating large backlogs - at least I
can’t spot them interactively.

You should see high latencies, utilization in there when the slow requests
happen.
Also look at it with atop in a larger terminal and with at least 5s
refresh interval.

Looking at the specific traffic that Ceph with inline journal puts on
the device, then during backfill it appears to be saturated from a
bandwith perspective. I see around 100MB/s going in with around 1k write
IOPS. During this time, iostat shows an average queue length of around
200 requests and an average wait of around 150ms, which is quite high.
Precisely.
The journal/sync write plot thickens.

Adding a 4k FIO [1] at the same time doesn’t change throughput or IOPS
significantly. Stopping the OSD then shows IOPS going up to 12k r/w, but
of course bandwith is much less at that point.

Something we have looked out for again but weren’t able to find
something on is whether the controller might be limiting us in some way.
Then again: the Intel SSDs are attached to the same controller and have
not shown *any* slow request at all. All the slow requests are defined
the the rbd.hdd pool (which now includes the Microns as part of a
migration strategy).

I’m also wondering whether we may be CPU bound, but the stats do not
indicate that. We’re running the “ondemand” governor and I’ve seen
references to people preferring the “performance” governor. However, I
would expect that we’d see significantly reduced idle times when this
becomes an issue.

The performance governor will help with (Ceph induced) latencies foremost. 

Tools like atop and iostat can give you a good insight on how your
storage subsystem is doing.

Right. We have collectd running including the Ceph plugin which helps
finding historical data. At the moment I’m not getting much out of the
interactive tools as any issue will have resolved once I’m logged in. ;)

I suppose you could always force things with a fio from a VM or a 
"rados bench".

For now, concentrate on validating those Microns.

Christian

The moment you see slow requests, your cluster has issues you need to
address.
Especially if this happens when no scrubs are going on (restrict scrubs
to off-peak hours).

I’ve seen this happen when no scrubs are around. I can currently not
pinpoint performance issues to ongoing scrubs. We have limited scrubs to
a low load threshold but run them regularly during the day.

The 30 seconds is an arbitrary value and the WRN would warrant an ERR
in my book, as some applications take a very dim view on being blocked
for more than few seconds.

Absolutely: we see adverse effects within VMs much earlier and have
lowered the complaint time to 15 seconds due to that. 

Here’s the Ceph config we’re running with. Some of the values are
temporarily tuned at runtime to support operations (like max backfills,
etc.):

[global]
fsid = ...
public network = 172.22.4.0/24
cluster network = 172.22.8.0/24
pid file = /run/ceph/$type-$id.pid
# should correspond to the setting in /etc/init.d/ceph
max open files = 262144
mon host = <3 mons>
mon osd nearfull ratio = .9
osd pool default min size = 1
osd pool default size = 2
# recovery sleep becomes available in infernalis.
# osd recovery sleep = 0.005
osd snap trim sleep = 0.05
osd scrub sleep = 0.05

[client]
log file = /var/log/ceph/client.log
rbd cache = true
rbd default format = 2
admin socket = /run/ceph/rbd-$pid-$cctid.asok

[mon]
admin socket = /run/ceph/$cluster-$name.asok
mon addr = 172.22.4.42:6789,172.22.4.43:6789,172.22.4.54:6789
mon data = "" class="">mon data avail crit = 2
mon data avail warn = 5
mon osd allow primary affinity = true
mon pg warn max per osd = 3000

[mon.cartman07]
host = cartman07
mon addr = 172.22.4.43:6789
public addr = 172.22.4.43:6789
cluster addr = 172.22.8.13:6789

[osd]
host = cartman07
admin socket = /run/ceph/$cluster-$name.asok
public addr = 172.22.4.43
cluster addr = 172.22.8.13
filestore fiemap = true
filestore max sync interval = 5
# slightly more than 1 week to avoid hitting the same day the same time
osd deep scrub interval = 700000
# place this between "nearfull" and "full"
osd backfill full ratio = 0.92
# 2* 5 seconds * 200MB/s disk throughput = 2000MB
osd journal size = 2000
osd max backfills = 1
osd op complaint time = 15
osd recovery max active = 1
osd recovery op priority = 1
osd scrub load threshold = 2

Something I started playing around with today was looking up specific
slow requests from the admin sockets. Here’s what happened a bit
earlier. Something I see is that the request flaps between “reached pg”
and “waiting for rw locks”. I’ll provide a set of 3 dumps that are
correlated. I started with OSD 30 (Micron) and pulled the ones from OSD
33 and 22 as they are referenced from the “waiting for subops from”.

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com