Re: Interpretation Guidance for Slow Requests

Christian Balzer <chibi@xxxxxxx> · Tue, 6 Dec 2016 12:42:14 +0900

Hello,

On Tue, 6 Dec 2016 03:37:32 +0100 Christian Theune wrote:

> Hi Christian (heh),
> 
> thanks for picking this up. :)
> 
> This has become a rather long post as I added more details and giving
> our history, but if we make progress then maybe this can help others in
> the future. I find slow requests extremely hard to debug and as I said:
> aside from scratching my own itch, I’d be happy to help future
> travellers.
> 
> > On 6 Dec 2016, at 00:59, Christian Balzer <chibi@xxxxxxx> wrote:
> > 
> > Hello,
> > 
> > On Mon, 5 Dec 2016 15:25:37 +0100 Christian Theune wrote:
> > 
> >> Hi,
> >> 
> >> we’re currently expanding our cluster to grow the number of IOPS we
> >> can provide to clients. We’re still on Hammer but in the process of
> >> upgrading to Jewel. 
> > 
> > You might want to wait until the next Jewel release, given the current
> > number of issues.
> 
> Our issues or Jewel’s? (This is only half a joke, a friend’s Ceph
> cluster’s issues on Jewel is making me quite nervous and I’m
> re-evaluating postponing already.)
>
Jewel issues, like the most recent one with scrub sending OSDs to
neverland.

> >> We started adding pure-SSD OSDs in the last days (based on MICRON
> > S610DC-3840) and the slow requests we’ve seen in the past have started
> > to show a different pattern.
> >> 
> > I looked in the archives and can't find a full description of your
> > cluster hardware in any posts from you or the other Christian (hint,
> > hint). Slow requests can nearly all the time being traced back to HW
> > issues/limitations.
> 
> We’re currently running on 42% Christians. ;)
> 
> We currently have hosts of 2 generations in our cluster. They’re both
> SuperMicro, sold by Thomas Krenn.
> 
> Type 1 (4 Hosts)
> 
>     SuperMicro X9DR3-F
>     64 GiB RAM
>     2 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz

A bit on the weak side, if you'd be doing lots of small IOPS, as you found
out later. Not likely a real problem here, though.

>     LSI MegaRAID SAS 9271-8i
And as you already found out, not a good match for Ceph (unless you actual
do RAIDs, like me with Areca)
Also LSI tends to have issues unless you're on the latest drivers
(firmware and kernel side)

>     Dual 10-Gigabit X540-AT2
>     OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k
>     Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
> should be DCP3700, maybe my inventory is wrong)
> 
That needs clarification, which ones are they now?

>     Pool “rbd.ssd”
> 
>         2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)
> 
You're clearly a much braver man than me or have a very light load (but at
least the 3510's have 1 DWPD endurance unlike the 0.3 of 3500s).  
Please clarify, journals inline (as you suggest below) or also on the NVME?

>     Pool “rbd.hdd”
> 
>         5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0)
>         1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
Same here, journal inline or also on the NVME?

> 
> Type 2 (2 Hosts)
> 
>     SuperMicro X9SRL-F
>     32 GiB RAM
>     2 Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz
>     LSI MegaRAID SAS 9260-4i
>     Dual 10-Gigabit X540-AT2
>     OS on RAID 1 SATA (HGST HUS724020ALS640 1.818 TB)
>     Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
> should be DCP3700, maybe my inventory is wrong)
> 
>     Pool “rbd.ssd”
> 
>         2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)
> 
>     Pool “rbd.hdd”
> 
>         7 OSDs on HITACHI HUS156060VLS600 558.406 GB 15k (jbod/raid 0)
>         1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
> 
> Network
> 
>     10 GE Copper on Brocade VDX 6740-T Switches w/ 20 GBit interconnect
>      Storage Access and Cluster network on 2 physically separated ports
>      Some KVM servers are still running 1 GE interfaces for access, some
> 10 GE
> 
> Software
> 
>     Ceph: 0.94.7
>     Filesystems: OS on ext4, OSDs on xfs
>     Kernel: 4.4.27 (one host of type 2 still waiting for reboot to
> upgrade from 4.1.16 in the next days) Qemu: 2.7 with librbd
> 
> Overall we’ve been more happy with the second type of hosts (the 15k
> disks obviously don’t reach their limit in a mixed environment), but
> generally we’ve been unhappy with having RAID controllers in there at
> all. We’re going for pure HBA setups with newer machines.
> 
Nods.

> This cluster setup has been with us for a few years now and has seen
> various generations of configurations ranging from initial huge OSD on
> RAID 5, over temporary 2-disk RAID 0, and using CacheCade (until it died
> ungracefully). Now it’s just single disk jbod/raid 0 with journals on
> NVMe SSD. We added a pure SSD Pool with inline journals a while ago
> which does not show any slow requests for the last 14 days or so. 

I'd be impressed to see slow requests from that pool, even with 3510s.
Because that level of activity should cause your HDD pool to come to a
total standstill. 

>We added 10GE about 18 months ago and migrated that from a temporary
>solution to Brocade in September.
> 
> > Have you tested these SSDs for their suitability as a Ceph journal (I
> > assume inline journals)? 
> > This has been discussed here countless times, google.
> 
> We’re testing them right now. We stopped testing this in our development
> environment as any learning aside from “works generally”/“doesn’t work
> at all” has not been portable to our production load at all. All our
> experiments trying to reliably replicate production traffic in a
> reasonable way have failed.
> 
> I’ve FIOed them and was able to see their spec’d write IOPS (18k random
> IO 4k with 32 QD) and throughput. I wasn’t able to see 100k read IOPS
> but maxed out around 30k.
> 
Not FIO (the one you cite below) nor IOPS are of relevance to journals,
sequential sync writes are. Find the most recent discussions here and for a
slightly dated info with dubious data/results see:
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

> > If those SSDs aren't up to snuff, they may be worse than plain disks in
> > some cases.
> > 
> > In addition, these SSDs have an endurance of 1 DWPD, less than 0.5 when
> > factoring journal and FS overheads and write amplification scenarios.
> > I'd be worried about wearing these out long before 5 years are over.
> 
> I reviewed the endurance from our existing hard drives. The hard drives
> had less than 0.05 DWPD since we’ve added them. I wasn’t able to get
> data from the NVMe drives about how much data they actually wrote to be
> able to compute endurance. The Intel tool currently shows an endurance
> analyzer of 14.4 years. This might be a completely useless number due to
> aggressive over-provisioning of the SSD.

No SMART interface for the NVMes? I don't have any, so I wouldn't know.

Well, at 0.05 DWPD you're probably fine, so much more IOPS than bandwidth
bound.

Some average/peak IOPS and bandwidth numbers (ceph -s) from your cluster
would give us some idea if you're trying to do the impossible or if your
HW is to blame to some extend at least.

> 
> However, I’ll be monitoring this more closely on the Microns and not
> worry about the full 5 years just yet. 
> 
> >> I’m currently seeing those: 
> >> 
> >> 2016-12-05 15:13:37.527469 osd.60 172.22.4.46:6818/19894 8080 :
> >> cluster [WRN] 5 slow requests, 1 included below; oldest blocked for >
> >> 31.675358 secs 2016-12-05 15:13:37.527478 osd.60
> >> 172.22.4.46:6818/19894 8081 : cluster [WRN] slow request 31.674886
> >> seconds old, received at 2016-12-05 15:13:05.852525:
> >> osd_op(client.518589944.0:2734750
> >> rbd_data.1e2b40f879e2a9e3.00000000000000a2 [stat,set-alloc-hint
> >> object_size 4194304 write_size 4194304,write 1892352~4096]
> >> 277.ceaf1c22 ack+ondisk+write+known_if_redirected e1107736) currently
> >> waiting for rw locks
> >> 
> > Is osd.60 one of these SSDs?
> 
> It is. However, I can also show any other of the OSDs from the same pool
> - this isn’t restricted to the Micron SSDs but also shows for the 1.8 TB
> SATA drives.I reviewed all the logs I got from the last two weeks and
> out of all complaints those have always made up ~30% of the slow
> requests we see. Others include:
> 
Again, seeing slow requests from an SSD based OSD strikes me as odd, aside
from LSI issues (which should show up in kernel logs) the main suspect
would be the journal writes totally overwhelming it.
Unlike the Intel 3xxx which are known to handle things (sync writes)
according to their specs. 

> currently waiting for subops from … (~35%)
> currently waiting for rw (~30%)
> currently no flag points reached (~20%)
> currently commit_sent (<5%)
> currently reached_pg (<3%)
> currently started (<1%)
> currently waiting for scrub (extremely rare)
> 
> The statistics for waiting for sub operations include OSD 60 (~1886)
> more than any others, even though all of the others do show a
> significant amount (~100-400). 
Those are not relevant (related) to this OSD.

>My current guess is that OSD 60 has gotten more than its fair share of
>PGs temporarily and a lot of the movement on the cluster just affected it
>much more. I added 3 of the Micron SSDs as the same time and the two
>others are well in the normal range. I don’t have distribution over time,
>so I can’t prove that right now. The Microns do have a) higher primary
>affinity b) double the size than the normal SATA drives so that should
>result in more requests ending up there.
> 

Naturally.
I'm not convinced that mixing HDDs and SSDs is the way forward, either you
can handle your load with HDDs (enough of them) and fast journals, or you
can't and are better off going all SSD, unless your use case is a match
for cache-tiering. 

> Looking at IOstat I don’t see them creating large backlogs - at least I
> can’t spot them interactively.
> 

You should see high latencies, utilization in there when the slow requests
happen.
Also look at it with atop in a larger terminal and with at least 5s
refresh interval.

> Looking at the specific traffic that Ceph with inline journal puts on
> the device, then during backfill it appears to be saturated from a
> bandwith perspective. I see around 100MB/s going in with around 1k write
> IOPS. During this time, iostat shows an average queue length of around
> 200 requests and an average wait of around 150ms, which is quite high.
Precisely.
The journal/sync write plot thickens.

> Adding a 4k FIO [1] at the same time doesn’t change throughput or IOPS
> significantly. Stopping the OSD then shows IOPS going up to 12k r/w, but
> of course bandwith is much less at that point.
> 
> Something we have looked out for again but weren’t able to find
> something on is whether the controller might be limiting us in some way.
> Then again: the Intel SSDs are attached to the same controller and have
> not shown *any* slow request at all. All the slow requests are defined
> the the rbd.hdd pool (which now includes the Microns as part of a
> migration strategy).
> 
> I’m also wondering whether we may be CPU bound, but the stats do not
> indicate that. We’re running the “ondemand” governor and I’ve seen
> references to people preferring the “performance” governor. However, I
> would expect that we’d see significantly reduced idle times when this
> becomes an issue.
>
The performance governor will help with (Ceph induced) latencies foremost. 

> > Tools like atop and iostat can give you a good insight on how your
> > storage subsystem is doing.
> 
> Right. We have collectd running including the Ceph plugin which helps
> finding historical data. At the moment I’m not getting much out of the
> interactive tools as any issue will have resolved once I’m logged in. ;)
> 
I suppose you could always force things with a fio from a VM or a 
"rados bench".

For now, concentrate on validating those Microns.

Christian

> > The moment you see slow requests, your cluster has issues you need to
> > address.
> > Especially if this happens when no scrubs are going on (restrict scrubs
> > to off-peak hours).
> 
> I’ve seen this happen when no scrubs are around. I can currently not
> pinpoint performance issues to ongoing scrubs. We have limited scrubs to
> a low load threshold but run them regularly during the day.
> 
> > The 30 seconds is an arbitrary value and the WRN would warrant an ERR
> > in my book, as some applications take a very dim view on being blocked
> > for more than few seconds.
> 
> Absolutely: we see adverse effects within VMs much earlier and have
> lowered the complaint time to 15 seconds due to that. 
> 
> Here’s the Ceph config we’re running with. Some of the values are
> temporarily tuned at runtime to support operations (like max backfills,
> etc.):
> 
> [global]
> fsid = ...
> public network = 172.22.4.0/24
> cluster network = 172.22.8.0/24
> pid file = /run/ceph/$type-$id.pid
> # should correspond to the setting in /etc/init.d/ceph
> max open files = 262144
> mon host = <3 mons>
> mon osd nearfull ratio = .9
> osd pool default min size = 1
> osd pool default size = 2
> # recovery sleep becomes available in infernalis.
> # osd recovery sleep = 0.005
> osd snap trim sleep = 0.05
> osd scrub sleep = 0.05
> 
> [client]
> log file = /var/log/ceph/client.log
> rbd cache = true
> rbd default format = 2
> admin socket = /run/ceph/rbd-$pid-$cctid.asok
> 
> [mon]
> admin socket = /run/ceph/$cluster-$name.asok
> mon addr = 172.22.4.42:6789,172.22.4.43:6789,172.22.4.54:6789
> mon data = /srv/ceph/mon/$cluster-$id
> mon data avail crit = 2
> mon data avail warn = 5
> mon osd allow primary affinity = true
> mon pg warn max per osd = 3000
> 
> [mon.cartman07]
> host = cartman07
> mon addr = 172.22.4.43:6789
> public addr = 172.22.4.43:6789
> cluster addr = 172.22.8.13:6789
> 
> [osd]
> host = cartman07
> admin socket = /run/ceph/$cluster-$name.asok
> public addr = 172.22.4.43
> cluster addr = 172.22.8.13
> filestore fiemap = true
> filestore max sync interval = 5
> # slightly more than 1 week to avoid hitting the same day the same time
> osd deep scrub interval = 700000
> # place this between "nearfull" and "full"
> osd backfill full ratio = 0.92
> # 2* 5 seconds * 200MB/s disk throughput = 2000MB
> osd journal size = 2000
> osd max backfills = 1
> osd op complaint time = 15
> osd recovery max active = 1
> osd recovery op priority = 1
> osd scrub load threshold = 2
> 
> Something I started playing around with today was looking up specific
> slow requests from the admin sockets. Here’s what happened a bit
> earlier. Something I see is that the request flaps between “reached pg”
> and “waiting for rw locks”. I’ll provide a set of 3 dumps that are
> correlated. I started with OSD 30 (Micron) and pulled the ones from OSD
> 33 and 22 as they are referenced from the “waiting for subops from”.
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com