Re: Interpretation Guidance for Slow Requests

Christian Balzer <chibi@xxxxxxx> · Wed, 7 Dec 2016 13:14:53 +0900

Hello,

On Tue, 6 Dec 2016 20:58:52 +0100 Christian Theune wrote:

> Hi,
> 
> > On 6 Dec 2016, at 04:42, Christian Balzer <chibi@xxxxxxx> wrote:
> > Jewel issues, like the most recent one with scrub sending OSDs to
> > neverland.
> 
> Alright. We’re postponing this for now. Is that actually a more widespread assumption that Jewel has “prime time” issues?
>
You ask the people here running into issues or patching stuff.

The upgrade is far more complex (ownerships) and with potential for things
to go wrong than I remember with any other release in the last 3 years.
Stuff like upgrading OSDs before MONs upsetting the muscle memory of all
long time operators.

The new defaults are likely to break things unless you prepared for them
and/or have all your clients upgraded at the same time (mission impossible
for anybody running long-term VMs).

Loads of things changed, some massively like cache-tiering, with poor or
no documentation (other than source code, obscure changelog entries).

I'm pondering to let my one cluster/unit die from natural causes
while still running Hammer after the HW is depreciated in 3 years.

> >>>> We started adding pure-SSD OSDs in the last days (based on MICRON
> >>> S610DC-3840) and the slow requests we’ve seen in the past have started
> >>> to show a different pattern.
> >>>> 
> >>> I looked in the archives and can't find a full description of your
> >>> cluster hardware in any posts from you or the other Christian (hint,
> >>> hint). Slow requests can nearly all the time being traced back to HW
> >>> issues/limitations.
> >> 
> >> We’re currently running on 42% Christians. ;)
> >> 
> >> We currently have hosts of 2 generations in our cluster. They’re both
> >> SuperMicro, sold by Thomas Krenn.
> >> 
> >> Type 1 (4 Hosts)
> >> 
> >>    SuperMicro X9DR3-F
> >>    64 GiB RAM
> >>    2 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
> > 
> > A bit on the weak side, if you'd be doing lots of small IOPS, as you found
> > out later. Not likely a real problem here, though.
> 
> We do have a good number of cores, so I wonder whether increasing the op thread count would help? However, the Intel pure SSD pool doesn’t show that issue at all.
>
Nope, as I said, CPUs aren't your problem.

> > 
> >>    LSI MegaRAID SAS 9271-8i
> > And as you already found out, not a good match for Ceph (unless you actual
> > do RAIDs, like me with Areca)
> > Also LSI tends to have issues unless you're on the latest drivers
> > (firmware and kernel side)
> 
> Yeah. I was leaning towards using the Adaptec HBA, but currently I’m leaning for the “devil we know”. Care to weigh in?
> 
Google. ^o^
http://www.spinics.net/lists/ceph-users/msg24370.html

I've had a comparable (according to specs) Adaptec controller
(HBA) performing abysmal (40% slower) compared to the LSI equivalent.

> Adaptec/LSI are our vendor provided choices. May I ask what your setup is regarding RAID/Areca?
> 
Also discussed here plenty times.
4GB HW cache, OSDs are 4 disk RAID10s, thus replica of 2 (effectively 4).
Special use case, probably not for general audience.

> >>    Dual 10-Gigabit X540-AT2
> >>    OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k
> >>    Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
> >> should be DCP3700, maybe my inventory is wrong)
> >> 
> > That needs clarification, which ones are they now?
> 
> I noticed we may have a mixed setup. I’ll need to go through this on our machines in detail. I’ll follow up on that.
> 
> >>    Pool “rbd.ssd”
> >> 
> >>        2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)
> >> 
> > You're clearly a much braver man than me or have a very light load (but at
> > least the 3510's have 1 DWPD endurance unlike the 0.3 of 3500s).  
> > Please clarify, journals inline (as you suggest below) or also on the NVME?
> 
> Journals inline.
> 
Good.
And I brain-farted, I was thinking of 3520s, the 3510s still have 0.3
DWPD, so 0.15 DWPD after journals at best.
Danger Will Robinson!

> >>    Pool “rbd.hdd”
> >> 
> >>        5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0)
> >>        1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
> > Same here, journal inline or also on the NVME?
> 
> Journals inline. I’m considering to run an experiment with moving the journal to NVME, but: I ran iostat on the raw device as well as the mapped LVM volumes for OSD/Journal and did not see high wait times on the journal, but on the OSD. Also, when I tried Sebastian’s fio setup (on the OSDs file system, not on the raw device as that’s in use right now) I got reasonable numbers. However, this might be a stupid test as I read that XFS might be ignoring the DSYNC request from fio.
> 

As I wrote below, not a fan of LVM in these use cases. 
How do you know that the high waits on the "OSD" part of the disk isn't
caused by the SSD melting from the "fast" SYNC writes to the journal part?

Definitely move the journal to the NVMe, that will settle this question.

> >> This cluster setup has been with us for a few years now and has seen
> >> various generations of configurations ranging from initial huge OSD on
> >> RAID 5, over temporary 2-disk RAID 0, and using CacheCade (until it died
> >> ungracefully). Now it’s just single disk jbod/raid 0 with journals on
> >> NVMe SSD. We added a pure SSD Pool with inline journals a while ago
> >> which does not show any slow requests for the last 14 days or so. 
> > 
> > I'd be impressed to see slow requests from that pool, even with 3510s.
> > Because that level of activity should cause your HDD pool to come to a
> > total standstill. 
> 
> You mean that other dependant resources (CPU, Network, Controller) would be saturated and thus not have the HDD pool have a chance to do anything?
> 
No, I meant that if a load that could cause your SSD pool to produce slow
requests would hit the HDD pool, it would just fall over.

> >> We added 10GE about 18 months ago and migrated that from a temporary
> >> solution to Brocade in September.
> >> 
> >>> Have you tested these SSDs for their suitability as a Ceph journal (I
> >>> assume inline journals)? 
> >>> This has been discussed here countless times, google.
> >> 
> >> We’re testing them right now. We stopped testing this in our development
> >> environment as any learning aside from “works generally”/“doesn’t work
> >> at all” has not been portable to our production load at all. All our
> >> experiments trying to reliably replicate production traffic in a
> >> reasonable way have failed.
> >> 
> >> I’ve FIOed them and was able to see their spec’d write IOPS (18k random
> >> IO 4k with 32 QD) and throughput. I wasn’t able to see 100k read IOPS
> >> but maxed out around 30k.
> >> 
> > Not FIO (the one you cite below) nor IOPS are of relevance to journals,
> > sequential sync writes are. Find the most recent discussions here and for a
> > slightly dated info with dubious data/results see:
> > https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ <https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/>
> 
> Thanks for the pointer. My first test probably was moot, but I’ll read up on the list. It’s a bit hard to find, so I’ll likely follow up with a list of links for future cross-reference.
> 
> >>> If those SSDs aren't up to snuff, they may be worse than plain disks in
> >>> some cases.
> >>> 
> >>> In addition, these SSDs have an endurance of 1 DWPD, less than 0.5 when
> >>> factoring journal and FS overheads and write amplification scenarios.
> >>> I'd be worried about wearing these out long before 5 years are over.
> >> 
> >> I reviewed the endurance from our existing hard drives. The hard drives
> >> had less than 0.05 DWPD since we’ve added them. I wasn’t able to get
> >> data from the NVMe drives about how much data they actually wrote to be
> >> able to compute endurance. The Intel tool currently shows an endurance
> >> analyzer of 14.4 years. This might be a completely useless number due to
> >> aggressive over-provisioning of the SSD.
> > 
> > No SMART interface for the NVMes? I don't have any, so I wouldn't know.
> 
> It appears that needs a newer smartmonctl than we currently have. Intel’s own tool was a bit non-helpful.
> 
Yeah, it's a tad cryptic.

> > Well, at 0.05 DWPD you're probably fine, so much more IOPS than bandwidth
> > bound.
> 
> Yup. End-to-end latency is what’s interesting to me, as well as the total reasonable number of ops I can expect/communicate to clients. 
> 
> > Some average/peak IOPS and bandwidth numbers (ceph -s) from your cluster
> > would give us some idea if you're trying to do the impossible or if your
> > HW is to blame to some extend at least.
> 
> This is today (cleaning up some reweights earlier on):
> 
> 
> > Again, seeing slow requests from an SSD based OSD strikes me as odd, aside
> > from LSI issues (which should show up in kernel logs) the main suspect
> > would be the journal writes totally overwhelming it.
> > Unlike the Intel 3xxx which are known to handle things (sync writes)
> > according to their specs. 
> 
> I’m going to try to substantiate this. I’m not sure iostat supports this theory. See attached log.
> 

Lets compare one of my cache-tier nodes (Hammer) with 4 DC S3610s 800GB
(inline, file based journal, no separate partition): 
---
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00   213.60    0.80  664.80     3.20 10978.40    33.00     1.01    1.52    1.00    1.52   0.05   3.36
sdc               0.00   240.80    0.80  476.40    10.40  9832.80    41.25     0.05    0.11    0.00    0.11   0.06   2.96
sda               0.00   232.00    0.00  523.00     0.00  9112.80    34.85     0.72    0.98    0.00    0.98   0.05   2.80
sdd               0.00   166.40    0.60  603.60     2.40  8046.40    26.64     0.99    1.63    0.00    1.64   0.05   2.72
---

To your Micron SSD:
---
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdl               0.00    13.20   33.60  454.60  1453.60 12751.20    58.19    28.47   58.22    7.51   61.97   0.49  23.92
dm-24             0.00     0.00    0.00  135.00     0.00  2136.80    31.66     1.02    7.46    0.00    7.46   1.62  21.88
dm-25             0.00     0.00   33.60  329.00  1453.60 11189.60    69.74    28.67   78.99    7.52   86.29   0.64  23.08
---

I know people love LVM and it can be helpful at times.
However in this case it adds another layer that could be problematic and
also muddies the water with iostat a bit.

The await and utilization numbers are through the roof and totally
unacceptable for a SSD that calls itself "DC".

Here's a DC S3610 400GB (inline, file base journal again) in my test
cluster cache tier during a rados bench run:
---
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00    25.00   18.67  490.00    76.00 291600.00  1146.83     5.44   10.70    7.57   10.82   1.37  69.60
---

So await of 10ms and 70% utilization when writing 25 times as much
your sdl Micron up there. 

This also matches up with your own results when looking at the NVMe or
sdj, which I presume is one of the DC S3510s, right?

Mind, this could also be the result of some very intrusive housekeeping
activity from the Micron controller as opposed to DSYNC weaknesses, but
clearly they are struggling immensely here.

> >> currently waiting for subops from … (~35%)
> >> currently waiting for rw (~30%)
> >> currently no flag points reached (~20%)
> >> currently commit_sent (<5%)
> >> currently reached_pg (<3%)
> >> currently started (<1%)
> >> currently waiting for scrub (extremely rare)
> >> 
> >> The statistics for waiting for sub operations include OSD 60 (~1886)
> >> more than any others, even though all of the others do show a
> >> significant amount (~100-400). 
> > Those are not relevant (related) to this OSD.
> 
> They should be, or I’m getting something very wrong: I was already referring to the OSD mentioned on the right side of “waiting for subops from XXX” where the numbers I gave correspond to the right hand side. I likely miscommunicated that.
> 
Ah, OK. 
It sounded like this were all from the OSD 60 logs itself.

> >> My current guess is that OSD 60 has gotten more than its fair share of
> >> PGs temporarily and a lot of the movement on the cluster just affected it
> >> much more. I added 3 of the Micron SSDs as the same time and the two
> >> others are well in the normal range. I don’t have distribution over time,
> >> so I can’t prove that right now. The Microns do have a) higher primary
> >> affinity b) double the size than the normal SATA drives so that should
> >> result in more requests ending up there.
> >> 
> > 
> > Naturally.
> > I'm not convinced that mixing HDDs and SSDs is the way forward, either you
> > can handle your load with HDDs (enough of them) and fast journals, or you
> > can't and are better off going all SSD, unless your use case is a match
> > for cache-tiering. 
> 
> It’s a way forward in the sense to move to SSD only without downtime. We should see improving conditions as the load gets off the HDDs, but our actual goal is to have pure pools. We’re also keeping the HDDs (and extend the pool) for low-traffic/low-priority VMs. The move may seem counter-intuitive but this way we can split the pool in the future and keep the production-relevant VMs online while replacing the OSDs in the original pool and then create a new pool with the HDDs and copy the RBD images back over which requires downtime and manual operations for the affected VMs.
> 
If the goal is a pure pool, that's a perfectly fine path. 
Though maybe not with these SSDs...

> >> Looking at IOstat I don’t see them creating large backlogs - at least I
> >> can’t spot them interactively.
> >> 
> > 
> > You should see high latencies, utilization in there when the slow requests
> > happen.
> > Also look at it with atop in a larger terminal and with at least 5s
> > refresh interval.
> > 
> >> Looking at the specific traffic that Ceph with inline journal puts on
> >> the device, then during backfill it appears to be saturated from a
> >> bandwith perspective. I see around 100MB/s going in with around 1k write
> >> IOPS. During this time, iostat shows an average queue length of around
> >> 200 requests and an average wait of around 150ms, which is quite high.
> > Precisely.
> > The journal/sync write plot thickens.
> 
> I also did see much higher traffic (550MB/s) close at the theoretical maximum without this kind of issue. The plot is thick but not yet clear. :)
> 
> >> Adding a 4k FIO [1] at the same time doesn’t change throughput or IOPS
> >> significantly. Stopping the OSD then shows IOPS going up to 12k r/w, but
> >> of course bandwith is much less at that point.
> >> 
> >> Something we have looked out for again but weren’t able to find
> >> something on is whether the controller might be limiting us in some way.
> >> Then again: the Intel SSDs are attached to the same controller and have
> >> not shown *any* slow request at all. All the slow requests are defined
> >> the the rbd.hdd pool (which now includes the Microns as part of a
> >> migration strategy).
> >> 
> >> I’m also wondering whether we may be CPU bound, but the stats do not
> >> indicate that. We’re running the “ondemand” governor and I’ve seen
> >> references to people preferring the “performance” governor. However, I
> >> would expect that we’d see significantly reduced idle times when this
> >> becomes an issue.
> >> 
> > The performance governor will help with (Ceph induced) latencies foremost. 
> 
> I guess we’re not there yet at all, I mean, that should be something somewhere in the magnitude of nanosecond / very low microseconds, right?

Take a look at Nick Fisk's various posts and the current thread with his
blog article.

> > 
> >>> Tools like atop and iostat can give you a good insight on how your
> >>> storage subsystem is doing.
> >> 
> >> Right. We have collectd running including the Ceph plugin which helps
> >> finding historical data. At the moment I’m not getting much out of the
> >> interactive tools as any issue will have resolved once I’m logged in. ;)
> >> 
> > I suppose you could always force things with a fio from a VM or a 
> > "rados bench".
> > 
> > For now, concentrate on validating those Microns.
> 
> On it. Care to take a look at the iostat output I’m attaching?
> 
See above.

> 
> 
> So far I can see that in mixed traffic the journal (dm-24) sometimes does an OK job and sometimes not. It kinda seems that it appears sensitive to the mixed mode. I can’t visually spot any covariance and my statistics wisdom isn’t sufficient to quickly do any kind of analysis on this. 
> 
> I have also included a short iostat run from the NVMe SSD (P3600). There’s a lot more going on and it appears to 
> 
> I’ll try two things next:
> 
> - move one of the inline journals from the Microns to the NVMe

Definitely.

> - use fio to benchmark  the raw LVM LV that hosted the journal before using Sebastian’s post
> 
Raw raw would be even better, might want to play with one of the SSDs you
haven't deployed yet?

> One thing I’m still trying to figure out how to prove: whether the LSI controller may be the bottleneck. When I tried a fio on one of the pure SSD pool VMs I got increased latencies that aren’t proportional to what I see when not running fio (second run of iostat on disk sdj). 
> 

What is sdj, another Micron?

Your controller is a 6Gb/s one if I'm not mistaken. And your Microns
could saturate that in theory.
But that would be only on that link. 
Another bottleneck would be the actual PCIe interface bandwidth, but the
numbers just don't add up to that. 

Christian

> Thanks for the input so far,
> Christian
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com