Re: Interpretation Guidance for Slow Requests

Christian Balzer <chibi@xxxxxxx> · Wed, 7 Dec 2016 20:20:49 +0900

Hello,

On Wed, 7 Dec 2016 09:04:37 +0100 Christian Theune wrote:

> Hi,
> 
> > On 7 Dec 2016, at 05:14, Christian Balzer <chibi@xxxxxxx> wrote:
> > 
> > Hello,
> > 
> > On Tue, 6 Dec 2016 20:58:52 +0100 Christian Theune wrote:
> > 
> >> Alright. We’re postponing this for now. Is that actually a more widespread assumption that Jewel has “prime time” issues?
> >> 
> > You ask the people here running into issues or patching stuff.
> > 
> > The upgrade is far more complex (ownerships) and with potential for things
> > to go wrong than I remember with any other release in the last 3 years.
> > Stuff like upgrading OSDs before MONs upsetting the muscle memory of all
> > long time operators.
> > 
> > The new defaults are likely to break things unless you prepared for them
> > and/or have all your clients upgraded at the same time (mission impossible
> > for anybody running long-term VMs).
> > 
> > Loads of things changed, some massively like cache-tiering, with poor or
> > no documentation (other than source code, obscure changelog entries).
> > 
> > I'm pondering to let my one cluster/unit die from natural causes
> > while still running Hammer after the HW is depreciated in 3 years.
> 
> That is quite a dim view. I’ve found myself in a similar place when experimenting with cache-tiering and I’ve spend an increasing amount of time reading Ceph code in the last weeks. From our perspective we always thought we’re not doing anything extraordinary, but we keep having to go to spending (seemingly?) extraordinary time to do what feels is the advertised standard.
> 
> Seeing friends struggle in a similar way this makes me feel a bit pessimistic. Ceph, great as it is, does make me shop around for alternatives every few months, but there aren’t any that I’d seriously consider. 
> 

I wasn't talking about abandoning Ceph up there, just that for this unit
(storage and VMs) a freeze might be the better, safer option.
The way it's operated makes that a possibility, others will of course
want/need to upgrade their clusters and keep them running as indefinitely
as possible.

> >>>>   LSI MegaRAID SAS 9271-8i
> >>> And as you already found out, not a good match for Ceph (unless you actual
> >>> do RAIDs, like me with Areca)
> >>> Also LSI tends to have issues unless you're on the latest drivers
> >>> (firmware and kernel side)
> >> 
> >> Yeah. I was leaning towards using the Adaptec HBA, but currently I’m leaning for the “devil we know”. Care to weigh in?
> >> 
> > Google. ^o^
> > http://www.spinics.net/lists/ceph-users/msg24370.html <http://www.spinics.net/lists/ceph-users/msg24370.html>
> > 
> > I've had a comparable (according to specs) Adaptec controller
> > (HBA) performing abysmal (40% slower) compared to the LSI equivalent.
> 
> So, back to square one: hardware is annoying and any specific vendor will have ups and downs … sigh.
> 
> Our current candidate would be the Adaptec 1000, for which I found one entry on the mailinglist that didn’t pick up on it being bad. ;)
> 
> >> Adaptec/LSI are our vendor provided choices. May I ask what your setup is regarding RAID/Areca?
> >> 
> > Also discussed here plenty times.
> > 4GB HW cache, OSDs are 4 disk RAID10s, thus replica of 2 (effectively 4).
> > Special use case, probably not for general audience.
> 
> Sorry, I’ll try to Google faster. :)
> 
> >>>>   Dual 10-Gigabit X540-AT2
> >>>>   OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k
> >>>>   Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
> >>>> should be DCP3700, maybe my inventory is wrong)
> >>>> 
> >>> That needs clarification, which ones are they now?
> >> 
> >> I noticed we may have a mixed setup. I’ll need to go through this on our machines in detail. I’ll follow up on that.
> >> 
> >>>>   Pool “rbd.ssd”
> >>>> 
> >>>>       2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)
> >>>> 
> >>> You're clearly a much braver man than me or have a very light load (but at
> >>> least the 3510's have 1 DWPD endurance unlike the 0.3 of 3500s).  
> >>> Please clarify, journals inline (as you suggest below) or also on the NVME?
> >> 
> >> Journals inline.
> >> 
> > Good.
> > And I brain-farted, I was thinking of 3520s, the 3510s still have 0.3
> > DWPD, so 0.15 DWPD after journals at best.
> > Danger Will Robinson!
> 
> Off to adding monitoring for our SSD life expectancy…
> 
Yup, something I did for our stuff (and not just Ceph SSDs) as well,
there's a nice Nagios plugin for this.

> >>>>   Pool “rbd.hdd”
> >>>> 
> >>>>       5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0)
> >>>>       1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
> >>> Same here, journal inline or also on the NVME?
> >> 
> >> Journals inline. I’m considering to run an experiment with moving the journal to NVME, but: I ran iostat on the raw device as well as the mapped LVM volumes for OSD/Journal and did not see high wait times on the journal, but on the OSD. Also, when I tried Sebastian’s fio setup (on the OSDs file system, not on the raw device as that’s in use right now) I got reasonable numbers. However, this might be a stupid test as I read that XFS might be ignoring the DSYNC request from fio.
> >> 
> > 
> > As I wrote below, not a fan of LVM in these use cases. 
> 
> Right. We’re looking forward to move our historic setup to ceph-disk at some point. However, LVM isn’t supposed to be a performance issue, so I’ll make a test with/without LVM, but I don’t expect anything to come out of that.
> 

I'm not a fan of ceph-disk either (it can't handle some desired scenarios
and generally tries to be too smart for its own good sometimes), but at
least you'll be able to get better, faster feedback potentially.

> > How do you know that the high waits on the "OSD" part of the disk isn't
> > caused by the SSD melting from the "fast" SYNC writes to the journal part?
> 
> My thought process was that both numbers would have to go up. It seems weird to me that the sync writes within the journal would continuue be fast while the SSD was melting from it. The SSD doesn’t know about the two different LVM LVs so having performance split exactly at that boundary is at least surprising. But obviously: abstraction leak possible.
> 
Yes, it would seem logical, but what you said, abstraction may blur this.
Either way, abysmal numbers.

> > Lets compare one of my cache-tier nodes (Hammer) with 4 DC S3610s 800GB
> > (inline, file based journal, no separate partition): 
> 
> I guess you’re running XFS? I’m going through code and reading up on the specific sync behaviour of the journal. I noticed in an XFS comment that various levels of SYNC might behave differently whether you’re going to access a raw device or a file on i.e. XFS.
> 
EXT4 actually on these, on the staging cluster it's half XFS and EXT4,
with noticeable, but not significant differences.

> > ---
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sdb               0.00   213.60    0.80  664.80     3.20 10978.40    33.00     1.01    1.52    1.00    1.52   0.05   3.36
> > sdc               0.00   240.80    0.80  476.40    10.40  9832.80    41.25     0.05    0.11    0.00    0.11   0.06   2.96
> > sda               0.00   232.00    0.00  523.00     0.00  9112.80    34.85     0.72    0.98    0.00    0.98   0.05   2.80
> > sdd               0.00   166.40    0.60  603.60     2.40  8046.40    26.64     0.99    1.63    0.00    1.64   0.05   2.72
> > ---
> > 
> > To your Micron SSD:
> > ---
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sdl               0.00    13.20   33.60  454.60  1453.60 12751.20    58.19    28.47   58.22    7.51   61.97   0.49  23.92
> > dm-24             0.00     0.00    0.00  135.00     0.00  2136.80    31.66     1.02    7.46    0.00    7.46   1.62  21.88
> > dm-25             0.00     0.00   33.60  329.00  1453.60 11189.60    69.74    28.67   78.99    7.52   86.29   0.64  23.08
> > ---
> > 
> > I know people love LVM and it can be helpful at times.
> > However in this case it adds another layer that could be problematic and
> > also muddies the water with iostat a bit.
> > 
> > The await and utilization numbers are through the roof and totally
> > unacceptable for a SSD that calls itself "DC".
> > 
> > Here's a DC S3610 400GB (inline, file base journal again) in my test
> > cluster cache tier during a rados bench run:
> > ---
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sdb               0.00    25.00   18.67  490.00    76.00 291600.00  1146.83     5.44   10.70    7.57   10.82   1.37  69.60
> > ---
> > 
> > So await of 10ms and 70% utilization when writing 25 times as much
> > your sdl Micron up there. 
> > 
> > This also matches up with your own results when looking at the NVMe or
> > sdj, which I presume is one of the DC S3510s, right?
> 
> Right.
> 
> > Mind, this could also be the result of some very intrusive housekeeping
> > activity from the Micron controller as opposed to DSYNC weaknesses, but
> > clearly they are struggling immensely here.
> 
> Another experiment I’m adding to my list. Did the screenshots in my previous mail make it through the list? That one shows very regularly spiking weighted IO times for the Micron SSDs which may be a housekeeping indicator. I’ll see whether I can provoke that experimentally.
> 
Could be housekeeping, could be pagecache flushes or other XFS ops.
Probably best to test/compare with a standalone SSD.

> >> - use fio to benchmark  the raw LVM LV that hosted the journal before using Sebastian’s post
> >> 
> > Raw raw would be even better, might want to play with one of the SSDs you
> > haven't deployed yet?
> 
> I’m actually “undeploying” one of the SSDs right now: if there’s a house-keeping issue then I think we’ll get better data with the SSD that has already seen some live action.
> 
True.

> >> One thing I’m still trying to figure out how to prove: whether the LSI controller may be the bottleneck. When I tried a fio on one of the pure SSD pool VMs I got increased latencies that aren’t proportional to what I see when not running fio (second run of iostat on disk sdj). 
> >> 
> > 
> > What is sdj, another Micron?
> 
> sdj was the S3510 from above.
>
OK.

> > Your controller is a 6Gb/s one if I'm not mistaken. And your Microns
> > could saturate that in theory.
> > But that would be only on that link. 
> > Another bottleneck would be the actual PCIe interface bandwidth, but the
> > numbers just don't add up to that. 
> 
> We had a slight suspicion at some point that the controller may be wired up in a weird way with the backplane. That could lead to weird numbers. We were not able to substantiate that, though.
> 

If it were hooked up to the backplane (expander or individual connectors
per drive?) with just one link/lane (6Gb/s) that would indeed be a
noticeable bottleneck.
But I have a hard time imagining that.

If it were with just one mini-SAS connector aka 4 lanes to an expander
port it would halve your potential bandwidth but still be more than what
you're currently likely to produce there.

Christian

> I’ll come back when I have data from my experiments.
> 
> Thanks a lot,
> Christian
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com