Re: Interpretation Guidance for Slow Requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 7 Dec 2016, at 05:14, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Tue, 6 Dec 2016 20:58:52 +0100 Christian Theune wrote:

Alright. We’re postponing this for now. Is that actually a more widespread assumption that Jewel has “prime time” issues?

You ask the people here running into issues or patching stuff.

The upgrade is far more complex (ownerships) and with potential for things
to go wrong than I remember with any other release in the last 3 years.
Stuff like upgrading OSDs before MONs upsetting the muscle memory of all
long time operators.

The new defaults are likely to break things unless you prepared for them
and/or have all your clients upgraded at the same time (mission impossible
for anybody running long-term VMs).

Loads of things changed, some massively like cache-tiering, with poor or
no documentation (other than source code, obscure changelog entries).

I'm pondering to let my one cluster/unit die from natural causes
while still running Hammer after the HW is depreciated in 3 years.

That is quite a dim view. I’ve found myself in a similar place when experimenting with cache-tiering and I’ve spend an increasing amount of time reading Ceph code in the last weeks. From our perspective we always thought we’re not doing anything extraordinary, but we keep having to go to spending (seemingly?) extraordinary time to do what feels is the advertised standard.

Seeing friends struggle in a similar way this makes me feel a bit pessimistic. Ceph, great as it is, does make me shop around for alternatives every few months, but there aren’t any that I’d seriously consider. 

  LSI MegaRAID SAS 9271-8i
And as you already found out, not a good match for Ceph (unless you actual
do RAIDs, like me with Areca)
Also LSI tends to have issues unless you're on the latest drivers
(firmware and kernel side)

Yeah. I was leaning towards using the Adaptec HBA, but currently I’m leaning for the “devil we know”. Care to weigh in?

Google. ^o^
http://www.spinics.net/lists/ceph-users/msg24370.html

I've had a comparable (according to specs) Adaptec controller
(HBA) performing abysmal (40% slower) compared to the LSI equivalent.

So, back to square one: hardware is annoying and any specific vendor will have ups and downs … sigh.

Our current candidate would be the Adaptec 1000, for which I found one entry on the mailinglist that didn’t pick up on it being bad. ;)

Adaptec/LSI are our vendor provided choices. May I ask what your setup is regarding RAID/Areca?

Also discussed here plenty times.
4GB HW cache, OSDs are 4 disk RAID10s, thus replica of 2 (effectively 4).
Special use case, probably not for general audience.

Sorry, I’ll try to Google faster. :)

  Dual 10-Gigabit X540-AT2
  OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k
  Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
should be DCP3700, maybe my inventory is wrong)

That needs clarification, which ones are they now?

I noticed we may have a mixed setup. I’ll need to go through this on our machines in detail. I’ll follow up on that.

  Pool “rbd.ssd”

      2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)

You're clearly a much braver man than me or have a very light load (but at
least the 3510's have 1 DWPD endurance unlike the 0.3 of 3500s).  
Please clarify, journals inline (as you suggest below) or also on the NVME?

Journals inline.

Good.
And I brain-farted, I was thinking of 3520s, the 3510s still have 0.3
DWPD, so 0.15 DWPD after journals at best.
Danger Will Robinson!

Off to adding monitoring for our SSD life expectancy…

  Pool “rbd.hdd”

      5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0)
      1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
Same here, journal inline or also on the NVME?

Journals inline. I’m considering to run an experiment with moving the journal to NVME, but: I ran iostat on the raw device as well as the mapped LVM volumes for OSD/Journal and did not see high wait times on the journal, but on the OSD. Also, when I tried Sebastian’s fio setup (on the OSDs file system, not on the raw device as that’s in use right now) I got reasonable numbers. However, this might be a stupid test as I read that XFS might be ignoring the DSYNC request from fio.


As I wrote below, not a fan of LVM in these use cases. 

Right. We’re looking forward to move our historic setup to ceph-disk at some point. However, LVM isn’t supposed to be a performance issue, so I’ll make a test with/without LVM, but I don’t expect anything to come out of that.

How do you know that the high waits on the "OSD" part of the disk isn't
caused by the SSD melting from the "fast" SYNC writes to the journal part?

My thought process was that both numbers would have to go up. It seems weird to me that the sync writes within the journal would continuue be fast while the SSD was melting from it. The SSD doesn’t know about the two different LVM LVs so having performance split exactly at that boundary is at least surprising. But obviously: abstraction leak possible.

Lets compare one of my cache-tier nodes (Hammer) with 4 DC S3610s 800GB
(inline, file based journal, no separate partition): 

I guess you’re running XFS? I’m going through code and reading up on the specific sync behaviour of the journal. I noticed in an XFS comment that various levels of SYNC might behave differently whether you’re going to access a raw device or a file on i.e. XFS.

---
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00   213.60    0.80  664.80     3.20 10978.40    33.00     1.01    1.52    1.00    1.52   0.05   3.36
sdc               0.00   240.80    0.80  476.40    10.40  9832.80    41.25     0.05    0.11    0.00    0.11   0.06   2.96
sda               0.00   232.00    0.00  523.00     0.00  9112.80    34.85     0.72    0.98    0.00    0.98   0.05   2.80
sdd               0.00   166.40    0.60  603.60     2.40  8046.40    26.64     0.99    1.63    0.00    1.64   0.05   2.72
---

To your Micron SSD:
---
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdl               0.00    13.20   33.60  454.60  1453.60 12751.20    58.19    28.47   58.22    7.51   61.97   0.49  23.92
dm-24             0.00     0.00    0.00  135.00     0.00  2136.80    31.66     1.02    7.46    0.00    7.46   1.62  21.88
dm-25             0.00     0.00   33.60  329.00  1453.60 11189.60    69.74    28.67   78.99    7.52   86.29   0.64  23.08
---

I know people love LVM and it can be helpful at times.
However in this case it adds another layer that could be problematic and
also muddies the water with iostat a bit.

The await and utilization numbers are through the roof and totally
unacceptable for a SSD that calls itself "DC".

Here's a DC S3610 400GB (inline, file base journal again) in my test
cluster cache tier during a rados bench run:
---
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00    25.00   18.67  490.00    76.00 291600.00  1146.83     5.44   10.70    7.57   10.82   1.37  69.60
---

So await of 10ms and 70% utilization when writing 25 times as much
your sdl Micron up there. 

This also matches up with your own results when looking at the NVMe or
sdj, which I presume is one of the DC S3510s, right?

Right.

Mind, this could also be the result of some very intrusive housekeeping
activity from the Micron controller as opposed to DSYNC weaknesses, but
clearly they are struggling immensely here.

Another experiment I’m adding to my list. Did the screenshots in my previous mail make it through the list? That one shows very regularly spiking weighted IO times for the Micron SSDs which may be a housekeeping indicator. I’ll see whether I can provoke that experimentally.

- use fio to benchmark  the raw LVM LV that hosted the journal before using Sebastian’s post

Raw raw would be even better, might want to play with one of the SSDs you
haven't deployed yet?

I’m actually “undeploying” one of the SSDs right now: if there’s a house-keeping issue then I think we’ll get better data with the SSD that has already seen some live action.

One thing I’m still trying to figure out how to prove: whether the LSI controller may be the bottleneck. When I tried a fio on one of the pure SSD pool VMs I got increased latencies that aren’t proportional to what I see when not running fio (second run of iostat on disk sdj). 


What is sdj, another Micron?

sdj was the S3510 from above.

Your controller is a 6Gb/s one if I'm not mistaken. And your Microns
could saturate that in theory.
But that would be only on that link. 
Another bottleneck would be the actual PCIe interface bandwidth, but the
numbers just don't add up to that. 

We had a slight suspicion at some point that the controller may be wired up in a weird way with the backplane. That could lead to weird numbers. We were not able to substantiate that, though.

I’ll come back when I have data from my experiments.

Thanks a lot,
Christian

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux