Re: Infernalis -> Jewel, 10x+ RBD latency increase

Martin Millnert <martin@xxxxxxxxxxx> · Fri, 22 Jul 2016 11:31:33 +0200

On Fri, 2016-07-22 at 08:56 +0100, Nick Fisk wrote:
> > 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > Behalf Of Martin Millnert
> > Sent: 22 July 2016 00:33
> > To: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
> > Subject:  Infernalis -> Jewel, 10x+ RBD latency
> > increase
> > 
> > Hi,
> > 
> > I just upgraded from Infernalis to Jewel and see an approximate 10x
> > latency increase.
> > 
> > Quick facts:
> >  - 3x replicated pool
> >  - 4x 2x-"E5-2690 v3 @ 2.60GHz", 128GB RAM, 6x 1.6 TB Intel S3610
> > SSDs,
> >  - LSI3008 controller with up-to-date firmware and upstream driver,
> > and up-to-date firmware on SSDs.
> >  - 40GbE (Mellanox, with up-to-date drivers & firmware)
> >  - CentOS 7.2
> > 
> > Physical checks out, both iperf3 for network and e.g. fio over all
> > the SSDs. Not done much of Linux tuning yet; but irqbalanced does a
> > pretty good job with pairing both NIC and HBA with their respective
> > CPUs.
> > 
> > In performance hunting mode, and today took the next logical step
> > of upgrading from Infernalis to Jewel.
> > 
> > Tester is remote KVM/Qemu/libvirt guest (openstack) CentOS 7 image
> > with fio. The test scenario is 4K randomwrite, libaio, directIO,
> > QD=1, runtime=900s, test-file-size=40GiB.
> > 
> > Went from a picture of [1] to [2]. In [1], the guest saw 98.25% of
> > the I/O complete within maximum 250 µsec (~4000 IOPS). This, [2],
> > sees 98.95% of the IO at ~4 msec (actually ~300 IOPs).
> 
> I would be suspicious that somehow somewhere you had some sort of
> caching going on, in the 1st example. 

It wouldn't surprise me either, though I to the best of my knowledge
haven't actively configured any such write caching anywhere.

I did forget one brief detail regarding the setup: We run 4x OSDs per
SSD-drive, i.e. roughly 400 GB each.
Consistent 4k random-write performance onto /var/lib/ceph/osd-
$num/fiotestfile, with similar test-config as above, is 13k IOPS *per
partition*.

> 250us is pretty much unachievable for directio writes with Ceph.

Thanks for the feedback, though it's disappointing to hear.

>  I've just built some new nodes with the pure goal of crushing
> (excuse the pun) write latency and after extensive tuning can't get
> it much below 600-700us. 

What of the below, or other than the below, have you done, considering
the directIO baseline?
 - SSD only hosts
 - NIC <-> CPU/NUMA mapping
 - HBA <-> CPU/NUMA mapping
 - ceph-osd process <-> CPU/NUMA mapping
 - Partition SSDs into multiple partitions
 - Ceph OSD tunings for concurrency (many-clients)
 - Ceph OSD tunings for latency (many-clients)
 - async messenger, new in Jewel (not sure what impact is), or,
change/tuning of memory allocator
 - RDMA (e.g. Mellanox) messenger

I have yet to iron out precisely what those two OSD tunings would be.

> The 4ms sounds more likely for an untuned cluster. I wonder if any of
> the RBD or qemu cache settings would have changed between versions?

I'm curious about this too.  What are relevant OSD-side configs here?
And how do I check what the librbd clients experience? What parameters
from e.g. /etc/ceph/$clustername.conf applies to them?

I'll have to make another pass over the rbd PRs between Infernalis and
10.2.2 I suppose.

> > Between [1] and [2] (simple plots of FIO's E2E-latency metrics),
> > the entire cluster including compute nodes code went from
> > Infernalis
> > to
> > 10.2.2
> > 
> > What's going on here?
> > 
> > I haven't tuned Ceph OSDs either in config or via Linux kernel at
> > all yet; upgrade to Jewel came first. I haven't changed any OSD
> > configs
> > between [1] and [2] myself (only minimally before [1], 0 effort on
> > performance tuning) , other than updated to Jewel tunables. But
> > the difference is very drastic, wouldn't you say?
> > 
> > Best,
> > Martin
> > [1] http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test08
> > /ceph-fio-bench_lat.1.png
> > [2] http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test10
> > /ceph-fio-bench_lat.1.png
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com