Re: Infernalis -> Jewel, 10x+ RBD latency increase

Nick Fisk <nick@xxxxxxxxxx> · Fri, 22 Jul 2016 10:38:25 +0100

> -----Original Message-----
> From: Martin Millnert [mailto:martin@xxxxxxxxxxx]
> Sent: 22 July 2016 10:32
> To: nick@xxxxxxxxxx; 'Ceph Users' <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Infernalis -> Jewel, 10x+ RBD latency increase
> 
> On Fri, 2016-07-22 at 08:56 +0100, Nick Fisk wrote:
> > >
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > Behalf Of Martin Millnert
> > > Sent: 22 July 2016 00:33
> > > To: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
> > > Subject:  Infernalis -> Jewel, 10x+ RBD latency increase
> > >
> > > Hi,
> > >
> > > I just upgraded from Infernalis to Jewel and see an approximate 10x
> > > latency increase.
> > >
> > > Quick facts:
> > >  - 3x replicated pool
> > >  - 4x 2x-"E5-2690 v3 @ 2.60GHz", 128GB RAM, 6x 1.6 TB Intel S3610
> > > SSDs,
> > >  - LSI3008 controller with up-to-date firmware and upstream driver,
> > > and up-to-date firmware on SSDs.
> > >  - 40GbE (Mellanox, with up-to-date drivers & firmware)
> > >  - CentOS 7.2
> > >
> > > Physical checks out, both iperf3 for network and e.g. fio over all
> > > the SSDs. Not done much of Linux tuning yet; but irqbalanced does a
> > > pretty good job with pairing both NIC and HBA with their respective
> > > CPUs.
> > >
> > > In performance hunting mode, and today took the next logical step of
> > > upgrading from Infernalis to Jewel.
> > >
> > > Tester is remote KVM/Qemu/libvirt guest (openstack) CentOS 7 image
> > > with fio. The test scenario is 4K randomwrite, libaio, directIO,
> > > QD=1, runtime=900s, test-file-size=40GiB.
> > >
> > > Went from a picture of [1] to [2]. In [1], the guest saw 98.25% of
> > > the I/O complete within maximum 250 µsec (~4000 IOPS). This, [2],
> > > sees 98.95% of the IO at ~4 msec (actually ~300 IOPs).
> >
> > I would be suspicious that somehow somewhere you had some sort of
> > caching going on, in the 1st example.
> 
> It wouldn't surprise me either, though I to the best of my knowledge haven't actively configured any such write caching anywhere.
> 
> I did forget one brief detail regarding the setup: We run 4x OSDs per SSD-drive, i.e. roughly 400 GB each.
> Consistent 4k random-write performance onto /var/lib/ceph/osd- $num/fiotestfile, with similar test-config as above, is 13k IOPS *per
> partition*.
> 
> > 250us is pretty much unachievable for directio writes with Ceph.
> 
> Thanks for the feedback, though it's disappointing to hear.
> 
> >  I've just built some new nodes with the pure goal of crushing (excuse
> > the pun) write latency and after extensive tuning can't get it much
> > below 600-700us.
> 
> What of the below, or other than the below, have you done, considering the directIO baseline?
>  - SSD only hosts
>  - NIC <-> CPU/NUMA mapping
>  - HBA <-> CPU/NUMA mapping
>  - ceph-osd process <-> CPU/NUMA mapping
>  - Partition SSDs into multiple partitions
>  - Ceph OSD tunings for concurrency (many-clients)
>  - Ceph OSD tunings for latency (many-clients)
>  - async messenger, new in Jewel (not sure what impact is), or, change/tuning of memory allocator
>  - RDMA (e.g. Mellanox) messenger

The things that have made the most difference (ie going from 2-3ms down to 600us are:-

- Fast cores (using Xeon E3 running at 3.6Ghz)
- Which are also single socket so no NUMA to worry about
- NVME journals (get significantly lower device write latency vs SSD)
- Fix CPU freq at 3.6Ghz
- Set max c-state to C1

Most of the OSD tuning is probably more for high concurrency or throughput, it seems to have less of an effect vs the above. Of course those CPU tunings do increase power usage, so I'm looking at ways to find the best balance.

> 
> I have yet to iron out precisely what those two OSD tunings would be.
> 
> > The 4ms sounds more likely for an untuned cluster. I wonder if any of
> > the RBD or qemu cache settings would have changed between versions?
> 
> I'm curious about this too.  What are relevant OSD-side configs here?
> And how do I check what the librbd clients experience? What parameters from e.g. /etc/ceph/$clustername.conf applies to them?

I think RBD cache is enabled by default, you can check via the admin socket

http://ceph.com/planet/ceph-validate-that-the-rbd-cache-is-active/

> 
> I'll have to make another pass over the rbd PRs between Infernalis and
> 10.2.2 I suppose.
> 
> 
> > > Between [1] and [2] (simple plots of FIO's E2E-latency metrics), the
> > > entire cluster including compute nodes code went from Infernalis to
> > > 10.2.2
> > >
> > > What's going on here?
> > >
> > > I haven't tuned Ceph OSDs either in config or via Linux kernel at
> > > all yet; upgrade to Jewel came first. I haven't changed any OSD
> > > configs between [1] and [2] myself (only minimally before [1], 0
> > > effort on performance tuning) , other than updated to Jewel
> > > tunables. But the difference is very drastic, wouldn't you say?
> > >
> > > Best,
> > > Martin
> > > [1] http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test08
> > > /ceph-fio-bench_lat.1.png
> > > [2] http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test10
> > > /ceph-fio-bench_lat.1.png
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com