On Fri, 2016-07-22 at 08:56 +0100, Nick Fisk wrote: > > > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > Behalf Of Martin Millnert > > Sent: 22 July 2016 00:33 > > To: Ceph Users <ceph-users@xxxxxxxxxxxxxx> > > Subject: Infernalis -> Jewel, 10x+ RBD latency > > increase > > > > Hi, > > > > I just upgraded from Infernalis to Jewel and see an approximate 10x > > latency increase. > > > > Quick facts: > > - 3x replicated pool > > - 4x 2x-"E5-2690 v3 @ 2.60GHz", 128GB RAM, 6x 1.6 TB Intel S3610 > > SSDs, > > - LSI3008 controller with up-to-date firmware and upstream driver, > > and up-to-date firmware on SSDs. > > - 40GbE (Mellanox, with up-to-date drivers & firmware) > > - CentOS 7.2 > > > > Physical checks out, both iperf3 for network and e.g. fio over all > > the SSDs. Not done much of Linux tuning yet; but irqbalanced does a > > pretty good job with pairing both NIC and HBA with their respective > > CPUs. > > > > In performance hunting mode, and today took the next logical step > > of upgrading from Infernalis to Jewel. > > > > Tester is remote KVM/Qemu/libvirt guest (openstack) CentOS 7 image > > with fio. The test scenario is 4K randomwrite, libaio, directIO, > > QD=1, runtime=900s, test-file-size=40GiB. > > > > Went from a picture of [1] to [2]. In [1], the guest saw 98.25% of > > the I/O complete within maximum 250 µsec (~4000 IOPS). This, [2], > > sees 98.95% of the IO at ~4 msec (actually ~300 IOPs). > > I would be suspicious that somehow somewhere you had some sort of > caching going on, in the 1st example. It wouldn't surprise me either, though I to the best of my knowledge haven't actively configured any such write caching anywhere. I did forget one brief detail regarding the setup: We run 4x OSDs per SSD-drive, i.e. roughly 400 GB each. Consistent 4k random-write performance onto /var/lib/ceph/osd- $num/fiotestfile, with similar test-config as above, is 13k IOPS *per partition*. > 250us is pretty much unachievable for directio writes with Ceph. Thanks for the feedback, though it's disappointing to hear. > I've just built some new nodes with the pure goal of crushing > (excuse the pun) write latency and after extensive tuning can't get > it much below 600-700us. What of the below, or other than the below, have you done, considering the directIO baseline? - SSD only hosts - NIC <-> CPU/NUMA mapping - HBA <-> CPU/NUMA mapping - ceph-osd process <-> CPU/NUMA mapping - Partition SSDs into multiple partitions - Ceph OSD tunings for concurrency (many-clients) - Ceph OSD tunings for latency (many-clients) - async messenger, new in Jewel (not sure what impact is), or, change/tuning of memory allocator - RDMA (e.g. Mellanox) messenger I have yet to iron out precisely what those two OSD tunings would be. > The 4ms sounds more likely for an untuned cluster. I wonder if any of > the RBD or qemu cache settings would have changed between versions? I'm curious about this too. What are relevant OSD-side configs here? And how do I check what the librbd clients experience? What parameters from e.g. /etc/ceph/$clustername.conf applies to them? I'll have to make another pass over the rbd PRs between Infernalis and 10.2.2 I suppose. > > Between [1] and [2] (simple plots of FIO's E2E-latency metrics), > > the entire cluster including compute nodes code went from > > Infernalis > > to > > 10.2.2 > > > > What's going on here? > > > > I haven't tuned Ceph OSDs either in config or via Linux kernel at > > all yet; upgrade to Jewel came first. I haven't changed any OSD > > configs > > between [1] and [2] myself (only minimally before [1], 0 effort on > > performance tuning) , other than updated to Jewel tunables. But > > the difference is very drastic, wouldn't you say? > > > > Best, > > Martin > > [1] http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test08 > > /ceph-fio-bench_lat.1.png > > [2] http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test10 > > /ceph-fio-bench_lat.1.png > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com