Re: Ceph + VMware + Single Thread Performance

Christian Balzer <chibi@xxxxxxx> · Tue, 23 Aug 2016 10:21:02 +0900

Hello,

On Mon, 22 Aug 2016 20:34:54 +0100 Nick Fisk wrote:

> > -----Original Message-----
> > From: Christian Balzer [mailto:chibi@xxxxxxx]
> > Sent: 22 August 2016 03:00
> > To: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>
> > Cc: Nick Fisk <nick@xxxxxxxxxx>
> > Subject: Re:  Ceph + VMware + Single Thread Performance
> > 
> > 
> > Hello,
> > 
> > On Sun, 21 Aug 2016 09:57:40 +0100 Nick Fisk wrote:
> > 
> > >
> > >
> > > > -----Original Message-----
> > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > > Behalf Of Christian Balzer
> > > > Sent: 21 August 2016 09:32
> > > > To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > > > Subject: Re:  Ceph + VMware + Single Thread Performance
> > > >
> > > >
> > > > Hello,
> > > >
> > > > On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> > > >
> > > > > Hi Nick
> > > > >
> > > > > Interested in this comment - "-Dual sockets are probably bad and
> > > > > will impact performance."
> > > > >
> > > > > Have you got real world experience of this being the case?
> > > > >
> > > > Well, Nick wrote "probably".
> > > >
> > > > Dual sockets and thus NUMA, the need for CPUs to talk to each other
> > > > and share information certainly can impact things that are
> > > very
> > > > time critical.
> > > > How much though is a question of design, both HW and SW.
> > >
> > > There was a guy from Redhat (sorry his name escapes me now) a few
> > > months ago on the performance weekly meeting. He was analysing the CPU
> > > cache miss effects with Ceph and it looked like a NUMA setup was
> > > having quite a severe impact on some things. To be honest a lot of it
> > > went over my head, but I came away from it with a general feeling that
> > > if you can get the required performance from 1 socket, then that is probably a better bet. This includes only populating a single
> > socket in a dual socket system. There was also a Ceph tech talk at the start of the year (High perf databases on Ceph) where the guy
> > presenting was also recommending only populating 1 socket for latency reasons.
> > >
> > I wonder how complete their testing was and how much manual tuning they tried.
> > As in:
> > 
> > 1. Was irqbalance running?
> > Because it and the normal kernel strategies clash beautifully.
> > Irqbalance moves stuff around, the kernel tries to move things close to where the IRQs are, cat and mouse.
> > 
> > 2. Did they try with manual IRQ pinning?
> > I do, not that it's critical with my Ceph nodes, but on other machines it can make a LOT of difference.
> > Like keeping the cores near (or at least on the same NUMA node) as the network IRQs reserved for KVM vhost processes.
> > 
> > 3. Did they try pining Ceph OSD processes?
> > While this may certainly help (and make things more predictable when the load gets high), as I said above the kernel normally does a
> > pretty good job of NOT moving things around and keeping processes close to the resources they need.
> > 
> 
> From what I remember I think they went to pretty long lengths to tune things. I think one point was that if you have a 40GB nic on socket, a NVME on another, no matter where the process runs, you are going to have a lot of traffic crossing between the sockets.

Traffic yes, complete process migrations hopefully not.
But anyway, yes, that's to be expected.

And also unavoidable if you want/need to utilize the whole capabilities
and PCIe lanes of a dual socket motherboard.
And in some cases (usually not with Ceph/OSDs), the IRQ load really will
benefit from more cores to play with.

> 
> Here is the DB on Ceph one
> 
> http://ceph-users.ceph.narkive.com/1sj4VI4U/ceph-tech-talk-high-performance-production-databases-on-ceph

Thanks!
Yeah, basically confirms what I know/said.

> 
> I don't think the recordings are available for the performance meeting one, but it was something to do with certain C++ string functions causing issue with CPU cache. Honestly can't remember much else.
> 
> > > Both of those, coupled with the fact that Xeon E3's are the cheapest way to get high clock speeds, sort of made my decision.
> > >
> > Totally agreed, my current HDD node design is based on the single CPU Supermicro 5028R-E1CR12L barebone, with an E5-1650 v3
> > (3.50GHz) CPU.
> 
> Nice. Any ideas how they compare to the E3's?
> 
Not really, as in direct comparison.
They look good enough on paper and surely perform as advertised.

> > 
> > > >
> > > > We're looking here at a case where he's trying to reduce latency by
> > > > all means and where the actual CPU needs for the HDDs are negligible.
> > > > The idea being that a "Ceph IOPS" stays on one core which is hopefully also not being shared at that time.
> > > >
> > > > If you're looking at full SSD nodes OTOH a singe CPU may very well
> > > > not be able to saturate a sensible amount of SSDs per node, so
> > > a
> > > > slight penalty but better utilization and overall IOPS with 2 CPUs may be the forward.
> > >
> > > Definitely, as always work out what your requirements are and design around them.
> > >
> > On my cache tier nodes with 2x E5-2623 v3 (3.00GHz) and currently 4 800GB DC S3610 SSDs I can already saturate all but 2 "cores", with
> > the "right"
> > extreme test cases.
> > Normal load is of course just around 4 (out of 16) "cores".
> 
> Any idea what sort of IOP's that is? Wondering how it lines up against my estimate of 10000 on my single quad core.

For background and HW, see my recent thread titled:
"Better late than never, some XFS versus EXT4 test results"

Basically something like this, on a KRBD mounted EXT4 image:
--- 
fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=64
---

This will give us about 8500 IOPS, using about 1100% (of 1200%) for
Ceph/OS, with only 20% (so next to nothing) in WAIT, the SSDs are only 35%
busy.

So with this setup and 15.6 GHz total capacity versus your 4 3.5GHz
cores at 14GHz 10000 might be a bit optimistic (though Jewel may help), but
definitely in the correct ballpark.

Christian
> 
> > 
> > And for the people who like it fast(er) but don't have to deal with VMware or the likes, instead of forcing the c-state to 1 just setting
> > the governor to "performance" was enough in my case to halve latency (from about 2 to 1ms).
> 
> Is that also changing the c-state, I'm pretty sure that only effects the frequency.
> 
> > 
> > This still does save some power at times and (as Nick speculated) indeed allows some cores to use their turbo speeds.
> 
> I did a test on my new boxes and the difference between max power savings and fulle frequency + cstate=1 was less than 10w.
> 
> > 
> > So the 4-5 busy cores on my cache tier nodes tend to hover around 3.3GHz, instead of the 3.0GHz baseline for their CPUs.
> > And the less loaded cores don't tend to go below 2.6GHz, as opposed to the 1.2GHz that the "powersave" governor would default to.
> > 
> > Christian
> > 
> > > >
> > > > Christian
> > > >
> > > > > Thanks - B
> > > > >
> > > > > On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > > > > >> -----Original Message-----
> > > > > >> From: Alex Gorbachev [mailto:ag@xxxxxxxxxxxxxxxxxxx]
> > > > > >> Sent: 21 August 2016 04:15
> > > > > >> To: Nick Fisk <nick@xxxxxxxxxx>
> > > > > >> Cc: wr@xxxxxxxx; Horace Ng <horace@xxxxxxxxx>; ceph-users
> > > > > >> <ceph-users@xxxxxxxxxxxxxx>
> > > > > >> Subject: Re:  Ceph + VMware + Single Thread
> > > > > >> Performance
> > > > > >>
> > > > > >> Hi Nick,
> > > > > >>
> > > > > >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > > > > >> >> -----Original Message-----
> > > > > >> >> From: wr@xxxxxxxx [mailto:wr@xxxxxxxx]
> > > > > >> >> Sent: 21 July 2016 13:23
> > > > > >> >> To: nick@xxxxxxxxxx; 'Horace Ng' <horace@xxxxxxxxx>
> > > > > >> >> Cc: ceph-users@xxxxxxxxxxxxxx
> > > > > >> >> Subject: Re:  Ceph + VMware + Single Thread
> > > > > >> >> Performance
> > > > > >> >>
> > > > > >> >> Okay and what is your plan now to speed up ?
> > > > > >> >
> > > > > >> > Now I have come up with a lower latency hardware design,
> > > > > >> > there is not much further improvement until persistent RBD
> > > > > >> > caching is
> > > > > >> implemented, as you will be moving the SSD/NVME closer to the
> > > > > >> client. But I'm happy with what I can achieve at the moment. You could also experiment with bcache on the RBD.
> > > > > >>
> > > > > >> Reviving this thread, would you be willing to share the details
> > > > > >> of the low latency hardware design?  Are you optimizing for NFS or iSCSI?
> > > > > >
> > > > > > Both really, just trying to get the write latency as low as
> > > > > > possible, as you know, vmware does everything with lots of
> > > unbuffered
> > > > small io's. Eg when you migrate a VM or as thin vmdk's grow.
> > > > > >
> > > > > > Even storage vmotions which might kick off 32 threads, as they
> > > > > > all roughly fall on the same PG, there still appears to be a
> > > > bottleneck with contention on the PG itself.
> > > > > >
> > > > > > These were the sort of things I was trying to optimise for, to make the time spent in Ceph as minimal as possible for each IO.
> > > > > >
> > > > > > So onto the hardware. Through reading various threads and experiments on my own I came to the following conclusions.
> > > > > >
> > > > > > -You need highest possible frequency on the CPU cores, which normally also means less of them.
> > > > > > -Dual sockets are probably bad and will impact performance.
> > > > > > -Use NVME's for journals to minimise latency
> > > > > >
> > > > > > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5
> > > > > > with an Intel P3700 for a journal. I used the SuperMicro X11SSH-
> > > > CTF board which has 10G-T onboard as well as 8SATA and 8SAS, so no
> > > > expansion cards required. Actually this design as well as being very
> > > > performant for Ceph, also works out very cheap as you are using low
> > > > end server parts. The whole lot + 12x7.2k disks all goes
> > > into
> > > > a 1U case.
> > > > > >
> > > > > > During testing I noticed that by default c-states and p-states
> > > > > > slaughter performance. After forcing max cstate to 1 and
> > > forcing the
> > > > CPU frequency up to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or around 1600IOPs, this is at QD=1.
> > > > > >
> > > > > > Few other observations:
> > > > > > 1. Power usage is around 150-200W for this config with 12x7.2k
> > > > > > disks 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom for more disks.
> > > > > > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage 4.
> > > > > > No idea about CPU load for pure SSD nodes, but based on the
> > > > > > current disks, you could maybe expect ~10000iops per node,
> > > > > > before maxing out CPU's 5. Single NVME seems to be able to
> > > > > > journal 12 disks
> > > > with no problem during normal operation, no doubt a specific benchmark could max it out though.
> > > > > > 6. There are slightly faster Xeon E3's, but price/performance =
> > > > > > diminishing returns
> > > > > >
> > > > > > Hope that answers all your questions.
> > > > > > Nick
> > > > > >
> > > > > >>
> > > > > >> Thank you,
> > > > > >> Alex
> > > > > >>
> > > > > >> >
> > > > > >> >>
> > > > > >> >> Would it help to put in multiple P3700 per OSD Node to
> > > > > >> >> improve performance for a single Thread (example Storage
> > > > > >> >> VMotion)
> > > > ?
> > > > > >> >
> > > > > >> > Most likely not, it's all the other parts of the puzzle which
> > > > > >> > are causing the latency. ESXi was designed for storage arrays
> > > > > >> > that service
> > > > > >> IO's in 100us-1ms range, Ceph is probably about 10x slower than
> > > > > >> this, hence the problem. Disable the BBWC on a RAID controller or SAN and you will the same behaviour.
> > > > > >> >
> > > > > >> >>
> > > > > >> >> Regards
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> > > > > >> >> >> -----Original Message-----
> > > > > >> >> >> From: ceph-users
> > > > > >> >> >> [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
> > > > > >> >> >> On Behalf Of wr@xxxxxxxx
> > > > > >> >> >> Sent: 21 July 2016 13:04
> > > > > >> >> >> To: nick@xxxxxxxxxx; 'Horace Ng' <horace@xxxxxxxxx>
> > > > > >> >> >> Cc: ceph-users@xxxxxxxxxxxxxx
> > > > > >> >> >> Subject: Re:  Ceph + VMware + Single Thread
> > > > > >> >> >> Performance
> > > > > >> >> >>
> > > > > >> >> >> Hi,
> > > > > >> >> >>
> > > > > >> >> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production right now?
> > > > > >> >> > It's just been built, not running yet.
> > > > > >> >> >
> > > > > >> >> >> So if you start a storage migration you get only 200 MByte/s right?
> > > > > >> >> > I wish. My current cluster (not this new one) would
> > > > > >> >> > storage migrate at ~10-15MB/s. Serial latency is the
> > > > > >> >> > problem, without being able to buffer, ESXi waits on an ack for each IO before sending the next.
> > > > > >> >> > Also it submits the migrations in 64kb chunks, unless you
> > > > > >> >> > get VAAI
> > > > > >> >> working. I think esxi will try and do them in parallel, which will help as well.
> > > > > >> >> >
> > > > > >> >> >> I think it would be awesome if you get 1000 MByte/s
> > > > > >> >> >>
> > > > > >> >> >> Where is the Bottleneck?
> > > > > >> >> > Latency serialisation, without a buffer, you can't drive
> > > > > >> >> > the devices to 100%. With buffered IO (or high queue depths) I can max out the journals.
> > > > > >> >> >
> > > > > >> >> >> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the P3700.
> > > > > >> >> >>
> > > > > >> >> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-
> > > > > >> >> >> test -if-y our -ssd-is-suitable-as-a-journal-device/
> > > > > >> >> >>
> > > > > >> >> >> How could it be that the rbd client performance is 50% slower?
> > > > > >> >> >>
> > > > > >> >> >> Regards
> > > > > >> >> >>
> > > > > >> >> >>
> > > > > >> >> >> Am 21.07.16 um 12:15 schrieb Nick Fisk:
> > > > > >> >> >>> I've had a lot of pain with this, smaller block sizes are even worse.
> > > > > >> >> >>> You want to try and minimize latency at every point as
> > > > > >> >> >>> there is no buffering happening in the iSCSI stack. This
> > > > > >> >> >>> means:-
> > > > > >> >> >>>
> > > > > >> >> >>> 1. Fast journals (NVME or NVRAM) 2. 10GB or better
> > > > > >> >> >>> networking 3. Fast CPU's (Ghz) 4. Fix CPU c-state's to C1 5.
> > > > > >> >> >>> Fix CPU's Freq to max
> > > > > >> >> >>>
> > > > > >> >> >>> Also I can't be sure, but I think there is a metadata
> > > > > >> >> >>> update happening with VMFS, particularly if you are
> > > > > >> >> >>> using thin VMDK's, this can also be a major bottleneck.
> > > > > >> >> >>> For my use case, I've switched over to NFS as it has
> > > > > >> >> >>> given much more performance at scale and
> > > > > >> >> less headache.
> > > > > >> >> >>>
> > > > > >> >> >>> For the RADOS Run, here you go (400GB P3700):
> > > > > >> >> >>>
> > > > > >> >> >>> Total time run:         60.026491
> > > > > >> >> >>> Total writes made:      3104
> > > > > >> >> >>> Write size:             4194304
> > > > > >> >> >>> Object size:            4194304
> > > > > >> >> >>> Bandwidth (MB/sec):     206.842
> > > > > >> >> >>> Stddev Bandwidth:       8.10412
> > > > > >> >> >>> Max bandwidth (MB/sec): 224 Min bandwidth (MB/sec): 180
> > > > > >> >> >>> Average IOPS:           51
> > > > > >> >> >>> Stddev IOPS:            2
> > > > > >> >> >>> Max IOPS:               56
> > > > > >> >> >>> Min IOPS:               45
> > > > > >> >> >>> Average Latency(s):     0.0193366
> > > > > >> >> >>> Stddev Latency(s):      0.00148039
> > > > > >> >> >>> Max latency(s):         0.0377946
> > > > > >> >> >>> Min latency(s):         0.015909
> > > > > >> >> >>>
> > > > > >> >> >>> Nick
> > > > > >> >> >>>
> > > > > >> >> >>>> -----Original Message-----
> > > > > >> >> >>>> From: ceph-users
> > > > > >> >> >>>> [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
> > > > > >> >> >>>> On Behalf Of Horace
> > > > > >> >> >>>> Sent: 21 July 2016 10:26
> > > > > >> >> >>>> To: wr@xxxxxxxx
> > > > > >> >> >>>> Cc: ceph-users@xxxxxxxxxxxxxx
> > > > > >> >> >>>> Subject: Re:  Ceph + VMware + Single Thread
> > > > > >> >> >>>> Performance
> > > > > >> >> >>>>
> > > > > >> >> >>>> Hi,
> > > > > >> >> >>>>
> > > > > >> >> >>>> Same here, I've read some blog saying that vmware will
> > > > > >> >> >>>> frequently verify the locking on VMFS over iSCSI, hence
> > > > > >> >> >>>> it will have much slower performance than NFS (with
> > > > > >> >> >>>> different
> > > > > >> locking mechanism).
> > > > > >> >> >>>>
> > > > > >> >> >>>> Regards,
> > > > > >> >> >>>> Horace Ng
> > > > > >> >> >>>>
> > > > > >> >> >>>> ----- Original Message -----
> > > > > >> >> >>>> From: wr@xxxxxxxx
> > > > > >> >> >>>> To: ceph-users@xxxxxxxxxxxxxx
> > > > > >> >> >>>> Sent: Thursday, July 21, 2016 5:11:21 PM
> > > > > >> >> >>>> Subject:  Ceph + VMware + Single Thread
> > > > > >> >> >>>> Performance
> > > > > >> >> >>>>
> > > > > >> >> >>>> Hi everyone,
> > > > > >> >> >>>>
> > > > > >> >> >>>> we see at our cluster relatively slow Single Thread Performance on the iscsi Nodes.
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> Our setup:
> > > > > >> >> >>>>
> > > > > >> >> >>>> 3 Racks:
> > > > > >> >> >>>>
> > > > > >> >> >>>> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).
> > > > > >> >> >>>>
> > > > > >> >> >>>> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per
> > > > > >> >> >>>> SSD) and 6x WD Red 1TB per Data Node as OSD.
> > > > > >> >> >>>>
> > > > > >> >> >>>> Replication = 3
> > > > > >> >> >>>>
> > > > > >> >> >>>> chooseleaf = 3 type Rack in the crush map
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
> > > > > >> >> >>>>
> > > > > >> >> >>>> rados bench -p rbd 60 write -b 4M -t 1
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> If we test with:
> > > > > >> >> >>>>
> > > > > >> >> >>>> rados bench -p rbd 60 write -b 4M -t 32
> > > > > >> >> >>>>
> > > > > >> >> >>>> we get ca. 600 - 700 MByte/s
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> We plan to replace the Samsung SSD with Intel DC P3700
> > > > > >> >> >>>> PCIe NVM'e for the Journal to get better Single Thread Performance.
> > > > > >> >> >>>>
> > > > > >> >> >>>> Is anyone of you out there who has an Intel P3700 for
> > > > > >> >> >>>> Journal an can give me back test results with:
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> rados bench -p rbd 60 write -b 4M -t 1
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> Thank you very much !!
> > > > > >> >> >>>>
> > > > > >> >> >>>> Kind Regards !!
> > > > > >> >> >>>>
> > > > > >> >> >>>> _______________________________________________
> > > > > >> >> >>>> ceph-users mailing list ceph-users@xxxxxxxxxxxxxx
> > > > > >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >> >> >>>> _______________________________________________
> > > > > >> >> >>>> ceph-users mailing list ceph-users@xxxxxxxxxxxxxx
> > > > > >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >> >> >> _______________________________________________
> > > > > >> >> >> ceph-users mailing list
> > > > > >> >> >> ceph-users@xxxxxxxxxxxxxx
> > > > > >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >> >
> > > > > >> >
> > > > > >> > _______________________________________________
> > > > > >> > ceph-users mailing list
> > > > > >> > ceph-users@xxxxxxxxxxxxxx
> > > > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > >
> > > >
> > > > --
> > > > Christian Balzer        Network/Systems Engineer
> > > > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > > > http://www.gol.com/
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > 
> > 
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com