Re: Ceph + VMware + Single Thread Performance

Nick Fisk <nick@xxxxxxxxxx> · Sun, 21 Aug 2016 09:57:40 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Christian Balzer
> Sent: 21 August 2016 09:32
> To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Ceph + VMware + Single Thread Performance
> 
> 
> Hello,
> 
> On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> 
> > Hi Nick
> >
> > Interested in this comment - "-Dual sockets are probably bad and will
> > impact performance."
> >
> > Have you got real world experience of this being the case?
> >
> Well, Nick wrote "probably".
> 
> Dual sockets and thus NUMA, the need for CPUs to talk to each other and share information certainly can impact things that are
very
> time critical.
> How much though is a question of design, both HW and SW.

There was a guy from Redhat (sorry his name escapes me now) a few months ago on the performance weekly meeting. He was analysing the
CPU cache miss effects with Ceph and it looked like a NUMA setup was having quite a severe impact on some things. To be honest a lot
of it went over my head, but I came away from it with a general feeling that if you can get the required performance from 1 socket,
then that is probably a better bet. This includes only populating a single socket in a dual socket system. There was also a Ceph
tech talk at the start of the year (High perf databases on Ceph) where the guy presenting was also recommending only populating 1
socket for latency reasons.

Both of those, coupled with the fact that Xeon E3's are the cheapest way to get high clock speeds, sort of made my decision.

> 
> We're looking here at a case where he's trying to reduce latency by all means and where the actual CPU needs for the HDDs are
> negligible.
> The idea being that a "Ceph IOPS" stays on one core which is hopefully also not being shared at that time.
> 
> If you're looking at full SSD nodes OTOH a singe CPU may very well not be able to saturate a sensible amount of SSDs per node, so
a
> slight penalty but better utilization and overall IOPS with 2 CPUs may be the forward.

Definitely, as always work out what your requirements are and design around them.  

> 
> Christian
> 
> > Thanks - B
> >
> > On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > >> -----Original Message-----
> > >> From: Alex Gorbachev [mailto:ag@xxxxxxxxxxxxxxxxxxx]
> > >> Sent: 21 August 2016 04:15
> > >> To: Nick Fisk <nick@xxxxxxxxxx>
> > >> Cc: wr@xxxxxxxx; Horace Ng <horace@xxxxxxxxx>; ceph-users
> > >> <ceph-users@xxxxxxxxxxxxxx>
> > >> Subject: Re:  Ceph + VMware + Single Thread Performance
> > >>
> > >> Hi Nick,
> > >>
> > >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > >> >> -----Original Message-----
> > >> >> From: wr@xxxxxxxx [mailto:wr@xxxxxxxx]
> > >> >> Sent: 21 July 2016 13:23
> > >> >> To: nick@xxxxxxxxxx; 'Horace Ng' <horace@xxxxxxxxx>
> > >> >> Cc: ceph-users@xxxxxxxxxxxxxx
> > >> >> Subject: Re:  Ceph + VMware + Single Thread
> > >> >> Performance
> > >> >>
> > >> >> Okay and what is your plan now to speed up ?
> > >> >
> > >> > Now I have come up with a lower latency hardware design, there is
> > >> > not much further improvement until persistent RBD caching is
> > >> implemented, as you will be moving the SSD/NVME closer to the
> > >> client. But I'm happy with what I can achieve at the moment. You could also experiment with bcache on the RBD.
> > >>
> > >> Reviving this thread, would you be willing to share the details of
> > >> the low latency hardware design?  Are you optimizing for NFS or iSCSI?
> > >
> > > Both really, just trying to get the write latency as low as possible, as you know, vmware does everything with lots of
unbuffered
> small io's. Eg when you migrate a VM or as thin vmdk's grow.
> > >
> > > Even storage vmotions which might kick off 32 threads, as they all roughly fall on the same PG, there still appears to be a
> bottleneck with contention on the PG itself.
> > >
> > > These were the sort of things I was trying to optimise for, to make the time spent in Ceph as minimal as possible for each IO.
> > >
> > > So onto the hardware. Through reading various threads and experiments on my own I came to the following conclusions.
> > >
> > > -You need highest possible frequency on the CPU cores, which normally also means less of them.
> > > -Dual sockets are probably bad and will impact performance.
> > > -Use NVME's for journals to minimise latency
> > >
> > > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel P3700 for a journal. I used the SuperMicro X11SSH-
> CTF board which has 10G-T onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually this design as well as being
> very performant for Ceph, also works out very cheap as you are using low end server parts. The whole lot + 12x7.2k disks all goes
into
> a 1U case.
> > >
> > > During testing I noticed that by default c-states and p-states slaughter performance. After forcing max cstate to 1 and
forcing the
> CPU frequency up to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or around 1600IOPs, this is at QD=1.
> > >
> > > Few other observations:
> > > 1. Power usage is around 150-200W for this config with 12x7.2k disks
> > > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom for more disks.
> > > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage 4.
> > > No idea about CPU load for pure SSD nodes, but based on the current
> > > disks, you could maybe expect ~10000iops per node, before maxing out CPU's 5. Single NVME seems to be able to journal 12 disks
> with no problem during normal operation, no doubt a specific benchmark could max it out though.
> > > 6. There are slightly faster Xeon E3's, but price/performance =
> > > diminishing returns
> > >
> > > Hope that answers all your questions.
> > > Nick
> > >
> > >>
> > >> Thank you,
> > >> Alex
> > >>
> > >> >
> > >> >>
> > >> >> Would it help to put in multiple P3700 per OSD Node to improve performance for a single Thread (example Storage VMotion)
> ?
> > >> >
> > >> > Most likely not, it's all the other parts of the puzzle which are
> > >> > causing the latency. ESXi was designed for storage arrays that
> > >> > service
> > >> IO's in 100us-1ms range, Ceph is probably about 10x slower than
> > >> this, hence the problem. Disable the BBWC on a RAID controller or SAN and you will the same behaviour.
> > >> >
> > >> >>
> > >> >> Regards
> > >> >>
> > >> >>
> > >> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> > >> >> >> -----Original Message-----
> > >> >> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
> > >> >> >> On Behalf Of wr@xxxxxxxx
> > >> >> >> Sent: 21 July 2016 13:04
> > >> >> >> To: nick@xxxxxxxxxx; 'Horace Ng' <horace@xxxxxxxxx>
> > >> >> >> Cc: ceph-users@xxxxxxxxxxxxxx
> > >> >> >> Subject: Re:  Ceph + VMware + Single Thread
> > >> >> >> Performance
> > >> >> >>
> > >> >> >> Hi,
> > >> >> >>
> > >> >> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production right now?
> > >> >> > It's just been built, not running yet.
> > >> >> >
> > >> >> >> So if you start a storage migration you get only 200 MByte/s right?
> > >> >> > I wish. My current cluster (not this new one) would storage
> > >> >> > migrate at ~10-15MB/s. Serial latency is the problem, without
> > >> >> > being able to buffer, ESXi waits on an ack for each IO before sending the next.
> > >> >> > Also it submits the migrations in 64kb chunks, unless you get
> > >> >> > VAAI
> > >> >> working. I think esxi will try and do them in parallel, which will help as well.
> > >> >> >
> > >> >> >> I think it would be awesome if you get 1000 MByte/s
> > >> >> >>
> > >> >> >> Where is the Bottleneck?
> > >> >> > Latency serialisation, without a buffer, you can't drive the
> > >> >> > devices to 100%. With buffered IO (or high queue depths) I can max out the journals.
> > >> >> >
> > >> >> >> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the P3700.
> > >> >> >>
> > >> >> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test
> > >> >> >> -if-y our -ssd-is-suitable-as-a-journal-device/
> > >> >> >>
> > >> >> >> How could it be that the rbd client performance is 50% slower?
> > >> >> >>
> > >> >> >> Regards
> > >> >> >>
> > >> >> >>
> > >> >> >> Am 21.07.16 um 12:15 schrieb Nick Fisk:
> > >> >> >>> I've had a lot of pain with this, smaller block sizes are even worse.
> > >> >> >>> You want to try and minimize latency at every point as there
> > >> >> >>> is no buffering happening in the iSCSI stack. This means:-
> > >> >> >>>
> > >> >> >>> 1. Fast journals (NVME or NVRAM) 2. 10GB or better
> > >> >> >>> networking 3. Fast CPU's (Ghz) 4. Fix CPU c-state's to C1 5.
> > >> >> >>> Fix CPU's Freq to max
> > >> >> >>>
> > >> >> >>> Also I can't be sure, but I think there is a metadata update
> > >> >> >>> happening with VMFS, particularly if you are using thin
> > >> >> >>> VMDK's, this can also be a major bottleneck. For my use
> > >> >> >>> case, I've switched over to NFS as it has given much more
> > >> >> >>> performance at scale and
> > >> >> less headache.
> > >> >> >>>
> > >> >> >>> For the RADOS Run, here you go (400GB P3700):
> > >> >> >>>
> > >> >> >>> Total time run:         60.026491
> > >> >> >>> Total writes made:      3104
> > >> >> >>> Write size:             4194304
> > >> >> >>> Object size:            4194304
> > >> >> >>> Bandwidth (MB/sec):     206.842
> > >> >> >>> Stddev Bandwidth:       8.10412
> > >> >> >>> Max bandwidth (MB/sec): 224
> > >> >> >>> Min bandwidth (MB/sec): 180
> > >> >> >>> Average IOPS:           51
> > >> >> >>> Stddev IOPS:            2
> > >> >> >>> Max IOPS:               56
> > >> >> >>> Min IOPS:               45
> > >> >> >>> Average Latency(s):     0.0193366
> > >> >> >>> Stddev Latency(s):      0.00148039
> > >> >> >>> Max latency(s):         0.0377946
> > >> >> >>> Min latency(s):         0.015909
> > >> >> >>>
> > >> >> >>> Nick
> > >> >> >>>
> > >> >> >>>> -----Original Message-----
> > >> >> >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
> > >> >> >>>> On Behalf Of Horace
> > >> >> >>>> Sent: 21 July 2016 10:26
> > >> >> >>>> To: wr@xxxxxxxx
> > >> >> >>>> Cc: ceph-users@xxxxxxxxxxxxxx
> > >> >> >>>> Subject: Re:  Ceph + VMware + Single Thread
> > >> >> >>>> Performance
> > >> >> >>>>
> > >> >> >>>> Hi,
> > >> >> >>>>
> > >> >> >>>> Same here, I've read some blog saying that vmware will
> > >> >> >>>> frequently verify the locking on VMFS over iSCSI, hence it
> > >> >> >>>> will have much slower performance than NFS (with different
> > >> locking mechanism).
> > >> >> >>>>
> > >> >> >>>> Regards,
> > >> >> >>>> Horace Ng
> > >> >> >>>>
> > >> >> >>>> ----- Original Message -----
> > >> >> >>>> From: wr@xxxxxxxx
> > >> >> >>>> To: ceph-users@xxxxxxxxxxxxxx
> > >> >> >>>> Sent: Thursday, July 21, 2016 5:11:21 PM
> > >> >> >>>> Subject:  Ceph + VMware + Single Thread
> > >> >> >>>> Performance
> > >> >> >>>>
> > >> >> >>>> Hi everyone,
> > >> >> >>>>
> > >> >> >>>> we see at our cluster relatively slow Single Thread Performance on the iscsi Nodes.
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>> Our setup:
> > >> >> >>>>
> > >> >> >>>> 3 Racks:
> > >> >> >>>>
> > >> >> >>>> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).
> > >> >> >>>>
> > >> >> >>>> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD)
> > >> >> >>>> and 6x WD Red 1TB per Data Node as OSD.
> > >> >> >>>>
> > >> >> >>>> Replication = 3
> > >> >> >>>>
> > >> >> >>>> chooseleaf = 3 type Rack in the crush map
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
> > >> >> >>>>
> > >> >> >>>> rados bench -p rbd 60 write -b 4M -t 1
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>> If we test with:
> > >> >> >>>>
> > >> >> >>>> rados bench -p rbd 60 write -b 4M -t 32
> > >> >> >>>>
> > >> >> >>>> we get ca. 600 - 700 MByte/s
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>> We plan to replace the Samsung SSD with Intel DC P3700 PCIe
> > >> >> >>>> NVM'e for the Journal to get better Single Thread Performance.
> > >> >> >>>>
> > >> >> >>>> Is anyone of you out there who has an Intel P3700 for
> > >> >> >>>> Journal an can give me back test results with:
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>> rados bench -p rbd 60 write -b 4M -t 1
> > >> >> >>>>
> > >> >> >>>>
> > >> >> >>>> Thank you very much !!
> > >> >> >>>>
> > >> >> >>>> Kind Regards !!
> > >> >> >>>>
> > >> >> >>>> _______________________________________________
> > >> >> >>>> ceph-users mailing list
> > >> >> >>>> ceph-users@xxxxxxxxxxxxxx
> > >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >> >>>> _______________________________________________
> > >> >> >>>> ceph-users mailing list
> > >> >> >>>> ceph-users@xxxxxxxxxxxxxx
> > >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >> >> _______________________________________________
> > >> >> >> ceph-users mailing list
> > >> >> >> ceph-users@xxxxxxxxxxxxxx
> > >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >
> > >> >
> > >> > _______________________________________________
> > >> > ceph-users mailing list
> > >> > ceph-users@xxxxxxxxxxxxxx
> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com