Re: Ceph + VMware + Single Thread Performance

Nick Fisk <nick@xxxxxxxxxx> · Thu, 21 Jul 2016 13:33:19 +0100

> -----Original Message-----
> From: wr@xxxxxxxx [mailto:wr@xxxxxxxx]
> Sent: 21 July 2016 13:23
> To: nick@xxxxxxxxxx; 'Horace Ng' <horace@xxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Ceph + VMware + Single Thread Performance
> 
> Okay and what is your plan now to speed up ?

Now I have come up with a lower latency hardware design, there is not much further improvement until persistent RBD caching is implemented, as you will be moving the SSD/NVME closer to the client. But I'm happy with what I can achieve at the moment. You could also experiment with bcache on the RBD.

> 
> Would it help to put in multiple P3700 per OSD Node to improve performance for a single Thread (example Storage VMotion) ?

Most likely not, it's all the other parts of the puzzle which are causing the latency. ESXi was designed for storage arrays that service IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence the problem. Disable the BBWC on a RAID controller or SAN and you will the same behaviour.

> 
> Regards
> 
> 
> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of wr@xxxxxxxx
> >> Sent: 21 July 2016 13:04
> >> To: nick@xxxxxxxxxx; 'Horace Ng' <horace@xxxxxxxxx>
> >> Cc: ceph-users@xxxxxxxxxxxxxx
> >> Subject: Re:  Ceph + VMware + Single Thread Performance
> >>
> >> Hi,
> >>
> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production right now?
> > It's just been built, not running yet.
> >
> >> So if you start a storage migration you get only 200 MByte/s right?
> > I wish. My current cluster (not this new one) would storage migrate at
> > ~10-15MB/s. Serial latency is the problem, without being able to
> > buffer, ESXi waits on an ack for each IO before sending the next. Also it submits the migrations in 64kb chunks, unless you get VAAI
> working. I think esxi will try and do them in parallel, which will help as well.
> >
> >> I think it would be awesome if you get 1000 MByte/s
> >>
> >> Where is the Bottleneck?
> > Latency serialisation, without a buffer, you can't drive the devices
> > to 100%. With buffered IO (or high queue depths) I can max out the journals.
> >
> >> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the P3700.
> >>
> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
> >> -ssd-is-suitable-as-a-journal-device/
> >>
> >> How could it be that the rbd client performance is 50% slower?
> >>
> >> Regards
> >>
> >>
> >> Am 21.07.16 um 12:15 schrieb Nick Fisk:
> >>> I've had a lot of pain with this, smaller block sizes are even worse.
> >>> You want to try and minimize latency at every point as there is no
> >>> buffering happening in the iSCSI stack. This means:-
> >>>
> >>> 1. Fast journals (NVME or NVRAM)
> >>> 2. 10GB or better networking
> >>> 3. Fast CPU's (Ghz)
> >>> 4. Fix CPU c-state's to C1
> >>> 5. Fix CPU's Freq to max
> >>>
> >>> Also I can't be sure, but I think there is a metadata update
> >>> happening with VMFS, particularly if you are using thin VMDK's, this
> >>> can also be a major bottleneck. For my use case, I've switched over to NFS as it has given much more performance at scale and
> less headache.
> >>>
> >>> For the RADOS Run, here you go (400GB P3700):
> >>>
> >>> Total time run:         60.026491
> >>> Total writes made:      3104
> >>> Write size:             4194304
> >>> Object size:            4194304
> >>> Bandwidth (MB/sec):     206.842
> >>> Stddev Bandwidth:       8.10412
> >>> Max bandwidth (MB/sec): 224
> >>> Min bandwidth (MB/sec): 180
> >>> Average IOPS:           51
> >>> Stddev IOPS:            2
> >>> Max IOPS:               56
> >>> Min IOPS:               45
> >>> Average Latency(s):     0.0193366
> >>> Stddev Latency(s):      0.00148039
> >>> Max latency(s):         0.0377946
> >>> Min latency(s):         0.015909
> >>>
> >>> Nick
> >>>
> >>>> -----Original Message-----
> >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >>>> Behalf Of Horace
> >>>> Sent: 21 July 2016 10:26
> >>>> To: wr@xxxxxxxx
> >>>> Cc: ceph-users@xxxxxxxxxxxxxx
> >>>> Subject: Re:  Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Hi,
> >>>>
> >>>> Same here, I've read some blog saying that vmware will frequently
> >>>> verify the locking on VMFS over iSCSI, hence it will have much slower performance than NFS (with different locking mechanism).
> >>>>
> >>>> Regards,
> >>>> Horace Ng
> >>>>
> >>>> ----- Original Message -----
> >>>> From: wr@xxxxxxxx
> >>>> To: ceph-users@xxxxxxxxxxxxxx
> >>>> Sent: Thursday, July 21, 2016 5:11:21 PM
> >>>> Subject:  Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Hi everyone,
> >>>>
> >>>> we see at our cluster relatively slow Single Thread Performance on the iscsi Nodes.
> >>>>
> >>>>
> >>>> Our setup:
> >>>>
> >>>> 3 Racks:
> >>>>
> >>>> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).
> >>>>
> >>>> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x
> >>>> WD Red 1TB per Data Node as OSD.
> >>>>
> >>>> Replication = 3
> >>>>
> >>>> chooseleaf = 3 type Rack in the crush map
> >>>>
> >>>>
> >>>> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
> >>>>
> >>>> rados bench -p rbd 60 write -b 4M -t 1
> >>>>
> >>>>
> >>>> If we test with:
> >>>>
> >>>> rados bench -p rbd 60 write -b 4M -t 32
> >>>>
> >>>> we get ca. 600 - 700 MByte/s
> >>>>
> >>>>
> >>>> We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e
> >>>> for the Journal to get better Single Thread Performance.
> >>>>
> >>>> Is anyone of you out there who has an Intel P3700 for Journal an
> >>>> can give me back test results with:
> >>>>
> >>>>
> >>>> rados bench -p rbd 60 write -b 4M -t 1
> >>>>
> >>>>
> >>>> Thank you very much !!
> >>>>
> >>>> Kind Regards !!
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com