Re: Ceph + VMware + Single Thread Performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> -----Original Message-----
> From: Alex Gorbachev [mailto:ag@xxxxxxxxxxxxxxxxxxx]
> Sent: 11 September 2016 03:17
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: Wilhelm Redbrake <wr@xxxxxxxx>; Horace Ng <horace@xxxxxxxxx>; ceph-users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Ceph + VMware + Single Thread Performance
> 
> Confirming again much better performance with ESXi and NFS on RBD using the XFS hint Nick uses, below.

Cool, I never experimented with different extent sizes, so I don't know if there is any performance/fragmentation benefit with larger/smaller values. I think storage vmotions might benefit from using striped RBD's with rbd-nbd, as this might get round the PG contention issues with 32 concurrent writes to the same PG. I want to test this out at some point.

> 
> I saw high load averages on the NFS server nodes, corresponding to iowait, does not seem to cause too much trouble so far.

Yeah I get this as well, but I think this is just a side effect of having a storage backend that can support a high queue depth. Every IO in flight will increase the load by 1. However, despite what it looks like in top, it doesn't actually consume any CPU, so it shouldn't cause any problems.

> 
> Here are HDtune Pro testing results from some recent runs.  The puzzling part is better random IO performance with 16 mb object size
> on both iSCSI and NFS.  I my thinking this should be slower, however, this has been confirmed by the timed vmotion tests and more
> random IO tests by my coworker as well:
> 
> Test_type read MB/s write MB/s read iops write iops read multi iops write multi iops NFS 1mb 460 103 8753 66 47466 1616 NFS 4mb 441
> 147 8863 82 47556 764 iSCSI 1mb 117 76 326 90 672 938 iSCSI 4mb 275 60 205 24 2015 1212 NFS 16mb 455 177 7761 119 36403 3175 iSCSI
> 16mb 300 65 1117 237 12389 1826
> 
> ( prettier view at
> http://storcium.blogspot.com/2016/09/latest-tests-on-nfs-vs.html )

Interesting. Are you pre-conditioning the RBD's before these tests? The only logical thing I can think of is that if you are writing to a new area of the RBD, it will be having to create the objects as it goes, larger objects would therefore need less object creates per MB.

> 
> Alex
> 
> >
> > From: Alex Gorbachev [mailto:ag@xxxxxxxxxxxxxxxxxxx]
> > Sent: 04 September 2016 04:45
> > To: Nick Fisk <nick@xxxxxxxxxx>
> > Cc: Wilhelm Redbrake <wr@xxxxxxxx>; Horace Ng <horace@xxxxxxxxx>;
> > ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > Subject: Re:  Ceph + VMware + Single Thread Performance
> >
> >
> >
> >
> >
> > On Saturday, September 3, 2016, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > HI Nick,
> >
> > On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >
> > From: Alex Gorbachev [mailto:ag@xxxxxxxxxxxxxxxxxxx]
> > Sent: 21 August 2016 15:27
> > To: Wilhelm Redbrake <wr@xxxxxxxx>
> > Cc: nick@xxxxxxxxxx; Horace Ng <horace@xxxxxxxxx>; ceph-users
> > <ceph-users@xxxxxxxxxxxxxx>
> > Subject: Re:  Ceph + VMware + Single Thread Performance
> >
> >
> >
> >
> >
> > On Sunday, August 21, 2016, Wilhelm Redbrake <wr@xxxxxxxx> wrote:
> >
> > Hi Nick,
> > i understand all of your technical improvements.
> > But: why do you Not use a simple for example Areca Raid Controller with 8 gb Cache and Bbu ontop in every ceph node.
> > Configure n Times RAID 0 on the Controller and enable Write back Cache.
> > That must be a latency "Killer" like in all the prop. Storage arrays or Not ??
> >
> > Best Regards !!
> >
> >
> >
> > What we saw specifically with Areca cards is that performance is excellent in benchmarking and for bursty loads. However, once we
> started loading with more constant workloads (we replicate databases and files to our Ceph cluster), this looks to have saturated the
> relatively small Areca NVDIMM caches and we went back to pure drive based performance.
> >
> >
> >
> > Yes, I think that is a valid point. Although low latency, you are still having to write to the disks twice (journal+data), so once the
> cache’s on the cards start filling up, you are going to hit problems.
> >
> >
> >
> >
> >
> > So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 HDDs) in hopes that it would help reduce the noisy
> neighbor impact. That worked, but now the overall latency is really high at times, not always. Red Hat engineer suggested this is due to
> loading the 7200 rpm NL-SAS drives with too many IOPS, which get their latency sky high. Overall we are functioning fine, but I sure
> would like storage vmotion and other large operations faster.
> >
> >
> >
> >
> >
> > Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you ever have to move a multi-TB VM, it’s just too slow.
> >
> >
> >
> > If you use iscsi with vaai and are migrating a thick provisioned vmdk, then performance is actually quite good, as the block sizes used
> for the copy are a lot bigger.
> >
> >
> >
> > However, my use case required thin provisioned VM’s + snapshots and I
> > found that using iscsi you have no control over the fragmentation of
> > the vmdk’s and so the read performance is then what suffers (certainly
> > with 7.2k disks)
> >
> >
> >
> > Also with thin provisioned vmdk’s I think I was seeing PG contention with the updating of the VMFS metadata, although I can’t be
> sure.
> >
> >
> >
> >
> >
> > I am thinking I will test a few different schedulers and readahead settings to see if we can improve this by parallelizing reads. Also
> will test NFS, but need to determine whether to do krbd/knfsd or something more interesting like CephFS/Ganesha.
> >
> >
> >
> > As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot less sensitive to making config adjustments without
> suddenly everything dropping offline. The fact that you can specify the extent size on XFS helps massively with using thin
> vmdks/snapshots to avoid fragmentation. Storage v-motions are a bit faster than iscsi, but I think I am hitting PG contention when esxi
> tries to write 32 copy threads to the same object. There is probably some tuning that could be done here (RBD striping???) but this is
> the best it’s been for a long time and I’m reluctant to fiddle any further.
> >
> >
> >
> > We have moved ahead and added NFS support to Storcium, and now able ti run NFS servers with Pacemaker in HA mode (all agents
> are public at https://github.com/akurz/resource-agents/tree/master/heartbeat).  I can confirm that VM performance is definitely
> better and benchmarks are more smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is choppy on writes, but smooth
> on reads, likely due to the bursty nature of OSD filesystems when dealing with that small IO size).
> >
> >
> >
> > Were you using extsz=16384 at creation time for the filesystem?  I saw kernel memory deadlock messages during vmotion, such as:
> >
> >
> >
> >  XFS: nfsd(102545) possible memory allocation deadlock size 40320 in
> > kmem_alloc (mode:0x2400240)
> >
> >
> >
> > And analyzing fragmentation:
> >
> >
> >
> > root@roc-5r-scd218:~# xfs_db -r /dev/rbd21
> >
> > xfs_db> frag -d
> >
> > actual 0, ideal 0, fragmentation factor 0.00%
> >
> > xfs_db> frag -f
> >
> > actual 1863960, ideal 74, fragmentation factor 100.00%
> >
> >
> >
> > Just from two vmotions.
> >
> >
> >
> > Are you seeing anything similar?
> >
> >
> >
> > Found your post on setting XFS extent size hint for sparse files:
> >
> >
> >
> > xfs_io -c extsize 16M /mountpoint
> >
> > Will test - fragmentation definitely present without this.
> >
> >
> >
> > Yeah I got bit by that when I 1st set it up, I then created another datastore with that ext hint and moved everything across. Haven’t
> seen any kmem alloc errors since and sequential Read performance is a lot better than thin provisioned iscsi.
> >
> >
> >
> >
> >
> >
> >
> > Thank you,
> >
> > Alex
> >
> >
> >
> >
> >
> > But as mentioned above, thick vmdk’s with vaai might be a really good fit.
> >
> >
> >
> > Thanks for your very valuable info on analysis and hw build.
> >
> >
> >
> > Alex
> >
> >
> >
> >
> >
> >
> > Am 21.08.2016 um 09:31 schrieb Nick Fisk <nick@xxxxxxxxxx>:
> >
> > >> -----Original Message-----
> > >> From: Alex Gorbachev [mailto:ag@xxxxxxxxxxxxxxxxxxx]
> > >> Sent: 21 August 2016 04:15
> > >> To: Nick Fisk <nick@xxxxxxxxxx>
> > >> Cc: wr@xxxxxxxx; Horace Ng <horace@xxxxxxxxx>; ceph-users
> > >> <ceph-users@xxxxxxxxxxxxxx>
> > >> Subject: Re:  Ceph + VMware + Single Thread Performance
> > >>
> > >> Hi Nick,
> > >>
> > >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > >>>> -----Original Message-----
> > >>>> From: wr@xxxxxxxx [mailto:wr@xxxxxxxx]
> > >>>> Sent: 21 July 2016 13:23
> > >>>> To: nick@xxxxxxxxxx; 'Horace Ng' <horace@xxxxxxxxx>
> > >>>> Cc: ceph-users@xxxxxxxxxxxxxx
> > >>>> Subject: Re:  Ceph + VMware + Single Thread
> > >>>> Performance
> > >>>>
> > >>>> Okay and what is your plan now to speed up ?
> > >>>
> > >>> Now I have come up with a lower latency hardware design, there is
> > >>> not much further improvement until persistent RBD caching is
> > >> implemented, as you will be moving the SSD/NVME closer to the
> > >> client. But I'm happy with what I can achieve at the moment. You could also experiment with bcache on the RBD.
> > >>
> > >> Reviving this thread, would you be willing to share the details of
> > >> the low latency hardware design?  Are you optimizing for NFS or iSCSI?
> > >
> > > Both really, just trying to get the write latency as low as possible, as you know, vmware does everything with lots of unbuffered
> small io's. Eg when you migrate a VM or as thin vmdk's grow.
> > >
> > > Even storage vmotions which might kick off 32 threads, as they all roughly fall on the same PG, there still appears to be a
> bottleneck with contention on the PG itself.
> > >
> > > These were the sort of things I was trying to optimise for, to make the time spent in Ceph as minimal as possible for each IO.
> > >
> > > So onto the hardware. Through reading various threads and experiments on my own I came to the following conclusions.
> > >
> > > -You need highest possible frequency on the CPU cores, which normally also means less of them.
> > > -Dual sockets are probably bad and will impact performance.
> > > -Use NVME's for journals to minimise latency
> > >
> > > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel P3700 for a journal. I used the SuperMicro X11SSH-
> CTF board which has 10G-T onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually this design as well as being
> very performant for Ceph, also works out very cheap as you are using low end server parts. The whole lot + 12x7.2k disks all goes into
> a 1U case.
> > >
> > > During testing I noticed that by default c-states and p-states slaughter performance. After forcing max cstate to 1 and forcing the
> CPU frequency up to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or around 1600IOPs, this is at QD=1.
> > >
> > > Few other observations:
> > > 1. Power usage is around 150-200W for this config with 12x7.2k disks
> > > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom for more disks.
> > > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage 4.
> > > No idea about CPU load for pure SSD nodes, but based on the current
> > > disks, you could maybe expect ~10000iops per node, before maxing out CPU's 5. Single NVME seems to be able to journal 12 disks
> with no problem during normal operation, no doubt a specific benchmark could max it out though.
> > > 6. There are slightly faster Xeon E3's, but price/performance =
> > > diminishing returns
> > >
> > > Hope that answers all your questions.
> > > Nick
> > >
> > >>
> > >> Thank you,
> > >> Alex
> > >>
> > >>>
> > >>>>
> > >>>> Would it help to put in multiple P3700 per OSD Node to improve performance for a single Thread (example Storage VMotion) ?
> > >>>
> > >>> Most likely not, it's all the other parts of the puzzle which are
> > >>> causing the latency. ESXi was designed for storage arrays that
> > >>> service
> > >> IO's in 100us-1ms range, Ceph is probably about 10x slower than
> > >> this, hence the problem. Disable the BBWC on a RAID controller or SAN and you will the same behaviour.
> > >>>
> > >>>>
> > >>>> Regards
> > >>>>
> > >>>>
> > >>>> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> > >>>>>> -----Original Message-----
> > >>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > >>>>>> Behalf Of wr@xxxxxxxx
> > >>>>>> Sent: 21 July 2016 13:04
> > >>>>>> To: nick@xxxxxxxxxx; 'Horace Ng' <horace@xxxxxxxxx>
> > >>>>>> Cc: ceph-users@xxxxxxxxxxxxxx
> > >>>>>> Subject: Re:  Ceph + VMware + Single Thread
> > >>>>>> Performance
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> hmm i think 200 MByte/s is really bad. Is your Cluster in production right now?
> > >>>>> It's just been built, not running yet.
> > >>>>>
> > >>>>>> So if you start a storage migration you get only 200 MByte/s right?
> > >>>>> I wish. My current cluster (not this new one) would storage
> > >>>>> migrate at ~10-15MB/s. Serial latency is the problem, without
> > >>>>> being able to buffer, ESXi waits on an ack for each IO before sending the next.
> > >>>>> Also it submits the migrations in 64kb chunks, unless you get
> > >>>>> VAAI
> > >>>> working. I think esxi will try and do them in parallel, which will help as well.
> > >>>>>
> > >>>>>> I think it would be awesome if you get 1000 MByte/s
> > >>>>>>
> > >>>>>> Where is the Bottleneck?
> > >>>>> Latency serialisation, without a buffer, you can't drive the
> > >>>>> devices to 100%. With buffered IO (or high queue depths) I can max out the journals.
> > >>>>>
> > >>>>>> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the P3700.
> > >>>>>>
> > >>>>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-i
> > >>>>>> f-y our -ssd-is-suitable-as-a-journal-device/
> > >>>>>>
> > >>>>>> How could it be that the rbd client performance is 50% slower?
> > >>>>>>
> > >>>>>> Regards
> > >>>>>>
> > >>>>>>
> > >>>>>>> Am 21.07.16 um 12:15 schrieb Nick Fisk:
> > >>>>>>> I've had a lot of pain with this, smaller block sizes are even worse.
> > >>>>>>> You want to try and minimize latency at every point as there
> > >>>>>>> is no buffering happening in the iSCSI stack. This means:-
> > >>>>>>>
> > >>>>>>> 1. Fast journals (NVME or NVRAM) 2. 10GB or better networking
> > >>>>>>> 3. Fast CPU's (Ghz) 4. Fix CPU c-state's to C1 5. Fix CPU's
> > >>>>>>> Freq to max
> > >>>>>>>
> > >>>>>>> Also I can't be sure, but I think there is a metadata update
> > >>>>>>> happening with VMFS, particularly if you are using thin
> > >>>>>>> VMDK's, this can also be a major bottleneck. For my use case,
> > >>>>>>> I've switched over to NFS as it has given much more
> > >>>>>>> performance at scale and
> > >>>> less headache.
> > >>>>>>>
> > >>>>>>> For the RADOS Run, here you go (400GB P3700):
> > >>>>>>>
> > >>>>>>> Total time run:         60.026491
> > >>>>>>> Total writes made:      3104
> > >>>>>>> Write size:             4194304
> > >>>>>>> Object size:            4194304
> > >>>>>>> Bandwidth (MB/sec):     206.842
> > >>>>>>> Stddev Bandwidth:       8.10412
> > >>>>>>> Max bandwidth (MB/sec): 224
> > >>>>>>> Min bandwidth (MB/sec): 180
> > >>>>>>> Average IOPS:           51
> > >>>>>>> Stddev IOPS:            2
> > >>>>>>> Max IOPS:               56
> > >>>>>>> Min IOPS:               45
> > >>>>>>> Average Latency(s):     0.0193366
> > >>>>>>> Stddev Latency(s):      0.00148039
> > >>>>>>> Max latency(s):         0.0377946
> > >>>>>>> Min latency(s):         0.015909
> > >>>>>>>
> > >>>>>>> Nick
> > >>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
> > >>>>>>>> On Behalf Of Horace
> > >>>>>>>> Sent: 21 July 2016 10:26
> > >>>>>>>> To: wr@xxxxxxxx
> > >>>>>>>> Cc: ceph-users@xxxxxxxxxxxxxx
> > >>>>>>>> Subject: Re:  Ceph + VMware + Single Thread
> > >>>>>>>> Performance
> > >>>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> Same here, I've read some blog saying that vmware will
> > >>>>>>>> frequently verify the locking on VMFS over iSCSI, hence it
> > >>>>>>>> will have much slower performance than NFS (with different
> > >> locking mechanism).
> > >>>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Horace Ng
> > >>>>>>>>
> > >>>>>>>> ----- Original Message -----
> > >>>>>>>> From: wr@xxxxxxxx
> > >>>>>>>> To: ceph-users@xxxxxxxxxxxxxx
> > >>>>>>>> Sent: Thursday, July 21, 2016 5:11:21 PM
> > >>>>>>>> Subject:  Ceph + VMware + Single Thread
> > >>>>>>>> Performance
> > >>>>>>>>
> > >>>>>>>> Hi everyone,
> > >>>>>>>>
> > >>>>>>>> we see at our cluster relatively slow Single Thread Performance on the iscsi Nodes.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Our setup:
> > >>>>>>>>
> > >>>>>>>> 3 Racks:
> > >>>>>>>>
> > >>>>>>>> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).
> > >>>>>>>>
> > >>>>>>>> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD)
> > >>>>>>>> and 6x WD Red 1TB per Data Node as OSD.
> > >>>>>>>>
> > >>>>>>>> Replication = 3
> > >>>>>>>>
> > >>>>>>>> chooseleaf = 3 type Rack in the crush map
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
> > >>>>>>>>
> > >>>>>>>> rados bench -p rbd 60 write -b 4M -t 1
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> If we test with:
> > >>>>>>>>
> > >>>>>>>> rados bench -p rbd 60 write -b 4M -t 32
> > >>>>>>>>
> > >>>>>>>> we get ca. 600 - 700 MByte/s
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> We plan to replace the Samsung SSD with Intel DC P3700 PCIe
> > >>>>>>>> NVM'e for the Journal to get better Single Thread Performance.
> > >>>>>>>>
> > >>>>>>>> Is anyone of you out there who has an Intel P3700 for Journal
> > >>>>>>>> an can give me back test results with:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> rados bench -p rbd 60 write -b 4M -t 1
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Thank you very much !!
> > >>>>>>>>
> > >>>>>>>> Kind Regards !!
> > >>>>>>>>
> > >>>>>>>> _______________________________________________
> > >>>>>>>> ceph-users mailing list
> > >>>>>>>> ceph-users@xxxxxxxxxxxxxx
> > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>>>>>> _______________________________________________
> > >>>>>>>> ceph-users mailing list
> > >>>>>>>> ceph-users@xxxxxxxxxxxxxx
> > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>>>> _______________________________________________
> > >>>>>> ceph-users mailing list
> > >>>>>> ceph-users@xxxxxxxxxxxxxx
> > >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> ceph-users mailing list
> > >>> ceph-users@xxxxxxxxxxxxxx
> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> >
> > --
> >
> > --
> >
> > Alex Gorbachev
> >
> > Storcium
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> >
> > --
> >
> > Alex Gorbachev
> >
> > Storcium
> >
> >
> >
> >

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux